Online Diffusion Policy RL: Making Robot Manipulation Adapt On the Fly
Bob Jiang
March 7, 2026
Introduction
Diffusion models moved from image generation into robot control for a simple reason: they are good at representing multi modal action distributions. In manipulation, there are often many valid ways to complete the same task, and classic behavior cloning can average these modes into mushy actions.
Diffusion policies address that by generating an action sequence through iterative denoising, conditioned on observations such as images and proprioception. The catch is that most diffusion policy work is still offline: train on demonstrations, then deploy the frozen policy.
In real deployments, the world shifts:
- Lighting changes, camera pose drifts, clutter appears.
- Objects differ in friction and compliance.
- The robot state estimator accumulates bias.
- People nudge the scene, intentionally or not.
Offline policies often fail not because the architecture is wrong, but because the training distribution is too narrow.
This post is a practical guide to:
- What diffusion policies are, and why they work well for manipulation.
- Where offline diffusion policies break.
- How to add online learning in a controlled way, using reinforcement learning style updates and safety constraints.
- A concrete implementation blueprint you can adapt to your stack.
References you can start with:
- Diffusion Policy project page (Columbia): https://diffusion-policy.cs.columbia.edu/
- The arXiv paper: https://arxiv.org/abs/2303.04137
- A 2026 review focused on online diffusion policy RL algorithms: https://arxiv.org/abs/2601.06133
Part 1: Diffusion policies in one mental model
Behavior cloning, but do not collapse the modes
Standard behavior cloning fits a model to predict an action given an observation:
- Input: observation at time t
- Output: action at time t
If there are multiple valid actions, a mean squared error loss pushes you toward an average action. For grasping and contact rich manipulation, the average can be invalid.
Diffusion policies model the full distribution by treating action sequences as data and learning a denoising process that converts noise into a plausible action trajectory.
What is being generated
A common setup is receding horizon control:
- Generate a horizon of H actions: a_{t:t+H-1}
- Execute the first K actions (often K=1)
- Replan from the next observation
This is important: even if the generated trajectory is imperfect, frequent replanning lets the controller correct.
Why diffusion helps
Diffusion offers three useful properties:
- Multi modality: multiple distinct action sequences can solve the same task.
- Stability: the denoising steps can act like a regularizer, often producing smooth trajectories.
- Conditional generation: you can condition on images, goal descriptors, language embeddings, or object poses.
Part 2: Why offline diffusion policies fail in the real world
Offline training assumes deployment data looks like training data. In robotics, that assumption is fragile.
Failure mode A: observation shift
A few degrees of camera shift can move key pixels. A policy that relies on specific visual cues can mis localize the object and generate the wrong approach.
Failure mode B: contact dynamics mismatch
Most demonstrations happen in a controlled regime. On a new object, contact forces and slip change the transition dynamics. The policy can generate trajectories that were safe in training but now cause failure.
Failure mode C: compounding error in long horizons
Even with replanning, if your policy distribution is slightly biased, repeated small errors push you into states you never saw in demos.
Failure mode D: limited recovery behaviors
Many demonstration datasets over represent success trajectories. The policy learns how to do the task when things go well, not how to recover when something goes wrong.
This is exactly where online learning shines: it can incorporate new states and recovery experiences.
Part 3: What online diffusion policy RL means
Online learning in this context usually means one or more of the following:
- Online data collection: keep logging trajectories during deployment.
- Online policy improvement: update the policy using new data.
- Online objective shift: optimize success rate or reward, not just imitation loss.
The arXiv review paper above categorizes algorithm families and tradeoffs. Even if you do not adopt a full RL system, the key idea is the same: do not freeze the policy forever.
A practical compromise: imitation plus conservative online updates
Pure RL on hardware is expensive and risky. A pragmatic approach is:
- Start with a strong offline diffusion policy trained on demonstrations.
- Add a small online improvement loop that focuses on:
- recovery states
- small distribution shifts
- domain specific nuisance factors
This can be framed as:
- supervised fine tuning on new successful rollouts
- preference style learning from comparisons
- conservative policy gradients with safety filters
Part 4: A step by step blueprint you can implement
Below is a blueprint that is intentionally system oriented, not just algorithm buzzwords.
Step 0: Define the task and the safety envelope
Pick a task with measurable success and clear constraints. Example tasks:
- pick and place into a bin
- cable insertion with compliance
- drawer open then retrieve object
Define constraints explicitly:
- joint limits, velocity limits
- workspace bounds
- force torque thresholds
- collision constraints from a signed distance field or depth map
If you do not define safety, online learning will eventually find the unsafe corner.
Step 1: Start from an offline diffusion policy baseline
Train or adopt a baseline diffusion policy:
- observation encoder: vision backbone plus proprioception
- diffusion model: predicts denoised action sequence
- planner: receding horizon, replan every step
Use the standard references for architecture and training details:
- Diffusion Policy: https://arxiv.org/abs/2303.04137
Step 2: Add an execution time safety filter
Before online updates, make execution robust.
A simple but effective pattern:
- Sample N candidate action sequences from the diffusion policy.
- Score each candidate with fast constraints:
- collision check
- joint limit check
- energy or jerk penalty
- optional learned value function
- Execute the first action of the best candidate.
This gives you two benefits:
- higher success rate immediately
- a clean interface where learning can improve the sampler, while the filter enforces constraints
Step 3: Instrument everything
Log:
- raw observations and compressed embeddings
- generated action sequences and selected actions
- constraint violations, near misses, and intervention signals
- success labels
Also log context:
- camera calibration hash
- end effector tool id
- object sku or approximate geometry id
Without these, you cannot diagnose why online updates changed behavior.
Step 4: Design a reward and a curriculum for online updates
Hardware RL is often made feasible by constraining the learning scope.
Use a curriculum:
- Phase 1: online supervised fine tuning on newly collected successes
- Phase 2: add negative examples from failures, with conservative objectives
- Phase 3: add sparse reward RL for recovery behaviors only
Reward design tips:
- Use a sparse terminal success signal.
- Add dense shaping only if it is physically meaningful, for example distance to goal in task space.
- Add penalties for constraint proximity.
Step 5: Choose an online update algorithm family
You have three practical families, each with an engineering profile.
Family A: replay buffer plus supervised fine tuning
You maintain a buffer of recent trajectories. When you see a new success mode, you fine tune the diffusion model on those trajectories.
Pros:
- simple
- stable
Cons:
- can overfit, forget old skills
Mitigations:
- mix old demo data with new data
- use small learning rates
- add behavior regularization, for example KL to the base policy
Family B: conservative offline to online RL
You treat new data as off policy experience and run a conservative RL update.
Pros:
- can improve beyond the demos
Cons:
- more moving parts: critic learning, reward modeling, hyperparameters
If you go this route, start with a narrow target: recover from a small set of failures.
Family C: preference based updates
When humans can label which rollout is better, you can learn a reward model and then update the policy.
Pros:
- does not require dense rewards
Cons:
- needs a labeling pipeline
Step 6: Prevent catastrophic forgetting
Online learning is dangerous because it can break what already worked.
Practical techniques:
- Keep a frozen base policy and learn a small adapter.
- Constrain updates with KL regularization to the base policy.
- Use periodic evaluation on a fixed validation set of scenarios.
- Roll back automatically if success rate drops.
Step 7: Deployment pattern for safe online learning
A pattern that works in practice:
- Shadow mode: collect data, but do not update.
- Batch mode: update offline overnight, validate, then deploy.
- Online mode with guardrails: update slowly with strict monitoring.
Even if the paper says online, you can still adopt batch mode first. It gets you most of the benefit with far less risk.
Part 5: Concrete implementation sketch
This section is intentionally concrete so you can map it to your codebase.
Data structures
- ReplayBuffer
- stores trajectories, success labels, constraint metrics
- Policy
- base diffusion policy parameters
- optional adapter parameters
- SafetyFilter
- constraint checks
- candidate scoring
- Evaluator
- runs nightly evaluation scenarios
Training loop pseudocode
- Initialize policy from offline training.
- For each episode on robot:
- run policy with safety filter
- log trajectory
- compute success label
- add to replay buffer
- Periodically, run an update:
- sample a batch from replay buffer plus demo buffer
- compute imitation loss on action sequences
- compute conservative regularization term
- update adapter or full policy
- After update:
- run evaluation suite
- if metrics pass, keep; else roll back
Metrics that matter
Track these, not just training loss:
- task success rate
- constraint violation rate
- intervention count
- recovery success rate after perturbations
- time to complete
- distribution drift metrics on observation embeddings
Part 6: Where the field is going next
A few trends are likely to dominate the next wave:
- Policy composition: combine diffusion policies with task and motion planning so the model handles contact rich parts and the planner handles geometry.
- World model integration: use learned dynamics models to score candidate trajectories before execution.
- Better uncertainty: diffusion gives you samples, but you still need calibrated uncertainty to know when to ask for help.
- Standardized online benchmarks: we need benchmarks that measure adaptation, not just static performance.
Practical checklist
If you want a quick action list:
- Train or adopt an offline diffusion policy baseline.
- Add candidate sampling plus a safety filter.
- Log everything and build an evaluation suite.
- Start with batch fine tuning using recent successes.
- Add conservative online updates only after you can measure regressions.
Conclusion
Diffusion policies are a strong foundation for manipulation because they can represent diverse and smooth action distributions. But offline training alone is rarely enough for messy real world deployments.
Online diffusion policy RL is best seen as a set of engineering patterns: collect the right data, update conservatively, enforce safety, and validate continuously.
If you do those four things, the exact algorithm choice becomes less scary, and your robots will keep getting better after deployment instead of getting stuck on the first unexpected edge case.
Further reading
- Diffusion Policy: https://arxiv.org/abs/2303.04137
- Diffusion Policy project page: https://diffusion-policy.cs.columbia.edu/
- Review: A Review of Online Diffusion Policy RL Algorithms for Scalable Robotic Control: https://arxiv.org/abs/2601.06133