Diffusion Transformer Policies: How On-Device Diffusion Is Making Robot Manipulation Faster and Cheaper
Bob Jiang
February 22, 2026
Introduction: why diffusion suddenly matters for robots
Robot manipulation has a reputation for being either:
- Hand-engineered (reliable but brittle), or
- Learned in the cloud (impressive demos but expensive, slow, and hard to deploy).
Diffusion-based policies are one of the most important shifts in the ālearnedā side because they change how actions are produced. Instead of predicting a single best next action (which often collapses to average behavior), diffusion policies learn to model a distribution over actions and sample from it.
That sounds like a generative-image trick, but in robotics it fixes a real pain:
- Many tasks have multiple valid solutions (grasp from left or right, push then grasp, rotate then lift).
- Demonstrations can be multi-modal (different humans do different things).
- Sensor noise makes exact next-action prediction unstable.
Diffusion policies take the same core idea that made image generation robustāiterative denoisingāand apply it to action sequences.
In this post, we will break down:
- What diffusion policies are in robotics (in plain terms)
- Why Diffusion Transformers are a natural next step
- What āon-device diffusion policyā changes for real-world deployment
- The practical tradeoffs: latency, compute, safety, and reliability
We will also point you to key references, including the original Diffusion Policy work and newer papers on diffusion transformers and efficient/on-device inference.
The baseline: behavior cloning and why it plateaus
The simplest way to learn robot manipulation from demonstrations is behavior cloning (BC):
- Input: observations (images, proprioception, language)
- Output: the next action (joint targets, end-effector delta, gripper open/close)
- Training: supervised learning on demonstration pairs
BC works, but it often fails in three common ways:
1) āAveragingā kills multi-modal behavior
If your dataset includes two equally-good strategies, a deterministic model tends to predict an average that is not executable.
Example: if half the demos grasp from the left and half from the right, the averaged grasp might be straight into the object.
2) Long-horizon compounding errors
A small error at time step 20 becomes a bigger distribution shift by time step 60.
3) Action distributions are weird
Robot actions are not nicely unimodal Gaussians. They can be discontinuous (contact, slip) and depend on latent factors (friction, object pose).
Diffusion-based policies attack all three by learning a richer action distribution and by predicting a whole action chunk instead of a single action.
Diffusion Policy 101 (robot version)
If you have seen diffusion models for images, the loop is:
- Start with noise
- Iteratively denoise
- End with a clean sample
In robotics, the āimageā is replaced by an action trajectory.
What is sampled?
Typically, the policy samples a sequence of actions over a horizon:
- Action chunk: e.g. 8ā32 steps
- Each step could be joint velocities, end-effector deltas, or other control commands
So instead of predicting a single next action, the model predicts a plan-ish chunk of actions conditioned on the current observation.
What does the model learn?
The model learns how to reverse a noising process on trajectories:
- Forward process: take expert action chunks and add noise across timesteps
- Reverse process: learn to remove noise and recover a plausible expert-like chunk
This gives you two big wins:
- You can represent multiple valid futures.
- You tend to get smoother, more consistent motion because the sampled chunk is internally coherent.
A canonical reference is Columbiaās Diffusion Policy page, which benchmarks diffusion policies across manipulation tasks and reports strong performance versus common baselines:
- Diffusion Policy (project): https://diffusion-policy.cs.columbia.edu/
Why the next step is a Diffusion Transformer
Early diffusion policies often used convolutional or MLP-ish backbones with temporal conditioning. But robots are inherently sequence + context problems:
- Temporal dependencies across the horizon
- Cross-modal conditioning (vision, proprioception, language)
- Variable task complexity
Transformers are strong at:
- Modeling sequences
- Attending to the relevant parts of history
- Scaling with data
A Diffusion Transformer combines:
- Diffusion-style denoising of action trajectories
- Transformer attention over time (and sometimes across modalities)
Conceptually:
- The diffusion process defines how you sample actions
- The transformer defines how you model dependencies in those actions (and what conditioning matters)
In the literature you will see names like:
- Diffusion trajectory-guided policy (DTP) for long-horizon tasks (OpenReview, 2024): https://openreview.net/forum?id=VaoeAi5CW8
- Diffusion-transformer-style policies and scaling discussions in later conference papers (e.g., ICCV/CVPR-era work)
The deployment problem: diffusion is good, but it is slow
Here is the uncomfortable truth: diffusion sampling costs steps.
Even if each denoising step is fast, 10ā50 iterations can add up, especially if:
- Your backbone is heavy (transformer)
- Your robot control loop wants low latency
- You want to run on embedded hardware (not a datacenter GPU)
That is why āon-device diffusion transformer policyā papers are interesting. They try to answer:
Can we keep the robustness and multi-modal benefits of diffusion while meeting real-time constraints on edge hardware?
One example surfaced in public CVF access:
- On-Device Diffusion Transformer Policy for Efficient Robot Manipulation (ICCV 2025 PDF): https://openaccess.thecvf.com/content/ICCV2025/papers/Wu_On-Device_Diffusion_Transformer_Policy_for_Efficient_Robot_Manipulation_ICCV_2025_paper.pdf
And related scaling/open-source work appears in diffusion transformer policy papers:
- Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy (ICCV 2025 PDF): https://openaccess.thecvf.com/content/ICCV2025/papers/Hou_Dita_Scaling_Diffusion_Transformer_for_Generalist_Vision-Language-Action_Policy_ICCV_2025_paper.pdf
(Exact details vary per paper, but the shared theme is efficiency + scaling.)
What āon-deviceā actually changes
Running a policy on-device is not just a performance flex. It changes the product constraints.
1) Latency becomes predictable
Cloud inference adds variable network delay. On-device inference makes latency mostly a function of compute and scheduling.
For manipulation, predictable latency often matters more than raw average latency because:
- Contact dynamics are fast
- Safety controllers rely on timing assumptions
2) Reliability improves in the messy world
Warehouses, farms, homes: connectivity is inconsistent. If your policy needs the internet, you do not have a product.
3) Data privacy and compliance get easier
On-device processing can simplify privacy concerns when cameras are involved (especially in home settings).
4) Cost per robot drops
Cloud GPUs are expensive. If you are shipping a fleet, on-device compute is often cheaper over time, even if the device BOM is higher.
The typical engineering tricks to make diffusion policies faster
Most āefficient diffusionā methods boil down to a few knobs:
A) Fewer denoising steps
- Use improved schedulers
- Distill many-step sampling into fewer steps
B) Smaller or structured backbones
- Use a transformer, but keep it narrow
- Use temporal attention only where needed
- Share weights across steps
C) Action chunking and receding horizon control
Instead of sampling 100 steps, sample 16 steps and re-plan every 4ā8 steps.
This also acts like an implicit feedback loop.
D) Quantization and deployment optimization
- INT8 / mixed precision
- Kernel fusion
- Static shapes where possible
E) Conditioning compression
If you condition on images, the vision encoder can dominate cost. Many systems:
- Precompute image features
- Use lightweight visual backbones
- Cache features across denoising iterations
When diffusion policies beat classic policies (and when they do not)
They shine when:
- The task is multi-modal (many valid approaches)
- Demonstrations are diverse
- Contact-rich motion needs smoothness
- You want a generalist policy with broad behavior coverage
They struggle when:
- You need ultra-low latency with high-frequency control
- Your dataset is small and narrow (a simpler BC model may be enough)
- The environment requires strong state estimation and classical feedback
A pragmatic take: diffusion is not replacing control theory. It is replacing the brittle āsingle next action regressionā layer in learned manipulation systems.
A practical mental model: diffusion policy as a constraint-satisfying sampler
Think of a diffusion policy as a sampler that tries to satisfy multiple soft constraints:
- āThe trajectory should look like expert behaviorā
- āIt should be consistent with the current observationā
- āIt should fit the task conditioning (language, goal image, etc.)ā
Sampling gives you the freedom to represent diverse strategies, which can be crucial in manipulation.
If you want to go further, you can add additional constraints:
- collision checks
- torque limits
- velocity limits
This is one direction where hybrid systems will likely win: diffusion proposes, safety filters verify.
How to evaluate diffusion transformer policies (what to measure)
If you are building or choosing a policy, measure more than success rate:
- Success rate across tasks and object variations
- Time-to-success (efficiency)
- Trajectory smoothness (jerk, acceleration)
- Recovery behavior after perturbations
- Latency distribution (p50, p95, worst-case)
- Compute footprint (watts, memory)
- Safety incidents (collisions, excessive forces)
On-device work forces you to care about #5 and #6.
What to watch next (2026 and beyond)
Three trends are converging:
1) Generalist VLA policies + diffusion-style action generation
Vision-language-action policies want broad coverage and multi-modality. Diffusion fits naturally.
2) Robotics compute stacks are maturing
Edge accelerators, better runtimes, and better quantization make on-device more realistic.
3) Safety and determinism will become mandatory
As these policies leave labs, we will see more safety wrappers, verification layers, and deterministic fallbacks.
My bet: the winners will be the teams that treat diffusion as a component in a full robot stack (perception, planning, control, safety), not as a standalone magic model.
References and further reading
- Diffusion Policy (Columbia): https://diffusion-policy.cs.columbia.edu/
- Diffusion Trajectory-guided Policy (OpenReview 2024): https://openreview.net/forum?id=VaoeAi5CW8
- On-Device Diffusion Transformer Policy for Efficient Robot Manipulation (ICCV 2025): https://openaccess.thecvf.com/content/ICCV2025/papers/Wu_On-Device_Diffusion_Transformer_Policy_for_Efficient_Robot_Manipulation_ICCV_2025_paper.pdf
- Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy (ICCV 2025): https://openaccess.thecvf.com/content/ICCV2025/papers/Hou_Dita_Scaling_Diffusion_Transformer_for_Generalist_Vision-Language-Action_Policy_ICCV_2025_paper.pdf
- Awesome list (diffusion in robotics): https://github.com/showlab/Awesome-Robotics-Diffusion