General Policy Composition: A Training-Free Way to Boost Diffusion Robot Policies at Test Time
Bob Jiang
April 18, 2026 New
The uncomfortable truth about robot foundation models
Diffusion-based robot policies (and their cousins in flow matching and VLA models) are on a pretty predictable trajectory:
- They get better when you scale data.
- They get better when you scale compute.
- They get better when you scale model capacity.
And all three of those are expensive.
If you are training something like a vision-action (VA) policy or a vision-language-action (VLA) policy, you are paying for (1) robot time, (2) teleop time, (3) dataset cleaning, (4) simulation, (5) evaluation, and (6) the very real engineering cost of getting stable training runs.
So here is a question that has been quietly sitting in the background:
Can we make robot diffusion policies better without training them again?
A 2026 ICLR paper argues yes: by composing existing pre-trained policies at test time, at the level of diffusion âscoresâ (gradients of the log probability). The method is called General Policy Composition (GPC).
This post explains what that means, when it helps, and how to think about implementing it in real robot stacks.
Quick background: what a diffusion robot policy is actually doing
If you have not looked at diffusion policies since the hype wave, here is the key idea in plain language.
A diffusion policy treats action generation as a denoising process:
- Start with a noisy action sequence (or a noisy trajectory).
- Iteratively denoise it, conditioning on observations (images, proprioception, etc.).
- Output a sequence of actions and execute them in a receding-horizon loop.
Diffusion Policy (Chi et al.) is the canonical reference here, and it popularized a very practical recipe:
- Use diffusion to model multimodal action distributions (so the policy can represent multiple valid ways to solve a task).
- Generate action sequences (not just one-step actions) and run them with receding-horizon control.
- Condition on vision and optionally other modalities.
The original Diffusion Policy project page is still one of the clearest summaries of why diffusion works well for visuomotor control, and it highlights the advantages around multimodality, high-dimensional action spaces, and training stability.
Sources:
- Diffusion Policy project page: https://diffusion-policy.cs.columbia.edu/
- arXiv: https://arxiv.org/abs/2303.04137
The bottleneck: data is the currency, and robots are slow
If your policy underperforms in a new environment, you typically do one of these:
- Collect more demonstrations.
- Fine-tune with additional robot interaction.
- Add more simulation and try sim-to-real tricks.
- Do domain randomization.
- Do some form of adaptation.
All of those cost time and introduce operational risk.
GPC is interesting because it tries to shift the trade-off:
- No new data.
- No new training.
- Only inference-time compute and some search over mixing weights.
That does not mean it is âfreeâ (inference becomes heavier), but it is a different knob to turn.
What is âpolicy compositionâ in diffusion terms?
GPCâs framing is: many diffusion-based policies can be described as estimating a score (loosely: a gradient direction that tells you how to denoise toward higher-probability actions given the observation).
If you have multiple pre-trained policies, you can combine their scores.
The simplest mental model
Assume you have two policies:
- Policy A: good at task structure but weak on some corner cases.
- Policy B: a different architecture or modality that covers those corner cases.
Instead of choosing one, GPC does something like:
- At each denoising step, compute both scores.
- Mix them with a convex combination.
- Use the mixed score to denoise.
In symbols (high-level):
- Score mix: s = w·sA + (1 - w)·sB, where w is between 0 and 1.
Then you generate an action trajectory using that mixed score.
This is conceptually similar to ensemble ideas in supervised learning, but the key difference is:
- You are composing the generative dynamics of a diffusion sampler.
- You are composing at the distribution level, not just averaging final actions.
The 2026 twist: General Policy Composition (GPC)
The ICLR 2026 paper âCompose Your Policies!â proposes General Policy Composition (GPC) as a training-free framework.
Their claims (paraphrased, with the original wording worth reading) are:
- A proper convex composition of scores can yield a better âsingle-stepâ objective than either parent policy.
- Under stability assumptions (they reference a Grönwall-type bound), single-step improvements can propagate through the generation trajectory.
- In practice, this can improve success rate across benchmarks like Robomimic, PushT, and RoboTwin.
Source:
- arXiv: https://arxiv.org/abs/2510.01068
- Project page: https://sagecao1125.github.io/GPC-Site/
Why âgeneralâ?
The âgeneralâ part matters because the authors position GPC as plug-and-play across:
- Vision-action (VA) and vision-language-action (VLA) policies
- Diffusion and flow-matching policies
- Different visual modalities (e.g., RGB vs point cloud, depending on the policy)
That is a stronger claim than âensemble two diffusion policies with the same input.â It is essentially: compose heterogeneous policies at test time.
When composition can help (and when it probably will not)
From an engineering standpoint, the biggest value of this line of work is not the specific math. It is the practical intuition:
Different policies fail differently.
If you have two policies whose errors are not perfectly correlated, mixing them can reduce failure probability.
Good cases
GPC is most plausible when:
-
Both policies are âokay,â not perfect. If one is extremely weak, it can poison the mixture.
-
They have complementary strengths. Examples:
- One policy trained on a broader dataset, another on a narrower but higher-quality dataset.
- One policy conditioned on RGB, another on depth or point clouds.
- One is a VLA that uses language context well, another is a VA that is very crisp on low-level manipulation.
-
Your deployment environment is uncertain. If you are dealing with variable lighting, clutter, or object variation, robustness is the entire game.
-
You can afford the inference-time cost. Composition means running multiple models (and potentially a search procedure) in the control loop.
Bad cases
Composition is less likely to help when:
- One policy dominates the other (the weaker one mostly adds noise).
- The task requires very tight real-time latency and you cannot increase compute.
- The failure mode is systematic (e.g., both policies never learned the same concept).
A practical rule: composition is not a substitute for missing skills. It is a robustness and performance boost for skills you already have.
How to implement GPC without hand-waving
Letâs translate âconvex score composition + test-time searchâ into something you can actually wire up.
Step 1: Standardize the action parameterization
If you want to combine policies, they must agree on:
- Action space (joint deltas, end-effector pose deltas, gripper control)
- Action horizon length
- Normalization and scaling
If they disagree, you will end up composing apples and forklifts.
Step 2: Expose the score (or noise prediction) interface
In many diffusion implementations, the network predicts something like:
- Δ (noise) or
- x0 (denoised target) or
- v (a reparameterization)
Your sampler then converts that into an update step.
To do composition, you need to:
- Run both networks at the same denoising step.
- Convert their outputs into the same score-like quantity (or mix in a consistent parameterization).
The GPC project page shows a simple mixing formula for noise prediction in one of their experiments:
- Î”Ì = w1·ΔA + w2·ΔB
That is an implementation-friendly view: you can mix predicted noise and proceed with a standard sampler.
Step 3: Choose a weight schedule
A fixed weight w is the simplest starting point, but it might not be optimal.
Two common strategies:
-
Static mixture
- Choose a single w (e.g., 0.7) and keep it constant.
- Pros: simple, stable.
- Cons: may underperform.
-
Search at test time
- Evaluate multiple w values (or a small schedule) and select what looks best.
- Pros: can extract more gains.
- Cons: adds compute, requires an evaluation signal.
That last part (evaluation signal) is the real engineering problem.
Step 4: Decide how you will âscoreâ candidate trajectories
To search over weights, you need a way to pick the best generated trajectory.
Options:
- Model-based scoring: use a learned value function or critic.
- Heuristic scoring: distance-to-goal, collision penalties, smoothness.
- Language-conditioned scoring: if you have language goals, you can score alignment.
If you do not have a scoring function, you can still use a static mixture and hope the ensemble effect helps.
Step 5: Wrap it in receding-horizon control
Diffusion Policyâs receding-horizon approach is not just a detail; it is how you keep things stable:
- Generate an action sequence.
- Execute only the first k actions.
- Observe the new state.
- Re-plan.
Composition fits naturally here because:
- You can search more aggressively at a lower frequency (every re-plan), rather than every control tick.
Practical deployment considerations
Latency: do not destroy your control loop
If your base diffusion policy already runs near your compute limit, composing two or three policies will be painful.
Practical mitigations:
- Use smaller backbones for one component policy.
- Lower denoising steps (fewer diffusion iterations).
- Use asynchronous planning (generate trajectories on a separate thread and feed the controller).
Safety: composition is not certification
If you are deploying on real hardware, you still need:
- Action bounds and joint limit checks
- Collision avoidance layers (even basic ones)
- Emergency stop integration
- Monitoring for divergence (e.g., high action variance)
Composition changes the action distribution; treat it as a new policy for safety purposes.
Dataset bias: composition can mix biases too
If both policies were trained on similar demonstrations, they may share the same blind spots.
Composition is most useful when the policies were trained differently.
Why this matters in the bigger âPhysical AIâ narrative
The last year of robotics discourse has been obsessed with âfoundation models,â but there is a quieter engineering reality:
Robots need reliability before they need generality.
GPC is interesting because it is a reliability tool:
- It leverages multiple imperfect policies.
- It tries to reduce failure without retraining.
- It provides a clear interface for combining modalities and architectures.
Even if you never implement GPC exactly as described, the lesson sticks:
If you can afford it, running multiple policies and composing them at inference can be a cheaper path to robustness than another month of data collection.
A concrete workflow you can try this week
If you are building with an existing diffusion policy codebase:
- Train or obtain two policies that differ meaningfully (modality, backbone, dataset, or conditioning).
- Implement a sampler wrapper that:
- runs both policies,
- mixes predicted noise (or score) with a weight w,
- generates trajectories.
- Run offline evaluation on a benchmark you already use (Robomimic-style tasks are ideal).
- Sweep w over {0.1, 0.3, 0.5, 0.7, 0.9}.
- If you see a consistent lift, move to a small test-time search with a lightweight scoring function.
References
-
Chi et al., âDiffusion Policy: Visuomotor Policy Learning via Action Diffusionâ (RSS 2023 / IJRR 2024)
-
Cao et al., âCompose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Compositionâ (ICLR 2026)