iDP3 Explained: Egocentric 3D Diffusion Policies for Generalizable Humanoid Manipulation
Bob Jiang
April 2, 2026
The deployment problem most robot learning demos quietly avoid
If you have followed robot learning for the past few years, you have probably seen the pattern:
- A policy looks impressive in a lab demo.
- It works with a fixed camera or a carefully controlled setup.
- Then it falls apart the moment you move the camera, change the table, or run on a mobile platform.
That gap is not just bad luck. It is often baked into the representation.
Many visuomotor pipelines (especially ones that use 3D point clouds) assume two things that are reasonable in a lab, but painful in the wild:
- Reliable camera calibration (so you can transform observations into a stable world frame)
- Clean point cloud segmentation (so the model can focus on task relevant geometry)
Both are hard to maintain on a humanoid robot that moves its head, changes viewpoint constantly, and operates in rooms that are not engineered like a robotics lab.
A recent line of work around 3D diffusion policies tries to fix generalization by learning from 3D structure, not just pixels. One particularly practical extension is Improved 3D Diffusion Policy (iDP3), designed specifically to remove the two deployment crutches above.
This post explains what iDP3 is, why it matters, and what a real implementation needs to look like.
What is iDP3 (Improved 3D Diffusion Policy)
iDP3 is a 3D visuomotor imitation learning policy that uses diffusion modeling to predict robot actions from 3D observations.
The key idea is simple but powerful:
- Traditional 3D visuomotor policies often represent the scene in a world coordinate frame.
- iDP3 represents 3D observations in the camera coordinate frame (an egocentric 3D representation).
By staying in the camera frame, iDP3 is designed to:
- Avoid explicit camera calibration between the sensor and the robot or a global frame
- Reduce reliance on brittle segmentation pipelines
In the paper, the authors describe iDP3 as a policy that âeliminates these constraints by leveraging egocentric 3D visual representations.â They demonstrate that a humanoid can learn skills in a single scene and still generalize across diverse real world environments. Source: arXiv HTML version of the paper and project site.
Primary references:
- Paper: âGeneralizable Humanoid Manipulation with Improved 3D Diffusion Policiesâ (arXiv) https://arxiv.org/html/2410.10803v1
- Project site: https://humanoid-manipulation.github.io/
- Training and deployment repo: https://github.com/YanjieZe/Improved-3D-Diffusion-Policy
Why 3D diffusion policies are attractive for robots
Diffusion models are not just for images.
In robot learning, you can treat actions (trajectories, joint targets, end effector commands) as samples from a distribution conditioned on observations. Diffusion gives you a way to model multimodal action distributions and generate smooth sequences.
In practice, diffusion policies have been used because they often deliver:
- Stable behavior cloning with good performance on contact rich tasks
- Smooth trajectories (especially when trained properly)
- Better handling of uncertainty than deterministic regressors
The catch: diffusion policies trained on 2D RGB images can still be fragile under view changes. This is where 3D representations help.
The two killers: calibration and segmentation
Before we talk about the iDP3 fix, it helps to be explicit about why calibration and segmentation are such persistent problems.
1) Camera calibration breaks silently
A lot of 3D pipelines require transforming a point cloud into a consistent coordinate frame.
That means you need:
- Intrinsics (focal length, distortion)
- Extrinsics (camera pose relative to robot base or world)
On a mobile humanoid with head motion, the extrinsics can drift, and even small errors can produce large 3D misalignment.
The worst part: calibration errors often do not crash anything. They just degrade performance in ways that are hard to debug.
2) Segmentation is a maintenance tax
Point cloud segmentation is typically done via:
- Thresholding by depth or workspace bounds
- Background subtraction
- Classical clustering
- Learned segmentation models
Every option has a failure mode:
- Lighting changes
- Novel objects
- Reflective surfaces
- Depth sensor noise
If your policy depends on segmentation, you are shipping a pipeline, not a model.
iDP3âs core move: egocentric 3D in the camera frame
The iDP3 approach is to use an egocentric 3D representation âin the camera frame, instead of the world frame.â That wording matters.
Representing the scene in the camera frame gives you invariances that are naturally aligned with deployment:
- The robot can move its head, but the representation stays consistent because the camera frame is always the reference.
- You do not need to estimate the camera pose in a global frame.
This does not magically solve generalization. It changes where the burden sits:
- You stop fighting calibration.
- You start caring more about robust 3D perception and training diversity.
On the project website, the authors emphasize view generalization and claim strong robustness to large view changes compared to a 2D Diffusion Policy baseline. Source: https://humanoid-manipulation.github.io/
System level picture: model plus data plus deployment
A good policy paper is never just a policy.
In the iDP3 work, the system has four key components:
- Humanoid robot platform
- Data collection system
- Visuomotor policy learning method (iDP3)
- Real world deployment pipeline
This structure is useful because it matches reality: deployment failures are usually not due to the model alone.
Data collection: whole upper body teleoperation (Vision Pro)
For imitation learning to work, you need demonstrations.
A big difference in this work is that demonstrations include more than just arm motion.
The authors describe a whole upper body teleoperation system that incorporates:
- Arms and hands
- Waist degrees of freedom
- Active vision (the robot can change viewpoint)
On the project site, they mention using Apple Vision Pro based teleoperation (building on VisionProTeleop) and streaming robot vision back to the headset. Source: https://humanoid-manipulation.github.io/
Why this matters:
- Many manipulation datasets are constrained to a narrow workspace.
- Adding waist and head motion increases reachable space and the diversity of viewpoints.
- Viewpoint diversity is exactly what you need if you want a policy that generalizes.
Practical takeaway:
If your data collection only covers one camera pose, your model will quietly learn âcamera pose is part of the task.â
What the GitHub repo reveals about real implementation choices
Papers often hide the messy details. Repositories leak them.
From the iDP3 training and deployment repo, a few practical signals stand out:
- They note using an RTX 4090 class GPU for training.
- They mention deployment running on the CPU of an onboard computer (in their case, Fourier GR1).
- They recommend the RealSense L515 for depth, and explicitly warn that RealSense D435 depth can be too imprecise for training a 3D policy.
Source: https://github.com/YanjieZe/Improved-3D-Diffusion-Policy
That is an important reminder: for 3D policies, your depth quality is not a minor detail. It is part of the model.
How iDP3 likely works (conceptually)
The full method includes architectural and training details, but you can understand iDP3 with a conceptual pipeline:
- Sense: capture RGB D (or depth) from a head mounted camera
- Lift to 3D: convert depth into a point cloud or voxel like 3D features in the camera frame
- Encode: embed the 3D observation into a latent representation
- Diffuse actions: run a diffusion model that denoises an action sequence conditioned on the observation latent
- Execute: send actions to the robot controller (joint targets, end effector targets, or task space commands)
The main âimprovedâ parts, per the paper abstract and introduction, are around avoiding calibration and segmentation by the choice of representation, plus additional modifications that boost effectiveness.
Source: https://arxiv.org/html/2410.10803v1
What generalization means here: view, object, and scene
Generalization claims can be vague. The project page breaks it into three concrete axes:
View generalization
They show that the policy remains effective under âlarge view changes.â Their comparison suggests a 2D diffusion policy struggles when the camera angle deviates significantly.
Source: https://humanoid-manipulation.github.io/
Object generalization
They evaluate ânew objectsâ beyond training objects, and argue that 3D structure helps.
Source: https://humanoid-manipulation.github.io/
Scene generalization
They show tasks trained in one scene working in âunseen environments,â and note smoother behavior compared to a 2D baseline.
Source: https://humanoid-manipulation.github.io/
The important nuance:
- This is still imitation learning.
- The policy is not magically reasoning about the world.
It is benefiting from a representation that preserves geometry in a way that is less entangled with lighting and textures.
What tasks are we talking about
On the project site, they highlight practical manipulation skills such as:
- Pick and place
- Pour
- Wipe
These are deceptively hard:
- Pick and place requires robust grasp geometry.
- Pour requires controlling a container orientation and reacting to contact.
- Wipe requires consistent surface contact and long horizon motion.
These are exactly the kinds of skills where pure 2D policies tend to overfit to appearance.
Source: https://humanoid-manipulation.github.io/
How to think about iDP3 vs VLA models
If you read this blog regularly, you have seen a lot of discussion about vision language action (VLA) models. iDP3 is different:
- iDP3 is an imitation learning policy focusing on robust 3D perception and action generation.
- VLAs often focus on instruction following and general task composition, but can struggle with fine contact dynamics.
One interesting connection is mentioned directly in the iDP3 repo: a community project called PointVLA uses the iDP3 encoder to inject 3D into VLA training.
Source: https://github.com/YanjieZe/Improved-3D-Diffusion-Policy
My take:
- In the near term, the most capable robots will likely use hybrid stacks.
- Use a higher level planner (maybe language conditioned) to decide goals.
- Use a lower level 3D policy like iDP3 to execute contact rich manipulation reliably.
Practical implementation checklist (what you need if you want to replicate the idea)
If you are building a real system inspired by iDP3, treat this as a checklist.
1) Depth sensing is not optional
Your model is only as good as your depth.
- Choose a sensor with stable depth indoors.
- Measure noise and missing depth regions.
- Expect reflective or black objects to be failure cases.
The iDP3 repo specifically highlights RealSense L515 and warns against D435 depth quality for training.
Source: https://github.com/YanjieZe/Improved-3D-Diffusion-Policy
2) Collect data with viewpoint variation baked in
If you want view invariance, record demonstrations with:
- different head poses
- different table positions
- different object locations
The teleoperation system in the project explicitly enables âactive visionâ and waist DoF, which increases viewpoint diversity.
Source: https://arxiv.org/html/2410.10803v1
3) Define your action space carefully
Diffusion policies can output:
- joint deltas
- end effector poses
- trajectories
You must match the output to a controller that can execute smoothly and safely.
4) Build a deployment harness, not just a model
Deployment needs:
- sensor synchronization
- observation preprocessing
- safety limits
- recovery logic (timeouts, regrasp, abort)
Most âimitation learning failuresâ in the field are deployment harness failures.
Limitations (and why they matter)
The project site lists several limitations worth taking seriously:
- Teleoperation with Vision Pro is easy to set up but can be tiring, limiting data scale.
- Depth sensors still produce noisy point clouds, limiting performance.
- Fine grained skills (like turning a screw) are time consuming to collect.
- They avoided lower body balancing, focusing on upper body manipulation.
Source: https://humanoid-manipulation.github.io/
This is exactly where the next generation of systems will compete:
- Better data collection (less human fatigue, more hours of data)
- Better depth sensing and 3D perception
- Better whole body control and balance
Where this goes next
If you are trying to forecast the direction of humanoid manipulation, iDP3 points to a clear trend:
- Representations that reduce âinfrastructure dependenciesâ (calibration, segmentation)
- Policies that can travel with the robot (head mounted sensors, egocentric perception)
- Data collection that captures realistic embodiment (waist motion, viewpoint changes)
The broader implication is not âdiffusion beats transformersâ or any single model class winning forever.
The implication is that the winning manipulation stacks will be the ones that are easiest to deploy and maintain.
If a method removes two fragile steps (calibration and segmentation) while improving generalization, it becomes a serious candidate for real products, not just papers.
Further reading
- iDP3 paper (HTML): https://arxiv.org/html/2410.10803v1
- iDP3 project site (videos and results): https://humanoid-manipulation.github.io/
- iDP3 training and deployment code: https://github.com/YanjieZe/Improved-3D-Diffusion-Policy