iDP3 Egocentric 3D Diffusion Policy for Humanoid Manipulation

The deployment problem most robot learning demos quietly avoid

If you have followed robot learning for the past few years, you have probably seen the pattern:

A policy looks impressive in a lab demo.
It works with a fixed camera or a carefully controlled setup.
Then it falls apart the moment you move the camera, change the table, or run on a mobile platform.

That gap is not just bad luck. It is often baked into the representation.

Many visuomotor pipelines (especially ones that use 3D point clouds) assume two things that are reasonable in a lab, but painful in the wild:

Reliable camera calibration (so you can transform observations into a stable world frame)
Clean point cloud segmentation (so the model can focus on task relevant geometry)

Both are hard to maintain on a humanoid robot that moves its head, changes viewpoint constantly, and operates in rooms that are not engineered like a robotics lab.

A recent line of work around 3D diffusion policies tries to fix generalization by learning from 3D structure, not just pixels. One particularly practical extension is Improved 3D Diffusion Policy (iDP3), designed specifically to remove the two deployment crutches above.

This post explains what iDP3 is, why it matters, and what a real implementation needs to look like.

What is iDP3 (Improved 3D Diffusion Policy)

iDP3 is a 3D visuomotor imitation learning policy that uses diffusion modeling to predict robot actions from 3D observations.

The key idea is simple but powerful:

Traditional 3D visuomotor policies often represent the scene in a world coordinate frame.
iDP3 represents 3D observations in the camera coordinate frame (an egocentric 3D representation).

By staying in the camera frame, iDP3 is designed to:

Avoid explicit camera calibration between the sensor and the robot or a global frame
Reduce reliance on brittle segmentation pipelines

In the paper, the authors describe iDP3 as a policy that “eliminates these constraints by leveraging egocentric 3D visual representations.” They demonstrate that a humanoid can learn skills in a single scene and still generalize across diverse real world environments. Source: arXiv HTML version of the paper and project site.

Primary references:

Paper: “Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies” (arXiv) https://arxiv.org/html/2410.10803v1
Project site: https://humanoid-manipulation.github.io/
Training and deployment repo: https://github.com/YanjieZe/Improved-3D-Diffusion-Policy

Why 3D diffusion policies are attractive for robots

Diffusion models are not just for images.

In robot learning, you can treat actions (trajectories, joint targets, end effector commands) as samples from a distribution conditioned on observations. Diffusion gives you a way to model multimodal action distributions and generate smooth sequences.

In practice, diffusion policies have been used because they often deliver:

Stable behavior cloning with good performance on contact rich tasks
Smooth trajectories (especially when trained properly)
Better handling of uncertainty than deterministic regressors

The catch: diffusion policies trained on 2D RGB images can still be fragile under view changes. This is where 3D representations help.

The two killers: calibration and segmentation

Before we talk about the iDP3 fix, it helps to be explicit about why calibration and segmentation are such persistent problems.

1) Camera calibration breaks silently

A lot of 3D pipelines require transforming a point cloud into a consistent coordinate frame.

That means you need:

Intrinsics (focal length, distortion)
Extrinsics (camera pose relative to robot base or world)

On a mobile humanoid with head motion, the extrinsics can drift, and even small errors can produce large 3D misalignment.

The worst part: calibration errors often do not crash anything. They just degrade performance in ways that are hard to debug.

2) Segmentation is a maintenance tax

Point cloud segmentation is typically done via:

Thresholding by depth or workspace bounds
Background subtraction
Classical clustering
Learned segmentation models

Every option has a failure mode:

Lighting changes
Novel objects
Reflective surfaces
Depth sensor noise

If your policy depends on segmentation, you are shipping a pipeline, not a model.

iDP3’s core move: egocentric 3D in the camera frame

The iDP3 approach is to use an egocentric 3D representation “in the camera frame, instead of the world frame.” That wording matters.

Representing the scene in the camera frame gives you invariances that are naturally aligned with deployment:

The robot can move its head, but the representation stays consistent because the camera frame is always the reference.
You do not need to estimate the camera pose in a global frame.

This does not magically solve generalization. It changes where the burden sits:

You stop fighting calibration.
You start caring more about robust 3D perception and training diversity.

On the project website, the authors emphasize view generalization and claim strong robustness to large view changes compared to a 2D Diffusion Policy baseline. Source: https://humanoid-manipulation.github.io/

System level picture: model plus data plus deployment

A good policy paper is never just a policy.

In the iDP3 work, the system has four key components:

Humanoid robot platform
Data collection system
Visuomotor policy learning method (iDP3)
Real world deployment pipeline

This structure is useful because it matches reality: deployment failures are usually not due to the model alone.

Data collection: whole upper body teleoperation (Vision Pro)

For imitation learning to work, you need demonstrations.

A big difference in this work is that demonstrations include more than just arm motion.

The authors describe a whole upper body teleoperation system that incorporates:

Arms and hands
Waist degrees of freedom
Active vision (the robot can change viewpoint)

On the project site, they mention using Apple Vision Pro based teleoperation (building on VisionProTeleop) and streaming robot vision back to the headset. Source: https://humanoid-manipulation.github.io/

Why this matters:

Many manipulation datasets are constrained to a narrow workspace.
Adding waist and head motion increases reachable space and the diversity of viewpoints.
Viewpoint diversity is exactly what you need if you want a policy that generalizes.

Practical takeaway:

If your data collection only covers one camera pose, your model will quietly learn “camera pose is part of the task.”

What the GitHub repo reveals about real implementation choices

Papers often hide the messy details. Repositories leak them.

From the iDP3 training and deployment repo, a few practical signals stand out:

They note using an RTX 4090 class GPU for training.
They mention deployment running on the CPU of an onboard computer (in their case, Fourier GR1).
They recommend the RealSense L515 for depth, and explicitly warn that RealSense D435 depth can be too imprecise for training a 3D policy.

Source: https://github.com/YanjieZe/Improved-3D-Diffusion-Policy

That is an important reminder: for 3D policies, your depth quality is not a minor detail. It is part of the model.

How iDP3 likely works (conceptually)

The full method includes architectural and training details, but you can understand iDP3 with a conceptual pipeline:

Sense: capture RGB D (or depth) from a head mounted camera
Lift to 3D: convert depth into a point cloud or voxel like 3D features in the camera frame
Encode: embed the 3D observation into a latent representation
Diffuse actions: run a diffusion model that denoises an action sequence conditioned on the observation latent
Execute: send actions to the robot controller (joint targets, end effector targets, or task space commands)

The main “improved” parts, per the paper abstract and introduction, are around avoiding calibration and segmentation by the choice of representation, plus additional modifications that boost effectiveness.

Source: https://arxiv.org/html/2410.10803v1

What generalization means here: view, object, and scene

Generalization claims can be vague. The project page breaks it into three concrete axes:

View generalization

They show that the policy remains effective under “large view changes.” Their comparison suggests a 2D diffusion policy struggles when the camera angle deviates significantly.

Source: https://humanoid-manipulation.github.io/

Object generalization

They evaluate “new objects” beyond training objects, and argue that 3D structure helps.

Source: https://humanoid-manipulation.github.io/

Scene generalization

They show tasks trained in one scene working in “unseen environments,” and note smoother behavior compared to a 2D baseline.

Source: https://humanoid-manipulation.github.io/

The important nuance:

This is still imitation learning.
The policy is not magically reasoning about the world.

It is benefiting from a representation that preserves geometry in a way that is less entangled with lighting and textures.

What tasks are we talking about

On the project site, they highlight practical manipulation skills such as:

Pick and place
Pour
Wipe

These are deceptively hard:

Pick and place requires robust grasp geometry.
Pour requires controlling a container orientation and reacting to contact.
Wipe requires consistent surface contact and long horizon motion.

These are exactly the kinds of skills where pure 2D policies tend to overfit to appearance.

Source: https://humanoid-manipulation.github.io/

How to think about iDP3 vs VLA models

If you read this blog regularly, you have seen a lot of discussion about vision language action (VLA) models. iDP3 is different:

iDP3 is an imitation learning policy focusing on robust 3D perception and action generation.
VLAs often focus on instruction following and general task composition, but can struggle with fine contact dynamics.

One interesting connection is mentioned directly in the iDP3 repo: a community project called PointVLA uses the iDP3 encoder to inject 3D into VLA training.

Source: https://github.com/YanjieZe/Improved-3D-Diffusion-Policy

My take:

In the near term, the most capable robots will likely use hybrid stacks.
Use a higher level planner (maybe language conditioned) to decide goals.
Use a lower level 3D policy like iDP3 to execute contact rich manipulation reliably.

Practical implementation checklist (what you need if you want to replicate the idea)

If you are building a real system inspired by iDP3, treat this as a checklist.

1) Depth sensing is not optional

Your model is only as good as your depth.

Choose a sensor with stable depth indoors.
Measure noise and missing depth regions.
Expect reflective or black objects to be failure cases.

The iDP3 repo specifically highlights RealSense L515 and warns against D435 depth quality for training.

Source: https://github.com/YanjieZe/Improved-3D-Diffusion-Policy

2) Collect data with viewpoint variation baked in

If you want view invariance, record demonstrations with:

different head poses
different table positions
different object locations

The teleoperation system in the project explicitly enables “active vision” and waist DoF, which increases viewpoint diversity.

Source: https://arxiv.org/html/2410.10803v1

3) Define your action space carefully

Diffusion policies can output:

joint deltas
end effector poses
trajectories

You must match the output to a controller that can execute smoothly and safely.

4) Build a deployment harness, not just a model

Deployment needs:

sensor synchronization
observation preprocessing
safety limits
recovery logic (timeouts, regrasp, abort)

Most “imitation learning failures” in the field are deployment harness failures.

Limitations (and why they matter)

The project site lists several limitations worth taking seriously:

Teleoperation with Vision Pro is easy to set up but can be tiring, limiting data scale.
Depth sensors still produce noisy point clouds, limiting performance.
Fine grained skills (like turning a screw) are time consuming to collect.
They avoided lower body balancing, focusing on upper body manipulation.

Source: https://humanoid-manipulation.github.io/

This is exactly where the next generation of systems will compete:

Better data collection (less human fatigue, more hours of data)
Better depth sensing and 3D perception
Better whole body control and balance

Where this goes next

If you are trying to forecast the direction of humanoid manipulation, iDP3 points to a clear trend:

Representations that reduce “infrastructure dependencies” (calibration, segmentation)
Policies that can travel with the robot (head mounted sensors, egocentric perception)
Data collection that captures realistic embodiment (waist motion, viewpoint changes)

The broader implication is not “diffusion beats transformers” or any single model class winning forever.

The implication is that the winning manipulation stacks will be the ones that are easiest to deploy and maintain.

If a method removes two fragile steps (calibration and segmentation) while improving generalization, it becomes a serious candidate for real products, not just papers.

iDP3 Explained: Egocentric 3D Diffusion Policies for Generalizable Humanoid Manipulation

The deployment problem most robot learning demos quietly avoid

What is iDP3 (Improved 3D Diffusion Policy)

Why 3D diffusion policies are attractive for robots

The two killers: calibration and segmentation

1) Camera calibration breaks silently

2) Segmentation is a maintenance tax

iDP3’s core move: egocentric 3D in the camera frame

System level picture: model plus data plus deployment

Data collection: whole upper body teleoperation (Vision Pro)

What the GitHub repo reveals about real implementation choices

How iDP3 likely works (conceptually)

What generalization means here: view, object, and scene

View generalization

Object generalization

Scene generalization

What tasks are we talking about

How to think about iDP3 vs VLA models

Practical implementation checklist (what you need if you want to replicate the idea)

1) Depth sensing is not optional

2) Collect data with viewpoint variation baked in

3) Define your action space carefully

4) Build a deployment harness, not just a model

Limitations (and why they matter)

Where this goes next

Further reading

Share this article:

Tags:

About Bob Jiang

Agentic Loops for Robot Manipulation: Execution Monitoring, Anchored Diffusion, and the Safety Gap (April 2026)

Related Articles

AGIBOT Rolls Out Its 10,000th Humanoid Robot: Why the Data Flywheel Matters

AGIBOT at MWC 2026: Why Robot Rentals and RaaS Are the Fastest Path to Real-World Humanoids