Humanoid Policy = Human Policy: Egocentric VR Data for Robots

Why humanoid manipulation is still a data problem

Humanoid robots are finally getting strong enough, cheap enough, and mechanically reliable enough to leave the lab. But there is a stubborn bottleneck that keeps showing up in every real deployment: dexterous manipulation needs a lot of high-quality behavior data.

If you train policies only from robot demonstrations (teleop, kinesthetic teaching, or scripted trajectories), you quickly run into hard scaling limits:

Cost: every minute of robot teleoperation is expensive and fragile.
Coverage: you never collect enough variation in objects, lighting, camera viewpoints, clutter, and background.
Transfer: data from one humanoid does not cleanly map to another humanoid with different kinematics and hands.

That is why the idea in the paper Humanoid Policy ~{} Human Policy (arXiv:2503.13441) is so provocative: treat humans as “just another humanoid embodiment” and co-train on human demonstrations at scale.

Instead of relying on internet video (messy supervision, unknown camera geometry), the authors focus on something more structured and surprisingly accessible: task-oriented egocentric human demonstrations captured with consumer VR hardware, aligned to the kinds of tasks humanoid robots actually do.

In this post, we will unpack the core method and why it matters:

The PH2D dataset: task-oriented, egocentric human manipulation with 3D hand-finger poses.
HAT (Human Action Transformer): a policy trained in a unified, human-centric state-action space.
Why co-training on human data improves robustness (backgrounds, object variation, placement).
What you should take away if you are building robot policies in 2026.

Sources: arXiv abstract and full HTML paper plus the project page and demos.\

arXiv: https://arxiv.org/abs/2503.13441\
Paper HTML: https://arxiv.org/html/2503.13441v3\
Project page: https://human-as-robot.github.io/

The key bet: “human data is the next scaling lever”

The paper’s motivation is simple:

We want manipulation policies that generalize across tasks and robots.
We know that diversity helps (multi-robot datasets, multi-task imitation learning).
But robot data is slow and expensive to collect.

So the authors ask: what if high-quality human demonstrations can provide the missing diversity, as long as we close the “embodiment gap” between humans and humanoid robots?

In the arXiv abstract, they describe the goal as using egocentric human demonstrations as “cross-embodiment training data” and mitigating the gap from both the data and modeling perspectives.

That phrase is doing a lot of work, so let’s break it down.

Why internet video is not enough (yet)

There is a long history of using human videos for robot learning, but many approaches become modular:

Detect hands, objects, affordances, or keypoints.
Infer a latent action.
Map it to a robot controller.

That can work, but each module introduces error, and you often lose the rich 3D structure that makes manipulation learnable.

The paper is intentionally different: rather than trying to scrape the open internet, it uses task-oriented data with accurate 3D hand poses collected with VR.

PH2D: Physical Human-Humanoid Data (task-oriented egocentric demos)

The first contribution is a dataset called PH2D.

From the project page and paper, the key properties are:

Egocentric videos (what the demonstrator sees).
Automatic but accurate 3D hand-finger poses from consumer-grade VR devices.
Task-oriented manipulation, aligned with the tasks and motion patterns used in humanoid teleoperation.
Language annotations (the paper states PH2D includes language annotations).

The authors emphasize that the dataset is not “random daily life,” but structured around manipulation tasks that can be aligned with humanoid robot demonstrations.

Why is that important?

Because if you want to use human behavior data for robot learning, you need to answer two questions:

What is the action representation? (humans have 20+ DoF hands; robots vary wildly)
What is the observation representation? (camera intrinsics, viewpoint, embodiment)

PH2D is designed to make those questions solvable rather than hand-wavy.

Why egocentric data is a sweet spot

Egocentric demonstrations are unusually valuable for robot learning because the robot’s onboard cameras often see the world in a similar way:

Hands and objects in the near field.
Strong viewpoint consistency.
Motion patterns that directly relate to manipulation.

This does not magically remove the embodiment gap, but it reduces one major mismatch compared to third-person video.

HAT: Human Action Transformer in a unified state-action space

The second contribution is the policy itself: Human Action Transformer (HAT).

From the abstract:

HAT has a unified state-action space for both humans and humanoid robots.
The policy can be differentiably retargeted to robot actions.
It is co-trained with smaller-scale robot data to improve alignment.

The core idea is to train a single model to predict manipulation behavior using a representation that makes “human” and “robot” two variants of the same thing.

Think of it like this:

A human hand is an end-effector with pose + finger configuration.
A humanoid robot hand is also an end-effector with pose + finger configuration.
If you pick a representation where both fit, then you can learn in a shared space.

The paper describes that robot actions can be obtained by applying inverse kinematics and hand retargeting to convert predicted human-centric actions into robot controls.

What “unified state-action space” means in practice

A useful mental model:

The model consumes sequences of observations (egocentric video plus state).
It outputs future trajectories for hand/finger motion (in a human-centric representation).
A retargeting layer maps those trajectories into joint targets or control commands for a specific robot.

This is not just a conceptual alignment. The authors emphasize it is differentiable, which matters because:

you can backprop through retargeting during training,
the model can learn what aspects of the human motion are “retargetable,”
and you can avoid brittle, hand-engineered action mappings.

Why co-training with human data helps: robustness and generalization

The paper’s claim is not “human data solves everything.” It is more specific:

human data improves generalization and robustness with significantly better data collection efficiency.

From the project page, the demonstrated improvements include:

Object generalization: objects seen in human data but not in robot data.
Object placement generalization: diverse placements in human demonstrations help beyond narrow robot teleop distributions.
Background generalization: robot data may have limited background variation, while human demos can cover more surfaces and scenes.
Cross-humanoid generalization: transfer to another humanoid platform with limited robot demos.

These are exactly the failure modes you hit when deploying.

A policy that works in a clean lab scene with a single table and consistent lighting often fails the moment you:

put it on a different table,
change the clutter distribution,
move objects 10 cm from where the training set expects,
or swap camera exposure.

The paper argues that human demonstrations are an efficient way to buy that missing diversity.

The important nuance: human data alone is not the point

The method still uses robot data.

That is crucial because:

Humans do not have the same dynamics and constraints as robots.
Humans can use compliance, tactile feedback, and micro-adjustments robots may not have.
Even with a unified representation, you want some on-robot data to ground the policy.

So the best way to interpret HAT is not “replace robot teleop,” but:

Use a smaller amount of robot demos to define the task and align embodiments.
Use a larger amount of human demos to expand coverage and robustness.

A practical pipeline: how you could apply this idea in 2026

Even if you are not building a transformer exactly like HAT, the paper suggests a concrete playbook for policy scaling.

1) Define tasks that are “human-and-robot aligned”

The dataset is task-oriented for a reason. Pick tasks where:

the human demo can be captured with an egocentric camera,
the action can be represented as end-effector + fingers,
and the robot can physically reproduce the same behavior.

Examples include:

pick-and-place of standardized objects,
drawer/door interaction,
bimanual handoffs,
peg-in-hole or insertion-like primitives.

2) Capture structured human data with commodity sensors

The authors explicitly use consumer-grade VR devices (project page).

That matters because it changes the economics:

You can collect data without tying up a robot.
You can scale to many demonstrators.
You can vary environments quickly (different tables, rooms, backgrounds).

The important part is not VR as a “metaverse” thing. It is VR as a cheap sensor suite for:

egocentric RGB,
hand/finger pose,
head pose,
time-synchronized trajectories.

3) Co-train with a small amount of robot data

To get deployment-quality behavior, you likely still need:

some teleop data from the target robot,
calibration of the action mapping,
and a way to handle robot-specific constraints.

HAT addresses this by co-training with humanoid data collected via teleoperation using the same VR hardware.

4) Put retargeting in the training loop

If your action mapping is differentiable (or at least smooth enough for learning), you can train end-to-end.

Practically, that reduces the “policy works in simulation but fails on robot” gap that often comes from a mismatch between:

what the policy predicts,
and what the controller can actually execute.

5) Evaluate on the right axes: not just success rate

The paper focuses heavily on generalization axes that matter in deployment:

novel objects,
new placements,
background changes,
new robots (few-shot transfer).

If you are only measuring success rate on a fixed benchmark scene, you will not see the benefits of diverse human data.

Where this approach can fail (and what to watch)

This line of work is promising, but there are real caveats.

1) Human demonstrations contain “human-only tricks”

Humans use compliance and tactile sensing naturally.

A robot policy trained on human trajectories may learn to rely on micro-motions or contact strategies the robot cannot reproduce, especially if:

the robot hand is underactuated,
the controller is position-based with limited force control,
the robot has less tactile feedback.

The paper’s use of co-training with robot data is one mitigation, but it does not eliminate the issue.

2) Retargeting quality becomes a first-class system component

Once you adopt a unified human-centric action representation, the retargeting layer is no longer an afterthought.

If your inverse kinematics is unstable, or if your hand retargeting is poorly calibrated, you will get:

noisy actions,
inconsistent contact,
or policy collapse where the model predicts actions that are hard to retarget.

Treat retargeting like part of the model, not a post-processing step.

3) Dataset bias is still dataset bias

PH2D is task-oriented. That is a strength, but it means your policy will reflect:

which tasks were chosen,
which objects were used,
and which environments were captured.

You still need deliberate coverage. The main advantage is that coverage becomes cheaper to buy.

The bigger picture: humanoid robots need “internet-scale,” but not necessarily from the internet

A recurring theme in robot learning is that we want an ImageNet moment for manipulation.

The paper argues that we do not have to wait for perfect internet-video supervision to get there. We can build an “internet-scale” dataset in a different way:

make the data cheap to collect (VR, commodity sensors),
make it aligned to robot tasks (task-oriented demos),
make the representation cross-embodiment (unified state-action space),
and use a bit of robot data to ground everything.

That is not just academically interesting. It is one of the most plausible scaling paths for humanoid manipulation in the next few years.

Takeaways

Humanoid manipulation is still bottlenecked by data diversity, not just model size.
Task-oriented egocentric human demonstrations are a strong candidate for scalable data.
HAT frames humans and humanoids as different embodiments in a unified state-action space and retargets actions differentiably.
The value is not “human data replaces robot data,” but human data multiplies the coverage you can afford.

If you are building manipulation systems in 2026, this is the direction to watch: not just better policies, but better ways to feed them.

References

Ri-Zhao Qiu et al., Humanoid Policy ~{} Human Policy, arXiv:2503.13441. https://arxiv.org/abs/2503.13441
Project page and demos: https://human-as-robot.github.io/

Humanoid Policy = Human Policy: Scaling Humanoid Manipulation with Egocentric VR Data