RGMP-S Explained: Geometric Priors + Spiking Features for Generalizable Humanoid Manipulation
Bob Jiang
April 9, 2026
The problem: humanoid manipulation fails in two boring, expensive ways
Humanoid manipulation looks like a single capability (pick up objects, open doors, use tools), but it keeps failing in two very different layers:
- High-level reasoning isnât physically grounded. A vision-language model (VLM) can explain what to do in English, but it may still pick a plan that is impossible given reachability, occlusions, or object geometry.
- Low-level action learning is data hungry. Even when you know the right plan, turning it into stable long-horizon motion typically takes a lot of demonstrations or reinforcement learning rollouts.
The paper âGeneralizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulationâ proposes a framework called RGMP-S that attacks both layers at once.
Primary source:
- arXiv:2601.09031 â âGeneralizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulationâ (Jan 2026): https://arxiv.org/abs/2601.09031
Code (per authors):
What RGMP-S is (in one sentence)
RGMP-S = a multimodal policy stack where a VLM selects long-horizon skills using explicit geometric priors, and a spiking-inspired recurrent module distills interaction dynamics for data-efficient action generation.
That sentence is dense, so letâs unpack it into the two key contributions:
- Long-horizon Geometric Prior Skill Selector (planning / skill selection)
- Recursive Adaptive Spiking Network (representation learning for action generation)
Part 1: Geometric priors that force the planner to respect reality
A lot of modern âVLM + robot policyâ systems implicitly assume the VLM will infer the geometry from pixels.
In practice, thatâs fragile:
- the VLM may âknowâ what a mug is but not its exact pose
- it may hallucinate affordances (a handle you canât actually grasp)
- it may choose a subgoal that is semantically correct but kinematically unreachable
RGMP-Sâs bet: donât ask the VLM to be a geometry engine
Instead of trying to make the VLM learn perfect 3D geometry end-to-end, RGMP-S injects lightweight 2D geometric inductive biases that help convert language goals into spatially consistent constraints.
The paper frames this as grounding high-level reasoning in âphysical realityâ by aligning:
- semantic intent (what the instruction wants)
- with spatial constraints (what is physically possible)
How this typically looks in a real robot stack
Even if the paperâs details are specific, the pattern is broadly useful:
- Parse the instruction into a structured goal (object of interest, target region, constraints)
- Extract geometry cues from the scene that are cheap but informative (bounding boxes, keypoints, 2D relations like left-of / inside / overlap)
- Score candidate skills/subgoals by combining semantic relevance and geometric feasibility
In other words: make the planner answer two questions, not one.
- âDoes this skill match the instruction?â
- âIs this skill feasible given the observed geometry?â
Why 2D priors can be enough
2D geometry sounds like a downgrade from âfull 3D scene understanding,â but it often gives you the constraints you actually need:
- relative ordering: in-front-of / behind
- proximity: near / far
- containment: in-bounds, inside a region
- occlusion indicators: overlaps that suggest hidden grasp points
For many manipulation tasks, those are the difference between:
- choosing a grasp that exists
- choosing a grasp that only exists in the modelâs imagination
Part 2: Spiking feature learning for sample-efficient action generation
Once a high-level skill is selected, the system still needs to generate long-horizon motion.
This is where many approaches fall apart:
- behavior cloning overfits to small demos
- policies become brittle under slight changes
- temporal credit assignment gets messy (especially with contact)
RGMP-Sâs bet: treat interaction as an event-driven temporal process
The paper introduces a Recursive Adaptive Spiking Network designed to improve:
- spatiotemporal consistency (actions donât jitter or drift over time)
- long-horizon feature distillation (retain what matters across many steps)
- data efficiency (learn from fewer demos without collapsing)
Spiking networks (broadly) are known for:
- event-driven computation
- temporal dynamics baked into the representation
- efficiency and robustness in some settings
RGMP-S uses âspiking featuresâ as a way to parameterize robot-object interaction dynamics, rather than relying only on standard continuous activations.
Why this matters specifically for contact-rich manipulation
Contact introduces discrete-ish events:
- first touch
- slip onset
- object lifts off
- collisions
Even if your sensors are continuous, the useful structure often has event-like moments.
A representation thatâs good at:
- compressing long temporal windows
- and emphasizing meaningful state changes
can make the downstream policy easier to learn (and harder to overfit).
Recursive/adaptive = the practical part
In practice, âspiking featuresâ alone arenât magic. The more important idea is the recursive distillation:
- reuse hidden state across time
- adapt to different phases of the task
- keep the model from memorizing a small set of trajectories
If youâve trained imitation policies before, youâll recognize the pain RGMP-S is targeting:
- policy works for the demo start state
- diverges after a few seconds
- never recovers because errors compound
Anything that stabilizes the internal temporal features is valuable.
Putting it together: an end-to-end view of RGMP-S
A helpful way to think about RGMP-S is a two-stage loop:
- Skill selection (slow time scale)
- interpret instruction + scene
- apply geometric priors
- select the next skill / subgoal
- Motion synthesis (fast time scale)
- execute the selected skill
- use spiking/recurrent features to maintain temporal consistency
- observe outcomes and transition to the next skill
This âslow planner / fast controllerâ structure is common.
Whatâs new here is the paperâs claim that both halves become more generalizable when:
- planning is constrained by cheap, explicit geometry
- control is stabilized by an interaction-aware temporal representation
Where RGMP-S fits vs the current trend (VLA end-to-end policies)
Thereâs a strong industry trend toward single giant policies (VLA models) that map images + text straight to actions.
Thatâs attractive because itâs simple to deploy.
But the downside is also simple:
- when it fails, it fails opaquely
- it needs a lot of data to become robust
RGMP-S is a more âsystemsâ take:
- add explicit structure where the model tends to hallucinate (geometry)
- add temporal bias where the model tends to overfit (contact dynamics)
My take: this is the more realistic path for humanoids in 2026.
End-to-end will win eventually, but right now most teams still need:
- interpretable constraints
- modular debugging
- ways to reduce demo requirements
Practical lessons you can steal even if you do not implement RGMP-S
1) Ground language planning with cheap geometry, not perfect 3D
You probably do not need dense 3D reconstruction to improve success rate.
Try this first:
- detect objects
- compute relative 2D spatial relations
- forbid obviously impossible subgoals
This is the highest ROI âanti-hallucinationâ layer for many VLM planners.
2) Long-horizon manipulation is a temporal representation problem
If your policy jitters, stalls, or drifts, itâs often not âthe robot is hard.â
Itâs that your model:
- cannot remember the right features long enough
- or remembers the wrong features too strongly
Recurrent state + interaction-aware features (spiking or not) is a direct fix.
3) Evaluate on heterogeneity, not just more tasks in one simulator
The paper reports experiments across:
- a simulation benchmark (ManiSkill)
- multiple real robot platforms (including a custom humanoid, per abstract)
That matters because generalization failures are often robot-specific:
- different compliance
- different camera placement
- different latency
If your approach only works on one platform, itâs not generalizable.
Limitations and open questions (what I would check before believing it)
Based on the abstract and the typical failure modes in this space, the key questions to validate are:
- Ablations: how much does the geometric prior contribute vs the spiking module?
- Compute/latency: can the full loop run at real-time rates on embedded hardware?
- Failure cases: does it degrade gracefully, or does the planner/controller mismatch cause cascading failures?
- Data scaling: does the spiking feature learning still help when you have a lot more demos?
These do not invalidate the approach, but they determine whether you ship it.
A concrete implementation sketch (if you want to try this idea this week)
You do not need to reproduce RGMP-S exactly to benefit from its structure. Here is a minimal âRGMP-S shapedâ prototype you can build:
-
Perception
- Run object detection + segmentation (or even just boxes).
- Derive simple 2D relations: overlap, relative depth cue (size), left/right, inside/outside a ROI.
-
Skill library
- Define 5â15 parameterized skills (reach, grasp, lift, place, open/close, push, pull).
- Make every skill declare its preconditions (object visible, handle exposed, target region free).
-
Language-to-skill selection
- Prompt a VLM/LLM to propose a short list of candidate skills and arguments.
- Hard-filter candidates using your geometric preconditions.
- Soft-rank what remains using semantic similarity.
-
Controller / policy
- Start with a recurrent policy (GRU/LSTM) trained on demonstrations.
- If you want the âspiking flavor,â add an auxiliary event-like channel: contact on/off, slip indicator, force threshold crossings.
- Train with strong regularization (dropout, noise injection, action smoothing) to reduce overfitting.
Even this stripped-down version will usually outperform âpure prompting + end-to-end BCâ once tasks get longer than ~10â20 seconds.
Conclusion
RGMP-S is interesting because it is not trying to brute-force humanoid manipulation with more parameters.
It is instead doing two âgrown-up roboticsâ things:
- planning: constrain language reasoning with explicit geometric feasibility
- control: encode interaction dynamics with a temporal representation that resists overfitting
If you are building a humanoid manipulation stack today, this is a solid template:
- give the planner a geometry filter
- give the controller a memory that respects contact events
And then iterate relentlessly on the messy real-world edge cases.
References
- Xuetao Li et al. âGeneralizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation.â arXiv:2601.09031 (2026). https://arxiv.org/abs/2601.09031
- RGMP-S code repository (per authors): https://github.com/xtli12/RGMP-S