GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

Pipeline

Architecture of GuidedVLA. We introduce explicit, structured guidance into the multi-head attention layers of the VLA action decoder. Instead of relying on implicitly entangled representations, we repurpose dedicated attention heads to specialize in distinct task-relevant factors: (i) Object Head supervises its attention maps to explicitly ground task-relevant objects and suppress distractors via ℒ_object; (ii) Skill Head aligns internal feature representations with temporal skill phases (e.g., Pick → Place) through auxiliary classification ℒ_skill; (iii) Depth Head injects geometric cues via cross attention only to features from a depth encoder. These guidance forces the policy to explicitly aware spatial, temporal, and geometric structures.

Object Grounding Head

Supervises attention maps to explicitly ground task-relevant objects and suppress distractors via attention mask alignment loss. Critical for precise localization on transparent/refractive objects and small targets.

Key insight: Forces action tokens to attend to semantically meaningful regions rather than incidental visual contrast.

Skill Recognition Head

Aligns internal feature representations with temporal skill phases (e.g., Pick → Place) through auxiliary classification loss. Prevents stage-skipping in multi-step behaviors.

Key insight: Encodes temporal intent progression to maintain stage awareness across extended horizons.

Geometry Perception Head

Injects explicit 3D spatial information by constraining dedicated attention heads to process only features from a frozen depth encoder (Depth Anything 3).

Key insight: Provides metric geometric reasoning for sub-centimeter precision tasks where monocular RGB cues are insufficient.

Experiment

GuidedVLA achieves significant performance gains across simulation benchmarks and real-world platforms, with particularly strong improvements under distribution shifts.

SIMULATION 1: LIBERO-Plus Benchmark Results

The proposed model achieves the highest average success rate, with a significant boost compared to its base model π₀. Notably, single-head ablations reveal task-specific alignment: the object head excels in the Goal suite (requiring precise target grounding), the skill head dominates the Long suite (requiring sequential temporal consistency), and the depth head performs well on the Spatial and Object suite (requiring 3D understanding).

Model	Perturbation Dimensions							Task Suites				Total
Model	Camera	Robot	Language	Light	Backg.	Noise	Layout	Spatial	Object	Goal	Long	Total
OpenVLA	0.8	3.5	23.0	8.1	34.8	15.2	28.5	19.4	14.0	15.1	14.3	15.6
OpenVLA-OFT	56.4	31.9	79.5	88.7	93.3	75.8	74.2	84.0	66.5	63.0	66.4	69.6
NORA	2.2	37.0	65.1	45.7	58.6	12.8	62.1	47.6	34.4	38.8	36.3	39.0
WorldVLA	0.1	27.9	41.6	43.7	17.1	10.9	38.0	32.5	28.6	31.8	8.2	25.0
UniVLA	1.8	46.2	69.6	69.0	81.0	21.2	31.9	55.5	36.7	40.7	39.9	43.9
pi_0-Fast	65.1	21.6	61.0	73.2	73.2	74.4	68.8	74.4	72.7	57.5	43.4	61.6
RIPT-VLA	55.2	31.2	77.6	88.4	91.6	73.5	74.2	85.8	64.3	58.0	67.5	68.4
DreamVLA	65.0	40.8	63.5	85.7	82.6	84.9	74.0	79.7	79.0	61.7	59.8	69.9
AdaMoE	53.8	17.5	20.6	73.7	73.8	58.6	65.8	51.0	57.9	53.3	38.1	50.1
π₀	62.3	39.8	63.1	86.0	82.8	82.4	69.6	77.7	74.1	62.2	60.5	68.2
w/ object head	68.2	40.0	62.1	91.4	87.2	85.0	76.5	77.4	78.8	67.5	62.7	71.5
w/ skill head	69.3	40.5	63.2	90.2	87.6	85.5	75.5	79.8	78.6	66.6	63.6	71.8
w/ depth head	68.1	43.9	65.8	90.7	83.4	85.6	72.8	81.4	79.0	65.4	61.8	71.7
w/ all heads (Ours)	70.8	49.4	66.8	92.9	88.1	89.3	78.4	82.3	79.9	71.2	68.4	75.4

SIMULATION 2: Robotwin 2.0 Benchmark

Robotwin 2.0 Benchmark Performance. Success rates across 8 manipulation tasks comparing the π₀ baseline, single-head experts, and our full model. While specific heads excel at aligned tasks (e.g., depth head for geometry-heavy Beat Block Hammer), the full model (purple) integrates these capabilities to achieve the best overall average performance (90.63%).

ABLATION: Factor Quality Correlation

Higher Factor Quality Leads to Better Task Performance. Top: Quantitative analysis on the LIBERO-Plus layout perturbation track shows that improving the quality of each specialized head consistently boosts success rates. (a) Object Head: as the proportion of attention focused on task-relevant object regions increases, success rises from 61.3% to 74.6%, highlighting the importance of precise object-centric attention. (b) Skill Head: higher skill-recognition accuracy, measured by a linear probe, correlates with improved performance (66.2% to 72.9%), indicating that better temporal understanding enhances control. (c) Depth Head: increasing the ratio of true depth features (versus noise) dramatically improves both qualitative depth estimation and quantitative success (15.6% to 76.7%), confirming that explicit 3D cues are critical for robust manipulation. Bottom: Qualitative visualizations show how changes along the x-axis metrics are reflected in the corresponding feature representations.

REAL WORLD: Cross-Platform Generalization

Cross-Platform Real-World Generalization. Success rates (N=20) across four generalization scenarios on ALOHA and PSI-Bot platforms. Our method consistently outperforms baseline, achieving performance gains across all scenarios (up to 52.7%) and demonstrating robustness under challenging out-of-domain conditions. Task 1–6 correspond to: (1) pick up fruits and vegetables (2) stack the bowls (3) clean the tabletop (4) pick up the beaker (5) stack the beakers and (6) heat the beaker. In-domain generalization includes variations in object positions within training distribution.

Generalization Setting	Method	ALOHA AgileX			PSI-Bot RealMan			Average (%)
Generalization Setting	Method	Task 1	Task 2	Task 3	Task 4	Task 5	Task 6	Average (%)
In-Domain^†	Base Policy	10/20	11/20	9/20	12/20	12/20	13/20	55.8
In-Domain^†	Ours	14/20	15/20	14/20	16/20	17/20	15/20	75.8
Scene	Base Policy	7/20	8/20	6/20	12/20	11/20	9/20	44.2
Scene	Ours	13/20	12/20	11/20	15/20	16/20	14/20	67.5
Lighting	Base Policy	11/20	9/20	10/20	14/20	12/20	13/20	57.5
Lighting	Ours	13/20	16/20	15/20	17/20	18/20	16/20	79.2

Tasks: (1) pick up fruits and vegetables, (2) stack the bowls, (3) clean the tabletop, (4) pick up the beaker, (5) stack the beakers, (6) heat the beaker.

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

Overview

Pipeline

Object Grounding Head

Skill Recognition Head

Geometry Perception Head

Experiment

SIMULATION 1: LIBERO-Plus Benchmark Results

SIMULATION 2: Robotwin 2.0 Benchmark

ABLATION: Factor Quality Correlation

REAL WORLD: Cross-Platform Generalization

Real Robot Tasks

Task 1: Pick up fruits and vegetables (4×)

Task 2: Stack the bowls (4×)

Task 3: Clean the tabletop (4×)

Task 4: Pick up the beaker (4×)

Task 5: Stack the beakers (4×)

Task 6: Heat the beaker (4×)