News·Unclaimed·

Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization

arXiv:2606.27663v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models leverage large-scale vision-language pretraining for flexible robot manipulation, yet at test time they remain brittle along two axes: spatial generalization, when object positions differ from those seen during trai

via RSS