Transferring skills between different objects remains one of the core challenges of open-world robot manipulation. Generalization needs to take into account the high-level structural differences between distinct objects while still maintaining similar low-level interaction control.
In this paper, we propose an example-based zero-shot approach to skill transfer. Rather than treating skills as atomic, we decompose skills into a prioritized list of grounded task-axis (GTA) controllers. Each GTAC defines an adaptable controller, such as a position or force controller, along an axis. Importantly, the GTACs are grounded in object key points and axes, e.g., the relative position of a screw head or the axis of its shaft. Zero-shot transfer is thus achieved by finding semantically-similar grounding features on novel target objects. We achieve this example-based grounding of the skills through the use of foundation models, such as SD-DINO, that can detect semantically similar keypoints of objects. We evaluate our framework on real-robot experiments, including screwing, pouring, and spatula scraping tasks, and demonstrate robust and versatile controller transfer for each.
Our method enables robots to perform manipulation tasks on novel objects by grounding modular task-axis controllers using visual correspondences. We begin by selecting functionally meaningful keypoints on a reference object, which define the structure of lifted skills—task specifications that are not tied to any specific object geometry. We use a visual foundation model, SD-DINO, to extract dense features from both the reference and target images, and match the keypoints across objects based on semantic similarity. The corresponding keypoints on the target object, together with 3D point cloud data, allow us to reconstruct axes such as surface normals or edge directions, enabling the instantiation of grounded task-axis controllers.
Each controller in our framework operates along one or more of these grounded axes and keypoints.
We support multiple controller types—such as position alignment (PosAlign
), axis alignment (AxisAlign
), force control (ForceAlign
), and waypoint traversal (PosWaypoint
)—that can be composed into multi-step skills.
These controllers are prioritized and projected into each other's null spaces to ensure conflict-free execution.
Importantly, all controllers are specified geometrically and grounded in object-relative frames, which allows zero-shot transfer to novel objects without retraining or task-specific demonstrations.
Our method has been evaluated on real-world tasks like scraping, pouring, and screwing, showing robust generalization across diverse tools and object geometries.
@misc{seker2025groundedtaskaxeszeroshot,
title={Grounded Task Axes: Zero-Shot Semantic Skill Generalization via Task-Axis Controllers and Visual Foundation Models},
author={M. Yunus Seker and Shobhit Aggarwal and Oliver Kroemer},
year={2025},
eprint={2505.11680},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2505.11680},
}