Vi-TacMan: Articulated Object Manipulation via Vision and Touch

Vi-TacMan exploits the complementary strengths of vision and touch to manipulate previously unseen articulated objects: vision, with its global receptive field, proposes a coarse grasp and an initial interaction direction, which sufficiently activates the tactile-informed controller that leverages local contact information to achieve precise and robust manipulation.

Abstract

Autonomous manipulation of articulated objects remains a fundamental challenge for robots in human environments. Vision-based methods can infer hidden kinematics but yield imprecise estimates on unfamiliar objects. Tactile approaches achieve robust control through contact feedback but require accurate initialization. This suggests a natural synergy: vision for global guidance, touch for local precision. This observation suggests a natural division of labor: vision provides global, coarse guidance, while touch delivers precise, robust execution. Yet no framework systematically exploits this complementarity for generalized articulated manipulation. Here we present Vi-TacMan, which uses vision to propose grasps and coarse directions that seed a tactile controller for precise execution. By incorporating surface normals as geometric priors and modeling directions via von Mises-Fisher distributions, our approach achieves significant gains over baselines (all $\boldsymbol{p}<0.0001$). Critically, manipulation succeeds without explicit kinematic models—the tactile controller refines coarse visual estimates through real-time contact regulation. Tests on 50,000+ simulated and diverse real-world objects confirm robust cross-category generalization. This work establishes that coarse visual cues suffice for reliable manipulation when coupled with tactile feedback, offering a scalable paradigm for autonomous systems in unstructured environments.

Pipeline Overview

When manipulating a previously unseen articulated object placed at a random position, the Vi-TacMan framework begins by capturing an RGB-D image of the scene. Utilizing its global receptive field, the vision module identifies holdable and movable parts, segments them from the surrounding environment, and proposes both a feasible grasp configuration and a coarse motion direction. These visual cues provide sufficient information for the tactile-informed controller to establish stable contact with the object. Guided by the inferred motion direction, the controller maintains this stable contact throughout the manipulation process, ensuring precise and robust execution.

Modules and Results

Detection and Segmentation Module

To identify holdable and movable regions, we first train a transformer-based detector that leverages DinoV3 visual features. The detector achieves a notably high mAP of 0.86 on the test set and generalizes seamlessly to real-world images. The model architecture and checkpoints are released with our code. The detection results are then used to prompt SAM2 for segmentation. In the following examples, blue indicates "holdable" parts, while orange indicates "movable" parts.

Grasp Generation Module

With the movable and holdable masks defined, the next step is to establish a stable grasp on the handle. Recent advances have demonstrated the effectiveness of parallel grippers for object grasping, even within cluttered environments, like AnyGrasp. However, the handle-grasping problem can be largely simplified in our setting. To this end, we adopt a sampling-based method. The grasp region is restricted to the holdable area, and the grasping point $\boldsymbol{g}$ is defined as the centroid of this region, which also determines the gripper translation. We then sample gripper rotations to identify one that yields a collision-free grasp with minimal gripper width.

Ours

💡Tips

● Scroll to zoom in/out

● Drag to rotate

● Press "shift" and drag to pan

AnyGrasp

We show the comparison of grasp poses generated by our method and AnyGrasp. Handles of articulated objects typically constrain the grasp pose to an 1-DoF configuration. However, the 6-DoF predictions from AnyGrasp often yield tilted poses that cannot provide reliable grasps. In contrast, our method exploits surface normals as priors to guide the grasp approaching direction and further refines the in-plane rotation through collision detection, resulting in more reliable grasp poses.

Direction Inference Module

We model the distribution of interaction directions given visual observations by sampling different point groups in the movable region and inferring the corresponding point displacements after a small perturbation. Each group yields a rigid transformation that, combined with the selected grasping point, converts to a possible direction. In the ideal case, all directions would equal the ground truth and the distribution would degenerate to a Dirac delta function. However, due to uncertainty in visual observations, the directions are typically scattered around the ground truth. We fit a von Mises-Fisher (vMF) distribution to these samples to model this uncertainty, where the mean direction achieves the highest probability density. In practice, we compute the Fréchet mean of the sampled directions under the geodesic metric (arc length) on the sphere, yielding an unbiased estimator of the mean.

We illustrate the approach using four representative objects, one from each test category. For each object, we present the obtained samples, the fitted vMF distribution, the ground truth, and the predictions of the three baseline methods. By fitting the distribution and incorporating the surface normal as an inductive bias, our proposed method demonstrates greater robustness to high uncertainties when encountering previously unseen objects.

**Quantitative results of direction estimates on unseen object categories.** We report prediction errors from four methods over 5,836 test samples drawn from categories not seen during training. Our method, which uses surface normals as an inductive bias, achieves a significant performance gain over the baselines. The violin plots show the error distributions: the outer shape is the kernel density estimate; the white dot is the median; the thick bar denotes the interquartile range (IQR); and the whiskers extend to $1.5\times$ IQR beyond the quartiles. **** indicates $p$-value < 0.0001.

Real-World Validations

Acknowledgment

This work is supported in part by the National Science and Technology Innovation 2030 Major Program (2025ZD0219402), the National Natural Science Foundation of China (62376009), the Beijing Nova Program, the State Key Lab of General AI at Peking University, the PKU-BingJi Joint Laboratory for Artificial Intelligence, and the National Comprehensive Experimental Base for Governance of Intelligent Society, Wuhan East Lake High-Tech Development Zone.

BibTeX


        @article{cui2025vitacman,
          title={Vi-{T}ac{M}an: Articulated Object Manipulation via Vision and Touch},
          author={Cui, Leiyao and Zhao, Zihang and Xie, Sirui and Zhang, Wenhuan and Han, Zhi and Zhu, Yixin},
          journal={arXiv preprint arXiv:2510.06339},
          year={2025}
        }