Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

Published in 2025 IEEE International Conference on Robotics & Automation (ICRA 2025)., 2025

Recommended citation: Li, J, Wang, Z., Zheng, J., Zhou, X., Wang, G., Song, G., Liu, Y., Liu, J., Zhang, Y., Zhan, X. Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning. 2025 IEEE International Conference on Robotics & Automation (ICRA 2025).

Abstract

Multimodal task specification is essential for enhanced robotic performance, where Cross-modality Alignment enables the robot to holistically understand complex task instructions. Directly annotating multimodal instructions for model training proves impractical, due to the sparsity of paired multimodal data. In this study, we demonstrate that by leveraging unimodal instructions abundant in real data, we can effectively teach robots to learn multimodal task specifications. First, we endow the robot with strong Cross-modality Alignment capabilities, by pretraining a robotic multimodal encoder using extensive out-of-domain data. Then, we employ two Collapse and Corrupt operations to further bridge the remaining modality gap in the learned multimodal representation. This approach projects different modalities of identical task goal as interchangeable representations, thus enabling accurate robotic operations within a well-aligned multimodal latent space. Evaluation across more than 130 tasks and 4000 evaluations on both simulated LIBERO benchmark and real robot platforms showcases the superior capabilities of our proposed framework, demonstrating significant advantage in overcoming data constraints in robotic learning.

Other information

  • Project Page
  • Paper
  • This paper was also previous accepted in NeurIPS 2024 Workshop on Open-World Agents (OWA).