TY - JOUR
T1 - Long-Horizon Language-Conditioned Imitation Learning for Robotic Manipulation
AU - Yao, Xiangtong
AU - Blei, Tobias
AU - Meng, Yuan
AU - Zhang, Yu
AU - Zhou, Hongkuan
AU - Bing, Zhenshan
AU - Huang, Kai
AU - Sun, Fuchun
AU - Knoll, Alois
N1 - Publisher Copyright:
© 2025 The Authors.
PY - 2025
Y1 - 2025
N2 - Language-controlled policies enable robots to follow human language instructions and execute complex tasks. While language-conditioned imitation learning has proven effective in teaching robots to perform tasks guided by language instructions, it faces multiple challenges due to the multimodal nature of human demonstrations and limited training data. The variability in demonstrations can complicate policy learning, as the same instruction may correspond to diverse actions. To mitigate these issues, we propose an end-to-end transformer-based policy, predicting categorical distributions over a discretized action space. By discretizing the action space and employing autoregressive sampling, our model efficiently handles the exponential growth of high-dimensional discrete action spaces, allowing it to learn complex action distributions effectively. In addition, we apply data augmentation techniques to reuse existing data more effectively and implement an action disturbance strategy to enhance the model's generalization capabilities. Furthermore, we employ a cotraining strategy to leverage data that lacks language annotations. The effectiveness of our approach is demonstrated through simulation and real-world experiments on a robot manipulator in a long-horizon, language-conditioned setting, including multiple environments and zero-shot transferring to real-world settings.
AB - Language-controlled policies enable robots to follow human language instructions and execute complex tasks. While language-conditioned imitation learning has proven effective in teaching robots to perform tasks guided by language instructions, it faces multiple challenges due to the multimodal nature of human demonstrations and limited training data. The variability in demonstrations can complicate policy learning, as the same instruction may correspond to diverse actions. To mitigate these issues, we propose an end-to-end transformer-based policy, predicting categorical distributions over a discretized action space. By discretizing the action space and employing autoregressive sampling, our model efficiently handles the exponential growth of high-dimensional discrete action spaces, allowing it to learn complex action distributions effectively. In addition, we apply data augmentation techniques to reuse existing data more effectively and implement an action disturbance strategy to enhance the model's generalization capabilities. Furthermore, we employ a cotraining strategy to leverage data that lacks language annotations. The effectiveness of our approach is demonstrated through simulation and real-world experiments on a robot manipulator in a long-horizon, language-conditioned setting, including multiple environments and zero-shot transferring to real-world settings.
KW - Imitation learning
KW - language-controlled robotics
KW - long-horizon task learning
UR - http://www.scopus.com/inward/record.url?scp=105001007061&partnerID=8YFLogxK
U2 - 10.1109/TMECH.2025.3547047
DO - 10.1109/TMECH.2025.3547047
M3 - Article
AN - SCOPUS:105001007061
SN - 1083-4435
JO - IEEE/ASME Transactions on Mechatronics
JF - IEEE/ASME Transactions on Mechatronics
ER -