TY - GEN
T1 - VXP
T2 - 12th International Conference on 3D Vision, 3DV 2025
AU - Li, Yun Jin
AU - Gladkova, Mariia
AU - Xia, Yan
AU - Wang, Rui
AU - Cremers, Daniel
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Cross-modal place recognition methods are flexible GPS-alternatives under varying environment conditions and sensor setups. However, this task is non-trivial since extracting consistent and robust global descriptors from different modalities is challenging. To tackle this issue, we propose Voxel-Cross-Pixel (VXP), a novel camera-to-LiDAR place recognition framework that enforces local similarities in a self-supervised manner and effectively brings global context from images and LiDAR scans into a shared feature space. Specifically, VXP is trained in three stages: first, we deploy a visual transformer to compactly represent input images. Secondly, we establish local correspondences between image-based and point cloud-based feature spaces using our novel geometric alignment module. We then aggregate local similarities into an expressive shared latent space. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate that our method surpasses the state-of-the-art cross-modal retrieval by a large margin. Our evaluations show that the proposed method is accurate, efficient and light-weight. Our project page is available at: https://yunjinli.github.io/projects-vxp/.
AB - Cross-modal place recognition methods are flexible GPS-alternatives under varying environment conditions and sensor setups. However, this task is non-trivial since extracting consistent and robust global descriptors from different modalities is challenging. To tackle this issue, we propose Voxel-Cross-Pixel (VXP), a novel camera-to-LiDAR place recognition framework that enforces local similarities in a self-supervised manner and effectively brings global context from images and LiDAR scans into a shared feature space. Specifically, VXP is trained in three stages: first, we deploy a visual transformer to compactly represent input images. Secondly, we establish local correspondences between image-based and point cloud-based feature spaces using our novel geometric alignment module. We then aggregate local similarities into an expressive shared latent space. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate that our method surpasses the state-of-the-art cross-modal retrieval by a large margin. Our evaluations show that the proposed method is accurate, efficient and light-weight. Our project page is available at: https://yunjinli.github.io/projects-vxp/.
KW - autonomous driving
KW - cross-modal retrieval
KW - foundation models
KW - place recognition
UR - https://www.scopus.com/pages/publications/105016126895
U2 - 10.1109/3DV66043.2025.00117
DO - 10.1109/3DV66043.2025.00117
M3 - Conference contribution
AN - SCOPUS:105016126895
T3 - Proceedings - 2025 International Conference on 3D Vision, 3DV 2025
SP - 1233
EP - 1242
BT - Proceedings - 2025 International Conference on 3D Vision, 3DV 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 25 March 2025 through 28 March 2025
ER -