TY - GEN
T1 - Decoupling Common and Unique Representations for Multimodal Self-supervised Learning
AU - Wang, Yi
AU - Albrecht, Conrad M.
AU - Braham, Nassim Ait Ali
AU - Liu, Chenying
AU - Xiong, Zhitong
AU - Zhu, Xiao Xiang
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - The increasing availability of multi-sensor data sparks wide interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent improvement regardless of architectures and for both multimodal and modality-missing settings. With thorough experiments and comprehensive analysis, we hope this work can provide valuable insights and raise more interest in researching the hidden relationships of multimodal representations (https://github.com/zhu-xlab/DeCUR).
AB - The increasing availability of multi-sensor data sparks wide interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent improvement regardless of architectures and for both multimodal and modality-missing settings. With thorough experiments and comprehensive analysis, we hope this work can provide valuable insights and raise more interest in researching the hidden relationships of multimodal representations (https://github.com/zhu-xlab/DeCUR).
KW - Multimodal representations
KW - Self-supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85209210148&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-73397-0_17
DO - 10.1007/978-3-031-73397-0_17
M3 - Conference contribution
AN - SCOPUS:85209210148
SN - 9783031733963
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 286
EP - 303
BT - Computer Vision – ECCV 2024 - 18th European Conference, Proceedings
A2 - Leonardis, Aleš
A2 - Ricci, Elisa
A2 - Roth, Stefan
A2 - Russakovsky, Olga
A2 - Sattler, Torsten
A2 - Varol, Gül
PB - Springer Science and Business Media Deutschland GmbH
T2 - 18th European Conference on Computer Vision, ECCV 2024
Y2 - 29 September 2024 through 4 October 2024
ER -