TY - JOUR
T1 - An Online-training-free Adaptor for Open Heterogeneous Collaborative Perception via Diffusion Model
AU - Wang, Tianhang
AU - Lu, Fan
AU - Qu, Sanqing
AU - Li, Bin
AU - Wu, Ya
AU - Cao, Hu
AU - Knoll, Alois
AU - Chen, Guang
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Collaborative perception seeks to mitigate the limitations of single-vehicle perception, such as occlusions, by facilitating communication and information sharing among connected vehicles. However, most existing works assume a homogeneous scenario where all vehicles share identity sensor types and perception model architectures. In contrast, real-world systems often involve heterogeneous agents with diverse sensor configurations and independently developed models. In such settings, directly exchanging features without proper alignment can significantly degrade performance and hinder effective collaboration. While some methods have been proposed to address heterogeneity, they typically require retraining or access to internal model parameters, making them impractical for scalable deployment. To address these challenges, we propose DiffAlign, a plug-and-play adapter that enables feature alignment across heterogeneous agents in a training-free and model-agnostic manner. DiffAlign treats received BEV features as noisy latent representations and progressively refines them through a pretrained diffusion process. This alignment strategy does not require access to model internals or any retraining, which makes it both scalable and privacy-preserving while supporting diverse sensor modalities and perception backbones. Extensive experiments on simulated OPV2V and real-world V2V4Real datasets demonstrate that DiffAlign consistently improves detection performance in heterogeneous settings, improving CoBEVT by 132.01% and 91.95%, respectively. Our method provides a practical path toward scalable, generalizable, and deployment-ready collaborative perception.
AB - Collaborative perception seeks to mitigate the limitations of single-vehicle perception, such as occlusions, by facilitating communication and information sharing among connected vehicles. However, most existing works assume a homogeneous scenario where all vehicles share identity sensor types and perception model architectures. In contrast, real-world systems often involve heterogeneous agents with diverse sensor configurations and independently developed models. In such settings, directly exchanging features without proper alignment can significantly degrade performance and hinder effective collaboration. While some methods have been proposed to address heterogeneity, they typically require retraining or access to internal model parameters, making them impractical for scalable deployment. To address these challenges, we propose DiffAlign, a plug-and-play adapter that enables feature alignment across heterogeneous agents in a training-free and model-agnostic manner. DiffAlign treats received BEV features as noisy latent representations and progressively refines them through a pretrained diffusion process. This alignment strategy does not require access to model internals or any retraining, which makes it both scalable and privacy-preserving while supporting diverse sensor modalities and perception backbones. Extensive experiments on simulated OPV2V and real-world V2V4Real datasets demonstrate that DiffAlign consistently improves detection performance in heterogeneous settings, improving CoBEVT by 132.01% and 91.95%, respectively. Our method provides a practical path toward scalable, generalizable, and deployment-ready collaborative perception.
KW - Collaborative perception
KW - diffusion model
KW - open heterogeneous
UR - https://www.scopus.com/pages/publications/105020933395
U2 - 10.1109/TCSVT.2025.3628726
DO - 10.1109/TCSVT.2025.3628726
M3 - Article
AN - SCOPUS:105020933395
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
ER -