Abstract
Our research aims to generate robust, dense 3-D depth maps for robotics, especially autonomous driving applications. Since cameras output 2-D images and active sensors such as LiDAR or radar produce sparse depth measurements, dense depth maps need to be estimated. Recent methods based on visual transformer networks have outperformed conventional deep learning approaches in various computer vision tasks, including depth prediction, but have focused on the use of a single camera image. This article explores the potential of visual transformers applied to the fusion of monocular images, semantic segmentation, and projected sparse radar reflections for robust monocular depth estimation. The addition of a semantic segmentation branch is used to add object-level understanding and is investigated in a supervised and unsupervised manner. We evaluate our new depth estimation approach on the nuScenes dataset where it outperforms existing state-of-the-art camera-radar depth estimation methods. We show that models can benefit from an additional segmentation branch during the training process by transfer learning even without running segmentation at inference. Further studies are needed to investigate the usage of 4-D-imaging radars and enhanced ground-truth generation in more detail. The related code is available as open-source software under https://github.com/TUMFTM/CamRaDepth.
Original language | English |
---|---|
Pages (from-to) | 28442-28453 |
Number of pages | 12 |
Journal | IEEE Sensors Journal |
Volume | 23 |
Issue number | 22 |
DOIs | |
State | Published - 15 Nov 2023 |
Keywords
- Autonomous driving
- computer vision
- depth prediction
- intelligent vehicles
- semantic segmentation
- sensor fusion