Abstract
Automated driving involves complex perception tasks that require a precise understanding of diverse traffic scenarios and confident navigation. Traditional data-driven algorithms trained on closed-set data often fail to generalize upon out-of-distribution (OOD) and edge cases. Recently, Large Vision Language Models (LVLMs) have shown potential in integrating the reasoning capabilities of language models to understand and reason about complex driving scenes, aiding generalization to OOD scenarios. However, grounding such OOD objects still remains a challenging task. In this work, we propose an automated framework zPROD for zero-shot promptable open vocabulary OOD object detection, segmentation, and grounding in autonomous driving. We leverage LVLMs with visual grounding capabilities, eliminating the need for lengthy text communication and providing precise indications of OOD objects in the scene or on the track of the egocentric vehicle. We evaluate our approach on OOD datasets from existing road anomaly segmentation benchmarks such as SMIYC and Fishyscapes. Our zero-shot approach shows superior performance on RoadAnomaly and RoadObstacle and comparable results on the Fishyscapes subset as compared to supervised models and acts a baseline for future zero-shot methods based on open vocabulary OOD detection.
| Original language | English |
|---|---|
| Pages (from-to) | 230-238 |
| Number of pages | 9 |
| Journal | Proceedings of Machine Learning Research |
| Volume | 265 |
| State | Published - 2025 |
| Event | 6th Northern Lights Deep Learning Conference, NLDL 2025 - Tromso, Norway Duration: 7 Jan 2025 → 9 Jan 2025 |
Fingerprint
Dive into the research topics of 'Zero-Shot Open-Vocabulary OOD Object Detection and Grounding using Vision Language Models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver