Zero-Shot Open-Vocabulary OOD Object Detection and Grounding using Vision Language Models

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

Automated driving involves complex perception tasks that require a precise understanding of diverse traffic scenarios and confident navigation. Traditional data-driven algorithms trained on closed-set data often fail to generalize upon out-of-distribution (OOD) and edge cases. Recently, Large Vision Language Models (LVLMs) have shown potential in integrating the reasoning capabilities of language models to understand and reason about complex driving scenes, aiding generalization to OOD scenarios. However, grounding such OOD objects still remains a challenging task. In this work, we propose an automated framework zPROD for zero-shot promptable open vocabulary OOD object detection, segmentation, and grounding in autonomous driving. We leverage LVLMs with visual grounding capabilities, eliminating the need for lengthy text communication and providing precise indications of OOD objects in the scene or on the track of the egocentric vehicle. We evaluate our approach on OOD datasets from existing road anomaly segmentation benchmarks such as SMIYC and Fishyscapes. Our zero-shot approach shows superior performance on RoadAnomaly and RoadObstacle and comparable results on the Fishyscapes subset as compared to supervised models and acts a baseline for future zero-shot methods based on open vocabulary OOD detection.

Original languageEnglish
Pages (from-to)230-238
Number of pages9
JournalProceedings of Machine Learning Research
Volume265
StatePublished - 2025
Event6th Northern Lights Deep Learning Conference, NLDL 2025 - Tromso, Norway
Duration: 7 Jan 20259 Jan 2025

Fingerprint

Dive into the research topics of 'Zero-Shot Open-Vocabulary OOD Object Detection and Grounding using Vision Language Models'. Together they form a unique fingerprint.

Cite this