Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols

Iqra Qasim, Alexander Horsch, Dilip Prasad

Research output: Contribution to journalArticlepeer-review

Abstract

Untrimmed videos have interrelated events, dependencies, context, overlapping events, object-object interactions, domain specificity, and other semantics that are worth highlighting while describing a video in natural language. Owing to such a vast diversity, a single sentence can only correctly describe a portion of the video. Dense Video Captioning (DVC) aims to detect and describe different events in a given video. The term DVC originated in the 2017 ActivityNet challenge, after which considerable effort has been made to address the challenge. DVC is divided into three sub-tasks: (1) Video Feature Extraction, (2) Temporal Event Localization, and (3) Dense Caption Generation. In this survey, we discuss all of the studies that claim to perform DVC along with its sub-tasks and summarize their results. We also discuss all of the datasets that have been used for DVC. Last, current challenges in the field are highlighted along with observatory remarks and future trends in the field.

Original languageEnglish
Article number154
JournalACM Computing Surveys
Volume57
Issue number6
DOIs
StatePublished - 10 Feb 2025
Externally publishedYes

Keywords

  • ActivityNet challenge
  • artificial intelligence
  • deep learning
  • Dense video captioning models
  • event localization
  • video feature extraction

Fingerprint

Dive into the research topics of 'Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols'. Together they form a unique fingerprint.

Cite this