TY - JOUR
T1 - Dense Video Captioning
T2 - A Survey of Techniques, Datasets and Evaluation Protocols
AU - Qasim, Iqra
AU - Horsch, Alexander
AU - Prasad, Dilip
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2025/2/10
Y1 - 2025/2/10
N2 - Untrimmed videos have interrelated events, dependencies, context, overlapping events, object-object interactions, domain specificity, and other semantics that are worth highlighting while describing a video in natural language. Owing to such a vast diversity, a single sentence can only correctly describe a portion of the video. Dense Video Captioning (DVC) aims to detect and describe different events in a given video. The term DVC originated in the 2017 ActivityNet challenge, after which considerable effort has been made to address the challenge. DVC is divided into three sub-tasks: (1) Video Feature Extraction, (2) Temporal Event Localization, and (3) Dense Caption Generation. In this survey, we discuss all of the studies that claim to perform DVC along with its sub-tasks and summarize their results. We also discuss all of the datasets that have been used for DVC. Last, current challenges in the field are highlighted along with observatory remarks and future trends in the field.
AB - Untrimmed videos have interrelated events, dependencies, context, overlapping events, object-object interactions, domain specificity, and other semantics that are worth highlighting while describing a video in natural language. Owing to such a vast diversity, a single sentence can only correctly describe a portion of the video. Dense Video Captioning (DVC) aims to detect and describe different events in a given video. The term DVC originated in the 2017 ActivityNet challenge, after which considerable effort has been made to address the challenge. DVC is divided into three sub-tasks: (1) Video Feature Extraction, (2) Temporal Event Localization, and (3) Dense Caption Generation. In this survey, we discuss all of the studies that claim to perform DVC along with its sub-tasks and summarize their results. We also discuss all of the datasets that have been used for DVC. Last, current challenges in the field are highlighted along with observatory remarks and future trends in the field.
KW - ActivityNet challenge
KW - artificial intelligence
KW - deep learning
KW - Dense video captioning models
KW - event localization
KW - video feature extraction
UR - http://www.scopus.com/inward/record.url?scp=85219754759&partnerID=8YFLogxK
U2 - 10.1145/3712059
DO - 10.1145/3712059
M3 - Article
AN - SCOPUS:85219754759
SN - 0360-0300
VL - 57
JO - ACM Computing Surveys
JF - ACM Computing Surveys
IS - 6
M1 - 154
ER -