Refining Action Segmentation with Hierarchical Video Representations

Hyemin Ahn, Dongheui Lee

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

34 Scopus citations


In this paper, we propose Hierarchical Action Segmentation Refiner (HASR), which can refine temporal action segmentation results from various models by understanding the overall context of a given video in a hierarchical way. When a backbone model for action segmentation estimates how the given video can be segmented, our model extracts segment-level representations based on frame-level features, and extracts a video-level representation based on the segment-level representations. Based on these hierarchical representations, our model can refer to the overall context of the entire video, and predict how the segment labels that are out of context should be corrected. Our HASR can be plugged into various action segmentation models (MS-TCN, SSTDA, ASRF), and improve the performance of state-of-the-art models based on three challenging datasets (GTEA, 50Salads, and Breakfast). For example, in 50Sal-ads dataset, the segmental edit score improves from 67.9% to 77.4% (MS-TCN), from 75.8% to 77.3% (SSTDA), from 79.3% to 81.0% (ASRF). In addition, our model can refine the segmentation result from the unseen backbone model, which was not referred to when training HASR. This generalization performance would make HASR be an effective tool for boosting up the existing approaches for temporal action segmentation. Our code is available at

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages9
ISBN (Electronic)9781665428125
StatePublished - 2021
Externally publishedYes
Event18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 - Virtual, Online, Canada
Duration: 11 Oct 202117 Oct 2021

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499


Conference18th IEEE/CVF International Conference on Computer Vision, ICCV 2021
CityVirtual, Online


Dive into the research topics of 'Refining Action Segmentation with Hierarchical Video Representations'. Together they form a unique fingerprint.

Cite this