Abstract
Temporal action segmentation is an essential task for understanding complex human activity sequences and identifying long-term dependencies between human actions. This is essential for effective non-verbal human-robot collaboration and robotic assistance to understand the underlying human intentions. To minimize the labeling effort, we focus on a timestamp supervised setting, where only one label per action segment is provided at a randomly selected time instant. This significantly reduces the labeling effort which improves the scalability to larger real-world datasets. We propose a contrastive learning-based approach that enforces the similarity of video snippet features of the same action and contrasts features of differing actions. Our boundary estimation algorithm is used to determine the negative and positive sets with respect to the ground truth timestamp labels. Additionally, our proposed loss function further penalizes predictions that do not belong to either of the action labels of the enclosing timestamps. The evaluation of our approach on four public datasets shows significant improvements compared to the state-of-the-art in varying environments and yields competitive results compared to models trained in a fully supervised way. Our approach is further applicable to semi-automatic annotation.
Originalsprache | Englisch |
---|---|
Seiten (von - bis) | 1-8 |
Seitenumfang | 8 |
Fachzeitschrift | IEEE Robotics and Automation Letters |
DOIs | |
Publikationsstatus | Angenommen/Im Druck - 2024 |