Modelling Complex Activities from Visual and Textual Data

Sener Merzbach, Fadime

Volltext

View/Open (13MB)

Author

Sener Merzbach, Fadime

ORCID

https://orcid.org/0000-0001-5004-6005

Type of Scholarly Publication

Dissertation

Date of Exam

06.07.2021

Date of Publication

23.07.2021

Advisor

Yao, Angela

Co-Referee

Gall, Jürgen

Involved Institutions

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadata

Show full item record

Citable Links

Handle: https://hdl.handle.net/20.500.11811/9235
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-63266

Abstract

Complex activity videos are long-range videos composed of multiple sub-activities following some temporal structuring and connected purpose. Recognizing human activities in such videos is a long-standing goal with a broad spectrum of applications, such as assistive technologies, robot-human interactions, or security systems. Although extensive efforts have been made to recognize human actions from short trimmed videos, complex activity videos have received attention only recently. This dissertation provides several models and techniques for understanding human activities in these long-range videos. In particular, we focus on the problems of action anticipation and temporal action segmentation with both supervised and unsupervised learning approaches.
Motivated by decreasing the high annotation costs for learning models on complex activity videos, we present two approaches. Given a collection of videos, all of the same complex activity, our temporal action segmentation method partitions videos into sub-activities based on only the visual data in an unsupervised way, following an iterative discriminative-generative approach. Our action anticipation approach generalizes instructional knowledge from large-scale text-corpora and transfers this knowledge to the visual domain using a small scale annotated video dataset. In this work, we challenge ourselves to develop models for describing complex activities with natural language, enabling translation between elements of the visual and textual domains. We also present a complex activity dataset of videos aligned with textual descriptions. We finally present a generic supervised approach for learning representations from long-range videos that we apply to action anticipation and temporal action segmentation. In this work, we investigate the required temporal extent, the representation granularity, and the influence of semantic abstraction with our flexible multi-granular temporal aggregation framework for reasoning from short and long-range observations.
This dissertation advances the state of the art in complex activity understanding, challenges the community with new problems, presents novel models that learn visual and temporal relations between human actions, and contributes a dataset for studying the intersection of vision and language. We thoroughly evaluated our approaches and compared them with the respective state of arts on a set of benchmarks. We conclude this dissertation by reporting the characteristics of future research directions and presenting some open issues on complex activity understanding research.

Subjects

interpretation komplexer Aktivitäten, zeitgleiche Segmentierung von Aktivitäten, Vorhersage von Aktivitäten, Videoanalyse, Aktionserkennung, complex activity understanding, temporal action segmentation, action anticipation, video analysis, action recognition

Classification (DDC)

004 Informatik

Zitiervorschlag
BibTeX

Sener Merzbach, Fadime: Modelling Complex Activities from Visual and Textual Data. - Bonn, 2021. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-63266

@phdthesis{handle:20.500.11811/9235,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-63266,
author = {{Fadime Sener Merzbach}},
title = {Modelling Complex Activities from Visual and Textual Data},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2021,
month = jul,
note = {Complex activity videos are long-range videos composed of multiple sub-activities following some temporal structuring and connected purpose. Recognizing human activities in such videos is a long-standing goal with a broad spectrum of applications, such as assistive technologies, robot-human interactions, or security systems. Although extensive efforts have been made to recognize human actions from short trimmed videos, complex activity videos have received attention only recently. This dissertation provides several models and techniques for understanding human activities in these long-range videos. In particular, we focus on the problems of action anticipation and temporal action segmentation with both supervised and unsupervised learning approaches.
Motivated by decreasing the high annotation costs for learning models on complex activity videos, we present two approaches. Given a collection of videos, all of the same complex activity, our temporal action segmentation method partitions videos into sub-activities based on only the visual data in an unsupervised way, following an iterative discriminative-generative approach. Our action anticipation approach generalizes instructional knowledge from large-scale text-corpora and transfers this knowledge to the visual domain using a small scale annotated video dataset. In this work, we challenge ourselves to develop models for describing complex activities with natural language, enabling translation between elements of the visual and textual domains. We also present a complex activity dataset of videos aligned with textual descriptions. We finally present a generic supervised approach for learning representations from long-range videos that we apply to action anticipation and temporal action segmentation. In this work, we investigate the required temporal extent, the representation granularity, and the influence of semantic abstraction with our flexible multi-granular temporal aggregation framework for reasoning from short and long-range observations.
This dissertation advances the state of the art in complex activity understanding, challenges the community with new problems, presents novel models that learn visual and temporal relations between human actions, and contributes a dataset for studying the intersection of vision and language. We thoroughly evaluated our approaches and compared them with the respective state of arts on a set of benchmarks. We conclude this dissertation by reporting the characteristics of future research directions and presenting some open issues on complex activity understanding research.},
url = {https://hdl.handle.net/20.500.11811/9235}
}

The following license files are associated with this item: