http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.
변환된 중국어를 복사하여 사용하시면 됩니다.
Learning action representation with limited information
이필현 Graduate School, Yonsei University 2023 국내박사
With the tremendous growth of the volume of video content on the Internet, it has become an essential task to analyze human actions in long untrimmed videos. Although remarkable advances in the field of deep learning have allowed for constructing strong automatic video analysis models, they come at a cost—deep learning models often require costly information such as human annotations and rich data from various sources. This effectively hinders the deployment of the models in many real-world systems where the available information is restricted. To tackle the challenge, this dissertation aims to build efficient models that are able to learn action representations under constrained scenarios where only a limited amount of information can be leveraged for model training and inference. Specifically, the main focus lies in the task of temporal action localization (or detection), whose goal is to localize temporal intervals of action instances in the given video. The main contributions of this dissertation are as follows. First, we focus on utilizing video-level weak supervision for model training to alleviate the notoriously expensive cost of human annotations for temporal action localization. Specifically, we make the first attempt to model background frames given video-level labels. The key idea is to suppress the activation from background frames for precise action localization by forcing them to be classified into the auxiliary background class. Then we delve deeper into the way of background modeling and introduce a novel perspective on background frames where they are considered to be out-of-distribution samples. Secondly, we explore another type of weak supervision — point-level annotations — where only a single frame for each action instance is annotated. In this setting, we propose a pseudo-label-based approach to learn action completeness from sparse point labels. The resulting model is capable of producing more complete and accurate action predictions. Lastly, we figure out that the bottleneck of action localization models at inference is the heavy computational cost of the motion modality, i.e., optical flow. To relieve the cost, we design a decomposed cross-modal knowledge distillation pipeline to inject motion knowledge into an RGB-based model. By exploiting multimodal complementarity, the model can accurately predict action intervals at low latency, shedding light on the potential adoption of temporal action localization models in real-world systems. We believe that the action representation learning methods under the information constraints proposed in this dissertation will serve as an essential tool for real-world action analysis systems and potentially benefit various computer vision applications.