RISS 학술연구정보서비스

검색
다국어 입력

http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.

변환된 중국어를 복사하여 사용하시면 됩니다.

예시)
  • 中文 을 입력하시려면 zhongwen을 입력하시고 space를누르시면됩니다.
  • 北京 을 입력하시려면 beijing을 입력하시고 space를 누르시면 됩니다.
닫기
    인기검색어 순위 펼치기

    RISS 인기검색어

      검색결과 좁혀 보기

      선택해제

      오늘 본 자료

      • 오늘 본 자료가 없습니다.
      더보기
      • Learning action representation with limited information

        이필현 Graduate School, Yonsei University 2023 국내박사

        RANK : 247615

        With the tremendous growth of the volume of video content on the Internet, it has become an essential task to analyze human actions in long untrimmed videos. Although remarkable advances in the field of deep learning have allowed for constructing strong automatic video analysis models, they come at a cost—deep learning models often require costly information such as human annotations and rich data from various sources. This effectively hinders the deployment of the models in many real-world systems where the available information is restricted. To tackle the challenge, this dissertation aims to build efficient models that are able to learn action representations under constrained scenarios where only a limited amount of information can be leveraged for model training and inference. Specifically, the main focus lies in the task of temporal action localization (or detection), whose goal is to localize temporal intervals of action instances in the given video. The main contributions of this dissertation are as follows. First, we focus on utilizing video-level weak supervision for model training to alleviate the notoriously expensive cost of human annotations for temporal action localization. Specifically, we make the first attempt to model background frames given video-level labels. The key idea is to suppress the activation from background frames for precise action localization by forcing them to be classified into the auxiliary background class. Then we delve deeper into the way of background modeling and introduce a novel perspective on background frames where they are considered to be out-of-distribution samples. Secondly, we explore another type of weak supervision — point-level annotations — where only a single frame for each action instance is annotated. In this setting, we propose a pseudo-label-based approach to learn action completeness from sparse point labels. The resulting model is capable of producing more complete and accurate action predictions. Lastly, we figure out that the bottleneck of action localization models at inference is the heavy computational cost of the motion modality, i.e., optical flow. To relieve the cost, we design a decomposed cross-modal knowledge distillation pipeline to inject motion knowledge into an RGB-based model. By exploiting multimodal complementarity, the model can accurately predict action intervals at low latency, shedding light on the potential adoption of temporal action localization models in real-world systems. We believe that the action representation learning methods under the information constraints proposed in this dissertation will serve as an essential tool for real-world action analysis systems and potentially benefit various computer vision applications.

      연관 검색어 추천

      이 검색어로 많이 본 자료

      활용도 높은 자료

      해외이동버튼