RISS 검색 - 학위논문 상세보기

다국어 초록 (Multilingual Abstract)

Temporal moment localization (TML) aims to retrieve the best moment in a video that matches a given sentence query. This task is challenging as it requires understanding the relationship between a video and a sentence, as well as the semantic meaning of both. TML methods using 2D temporal maps, which represent proposal features or scores on all moment proposals with the boundaries of start and end times on the m and n axes, have shown performance improvements by modeling moment proposals in relation to each other. The methods, however, are limited by the coarsely pre-defined fixed boundaries of target moments, which depend on the length of training videos and the amount of memory available. To overcome this limitation, we propose a boundary matching and refinement network (BMRN) that generates 2D boundary matching and refinement maps along with a proposal feature map to obtain the final proposal score map. Our BMRN adjusts the fixed boundaries of moment proposals with predicted center and length offsets from boundary refinement maps. In addition, we introduce the length-aware proposal-interactive feature map extraction that combines a cross-modal feature map and a similarity map between the predicted duration of the target moment and each moment proposal and then obtain the final proposal feature map through two-stream proposal interaction by applying for two-dimensional convolution and transformer layers to the combined feature map. We also improve the performance of BMRN with our cross-modal contrastive approach for TML. BMRN and BMRN-CCL outperform SoTA methods on Charades-STA and ActivityNet Captions datasets, outperforming state-of-the-art methods by a large margin. Through comprehensive ablation studies, we also show the effectiveness of component losses, modules for cross-modal interaction, proposal interaction, boundary matching and refinement, and cross-modal contrastive learning.

Key words : Temporal moment localization, Video understaning, multi-modal learning, 2D-map proposal refinement, Cross-modal contrastive learning

국문 초록 (Abstract)

비디오 의미 구간 탐지(Temporal moment localization, TML)는 주어진 문장의 의미에 맞는 비디오 구간을 찾는 것을 목표로 한다. 이 작업은 문장의 의미를 이해하고, 비디오 장면과 문장 간의 관계 성...

비디오 의미 구간 탐지(Temporal moment localization, TML)는 주어진 문장의 의미에 맞는 비디오 구간을 찾는 것을 목표로 한다. 이 작업은 문장의 의미를 이해하고, 비디오 장면과 문장 간의 관계 성을 찾아야 하므로, 매우 어려운 과제에 속한다. 이를 위한 기존 방법론으로 y축을 시작 시간 x축을 종료 시간으로 하는 2D 시간적 후보 맵을 만들어서, 각 후보 구간의 점수를 도출하는 방식이다. 이 방법은 후보 구간들 간의 상호작용을 모델링함으로써 큰 성능 향상 을 보였다. 그러나 기존 방법의 한계점으로 입력된 비디오의 길이 또는 사용 가능한 컴퓨터 메모리 양에 따라 사전에 정의되고 경계 가 고정된 후보 구간을 이용하므로 정답 구간에 대해 정확한 탐지 를 할 수 없다. 이러한 한계점을 극복하고자, 구간 탐지 및 경계 보 정 네트워크(Boundary Matching and Refinement Network, BMRN)를 제안하였다. 이 네트워크는 최종 후보 구간 점수 맵을 얻 기 위해 2D 후보 구간 특징과 구간 매칭 및 후보 경계 보정 맵을 생성 한다. 후보 구간 보정 맵은 후보 구간의 중심 위치 및 길이를 보정하여 고정된 경계를 조정한다. 또한 크로스 어텐션을 활용하여 후보 구간 길이 스케일에 따른 2D 후보 특징 맵을 추출 하였고, 정 답 구간의 길이를 예측해 각 후보 구간과의 길이 유사성 맵을 만들 어 후보 구간에 바이어스를 주는 방법을 제안하였다. 그리고 후보 구간들 간의 상호작용을 위해 컨볼루션 레이어와 트랜스포머 레이 어를 도입해 보다 효과적인 상호작용 방법을 제시하였다. 또한 크로 스 모달 대조 학습을 통해 비디오와 문장 특징을 보다 정밀하게 연 관시킨 모델인 BMRN-CCL은 기존 BMRN의 성능을 높였다. 제안 한 BMRN-CCL 네트워크는 두 가지 벤치마크 데이터 셋 Charades-STA, ActivityNet Captions에서 기존 SoTA(State of the Art) 모델 대비 큰 마진으로 성능 향상을 보였다. 그리고 다양한 Ablation 실험 결과를 통해 각 손실 함수 및 모듈의 효과성을 입증 하였고, 정성적 실험 결과를 통해 제안한 방법이 2D 후보 점수 맵 이 정답 구간을 잘 추종하며 각 후보 구간의 경계 보정 또한 잘 되 고 있음을 확인할 수 있다.

주요단어(Keyword) : 비디오 의미 구간 탐지, 비디오 이해, 멀티 모 달 학습, 2D 후보 구간 보정, 크로스 모달 대조 학습

목차 (Table of Contents)

Ⅰ. 서 론 · 1
1. 연구 배경 및 목적 1
2. 연구 기여 사항 3
3. 논문 구성 4

Ⅰ. 서 론 · 1
1. 연구 배경 및 목적 1
2. 연구 기여 사항 3
3. 논문 구성 4
Ⅱ. 관련 연구 5
1. 행동 인식 5
2. 비디오 행동 탐지 6
3. 비디오 의미 구간 탐지 7
3.1 후보 구간 없는 방법 · 7
3.2 후보 구간 기반 방법 · 8
4. 대조 학습 9
Ⅲ. 본 론 9
1. 문제 정의 및 가설 9
2. 모델 구조 9
2.1 1D 특징 인코딩 9
2.2 정답 구간 길이 예측 · 12
2.3 2D 특징 맵 추출 · 13
2.4 후보 구간 점수 예측 · 17
3. 모델 훈련 18
3.1 비디오 세그먼트 점수 손실 함수 18
3.2 정답 구간 길이 예측 손실 함수 · 18
3.3 후보 구간 점수 손실 함수 19
3.4 후보 구간 보정 손실 함수 20
3.5 크로스 모달 대조 학습 손실 함수 · 21
3.6 최종 손실 함수 24
4. 모델 추론 25
Ⅳ. 실험 및 결과 26
1. 데이터셋 26
2. 실험 세부 세팅 26
3. 성능 평가 방법 27
4. 성능 비교 결과 27
5. Ablation 실험 결과 30
6. 모델 효율성 비교 결과 37
7. 정성적 실험 결과 38
Ⅴ. 결론 및 향후 과제 45
참고문헌 46

상세검색

RISS 보유자료

상세검색

해외전자자료

비디오 의미 구간 탐지를 위한 후보 구간 매칭 및 보정 네트워크 : 2D 시간적 후보 구간 보정 및 크로스 모달 대조 학습 이용 = Boundary Matching and Refinement Network for Temporal Moment Localization

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료