IDTrack : 시각적 객체 추적을 위한 디코더에서의 독립적인 시퀀스 예측|RISS 상세보기

다국어 초록 (Multilingual Abstract)

Visual Object Tracking (VOT) is a crucial task in fields such as robotics, autonomous driving, and surveillance cameras, aiming to estimate the position of a given object in subsequent frames when its position is provided in the initial frame of a video. Object tracking faces several challenges, including scale variation, distraction, object deformation, and occlusion. To address these challenges, various models for improved image feature learning have been proposed. In this paper, we propose a novel model that incorporates the past coordinates of a search image to jointly learn both image features and coordinate features. The proposed model utilizes the Vision Transformer (ViT) architecture as its encoder. Depending on the format of the coordinate input, the decoder can be configured into three models: IDTrack, SDTrack, and Baseline. Comparative experiments show that the IDTrack model outperforms the others, followed by SDTrack and Baseline models. An ablation study on the IDTrack model reveals that the optimal configuration includes four positional embeddings for the decoder, a prediction format of x1y1x2y2, three past coordinates, the application of Random Horizontal Flip, and a learning rate decay at epoch 50. Compared to the Baseline model, IDTrack achieves a 5.88% improvement in AO (average overlap) and a 6.70% improvement in SR (success rate).
Thus, the proposed model leverages the temporal features of past frame coordinates and the spatial features of images to implement relational modeling, presenting a new framework that achieves performance comparable to state-of-the-art (SOTA) models.

번역하기

국문 초록 (Abstract)

Visual Object Tracking (VOT)는 로봇, 자율주행자동차, 감시 카메라와 같은 분야에서 연구되는 중요한 과제로, 비디오의 초기 프레임에서 위치가 주어졌을 때 이후 프레임들에서 해당 객체의 위치를 추정하는 것을 목표로 한다. 객체 추적에는 scale variation, distraction, object deformation, occlustion과 같은 여러 어려움이 존재하며, 이를 해결하기 위해 더 나은 이미지 특징 학습을 위한 여러 모델들이 제안되었다. 본 논문에서는 이미지 특징 뿐만 아니라 좌표 특징을 공동으로 학습하기 위해 search image의 과거 좌표를 도입하는 새로운 모델을 제안한다. 제안하는 모델의 인코더로는 ViT(Vision Transformer) 구조를 사용하였으며, 디코더로는 좌표 입력 형태에 따라 총 3가지 모델(IDTrack, SDTrack, Baseline)을 구성할 수 있다. 비교 실험 결과 IDTrack 모델이 가장 높은 성능을 보였으며, 그 뒤를 SDTrack, Baseline 모델이 이었다. 또한 IDTrack 모델에 대한 절제 연구를 통해 디코더의 positional embedding의 개수는 4개, prediction format은 x1y1x2y2, 과거 좌표의 개수는 3개, Random Horizontal Flip 적용 및 learning rate decay는 epoch 50에 대해 가장 좋은 성능을 보였으며, Baseline 모델의 성능 대비 AO 기준 5.88% 향상, SR 기준 6.70%가 향상되었다. 이로써 본 논문의 모델은 과거 프레임들의 좌표 정보에 대한 시간적 특징과 이미지의 공간적 특징을 활용하여 관계 모델링을 구현함으로써 새로운 프레임워크 방법을 제안하였으며, SOTA(State-of-the-Art) 모델들과 비교할 만한 성능을 달성하였다.

번역하기

Visual Object Tracking (VOT)는 로봇, 자율주행자동차, 감시 카메라와 같은 분야에서 연구되는 중요한 과제로, 비디오의 초기 프레임에서 위치가 주어졌을 때 이후 프레임들에서 해당 객체의 위치...

목차 (Table of Contents)