Visual Object Tracking (VOT) is a crucial task in fields such as robotics, autonomous driving, and surveillance cameras, aiming to estimate the position of a given object in subsequent frames when its position is provided in the initial frame of a vid...
Visual Object Tracking (VOT) is a crucial task in fields such as robotics, autonomous driving, and surveillance cameras, aiming to estimate the position of a given object in subsequent frames when its position is provided in the initial frame of a video. Object tracking faces several challenges, including scale variation, distraction, object deformation, and occlusion. To address these challenges, various models for improved image feature learning have been proposed. In this paper, we propose a novel model that incorporates the past coordinates of a search image to jointly learn both image features and coordinate features. The proposed model utilizes the Vision Transformer (ViT) architecture as its encoder. Depending on the format of the coordinate input, the decoder can be configured into three models: IDTrack, SDTrack, and Baseline. Comparative experiments show that the IDTrack model outperforms the others, followed by SDTrack and Baseline models. An ablation study on the IDTrack model reveals that the optimal configuration includes four positional embeddings for the decoder, a prediction format of x1y1x2y2, three past coordinates, the application of Random Horizontal Flip, and a learning rate decay at epoch 50. Compared to the Baseline model, IDTrack achieves a 5.88% improvement in AO (average overlap) and a 6.70% improvement in SR (success rate).
Thus, the proposed model leverages the temporal features of past frame coordinates and the spatial features of images to implement relational modeling, presenting a new framework that achieves performance comparable to state-of-the-art (SOTA) models.