Temporal moment localization (TML) aims to retrieve the best moment in a video that matches a given sentence query. This task is challenging as it requires understanding the relationship between a video and a sentence, as well as the semantic meaning ...
Temporal moment localization (TML) aims to retrieve the best moment in a video that matches a given sentence query. This task is challenging as it requires understanding the relationship between a video and a sentence, as well as the semantic meaning of both. TML methods using 2D temporal maps, which represent proposal features or scores on all moment proposals with the boundaries of start and end times on the m and n axes, have shown performance improvements by modeling moment proposals in relation to each other. The methods, however, are limited by the coarsely pre-defined fixed boundaries of target moments, which depend on the length of training videos and the amount of memory available. To overcome this limitation, we propose a boundary matching and refinement network (BMRN) that generates 2D boundary matching and refinement maps along with a proposal feature map to obtain the final proposal score map. Our BMRN adjusts the fixed boundaries of moment proposals with predicted center and length offsets from boundary refinement maps. In addition, we introduce the length-aware proposal-interactive feature map extraction that combines a cross-modal feature map and a similarity map between the predicted duration of the target moment and each moment proposal and then obtain the final proposal feature map through two-stream proposal interaction by applying for two-dimensional convolution and transformer layers to the combined feature map. We also improve the performance of BMRN with our cross-modal contrastive approach for TML. BMRN and BMRN-CCL outperform SoTA methods on Charades-STA and ActivityNet Captions datasets, outperforming state-of-the-art methods by a large margin. Through comprehensive ablation studies, we also show the effectiveness of component losses, modules for cross-modal interaction, proposal interaction, boundary matching and refinement, and cross-modal contrastive learning.
Key words : Temporal moment localization, Video understaning, multi-modal learning, 2D-map proposal refinement, Cross-modal contrastive learning