Human action analysis is crucial for identifying abnormal behaviors linked to security threats, unusual events, and potentially suspicious activities in surveillance and public settings. However, video-based abnormal action detection still presents si...
Human action analysis is crucial for identifying abnormal behaviors linked to security threats, unusual events, and potentially suspicious activities in surveillance and public settings. However, video-based abnormal action detection still presents significant challenges, particularly in complex, real-world scenarios. This study proposes a deep learning approach for abnormal human action detection that integrates robust feature extraction using a pre-trained CLIP Image Encoder with a Transformer-based sequential model. The proposed method effectively captures both spatial (visual) and temporal action characteristics across video sequences. Rich visual features, representing the scene and subject’s appearance, are extracted directly from video frames using the CLIP image encoder and fed into an encoder-only Transformer model to classify action sequences as abnormal or normal. The model was evaluated on the Surveillance Perspective Human Action Recognition (SPHAR) dataset, achieving high classification accuracy and real-time performance. Experimental results demonstrate the effectiveness and robustness of the proposed method in detecting abnormal human actions from a surveillance perspective.