Recently, abnormal behavior detection in indoor environments has become important in various fields such as public safety and unmanned store operation. In particular, for image-based abnormal behavior detection, it is most important to precisely ident...
Recently, abnormal behavior detection in indoor environments has become important in various fields such as public safety and unmanned store operation. In particular, for image-based abnormal behavior detection, it is most important to precisely identify not only the movement of an object but also the location information of detailed body parts. Accordingly, many studies have been conducted on methods utilizing keypoints, which are joint parts of the human body, and a model capable of high-dimensional classification and recognition of abnormal behavior along with time series data processing using these is required.
In this paper, a model combining YOLO(You Only Look Once)-based object detection and keypoint prediction model with a Transformer-based classification structure was designed for abnormal behavior detection that may occur in indoor environments. Since the existing Transformer structure is not a keypoint-based structure, it must be converted to fit the keypoint input data. This can cause problems in data processing such as occluded keypoints. To address this, a technique used in natural language processing was utilized to mask occluded keypoint information and distinguish it as a CLS token to configure keypoint sequence information.
To generate refined keypoints, a Transformer utilizing natural language processing techniques was used as the generator, and a generative adversarial network (GAN) structure utilizing two discriminators was constructed. One discriminator evaluates the realism of keypoints, and the other discriminator evaluates the consistency between keypoints and abnormal behavior event classes. This allows for both classification performance assistance and keypoint refinement effects to be considered.
For model learning, the indoor abnormal behavior dataset released by AIHub, the Korea National Information Society Agency(NIA), was used. It includes eight events that can occur in indoor stores, including evangelism, arson, theft, vandalism, smoking, abandonment, assault, and traffic disadvantaged, and consists of high-resolution CCTV footage and XML-based labeling data. The footage was divided into frames by identifying the start and end points of event occurrence from the labeling data, and then a pre-processing process was performed to extract bounding boxes and keypoints, process occlusion keypoints, process unidentified values, and convert to YOLO format through normalization.
Object detection and keypoint prediction were performed by comparing the n, s, and m versions of YOLOv8 and v11, and the final YOLOv8n-pose was selected. Through this, keypoints are predicted and then input into the abnormal behavior classification model based on the results. The experimental results showed that the structure combining Transformer and GAN showed a similar or higher level of verification accuracy compared to the reference models, LSTM, GRU, and simple Transformer model structures, and the training accuracy was particularly improved somewhat. In addition, when the weight regularization value of the discriminator was grid searched and both discriminators were 0.05, the training accuracy was 78.8% and the verification accuracy was 61.5%, showing the most stable learning convergence and generalization performance.
Quantitative analysis shows that the refined keypoints that are not occluded have an average L2 distance of 0.0839 with the correct keypoints, which is about 54 pixels based on the YOLO input size of 640 pixels. In the case of occluded keypoints, a lower error result was obtained at 0.0447 on average. This means that prediction of occluded locations is possible through the Transformer and generative adversarial learning, and the correct keypoints themselves may be somewhat inaccurate due to occlusion, and the refined results may be located more structurally and naturally. It is thought that this can be used to suggest the possibility of being utilized in keypoint-based abnormal behavior classification problems.