실내 이상행동 탐지를 위한 키포인트 분석 모델의 설계 및 구현 = Design and Implementation of a Keypoint Analysis Model for Indoor Abnormal Behavior Detection|RISS 상세보기

다국어 초록 (Multilingual Abstract)

Recently, abnormal behavior detection in indoor environments has become important in various fields such as public safety and unmanned store operation. In particular, for image-based abnormal behavior detection, it is most important to precisely identify not only the movement of an object but also the location information of detailed body parts. Accordingly, many studies have been conducted on methods utilizing keypoints, which are joint parts of the human body, and a model capable of high-dimensional classification and recognition of abnormal behavior along with time series data processing using these is required.
In this paper, a model combining YOLO(You Only Look Once)-based object detection and keypoint prediction model with a Transformer-based classification structure was designed for abnormal behavior detection that may occur in indoor environments. Since the existing Transformer structure is not a keypoint-based structure, it must be converted to fit the keypoint input data. This can cause problems in data processing such as occluded keypoints. To address this, a technique used in natural language processing was utilized to mask occluded keypoint information and distinguish it as a CLS token to configure keypoint sequence information.
To generate refined keypoints, a Transformer utilizing natural language processing techniques was used as the generator, and a generative adversarial network (GAN) structure utilizing two discriminators was constructed. One discriminator evaluates the realism of keypoints, and the other discriminator evaluates the consistency between keypoints and abnormal behavior event classes. This allows for both classification performance assistance and keypoint refinement effects to be considered.
For model learning, the indoor abnormal behavior dataset released by AIHub, the Korea National Information Society Agency(NIA), was used. It includes eight events that can occur in indoor stores, including evangelism, arson, theft, vandalism, smoking, abandonment, assault, and traffic disadvantaged, and consists of high-resolution CCTV footage and XML-based labeling data. The footage was divided into frames by identifying the start and end points of event occurrence from the labeling data, and then a pre-processing process was performed to extract bounding boxes and keypoints, process occlusion keypoints, process unidentified values, and convert to YOLO format through normalization.
Object detection and keypoint prediction were performed by comparing the n, s, and m versions of YOLOv8 and v11, and the final YOLOv8n-pose was selected. Through this, keypoints are predicted and then input into the abnormal behavior classification model based on the results. The experimental results showed that the structure combining Transformer and GAN showed a similar or higher level of verification accuracy compared to the reference models, LSTM, GRU, and simple Transformer model structures, and the training accuracy was particularly improved somewhat. In addition, when the weight regularization value of the discriminator was grid searched and both discriminators were 0.05, the training accuracy was 78.8% and the verification accuracy was 61.5%, showing the most stable learning convergence and generalization performance.
Quantitative analysis shows that the refined keypoints that are not occluded have an average L2 distance of 0.0839 with the correct keypoints, which is about 54 pixels based on the YOLO input size of 640 pixels. In the case of occluded keypoints, a lower error result was obtained at 0.0447 on average. This means that prediction of occluded locations is possible through the Transformer and generative adversarial learning, and the correct keypoints themselves may be somewhat inaccurate due to occlusion, and the refined results may be located more structurally and naturally. It is thought that this can be used to suggest the possibility of being utilized in keypoint-based abnormal behavior classification problems.

번역하기

국문 초록 (Abstract)

최근 실내 환경에서 이상행동 탐지는 공공 안전, 무인 매장 운영 등 다양한 분야에서 중요해지고 있다. 특히 영상 기반 이상행동 탐지에는 객체의 움직임뿐만 아니라 세부적인 신체 부위의 위치 정보까지 정밀히 파악하는 것이 무엇보다 중요하다. 이에 따라 사람의 신체 관절 부위인 키포인트(keypoint)를 활용한 방법이 많이 연구되고 있으며, 이를 이용한 시계열 데이터 처리와 함께 고차원 분류와 이상행동에 대한 인식이 모두 가능한 모델이 요구되고 있다.
본 논문에서는 실내 환경에서 발생할 수 있는 이상행동 탐지를 위해 YOLO(You Only Look Once) 기반 객체 검출과 키포인트 예측 모델과 Transformer 기반의 분류 구조를 결합한 모델을 설계하였다. 기존 Transformer 구조는 키포인트 기반 구조가 아니므로 키포인트 입력 데이터에 맞게 변환해야 한다. 폐색된 키포인트와 같은 데이터 처리에 문제가 될 수 있다. 이에 대해 자연어 처리에서 활용되는 기법을 활용하여 폐색된 키포인트 정보를 마스킹 처리하고 CLS 토큰으로 구분하여 키포인트 시퀀스 정보를 구성하였다.
정제된 키포인트 생성을 위해 자연어 처리 기법을 활용한 Transformer를 생성자로 하고, 두 개의 판별자를 활용한 생성적 적대적 신경망(GAN) 구조로 구성하였다. 한 판별자는 키포인트의 현실성을 평가하고, 다른 판별자는 키포인트와 이상행동 이벤트 클래스 간의 일치성을 평가한다. 이를 통해 분류 성능 보조와 키포인트 정제 효과를 함께 고려하도록 하였다.
모델 학습에는 한국지능정보사회진흥원인 AIHub에서 공개한 실내 이상행동 데이터셋을 활용하였다. 실내 매장에서 발생할 수 있는 전도, 방화, 절도, 파손, 흡연, 유기, 폭행, 교통약자 등 8가지 이벤트를 포함하고 있으며 고해상도의 CCTV 영상과 XML 기반의 라벨링 데이터로 구성되어 있다. 영상은 이벤트 발생 시작과 종료 시점을 라벨링 데이터에서 파악하여 프레임으로 분할한 후 바운딩 박스와 키포인트 추출, 폐색 키포인트 처리, 식별되지 않는 값 처리, 정규화를 거쳐 YOLO 포맷 변환하는 전처리 과정을 수행하였다.
객체 탐지와 키포인트 예측은 YOLOv8과 v11의 n, s, m 버전을 비교 실험하여 최종 YOLOv8n-pose로 선정하였으며, 이를 통해 키포인트를 예측한 후 해당 결과를 기반으로 이상행동 분류 모델에 입력하는 구조로 수행된다. 실험 결과 Transformer와 GAN을 결합한 구조는 기준 모델인 LSTM, GRU, 단순한 Transformer 모델 구조 대비 유사하거나 높은 수준의 검증 정확도를 보였으며, 훈련 정확도의 경우 특히 다소 향상되었다. 추가로 판별기의 가중치 규제 값을 그리드 탐색하여 두 판별기 모두 0.05였을 때 훈련 정확도는 78.8%, 검증 정확도는 61.5%로 나타나 가장 안정적인 학습 수렴과 일반화 성능을 보였다.
정량적 분석으로 폐색되지 않은 정제된 키포인트가 정답 키포인트와 평균적으로 0.0839의 L2 거리를 보였고 이는 YOLO 입력 크기 640픽셀 기준 약 54픽셀에 해당하며, 폐색된 키포인트의 경우에는 평균적으로 0.0447로 더 낮은 오차의 결과를 얻었다. 이는 Transformer와 생성적 적대적 학습을 통해 폐색된 위치에 대한 예측이 가능함을 의미하며, 정답 키포인트 자체가 폐색으로 인해 다소 부정확하고 정제된 결과가 더 구조적으로 자연스럽게 위치했을 수도 있다. 이를 통해 키포인트 기반 이상행동 분류 문제에 있어 활용 가능성을 제시할 수 있을 것으로 사료된다.

번역하기

최근 실내 환경에서 이상행동 탐지는 공공 안전, 무인 매장 운영 등 다양한 분야에서 중요해지고 있다. 특히 영상 기반 이상행동 탐지에는 객체의 움직임뿐만 아니라 세부적인 신체 부위의 ...

목차 (Table of Contents)