An Enhanced Multimodal Transformer with Hyper Attention for Real-Time and Robust Facial Emotion Analysis = 실시간 및 강력한 얼굴 감정 분석을 위한 하이퍼 어텐션을 갖춘 향상된 멀티모달 트랜스포머|RISS 상세보기

다국어 초록 (Multilingual Abstract)

Facial expression analysis is an essential component of affective computing, as it allows intelligent systems to understand human emotional reactions from visual cues. Despite the progress achieved through modern deep learning approaches, many existing solutions still suffer performance drops when exposed to occlusions, illumination changes, or subtle and ambiguous facial movements. To address these challenges, this thesis introduces FERONet, a multimodal transformer-based framework designed for reliable and real-time facial expression recognition. The architecture incorporates a hyper-attentive feature extraction strategy that jointly leverages spatial, channel, and cross-region attention to capture detailed local patterns as well as broader structural relationships within the face. Furthermore, a hierarchical transformer equipped with token-reduction stages enhances computational efficiency, while a temporal decoder with cross-attention enables the system to model the progression of expressions in video sequences.
The proposed method combines information from multiple sources RGB images, motion cues derived from optical flow, and geometric features extracted from depth or facial landmarks resulting in improved robustness across diverse recording conditions. Extensive evaluations conducted on five widely used benchmarks (FER-2013, RAF-DB, CK+, BU-3DFE, and AFEW) demonstrate that FERONet delivers competitive state-of-the-art accuracy, reaching up to 97.3%, while maintaining real-time inference of under 16 milliseconds per frame. These findings highlight the model’s suitability for deployment in practical environments such as driver monitoring systems, healthcare-related emotion assessment, and intelligent learning technologies.

번역하기

목차 (Table of Contents)

I. Introduction 1
1.1 Research Background 1
1.2 Research Motivation 2
1.3 Main Contributions 3
1.4 Composition of the Thesis 4

I. Introduction 1
1.1 Research Background 1
1.2 Research Motivation 2
1.3 Main Contributions 3
1.4 Composition of the Thesis 4
II. Literature Review 5
2.1 Deep Learning and CNN-Based Models 6
2.2 Limitations of CNNs and the Emergence of Transformer Based FER 6
2.3 Multimodal and Cross-Domain FER 7
III. Proposed Method and Model Architecture 10
3.1 Multimodal Feature Encoder 11
3.2 Triple Attention Block 15
3.3 Hierarchical Transformer with Token Merging 18
3.4 Temporal Decoder with Cross-Attention 23
IV. Implementation Results and Evaluation 28
4.1 Robustness Strategies 28
4.2 Experimental Setup 31
4.3 Experimental Results 34
4.4 Comparison with SOTA Models 36
4.5 Discussions 42
VI. Conclusion and Future Direction 45
References 47
Acknowledgments 53

상세검색

RISS 보유자료

상세검색

해외전자자료

An Enhanced Multimodal Transformer with Hyper Attention for Real-Time and Robust Facial Emotion Analysis = 실시간 및 강력한 얼굴 감정 분석을 위한 하이퍼 어텐션을 갖춘 향상된 멀티모달 트랜스포머

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료