Facial expression analysis is an essential component of affective computing, as it allows intelligent systems to understand human emotional reactions from visual cues. Despite the progress achieved through modern deep learning approaches, many existin...
Facial expression analysis is an essential component of affective computing, as it allows intelligent systems to understand human emotional reactions from visual cues. Despite the progress achieved through modern deep learning approaches, many existing solutions still suffer performance drops when exposed to occlusions, illumination changes, or subtle and ambiguous facial movements. To address these challenges, this thesis introduces FERONet, a multimodal transformer-based framework designed for reliable and real-time facial expression recognition. The architecture incorporates a hyper-attentive feature extraction strategy that jointly leverages spatial, channel, and cross-region attention to capture detailed local patterns as well as broader structural relationships within the face. Furthermore, a hierarchical transformer equipped with token-reduction stages enhances computational efficiency, while a temporal decoder with cross-attention enables the system to model the progression of expressions in video sequences.
The proposed method combines information from multiple sources RGB images, motion cues derived from optical flow, and geometric features extracted from depth or facial landmarks resulting in improved robustness across diverse recording conditions. Extensive evaluations conducted on five widely used benchmarks (FER-2013, RAF-DB, CK+, BU-3DFE, and AFEW) demonstrate that FERONet delivers competitive state-of-the-art accuracy, reaching up to 97.3%, while maintaining real-time inference of under 16 milliseconds per frame. These findings highlight the model’s suitability for deployment in practical environments such as driver monitoring systems, healthcare-related emotion assessment, and intelligent learning technologies.