The demand for accurate mental workload (MW) monitoring and classification has intensified, particularly in high-stakes domains such as aerospace and healthcare. Traditional MW classification methods often rely on hand-crafted features and single-moda...
The demand for accurate mental workload (MW) monitoring and classification has intensified, particularly in high-stakes domains such as aerospace and healthcare. Traditional MW classification methods often rely on hand-crafted features and single-modality inputs or static fusion techniques, which offer limited accuracy and fail to fully leverage cross-sensor complementarity. Recent multimodal fusion methods—such as attention-based weighting, averaging, or majority voting—struggle to assess the relative informativeness of each modality, particularly when sensor reliability varies. To address these limitations, we propose CogniMoE, an end-to-end multimodal framework that learns directly from raw physiological signals. It introduces three key innovations: on-the-fly scalogram generation using FP16 arithmetic, which eliminates pre-computation and significantly reduces memory and processing overhead; parallel CNN-LSTM branches for each modality, incorporating attention mechanisms and dynamic dropout to extract robust spatiotemporal features; and a Mixture of Experts (MoE) gating network that adaptively fuses modalities based on real-time informativeness, maintaining performance even when a modality degrades. Trained in a subject-independent manner on diverse participants, CogniMoE demonstrates strong generalizability and scalability. Evaluations on the MAUS and CLAS datasets show that it outperforms both traditional and recent state-of-the-art approaches, achieving accuracies of 94% and 92%, respectively. Moreover, on-the-fly scalogram generation reduces memory usage and processing time by an order of magnitude, providing a lightweight and efficient solution. The MoE gating mechanism further boosts classification performance by approximately 5% on average over non-adaptive fusion strategies by dynamically adjusting modality importance based on individual participant characteristics.