Recent advances in Text-to-Speech (TTS) have enabled emotionally expressive speech synthesis; however, fine-grained control of emotional intensity without degrading speech quality remains challenging. Many existing systems exhibit a trade-off between ...
Recent advances in Text-to-Speech (TTS) have enabled emotionally expressive speech synthesis; however, fine-grained control of emotional intensity without degrading speech quality remains challenging. Many existing systems exhibit a trade-off between expressiveness and naturalness, especially at extreme intensity levels, largely due to simplistic feature-fusion strategies.
To address this issue, this thesis presents a fine-grained emotion-controllable TTS framework built on a Bidirectional State-Space Model (BiMamba). By replacing Transformer based self-attention with a linear-time Bidirectional state-space backbone, the proposed system improves computational efficiency while maintaining strong context modeling for speech generation.
Two components are introduced to enhance controllability and synthesis quality. First, Emotion-Guided Cross Attention (EGCA) is designed to model emotion–acoustic interactions by using emotion representations to selectively attend to relevant acoustic
regions, producing stable and discriminative intensity representations beyond additive or concatenative conditioning. Second, a Dual Discriminative Learning strategy is adopted using a Joint Conditional and Unconditional (JCU) discriminator to jointly enforce overall realism and speaker-consistent emotion-intensity fidelity.
Experiments on the Emotional Speech Dataset (ESD) demonstrate that the proposed model outperforms competitive baselines in perceptual quality, achieving higher naturalness (NMOS) and speaker similarity (SMOS) while maintaining strong emotion recognition accuracy and improved efficiency. Notably, the system remains robust at maximum intensity, substantially mitigating the quality degradation commonly observed in prior emotion-controllable TTS approaches. These results indicate that interaction-aware conditioning and dual supervision provide an effective path toward practical, high-fidelity emotional TTS with reliable intensity control.