Fine-Grained Emotion-Controllable Text-to-Speech via Acoustic?Emotion interaction and Bidirectional State Space Model|RISS 상세보기

다국어 초록 (Multilingual Abstract)

Recent advances in Text-to-Speech (TTS) have enabled emotionally expressive speech synthesis; however, fine-grained control of emotional intensity without degrading speech quality remains challenging. Many existing systems exhibit a trade-off between expressiveness and naturalness, especially at extreme intensity levels, largely due to simplistic feature-fusion strategies.

To address this issue, this thesis presents a fine-grained emotion-controllable TTS framework built on a Bidirectional State-Space Model (BiMamba). By replacing Transformer based self-attention with a linear-time Bidirectional state-space backbone, the proposed system improves computational efficiency while maintaining strong context modeling for speech generation.

Two components are introduced to enhance controllability and synthesis quality. First, Emotion-Guided Cross Attention (EGCA) is designed to model emotion–acoustic interactions by using emotion representations to selectively attend to relevant acoustic
regions, producing stable and discriminative intensity representations beyond additive or concatenative conditioning. Second, a Dual Discriminative Learning strategy is adopted using a Joint Conditional and Unconditional (JCU) discriminator to jointly enforce overall realism and speaker-consistent emotion-intensity fidelity.

Experiments on the Emotional Speech Dataset (ESD) demonstrate that the proposed model outperforms competitive baselines in perceptual quality, achieving higher naturalness (NMOS) and speaker similarity (SMOS) while maintaining strong emotion recognition accuracy and improved efficiency. Notably, the system remains robust at maximum intensity, substantially mitigating the quality degradation commonly observed in prior emotion-controllable TTS approaches. These results indicate that interaction-aware conditioning and dual supervision provide an effective path toward practical, high-fidelity emotional TTS with reliable intensity control.

번역하기

목차 (Table of Contents)

1 Introduction 1
2 Related Work 4
2.1 Emotion intensity control in TTS 4
2.2 State space models and Mamba for speech 5
2.3 Dual discriminative learning 7

1 Introduction 1
2 Related Work 4
2.1 Emotion intensity control in TTS 4
2.2 State space models and Mamba for speech 5
2.3 Dual discriminative learning 7
3 Method 10
3.1 Mamba 10
3.1.1 Overview 10
3.1.2 Continuous-time SSM and discretization 11
3.1.3 Why scan works for Mamba (and not for nonlinear recurrences) 11
3.1.4 Why Transformer attention does not admit scan in the same form 12
3.1.5 Selective parameterization: nonlinearity in parameter generation, linear state update 13
3.1.6 Bidirectional block and fusion 14
3.2 Bidirectional Mamba Layer 14
3.2.1 Layer architecture 14
3.2.2 Roles of the sub-blocks 15
3.2.3 Advantages Over Transformer 15
3.3 Rank Model and Emotion-Guided Cross Attention 16
3.3.1 Rank Model: Inter- and Intra-Emotion Learning 16
3.3.2 Emotion-Guided Cross Attention (EGCA) 19
3.4 TTS Model with Dual Discriminative Learning 27
3.4.1 Motivation for Dual Discriminative Learning 28
3.4.2 Architecture and Formulation 30
4 Experiments 36
4.1 Dataset 36
4.2 Implementation Details 37
4.2.1 Acoustic Feature Extraction 37
4.2.2 Training Configuration 37
4.2.3 Inference Details 38
4.2.4 Baseline Models 40
4.2.5 Evaluation Metrics 41
4.3 Main Results 43
4.3.1 Overall Performance Analysis 43
4.3.2 Comparison with Baseline Systems 44
4.3.3 Spectral and Prosodic Quality 45
4.3.4 Speaker Identity Preservation 46
4.4 Ablation Study 46
4.4.1 Impact of EGCA 46
4.4.2 Impact of Dual Discriminative Learning 47
4.5 Computational Efficiency 48
4.6 Performance at Extreme Intensity 50
4.7 Intensity Control Effectiveness 51
4.8 Representation t-SNE Visualization 53
4.9 Results and Discussion 54
4.9.1 Overall results 55
4.9.2 Robustness at extreme intensity 55
4.9.3 Controllability and perceptual intensity recognition 56
4.9.4 Ablation analysis 56
4.9.5 Efficiency 56
4.9.6 Representation Insights for Speech LLMs 57
5 Conclusion 60
Reference 62
Acknowledgment 68

상세검색

RISS 보유자료

상세검색

해외전자자료

Fine-Grained Emotion-Controllable Text-to-Speech via Acoustic?Emotion interaction and Bidirectional State Space Model

부가정보

분석정보

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료