RISS 학술연구정보서비스

검색
다국어 입력

http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.

변환된 중국어를 복사하여 사용하시면 됩니다.

예시)
  • 中文 을 입력하시려면 zhongwen을 입력하시고 space를누르시면됩니다.
  • 北京 을 입력하시려면 beijing을 입력하시고 space를 누르시면 됩니다.
닫기
    인기검색어 순위 펼치기

    RISS 인기검색어

      Fine-Grained Emotion-Controllable Text-to-Speech via Acoustic?Emotion interaction and Bidirectional State Space Model

      한글로보기

      https://www.riss.kr/link?id=T17403293

      • 0

        상세조회
      • 0

        다운로드
      서지정보 열기
      • 내보내기
      • 내책장담기
      • 공유하기
      • 오류접수

      부가정보

      다국어 초록 (Multilingual Abstract) kakao i 다국어 번역

      Recent advances in Text-to-Speech (TTS) have enabled emotionally expressive speech synthesis; however, fine-grained control of emotional intensity without degrading speech quality remains challenging. Many existing systems exhibit a trade-off between expressiveness and naturalness, especially at extreme intensity levels, largely due to simplistic feature-fusion strategies.

      To address this issue, this thesis presents a fine-grained emotion-controllable TTS framework built on a Bidirectional State-Space Model (BiMamba). By replacing Transformer based self-attention with a linear-time Bidirectional state-space backbone, the proposed system improves computational efficiency while maintaining strong context modeling for speech generation.

      Two components are introduced to enhance controllability and synthesis quality. First, Emotion-Guided Cross Attention (EGCA) is designed to model emotion–acoustic interactions by using emotion representations to selectively attend to relevant acoustic
      regions, producing stable and discriminative intensity representations beyond additive or concatenative conditioning. Second, a Dual Discriminative Learning strategy is adopted using a Joint Conditional and Unconditional (JCU) discriminator to jointly enforce overall realism and speaker-consistent emotion-intensity fidelity.

      Experiments on the Emotional Speech Dataset (ESD) demonstrate that the proposed model outperforms competitive baselines in perceptual quality, achieving higher naturalness (NMOS) and speaker similarity (SMOS) while maintaining strong emotion recognition accuracy and improved efficiency. Notably, the system remains robust at maximum intensity, substantially mitigating the quality degradation commonly observed in prior emotion-controllable TTS approaches. These results indicate that interaction-aware conditioning and dual supervision provide an effective path toward practical, high-fidelity emotional TTS with reliable intensity control.
      번역하기

      Recent advances in Text-to-Speech (TTS) have enabled emotionally expressive speech synthesis; however, fine-grained control of emotional intensity without degrading speech quality remains challenging. Many existing systems exhibit a trade-off between ...

      Recent advances in Text-to-Speech (TTS) have enabled emotionally expressive speech synthesis; however, fine-grained control of emotional intensity without degrading speech quality remains challenging. Many existing systems exhibit a trade-off between expressiveness and naturalness, especially at extreme intensity levels, largely due to simplistic feature-fusion strategies.

      To address this issue, this thesis presents a fine-grained emotion-controllable TTS framework built on a Bidirectional State-Space Model (BiMamba). By replacing Transformer based self-attention with a linear-time Bidirectional state-space backbone, the proposed system improves computational efficiency while maintaining strong context modeling for speech generation.

      Two components are introduced to enhance controllability and synthesis quality. First, Emotion-Guided Cross Attention (EGCA) is designed to model emotion–acoustic interactions by using emotion representations to selectively attend to relevant acoustic
      regions, producing stable and discriminative intensity representations beyond additive or concatenative conditioning. Second, a Dual Discriminative Learning strategy is adopted using a Joint Conditional and Unconditional (JCU) discriminator to jointly enforce overall realism and speaker-consistent emotion-intensity fidelity.

      Experiments on the Emotional Speech Dataset (ESD) demonstrate that the proposed model outperforms competitive baselines in perceptual quality, achieving higher naturalness (NMOS) and speaker similarity (SMOS) while maintaining strong emotion recognition accuracy and improved efficiency. Notably, the system remains robust at maximum intensity, substantially mitigating the quality degradation commonly observed in prior emotion-controllable TTS approaches. These results indicate that interaction-aware conditioning and dual supervision provide an effective path toward practical, high-fidelity emotional TTS with reliable intensity control.

      더보기

      목차 (Table of Contents)

      • 1 Introduction 1
      • 2 Related Work 4
      • 2.1 Emotion intensity control in TTS 4
      • 2.2 State space models and Mamba for speech 5
      • 2.3 Dual discriminative learning 7
      • 1 Introduction 1
      • 2 Related Work 4
      • 2.1 Emotion intensity control in TTS 4
      • 2.2 State space models and Mamba for speech 5
      • 2.3 Dual discriminative learning 7
      • 3 Method 10
      • 3.1 Mamba 10
      • 3.1.1 Overview 10
      • 3.1.2 Continuous-time SSM and discretization 11
      • 3.1.3 Why scan works for Mamba (and not for nonlinear recurrences) 11
      • 3.1.4 Why Transformer attention does not admit scan in the same form 12
      • 3.1.5 Selective parameterization: nonlinearity in parameter generation, linear state update 13
      • 3.1.6 Bidirectional block and fusion 14
      • 3.2 Bidirectional Mamba Layer 14
      • 3.2.1 Layer architecture 14
      • 3.2.2 Roles of the sub-blocks 15
      • 3.2.3 Advantages Over Transformer 15
      • 3.3 Rank Model and Emotion-Guided Cross Attention 16
      • 3.3.1 Rank Model: Inter- and Intra-Emotion Learning 16
      • 3.3.2 Emotion-Guided Cross Attention (EGCA) 19
      • 3.4 TTS Model with Dual Discriminative Learning 27
      • 3.4.1 Motivation for Dual Discriminative Learning 28
      • 3.4.2 Architecture and Formulation 30
      • 4 Experiments 36
      • 4.1 Dataset 36
      • 4.2 Implementation Details 37
      • 4.2.1 Acoustic Feature Extraction 37
      • 4.2.2 Training Configuration 37
      • 4.2.3 Inference Details 38
      • 4.2.4 Baseline Models 40
      • 4.2.5 Evaluation Metrics 41
      • 4.3 Main Results 43
      • 4.3.1 Overall Performance Analysis 43
      • 4.3.2 Comparison with Baseline Systems 44
      • 4.3.3 Spectral and Prosodic Quality 45
      • 4.3.4 Speaker Identity Preservation 46
      • 4.4 Ablation Study 46
      • 4.4.1 Impact of EGCA 46
      • 4.4.2 Impact of Dual Discriminative Learning 47
      • 4.5 Computational Efficiency 48
      • 4.6 Performance at Extreme Intensity 50
      • 4.7 Intensity Control Effectiveness 51
      • 4.8 Representation t-SNE Visualization 53
      • 4.9 Results and Discussion 54
      • 4.9.1 Overall results 55
      • 4.9.2 Robustness at extreme intensity 55
      • 4.9.3 Controllability and perceptual intensity recognition 56
      • 4.9.4 Ablation analysis 56
      • 4.9.5 Efficiency 56
      • 4.9.6 Representation Insights for Speech LLMs 57
      • 5 Conclusion 60
      • Reference 62
      • Acknowledgment 68
      더보기

      분석정보

      View

      상세정보조회

      0

      Usage

      원문다운로드

      0

      대출신청

      0

      복사신청

      0

      EDDS신청

      0

      동일 주제 내 활용도 TOP

      더보기

      주제

      연도별 연구동향

      연도별 활용동향

      연관논문

      연구자 네트워크맵

      공동연구자 (7)

      유사연구자 (20) 활용도상위20명

      이 자료와 함께 이용한 RISS 자료

      나만을 위한 추천자료

      해외이동버튼