RISS 학술연구정보서비스

검색
다국어 입력

http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.

변환된 중국어를 복사하여 사용하시면 됩니다.

예시)
  • 中文 을 입력하시려면 zhongwen을 입력하시고 space를누르시면됩니다.
  • 北京 을 입력하시려면 beijing을 입력하시고 space를 누르시면 됩니다.
닫기
    인기검색어 순위 펼치기

    RISS 인기검색어

      Approaches in Generative Speech Models: Restoration and Enhancement in Degraded Speech Signals

      한글로보기

      https://www.riss.kr/link?id=T17291110

      • 0

        상세조회
      • 0

        다운로드
      서지정보 열기
      • 내보내기
      • 내책장담기
      • 공유하기
      • 오류접수

      부가정보

      다국어 초록 (Multilingual Abstract) kakao i 다국어 번역

      In real-world communication scenarios, speech signals are often degraded by adverse factors such as background noise, reverberation, bandwidth limitations, and packet loss. These distortions hinder both human perception and the performance of speech-driven applications, including voice assistants and automatic speech recognition systems. Recent advances in deep learning have led to significant progress in speech processing, broadly categorized into discriminative and generative approaches. While discriminative models learn direct mappings from input to target signals, they often suffer from limited generalization and over-smoothed outputs. In contrast, generative models—while offering greater flexibility and perceptual quality—are often constrained by high computational cost and slow inference due to their iterative generation process.

      This dissertation proposes practical generative approaches that aim to improve efficiency, adaptability, or reconstruction quality across three key restoration tasks. First, for packet loss concealment (PLC), we propose Flow-PLC, a generative model based on the flow-matching framework. Flow-PLC learns a vector field that deterministically transforms a source distribution into clean waveforms, enabling fast and high-fidelity reconstruction. By avoiding iterative sampling, this approach significantly improves inference speed while maintaining high reconstruction quality, making it well-suited for real-time applications.

      Second, for speech enhancement under noisy conditions, we introduce a token-based generative framework that combines autoregressive modeling with flow matching to improve robustness against diverse noise conditions. We adopt a non-neural tokenization method called dMel, which discretizes Mel spectrograms while retaining both semantic and acoustic information. An autoregressive language model is trained to predict clean dMel sequences from noisy inputs, and a flow-matching-based dequantizer is employed to refine and reconstruct Mel spectrograms from the predicted tokens. This framework effectively leverages the sequence modeling capability of language models and the fine-resolution reconstruction ability of flow matching for improved enhancement performance, resulting in more robust speech enhancement under severe noise distortions.

      Finally, for general speech restoration involving simultaneous distortions—such as noise, reverberation, and bandwidth reduction—we propose FLOWER, a conditioning framework that provides normalizing flow-based Gaussian guidance to the generative model. This guidance is obtained from a normalizing flow model during training, but at inference time, it is sampled directly from a Gaussian distribution, enabling lightweight deployment. Consequently, FLOWER improves restoration performance without increasing model complexity.

      keywords: Flow matching (FM), generative model, general speech restoration (GSR), language model (LM), normalizing flow (NF), packet loss concealment (PLC), speech enhancement (SE)
      번역하기

      In real-world communication scenarios, speech signals are often degraded by adverse factors such as background noise, reverberation, bandwidth limitations, and packet loss. These distortions hinder both human perception and the performance of speech-d...

      In real-world communication scenarios, speech signals are often degraded by adverse factors such as background noise, reverberation, bandwidth limitations, and packet loss. These distortions hinder both human perception and the performance of speech-driven applications, including voice assistants and automatic speech recognition systems. Recent advances in deep learning have led to significant progress in speech processing, broadly categorized into discriminative and generative approaches. While discriminative models learn direct mappings from input to target signals, they often suffer from limited generalization and over-smoothed outputs. In contrast, generative models—while offering greater flexibility and perceptual quality—are often constrained by high computational cost and slow inference due to their iterative generation process.

      This dissertation proposes practical generative approaches that aim to improve efficiency, adaptability, or reconstruction quality across three key restoration tasks. First, for packet loss concealment (PLC), we propose Flow-PLC, a generative model based on the flow-matching framework. Flow-PLC learns a vector field that deterministically transforms a source distribution into clean waveforms, enabling fast and high-fidelity reconstruction. By avoiding iterative sampling, this approach significantly improves inference speed while maintaining high reconstruction quality, making it well-suited for real-time applications.

      Second, for speech enhancement under noisy conditions, we introduce a token-based generative framework that combines autoregressive modeling with flow matching to improve robustness against diverse noise conditions. We adopt a non-neural tokenization method called dMel, which discretizes Mel spectrograms while retaining both semantic and acoustic information. An autoregressive language model is trained to predict clean dMel sequences from noisy inputs, and a flow-matching-based dequantizer is employed to refine and reconstruct Mel spectrograms from the predicted tokens. This framework effectively leverages the sequence modeling capability of language models and the fine-resolution reconstruction ability of flow matching for improved enhancement performance, resulting in more robust speech enhancement under severe noise distortions.

      Finally, for general speech restoration involving simultaneous distortions—such as noise, reverberation, and bandwidth reduction—we propose FLOWER, a conditioning framework that provides normalizing flow-based Gaussian guidance to the generative model. This guidance is obtained from a normalizing flow model during training, but at inference time, it is sampled directly from a Gaussian distribution, enabling lightweight deployment. Consequently, FLOWER improves restoration performance without increasing model complexity.

      keywords: Flow matching (FM), generative model, general speech restoration (GSR), language model (LM), normalizing flow (NF), packet loss concealment (PLC), speech enhancement (SE)

      더보기

      목차 (Table of Contents)

      • 1 Introduction 1
      • 1.1 Background: Speech Enhancement and Restoration 1
      • 1.2 Outline of the Thesis 3
      • 2 Flow-PLC: Towards Efficient Packet Loss Concealment with Flow Matching 6
      • 2.1 Introduction 6
      • 1 Introduction 1
      • 1.1 Background: Speech Enhancement and Restoration 1
      • 1.2 Outline of the Thesis 3
      • 2 Flow-PLC: Towards Efficient Packet Loss Concealment with Flow Matching 6
      • 2.1 Introduction 6
      • 2.2 Background 8
      • 2.3 Proposed Method 10
      • 2.3.1 Problem statement 10
      • 2.3.2 Flow-matching with OT path 10
      • 2.3.3 Model architecture 12
      • 2.4 Experiments 14
      • 2.4.1 Dataset 14
      • 2.4.2 Implementation details 15
      • 2.4.3 Evaluation metrics 16
      • 2.5 Results 17
      • 2.5.1 Comparative analysis of performance 17
      • 2.5.2 Computational complexity: Inference time 18
      • 2.5.3 Ablation study 21
      • 2.6 Conclusion 23
      • 3 dMel-LM + RoSE: Tokenized Generative Speech Enhancement with Language
      • Model and Flow Matching 24
      • 3.1 Introduction 24
      • 3.2 Background 26
      • 3.3 Proposed Method 29
      • 3.3.1 Overall framework 29
      • 3.3.2 Tokenization: dMel 30
      • 3.3.3 Speech language model: dMel-LM 31
      • 3.3.4 Refining of speech enhancement: RoSE 34
      • 3.4 Experimental Setup 35
      • 3.4.1 Dataset 35
      • 3.4.2 Model configuration 36
      • 3.4.3 Baseline systems 36
      • 3.4.4 Evaluation metrics 37
      • 3.5 Result 39
      • 3.6 Conclusion 40
      • 4 FLOWER: Flow-based Estimated Gaussian Guidance for General Speech Restoration 41
      • 4.1 Introduction 41
      • 4.2 Related Work 44
      • 4.3 Preliminaries 46
      • 4.3.1 Score-based diffusion model 46
      • 4.3.2 Conditional normalizing flow network 47
      • 4.3.3 FM-based model using OT path 49
      • 4.4 Proposed Method 51
      • 4.4.1 Overview of FLOWER framework 51
      • 4.4.2 Network architecture 54
      • 4.4.3 Extensions to flow matching 55
      • 4.5 Experimental Setup 56
      • 4.5.1 Datasets 56
      • 4.5.2 Evaluation metrics 57
      • 4.5.3 Implementation details 58
      • 4.5.4 Comparative models 59
      • 4.6 Results 60
      • 4.6.1 Quantitative evaluation 60
      • 4.6.2 Qualitative results 66
      • 4.7 Conclusion 68
      • 5 Conclusion 69
      더보기

      분석정보

      View

      상세정보조회

      0

      Usage

      원문다운로드

      0

      대출신청

      0

      복사신청

      0

      EDDS신청

      0

      동일 주제 내 활용도 TOP

      더보기

      주제

      연도별 연구동향

      연도별 활용동향

      연관논문

      연구자 네트워크맵

      공동연구자 (7)

      유사연구자 (20) 활용도상위20명

      이 자료와 함께 이용한 RISS 자료

      나만을 위한 추천자료

      해외이동버튼