Approaches in Generative Speech Models: Restoration and Enhancement in Degraded Speech Signals|RISS 상세보기

다국어 초록 (Multilingual Abstract)

In real-world communication scenarios, speech signals are often degraded by adverse factors such as background noise, reverberation, bandwidth limitations, and packet loss. These distortions hinder both human perception and the performance of speech-driven applications, including voice assistants and automatic speech recognition systems. Recent advances in deep learning have led to significant progress in speech processing, broadly categorized into discriminative and generative approaches. While discriminative models learn direct mappings from input to target signals, they often suffer from limited generalization and over-smoothed outputs. In contrast, generative models—while offering greater flexibility and perceptual quality—are often constrained by high computational cost and slow inference due to their iterative generation process.

This dissertation proposes practical generative approaches that aim to improve efficiency, adaptability, or reconstruction quality across three key restoration tasks. First, for packet loss concealment (PLC), we propose Flow-PLC, a generative model based on the flow-matching framework. Flow-PLC learns a vector field that deterministically transforms a source distribution into clean waveforms, enabling fast and high-fidelity reconstruction. By avoiding iterative sampling, this approach significantly improves inference speed while maintaining high reconstruction quality, making it well-suited for real-time applications.

Second, for speech enhancement under noisy conditions, we introduce a token-based generative framework that combines autoregressive modeling with flow matching to improve robustness against diverse noise conditions. We adopt a non-neural tokenization method called dMel, which discretizes Mel spectrograms while retaining both semantic and acoustic information. An autoregressive language model is trained to predict clean dMel sequences from noisy inputs, and a flow-matching-based dequantizer is employed to refine and reconstruct Mel spectrograms from the predicted tokens. This framework effectively leverages the sequence modeling capability of language models and the fine-resolution reconstruction ability of flow matching for improved enhancement performance, resulting in more robust speech enhancement under severe noise distortions.

Finally, for general speech restoration involving simultaneous distortions—such as noise, reverberation, and bandwidth reduction—we propose FLOWER, a conditioning framework that provides normalizing flow-based Gaussian guidance to the generative model. This guidance is obtained from a normalizing flow model during training, but at inference time, it is sampled directly from a Gaussian distribution, enabling lightweight deployment. Consequently, FLOWER improves restoration performance without increasing model complexity.

keywords: Flow matching (FM), generative model, general speech restoration (GSR), language model (LM), normalizing flow (NF), packet loss concealment (PLC), speech enhancement (SE)

번역하기

목차 (Table of Contents)

1 Introduction 1
1.1 Background: Speech Enhancement and Restoration 1
1.2 Outline of the Thesis 3
2 Flow-PLC: Towards Efficient Packet Loss Concealment with Flow Matching 6
2.1 Introduction 6

1 Introduction 1
1.1 Background: Speech Enhancement and Restoration 1
1.2 Outline of the Thesis 3
2 Flow-PLC: Towards Efficient Packet Loss Concealment with Flow Matching 6
2.1 Introduction 6
2.2 Background 8
2.3 Proposed Method 10
2.3.1 Problem statement 10
2.3.2 Flow-matching with OT path 10
2.3.3 Model architecture 12
2.4 Experiments 14
2.4.1 Dataset 14
2.4.2 Implementation details 15
2.4.3 Evaluation metrics 16
2.5 Results 17
2.5.1 Comparative analysis of performance 17
2.5.2 Computational complexity: Inference time 18
2.5.3 Ablation study 21
2.6 Conclusion 23
3 dMel-LM + RoSE: Tokenized Generative Speech Enhancement with Language
Model and Flow Matching 24
3.1 Introduction 24
3.2 Background 26
3.3 Proposed Method 29
3.3.1 Overall framework 29
3.3.2 Tokenization: dMel 30
3.3.3 Speech language model: dMel-LM 31
3.3.4 Refining of speech enhancement: RoSE 34
3.4 Experimental Setup 35
3.4.1 Dataset 35
3.4.2 Model configuration 36
3.4.3 Baseline systems 36
3.4.4 Evaluation metrics 37
3.5 Result 39
3.6 Conclusion 40
4 FLOWER: Flow-based Estimated Gaussian Guidance for General Speech Restoration 41
4.1 Introduction 41
4.2 Related Work 44
4.3 Preliminaries 46
4.3.1 Score-based diffusion model 46
4.3.2 Conditional normalizing flow network 47
4.3.3 FM-based model using OT path 49
4.4 Proposed Method 51
4.4.1 Overview of FLOWER framework 51
4.4.2 Network architecture 54
4.4.3 Extensions to flow matching 55
4.5 Experimental Setup 56
4.5.1 Datasets 56
4.5.2 Evaluation metrics 57
4.5.3 Implementation details 58
4.5.4 Comparative models 59
4.6 Results 60
4.6.1 Quantitative evaluation 60
4.6.2 Qualitative results 66
4.7 Conclusion 68
5 Conclusion 69

상세검색

RISS 보유자료

상세검색

해외전자자료

Approaches in Generative Speech Models: Restoration and Enhancement in Degraded Speech Signals

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료