In real-world communication scenarios, speech signals are often degraded by adverse factors such as background noise, reverberation, bandwidth limitations, and packet loss. These distortions hinder both human perception and the performance of speech-d...
In real-world communication scenarios, speech signals are often degraded by adverse factors such as background noise, reverberation, bandwidth limitations, and packet loss. These distortions hinder both human perception and the performance of speech-driven applications, including voice assistants and automatic speech recognition systems. Recent advances in deep learning have led to significant progress in speech processing, broadly categorized into discriminative and generative approaches. While discriminative models learn direct mappings from input to target signals, they often suffer from limited generalization and over-smoothed outputs. In contrast, generative models—while offering greater flexibility and perceptual quality—are often constrained by high computational cost and slow inference due to their iterative generation process.
This dissertation proposes practical generative approaches that aim to improve efficiency, adaptability, or reconstruction quality across three key restoration tasks. First, for packet loss concealment (PLC), we propose Flow-PLC, a generative model based on the flow-matching framework. Flow-PLC learns a vector field that deterministically transforms a source distribution into clean waveforms, enabling fast and high-fidelity reconstruction. By avoiding iterative sampling, this approach significantly improves inference speed while maintaining high reconstruction quality, making it well-suited for real-time applications.
Second, for speech enhancement under noisy conditions, we introduce a token-based generative framework that combines autoregressive modeling with flow matching to improve robustness against diverse noise conditions. We adopt a non-neural tokenization method called dMel, which discretizes Mel spectrograms while retaining both semantic and acoustic information. An autoregressive language model is trained to predict clean dMel sequences from noisy inputs, and a flow-matching-based dequantizer is employed to refine and reconstruct Mel spectrograms from the predicted tokens. This framework effectively leverages the sequence modeling capability of language models and the fine-resolution reconstruction ability of flow matching for improved enhancement performance, resulting in more robust speech enhancement under severe noise distortions.
Finally, for general speech restoration involving simultaneous distortions—such as noise, reverberation, and bandwidth reduction—we propose FLOWER, a conditioning framework that provides normalizing flow-based Gaussian guidance to the generative model. This guidance is obtained from a normalizing flow model during training, but at inference time, it is sampled directly from a Gaussian distribution, enabling lightweight deployment. Consequently, FLOWER improves restoration performance without increasing model complexity.
keywords: Flow matching (FM), generative model, general speech restoration (GSR), language model (LM), normalizing flow (NF), packet loss concealment (PLC), speech enhancement (SE)