Dysarthria, resulting from neurological impairment, significantly degrades speech intelligibility, thereby infringing upon patients' communication rights and leading to social isolation. While deep learning-based Voice Conversion (VC) technology has e...
Dysarthria, resulting from neurological impairment, significantly degrades speech intelligibility, thereby infringing upon patients' communication rights and leading to social isolation. While deep learning-based Voice Conversion (VC) technology has emerged as a promising alternative, existing approaches often rely on high-performance server resources, raising concerns regarding internet connectivity and privacy. Furthermore, their reliance on batch processing frequently results in high latency, making them unsuitable for real-time conversational scenarios. To overcome these limitations, this study proposes a real-time, intelligent, on-device speech conversion system designed to support practical communication for patients with dysarthria. The proposed system is engineered to operate independently on the NVIDIA Jetson Orin Nano, an edge computing platform. To ensure real-time performance, a Contextual Block Conformer-based streaming Automatic Speech Recognition (ASR) architecture is introduced to process continuous speech input with low latency. This is integrated with an End-to-End Text-to-Speech (TTS) model based on Jointly Training FastSpeech2 and HiFi-GAN, enabling the immediate conversion of generated text into intelligible speech. Furthermore, to guarantee stability in real-world environments, the robustness of the model has been enhanced through data augmentation techniques, including device-specific noise synthesis and speed perturbation. Additionally, a bidirectional communication support function has been implemented by integrating microphone array-based Direction of Arrival (DOA) estimation technology to identify speakers and facilitate dialogue. Experimental results demonstrate that the proposed streaming model significantly reduces the First Response Time to approximately 0.04 seconds compared to conventional batch models. Moreover, it achieved a Real-Time Factor (RTF) sufficient for real-time processing across the entire pipeline, including both ASR and TTS. These findings validate that dysarthric speech can be instantaneously converted into high-quality standard speech even within limited computational resources. This study holds significance in presenting an independent and practical communication assistance solution that transcends laboratory settings and is directly applicable to the daily lives of patients.