구음장애 환자를 위한 실시간 온디바이스 발화 변환 시스템|RISS 상세보기

다국어 초록 (Multilingual Abstract)

Dysarthria, resulting from neurological impairment, significantly degrades speech intelligibility, thereby infringing upon patients' communication rights and leading to social isolation. While deep learning-based Voice Conversion (VC) technology has emerged as a promising alternative, existing approaches often rely on high-performance server resources, raising concerns regarding internet connectivity and privacy. Furthermore, their reliance on batch processing frequently results in high latency, making them unsuitable for real-time conversational scenarios. To overcome these limitations, this study proposes a real-time, intelligent, on-device speech conversion system designed to support practical communication for patients with dysarthria. The proposed system is engineered to operate independently on the NVIDIA Jetson Orin Nano, an edge computing platform. To ensure real-time performance, a Contextual Block Conformer-based streaming Automatic Speech Recognition (ASR) architecture is introduced to process continuous speech input with low latency. This is integrated with an End-to-End Text-to-Speech (TTS) model based on Jointly Training FastSpeech2 and HiFi-GAN, enabling the immediate conversion of generated text into intelligible speech. Furthermore, to guarantee stability in real-world environments, the robustness of the model has been enhanced through data augmentation techniques, including device-specific noise synthesis and speed perturbation. Additionally, a bidirectional communication support function has been implemented by integrating microphone array-based Direction of Arrival (DOA) estimation technology to identify speakers and facilitate dialogue. Experimental results demonstrate that the proposed streaming model significantly reduces the First Response Time to approximately 0.04 seconds compared to conventional batch models. Moreover, it achieved a Real-Time Factor (RTF) sufficient for real-time processing across the entire pipeline, including both ASR and TTS. These findings validate that dysarthric speech can be instantaneously converted into high-quality standard speech even within limited computational resources. This study holds significance in presenting an independent and practical communication assistance solution that transcends laboratory settings and is directly applicable to the daily lives of patients.

번역하기

목차 (Table of Contents)

I. 서론 1
1. 연구 배경 1
2. 연구 목적 4
II. 관련 연구 6
1. ASR-TTS 파이프라인 기반 구음장애 음성 변환 시스템 7

I. 서론 1
1. 연구 배경 1
2. 연구 목적 4
II. 관련 연구 6
1. ASR-TTS 파이프라인 기반 구음장애 음성 변환 시스템 7
2. 음성 처리를 위한 효율적인 신경망 아키텍처 9
Ⅲ. 제안한 방법 11
1. 시스템 개요 11
2. 데이터 증강을 통한 강인한 모델 설계 12
1) 디바이스 특화 잡음 증강 12
2) 속도 섭동을 통한 발화 가변성 대응 13
3. 스트리밍 ASR 아키텍처를 통한 저지연 프로세스 구축 14
1) 전체 아키텍처 14
2) 블록 단위 스트리밍 처리 17
3) 스트리밍 방식의 한계와 파라미터 최적화 전략 19
4) JETS 기반 음성합성 21
Ⅳ. 실험 및 결과 23
1. 실험 환경 23
1) 하드웨어 및 소프트웨어 구성 23
2) 데이터셋 23
2. 성능 평가 지표 25
1) 구음장애 음성인식 모델 평가 지표 25
2) 음성합성 모델 평가 지표 26
3. 구음장애 음성인식 모델 성능 평가 27
1) 비교 모델 구성 27
2) 실험 결과 28
4. 음성합성 모델 성능 평가 30
1) 비교 모델 구성 30
2) 실험 결과 30
5. 정성적 분석 및 소결 32
Ⅴ. 시스템 응용: DOA 기반 다화자 상호작용 대화 시스템 34
Ⅵ. 결론 39
참고 문헌 41
ABSTRACT 47
감사의 말 49
연구 업적 51

상세검색

RISS 보유자료

상세검색

해외전자자료

구음장애 환자를 위한 실시간 온디바이스 발화 변환 시스템

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료