강건한 한국어 멀티모달 자동 음성인식 모델 구축 = Building a Robust Korean Multimodal ASR Model|RISS 상세보기

국문 초록 (Abstract)

종단형 음성 인식 (end-to-end speech recognition) 모델의 발전 은 ASR (Automatic Speech Recognition) 기술의 성능을 비약적으로 향상시켜, 일부 영역에서는 인간의 성능을 초월하는 결과를 보여주었 다. 그러나 여전히 해결해야 할 여러 과제가 존재한다. Transformer 기반 AED (Attention-based Encoder-Decoder) 구 조는 음성의 발화 길이가 약 30초 이상인 경우 성능이 급격히 저하된 다. 이에 긴 발화에서도 모델의 성능이 강건할 수 있게 하는 연구들 이 제안되어 왔지만, 주로 모델의 손실 함수를 재정의하거나 파인튜 닝을 하는 등 제안된 방법을 수행하기 위해 재학습이 필요하다는 단 점이 있다. 음성 정보에 의존하는 ASR 기술은 실세계에 존재하는 수많은 종 류의 소음에 취약하다는 문제가 있다. 이에 따라 오디오 (Audio)와 소음의 영향이 적은 비주얼 (Visual)의 특징을 모두 활용하는 AVSR (Audio-Visual Speech Recognition) 기술이 등장하여, 소음이 있는 환경에서도 강건한 음성인식 성능을 보여주는 연구가 다수 제안되었 다. LRS3와 VoxCeleb2 데이터셋으로 학습된 AV-HuBERT는 두 모 달리티의 특징을 결합해 주목할만한 성능을 보여주었다. 그러나, 이렇 게 공개된 데이터셋들은 영어 위주이며 한국어 데이터로 학습된 모델 은 극소수에 불과하다. 본 연구에서는 긴 발화에서 AED 기반 음성인식 모델의 성능 저 하의 원인이 cross-attention 정렬 오류임을 찾아내고, gaussian masking을 통해 디코더의 각 time-step마다 적절한 위치에 집중할 수 있도록 cross-attention 가중치 분포를 조정하는 방식으로 해결한 다. 디코딩 시에 적절한 위치를 찾기 위해 CTC (Connectionist Temporal Classification)의 전방 확률 (forward probability)이 최대 가 되는 위치 값을 활용한다. 그 결과, 재학습 없이도 LibriSpeech 데 이터셋에서 25초 이상의 단어 오류율 (Word Error Rate)을 모두 개 선하며, 오류 감소율 (Error Reduction Rate)을 91.10% (33.48% vs 2.98%)까지 개선한다. 다음으로, 한국어 오디오-비주얼 데이터셋과 모델을 구축한다. 유 튜브에서 Creative Commons License (CCL)에 해당하는 한국어 영 상만을 추출하여 엄격한 전처리 과정을 거친다. 구축한 데이터셋으로 AV-HuBERT 모델을 사전학습과 파인튜닝한다. 오디오 정보만 사용 했을 때는 단어 오류율 17.78%를 달성하는 반면, 오디오 정보와 비주 얼 정보를 모두 사용했을 때는 단어 오류율 14.93%를 달성한다. 이러 한 결과는 비주얼 정보가 음성인식 성능 향상에 도움을 주며, 구축한 한국어 AVSR 데이터가 AVSR에 효과적임을 시사한다. 주요단어(Keyword) : AV-HuBERT, AVSR, Cross-attention, Hybrid CTC/Attention, Long-form speech.

번역하기

종단형 음성 인식 (end-to-end speech recognition) 모델의 발전 은 ASR (Automatic Speech Recognition) 기술의 성능을 비약적으로 향상시켜, 일부 영역에서는 인간의 성능을 초월하는 결과를 보여주었 다. 그...

다국어 초록 (Multilingual Abstract)

Recent advancements in end-to-end (E2E) speech recognition have led to remarkable improvements in ASR (Automatic Speech Recognition) performance, even surpassing human-level accuracy in certain domains. Nevertheless, various challenges remain. In particular, Transformer-based AED (Attention-based Encoder-Decoder) architectures exhibit a pronounced decline in performance when the length of the spoken utterance exceeds approximately 30 seconds. Although several studies have proposed methods to enhance robustness for such long utterances, the majority of these approaches—such as redefining the loss function or performing model fine-tuning—require retraining, which represents a notable drawback. ASR technology, which relies predominantly on acoustic information, faces significant challenges due to the wide variety of noise present in real-world environments. In response, Audio-Visual Speech Recognition (AVSR) techniques have been introduced, integrating audio features— which tend to degrade in the presence of noise—and visual features— which are relatively less affected by noise—to achieve robust recognition performance in noisy environments. Notably, AV-HuBERT, trained on the LRS3 and VoxCeleb2 datasets, has demonstrated impressive performance by combining features from both modalities. However, these publicly available datasets are predominantly English-based, and only a small number of models trained on Korean data currently exist. In this study, we identify cross-attention misalignment as the main cause of performance degradation in AED-based speech recognition models for long utterances. To address this issue, we employ Gaussian masking to adjust the cross-attention weight distribution, ensuring that the decoder focuses on the appropriate positions at each time step. During decoding, we leverage the position at which the forward probability of Connectionist Temporal Classification (CTC) is maximized to locate the optimal alignment. As a result, our method improves the word error rate (WER) for utterances exceeding 25 seconds in the LibriSpeech dataset without requiring retraining, achieving up to a 91.10% error reduction rate (from 33.48% to 2.98%). Next, we construct a Korean audio-visual dataset and train a corresponding model. By extracting only Korean-language videos under the Creative Commons License (CCL) from YouTube and applying rigorous preprocessing, we establish the dataset for AV-HuBERT pre-training and fine-tuning. When using only audio information, the model achieves a word error rate (WER) of 17.78%, whereas incorporating both audio and visual information lowers the WER to 14.93%. These results suggest that visual information contributes to enhanced speech recognition performance and that the constructed Korean AVSR dataset effectively supports AVSR. Key words : AV-HuBERT, AVSR, Cross-attention, Hybrid CTC/Attention, Long-form speech.

번역하기

목차 (Table of Contents)

I. 서론 1
II. 관련 연구 4
2.1 긴 발화 음성인식 4
2.2 AVSR 5
III. Cross-Attention 개선을 통한 긴 음성의 인식 성능 향상 7

I. 서론 1
II. 관련 연구 4
2.1 긴 발화 음성인식 4
2.2 AVSR 5
III. Cross-Attention 개선을 통한 긴 음성의 인식 성능 향상 7
3.1 Self-Attention 7
3.2 Cross-Attention 9
3.3 Cross-Attention 개선 10
3.4 실험 및 결과 13
3.4.1 실험 환경 13
3.4.2 베이스라인 모델의 성능 13
3.4.3 Cross-Attention이 개선된 모델의 성능 15
IV. 한국어 AVSR 데이터셋 구축 19
4.1 데이터 수집 19
4.2 음성 전사 21
4.3 비디오에서 발화 추출 22
4.4 데이터셋의 특성 25
4.5 ASR 성능 27
V. 한국어 AVSR 모델 구축 29
5.1 AV-HuBERT 29
5.1.1 영어 사전 학습 모델 파인튜닝 30
5.1.2 실험 및 결과 31
5.1.3 한국어 사전 학습 및 파인튜닝 35
5.1.4 실험 및 결과 38
VI. 결론 40
참고문헌 41

상세검색

RISS 보유자료

상세검색

해외전자자료

강건한 한국어 멀티모달 자동 음성인식 모델 구축 = Building a Robust Korean Multimodal ASR Model

부가정보

분석정보

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료