AvaNet and EvaNet for efficient integration of text, speech, and emotion in 3D avatar creation : towards seamless human-computer interaction|RISS 상세보기

다국어 초록 (Multilingual Abstract)

Creating a 3D avatar using neural networks is a crucial element of human-computer interaction. The objective of 3D facial animation is to generate a 3D avatar with high lip
synchronization to audio derived separately from text (text-driven) or speech (speech-driven) inputs. In this dissertation, two different types of models are proposed: a text-driven 3D facial animation model and a speech-driven emotional 3D facial animation model.
One disadvantage of speech-driven 3D facial animation models is that the content cannot be changed without re-recording. To address this issue, the models inevitably require a text-to-speech (TTS) system to synthesize speech from the content. However, this conventional pipeline, which utilizes a TTS system and an automatic speech recognition (ASR) model to extract context-related information from the speech, is characterized by significant computational costs and a large number of trainable parameters
To address these challenges, this dissertation proposes a novel model named AvaNet, which efficiently combines different domains, namely text, speech, and 3D avatar. AvaNet leverages text embedding encoded by the text encoder of the TTS model as intermediate features to generate both speech and the vertex of the 3D mesh. Using the TTS model’s capability to handle context and prosody elements (intonation, speaking speed, etc.), the proposed model facilitates the adjustment of 3D facial animation in sync with the synthesized speech. Consequently, AvaNet achieves a reduction in model size while demonstrating outstanding performance in terms of quantitative experiments and ABX test comparisons.
While leveraging a TTS model offers advantages, it necessitates training on large text-speech pair datasets. Furthermore, in the creation of avatars expressing emotions—an essential aspect of human interaction—there is a shortage of emotional speech-text pairs in published datasets. Therefore, the dissertation introduces a speech-driven emotional 3D facial animation model called EvaNet. EvaNet effectively expresses emotions using a limited emotional audio-visual dataset. The model categorizes emotions into four types (angry, happy, sadness, and neutral) and utilizes style embedding extracted from a randomly selected reference avatar belonging to the target emotion. This allows the avatar to vividly convey the intensity of various emotions from both seen and unseen speakers by adjusting the style embedding. Additionally, a non-autoregressive model comprising gated activation units (GAUs) and bidirectional long short-term memory (BLSTM) modules is designed to enhance inference speed. Quantitative and qualitative experiments validate the proposed model’s superior performance from an objective standpoint. User study evaluations, including mean opinion score (MOS) tests on overall quality and emotion manipulation of the generated avatar, yield results consistent with the model’s effectiveness.

번역하기

국문 초록 (Abstract)

신경망을 사용하여 3D 아바타를 생성하는 것은 인간-컴퓨터 상호작용의 중요한 요소이다. 텍스트 및 음성 기반 3D 얼굴 애니메이션 모델의 목표는 텍스트 및 음성 입력에서 오디오와 입 모양이 동기화된3D아바타를 생성하는 것이다. 이는 일반적으로 텍스트를 음성으로 변환하는 (TTS) 시스템과 음성을 텍스트로 변환하는 (ASR) 모델을 사용하여 음성에서 관련 정보를 추출하는 것을 포함한다. 그러나 이러한 기존의 파이프라인은 상당한 계산 비용과 많은 양의 학습 매개변수를 요구한다. 이러한 문제에 대응하기 위해 본 학위 논문은 AvaNet 이라는 혁신적인 모델을 제안한다. AvaNet은 텍스트, 음성 및 3D 아바타와 같은 다른 도메인들을 효율적으로 결합한다. AvaNet은 TTS 모델의 텍스트 인코더에 의해 인코딩된 텍스트 임베딩을 중간 특징으로 활용하여 음성과 3D 메시의 정점을 함께 생성한다. TTS 모델의 컨텍스트 및 억양 요소를 처리할 수 있는 능력을 활용하여 제안된 모델은 음성과 3D아바타의 말하기 속도 등의 요소를 조절할 수 있다. 결과적으로 AvaNet은 기존 최첨단 모델에 비해 크기는 줄어든 반면에, 객관적 실험 및 ABX 테스트에서 우수한 성능을 보여준다.
TTS 모델을 활용하는 것은 장점을 제공하지만, 대규모 텍스트-음성 쌍 데이터셋을 이용하여 모델을 학습시켜야 한다. 하지만, 인간 상호작용의 중요한 측면인 감정을 표현하는 아바타를 만들 때, 출판된 데이터셋에는 모델의 학습에 필요한 감정적인 음성-텍스트 쌍이 부족하다. 이를 해결하기 위해 본 학위 논문은 EvaNet 이라는 음성기반 감정 3D 얼굴 애니메이션 모델을 소개한다. EvaNet은 한정된 감정적 오디오-시각 데이터셋을 사용하여 효과적으로 감정을 표현한다. 이 모델은 감정을 분류하여 네 가지 유형(분노, 행복, 슬픔, 중립)으로 사용하며, 대상 감정에 속하는 참조 아바타에서 추출한 스타일 임베딩을 활용한다. 이를 통해 아바타는 스타일 임베딩을 조정함으로써 다양한 세기의 감정을 생생하게 전달할 수 있다. 또한, GAU와 BLSTM모듈로 구성된 비자기회귀 모델을 설계하여 추론 속도를 향상시키며, 주관적 및 객관적 실험에서 제안된 모델의 우수한 성능이 입증되었다. 사용자 연구 평가, 즉 생성된 아바타의 전반적인 품질과 감정 제어에 대한 평균 의견 점수(MOS) 테스트는 모델의 효과를 일관되게 나타낸다.

번역하기

신경망을 사용하여 3D 아바타를 생성하는 것은 인간-컴퓨터 상호작용의 중요한 요소이다. 텍스트 및 음성 기반 3D 얼굴 애니메이션 모델의 목표는 텍스트 및 음성 입력에서 오디오와 입 모양...

목차 (Table of Contents)

List of Figures iv
List of Tables viii
Abstract x
1 Introduction 1
1.1 Speech-driven 3D facial animation . . . . . . . . . . . . . . . . . . . . 2

List of Figures iv
List of Tables viii
Abstract x
1 Introduction 1
1.1 Speech-driven 3D facial animation . . . . . . . . . . . . . . . . . . . . 2
1.2 Text-driven 3D facial animation . . . . . . . . . . . . . . . . . . . . . 3
1.3 Speech and Text-driven 3D facial animation . . . . . . . . . . . . . . . 4
1.4 AvaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Speech-driven emotional 3D facial animation . . . . . . . . . . . . . . 8
1.6 EvaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Background 12
2.1 3D facial animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 3D face model . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Speech-to-Animation (S2A) . . . . . . . . . . . . . . . . . . . 13
2.1.3 FaceFormer [1] . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.4 CodeTalker [2] . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Text-to-Speech (TTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Autoregressive TTS model . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Non-autoregressive TTS model . . . . . . . . . . . . . . . . . 19
2.3 Wav2Vec 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Proposed Methods: AvaNet and EvaNet 22
3.1 AvaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 Text-to-Speech module . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Speech-to-Animation module . . . . . . . . . . . . . . . . . . 31
3.2 EvaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Emotion classification . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 EvaNet: Emotional avatar generator . . . . . . . . . . . . . . . 38
4 Performance Evaluation 48
4.1 Experiment: AvaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1 Experiment settings . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.2 Vertices Indexing . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.3 Experiment results . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Experiment: EvaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.1 Experiment settings . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Experiment results . . . . . . . . . . . . . . . . . . . . . . . . 69
5 Conclusion 79
국문요약 99

상세검색

RISS 보유자료

상세검색

해외전자자료

AvaNet and EvaNet for efficient integration of text, speech, and emotion in 3D avatar creation : towards seamless human-computer interaction

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료