Creating a 3D avatar using neural networks is a crucial element of human-computer interaction. The objective of 3D facial animation is to generate a 3D avatar with high lip
synchronization to audio derived separately from text (text-driven) or speech ...
Creating a 3D avatar using neural networks is a crucial element of human-computer interaction. The objective of 3D facial animation is to generate a 3D avatar with high lip
synchronization to audio derived separately from text (text-driven) or speech (speech-driven) inputs. In this dissertation, two different types of models are proposed: a text-driven 3D facial animation model and a speech-driven emotional 3D facial animation model.
One disadvantage of speech-driven 3D facial animation models is that the content cannot be changed without re-recording. To address this issue, the models inevitably require a text-to-speech (TTS) system to synthesize speech from the content. However, this conventional pipeline, which utilizes a TTS system and an automatic speech recognition (ASR) model to extract context-related information from the speech, is characterized by significant computational costs and a large number of trainable parameters
To address these challenges, this dissertation proposes a novel model named AvaNet, which efficiently combines different domains, namely text, speech, and 3D avatar. AvaNet leverages text embedding encoded by the text encoder of the TTS model as intermediate features to generate both speech and the vertex of the 3D mesh. Using the TTS model’s capability to handle context and prosody elements (intonation, speaking speed, etc.), the proposed model facilitates the adjustment of 3D facial animation in sync with the synthesized speech. Consequently, AvaNet achieves a reduction in model size while demonstrating outstanding performance in terms of quantitative experiments and ABX test comparisons.
While leveraging a TTS model offers advantages, it necessitates training on large text-speech pair datasets. Furthermore, in the creation of avatars expressing emotions—an essential aspect of human interaction—there is a shortage of emotional speech-text pairs in published datasets. Therefore, the dissertation introduces a speech-driven emotional 3D facial animation model called EvaNet. EvaNet effectively expresses emotions using a limited emotional audio-visual dataset. The model categorizes emotions into four types (angry, happy, sadness, and neutral) and utilizes style embedding extracted from a randomly selected reference avatar belonging to the target emotion. This allows the avatar to vividly convey the intensity of various emotions from both seen and unseen speakers by adjusting the style embedding. Additionally, a non-autoregressive model comprising gated activation units (GAUs) and bidirectional long short-term memory (BLSTM) modules is designed to enhance inference speed. Quantitative and qualitative experiments validate the proposed model’s superior performance from an objective standpoint. User study evaluations, including mean opinion score (MOS) tests on overall quality and emotion manipulation of the generated avatar, yield results consistent with the model’s effectiveness.