A survey on talking face generation = 대화형 얼굴 생성에 관한 조사 연구|RISS 상세보기

국문 초록 (Abstract)

대화형 얼굴 생성(Talking Face Generation) 기술은 음성, 영상, 텍스트 등의 다중 모달 입력으로부터 인물의 입모양, 표정, 제스처를 자연스럽게 합성하는 인공지능 기반 생성 기술로, 디지털 휴먼 연구의 핵심 분야로 부상하고 있다. 본 논문은 해당 기술의 연구 동향을 체계적으로 분석하고, 최근 등장한 대표적 생성 패러다임과 모델을 종합적으로 고찰하였다.
특히 2D 이미지 기반의 Wav2Lip, 3D 기하 기반의 AD-NeRF, 가우시안 표현 기반의 LAM(Large Avatar Model), 통합 확산 모델 기반의 EchoMimic V3, 제스처 확장형 모델 EMO2 등을 중심으로, 각 접근법의 합성 경로, 구조적 특징, 성능 지표를 비교·분석하였다. 또한 VFHQ, HDTF, MOSEI, AVSpeech 등 주요 데이터셋과 평가 지표를 정리하여, 기술적 발전 흐름과 한계점을 함께 제시하였다.
분석 결과, 최근 연구는 효율적 3D 표현·다중모달 통합·전신 협동 생성 방향으로 발전하고 있으며, 실시간 상호작용 및 감정 일관성, 다언어 일반화, 윤리적 거버넌스 등 다양한 연구 과제가 남아 있음을 확인하였다. 본 연구는 디지털 휴먼 기술의 현황과 향후 발전 방향을 종합적으로 제시함으로써, 관련 학문 및 산업 분야에서의 응용 확대에 기여하고자 한다.

번역하기

대화형 얼굴 생성(Talking Face Generation) 기술은 음성, 영상, 텍스트 등의 다중 모달 입력으로부터 인물의 입모양, 표정, 제스처를 자연스럽게 합성하는 인공지능 기반 생성 기술로, 디지털 휴먼 ...

다국어 초록 (Multilingual Abstract)

Talking Face Generation is an artificial intelligence–based generative technology that synthesizes natural lip movements, facial expressions, and gestures of human characters from multimodal inputs such as speech, images, and text, and has emerged as a core research area in digital human studies. This paper systematically analyzes recent research trends in TFG and provides a comprehensive review of representative generative paradigms and models.
In particular, we examine and compare approaches including the 2D image–based Wav2Lip, the 3D geometry–based AD-NeRF, the Gaussian representation–based Large Avatar Model, the unified diffusion model–based EchoMimic V3, and the gesture-augmented model EMO2, focusing on their synthesis pipelines, architectural characteristics, and performance metrics. Major datasets and evaluation benchmarks, such as VFHQ, HDTF, MOSEI, and AVSpeech, are also summarized to highlight both the technological progress and the remaining limitations.
Our analysis indicates that recent advances are converging toward efficient 3D representations, multimodal integration, and full-body coordinated generation. At the same time, several open challenges remain, including real-time interactive generation, emotional consistency, multilingual generalization, and ethical governance. By presenting a comprehensive overview of the current landscape and future directions of digital human technologies, this study aims to facilitate broader applications in both academic research and industrial practice

번역하기

목차 (Table of Contents)

1. Introduction 1
1.1 Research Background and Motivation 2
1.2 The Core Trajectory of Technological Evolution from Implicit Fields to ExplicitPrimitives 4
1.3 Structure and Core Contributions 7
2. Related Works 10

1. Introduction 1
1.1 Research Background and Motivation 2
1.2 The Core Trajectory of Technological Evolution from Implicit Fields to ExplicitPrimitives 4
1.3 Structure and Core Contributions 7
2. Related Works 10
2.1 General Talking Head Generation Surveys 10
2.2 Surveys on Audio-Driven Facial Animation and Multimodal Methods 12
2.3 Surveys on Specific Generative Architectures and Human Motion Modeling 13
2.4 Summary 15
3. The Evolution of Generative Paradigms from Signal Mapping to Unified Embodiment 17
3.1 From early signal-mapping approaches in 2D and implicit 3D 17
3.2 The Revolution of Explicit 3D Representations: 3DGS as a Successor to NeRF 19
3.3 The Integration of Unified Multimodality and Embodied Intelligence 20
3.4 The Emergence of a Unified Framework 23
4. Toward a Unified Framework for Talking Face Generation 25
4.1 Unified Diffusion Models: The Multimodal Engine 25
4.2 Gaussian Representation Methods: The Explicit 3D Foundation 29
4.3 Gesture-Enhanced Generative Models: The Embodied Extension 31
4.4 Comparative Analysis and Methodological Implications 34
4.4.1 The "Fidelity-Generality-Embodiment" Trilemma 35
4.4.2 Methodological Implications: Three Fundamental Shifts 36
4.4.3 Critical Open Challenges 37
4.5 Summary 39
5. Datasets and Evaluation Metrics for Multilayer Digital Human Generation 41
5.1 Datasets 41
5.2 Evaluation Metrics 43
5.2.1 The Evaluation Metric System for LAM 43
5.2.2 The Evaluation Metric System for EMO2 45
5.2.3 The Evaluation Metric System for EchoMimic V3 46
5.2.4 Subjective Evaluation Framework 47
5.3 Summary of This Chapter 48
5.4 Cross-Layer Gaps and Integrated Benchmarks 48
5.4.1 The Phenomenon of "Dataset Silos" 49
5.4.2 The Absence of Standardized Cross-Modal Protocols 50
5.4.3 Proposal: Toward an Embodied Benchmark Suite 51
6. Applications of Talking Head Systems 52
6.1 Virtual Co-presence and Real-time Communication 52
6.2 Creative Media and Entertainment Production 53
6.3 Digital Marketing and E-commerce Livestreaming 55
6.4 Embodied Agents and Service Robots 57
7. Challenges and Open Problems 60
7.1 Cross-modal Fusion and Feature Alignment 60
7.2 Long-term Temporal Coherence 61
7.3 Affective and Expressive Modeling 63
7.4 Real-time Deployment and Lightweight Optimization 64
7.5 Evaluation Standardization and Cross-lingual Generalization 65
7.6 Model Generalization and Identity Preservation 66
8. Future Research Directions 68
8.1 Unified Foundational Models for Digital Humans 68
8.2 Affective-Consistent Generation 69
8.3 Multilingual and Cross-Cultural Generalization 70
8.4 Model Lightweighting and Real-Time Interaction Optimization 71
8.5 Full-Body Coordinated Speech-Driven Animation 72
8.6 Technical Standardization and Ethical Governance 73
9. Conclusion 75
References 77
ABSTRACT 83

상세검색

RISS 보유자료

상세검색

해외전자자료

A survey on talking face generation = 대화형 얼굴 생성에 관한 조사 연구

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료