EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head


1 State Key Laboratory for Novel Software Technology, Nanjing University, China,
2 Fudan University, Shanghai, China    3 Huawei Noah's Ark Lab

Abstract

Despite significant progress in the field of 3D talking heads, prior methods still suffer from multi-view consistency and a lack of emotional expressiveness. To address these issues, we collect EmoTalk3D dataset with calibrated multi-view videos, emotional annotations, and per-frame 3D geometry. Besides, We present a novel approach for synthesizing 3D talking heads with controllable emotion, featuring enhanced lip synchronization and rendering quality.

By training on the EmoTalk3D dataset, we propose a "Speech-to-Geometry-to-Appearance" mapping framework that first predicts faithful 3D geometry sequence from the audio features, then the appearance of a 3D talking head represented by 4D Gaussians is synthesized from the predicted geometry. The appearance is further disentangled into canonical and dynamic Gaussians, learned from multi-view videos, and fused to render free-view talking head animation.

our model enables controllable emotion in the generated talking heads and can be rendered in wide-range views. Our method exhibits improved rendering quality and stability in lip motion generation while capturing dynamic facial details such as wrinkles and subtle expressions.

Method

Overall Pipeline.The pipeline consists of five modules: 1) Emotion-content; Disentangle Encoder that parses content features and emotion features from input speech; 2) Speech-to-Geometry Network (S2GNet) that predicts dynamic 3D pointclouds from the features; 3) Gaussian Optimization and Completion Module for establishing a canonical appearance; 4) Geometry-to-Appearance Network (G2ANet) that synthesizes facial appearance based on dynamic 3D point cloud; and 5) Rendering module for rendering dynamic Gaussians into free-view animations.

Dataset

We establish EmoTalk3D dataset, an emotion-annotated multi-view talking head dataset with per-frame 3D facial shapes. EmoTalk3D dataset provides audio, per-frame multi-view images, camera paramters and corresponding reconstructed 3D shapes. The data have been released to public for non-commercial research purpose.

Data Request

We are preparing the data release, and the request method will be released soon. This dataset is for non-commercial research use only, and requests from commercial companies will not be licensed.

Results

Up: GroundTruth    Down: Ours    Input Emotion: Angry
Up: GroundTruth    Down: Ours    Input Emotion: Disgusted
Up: GroundTruth    Down: Ours    Input Emotion: Happy

In-the-wild Audio-driven

Free-viewpoint Animation

Supplementary Video


BibTeX

@inproceedings{he2024emotalk3d,
  title={EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head},
  author={He, Qianyun and Ji, Xinya and Gong, Yicheng and Lu, Yuanxun and Diao, Zhengyu and Huang, Linjia and Yao, Yao and Zhu, Siyu and Ma, Zhan and Xu, Songchen and Wu, Xiaofei and Zhang, Zixiao and Cao, Xun and Zhu, Hao},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2024}      
}