Thesis

Abstract

High-fidelity 3-D head avatars are becoming indispensable for emerging spatial computing devices, such as Apple’s Vision Pro, where virtual “spatial persona” calls and other immersive experiences demand natural, expressive facial renderings. Yet, the challenge of creating avatars that convincingly display a broad range of emotions—complete with texture cues like wrinkles, skin-tone variations, and shadows—remains largely unsolved. Traditional 3D pipelines based on purely geometric blend-shapes or one-shot 2-D editing often omit these subtleties, leading to expressions that look hollow or appear inconsistent when viewed from varying spatial positions or angles.

To tackle this gap, we propose Emotion-Driven Editing of Gaussian Avatars (EMO-GA), a pipeline designed to create expressive and emotionally rich 3D avatars. The process begins by transforming neutral or mildly expressive frames into “pseudo” emotion exemplars via text-prompted diffusion (e.g., UltraEdit), injecting details like deeper cheek contours for happiness or shading shifts for anger. Next, photometric FLAME-based tracking is applied to multi-view data, producing accurate head geometry to which the Gaussian splats are rigged. A baseline Gaussian avatar is then constructed, augmented with a color MLP and patch-based photometric constraints to capture both global shape and fine texture cues. To finalize the pipeline, we optimize the shared FLAME expression offsets and per-Gaussian texture features against the noisy “pseudo” edits generated by diffusion models. This optimization involves a delicate balance: while geometry updates capture the overall structure of the expression, texture adjustments are critical for fine details like shading, lip color, and subtle skin-tone shifts. To avoid over-fitting to diffusion inconsistencies and to maintain spatiotemporal consistency, we use shared expression offsets across frames, coupled with iterative cycles of geometry and texture refinement. By incorporating robust regularization strategies, such as view weighting and patch-based photometric losses, the pipeline ensures that both geometry and texture converge coherently to reflect the desired emotional state without introducing any significant flicker or causing nearly zero identity loss.

Our results show that geometry alone fails to capture the richness of facial affect; textural refinement is indispensable for natural emotional expressions. EMO-GA integrates these elements into a single, view-consistent representation, eliminating the need for massive labeled 3D emotion corpora. By achieving nuanced expression edits and preserving multi-view fidelity, this pipeline paves the way for more authentic avatar interactions in spatial FaceTime calls, virtual collaboration platforms, and other scenarios where subtle emotional detail truly matters.

Pipeline Overview

Complete EMO-GA pipeline. Phase (a) in lavender (top) produces pseudo-ground-truth images via diffusion-based editing. These are often spatially and temporally inconsistent but carry strong emotional cues. Phase (b) in lilac constructs and renders the Gaussian avatar using FLAME parameters and input sequence. Finally, Phase (c) in coral optimizes both geometry (FLAME expression offsets) and color (per-Gaussian latents) to align the avatar with the noisy, emotion-edited diffusion images as references. For clarity, we illustrate only two views here along the spatial axis, but in practice, we utilize all 15 camera views for Gaussian Avatars, while selecting 10 for emotion optimization. Note that Phase (a) is purely a preprocessing stage; gradient flow occurs only between Phases (b) and (c). Please refer to the thesis above for more details.

Results

Baseline comparison: from left to right, the capture displays the neutral input sequence, three 2-D diffusion‐based baselines, a 3-D geometry-only baseline (EMOTE), and our EMO-GA output for the target emotion Angry. The 2-D diffusion methods markedly distort the subject’s identity and disrupt temporal coherence. EMOTE optimizes geometry but lacks texture refinement and therefore fails to reproduce the fine surface details needed to express emotion. In contrast, EMO-GA preserves identity and multi-view consistency while introducing subtle texture cues—such as forehead wrinkles—that clearly convey anger. For more such results, please refer to the presentation above.

Additional subjects from the dataset are shown: from left to right, each row presents the neutral input sequence followed by EMO-GA outputs for the target emotions Happy, Angry, Sad, and Surprised. We tested many more subjects from the dataset; for further results, please refer to the presentation above, and feel free to contact with any queries.

BibTeX

@mastersthesis{utkarsh2025emoga, title = {Emotion-Driven Editing of Gaussian Avatars}, author = {Utkarsh, Abhinav}, school = {Technical University of Munich}, type = {Master’s Thesis}, address = {Munich, Germany}, year = {2025}, month = {February}, url = {https://abhinavutkarsh.com/Emotion-Driven-Editing-of-Gaussian-Avatars/static/pdfs/Emotion-Driven Editing of Gaussian Avatars.pdf}, note = {Visual Computing \& Artificial Intelligence Lab} }

Emotion-Driven Editing of GaussianAvatars

Abstract

Pipeline Overview

Results

Texture only Optimization

BibTeX