We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types—realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, andanimates single or multiple characters with input masks for precise speaker designation in multi-character scenes. Our approach tackles key challenges with three innovations: (1) 3D full-attention mechanisms with a sliding window denoising strategy,enabling infinite video generation with temporal coherence and visual quality acrossdiverse character styles; (2) a two-stage curriculum learning scheme, integratingaudio for lip sync, text for expressive dynamics, and reference images for identitypreservation, enabling flexible multi-modal control over long sequences; and (3)region-specific masks with adaptive loss functions to balance global textual controland local audio guidance, supporting speaker-specific animation Efficiency is enhanced via our innovative unified step and cfg distillation techniques, achieving a 20× inference speed boost over the basemodel—generating a 10-second 540x540p video in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs—without quality loss. Evaluations on our new benchmark demonstrate MagicInfinite’s superiority in audio-lip synchronization, identity preservation, and motion naturalness across diverse scenarios. It is publicly available at https://www.hedra.com/, with examples at https://magicinfinite.github.io/.
MagicInfinite supports video synthesis of arbitrary lengths and can rapidly generate videos longer than one minute.
MagicInfinite supports a diverse range of character styles, including anime, cartoon and unique artistic styles.
MagicInfinite enables the animation of a wide variety of cartoon characters and animals, generating dynamic, expressive, and naturally realistic anime-style videos.
MagicInfinite also supports portrait images of characters with unique artistic styles, accommodating various facial orientations and character scales, including close-up, half-body, and full-body representations.
MagicInfinite supports the animation of reference images featuring diverse facial and body orientations. This includes individuals facing forward, in profile, or even with near-backward postures.
MagicInfinite supports character animations of varying scales, ranging from ultra-close-up portrait images that fully occupy the screen to half-body and full-body representations.
MagicInfinite employs a hybrid dual-to-single-stream denoising network with Audio Cross-Attention in final blocks. MLLM encodes static portrait and text into tokens, concatenated for T2V, refined, and denoised. Wav2Vec encodes audio, resampled by an Audio Encoder, and guided by a Face Region Mask for precise lip sync and adaptive loss.
@article{yi2025magic,
title={MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice},
author={Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, Wei Li, Michael Lingelbach, Daquan Zhou},
journal={to be updated},
year={2025}
}