MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types—realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, andanimates single or multiple characters with input masks for precise speaker designation in multi-character scenes. Our approach tackles key challenges with three innovations: (1) 3D full-attention mechanisms with a sliding window denoising strategy,enabling infinite video generation with temporal coherence and visual quality acrossdiverse character styles; (2) a two-stage curriculum learning scheme, integratingaudio for lip sync, text for expressive dynamics, and reference images for identitypreservation, enabling flexible multi-modal control over long sequences; and (3)region-specific masks with adaptive loss functions to balance global textual controland local audio guidance, supporting speaker-specific animation Efficiency is enhanced via our innovative unified step and cfg distillation techniques, achieving a 20× inference speed boost over the basemodel—generating a 10-second 540x540p video in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs—without quality loss. Evaluations on our new benchmark demonstrate MagicInfinite’s superiority in audio-lip synchronization, identity preservation, and motion naturalness across diverse scenarios. It is publicly available at https://www.hedra.com/, with examples at https://magicinfinite.github.io/.

MagicInfinite More Demo Results

Infinite Long Video

MagicInfinite supports video synthesis of arbitrary lengths and can rapidly generate videos longer than one minute.

Diverse Character Styles

MagicInfinite supports a diverse range of character styles, including anime, cartoon and unique artistic styles.

Anime And Cartoon Style

MagicInfinite enables the animation of a wide variety of cartoon characters and animals, generating dynamic, expressive, and naturally realistic anime-style videos.

Special artistic styles

MagicInfinite also supports portrait images of characters with unique artistic styles, accommodating various facial orientations and character scales, including close-up, half-body, and full-body representations.

Diverse Facial And Body Orientations

MagicInfinite supports the animation of reference images featuring diverse facial and body orientations. This includes individuals facing forward, in profile, or even with near-backward postures.

Diverse Character Animations Scales

MagicInfinite supports character animations of varying scales, ranging from ultra-close-up portrait images that fully occupy the screen to half-body and full-body representations.

Architecture Overview

MagicInfinite employs a hybrid dual-to-single-stream denoising network with Audio Cross-Attention in final blocks. MLLM encodes static portrait and text into tokens, concatenated for T2V, refined, and denoised. Wav2Vec encodes audio, resampled by an Audio Encoder, and guided by a Face Region Mask for precise lip sync and adaptive loss.

BibTeX

@article{yi2025magic,
      title={MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice},
      author={Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, Wei Li, Michael Lingelbach, Daquan Zhou},
      journal={to be updated},
      year={2025}
    }