MultiAnimate

CVPR 2026 Accepted

Yingcheng Hu^1,2,3*, Haowen Gong³, Chuanguang Yang¹, Zhulin An^1,†, Yongjun Xu¹, Songhua Liu^{3, †}

¹State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences ²ShanghaiTech University ³Shanghai Jiatong University ^†Corresponding Authors

Abstract

Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components—Identifier Assigner and Identifier Adapter—which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.

Method Overview

We present MultiAnimate for multi-character image animation, which is the first extensible framework built upon modern DiT-based video generators, to the best of our knowledge.

Figure 1: Our pipeline contains two main streams: the reference stream, which encodes the reference image and its pose to capture appearance information, and the motion stream, which encodes multi-character pose sequences and tracking masks to model motion and spatial conditions. The two streams are fused through element-wise addition of latent tokens. The Identifier Assigner unifies per-person tracking masks into a structured label representation, preserving spatial relationships and interactions among multiple characters. This representation is converted to the feature space of the DiT backbone by the Identifier Adapter.

Demo & Results

Our framework, trained only on two-character data, is capable of producing identity-consistent three-person videos and can, in principle, be extended to scenarios with even more participants (e.g., seven characters).

Figure 2: Two-character animation.

Figure 3: Three-character animation.

Figure 4: Four-character animation.

Figure 5: Five-character animation.

Figure 6: Six-character animation.

Figure 7: Seven-character animation.

BibTeX

@article{hu2026multianimateposeguidedimageanimation, title={MultiAnimate: Pose-Guided Image Animation Made Extensible}, author={Yingcheng Hu and Haowen Gong and Chuanguang Yang and Zhulin An and Yongjun Xu and Songhua Liu}, year={2026}, eprint={2602.21581}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.21581}, }