VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

1Hong Kong University of Science and Technology, 2Peking University, 3University of Hong Kong, 4National University of Singapore, 5University of Central Florida, 6Everlyn Al
Teaser Image

Illustration of VideoGen-of-Thought (VGoT). (a) Comparison of existing methods with VGoTain multi-shot video generation. Existing methods struggle with maintaining consistency and logical coherence across multiple shots, while VGoT effectively addresses these challenges through a multi-shot generation approach. (b) Overview of our proposed framework VGoT, which consists of the Script Module that generates detailed shot descriptions from five domains, the KeyFrame Module to create keyframes from scripts, the Shot-Level Video Module which synthesizes video latents conditioned on keyframes and scripts, and the Smooth Module ensures seamless transitions across shots, resulting in a cohesive video narrative.

Abstract

Current video generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos. Existing models trained on large-scale data on the back of rich computational resources are unsurprisingly inadequate for maintaining a logical storyline and visual consistency across multiple shots of a cohesive script since they are often trained with a single-shot objective. To this end, we propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation. VGoT is designed with three goals in mind as follows. Multi-Shot Video Generation: We divide the video generation process into a structured, modular sequence, including (1) Script Generation, which translates a curt story into detailed prompts for each shot; (2) Keyframe Generation, responsible for creating visually consistent keyframes faithful to character portrayals; and (3) Shot-Level Video Generation, which transforms information from scripts and keyframes into shots; (4) Smoothing Mechanism that ensures a consistent multi-shot output. Reasonable Narrative Design: Inspired by cinematic scriptwriting, our prompt generation approach spans five key domains, ensuring logical consistency, character development, and narrative flow across the entire video. Cross-Shot Consistency: We ensure temporal and identity consistency by leveraging identity-preserving (IP) embeddings across shots, which are automatically created from the narrative. Additionally, we incorporate a cross-shot smoothing mechanism, which integrates a reset boundary that effectively combines latent features from adjacent shots, resulting in smooth transitions and maintaining visual coherence throughout the video. Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos. The code will be made publicly available.

Video

Method: VideoGen-of-Thought

Teaser Image

The FlowChart of VideoGen-of-Thought. Left: Shot descriptions are generated based on user prompts, describing various attributes including character details, background, relations, camera pose, and lighting HDR. Pre-shot descriptions provide a broader context for the upcoming scenes. Middle Top: Keyframes are generated using a text-to-image diffusion model conditioned with identity-preserving (IP) embeddings, which ensures consistent representation of characters throughout the shots. IP portraits help maintain visual identity consistency. Right: The shot-level video clips are generated from keyframes, followed by shot-by-shot smooth inference to ensure temporal consistency across different shots. This collaborative framework ultimately produces a cohesive narrative-driven video.


Gallery


Visual comparison of VGoT and baselines:

Teaser Image

Visual showcases of VGoT generated multi-shot videos:

Teaser Image

BibTeX

@misc{zheng2024videogenofthoughtcollaborativeframeworkmultishot,
      title={VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation}, 
      author={Mingzhe Zheng and Yongqi Xu and Haojian Huang and Xuran Ma and Yexin Liu and Wenjie Shu and Yatian Pang and Feilong Tang and Qifeng Chen and Harry Yang and Ser-Nam Lim},
      year={2024},
      eprint={2412.02259},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}