Current video generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos. Existing models trained on large-scale data on the back of rich computational resources are unsurprisingly inadequate for maintaining a logical storyline and visual consistency across multiple shots of a cohesive script since they are often trained with a single-shot objective. To this end, we propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation. VGoT is designed with three goals in mind as follows. Multi-Shot Video Generation: We divide the video generation process into a structured, modular sequence, including (1) Script Generation, which translates a curt story into detailed prompts for each shot; (2) Keyframe Generation, responsible for creating visually consistent keyframes faithful to character portrayals; and (3) Shot-Level Video Generation, which transforms information from scripts and keyframes into shots; (4) Smoothing Mechanism that ensures a consistent multi-shot output. Reasonable Narrative Design: Inspired by cinematic scriptwriting, our prompt generation approach spans five key domains, ensuring logical consistency, character development, and narrative flow across the entire video. Cross-Shot Consistency: We ensure temporal and identity consistency by leveraging identity-preserving (IP) embeddings across shots, which are automatically created from the narrative. Additionally, we incorporate a cross-shot smoothing mechanism, which integrates a reset boundary that effectively combines latent features from adjacent shots, resulting in smooth transitions and maintaining visual coherence throughout the video. Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos. The code will be made publicly available.
The FlowChart of VideoGen-of-Thought. Left: Shot descriptions are generated based on user prompts, describing various attributes including character details, background, relations, camera pose, and lighting HDR. Pre-shot descriptions provide a broader context for the upcoming scenes. Middle Top: Keyframes are generated using a text-to-image diffusion model conditioned with identity-preserving (IP) embeddings, which ensures consistent representation of characters throughout the shots. IP portraits help maintain visual identity consistency. Right: The shot-level video clips are generated from keyframes, followed by shot-by-shot smooth inference to ensure temporal consistency across different shots. This collaborative framework ultimately produces a cohesive narrative-driven video.
Method |
"A set of one-sentence prompts, 30 shots, describe the story of Marco, a chef who stumbles upon ancient secrets through culinary discoveries." |
"A set of one-sentence prompts, 30 shots, describe the life of Dr. Sarah, a scientist, dedicated to finding a cure for a rare disease." |
|
||
|
||
|
||
|
||
|
"A set of one-sentence prompts, 30 shots, describe the story of Marco, a chef who stumbles upon ancient secrets through culinary discoveries." |
"A set of one-sentence prompts, 30 shots, describe the story of Marco, a chef who stumbles upon ancient secrets through culinary discoveries." |
"A set of one-sentence prompts, 30 shots, describe the life of Dr. Sarah, a scientist, dedicated to finding a cure for a rare disease." |
"A set of one-sentence prompts, 30 shots, describe the life of Olivia, an ambitious fashion designer, from her first sketches to her iconic fashion show." |
"A set of one-sentence prompts, 30 shots, describe the life of Lily, a young pianist, as she grows from a novice to a world-class musician. " |
"A set of one-sentence prompts, 30 shots, describe the journey of Isaac, an AI developer who accidentally creates an AI with human emotions and a consciousness of its own." |
@article{zheng2024videogen,
title={VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation},
author={Zheng, Mingzhe and Xu, Yongqi and Huang, Haojian and Ma, Xuran and Liu, Yexin and Shu, Wenjie and Pang, Yatian and Tang, Feilong and Chen, Qifeng and Yang, Harry and others},
journal={arXiv preprint arXiv:2412.02259},
year={2024}
}