ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Abstract

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline that aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.

ShotVerse Results

Comparison with Baselines

Camera-Controlled Methods

Single-shot camera control models encode camera extrinsics into pretrained video models. We adapt them for multi-shot evaluation by applying shot-by-shot and concatenating results using our calibration pipeline.

MotionCtrl

CameraCtrl

ReCamMaster

ShotVerse (Ours)

Open-Source Multi-Shot Models

We compare ShotVerse against open-source multi-shot generators. HoloCine is a holistic baseline with explicit shot structure, while MultiShotMaster is another open-source multi-shot video model evaluated under the same prompts.

HoloCine

MultiShotMaster

ShotVerse (Ours)

Closed-Source Multi-Shot Models

Leading closed-source models rely on implicit textual control. We provide them with our hierarchical prompts to evaluate their zero-shot cinematic understanding.

VEO3

Kling3.0

Sora2

Seedance2.0

ShotVerse (Ours)

Ablation Study

Camera Encoder

Camera encoder is necessary for controllability. Without the camera encoder, the model follows the intended motion pattern less reliably, whereas adding the encoder yields clearer, more stable camera behavior.

w/o Camera Encoder

ShotVerse (Ours)

4D RoPE vs 3D RoPE

4D RoPE captures shot hierarchy. Replacing 4D RoPE with 3D RoPE significantly degrades Shot Transition Accuracy, demonstrating that the explicit shot axis is critical for respecting shot boundaries.

3D RoPE

4D RoPE (Ours)

Camera Calibration

Unified camera calibration is necessary. Removing global coordinate calibration reduces inter-shot consistency and aesthetics, supporting that unified coordinates are important for geometrically consistent pose conditioning across cuts.

w/o Camera Calibration

ShotVerse (Ours)

Training Data

Synthetic supervision hurts film-like rendering. Aesthetics drops noticeably and semantics slightly weakens, suggesting real cinematic triplets provide crucial composition/lighting cues beyond what synthetic triplets capture.

w/ Synthetic Data

Real Cinematic Data (Ours)

Noise Injection Strategy

High-noise-only injection is largely sufficient. Adding an additional low-noise encoder slightly trades off perceptual quality, as early pose injection already establishes the global motion scaffold.

w/ Low & High Noise Encoders

ShotVerse (Ours)