Multimodal AI Models: ByteDance Vidi2 Independently Produces Finished Videos from Raw Material

[15:54 Mon,1.December 2025 by Rudi Schmidts]

China&s ByteDance kicks off the December AI presentation series, demonstrating its latest multimodal AI model, Vidi 2, with a paper and a demo. Multimodal models accept various input types (e.g., text, audio, image, or video) and can subsequently generate diverse outputs from them.

Vidi 2 specializes in analyzing many hours of raw footage and interpreting associated prompts. Possible outputs include generating a polished TikTok video or even a complete film from a full script. Vidi 2 "learns" the raw material in great detail. For example, it can find the in- and out-points of individual scenes, as well as people or objects within the raw footage.

Through this spatio-temporal linking, Vidi 2 is intended to enable potential applications in complex editing scenarios, such as understanding plot or characters, automatically switching between different views, and intelligent, composition-aware reframing and trimming of individual scenes.

By the way, ByteDance is already using the model in two applications on TikTok. "Smart Split" is available worldwide via TikTok Studio Web and automatically cuts, frames, subtitles, and transcribes longer content into multiple short videos that can be shared directly on TikTok. This allows creators, for example, to split their daily vlog or a podcast episode into several clips.

To start, creators can upload content longer than one minute and select the sections they want to convert into shorter clips. Based on the selected video segment, Smart Split can automatically determine a video length, or creators can specify a particular length. Additionally, various caption formatting options are available, and the content can be converted into vertical clips. Once Smart Split has created the clips, creators can select each video and upload it directly to their TikTok account.

"AI Outline," on the other hand, is designed to help creatives structure their content by generating video titles, hashtags, hooks, and outlines. To do this, users either enter a prompt or select a frequently searched topic from Creator Search Insights. AI Outline is thus intended to provide a better overview of how to structure content. So, perhaps, if you yourself no longer know what you actually wanted to convey with your video ;)

more infos at bei bytedance.github.io

deutsche Version dieser Seite: ByteDance Vidi2 produziert selbstständig fertige Videos aus Rohmaterial