Soon to be cinematic? New NVIDIA AI creates high-definition videos

[10:28 Thu,20.April 2023 by Thomas Richter]

Schneller als noch vor kurzem gedacht, verbessert sich die Qualität von Text-zu-Video-KIs. Waren vor kurzer Zeit vorgestellten Video-KIs wie Metas Make-a-Video, Googles Imagen und Phenaki oder das quelloffene VideoFusion noch beschränkt auf die Erzeugung kleiner (256 x 256 bzw. 128 x 128 Pixel) - nur Imagen erreichte 1.280 x 768 - erzielt die neue Video-KI von Nvidia jetzt Auflösungen von bis zu 1.280 x 2.048 Pixeln bei 24 fps und zeigt deutlich weniger temporale Artefakte bzw eine bessere Kohärenz zwischen den einzelnen Frames.

Mitgearbeitet im Team von Forschern von NVIDIA haben mit Andreas Blattman und Robin Rombach zwei Experten der LMU München, die auch schon die Bild-KI Stable Diffusion mitentwickelten. Wie diese nutzt auch die neue Video-KI ein latentes Diffusionsmodell (LDM) für Standbilder. Aus dem Standbildgernerator wird ein Videogenerator, indem eine zeitliche Dimensions-Variable in das Diffusionsmodell mit eintrainiert wird. Da das genutzte Modell von Stable Diffusion Gewichten abgeleitet wurde, liegt die Output-Auflösung noch deutlich unter HD.

Therefore, the subsequent diffusion upsampler also gets a temporal component, which then leads to a temporally consistent video superresolution. With this concatenation, videos up to several seconds long with a resolution of up to 1,280 x 2,048 pixels are then possible with "reasonable" computing effort. The frame rate is up-sampled twice with the help of a special latent diffusion model to enable relatively smooth images at 24 fps.

A whole series of 4.7 second sample videos can be viewed at

demo page in full 1,280 x 2,048 resolution if you open each in an extra window.

Also interesting is the ability to add your own objects to the synthesized videos via

DreamBooth, thus personalizing the text-to-video AI:

There is also a very special use case in which the new method can even generate several minutes of coherent video - albeit only at a resolution of 512 x 1,024 pixels - namely videos of driving scenes in the wild. The following is a 9-second clip - the full 5-minute video can be found

here.

more infos at bei research.nvidia.com

deutsche Version dieser Seite: Bald filmreif? Neue NVIDIA-KI erzeugt hochauflösende Videos per Texteingabe