Google and Meta had already presented their text-to-video AIs a few months ago with Imagen and Make-a-Video - these work very similarly to the text-to-image AIs such as Stable Diffsion, Midjourney or DALL-E2, which are currently experiencing a boom: images or videos are generated via text description.
Researchers from Singapore have now presented an interesting alternative to this type of video generation: the new algorithm, called Tune-A-Video, combines a sample video with a description of the desired result - similar to the Image-2-Image function of Stable Diffusion, in which the appearance (such as image composition and shapes of the objects) of the image to be generated by the AI is roughly specified by means of an input image and defined in more detail by text. In the case of Tune-A-Video, an own video is entered analogously - by means of a description, the individual objects in the foreground as well as the background can then be exchanged as desired.
The possibilities range from object-based video editing, i.e. targeted objects can be manipulated as in the following example of a cat wearing a hat: the cat can be doubled, the hat exchanged for another, or the facial expression changed, or the hat can be removed from the video without any gaps. Likewise, entire objects from the video can be exchanged for others, the background can be replaced by another or the entire video can be reproduced in a completely different style (for example, as a comic, oil painting, anime or pencil drawing) - in each case including all movements.
Cat with hat in the original and with variations
Compared to the pure video (or image) generation via text prompt, this has the advantage that the type of movement (how fast, in which movement style, from where to where) and the general image composition (where should each object be located in the image, what is the camera angle) can be specified, which are otherwise difficult to describe in this exactness via prompt. Theoretically, one&s own video as well as clips from any film can be used as a model video, because the final result can look sufficiently different to avoid being recognised as a copyright infringement or being legally *seen*. The Tune-A-Video algorithm of this kind combines the possibilities of techniques such as the animation of objects by means of motion capturing with the design and dynamic rendering of these objects.
At the moment, it already delivers very good continuity of the movements and the objects, even if the replaced objects themselves are often still rendered quite incorrectly and the frame rate is very low. But as we know, with the current development speed of AIs, the quality of the results is increasing quite rapidly and should soon be sufficiently good. more infos at bei tuneavideo.github.io