It was clear that the next step after generating images by text description would be to generate videos - now Meta AI, the AI research division of Meta (formerly Facebook) has unveiled just such an algorithm. Dubbed "Make-a-Video," the text-to-video AI is similar to the text-to-image AIs DALL-E 2 and Stable Diffusion that have made a splash in recent months.
Like these, it has learned what the real world looks like, what objects it consists of, and how people describe it, based on billions of images including text descriptions. In addition to this, however, it was trained by means of additional layers of the neural network for the temporal sequences of images with about 20 million videos to learn how different objects typically move.
So now, using only text descriptions, the Make-a-Video AI can generate arbitrary short video clips, such as "A teddy bear paints a portrait" or "A fluffy baby sloth in an orange knit cap trying to operate a laptop, with a detailed studio light screen reflected in its eye." Similar to the image-generating AIs, the image style (realistic, surreal, abstract, stylized, ...) can be defined arbitrarily.
Bring images to life
. As input, however, instead of a text, a single image can be used (analogous to the image-2-image method in image AIs) to animate it, or two images (a start and an end image, between which the make-a-video algorithm then generates the intermediate images. Alternatively, quasi by video-2-video, a video can also act as input, from which Make-a-Video then generates variations.
Meta AI has published the research work of Make-a-Video, but not the associated code or model - however, attempts to replicate an algorithm based on it should follow soon.
The quality of the generated videos still leaves something to be desired and the videos are more like small image animations than real (complex) video and they are only a few seconds long, but Make-a-Video clearly shows where the journey is going. Generation of videos and object-based editing via text even for consumers is getting closer.
Longer videos with new AI
. How fast the development is progressing right now is also proven by the fact that at the same time as Meta&s "Make-a-Video" another project called Phenaki by a (still) anonymous research team has appeared, which has a much lower resolution, but in one important aspect is even more interesting than this one, because it allows the generation of videos lasting several minutes. Thus, in the following example, a lengthy text description (complete with instructions for the virtual camera movements) was used to generate an impressive 2 minute video: