Microsoft Introduces 'VASA-1' AI That Can Turn Photos Into Lifelike Talking Videos

Microsoft VASA-1 examples

A photo is a photo. A static image that will look like that way, forever. But with AI, that is no longer the case.

An AI research paper from Microsoft promises a future, where users can simply upload a photo of a person, a sample of their voice and create a live, animated talking head of that person in the photo.

Microsoft introduces this kind of ability, with an AI it calls 'VASA-1'.

The AI in question not only animate the lips to match the audio, but it can literally animate the entire face.

The result is a hyper realistic talking face video complete with lip sync, facial features and head movement.

In Microsoft's words, the AI can generate "appealing visual affective skills (VAS)."

On a website post, Microsoft said that:

" [...] The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos."

"Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively."

The method not only allows VASA-1 to deliver high video quality with realistic facial and head dynamics, but also supports the online generation of 512x512 videos at up to 40 FPS with negligible starting latency.

This, in Microsoft' words, "paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors."

While similar lip sync and head movement technology is already available from rivals, Microsoft's VASA-1 takes everything to a whole different level with its higher quality and realism, reducing mouth artifacts.

This approach to audio-driven animation is more similar to the VLOGGER AI from Google.

Microsoft introduces VASA-1 only as a research preview, and not available for anyone outside of the Microsoft Research team to try.

For a number reasons, the team worries that the technology could be misused for impersonating humans, for example.

"Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications. It is not intended to create content that is used to mislead or deceive," said Microsoft.

"We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection."

While the company acknowledges the possibility of misuse, the team recognizes the substantial positive potential of the technology, which far outweighs its worries.

"We are dedicated to developing AI responsibly, with the goal of advancing human well-being."

"The benefits – such as enhancing educational equity, improving accessibility for individuals with communication challenges, offering companionship or therapeutic support to those in need, among many others – underscore the importance of our research and other related explorations," said the company.

Published: 
20/04/2024