DeepMind, Google’s AI research lab, says it’s developing AI tech to generate soundtracks for videos.
In a post on its official blog, DeepMind says that it sees the tech, V2A (short for “video-to-audio”), as an important piece of the AI-generated media puzzle. While loads of orgs including DeepMind have developed video-generating AI models, these models can’t create sound effects to sync with the videos that they generate.
“Video generation models are advancing at an incredible pace, but many current systems can only generate silent output,” DeepMind writes. “V2A technology [could] turn out to be a promising approach for bringing generated movies to life.”
DeepMind’s V2A tech takes the outline of a soundtrack (e.g. “jellyfish pulsating under water, marine life, ocean”) paired with a video to create music, sound effects and even dialogue that matches the characters and tone of the video, watermarked by DeepMind’s deepfakes-combatting SynthID technology. The AI model powering V2A, a diffusion model, was trained on a mix of sounds and dialogue transcripts in addition to video clips, DeepMind says.
“By training on video, audio and the extra annotations, our technology learns to associate specific audio events with various visual scenes, while responding to the data provided within the annotations or transcripts,” in response to DeepMind.
Mum’s the word on whether any of the training data was copyrighted — and whether the info’s creators were informed of DeepMind’s work. We’ve reached out to DeepMind for clarification and can update this post if we hear back.
AI-powered sound-generating tools aren’t novel. Startup Stability AI released one just last week, and ElevenLabs launched one in May. Nor are models to create video sound effects. A Microsoft project can generate talking and singing videos from a still image, and platforms like Pika and GenreX have trained models to take a video and make a best guess at what music or effects are appropriate in a given scene.
But DeepMind claims that its V2A tech is exclusive in that it might probably understand the raw pixels from a video and sync generated sounds with the video robotically, optionally sans description.
V2A isn’t perfect, and DeepMind acknowledges this. Since the underlying model wasn’t trained on lots of videos with artifacts or distortions, it doesn’t create particularly high-quality audio for these. And basically, the generated audio isn’t super convincing; my colleague Natasha Lomas described it as “a smorgasbord of stereotypical sounds,” and I can’t say I disagree.
For those reasons, and to stop misuse, DeepMind says it won’t release the tech to the general public anytime soon, if ever.
“To make certain our V2A technology can have a positive impact on the creative community, we’re gathering diverse perspectives and insights from leading creators and filmmakers, and using this invaluable feedback to tell our ongoing research and development,” DeepMind writes. “Before we consider opening access to it to the broader public, our V2A technology will undergo rigorous safety assessments and testing.”
DeepMind pitches its V2A technology as an especially useful gizmo for archivists and folk working with historical footage. But generative AI along these lines also threatens to upend the film and TV industry. It’ll take some seriously strong labor protections to be sure that generative media tools don’t eliminate jobs — or, because the case could also be, entire professions.