Mark Hamilton, an MIT PhD student in electrical engineering and computer science and affiliate of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), wants to make use of machines to grasp how animals communicate. To do this, he set out first to create a system that may learn human language “from scratch.”
“Funny enough, the important thing moment of inspiration got here from the movie ‘March of the Penguins.’ There’s a scene where a penguin falls while crossing the ice, and lets out a bit of belabored groan while getting up. While you watch it, it’s almost obvious that this groan is standing in for a 4 letter word. This was the moment where we thought, possibly we want to make use of audio and video to learn language.” says Hamilton. “Is there a way we could let an algorithm watch TV all day and from this determine what we’re talking about?”
“Our model, ‘DenseAV,’ goals to learn language by predicting what it’s seeing from what it’s hearing, and vice-versa. For instance, if you happen to hear the sound of somebody saying ‘bake the cake at 350’ chances are high you could be seeing a cake or an oven. To succeed at this audio-video matching game across thousands and thousands of videos, the model has to learn what individuals are talking about,” says Hamilton.
Once they trained DenseAV on this matching game, Hamilton and his colleagues checked out which pixels the model searched for when it heard a sound. For instance, when someone says “dog,” the algorithm immediately starts on the lookout for dogs within the video stream. By seeing which pixels are chosen by the algorithm, one can discover what the algorithm thinks a word means.
Interestingly, the same search process happens when DenseAV listens to a dog barking: It searches for a dog within the video stream. “This piqued our interest. We desired to see if the algorithm knew the difference between the word ‘dog’ and a dog’s bark,” says Hamilton. The team explored this by giving the DenseAV a “two-sided brain.” Interestingly, they found one side of DenseAV’s brain naturally focused on language, just like the word “dog,” and the opposite side focused on feels like barking. This showed that DenseAV not only learned the meaning of words and the locations of sounds, but additionally learned to differentiate between most of these cross-modal connections, all without human intervention or any knowledge of written language.
One branch of applications is learning from the large amount of video published to the web every day: “We wish systems that may learn from massive amounts of video content, comparable to instructional videos,” says Hamilton. “One other exciting application is knowing latest languages, like dolphin or whale communication, which don’t have a written type of communication. Our hope is that DenseAV might help us understand these languages which have evaded human translation efforts for the reason that starting. Finally, we hope that this method may be used to find patterns between other pairs of signals, just like the seismic sounds the earth makes and its geology.”
A formidable challenge lay ahead of the team: learning language with none text input. Their objective was to rediscover the meaning of language from a blank slate, avoiding using pre-trained language models. This approach is inspired by how children learn by observing and listening to their environment to grasp language.
To realize this feat, DenseAV uses two fundamental components to process audio and visual data individually. This separation made it unattainable for the algorithm to cheat, by letting the visual side have a look at the audio and vice versa. It forced the algorithm to acknowledge objects and created detailed and meaningful features for each audio and visual signals. DenseAV learns by comparing pairs of audio and visual signals to search out which signals match and which signals don’t. This method, called contrastive learning, doesn’t require labeled examples, and allows DenseAV to determine the vital predictive patterns of language itself.
One major difference between DenseAV and former algorithms is that prior works focused on a single notion of similarity between sound and pictures. A complete audio clip like someone saying “the dog sat on the grass” was matched to a complete image of a dog. This didn’t allow previous methods to find fine-grained details, just like the connection between the word “grass” and the grass underneath the dog. The team’s algorithm searches for and aggregates all of the possible matches between an audio clip and a picture’s pixels. This not only improved performance, but allowed the team to exactly localize sounds in a way that previous algorithms couldn’t. “Conventional methods use a single class token, but our approach compares every pixel and each second of sound. This fine-grained method lets DenseAV make more detailed connections for higher localization,” says Hamilton.
The researchers trained DenseAV on AudioSet, which incorporates 2 million YouTube videos. In addition they created latest datasets to check how well the model can link sounds and pictures. In these tests, DenseAV outperformed other top models in tasks like identifying objects from their names and sounds, proving its effectiveness. “Previous datasets only supported coarse evaluations, so we created a dataset using semantic segmentation datasets. This helps with pixel-perfect annotations for precise evaluation of our model’s performance. We will prompt the algorithm with specific sounds or images and get those detailed localizations,” says Hamilton.
As a result of the large amount of information involved, the project took a few yr to finish. The team says that transitioning to a big transformer architecture presented challenges, as these models can easily overlook fine-grained details. Encouraging the model to give attention to these details was a big hurdle.
Looking ahead, the team goals to create systems that may learn from massive amounts of video- or audio-only data. That is crucial for brand spanking new domains where there’s plenty of either mode, but not together. In addition they aim to scale this up using larger backbones and possibly integrate knowledge from language models to enhance performance.
“Recognizing and segmenting visual objects in images, in addition to environmental sounds and spoken words in audio recordings, are each difficult problems in their very own right. Historically researchers have relied upon expensive, human-provided annotations as a way to train machine learning models to perform these tasks,” says David Harwath, assistant professor in computer science on the University of Texas at Austin who was not involved within the work. “DenseAV makes significant progress towards developing methods that may learn to unravel these tasks concurrently by simply observing the world through sight and sound — based on the insight that the things we see and interact with often make sound, and we also use spoken language to speak about them. This model also makes no assumptions in regards to the specific language that’s being spoken, and will due to this fact in principle learn from data in any language. It could be exciting to see what DenseAV could learn by scaling it as much as hundreds or thousands and thousands of hours of video data across a large number of languages.”
Additional authors on a paper describing the work are Andrew Zisserman, professor of computer vision engineering on the University of Oxford; John R. Hershey, Google AI Perception researcher; and William T. Freeman, MIT electrical engineering and computer science professor and CSAIL principal investigator. Their research was supported, partially, by the U.S. National Science Foundation, a Royal Society Research Professorship, and an EPSRC Programme Grant Visual AI. This work will likely be presented on the IEEE/CVF Computer Vision and Pattern Recognition Conference this month.