You’ve likely heard that an image is value a thousand words, but can a big language model (LLM) get the image if it’s never seen images before?
Because it seems, language models which can be trained purely on text have a solid understanding of the visual world. They will write image-rendering code to generate complex scenes with intriguing objects and compositions — and even when that knowledge shouldn’t be used properly, LLMs can refine their images. Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) observed this when prompting language models to self-correct their code for various images, where the systems improved on their easy clipart drawings with each query.
The visual knowledge of those language models is gained from how concepts like shapes and colours are described across the web, whether in language or code. When given a direction like “draw a parrot within the jungle,” users jog the LLM to think about what it’s read in descriptions before. To evaluate how much visual knowledge LLMs have, the CSAIL team constructed a “vision checkup” for LLMs: using their “Visual Aptitude Dataset,” they tested the models’ abilities to attract, recognize, and self-correct these concepts. Collecting each final draft of those illustrations, the researchers trained a pc vision system that identifies the content of real photos.
“We essentially train a vision system without directly using any visual data,” says Tamar Rott Shaham, co-lead creator of the study and an MIT electrical engineering and computer science (EECS) postdoc at CSAIL. “Our team queried language models to write down image-rendering codes to generate data for us after which trained the vision system to judge natural images. We were inspired by the query of how visual concepts are represented through other mediums, like text. To specific their visual knowledge, LLMs can use code as a typical ground between text and vision.”
To construct this dataset, the researchers first queried the models to generate code for various shapes, objects, and scenes. Then, they compiled that code to render easy digital illustrations, like a row of bicycles, showing that LLMs understand spatial relations well enough to attract the two-wheelers in a horizontal row. As one other example, the model generated a car-shaped cake, combining two random concepts. The language model also produced a glowing light bulb, indicating its ability to create visual effects.
“Our work shows that whenever you query an LLM (without multimodal pre-training) to create a picture, it knows rather more than it seems,” says co-lead creator, EECS PhD student, and CSAIL member Pratyusha Sharma. “Let’s say you asked it to attract a chair. The model knows other things about this piece of furniture that it could not have immediately rendered, so users can query the model to enhance the visual it produces with each iteration. Surprisingly, the model can iteratively enrich the drawing by improving the rendering code to a big extent.”
The researchers gathered these illustrations, which were then used to coach a pc vision system that may recognize objects inside real photos (despite never having seen one before). With this synthetic, text-generated data as its only reference point, the system outperforms other procedurally generated image datasets that were trained with authentic photos.
The CSAIL team believes that combining the hidden visual knowledge of LLMs with the artistic capabilities of other AI tools like diffusion models may be helpful. Systems like Midjourney sometimes lack the know-how to consistently tweak the finer details in a picture, making it difficult for them to handle requests like reducing what number of cars are pictured, or placing an object behind one other. If an LLM sketched out the requested change for the diffusion model beforehand, the resulting edit may very well be more satisfactory.
The irony, as Rott Shaham and Sharma acknowledge, is that LLMs sometimes fail to acknowledge the identical concepts that they’ll draw. This became clear when the models incorrectly identified human re-creations of images throughout the dataset. Such diverse representations of the visual world likely triggered the language models’ misconceptions.
While the models struggled to perceive these abstract depictions, they demonstrated the creativity to attract the identical concepts otherwise every time. When the researchers queried LLMs to attract concepts like strawberries and arcades multiple times, they produced pictures from diverse angles with various shapes and colours, hinting that the models might need actual mental imagery of visual concepts (relatively than reciting examples they saw before).
The CSAIL team believes this procedure may very well be a baseline for evaluating how well a generative AI model can train a pc vision system. Moreover, the researchers look to expand the tasks they challenge language models on. As for his or her recent study, the MIT group notes that they don’t have access to the training set of the LLMs they used, making it difficult to further investigate the origin of their visual knowledge. In the long run, they intend to explore training a good higher vision model by letting the LLM work directly with it.
Sharma and Rott Shaham are joined on the paper by former CSAIL affiliate Stephanie Fu ’22, MNG ’23 and EECS PhD students Manel Baradad, Adrián Rodríguez-Muñoz ’22, and Shivam Duggal, who’re all CSAIL affiliates; in addition to MIT Associate Professor Phillip Isola and Professor Antonio Torralba. Their work was supported, partly, by a grant from the MIT-IBM Watson AI Lab, a LaCaixa Fellowship, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. They present their paper this week on the IEEE/CVF Computer Vision and Pattern Recognition Conference.