Ecologists find computer vision models’ blind spots in retrieving wildlife images

Try taking an image of every of North America’s roughly 11,000 tree species, and also you’ll have a mere fraction of the thousands and thousands of photos inside nature image datasets. These massive collections of snapshots — starting from butterflies to humpback whales — are an incredible research tool for ecologists because they supply evidence of organisms’ unique behaviors, rare conditions, migration patterns, and responses to pollution and other types of climate change.

While comprehensive, nature image datasets aren’t yet as useful as they could possibly be. It’s time-consuming to go looking these databases and retrieve the pictures most relevant to your hypothesis. You’d be higher off with an automatic research assistant — or perhaps artificial intelligence systems called multimodal vision language models (VLMs). They’re trained on each text and pictures, making it easier for them to pinpoint finer details, like the particular trees within the background of a photograph.

But just how well can VLMs assist nature researchers with image retrieval? A team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), University College London, iNaturalist, and elsewhere designed a performance test to seek out out. Each VLM’s task: locate and reorganize essentially the most relevant results throughout the team’s “INQUIRE” dataset, composed of 5 million wildlife pictures and 250 search prompts from ecologists and other biodiversity experts.

Searching for that special frog

In these evaluations, the researchers found that larger, more advanced VLMs, that are trained on much more data, can sometimes get researchers the outcomes they wish to see. The models performed reasonably well on straightforward queries about visual content, like identifying debris on a reef, but struggled significantly with queries requiring expert knowledge, like identifying specific biological conditions or behaviors. For instance, VLMs somewhat easily uncovered examples of jellyfish on the beach, but struggled with more technical prompts like “axanthism in a green frog,” a condition that limits their ability to make their skin yellow.

Their findings indicate that the models need rather more domain-specific training data to process difficult queries. MIT PhD student Edward Vendrow, a CSAIL affiliate who co-led work on the dataset in a brand new paper, believes that by familiarizing with more informative data, the VLMs could sooner or later be great research assistants. “We wish to construct retrieval systems that find the precise results scientists seek when monitoring biodiversity and analyzing climate change,” says Vendrow. “Multimodal models don’t quite understand more complex scientific language yet, but we consider that INQUIRE will probably be a very important benchmark for tracking how they improve in comprehending scientific terminology and ultimately helping researchers mechanically find the precise images they need.”

The team’s experiments illustrated that larger models tended to be simpler for each simpler and more intricate searches resulting from their expansive training data. They first used the INQUIRE dataset to check if VLMs could narrow a pool of 5 million images to the highest 100 most-relevant results (also referred to as “rating”). For straightforward search queries like “a reef with manmade structures and debris,” relatively large models like “SigLIP” found matching images, while smaller-sized CLIP models struggled. In line with Vendrow, larger VLMs are “only beginning to be useful” at rating tougher queries.

Vendrow and his colleagues also evaluated how well multimodal models could re-rank those 100 results, reorganizing which images were most pertinent to a search. In these tests, even huge LLMs trained on more curated data, like GPT-4o, struggled: Its precision rating was only 59.6 percent, the very best rating achieved by any model.

The researchers presented these results on the Conference on Neural Information Processing Systems (NeurIPS) earlier this month.

Soliciting for INQUIRE

The INQUIRE dataset includes search queries based on discussions with ecologists, biologists, oceanographers, and other experts concerning the varieties of images they’d search for, including animals’ unique physical conditions and behaviors. A team of annotators then spent 180 hours searching the iNaturalist dataset with these prompts, fastidiously combing through roughly 200,000 results to label 33,000 matches that fit the prompts.

As an illustration, the annotators used queries like “a hermit crab using plastic waste as its shell” and “a California condor tagged with a green ‘26’” to discover the subsets of the larger image dataset that depict these specific, rare events.

Then, the researchers used the identical search queries to see how well VLMs could retrieve iNaturalist images. The annotators’ labels revealed when the models struggled to know scientists’ keywords, as their results included images previously tagged as irrelevant to the search. For instance, VLMs’ results for “redwood trees with fire scars” sometimes included images of trees with none markings.

“That is careful curation of information, with a concentrate on capturing real examples of scientific inquiries across research areas in ecology and environmental science,” says Sara Beery, the Homer A. Burnell Profession Development Assistant Professor at MIT, CSAIL principal investigator, and co-senior creator of the work. “It’s proved vital to expanding our understanding of the present capabilities of VLMs in these potentially impactful scientific settings. It has also outlined gaps in current research that we are able to now work to handle, particularly for complex compositional queries, technical terminology, and the fine-grained, subtle differences that delineate categories of interest for our collaborators.”

“Our findings imply that some vision models are already precise enough to assist wildlife scientists with retrieving some images, but many tasks are still too difficult for even the most important, best-performing models,” says Vendrow. “Although INQUIRE is targeted on ecology and biodiversity monitoring, the big variety of its queries implies that VLMs that perform well on INQUIRE are more likely to excel at analyzing large image collections in other observation-intensive fields.”

Inquiring minds wish to see

Taking their project further, the researchers are working with iNaturalist to develop a question system to raised help scientists and other curious minds find the pictures they really wish to see. Their working demo allows users to filter searches by species, enabling quicker discovery of relevant results like, say, the varied eye colours of cats. Vendrow and co-lead creator Omiros Pantazis, who recently received his PhD from University College London, also aim to enhance the re-ranking system by augmenting current models to offer higher results.

University of Pittsburgh Associate Professor Justin Kitzes highlights INQUIRE’s ability to uncover secondary data. “Biodiversity datasets are rapidly becoming too large for any individual scientist to review,” says Kitzes, who wasn’t involved within the research. “This paper draws attention to a difficult and unsolved problem, which is the right way to effectively search through such data with questions that transcend simply ‘who’s here’ to ask as an alternative about individual characteristics, behavior, and species interactions. Having the ability to efficiently and accurately uncover these more complex phenomena in biodiversity image data will probably be critical to fundamental science and real-world impacts in ecology and conservation.”

Vendrow, Pantazis, and Beery wrote the paper with iNaturalist software engineer Alexander Shepard, University College London professors Gabriel Brostow and Kate Jones, University of Edinburgh associate professor and co-senior creator Oisin Mac Aodha, and University of Massachusetts at Amherst Assistant Professor Grant Van Horn, who served as co-senior creator. Their work was supported, partly, by the Generative AI Laboratory on the University of Edinburgh, the U.S. National Science Foundation/Natural Sciences and Engineering Research Council of Canada Global Center on AI and Biodiversity Change, a Royal Society Research Grant, and the Biome Health Project funded by the World Wildlife Fund United Kingdom.