A high schooler built a web site that helps you to challenge AI models to a Minecraft build-off

As conventional AI benchmarking techniques prove inadequate, AI builders are turning to more creative ways to evaluate the capabilities of generative AI models. For one group of developers, that’s Minecraft, the Microsoft-owned sandbox-building game.

The web site Minecraft Benchmark (or MC-Bench) was developed collaboratively to pit AI models against one another in head-to-head challenges to reply to prompts with Minecraft creations. Users can vote on which model did a greater job, and only after voting can they see which AI made each Minecraft construct.

For Adi Singh, the Twelfth-grader who began MC-Bench, the worth of Minecraft isn’t a lot the sport itself, however the familiarity that folks have with it — in any case, it’s the best-selling video game of all time. Even for individuals who haven’t played the sport, it’s still possible to guage which blocky representation of a pineapple is healthier realized.

“Minecraft allows people to see the progress [of AI development] rather more easily,” Singh told TechCrunch. “Individuals are used to Minecraft, used to the look and the vibe.”

MC-Bench currently lists eight people as volunteer contributors. Anthropic, Google, OpenAI, and Alibaba have subsidized the project’s use of their products to run benchmark prompts, per MC-Bench’s website, but the businesses usually are not otherwise affiliated.

“Currently we are only doing easy builds to reflect on how far we’ve come from the GPT-3 era, but [we] could see ourselves scaling to those longer-form plans and goal-oriented tasks,” Singh said. “Games might just be a medium to check agentic reasoning that’s safer than in real life and more controllable for testing purposes, making it more ideal in my eyes.”

Other games like Pokémon Red, Street Fighter, and Pictionary have been used as experimental benchmarks for AI, partially since the art of benchmarking AI is notoriously tricky.

Researchers often test AI models on standardized evaluations, but a lot of these tests give AI a home-field advantage. Due to the best way they’re trained, models are naturally gifted at certain, narrow sorts of problem-solving, particularly problem-solving that requires rote memorization or basic extrapolation.

Put simply, it’s hard to glean what it implies that OpenAI’s GPT-4 can rating within the 88th percentile on the LSAT, but cannot discern what number of Rs are within the word “strawberry.” Anthropic’s Claude 3.7 Sonnet achieved 62.3% accuracy on a standardized software engineering benchmark, however it is worse at playing Pokémon than most five-year-olds.

MC-Bench is technically a programming benchmark, because the models are asked to jot down code to create the prompted construct, like “Frosty the Snowman” or “an enthralling tropical beach hut on a pristine sandy shore.”

Nevertheless it’s easier for many MC-Bench users to guage whether a snowman looks higher than to dig into code, which supplies the project wider appeal — and thus the potential to gather more data about which models consistently rating higher.

Whether those scores amount to much in the best way of AI usefulness is up for debate, in fact. Singh asserts that they’re a powerful signal, though.

“The present leaderboard reflects quite closely to my very own experience of using these models, which is unlike plenty of pure text benchmarks,” Singh said. “Perhaps [MC-Bench] could possibly be useful to corporations to know in the event that they’re heading in the appropriate direction.”