Can an LLM run a Vending Machine?
Benchmarking LLM's has evolved from (increasingly gamed) single leaderbords to more useful task-specific benchmarks. One new super-interesting one is Vending Bench by Andon Labs, where they compare different models' ability to independently and profitably run a vending machine.
Their goal (testing agentic capabilities) is pretty different from what we are up to at maia, but their results are fascinating and are really useful for our efforts in a number of ways that we'll write about in the future. But two immediate top-line take-aways:
1) Similar to the findings in Otis et al (discussed here), AI for MSMEs in developing countries isn't as simple as letting ChatGPT take the wheel. If you just put an LLM in charge of a vending machine, Vend shows most actually go out of business, and this is a much more straightforward business than those of our users. LLMs are amazing and can massively improve MSME productivity, but their capabilities are a 'jagged frontier': they succeed at some very hard tasks but can also fail at some very simple tasks. Teaching MSMEs about that jagged frontier and bulilding scaffolding around tools like Maia to adjust for it is a core part of the work.
2) Behind most buzzy AI hype areas is a kernel of truth, and that's true with 'context engineering'. Figuring out what information to put into an LLM significantly impacts the value of what comes out. If you look at how LLMs fail in running a vending machine, they succeed steadily for a while, but then enter a sharp death loop where their results crater, and the Lab's paper shows this isn't because they are hitting the technical context limits for the models. Often well within those limits, an LLM can just go off the rails, as anyone who has used an LLM for programming and debugging can attest, and recovering performance reuqires a context refresh. This is one of the reasons why we are hard at work rolling out our new backend for Maia which features quite a few evolutions in managing chat history, to ensure key buisness context is used to customize advice, but the bloat of what can be long multi-turn conversations doesn't degrade performance.
Also really interesting stuff there about what business tasks LLMs perform better than others, but we'll leave that topic for another day. If you are interested in learning more, Anthropic did a really entertaining post about their Vend test of Claud in their office, which you can find here and you can find Andon Lab's live Vending Bench results here.