Model Madness: Same Prompt, Four Answers
It's Selection Sunday minus four days. The bracket is set. Duke is the overall one seed. The first round tips off today.
I asked four AI models to fill out a bracket.
Same prompt. Word for word. No extra context, no hints, no seeding data handed to them. Just: "Give me a final four full bracket for the NCAA men's tournament that's about to start."
Every model gave a different champion. Every single one.
The Responses
Claude went with Florida. Defending champions, best rebounding margin in the country, frontcourt depth that wears teams down in March. Confident, specific, cited real numbers. The pick nobody else made.
ChatGPT went with UConn. Knows how to close, doesn't beat itself, tournament experience. Solid reasoning, except it also picked Purdue - and Purdue lost Zach Edey to the NBA. ChatGPT either didn't know or didn't care. The picks felt like last year's field more than this year's.
Gemini went with Duke. The most detailed response by far - it gave regional reasoning, cited Lucas Oil Stadium's historical tilt toward blue bloods, flagged specific upset alerts, and predicted Duke over Purdue in a title game rematch citing 2010 and 2015 precedent. A lot of words. A lot of confidence.
Grok went with Arizona. Momentum, favorable draw, peaking at the right time. Grok also opened by noting the tournament had just started - hedging before it even made a pick - and immediately offered to adjust for more upsets after giving them. The most self-aware response. Maybe the least committed.
The One Thing They Agreed On
Arizona in the Final Four. All four models, independently, put the Wildcats in Indianapolis. That's it. That's the only unanimous call across all four brackets.
Everything else was chaos.
Why I Did This
Partly because it's March and I was filling out a bracket anyway.
But also because I've been thinking about how different these models actually are in practice - not on benchmarks, but on a real question with genuine uncertainty. A bracket prediction is a good test. The information is public. The question is bounded. And the outcome is verifiable.
In six weeks, we'll know who was right.
The Site
I built a small tracker that auto-scores all four sets of picks in real time using ESPN's public API. As games finish, pick cards flip green or red, the leaderboard updates, scores accumulate. No manual updates needed.
There's also a comparison section showing how each model's champion pick stacked up against ESPN Bracket Challenge public percentages. Grok's Arizona was the most contrarian pick at 18% public support. ChatGPT's UConn was the biggest chalk call at 9% - except that 9% might actually be high given Edey's gone.
Built the whole thing in a single Claude Code session, which felt appropriate.
We'll see how Claude does.