11 LLM falsifications, then one tier swap flipped the verdict

2026-05-05

This is from a benchmark arena I’m building for word-association reasoning, modeled on Codenames. The matchup: an LLM-driven team (cluegiver + guesser, both invoking the same model) against a deterministic baseline that picks clues by simulating the partner’s response over a WordNet ranker — no LLM. The question is whether putting an LLM on the playing engine earns its API cost.

Across 11 experiments and roughly $1.50 of API spend, every LLM variant I’d tested had failed to beat the deterministic baseline. gpt-4.1-nano, gpt-4o-mini, gemini-2.5-flash-lite — three providers at the cheapest tier each, asymmetric and symmetric configurations, prompt-hardened and not. The pattern was decisive enough that I was about to write up “LLMs don’t compete on this game” as a closed finding. Before doing that, I ran one ladder-probe at Claude Haiku 4.5: same hardened prompt, same wordnet-search opponent, $2.87. Win rate went from 0.20 (nano) to 0.60 (Haiku). Wilson 95% CI [0.45, 0.74]. Fallback rate went from 16.5% to 0.0%. Trap-driven losses went from 40% to 2.5%.

The misleading signal was that the eleven prior experiments looked uniform: nano, mini, gemini-flash-lite were all decisively below 0.50, regardless of prompt, regardless of pairing. The shape of the data argued for a model-tier-invariant verdict. But “tested at three cheapest-tier models” is not “tested across the model ladder.” The cheapest tier of each provider is the closest-spaced point in cost-capability space — they share the same set of capability ceilings even when they look like independent samples. One step up the ladder isn’t a marginal change; it crossed the threshold where the model could actually follow the structured-output rules and reason about board state without the failure-mode debris that drowned the signal at nano.

The decision rule going forward: before declaring “X doesn’t work” based on the cheapest model, run one disciplined ladder probe at the next tier up. The cost is bounded — Haiku N=20 paired ran me $2.87, ~10x the cheapest tier — and the information gain is “is this a model-tier issue or a real ceiling?” If the next tier doesn’t move the result, the closure is much stronger. If it does move the result, you’ve avoided shipping the wrong conclusion.

prior at cheapest tier:
  gpt-4.1-nano (E13):     8/40 = 0.20  ($0.15)
  gpt-4o-mini (E8):       40.8% (N=60) ($0.08)
  gemini-2.5-flash-lite:  46.3% (N=54) ($0.06)
ladder-probe one step up:
  Claude Haiku 4.5 (E14): 24/40 = 0.60  ($2.87)

The transferable rule: a uniform negative result at the bottom of a cost-capability ladder doesn’t generalize up the ladder. Before publishing a “this doesn’t work” finding, spend the bounded cost of one ladder step. The cheapest tier shares failure modes (instruction-following ceilings, reasoning depth) that aren’t the question you’re trying to answer.