Diagnosed the wrong LLM failure mode from one trace

2026-05-05

This is from a benchmark arena I’m building for word-association reasoning. Each game has an LLM cluegiver propose a one-word clue under a strict validation contract — the clue must satisfy a list of legality rules (one token, not a board word, references real own-team targets, stays inside a risk budget). Failed clues route to a deterministic fallback that wastes the turn.

Ran a 40-game LLM-vs-deterministic match and got 0.20 win rate with a 57.3% prompt-validation fallback rate. Pulled one trace, saw a selected_clue_insufficient_unrevealed_targets error where the LLM had proposed targeting an opponent card, and wrote up the mechanism: “the LLM is hallucinating target ownership.” Hardened the system prompt around ownership rules. Re-ran. Fallback rate barely moved.

Then I aggregated error codes across all 40 games. The dominant failure was clue_matches_board_word: 318 of 493 total validation errors, 64% of all failures. The LLM was proposing board words as the clue word — word="saddle" when “saddle” was on the board, even when saddle was its own target. The ownership-hallucination mode I’d hardened against (selected_clue_insufficient_unrevealed_targets) accounted for 14 of 493 errors, 2.8%. I had patched the rare failure and missed the dominant one.

The misleading signal was that the first trace I pulled happened to land on the rare error. I sampled n=1, treated it as representative, and built a system prompt around it. The actual prompt fix needed was different: emphasize the forbidden-clue-words list, give a worked example of the substring rule, re-state the constraint at the top of the prompt instead of buried in a checklist. After that fix, fallback rate dropped 57.3% → 16.5%.

my claim: target-ownership hallucination → 9% of fallbacks
reality:  target-ownership hallucination → 2.8% of validation errors
         clue-matches-board-word           → 64% of validation errors

The instinct to spot-check before claiming a mechanism was right. The execution was wrong: a single trace shows what can go wrong, not what does. Validation-error code is a categorical variable; I needed the histogram before I had a claim.

The transferable rule: when diagnosing failure mechanisms, aggregate the categorical signal before reading individual cases. A trace shows a possibility; the histogram shows the dominant cause. The two questions look similar — “why did this fail?” and “why do failures fail?” — and the answers can be wildly different. If the system emits a typed error code, sort and count the codes first, then pull traces from the most common bucket.