2026-05-05
This is from a benchmark arena Iβm building for word-association reasoning. Each game has an LLM cluegiver propose a one-word clue under a strict validation contract β the clue must satisfy a list of legality rules (one token, not a board word, references real own-team targets, stays inside a risk budget). Failed clues route to a deterministic fallback that wastes the turn.
Ran a 40-game LLM-vs-deterministic match and got 0.20 win rate with a 57.3% prompt-validation fallback rate. Pulled one trace, saw a selected_clue_insufficient_unrevealed_targets error where the LLM had proposed targeting an opponent card, and wrote up the mechanism: βthe LLM is hallucinating target ownership.β Hardened the system prompt around ownership rules. Re-ran. Fallback rate barely moved.
Then I aggregated error codes across all 40 games. The dominant failure was clue_matches_board_word: 318 of 493 total validation errors, 64% of all failures. The LLM was proposing board words as the clue word β word="saddle" when βsaddleβ was on the board, even when saddle was its own target. The ownership-hallucination mode Iβd hardened against (selected_clue_insufficient_unrevealed_targets) accounted for 14 of 493 errors, 2.8%. I had patched the rare failure and missed the dominant one.
The misleading signal was that the first trace I pulled happened to land on the rare error. I sampled n=1, treated it as representative, and built a system prompt around it. The actual prompt fix needed was different: emphasize the forbidden-clue-words list, give a worked example of the substring rule, re-state the constraint at the top of the prompt instead of buried in a checklist. After that fix, fallback rate dropped 57.3% β 16.5%.
my claim: target-ownership hallucination β 9% of fallbacks
reality: target-ownership hallucination β 2.8% of validation errors
clue-matches-board-word β 64% of validation errors
The instinct to spot-check before claiming a mechanism was right. The execution was wrong: a single trace shows what can go wrong, not what does. Validation-error code is a categorical variable; I needed the histogram before I had a claim.
The transferable rule: when diagnosing failure mechanisms, aggregate the categorical signal before reading individual cases. A trace shows a possibility; the histogram shows the dominant cause. The two questions look similar β βwhy did this fail?β and βwhy do failures fail?β β and the answers can be wildly different. If the system emits a typed error code, sort and count the codes first, then pull traces from the most common bucket.