Symmetric upgrade regressed; asymmetric stack was the right shape

2026-05-05

This is from a benchmark arena I’m building for word-association reasoning, modeled on Codenames. Two AI roles: a cluegiver picks a one-word clue meant to point a partner at the cluegiver’s hidden target cards while avoiding bystanders and traps; a guesser sees only the clue and ranks unrevealed board words. A turn ends on a non-target reveal, and a trap reveal ends the game.

I built an embedding-augmented cluegiver and it crushed the WordNet-only baseline 53/60 = 0.883. Then ran the obvious follow-up: swap the guesser to the same embedding pack so both halves of the stack live in the same semantic space. Symmetric variant lost the head-to-head 22/60 = 0.367 against the cluegiver-only-emb version. Trap-driven games went from 3% (asymmetric) to 20% (symmetric). 6.7x more often the embedding guesser walked into a non-target card and ended the turn on a hidden trap.

The cluegiver pulls candidate clues from a 28k-word GloVe-6B-300d slice. Score = sum(target cosines) − λ × sum(risk cosines), then re-ranked by a simulated-EV pass that asks the guesser ranker how it would actually decode each candidate. The guesser ranker is the part that flipped. The original WordNet guesser scores unrevealed board words by max-depth shared hypernym ancestor — words with no shared ancestor score exactly 0. The pipeline guesses the highest-scored word, and 0-scored words are skipped whenever any positive-scored word exists.

The embedding ranker has no such structural zero. Cosine similarity is continuous on [-1, 1]. Even genuinely unrelated word pairs sit at small positive cosines (0.05 - 0.20). For clue “horse” pointing at target “saddle”, a board risk like “wagon” might cosine 0.41 (horse-drawn wagon) and a target like “saddle” 0.46. The guesser picks the highest score; those small positive cosines on risk words become guesses on risk words.

I tried the obvious patch: subtract a 0.20 floor, score-zero anything below. Trap rate dropped 20% → 12%, win rate moved 0.333 → 0.367. Still net-negative. Threshold tuning is local optimization on a misaligned objective; the structural issue is that continuous similarity functions don’t reject unrelated pairs the way discrete ancestry graphs do, and the guesser role needs rejection.

The asymmetric stack — embedding cluegiver, WordNet guesser — was the right shape all along. The two halves play complementary roles. The cluegiver wants a wide candidate pool: embeddings give 7x more candidates than WordNet, including the thematic associations (saddle ↔ horse, denim, accordion) that WordNet’s hypernym graph misses. The guesser wants a selective filter: WordNet’s discrete graph rejects spurious distributional similarity that the cluegiver’s pool isn’t trying to commit to. The simulated-EV pass routes only candidates the WordNet ranker can decode (87% embedding-pool / 13% wordnet fallback). Best of both, by design.

The transferable rule: continuous similarity functions need a discrete “unrelated” floor to be useful as decoders. Symmetry isn’t always the architectural goal. When two roles in a system have asymmetric needs — generation wants breadth, decoding wants selectivity — matching their tools degrades both. Reach for the same tool on both sides only after checking that the role demands actually agree.