Held-out vocabulary halved the lever’s ROI

2026-05-06

The honest test for whether a learned lever generalizes is to run it on data the system wasn’t tuned for. For most ML work that means a held-out split. For a system whose lever is “augment the candidate pool with a 28k-vocab GloVe slice,” a random-sample split isn’t enough; the held-out boards will share too much vocab with the training boards.

This is from a benchmark arena I’m building for word-association reasoning, modeled on Codenames. The current engine pairs a GloVe-augmented cluegiver with a WordNet-only guesser. On the original word pack (wordnet-concrete-v0.1, ~1000 high-frequency English nouns), the engine beat the WordNet-only baseline +76.7pp at N=360. The question was whether that gap was real or an artifact of the pack the embedding lever was implicitly tuned to.

I built pack-2 same-shape (concrete nouns, WordNet + Brown frequency filter, ~1000 words, physical_entity ancestor) but vocab-disjoint by construction with --exclude-pack. Then rebuilt the embedding slim pack to cover both, so missing-vector gaps wouldn’t be a confound (99.5% pack-2 coverage vs 70.8% before). N=360 against the same wordnet-search baseline.

Result: engine-v1 won 240/360 (66.7%) vs baseline 120/360 (33.3%). +33.3pp pooled, all six seeds same sign, five of six individually significant. The lift was real and direction-stable. The magnitude shrank: +76.7pp on pack-1, +33.3pp on pack-2. Roughly 44% of the embedding lever’s ROI retained on a vocabulary the system wasn’t tuned for.

The mechanism was intrinsic to GloVe-300d’s representation of low-frequency English. Pack-1 was downsampled by raising the frequency floor; its words have richer neighborhood structure (more near-neighbors at meaningful similarity). Pack-2 included more specialist words (“agglutinin”, “alabaster”, “armhole”) whose embedding neighbors are less predictive of game-relevant similarity. The cluegiver had fewer and weaker candidate clues per board word on pack-2, and the lever shrank. The WordNet baseline degraded less under the same shift, since hypernym ancestry coverage was similar across packs, so the differential narrowed.

Two reportable numbers, not one. The +76.7pp number reflects pack-1’s favorability for the embedding lever; the +33.3pp number is the cleaner generalization-class delta. External claims about the engine should report both. The build-the-disjoint-pack step was the load-bearing one. Sampling a held-out slice from the same vocabulary distribution would have made the embedding cluegiver look more general than it actually is.