2026-05-06
A held-out vocabulary pack had degraded the engine’s win rate over a baseline from +76.7pp to +33.3pp. The instinctive next move: more vocab, more vectors, attack the gap with scale. Two experiments later, both falsified.
This is from a benchmark arena I’m building for word-association reasoning, modeled on Codenames. The current engine pairs a GloVe-6B-300d cluegiver (28k slim vocab) with a WordNet guesser. The 76.7pp number was on the same word pack the embedding pack was tuned for. The 33.3pp number was on a vocab-disjoint pack of lower-frequency English nouns, where the embedding lever shrank.
Experiment one: rebuild the slim pack at 60k vocab via build_embedding_pack.py --top-n 60000, doubling coverage of the lower-frequency band. N=360 against the same baseline. Result: +31.7pp pooled, CIs fully overlapping the 28k version. Difference is noise. The cluegiver’s simulator-EV gate already caps the candidate pool at top-K by EV, so doubling the pool just added more low-EV candidates that got filtered out. The top-K was saturated at 28k.
Experiment two: keep the algorithm and dimension fixed (GloVe, 300d), swap the training corpus from 6B tokens (Wikipedia + Gigaword) to 840B tokens (Common Crawl). 140x more tokens, 5x more vocab, ~2.5GB pretrained model. Three sub-runs covering both packs and a case-aggregated variant. Pack-1 (the home pack) regressed from +76.7pp to +45.0pp cased, +38.9pp case-aggregated. Pack-2 gained +5pp. Net: 840B is not viable. Reverted to 6B.
The mechanism was structural. Common Crawl is messier than Wikipedia + Gigaword. Web junk (code, names, HTML, repetitive boilerplate) pollutes the neighborhood structure for everyday English nouns, exactly the category the packs use. 6B’s clean-text training produces tighter, game-association-relevant neighborhoods despite 140x fewer tokens. Case sensitivity wasn’t the explanation; the case-aggregated 840B regressed worse than cased.
The vocab-expansion experiment falsified “more vocabulary closes the gap” by saturating the gate. The corpus-upgrade experiment falsified “more training data closes the gap” by drowning relevant signal in unrelated noise. The remaining vector-source levers (fastText subword, contextual embeddings) test different axes and cost more to set up. The gate logic was doing more work than the embedding pool size, and clean training distribution beat raw token count for narrow lexical tasks.