2026-03-08
38% of entity_state was junk. Topic noise like “SEMICONDUCTOR INDUSTRY” and “MORTGAGE RATES.” Unresolvable companies like “ARK INNOVATION ETF” and “BAE SYSTEMS.” Foreign or malformed tickers like “3LAM.PA” and “OTCMKTS:POETF.” The polluted rows hit entity pages, the dashboard, and a third-party CDN icon lookup that expects real ticker symbols; they also burned LLM calls synthesizing states for entities nobody would ever view.
The pipeline tags every signal with an entity (a ticker, a company name, or a topic theme), then maintains a per-entity rolling state record. Two failures had compounded. The resolver fell back to returning the raw uppercased name when it couldn’t map to a known ticker. That was the faucet. A March 7 patch added a regex check upstream of the index but only filtered rows tagged entity_type == "ticker"; the 353 entities tagged general or topic bypassed it entirely.
Fix was structural: one gate, all types, no fallback to raw names. Every entity now has to pass _resolve_name() before entering the index. The resolver checks macro aliases, hardcoded ticker aliases, and ~13K SEC company tickers. No match, no row. Aliases expanded by 16 tickers plus macro variants (WTI, ^GSPC, ^DJI, ^IXIC, VIX, US Treasury). When a legitimate new company shows up, the move is to add a line to the alias table, never to weaken the gate.
Production cleanup: 1,204 to 740 entities. Deleted 464 state rows plus FK cascades on history and change tables. Result: 728 tickers + 12 macro entities. Local: 1,258 to 736.
A resolver with a “return the raw input on miss” fallback is structurally identical to no resolver at all. The index downstream becomes a dump of every string anything upstream ever mentioned. The gate has to fail closed, and the gate has to live at one place that everything passes through. Two narrower gates that almost cover the input space leave the third path uncovered, and the bypass is invisible until you sample.