Feature salience anchored to engine-eval delta, not hand-set values

2026-05-06

The substrate’s top-ranked feature on a position should be the feature that actually matters most. On five real first-deviation positions from caissaresearch.com, the chess analytics product I run, it wasn’t. The lead came from hand-set salience values per detector. The engine’s view of the position was nowhere in the ranking.

Worst case in the validation set: a position where hanging_piece fired at salience 0.73 with rationale “white can capture the bishop.” White was at -3.2 by Stockfish. Capturing the bishop barely moved the eval, because white was already losing concretely on a different axis. The substrate’s lead feature was a feature the engine thought was irrelevant.

The fix is to ground salience in counterfactual removal. For each fired imbalance or motif, mutate the board to neutralize the feature: remove the pinner, the hanging piece, the weak pawns, the extra bishop, the active rook. Run Stockfish on the neutralized FEN, take the eval delta, map it to [0.10, 0.95] saturating at 300cp. Mate evals project to ±10000cp so they hit the ceiling cleanly. A feature whose removal forces mate is decisive by definition.

Result on the validation set: the lead feature now matches engine emphasis. The hanging-piece position’s salience dropped from 0.73 to 0.21, because removing the bishop barely moves an already-lost eval. The false lead is gone. Cost is 3-10 extra Stockfish calls per position at depth 14, around 50ms each, 0.3-1.0s overhead. Fine for interactive use.

The trap with hand-set salience is that it ranks features against each other on a fixed scale defined in isolation. Two detectors authored a year apart can’t compare salience values, because there’s no shared reference. Anchoring to engine-eval delta ties every feature to the same scale the position itself is measured on. Cross-detector normalization comes for free, since they all map through the same cp-delta curve.

For a heuristic ranking system pulling from multiple sources, the ranking has to anchor to something external; otherwise the units are vibes. Counterfactual removal is the cheapest external anchor: neutralize the feature, ask the world how much the world changed. When the world has a quantitative measure — engine eval, model loss, downstream metric, conversion — the answer is a real number on a real scale, regardless of whether the detector underneath was hand-coded or learned.