2026-05-06

Two FOW-aware changes shipped today on a Fog of War chess engine I’m building. Both held strength against a random opponent. Both dropped agreement with Stockfish-with-truth, the metric I’d been using as a fairness check between configs.

The baseline (uniform opponent prior, no fog discount) ran 100% win rate vs random, 34.6% agreement with Stockfish-with-truth. Agreement here means Stockfish-given-the-true-board picked the same move the engine picked.

P4.1 added a Stockfish-shallow opponent prior, depth-4 multipv softmax blended with uniform. Result: 96.7% W, 25.9% agreement. Delta minus 8.7 percentage points.

P2.3 added a fog discount, a static penalty for our pieces in opponent territory without defenders. Result: 100% W, 30.0% agreement. Delta minus 4.6 percentage points.

The Measurement Protocol cross-check trigger fires at Δagreement ≥ 5 pp. Both fired, in the wrong direction. The reflex read is that both changes are bad. The actual story is that the oracle was the bug.

Stockfish-with-truth evaluates positions as if both sides see everything. It has no representation of hidden information, no concept that the player I’m scoring is making decisions under partial observability. Any change that makes the engine play moves a full-info Stockfish wouldn’t pick (defensive retreats from exposed pieces, particle-aware avoidances, fog discounts on undefended pieces in enemy territory) drops agreement by definition. The metric counts FOW-aware play as a regression because the oracle doesn’t know FOW exists.

Two compounding causes for P4.1 specifically. One: FOW-aware play registers as wrong against the oracle. Two: a tighter opponent prior that models random as a strong-move opponent makes belief collapse faster, which sends the engine into its fallback path more often, which produces deterministic moves that also disagree with Stockfish-truth. P2.3 only had cause one.

Track A still catches obvious regressions. An evaluator bug that makes the engine play hanging-piece blunders shows up here. The bound is narrower than I’d assumed. The metric is fair only when the change preserves the regime the oracle models. Anything that exploits structure the oracle ignores reads as a regression even when it’s an improvement.

The fix is a different track: head-to-head against a peer that lives in the same regime. Two engines under partial observability, same rules, paired round-robin, Elo from the result. The peer doesn’t have to be strong. It has to share the regime.

Agreement-with-oracle is only a fairness check when the oracle and the system under test share the regime. When they don’t, the metric is biased against every regime-specific improvement, and the bias is structural, not noise.