2026-05-05

A Fog of War chess engine I’m building had been silently playing worse than random for stretches of every losing game. The unit tests passed. The bot finished games. The win rate against a random opponent climbed from 81% to 89% over a few iterations. I’d been treating the gap to 100% as a structural wall and started planning the next phase to push past it.

The engine maintains 256 hypothetical board states (particles) and votes across them, because in Fog of War each player only sees the squares their own pieces can reach. Every opponent move filters out particles that contradict the latest observation. If no particle survives, the set is empty.

I added per-ply diagnostics and rendered them as an HTML report. Ran it across all 11 losses.

In 6 of 11, unique_particles dropped to zero mid-game and stayed there. The filter had collapsed entirely. Once collapsed, my move-selection function fell through to return own_legal_moves[0]. The bot was playing the first alphabetical legal move every turn for the rest of the game. Strictly worse than random play. The games were ~250 plies of degenerate shuffling I’d been writing off as “long losses, probably belief degradation.”

The fix was 30 lines. When the particle set is empty, score moves on a visibility-only synthesized board (own pieces plus visible opponent pieces) and pick the best-scoring one. King-cap detection on visible kings still fires; captures of visible pieces still score; threats from visible pieces still get penalized. The next 200-game bake-off ran 197W 3L 0D, 98.5%. Up from 178W 17L 5D, 89.0%.

The fallback I’d written was strictly worse than no-op. A random legal move would have outperformed it for every collapsed game. The fact that it never crashed and never threw an error made it invisible to every test I had. The tests asked “does the engine pick a move?” Yes. They didn’t ask “is the move better than random?” because the bake-off averages win rate across hundreds of moves, and a hundred-ply degenerate stretch gets washed into noise.

The reflex in error-handling is to degrade gracefully and return a reasonable default. Reasonable defaults are not free. A fallback structurally worse than the next-simplest baseline is a silent regression dressed as graceful degradation. The instrumentation that surfaces it has to be specific: not “did anything fail,” but “is each path producing output competitive with the simplest alternative.”