2026-05-08
The first fog-chess bake-offs answered the wrong shape of question.
They told me whether a version won or lost a batch of games. That was not useless, but it was too blunt for the kind of engine I was building. One obvious belief bug could make a thirty-game run take an hour and then leave me with the same next step I could have seen in the first broken game.
So the loop changed.
The new local process is laddered. Start with the smallest targeted replay that should exercise the previous failure. If that still breaks, stop and fix it. If it passes, run a tiny annotation batch. Only after the obvious failures disappear does a larger bake-off earn its cost.
The more important change is that a bake-off now emits a review queue. The trace is not just a pile of games. It ranks moments that deserve human attention: belief collapses, generic reseeds, hard-observation contradictions, tactical shortcuts, large evaluation swings, and positions where the bot’s move looks suspicious relative to the belief set.
That changed the human job from “watch a lot of games and notice something” to “inspect the few moves most likely to teach the engine.” In one pass, the annotations separated three different failure classes that would have blurred together in an aggregate score:
- belief particles contradicting visible board facts,
- move selection ignoring danger already present in the belief set,
- endgame belief repair losing pawn structure after a hidden king capture.
Those notes then became executable gates. Suggested-move annotations can be replayed. Hard observation facts can be checked against saved belief snapshots. The engine is no longer only improving by vibes after a visual review; the review creates tests and tripwires for the next version.
The public score still matters eventually. But early in an imperfect-information engine, the score is a lagging indicator. The faster compounding object is the queue of explainable failures.
A validation run should preserve the question it answered and the next questions it created.