Five real positions surfaced three substrate bugs the unit tests cleared

2026-05-06

The chess substrate I’m building on caissaresearch.com had around 1500 unit tests passing on the imbalance and tactical-motif detectors. Running the new single-position analyzer on five real first-deviation positions from a recent middlegame pattern run surfaced three bugs in two hours.

Bug 1 was a trapped_piece false positive on a bishop at d3. Knight on c5 attacks (worth 320cp), pawn on c2 defends (worth 100cp), bishop value 330. Static loss after capture and recapture is 330 - 320 = 10cp, well inside any reasonable trade tolerance. The old _detect_trapped_piece rule compared cheapest attacker against piece value (320 < 330) and tagged the bishop trapped. It never modeled the recapture. Fix is per-move 2-ply lite-SEE replacing the single-attacker check, plus a skip-if-static-safe guard that runs SEE on the piece’s current square first. If staying put loses ≤50cp, the piece is by definition not trapped. As a side benefit, a Magnus position the suite had labeled “knight trapped on h5” reclassified to “passive, not trapped,” same logic: no white attackers, static loss is zero, the knight can stay put.

Bug 2 was attach_plans recommending an offensive plan for the losing side. The position had black at -2.14 by Stockfish; the substrate emitted central_counter as the primary plan. The plan attacher had no idea black was losing, because it used material delta as the winning-side proxy and material was even on the board. Fix is threading engine_cp from the multipv top result through to the plan attacher. When engine_cp shows a side ≥150cp worse, demote that side’s offensive plans (kingside_attack, central_counter) and promote defensives (defend_passively, blockade, prophylaxis). If no defensive plan emitted, inject a synthetic defend_passively so the substrate always has a coaching answer for the losing side.

Both bugs were latent the whole time. The unit tests checked the detectors in isolation: given this kind of input, does the rule fire as designed? The tests passed because the detectors were faithful to their own logic. The logic was wrong.

The reflex fix when a detector misfires is to add more unit tests for the new failure case. That doesn’t catch the next instance of the shape, which is a rule internally consistent with its specification and externally wrong about the position. Tests that only check a detector’s own contract are structurally a tautology against that contract; they can’t fail for this reason. The pass that does is N real cases run end-to-end, with someone looking at every output. Five positions in two hours surfaced two latent detector bugs the suite had cleared for months.