2026-05-02

The first 100-game Modal run almost completed. 483/484 batches finished. 15,445 positions wrote back cleanly. One batch failed on a depth guard with the signature depths=[17,17,17], then retries narrowed the problem from 32 positions to 8 positions to one FEN:

8/5p2/3Kp1p1/8/8/5q2/5k2/8 b - -

The initial mental model was that Stockfish had failed the position. That was too broad. The position was legal, the engine was alive, and single-position retries around it completed. The problem was the contract around MultiPV.

For broad backfill we ask Stockfish for depth 18, MultiPV 3. MultiPV means the engine tries to return several principal variations, not just its best line. On some tail positions, the best line can reach depth 18 while the second or third line is still at depth 17 when the per-position time cap fires. Our guard treated that as a failed position because it required every returned PV to hit the requested depth.

That was stricter than the product contract. The materialized evals row uses the best move, score, and principal variation. It does not need all three PV lines to clear the same depth before the row is useful. The fix was to validate the line we actually consume and record result_multipv honestly when fewer lines meet the full-depth bar.

The isolated retry passed after the guard change: best-line depth 18, requested MultiPV 3, result MultiPV 2, best move f3d5, cp -8115. Later batches kept provenance instead of pretending the result had the exact requested shape.

The bug was not in Stockfish. It was in the boundary between a chess quality setting and a production acceptance rule. Requested depth is a goal; reached depth is evidence. The system has to store both.