2026-05-04

The next Stockfish bottleneck was not raw engine speed. It was the callback shape.

The first Modal rollout had every worker POST results directly back to the Railway web server. That worked at max_containers=100, but it meant higher parallelism would also mean higher callback fanout into one server and one DB writer path.

The buffered write-back slice changed only the transport. Workers still receive claimed position keys and evaluate the same Stockfish 18, depth 18, MultiPV 3 contract. But instead of POSTing to Railway, each worker writes a JSON result shard into a durable Modal Volume. A separate drainer reads those shards later and POSTs them through the existing signed Railway endpoint.

The production smoke used newly ingested Chess.com games for chess.com:nihalsarin:

Metric Value
Games ingested 10 rated blitz
Job 58
Claimed positions 151
Batches 3
Worker CPU 0.25
Worker cap 100
Submit wall 79.579s
Worker wall sum 186.994s
Shards written 3
Error shards 0
Unbuffered errors 0
Drain wall 8.387s
Accepted/materialized 151/151

The important check was the state between compute and drain. After workers finished, Railway still showed the job as running, batch rows as pending, and 151 queue rows as running. That is exactly the new contract: compute can finish without mutating the DB.

Then the drainer applied the three shards, marked them applied, and the queue returned clean: pending=0, running=0, error=0.

This does not prove the new throughput ceiling. It proves the architecture can decouple compute burst from write-back pressure. The next question is where the wall moves when the same path runs at real scale.