2026-05-04
The next Stockfish bottleneck was not raw engine speed. It was the callback shape.
The first Modal rollout had every worker POST results directly back to the
Railway web server. That worked at max_containers=100, but it meant higher
parallelism would also mean higher callback fanout into one server and one DB
writer path.
The buffered write-back slice changed only the transport. Workers still receive
claimed position keys and evaluate the same Stockfish 18, depth 18,
MultiPV 3 contract. But instead of POSTing to Railway, each worker writes a
JSON result shard into a durable Modal Volume. A separate drainer reads those
shards later and POSTs them through the existing signed Railway endpoint.
The production smoke used newly ingested Chess.com games for
chess.com:nihalsarin:
| Metric | Value |
|---|---|
| Games ingested | 10 rated blitz |
| Job | 58 |
| Claimed positions | 151 |
| Batches | 3 |
| Worker CPU | 0.25 |
| Worker cap | 100 |
| Submit wall | 79.579s |
| Worker wall sum | 186.994s |
| Shards written | 3 |
| Error shards | 0 |
| Unbuffered errors | 0 |
| Drain wall | 8.387s |
| Accepted/materialized | 151/151 |
The important check was the state between compute and drain. After workers
finished, Railway still showed the job as running, batch rows as pending, and
151 queue rows as running. That is exactly the new contract: compute can
finish without mutating the DB.
Then the drainer applied the three shards, marked them applied, and the queue
returned clean: pending=0, running=0, error=0.
This does not prove the new throughput ceiling. It proves the architecture can decouple compute burst from write-back pressure. The next question is where the wall moves when the same path runs at real scale.