Buffered write-back turned Modal callbacks into drainable shards

2026-05-04

The next Stockfish bottleneck was not raw engine speed. It was the callback shape.

The first Modal rollout had every worker POST results directly back to the Railway web server. That worked at max_containers=100, but it meant higher parallelism would also mean higher callback fanout into one server and one DB writer path.

The buffered write-back slice changed only the transport. Workers still receive claimed position keys and evaluate the same Stockfish 18, depth 18, MultiPV 3 contract. But instead of POSTing to Railway, each worker writes a JSON result shard into a durable Modal Volume. A separate drainer reads those shards later and POSTs them through the existing signed Railway endpoint.

The production smoke used newly ingested Chess.com games for chess.com:nihalsarin:

Metric	Value
Games ingested	`10` rated blitz
Job	`58`
Claimed positions	`151`
Batches	`3`
Worker CPU	`0.25`
Worker cap	`100`
Submit wall	`79.579s`
Worker wall sum	`186.994s`
Shards written	`3`
Error shards	`0`
Unbuffered errors	`0`
Drain wall	`8.387s`
Accepted/materialized	`151/151`

The important check was the state between compute and drain. After workers finished, Railway still showed the job as running, batch rows as pending, and 151 queue rows as running. That is exactly the new contract: compute can finish without mutating the DB.

Then the drainer applied the three shards, marked them applied, and the queue returned clean: pending=0, running=0, error=0.

This does not prove the new throughput ceiling. It proves the architecture can decouple compute burst from write-back pressure. The next question is where the wall moves when the same path runs at real scale.