2026-05-04
The goal is not to build more queue machinery. The goal is maximum Stockfish throughput per dollar while keeping analysis quality high.
The quality contract stays fixed for the broad path: Stockfish 18, depth
18, MultiPV 3. Pathological positions can get special handling, but only
with explicit provenance. The default should not silently become weaker just
because one tail position is expensive.
The next bottlenecks are now ordered:
| Bottleneck | Signal | Experiment |
|---|---|---|
| Stockfish compute | Worker wall dominates everything else | raise containers while tracking cost/position |
| Modal scheduling | 200 or 300 containers barely improve throughput |
stop raising caps and tune batch shape |
| Modal Volume shards | many small files make submit/list/drain slow | compare 64, 128, 256 position shards |
| Drainer throughput | compute finishes fast, drain wall dominates | add partitioned drainers or rate-shaped drains |
| Railway DB writes | endpoint write-back p95/max grows | meter writes before scaling web servers |
The efficient ladder is:
- Run a medium buffered job: 25-50 Chess.com games,
cpu=0.25,max_containers=100,batch_size=128, capped around$0.25. - Run the same shape at
max_containers=200, then300. - If
300still gives material gains, test whether the Modal account and platform allow a higher cap. If gains flatten, stop there. - Sweep batch size at the best container cap:
64,128,256. - Stress the drainer only if drain wall is more than
10-15%of compute wall.
The decision rules matter more than the knobs:
- If
200beats100materially, try300. - If
300beats200materially and cost per position stays stable, explore beyond300. - If throughput gain is under
15-20%, stop raising container count. - If drain is small, do not optimize the drainer yet.
- If drain dominates, parallelize write-back deliberately instead of letting compute workers stampede the web server.
The architecture target is compute burst, shaped materialization. Modal can run many workers. The server and DB should accept results at a rate they can actually sustain.