The web image was paying for an ML stack it did not use

2026-05-03

The Stockfish rollout had a hidden tax: every production iteration dragged a large ML dependency stack through the web deploy path.

The web app does not need Maia inference on startup. It serves routes, reads SQLite, accepts signed Modal callbacks, and renders analysis pages. But the base Python environment still installed Maia-related dependencies, including Torch and friends. That made every Railway image enormous even when the active sprint was about Stockfish batch write-back.

The measured cost was not subtle:

Metric	Before	After
Image size	`2.85 GB`	`189.5 MB`
Image push	`163s`	`8.1s`
Export	`~42s`	`~2s`
Bytecode files	`8,521`	`1,539`

The fix was dependency boundary work, not infrastructure tuning. Maia/Torch/Numpy/Pandas moved out of the base dependency set and into an optional maia extra. The web image stopped shipping Linux Torch/CUDA/NVIDIA wheels it did not need for the production routes under test.

This mattered because the rollout was deliberately iterative. We were deploying guardrails, retry behavior, callback fixes, and observability while production jobs were running. A three-minute image push makes every small correction expensive. An eight-second push changes the feedback loop.

The lesson is simple: dependency boundaries are deploy velocity boundaries. A service should not pay build, upload, cold-start, and security-surface costs for capabilities that belong to a different runtime.