2026-05-05

Production was down for 10 hours on caissaresearch.com, a chess analytics product I run. The actual fix took 24 minutes from the moment I started investigating; the other 9 hours were a deploy that had silently never happened.

Earlier the same day I’d switched the Dockerfile from a railpack-managed build to a direct uv sync --frozen --no-dev --no-editable. The --no-editable flag was the trigger: it builds and installs a real wheel into site-packages instead of pointing at the source tree. The wheel was correct in every way the build cared about: deps resolved, imports worked, /healthz returned 200. But pyproject.toml had no [tool.setuptools.package-data] declaration, and setuptools’ default for packages.find ships .py files only. Every Jinja template was silently dropped from the wheel. Every page-rendering route immediately threw TemplateNotFound.

The bug was latent under the previous editable install, where the source tree on disk satisfied template lookups. The Dockerfile switch made the wheel itself the source of truth, and the wheel was lying.

The 10-hour duration was a separate bug. A cleanup commit earlier the same day had deleted railpack.json and nixpacks.toml but left two tests in test_deploy_config.py reading them, so CI was red on main. Railway gates production deploys on CI green and marks failures as SKIPPED, surfaced only as a status field in its dashboard. When I pushed the package-data fix, Railway evaluated it, marked it SKIPPED for the same reason, and quietly held the fix in queue. From the outside, railway status happily reported the previous green deploy as the current one. Synthetic-uptime was correctly firing on the symptom; nothing was firing on the cause.

The package-data fix took one block in pyproject.toml. Unblocking the deploy took rewriting two dead tests against Dockerfile instead. Total code change: 15 lines across two commits.

A deploy that was held by a gate is not a deploy that ran. CI failure, branch protection block, manual approval pending: the gated state needs to surface the same way a build failure does, because “still pending behind a gate” and “everything is fine” look identical in every dashboard that summarizes the latest successful deploy. Alert on the gate, not just on the build.