2026-03-07

Pushed a one-line fix to scheduler.py after a 35-minute outage on a daily news pipeline I run. Four consecutive Railway deploys had failed the healthcheck while the old container kept serving stale code.

The pipeline runs as a Python scheduler on Railway’s Hobbyist tier (one container, one Postgres). On startup, the scheduler called ensure_schema(), which does a DB connection plus migration check, before _start_health_server(). On warm DB, the ordering didn’t matter; ensure_schema() returned in milliseconds. On a fresh Railway Postgres connection, internal networking takes ~60s to become reachable. The health server never started in time, and Railway killed each new container before it could replace the old one.

The fix was a two-line swap: start the health server first, in a background thread, then connect to the DB. The server doesn’t need the DB to respond to a healthcheck.

Two adjacent issues wasted ~25 of the 35 minutes. Railway’s Teardown setting wasn’t enabled, so the old container kept holding the deploy slot, same symptom (failed healthcheck), different cause. And railway logs shows the active deployment by default, not the failing one, so the old container’s ModuleNotFoundError on a separate route looked like the deploy failure for a while. It wasn’t.

The standard cloud-deploy pattern is health and readiness probes that respond immediately, with all blocking initialization (DB connections, cache warm-up, schema migrations) running after the process is reachable. We had it backward, and the bug was invisible on every warm-DB deploy. Cold-start is the path that exercises ordering invariants, and it’s the one local Docker testing can’t reproduce.