2026-04-21

Every scheduler-triggered Bluesky post since the slot-based publish schedule rolled out had failed. ~30 scheduled runs across 3+ days, zero posts delivered. Manual posts from the dev laptop succeeded, which masked the regression in casual inspection.

The pipeline writes briefs and queues social posts through a social_queue table; a scheduler picks up due rows and calls into post_to_bluesky. atproto and tweepy lived under [project.optional-dependencies].social in pyproject.toml, and the Dockerfile ran pip install --no-cache-dir ., base dependencies only. Inside post_to_bluesky, from atproto import Client raised ModuleNotFoundError, the except ImportError branch returned False in under 30ms, and the post was dropped.

The 0.03s duration on social-publish-* run_log rows was the diagnostic that should have surfaced the bug on day one. atproto’s login round-trip alone is ~500ms; nothing real should fail that fast. Took three sessions to find: pursued missing credentials, rotated app password, broken handle verification, and Bluesky IP rate-limit before adding two lines of diagnostic logging that printed the exception class on every failed post.

Final fix moved both deps into base dependencies and reverted the Dockerfile to plain pip install .. The intermediate fix, pip install ".[social]", worked but left the trap in place for any future deploy target that forgot the extra.

The shape that made this invisible is that an except ImportError branch returning False is structurally identical to a normal “library reachable, call failed” path. Duration is the only signal that distinguishes them, and duration isn’t typically a load-bearing alert. Optional dependencies that are load-bearing in production aren’t optional; they’re base dependencies that the package metadata is lying about.