2026-04-25

CI deploy gate stuck red for ~30 minutes. test_extract_signals_success failed in CI on a foreign-key violation against signal_id=67, didn’t reproduce locally on the dev box, didn’t reproduce on the rerun. A different test failed instead with savepoint "_test" does not exist. Neither test had been changed in the failing commit.

The test suite ran against a real Postgres container, ~1,565 tests in 17 seconds. The fixture that made it that fast was a _SavepointConn proxy in tests/conftest.py: one shared psycopg2 connection per test, commit() aliased to RELEASE SAVEPOINT _test; SAVEPOINT _test, and the pool’s getconn/putconn monkey-patched so every “different connection” was actually the same proxied object. Speedy. The trap was that this scheme broke a fundamental property of the connection pool. Code that worked in tests because two pool connections were secretly the same proxy could not be distinguished from code that would FK-violate in production with two real connections.

The reflex fix to the original CI failure was to swap the savepoint scheme for a TRUNCATE-after-test fixture. Suite went from 17s to 42s, +25s for the full run. Acceptable. Then 7 deterministic test failures surfaced under the new fixture. 3 were real production-class bugs in nested-transaction code paths: _create_holding_signals and _check_short_interest_change opened an outer transaction, then called store_signal_entities() which grabbed a different pool connection that couldn’t see the uncommitted parent INSERT. Both dormant in current schedule, armed for the next 13F filing wave in mid-May. Fixed with an opt-in cur= kwarg on the helper so the parent’s cursor flows through.

Test isolation that doesn’t mirror production transaction semantics is worse than slow tests. Test passes were not evidence of working code; they were evidence the fixture’s lock-and-proxy trick was hiding the kind of bug that only fires on two real connections. The signal that the cleverness is the problem is the same shape every time: tests pass locally, fail in CI on environment-specific timing, and the failure surfaces in tests adjacent to the bug.