2026-03-05

The daily brief ran empty for two consecutive days in early March. The dedup stage was returning a dedup_rate of 100%. Every signal out of the news pipeline marked as a duplicate of something already in the database, every brief reduced to a header with no body.

The pipeline pulls from 50+ financial news sources every two hours, parses each article into structured signals (“Rocket Lab announced earnings of X,” “the FDA approved drug Y”), and runs dedup upstream of the synthesis LLM call so the brief doesn’t repeat itself across sources. Dedup is itself an LLM call: send a batch of signal IDs, get back groups of {canonical_id, duplicate_ids}, mark accordingly. The code took those IDs at face value.

The model frequently returned canonical_id=1, a row from Feb 20 that had never appeared in any of the dedup batches that referenced it. Every group rooted at that ID quietly marked unrelated signals as duplicates of a row from two weeks ago. 1,714 signals false-deduped between Feb 20 and Mar 5. On Mar 3-4 the rate hit 100%.

canonical_id=1 pointed to a row in event_signals, a different table that the dedup pool never reads from. There was no path by which a legitimate dedup decision could ever produce that ID. The model was hallucinating IDs from prior context, and the join accepted them silently. The foreign key was constrained against the target table, not against the batch being deduped.

Fix is a one-line validation in dedup_signals(): check every LLM-returned ID against the batch’s valid_ids set before it touches the join column. Hallucinated canonical_id values cause the whole group to be skipped; hallucinated duplicate_ids inside an otherwise-valid group get filtered out. The repair cleared deduplicated_by for every signal pointing to an event_signals target, since those were definitionally wrong. The next brief surfaced ~640 signals that had been invisible for two weeks.

A foreign key constrained against the target table will accept any ID the model returns, as long as it exists somewhere. The constraint that catches the hallucination, that IDs must exist in this batch, is a per-call invariant, not a relational one, which means it has to live in application code at the boundary.