The cleaner dedup architecture cost 4x

2026-03-21

Replaced a signal-deduplication pipeline that looked messy with one that looked clean. Production LLM cost jumped from $0.84/day to $3.50/day. Signal volume for one ticker (Rocket Lab) went from 3 entries/day to 69 entries/day. Reverted in two days, deleted 1,629 lines of the new architecture, and shipped a narrower variant that swapped only one stage.

The pipeline pulls articles from 50+ financial news sources every two hours, parses each into structured signals, and dedupes before synthesis. The old version did dedup in three stages. A title-similarity clusterer grouped articles into events at a SequenceMatcher threshold of 0.45. An LLM extracted signals per event, one call per event regardless of how many articles landed there. A final batched LLM dedup catch caught cross-title duplicates at the signal layer. A pre-work investigation found three real leaks: cross-title dupes the title clusterer missed, signals split across the 48-hour event window, and occasional hallucinated IDs from the LLM dedup catch. The cleaner architecture replaced the title clusterer and the LLM dedup catch with one embedding clusterer that ran at the article layer, before extraction, with a cosine threshold of 0.75. Two stages collapsed into one. Prevent duplication at the source.

What I had not measured was what the messy thing was silently doing. The old title-clusterer at 0.45 was much more aggressive than the new embedding clusterer at 0.75. “Very similar” articles in embedding space are not the same articles by topic; they are just close. The new clusterer split apart articles the old one had merged. That meant one extraction call per article instead of one per event, roughly 3x more upstream extraction calls. Then 4x more surviving signals drove 4x more entity-update calls downstream. The summarize-entity stage cost went from $0.01-0.05 per cycle to $0.10-0.25 per cycle. Each layer of the cascade was a separate cost multiplier, none of which I had modeled, because the old pipeline’s grouping was not visible as architecture. It was visible only as the LLM call count, which had been low.

The revert kept the old event clusterer plus per-event extraction and swapped only the final dedup catch (an LLM-batched call) for an embedding cosine check that ran on the smaller signal volume. Embedding cost is now around $0.002/day. Same dedup quality, no upstream cascade.

The trap was conflating “theoretically cleaner data flow” with “practically better.” Architectural waste, like extracting a duplicate before deduping it, is visible and quantifiable. Architectural batching, like grouping articles into events so a single extraction call covers all of them, is invisible until you remove it. The first looks like obvious work to do. The second is the work.