2026-04-02
Migrated 11 LLM stages on a daily news pipeline I run from OpenAI to Gemini 2.5 Flash in one commit. By the second morning, $120 of Google Cloud credits had drained. My internal cost_log table summed to $5.11 over the same window. Both numbers were correct. They were measuring different things.
Five things compounded into the gap.
Gemini 2.5 Flash is a reasoning model and emits thinking tokens by default. On the same prompts that produced 29 output tokens under the OpenAI models, Gemini was generating 291. On the extraction stage, 285 became 933. Three to ten times the output, on tasks that were not reasoning tasks in any meaningful sense. The fix is one parameter, reasoning_effort=low, but the API does not reject calls that omit it.
When I first noticed inflated outputs, I set the parameter on the two highest-volume stages and watched cost drop on those stages. Nine other stages were still uncontrolled. The dashboard improved enough to feel like the fire was out.
Free-tier rate limits saturated next. Synchronous retries hammered the depleted bucket. Across 48 hours the pipeline issued 9,200 retries, every one of which counted as an API request. Classic backpressure failure, where retries stop being a recovery mechanism and become the load.
The most uncomfortable finding was that the cost-tracking schema logs prompt_tokens and completion_tokens because that is what every model API I had previously talked to reported. Gemini reports thinking tokens in a separate field. I was not reading that field. The internal cost number was missing the dimension that dominated the bill.
Last, and small but expensive: Gemini’s billing chart labels its x-axis UTC-8. April is UTC-7. The off-by-one made my own scheduled bursts look like external traffic for half a day.
Six commits to recover. reasoning_effort on the loud stages, then a circuit breaker (3 consecutive 429s halt the provider for 30 minutes), then a $5/day hard cost cap checked every five minutes, then reasoning_effort on the remaining nine stages, then embeddings reverted, then all 11 stages reverted to OpenAI. The Gemini integration stayed in the codebase, gated off. The infrastructure was the part of the work that was right; what was wrong was assuming the migration could ship without it.
Cost observability is a property of your provider, not your code. Migrating providers is not just migrating latency and quality, it is migrating which dimensions you are billed on. If the schema does not have a column for the dimension blowing up, no amount of summing fixes that. The cost log is a control surface, not a passive record. When it lies in the “everything is fine” direction, you have a production incident in progress and do not know it yet.