2026-05-05

Wired Anthropic prompt caching for the reflection prompt in an agent-simulation kernel I’m building. Expected ~40% savings on per-call input cost across long dialogues, since the static prefix (persona, scenario, calibration rubric) repeats unchanged across every reflection in a run. Set the cache_control: {type: "ephemeral"} markers on the system content block exactly as Anthropic’s docs prescribe.

Three test paths: through the LiteLLM async wrapper plus Instructor for structured outputs, through direct LiteLLM, and through the native Anthropic SDK. All three returned cache_creation_input_tokens=0 and cache_read_input_tokens=0. The request succeeded with normal pricing every time. No error, no warning, just silent no-op.

The same code path with Sonnet 4.6 cached cleanly: cache_creation=2803 on the first call, cache_read=2803 on the second. So caching wasn’t broken end-to-end. It was failing only on Haiku 4.5, the cheaper-tier model where I wanted the savings most.

Bisected the threshold by running fixed-prefix calls at increasing prompt sizes. At ~2,809 tokens, no caching. At ~5,602 tokens, both write and read fired correctly. Anthropic’s docs publish 2,048 tokens as the Haiku minimum. The empirical floor on Haiku 4.5 sits somewhere between 2,809 and 5,602 (probably around 4,096). My static prefix at 1,500 tokens was well under threshold and would never cache.

The reflex fix would be to pad the prefix to 4,096+ tokens. That trades the cache-write surcharge (1.25x normal input) for the read discount (0.1x), and the math doesn’t pencil out at our turn count. The actual fix was switching the medium-tier ablation provider from Haiku-uncached to Sonnet-cached. Cached Sonnet reads are $0.30/M; uncached Haiku is $1/M. After the first dialogue, Sonnet-with-caching is cheaper per call than Haiku-without, and probably better at the actual reflection task.

Published cache minimums are a starting point, not a contract. Below threshold, cache_control is silently ignored, the same shape as a config file that “supports” a key the parser drops on the floor. Verifying caching by reading cache_creation_input_tokens on the response is the only way to confirm the markers are doing anything. Default assumption when a feature has a size threshold and a silent failure mode: test at the size you’re actually shipping, not the size the docs assume you’re shipping.