The consumption layer for an experiment shouldn’t be an LLM call

2026-05-06

The output renderer for my dyadic-simulation kernel is a pure function: trace JSONL in, markdown report out. No LLM calls, no model spend, deterministic. Same trace, same output, every time.

agent-lab is a typed-state simulation kernel I built where two LLM agents run an N-turn dialogue and emit per-turn structured reflections (trust deltas, warmth deltas, sacred-cow violations). The renderer turns a single trace into a reader-facing scenario report: dialogue summary, symbolic-state trajectory, friction prediction, repair likelihood, sacred-cow boundaries crossed.

The reflex move is to let an LLM read the trace and write the report. It would ship faster and the prose would read better. The cost is reproducibility. If I show you a report and you ask “what does this say about run X,” I want the answer to be a function of run X’s data, not a function of run X plus whichever model generated the prose plus whichever sampling temperature plus whichever prompt revision was current that week.

The renderer implements 8 detection rules from a construct-validity mapping (Gottman, EFT, adult-attachment lineages). Each rule is a deterministic predicate over trace fields: contempt-pattern fires on (trust crash + dignity sacred-cow); criticism-pattern fires on (warmth crash + character-attack private_reaction); defensiveness-pattern fires on (open_disagreement + zero confidence_shifts), and so on. 530 lines of renderer code, 30 unit tests, one per detector path plus integration tests for the end-to-end render. CLI is python -m agent_lab render <trace_path>.

The variance the kernel is designed to surface is cross-model variance: same scenario, three models, different judgment. Three sample reports in evals/sample_reports/ show this directly. Haiku reads as rich-rupture, Sonnet reads as constructive, gpt-4o-mini reads as flat-affect, all on the same scenario and archetype. If the renderer were itself an LLM call, that variance would be downstream of an additional model, not just the three models under study.

The principle: the consumption layer for an experiment is part of the experiment. Putting an LLM there means every reader is reading through a model whose behavior is not under test. A deterministic renderer is reproducibility you can grep.