When validation fails, ask which step failed

2026-05-06

40 simulation cells: 5 cofounder-postmortem scenarios, 2 archetypes, 2 models, 2 repeats per cell. 38 of 40 cells produced “no rupture-class pattern detected.” Original validation rubric: 0 Match / 2 Partial / 38 Miss. The kernel did not demonstrate predictive validity on this corpus.

The framing the result supports is sharper than the score.

agent-lab is a typed-state simulation kernel I built where two LLM agents run a dyadic dialogue and emit structured reflections (trust deltas, warmth deltas, sacred-cow violations). The validation pass: encode 5 published cofounder ruptures (Snap, Twitter, Microsoft, Tinder, Fab) as YAML scenarios, run gpt-4o-mini and claude-haiku-4.5 through them at 6 turns, check whether the kernel’s friction prediction matches the documented rupture mode.

The simulated dialogues didn’t rupture. Final cumulative trust ended at median +0.80 across 80 listener-states, against an initial +0.50 baseline. Final cumulative warmth ended at median +0.85. Sacred-cow firings: 1 across 40 cells. 84% of construct firings (21 of 25) were constructive-dynamics (bid-turned-toward, repair-accepted); 16% were rupture-side. The off-the-shelf LLM dyads constructively resolved harassment-allegation and equity-dilution-during-illness scenarios at very high rates, regardless of how charged the underlying real-world case was.

The kernel correctly classified non-ruptures as non-ruptures. That’s a true positive on the kernel’s read of the trace. The bottleneck is one layer up: the simulated dyad behavior. With these models, at 6 turns, on these scenarios, the LLM dyads were too cooperative for the kernel’s rupture-side detectors to fire. The pass measures the LLM-dyad cooperativeness ceiling on the cofounder corpus, not the kernel’s classification ability on rupture trajectories that aren’t there.

The reflex move when validation looks like a fail is to fix the rubric: loosen the match criterion, lower the detection threshold, claim partial credit. That hides what the run actually measured. The honest write-up names which step failed (dyad simulation, not kernel classification) and surfaces the corpus-vs-model-panel mismatch as itself the finding. The fix space is upstream: a larger model panel, longer simulation horizons, or scenario reframings that close the cooperative escape hatch. Not the rubric.

When a multi-stage system fails validation, the score answers a downstream question. The first question to ask is which stage failed.