Log by topic

The default /log/ feed is reverse-chronological. This page groups entries by the shape of the lesson, so a problem you’re seeing today can find the closest prior receipt.

Some entries appear in more than one bucket when the lesson cuts across.

Tests and validation: when the green check meant nothing

Tautological tests, masked fixtures, statistical noise, validation that scored the wrong thing.

Five real positions surfaced three substrate bugs the unit tests cleared — detector tests checking only the detector’s own contract are tautological against that contract.
Savepoint test fixture masked two production bugs — a 2.5x faster fixture turned green CI into evidence of nothing.
When validation fails, ask which step failed — a 0/2/38 score answers a downstream question; first ask which stage failed.
Diagnosed the wrong LLM failure mode from one trace — single-trace conclusion was 14 errors; aggregate showed 318.
Did not promote exit_wide to default — +9pp leaderboard advantage was below the noise floor at n≈100.
Held-out vocabulary halved the lever’s ROI — disjoint-by-construction held-out set cut +76.7pp to +33.3pp.
Plausible-shaped errors in an LLM-drafted methodology doc — make primary-source verification a separate phase from drafting.
PEAD validated as the one signal with real edge — surviving signal across 387 events; behavioral inefficiency.
Paper trader was treating ‘roll’ as ‘close’ — the harness was the lie; -4.7% drag came from misbooked actions, not the strategy.

Silent failure modes

Things that broke without surfacing. The bug is invisible until you specifically look for it.

Bluesky posts silently dropped for three weeks — optional dep + 0.03s failure duration; took three sessions to spot.
A silent deploy-skip turned a 24-minute fix into a 10-hour outage — alert on the gate, not just on the build.
Anthropic cache_control silently no-ops below model minimums — empirical floor sits well above the published 2,048-token threshold.
Datetime separator mismatch hid every same-day signal — one character was load-bearing in a lexicographic compare.
Engine fell back to alphabetical when belief collapsed — silent fallback path; surfacing it lifted win rate 89% → 98.5%.
An authoritative seed must disable, not just enable — removing entries from a canonical list was invisible to the seed.
Vite baked the bundle before the env vars existed — VITE_* inline at build time; runtime env was too late.
LLM dedup was hallucinating signal IDs into joins — canonical_id values not present in any batch silently false-deduped 1,714 signals.
$120 in Gemini credits, $5.11 in my cost log — schema missing the dimension that dominated the bill.
Agent skip threshold quietly broke prediction refinement — gate inherited by the refinement loop silently froze every existing prediction.

Layer and boundary contracts

Where the trust, durability, or visibility line actually lives — and where it was assumed to live.

Hidden info at the protocol, not the renderer — UI-layer enforcement is a comment, not a constraint.
Persist before fanout, even when the cluster is one server — durability boundary belongs in front of the fanout.
The labeling UI was part of the corpus definition — fields hidden from the labeler are absent from the dataset forever.
ChatGPT highlight-copy returned only the visible message — virtualized DOM ≠ rendered text; go to the API for full state.
Entity resolver became the single gatekeeper — a resolver that falls back to raw input pollutes everything downstream.
A public chess eval cache is not a replacement for provenance — same-shape rows, different contracts; read-through with source metadata.
Next standalone passed health but failed hydration — HTTP 200 from /health while the browser app crashed; healthcheck was the wrong contract.
Video embed ready is not video playing — reveal on observable playback, not iframe readiness.
Scheduled video channels need one duration clock — the guide and player were telling time from different clocks.
Stockfish MultiPV failed on one legal line, not the whole position — validate the result you actually use, not every byproduct.

Operational rollout and infra

Deploy ordering, fanout, image size, ladders. Keeping the rollout boring.

Railway deploy deadlocked behind cold DB init — ensure_schema before health server; cold-start surfaced the invariant.
Stockfish backfill wedged behind a single vCPU — fan-out across ephemeral workers, not a fatter persistent box.
Buffered write-back turned Modal callbacks into drainable shards — durable shards first, controlled drainer back to the system of record.
The Stockfish backfill finished because the rollout stayed boring — every scale step had a ledger, cap, metric, and stop condition.
The next Stockfish target is throughput per dollar, not more knobs — short ladder: medium run, ceiling test, batch sweep, drainer stress.
The web image was paying for an ML stack it did not use — 2.85 GB → 189.5 MB; image push 163s → 8s.

LLM systems and evaluation

Model behavior, tier choice, oracle limits, prompt and eval architecture.

11 LLM falsifications, then one tier swap flipped the verdict — cheap-tier results lie; ladder probe before declaring a null.
The cleaner dedup architecture cost 4x — the messy stages were doing real grouping work that didn’t survive the rewrite.
Predictions became immutable claims at fixed horizons — drop the update verbs; let the LLM gate its own output.
The oracle had no concept of the regime under test — Stockfish-with-truth marked every FOW-aware move it wouldn’t pick as wrong.
Belief-state overlay inverted the discrete floor — continuous suppression replaced a discrete reject floor; -46.7pp.
Symmetric upgrade regressed; asymmetric stack was the right shape — continuous cosine has no “unrelated” floor; the discrete floor was the load-bearing piece.
140x more training tokens regressed the engine 32pp — both “more data” levers null in one evening.

Heuristic ranking and scoring

Anchoring units to something external; choosing the right partition.

Feature salience anchored to engine-eval delta, not hand-set values — counterfactual removal is the cheapest external anchor.
Brief synthesis: map-reduce by entity, not by signal cap — change the partitioning unit, not the cap.

Distribution and product

Channel structure, GTM, semantic loops, source curation.

Three channels, one audience — channel count is not signal count when the channels share an audience.
Window TV grew a Mandarin immersion ladder — graded listening rows with learner level as a routing signal.
aide started turning session logs into reusable project memory — reviewed semantic artifact loop: logs → proposed knowledge → runbooks.