<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://brianhliou.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://brianhliou.com/" rel="alternate" type="text/html" /><updated>2026-04-12T22:27:23+00:00</updated><id>https://brianhliou.com/feed.xml</id><title type="html">Brian Liou</title><subtitle>I build backend systems and side projects. Writing about technical things and other stuff I&apos;m thinking about.</subtitle><author><name>Brian Liou</name><uri>https://brianhliou.com</uri></author><entry><title type="html">The Market Isn’t One Game</title><link href="https://brianhliou.com/posts/the-market-isnt-one-game/" rel="alternate" type="text/html" title="The Market Isn’t One Game" /><published>2026-03-28T00:00:00+00:00</published><updated>2026-03-28T00:00:00+00:00</updated><id>https://brianhliou.com/posts/the-market-isnt-one-game</id><content type="html" xml:base="https://brianhliou.com/posts/the-market-isnt-one-game/"><![CDATA[<p>People explain why you can’t beat the market in two ways. Either it’s rigged, or you’re not smart enough. Both miss the point.</p>

<p>The real problem is that the market isn’t one game. It’s a bunch of different games happening at the same time, on the same board, with the same pieces, but under completely different rules depending on who you are. Most people don’t lose because they play badly. They lose because they don’t realize which game they’re in.</p>

<h2 id="why-this-is-harder-than-any-other-game">Why This Is Harder Than Any Other Game</h2>

<p>Every competitive game humans have come up with is simpler than markets. Chess has about 20 moves per turn. Go has about 250. Real-time games like League of Legends add execution pressure, fog of war, and team coordination on top of that.</p>

<p>Markets are past all of them. Four things make them different.</p>

<p><strong>The game rewrites itself.</strong> A strategy that works attracts money. More money changes prices. Changed prices kill the strategy. In chess, no move you make changes the rules of chess. In markets, the rules shift based on how people play. The game mutates in response to its own players.</p>

<p><strong>You never see the full board.</strong> In Go, both players can see everything. In markets, the relevant information includes geopolitics, weather, psychology, corporate fraud, central bank policy, and whatever is happening in a CEO’s personal life. Most of it is hidden, delayed, or just wrong.</p>

<p><strong>You can’t see your opponents.</strong> The other side of your trade could be a retiree moving money around, a hedge fund closing a position, or an insider who already knows what’s about to happen. You don’t pick who you play against and you can’t tell how good they are.</p>

<p><strong>There’s no finish line.</strong> You can’t get checkmated. There’s no final score. Which means losses can always become “I’m just early.” Every other game ends. This one doesn’t.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Complexity scale:

Chess → Poker → Go → League of Legends → Markets
 20        ~5    250    real-time +           all of
moves/   hidden  moves/  fog of war +        the left +
turn     cards   turn    coordination        reflexivity +
                                             infinite players +
                                             no win condition
</code></pre></div></div>

<h2 id="six-games-one-board">Six Games, One Board</h2>

<p>Here’s the framework. The market isn’t one game with a skill ladder. It’s six different games being played in the same place. And they stack. Each one above requires something the one below doesn’t have.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────┐
│  ACCESS        Insider knowledge, connections    │ ← information you shouldn't have
├─────────────────────────────────────────────────┤
│  STRUCTURAL    Market making, HFT, liquidity    │ ← co-located servers, exchange access
├─────────────────────────────────────────────────┤
│  SIGNAL        Proprietary data → predictions   │ ← alt-data pipelines, ML at scale
├─────────────────────────────────────────────────┤
│  SYSTEMATIC    Quant process, factor exposure   │ ← tooling, discipline, AI-augmented research
├─────────────────────────────────────────────────┤
│  ANALYTICAL    Fundamental research, valuation  │ ← domain knowledge, time
├─────────────────────────────────────────────────┤
│  NARRATIVE     Vibes, sentiment, stories        │ ← a brokerage account
└─────────────────────────────────────────────────┘
  Each level up requires new data, infrastructure, or capital to unlock.
</code></pre></div></div>

<table>
  <thead>
    <tr>
      <th>Game</th>
      <th>What the edge is</th>
      <th>What you need to unlock it</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Narrative</strong></td>
      <td>Sentiment, vibes, stories</td>
      <td>A brokerage account and an opinion</td>
    </tr>
    <tr>
      <td><strong>Analytical</strong></td>
      <td>Better understanding of what a business is worth</td>
      <td>Domain knowledge, financial literacy, time to research</td>
    </tr>
    <tr>
      <td><strong>Systematic</strong></td>
      <td>Quantitative process, run at scale</td>
      <td>Tooling, programming ability, discipline to follow a process</td>
    </tr>
    <tr>
      <td><strong>Signal</strong></td>
      <td>Proprietary data turned into predictions</td>
      <td>Alt-data pipelines, ML infrastructure, serious capital for data acquisition</td>
    </tr>
    <tr>
      <td><strong>Structural</strong></td>
      <td>Profiting from how markets work, not from predicting prices</td>
      <td>Co-located servers, exchange memberships, regulatory approvals</td>
    </tr>
    <tr>
      <td><strong>Access</strong></td>
      <td>Knowing things other people can’t know</td>
      <td>Being in the room, or close enough to it</td>
    </tr>
  </tbody>
</table>

<p>Each level up isn’t just harder. It requires something fundamentally different. You don’t graduate from Analytical to Systematic by getting smarter. You need different tools, different infrastructure, and often different capital. A jogger doesn’t become a Formula 1 driver by running faster. They need a completely different vehicle.</p>

<p>Getting better at the Narrative game will never get you the returns available in the Signal game. <strong>Picking the wrong game costs you more than playing your game badly.</strong></p>

<h2 id="above-the-game">Above the Game</h2>

<p>Some players aren’t even competing for returns. They run the arena.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌───────────────────────────────────┐
│  SOVEREIGNS                       │  Can flip the table
│  (nation-states, SWFs)            │
├───────────────────────────────────┤
│  RULE-SETTERS                     │  Write the rules
│  (Fed, SEC, regulators)           │
├───────────────────────────────────┤
│  INFRASTRUCTURE                   │  Tax every transaction
│  (exchanges, index providers)     │
├───────────────────────────────────┤
│  ┌─────────────────────────────┐  │
│  │      THE SIX GAMES          │  │  Compete for returns
│  │    (all the players)        │  │
│  └─────────────────────────────┘  │
└───────────────────────────────────┘
</code></pre></div></div>

<p><strong>Infrastructure</strong> takes a cut of every transaction. Exchanges, clearinghouses, index providers, data vendors. When S&amp;P adds a stock to an index, billions of dollars move automatically. These players don’t need to be right about anything. They just collect rent.</p>

<p><strong>Rule-setters</strong> decide the rules everyone else is trying to model. The Fed doesn’t predict interest rates. It decides them. When the SEC approved Bitcoin ETFs, hundreds of billions in new flows opened up overnight. One regulatory decision can create or wipe out an entire market.</p>

<p><strong>Sovereigns</strong> can flip the whole table. Sanctions, capital controls, currency interventions. A sovereign wealth fund can move a market just by showing up or leaving.</p>

<p>The higher up this stack you go, the less skill matters. These players don’t play the game. They shape the board.</p>

<h2 id="marked-cards">Marked Cards</h2>

<p>Then there are insiders. They break the whole framework because they’re not playing better. They already know the answer.</p>

<p><strong>Hard insiders</strong> are corporate officers and deal advisors trading on information that isn’t public yet. Illegal. Sometimes prosecuted. <strong>Soft insiders</strong> are people like engineers who can feel demand shifting before it hits earnings, or lobbyists who know which way a regulation is going, or VCs who see private metrics that tell you where public markets are headed. Mostly legal, definitely gray. <strong>Political insiders</strong> are legislators trading on policy knowledge. The STOCK Act makes it illegal. Nobody enforces it. <strong>Connected capital</strong> isn’t inside the room, but close enough to the room that information leaks to them through relationships. You can see this in the unusual options activity that shows up before almost every big acquisition. Somebody always knows.</p>

<p>These people haven’t found a better chess move. They’ve read the last page of the book. And they make the game harder for everyone else, because some of what looks like random market movement is actually people trading on things that haven’t been announced yet.</p>

<h2 id="the-index-paradox">The Index Paradox</h2>

<p>Given everything above, here’s the strangest part. The best move for most players is to stop playing.</p>

<p>An index fund has no strategy. It holds everything. It doesn’t think. It beats the majority of active participants because it has no research costs, almost no transaction costs, never panics, and captures the economic return of owning businesses without trying to compete.</p>

<p>No other game works like this. There’s no chess equivalent of “don’t play, collect the average score of everyone, and beat 80% of them.” In markets that’s a real option and it works.</p>

<p>This sets the floor. Whatever you do, you have to beat this. Not just make money. Make more than you would have made doing literally nothing. After trading costs, after taxes, after the time you spent thinking about it.</p>

<h2 id="whats-the-highest-level-you-can-actually-play">What’s the Highest Level You Can Actually Play?</h2>

<p>This is the question the whole framework builds toward.</p>

<p>Look at the stack again. Access requires connections you probably don’t have. Structural requires exchange infrastructure and regulatory approvals. Signal requires millions in data acquisition and ML infrastructure.</p>

<p>For most individuals, the realistic ceiling is the <strong>Systematic</strong> game. You need programming ability, good tooling, and the discipline to follow a quantitative process. That’s a real unlock, but it’s an unlock most technically skilled people can actually reach.</p>

<p>AI shifts the ceiling here. One person with good tooling can now chew through earnings transcripts, SEC filings, patent data, job postings, and alternative data at a throughput that would have cost seven figures in analyst salaries five years ago. The edge isn’t some superhuman insight. It’s coverage. You can look at more things more consistently than any single human analyst.</p>

<p>But the ceiling of that game is specific. You’re hunting for mispricings in places where big money can’t go without moving prices. Micro-caps, weird special situations, post-spinoff equities, niche sectors with almost no analyst coverage. Your edge is that the space is too small and too annoying for institutional capital to bother with.</p>

<p><strong>So what can you actually make?</strong></p>

<p>Honestly, a few percentage points over the index per year. Sustained over a long time. With real variance and stretches where you underperform and wonder if the whole thing is broken.</p>

<p>Not 60%+ annual returns like Renaissance’s Medallion fund. Those come from the Signal and Structural games at a scale and speed you can’t touch as a solo player.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Realistic annual returns by game:

Narrative:     negative after costs (most retail traders lose money)
Analytical:    roughly index-matching after effort and fees
Systematic:    index + 2-5% in favorable conditions, with dry spells
Signal:        10-30%+ (requires massive infrastructure investment)
Structural:    consistent but requires institutional setup
Access:        high, but illegal or ethically gray

Index fund:    7-10% long-term average, no effort, no skill required
</code></pre></div></div>

<p>The index does about 7-10% a year over long periods. To justify playing actively, you need to clear that plus trading costs, tax drag, and the value of the time you’re spending. A good systematic solo player might compound at 12-15% in a good environment. That’s real money over decades. But it demands:</p>

<ul>
  <li>A specific theory of your edge that you could be proven wrong about. Not “I’m smart.” Something like “I find micro-cap spinoffs before institutional coverage shows up.”</li>
  <li>A process that works without you needing to feel inspired or convicted in the moment</li>
  <li>Position sizing where no single mistake can blow you up</li>
  <li>The ability to watch yourself underperform the index for years and keep going anyway</li>
</ul>

<p>Most of the edge available at this level comes from going where the market doesn’t look. Your advantage is the hassle. The day it stops being a hassle for bigger players, the advantage disappears.</p>

<h2 id="how-to-think-about-all-this">How to Think About All This</h2>

<p>A few things fall out of the framework.</p>

<p><strong>The market isn’t your opponent.</strong> It’s the combined output of everyone playing every game at once. Beating it means beating the weighted average of all of them, including the majority who lose. The bar is lower than it sounds, but higher than most people clear.</p>

<p><strong>The first decision is which game to play.</strong> Not which stock to buy. Which game. Most people never think about this consciously. The game gets chosen for them. Usually it’s the Narrative game, which has the worst expected returns of all six.</p>

<p><strong>Each game above requires a new unlock.</strong> You don’t level up by getting smarter at your current game. You level up by acquiring something new: tooling, data, infrastructure, access. If you don’t have what the next level requires, you can’t play it no matter how skilled you are.</p>

<p><strong>The ceiling is real.</strong> For a solo player, the Systematic game is as high as it goes. The returns there are meaningful but not spectacular. A few points of alpha over long periods. The Signal, Structural, and Access games pay better but require things individuals don’t have.</p>

<p><strong>Indexing is the right default.</strong> Not because the game can’t be beaten. It can. But knowing whether <em>you</em> can beat it is itself an incredibly hard problem. Everyone thinks they’re above average. The cost of being wrong is years of underperformance plus all the time you spent. The “just index” advice isn’t giving up. It’s the right answer for most people in a game where you can’t reliably judge your own skill level.</p>

<p>The most important question isn’t whether the market can be beaten. It’s whether you can honestly figure out which game you’re playing and whether your edge in that game is real enough to clear the index after all costs. If you can’t answer that clearly, the answer is already index.</p>

<p>The next question is what actually happens when you try to play the Systematic game for real. That’s what I’m building toward.</p>

<h2 id="what-you-learned">What You Learned</h2>

<p>✓ Markets are six different games on one board, not one game with skill levels<br />
✓ Each game above requires new data, infrastructure, or capital to unlock<br />
✓ Picking the wrong game costs more than playing your game badly<br />
✓ The highest realistic level for a solo player is the Systematic game<br />
✓ The honest return ceiling there is a few points of alpha per year, with long dry spells<br />
✓ Indexing is the right default because self-assessment in this game is unreliable</p>

<hr />

<p><strong>Resources:</strong></p>
<ul>
  <li><a href="https://en.wikipedia.org/wiki/A_Random_Walk_Down_Wall_Street">A Random Walk Down Wall Street</a> - Malkiel’s classic on efficient markets and why indexing works</li>
  <li><a href="https://en.wikipedia.org/wiki/Renaissance_Technologies">Renaissance Technologies</a> - Background on the most successful quant fund, playing the Signal game at its peak</li>
  <li><a href="https://en.wikipedia.org/wiki/Annie_Duke">Annie Duke</a> - Author of Thinking in Bets, on making decisions when you can’t know the outcome</li>
  <li><a href="https://unusualwhales.com/politics">Unusual Whales</a> - Tracks congressional trading, real data on the Access game</li>
</ul>]]></content><author><name>Brian Liou</name><uri>https://brianhliou.com</uri></author><category term="strategy" /><category term="investing" /><summary type="html"><![CDATA[Markets aren't one game with skill levels. They're multiple games on the same board. Here's a framework for figuring out which game you're in.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brianhliou.com/assets/img/og-default.png" /><media:content medium="image" url="https://brianhliou.com/assets/img/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">I Built an AI System That Synthesizes Stock Intelligence</title><link href="https://brianhliou.com/posts/signal-rundown/" rel="alternate" type="text/html" title="I Built an AI System That Synthesizes Stock Intelligence" /><published>2026-03-09T00:00:00+00:00</published><updated>2026-03-09T00:00:00+00:00</updated><id>https://brianhliou.com/posts/signal-rundown</id><content type="html" xml:base="https://brianhliou.com/posts/signal-rundown/"><![CDATA[<p>Every analyst has an opinion on the market. Almost none of them show you the data behind it, update it as new information comes in, or tell you when they were wrong.</p>

<p>The problem with market analysis today isn’t a lack of information. It’s a lack of <strong>synthesis</strong>. There are dozens of sources covering any major stock: earnings reports, insider trades, analyst ratings, SEC filings, news articles. Nobody’s reading all of them. And the analysts who do form opinions rarely tie them to specific, timestamped, falsifiable claims.</p>

<p>I wanted to test something: if you point AI at the full firehose of financial news and force it to take structured positions backed by specific data, can it produce better intelligence than reading five articles? So I built a system to find out.</p>

<p><strong>Signal Rundown</strong> is an AI that reads dozens of financial sources every 2 hours, maintains explicit bullish/bearish/neutral positions on every stock it tracks, and surfaces the data behind each call. Insider trades, analyst ratings, earnings data, direction changes, all synthesized into what actually matters.</p>

<p><img src="/assets/projects/signal-rundown/landing.png" alt="Signal Rundown landing page" /></p>

<h2 id="what-it-actually-does">What It Actually Does</h2>

<p>Signal Rundown is not a chatbot and not a news aggregator. It’s a <strong>signal extraction and synthesis system</strong> that runs continuously.</p>

<p>Every 2 hours, the system pulls from dozens of sources (RSS feeds, Google News, financial data APIs) and filters aggressively, because most financial news is noise. The articles that survive get their full text extracted and run through LLMs that break each one down into structured signals: direction, key facts, sentiment, which companies are affected.</p>

<p>This matters because raw articles are ambiguous. A single earnings report might be bullish for the company, bearish for a competitor, and neutral for the sector. The extraction step forces a position.</p>

<p>Those signals accumulate into <strong>entity states</strong>: a rolling analytical position per stock that updates as new evidence arrives. Each entity state includes a direction, a headline thesis, the key data points driving the call, and what to watch next. When a major event approaches, like earnings, the system generates <strong>predictions</strong> with a specific direction, baseline price, and expected move range.</p>

<p>The tech stack is Python, PostgreSQL, and LLMs. The whole system runs on a single Railway instance for about $30/month. Collection happens every 2 hours from 5 AM, entity states update on the same cadence, and the dashboard updates continuously.</p>

<h2 id="what-makes-it-different">What Makes It Different</h2>

<p>Most analysis is <strong>stateless</strong>. Someone publishes a take, it floats around for a day, and it’s forgotten. Nobody checks whether it was right.</p>

<p>Signal Rundown works more like a real analyst: it maintains views, updates them as evidence changes, and shows you exactly what data drove each position. Every entity has a living analytical state that accumulates evidence over time. When the system flips from bullish to bearish, that shift is logged with the specific signals that caused it.</p>

<p>The difference is <strong>synthesis at scale</strong>. Signal Rundown does this across dozens of stocks simultaneously, processes information in minutes instead of days, and never has a bad Monday. For each stock, you see not just “bullish” or “bearish” but the insider trades, analyst ratings, earnings data, and news signals that support the call.</p>

<p>The other key decision: <strong>predictions are first-class objects, not chat responses.</strong> Each prediction records a baseline price, an expected move range, and a target event. After the event, the system scores the prediction against the actual price. The track record is public and permanent.</p>

<h2 id="what-it-looks-like">What It Looks Like</h2>

<p>The system has been running in production for several weeks. Each entity page shows the AI’s full analytical state: direction, key data points, what to watch, insider activity, analyst consensus, and active predictions.</p>

<p><img src="/assets/projects/signal-rundown/entity-nvda.png" alt="NVDA entity page showing bearish position and full analysis" /></p>

<p>Here’s an example of the social card the system generates for NVDA:</p>

<p><img src="/assets/projects/signal-rundown/card-nvda.png" alt="NVDA social card" /></p>

<p>That’s not a summary of one article. That’s the synthesis of dozens of signals from multiple sources, updated every 2 hours. The system spotted the insider selling pattern, cross-referenced it with the margin data, and formed a view.</p>

<h2 id="whats-next">What’s Next</h2>

<p>Two things I’m focused on:</p>

<p><strong>Building the track record.</strong> The prediction system is live and scoring against real prices. As more events get scored, the accuracy data will speak for itself. I’ll publish results when there’s enough data to be meaningful.</p>

<p><strong>Building in public.</strong> The dashboard is live at <a href="https://signalrundown.com">signalrundown.com</a>. I’m posting the AI’s takes daily on <a href="https://x.com/signalrundown">@signalrundown</a> on Twitter and <a href="https://bsky.app/profile/signalrundown.bsky.social">Bluesky</a>.</p>

<p>If you want to follow along, check out <a href="https://signalrundown.com">signalrundown.com</a>.</p>]]></content><author><name>Brian Liou</name><uri>https://brianhliou.com</uri></author><summary type="html"><![CDATA[There's no shortage of market analysis. There's a shortage of synthesis. Signal Rundown scans dozens of financial sources every 2 hours and tells you what's actually happening with the stocks you follow.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brianhliou.com/assets/img/og-default.png" /><media:content medium="image" url="https://brianhliou.com/assets/img/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Building a Model Serving API From Scratch</title><link href="https://brianhliou.com/posts/model-serving-api/" rel="alternate" type="text/html" title="Building a Model Serving API From Scratch" /><published>2026-02-22T00:00:00+00:00</published><updated>2026-02-22T00:00:00+00:00</updated><id>https://brianhliou.com/posts/model-serving-api</id><content type="html" xml:base="https://brianhliou.com/posts/model-serving-api/"><![CDATA[<p>I built a model serving API from scratch. Not because the world needs another inference server, but because I wanted to understand what happens between “send prompt” and “receive tokens.” The things ML system design interviews ask about: batching, backpressure, streaming, graceful degradation. I wanted hands-on experience so I could talk about them from building, not reading.</p>

<p>The result: a FastAPI server wrapping Ollama with a bounded request queue, SSE streaming, naive batching, 11 custom Prometheus metrics, and structured logging. It runs on a $7/month ARM server. I ran 8 structured experiments against it. The data revealed things I didn’t expect.</p>

<p><strong>Source:</strong> <a href="https://github.com/brianhliou/model-serving-api">github.com/brianhliou/model-serving-api</a></p>

<h2 id="the-system">The system</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────────────┐
│  Hetzner CAX21 (ARM64, 4 vCPU, 8GB RAM, $7/month)      │
│                                                         │
│  ┌──────────┐  ┌──────────┐  ┌────────┐  ┌───────────┐ │
│  │  Caddy   │→ │ FastAPI  │→ │ Ollama │  │  Grafana  │ │
│  │ (TLS,    │  │ (queue,  │  │(llama  │  │  Alloy    │ │
│  │  proxy)  │  │  batch,  │  │ 3.2)   │  │(telemetry)│ │
│  │          │  │  metrics)│  │        │  │           │ │
│  └──────────┘  └──────────┘  └────────┘  └───────────┘ │
│       :443          :8000       :11434                   │
│                                                         │
│  Docker Compose, bridge network, internal DNS            │
└─────────────────────────────────────────────────────────┘
</code></pre></div></div>

<p>Four containers on a single machine. Caddy terminates TLS with automatic Let’s Encrypt certificates (three lines of config). FastAPI handles the serving logic: bounded request queue, batch dispatcher, OpenAI-compatible API, Prometheus metrics. Ollama wraps llama.cpp and runs the model. Grafana Alloy scrapes metrics every 15 seconds and ships them to Grafana Cloud.</p>

<p>The containers communicate over a Docker bridge network using service names as hostnames. Caddy resolves <code class="language-plaintext highlighter-rouge">api</code>, FastAPI resolves <code class="language-plaintext highlighter-rouge">ollama</code>, Alloy resolves <code class="language-plaintext highlighter-rouge">api</code>. Sub-millisecond latency between containers because the traffic never leaves the host. Only Caddy exposes ports to the internet (80, 443). The FastAPI port binds to <code class="language-plaintext highlighter-rouge">127.0.0.1</code> only.</p>

<p>The core idea: a bounded request queue sits between clients and the model. When the queue is full, clients get an instant 503 with <code class="language-plaintext highlighter-rouge">Retry-After</code> instead of waiting indefinitely. This is <strong>backpressure</strong>: the serving layer’s most important job.</p>

<p>The API is OpenAI-compatible (<code class="language-plaintext highlighter-rouge">/v1/chat/completions</code>), supports streaming via SSE, and exposes 11 custom Prometheus metrics (TTFT, tokens/sec, queue depth, error rates by type, backend latency).</p>

<h2 id="what-the-model-actually-does">What the model actually does</h2>

<p>The model running on this server is Llama 3.2 3B Instruct: 3.21 billion parameters, 28 transformer layers, 128K token context window. It was built through <strong>knowledge distillation</strong> from Meta’s larger Llama 3.1 8B and 70B models. The 3B model wasn’t trained from scratch. Instead, Meta pruned the 8B architecture down to 3B parameters, then trained the smaller model to match the output distributions of the larger ones. This is why a 3B model performs competitively with many 7B models.</p>

<h3 id="how-a-single-token-is-generated">How a single token is generated</h3>

<p>When you send “What is 2+2?” to the API, the model processes it through 28 identical transformer layers. Each layer does two things:</p>

<ol>
  <li>
    <p><strong>Attention</strong>: The model decides which parts of the input matter for predicting the next word. For each position, it computes query, key, and value vectors, calculates attention scores between all positions, and produces a weighted sum. Llama 3.2 uses <strong>Grouped Query Attention</strong> (GQA): 24 query heads share 8 key-value heads (a 3:1 ratio). This cuts the memory needed for cached attention data by 3x.</p>
  </li>
  <li>
    <p><strong>Feed-forward network</strong>: Each token’s representation passes through a gated network (SwiGLU) with three weight matrices: gate, up, and down projections. The gate controls information flow through element-wise multiplication. Each FFN layer has ~75 million parameters.</p>
  </li>
</ol>

<p>After all 28 layers, the model produces a probability distribution over 128,256 possible tokens. Temperature scaling adjusts how “random” the selection is (lower = more deterministic), and top-p sampling filters the candidate set. One token is drawn from this distribution.</p>

<p>This entire process repeats for every single output token, one at a time.</p>

<h3 id="two-phase-inference">Two-phase inference</h3>

<p>Token generation has two distinct phases with very different performance characteristics:</p>

<p><strong>Prefill</strong> (prompt processing): All input tokens are processed in parallel through the transformer. This is compute-bound: lots of matrix multiplications that can be parallelized across CPU cores. Speed: 50-150+ tokens/second.</p>

<p><strong>Decode</strong> (generation): Each output token is generated sequentially. The model must read its entire 2 GB of weights from memory to produce one token. This is memory-bandwidth-bound: the CPU can compute faster than it can load data. Speed: 7-8 tokens/second.</p>

<p>The decode bottleneck explains a key number in my experiments. To generate one token, the CPU reads ~2 GB of model weights from RAM. With the server’s DDR4 memory bandwidth, the theoretical ceiling is roughly 15-30 tokens/second. After overhead from the KV cache, dequantization, and non-sequential memory access, the practical rate is ~7.5 tok/s. This rate is nearly identical whether the system is idle or under heavy load.</p>

<h3 id="how-a-3b-model-fits-in-8gb-ram">How a 3B model fits in 8GB RAM</h3>

<p>The raw model weights in 16-bit precision would be 6.4 GB. That doesn’t fit. Ollama uses <strong>Q4_K_M quantization</strong>: weights are compressed from 16 bits to ~4.5 bits per parameter by clustering weight values into discrete bins using k-means. Sensitive layers (attention output, FFN down projection) get 5-6 bits; less sensitive layers get 4 bits.</p>

<p>The memory budget on this server:</p>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>RAM</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Model weights (Q4_K_M)</td>
      <td>~2.0 GB</td>
    </tr>
    <tr>
      <td>KV cache (inference state)</td>
      <td>~0.5-1.0 GB</td>
    </tr>
    <tr>
      <td>FastAPI + Python runtime</td>
      <td>~100 MB</td>
    </tr>
    <tr>
      <td>Caddy + Alloy + Docker</td>
      <td>~200 MB</td>
    </tr>
    <tr>
      <td>OS + kernel</td>
      <td>~300 MB</td>
    </tr>
    <tr>
      <td><strong>Total active</strong></td>
      <td><strong>~3.1-3.6 GB</strong></td>
    </tr>
    <tr>
      <td>Page cache (remaining)</td>
      <td>~4.4-4.9 GB</td>
    </tr>
  </tbody>
</table>

<p>Comfortable margin. The KV cache stores attention keys and values from all previous tokens so the model doesn’t recompute them. Each token in the cache costs 112 KB across all 28 layers and 8 KV heads. At 2K context, that’s ~224 MB. At 8K, ~900 MB. At the model’s full 128K context, the KV cache alone would need ~14 GB, which is why CPU inference practically limits context length.</p>

<h2 id="backpressure-works-but-the-math-is-brutal">Backpressure works, but the math is brutal</h2>

<p>I sent 100 simultaneous requests against a queue of 50:</p>

<table>
  <thead>
    <tr>
      <th>Outcome</th>
      <th>Count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Rejected instantly (503)</td>
      <td>50</td>
    </tr>
    <tr>
      <td>Accepted, then timed out (504)</td>
      <td>50</td>
    </tr>
    <tr>
      <td>Successful (200)</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>Zero successful completions. Not one.</p>

<p>Ollama processes requests sequentially. Each request takes 2-3 seconds. 50 queued requests need 100-150 seconds to drain. The request timeout is 60 seconds. So by the time the server gets to request #26, the deadline has already passed.</p>

<p>The queue protects the system from crashing. Rejected clients get an instant response and can retry. But the queue doesn’t make the system faster. A queue of 50 with a sequential backend and 60s timeout means accepting work you can’t finish.</p>

<p>The correct formula: <code class="language-plaintext highlighter-rouge">max_queue_size = (timeout / avg_request_duration) * backend_concurrency</code>. For this system: <code class="language-plaintext highlighter-rouge">60s / 2.5s * 1 = 24</code>. My queue of 50 is too large.</p>

<h3 id="the-queue-isnt-really-a-queue">The “queue” isn’t really a queue</h3>

<p>Looking deeper at the implementation, <code class="language-plaintext highlighter-rouge">QueueManager</code> is not a FIFO queue. It’s a counter. There’s no <code class="language-plaintext highlighter-rouge">asyncio.Queue</code>, no waiting, no ordering. When <code class="language-plaintext highlighter-rouge">acquire()</code> is called, it checks if <code class="language-plaintext highlighter-rouge">active &gt;= max_size</code>. If yes, it immediately raises <code class="language-plaintext highlighter-rouge">QueueFullError</code>. If no, it increments the counter. That’s it. No mutex needed because asyncio is single-threaded.</p>

<p>This is actually a <strong>load shedder</strong>, not a queue. Requests are either admitted instantly or rejected instantly. The name “queue” is misleading. In the backpressure flood experiment, asyncio task scheduling, not arrival order, determined which requests got admitted. Request #0 (the first to arrive) was rejected while request #1 got in.</p>

<h3 id="503-rejection-isnt-fast-enough">503 rejection isn’t fast enough</h3>

<p>The 50 rejected requests averaged 0.87 seconds to get their 503 response. That’s nearly a full second to say “no.” For a fast-fail mechanism, that’s too slow.</p>

<p>The latency comes from the network stack: TLS handshake to the server, HTTP request parsing, response propagation back through Caddy. Under extreme load (100 simultaneous requests), the server’s event loop is contended. At concurrency 60 in another experiment, 503 rejections took only 0.73 seconds. The 140ms difference reflects the server being less overloaded.</p>

<h2 id="latency-doesnt-just-increase-it-cliffs">Latency doesn’t just increase. It cliffs.</h2>

<p>I swept concurrency from 1 to 60:</p>

<table>
  <thead>
    <tr>
      <th>Concurrency</th>
      <th>Avg Latency</th>
      <th>Success Rate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>7.4s</td>
      <td>100%</td>
    </tr>
    <tr>
      <td>2</td>
      <td>4.3s</td>
      <td>100%</td>
    </tr>
    <tr>
      <td>5</td>
      <td>10.0s</td>
      <td>100%</td>
    </tr>
    <tr>
      <td>10</td>
      <td>18.7s</td>
      <td>100%</td>
    </tr>
    <tr>
      <td>20</td>
      <td>42.9s</td>
      <td>100%</td>
    </tr>
    <tr>
      <td>30</td>
      <td>20.9s</td>
      <td>27%</td>
    </tr>
    <tr>
      <td>50</td>
      <td>43.7s</td>
      <td>16%</td>
    </tr>
    <tr>
      <td>60</td>
      <td>39.2s</td>
      <td>20%</td>
    </tr>
  </tbody>
</table>

<p>The jump from 20 to 30 is the interesting part. Latency drops from 42.9s to 20.9s, but success rate craters from 100% to 27%.</p>

<p>At concurrency 20, all requests fit in the queue and all eventually complete, with the last ones barely making the 60s timeout. At 30, the requests that timeout (73%) are removed from the average, leaving only the fast early ones that Ollama processed first. The average looks better, but the system is failing.</p>

<p><strong>Averages lie at the boundary.</strong> When requests start timing out, the surviving “successful” requests look artificially fast because they were the lucky ones processed first. You need success rate alongside latency, not one or the other.</p>

<h3 id="batch-tiers-are-visible-in-the-data">Batch tiers are visible in the data</h3>

<p>At concurrency 10, latencies form a clear bimodal distribution: 4 requests complete at ~10.6s, 6 at ~24.1s. These are two batch rounds. The batch dispatcher collects requests for up to 100ms or 8 requests, then fires them all concurrently via <code class="language-plaintext highlighter-rouge">asyncio.gather</code>. But Ollama processes them sequentially, so the first batch finishes, then the second batch starts.</p>

<p>At concurrency 15: trimodal (6 at ~14.6s, 8 at ~32.7s, 1 at ~35.0s). Three batch rounds. At concurrency 20: four tiers. The batch_size=8 configuration creates predictable staircase patterns in the latency distribution.</p>

<h3 id="concurrency-2-is-faster-than-concurrency-1">Concurrency 2 is faster than concurrency 1</h3>

<p>This was unexpected. The single request at concurrency 1 took 7.4s. At concurrency 2, the mean was 4.3s, with the faster request completing in 3.1s.</p>

<p>The explanation: concurrency 1 included a cold-start penalty (model loading, KV cache warmup). At concurrency 2, both requests arrive together, get batched, and share the warmup cost. Compare to later experiments where warm-model sequential requests took 2-3s. The 7.4s single request was paying a one-time tax.</p>

<h2 id="streaming-is-faster-than-non-streaming-under-load">Streaming is faster than non-streaming under load</h2>

<p>I expected streaming to add overhead from more HTTP chunks and I/O. Under no contention, that’s true: streaming (3.03s) is slightly slower than non-streaming (2.85s). The SSE framing and chunk processing add about 6% overhead.</p>

<p>At concurrency 5, the picture reverses:</p>

<table>
  <thead>
    <tr>
      <th>Mode</th>
      <th>Avg Latency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Non-streaming</td>
      <td>12.29s</td>
    </tr>
    <tr>
      <td>Streaming</td>
      <td>8.35s</td>
    </tr>
  </tbody>
</table>

<p>Streaming is 32% faster. The reason is in my implementation: non-streaming requests go through a batch dispatcher that collects requests for 100ms before dispatching as a group. Streaming requests bypass the batcher entirely, going directly to <code class="language-plaintext highlighter-rouge">backend.stream()</code>.</p>

<p>This was an honest finding about my own code. The batch dispatcher adds more latency than it saves because Ollama processes requests sequentially regardless. Batching only helps when the backend can exploit parallelism (like a GPU with continuous batching). With a sequential backend, it’s pure overhead.</p>

<p>The 100ms batch window is the problem. At solo concurrency, a single request waits up to 100ms for more requests that may never arrive. At high concurrency, the window fills quickly, but the backend can’t parallelize the batch anyway.</p>

<h2 id="time-to-first-token-degrades-10x-under-contention">Time to first token degrades 10x under contention</h2>

<p>The most dramatic finding. I measured TTFT (time to first token) for streaming requests:</p>

<table>
  <thead>
    <tr>
      <th>Condition</th>
      <th>Mean TTFT</th>
      <th>Min</th>
      <th>Max</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>No contention</td>
      <td>0.87s</td>
      <td>0.53s</td>
      <td>0.93s</td>
    </tr>
    <tr>
      <td>Concurrency 5</td>
      <td>9.02s</td>
      <td>1.08s</td>
      <td>10.83s</td>
    </tr>
  </tbody>
</table>

<p>A 10x degradation from just 5 concurrent users.</p>

<p>TTFT measures how long until the client sees the first token. This maps directly to the two-phase inference described above. The 0.87s baseline TTFT is the prefill time: the model processes the prompt tokens through all 28 layers before it can start generating output. Under contention, requests queue behind each other at Ollama.</p>

<p>The concurrent TTFTs show a clear staircase pattern: 0.86s, 3.47s, 6.01s, 8.60s, 11.01s. Each step is approximately 2.5s apart, the time for Ollama to finish one request’s prefill and generation before starting the next. TTFT under sequential processing is essentially <code class="language-plaintext highlighter-rouge">queue_position * avg_request_duration</code>.</p>

<p>The sequential TTFT distribution (20 samples) is Gaussian centered on 0.886s with a standard deviation of just 15ms. Extremely consistent. The first request was an outlier at 0.53s because the model was already warm from a prior experiment.</p>

<p>TTFT is the metric that matters most for user experience. A user staring at a blank screen for 9 seconds will close the tab. This is why production systems use <strong>continuous batching</strong>: it allows the model to interleave generation across requests, keeping TTFT low even under load.</p>

<h2 id="token-generation-rate-is-rock-solid">Token generation rate is rock-solid</h2>

<p>Five sequential streaming requests, 100 tokens each:</p>

<table>
  <thead>
    <tr>
      <th>Run</th>
      <th>Tokens/sec</th>
      <th>Mean Inter-Token Interval</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>8.0</td>
      <td>125ms</td>
    </tr>
    <tr>
      <td>2</td>
      <td>7.6</td>
      <td>131ms</td>
    </tr>
    <tr>
      <td>3</td>
      <td>7.7</td>
      <td>129ms</td>
    </tr>
    <tr>
      <td>4</td>
      <td>7.9</td>
      <td>126ms</td>
    </tr>
    <tr>
      <td>5</td>
      <td>8.0</td>
      <td>126ms</td>
    </tr>
  </tbody>
</table>

<p>No degradation as output gets longer. Once Ollama starts generating, it produces tokens at a steady ~7.8 tok/s on ARM64.</p>

<h3 id="why-75-toks">Why 7.5 tok/s?</h3>

<p>The Hetzner CAX21 uses Ampere Altra processors (ARM Neoverse N1 cores) with DDR4 memory. Token generation is memory-bandwidth-bound: each token requires reading the entire model weights (~2 GB for Q4_K_M) from RAM. The arithmetic intensity is only ~3.2 FLOPs per byte of memory accessed, which puts decode squarely in the memory-bound regime of the roofline model.</p>

<p>llama.cpp (which Ollama wraps) uses ARM NEON SIMD instructions for the core computation: 128-bit wide vector operations that process 4 floats or 16 int8 values simultaneously. Hand-written kernels for each quantization format handle dequantization and multiply-accumulate in fused operations.</p>

<h3 id="inter-token-timing-isnt-perfectly-constant">Inter-token timing isn’t perfectly constant</h3>

<p>Looking at the raw chunk timestamps across 100 tokens, the inter-token interval ranges from 109ms to 163ms with a coefficient of variation of 11.2%. There are periodic spikes every 5-7 tokens where the interval jumps by 20-30ms, possibly from KV cache extension operations. One request showed a 206ms gap followed by a compensating 54ms interval, which looks like a garbage collection pause or memory operation.</p>

<h3 id="sustained-throughput-is-stable">Sustained throughput is stable</h3>

<p>A 2-minute sustained load test at concurrency 5: 56 requests, 990 tokens, 7.6 tok/s, stable the entire time. No memory leaks, no thermal throttling. The per-window latency (10s buckets) varied by only 0.55s standard deviation across the full run. The aggregate token rate was 96.5% of the isolated single-stream rate.</p>

<p>The bottleneck isn’t generation speed. It’s sequential processing. The model generates tokens fast enough; it just can’t serve multiple users at once.</p>

<h2 id="prompt-length-matters-more-than-expected">Prompt length matters more than expected</h2>

<table>
  <thead>
    <tr>
      <th>Prompt</th>
      <th>Avg Latency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Short (5 tokens)</td>
      <td>2.0s</td>
    </tr>
    <tr>
      <td>Long (~50 tokens)</td>
      <td>4.9s</td>
    </tr>
    <tr>
      <td>5-turn conversation</td>
      <td>5.9s</td>
    </tr>
    <tr>
      <td>10-turn conversation</td>
      <td>7.8s</td>
    </tr>
  </tbody>
</table>

<p>A 10-turn conversation takes nearly 4x longer than a short prompt, even with the same <code class="language-plaintext highlighter-rouge">max_tokens=30</code> output limit. The extra time is almost entirely prompt processing (the prefill phase). The model needs to process all input tokens through 28 layers of attention before generating the first output.</p>

<h3 id="the-kv-cache-explains-everything">The KV cache explains everything</h3>

<p>During prefill, the model computes attention keys and values for every input token and stores them in the KV cache. For subsequent output tokens during decode, it only computes attention for the new token against the cached keys and values. This is why prefill is compute-bound (matrix-matrix multiplication across all input tokens) while decode is memory-bandwidth-bound (matrix-vector for one token, but must read all cached KV entries).</p>

<p>Prefill attention complexity is O(n^2) where n is the prompt length. A 10-turn conversation with ~200 tokens of context requires 4x the prefill computation of a 5-turn conversation with ~100 tokens. Once the prefill is done, decode speed is nearly identical regardless of prompt length.</p>

<p>For chat applications, this means every request gets slower as conversations grow. Production systems deal with this through <strong>KV cache reuse</strong>: storing the cached attention state between turns so only the new user message needs prefill processing. Ollama doesn’t expose this across requests, so every request pays the full prefill cost from scratch.</p>

<h2 id="what-the-data-hid">What the data hid</h2>

<p>Beyond the headline findings, the raw experiment data revealed patterns I didn’t expect:</p>

<p><strong>Cold-start tax is 1.5-4.3x.</strong> The first request to each experiment was consistently slower. For short prompts: 2.88s first vs 1.6s warm (1.8x). For 10-turn prompts: 15.3s first vs 3.5s warm (4.3x). The penalty scales with prompt complexity because the initial request pays both model loading overhead and the full prefill cost without any cached state.</p>

<p><strong>Zero completions in the 100-request flood.</strong> Despite 50 queue slots, not a single request completed. The queue accepted 50 requests, but the serial backend couldn’t process any of them within the 60s timeout. The queue protects the system from crashing, but it accepted work that was mathematically impossible to finish.</p>

<p><strong>Only 2 out of 50 succeeded in the degradation test.</strong> Request #0 (8.98s) and request #40 (31.88s). The 22.9s gap between them aligns almost exactly with 2 batch processing rounds. The remaining 48 requests all timed out at ~60.7s.</p>

<p><strong>Token generation rate is identical across all modes.</strong> Solo streaming: 7.8 tok/s. Concurrent non-streaming: 7.6 tok/s aggregate. Solo sequential: ~7.7 tok/s. The Ollama backend generates tokens at a fixed rate regardless of how many requests are queued. All latency differences come from queuing and batching, not token generation.</p>

<h2 id="what-id-do-differently">What I’d do differently</h2>

<p><strong>Queue size:</strong> Set it to 20-25, not 50. With a sequential backend and 60s timeout, a queue of 50 means accepting requests you’ll never finish. The formula: <code class="language-plaintext highlighter-rouge">(timeout / request_duration) * concurrency = (60 / 2.5) * 1 = 24</code>.</p>

<p><strong>Batching:</strong> Skip it entirely for a sequential backend. The 100ms collection window adds latency with no benefit. Only enable it when the backend supports parallel processing.</p>

<p><strong>TTFT alerting:</strong> Set a Grafana alert on p95 TTFT &gt; 5s. That metric tells you users are having a bad experience earlier than total latency does.</p>

<p><strong>503 latency:</strong> Investigate why rejection takes 870ms. For a load shedder, the rejection path should be sub-10ms. The current latency is dominated by network overhead, but with connection pooling and HTTP keep-alive, it could be much faster.</p>

<p><strong>The backend:</strong> The most impactful improvement would be swapping Ollama for <code class="language-plaintext highlighter-rouge">llama-cpp-python</code> with continuous batching. That allows multiple requests to share the model simultaneously, keeping TTFT low under load. The <code class="language-plaintext highlighter-rouge">InferenceBackend</code> Protocol abstraction makes this a clean swap: implement <code class="language-plaintext highlighter-rouge">generate()</code>, <code class="language-plaintext highlighter-rouge">stream()</code>, and <code class="language-plaintext highlighter-rouge">health()</code>, and the serving logic stays unchanged.</p>

<h2 id="key-takeaways">Key takeaways</h2>

<ul>
  <li>
    <p>Backpressure protects the system, but queue size must match <code class="language-plaintext highlighter-rouge">(timeout / request_duration) * concurrency</code></p>
  </li>
  <li>
    <p>Latency averages lie at the boundary: when requests start timing out, the survivors look artificially fast</p>
  </li>
  <li>
    <p>Batching is not universally good: with a sequential backend, it’s pure overhead that adds 100ms to every request</p>
  </li>
  <li>
    <p>TTFT is the metric that matters most for UX, and it degrades linearly with queue position</p>
  </li>
  <li>
    <p>Token generation on ARM64 is memory-bandwidth-bound at ~7.5 tok/s, consistent across all load conditions</p>
  </li>
  <li>
    <p>Prompt length affects latency as much as output length: prefill is O(n^2) and grows with every conversation turn</p>
  </li>
  <li>
    <p>A 3B model fits comfortably on an 8GB server via 4-bit quantization (6.4 GB compressed to 2 GB)</p>
  </li>
  <li>
    <p>The most impactful improvement isn’t in the serving layer: it’s swapping a sequential backend for one with continuous batching</p>
  </li>
</ul>

<hr />

<p><strong>Resources:</strong></p>
<ul>
  <li><a href="https://github.com/brianhliou/model-serving-api">model-serving-api</a> - Full source, experiment data, Grafana dashboard</li>
  <li><a href="https://ollama.com">Ollama</a> - Local model backend used in this project</li>
  <li><a href="https://fastapi.tiangolo.com">FastAPI</a> - Python async web framework</li>
  <li><a href="https://github.com/prometheus/client_python">prometheus-client</a> - Python Prometheus instrumentation</li>
  <li><a href="https://github.com/ggerganov/llama.cpp">llama.cpp</a> - C/C++ inference engine that Ollama wraps</li>
  <li><a href="https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md">Llama 3.2 model card</a> - Model architecture and training details</li>
</ul>]]></content><author><name>Brian Liou</name><uri>https://brianhliou.com</uri></author><summary type="html"><![CDATA[Production model serving layer wrapping Ollama with request batching, SSE streaming, backpressure, Prometheus metrics, and structured experiments revealing how latency, TTFT, and throughput behave under load.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brianhliou.com/assets/img/og-default.png" /><media:content medium="image" url="https://brianhliou.com/assets/img/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">System Design: Concepts, Patterns, Technologies</title><link href="https://brianhliou.com/posts/system-design-study-guide/" rel="alternate" type="text/html" title="System Design: Concepts, Patterns, Technologies" /><published>2026-02-20T00:00:00+00:00</published><updated>2026-02-20T00:00:00+00:00</updated><id>https://brianhliou.com/posts/system-design-study-guide</id><content type="html" xml:base="https://brianhliou.com/posts/system-design-study-guide/"><![CDATA[<p>My reference for system design interviews. Three sections: core concepts (the building blocks), common patterns (recurring solutions), and key technologies (when to reach for what).</p>

<p>Scan the tables to refresh. Read the bold text for the key decisions and tradeoffs.</p>

<hr />

<h2 id="core-concepts">Core Concepts</h2>

<h3 id="networking-essentials">Networking Essentials</h3>

<table>
  <thead>
    <tr>
      <th>Topic</th>
      <th>Key Points</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>TCP vs UDP</td>
      <td><strong>TCP</strong>: reliable, ordered, connection-oriented (HTTP, databases). <strong>UDP</strong>: fast, no guarantees (video streaming, DNS).</td>
    </tr>
    <tr>
      <td>HTTP/HTTPS</td>
      <td>Request-response over TCP. HTTP/2 adds multiplexing. HTTP/3 uses QUIC (UDP-based). TLS for encryption.</td>
    </tr>
    <tr>
      <td>WebSockets</td>
      <td>Persistent bidirectional connection over TCP. Use for real-time (chat, live updates). Initiated via HTTP upgrade.</td>
    </tr>
    <tr>
      <td>DNS</td>
      <td>Domain to IP resolution. TTL controls caching. Can route by geography for latency.</td>
    </tr>
    <tr>
      <td>Load Balancers</td>
      <td><strong>L4</strong> (transport): routes by IP/port, fast, no inspection. <strong>L7</strong> (application): routes by URL/headers, can do auth and rate limiting.</td>
    </tr>
  </tbody>
</table>

<p><strong>The key distinction in interviews is L4 vs L7.</strong> L4 is faster but dumb. L7 is slower but can make smart routing decisions based on content.</p>

<h3 id="api-design">API Design</h3>

<table>
  <thead>
    <tr>
      <th>Style</th>
      <th>Use When</th>
      <th>Tradeoffs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>REST</td>
      <td>Public APIs, CRUD, simple resource models</td>
      <td>Well-understood, cacheable. Overfetching/underfetching common.</td>
    </tr>
    <tr>
      <td>gRPC</td>
      <td>Service-to-service, low latency, streaming</td>
      <td>Binary protobuf, fast, typed contracts. Not browser-friendly without proxy.</td>
    </tr>
    <tr>
      <td>GraphQL</td>
      <td>Client-driven queries, mobile apps needing flexible data</td>
      <td>Single endpoint, no overfetching. Complexity on server, caching harder.</td>
    </tr>
  </tbody>
</table>

<p><strong>Key decisions</strong>: <strong>pagination</strong> (cursor-based for real-time data, offset for static), <strong>idempotency</strong> (POST with client-generated ID), <strong>versioning</strong> (URL path is simplest: <code class="language-plaintext highlighter-rouge">/v1/resource</code>).</p>

<h3 id="data-modeling">Data Modeling</h3>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Strengths</th>
      <th>Weaknesses</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Relational (PostgreSQL)</td>
      <td>ACID, joins, complex queries, strong consistency</td>
      <td>Schema rigidity, harder to shard</td>
    </tr>
    <tr>
      <td>Document (MongoDB)</td>
      <td>Flexible schema, nested data, horizontal scaling</td>
      <td>No joins, denormalized data can diverge</td>
    </tr>
    <tr>
      <td>Wide-column (Cassandra)</td>
      <td>Massive write throughput, time-series, multi-DC</td>
      <td>Limited query patterns, must design around partition key</td>
    </tr>
    <tr>
      <td>Key-value (Redis, DynamoDB)</td>
      <td>Sub-ms latency, simple access patterns</td>
      <td>No complex queries</td>
    </tr>
  </tbody>
</table>

<p><strong>Design your schema around how data is read, not how it’s logically organized.</strong> If reads far outnumber writes, denormalize. If you need joins, use relational.</p>

<h3 id="caching">Caching</h3>

<table>
  <thead>
    <tr>
      <th>Strategy</th>
      <th>How It Works</th>
      <th>Use When</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cache-aside</td>
      <td>App checks cache first, falls back to DB on miss, fills cache after</td>
      <td>General purpose, most common</td>
    </tr>
    <tr>
      <td>Write-through</td>
      <td>Write to cache and DB synchronously on every write</td>
      <td>Need cache and DB always in sync</td>
    </tr>
    <tr>
      <td>Write-back</td>
      <td>Write to cache only, async flush to DB</td>
      <td>Write-heavy, can tolerate data loss risk</td>
    </tr>
    <tr>
      <td>Read-through</td>
      <td>Cache itself fetches from DB on miss</td>
      <td>Simplify app logic, cache acts as proxy</td>
    </tr>
  </tbody>
</table>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Cache-aside read path:
  App → Cache → HIT  → return
             → MISS → read DB → fill cache → return
</code></pre></div></div>

<p><strong>Eviction</strong>: LRU (most common), TTL (simplest to reason about), LFU (frequency-based, good for skewed access).</p>

<p><strong>Cache invalidation is the hard part.</strong> TTL is simplest: accept staleness up to N seconds. Event-driven invalidation (publish on write, subscribers evict) is more precise but more complex. When in doubt, start with TTL.</p>

<h3 id="sharding">Sharding</h3>

<table>
  <thead>
    <tr>
      <th>Strategy</th>
      <th>How It Works</th>
      <th>Tradeoffs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Hash-based</td>
      <td>Hash the partition key, mod by shard count</td>
      <td>Even distribution, but range queries hit all shards</td>
    </tr>
    <tr>
      <td>Range-based</td>
      <td>Assign contiguous key ranges to shards</td>
      <td>Efficient range queries, but hot ranges cause imbalance</td>
    </tr>
  </tbody>
</table>

<p><strong>Choose a partition key with high cardinality and even distribution.</strong> Bad key: country (skewed). Good key: user ID (uniform). Cross-shard queries are expensive. Resharding is painful without consistent hashing.</p>

<h3 id="replication">Replication</h3>

<table>
  <thead>
    <tr>
      <th>Strategy</th>
      <th>How It Works</th>
      <th>Tradeoffs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Leader-Follower</td>
      <td>One leader handles writes, followers replicate and serve reads</td>
      <td>Simple, but leader is a bottleneck. Replication lag means stale reads.</td>
    </tr>
    <tr>
      <td>Multi-Leader</td>
      <td>Multiple nodes accept writes, replicate to each other</td>
      <td>Better write availability across regions. Conflict resolution is hard.</td>
    </tr>
    <tr>
      <td>Leaderless (Quorum)</td>
      <td>Read/write to multiple nodes. Consistency when W + R &gt; N.</td>
      <td>High availability, tunable consistency. More complex client logic.</td>
    </tr>
  </tbody>
</table>

<p><strong>Leader-follower is the default.</strong> Most SQL databases use it. Go multi-leader for multi-region writes. Go leaderless (Dynamo-style) when you need high availability and can handle eventual consistency.</p>

<h3 id="consistent-hashing">Consistent Hashing</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Hash Ring (linear view):

  0 ─── NA ─── NB ─── NC ─── ND ─── 0
         ↑   ↑         ↑
        k1  k2        k3

  Keys route to the next node clockwise:
  k1 → NA    k2 → NB    k3 → NC
</code></pre></div></div>

<table>
  <thead>
    <tr>
      <th>Concept</th>
      <th>Detail</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Hash ring</td>
      <td>Both nodes and keys are hashed to positions on a ring. Each key routes to the next node clockwise.</td>
    </tr>
    <tr>
      <td>Adding a node</td>
      <td>Only keys between the new node and its predecessor move. Minimal redistribution.</td>
    </tr>
    <tr>
      <td>Virtual nodes</td>
      <td>Each physical node maps to multiple ring positions. Improves balance.</td>
    </tr>
    <tr>
      <td>Used in</td>
      <td>Distributed caches, DynamoDB partitioning, Cassandra, CDN routing.</td>
    </tr>
  </tbody>
</table>

<p><strong>The point is minimal disruption.</strong> Adding or removing a node only affects its immediate neighbors, not a full reshuffle.</p>

<h3 id="cap-theorem">CAP Theorem</h3>

<table>
  <thead>
    <tr>
      <th>Choice</th>
      <th>Behavior During Partition</th>
      <th>Examples</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>CP</strong> (Consistency)</td>
      <td>Reject requests rather than serve stale data</td>
      <td>ZooKeeper, HBase, etcd</td>
    </tr>
    <tr>
      <td><strong>AP</strong> (Availability)</td>
      <td>Serve requests, accept eventual consistency</td>
      <td>Cassandra, DynamoDB (default), CouchDB</td>
    </tr>
  </tbody>
</table>

<p><strong>CAP only forces a choice during network partitions.</strong> When the network is healthy, you get both C and A. Most production systems choose AP with tunable consistency: DynamoDB lets you choose strong or eventual per read, Cassandra lets you set consistency level per query.</p>

<h3 id="rate-limiting">Rate Limiting</h3>

<table>
  <thead>
    <tr>
      <th>Algorithm</th>
      <th>How It Works</th>
      <th>Tradeoffs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Token bucket</td>
      <td>Tokens added at fixed rate, each request costs one</td>
      <td>Allows bursts up to bucket size. Most common.</td>
    </tr>
    <tr>
      <td>Sliding window log</td>
      <td>Store timestamp of each request, count within window</td>
      <td>Precise, but high memory at scale</td>
    </tr>
    <tr>
      <td>Sliding window counter</td>
      <td>Weighted count from current + previous window</td>
      <td>Memory-efficient approximation. Good enough for most cases.</td>
    </tr>
    <tr>
      <td>Fixed window counter</td>
      <td>Count requests per fixed time window (e.g., per minute)</td>
      <td>Simplest. Spike at window boundaries (double rate across boundary).</td>
    </tr>
  </tbody>
</table>

<p><strong>Token bucket is the standard choice.</strong> It handles bursts naturally and is what most API gateways implement. Rate limit by IP for anonymous traffic, by API key or user ID for authenticated traffic.</p>

<h3 id="database-indexing">Database Indexing</h3>

<table>
  <thead>
    <tr>
      <th>Index Type</th>
      <th>Best For</th>
      <th>How It Works</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>B-tree</td>
      <td>Range queries, sorted access</td>
      <td>Balanced tree, O(log n). Default in most databases.</td>
    </tr>
    <tr>
      <td>Hash</td>
      <td>Exact-match lookups</td>
      <td>O(1) lookup. No range queries.</td>
    </tr>
    <tr>
      <td>Composite</td>
      <td>Multi-column queries</td>
      <td>Leftmost prefix rule: index on (a, b, c) supports (a), (a, b), (a, b, c).</td>
    </tr>
    <tr>
      <td>Covering</td>
      <td>Avoiding table lookups</td>
      <td>Index includes all columns the query needs, no row fetch required.</td>
    </tr>
  </tbody>
</table>

<p><strong>Indexes speed reads but slow writes.</strong> Every insert and update must update every relevant index. Don’t over-index. Start with the queries you need to optimize, add indexes for those.</p>

<h3 id="numbers-to-know">Numbers to Know</h3>

<p><strong>Latency:</strong></p>

<table>
  <thead>
    <tr>
      <th>Operation</th>
      <th>Time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>L1 cache reference</td>
      <td>~1 ns</td>
    </tr>
    <tr>
      <td>L2 cache reference</td>
      <td>~4 ns</td>
    </tr>
    <tr>
      <td>RAM access</td>
      <td>~100 ns</td>
    </tr>
    <tr>
      <td>SSD random read</td>
      <td>~100 μs</td>
    </tr>
    <tr>
      <td>HDD seek</td>
      <td>~10 ms</td>
    </tr>
    <tr>
      <td>Same-datacenter round trip</td>
      <td>~0.5 ms</td>
    </tr>
    <tr>
      <td>Cross-continent round trip</td>
      <td>~150 ms</td>
    </tr>
  </tbody>
</table>

<p><strong>Throughput (order of magnitude):</strong></p>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Ballpark</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Single web server</td>
      <td>~10K QPS</td>
    </tr>
    <tr>
      <td>Single SQL database</td>
      <td>~10K QPS</td>
    </tr>
    <tr>
      <td>Redis</td>
      <td>~100K QPS</td>
    </tr>
    <tr>
      <td>Kafka (per broker)</td>
      <td>~100K msgs/sec</td>
    </tr>
  </tbody>
</table>

<p><strong>Storage:</strong></p>

<table>
  <thead>
    <tr>
      <th>Unit</th>
      <th>Size</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1 tweet (text + metadata)</td>
      <td>~1 KB</td>
    </tr>
    <tr>
      <td>1 image (compressed)</td>
      <td>~200 KB</td>
    </tr>
    <tr>
      <td>1 minute of video (720p)</td>
      <td>~5 MB</td>
    </tr>
  </tbody>
</table>

<p><strong>Time conversions for estimation:</strong></p>

<table>
  <thead>
    <tr>
      <th>Period</th>
      <th>Seconds</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1 day</td>
      <td>~100K</td>
    </tr>
    <tr>
      <td>1 month</td>
      <td>~2.5M</td>
    </tr>
    <tr>
      <td>1 year</td>
      <td>~30M</td>
    </tr>
  </tbody>
</table>

<h3 id="back-of-envelope-estimation">Back-of-Envelope Estimation</h3>

<p>The goal is order of magnitude, not precision. 2x off is fine. 100x off means you picked the wrong architecture.</p>

<p><strong>Method:</strong></p>
<ol>
  <li>Start with users or requests (DAU, writes/day, reads/day)</li>
  <li>Estimate per-unit size or per-unit cost</li>
  <li>Multiply out: per second, per day, per year</li>
  <li>Round aggressively</li>
</ol>

<p><strong>Worked example: URL shortener storage</strong></p>

<ul>
  <li>100M new URLs per month</li>
  <li>Each record: short code (7 B) + long URL (~200 B) + metadata (~50 B) ≈ 250 B</li>
  <li>Monthly storage: 100M x 250 B = <strong>25 GB/month</strong></li>
  <li>5-year retention: 25 GB x 60 = <strong>1.5 TB total</strong></li>
  <li>Read QPS: 10:1 read/write → 1B reads/month → 1B / 2.5M sec ≈ <strong>400 QPS</strong></li>
</ul>

<p>1.5 TB fits on a single machine. 400 QPS is trivially handled. This tells you the bottleneck isn’t storage or throughput, it’s latency (caching helps) and availability (replication helps).</p>

<hr />

<h2 id="common-patterns">Common Patterns</h2>

<h3 id="real-time-updates">Real-time Updates</h3>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>How It Works</th>
      <th>Use When</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>WebSockets</td>
      <td>Persistent bidirectional connection</td>
      <td>Chat, collaborative editing, gaming</td>
    </tr>
    <tr>
      <td>Server-Sent Events</td>
      <td>Server pushes over HTTP, one-directional</td>
      <td>Live feeds, notifications, dashboards</td>
    </tr>
    <tr>
      <td>Long polling</td>
      <td>Client sends request, server holds until data available</td>
      <td>Fallback when WebSockets not supported</td>
    </tr>
  </tbody>
</table>

<p>For <strong>feeds and timelines</strong>, the key decision is fan-out strategy:</p>

<table>
  <thead>
    <tr>
      <th>Strategy</th>
      <th>How It Works</th>
      <th>Use When</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Fan-out on write</td>
      <td>Push updates to all subscriber inboxes at write time</td>
      <td>Most users have small follower counts</td>
    </tr>
    <tr>
      <td>Fan-out on read</td>
      <td>Pull and merge updates at read time</td>
      <td>Some users have millions of followers (celebrities)</td>
    </tr>
    <tr>
      <td>Hybrid</td>
      <td>Fan-out on write for normal users, fan-out on read for high-follower users</td>
      <td>Twitter-scale systems</td>
    </tr>
  </tbody>
</table>

<h3 id="dealing-with-contention">Dealing with Contention</h3>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>How It Works</th>
      <th>Use When</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Optimistic locking</td>
      <td>Read version, write with version check, retry on conflict</td>
      <td>Low contention (most writes succeed)</td>
    </tr>
    <tr>
      <td>Pessimistic locking</td>
      <td>Acquire lock before read-modify-write</td>
      <td>High contention (conflicts are expensive)</td>
    </tr>
    <tr>
      <td>CAS (Compare-and-Swap)</td>
      <td>Atomic conditional update at the DB level</td>
      <td>Counters, inventory, simple state transitions</td>
    </tr>
    <tr>
      <td>Queue writes</td>
      <td>Serialize concurrent writes through a queue</td>
      <td>Ordering matters, or writes need complex processing</td>
    </tr>
  </tbody>
</table>

<p><strong>Default to optimistic locking.</strong> Switch to pessimistic or queuing when conflict rates are high enough that retries become wasteful.</p>

<h3 id="multi-step-processes">Multi-step Processes</h3>

<table>
  <thead>
    <tr>
      <th>Pattern</th>
      <th>How It Works</th>
      <th>Use When</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Saga (choreography)</td>
      <td>Each service emits events, next service reacts</td>
      <td>Loosely coupled services, simple flows</td>
    </tr>
    <tr>
      <td>Saga (orchestration)</td>
      <td>Central coordinator directs each step</td>
      <td>Complex flows, need visibility into process state</td>
    </tr>
    <tr>
      <td>Two-phase commit</td>
      <td>Coordinator asks all to prepare, then commit/abort</td>
      <td>Strong consistency across services. Avoid if possible (slow, fragile).</td>
    </tr>
  </tbody>
</table>

<p><strong>Idempotency is the foundation.</strong> Every step must be safe to retry. Use client-generated UUIDs as idempotency keys so retries don’t create duplicates. Every compensating action (undo) must also be idempotent.</p>

<h3 id="scaling-reads">Scaling Reads</h3>

<table>
  <thead>
    <tr>
      <th>Technique</th>
      <th>How It Works</th>
      <th>Tradeoff</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Read replicas</td>
      <td>Route reads to follower replicas</td>
      <td>Replication lag (stale reads)</td>
    </tr>
    <tr>
      <td>Caching</td>
      <td>Cache hot data in Redis or Memcached</td>
      <td>Invalidation complexity</td>
    </tr>
    <tr>
      <td>CDN</td>
      <td>Cache static/semi-static content at the edge</td>
      <td>Only for cacheable content</td>
    </tr>
    <tr>
      <td>Denormalization</td>
      <td>Pre-join data at write time</td>
      <td>Faster reads, harder writes</td>
    </tr>
    <tr>
      <td>Materialized views</td>
      <td>Precomputed query results, refreshed periodically</td>
      <td>Stale between refreshes</td>
    </tr>
  </tbody>
</table>

<h3 id="scaling-writes">Scaling Writes</h3>

<table>
  <thead>
    <tr>
      <th>Technique</th>
      <th>How It Works</th>
      <th>Tradeoff</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Sharding</td>
      <td>Partition data across nodes</td>
      <td>Cross-shard queries, resharding pain</td>
    </tr>
    <tr>
      <td>Write-ahead log</td>
      <td>Append-only log, apply changes asynchronously</td>
      <td>Sequential I/O is fast, crash recovery</td>
    </tr>
    <tr>
      <td>Batching</td>
      <td>Buffer writes, flush in bulk</td>
      <td>Higher throughput, higher per-write latency</td>
    </tr>
    <tr>
      <td>Async processing</td>
      <td>Accept write into queue, process later</td>
      <td>Fast ack to client, eventual consistency</td>
    </tr>
    <tr>
      <td>Event sourcing</td>
      <td>Store events as source of truth, derive state</td>
      <td>Full audit trail, complex to query current state</td>
    </tr>
  </tbody>
</table>

<h3 id="handling-large-blobs">Handling Large Blobs</h3>

<table>
  <thead>
    <tr>
      <th>Concern</th>
      <th>Approach</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Storage</td>
      <td>Object store (S3, GCS). Never store blobs in your database.</td>
    </tr>
    <tr>
      <td>Uploads</td>
      <td><strong>Presigned URLs</strong> for direct client-to-S3 upload. Chunked uploads for large files (resumable).</td>
    </tr>
    <tr>
      <td>Serving</td>
      <td>CDN in front of object store. Signed URLs for access control.</td>
    </tr>
    <tr>
      <td>Processing</td>
      <td>Async pipeline triggered by upload event (thumbnails, transcoding, virus scan).</td>
    </tr>
  </tbody>
</table>

<h3 id="managing-long-running-tasks">Managing Long Running Tasks</h3>

<table>
  <thead>
    <tr>
      <th>Concern</th>
      <th>Approach</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Dispatch</td>
      <td>Message queue (SQS, Kafka) decouples producer from worker</td>
    </tr>
    <tr>
      <td>Execution</td>
      <td>Worker pool pulls from queue, processes independently</td>
    </tr>
    <tr>
      <td>Reliability</td>
      <td><strong>Checkpointing</strong> for progress. Idempotent retries on failure.</td>
    </tr>
    <tr>
      <td>Failure</td>
      <td>Dead letter queue for messages that repeatedly fail. Alert on DLQ depth.</td>
    </tr>
    <tr>
      <td>Visibility</td>
      <td>Status tracking in DB. Status endpoint for clients to poll.</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="key-technologies">Key Technologies</h2>

<table>
  <thead>
    <tr>
      <th>Technology</th>
      <th>What It Is</th>
      <th>Reach For It When</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Redis</strong></td>
      <td>In-memory key-value store</td>
      <td>Caching, rate limiting, leaderboards, pub/sub, session storage. Sub-ms reads.</td>
    </tr>
    <tr>
      <td><strong>Elasticsearch</strong></td>
      <td>Distributed search engine</td>
      <td>Full-text search, log aggregation (ELK), faceted search, autocomplete.</td>
    </tr>
    <tr>
      <td><strong>Kafka</strong></td>
      <td>Distributed event streaming</td>
      <td>Event-driven architecture, decoupling services, high-throughput messaging, replay.</td>
    </tr>
    <tr>
      <td><strong>API Gateway</strong></td>
      <td>Reverse proxy at the edge</td>
      <td>Rate limiting, auth, routing, SSL termination, versioning.</td>
    </tr>
    <tr>
      <td><strong>Cassandra</strong></td>
      <td>Wide-column distributed DB</td>
      <td>High write throughput, time-series, multi-DC replication.</td>
    </tr>
    <tr>
      <td><strong>DynamoDB</strong></td>
      <td>Managed key-value / document DB</td>
      <td>Predictable single-digit ms latency at any scale, serverless backends.</td>
    </tr>
    <tr>
      <td><strong>PostgreSQL</strong></td>
      <td>Relational database</td>
      <td>ACID, complex queries, joins. <strong>Default choice until you outgrow it.</strong></td>
    </tr>
    <tr>
      <td><strong>Flink</strong></td>
      <td>Stream processing framework</td>
      <td>Real-time aggregations, windowed computations, complex event processing.</td>
    </tr>
    <tr>
      <td><strong>ZooKeeper</strong></td>
      <td>Distributed coordination</td>
      <td>Leader election, distributed locks, config management. Being replaced by etcd in newer systems.</td>
    </tr>
  </tbody>
</table>

<hr />

<!-- ## Question Breakdowns

TODO: Add question breakdown tables as studied. Format:

| Question | Key Components | Core Decisions |
|----------|---------------|----------------|
| Bit.ly | ... | ... |
| Dropbox | ... | ... |

-->

<hr />

<p><strong>Resources:</strong></p>
<ul>
  <li><a href="https://www.hellointerview.com/learn/system-design/in-a-hurry/introduction">Hello Interview - System Design</a> - Structured course with question breakdowns</li>
  <li><a href="https://github.com/donnemartin/system-design-primer">System Design Primer</a> - Comprehensive open-source reference</li>
  <li><a href="https://dataintensive.net/">Designing Data-Intensive Applications</a> - The foundational book on distributed systems</li>
</ul>]]></content><author><name>Brian Liou</name><uri>https://brianhliou.com</uri></author><category term="system-design" /><category term="distributed-systems" /><summary type="html"><![CDATA[A scannable reference for system design interviews. Core concepts, common patterns, and key technologies with the tradeoffs that actually matter.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brianhliou.com/assets/img/og-default.png" /><media:content medium="image" url="https://brianhliou.com/assets/img/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">aide</title><link href="https://brianhliou.com/posts/aide/" rel="alternate" type="text/html" title="aide" /><published>2026-02-11T00:00:00+00:00</published><updated>2026-02-11T00:00:00+00:00</updated><id>https://brianhliou.com/posts/aide</id><content type="html" xml:base="https://brianhliou.com/posts/aide/"><![CDATA[<p>The <a href="https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/">METR study</a> found developers believe AI makes them 20% faster. Measured: 19% slower. I wanted data on my own usage, so I built a dashboard.</p>

<p><strong>Source:</strong> <a href="https://github.com/brianhliou/aide">github.com/brianhliou/aide</a></p>

<p><img src="/assets/projects/aide/thumbnail.png" alt="aide dashboard overview" style="max-width: 100%; display: block; margin: 20px auto;" /></p>

<h2 id="what-is-this">What Is This</h2>

<p>aide ingests Claude Code’s session logs (JSONL) into SQLite and shows long-term trends across all your projects: cost, token usage, session patterns, efficiency metrics. The “Fitbit for AI coding.”</p>

<p>Zero LLM calls. Zero cost to run. All data stays local.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/.claude/projects/**/*.jsonl → parser → SQLite → dashboard
</code></pre></div></div>

<h2 id="the-problem">The Problem</h2>

<p>Claude Code generates detailed session logs for every interaction: messages, tool calls, token counts, timestamps. These logs are JSONL files buried in <code class="language-plaintext highlighter-rouge">~/.claude/projects/</code>. Nobody looks at them.</p>

<p>Everyone has opinions about whether AI coding tools are worth it. Nobody has data. Am I getting more efficient over time? Which projects eat the most tokens? Do longer prompts produce better results?</p>

<p>Nothing shows <em>personal trends across all your sessions</em>, the view that tells you if you’re improving.</p>

<h2 id="how-it-works">How It Works</h2>

<h3 id="data-pipeline">Data Pipeline</h3>

<ol>
  <li><strong>Discovery</strong> - finds all <code class="language-plaintext highlighter-rouge">*.jsonl</code> files under <code class="language-plaintext highlighter-rouge">~/.claude/projects/</code></li>
  <li><strong>Parsing</strong> - extracts messages, token usage, tool calls, session metadata</li>
  <li><strong>Work Blocks</strong> - splits each session into continuous coding periods at 30-minute idle gaps</li>
  <li><strong>Ingestion</strong> - upserts into SQLite with incremental ingest (tracks file mtime, only re-processes changed files)</li>
</ol>

<h3 id="cost-estimation">Cost Estimation</h3>

<p>All costs estimated at current API rates. For subscription users (Pro/Max), a toggle shows token-based metrics instead of dollar amounts.</p>

<h2 id="key-features">Key Features</h2>

<h3 id="overview-dashboard">Overview Dashboard</h3>

<p>Summary cards, effectiveness metrics (cache hit rate, edit ratio, compaction rate, error rate), trend charts, work blocks per week.</p>

<p><img src="/assets/projects/aide/overview-full.png" alt="Overview page with effectiveness metrics and trend charts" style="max-width: 100%; display: block; margin: 20px auto;" /></p>

<h3 id="session-detail">Session Detail</h3>

<p>Drill into any session: token breakdown, tool usage, files touched with read/edit/write counts, work block timeline, error categorization.</p>

<p><img src="/assets/projects/aide/session-detail.png" alt="Session detail showing tokens, tools, and files" style="max-width: 100%; display: block; margin: 20px auto;" /></p>

<h3 id="insights">Insights</h3>

<p>First-prompt effectiveness, cost concentration, time patterns (when you code), model usage, tool sequences, thinking block analysis.</p>

<p><img src="/assets/projects/aide/insights.png" alt="Insights page with first-prompt analysis and time patterns" style="max-width: 100%; display: block; margin: 20px auto;" /></p>

<h3 id="more">More</h3>

<ul>
  <li><strong>Session Autopsy</strong> - <code class="language-plaintext highlighter-rouge">aide autopsy &lt;session-id&gt;</code> generates a per-session diagnostic report: cost breakdown, context window analysis, compaction detection, CLAUDE.md improvement suggestions.</li>
  <li><strong>Subscription Mode</strong> - toggle between API cost view and token-based metrics for Pro/Max subscribers.</li>
  <li><strong>CLI Stats</strong> - <code class="language-plaintext highlighter-rouge">aide stats</code> prints a quick summary to the terminal without opening the browser.</li>
</ul>

<h2 id="technical-details">Technical Details</h2>

<h3 id="work-blocks">Work Blocks</h3>

<p>A JSONL “session” is just one terminal window staying open. A session spanning Mon-Wed with sleep in between reads as a 48-hour session, making “duration” useless.</p>

<p>The gap distribution between messages is bimodal: most gaps are under 5 minutes (active work), a small cluster is over 30 minutes (away). A 30-minute threshold cleanly separates the two modes.</p>

<p>Each session splits into work blocks, continuous coding periods. “119 work blocks across 35 sessions” tells you more than “35 sessions.”</p>

<h3 id="error-categorization">Error Categorization</h3>

<p>Tool errors are categorized automatically: Test (pytest, jest), Lint (ruff, eslint), Build (pip, npm), Git, Edit Mismatch, File Access. Most “errors” are normal iteration (test failures during edit-test-fix cycles). The dashboard separates iteration from actual mistakes.</p>

<h3 id="effectiveness-metrics">Effectiveness Metrics</h3>

<ul>
  <li><strong>Cache Hit Rate</strong> - % of input context served from cache (higher = better reuse)</li>
  <li><strong>Edit Ratio</strong> - % of tool calls that are file edits (higher = more productive)</li>
  <li><strong>Compaction Rate</strong> - % of sessions hitting context limits</li>
  <li><strong>Read-to-Edit Ratio</strong> - reads per edit (lower = less searching)</li>
  <li><strong>Iteration Rate</strong> - sessions with files edited 3+ times</li>
</ul>

<h2 id="what-i-learned">What I Learned</h2>

<p><strong>Claude Code logs are a gold mine.</strong> Every tool call, every token count, every timestamp is there. The hard part was deciding which metrics actually matter.</p>

<p><strong>Work blocks changed everything.</strong> Raw session duration was misleading for every chart. Splitting at idle gaps made the data honest. Data cleaning matters more than fancy visualizations.</p>

<p><strong>Zero LLM calls was the right constraint.</strong> Every metric is heuristic. No API calls, no marginal cost. Re-ingest and rebuild as many times as you want.</p>

<h2 id="tech-stack">Tech Stack</h2>

<ul>
  <li><strong>Python 3.12+</strong> with Click CLI</li>
  <li><strong>Flask</strong> + Jinja2 templates</li>
  <li><strong>Chart.js</strong> (CDN) for interactive charts</li>
  <li><strong>Tailwind CSS</strong> (CDN) for styling</li>
  <li><strong>SQLite</strong> (stdlib) for storage</li>
  <li><strong>uv</strong> for package management</li>
</ul>

<h2 id="try-it-out">Try It Out</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>aide-dashboard
aide ingest          <span class="c"># Parse your Claude Code logs</span>
aide serve           <span class="c"># Open dashboard at localhost:8787</span>
</code></pre></div></div>

<p>Requires Claude Code session logs at <code class="language-plaintext highlighter-rouge">~/.claude/projects/</code>. If you use Claude Code, you already have them.</p>]]></content><author><name>Brian Liou</name><uri>https://brianhliou.com</uri></author><summary type="html"><![CDATA[AI developer effectiveness dashboard. Ingests Claude Code session logs into SQLite and serves local analytics with cost trends, tool usage, and efficiency metrics.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brianhliou.com/assets/img/og-default.png" /><media:content medium="image" url="https://brianhliou.com/assets/img/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Play Power Law Games</title><link href="https://brianhliou.com/posts/play-power-law-games/" rel="alternate" type="text/html" title="Play Power Law Games" /><published>2026-02-10T00:00:00+00:00</published><updated>2026-02-10T00:00:00+00:00</updated><id>https://brianhliou.com/posts/play-power-law-games</id><content type="html" xml:base="https://brianhliou.com/posts/play-power-law-games/"><![CDATA[<p>Some games have capped outcomes. You can play perfectly and still only win a predictable amount. Other games have uncapped outcomes, where a single result can dwarf everything else combined. Most people spend their entire lives playing the first kind without realizing the second kind exists.</p>

<p>This distinction comes down to two distributions that govern almost everything: <strong>normal distributions</strong> and <strong>power laws</strong>.</p>

<h2 id="the-two-distributions">The Two Distributions</h2>

<p>In a <strong>normal distribution</strong>, outcomes cluster around an average. Extreme results are rare and bounded. Height is normally distributed: most people are close to average, and no one is five times taller than anyone else. If you measure more, your average converges and stabilizes.</p>

<p>In a <strong>power law</strong>, there is no meaningful average. A small number of outcomes account for the vast majority of the total. Income follows a power law: some people earn 10x, 100x, or 1,000x more than others. If you measure more, the average keeps shifting because one new outlier can dominate everything you’ve measured so far.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Normal Distribution</th>
      <th>Power Law</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Outcomes</td>
      <td>Cluster around an average</td>
      <td>Span orders of magnitude</td>
    </tr>
    <tr>
      <td>Extremes</td>
      <td>Rare, bounded</td>
      <td>Rare, but unbounded</td>
    </tr>
    <tr>
      <td>The average</td>
      <td>Useful predictor</td>
      <td>Misleading (dominated by outliers)</td>
    </tr>
    <tr>
      <td>More data</td>
      <td>Average stabilizes</td>
      <td>Average keeps shifting</td>
    </tr>
    <tr>
      <td>Winning strategy</td>
      <td>Optimize consistency</td>
      <td>Maximize persistence</td>
    </tr>
  </tbody>
</table>

<h2 id="where-you-see-each">Where You See Each</h2>

<p>Normal distribution games are everywhere. Working an hourly job, ranking up in a competitive video game, filling restaurant tables night after night. The outcomes are proportional to effort. You grind, you get a predictable return. There’s a ceiling.</p>

<p>Power law games look completely different. Venture capital: Y Combinator calculated that 75% of their returns came from just 2 out of 280 startups they funded. Publishing: most books flop, but one bet on a story about a boy wizard turned Bloomsbury into a global brand. YouTube: less than 4% of videos reach 10,000 views, but those videos account for over 93% of all views.</p>

<p>The pattern repeats: most attempts produce little, but the rare wins are so large they make everything else irrelevant.</p>

<h2 id="why-power-laws-exist">Why Power Laws Exist</h2>

<p>Power laws emerge when small causes can cascade into massive effects. In physics, this happens at <strong>critical points</strong>, where systems become maximally unstable and a tiny perturbation can ripple through the entire system.</p>

<p>Forest fires work this way. Most lightning strikes burn a few trees. But when the forest is dense enough, one identical lightning strike can trigger a fire that burns across an entire state. The cause is the same. The outcome is not.</p>

<p>The 1988 Yellowstone fire burned 1.4 million acres, 50 times more than all fires over the previous 15 years combined. There was nothing special about the spark. The forest was simply in a critical state.</p>

<p>Earthquakes follow the same pattern. The physical process behind a tiny tremor you can’t feel and a catastrophic quake that levels a city is identical. The difference is whether the stress cascades along the fault line or dissipates locally. You can’t predict which one it will be. The system is inherently unpredictable at the critical point.</p>

<p>The same dynamics show up in networks. Barabasi found that the internet follows a power law: a few sites have thousands of times more connections than most. New nodes are more likely to connect to well-connected nodes, creating a snowball effect. This <strong>preferential attachment</strong> is why early advantages compound: the more connected you are, the more connections you attract.</p>

<h2 id="the-decision">The Decision</h2>

<p>If you’re playing a normal distribution game, consistency wins. Show up every day, optimize the small things, grind out incremental improvements. The returns are proportional and predictable. There’s nothing wrong with this, but the ceiling is real.</p>

<p>If you’re playing a power law game, persistence wins. Most of your bets will produce nothing. That’s not failure, that’s the expected distribution. The strategy is to keep making intelligent bets, because you can’t know in advance which one will be the outlier. You only need one.</p>

<p>The costly mistake is spending all your time on normal distribution games, optimizing for a 10% raise or a slightly better ranking, while ignoring power law games where a single outcome could change everything. The opportunity cost is invisible because you never see the power law game you didn’t play.</p>

<p>The even costlier mistake is suppressing small fires. The US Forest Service spent a century trying to prevent all fires, which just made the forest denser and the eventual megafires more catastrophic. In your own life, avoiding all risk and variability doesn’t eliminate the power law. It just ensures that when the big event comes, you’re unprepared, or worse, you’re never in the game at all.</p>

<h2 id="playing-power-law-games">Playing Power Law Games</h2>

<p>Knowing the theory is one thing. Actually shifting how you think and act is another. Here’s what changes when you start treating life as a power law game.</p>

<h3 id="input-vs-outcome">Input vs. Outcome</h3>

<p>Normal thinking: output is proportional to input. Work 10% harder, get a 10% raise. Study four hours instead of two, get a better grade. The goal is efficiency: best return per hour.</p>

<p>Power law thinking: outcomes are non-linear. 99% of efforts yield nothing. 1% yield 1,000x. Working harder on the wrong thing is useless. Finding the right thing is the only thing that matters. The goal is optionality: maximize exposure to positive outliers.</p>

<p>The fear shifts too. Normal players fear wasting time on something that doesn’t pay immediately. Power law players fear missing the magnitude, being steady but capped.</p>

<h3 id="time-vs-equity">Time vs. Equity</h3>

<p>Normal path: sell time for money. Career ladder. Junior to senior to manager. The variance is low. You won’t make $10M next Tuesday, but you won’t make $0 either.</p>

<p>Power law path: seek leverage. Code, media, capital. These work for you while you sleep. A salary only works when you’re awake.</p>

<p>The <strong>barbell strategy</strong>: keep a boring, low-effort income source to survive (capped downside), while aggressively pursuing high-variance projects (uncapped upside). Prefer equity, royalties, or products over salary.</p>

<p>Instead of consulting for $200/hr, build a SaaS tool that might fail completely or scale to $20k/month with zero marginal cost.</p>

<h3 id="networks">Networks</h3>

<p>Normal strategy: maintain a stable circle of similar peers. Coworkers, local friends. Comfortable, predictable, low new information.</p>

<p>Power law strategy: send cold signals. Emails, DMs, published work. One introduction to a super-connector is worth more than 1,000 coffees with peers. Publish your work publicly, because the internet has fat tails. Bill Gates might see your blog post, but he will never see your internal memo.</p>

<p>Casper, the researcher from the Veritasium video, read a line in a book: “One idea could transform your entire life.” He wrote underneath it: “Send an email to Veritasium.” Four weeks of silence. Then a reply that changed his career. Same mechanism as the lightning strike. The cause was small. The outcome was not.</p>

<h3 id="skills">Skills</h3>

<p>Normal path: deep specialization in a predefined niche. “I am a tax accountant for mid-sized retail firms.” The risk is obsolescence. If the niche shrinks, your value drops.</p>

<p>Power law path: <strong>talent stacking</strong>. Combine 2-3 skills that don’t usually go together. Being top 1% in one skill is brutally competitive (normal distribution competition). Being top 25% in three different things and combining them creates a monopoly of one (power law value).</p>

<h3 id="failure">Failure</h3>

<p>This is the most critical shift. Normal players view failure as a net loss of resources and social standing. “I wasted six months on that.” Power law players view failure as the cost of discovery. They expect 9 out of 10 projects to fail.</p>

<p>The VC approach to life: don’t try to ensure every Saturday night is “pretty good.” Have five terrible weekends exploring weird hobbies to find the one passion that defines the next decade.</p>

<h2 id="the-algorithm">The Algorithm</h2>

<p><strong>Cap the downside.</strong> Ensure you can’t go to zero, financially or socially. Keep a safety net.</p>

<p><strong>Maximize shots on goal.</strong> Take as many small, intelligent risks as possible. Write more posts, launch more tiny projects, send more cold emails.</p>

<p><strong>Cut losers fast.</strong> If something shows linear or diminishing returns, kill it. Don’t fall for the sunk cost fallacy.</p>

<p><strong>Let winners run.</strong> When something starts working exponentially, drop everything else and double down.</p>

<p>The phrase that stuck with me: <strong>be persistent, not consistent</strong>. In a power law world, the person who makes 100 smart bets and fails 99 times will outperform the person who made one safe bet and succeeded.</p>

<h2 id="what-you-learned">What You Learned</h2>

<p>✓ Normal distributions reward consistency; power laws reward persistence<br />
✓ Power laws emerge from critical systems where small causes cascade into massive effects<br />
✓ Seek leverage (code, media, capital) over trading time for money<br />
✓ Stack skills instead of hyper-specializing; top 25% in three things beats top 1% in one<br />
✓ Failure is the cost of discovery, not a loss. Expect most bets to fail.<br />
✓ The algorithm: cap downside, maximize shots, cut losers fast, let winners run</p>

<hr />

<p><strong>Resources:</strong></p>
<ul>
  <li><a href="https://www.youtube.com/watch?v=HBluLfX2F_k">Veritasium: You’ve (Likely) Been Playing The Game of Life Wrong</a> - The video that inspired this post</li>
  <li><a href="https://en.wikipedia.org/wiki/Antifragile_(book)">Antifragile (Wikipedia)</a> - The barbell strategy and convexity in depth</li>
</ul>]]></content><author><name>Brian Liou</name><uri>https://brianhliou.com</uri></author><category term="strategy" /><category term="thinking" /><summary type="html"><![CDATA[Most people spend their lives optimizing normal distribution games where outcomes are predictable and capped. The real leverage comes from identifying and playing power law games where one win can outweigh everything else combined.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brianhliou.com/assets/img/og-default.png" /><media:content medium="image" url="https://brianhliou.com/assets/img/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">7 Principles for Staying Effective</title><link href="https://brianhliou.com/posts/7-principles-for-staying-effective/" rel="alternate" type="text/html" title="7 Principles for Staying Effective" /><published>2026-02-09T00:00:00+00:00</published><updated>2026-02-09T00:00:00+00:00</updated><id>https://brianhliou.com/posts/7-principles-for-staying-effective</id><content type="html" xml:base="https://brianhliou.com/posts/7-principles-for-staying-effective/"><![CDATA[<p>This is the operating system I return to when things get noisy. Seven principles that compound over time.</p>

<h2 id="1-speed-over-perfection">1. Speed Over Perfection</h2>

<p>Perfectionism is fear disguised as refinement. At scale, it destroys momentum.</p>

<ul>
  <li>Delay compounds into lost opportunity.</li>
  <li>Ship at 90% readiness; the last 10% rarely changes outcomes.</li>
  <li>Treat every feedback cycle as superior to every polish cycle.</li>
</ul>

<blockquote>
  <p><strong>Internalize:</strong> Progress compounds. Perfection stalls.</p>
</blockquote>

<h2 id="2-outcome-neutrality">2. Outcome Neutrality</h2>

<p>Attachment distorts perception. Each result is only data.</p>

<ul>
  <li>Observe outcomes like a scientist, without identity in the result.</li>
  <li>Define metrics, trust them, and extract insight fast.</li>
</ul>

<blockquote>
  <p><strong>Internalize:</strong> The only real outcome is learning.</p>
</blockquote>

<h2 id="3-compounding-through-micro-iteration">3. Compounding Through Micro-Iteration</h2>

<p>Small, consistent optimizations appear trivial until they are not.</p>

<ul>
  <li>Refine one variable daily. Log changes.</li>
  <li>Celebrate 1% improvements as system upgrades.</li>
</ul>

<blockquote>
  <p><strong>Internalize:</strong> Tiny adjustments, applied relentlessly, create transformations.</p>
</blockquote>

<h2 id="4-systems-over-tactics">4. Systems Over Tactics</h2>

<p>Tactics are brittle. Systems create durability.</p>

<ul>
  <li>Ask: “Is this repeatable, automatable, and improvable?”</li>
  <li>Design mechanisms that run without constant input.</li>
</ul>

<blockquote>
  <p><strong>Internalize:</strong> I build engines. Outputs are side effects.</p>
</blockquote>

<h2 id="5-radical-responsibility">5. Radical Responsibility</h2>

<p>Every dependency reduces leverage.</p>

<ul>
  <li>Audit dependencies: emotional, technical, financial.</li>
  <li>Replace reliance with ownership wherever possible.</li>
</ul>

<blockquote>
  <p><strong>Internalize:</strong> No rescuer is coming. I own the system end to end.</p>
</blockquote>

<h2 id="6-leveraged-courage">6. Leveraged Courage</h2>

<p>Discomfort marks the edge of growth.</p>

<ul>
  <li>Seek controlled discomfort regularly.</li>
  <li>Treat fear signals as coordinates for expansion.</li>
</ul>

<blockquote>
  <p><strong>Internalize:</strong> Discomfort is data. Courage is leverage.</p>
</blockquote>

<h2 id="7-strategic-patience-tactical-urgency">7. Strategic Patience, Tactical Urgency</h2>

<p>Vision must be long; execution must be immediate.</p>

<ul>
  <li>Think in decades. Operate in days.</li>
  <li>Hold long arcs while acting with daily precision.</li>
</ul>

<blockquote>
  <p><strong>Internalize:</strong> Patient for outcomes, ruthless for actions.</p>
</blockquote>

<hr />

<h2 id="summary-protocol">Summary Protocol</h2>

<p>When momentum wavers, return here:</p>

<ul>
  <li><strong>Speed</strong> → act fast.</li>
  <li><strong>Neutrality</strong> → treat results as data.</li>
  <li><strong>Compounding</strong> → improve 1%.</li>
  <li><strong>Systems</strong> → design for repeatability.</li>
  <li><strong>Responsibility</strong> → own everything.</li>
  <li><strong>Courage</strong> → move toward fear.</li>
  <li><strong>Patience</strong> → hold the long game.</li>
</ul>]]></content><author><name>Brian Liou</name><uri>https://brianhliou.com</uri></author><category term="productivity" /><category term="mindset" /><summary type="html"><![CDATA[A mental framework for high performance: speed over perfection, outcome neutrality, compounding through micro-iteration, and more.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brianhliou.com/assets/img/og-default.png" /><media:content medium="image" url="https://brianhliou.com/assets/img/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Solving Gobblet Gobblers: Building a 20-Million Position Tablebase</title><link href="https://brianhliou.com/posts/gobblet-gobblers/" rel="alternate" type="text/html" title="Solving Gobblet Gobblers: Building a 20-Million Position Tablebase" /><published>2025-12-23T00:00:00+00:00</published><updated>2025-12-23T00:00:00+00:00</updated><id>https://brianhliou.com/posts/gobblet-gobblers</id><content type="html" xml:base="https://brianhliou.com/posts/gobblet-gobblers/"><![CDATA[<p>Gobblet Gobblers is a tic-tac-toe variant with a stacking mechanic, marketed as a children’s game (ages 5+). I solved the game, built a 20-million position tablebase, and deployed a web UI for exploring optimal play.</p>

<p><strong>Result: Player 1 wins with perfect play.</strong></p>

<p>This post covers the solver implementation, the 180× Rust rewrite, and the deployment architecture.</p>

<p><strong>Demo:</strong> <a href="https://gobblet-gobblers-tablebase.vercel.app/">gobblet-gobblers-tablebase.vercel.app</a><br />
<strong>Source:</strong> <a href="https://github.com/brianhliou/gobblet-gobblers">github.com/brianhliou/gobblet-gobblers</a></p>

<p><img src="/assets/projects/gobblet-gobblers/hero.png" alt="Hero Screenshot" />
<em>The analysis interface showing move evaluations. Green indicates winning moves, red indicates losing moves.</em></p>

<hr />

<h2 id="1-game-rules">1. Game Rules</h2>

<p>Gobblet Gobblers is a two-player game on a 3×3 board. Each player has six pieces: two small, two medium, and two large. Victory requires placing three pieces of your color in a row (horizontal, vertical, or diagonal).</p>

<div style="display:flex;gap:16px;justify-content:center;flex-wrap:wrap;margin:1.5rem 0">
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/01-empty-board.svg" alt="Empty board with full reserves" style="max-width:220px" />
<figcaption style="font-size:0.85em;color:#666;margin-top:4px">Initial position</figcaption>
</figure>
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/02-stacking-demo.svg" alt="Board showing stacked pieces" style="max-width:220px" />
<figcaption style="font-size:0.85em;color:#666;margin-top:4px">Red large gobbles blue small</figcaption>
</figure>
</div>

<p>The distinguishing mechanic is <strong>gobbling</strong>: larger pieces can cover smaller pieces of either color. When a large piece is placed on top of a small piece, the small piece becomes hidden and does not count toward winning lines. Only the top piece of each stack is visible.</p>

<p>The <strong>reveal rule</strong> introduces significant complexity. When a player lifts a piece to move it, any piece underneath becomes visible. If this reveals an opponent’s winning line, the moving player loses, unless the piece being moved can legally gobble one of the pieces in that winning line. This “hail mary” escape is the only recourse.</p>

<div style="display:flex;gap:16px;justify-content:center;flex-wrap:wrap;margin:1.5rem 0">
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/03-reveal-before.svg" alt="Position before reveal" style="max-width:220px" />
<figcaption style="font-size:0.85em;color:#666;margin-top:4px">P2's large hides P1's small</figcaption>
</figure>
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/03-reveal-after.svg" alt="Position after hail mary save" style="max-width:220px" />
<figcaption style="font-size:0.85em;color:#666;margin-top:4px">P2 gobbles into column 0 to survive</figcaption>
</figure>
</div>

<p>Several edge cases arise from this rule:</p>

<ol>
  <li>
    <p><strong>Simultaneous reveal and win</strong>: If your move creates your own three-in-a-row while also revealing the opponent’s winning line, the reveal takes precedence: the lift happens before the place. You lose.</p>
  </li>
  <li>
    <p><strong>Same-square restriction</strong>: A piece cannot be placed back on the square it was lifted from. If the only valid gobble target is the origin square, there is no legal escape.</p>
  </li>
  <li>
    <p><strong>Multiple revealed lines</strong>: If lifting reveals two or more winning lines, blocking all of them with a single piece is typically impossible.</p>
  </li>
  <li>
    <p><strong>Zugzwang</strong>: If a player has no legal moves (all possible piece lifts would reveal opponent wins with no valid escape), that player loses immediately.</p>
  </li>
</ol>

<p>The game ends in a draw upon threefold repetition of the same board position.</p>

<hr />

<h2 id="2-results">2. Results</h2>

<h3 id="primary-finding">Primary Finding</h3>

<p><strong>Player 1 (first mover) wins with optimal play.</strong></p>

<p>This was determined through minimax search with alpha-beta pruning.</p>

<h3 id="important-limitations">Important Limitations</h3>

<p>This is <strong>not an exhaustive solve</strong>. Alpha-beta pruning, by design, skips branches once a winning move is found. The 19,836,040 positions in the tablebase represent the positions <em>visited during our search</em>, not the complete set of reachable positions.</p>

<p>The solution proves P1 can force a win, but:</p>

<ol>
  <li>
    <p><strong>Not necessarily optimal.</strong> Move ordering prioritized smaller pieces before larger ones to improve pruning efficiency. The solver found that P1 wins by leading with small piece placements, but this may not be the shortest path to victory. A different move ordering might discover a faster win.</p>
  </li>
  <li>
    <p><strong>Not exhaustive.</strong> When the solver found a winning move at a P1-to-move position, it stopped searching other moves from that position. Those unexplored branches may contain positions not in our tablebase. At P2-to-move positions, all moves were explored (P2 must check every escape attempt), so P2’s options are complete.</p>
  </li>
  <li>
    <p><strong>Tablebase coverage is path-dependent.</strong> The 19.8M positions reflect what our specific search order encountered.</p>
  </li>
</ol>

<h3 id="tablebase-statistics">Tablebase Statistics</h3>

<p>Statistics for positions visited during our alpha-beta search:</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Positions in tablebase</td>
      <td>19,836,040</td>
    </tr>
    <tr>
      <td>P1 winning positions</td>
      <td>10,226,838 (51.56%)</td>
    </tr>
    <tr>
      <td>P2 winning positions</td>
      <td>9,570,219 (48.25%)</td>
    </tr>
    <tr>
      <td>Drawn positions</td>
      <td>38,983 (0.20%)</td>
    </tr>
  </tbody>
</table>

<p>The low draw rate (0.20%) among visited positions is notable.</p>

<p>Note: These percentages describe the <em>visited</em> positions, not the full game. The complete state space is ~341 million positions; our pruned search visited only ~20 million of them.</p>

<h3 id="game-length">Game Length</h3>

<p><strong>Optimal play: P1 wins in 13 plies (7 moves).</strong></p>

<p>With optimal play by both sides, P1 forces a win in 13 plies. The winning first move is a small or large piece; opening with a medium piece is a mistake.</p>

<hr />

<h2 id="3-approach">3. Approach</h2>

<h3 id="algorithm">Algorithm</h3>

<p>The solver uses minimax search with alpha-beta pruning and transposition tables.</p>

<p><strong>Minimax</strong> computes the outcome of a position recursively: a position is winning for the current player if any child position is losing for the opponent; losing if all children are winning for the opponent; drawn otherwise.</p>

<p><strong>Alpha-beta pruning</strong> eliminates branches that cannot affect the final result. If the maximizing player has already found a winning move, remaining moves need not be evaluated. This optimization is only effective when good moves are explored first.</p>

<p><strong>Transposition tables</strong> cache position outcomes to avoid redundant computation when the same position is reachable via different move sequences.</p>

<h3 id="symmetry-reduction">Symmetry Reduction</h3>

<p>The 3×3 board has D₄ symmetry (8 equivalent configurations under rotation and reflection). Before lookup or storage, positions are <strong>canonicalized</strong> by computing all 8 transformations and selecting the lexicographically smallest encoding.</p>

<div style="display:grid;grid-template-columns:repeat(4,1fr);gap:8px;max-width:600px;margin:1.5rem auto">
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/symmetries/sym-0-identity.svg" alt="Gobblet Gobblers board identity transformation" style="width:100%" />
<figcaption style="font-size:0.75em;color:#666">Identity</figcaption>
</figure>
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/symmetries/sym-1-rotate-90.svg" alt="Gobblet Gobblers board rotated 90 degrees" style="width:100%" />
<figcaption style="font-size:0.75em;color:#666">Rotate 90°</figcaption>
</figure>
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/symmetries/sym-2-rotate-180.svg" alt="Gobblet Gobblers board rotated 180 degrees" style="width:100%" />
<figcaption style="font-size:0.75em;color:#666">Rotate 180°</figcaption>
</figure>
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/symmetries/sym-3-rotate-270.svg" alt="Gobblet Gobblers board rotated 270 degrees" style="width:100%" />
<figcaption style="font-size:0.75em;color:#666">Rotate 270°</figcaption>
</figure>
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/symmetries/sym-4-flip-horizontal.svg" alt="Gobblet Gobblers board flipped horizontally" style="width:100%" />
<figcaption style="font-size:0.75em;color:#666">Flip H</figcaption>
</figure>
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/symmetries/sym-5-flip-vertical.svg" alt="Gobblet Gobblers board flipped vertically" style="width:100%" />
<figcaption style="font-size:0.75em;color:#666">Flip V</figcaption>
</figure>
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/symmetries/sym-6-flip-diagonal.svg" alt="Gobblet Gobblers board flipped diagonally" style="width:100%" />
<figcaption style="font-size:0.75em;color:#666">Flip diag</figcaption>
</figure>
<figure style="margin:0;text-align:center">
<img src="/assets/projects/gobblet-gobblers/symmetries/sym-7-flip-antidiagonal.svg" alt="Gobblet Gobblers board flipped anti-diagonally" style="width:100%" />
<figcaption style="font-size:0.75em;color:#666">Flip anti-diag</figcaption>
</figure>
</div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Transformations:
- Identity
- Rotate 90° CW
- Rotate 180°
- Rotate 270° CW
- Reflect horizontal
- Reflect vertical
- Reflect main diagonal
- Reflect anti-diagonal
</code></pre></div></div>

<p>This reduces the effective state space by approximately 8× for most positions (some positions are self-symmetric).</p>

<h3 id="tablebase-construction">Tablebase Construction</h3>

<p>The goal is a mapping from canonical position hash to outcome. Given this tablebase, optimal play reduces to: look up the current position, enumerate legal moves, look up each resulting position, choose any move that preserves your winning status (or minimizes loss).</p>

<hr />

<h2 id="4-implementation-v1-python">4. Implementation: V1 (Python)</h2>

<h3 id="initial-design">Initial Design</h3>

<p>The first implementation used an object-oriented design in Python. <code class="language-plaintext highlighter-rouge">GameState</code> objects encapsulated board state, reserves, and current player. <code class="language-plaintext highlighter-rouge">Piece</code> objects represented individual pieces with player and size attributes. Move generation returned <code class="language-plaintext highlighter-rouge">Move</code> objects.</p>

<p>To explore a child position, the solver deep-copied the current state, applied the move to the copy, recursed, then discarded the copy.</p>

<p>This design was correct but exhibited unacceptable performance.</p>

<h3 id="performance-analysis">Performance Analysis</h3>

<p>Benchmarking revealed the bottleneck:</p>

<table>
  <thead>
    <tr>
      <th>Operation</th>
      <th>Time per position</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">generate_moves()</code></td>
      <td>18 µs</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">play_move()</code> × 27 moves</td>
      <td>567 µs</td>
    </tr>
    <tr>
      <td>└─ <code class="language-plaintext highlighter-rouge">deepcopy()</code> overhead</td>
      <td><strong>420 µs (75%)</strong></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">encode_state()</code> × 27</td>
      <td>40 µs</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">canonicalize()</code> × 27</td>
      <td>222 µs</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td>~850 µs</td>
    </tr>
  </tbody>
</table>

<p><code class="language-plaintext highlighter-rouge">deepcopy</code> consumed 75% of <code class="language-plaintext highlighter-rouge">play_move</code> time, approximately 15 µs per copy, 27 copies per position. At 750 positions/second, solving would require days.</p>

<p>Additionally, Python’s default recursion limit (~10,000) was insufficient. The game tree extends to depths exceeding 400 moves. The solver was converted to an iterative implementation using an explicit stack of <code class="language-plaintext highlighter-rouge">StackFrame</code> objects.</p>

<h3 id="optimization-undo-based-move-application">Optimization: Undo-Based Move Application</h3>

<p>Rather than copy-mutate-discard, the optimized approach mutates in place and undoes the mutation when backtracking:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Before: copy-based (slow)
</span><span class="n">child_state</span> <span class="o">=</span> <span class="n">state</span><span class="p">.</span><span class="nf">copy</span><span class="p">()</span>  <span class="c1"># Deep copy
</span><span class="nf">apply_move</span><span class="p">(</span><span class="n">child_state</span><span class="p">,</span> <span class="n">move</span><span class="p">)</span>
<span class="n">outcome</span> <span class="o">=</span> <span class="nf">solve</span><span class="p">(</span><span class="n">child_state</span><span class="p">)</span>

<span class="c1"># After: undo-based (fast)
</span><span class="n">undo_info</span> <span class="o">=</span> <span class="nf">apply_move_in_place</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">move</span><span class="p">)</span>  <span class="c1"># Mutate directly
</span><span class="n">outcome</span> <span class="o">=</span> <span class="nf">solve</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
<span class="nf">undo_move_in_place</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">undo_info</span><span class="p">)</span>  <span class="c1"># Restore
</span></code></pre></div></div>

<p>This required tracking sufficient information to reverse each move: which piece was captured at the destination, which piece was revealed at the source, whether the player turn was switched.</p>

<p><strong>Result: 750 → 2,000 positions/second (2.7× speedup).</strong></p>

<h3 id="optimization-disabling-garbage-collection">Optimization: Disabling Garbage Collection</h3>

<p>During enumeration, progress stopped entirely at ~70 million objects in memory. Profiling revealed Python’s cyclic garbage collector was scanning all objects to find reference cycles:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>815/817 samples in _PyGC_Collect → mark_all_reachable
</code></pre></div></div>

<p>With tens of millions of objects (26M in one dict, 24M in another set, 20M in the transposition table), each GC scan consumed minutes. The data structures contained only integers, so no reference cycles were possible.</p>

<p><strong>Fix:</strong> <code class="language-plaintext highlighter-rouge">gc.disable()</code> during hot loops. Reference counting (non-cyclic cleanup) continues to function.</p>

<h3 id="optimization-shared-path-set">Optimization: Shared Path Set</h3>

<p>Each <code class="language-plaintext highlighter-rouge">StackFrame</code> initially stored a <code class="language-plaintext highlighter-rouge">frozenset</code> of all positions on the current path for cycle detection. At depth D, frame D contains a frozenset of D elements. Total storage: O(D²).</p>

<p>At depth 84,000:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1 + 2 + ... + 84,000 = 3.5 billion integers across frames
</code></pre></div></div>

<p><strong>Fix:</strong> Use a single shared <code class="language-plaintext highlighter-rouge">set</code> for the entire search. Add positions when pushing frames, remove when popping. Storage: O(D).</p>

<h3 id="final-python-performance">Final Python Performance</h3>

<p>With all optimizations (undo-based moves, gc.disable, shared path set), throughput reached <strong>~3,700 positions/second</strong>. Total solve time: 1.5 hours.</p>

<hr />

<h2 id="5-implementation-v2-rust">5. Implementation: V2 (Rust)</h2>

<h3 id="motivation">Motivation</h3>

<p>3,700 positions/second was sufficient to complete the solve, but a Rust rewrite offered both performance gains and a verification opportunity: two independent implementations should produce identical results.</p>

<h3 id="why-rust">Why Rust</h3>

<p>The Python bottlenecks pointed directly at language-level constraints:</p>

<ol>
  <li>
    <p><strong>Object overhead.</strong> Python objects carry type information, reference counts, and GC metadata. A simple <code class="language-plaintext highlighter-rouge">Piece(player=1, size=2)</code> occupies ~100 bytes. In Rust, this is 2 bytes (or 1 byte with packing).</p>
  </li>
  <li>
    <p><strong>Garbage collection.</strong> Even with <code class="language-plaintext highlighter-rouge">gc.disable()</code>, Python’s reference counting still runs on every object creation and destruction. Rust has no runtime GC; memory is managed at compile time.</p>
  </li>
  <li>
    <p><strong>Heap allocation.</strong> Python allocates nearly everything on the heap. Rust allows stack allocation for fixed-size data, avoiding allocator overhead entirely.</p>
  </li>
  <li>
    <p><strong>Memory layout control.</strong> Python dictionaries and lists have indirection and padding. Rust’s <code class="language-plaintext highlighter-rouge">#[repr(packed)]</code> and bitfield operations allow exact control over memory layout.</p>
  </li>
</ol>

<p>The bit-packed representation described below is technically possible in Python (using integers as bitfields), but the surrounding code would still pay Python’s interpretation overhead. Rust compiles these bit operations to native instructions.</p>

<h3 id="bit-level-encoding">Bit-Level Encoding</h3>

<p>The V2 representation eliminates all object allocation in the hot path:</p>

<p><strong>Board state (64 bits):</strong> The entire game state fits in a single <code class="language-plaintext highlighter-rouge">u64</code>.</p>

<p><img src="/assets/projects/gobblet-gobblers/05-bit-layout.svg" alt="64-bit board encoding layout" style="max-width:100%;margin:1.5rem 0;display:block" /></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Bits 0-53:  Board (9 cells × 6 bits)
Bit 54:     Current player (0=P1, 1=P2)
Bits 55-63: Unused

Cell encoding (6 bits):
  Bits 0-1: Small piece owner  (0=empty, 1=P1, 2=P2)
  Bits 2-3: Medium piece owner
  Bits 4-5: Large piece owner
</code></pre></div></div>

<p>Each cell can hold at most one piece of each size (the stacking constraint). With 3 sizes × 2 bits per size = 6 bits per cell, 9 cells × 6 bits = 54 bits for the full board.</p>

<p><strong>Move encoding (8 bits):</strong> <code class="language-plaintext highlighter-rouge">PackedMove</code> is a single byte.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Bits 0-3: Destination (0-8)
Bits 4-7: Source (0-8 for slides, 9-11 for reserve placement by size)
</code></pre></div></div>

<p><strong>Undo encoding (16 bits):</strong> <code class="language-plaintext highlighter-rouge">PackedUndo</code> stores the move plus captured/revealed pieces.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Bits 0-7:   PackedMove
Bits 8-10:  Captured piece (3-bit encoding)
Bits 11-13: Revealed piece (3-bit encoding)
</code></pre></div></div>

<p><strong>Win detection (bitmask):</strong> Precomputed masks for the 8 winning lines enable branchless checking:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="n">WIN_MASKS</span><span class="p">:</span> <span class="p">[</span><span class="nb">u16</span><span class="p">;</span> <span class="mi">8</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span>
    <span class="mi">0b000_000_111</span><span class="p">,</span> <span class="c1">// Row 0</span>
    <span class="mi">0b000_111_000</span><span class="p">,</span> <span class="c1">// Row 1</span>
    <span class="mi">0b111_000_000</span><span class="p">,</span> <span class="c1">// Row 2</span>
    <span class="mi">0b001_001_001</span><span class="p">,</span> <span class="c1">// Col 0</span>
    <span class="mi">0b010_010_010</span><span class="p">,</span> <span class="c1">// Col 1</span>
    <span class="mi">0b100_100_100</span><span class="p">,</span> <span class="c1">// Col 2</span>
    <span class="mi">0b100_010_001</span><span class="p">,</span> <span class="c1">// Main diagonal</span>
    <span class="mi">0b001_010_100</span><span class="p">,</span> <span class="c1">// Anti-diagonal</span>
<span class="p">];</span>

<span class="k">fn</span> <span class="nf">check_winner</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">Option</span><span class="o">&lt;</span><span class="n">Player</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="k">let</span> <span class="p">(</span><span class="n">p1_mask</span><span class="p">,</span> <span class="n">p2_mask</span><span class="p">)</span> <span class="o">=</span> <span class="k">self</span><span class="nf">.visibility_masks</span><span class="p">();</span>
    <span class="k">for</span> <span class="o">&amp;</span><span class="n">win</span> <span class="k">in</span> <span class="o">&amp;</span><span class="n">WIN_MASKS</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">p1_mask</span> <span class="o">&amp;</span> <span class="n">win</span><span class="p">)</span> <span class="o">==</span> <span class="n">win</span> <span class="p">{</span> <span class="k">return</span> <span class="nf">Some</span><span class="p">(</span><span class="nn">Player</span><span class="p">::</span><span class="n">One</span><span class="p">);</span> <span class="p">}</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">p2_mask</span> <span class="o">&amp;</span> <span class="n">win</span><span class="p">)</span> <span class="o">==</span> <span class="n">win</span> <span class="p">{</span> <span class="k">return</span> <span class="nf">Some</span><span class="p">(</span><span class="nn">Player</span><span class="p">::</span><span class="n">Two</span><span class="p">);</span> <span class="p">}</span>
    <span class="p">}</span>
    <span class="nb">None</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="memory-comparison">Memory Comparison</h3>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Python V1</th>
      <th>Rust V2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Board state</td>
      <td>~500 bytes (objects + GC overhead)</td>
      <td>8 bytes (<code class="language-plaintext highlighter-rouge">u64</code>)</td>
    </tr>
    <tr>
      <td>Move</td>
      <td>~100 bytes</td>
      <td>1 byte (<code class="language-plaintext highlighter-rouge">u8</code>)</td>
    </tr>
    <tr>
      <td>Undo info</td>
      <td>~200 bytes</td>
      <td>2 bytes (<code class="language-plaintext highlighter-rouge">u16</code>)</td>
    </tr>
    <tr>
      <td>Move list</td>
      <td>Heap allocation</td>
      <td>Stack-allocated</td>
    </tr>
  </tbody>
</table>

<h3 id="performance-result">Performance Result</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Python V1</th>
      <th>Rust V2</th>
      <th>Improvement</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Solve time</td>
      <td>1.5 hours</td>
      <td>31 seconds</td>
      <td><strong>174×</strong></td>
    </tr>
    <tr>
      <td>Positions/sec</td>
      <td>~3,700</td>
      <td>~640,000</td>
      <td><strong>173×</strong></td>
    </tr>
  </tbody>
</table>

<p>The Rust solver evaluated 19,836,040 unique positions. The ~180× speedup comes entirely from eliminating allocation overhead and using compact representations. The algorithmic approach is identical.</p>

<hr />

<h2 id="6-parity-testing">6. Parity Testing</h2>

<p>With two independent implementations (Python V1 and Rust V2), cross-validation was possible.</p>

<h3 id="methodology">Methodology</h3>

<ol>
  <li>Run the full solve on both implementations independently</li>
  <li>Compare tablebase outputs position-by-position</li>
  <li>Investigate any discrepancies</li>
</ol>

<h3 id="initial-discrepancies">Initial Discrepancies</h3>

<p>The first comparison revealed 109 positions where V1 and V2 disagreed. Investigation showed the bugs were in V1’s test fixtures, not in either solver’s core logic.</p>

<p><strong>Root cause:</strong> V1 unit tests constructed game states directly, bypassing move validation. Some test positions were impossible: more pieces on the board than existed in the starting reserves. These invalid states had been used during V1 development.</p>

<p>V2’s design derives reserve counts from the board state rather than tracking them separately. When V2 encountered these positions, the derivation detected the inconsistency.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// V2: Reserves computed from board</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">reserves</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">player</span><span class="p">:</span> <span class="n">Player</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="p">[</span><span class="nb">u8</span><span class="p">;</span> <span class="mi">3</span><span class="p">]</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">on_board</span> <span class="o">=</span> <span class="k">self</span><span class="nf">.pieces_on_board</span><span class="p">(</span><span class="n">player</span><span class="p">);</span>
    <span class="p">[</span><span class="mi">2</span> <span class="o">-</span> <span class="n">on_board</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">2</span> <span class="o">-</span> <span class="n">on_board</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="mi">2</span> <span class="o">-</span> <span class="n">on_board</span><span class="p">[</span><span class="mi">2</span><span class="p">]]</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>Resolution:</strong> Fixed the V1 test fixtures to construct only valid positions. After removing invalid positions from the comparison, both solvers produced identical results.</p>

<h3 id="reveal-rule-edge-cases">Reveal Rule Edge Cases</h3>

<p>The reveal rule required careful handling:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Board state:
  2L/1S  ..  2S      (P2 Large on top of P1 Small at (0,0))
  1M     ..  ..
  1M     ..  ..

P1 has column 0 except (0,0). If P2 lifts L from (0,0):
→ Reveals 1S at (0,0) → P1 wins column 0!
→ P2 must gobble INTO column 0 to survive.
→ (0,0)→(1,0): P2 L gobbles P1 M → column broken, game continues.
</code></pre></div></div>

<p>Both implementations needed to handle:</p>
<ul>
  <li>Detection of revealed winning lines</li>
  <li>Validation that the moving piece can legally gobble a piece in that line</li>
  <li>The same-square restriction (can’t place back where you lifted)</li>
  <li>Zugzwang (no legal moves = immediate loss)</li>
</ul>

<hr />

<h2 id="7-move-ordering">7. Move Ordering</h2>

<p>Alpha-beta pruning’s effectiveness depends critically on move ordering. Exploring winning moves first enables early cutoffs; exploring losing moves first wastes computation.</p>

<h3 id="v1-observations">V1 Observations</h3>

<p>Without move ordering, the V1 solver exhibited pathological behavior:</p>

<table>
  <thead>
    <tr>
      <th>Configuration</th>
      <th>Max depth reached</th>
      <th>Positions evaluated</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>No ordering</td>
      <td>12,724 (stuck)</td>
      <td>9,423 in 60s</td>
    </tr>
    <tr>
      <td>With ordering</td>
      <td>473</td>
      <td>45,892 in 60s</td>
    </tr>
  </tbody>
</table>

<p>Without ordering, the solver descended over 12,000 moves deep before any backtracking occurred. It was exploring chains of bad moves that led nowhere.</p>

<p>With move ordering (winning moves first, losing moves last), max depth dropped to ~480 and the solver made steady progress.</p>

<h3 id="terminal-detection">Terminal Detection</h3>

<p>Move ordering must identify terminal positions (immediate wins/losses) during move generation, not just look them up in a transposition table. An early V2 bug omitted this check, causing the solver to store 198M positions and descend 3.3M moves deep before completing. After fixing the bug: 19.8M positions, max depth 460. The 10× reduction in positions and 7,000× reduction in depth came from a single missing check.</p>

<h3 id="why-depth-explodes-without-ordering">Why Depth Explodes Without Ordering</h3>

<p>Without good move ordering, the solver descended millions of moves deep. Why?</p>

<p>The state space is large enough (~341M positions) that players can shuffle pieces through millions of unique positions before being forced into either a terminal state or threefold repetition. One test run descended 26 million moves deep before finding any terminal position.</p>

<p>With good ordering, wins are found early and these shuffling paths are pruned before being explored.</p>

<hr />

<h2 id="8-the-graph-history-interaction-problem">8. The Graph History Interaction Problem</h2>

<p>Running the solver with and without alpha-beta pruning produced different results:</p>

<table>
  <thead>
    <tr>
      <th>Mode</th>
      <th>Result</th>
      <th>Positions</th>
      <th>Time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Pruned</td>
      <td><strong>P1 wins</strong></td>
      <td>20M</td>
      <td>31s</td>
    </tr>
    <tr>
      <td>Full (no pruning)</td>
      <td><strong>Draw</strong></td>
      <td>531M</td>
      <td>46 min</td>
    </tr>
  </tbody>
</table>

<p>Same solver, same game logic, different answers. A fundamental bug.</p>

<h3 id="the-bug">The Bug</h3>

<p>The natural approach is to cache <code class="language-plaintext highlighter-rouge">position → result</code> to avoid recomputing positions reached via different move sequences. But this data structure is fundamentally insufficient for games with repetition rules.</p>

<p>The problem: <strong>the result at a position depends on how you got there</strong>, not just the position itself.</p>

<div style="max-width: 500px; margin: 1.5rem auto;">
  <img src="/assets/projects/gobblet-gobblers/07-ghi-problem.svg" alt="Graph History Interaction: same position reached via different paths" style="width: 100%;" />
</div>

<p>Position N can be reached via Path 1 (through A, B, C, D, E, F) or Path 2 (through U, V, W, X, Y, Z). After reaching N, threefold repetition is checked against positions <em>on the current path</em>. If the game later revisits position B:</p>

<ul>
  <li>Via Path 1: B is the 2nd occurrence (one more repetition triggers a draw)</li>
  <li>Via Path 2: B is the 1st occurrence (not a repetition)</li>
</ul>

<p>The game’s result from N depends on which positions were visited before N. Caching <code class="language-plaintext highlighter-rouge">N → result</code> is wrong because there is no single result for N.</p>

<p>This is the <strong>Graph History Interaction (GHI) problem</strong>: in games with path-dependent rules like repetition, position evaluations depend on history, not just the position itself.</p>

<h3 id="why-pruning-masked-the-bug">Why Pruning Masked the Bug</h3>

<p>Alpha-beta pruning with good move ordering finds winning lines quickly and cuts off most of the search tree. The winning path for P1 reaches terminal wins without entering cycle-heavy regions. The bug exists in pruned search too, but the incorrect cache entries happen in branches that get pruned anyway.</p>

<p>Full search explores everything, including all the cyclic paths where GHI causes incorrect results.</p>

<h3 id="attempted-fix-dont-cache-cycle-influenced-values">Attempted Fix: Don’t Cache Cycle-Influenced Values</h3>

<p>A natural fix: track whether a frame’s outcome was influenced by any cycle, and only cache “pure” values.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">Frame</span> <span class="p">{</span>
    <span class="n">had_cycle</span><span class="p">:</span> <span class="nb">bool</span><span class="p">,</span>  <span class="c1">// Any child returned cycle-draw?</span>
<span class="p">}</span>

<span class="c1">// Only cache if no cycle influence</span>
<span class="k">if</span> <span class="o">!</span><span class="n">frame</span><span class="py">.had_cycle</span> <span class="p">{</span>
    <span class="k">self</span><span class="py">.table</span><span class="nf">.insert</span><span class="p">(</span><span class="n">canonical</span><span class="p">,</span> <span class="n">outcome</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>Result: Impractical.</strong> With ~18,700 direct cycle detections and cycle influence propagating upward, 99.9% of positions became cycle-influenced. Caching almost nothing, the solver reverted to near-full tree enumeration.</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>With had_cycle tracking</th>
      <th>Original</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Pure positions cached</td>
      <td>175,806</td>
      <td>19,836,040</td>
    </tr>
    <tr>
      <td>Cycle-influenced (not cached)</td>
      <td>158M+</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>This approach is theoretically correct but the memory and time requirements become prohibitive.</p>

<h3 id="attempted-fix-full-tree-enumeration">Attempted Fix: Full Tree Enumeration</h3>

<p>Another approach: don’t cache transpositions at all. Treat each path as a distinct node. Same position via different paths = different nodes.</p>

<p><strong>Result: Impractical.</strong> The full game tree is finite but enormous. A test run descended 117 million moves deep before I terminated it. The state space allows players to shuffle through vast numbers of unique positions before threefold repetition forces termination.</p>

<h3 id="resolution">Resolution</h3>

<p>The pragmatic solution:</p>

<ol>
  <li><strong>Use alpha-beta pruned solve.</strong> It produces correct results (P1 wins).</li>
  <li><strong>Store intrinsic position values.</strong> The tablebase maps canonical position → outcome assuming a fresh game from that position.</li>
  <li><strong>Handle repetition at runtime.</strong> The viewer tracks game history and applies threefold repetition dynamically.</li>
</ol>

<p>The tablebase answers: “If this position were the start of a new game, who wins?”
The viewer answers: “Given the current game history, who wins?”</p>

<p>These differ only when repetition rules block the optimal path, a rare edge case.</p>

<hr />

<h2 id="9-deployment-architecture">9. Deployment Architecture</h2>

<p>With the solver complete, the next step was building a web interface for exploring optimal play. The viewer needs two components: game logic (move generation, win detection) and tablebase lookups (position → outcome).</p>

<p>The game logic is straightforward (compiles to ~32KB of WASM, runs entirely in the browser). The challenge is serving the tablebase. The 170MB binary format is already heavily compressed (9 bytes per entry). In SQLite format it would be larger; as raw JSON, much larger still. The question: how to serve 20 million position lookups with low latency and minimal hosting cost?</p>

<h3 id="options-considered">Options Considered</h3>

<p><strong>Option 1: Traditional Backend</strong></p>

<p>FastAPI or Axum server with SQLite tablebase.</p>

<ul>
  <li>Pros: Simple architecture</li>
  <li>Cons: Always-on server required, hosting costs, latency depends on server location</li>
</ul>

<p><strong>Option 2: WASM + Edge Database</strong></p>

<p>Game logic in browser (WASM), tablebase in serverless SQL (Turso, PlanetScale).</p>

<ul>
  <li>Pros: Game runs locally, only lookups hit network</li>
  <li>Cons: Each move requires looking up ~27 child positions. Even at 10ms per query, that’s 270ms latency per move.</li>
</ul>

<p><strong>Option 3: WASM + Embedded Binary (Selected)</strong></p>

<p>Game logic in browser (WASM), tablebase bundled with serverless function.</p>

<ul>
  <li>Game logic compiles to ~32KB WASM</li>
  <li>170MB tablebase deploys with the Vercel function</li>
  <li>Lookups are in-memory binary search: &lt;1ms</li>
  <li>Vercel replicates globally</li>
</ul>

<p><strong>How this works with Vercel:</strong></p>

<p>Vercel does two things for us:</p>

<ol>
  <li>
    <p><strong>Static hosting (CDN):</strong> The React UI and WASM module are just files (HTML, JavaScript, CSS, and a <code class="language-plaintext highlighter-rouge">.wasm</code> binary). When you visit the URL, your browser downloads these files (~80KB total) from Vercel’s CDN. This is standard web hosting; the files are cached at edge locations worldwide. After the initial download, the app runs entirely in your browser.</p>
  </li>
  <li>
    <p><strong>Serverless functions:</strong> For the tablebase, Vercel runs backend code. You write a function (Node.js, Python, etc.), bundle it with data files, and deploy. When a request arrives, Vercel spins up an instance, executes your code, and returns the response. The 170MB tablebase lives on Vercel’s servers; the browser never downloads it. Instead, the browser makes HTTPS requests to the function endpoint and receives JSON results.</p>
  </li>
</ol>

<p>The split: game logic runs locally (fast, offline-capable), tablebase lookups go to the server (because 170MB is too large to download to every browser).</p>

<div style="max-width: 650px; margin: 1.5rem auto;">
  <img src="/assets/projects/gobblet-gobblers/10-architecture.svg" alt="Deployment architecture showing browser with WASM module communicating with Vercel Edge Function containing the tablebase" style="width: 100%;" />
</div>

<h3 id="implementation-details">Implementation Details</h3>

<p>The tablebase is a sorted binary file: 9 bytes per entry (8-byte key, 1-byte value), 19.8M entries.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Lookup: binary search over 19.8M entries
Comparisons: log₂(19,836,040) ≈ 24
</code></pre></div></div>

<p><strong>Latency breakdown:</strong></p>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Binary search (27 positions)</td>
      <td>&lt;1ms</td>
    </tr>
    <tr>
      <td>Network round trip</td>
      <td>20-100ms</td>
    </tr>
    <tr>
      <td>Cold start (if idle)</td>
      <td>~300ms</td>
    </tr>
  </tbody>
</table>

<p>The compute time is negligible; latency is dominated by network round trip to the nearest edge location. Cold starts occur when the function hasn’t been invoked recently. The first move may feel slow, but subsequent moves complete in under 100ms.</p>

<p>This architecture works because:</p>
<ul>
  <li>The tablebase is read-only (no database writes)</li>
  <li>170MB fits within Vercel’s 250MB deployment limit</li>
  <li>No external database to manage or pay for</li>
</ul>

<hr />

<h2 id="10-sample-game-43-plies">10. Sample Game (43 Plies)</h2>

<p>The following is a sample 43-ply game from our solver where P1 wins. This is not necessarily optimal play by either side; it is simply the longest winning path our search discovered.</p>

<style>
.gobblet-viewer{max-width:480px;margin:2rem auto;font-family:system-ui,-apple-system,sans-serif}
.viewer-board{background:#1a1a1a;border-radius:12px;padding:8px;display:flex;justify-content:center}
.viewer-board .board-img{max-width:100%;height:auto;display:block}
.viewer-controls{display:flex;align-items:center;justify-content:center;gap:8px;margin-top:12px;padding:8px;background:#2a2a2a;border-radius:8px}
.viewer-controls button{background:#3a3a3a;border:none;color:#fff;font-size:16px;width:40px;height:40px;border-radius:6px;cursor:pointer;transition:background .15s,transform .1s}
.viewer-controls button:hover:not(:disabled){background:#4a4a4a}
.viewer-controls button:active:not(:disabled){transform:scale(.95)}
.viewer-controls button:disabled{opacity:.3;cursor:not-allowed}
.frame-counter{color:#888;font-size:14px;font-variant-numeric:tabular-nums;min-width:60px;text-align:center}
.viewer-move-label{text-align:center;margin-top:8px;color:#ccc;font-size:14px;min-height:20px}
</style>

<div id="game-viewer"></div>

<script>
(function(){
const MOVES=["Initial position","1. P1: S(0,0)","2. P2: S(2,2)","3. P1: M(1,2)","4. P2: (2,2)→(0,1)","5. P1: (1,2)→(0,1)","6. P2: L(0,1)","7. P1: M(2,2)","8. P2: (0,1)→(0,0)","9. P1: (2,2)→(0,2)","10. P2: L(0,2)","11. P1: S(2,1)","12. P2: (0,0)→(0,1)","13. P1: L(0,0)","14. P2: (0,2)→(1,1)","15. P1: (0,2)→(2,1)","16. P2: M(0,2)","17. P1: L(0,2)","18. P2: (1,1)→(1,2)","19. P1: (0,0)→(1,0)","20. P2: (1,2)→(2,0)","21. P1: (0,0)→(1,2)","22. P2: (2,0)→(1,2)","23. P1: (1,0)→(0,0)","24. P2: M(2,2)","25. P1: (0,0)→(1,0)","26. P2: (2,2)→(2,0)","27. P1: (1,0)→(1,1)","28. P2: S(1,0)","29. P1: (1,1)→(1,0)","30. P2: (2,0)→(0,0)","31. P1: (1,0)→(0,0)","32. P2: (1,0)→(2,0)","33. P1: (0,0)→(1,0)","34. P2: (2,0)→(2,2)","35. P1: (1,0)→(1,1)","36. P2: (0,0)→(2,0)","37. P1: (1,1)→(2,2)","38. P2: (0,1)→(2,1)","39. P1: (0,2)→(1,1)","40. P2: (0,2)→(0,0)","41. P1: (1,1)→(0,0)","42. P2: (1,2)→(0,1)","43. P1: (0,0)→(0,2) — P1 wins!"];
const BASE="/assets/projects/gobblet-gobblers/optimal-game";
const c=document.getElementById("game-viewer");
let frame=0;
c.innerHTML='<div class="gobblet-viewer"><div class="viewer-board"><img class="board-img" alt="Gobblet Gobblers game position"></div><div class="viewer-controls"><button class="btn-first" title="First">⏮</button><button class="btn-prev" title="Previous">◀</button><span class="frame-counter">0 / 43</span><button class="btn-next" title="Next">▶</button><button class="btn-last" title="Last">⏭</button></div><div class="viewer-move-label"></div></div>';
const img=c.querySelector(".board-img"),counter=c.querySelector(".frame-counter"),label=c.querySelector(".viewer-move-label");
const bf=c.querySelector(".btn-first"),bp=c.querySelector(".btn-prev"),bn=c.querySelector(".btn-next"),bl=c.querySelector(".btn-last");
function load(n){frame=Math.max(0,Math.min(n,43));img.src=BASE+"/move-"+String(frame).padStart(2,"0")+".svg";counter.textContent=frame+" / 43";label.textContent=MOVES[frame]||"";bf.disabled=bp.disabled=frame===0;bn.disabled=bl.disabled=frame===43;}
bf.onclick=()=>load(0);bp.onclick=()=>load(frame-1);bn.onclick=()=>load(frame+1);bl.onclick=()=>load(43);
document.addEventListener("keydown",e=>{const r=c.getBoundingClientRect();if(r.top>window.innerHeight||r.bottom<0)return;if(e.key==="ArrowLeft"){load(frame-1);e.preventDefault();}else if(e.key==="ArrowRight"){load(frame+1);e.preventDefault();}});
load(0);
})();
</script>

<p><strong>Move sequence:</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>S(0,0) S(2,2) M(1,2) (2,2)-&gt;(0,1) (1,2)-&gt;(0,1) L(0,1) M(2,2) (0,1)-&gt;(0,0)
(2,2)-&gt;(0,2) L(0,2) S(2,1) (0,0)-&gt;(0,1) L(0,0) (0,2)-&gt;(1,1) (0,2)-&gt;(2,1)
M(0,2) L(0,2) (1,1)-&gt;(1,2) (0,0)-&gt;(1,0) (1,2)-&gt;(2,0) (0,0)-&gt;(1,2) (2,0)-&gt;(1,2)
(1,0)-&gt;(0,0) M(2,2) (0,0)-&gt;(1,0) (2,2)-&gt;(2,0) (1,0)-&gt;(1,1) S(1,0) (1,1)-&gt;(1,0)
(2,0)-&gt;(0,0) (1,0)-&gt;(0,0) (1,0)-&gt;(2,0) (0,0)-&gt;(1,0) (2,0)-&gt;(2,2) (1,0)-&gt;(1,1)
(0,0)-&gt;(2,0) (1,1)-&gt;(2,2) (0,1)-&gt;(2,1) (0,2)-&gt;(1,1) (0,2)-&gt;(0,0) (1,1)-&gt;(0,0)
(1,2)-&gt;(0,1) (0,0)-&gt;(0,2)
</code></pre></div></div>

<p>To replay this game interactively, paste the sequence into the import field at <a href="https://gobblet-gobblers-tablebase.vercel.app/">gobblet-gobblers-tablebase.vercel.app</a> and step through with the history controls.</p>

<hr />

<h2 id="11-context-other-solved-games">11. Context: Other Solved Games</h2>

<p>Gobblet Gobblers sits in a particular complexity class among solved games:</p>

<table>
  <thead>
    <tr>
      <th>Game</th>
      <th>State Space</th>
      <th>Year Released</th>
      <th>Year Solved</th>
      <th>Result</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Tic-tac-toe</td>
      <td>~10³</td>
      <td>Ancient</td>
      <td>1952</td>
      <td>Draw</td>
    </tr>
    <tr>
      <td><strong>Gobblet Gobblers (3×3)</strong></td>
      <td><strong>~3×10⁸</strong></td>
      <td>2003</td>
      <td>2024</td>
      <td>First player wins</td>
    </tr>
    <tr>
      <td>Connect Four</td>
      <td>~10¹³</td>
      <td>1974</td>
      <td>1988</td>
      <td>First player wins</td>
    </tr>
    <tr>
      <td>Checkers</td>
      <td>~10²⁰</td>
      <td>~3000 BCE</td>
      <td>2007</td>
      <td>Draw</td>
    </tr>
    <tr>
      <td>Gobblet (4×4)</td>
      <td>~6×10²¹</td>
      <td>2001</td>
      <td>Unsolved</td>
      <td>Unknown (likely P1)</td>
    </tr>
    <tr>
      <td>Chess</td>
      <td>~10⁴⁴</td>
      <td>~600 CE</td>
      <td>Unsolved</td>
      <td>Unknown</td>
    </tr>
    <tr>
      <td>Go (19×19)</td>
      <td>~10¹⁷⁰</td>
      <td>~2000 BCE</td>
      <td>Unsolved</td>
      <td>Unknown</td>
    </tr>
  </tbody>
</table>

<p>Gobblet Gobblers has a state space of ~341 million positions, though symmetry reduction and pruning bring the solved tablebase down to ~20 million. The late solve date (2024) reflects the game’s novelty rather than difficulty.</p>

<p>The full 4×4 Gobblet has a state space comparable to Checkers (~10²⁰) and remains unsolved. Unlike the 3×3 version where exhaustive search is tractable, 4×4 would require heuristic methods like alpha-beta with evaluation functions or Monte Carlo tree search.</p>

<hr />

<h2 id="12-conclusion">12. Conclusion</h2>

<p>Gobblet Gobblers is solved: <strong>Player 1 wins with optimal play in 13 plies.</strong></p>

<p>Despite being marketed as a children’s game, the reveal rule and stacking mechanic create genuine complexity. The state space of ~341 million positions reduces to ~20 million after symmetry and pruning, tractable for exhaustive search on a laptop.</p>

<p>The technical lessons:</p>

<ol>
  <li>
    <p><strong>Representation dominates performance.</strong> The same algorithm ran 180× faster in Rust with bit-packed state (8 bytes) versus Python with objects (~500 bytes). Eliminating allocation in the hot path matters more than algorithmic cleverness.</p>
  </li>
  <li>
    <p><strong>Move ordering makes or breaks alpha-beta.</strong> Without ordering, the solver descended 3 million moves deep into shuffling paths. With terminal detection during move generation, max depth dropped to 460, a 7,000× reduction.</p>
  </li>
  <li>
    <p><strong>The GHI problem is real.</strong> Games with repetition rules cannot use simple position → outcome caching. The same position reached via different paths can have different results. The pragmatic fix: use pruned search and handle repetition at query time.</p>
  </li>
</ol>

<p>This is not an exhaustive solve. Alpha-beta pruning skips branches once a winning move is found. The tablebase contains the positions our search visited, which is sufficient to prove P1 wins and to play optimally from any visited position.</p>

<p>The demo is live at <a href="https://gobblet-gobblers-tablebase.vercel.app/">gobblet-gobblers-tablebase.vercel.app</a>. Analysis mode colors each legal move by outcome. Try to beat the solver, or watch optimal play unfold.</p>

<hr />

<p><em>Questions or feedback: <a href="https://github.com/brianhliou/gobblet-gobblers/issues">GitHub Issues</a></em></p>]]></content><author><name>Brian Liou</name><uri>https://brianhliou.com</uri></author><summary type="html"><![CDATA[A minimax solver for Gobblet Gobblers: proving P1 wins, achieving 180× speedup with Rust, and encountering the Graph History Interaction problem.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brianhliou.com/assets/img/og-default.png" /><media:content medium="image" url="https://brianhliou.com/assets/img/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Inside Chinese Pinyin: What 79,000+ Sentences Reveal</title><link href="https://brianhliou.com/posts/chinese-pinyin-analysis/" rel="alternate" type="text/html" title="Inside Chinese Pinyin: What 79,000+ Sentences Reveal" /><published>2025-11-06T00:00:00+00:00</published><updated>2025-11-06T00:00:00+00:00</updated><id>https://brianhliou.com/posts/chinese-pinyin-analysis</id><content type="html" xml:base="https://brianhliou.com/posts/chinese-pinyin-analysis/"><![CDATA[<p>I analyzed 79,704 Chinese sentences from <a href="https://tatoeba.org">Tatoeba</a> to understand which pinyin syllables actually appear in practice. The corpus contains 1,161 unique syllables (there are likely more obscure ones not covered here). Using a Trie data structure to map every syllable, I found patterns in tone distribution, character complexity, and polyphonic pronunciation that differ from what textbooks suggest.</p>

<p>Here’s what the data reveals.</p>

<h2 id="the-trie-structure">The Trie Structure</h2>

<p>A Trie (prefix tree) is a data structure where each path from root to leaf represents a complete string. For pinyin, each node is a single letter, and following a path like h → a → n → 4 gives you the syllable “han4”.</p>

<p>Pinyin syllables form a tree where each letter is a node. Here’s the complete h-branch showing how syllables form from root to terminal nodes:</p>

<p><img src="/assets/posts/pinyin-trie/pinyin_trie_visualization_h_branch.svg" alt="Pinyin Trie - H Branch" style="max-width: 600px; height: auto; display: block; margin: 0 auto;" /></p>

<p><em>The h-branch from root to all complete syllables (ha, hao, he, etc.). Dashed lines show other branches.</em></p>

<p><strong>Explore more:</strong></p>

<p><a href="/assets/posts/pinyin-trie/pinyin_trie_visualization_depth1.svg">Level 1 →</a><br />
<a href="/assets/posts/pinyin-trie/pinyin_trie_visualization_depth2.svg">Level 2 →</a><br />
<a href="/assets/posts/pinyin-trie/pinyin_trie_visualization_depth3.svg">Level 3 →</a><br />
<a href="/assets/posts/pinyin-trie/pinyin_trie_visualization.svg">Full Trie →</a> <em>(hover over nodes for details)</em></p>

<h2 id="the-tone-paradox">The Tone Paradox</h2>

<p>Neutral tone represents only 2.2% of unique syllables, but 8.4% of actual character usage. Why? Extremely common grammatical particles (的, 了, 吗, 么) are all neutral tone:</p>

<p><img src="/assets/posts/pinyin-trie/tone_distributions.png" alt="Tone Distribution Analysis" />
<em>Three perspectives on tone: by syllables, by characters, and by frequency</em></p>

<p>Key insight: <strong>Tone 4 dominates</strong> across all measures (27-34%), while neutral tone punches far above its weight due to particle frequency.</p>

<h2 id="syllable-crowding">Syllable Crowding</h2>

<p>Most pinyin syllables map to a handful of characters. But a few are extremely crowded:</p>

<p><img src="/assets/posts/pinyin-trie/syllable_complexity.png" alt="Syllable Complexity Distribution" />
<em>Character count per syllable - most have 3-4 characters, but some have 30+</em></p>

<p>The most crowded syllables:</p>
<ul>
  <li><strong>yi4</strong>: 37 characters (意, 义, 议, 异, 易, 亿, 艺, 益…)</li>
  <li><strong>shi4</strong>: 32 characters (是, 事, 市, 式, 试, 视, 世, 士…)</li>
  <li><strong>ji4</strong>: 30 characters (记, 际, 计, 技, 季, 继, 既, 寄…)</li>
</ul>

<p>Meanwhile, some syllables are unique to a single character: <strong>wo3</strong> (我), <strong>le0</strong> (了), <strong>ni3</strong> (你).</p>

<h2 id="the-polyphonic-myth">The Polyphonic Myth</h2>

<p>Textbooks emphasize that many Chinese characters have multiple pronunciations. But in the 5,002 characters that appear in this corpus, only <strong>3.8%</strong> (199 characters) are actually polyphonic. While Unicode defines 80,000+ Chinese characters, this analysis focuses on what learners encounter in real usage:</p>

<p><img src="/assets/posts/pinyin-trie/polyphonic_characters.png" alt="Polyphonic Characters" />
<em>Top 20 characters with multiple pronunciations</em></p>

<p>Even when characters <em>have</em> multiple pronunciations, one usually dominates:</p>
<ul>
  <li><strong>的</strong>: 4 pronunciations, but <code class="language-plaintext highlighter-rouge">de0</code> is used 99.8% of the time</li>
  <li><strong>一</strong>: 3 pronunciations, but <code class="language-plaintext highlighter-rouge">yi1</code> is the primary form</li>
  <li><strong>著</strong>: 5 pronunciations (the most polyphonic character in the corpus)</li>
</ul>

<h2 id="syllable-structure">Syllable Structure</h2>

<p>Most Chinese syllables complete at depth 4 (3 letters + tone, like ban1 or mao2):</p>

<p><img src="/assets/posts/pinyin-trie/depth_distribution.png" alt="Depth Distribution" />
<em>Syllable completion by depth - 41.5% of syllables complete at depth 4 (482 out of 1,161)</em></p>

<p>The longest syllables (depth 7: 6 letters + tone) are all <strong>-uang</strong> combinations:</p>
<ul>
  <li>chuang1, chuang2, chuang3, chuang4</li>
  <li>shuang1, shuang3</li>
  <li>zhuang1, zhuang4</li>
</ul>

<h2 id="syllable--tone-coverage">Syllable × Tone Coverage</h2>

<p>When you strip away tones, 1,161 syllables collapse to <strong>401 base forms</strong>. This heatmap shows which base syllables exist across all 5 tones:</p>

<p><img src="/assets/posts/pinyin-trie/syllable_tone_matrix.png" alt="Syllable-Tone Matrix" />
<em>401 base syllables × 5 tones. Cell values show character count. Some bases exist across all tones, others only in one or two.</em></p>

<p>Observations:</p>
<ul>
  <li>Most base syllables don’t have all 5 tone variants</li>
  <li>Neutral tone (column 0) is sparse - only 25 base syllables have it</li>
  <li>Some bases are tone-specific (appear in only 1-2 tone columns)</li>
</ul>

<h2 id="key-findings">Key Findings</h2>

<ul>
  <li><strong>1,161 unique syllables</strong> found in 79,704 sentences from the Tatoeba corpus</li>
  <li><strong>401 base syllables</strong> when tones are removed (average 2.9 tones per base)</li>
  <li><strong>91.8% of characters</strong> have only one pronunciation in practice</li>
  <li><strong>Neutral tone</strong>: 2.2% of syllables but 8.4% of usage (particle effect)</li>
  <li><strong>Top 10 syllables</strong> account for 18.8% of all character instances</li>
  <li><strong>Character homophony</strong>: Average 4.5 characters per syllable (range: 1-37)</li>
</ul>

<h2 id="what-you-learned">What You Learned</h2>

<p>✓ Chinese has 1,161 unique pinyin syllables (401 base forms without tones)<br />
✓ Only 3.8% of characters in real usage are actually polyphonic<br />
✓ Neutral tone punches above its weight: 2.2% of syllables, 8.4% of usage<br />
✓ Tone 4 dominates across all measures (27-34%)<br />
✓ A Trie data structure maps every syllable path efficiently</p>

<h2 id="technical-notes">Technical Notes</h2>

<p>This analysis is based on 79,704 Chinese sentences from the <a href="https://tatoeba.org">Tatoeba corpus</a>. I built a character-level Trie data structure where each node represents a single letter or tone number, with terminal nodes storing character metadata and frequency data.</p>

<p><strong>Tools:</strong> Python 3.9+, matplotlib (charts), graphviz (tree visualization)</p>

<p><strong>Data pipeline:</strong></p>
<ol>
  <li>Extract characters from corpus</li>
  <li>Compute pinyin with jieba + pypinyin and/or GPT-4o-mini</li>
  <li>Build Trie structure</li>
  <li>Analyze distributions, patterns, and edge cases</li>
</ol>

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://github.com/brianhliou/hanzi-flow/tree/main/scripts/character_set/analysis">hanzi-flow on GitHub</a> — Full analysis code and visualizations</li>
  <li><a href="https://tatoeba.org">Tatoeba</a> — Sentence corpus</li>
</ul>]]></content><author><name>Brian Liou</name><uri>https://brianhliou.com</uri></author><category term="data-analysis" /><category term="chinese" /><category term="linguistics" /><summary type="html"><![CDATA[Analyzing 1,161 unique pinyin syllables from 79,704 Chinese sentences reveals surprising patterns about tones, polyphonic characters, and what learners actually encounter.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brianhliou.com/assets/img/og-default.png" /><media:content medium="image" url="https://brianhliou.com/assets/img/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">How Prometheus and Grafana Actually Work</title><link href="https://brianhliou.com/posts/how-prometheus-and-grafana-work/" rel="alternate" type="text/html" title="How Prometheus and Grafana Actually Work" /><published>2025-10-27T00:00:00+00:00</published><updated>2025-10-27T00:00:00+00:00</updated><id>https://brianhliou.com/posts/how-prometheus-and-grafana-work</id><content type="html" xml:base="https://brianhliou.com/posts/how-prometheus-and-grafana-work/"><![CDATA[<p>When you start adding observability to your applications, the Prometheus ecosystem can be confusing. You install <code class="language-plaintext highlighter-rouge">prometheus-client</code>, but where are the dashboards? You hear about Prometheus and Grafana - are they the same thing? And why do you need three separate tools just to track some metrics?</p>

<p>This guide explains how the observability stack actually works - the three components, how they communicate, and most importantly, <strong>what’s stored in memory versus what’s persisted to disk</strong>. This last part is crucial for understanding how metrics flow through the system, yet most tutorials skip over it.</p>

<h2 id="understanding-the-stack">Understanding the Stack</h2>

<p>Here’s the key insight: <strong>prometheus-client, Prometheus, and Grafana are three separate applications</strong>, not one tool.</p>

<p>When you run:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>prometheus-client
</code></pre></div></div>

<p>You get a Python library that helps your application expose metrics. But you still need:</p>
<ul>
  <li><strong>Prometheus</strong> (a separate application) to collect and store those metrics</li>
  <li><strong>Grafana</strong> (another separate application) to visualize them</li>
</ul>

<p>Here’s what each does:</p>
<ul>
  <li><strong>prometheus-client</strong>: Instrumentation library that lives in your application code</li>
  <li><strong>Prometheus</strong>: Time-series database that scrapes and stores metrics</li>
  <li><strong>Grafana</strong>: Visualization platform that queries Prometheus and renders dashboards</li>
</ul>

<p>This guide walks through each component, shows how they interact with working examples, and explains the architecture decisions that make this separation useful.</p>

<h2 id="the-three-separate-applications">The Three Separate Applications</h2>

<p>Let’s break down each component and what it actually does:</p>

<h3 id="1-prometheus-client-instrumentation-library">1. prometheus-client (Instrumentation Library)</h3>

<p><strong>What it is:</strong> A language-specific library that lives in your application</p>

<p><em>This guide uses Python examples (<code class="language-plaintext highlighter-rouge">pip install prometheus-client</code>), but Prometheus has official client libraries for Go, Java, Ruby, and more. The concepts are identical across all languages.</em></p>

<p><strong>What it does:</strong></p>
<ul>
  <li>Provides classes/functions to define metrics: <code class="language-plaintext highlighter-rouge">Counter</code>, <code class="language-plaintext highlighter-rouge">Histogram</code>, <code class="language-plaintext highlighter-rouge">Gauge</code></li>
  <li>Formats metrics in Prometheus text format</li>
  <li>Exposes a <code class="language-plaintext highlighter-rouge">/metrics</code> HTTP endpoint</li>
</ul>

<p><strong>What it does NOT do:</strong></p>
<ul>
  <li>Does NOT store metrics long-term</li>
  <li>Does NOT provide a UI</li>
  <li>Does NOT include Prometheus itself</li>
</ul>

<p><strong>Think of it as:</strong> A “printer driver” - helps your app output data in the right format</p>

<h3 id="2-prometheus-standalone-go-application">2. Prometheus (Standalone Go Application)</h3>

<p><strong>What it is:</strong> A completely separate application (not a Python library!)</p>

<p><strong>How you get it:</strong> Docker image <code class="language-plaintext highlighter-rouge">prom/prometheus:latest</code> (or download the binary)</p>

<p><strong>What it does:</strong></p>
<ul>
  <li>Scrapes your <code class="language-plaintext highlighter-rouge">/metrics</code> endpoint every 15 seconds (configurable)</li>
  <li>Stores time-series data in a database on disk</li>
  <li>Provides PromQL query language</li>
  <li>Includes a basic web UI at port 9090</li>
</ul>

<p><strong>Think of it as:</strong> The database that stores your metrics history</p>

<h3 id="3-grafana-standalone-gotypescript-application">3. Grafana (Standalone Go/TypeScript Application)</h3>

<p><strong>What it is:</strong> Yet another separate application</p>

<p><strong>How you get it:</strong> Docker image <code class="language-plaintext highlighter-rouge">grafana/grafana:latest</code></p>

<p><strong>What it does:</strong></p>
<ul>
  <li>Queries Prometheus using PromQL</li>
  <li>Renders beautiful dashboards</li>
  <li>Provides alerting (not covered here)</li>
  <li>Full web UI at port 3000</li>
</ul>

<p><strong>Think of it as:</strong> The visualization layer</p>

<p><img src="/assets/posts/observability/prometheus-targets.png" alt="Three separate containers running" />
<em>Prometheus showing that it’s actively scraping the API - three separate applications communicating over HTTP</em></p>

<h3 id="how-theyre-actually-deployed">How They’re Actually Deployed</h3>

<p>Understanding the deployment topology is important:</p>

<p><strong>prometheus-client (runs everywhere):</strong></p>
<ul>
  <li>Lives inside <strong>every instance</strong> of your application</li>
  <li>If you have 50 API servers, you have 50 instances with prometheus-client embedded</li>
  <li>Each exposes its own <code class="language-plaintext highlighter-rouge">/metrics</code> endpoint</li>
</ul>

<p><strong>Prometheus (centralized):</strong></p>
<ul>
  <li>Typically <strong>1-2 instances</strong> (or a small HA cluster) for your entire infrastructure</li>
  <li>One Prometheus can scrape hundreds or thousands of application instances</li>
  <li>Configuration lists all the targets to scrape</li>
</ul>

<p><strong>Grafana (centralized):</strong></p>
<ul>
  <li>Typically <strong>1 instance</strong> (or a small HA cluster)</li>
  <li>One Grafana can query multiple Prometheus instances</li>
  <li>Shared by your entire team</li>
</ul>

<p><strong>The topology:</strong> N applications : 1 Prometheus (or cluster) : 1 Grafana</p>

<p>In our demo, we’re running everything locally (1 API, 1 Prometheus, 1 Grafana). But in production, you’d have many application instances all being scraped by one or more Prometheus instances, with a single Grafana instance for visualization. At large scale, you might use federated Prometheus or tools like Thanos/Cortex for horizontal scaling.</p>

<h3 id="the-pull-model">The Pull Model</h3>

<p>Prometheus <strong>pulls</strong> metrics from your application (rather than your application pushing metrics to Prometheus):</p>

<p><strong>Why this matters:</strong></p>
<ul>
  <li>Your API doesn’t need to know about Prometheus</li>
  <li>Prometheus controls scrape frequency</li>
  <li>Simple and reliable - no complex retry logic needed</li>
  <li>Easy to add/remove Prometheus without changing app code</li>
</ul>

<p><strong>Configuration example:</strong></p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># prometheus/prometheus.yml</span>
<span class="na">scrape_configs</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">job_name</span><span class="pi">:</span> <span class="s1">'</span><span class="s">api'</span>
    <span class="na">static_configs</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">targets</span><span class="pi">:</span> <span class="pi">[</span><span class="s1">'</span><span class="s">api:8000'</span><span class="pi">]</span>
    <span class="na">scrape_interval</span><span class="pi">:</span> <span class="s">15s</span>
</code></pre></div></div>

<h2 id="whats-in-memory-vs-disk">What’s in Memory vs. Disk</h2>

<p>Regardless of which language you use, the principle is the same: <strong>client libraries store metrics in memory, Prometheus persists them to disk</strong>.</p>

<h3 id="in-your-application-prometheus-client">In Your Application (prometheus-client)</h3>

<p>Metrics are stored as <strong>in-memory data structures</strong> - NOT as text, NOT on disk. In Python, these are objects; in Go, they’re structs; in Java, they’re class instances. The key is: they live in RAM.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">prometheus_client</span> <span class="kn">import</span> <span class="n">Counter</span>

<span class="c1"># This creates a Python object in memory
</span><span class="n">requests_total</span> <span class="o">=</span> <span class="nc">Counter</span><span class="p">(</span><span class="sh">'</span><span class="s">api_requests_total</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">Total requests</span><span class="sh">'</span><span class="p">)</span>

<span class="c1"># When you increment, you're just updating a number
</span><span class="n">requests_total</span><span class="p">.</span><span class="nf">inc</span><span class="p">()</span>  <span class="c1"># Internally: self._value = 42 → self._value = 43
</span></code></pre></div></div>

<p>That’s it. Just a float in memory. No disk writes, no database, no persistence.</p>

<p><strong>What happens when your API restarts?</strong></p>
<ul>
  <li>All counters reset to 0</li>
  <li>All metrics are lost</li>
  <li>Your application has no memory of previous values</li>
</ul>

<h3 id="the-metrics-endpoint">The /metrics Endpoint</h3>

<p>The text format is only generated <strong>on-demand</strong> when Prometheus scrapes:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">prometheus_client</span> <span class="kn">import</span> <span class="n">generate_latest</span>

<span class="nd">@app.get</span><span class="p">(</span><span class="sh">"</span><span class="s">/metrics</span><span class="sh">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">metrics</span><span class="p">():</span>
    <span class="c1"># This converts Python objects → text, fresh every time
</span>    <span class="k">return</span> <span class="nc">Response</span><span class="p">(</span><span class="n">content</span><span class="o">=</span><span class="nf">generate_latest</span><span class="p">())</span>
</code></pre></div></div>

<p>When you visit <code class="language-plaintext highlighter-rouge">http://localhost:8000/metrics</code>, you see:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># HELP api_requests_total Total number of API requests
# TYPE api_requests_total counter
api_requests_total{endpoint="/api/data",status="success"} 42.0

# HELP api_request_duration_seconds API request latency
# TYPE api_request_duration_seconds histogram
api_request_duration_seconds_sum{endpoint="/api/data"} 6.3
api_request_duration_seconds_count{endpoint="/api/data"} 42
</code></pre></div></div>

<p>This is just <strong>a snapshot</strong> of what’s currently in memory. It’s generated fresh on every request.</p>

<h3 id="in-prometheus-the-database">In Prometheus (The Database)</h3>

<p>Prometheus is where persistence happens:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Every 15 seconds:
1. Prometheus → GET http://api:8000/metrics
2. Parses the text response
3. Writes to disk: /prometheus/data/*.db

Time-series database:
  2:00:00 PM → api_requests_total = 1200
  2:00:15 PM → api_requests_total = 1247
  2:00:30 PM → api_requests_total = 1289
  ...continues for 15 days (configurable)
</code></pre></div></div>

<p>Prometheus <strong>stores the history</strong> so you can see trends, calculate rates, and query past data.</p>

<p><img src="/assets/posts/observability/prometheus-graph.png" alt="Prometheus storing time-series data" />
<em>Prometheus showing time-series data - notice how it tracks changes over time</em></p>

<h3 id="the-elegant-separation">The Elegant Separation</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌──────────────────────────────┐
│ Your API (RAM only)          │
│ Current state:               │
│ requests_total = 1247        │
│                              │
│ ✗ No disk writes             │
│ ✗ No history                 │
│ ✗ Lost on restart            │
└──────────┬───────────────────┘
           │ Scrapes every 15s
           ▼
┌──────────────────────────────┐
│ Prometheus (Disk + RAM)      │
│ Time-Series Database:        │
│ 2:00:00 → requests = 1200    │
│ 2:00:15 → requests = 1247    │
│ 2:00:30 → requests = 1289    │
│                              │
│ ✓ Persisted to disk          │
│ ✓ Full history (15 days)     │
│ ✓ Survives API restarts      │
└──────────────────────────────┘
</code></pre></div></div>

<p><strong>Why this is elegant:</strong></p>
<ul>
  <li>Your API stays fast (no disk I/O)</li>
  <li>Prometheus handles the hard parts (storage, retention, querying)</li>
  <li>If your API crashes, historical data survives</li>
</ul>

<h2 id="the-complete-flow">The Complete Flow</h2>

<p>Let’s trace a single request through the entire stack:</p>

<h3 id="step-1-request-comes-in">Step 1: Request Comes In</h3>

<p>A user hits your API endpoint. This is just your normal application code - no observability logic here:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@app.get</span><span class="p">(</span><span class="sh">"</span><span class="s">/api/data</span><span class="sh">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">get_data</span><span class="p">():</span>
    <span class="c1"># Your normal endpoint logic
</span>    <span class="k">return</span> <span class="p">{</span><span class="sh">"</span><span class="s">data</span><span class="sh">"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]}</span>
</code></pre></div></div>

<h3 id="step-2-middleware-tracks-it">Step 2: Middleware Tracks It</h3>

<p>Before and after your endpoint runs, middleware captures timing and updates metrics in memory:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@app.middleware</span><span class="p">(</span><span class="sh">"</span><span class="s">http</span><span class="sh">"</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">track_metrics</span><span class="p">(</span><span class="n">request</span><span class="p">,</span> <span class="n">call_next</span><span class="p">):</span>
    <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span>

    <span class="n">response</span> <span class="o">=</span> <span class="k">await</span> <span class="nf">call_next</span><span class="p">(</span><span class="n">request</span><span class="p">)</span>

    <span class="c1"># Update in-memory counters
</span>    <span class="n">duration</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span>
    <span class="n">requests_total</span><span class="p">.</span><span class="nf">labels</span><span class="p">(</span><span class="n">endpoint</span><span class="o">=</span><span class="sh">"</span><span class="s">/api/data</span><span class="sh">"</span><span class="p">).</span><span class="nf">inc</span><span class="p">()</span>
    <span class="n">request_duration</span><span class="p">.</span><span class="nf">observe</span><span class="p">(</span><span class="n">duration</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">response</span>
</code></pre></div></div>

<p><strong>What’s actually happening internally:</strong> Just updating Python variables in RAM:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Simplified view of what happens
</span><span class="n">self</span><span class="p">.</span><span class="n">_value</span> <span class="o">=</span> <span class="mi">42</span>  <span class="c1"># Now it's 43
</span><span class="n">self</span><span class="p">.</span><span class="n">_sum</span> <span class="o">+=</span> <span class="mf">0.123</span>  <span class="c1"># Add the duration
</span></code></pre></div></div>

<h3 id="step-3-prometheus-scrapes-every-15-seconds">Step 3: Prometheus Scrapes (Every 15 Seconds)</h3>

<p><strong>1. Prometheus sends request:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GET http://api:8000/metrics
</code></pre></div></div>

<p><strong>2. API responds:</strong></p>
<ul>
  <li>Runs <code class="language-plaintext highlighter-rouge">generate_latest()</code></li>
  <li>Converts Python objects → text format</li>
  <li>Returns:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>api_requests_total{endpoint="/api/data"} 43.0
api_request_duration_seconds_sum{...} 6.3
</code></pre></div>    </div>
  </li>
</ul>

<p><strong>3. Prometheus processes:</strong></p>
<ul>
  <li>Parses text response</li>
  <li>Writes to disk</li>
  <li>Stores: <code class="language-plaintext highlighter-rouge">(timestamp: 2:00:15, metric: api_requests_total, value: 43)</code></li>
</ul>

<p><img src="/assets/posts/observability/prometheus-targets.png" alt="Prometheus scraping targets" />
<em>The Prometheus targets page showing active scraping - notice the “UP” status and last scrape time</em></p>

<h3 id="step-4-grafana-queries-prometheus">Step 4: Grafana Queries Prometheus</h3>

<p>When you open a dashboard, Grafana queries historical data:</p>

<p><strong>1. User action:</strong></p>
<ul>
  <li>Opens Grafana dashboard</li>
</ul>

<p><strong>2. Grafana sends query:</strong></p>
<ul>
  <li>Sends PromQL query to Prometheus: <code class="language-plaintext highlighter-rouge">rate(api_requests_total[1m])</code></li>
</ul>

<p><strong>3. Prometheus responds:</strong></p>
<ul>
  <li>Queries its time-series database</li>
  <li>Returns data points:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[(2:00:00, 0.4), (2:00:15, 0.4), (2:00:30, 0.47)]
</code></pre></div>    </div>
  </li>
</ul>

<p><strong>4. Grafana renders:</strong></p>
<ul>
  <li>Draws the graph with the returned data</li>
</ul>

<p><img src="/assets/posts/observability/grafana-dashboard.png" alt="Full Grafana dashboard" />
<em>The complete Grafana dashboard showing request rate, latency, and request distribution</em></p>

<p>The API has <strong>no idea</strong> any of this is happening. It just keeps updating numbers in RAM.</p>

<h2 id="reference-metrics-and-queries">Reference: Metrics and Queries</h2>

<h3 id="metric-types">Metric Types</h3>

<p><strong>Counter</strong> (only goes up):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">requests_total</span> <span class="o">=</span> <span class="nc">Counter</span><span class="p">(</span><span class="sh">'</span><span class="s">api_requests_total</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">Total requests</span><span class="sh">'</span><span class="p">)</span>
<span class="n">requests_total</span><span class="p">.</span><span class="nf">inc</span><span class="p">()</span>  <span class="c1"># 0 → 1 → 2 → 3 ...
</span></code></pre></div></div>

<p><strong>Histogram</strong> (distribution):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">latency</span> <span class="o">=</span> <span class="nc">Histogram</span><span class="p">(</span><span class="sh">'</span><span class="s">latency_seconds</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">Latency</span><span class="sh">'</span><span class="p">)</span>
<span class="n">latency</span><span class="p">.</span><span class="nf">observe</span><span class="p">(</span><span class="mf">0.234</span><span class="p">)</span>  <span class="c1"># Records a single value
# Automatically creates buckets for percentile calculations
</span></code></pre></div></div>

<p><strong>Gauge</strong> (can go up or down):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">active</span> <span class="o">=</span> <span class="nc">Gauge</span><span class="p">(</span><span class="sh">'</span><span class="s">active_requests</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">Active requests</span><span class="sh">'</span><span class="p">)</span>
<span class="n">active</span><span class="p">.</span><span class="nf">inc</span><span class="p">()</span>  <span class="c1"># Increment
</span><span class="n">active</span><span class="p">.</span><span class="nf">dec</span><span class="p">()</span>  <span class="c1"># Decrement
</span></code></pre></div></div>

<h3 id="promql-examples">PromQL Examples</h3>

<p>PromQL (Prometheus Query Language) uses a functional syntax with built-in aggregation functions. Here are common queries you’ll use:</p>

<pre><code class="language-promql"># Total requests
sum(api_requests_total)

# Request rate (per second)
rate(api_requests_total[1m])

# P95 latency
histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m]))

# Average latency by endpoint
rate(api_request_duration_seconds_sum[5m])
  /
rate(api_request_duration_seconds_count[5m])
</code></pre>

<h2 id="try-it-yourself">Try It Yourself</h2>

<p>I’ve created a minimal example that demonstrates all of this: <a href="https://github.com/brianhliou/observability-starter">observability-starter</a></p>

<p><strong>Get it running in 60 seconds:</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/brianhliou/observability-starter
<span class="nb">cd </span>observability-starter
make up

<span class="c"># Wait ~30 seconds, then open:</span>
<span class="c"># - API:        http://localhost:8000</span>
<span class="c"># - Prometheus: http://localhost:9090</span>
<span class="c"># - Grafana:    http://localhost:3002</span>
</code></pre></div></div>

<p>The repo includes:</p>
<ul>
  <li>Minimal FastAPI app with 4 endpoints</li>
  <li>Full docker-compose stack</li>
  <li>Pre-configured Grafana dashboard</li>
  <li>Load testing script</li>
  <li>Detailed README</li>
</ul>

<p><strong>What you get:</strong></p>
<ul>
  <li>See the <code class="language-plaintext highlighter-rouge">/metrics</code> endpoint in plain text</li>
  <li>Watch Prometheus scrape in real-time</li>
  <li>Generate load and see graphs update</li>
  <li>Three separate applications, all working together</li>
</ul>

<h2 id="common-gotchas">Common Gotchas</h2>

<h3 id="1-cardinality-explosion">1. Cardinality Explosion</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># BAD: Unbounded labels
</span><span class="n">requests</span><span class="p">.</span><span class="nf">labels</span><span class="p">(</span><span class="n">user_id</span><span class="o">=</span><span class="n">user_id</span><span class="p">).</span><span class="nf">inc</span><span class="p">()</span>  <span class="c1"># Millions of users!
</span>
<span class="c1"># GOOD: Bounded labels
</span><span class="n">requests</span><span class="p">.</span><span class="nf">labels</span><span class="p">(</span><span class="n">endpoint</span><span class="o">=</span><span class="n">endpoint</span><span class="p">,</span> <span class="n">status</span><span class="o">=</span><span class="n">status</span><span class="p">).</span><span class="nf">inc</span><span class="p">()</span>
</code></pre></div></div>

<p>Keep label cardinality low - dozens, not millions.</p>

<h3 id="2-counters-reset-on-restart">2. Counters Reset on Restart</h3>

<p>When your API restarts, counters go to zero. Prometheus handles this with the <code class="language-plaintext highlighter-rouge">rate()</code> function, which calculates the per-second rate and handles resets automatically.</p>

<h3 id="3-memory-footprint">3. Memory Footprint</h3>

<p>Even with millions of requests, prometheus-client uses minimal memory:</p>
<ul>
  <li>Counter: Just a float (8 bytes)</li>
  <li>Histogram: Sum + count + bucket counts (~100 bytes)</li>
</ul>

<p>The API doesn’t store individual requests - just aggregates.</p>

<h2 id="when-to-use-this-stack">When to Use This Stack</h2>

<h3 id="prometheusgrafana-is-a-good-fit-if">Prometheus/Grafana is a good fit if:</h3>
<ul>
  <li>You want <strong>open-source</strong> with no vendor lock-in</li>
  <li>You’re okay <strong>running infrastructure</strong> (Kubernetes, VMs, Docker)</li>
  <li>You need <strong>powerful querying</strong> (PromQL)</li>
  <li>Cost predictability matters (no per-host or per-metric pricing)</li>
</ul>

<h3 id="what-alternatives-exist">What Alternatives Exist</h3>

<p>Most large tech companies use managed platforms instead:</p>

<p><strong>SaaS Platforms:</strong></p>
<ul>
  <li><strong>Datadog</strong> - Used by Airbnb, Peloton, Samsung</li>
  <li><strong>New Relic, Dynatrace, Splunk</strong> - Popular in enterprises</li>
</ul>

<p><strong>Cloud-Native:</strong></p>
<ul>
  <li><strong>CloudWatch</strong> (AWS), <strong>Azure Monitor</strong>, <strong>Google Cloud Operations</strong></li>
</ul>

<p><strong>What they provide that Prometheus/Grafana don’t:</strong></p>
<ul>
  <li><strong>Unified observability:</strong> Metrics + logs + traces in one platform (Prometheus is metrics-only)</li>
  <li><strong>No infrastructure:</strong> They handle HA, scaling, backups</li>
  <li><strong>Advanced features:</strong> APM, distributed tracing, anomaly detection, log analysis</li>
  <li><strong>Better UX:</strong> Pre-built dashboards, faster onboarding, integrated alerting</li>
</ul>

<p><strong>The trade-off:</strong></p>
<ul>
  <li><strong>Prometheus/Grafana:</strong> Lower cost at scale ($0 vs. $50k-500k+/year), full control, no vendor lock-in</li>
  <li><strong>Managed platforms:</strong> Faster setup, more features, less operational burden</li>
</ul>

<p><strong>Common pattern:</strong> Many companies use <strong>both</strong> - Prometheus for internal metrics, managed platforms for application observability.</p>

<h2 id="what-you-learned">What You Learned</h2>

<p>By now you should understand:</p>

<p>✓ <strong>prometheus-client</strong> stores metrics in RAM, not disk<br />
✓ <strong>Prometheus</strong> is a separate application that scrapes and persists<br />
✓ <strong>Grafana</strong> is another separate application that visualizes<br />
✓ Text format is generated on-demand, not stored<br />
✓ The pull model means your API stays simple<br />
✓ Historical data survives API restarts</p>

<p>You now have the foundation to implement observability in your own applications. Clone the <a href="https://github.com/brianhliou/observability-starter">observability-starter</a> repo and start experimenting.</p>

<hr />

<p><strong>Resources:</strong></p>
<ul>
  <li><a href="https://github.com/brianhliou/observability-starter">observability-starter repo</a> - Working example</li>
  <li><a href="https://prometheus.io/docs/">Prometheus Documentation</a></li>
  <li><a href="https://grafana.com/docs/">Grafana Documentation</a></li>
  <li><a href="https://github.com/prometheus/client_python">Prometheus Python Client</a></li>
</ul>]]></content><author><name>Brian Liou</name><uri>https://brianhliou.com</uri></author><category term="observability" /><category term="monitoring" /><category term="python" /><summary type="html"><![CDATA[Understanding the observability stack: prometheus-client vs Prometheus vs Grafana, and how they work together to monitor your APIs.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://brianhliou.com/assets/img/og-default.png" /><media:content medium="image" url="https://brianhliou.com/assets/img/og-default.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>