MistyBanqi is the Banqi (Chinese Dark Chess) engine that plays on mistboard.com: a classical alpha-beta search engine with a handcrafted evaluation, written in Rust, with no neural network in it. I started its evaluation from george0828Zhang’s open-source CDC engine, and kept that engine as the fixed opponent I measured every change against. It’s now competent, about even with that benchmark, and open-source. Almost all the progress came from fixing how I measured strength, not the engine.

The rules are at mistboard.com/rules/banqi, with a playable board. The one rule that matters here: a move is either an ordinary move or capture, or a flip that turns over a face-down tile and reveals a random piece from the bag of unrevealed ones. A flip is a chance event, so on top of ordinary alpha-beta the search carries chance nodes, which it handles with Star1 expectiminimax; the rest is the usual machinery, a transposition table and quiescence. Everything that scores a position is hand-written, which is the whole subject of what follows.

Don’t measure against yourself

My first upgrade was the usual search machinery, a transposition table and repetition detection. I measured it against an earlier build of my own engine in paired self-play. It scored 60.9%, about +77 Elo, and every statistic said the gain was real.

Then I ran the same change against the reference engine. The +77 Elo was gone; the result was even. A bot tuned against itself optimizes for beating its own blind spots, and the self-play number was measuring my engine’s idea of a hard opponent, which is just my engine. So I made that engine the yardstick and ran every later change as a large paired bakeoff against it, a few hundred games a side fanned across containers on Modal so each run lands in under an hour. Banqi draws a lot, so a few-percent edge needs that many games to see; the 20-game matches I started with were noise.

Tuning the evaluation

The first thing the honest yardstick found was a bug I’d been carrying the whole time, in the piece-value table. Here is the corrected version:

Piece Value
General 30
Cannon 16 (was 12)
Advisor 14
Elephant 11
Chariot 9 (was 14)
Horse 7
Soldier 4

The cannon and the chariot were backwards. The cannon captures by jumping a screen piece, which makes it the most tactically dangerous piece on the board even though it sits low in the capture order; the chariot is a plain slider. That one fix was worth a few points by itself and most of the eval gain. Self-play could never have shown it to me, because every version of my own engine carried the same wrong table; it took an outside opponent to expose it.

The more interesting tuning is that a piece’s value isn’t really a constant. It depends on what’s left on the board. The reference engine already did a small version of this for the general, the one piece a soldier can capture: its value climbs as enemy soldiers come off, because the only thing that threatens it is disappearing. I generalized that to every piece. Each piece has a set of enemy pieces that can capture it, and as those are traded away its value rises toward untouchable. That term, adaptive domination, was the single biggest eval gain, and the whole stack reached +16.6% against the reference engine with two-thirds fewer losses. It still gave +8.7% when I deliberately paired it with the old, wrong value table, which is the sign it captures real Banqi structure instead of numbers fit to one opponent.

The whole term is a dozen lines. For every piece on the board, count the enemy pieces still alive (on the board or in the bag) that could capture it, and scale a bonus inversely:

// dom_val: a piece's value grows as the enemy pieces that can capture it
// disappear — toward "immortal" when its dominators are gone.
const DOM_K: f64 = 0.5;
for i in 0..NSQ {
    let c = self.sq[i];
    if !is_piece(c) { continue; }
    let role = code_role(c);
    let enemy = (1 - code_color(c)) as usize;
    // living enemy pieces that can capture this role
    // (role 5 = cannon, which screen-captures anything, so it always counts)
    let mut dominators = 0;
    for d in 0..7 {
        if d == 5 || can_capture(d, role) { dominators += alive[enemy][d]; }
    }
    let bonus = values[role] * DOM_K / (1.0 + dominators as f64);
    total += if code_color(c) == persp { bonus } else { -bonus };
}

A general with the enemy soldiers all traded off, or any piece whose capturers have left the board, drifts toward untouchable, which is exactly the intuition a fixed table can’t hold. (the full function)

The static eval terms plateaued, so next I went after depth. Two things stopped me.

First, the flips. At a flip node the engine has to average over every piece the tile might turn out to be, so a single flip fans out into a dozen-odd weighted outcomes, and there are many flippable tiles. That chance branching makes the tree explode; you can’t naively search deep in Banqi the way you can in chess. (Star1 expectiminimax prunes the chance branches, which helps, but it doesn’t make the problem go away.)

Second, and more important: even with more nodes, the results stopped moving. Past a few hundred thousand nodes per move, deeper play didn’t beat the reference engine any better. The bottleneck wasn’t depth. It was the evaluation, the function that scores a position the search bottoms out on. So I stopped trying to search my way to strength and started watching games. A win-rate tells you that you’re losing; it never tells you which move to look at.

Pathology one: it draws games it has won

Here is one. Step through it.

Misty (red) is up ten pieces to two and drew by threefold repetition. Both problems are about how it understands draws. In production it was handed only the current position with no history, so it was blind to the repetition it was walking into. And its evaluation gives no reward for converting, so a position it’s winning by a mile and a position it has actually won score about the same; it had no reason to make progress instead of shuffling.

The fix threaded the real game’s move history into the search, so the engine sees a repetition coming and avoids it when ahead, seeking it only when losing (a small contempt setting). That was worth about +53 Elo. A separate guard handles the ugliest version, where the engine sheds a piece into a losing capture and then takes the draw anyway, strictly worse than just taking it. That one looked like a contempt bug, but it reproduced at contempt zero; the real cause was the eval scoring the losing capture a hair above the draw value.

Pathology two: it lets its general get hunted

The soldier is the only piece that can capture the general, so a loose general is in real danger. Watch a single enemy soldier hunt Misty’s down the a-file.

Misty (black, here) drifts its general into the a1 corner. One red soldier marches up the file, and the general is walled in: the squares next to it are still face-down, you can’t move onto a face-down tile, and a general can’t capture a soldier. It ends with no legal move at all, frozen in place, and is captured 28 moves later.

The save is what players call making luft: flip a face-down neighbour to open the general an escape before the hunter arrives. I added a general-safety term that induces exactly that, and it cut how often Misty loses its own general from 35.5% to 26%. Look at what measuring that took: against the reference engine the win-rate barely moved, because it doesn’t hunt generals and Misty usually won these games anyway. The benchmark couldn’t see the fix, so I measured the disaster directly instead of trusting the score. (I also caught myself asking the engine’s own eval whether a save worked. An eval that’s blind to the danger can’t grade a fix for its own blindness, so I checked it by hand.)

This is the honest edge of what handcrafting reaches. The threat is a slow, quiet, multi-move march that no static term sees until it’s too late, which is exactly the long-horizon judgment a learned evaluation is for.

The last measurement: is the climb worth paying for?

The cheap gains are spent, and both pathologies point at the same fix: an evaluation that understands positions instead of counting them. That’s a learned value network, the AlphaZero recipe CLAP_CDC used to win the Computer Olympiad at this game.

So I built the pipeline (board encoding, a small ResNet, chance-aware MCTS, gated self-play so a bad generation can’t poison the next), and then, before paying for a full run, made one more measurement: a cheap local de-risk to check whether the value net would actually beat the hand-tuned engine.

It didn’t. I ran the de-risk on my laptop’s GPU, and across three runs the network capped around a 35% win-rate against the alpha-beta engine, never reaching the 55% that would count as an improvement. Doubling the search budget barely moved it, so this isn’t a case of too few games; the likeliest causes are the noisy value targets the chance nodes produce and a network that’s too small. Those are fixable, but testing the fixes needs far more compute than a laptop that throttles after a few hours, and cloud GPUs aren’t plug-and-play here: the game logic is single-threaded Python, so a fast GPU sits starved waiting on it until that path is rewritten in Rust behind a batched inference server. That’s a week or two of work, then hundreds to low thousands of dollars to run. On the evidence I have, it’s a bet, not a sure step, so I haven’t taken it; the cheaper local experiments come first.

So MistyBanqi is a competent alpha-beta engine, not state of the art, and that’s a measured choice, not a dead end. The strong programs (CLAP_CDC, DarkKnight) are closed and stronger; the learned net is the next climb, the day a de-risk says it’s worth paying for.

The pattern under every section here: the cheapest, highest-leverage decision in the whole project was choosing what to measure. The self-play score, the win-rate, the general-loss rate, the de-risk gate, each time picking an instrument that could see what I actually cared about did more than any change to the engine itself.

Resources: