Toward Solving Shogi4 (4×4 Shogi)

Shogi4 is a public-domain 4×4 drop-shogi by Oca Studios whose rules disappeared from the public web. I recovered them from the official Android package, then built a strong-solution engine around the recovered rules.

This is not a full solution. It is a validated solver, a 2.1-billion-position closed subgame, and a sharded external-memory design tested on closed games but never run at full scale. That boundary is the interesting part: Shogi4 is the first rung above Dōbutsu Shōgi where drops turn a game solve into a distributed-systems problem.

Try the rules directly below. The viewer enforces legal moves, including drops, promotion, and friendly-jumps, but it does not show theoretical results.

Open standalone →

The rest of the post separates the full-game solver design, which is tested but unrun at full scale, from the largest closed subgame, which is actually solved.

The rules

No secondary source has Shogi4’s rules with engine-grade precision. The print-and-play PDF was never archived, and the publisher’s server is down. I recovered the rules from the official Android package: a Python/Kivy app whose game code shipped as an .mp3 that was actually a gzipped tarball of bytecode. Decompiled, its get_possible_squares and can_take_square functions are the authoritative ruleset. The full recovered rules are on Mistboard; this summary is enough to follow the solve.

The game is played on a 4×4 board, five pieces a side, all moving one square per turn:

Shogi4 starting position on a 4 by 4 board, with the first player's crane, fox, raccoon-dog, tapir, and carp facing upward and the second player's pieces facing downward Starting position. Tile ownership is orientation, not color.

Piece	Move
Royal (Crane / Pheasant)	one square, any of 8 directions (a king)
Carp	one square forward
Fox	one square orthogonally
Raccoon-dog	one square diagonally
Tapir	one square forward or forward-diagonal

Three rules make Shogi4 its own engine rather than Dōbutsu re-pointed:

Drops. A captured piece switches sides and returns from hand to any empty square, so material is conserved. The only restriction is no drop on the opponent’s back row.
Promotion. A piece reaching the far row must promote: Carp, Tapir, and Raccoon-dog to a silver-general step set, Fox to a gold. A captured promoted piece reverts to its base.
Friendly-jump. If an allied piece sits on the square adjacent in a legal move direction, a piece may leap it to land two squares away. One ally, no chains, enemies cannot be jumped. This has no Dōbutsu equivalent.

You win by capturing the royal. There is no check or checkmate, moving where you can be captured is legal, and there is no king-march win. Compared with Dōbutsu, the extra piece, wider board, broader promotion set, drop restriction, and friendly-jump rule are enough to require a new engine.

Why full Shogi4 is a distributed problem

Strong solving assigns every position a value (win, loss, or draw) by retrograde analysis, working backward from finished positions, here a captured king. A position is a win if some move reaches a loss, a loss if every move reaches a win, and a draw otherwise.

In chess, captures only remove material, so the solve decomposes into a ladder of ever-smaller endgames you can solve bottom-up. Drops remove that structure. Captured pieces come back, material is fixed, and the graph is a single strongly-connected cycle that has to be solved as one object. That object is large:

Quantity	Value
All-arrangements upper bound (exact)	205,148,532,253,680
Reachable from the start	~3×10¹³
Dense-rank index domain N	410,297,064,507,360

The solver indexes positions by a dense rank, a position-to-integer bijection, so the value table is a flat array and a lookup is O(1). The index must span every legal arrangement, so N (about 4×10¹⁴), not the reachable count, sets the size: at two bits per slot, about 100 TB, or 50 TB after the exact left-right symmetry fold. That makes the full solve a distributed, external-memory computation rather than an in-RAM job.

The full-solve design

The solver is push-based. Each position holds a counter of its undecided children. Terminal positions (a king that can be captured next move) are seeded as wins, and from there values flow backward: a position with a move into a loss becomes a win at once, and a position whose counter reaches zero, meaning every child is now a win, becomes a loss. Values only ever move from decided to undecided, so the computation is a monotone fixpoint.

To scale past one machine, the rank space is partitioned into shards, each owning its slots’ values and counters. Resolving a position generates its predecessors by un-move generation and sends each as a message to the shard that owns it. That routing is the shuffle, and it is the bulk of the work: about 7 predecessor-messages per position, or 15 to 30 PB at full scale.

A sharded retrograde solver superstep: decided values generate predecessors, messages are shuffled by dense rank, each shard updates value and counter arrays on disk, then a barrier starts the next round Each shard owns a slice of the dense-rank table. Messages move computation to the shard that owns the predecessor.

Workers advance in supersteps behind a barrier. Each drains the current round of messages, resolves what it can, and emits the next round. Because the fixpoint is monotone, the final table does not depend on how the space is sharded or on the order messages arrive. Sharding changes the schedule, never the answer, and that property is what makes the computation safe to distribute.

Storage sets the shape of the computation more than compute does. The value array lives sharded on disk, streamed and discarded per superstep rather than held resident, and the exact left-right fold halves both storage and work.

Validation

The engine is checked against an independent Python port of the decompiled app. The index and values are checked separately.

Move generation. perft (the move-tree node count at each depth from the start) matches to the digit:

Depth	Positions
1	8
2	64
3	626
4	6,304
5	68,723
6	769,014

A separate 4,000-position differential test over random playouts (drops, promotions, captures, jumps) finds zero move-list mismatches between the two engines.

The index. The dense rank is a verified bijection: rank-then-unrank is the identity on every small-game index, and the positions it enumerates match a direct enumeration. Its full-game domain is exactly twice the independent arrangement count. That enumerator is gated against Dōbutsu, where it reproduces Tanaka’s published count of 1,567,925,964 positions and the full by-pieces-in-hand breakdown.

The values. Three solvers (two algorithms in Python, one in Rust) agree on every closed sub-game. A local consistency audit finds zero violations: every win has a move to a loss, every loss has only moves to wins, and every draw has neither. Left-right mirroring holds exactly, with zero mismatches, and the predecessor generator matches the true reverse edges.

The small closed games, solved in full, are the first game-theoretic results for Shogi4 (values from the side to move):

Sub-game	Positions	Win	Loss	Draw
2 kings	480	35.0%	0%	65.0%
2 kings + Carp	24,480	75.9%	23.3%	0.7%
2 kings + Fox	24,480	76.7%	23.3%	0%
2 kings + Raccoon-dog	24,480	76.2%	23.8%	0%
2 kings + Carp + Fox	1,164,704	74.5%	17.8%	7.7%

The two-kings result, no losses, is a sanity check: with only kings, the side to move can always avoid a forced capture.

De-risking the full-solve design on a laptop

Before renting a cluster, I tested the distributed design on one laptop against the in-memory solver as the oracle.

Confluence. The sharded solver reproduces the single-machine result at 1, 4, 16, and 64 shards. Propagation runs in up to 54 supersteps, the depth of the longest forced line. The monotone-fixpoint argument says shard count cannot matter; the test confirms it.
Verification. One consistency pass certifies the whole table for less than the solve costs. On {2 kings + carp + fox} the solve runs 5.5 s and the audit 3.0 s, so verification is 55% of the solve, and every value I corrupt is caught (each surfaces as 1 to 5 local violations).
Recovery. Crash-and-replay reproduces each superstep byte-for-byte. Building it caught a real bug: predecessors were collected from a hash set, so replay order diverged from the original. Sorting the output made the solver deterministic, which is what idempotent recovery needs.

The largest closed run

With the design validated, I solved a larger closed game: the two kings plus one of each piece type (Carp, Fox, Raccoon-dog, Tapir), 2,100,849,024 positions. It is a reduced game, not a slice of full Shogi4, since drops conserve material and the real game never reaches a one-of-each configuration. It is the biggest closed game I could use to prove the core at scale.


Instance	Hetzner Cloud CCX43 (16 dedicated vCPU, 64 GB)
OS	Ubuntu 26.04
Mode	single-threaded, fully in memory
Peak memory	~15 GB (work queue packed to 32-bit ranks)
Wall time	5.3 hours
Cost	about $2

This is a single-machine, in-memory solve, not a run of the distributed path. At ~15 GB it fits on a laptop; I ran it on a rented box only to keep mine free for the five hours. The result, from the side to move:

Outcome	Positions	Share
Win	1,662,776,212	79.15%
Loss	339,566,116	16.16%
Draw	98,506,696	4.69%

The counts sum to the position total exactly and pass the audit.

The run also calibrates cost per edge against working-set size:

Positions	ns per edge
1,164,704	327
51,461,568	493
2,100,849,024	697

The 2.1-billion run is the first whose value arrays (about 14 GB) leave CPU cache, so 697 ns/edge is the basis for projecting the full solve. In-cache benchmarks understate this kind of job.

What the full solve would cost

From the 697 ns/edge anchor and the 4×10¹⁴ index, the full solve is roughly 50 to 100 TB of working storage and 130 to 190 core-years of compute: about a million core-hours, or $10,000 to $15,000 in cloud terms.

Two corrections produced that figure. The raw position count was first off by an order of magnitude. The footprint was then mis-sized against the reachable positions (~3×10¹³) rather than the arrangement domain (~4×10¹⁴, about 13 times larger), which is what a dense-rank index must cover.

That index is the awkward but necessary part. A small solver can enumerate reachable positions and keep a HashMap from position key to table slot. At this scale, building that map is already too large. The scalable version gives every arrangement a mathematical address: rank(position) -> integer. That wastes slots on illegal or unreachable arrangements, but it turns the tablebase into flat arrays that can be sharded, streamed, checkpointed, and audited.

The same tradeoff shows up in the backward step. A small in-memory solver can store a compressed sparse row (CSR) list of every predecessor edge, then walk that reverse graph directly. The full solver can’t afford one stored row per edge. It generates predecessors on demand instead: undo the move shape, rank each predecessor, and send the update to the shard that owns that rank. That costs more CPU per edge, but it is the difference between a memory-bound shortcut and an external-memory solver.

I measured that exact tradeoff on a related 4×5 Micro Shogi rung after this estimate: K+P+G per side, 869,287,068 reachable canonical positions. The reachable-HashMap + stored-CSR solver needed ~60.5 GiB peak RSS and 12,385 seconds for the core solve. A KPG-specific dense-rank + unmove solver used a 2,037,557,340-slot rank domain, 2.34× the reachable count, and reproduced the same value counts with ~6.86 GiB peak RSS and 8,464 seconds total solve time.

Method	Index	Predecessors	Peak memory	Single-core time
Reachable + CSR	869M reachable keys	Stored reverse graph	~60.5 GiB	12,385 s core solve
Dense rank + unmove	2.04B rank slots	Generated on demand	~6.86 GiB	8,464 s total solve

That is the engineering trade: dense rank spends slots and rank/unrank CPU to avoid storing the reverse graph. On this rung, memory fell 8.8× and the single-core run finished 1.46× faster because it avoided building and holding the reverse graph.

Why it’s left unrun

The engine, distributed design, verification, and calibration are done. Finishing the full solve needs rented compute, not another idea.

That spend does not clear the bar. The technique is known, so there is no paper in the result; the game is obscure, so there is little demand for its value. Self-funding would buy one number: the game-theoretic value of a game almost no one plays.

The value of the project was the capability. The rules are recovered, the solver is built and validated, and the distributed path is checked against closed games before the full cluster run. Solving the game would confirm a value; it would not change the method.

Resources:

Shogi4 viewer: free play with legal moves only.
Shogi4 rules: the full recovered ruleset.
brianhliou/shogi4: the Rust engine, solver, and validation harnesses.
Solving Dōbutsu Shōgi: the rung below, solved in full.