April 20, 2026

I Fine-Tuned Four Tiny Models to Fight Each Other in a Civilization Sim

This started as a simple question: can a 0.6B model, fine-tuned on the right corpus, develop a distinct enough voice to play a character convincingly? Not just generate coherent text — but make strategic decisions in a game world while sounding like itself and not like the other players.

The answer is mostly yes, with caveats. Here's how I built it.

The Concept

Four factions. Four distinct voices. One 10×10 grid world.

→The Archive — a cautious scholar-faction that reasons by analogy to history
→The Ledger — a pragmatic trader-faction that frames everything as costs and exchange
→The Chorus — a mystic storyteller-faction that speaks in heightened poetic language
→The Hollow — an engineer-faction that thinks procedurally and lists steps

Each faction is a LoRA adapter on top of Qwen3-0.6B. Every turn, each faction gets a text description of the world state and must output a structured JSON decision: what to do, why, and a line of in-world flavor text. After 100 turns, the whole run renders as a self-contained HTML chronicle — a narrative history of the simulation.

Everything runs locally on an M1 Pro 16GB via MLX.

The Pipeline

The full pipeline has four stages:

1. Corpus collection. Each faction needed source material to train on. I collected prose that matched each faction's intended voice — different registers, different preoccupations, different rhythms. Each corpus ended up with 1,800–4,000 passages.

2. Dataset generation. A larger teacher model (Qwen3-4B-4bit) synthesized training examples from each corpus. For each passage, it was prompted to write a faction decision JSON seeded by that passage's voice and a sample game state. Target: 500 examples per faction. The format was a full chat template with system prompt + user message + assistant JSON completion — the exact format the student model would see at inference time.

3. LoRA fine-tuning. Each of the four factions got its own adapter trained on its 500 examples. Rank 8, ~1.4M trainable parameters, about 0.24% of the base model. Training ran for 600 iterations with checkpoints every 100 steps.

4. Simulation and chronicle. main.py spins up a World, runs 100 turns, and calls each faction's adapter in sequence. The WorldEvent log feeds into a Jinja2 HTML template that renders the chronicle.

What Went Wrong (and How I Fixed It)

The first chronicle was garbage. Every faction sounded identical, decisions were incoherent, and one faction repeated "research" for 25 consecutive turns. Here's what was actually broken.

Prompt/training format mismatch. The training examples included the faction persona in the user message. Inference didn't. The model was trained to reason in a particular voice given a persona description — but at inference time, that cue was missing. The fix was one line: prepend "Faction persona: {description}" to every game-state prompt at inference time.

EOS token placement. The chat template for Qwen3 uses <|im_end|> as an end-of-turn marker. Training examples that didn't include this token in the right place caused the model to generate garbage after the JSON. The fix was to use tokenizer.apply_chat_template() for both training example construction and inference formatting, so the token boundaries matched exactly.

Context poisoning. When a faction declaimed (made a speech), the full flavor text was broadcast to all other factions' memory. By turn 30, every faction was echoing the same phrase back and forth, because it kept appearing in their context and they kept incorporating it. The fix was to summarize declaims in broadcast: "{faction} made a proclamation" instead of the actual text.

Checkpoint overfitting. Training for 600 iterations seemed reasonable, but validation loss bottomed out around iteration 200 for every faction and rose from there. The checkpoints at 600 were the worst-performing weights. I had to manually copy the 0000200_adapters.safetensors file back to adapters.safetensors for each faction after training.

Loop detection. Even after fixing the format mismatch, factions would get stuck in behavioral ruts — researching every turn, or expanding until population collapsed. I added lightweight loop detection in the prompt construction: if the last three entries in a faction's private memory are all the same action, the strategic hint overrides to "take a different action." Crude, but effective.

What the Final Output Looked Like

After fixing all of the above:

T094 [LoRA] chorus   attack   chorus attacks hollow and is repelled (−10 pop)

That single line at turn 94 was the first combat event — after 93 turns of maneuvering. The Ledger expanded aggressively toward the center of the map. The Archive accumulated grain and stone but stayed cautious. The Chorus declaimed frequently, built territory, and eventually attacked. The Hollow stockpiled iron and built forges.

Final state after 100 turns:

archive  pop= 126  cells=  4  resources={grain: 42, stone: 0}   research=3
ledger   pop=  64  cells= 13  resources={grain: 19, stone: 2}   research=0
chorus   pop=  21  cells= 19  resources={grain: 6, stone: 3}    research=0
hollow   pop= 100  cells=  7  resources={grain: 7, stone: 16}   research=0

Ledger dominated by territory. Hollow dominated by iron. Archive survived by staying small and resource-rich. Chorus overexpanded and collapsed its population, classic overreach.

None of this was scripted. It emerged from four small models making decisions in a shared world.

What I'd Do Differently

More training data. 500 examples is thin. The models were capable of distinct voices but fell back on repetitive patterns under novel game states. 2,000 examples per faction would likely produce more robust behavior.

Reward signal instead of SFT. Supervised fine-tuning on teacher-generated examples is a blunt instrument. The teacher model doesn't know anything about the game state — it just mimics the voice. A simple RL loop where factions get rewarded for population and territory growth would produce more strategically coherent play.

Structured output constraints. The JSON parsing is fragile. A constrained decoding approach (grammar-based sampling) would eliminate the fallback decision entirely and let me run with lower temperatures.

The Part That Actually Worked

The voices held. When things went well — particularly for the Ledger and Hollow — the faction's reasoning read as distinctly theirs. The Ledger thought in terms of yields and exchange rates. The Hollow listed preconditions before acting. The Archive hedged every decision with historical analogy.

For a 0.6B model with 500 training examples and 1.4M fine-tuned parameters, that's more than I expected.

The project is on GitHub. The full pipeline — corpus collection scripts, dataset generation, training config, simulation, and chronicle renderer — is about 1,200 lines of Python.

Have an idea you want to ship?

Start a project →