iRacing AI Race Strategist
Fine-tuned Llama 3.1 8B that calls race strategy in real time — and the eval harness that proved it works.
Highlights
- Fine-tuned Llama 3.1 8B with QLoRA on 9,269 synthetic examples generated via Claude API with category-balanced distribution.
- Built a 12-metric weighted scorer plus LLM-as-judge blind A/B harness: score lifted 29.7 → 78.5, hallucinations driven 48/55 → 0/55.
- Event-driven LLM orchestration with priority dispatch, per-event cooldowns, and state-machine transitions — replaces naive per-tick inference.
- Async telemetry → LLM → TTS pipeline at 800–1100 ms total latency with q4_k_m GGUF quantization and deterministic fallback.
Watch it run
Why fine-tune for race strategy
Off-the-shelf LLMs hallucinate on telemetry. They invent lap times, confuse race phases, and call strategy that contradicts the rules of the format they’re in. A race strategist that ships into someone’s ear during a live race has zero tolerance for that. The whole project is built around the eval harness first — fine-tuning is the lever the harness measures.
Numbers that matter
Score lift
29.7 → 78.5
12-metric weighted scorer across a 55-case eval set; 100% win rate vs. base in blind A/B.
Hallucinations
48 → 0
Eliminated out of 55 cases. Base model invented driver names, wrong positions, and phantom systems.
End-to-end latency
800–1,100 ms
Telemetry → LLM → TTS pipeline, q4_k_m GGUF on-device, served via LM Studio.
43
telemetry fields @ 1 Hz
25
event types detected
60+
cars with trait mappings
65+
tracks with corner names
Before / after
Same telemetry, same prompt. Base Llama 3.1 8B on the left, fine-tuned on the right.
Fuel Critical at Spa
Lap 18, 1.8 laps of fuel remaining, running P4.
Base
"Hey, 3rd place, 85% of the lap..." (verbose, wrong position, no urgency).
Fine-tuned
"Box this lap, critical fuel. P4 will close that 4.2 second gap. Manage tires through Pouhon."
Tire Warning at Barcelona
Front right at 108°C, 52% worn.
Base
"Alright, driver! We're on lap 18, 35% through the stint..." (generic, no action).
Fine-tuned
"Front right temp one-oh-eight, wear at fifty-two percent. Ease trail braking into Turn 5."
Battle at Nurburgring
P3, gap 0.4s to P2.
Base
"Hey driver, we're in a good spot, P3 on lap 10..." (no urgency, hallucinated names).
Fine-tuned
"P3, gap four-tenths. Use your strong brakes into Veedol, be aggressive on exit."
Per-category eval scores
The hardest categories for the base model — pit approach (15.0) and tire warning (18.2) — moved the most.
| Category | Base | Fine-tuned | Lift |
|---|---|---|---|
| Gap Management | 43.4 | 95.6 | +52.2 |
| Routine Updates | 39.9 | 87.7 | +47.8 |
| Pace Feedback | 48.8 | 87.0 | +38.2 |
| Fuel Critical | 22.2 | 85.7 | +63.5 |
| Position Battle | 27.4 | 78.0 | +50.6 |
| Tire Cold | 25.2 | 74.8 | +49.6 |
| Tire Critical | 31.8 | 72.8 | +41.0 |
| Pit Approach | 15.0 | 71.0 | +56.0 |
| Tire Warning | 18.2 | 68.4 | +50.2 |
| Fuel Warning | 23.2 | 59.4 | +36.2 |
The eval harness
The hardest part wasn’t training, it was knowing whether the model got better. The harness combines two signals:
- 12-metric weighted scorer — urgency match (15%), required keywords (15%), conciseness (10%), TTS-suitability (10%), telemetry reference (10%), actionability (10%), track reference (8%), car-appropriateness (8%), specificity (8%), correct values (6%), plus a –20% penalty for hallucinations.
- LLM-as-judge blind A/B — base and fine-tuned responses presented anonymously to a stronger model that picks the better answer with reasoning. Used to validate the scorer.
- 55 handcrafted test cases across the 10 training categories — known failure modes of the base model.
Architecture
The main loop polls iRacing at 1 Hz; the pipeline is asyncio-driven so the LLM call and Piper synth never block telemetry capture.
EventDetector tracks per-event-type cooldowns, batches rapid changes (lap-1 position shuffling collapses into one event), and emits a priority-sorted list per tick. The engine picks the top event and dispatches.
Critical bypass — the engine’s global cooldown is skipped for EventPriority.CRITICAL events (defend, fuel critical, yellow flag, tires gone) so urgent callouts always fire.
Fallback path — if LM Studio times out or errors, the engine speaks the event’s built-in callout (“Box now! 1.5 laps of fuel.”). Strategy keeps flowing even when the LLM is gone.
Side taps — SessionLogger writes gzipped JSON of every tick and every LLM call for fine-tuning data collection; OverlayServer broadcasts the same state over WebSocket to a debug overlay for video demos.
Training data composition
9,269 synthetic examples generated via the Claude API with category-balanced distribution. Each example pairs raw telemetry JSON with a concise, TTS-suitable race engineer response — the model learns to infer situations from data rather than from semantic labels.
| Category | Share | Example |
|---|---|---|
| Fuel Critical | 10% | Box now, DNF risk |
| Fuel Warning | 12% | Fuel getting low |
| Tire Critical | 10% | Tires gone |
| Tire Warning | 12% | Overheating, wearing |
| Tire Cold | 8% | Not up to temp |
| Position Battle | 12% | Attacking / defending |
| Gap Management | 10% | Dirty air, clean air, undercut |
| Pit Approach | 8% | Coming into pits |
| Pace Feedback | 8% | PB, slower, improving |
| Routine | 10% | Status updates |
QLoRA: what actually changes
The 8B base weights are quantized to 4-bit (NF4) and frozen. Training only updates two tiny rank-16 matrices per target module — A and B — whose product BA is added on top of the frozen W at the forward pass. Roughly 0.1% of the parameters are trainable; the other 99.9% sit on disk and are never touched.
Why it matters: a single consumer GPU (RTX 5060 Ti, 16 GB) can hold the 4-bit base in memory and still have room to backprop through the adapters. Without quantization, an 8B base in fp16 alone is ~16 GB — leaving no room for gradients, activations, or optimizer state.
Training setup
| Base model | Llama 3.1 8B Instruct |
| Method | QLoRA (4-bit base + LoRA adapters) |
| LoRA rank / alpha | 16 / 32 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Learning rate | 2e-4 |
| Epochs | ~2.6 (stopped at step 1500 / 1653) |
| Effective batch | 16 (4 × 4 grad accumulation) |
| Best eval loss | 0.1025 at step 1500 |
| Trainable params | ~0.1% of total |
| Hardware | NVIDIA RTX 5060 Ti 16GB |
The full eval harness, training pipeline, and conversion scripts are in the repo — including a 30-test telemetry suite and a 45-test LLM client suite.