← All projects
Shipped

iRacing AI Race Strategist

Fine-tuned Llama 3.1 8B that calls race strategy in real time — and the eval harness that proved it works.

Highlights

  • Fine-tuned Llama 3.1 8B with QLoRA on 9,269 synthetic examples generated via Claude API with category-balanced distribution.
  • Built a 12-metric weighted scorer plus LLM-as-judge blind A/B harness: score lifted 29.7 → 78.5, hallucinations driven 48/55 → 0/55.
  • Event-driven LLM orchestration with priority dispatch, per-event cooldowns, and state-machine transitions — replaces naive per-tick inference.
  • Async telemetry → LLM → TTS pipeline at 800–1100 ms total latency with q4_k_m GGUF quantization and deterministic fallback.

Watch it run

Why fine-tune for race strategy

Off-the-shelf LLMs hallucinate on telemetry. They invent lap times, confuse race phases, and call strategy that contradicts the rules of the format they’re in. A race strategist that ships into someone’s ear during a live race has zero tolerance for that. The whole project is built around the eval harness first — fine-tuning is the lever the harness measures.

Numbers that matter

Score lift

29.7 → 78.5

12-metric weighted scorer across a 55-case eval set; 100% win rate vs. base in blind A/B.

Hallucinations

48 → 0

Eliminated out of 55 cases. Base model invented driver names, wrong positions, and phantom systems.

End-to-end latency

800–1,100 ms

Telemetry → LLM → TTS pipeline, q4_k_m GGUF on-device, served via LM Studio.

43

telemetry fields @ 1 Hz

25

event types detected

60+

cars with trait mappings

65+

tracks with corner names

Before / after

Same telemetry, same prompt. Base Llama 3.1 8B on the left, fine-tuned on the right.

Fuel Critical at Spa

Lap 18, 1.8 laps of fuel remaining, running P4.

Base

"Hey, 3rd place, 85% of the lap..." (verbose, wrong position, no urgency).

Fine-tuned

"Box this lap, critical fuel. P4 will close that 4.2 second gap. Manage tires through Pouhon."

Tire Warning at Barcelona

Front right at 108°C, 52% worn.

Base

"Alright, driver! We're on lap 18, 35% through the stint..." (generic, no action).

Fine-tuned

"Front right temp one-oh-eight, wear at fifty-two percent. Ease trail braking into Turn 5."

Battle at Nurburgring

P3, gap 0.4s to P2.

Base

"Hey driver, we're in a good spot, P3 on lap 10..." (no urgency, hallucinated names).

Fine-tuned

"P3, gap four-tenths. Use your strong brakes into Veedol, be aggressive on exit."

Per-category eval scores

The hardest categories for the base model — pit approach (15.0) and tire warning (18.2) — moved the most.

CategoryBaseFine-tunedLift
Gap Management43.495.6+52.2
Routine Updates39.987.7+47.8
Pace Feedback48.887.0+38.2
Fuel Critical22.285.7+63.5
Position Battle27.478.0+50.6
Tire Cold25.274.8+49.6
Tire Critical31.872.8+41.0
Pit Approach15.071.0+56.0
Tire Warning18.268.4+50.2
Fuel Warning23.259.4+36.2

The eval harness

The hardest part wasn’t training, it was knowing whether the model got better. The harness combines two signals:

  • 12-metric weighted scorer — urgency match (15%), required keywords (15%), conciseness (10%), TTS-suitability (10%), telemetry reference (10%), actionability (10%), track reference (8%), car-appropriateness (8%), specificity (8%), correct values (6%), plus a –20% penalty for hallucinations.
  • LLM-as-judge blind A/B — base and fine-tuned responses presented anonymously to a stronger model that picks the better answer with reasoning. Used to validate the scorer.
  • 55 handcrafted test cases across the 10 training categories — known failure modes of the base model.

Architecture

The main loop polls iRacing at 1 Hz; the pipeline is asyncio-driven so the LLM call and Piper synth never block telemetry capture.

iRacing SimWindows · pyirsdkTelemetryReader43 fields · 1 Hz pollStrategyCalculatorrolling 5-lap fuel · urgencyEventDetector27 event types · per-type cooldownsStrategyEnginepriority dispatch · critical bypassLMStudioClientaiohttp → :1234 · Llama 3.1 8B q4_k_mDeterministic fallbackevent's built-in calloutPiperTTSasync queue · priority preemptAudio output22 050 Hz PCM · sounddeviceSessionLoggergzip JSON → fine-tune setOverlayServerFastAPI WebSocket :8080SDKTelemetrySnapshotStrategyStateRaceEvent[ ] (sorted)top eventon LLM failLLMResponseWAVevery tickbroadcast

EventDetector tracks per-event-type cooldowns, batches rapid changes (lap-1 position shuffling collapses into one event), and emits a priority-sorted list per tick. The engine picks the top event and dispatches.

Critical bypass — the engine’s global cooldown is skipped for EventPriority.CRITICAL events (defend, fuel critical, yellow flag, tires gone) so urgent callouts always fire.

Fallback path — if LM Studio times out or errors, the engine speaks the event’s built-in callout (“Box now! 1.5 laps of fuel.”). Strategy keeps flowing even when the LLM is gone.

Side taps SessionLogger writes gzipped JSON of every tick and every LLM call for fine-tuning data collection; OverlayServer broadcasts the same state over WebSocket to a debug overlay for video demos.

Training data composition

9,269 synthetic examples generated via the Claude API with category-balanced distribution. Each example pairs raw telemetry JSON with a concise, TTS-suitable race engineer response — the model learns to infer situations from data rather than from semantic labels.

CategoryShareExample
Fuel Critical10%Box now, DNF risk
Fuel Warning12%Fuel getting low
Tire Critical10%Tires gone
Tire Warning12%Overheating, wearing
Tire Cold8%Not up to temp
Position Battle12%Attacking / defending
Gap Management10%Dirty air, clean air, undercut
Pit Approach8%Coming into pits
Pace Feedback8%PB, slower, improving
Routine10%Status updates

QLoRA: what actually changes

The 8B base weights are quantized to 4-bit (NF4) and frozen. Training only updates two tiny rank-16 matrices per target module — A and B — whose product BA is added on top of the frozen W at the forward pass. Roughly 0.1% of the parameters are trainable; the other 99.9% sit on disk and are never touched.

Wd × d+B·Adrdr=W + BAforward pass uses thisFrozen baseLlama 3.1 8B · 4-bit (NF4)LoRA adaptersrank r = 16 · α = 32 · fp16 · trainableEffective weightsW stays on disk; BA loads on top~0.1% trainable99.9% frozenLoRA params (B + A) across 7 target modules × 32 layers~8B base weights · never touchedApplied to:q_projk_projv_projo_projgate_projup_projdown_proj

Why it matters: a single consumer GPU (RTX 5060 Ti, 16 GB) can hold the 4-bit base in memory and still have room to backprop through the adapters. Without quantization, an 8B base in fp16 alone is ~16 GB — leaving no room for gradients, activations, or optimizer state.

Training setup

Base modelLlama 3.1 8B Instruct
MethodQLoRA (4-bit base + LoRA adapters)
LoRA rank / alpha16 / 32
Target modulesq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate2e-4
Epochs~2.6 (stopped at step 1500 / 1653)
Effective batch16 (4 × 4 grad accumulation)
Best eval loss0.1025 at step 1500
Trainable params~0.1% of total
HardwareNVIDIA RTX 5060 Ti 16GB

The full eval harness, training pipeline, and conversion scripts are in the repo — including a 30-test telemetry suite and a 45-test LLM client suite.