Shipped

iRacing AI Race Strategist

Fine-tuned Llama 3.1 8B that calls race strategy in real time — and the eval harness that proved it works.

Highlights

Fine-tuned Llama 3.1 8B with QLoRA on 9,269 synthetic examples generated via Claude API with category-balanced distribution.
Built a 12-metric weighted scorer plus LLM-as-judge blind A/B harness: score lifted 29.7 → 78.5, hallucinations driven 48/55 → 0/55.
Event-driven LLM orchestration with priority dispatch, per-event cooldowns, and state-machine transitions — replaces naive per-tick inference.
Async telemetry → LLM → TTS pipeline at 800–1100 ms total latency with q4_k_m GGUF quantization and deterministic fallback.

Watch it run

Why fine-tune for race strategy

Off-the-shelf LLMs hallucinate on telemetry. They invent lap times, confuse race phases, and call strategy that contradicts the rules of the format they’re in. A race strategist that ships into someone’s ear during a live race has zero tolerance for that. The whole project is built around the eval harness first — fine-tuning is the lever the harness measures.

Numbers that matter

Score lift

29.7 → 78.5

12-metric weighted scorer across a 55-case eval set; 100% win rate vs. base in blind A/B.

Hallucinations

48 → 0

Eliminated out of 55 cases. Base model invented driver names, wrong positions, and phantom systems.

End-to-end latency

800–1,100 ms

Telemetry → LLM → TTS pipeline, q4_k_m GGUF on-device, served via LM Studio.

telemetry fields @ 1 Hz

event types detected

60+

cars with trait mappings

65+

tracks with corner names

Before / after

Same telemetry, same prompt. Base Llama 3.1 8B on the left, fine-tuned on the right.

Fuel Critical at Spa

Lap 18, 1.8 laps of fuel remaining, running P4.

Base

"Hey, 3rd place, 85% of the lap..." (verbose, wrong position, no urgency).

Fine-tuned

"Box this lap, critical fuel. P4 will close that 4.2 second gap. Manage tires through Pouhon."

Tire Warning at Barcelona

Front right at 108°C, 52% worn.

Base

"Alright, driver! We're on lap 18, 35% through the stint..." (generic, no action).

Fine-tuned

"Front right temp one-oh-eight, wear at fifty-two percent. Ease trail braking into Turn 5."

Battle at Nurburgring

P3, gap 0.4s to P2.

Base

"Hey driver, we're in a good spot, P3 on lap 10..." (no urgency, hallucinated names).

Fine-tuned

"P3, gap four-tenths. Use your strong brakes into Veedol, be aggressive on exit."

Per-category eval scores

The hardest categories for the base model — pit approach (15.0) and tire warning (18.2) — moved the most.

Category	Base	Fine-tuned	Lift
Gap Management	43.4	95.6	+52.2
Routine Updates	39.9	87.7	+47.8
Pace Feedback	48.8	87.0	+38.2
Fuel Critical	22.2	85.7	+63.5
Position Battle	27.4	78.0	+50.6
Tire Cold	25.2	74.8	+49.6
Tire Critical	31.8	72.8	+41.0
Pit Approach	15.0	71.0	+56.0
Tire Warning	18.2	68.4	+50.2
Fuel Warning	23.2	59.4	+36.2

The eval harness

The hardest part wasn’t training, it was knowing whether the model got better. The harness combines two signals:

12-metric weighted scorer — urgency match (15%), required keywords (15%), conciseness (10%), TTS-suitability (10%), telemetry reference (10%), actionability (10%), track reference (8%), car-appropriateness (8%), specificity (8%), correct values (6%), plus a –20% penalty for hallucinations.
LLM-as-judge blind A/B — base and fine-tuned responses presented anonymously to a stronger model that picks the better answer with reasoning. Used to validate the scorer.
55 handcrafted test cases across the 10 training categories — known failure modes of the base model.

Architecture

The main loop polls iRacing at 1 Hz; the pipeline is asyncio-driven so the LLM call and Piper synth never block telemetry capture.

EventDetector tracks per-event-type cooldowns, batches rapid changes (lap-1 position shuffling collapses into one event), and emits a priority-sorted list per tick. The engine picks the top event and dispatches.

Critical bypass — the engine’s global cooldown is skipped for EventPriority.CRITICAL events (defend, fuel critical, yellow flag, tires gone) so urgent callouts always fire.

Fallback path — if LM Studio times out or errors, the engine speaks the event’s built-in callout (“Box now! 1.5 laps of fuel.”). Strategy keeps flowing even when the LLM is gone.

Side taps — SessionLogger writes gzipped JSON of every tick and every LLM call for fine-tuning data collection; OverlayServer broadcasts the same state over WebSocket to a debug overlay for video demos.

Training data composition

9,269 synthetic examples generated via the Claude API with category-balanced distribution. Each example pairs raw telemetry JSON with a concise, TTS-suitable race engineer response — the model learns to infer situations from data rather than from semantic labels.

Category	Share	Example
Fuel Critical	10%	Box now, DNF risk
Fuel Warning	12%	Fuel getting low
Tire Critical	10%	Tires gone
Tire Warning	12%	Overheating, wearing
Tire Cold	8%	Not up to temp
Position Battle	12%	Attacking / defending
Gap Management	10%	Dirty air, clean air, undercut
Pit Approach	8%	Coming into pits
Pace Feedback	8%	PB, slower, improving
Routine	10%	Status updates

QLoRA: what actually changes

The 8B base weights are quantized to 4-bit (NF4) and frozen. Training only updates two tiny rank-16 matrices per target module — A and B — whose product BA is added on top of the frozen W at the forward pass. Roughly 0.1% of the parameters are trainable; the other 99.9% sit on disk and are never touched.

Why it matters: a single consumer GPU (RTX 5060 Ti, 16 GB) can hold the 4-bit base in memory and still have room to backprop through the adapters. Without quantization, an 8B base in fp16 alone is ~16 GB — leaving no room for gradients, activations, or optimizer state.

Training setup

Base model	Llama 3.1 8B Instruct
Method	QLoRA (4-bit base + LoRA adapters)
LoRA rank / alpha	16 / 32
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate	2e-4
Epochs	~2.6 (stopped at step 1500 / 1653)
Effective batch	16 (4 × 4 grad accumulation)
Best eval loss	0.1025 at step 1500
Trainable params	~0.1% of total
Hardware	NVIDIA RTX 5060 Ti 16GB

The full eval harness, training pipeline, and conversion scripts are in the repo — including a 30-test telemetry suite and a 45-test LLM client suite.