// THE_LAB
The Lab
A test run, not a product. I wired up Claude Code as a two-agent generator/evaluator loop and pointed it at this site: one agent edits the code, another grades the live page through a real browser, and I approve every change in between. Three rounds, scored 83 → 89 / 100 against a rubric I wrote. Below is the loop, what it changed, and an honest read on it.
status: experiment · human-gated · not autonomous
▌ THE LOOP
┌──────────────────────────────────────────────────┐
│ /improve-site — orchestrator │
│ owns the dev server · git · the human gate │
└──────────────────────┬─────────────────────────────┘
│ one round, looped
┌───────────────┴────────────────┐
▼ ▼
┌────────────────────┐ ranked ┌────────────────────┐
│ site-eval │ fixes → │ site-builder │
│ Playwright + │ │ edits the code │
│ vision · grades │ │ in house style │
│ vs the rubric │ ← approved│ │
└─────────┬──────────┘ fixes └─────────┬──────────┘
│ │
│ re-grade implements │
▼ ▼
┌────────────────────────────────────────────────────┐
│ ME · approve the fix set before anything ships │
└────────────────────────────────────────────────────┘
site-eval
A Claude Code subagent driving a real Chromium browser via the Playwright MCP. It screenshots desktop + mobile, reads the console, measures contrast and layout, and scores eight rubric categories with evidence.
site-builder
A second subagent that implements only the changes I approve, in the existing code style. It can't grade and can't open the browser — the two roles stay separate.
me · the gate
Every round pauses for my approval before any code lands, and I recompute the totals from the per-category scores rather than take the grader's word for them.
▌ THE SCORECARD
Weighted total · /100 · three graded states
Full 0–100 axis. The site started strong, so the gains are modest — mostly closing accessibility, layout, and copy gaps. Totals are computed in code from the per-category scores, not hardcoded.
Where the points are now · Round 2 (89/100)
cyan = near-maxed · magenta = room to climb.
▌ WHAT THE LOOP CHANGED
Round 1
- High-contrast :focus-visible ring + a prefers-reduced-motion blockAccessibility 7 → 9
- Capped the /projects ASCII scene and closed the dead gapVisual craft 8 → 9
- Desktop-only "live scene" hint so the canvas reads as interactiveMemorable 5 → 6
Round 2
- One-time hero intro sweep (off under reduced-motion + on mobile)Memorable 6 → 7
- "Let's talk" contact CTA so /projects no longer dead-endsNavigation 6 → 8
- Led the hero subhead with the load-bearing Wells Fargo claimClarity held at 9
Round 3
shipped · ungraded- Tightened the H1 to two lines so the proof-point sits highershipped — not yet graded
- Faint post-sweep interactivity cue (0.34 vs 1.0 hover), auto-off on first interactionshipped — not yet graded
▌ THE RUBRIC — AND ITS ANTI-GOALS
The loop grades against a rubric that weights substance over spectacle for one reader: a senior engineer skimming for fifteen seconds. It also spells out anti-goals — things the builder agent is forbidden to do, so chasing the score can’t turn into gaming it:
Anti-goals
- ✗Don't add features or sections just to look busy — restraint scores higher.
- ✗Don't inflate claims to chase a score — credibility means verifiable.
- ✗Don't sacrifice readability for spectacle in the hero.
- ✗Don't regress the reduced-motion or mobile fallbacks already in place.
- ✗Don't introduce heavy dependencies for cosmetic gains.
▌ MY TAKE
Honestly, the loop worked pretty well — but I don’t think this site was the best test of it. I’d already started and refined this site before I pointed the loop at it, so a lot of the strengths were already there and the agent was mostly polishing. Some of that earlier refinement came from my younger brother, who looked at an early version and told me it looked generic. The variable ASCII-art heroes were his idea — I’m the one who took the suggestion and made them actually good.
What I’d really want to see is the loop with less direction from me: fewer guardrails, and let it keep inventing more interesting pages on its own. I read that Anthropic got a whole 3D museum site to emerge out of a loop like this, which is the kind of thing I mean. I just haven’t wanted to pay the API tokens to find out what mine would do.
I think the real version of this is a proper harness loop with a token budget behind it — that’s where it would get interesting. I’ve also heard Codex can handle subagents inside a harness, which might be the cleaner way to run it. For now, this was a test run — and a useful one.