// THE_LAB

The Lab

A test run, not a product. I wired up Claude Code as a two-agent generator/evaluator loop and pointed it at this site: one agent edits the code, another grades the live page through a real browser, and I approve every change in between. Three rounds, scored 8389 / 100 against a rubric I wrote. Below is the loop, what it changed, and an honest read on it.

status: experiment · human-gated · not autonomous

▌ THE LOOP


  ┌──────────────────────────────────────────────────┐
  │  /improve-site   —   orchestrator                  │
  │  owns the dev server · git · the human gate        │
  └──────────────────────┬─────────────────────────────┘
                         │   one round, looped
         ┌───────────────┴────────────────┐
         ▼                                ▼
  ┌────────────────────┐  ranked   ┌────────────────────┐
  │  site-eval         │  fixes →  │  site-builder      │
  │  Playwright +      │           │  edits the code    │
  │  vision · grades   │           │  in house style    │
  │  vs the rubric     │ ← approved│                    │
  └─────────┬──────────┘   fixes   └─────────┬──────────┘
            │                                │
            │ re-grade            implements │
            ▼                                ▼
  ┌────────────────────────────────────────────────────┐
  │  ME  ·  approve the fix set before anything ships    │
  └────────────────────────────────────────────────────┘

site-eval

A Claude Code subagent driving a real Chromium browser via the Playwright MCP. It screenshots desktop + mobile, reads the console, measures contrast and layout, and scores eight rubric categories with evidence.

site-builder

A second subagent that implements only the changes I approve, in the existing code style. It can't grade and can't open the browser — the two roles stay separate.

me · the gate

Every round pauses for my approval before any code lands, and I recompute the totals from the per-category scores rather than take the grader's word for them.

▌ THE SCORECARD

Weighted total · /100 · three graded states

025507510083Baseline87Round 189Round 2

Full 0–100 axis. The site started strong, so the gains are modest — mostly closing accessibility, layout, and copy gaps. Totals are computed in code from the per-category scores, not hardcoded.

Where the points are now · Round 2 (89/100)

First-screen clarity
9/10 · w20
Evidence & credibility
9/10 · w18
Visual craft & restraint
9/10 · w16
Performance & stability
9/10 · w14
Accessibility
9/10 · w12
Mobile experience
9/10 · w10
Navigation & flow
8/10 · w6
The memorable factor
7/10 · w4

cyan = near-maxed · magenta = room to climb.

▌ WHAT THE LOOP CHANGED

Round 1

  • High-contrast :focus-visible ring + a prefers-reduced-motion blockAccessibility 7 → 9
  • Capped the /projects ASCII scene and closed the dead gapVisual craft 8 → 9
  • Desktop-only "live scene" hint so the canvas reads as interactiveMemorable 5 → 6

Round 2

  • One-time hero intro sweep (off under reduced-motion + on mobile)Memorable 6 → 7
  • "Let's talk" contact CTA so /projects no longer dead-endsNavigation 6 → 8
  • Led the hero subhead with the load-bearing Wells Fargo claimClarity held at 9

Round 3

shipped · ungraded
  • Tightened the H1 to two lines so the proof-point sits highershipped — not yet graded
  • Faint post-sweep interactivity cue (0.34 vs 1.0 hover), auto-off on first interactionshipped — not yet graded

▌ THE RUBRIC — AND ITS ANTI-GOALS

The loop grades against a rubric that weights substance over spectacle for one reader: a senior engineer skimming for fifteen seconds. It also spells out anti-goals — things the builder agent is forbidden to do, so chasing the score can’t turn into gaming it:

Anti-goals

  • Don't add features or sections just to look busy — restraint scores higher.
  • Don't inflate claims to chase a score — credibility means verifiable.
  • Don't sacrifice readability for spectacle in the hero.
  • Don't regress the reduced-motion or mobile fallbacks already in place.
  • Don't introduce heavy dependencies for cosmetic gains.

▌ MY TAKE

Honestly, the loop worked pretty well — but I don’t think this site was the best test of it. I’d already started and refined this site before I pointed the loop at it, so a lot of the strengths were already there and the agent was mostly polishing. Some of that earlier refinement came from my younger brother, who looked at an early version and told me it looked generic. The variable ASCII-art heroes were his idea — I’m the one who took the suggestion and made them actually good.

What I’d really want to see is the loop with less direction from me: fewer guardrails, and let it keep inventing more interesting pages on its own. I read that Anthropic got a whole 3D museum site to emerge out of a loop like this, which is the kind of thing I mean. I just haven’t wanted to pay the API tokens to find out what mine would do.

I think the real version of this is a proper harness loop with a token budget behind it — that’s where it would get interesting. I’ve also heard Codex can handle subagents inside a harness, which might be the cleaner way to run it. For now, this was a test run — and a useful one.

Get in touch.