Learn · Core skill lab

15-Case Evaluation Sprint

Turn real user work and known failures into a release gate that can detect whether one change helped.

Start the lab

Decision: Is the new AI behaviour better enough—and safe enough—to release?
Timebox: 2–3 hours
Output: LLM evaluation report

Bring to the bench

One bounded AI workflow
Examples of real or representative tasks
A baseline version and one proposed change

Your work saves in this browser.

Step 1 / 5 · 15 min

Name the gate

Tie evaluation to one release choice.

Specify the change under test.
Bound the model job and retained human judgment.

Release decisionName the version/change and choices the result controls.Task and successful outcomeDefine what the system must do and what remains human judgment.

Field tools

Use the instrument, not a blank page.

Copy these into your interview, agent, review, or working document. They are specific to this repetition.

template

Evaluation case schema

Use one record per task so results can be compared and audited.

ID / segment / source
INPUT
EXPECTED OUTCOME
ACCEPTABLE VARIATION
KNOWN TRAP
DIMENSION SCORES + ANCHORS
CRITICAL FAILURE?
BASELINE OUTPUT / SCORE
CHANGED OUTPUT / SCORE
REVIEWER NOTE

script

Failure review script

Do this before changing the prompt again.

1. Which exact cases regressed?
2. What failure mechanism do they share?
3. Is the problem model, context, tool, UX, policy, or scorer?
4. Did a quality gain hide a critical failure?
5. What single intervention targets the mechanism?
6. Which full set must rerun before release?

Calibrate judgment

Compare the evidence, not the polish.

Useful

Average up, release blocked

A triage prompt improves most cases but still follows an instruction embedded in a ticket.

Version 7 gains eight points and improves common billing tasks, but the prompt-injection case remains a critical failure. Revise and rerun all 15; the aggregate improvement does not override the release gate.

Why it works: The gate is pre-committed, critical failure outranks averages, and the next intervention targets a mechanism.

Looks finished. Is not.

Vibe-based comparison

Tried five prompts on a few examples. Version 7 sounded more professional and got better answers most of the time, so ship it.

Why it fails: The workload, expected outcomes, scorer, critical failures, baseline, and regression threshold are not reproducible.

Review → revise → repeat

The artifact is the beginning of the rep.

Check only standards your current artifact actually meets. Then record one consequential revision before exporting it.

Quality gate · 0/4 metCases represent useThe set covers ordinary volume, meaningful segments, and known costly failures.Scoring is calibratedObservable anchors and critical-failure rules can guide another reviewer.Change is isolatedCases, settings, rubric, and review process stay fixed across versions.Gate controls releaseThe verdict follows pre-set thresholds and case-level regressions.

Revision made after reviewName the artifact change—not “thought more about it”.Add the next real missPut the next repetition on a real product and date.