The first version of my LLM-as-judge passed everything.
Every output, every entry, every recommendation. The pass rate was 99%. The judge was confident. The judge was wrong.
The prompt was generic. "Read this output and rate its quality on a scale of 1 to 10. Provide a brief explanation." The judge would dutifully return: "8/10. The text is well-written, conveys the key information clearly, and is engaging to read." Every output landed between 7 and 9. Nothing ever failed.
But when I read the outputs by hand, half of them were bad. Hallucinated prices. Restaurants confidently described as having dishes that didn't exist on the menu. Generic "great spot for any occasion" filler that could describe any place. The judge was passing trash, and worse — the judge was gaslighting me about it. When I asked it to be stricter, it returned 5/10 with the same vague explanations. The numbers changed; the diagnostic value didn't.
My judge was performing the appearance of evaluation without doing any of the work. It had the shape of a quality gate. It had none of the function. And without diagnostic detail, I couldn't even tell what to fix.
The fix came from an analogy I should have seen sooner.
What sommeliers do that LLM judges don't
Sommeliers don't score a wine "7/10."
They write tasting notes. A wine's note describes its appearance, its nose, its palate, its finish, its structure, its faults, its likely cellar life. Each of those is its own axis with its own vocabulary. "Garnet at the rim, intense fruit on the nose with notes of dried cherry and tobacco leaf, medium-plus acidity, tannins integrated but firm, finish carries through with savory complexity." The verdict — this is a good wine — is downstream of those notes, not upstream of them.
The score, if one exists at all, isn't the artifact. The notes are. The score is just a summary the layperson can read.
This matters for two reasons. First, it makes evaluation diagnostic. If a wine fails, the sommelier can tell you exactly why — the acidity was thin, the finish was short, the oak was overworked. The note tells you what to fix. Second, it makes evaluation checkable. Two sommeliers tasting the same wine will produce different notes, but they'll generally agree on the axes. The vocabulary is shared. The criteria are stable.
My broken judge did neither. It produced an aggregate verdict with no diagnostic. It used vague vocabulary ("well-written," "engaging") that meant nothing across raters. It hid failures inside passes.
If I'd been a sommelier evaluating wine the way my judge evaluated text, I'd have been fired by the second tasting.
Translating the analogy
The tasting-notes approach translates directly to LLM evaluation. Three principles:
Decompose into named axes. Don't ask "is this output good?" Ask: is the specificity good? Is the tone good? Is the factual accuracy good? Each axis is judged separately. The verdict is the combination of axis scores, not a unified vibe.
Use specific vocabulary per axis. Each axis has its own success criteria, stated explicitly. "Specificity" doesn't mean "is this specific enough" — it means "does this output name concrete dishes, prices, ambiance details, instead of generic filler like 'great for any occasion.'" Specific failure modes get named. The judge has to identify the kind of failure, not just the existence of one.
Weight the axes by what matters. Not every axis is equally important to the output's quality. Specificity might matter more than structural completeness. Voice might matter more than spelling. Assigning weights forces you to be explicit about your editorial priorities, which surfaces disagreements before they show up in production.
These three principles are the entire methodology. Most of the engineering is just doing them carefully.
The Pico judge, as a worked example
The judge that runs Pico — my conversational restaurant discovery platform for São Paulo and Rio — scores every restaurant entry on five criteria with explicit weights:
- Specificity (30%) — concrete dishes, ambiance, prices. No generic filler.
- Tone (25%) — friend-recommending voice with personality.
- Accuracy (20%) — every claim traceable to input data. No hallucinated dishes or prices.
- Portuguese (15%) — natural Brazilian Portuguese, grammar, cultural fit. Not translated American copy.
- Structure (10%) — tagline, description, vibe tags, best-for, slug all well-formed.
The weights came from honest editorial argument. Specificity matters most because vague restaurant copy is the failure mode all of Pico's competitors share — Yelp, Google Maps, iFood. Tone matters next because the wager of the product is conversational discovery; tone is the product. Accuracy is non-negotiable, but it's a quality gate rather than a quality lever — you don't earn extra credit for being correct; you fail if you're wrong.
Portuguese gets 15% because the readers are paulistanos and cariocas first, and translated copy reads like ChatGPT was asked to "make it Brazilian." Structure gets 10% because the fields are mechanical — easy to enforce, easy to validate, but they're not what makes a Pico entry good.
The judge scores each axis independently, with specific failure vocabulary for each:
- Specificity failures look like: "Generic filler detected — phrase 'great spot for any occasion' suggests no concrete recommendation."
- Tone failures look like: "Reads as press release. No first-person warmth. No personality."
- Accuracy failures look like: "Mentions 'tasting menu with eight courses' but input data shows no tasting menu offered."
When an entry fails, I don't just know it failed. I know which axis, what specifically, and therefore what to fix.
A failing entry goes back through the pipeline for revision with the judge's notes attached. A passing entry ships. A second failure flags for manual review — the system doesn't keep beating its head against the same wall.
How to write a judge prompt that doesn't hallucinate
The biggest failure mode I hit early — and the reason "tasting notes" matters as a frame — is judges that hallucinate verdicts. The judge is also an LLM. It will happily make up specific failures it didn't actually find, or pass outputs by inventing strengths that aren't there. Without discipline, you have two unreliable LLMs talking to each other.
Four patterns help:
Force structured output, not free-form prose. The judge returns a strict schema: per-axis scores, per-axis evidence, per-axis vocabulary tags. Free-form "let me explain my reasoning" prose is where hallucinations live. Schema-bound output is auditable.
Force the judge to cite the specific text. Every score has to point to specific phrases from the output that justify it. "Specificity score 4/10. Generic phrases: 'great for any occasion,' 'perfect spot,' 'wonderful atmosphere.'" If the judge can't cite, the judge has to score lower — the absence of evidence is a failure signal.
Use evaluation-mode prompts that aren't asking the model to also do the original task. Don't have the judge "read this output and improve it if needed." Have the judge read it and score it. Separation of concerns. Mixing evaluation with revision contaminates the judgment.
Test your judge against known-bad and known-good outputs. Curate a small set of obvious failures and obvious successes. Run them through the judge. If the judge passes the known-bad or fails the known-good, fix the prompt. The judge is also software; software needs tests.
What this approach doesn't solve
Tasting notes don't fix everything. Four real limitations worth naming:
It's slower and more expensive. Every output costs an extra LLM call to score. For a corpus of 5,000 entries — Pico's current size — that's 5,000 extra inferences. The math is real. The defense is that the judge call is shorter than the generation call, and the alternative (shipping bad output) is more expensive in the long run.
Criteria drift over time. What counted as "specific enough" in the first month of Pico is the baseline now. As the corpus grows, the bar should rise. Your judge's criteria need maintenance the same way your codebase does.
Structurally-correct-but-vibes-off output still slips through. Sometimes a piece of text hits every axis cleanly and is still subtly wrong — too sales-y, too sterile, missing the cultural register. The judge is constrained by its vocabulary. If your vocabulary doesn't name a failure mode, the judge can't catch it. The fix is iterative: when you see a slipped output, add the failure mode to the vocabulary. Your judge gets smarter the way a sommelier does — by tasting bad wines and learning to name what's wrong with them.
You still have to pick the criteria. The judge is autonomous in scoring but not in evaluating itself. The criteria, the weights, the vocabulary — those come from you. Pick the wrong ones and you ship garbage that scores beautifully. The judge is a force multiplier on your editorial taste, not a substitute for it.
The principle
Decompose. Weight. Judge per axis. The verdict is the combination; the diagnostic is the breakdown.
The point isn't that this is some breakthrough architecture. It isn't. It's just evaluation done seriously — the way the rest of the engineering world figured out for testing and the wine world figured out for tasting. The thing that's new is treating LLM output the way a sommelier treats a glass: as something multi-dimensional, with vocabulary attached to each dimension, where the verdict is downstream of the analysis.
The version of my judge that gaslights me less is the version that talks like a sommelier. The version that ships trash is the version that talks like a Yelp review.
Take the analogy seriously enough, and your judge stops lying to you.