runbookify
← All plans
Customer Support & Service / Quality Assurance & Agent Coaching

QA Calibration Session Manager: Get Your Reviewers to Score the Same Way

Several reviewers score the same tickets blind, the tool surfaces exactly where they disagree, and your QA lead approves the calibrated 'agreed' score and rubric clarifications — so agents are judged fairly and consistently.

IntermediateA weekendBuilds onNext.jsSupabaseResend
What you'll build

A web tool where your QA team picks a calibration set, each reviewer scores the same tickets independently and blind, the tool surfaces the rubric items and tickets with the most disagreement, your lead approves the agreed scores plus any rubric clarifications, and it exports a clean calibration report.

Gated download

Enter your email — the plan downloads instantly and a copy lands in your inbox.

By submitting your email you'll also receive the weekly runbookify newsletter. You can unsubscribe at any time.

Before you start

  • A Supabase account (free)
  • A Vercel account (free)
  • A Resend account (free)
  • A handful of calibration tickets and your scoring rubric
  • Claude Code or any AI coding agent

The problem this kills

You run quality reviews on support tickets, and you have a rubric. On paper, every reviewer should score the same ticket the same way. In reality they don't. One reviewer is a hardliner on tone; another waves it through. One reads "resolution" strictly; another counts a workaround as a fix. The result is that an agent's QA score depends as much on who graded them as on what they actually did — and your agents know it. The moment scoring feels like a lottery, the whole program loses its teeth.

The fix that real QA teams use is calibration: have several reviewers grade the same set of tickets, then sit down and work through exactly where they disagreed and why. Done well, it tightens the rubric, aligns the reviewers, and makes scores defensible. Done in a spreadsheet, it's a mess — people see each other's scores and anchor to them, nobody can tell which rubric items are the real problem, and the "agreed" answer never gets written down anywhere official. You do not need to be a developer to build something that does this properly.

What you'll build

A simple internal web tool for your QA team. A lead picks a calibration set — a handful of tickets and the rubric to grade them against. Each assigned reviewer opens the app and scores those tickets independently and blind: they can't see anyone else's scores until they've submitted their own, so nobody anchors. Once scores are in, the tool builds a variance report that surfaces exactly where the reviewers split — which tickets caused the most disagreement, and crucially which rubric items are the most contentious (the ones where your guidance is too vague). Your QA lead reviews that report and, for each disputed item, approves the calibrated "agreed" score and writes any rubric clarification that should become the new standard. Only the lead's approval makes those official. The tool then exports a clean calibration report you can share with reviewers and agents.

What's inside the Implementation Plan

The downloadable plan is a step-by-step file you paste into an AI coding agent. It opens by interviewing you about your business — how QA works on your team today, what your rubric actually looks like (categories, items, weights, pass/fail vs points), how reviewers score, what counts as "agreement," your typical and peak calibration cadence, and the edge cases that always cause arguments. It reads a short spec back to you for a thumbs-up, then builds the tool around your rubric and your rules instead of a generic template. From there it walks the agent through the data model, the blind independent-scoring flow, the variance and outlier engine, the rubric-item disagreement view, the lead's approval gate, and the report export. Every step ends with a ready-to-copy prompt.

The governance it includes (this is the point)

This isn't a toy. The plan builds in the controls a real QA function needs: login so only your team can use it, row-level security so people only ever see their own organization's data, blind scoring so reviewers can't see each other's scores before they submit (anchoring is a real source of bias), a complete audit trail of every score, edit, and approval (who, what, when, and why), a hard human-approval gate so an "agreed" score or rubric change only becomes official when the QA lead signs off, and duplicate guards so one reviewer can't accidentally submit two sets of scores for the same ticket. The whole tool exists to make a careful human judgment easy and defensible — the data shows the disagreement, a person decides the standard.

Who it's for

QA leads, support quality managers, and team leads who own grading consistency and are tired of scores that depend on who happened to review the ticket. If you can describe your rubric and what "good agreement" means to you, you can build this.

You've got this — open the plan, paste the first prompt, and you'll be running your first real calibration session this weekend.

Gated download

Enter your email — the plan downloads instantly and a copy lands in your inbox.

By submitting your email you'll also receive the weekly runbookify newsletter. You can unsubscribe at any time.