(prompt, response) pair and whose output is a score. Constrain the score with int and ge/le so it stays on scale.
Add a rationale
A free-text rationale makes the score auditable and surfaces rubric drift.Multi-dimension rubric
Score each dimension independently, then an overall. Independent fields stop one weak dimension from dragging the rest.Picking the shape
| You need | Schema |
|---|---|
| One quality number | int with ge=1, le=5 |
| Number plus justification | Add a rationale field after the score |
| Per-criterion breakdown | One bounded int field per dimension |
| A vs B instead of a score | Preference data |
Relationship to evals
This is the same primitive as single-label classification, pointed at model outputs instead of raw data. When the judge is the deliverable, it lives here. When it scores a system under test, see Evals.Next steps
| Task | Guide |
|---|---|
| Rank two responses | Preference data |
| Reduce single-model bias | Quality pipeline |