0% helpful
0 likes 0 dislikes
Inputs Required
Clear task definition + “correct” criteria; golden set (and rubric); baseline model/prompt; metrics (quality + latency + cost); acceptance thresholds; segmentation (easy/hard cases, user types, edge cases); online instrumentation plan.
Output Artifact
Evaluation plan + scorecards: offline results by segment, online experiment plan (A/B or shadow), ship/no-ship thresholds, and regression cadence.
When to use this
When you must compare models/prompt approaches and need a defensible way to decide what ships.
Link copied