More eval traces will not stabilize your kappa. Stratify the ones you have
TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63 week to week with no rubric change. First instinct was sample size, so we went from 50 weekly traces to 200. Variance barely moved. Then we stratified the 50 we already had, by score class and a couple
โก
Key Insights
10 editorial insights.
AiFeed24 Teamยทโฑ 1 min readยทNews
Deep Analysis
Multi-Source Intelligence
Tags:#cloud
Found this useful? Share it!