ยท 3 days agoยท Dev.to
LLM-as-judge fluctuations disrupted DPO training signals for three weeks
TL;DR: Our DPO pipeline used a single LLM as the preference judge. Training reward climbed every run. Production accuracy fell 4 points. The judge was flipping its own labels 28% of the time at temperature 0. Nexus Labs ships agents that book travel, file expenses, process insurance claims. Eight en
#cloud-computing#llm#dpo#machine-learning