arxiv_llm_reliability ยท Jun 21, 2026 ยท paper
Source brief
BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories
arxiv.orgJun 21, 2026
original source linked
In brief
LLM-as-a-judge has become the dominant approach to scalable evaluation in NLP pipelines, yet judges themselves carry systematic biases that raw accuracy hides: they favor responses placed in slot A (position bias), th...
Feed lens
agenticevaluation