๐Ÿ“ฐ Story

arxiv_llm_reliability ยท Jun 21, 2026 ยท paper

โ† Live feed ๐Ÿ“ˆ Storylines ๐Ÿ“ฐ Daily recap ๐Ÿ—“๏ธ Weekly recap โœ‰๏ธ Email digest

Source brief

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

arxiv.orgJun 21, 2026
original source linked

In brief

LLM-as-a-judge has become the dominant approach to scalable evaluation in NLP pipelines, yet judges themselves carry systematic biases that raw accuracy hides: they favor responses placed in slot A (position bias), th...

Feed lens
agenticevaluation

Continue reading

Read the original at arxiv.org โ†’Open in live feedRead that dayโ€™s brief

Earlier in this thread 4 items