BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

arxiv_llm_reliability · Jun 21, 2026 · paper

← Live feed 📈 Storylines 📰 Daily recap 🗓️ Weekly recap ✉️ Email digest

Source brief

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

arxiv.orgJun 21, 2026
original source linked

In brief

LLM-as-a-judge has become the dominant approach to scalable evaluation in NLP pipelines, yet judges themselves carry systematic biases that raw accuracy hides: they favor responses placed in slot A (position bias), th...

Feed lens

agenticevaluation

Read the original at arxiv.org →Open in live feed Read that day’s brief

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

Earlier in this thread 4 items

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Anthropic opens Seoul office and announces new partnerships across the Korean AI ecosystem

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

Pega Harnesses Best Practices and AI Coding Agents to Build Apps with Mission-Critical Reliability - Business Wire