LLM Digest

Story

arxiv_cs_ai · Jul 1, 2026 · paper

Source brief

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

arxiv.orgJul 1, 2026
original source linked

In brief

Repository-level performance-optimization benchmarks such as GSO, SWE-Perf and SWE-fficiency evaluate coding agents by applying patches to real repositories and comparing runtime against unoptimized baselines and offi...

Feed lens

agenteval

Read the original at arxiv.org →Open in live feed Read that day’s brief

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Earlier in this thread 3 items

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations