arxiv_cs_lg ยท Jun 12, 2026 ยท paper
Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments
In brief
As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular ap...
agenticevaluation