Story
arxiv_llm_reliability ยท Jun 24, 2026 ยท paper
Source brief
Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability
arxiv.orgJun 24, 2026
original source linked
In brief
Large language models are increasingly deployed as agents that solve tasks by interacting with external tool environments. Although recent tool-use benchmarks increasingly cover complex task settings, they still large...
Feed lens
agentevaluation