LLM Digest

Story

arxiv_llm_reliability · Jun 24, 2026 · paper

Source brief

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

arxiv.orgJun 24, 2026
original source linked

In brief

Large language models are increasingly deployed as agents that solve tasks by interacting with external tool environments. Although recent tool-use benchmarks increasingly cover complex task settings, they still large...

Feed lens

agentevaluation

Read the original at arxiv.org →Open in live feed Read that day’s brief

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

Earlier in this thread 4 items

Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

Towards Root Memories: Benchmarking and Enhancing Implicit Logical Memory Retrieval for Personalized LLMs