arxiv_llm_reliability ยท Jun 15, 2026 ยท paper
GRACE: Step-Level Benchmark for Faithful Reasoning over Context
In brief
Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet indivi...
evaluation