LLM Digest

Story

arxiv_cs_lg · Jun 30, 2026 · paper

Source brief

RaBitQCache: Rotated Binary Quantization for KVCache in Long Context LLM Inference

arxiv.orgJun 30, 2026
original source linked

In brief

Long-context Large Language Model inference is severely bottlenecked by the massive Key-Value (KV) cache, yet existing sparse attention methods often suffer from static fixed-budget (Top-k) retrieval or rely on proxy...

Feed lens

evaluation

Read the original at arxiv.org →Open in live feed Read that day’s brief

RaBitQCache: Rotated Binary Quantization for KVCache in Long Context LLM Inference

Earlier in this thread 4 items

Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Show HN: Agent-estimate, how long a coding task takes, at agent speed