๐Ÿ“ฐ Story

arxiv_cs_ai ยท Jun 16, 2026 ยท paper

โ† Live feed ๐Ÿ“ˆ Storylines ๐Ÿ“ฐ Daily recap ๐Ÿ—“๏ธ Weekly recap ๐Ÿ”” RSS

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

In brief

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora...

evaluation
Read the original at arxiv.org โ†’Open in live feedDaily recap for 2026-06-16

Related stories 4 items