LLM Digest

Story

arxiv_cs_ai · Jun 16, 2026 · paper

Source brief

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

arxiv.orgJun 16, 2026
original source linked

In brief

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora...

Feed lens

evaluation

Read the original at arxiv.org →Open in live feed Read that day’s brief

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

Earlier in this thread 4 items

[AINews] How to land a job at a frontier lab (on Pretraining)

EMO: Pretraining mixture of experts for emergent modularity

Deploying Claude Across Financial Services

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition