May 10, 2026 · 8 min read

SEC EDGAR Dataset: A Complete Guide for AI and NLP Engineers

Everything you need to know about using SEC EDGAR filings as training data for LLMs and NLP models — what the data contains, why it is valuable, how to clean it, and where to download a ready-to-use dataset.

SEC EDGAR is the Securities and Exchange Commission's public filing database. Every US public company is required to submit regulatory documents here — annual reports, quarterly reports, event disclosures, and registration statements when going public. As of 2025, EDGAR contains over 20 million documents dating back to the early 1990s. For AI engineers and NLP researchers, it is one of the richest free sources of high-quality financial text available.

Why SEC Filings Are Valuable for AI Training

Most large language models are trained predominantly on web text, which skews toward general knowledge and informal writing. Financial and legal language is systematically underrepresented. Fine-tuning on SEC filings directly addresses this gap.

High signal-to-noise ratio: SEC filings face legal scrutiny. The language is precise, intentional, and verifiable in a way that web-scraped text is not.
Consistent structure: Every 10-K has a Management Discussion and Analysis section. Every S-1 has risk factors. This consistency makes labeled training data easier to produce.
Forward-looking language: Risk factors and MD&A sections are rich with forward-looking statements — useful for forecasting and sentiment models.
Public domain: US government publications are not subject to copyright. SEC filings can be used commercially without licensing restrictions.
Long-form reasoning: Filings are dense documents that require sustained reasoning — exactly the training signal modern LLMs need.

Filing Types and What They Contain

Understanding the different filing types is essential for knowing what training signal each document provides.

S-1: Filed when a company plans to go public. Contains a comprehensive business description, full financial history, risk factors, and use of proceeds — the richest single document for understanding a company.
10-K: Annual report with audited financials, MD&A, risk factors, and business overview. The workhorse document for financial NLP.
10-Q: Quarterly report with unaudited financials and updated business commentary. High temporal frequency at shorter length.
8-K: Current report triggered by material events — earnings, M&A, leadership changes. High-frequency and event-driven.
DRS: Draft Registration Statement, filed confidentially before the public S-1. Reveals earlier-stage company thinking.
CORRESP: Correspondence between the SEC and the filer during document review. Exposes specific regulatory concerns — unusually informative for compliance NLP.

The Problem With Raw EDGAR Data

EDGAR documents are available for free download, but raw EDGAR data requires significant engineering work before it is usable in ML pipelines.

Mixed formats: Documents come as HTML, XML, SGML, and plain text depending on filing year and agent.
XBRL noise: Modern filings embed inline XBRL tags throughout the text that must be stripped without losing surrounding context.
Encoding artifacts: Decades of filing systems have introduced garbled characters and inconsistent whitespace.
Boilerplate: Legal disclaimers, table-of-contents entries, cover pages, and exhibit lists add noise without training signal.
Context length: A single 10-K can exceed 200,000 words. Chunking is required for most model context windows.

What Clean EDGAR Training Data Looks Like

Chunked into model-ready segments: typically 1,000–2,000 words with 100–200 word overlap to preserve cross-chunk context.
JSONL format: one record per line, streaming-ready, no need to load the full dataset into memory.
Structured metadata per chunk: filing type, company name, CIK, ticker, date filed, and source URL on every record.
Normalized text: UTF-8 throughout, smart quotes standardized, XBRL stripped, boilerplate removed.

Loading an EDGAR Dataset in Python

python

from datasets import load_dataset

ds = load_dataset("zorynthiq/zoryntiq-sec-filings")

# Each record: source, filing_type, company_name, ticker, cik, date_filed, text, url
for record in ds["train"]:
    print(record["filing_type"], record["company_name"])
    print(record["text"][:500])
    break

For fine-tuning, pass the text field directly as training examples. For retrieval-augmented generation, embed each chunk and store the metadata as retrieval context. For classification tasks, use the filing_type field as a ready-made label.

Where to Get the Zorynthiq SEC Filings Dataset

The Zorynthiq SEC Filings dataset is available free on HuggingFace. It contains 5,179 chunks extracted from 261 recent S-1, 10-K, 10-Q, 8-K, DRS, and CORRESP filings — focused on recently IPO'd and pre-IPO companies, the highest-signal segment of EDGAR for AI training. Each chunk is approximately 1,500 words with 150-word overlap, normalized to a consistent JSONL schema.

Free dataset

Download the Zorynthiq SEC Filings dataset

5,179 LLM-ready chunks from 261 recent SEC filings. Free on HuggingFace.

Get free access