May 19, 2026 · 6 min read
SEC Filing Types for AI Engineers: S-1, 10-K, 10-Q, 8-K Explained
A practical breakdown of SEC filing types — S-1, 10-K, 10-Q, 8-K, DRS, CORRESP — for AI engineers building financial NLP models, RAG systems, or fine-tuning LLMs on regulatory text.
When working with SEC filings data for machine learning, not all filing types are equivalent. Each type has a distinct purpose, structure, and vocabulary — which means they have different value for different training objectives. Using them indiscriminately can reduce model quality; using them strategically can significantly improve it.
S-1 and S-1/A: The IPO Prospectus
The S-1 is filed when a company intends to go public. It is the most comprehensive single document in the EDGAR corpus — a full business description, three years of audited financial statements, risk factors, use of proceeds, competitive positioning, and management discussion.
For AI training, S-1 filings are particularly valuable because they describe business models in depth at a level of detail not repeated in subsequent 10-Ks. The language must be accessible to retail investors, making it less dense than pure regulatory text but still domain-rich. The S-1/A is an amendment filed after the initial S-1, typically addressing SEC comments or updating financials — training on both captures the revision process.
DRS: The Confidential Draft Registration
A Draft Registration Statement is filed confidentially before the public S-1. Under the JOBS Act, companies can submit a DRS and receive SEC feedback before going public. DRS filings are released publicly 15 days before the roadshow. They are less commonly known but valuable — they represent the earliest-stage IPO thinking and often contain sections that get revised or removed before the public S-1.
10-K: The Annual Report
The 10-K is the standard annual report for public companies. For AI training, it is the workhorse document. Key sections:
- Item 1 (Business): Company description, products, markets, competition.
- Item 1A (Risk Factors): Comprehensive list of material risks — rich with forward-looking language.
- Item 7 (MD&A): Management Discussion and Analysis — the narrative explanation of financial results, written by management and reviewed by auditors.
- Item 8 (Financial Statements): Audited balance sheet, income statement, cash flow statement.
10-K filings are very long, often exceeding 100,000 words. Chunking is required. The Zorynthiq dataset uses 1,500-word chunks with 150-word overlap to preserve cross-section context.
10-Q: The Quarterly Report
The 10-Q covers the same ground as the 10-K but quarterly, with unaudited financials. Companies file three 10-Qs per year (the fourth quarter is covered by the 10-K). For training purposes, 10-Qs are valuable for capturing business commentary at higher temporal frequency and identifying how companies update their risk language quarter to quarter. They are structurally similar to 10-Ks but shorter, making them easier to process end-to-end.
8-K: The Current Report
The 8-K is filed when a material event occurs. Common triggers:
- Earnings announcements (typically as an exhibit to the 8-K).
- M&A transactions — signing, closing, or termination.
- Leadership changes — CEO, CFO, board members.
- Regulatory actions or credit facility changes.
- Restatements of financial results.
8-Ks are short, event-driven documents. For AI training, they are valuable for event detection and classification tasks — teaching models to identify what type of material event occurred from unstructured text.
CORRESP: SEC Correspondence
CORRESP filings contain letters between the SEC staff and the company during the review of a registration statement or periodic report. An SEC comment letter identifies disclosures the staff found insufficient; the company response explains or revises the disclosure.
CORRESP filings are underused in financial NLP but highly informative. They reveal what SEC reviewers consider material — valuable for compliance classification models. The question-answer format (SEC comment, company response) is naturally structured for fine-tuning dialogue models.
Building a Balanced Training Set
For a general-purpose financial language model, we recommend weighting toward the filing types with the most training signal:
- 10-K filings: 40% — highest-density narrative financial text.
- S-1 filings: 30% — comprehensive, business-description-rich, accessible language.
- 10-Q filings: 15% — high temporal frequency, shorter format.
- 8-K filings: 10% — event-driven, diverse triggers.
- CORRESP and DEF 14A: 5% — specialized but high signal for specific tasks.
The Zorynthiq SEC Filings dataset includes all of these filing types across 261 recent filings, providing a balanced starting point for financial AI training. Download it free on HuggingFace.
Free dataset
Download the Zorynthiq SEC Filings dataset
5,179 LLM-ready chunks from 261 recent SEC filings. Free on HuggingFace.
Get free access