All posts

May 19, 2026 · 6 min read

SEC Filing Types for AI Engineers: S-1, 10-K, 10-Q, 8-K Explained

A practical breakdown of SEC filing types — S-1, 10-K, 10-Q, 8-K, DRS, CORRESP — for AI engineers building financial NLP models, RAG systems, or fine-tuning LLMs on regulatory text.

When working with SEC filings data for machine learning, not all filing types are equivalent. Each type has a distinct purpose, structure, and vocabulary — which means they have different value for different training objectives. Using them indiscriminately can reduce model quality; using them strategically can significantly improve it.

S-1 and S-1/A: The IPO Prospectus

The S-1 is filed when a company intends to go public. It is the most comprehensive single document in the EDGAR corpus — a full business description, three years of audited financial statements, risk factors, use of proceeds, competitive positioning, and management discussion.

For AI training, S-1 filings are particularly valuable because they describe business models in depth at a level of detail not repeated in subsequent 10-Ks. The language must be accessible to retail investors, making it less dense than pure regulatory text but still domain-rich. The S-1/A is an amendment filed after the initial S-1, typically addressing SEC comments or updating financials — training on both captures the revision process.

DRS: The Confidential Draft Registration

A Draft Registration Statement is filed confidentially before the public S-1. Under the JOBS Act, companies can submit a DRS and receive SEC feedback before going public. DRS filings are released publicly 15 days before the roadshow. They are less commonly known but valuable — they represent the earliest-stage IPO thinking and often contain sections that get revised or removed before the public S-1.

10-K: The Annual Report

The 10-K is the standard annual report for public companies. For AI training, it is the workhorse document. Key sections:

  • Item 1 (Business): Company description, products, markets, competition.
  • Item 1A (Risk Factors): Comprehensive list of material risks — rich with forward-looking language.
  • Item 7 (MD&A): Management Discussion and Analysis — the narrative explanation of financial results, written by management and reviewed by auditors.
  • Item 8 (Financial Statements): Audited balance sheet, income statement, cash flow statement.

10-K filings are very long, often exceeding 100,000 words. Chunking is required. The Zorynthiq dataset uses 1,500-word chunks with 150-word overlap to preserve cross-section context.

10-Q: The Quarterly Report

The 10-Q covers the same ground as the 10-K but quarterly, with unaudited financials. Companies file three 10-Qs per year (the fourth quarter is covered by the 10-K). For training purposes, 10-Qs are valuable for capturing business commentary at higher temporal frequency and identifying how companies update their risk language quarter to quarter. They are structurally similar to 10-Ks but shorter, making them easier to process end-to-end.

8-K: The Current Report

The 8-K is filed when a material event occurs. Common triggers:

  • Earnings announcements (typically as an exhibit to the 8-K).
  • M&A transactions — signing, closing, or termination.
  • Leadership changes — CEO, CFO, board members.
  • Regulatory actions or credit facility changes.
  • Restatements of financial results.

8-Ks are short, event-driven documents. For AI training, they are valuable for event detection and classification tasks — teaching models to identify what type of material event occurred from unstructured text.

CORRESP: SEC Correspondence

CORRESP filings contain letters between the SEC staff and the company during the review of a registration statement or periodic report. An SEC comment letter identifies disclosures the staff found insufficient; the company response explains or revises the disclosure.

CORRESP filings are underused in financial NLP but highly informative. They reveal what SEC reviewers consider material — valuable for compliance classification models. The question-answer format (SEC comment, company response) is naturally structured for fine-tuning dialogue models.

Building a Balanced Training Set

For a general-purpose financial language model, we recommend weighting toward the filing types with the most training signal:

  • 10-K filings: 40% — highest-density narrative financial text.
  • S-1 filings: 30% — comprehensive, business-description-rich, accessible language.
  • 10-Q filings: 15% — high temporal frequency, shorter format.
  • 8-K filings: 10% — event-driven, diverse triggers.
  • CORRESP and DEF 14A: 5% — specialized but high signal for specific tasks.

The Zorynthiq SEC Filings dataset includes all of these filing types across 261 recent filings, providing a balanced starting point for financial AI training. Download it free on HuggingFace.

Free dataset

Download the Zorynthiq SEC Filings dataset

5,179 LLM-ready chunks from 261 recent SEC filings. Free on HuggingFace.

Get free access