Datasets

Domain-specific datasets for training and evaluation.

Name: Zorynthiq SEC Filings Dataset
Creator: Zorynthiq
License: https://creativecommons.org/licenses/by/4.0/
Keywords: SEC filings, EDGAR, financial NLP, LLM training data, JSONL, IPO, S-1, 10-K

Each dataset covers a single domain, crawled from primary public sources, cleaned to a consistent schema, and packaged as JSONL for direct use in training pipelines. Finance is our first vertical.

Finance · Dataset 01

Zorynthiq SEC Filings

5,179 LLM-ready text chunks extracted from 261 high-signal SEC EDGAR filings — S-1s, 10-Ks, 10-Qs, 8-Ks, DRS, and more. Each chunk is ~1,500 words with 150-word overlap, cleaned and normalized for LLM training and financial NLP. Focused on recently IPO'd and pre-IPO companies.

Free

open access

Schema

source	string	Data source identifier
filing_type	string	Filing type, e.g. S-1, 10-K, 8-K
company_name	string	Full legal company name
ticker	string \| null	Stock ticker symbol
cik	string	SEC Central Index Key
date_filed	string	Filing date, YYYY-MM-DD
text	string	Cleaned, normalized filing text chunk
url	string	Original EDGAR document URL

Chunks

5,179

Filings

261

Format

JSONL

License

Free

Free and open. You'll be redirected to HuggingFace after signing up.

More datasets are in development across legal, medical, and scientific domains. Each will follow the same standard: single authoritative source, consistent JSONL schema, commercially licensable.