Datasets

Domain-specific datasets for training and evaluation.

Each dataset covers a single domain, crawled from primary public sources, cleaned to a consistent schema, and packaged as JSONL for direct use in training pipelines. Finance is our first vertical.

Finance · Dataset 01

Zorynthiq SEC Filings

5,179 LLM-ready text chunks extracted from 261 high-signal SEC EDGAR filings — S-1s, 10-Ks, 10-Qs, 8-Ks, DRS, and more. Each chunk is ~1,500 words with 150-word overlap, cleaned and normalized for LLM training and financial NLP. Focused on recently IPO'd and pre-IPO companies.

Free

open access

Schema

sourceData source identifier
filing_typeFiling type, e.g. S-1, 10-K, 8-K
company_nameFull legal company name
tickerStock ticker symbol
cikSEC Central Index Key
date_filedFiling date, YYYY-MM-DD
textCleaned, normalized filing text chunk
urlOriginal EDGAR document URL

Chunks

5,179

Filings

261

Format

JSONL

License

Free

Free and open. You'll be redirected to HuggingFace after signing up.

More datasets are in development across legal, medical, and scientific domains. Each will follow the same standard: single authoritative source, consistent JSONL schema, commercially licensable.