Datasets
Domain-specific datasets for training and evaluation.
Each dataset covers a single domain, crawled from primary public sources, cleaned to a consistent schema, and packaged as JSONL for direct use in training pipelines. Finance is our first vertical.
Zorynthiq SEC Filings
5,179 LLM-ready text chunks extracted from 261 high-signal SEC EDGAR filings — S-1s, 10-Ks, 10-Qs, 8-Ks, DRS, and more. Each chunk is ~1,500 words with 150-word overlap, cleaned and normalized for LLM training and financial NLP. Focused on recently IPO'd and pre-IPO companies.
Free
open access
Schema
| source | string | Data source identifier |
| filing_type | string | Filing type, e.g. S-1, 10-K, 8-K |
| company_name | string | Full legal company name |
| ticker | string | null | Stock ticker symbol |
| cik | string | SEC Central Index Key |
| date_filed | string | Filing date, YYYY-MM-DD |
| text | string | Cleaned, normalized filing text chunk |
| url | string | Original EDGAR document URL |
Chunks
5,179
Filings
261
Format
JSONL
License
Free
Free and open. You'll be redirected to HuggingFace after signing up.
More datasets are in development across legal, medical, and scientific domains. Each will follow the same standard: single authoritative source, consistent JSONL schema, commercially licensable.