May 13, 2026 · 7 min read
How to Fine-Tune an LLM on SEC Filings Data
A practical guide to fine-tuning large language models on SEC filings — dataset selection, training approaches, a working code example, and evaluation tips for financial NLP tasks.
General-purpose LLMs like Llama, Mistral, and GPT are trained on diverse web text. They have broad knowledge but thin coverage of specialized domains. Financial language — MD&A prose, risk factor boilerplate, regulatory disclosure conventions — is systematically underrepresented in most pre-training corpora. Fine-tuning a model on SEC filings produces measurable improvements on financial NLP tasks without requiring expensive proprietary data.
Choosing the Right SEC Filings Dataset
Not all SEC filing datasets are equally useful for fine-tuning. Key quality dimensions to evaluate:
- Coverage: Does the dataset include multiple filing types (S-1, 10-K, 10-Q, 8-K) or just one? Multi-type datasets produce more generalizable models.
- Recency: EDGAR filings from the 1990s use different language conventions than modern filings. For most applications, filings from 2020 onward are preferable.
- Chunking quality: Chunks should respect natural text boundaries rather than cutting at fixed character counts.
- Metadata: Per-chunk metadata enables filtered training — for example, training only on 10-K filings or filings from specific sectors.
- Normalization quality: Encoding errors and residual HTML tags degrade model quality. Inspect a sample before committing to training.
The Zorynthiq SEC Filings dataset provides 5,179 chunks from 261 recent filings across all major filing types, with consistent JSONL schema and per-chunk metadata. It is free to download from HuggingFace.
Three Fine-Tuning Approaches
1. Continued Pre-Training (Domain Adaptation)
Train the model on raw filing text using the standard next-token prediction objective. This teaches the model the vocabulary, syntax, and conventions of regulatory language without task-specific labels. Use this when you want a general-purpose financial language model as a base for downstream tasks. This approach requires the most data — ideally 500M+ tokens. For smaller datasets, combine with a larger self-built EDGAR corpus from the bulk download system.
2. Supervised Fine-Tuning on Formatted Examples
Format each chunk as a prompt-completion pair for a specific task. Common financial NLP tasks suited to this approach:
- Section classification: Given a text chunk, predict which section of the filing it came from (MD&A, Risk Factors, Business Description).
- Filing type classification: Predict whether a chunk is from a 10-K, 10-Q, 8-K, or S-1.
- Financial sentiment: Label the outlook expressed in a chunk as positive, neutral, or negative.
- Named entity extraction: Extract company names, ticker symbols, and financial figures.
3. Instruction Tuning for Financial QA
Format filing chunks as context for question-answer pairs. This produces models that answer questions about companies from their regulatory filings — the foundation of financial RAG systems.
Working Code Example
from datasets import load_dataset
from transformers import (
AutoTokenizer, AutoModelForCausalLM,
TrainingArguments, Trainer,
DataCollatorForLanguageModeling,
)
model_id = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
ds = load_dataset("zorynthiq/zoryntiq-sec-filings", split="train")
def tokenize(examples):
return tokenizer(examples["text"], truncation=True, max_length=512)
tokenized = ds.map(tokenize, batched=True, remove_columns=ds.column_names)
training_args = TrainingArguments(
output_dir="./sec-llm",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
save_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()Evaluation
After fine-tuning, evaluate on held-out filing chunks not in the training set.
- Perplexity on financial text: Should decrease significantly compared to the base model.
- Filing type classification accuracy: Test whether the model correctly identifies filing types from text alone.
- Named entity F1: Evaluate entity extraction against manually labeled examples.
- Human evaluation of financial summaries: Have a domain expert rate the quality of generated MD&A summaries.
Tips for Better Results
- Filter by recency: Use filings from 2020 onward for the most current language conventions.
- Oversample rare filing types: CORRESP and DRS filings are less common but highly informative — oversample them to ensure the model learns their conventions.
- Prepend metadata to context: Include the filing type and company name at the start of each chunk so the model learns to condition on filing context.
- Use LoRA for efficiency: Full fine-tuning a 7B+ model is expensive. LoRA with rank 16-32 achieves comparable results with significantly less memory.
Free dataset
Download the Zorynthiq SEC Filings dataset
5,179 LLM-ready chunks from 261 recent SEC filings. Free on HuggingFace.
Get free access