SEC filings dataset for RAG — SEC EDGAR Full-Text (RAG-ready) (JSON/JSONL download)

Overview

Every filing pulled straight from the SEC EDGAR official APIs, then run through a clean + chunk pipeline built for retrieval: HTML/scripts stripped, entities decoded, whitespace collapsed, split into ~760-char overlapping chunks with section labels and stable chunk IDs. Drop it into a vector store and you have a finance copilot grounded in primary-source filings.

1.4M+filings (full EDGAR history) · ~95M chunks

Dailydelta (new filings within ~6h of EDGAR acceptance)

JSONL+ CSV on request

Schema

One row, fully labeled.

Field	Type	Description
chunk_id	string	Stable ID: accession_form_index. Use as vector-store primary key.
cik	string	10-digit zero-padded SEC Central Index Key.
company	string	Registrant name as filed.
ticker	string	Ticker symbol where mapped (may be empty for non-listed).
sic / sic_description	string	Standard Industrial Classification code + label.
form	string	Filing type: 10-K, 10-Q, 8-K, S-1, DEF 14A, etc.
filing_date / report_date	date	ISO dates: when filed / period covered.
accession_number	string	SEC accession number (dashed).
items	string	8-K item codes (e.g. 2.02,9.01) + item_descriptions.
section	string	Logical section label (Item 1. Business, Risk Factors, MD&A...).
chunk_index / char_count	int	Position within filing + chunk size.
text	string	Cleaned chunk text, ready to embed.
source_url	string	Canonical SEC archive URL for the source document.

Sample record

A real row from the feed.

Below is one genuine record, exactly as it ships. Download the full sample (.jsonl).

{
  "chunk_id": "0000320193-25-000079_10-K_0001",
  "cik": "0000320193",
  "company": "Apple Inc.",
  "ticker": "AAPL",
  "sic": "3571",
  "sic_description": "Electronic Computers",
  "form": "10-K",
  "filing_date": "2025-10-31",
  "report_date": "2025-09-27",
  "accession_number": "0000320193-25-000079",
  "section": "Item 1. Business — Products",
  "chunk_index": 0,
  "char_count": 760,
  "source_url": "https://www.sec.gov/Archives/edgar/data/320193/000032019325000079/aapl-20250927.htm",
  "text": "The Company designs, manufactures and markets smartphones, personal computers, tablets, wearables and accessories, and sells a variety of related services. The Company's fiscal year is the 52- or 53-week period that ends on the last Saturday of September. Products iPhone iPhone (R) is the Company's line of smartphones based on its iOS operating system. The iPhone line includes iPhone 17 Pro, iPhone Air(TM), iPhone 17, iPhone 16 and iPhone 16e. Mac Mac (R) is the Company's line of personal computers based on its macOS (R) operating system. The Mac line includes laptops MacBook Air (R) and MacBook Pro (R) , as well as desktops iMac (R) , Mac mini (R) , Mac Studio (R) and Mac Pro (R) . iPad iPad (R) is the Company's line of multipurpose tablets based on"
}

Use cases

Finance copilots & research assistants

Earnings / risk-factor RAG

Quant signal extraction from text

Compliance & disclosure monitoring

Chapter 03 · Pricing

Buy it once, or stream the delta.

Sample

Free

A real labeled slice of each dataset, downloadable right now. No signup to look; email to get the full sample bundle.

Real records, full schema
JSONL format
Use it to test your pipeline

Download sample

One-time dump

$200–$2,000one-time

Full dataset (or a filtered cut) delivered as a single JSONL/CSV bundle via download link. Price scales with size & filtering.

Full history, your filters
JSONL or CSV
Delivered within 48h
Optional pre-computed embeddings (SEC)

Request a quote

Monthly delta

$99–$499/mo

Initial full dump plus a recurring incremental feed of new/changed records. SEC daily, contracts weekly.

Initial dump included
Incremental delta feed
Schema-versioned
Cancel anytime

Start subscription

Honest note: bulk delivery is manual-fulfilled — you request, we ship a download link (usually within 48h). That means you always get a cut tailored to your filters, and a real human if something's off.

FAQ

Questions, answered.

Where does the SEC data come from?

Directly from the SEC's official EDGAR APIs (data.sec.gov submissions + the public Archives). It is primary-source public data; we only clean, chunk and normalize it for retrieval.

What chunking strategy do you use?

~760-character chunks with a small overlap, split on section boundaries where possible. Each chunk carries a stable chunk_id, section label and source_url so citations resolve back to the exact filing. We can deliver custom chunk sizes on request.

Are embeddings included?

The base dataset ships text-only so you can embed with your own model. We can pre-compute embeddings (OpenAI text-embedding-3, Voyage, or Cohere) as an add-on.

How fresh is the data?

Full historical back-catalog plus a daily delta that captures new filings within roughly six hours of EDGAR acceptance. Delta subscribers get an incremental JSONL feed.

What forms are covered?

10-K, 10-Q and 8-K are fully chunked. S-1, DEF 14A, 13F and others are available on request.

Get access

Request the SEC filings dataset for RAG.

We'll email the full sample bundle and a delivery link, with a cut tailored to your filters. Delivery is hand-fulfilled within 48h.

Email Dataset

Tier Pay method

Notes / filters (agencies, tickers, NAICS, date range...)