SEC 10-K / 10-Q / 8-K full-text — chunked, cleaned & embedded-ready (JSON / JSONL)
Every filing pulled straight from the SEC EDGAR official APIs, then run through a clean + chunk pipeline built for retrieval: HTML/scripts stripped, entities decoded, whitespace collapsed, split into ~760-char overlapping chunks with section labels and stable chunk IDs. Drop it into a vector store and you have a finance copilot grounded in primary-source filings.
| Field | Type | Description |
|---|---|---|
| chunk_id | string | Stable ID: accession_form_index. Use as vector-store primary key. |
| cik | string | 10-digit zero-padded SEC Central Index Key. |
| company | string | Registrant name as filed. |
| ticker | string | Ticker symbol where mapped (may be empty for non-listed). |
| sic / sic_description | string | Standard Industrial Classification code + label. |
| form | string | Filing type: 10-K, 10-Q, 8-K, S-1, DEF 14A, etc. |
| filing_date / report_date | date | ISO dates: when filed / period covered. |
| accession_number | string | SEC accession number (dashed). |
| items | string | 8-K item codes (e.g. 2.02,9.01) + item_descriptions. |
| section | string | Logical section label (Item 1. Business, Risk Factors, MD&A...). |
| chunk_index / char_count | int | Position within filing + chunk size. |
| text | string | Cleaned chunk text, ready to embed. |
| source_url | string | Canonical SEC archive URL for the source document. |
Below is one genuine record, exactly as it ships. Download the full sample (.jsonl).
{
"chunk_id": "0000320193-25-000079_10-K_0001",
"cik": "0000320193",
"company": "Apple Inc.",
"ticker": "AAPL",
"sic": "3571",
"sic_description": "Electronic Computers",
"form": "10-K",
"filing_date": "2025-10-31",
"report_date": "2025-09-27",
"accession_number": "0000320193-25-000079",
"section": "Item 1. Business — Products",
"chunk_index": 0,
"char_count": 760,
"source_url": "https://www.sec.gov/Archives/edgar/data/320193/000032019325000079/aapl-20250927.htm",
"text": "The Company designs, manufactures and markets smartphones, personal computers, tablets, wearables and accessories, and sells a variety of related services. The Company's fiscal year is the 52- or 53-week period that ends on the last Saturday of September. Products iPhone iPhone (R) is the Company's line of smartphones based on its iOS operating system. The iPhone line includes iPhone 17 Pro, iPhone Air(TM), iPhone 17, iPhone 16 and iPhone 16e. Mac Mac (R) is the Company's line of personal computers based on its macOS (R) operating system. The Mac line includes laptops MacBook Air (R) and MacBook Pro (R) , as well as desktops iMac (R) , Mac mini (R) , Mac Studio (R) and Mac Pro (R) . iPad iPad (R) is the Company's line of multipurpose tablets based on"
}
A real labeled slice of each dataset, downloadable right now. No signup to look; email to get the full sample bundle.
Full dataset (or a filtered cut) delivered as a single JSONL/CSV bundle via download link. Price scales with size & filtering.
Initial full dump plus a recurring incremental feed of new/changed records. SEC daily, contracts weekly.
Honest note: bulk delivery is manual-fulfilled — you request, we ship a download link (usually within 48h). That means you always get a cut tailored to your filters, and a real human if something's off.
Directly from the SEC's official EDGAR APIs (data.sec.gov submissions + the public Archives). It is primary-source public data; we only clean, chunk and normalize it for retrieval.
~760-character chunks with a small overlap, split on section boundaries where possible. Each chunk carries a stable chunk_id, section label and source_url so citations resolve back to the exact filing. We can deliver custom chunk sizes on request.
The base dataset ships text-only so you can embed with your own model. We can pre-compute embeddings (OpenAI text-embedding-3, Voyage, or Cohere) as an add-on.
Full historical back-catalog plus a daily delta that captures new filings within roughly six hours of EDGAR acceptance. Delta subscribers get an incremental JSONL feed.
10-K, 10-Q and 8-K are fully chunked. S-1, DEF 14A, 13F and others are available on request.
We'll email the full sample bundle and a delivery link, with a cut tailored to your filters. Delivery is hand-fulfilled within 48h.