Dataset · SEC EDGAR full-text

SEC EDGAR Full-Text (RAG-ready)

SEC 10-K / 10-Q / 8-K full-text — chunked, cleaned & embedded-ready (JSON / JSONL)

Overview

Every filing pulled straight from the SEC EDGAR official APIs, then run through a clean + chunk pipeline built for retrieval: HTML/scripts stripped, entities decoded, whitespace collapsed, split into ~760-char overlapping chunks with section labels and stable chunk IDs. Drop it into a vector store and you have a finance copilot grounded in primary-source filings.

1.4M+filings (full EDGAR history) · ~95M chunks
Dailydelta (new filings within ~6h of EDGAR acceptance)
JSONL+ CSV on request
Schema

One row, fully labeled.

FieldTypeDescription
chunk_idstringStable ID: accession_form_index. Use as vector-store primary key.
cikstring10-digit zero-padded SEC Central Index Key.
companystringRegistrant name as filed.
tickerstringTicker symbol where mapped (may be empty for non-listed).
sic / sic_descriptionstringStandard Industrial Classification code + label.
formstringFiling type: 10-K, 10-Q, 8-K, S-1, DEF 14A, etc.
filing_date / report_datedateISO dates: when filed / period covered.
accession_numberstringSEC accession number (dashed).
itemsstring8-K item codes (e.g. 2.02,9.01) + item_descriptions.
sectionstringLogical section label (Item 1. Business, Risk Factors, MD&A...).
chunk_index / char_countintPosition within filing + chunk size.
textstringCleaned chunk text, ready to embed.
source_urlstringCanonical SEC archive URL for the source document.
Sample record

A real row from the feed.

Below is one genuine record, exactly as it ships. Download the full sample (.jsonl).

{
  "chunk_id": "0000320193-25-000079_10-K_0001",
  "cik": "0000320193",
  "company": "Apple Inc.",
  "ticker": "AAPL",
  "sic": "3571",
  "sic_description": "Electronic Computers",
  "form": "10-K",
  "filing_date": "2025-10-31",
  "report_date": "2025-09-27",
  "accession_number": "0000320193-25-000079",
  "section": "Item 1. Business — Products",
  "chunk_index": 0,
  "char_count": 760,
  "source_url": "https://www.sec.gov/Archives/edgar/data/320193/000032019325000079/aapl-20250927.htm",
  "text": "The Company designs, manufactures and markets smartphones, personal computers, tablets, wearables and accessories, and sells a variety of related services. The Company's fiscal year is the 52- or 53-week period that ends on the last Saturday of September. Products iPhone iPhone (R) is the Company's line of smartphones based on its iOS operating system. The iPhone line includes iPhone 17 Pro, iPhone Air(TM), iPhone 17, iPhone 16 and iPhone 16e. Mac Mac (R) is the Company's line of personal computers based on its macOS (R) operating system. The Mac line includes laptops MacBook Air (R) and MacBook Pro (R) , as well as desktops iMac (R) , Mac mini (R) , Mac Studio (R) and Mac Pro (R) . iPad iPad (R) is the Company's line of multipurpose tablets based on"
}
Use cases

Finance copilots & research assistants

Earnings / risk-factor RAG

Quant signal extraction from text

Compliance & disclosure monitoring

Chapter 03 · Pricing

Buy it once, or stream the delta.

Sample

Free

A real labeled slice of each dataset, downloadable right now. No signup to look; email to get the full sample bundle.

  • Real records, full schema
  • JSONL format
  • Use it to test your pipeline
Download sample

Monthly delta

$99–$499/mo

Initial full dump plus a recurring incremental feed of new/changed records. SEC daily, contracts weekly.

  • Initial dump included
  • Incremental delta feed
  • Schema-versioned
  • Cancel anytime
Start subscription

Honest note: bulk delivery is manual-fulfilled — you request, we ship a download link (usually within 48h). That means you always get a cut tailored to your filters, and a real human if something's off.

FAQ

Questions, answered.

Where does the SEC data come from?

Directly from the SEC's official EDGAR APIs (data.sec.gov submissions + the public Archives). It is primary-source public data; we only clean, chunk and normalize it for retrieval.

What chunking strategy do you use?

~760-character chunks with a small overlap, split on section boundaries where possible. Each chunk carries a stable chunk_id, section label and source_url so citations resolve back to the exact filing. We can deliver custom chunk sizes on request.

Are embeddings included?

The base dataset ships text-only so you can embed with your own model. We can pre-compute embeddings (OpenAI text-embedding-3, Voyage, or Cohere) as an add-on.

How fresh is the data?

Full historical back-catalog plus a daily delta that captures new filings within roughly six hours of EDGAR acceptance. Delta subscribers get an incremental JSONL feed.

What forms are covered?

10-K, 10-Q and 8-K are fully chunked. S-1, DEF 14A, 13F and others are available on request.

Get access

Request the SEC filings dataset for RAG.

We'll email the full sample bundle and a delivery link, with a cut tailored to your filters. Delivery is hand-fulfilled within 48h.