EdgeParse

Fastest PDF extraction engine. Rust-native. Zero GPU, zero JVM, zero OCR models.

Extract Markdown, JSON (with bounding boxes), and HTML from any born-digital PDF — deterministically, in milliseconds, on CPU.

How accurate is it? — 0.787 overall on the latest opendataloader.org PDF-to-Markdown benchmark, with the best score in every reported metric: reading order, tables, headings, paragraphs, text quality, table detection, and speed. Benchmark details
How fast? — 0.064 s/doc on the 200-document benchmark corpus on Apple M4 Max. Faster than OpenDataLoader, Docling, PyMuPDF4LLM, MarkItDown, and LiteParse.
Does it need a GPU or Java? — No. No JVM, no GPU, no OCR models, no Python runtime for the CLI. Single ~15 MB binary.
RAG / LLM pipelines? — Yes. Outputs structured Markdown for chunking, JSON with bounding boxes for citations, preserves reading order across multi-column layouts. See integration examples

Available as a Rust library, CLI binary, Python package, Node.js package, and WebAssembly module for in-browser PDF parsing.

Get Started in 30 Seconds

Python (Python 3.9+):

pip install edgeparse

import edgeparse

# Markdown — ready for LLM context or RAG chunking
md = edgeparse.convert("report.pdf", format="markdown")

# JSON with bounding boxes — for citations and element-level control
data = edgeparse.convert("report.pdf", format="json")

CLI (Rust 1.85+):

cargo install edgeparse-cli
edgeparse report.pdf --format markdown --output-dir output/

Node.js (Node 18+):

npm install edgeparse

import { convert } from 'edgeparse';
const md = convert('report.pdf', { format: 'markdown' });

Release Channels

Tagged releases publish every supported distribution target through GitHub Actions:

Channel	Artifact	Install / Pull
Rust crates	`pdf-cos`, `edgeparse-core`, `edgeparse-cli`	`cargo install edgeparse-cli`
Python SDK	`edgeparse` wheels + sdist	`pip install edgeparse`
Node.js SDK	`edgeparse` + 5 platform addons	`npm install edgeparse`
WebAssembly SDK	`edgeparse-wasm`	`npm install edgeparse-wasm`
CLI binaries	GitHub Release archives for macOS, Linux, Windows	GitHub Releases
Homebrew	`raphaelmansuy/edgeparse` tap	`brew tap raphaelmansuy/edgeparse && brew install edgeparse`
Containers	GHCR + Docker Hub multi-arch images	`docker pull ghcr.io/raphaelmansuy/edgeparse:0.2.1`

Release automation and registry details: docs/07-cicd-publishing.md

What Problems Does This Solve?

Problem	EdgeParse Solution	Status
PDF text loses reading order in multi-column layouts	XY-Cut++ algorithm preserves correct reading sequence across columns, sidebars, and mixed layouts	✅ Shipped
Table extraction is broken (merged cells, borderless tables)	Ruling-line table detection + borderless cluster method; `--table-method cluster` for complex cases	✅ Shipped
OCR/ML tools add 500 MB+ of dependencies to a simple PDF pipeline	Zero GPU, zero OCR models, zero JVM — single 15 MB binary, pure Rust	✅ Shipped
Heading hierarchy is lost (all text looks the same)	Font-metric + geometry-based heading classifier; MHS score 0.553 on the current benchmark	✅ Shipped
PDFs can carry hidden prompt injection payloads	AI safety filters: hidden text, off-page content, tiny-text, invisible OCG layers detected and stripped	✅ Shipped
Need bounding boxes to cite sources in RAG answers	Every element (`paragraph`, `heading`, `table`, `image`) has `[left, bottom, right, top]` coordinates in PDF points	✅ Shipped
In-browser PDF parsing uploads data to a server	WebAssembly build — full Rust engine in the browser, PDF data never leaves the device	✅ Shipped

Benchmark

Evaluated on 200 real-world PDFs — academic papers, financial reports, multi-column layouts, complex tables, mixed-language documents — running on Apple M4 Max.

Current comparison set

Engine	NID ↑	TEDS ↑	MHS ↑	PBF ↑	TQS ↑	TD F1 ↑	Speed ↓	Overall ↑
EdgeParse	0.889	0.596	0.553	0.559	0.920	0.901	0.064 s/doc	0.787
OpenDataLoader	0.873	0.326	0.442	0.544	0.916	0.636	0.094 s/doc	0.733
Docling	0.867	0.540	0.438	0.530	0.908	0.891	0.768 s/doc	0.745
PyMuPDF4LLM	0.852	0.323	0.407	0.538	0.888	0.744	0.439 s/doc	0.710
EdgeParse (pre-frontier baseline)	0.859	0.493	0.500	0.482	0.891	0.849	0.232 s/doc	0.751
MarkItDown	0.808	0.193	0.001	0.362	0.861	0.558	0.149 s/doc	0.564
LiteParse	0.815	0.000	0.001	0.383	0.887	N/A	0.196 s/doc	0.564

EdgeParse now leads the entire comparison set on every reported benchmark metric, including speed. Relative to the previous EdgeParse baseline, the current pipeline increases reading-order accuracy, table structure similarity, paragraph boundaries, text quality, table-detection F1, and overall score while cutting latency from 0.232 to 0.064 s/doc.

When to choose what:

Use case	Recommendation
Born-digital PDFs, latency matters, production deployment	EdgeParse — best accuracy/speed, zero dependencies
Complex scanned tables, GPU available, batch offline	Consider Docling or MinerU
Scanned documents requiring full OCR	Dedicated OCR pipeline

Metrics

Metric	What it measures
NID	Reading order accuracy — normalised index distance
TEDS	Table structure accuracy — tree-edit distance vs. ground truth
MHS	Heading hierarchy accuracy
PBF	Paragraph boundary F1
TQS	Text quality score
TD F1	Table detection F1
Overall	Normalized aggregate benchmark score
Speed	Wall-clock seconds per document (full pipeline, 200 docs, parallel)

Running the benchmark

cargo build --release
cd benchmark
uv sync
uv run python run.py          # EdgeParse on all 200 docs
uv run python compare_all.py  # Compare against 9 engines

Results → benchmark/prediction/edgeparse/ · HTML reports → benchmark/reports/

Regression thresholds

benchmark/thresholds.json defines minimum acceptable scores for CI:

{
  "nid": 0.85,
  "teds": 0.40,
  "mhs": 0.55,
  "table_detection_f1": 0.55,
  "elapsed_per_doc": 2.0
}

Capability Matrix

Capability	Available	Notes
Extraction
Text with correct reading order	✅	XY-Cut++ across columns and sidebars
Bounding boxes for every element	✅	`[left, bottom, right, top]` in PDF points
Table extraction — ruling-line borders	✅	Default mode
Table extraction — borderless/cluster	✅	`--table-method cluster`
Heading hierarchy detection	✅	Numbered + unnumbered, all levels
List detection (numbered, bulleted, nested)	✅
Image extraction with coordinates	✅	PNG or JPEG, embedded or external
Header / footer / watermark filtering	✅
Multi-column layout support	✅
CMap / ToUnicode font decoding	✅	Handles non-standard encodings
Tagged PDF structure tree	✅	`--use-struct-tree` preserves author intent
Safety
Hidden text filtering	✅	Prompt injection protection
Off-page content filtering	✅
Tiny-text / invisible OCG layer filtering	✅
PII sanitization	✅	`--sanitize` flag
Output
Markdown (GFM tables)	✅
JSON with bounding boxes	✅
HTML5	✅
Plain text (reading order preserved)	✅
SDKs
Python (PyO3 native extension)	✅	Python 3.9+, pre-built wheels
Node.js (NAPI-RS native addon)	✅	Node 18+, pre-built addons
WebAssembly	✅	In-browser, no server required
Rust library	✅	`edgeparse-core` crate
CLI binary	✅	`edgeparse-cli` on crates.io
Runtime
GPU required	❌ No	CPU only
JVM required	❌ No	Pure Rust
OCR models required	❌ No	Born-digital PDFs only
Parallel processing	✅	Rayon per-page parallelism
Deterministic / reproducible	✅	Same input → same output, always

Installation

Python

pip install edgeparse

Python 3.9+. Pre-built wheels for macOS (arm64, x64), Linux (x64, arm64), Windows (x64).

CLI

cargo install edgeparse-cli

Requires Rust 1.85+.

Rust library

[dependencies]
edgeparse-core = "0.2.1"

Docs: docs.rs/edgeparse-core · docs.rs/edgeparse-cli

Node.js

npm install edgeparse

Node 18+. Pre-built native addons for macOS (arm64, x64), Linux (x64, arm64), Windows (x64).

Build from source

git clone https://github.com/raphaelmansuy/edgeparse.git
cd edgeparse
cargo build --release
# Binary: target/release/edgeparse

System requirements

macOS 12+, Linux (glibc 2.17+), or Windows 10+
~15 MB binary (stripped release build)
No Java, no Python (for the CLI), no GPU

Python SDK

Package: edgeparse · Requires: Python 3.9+ · Source: sdks/python/

`edgeparse.convert()`

def convert(
    input_path: str | Path,
    *,
    format: str = "markdown",       # "markdown", "json", "html", "text"
    pages: str | None = None,        # e.g. "1,3,5-7"
    password: str | None = None,
    reading_order: str = "xycut",   # "xycut" or "off"
    table_method: str = "default",  # "default" or "cluster"
    image_output: str = "off",      # "off", "embedded", "external"
) -> str: ...

`edgeparse.convert_file()`

def convert_file(
    input_path: str | Path,
    output_dir: str | Path = "output",
    *,
    format: str = "markdown",
    pages: str | None = None,
    password: str | None = None,
) -> str: ...  # returns output file path

Examples

import edgeparse

# Basic Markdown extraction
md = edgeparse.convert("report.pdf", format="markdown")

# JSON with bounding boxes
json_str = edgeparse.convert("report.pdf", format="json")

# Password-protected, specific pages, borderless tables
md = edgeparse.convert(
    "secure.pdf",
    format="markdown",
    pages="1,3,5-7",
    password="secret",
    reading_order="xycut",
    table_method="cluster",
)

# Write output file  
out_path = edgeparse.convert_file("report.pdf", output_dir="output/", format="markdown")

CLI entry point (Python package)

edgeparse report.pdf -f markdown -o output/
edgeparse *.pdf --format json --output-dir out/ --pages "1-3"

Node.js SDK

Package: edgeparse · Requires: Node.js 18+ · Source: sdks/node/

`convert()`

import { convert } from 'edgeparse';

// Markdown
const md = convert('report.pdf', { format: 'markdown' });

// JSON with bounding boxes
const json = convert('report.pdf', { format: 'json' });

// With options
const result = convert('report.pdf', {
  format: 'markdown',
  pages: '1-5',
  readingOrder: 'xycut',
  tableMethod: 'cluster',
});

`ConvertOptions`

interface ConvertOptions {
  format?: 'markdown' | 'json' | 'html' | 'text';  // default: "markdown"
  pages?: string;         // e.g. "1,3,5-7"
  password?: string;
  readingOrder?: 'xycut' | 'off';        // default: "xycut"
  tableMethod?: 'default' | 'cluster';   // default: "default"
  imageOutput?: 'off' | 'embedded' | 'external';  // default: "off"
}

CLI (npm)

npx edgeparse report.pdf -f markdown -o output.md
npx edgeparse report.pdf --format json --pages "1-5"

WebAssembly SDK

EdgeParse compiles to WebAssembly — client-side PDF extraction in any modern browser with no server, no uploads, and no backend infrastructure.

Same Rust engine, identical output to CLI/Python/Node
PDF data never leaves the user's device (privacy by design)
Works offline after initial WASM load (~4 MB cached)
Zero infrastructure cost — static hosting only

Quick start

import init, { convert_to_string } from 'edgeparse-wasm';

await init();  // load WASM binary once

const bytes = new Uint8Array(await file.arrayBuffer());

const markdown = convert_to_string(bytes, 'markdown');
const json     = convert_to_string(bytes, 'json');
const html     = convert_to_string(bytes, 'html');

API

Function	Returns	Description
`convert(bytes, format?, pages?, readingOrder?, tableMethod?)`	JS object	Structured `PdfDocument` with pages, elements, bounding boxes
`convert_to_string(bytes, format?, pages?, readingOrder?, tableMethod?)`	`string`	Formatted output (Markdown, JSON, HTML, or text)
`version()`	`string`	EdgeParse version

Live demo

edgeparse.com/demo/ — drag-and-drop any PDF, all processing runs locally in your browser.

Build from source

curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
cd crates/edgeparse-wasm
wasm-pack build --target web --release
# Output: crates/edgeparse-wasm/pkg/

Full documentation: docs/09-wasm-sdk.md

CLI Reference

edgeparse [OPTIONS] <PDF_FILE>...

Core options

Flag	Default	Description
`-o, --output-dir <DIR>`	—	Write output files to this directory
`-f, --format <FMT>`	`json`	Output format(s), comma-separated
`-p, --password <PW>`	—	Password for encrypted PDFs
`--pages <RANGE>`	—	Page range e.g. `"1,3,5-7"`
`-q, --quiet`	false	Suppress log output

Output format values

Value	Description
`json`	Structured JSON with bounding boxes and element types (default)
`markdown`	Standard Markdown with GFM tables
`markdown-with-html`	Markdown with HTML table fallback for complex tables
`markdown-with-images`	Markdown with embedded or linked images
`html`	Full HTML5 document with semantic elements
`text`	Plain UTF-8 text, reading order preserved

Multiple formats: --format markdown,json

Layout & extraction options

Flag	Default	Description
`--reading-order <ALGO>`	`xycut`	Reading order: `xycut` or `off`
`--table-method <METHOD>`	`default`	Table detection: `default` (ruling lines) or `cluster` (borderless)
`--keep-line-breaks`	false	Preserve original line breaks within paragraphs
`--use-struct-tree`	false	Use tagged PDF structure tree when available
`--include-header-footer`	false	Include headers and footers in output
`--sanitize`	false	Enable PII sanitization
`--replace-invalid-chars <CH>`	`" "`	Replacement for invalid Unicode characters

Image options

Flag	Default	Description
`--image-output <MODE>`	`external`	`off`, `embedded` (base64), or `external` (files)
`--image-format <FMT>`	`png`	`png` or `jpeg`
`--image-dir <DIR>`	—	Directory for extracted image files

Separator options

Flag	Description
`--markdown-page-separator <STR>`	String inserted between Markdown pages
`--text-page-separator <STR>`	String inserted between plain-text pages
`--html-page-separator <STR>`	String inserted between HTML pages

Content safety options

Flag	Default	Description
`--content-safety-off <FLAGS>`	—	Disable safety filters: `all`, `hidden-text`, `off-page`, `tiny`, `hidden-ocg`

Hybrid backend options

Flag	Default	Description
`--hybrid <BACKEND>`	`off`	Hybrid backend: `off` or `docling-fast`
`--hybrid-mode <MODE>`	`auto`	Triage mode: `auto` or `full`
`--hybrid-url <URL>`	—	Hybrid backend service URL
`--hybrid-timeout <MS>`	`30000`	Timeout in milliseconds
`--hybrid-fallback`	false	Fall back to local extraction on hybrid error

Output Formats

Format	Flag value	Best for
JSON with bounding boxes	`json`	RAG citations, element-level processing, source highlighting
Markdown (GFM)	`markdown`	LLM context windows, chunking pipelines, readable output
Markdown + HTML tables	`markdown-with-html`	Complex tables that don't render well in pure Markdown
Markdown + images	`markdown-with-images`	Documents where figures matter
HTML5	`html`	Web display, accessibility pipelines
Plain text	`text`	Keyword search, simple NLP, legacy pipelines

Combine formats: --format markdown,json

JSON output example

{
  "type": "heading",
  "id": 42,
  "level": "Title",
  "page number": 1,
  "bounding box": [72.0, 700.0, 540.0, 730.0],
  "heading level": 1,
  "font": "Helvetica-Bold",
  "font size": 24.0,
  "content": "Introduction"
}

Field	Description
`type`	Element type: `heading`, `paragraph`, `table`, `list`, `image`, `caption`
`id`	Unique identifier for cross-referencing
`page number`	1-indexed page reference
`bounding box`	`[left, bottom, right, top]` in PDF points (72 pt = 1 inch)
`heading level`	Heading depth (1+)
`content`	Extracted text

RAG / LLM Integration

EdgeParse is designed for AI pipelines. Every element has a bounding box and page number, so you can cite exact sources in answers.

Extract Markdown for chunking

import edgeparse

md = edgeparse.convert("report.pdf", format="markdown")

# Feed directly into an LLM
response = llm.invoke(f"Summarize this document:\n\n{md}")

Extract JSON for source citations

import json, edgeparse

data = json.loads(edgeparse.convert("report.pdf", format="json"))

for element in data["kids"]:
    if element["type"] == "paragraph":
        # element["bounding box"] → highlight location in original PDF
        # element["page number"] → link back to source page
        print(f"p.{element['page number']}: {element['content'][:80]}")

LangChain integration

EdgeParse has no official LangChain loader yet, but integrates trivially:

from langchain.schema import Document
import edgeparse, json

def load_pdf(path: str) -> list[Document]:
    data = json.loads(edgeparse.convert(path, format="json"))
    docs = []
    for el in data["kids"]:
        if el["type"] in ("paragraph", "heading"):
            docs.append(Document(
                page_content=el["content"],
                metadata={
                    "source": path,
                    "page": el["page number"],
                    "bbox": el["bounding box"],
                    "type": el["type"],
                }
            ))
    return docs

LlamaIndex integration

from llama_index.core import Document
import edgeparse, json

def edgeparse_reader(path: str) -> list[Document]:
    data = json.loads(edgeparse.convert(path, format="json"))
    return [
        Document(
            text=el["content"],
            metadata={"page": el["page number"], "source": path}
        )
        for el in data["kids"]
        if el.get("content")
    ]

Which output format for which use case?

Use case	Recommended format	Why
Feed PDF to LLM	`markdown`	Clean structure, fits in context window
RAG with source citations	`json`	Bounding boxes enable "click-to-source" UX
Semantic chunking by section	`markdown`	Headings make natural chunk boundaries
Element-level filtering	`json`	Filter by `type`, `page number`, `heading level`
Web display	`html`	Styled output with semantic elements

Agent Skill

EdgeParse ships as a Claude agent skill — a structured description that teaches Claude (and any compatible AI agent) how to extract PDF content on behalf of users.

# Add the EdgeParse skill to your agent environment
npx skills add raphaelmansuy/edgeparse --skill edgeparse

# Install the Python package
pip install edgeparse

The npx skills add command registers the skill in skills-lock.json:

{
  "version": 1,
  "skills": {
    "edgeparse": {
      "source": "raphaelmansuy/edgeparse",
      "sourceType": "github"
    }
  }
}

Once installed, the agent reads skills/edgeparse/SKILL.md and knows when to call edgeparse.convert(), which format to use for different tasks, and how to handle edge cases like encrypted PDFs, borderless tables, and multi-column layouts.

See docs/08-agent-skill.md for the full integration guide (LangChain, LlamaIndex, MCP, CrewAI patterns).

Architecture

Crate structure

edgeparse/
├── crates/
│   ├── pdf-cos/            # Low-level PDF object model (fork of lopdf 0.39)
│   ├── edgeparse-core/     # Core extraction engine (~90 source files)
│   │   └── src/
│   │       ├── api/        # ProcessingConfig, FilterConfig, BatchResult
│   │       ├── pdf/        # Loader, content stream parser, font/CMap decoding
│   │       ├── models/     # ContentElement, BoundingBox, TextChunk, PdfDocument
│   │       ├── pipeline/   # 20-stage orchestrator + Rayon parallel helpers
│   │       ├── output/     # Renderers: JSON, Markdown, HTML, text, CSV
│   │       ├── tagged/     # Tagged-PDF structure tree → McidMap
│   │       └── utils/      # XY-Cut++ algorithm, sanitizer, layout analysis
│   ├── edgeparse-cli/      # CLI binary (clap derive, 25+ flags)
│   ├── edgeparse-python/   # PyO3 native extension
│   └── edgeparse-node/     # NAPI-RS native addon
└── sdks/
    ├── python/             # Python packaging (maturin, pyproject.toml)
    └── node/               # npm packaging (TypeScript wrapper, tsup)

Processing pipeline (20 stages)

PDF file
    │
    ▼
pdf-cos                   ← xref parsing, object graph, encrypted streams
    │
    ▼
edgeparse-core::pdf       ← page tree, content stream operators, font decoding,
    │                       CMap/ToUnicode, image extraction, tagged PDF
    ▼
edgeparse-core::pipeline  ← 20 sequential/parallel stages:
    │  [page range] → [watermark] → [filter] → [table borders] →
    │  [cell assignment] → [boxed headings] → [column detection] →
    │  [TextLine assembly] → [TextBlock grouping] → [table clustering] →
    │  [header/footer] → [list detection] → [paragraph] → [figure] →
    │  [heading classification] → [XY-Cut++ reading order] →
    │  [list pass 2] → [caption/footnote/TOC] → [cross-page tables] →
    │  [element nesting] → [final reading order] → [sanitize]
    ▼
edgeparse-core::output    ← render to JSON / Markdown / HTML / text

Stages marked par_map_pages run in parallel via Rayon; cross-page stages run sequentially.

FAQ

What is the best PDF parser for RAG?

For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. EdgeParse outputs structured JSON with bounding boxes for every element, handles multi-column layouts with XY-Cut++, and runs locally on CPU without a GPU or JVM. On the current 200-document benchmark it leads the comparison set in both overall score (0.787) and latency (0.064 s/doc). See RAG integration examples.

How do I cite PDF sources in RAG answers?

Every element in JSON output includes a bounding box ([left, bottom, right, top] in PDF points, 72 pt = 1 inch) and page number. Map the source chunk back to its bounding box to highlight the exact location in the original PDF — enabling "click-to-source" UX. No other non-OCR open-source parser provides bounding boxes for every element by default.

How do I extract tables from PDF?

EdgeParse detects tables using border (ruling-line) analysis by default. For complex or borderless tables, add --table-method cluster (CLI) or table_method="cluster" (Python). This uses a text-clustering algorithm to detect table structure without visible borders. On the current benchmark it reaches 0.596 TEDS and 0.901 table-detection F1, both best in the published comparison set.

Does it work without sending data to the cloud?

Yes. EdgeParse runs 100% locally. No API calls, no data transmission — your documents never leave your environment. The WebAssembly SDK also runs entirely in the browser: the PDF is processed client-side and never uploaded to any server. Ideal for legal, healthcare, and financial documents.

Does it handle multi-column layouts?

Yes. XY-Cut++ reading order analysis correctly sequences text across multi-column pages, sidebars, and mixed layouts. This works without any configuration change; it is the default reading order algorithm.

Does it need a GPU or Java?

No. EdgeParse is a pure Rust implementation. It requires no JVM, no GPU, no OCR models, and no Python runtime for the CLI binary. The CLI binary is ~15 MB stripped. On Apple M4 Max, it processes the 200-document benchmark corpus in about 12.7 seconds total.

How does it compare to Docling, MinerU, and Marker?

vs.	EdgeParse advantage	Tradeoff
OpenDataLoader	Faster (`0.064` vs `0.094 s/doc`) with stronger table structure and heading recovery	OpenDataLoader remains close on text quality and paragraph boundaries
IBM Docling	Faster (`0.064` vs `0.768 s/doc`) with better TEDS and overall score in the current benchmark snapshot	Docling remains a viable OCR-heavy fallback for scanned documents
Marker	Faster and materially better on every published metric in this benchmark family	Marker supports scanned PDFs via Surya OCR
PyMuPDF4LLM	Faster (`0.064` vs `0.439 s/doc`) with stronger tables, headings, and reading order	PyMuPDF4LLM is simpler if you only need lightweight text extraction

Does it support scanned PDFs?

Not directly — EdgeParse is built for born-digital PDFs with embedded fonts. It does not include an OCR engine. For scanned documents, use EdgeParse's hybrid backend option (--hybrid docling-fast) which routes complex pages to a Docling-Fast backend running locally.

What does "deterministic" mean?

Same input PDF → same output, every time. No stochastic models, no floating-point non-determinism from ML inference. This makes EdgeParse safe to use in CI pipelines, regression tests, and compliance workflows where reproducibility is required.

How do I chunk PDFs for semantic search?

Use format="markdown". EdgeParse preserves heading hierarchy and table structure in Markdown output — headings make natural chunk boundaries for RecursiveCharacterTextSplitter (LangChain) or heading-based splitters. For element-level control, use format="json" and split on heading level boundaries or page number changes.

Does the Python SDK run on Windows?

Yes. Pre-built wheels are available for Windows (x64). The Python package installs from PyPI with no compilation needed:

pip install edgeparse

Tutorials

Step-by-step guides with working examples live in tutorials/:

Tutorial	Description
tutorials/01-cli.md	All CLI flags with working examples and output samples
tutorials/02-python-sdk.md	`pip install edgeparse` — full API, batch processing, JSON parsing
tutorials/03-nodejs-sdk.md	`npm install edgeparse` — TypeScript, CJS, and worker threads
tutorials/04-rust-library.md	`edgeparse-core` in your Rust project — config, models, Rayon
tutorials/05-output-formats.md	JSON schema, bounding boxes, Markdown variants, HTML, plain text

Documentation

Technical documentation lives in docs/:

Document	Description
docs/00-overview.md	Project overview, goals, and design philosophy
docs/01-architecture.md	Crate structure, module map, data-flow diagram
docs/02-pipeline.md	All 20 pipeline stages with ASCII diagrams
docs/03-data-model.md	Type hierarchy: `ContentElement`, `BoundingBox`, `PdfDocument`
docs/04-pdf-extraction.md	PDF loader, chunk parser, font/CMap decoding
docs/05-output-formats.md	JSON schema, Markdown renderer, HTML/text/CSV output
docs/06-sdk-integration.md	CLI flag reference, Python SDK API, Node.js SDK API, Batch API
docs/07-cicd-publishing.md	CI/CD publishing pipeline — how it works and how to configure it
docs/08-agent-skill.md	EdgeParse agent skill — `npx skills add`, SKILL.md structure, SDK patterns
docs/09-wasm-sdk.md	WebAssembly SDK — objectives, API, use cases, build instructions

Project Layout

edgeparse/
├── LICENSE
├── CONTRIBUTING.md
├── README.md
├── Cargo.toml               # Rust workspace (5 members)
├── Cargo.lock
│
├── crates/
│   ├── pdf-cos/             # lopdf 0.39 fork — low-level PDF object model
│   ├── edgeparse-core/      # Core extraction engine (~90 source files)
│   ├── edgeparse-cli/       # CLI binary (clap, 25+ flags)
│   ├── edgeparse-wasm/      # WebAssembly build for browsers (wasm-bindgen)
│   ├── edgeparse-python/    # PyO3 native Python extension
│   └── edgeparse-node/      # NAPI-RS native Node.js addon
│
├── sdks/
│   ├── python/              # Python wheel packaging (maturin + pyproject.toml)
│   │   └── edgeparse/       # Pure-Python wrapper + CLI entry point
│   └── node/                # npm packaging (TypeScript + tsup + vitest)
│       └── src/             # index.ts, types.ts, cli.ts
│
├── benchmark/               # Evaluation suite
│   ├── run.py               # Benchmark runner (EdgeParse)
│   ├── compare_all.py       # Multi-engine comparison (9 engines)
│   ├── pyproject.toml
│   ├── thresholds.json      # Regression thresholds
│   ├── pdfs/                # Benchmark PDFs (200 docs)
│   ├── ground-truth/        # Reference Markdown and JSON annotations
│   ├── prediction/          # Per-engine output directories
│   ├── reports/             # HTML benchmark reports
│   └── src/                 # Python evaluators and engine adapters
│
├── docs/                    # Technical documentation (Markdown)
│
├── demo/                    # Interactive WASM demo (Vite + TypeScript)
│   └── src/                 # Demo application source
│
├── examples/
│   └── pdf/                 # Sample PDFs for quick testing
│       ├── lorem.pdf
│       ├── 1901.03003.pdf   # Academic paper (multi-column)
│       ├── 2408.02509v1.pdf # Academic paper
│       └── chinese_scan.pdf # CJK + scan example
│
├── benches/                 # Rust micro-benchmarks (criterion)
├── docker/                  # Dockerfile and Dockerfile.dev
├── scripts/                 # bench.sh, publish-crates.sh
└── tests/
    └── fixtures/            # Rust integration test fixtures

Contributing

See CONTRIBUTING.md. In short:

Fork, branch from main
cargo fmt && cargo clippy -- -D warnings
Run the benchmark to check for regressions: cd benchmark && uv run python run.py --engine edgeparse
Open a PR

Star History

License

EdgeParse is licensed under the Apache License 2.0. See LICENSE for the full text.

The crates/pdf-cos/ directory is a fork of lopdf (MIT/Apache-2.0 dual-licensed).
Benchmark PDF documents (benchmark/pdfs/) are sourced from publicly available documents and are used solely for evaluation purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.agents/skills		.agents/skills
.claude/skills		.claude/skills
.github/workflows		.github/workflows
.vite/deps		.vite/deps
Formula		Formula
benchmark		benchmark
crates		crates
demo		demo
docker		docker
docs		docs
examples/pdf		examples/pdf
logs		logs
mission/004-benchmark-ooda		mission/004-benchmark-ooda
scripts		scripts
sdks		sdks
site		site
skills/edgeparse		skills/edgeparse
test-results		test-results
tests/fixtures		tests/fixtures
tutorials		tutorials
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
analyze_benchmark.py		analyze_benchmark.py
analyze_worst.py		analyze_worst.py
deny.toml		deny.toml
skills-lock.json		skills-lock.json

Folders and files

Latest commit

History

Repository files navigation

EdgeParse

Table of Contents

Get Started in 30 Seconds

Release Channels

What Problems Does This Solve?

Benchmark

Current comparison set

Metrics

Running the benchmark

Regression thresholds

Capability Matrix

Installation

Python

CLI

Rust library

Node.js

Build from source

System requirements

Python SDK

edgeparse.convert()

edgeparse.convert_file()

Examples

CLI entry point (Python package)

Node.js SDK

convert()

ConvertOptions

CLI (npm)

WebAssembly SDK

Quick start

API

Live demo

Build from source

CLI Reference

Core options

Output format values

Layout & extraction options

Image options

Separator options

Content safety options

Hybrid backend options

Output Formats

JSON output example

RAG / LLM Integration

Extract Markdown for chunking

Extract JSON for source citations

LangChain integration

LlamaIndex integration

Which output format for which use case?

Agent Skill

Architecture

Crate structure

Processing pipeline (20 stages)

FAQ

What is the best PDF parser for RAG?

How do I cite PDF sources in RAG answers?

How do I extract tables from PDF?

Does it work without sending data to the cloud?

Does it handle multi-column layouts?

Does it need a GPU or Java?

How does it compare to Docling, MinerU, and Marker?

Does it support scanned PDFs?

What does "deterministic" mean?

How do I chunk PDFs for semantic search?

Does the Python SDK run on Windows?

Tutorials

Documentation

Project Layout

Contributing

Star History

License

About

Topics

Resources

License

Contributing

Uh oh!

`edgeparse.convert()`

`edgeparse.convert_file()`

`convert()`

`ConvertOptions`

Packages