PDF → structured JSON (pypdf + Donut OCR fallback)
Per-row — runs once for each row.
extractionllmscales-largeimage
Arguments
name
type
description
pdf_bytes
VARCHAR
PDF content (hex, base64, or raw bytes as VARCHAR)
schema
VARCHAR
JSON schema of the desired output shape
About
Parse a PDF into a JSON object matching a user-supplied schema —
works on both digital (text-layer) and scanned (image-only) PDFs.
## How it works
`parse_document(pdf_bytes, schema)` runs a two-stage pipeline:
1. **Text extraction** — picks the fastest path that produces real text:
- **Digital PDFs**: `pypdf` walks the text layer in-process. Cheap,
synchronous, zero external dependencies.
- **Scanned PDFs** (empty or near-empty text layer): automatically
falls back to the zoo's Donut OCR endpoint (`POST /document/ocr`).
The zoo rasterizes every page with PyMuPDF and runs the
`naver-clova-ix/donut-base` encoder-decoder with the SynthDoG
reading task. Multi-page PDFs are handled end-to-end.
2. **Schema-driven extraction** — the extracted text is handed to an
LLM cell running in JSON mode. The schema is injected into the
prompt so the model emits a single JSON object with all fields
present (using `null` for missing data), matching the declared
types.
The fallback is invisible to the caller — the same `parse_document`
call works on both kinds of input. Check the `ocr_used` flag in the
per-row output of the intermediate `pdf_to_text` cell if you need to
distinguish the two paths.
## Input formats
`pdf_bytes` accepts any of the shapes DuckDB can hand down:
- Raw bytes or memoryview (BLOB columns)
- Base64 string
- Hex string, with or without the `\x` prefix
## Usage
-- Extract invoice fields from every PDF in a table
SELECT
pdf_id,
parse_document(pdf_bytes,
'{"vendor":"string","total":"decimal","invoice_date":"date"}'
) AS fields
FROM uploaded_invoices;
-- Combine with LATERAL for native typed columns via CREATE EXTRACTION
CREATE EXTRACTION invoice_fields AS $${
"vendor": "string",
"total": "decimal",
"invoice_date": "date"
}$$;
SELECT t.pdf_id, f.vendor, f.total, f.invoice_date
FROM uploaded_invoices t,
LATERAL extract_invoice_fields_rows(
parse_document(t.pdf_bytes, '{}') ->> 'text'
) f;
## Limitations
- Multi-page documents are flattened into a single text blob before
extraction. For very long documents, chunk upstream and merge
results.
- Nested schemas are accepted but flattened — the LLM will still emit
them, but the `returns_columns` machinery only materializes scalar
fields. Pair with `CREATE EXTRACTION` if you need typed columns.
- The OCR fallback costs a zoo round-trip (~1-3s per page on GPU),
so it's slower than the pypdf path. Keep scanned workloads batched.
## Related
- `CREATE EXTRACTION` — reusable typed schemas, saves rewriting the
`'{...}'` JSON on every call and gives you native DuckDB columns.
- `extract_structured(text, schema)` — the underlying text-path
extractor. Call directly when you already have plain text.
- Zoo endpoint `POST /document/ocr` — the OCR fallback in isolation,
if you want to handle the text-vs-JSON split in your own pipeline.
PDF → structured JSON (pypdf + Donut OCR fallback)
Per-row — runs once for each row.
extractionllmscales-largeimage
Arguments
name
type
description
pdf_bytes
VARCHAR
PDF content (hex, base64, or raw bytes as VARCHAR)
schema
VARCHAR
JSON schema of the desired output shape
About
Parse a PDF into a JSON object matching a user-supplied schema —
works on both digital (text-layer) and scanned (image-only) PDFs.
## How it works
`parse_document(pdf_bytes, schema)` runs a two-stage pipeline:
1. **Text extraction** — picks the fastest path that produces real text:
- **Digital PDFs**: `pypdf` walks the text layer in-process. Cheap,
synchronous, zero external dependencies.
- **Scanned PDFs** (empty or near-empty text layer): automatically
falls back to the zoo's Donut OCR endpoint (`POST /document/ocr`).
The zoo rasterizes every page with PyMuPDF and runs the
`naver-clova-ix/donut-base` encoder-decoder with the SynthDoG
reading task. Multi-page PDFs are handled end-to-end.
2. **Schema-driven extraction** — the extracted text is handed to an
LLM cell running in JSON mode. The schema is injected into the
prompt so the model emits a single JSON object with all fields
present (using `null` for missing data), matching the declared
types.
The fallback is invisible to the caller — the same `parse_document`
call works on both kinds of input. Check the `ocr_used` flag in the
per-row output of the intermediate `pdf_to_text` cell if you need to
distinguish the two paths.
## Input formats
`pdf_bytes` accepts any of the shapes DuckDB can hand down:
- Raw bytes or memoryview (BLOB columns)
- Base64 string
- Hex string, with or without the `\x` prefix
## Usage
-- Extract invoice fields from every PDF in a table
SELECT
pdf_id,
parse_document(pdf_bytes,
'{"vendor":"string","total":"decimal","invoice_date":"date"}'
) AS fields
FROM uploaded_invoices;
-- Combine with LATERAL for native typed columns via CREATE EXTRACTION
CREATE EXTRACTION invoice_fields AS $${
"vendor": "string",
"total": "decimal",
"invoice_date": "date"
}$$;
SELECT t.pdf_id, f.vendor, f.total, f.invoice_date
FROM uploaded_invoices t,
LATERAL extract_invoice_fields_rows(
parse_document(t.pdf_bytes, '{}') ->> 'text'
) f;
## Limitations
- Multi-page documents are flattened into a single text blob before
extraction. For very long documents, chunk upstream and merge
results.
- Nested schemas are accepted but flattened — the LLM will still emit
them, but the `returns_columns` machinery only materializes scalar
fields. Pair with `CREATE EXTRACTION` if you need typed columns.
- The OCR fallback costs a zoo round-trip (~1-3s per page on GPU),
so it's slower than the pypdf path. Keep scanned workloads batched.
## Related
- `CREATE EXTRACTION` — reusable typed schemas, saves rewriting the
`'{...}'` JSON on every call and gives you native DuckDB columns.
- `extract_structured(text, schema)` — the underlying text-path
extractor. Call directly when you already have plain text.
- Zoo endpoint `POST /document/ocr` — the OCR fallback in isolation,
if you want to handle the text-vs-JSON split in your own pipeline.