surfaceextractionparse_document
Extractionscalar · returns json

PARSE_DOCUMENT

PDF → structured JSON (pypdf + Donut OCR fallback)

Per-row — runs once for each row.

extractionllmscales-largeimage

Arguments

nametypedescription
pdf_bytesVARCHARPDF content (hex, base64, or raw bytes as VARCHAR)
schemaVARCHARJSON schema of the desired output shape

About

Parse a PDF into a JSON object matching a user-supplied schema — works on both digital (text-layer) and scanned (image-only) PDFs. ## How it works `parse_document(pdf_bytes, schema)` runs a two-stage pipeline: 1. **Text extraction** — picks the fastest path that produces real text: - **Digital PDFs**: `pypdf` walks the text layer in-process. Cheap, synchronous, zero external dependencies. - **Scanned PDFs** (empty or near-empty text layer): automatically falls back to the zoo's Donut OCR endpoint (`POST /document/ocr`). The zoo rasterizes every page with PyMuPDF and runs the `naver-clova-ix/donut-base` encoder-decoder with the SynthDoG reading task. Multi-page PDFs are handled end-to-end. 2. **Schema-driven extraction** — the extracted text is handed to an LLM cell running in JSON mode. The schema is injected into the prompt so the model emits a single JSON object with all fields present (using `null` for missing data), matching the declared types. The fallback is invisible to the caller — the same `parse_document` call works on both kinds of input. Check the `ocr_used` flag in the per-row output of the intermediate `pdf_to_text` cell if you need to distinguish the two paths. ## Input formats `pdf_bytes` accepts any of the shapes DuckDB can hand down: - Raw bytes or memoryview (BLOB columns) - Base64 string - Hex string, with or without the `\x` prefix ## Usage -- Extract invoice fields from every PDF in a table SELECT pdf_id, parse_document(pdf_bytes, '{"vendor":"string","total":"decimal","invoice_date":"date"}' ) AS fields FROM uploaded_invoices; -- Combine with LATERAL for native typed columns via CREATE EXTRACTION CREATE EXTRACTION invoice_fields AS $${ "vendor": "string", "total": "decimal", "invoice_date": "date" }$$; SELECT t.pdf_id, f.vendor, f.total, f.invoice_date FROM uploaded_invoices t, LATERAL extract_invoice_fields_rows( parse_document(t.pdf_bytes, '{}') ->> 'text' ) f; ## Limitations - Multi-page documents are flattened into a single text blob before extraction. For very long documents, chunk upstream and merge results. - Nested schemas are accepted but flattened — the LLM will still emit them, but the `returns_columns` machinery only materializes scalar fields. Pair with `CREATE EXTRACTION` if you need typed columns. - The OCR fallback costs a zoo round-trip (~1-3s per page on GPU), so it's slower than the pypdf path. Keep scanned workloads batched. ## Related - `CREATE EXTRACTION` — reusable typed schemas, saves rewriting the `'{...}'` JSON on every call and gives you native DuckDB columns. - `extract_structured(text, schema)` — the underlying text-path extractor. Call directly when you already have plain text. - Zoo endpoint `POST /document/ocr` — the OCR fallback in isolation, if you want to handle the text-vs-JSON split in your own pipeline.

Nearby rabbit holes

same domain
Climb back to The Looking Glass