PDF Extractor
By DorsalHub
Audited
dorsalhub/pdf-extractor
Fast, deterministic PDF extraction using pdfium.
User
Dorsalhub ModelsDorsal PDF Extractor
Fast, deterministic PDF extraction using pdfium. Get native text and bounding boxes without the overhead of heavy AI layout models. Tesseract OCR fallback for scanned pages.
Features
- Native Extraction: Extracts text and precise spatial coordinates (bounding boxes) directly from the PDF document.
- Tesseract OCR Fallback: Built-in optical character recognition (OCR) fallback for scanned or empty pages (requires
pytesseract) - Schema Compliant: Outputs standardized JSON matching the
open/document-extractionschema. - Interactive HTML Export: Outputs can be exported to interactive HTML wireframes via the Dorsal CLI (requires
dorsalhub-adapters).
Quick Start
Run the model directly against a local PDF file:
dorsal run dorsalhub/pdf-extractor ./document.pdf
Configuration Options
In the CLI, you can pass options to the model using the --opt (or -o) flag.
Example: Run the extractor with OCR fallback enabled for a French language document:
dorsal run github:dorsalhub/pdf-extractor ./document.pdf --opt use_ocr=true --opt ocr_language=fra
Options:
password(default:null): Password for decrypted protected PDFs.strict(default:false): Toggle strict parsing mode for PDF processing.use_ocr(default:false): Enable Tesseract OCR fallback for pages where no native text tokens are detected.ocr_language(default:"eng"): The language code to use for the OCR fallback engine.
Output Formats & Exporting
By default, the CLI outputs a validated JSON record to the current working directory.
You can export to other formats right from the CLI. For example, exporting to HTML:
dorsal run github:dorsalhub/pdf-extractor ./document.pdf --export=html
Output
This model produces a file annotation conforming to the Open Validation Schemas Document Extraction schema:
- Schema ID:
open/document-extraction(v0.5.0) - Key Fields:
extraction_type: Indicates the type of extraction (e.g.,boxes,text, ormixed).unit: Geometric unit, set toper_millefor standardized relative spatial mapping.page_width/page_height: Absolute dimensions of the pages in pixels.blocks: An array of elements containing theblock_type,text,page_number, andboxcoordinates.
Development
Running Tests
This repository uses pytest for integration testing.
pip install -e .
pytest
License
This project is licensed under the Apache 2.0 License.
Install
To use this model, you must have Dorsal installed in your environment:
pip install dorsalhub
Once installed, run the command below in your terminal to install the model:
dorsal model install dorsalhub/pdf-extractor
- Version
- 0.1.0
- Published By
-
Dorsalhub Models
- Creation Date
- 2026-02-26
- Last Modified Date
- 2026-02-26
- Source Code
- GitHub
Output
- Schema
- open/document-extraction
Supported Media