olmocr
Open-source toolkit from Allen Institute for AI that converts PDFs and scanned documents into clean, structured Markdown text.
About
olmOCR is a document processing toolkit built by the Allen Institute for AI. It uses a 7-billion-parameter vision language model to convert PDFs, PNGs, and JPEGs into structured Markdown, preserving reading order through multi-column layouts, figures, tables, equations, and handwritten content. Headers, footers, and page artifacts are stripped automatically. A free web demo is available, and the software is Apache 2.0 licensed for commercial and personal use. Self-hosted processing runs at under $200 per million pages; third-party inference providers offer per-token pricing for teams without GPU infrastructure.