The Filing Cabinet Problem Every organization has one. A storage room, a shared drive, a Dropbox folder — somewhere there are thousands of documents that exist only as scans. Supplier invoices from before the accounting system went digital. Patient intake forms from a decade of paper processes. Lease agreements that were faxed, signed, scanned, and filed away. Customs declarations. Insurance claims. Building permits. The data inside those documents is valuable. It is also trapped behind a wall of pixels. A scanned PDF is not a document in any meaningful sense — it is a photograph of a document, wrapped in a PDF container. You cannot search it. You cannot copy text from it. You cannot query a database for "all invoices over EUR 10,000 from 2023" when those invoices are flat images. The traditional fix is OCR — optical character recognition. Run Tesseract, get text out. But raw OCR gives you a stream of characters with no structure.…