Optical character recognition works beautifully in the demo. You feed it a crisp, printed, well-aligned page, and it returns clean text. The trouble starts when you point it at the actual mailbag - the real population of documents an organisation receives, where the print is faded, the form was filled in by hand, a clause is struck through and rewritten in the margin, there's a signature across the box, and the whole thing was scanned slightly skewed on a busy Friday afternoon.

That gap - between OCR on a clean sample and extraction on real-world documents - is where most document-automation projects quietly underdeliver. Closing it is what separates a tidy proof of concept from a system you can run a regulated process on. And closing it means accepting a simple truth: reliable data extraction is not one model doing one thing. It's a pipeline.

What OCR is good at and where it breaks

OCR is a mature, fast, and cost-effective technology for what it was designed to do: convert clean, machine-printed text into characters. On a standard printed field, it's hard to beat on speed or price.

Its limits show up the moment a document stops being clean and printed:

Handwriting. Free-hand entries vary endlessly in style, slant, and legibility. Traditional OCR was never built for them; this is the domain of intelligent character recognition and, increasingly, language models that can read in context.

Signatures. A signature isn't text to be transcribed - it's a mark to be detected and located. Asking an OCR engine to "read" it is the wrong question.

Corrections and annotations. Strike-throughs, insertions, initialled amendments, and marginal notes carry real meaning that flat character recognition flattens or misreads.

Stamps, watermarks, and overprints. Content layered over content confuses engines that assume a clean character on a clean background.

Degraded scans. Skew, low contrast, speckle, bleed-through, and partial pages all erode accuracy - often silently.

Variable layout. Documents that look similar but arrive from many different sources rarely sit in identical positions on the page.

None of these are edge cases in the real world. They are the world. A system that only handles the clean subset has automated the easy 60% and left the hard, costly 40% exactly where it was.

"Just use an LLM" isn't the whole answer either

The instinct now is to throw a frontier language model at everything, since modern models read handwriting and context impressively well. But running every field - including millions of clean, printed reference numbers - through a large model is slower and far more expensive than it needs to be, and it applies a heavyweight tool to a problem a lightweight one already solves. A single model, of any kind, is rarely the right choice across an entire document population that ranges from pristine print to scrawled marginalia.

The realisation that matters: the question isn't "OCR or LLM?" It's "which method, for this content, on this document?"

The adaptive approach: match the method to the document

The systems that actually perform treat extraction as a routed pipeline rather than a single pass. In practice that means a few coordinated stages.

Pre-processing first, because it multiplies everything downstream. Before any extraction, clean the image: correct skew, reduce noise, normalise contrast, re-orient rotated pages, and order multi-page documents correctly. Accuracy gains here are some of the cheapest available - every later stage performs better on a cleaner input.

Classify before you extract. Identify what each document is before deciding how to read it. A known, templated form can be handled differently from an unstructured letter, and classification lets the pipeline apply the right strategy rather than a one-size-fits-all guess.

Route by content and quality. Send clean, printed fields to fast, inexpensive OCR. Send handwriting, corrections, and degraded or ambiguous content to a language model that can reason about context. Treat signatures as a detection task in their own right. The result is a system that is fast and cheap where it can be, and powerful where it has to be - instead of compromising on a single setting for the whole range.

Handle the unreadable honestly. When content genuinely can't be read with confidence, the right behaviour is to return the field empty and flagged for review - never to invent a plausible value. A fabricated entry that looks right is far more dangerous than an honest blank, because it passes silently into the downstream process. (This is where extraction and confidence scoring meet: the pipeline should know what it doesn't know.)

Templated versus non-templated, and why it matters

A lot of document tooling assumes a fixed template: the field is always in this box, at these coordinates. That works right up until your documents arrive from many different originating systems, each with its own layout, at which point coordinate-based extraction quietly falls apart.

Real-world extraction has to handle both: the efficiency of templated extraction where layouts are known and stable, and the flexibility to extract the same logical field from documents that have never been seen in that exact form before. A capable system isn't tied to a template; it understands what a field means and finds it wherever it sits.

What "good" looks like

If you're evaluating a document-extraction capability, the questions that cut to whether it can handle the real mailbag rather than the demo:

Does it pre-process images, or does extraction accuracy simply degrade on poor scans?

Does it use one method for everything, or route by document type and content quality?

How does it handle handwriting and corrections specifically - not just printed text?

Are signatures treated as detection, or wrongly forced through text recognition?

Can it extract from documents arriving in many different layouts, not just a fixed template?

What does it do when something is illegible - flag it, or guess?

A system that has good answers here is built for real documents. One that demos only on clean samples is showing you the 60% and hoping you don't ask about the rest.

The takeaway

OCR isn't obsolete - it's one tool, excellent at one job. The mistake is treating it, or any single model, as the whole solution. Reliable extraction from the messy reality of handwriting, signatures, corrections, and poor scans comes from a pipeline that prepares the image, understands the document, routes each piece of content to the right method, and is honest about what it can't read.

That pipeline is what PromptX, VE3's intelligent document processing platform, is built around: image pre-processing, document classification, adaptive routing between high-speed OCR and frontier language models by content and quality, and explicit handling of the illegible rather than fabrication. The goal isn't to read the easy documents - anything can do that. It's to read the ones your process receives.