A vendor tells you their document-AI solution is "99% accurate." Take a moment with that number, because as stated it's close to meaningless. Ninety-nine per cent of what - fields, or whole documents? Measured on which documents - the clean ones, or the real mailbag? Established how, and adjudicated by whom? And the question that matters most: what happens on the day it isn't 99%?

For anyone procuring or owning an AI service in a regulated environment, this is the shift in mindset that separates a good outcome from a disappointing one. Accuracy is not a headline figure you accept at signing. It is something you specify, measure, and manage across the life of the contract. The difference between a promise and a programme is the difference between hoping the number holds and knowing what you'll do when it moves.

The problem with accuracy as a headline number

A single percentage compresses away everything you need to govern the service. It hides the unit of measurement, the population it was measured on, the method behind it, and - crucially - whether anyone is contractually obliged to maintain it. "Best-efforts accuracy" is not a commitment; it's a sentiment. What you want instead is a small set of defined, measurable indicators with consequences attached, written so that both sides can agree, months later, whether the target was met.

Getting there means being specific about four things most accuracy claims leave vague.

1. Define the unit: document-level vs field-level

This is the distinction that catches buyers out most often. Field-level accuracy measures the proportion of individual fields extracted correctly. Document-level accuracy measures the proportion of documents that are correct in their entirety - every field right.

The two diverge sharply. A system at 99% field-level accuracy, on a document with twenty fields, will on average get a field wrong on roughly one document in five - so its document-level accuracy could sit far lower than the impressive field number implies. In workflows where a single wrong field invalidates the whole record - which describes most regulated document processing - document-level accuracy is the figure that reflects real-world outcomes. Specify which unit your targets refer to, and be wary of a headline number that quietly uses the more flattering one.

2. Define the population

Accuracy measured on a curated sample of clean documents tells you little about performance on the documents you actually receive. The target should be tied to a representative population - the real mix of clean and degraded, printed and handwritten, templated and non-templated - or, better, reported per document type so you can see where performance is strong and where it isn't. An aggregate number across an unrepresentative sample is the easiest place for a claim to hide.

3. Define the method and the ground truth

A measurable target needs an agreed measurement method: how samples are drawn, how often, what counts as "correct," and - the hard part - what serves as ground truth. Define this up front, because it's where disputes live. If a field was genuinely illegible on the source document, is a blank the correct output or an error? (It should usually count as correct behaviour, provided the system flagged it rather than guessing.) Agree the adjudication process, the exclusions, and who arbitrates a contested case before you need it.

4. Treat accuracy as a trajectory, not a fixed point

A well-designed AI service improves as it encounters real data. That makes a single fixed accuracy figure for the whole contract term both unrealistic and unambitious - too high to guarantee on day one, too low to accept by year two. A more honest and more demanding structure is a committed improvement trajectory: a starting target that the system is expected to meet from go-live, rising through defined milestones to a higher target as it matures on production data. This rewards the vendor for genuine improvement and gives you a measurable expectation at each stage, rather than a flat promise that's either sandbagged or unachievable.

The closed loop that makes improvement real

An improvement trajectory is only credible if there's a mechanism behind it. "Continuous improvement" is a slogan until you can see the loop that produces it. A real one has five stages, and it's worth asking a vendor to walk you through theirs:

Capture. Every correction a human reviewer makes is recorded as a confirmed, labelled example of what the right answer was. This is the raw material for improvement, and it's a by-product of the review process you already run.

Diagnose. Errors are root-caused at the level of the specific field and document type, not treated as one undifferentiated pile. The question is always "what is failing, on which documents, and why" - a misread handwriting style, a new layout, a degraded scan source.

Fix. The remedy is targeted to the cause - a model or prompt adjustment, a new validation rule, a pre-processing change - rather than a blunt across-the-board retrain.

Regression-test. Before any change is deployed, it's tested against a held-out set to confirm it improves the target case without degrading anything that previously worked. This discipline is what stops "improvements" from quietly costing you accuracy elsewhere.

Validate in production. The change is deployed and the uplift confirmed on real traffic, then the loop repeats.

This is ordinary MLOps discipline, but it's exactly what's missing when a system's accuracy stalls or drifts. Which leads to the other reason the loop matters.

The loop also defends against drift

AI accuracy is not static even if you do nothing. As the input changes - a new originating system, a format revision, a change in scanning quality - performance can decay. This is model drift, and it's invisible until it shows up as errors downstream. The same monitoring that drives improvement is what detects decline: per-route and per-document-type accuracy tracking, and watching the distribution of confidence and corrections for early warning. A managed accuracy programme isn't only about climbing toward a higher target; it's about noticing, quickly, when you've started sliding away from the one you have.

Fair measurement and dispute resolution

Because accuracy targets carry consequences - service credits, remediation, escalation - the measurement regime has to be fair to both sides, or it becomes a source of friction rather than assurance. Agree in advance: the sampling and reporting cadence, the exclusions (genuinely illegible source content, out-of-scope document types), how a contested result is adjudicated and by whom, and what remediation a missed target triggers. The goal isn't to penalise; it's to make the commitment real and the relationship durable. A vendor confident in their programme will welcome a clear measurement regime, because it protects them as much as you.

What to require

If you're specifying or evaluating an AI accuracy commitment, insist on:

Which unit the target uses - document-level or field-level - and why.

The document population the target is measured against, ideally broken down by type.

A defined measurement method, ground-truth process, and adjudication route.

An improvement trajectory with milestones, not a single flat figure.

A described closed-loop improvement process - capture, diagnose, fix, regression-test, validate.

Drift monitoring, so decline is caught early, not after it reaches your downstream process.

A fair, pre-agreed dispute and remediation mechanism.

A vendor who can answer these is running a programme. One who offers a single confident percentage and little else is offering a promise - and you'll discover the difference only after you've signed.

The takeaway

Accuracy you can rely on is governed, not guaranteed. It's defined precisely enough to measure, tied to the documents you actually handle, expected to improve along a visible trajectory, sustained by a closed loop that turns human corrections into systemic gains, and protected by monitoring that catches drift before it bites. Treated that way, accuracy stops being a number you take on trust and becomes an outcome you can manage and evidence.

That is how PromptX, VE3's intelligent document processing platform, treats it: human-in-the-loop corrections feed a closed improvement loop that root-causes errors by field and document type, deploys targeted fixes under regression testing, and validates uplift in production - with accuracy measured and reported rather than asserted. In regulated document work, the right question to ask a vendor isn't "how accurate are you?" It's "how do you manage accuracy?" and the quality of the answer tells you most of what you need to know.