OCR works well in controlled conditions — structured forms, consistent fonts, clean scans. Production financial services environments are none of those things. Fact finds come from dozens of different providers. Valuation reports vary by platform. Client letters change format with every software update. Any extraction pipeline that assumes document consistency will fail repeatedly in production.
The classification problem comes first
Before any text is extracted, the system needs to know what kind of document it's looking at. A valuation report from platform A requires different extraction logic than the same report from platform B. A fact find completed digitally has a different structure than the same form filled by hand and scanned. Classification has to happen at ingestion — and it has to handle documents that don't match any known template.
We handle this with a two-stage approach: a primary classifier trained on document type and provider, and a fallback that routes unknown documents to a human review queue rather than attempting extraction. The routing decision is logged so the classifier can be improved over time as new document variants appear.
“Before any text is extracted, the system needs to know what kind of document it's looking at.”
Extraction confidence and validation
Every extracted field should carry a confidence score. Fields extracted with low confidence shouldn't be silently passed downstream — they should trigger a review checkpoint. The threshold for what counts as low confidence varies by field criticality: a client's date of birth requires higher confidence than a correspondence address.
Validation logic should be independent of extraction logic. A client age of 847 might be extracted with high confidence because the text was clearly readable — but it should still fail validation. These two failure modes need separate handling. Extraction failures indicate document quality or model problems. Validation failures usually indicate upstream data quality issues that need operational attention.
Designing for the exception rate
No extraction pipeline achieves 100% straight-through processing in a real financial services environment. Designing for the exception rate means building the review queue as a first-class part of the system, not as an afterthought. The queue needs to show reviewers exactly what was extracted, what was uncertain, and what failed — with enough document context to make a fast, accurate correction.
The target isn't zero exceptions. It's a manageable exception rate with a review process that's fast enough to maintain operational throughput. A system that processes 85% of documents automatically with a 15-minute review cycle for exceptions usually outperforms a manual process in both speed and accuracy, even before further model improvement.