Optical Character Recognition technology, or OCR, has been around for quite some time.
It became mainstream back in the ’70s when a man named Ray Kurzweil developed a technology to help the visually impaired. He quickly realized the broad commercial implications of the technology, and so did Xerox, who purchased his company. From there, the technology experienced broad adoption across all types of use cases.
At its simplest, OCR is a means to take an image and convert recognized characters to text. In the Enterprise Content Management (ECM) world, it is this technology that provides a broad range of metadata and content collection methods as documents are scanned and processed. Here are the basic legacy forms of OCR that can be leveraged:
- Full-Text OCR– converts the entire document image to text, allowing full-text search capabilities. Using this OCR type, documents are typically converted to an Image+Text PDF, which can be crawled, and the content made fully searchable.
- Zone OCR– Zoning provides the ability to extract text from a specific location on the page. In this form of “templated” processing, specific OCR metadata can be extracted and mapped to an ECM system index field or column. This method is appropriate for structured documents that have the data in the same location.
- Pattern Matching OCR– pattern matching is purely a method to filter, or match patterns within OCR text. This technique can provide some capabilities when it comes to extracting data from unstructured, or non-homogeneous documents. For example, you could extract a Social Security Number pattern (XXX-XX-XXXX) from the OCR text.
These forms of OCR are deemed as legacy methods of extraction, and although they can provide some value when utilized with any document process, they are purely data driven at the text level.
In steps OCR 2.0. Today, at Ephesoft we leverage OCR as the very bottom of our document analytics and intelligence stack. The OCR text is now pushed through algorithms that create meaning out of all types of dimensions: location, size, font, patterns, values, zones, numbers, and more (You can read about this patented technology here: Document Analytics and Why It Matters in Capture and OCR ). So rather than just being completely data-centric, or functioning at the text layer, we now create a high-functioning intelligence layer that can be used beyond just text searching and metadata. Moreover, the best part? This technology has been extended to non-scanned files like Office documents.
Examples? See below:
- Multi-dimensional Classification– using that analysis capability (with OCR as algorithm input), and all the collected dimensions of the document, document type or content type can now be accurately identified. As documents are fed into any system, they can be intelligently classified, and that information is now actionable with workflows, retention policies, security restrictions and more. You can see more on this topic in this video on Multidimensional Classification Technology: Machine Learning and Classification of Documents
- Machine Learning– legacy OCR technology provided no means or method to “get smarter” as documents were processed. Just looking at the pure text, it either recognized it or not. With a machine learning layer, you now have a system that gets more efficient the more you use it. The key here is that learned intelligence must span documents, it cannot be tied to any one item. It’s this added efficiency that can drive usage and adoption through ease of use. You can see more on machine learning in the videos below:
- Document Analytics, Accuracy, and Extraction– with legacy OCR, extracting the information you need can be problematic at best. How do you raise confidence that the information you have is accurate? With an analysis engine, we look not just at the text, but where it sits, what surrounds it, and know patterns or libraries. This added layer provides the ability to express higher confidence in data extraction, and make sure you are putting the right data into your backend systems.