Multi-Dimensional Extraction for Unstructured Content

Ephesoft’s Patented Multi-Dimensional Extraction Methodology for Classifying Documents and Extracting Data

Guest Blog Excerpt from Jake Karnes – ECM Consultant at Zia Consulting

Multi-dimensional ExtractionExecutive Summary: At INNOVATE 2016, Ephesoft announced a new feature to be included in Transact and Insight: multi-dimensional extraction. This feature was described as using machine learning to improve extraction results from unstructured documents with minimal user intervention. While the description is very exciting, I was left wondering how the system accomplished this lofty goal.

Ephesoft was also proud to announce they had a received a US patent for the technology. I took it upon myself to read Ephesoft’s patent (US 9,384,264 B1) to find out what is going on behind the scenes. I’ll even share what I’ve learned so that no one else must repeat my experience.

The patent actually describes much more than just multi-dimensional extraction. It includes information regarding machine learning, system architecture to handle massive scale, and much more. However, I found the most insightful information to be the description of what “multi-dimensional extraction” truly means, and how Ephesoft is implementing the concept.

Multi-dimensional extraction is the process of using different techniques in combination to determine which text on a page is the true value that should be extracted for further analysis or processing. The patent refers to these techniques as dimensions. Each dimension analyzes the document in a different way, and their analyses are combined to determine the best candidate for extraction.

Several categories of dimensions are described within the patent. Value dimensions provide analysis of each piece of text individually to determine if a value is correct for extraction using information such as page location, page number, text format, font size, font style, and more. Anchor dimensions expand the analysis to include the words surrounding a particular value. These dimensions look at the neighboring words/phrases to determine if expected keywords are found, and if the value’s neighbors match any of the neighbors found during training. Finally, there are some use-case-specific dimensions described as well. For example, addresses and zip codes can be analyzed further by ensuring their validity from a list of known, valid addresses and zip codes.

Multi-dimensional extraction combines a wide variety of information to create a holistic approach to document-based extraction, which is strikingly similar to how humans perform the same task. Ephesoft has described a robust and flexible system for finding important information in unstructured text. I’m eager to see the performance of this feature in practice and expect to see improved extraction results from documents which previously proved difficult.

A much more in-depth discussion of the patent and multi-dimensional extraction can be found here.

Jake Karnes is an ECM Consultant at Zia Consulting. He extends and integrates Ephesoft and Alfresco to create complete content solutions. In addition to client integrations, Jake has helped create Zia stand-alone solutions such as mobile applications, mortgage automation, and analytic tools. He’s always eager to discuss software to the finest details, you can find Jake on LinkedIn.