Presentation Discussion on Ephesoft’s Patented, Multi-dimensional Data Capture Methodology
Ray Azarm – Regional Sales Manager and ECM Veteran
Ray Azarm discusses Ephesoft patented, multi-dimensional data capture methodology which is used both in Ephesoft Transact as well as the Insight platforms. The purpose of this session is to do a technical deep-dive into the various ways of how we approach data extraction and why we believe it provides the highest overall confidence in extracting data off of semi-structured or unstructured content.
We will first provide a bit of background around multi-dimensional data capture methodology and machine learning and why it is important in this day and age. Essentially, Machine Learning is a way to analyze data using algorithms to find insights without explicitly programming. The reason machine learning has become critical is essentially the same reason why data mining has become important. Things like growing volumes, a variety of data, and accuracy are all some of the factors that have made it critical to use machine learning.
The focus of this session is multi-dimensional data capture when reading data off of semi-structured or unstructured content but much of this presentation is about dimensions, let’s first define it.
As you know, the word dimension quite often is applied to a measurement of some kind, either a length, width, etc. For our purpose today, we not only refer to dimension as a measurement but more importantly to a situation or a position with an important factor as it relates to the rest of the entities around it. As an example. When we read a field that reads: “invoice number,” we can potentially make several inferences. Perhaps this is an invoice document type and maybe we can expect to find a vendor name and address and possibly a ship to address. Maybe there are line items on this document that further confirm it being an invoice.
All of these fields have a relationship to each other that must be accounted for when one looks at this document and makes certain conclusions on what elements to extract and the accuracy of each extraction. These different relationships are what we essentially call dimensions on a document. For every one of these dimensions, we assign a confidence score (scale of zero to one) and ultimately the aggregate of these scores provide the overall confidence and accuracy rate that we are looking in our data capture exercise.