Ephesoft Advanced Capture solutions provide significant benefits to improving the paper process within organizations. The purpose of this document is to provide a high level overview of the capabilities of Ephesoft and the terminology used with this technology.
Ephesoft Guide to Forms Processing Terms
Forms Processing – Refers to the capture of information from documents through an automated process. Documents enter organizations and have little or no value until they are identified as to what they are and what metadata is on them. Forms processing refers to the automation of scanning, reading and interpreting the information on the documents to help automate workflows and add value to an organization.
Scanning – Usually the first process of forms processing where paper documents are converted to images that can be read by the computer and routed electronically. Scanning can be done by slow speed and high speed scanners depending on the user’s needs. For the best Optical Character Recognition,it is recommended that scanning is done at 300 dots per inch to provide the optimal image for good recognition.
PDF/PDF Plus Text – Refers to a standard format of an image file once scanned. PDF stands for Portable Document Format which is a document image standard created by Adobe Systems. A PDF Plus Text document contains an image of a document plus the associated text data which can be used for search and retrieval functions in order to locate the document or information on the document at a later date.
Tiff-Tagged Image File Format – Refers to a file type which is an electronic image (or Picture) of a document. A tiff image can be used to route and store pictures of documents within a computer system. Many manual file cabinets are converted to electronic format through the conversion of documents to electronic tiff files.
OCR – Optical Character Recognition – This term is used to represent the capability of computers to read and interpret machine printed characters on documents. For clean documents,expectations can be about 95 out of 100 characters that can be read correctly without human intervention. This number can be higher or lower depending on the quality of the original documents.
ICR-Intelligent Character Recognition-Similar term to OCR, but relates to hand printed information. In general, this technology works well only with forms designed for recognizing handprint which include structured boxes and drop out form colors.
BCR-Bar Code Recognition-The ability to read and interpret the data from bar codes printed on paper documents.Several bar code types exist including Code39, CODABAR,Interleaved 2 of 5,Code93 and more.
OMR – Optical Mark Recognition – The ability to read boxes or circles with tics on them as in a survey or questionnaire. The system looks for fill or no fill within the box to determine whether it is checked or not.
Batch – A group of one or more documents processed at a time within the system.
Advanced Forms Processing – Generally refers to technologies of reading documents beyondsimply reading a bar code or producing a PDF output file. These generally include free form(unstructured) data extraction and full page document classification.
Fixed Forms Processing -Refers to initial forms processing technology that would use a fixed forms template to OCR and extract meta data from a document. The forms have to be the same and identified and the data has to be in the EXACT same location on each document for this technology to work. This technology has been replaced with unstructured extraction since it is much more efficient. Some rare cases still exist to utilize fixed forms extraction.
Document Classification – The process whereby different document types are defined in order to be used for later processing.
Document Separation -The process of determining the total number of pages within each document type and separating all the pages of one document type from another document type. This is also the same as determining where one document ends and the next one begins in a stream of images.
Image Classification – This technology is used to identify and separate document types in a stream of scanned pages. The technology uses finger print matching or matching of similar patterns of a particular form. It works well for standard pre defined forms, but does not work for textual types of documents of information since there are no form attributes on those to match to.
Full Page Text Classification – This technology is used to identify and separate document types in a stream of scanned pages. It can distinguish between each document type to classify the documents and can also separate each document type even if they have variable pages. It does this by reading the full page of each document’s text and understands similarities between pages and documents much like a human would. Documents can usually be trained by providing samples. The advantage of this technology is that bar code separator pages and barcode page identifiers are not needed to separate and classify documents.
Document Review-This is a process where the user is asked to review the classification results from the system. It is an exception process designed to review only those documents that the system did not have enough confidence to determine the proper document type. The operator is show only the exceptions and has the option to agree with the choice or select another outcome. This process is significantly faster than performing the classification in an entirely manual process.
Metadata – Refers to the information that is associated with the document images.Examples of metadata that can be associated with an image are document type, field names and data elements. For example an image could be identified as an invoice and some metadata fields could in Invoice Date, Invoice Number, Total Amount, etc. The metadata would then be used to populate another system so that it can be used to automate a process.
Extraction – General term referring to the reading of data on a document image and pulling the data off to be used in another process. For example, an invoice image may have the fields of date, invoice number and total amount “extracted” and transferred to an other system where it can be used to post the invoice information or used to retrieve the invoice image later on if needed.
Free Form (unstructured) Data Extraction – Refers to the ability of a computer device to read a full page document of text and selectively extract data elements based on characteristics and formats of data surrounding it. For example, look anywhere on the page for the word “invoice”and when you find that specific word, look to the right of it and pull of an 8 digit number and populate my database with it. These can be as simple or complex as needed to extract information off of documents. The advantage of this technology is that the exact document type does not need to be known ahead of time as the system is just looking for variable data on any page.
Document Validation – In this process, metadata that was automatically extracted from the document is manually reviewed by the operator. Data that was not read with confidence or that did not satisfy a table or field validation is presented to the operator to correct. This is an exception process only and is much quicker than entering the data in an entirely manual process.
Confidence Threshold – The setting to determine at what point an operator should review the results created by the automated computer system. For example, if a confidence level of 70% was set, the computer would pass all results that it was less than 70% confident on to the user to validate. This is an adjustable parameter which is used to fine tune the system and maximize throughput while reducing errors.
Release – Generally refers to the end result process of a forms processing system. Forms processing systems take in images and release images and data. The data that is released from the system can connect to data base, workflow engines and other back end systems. A release procedure is the formal process which takes the data and images from the forms processing system and places it into the repository or system that needs to store or act on the data.