Apache Spark and the Rise of Big Data

Apache SparkThe idea of Big Data as we know it today was introduced in 2005, when Yahoo! built the Hadoop operating system utilizing Google’s MapReduce programming method for dealing with large sets of data. However, it was 2010’s introduction of Apache Spark as an open source project that really accelerated the ability to process extremely large data sets and gave rise to true Big Data solutions as we know them today.

Like MapReduce, Spark is a method for processing data that can be run in a distributed computing cluster framework like Hadoop. But, Spark can run much faster, partly because of its ability to run in-memory, while MapReduce writes back to the disk after each action it performs.

MapReduce basically operates as a linear operation. Kirk Borne, principal data scientist at Booz Allen Hamilton, describes it this way, “The MapReduce workflow looks like this: read data from the cluster, perform an operation, write results to the cluster, read updated data from the cluster, perform next operation, write next results to the cluster, etc..”

In contrast, Apache Spark enables programmers to develop more complex, multi-step data processing pipelines. Instead of being deployed linearly, Spark workflows can follow a directed acyclic graph (DAG) pattern, which allows for multiple processing steps to be executed simultaneously. Spark can also hold its results in memory rather than writing them to disk, which is useful when running analysis on the same dataset multiple times — like you might want to do when analyzing a large document repository.

The bottom line is that Spark can run up to 100 times faster in memory and 10 times faster when running on a disk than MapReduce. It’s this type of functionality that enabled Ephesoft to build our Insight Big Documents platform. Insight is designed to utilize up to 16 characteristics of a document to automatically classify and extract information from it. And we are targeting Insight at repositories containing hundreds of thousands, millions, and even billions of pages. Insight is designed to provide real-time results to users searching through and analyzing the contents of these documents.

Apache Spark running on Hadoop is new to all of us at Ephesoft, our partners, and our customers. So, it was a very bold decision for us to pick this platform to build Insight on. Would it have been easier to develop Insight on a SQL database using data processing tools that were more familiar to us? Certainly, it would have been possible, but we wouldn’t have had everything we wanted. Easier is not always the right decision, especially when related to innovation. Sometimes in order to innovate, you need to do the right thing, not the easiest, which is what we feel we did with Insight.

We introduced Insight in 2015 and have started to bring it to market this year. The early feedback we’ve received, including a strategic investment from In-Q-Tel, which works closely with the U.S. intelligence community to identify innovative technology, leads us to believe we have taken the right approach by investing in developing on the Spark platform. It not only meets our needs of today, but offers plenty of power as we move forward with advanced document processing applications.