Hadoop is the de facto standard for implementation of MapReduce applications. It is composed of one or more master nodes and any number of slave nodes. Hadoop simplifies distributed applications by saying that “the data center is the computer,” and by providing map() and reduce() functions (defined by the programmer) that allow application developers or programmers to utilize those data centers. Hadoop implements the MapReduce paradigm efficiently and is quite simple to learn; it is a powerful tool for processing large amounts of data in the range of terabytes and petabytes. Both the Hadoop and Spark frameworks are open source and enable us to perform a huge volume of computations and data pro‐ cessing in distributed environments.
These frameworks enable scaling by providing “scale-out” methodology. They can be set up to run intensive computations in the MapReduce paradigm on thousands of servers. Spark’s API has a higher-level abstraction than Hadoop’s API; for this reason, we are able to express Spark solutions in a single Java driver class.
Hadoop and Spark are two different distributed software frameworks. Hadoop is a MapReduce framework on which one may run jobs supporting the map(), combine(),and reduce() functions. The MapReduce paradigm works well at one-pass computa‐tion (first map(), then reduce()), but is inefficient for multipass algorithms. Spark is not a MapReduce framework, but can be easily used to support a MapReduce framework’s functionality; it has the proper API to handle map() and reduce() functionality. Spark is not tied to a map phase and then a reduce phase. A Spark job can be an arbitrary DAG (directed acyclic graph) of map and/or reduce/shuffle phases. Spark programs may run with or without Hadoop, and Spark may use HDFS (Hadoop Distributed File System) or other persistent storage for input/output. In a nutshell, for a given Spark program or job, the Spark engine creates a DAG of task stages to be per‐ formed on the cluster, while Hadoop/MapReduce, on the other hand, creates a DAG with two predefined stages, map and reduce. Note that DAGs created by Spark can contain any number of stages. This allows most Spark jobs to complete faster than they would in Hadoop/MapReduce, with simple jobs completing after just one stage and more complex tasks completing in a single run of many stages, rather than hav‐ ing to be split into multiple jobs. As mentioned, Spark’s API is a higher-level abstrac‐ tion than MapReduce/Hadoop. For example, a few lines of code in Spark might be
equivalent to 30–40 lines of code in MapReduce/Hadoop. Even though frameworks such as Hadoop and Spark are built on a “shared-nothing” paradigm, they do support sharing immutable data structures among all cluster nodes. In Hadoop, you may pass these values to mappers and reducers via Hadoop’s Configuration object; in Spark, you may share data structures among mappers and reducers by using Broadcast objects. In addition to Broadcast read-only objects, Spark supports write-only accumulators. Hadoop and Spark provide the following benefits for big data processing:
Hadoop and Spark are fault-tolerant (any node can go down without losing the result of the desired computation).
Hadoop and Spark support large clusters of servers.
In Spark and Hadoop, input data and processing are distributed (they support big data from the ground up).
Computations are executed on a cluster of nodes in parallel.
Hadoop is designed mainly for batch processing, while with enough memory/RAM, Spark may be used for near real-time processing.
So what are the core components of MapReduce/Hadoop?
- Input/output data consists of key-value pairs. Typically, keys are integers, longs, and strings, while values can be almost any data type (string, integer, long, sentence, special-format data, etc.).
- Data is partitioned over commodity nodes, filling racks in a data center.
- The software handles failures, restarts, and other interruptions. Known as fault tolerance, this is an important feature of Hadoop.
Hadoop and Spark provide more than map() and reduce() functionality: they provide plug-in model for custom record reading, secondary data sorting, and much more.
A high-level view of the relationship between Spark, YARN, and Hadoop’s HDFS is illustrated in the figure below.
This relationship shows that there are many ways to run MapReduce and Spark using HDFS (and non-HDFS file systems). Here, the following keywords and terminology will be used:
- MapReduce refers to the general MapReduce framework paradigm.
- MapReduce/Hadoop refers to a specific implementation of the MapReduce framework using Hadoop.
- Spark refers to a specific implementation of Spark using HDFS as a persistent storage or a compute engine:
— Spark can run without Hadoop using standalone cluster mode (which may use HDFS, NFS, or another medium as a persistent data store).
— Spark can run with Hadoop using Hadoop’s YARN or MapReduce framework.
MapReduce/Hadoop has become the programming model of choice for processing large data sets. MapReduce can be used for any application that does not require tightly coupled parallel processing. Keep in mind that Hadoop is designed for MapReduce batch processing and is not an ideal solution for real-time processing. Do not expect to get your answers from Hadoop in 2 to 5 seconds; the smallest jobs might take 20+ seconds. Spark is a top-level Apache project that is well suited for near real-time processing, and will perform better with more RAM. With Spark, it is very possible to run a job (such as biomarker analysis or Cox regression) that processes 200 million records in 25 to 35 seconds by just using a cluster of 100 nodes. Typically, Hadoop jobs have a latency of 15 to 20 seconds, but this depends on the size and configuration of the Hadoop cluster.
An implementation of MapReduce (such as Hadoop) runs on a large cluster of commodity machines and is highly scalable. For example, a typical MapReduce computation processes many petabytes or terabytes of data on hundreds or thousands of machines. Programmers find MapReduce easy to use because it hides the messy details of parallelization, fault tolerance, data distribution, and load balancing, letting the programmers focus on writing the two key functions, map() and reduce().
The following are some of the major applications of MapReduce/Hadoop/Spark:
- Query log processing
- Crawling, indexing, and search
- Analytics, text processing, and sentiment analysis
- Machine learning (such as Markov chains and the Naive Bayes classifier)
- Recommendation systems
- Document clustering and classification
- Bioinformatics (alignment, recalibration, germline ingestion, and DNA/RNA sequencing)
- Genome analysis (biomarker analysis, and regression algorithms such as linear and Cox)