Apache Hadoop Class: Architecture

9 min readOct 20, 2020

A quick overview of Hadoop’s architecture.

With the growth of the data in our world and the birth of the buzzwords — BIG DATA, people started to ask themself: How much hardware of a single computer can hold in the face of the challenges of storing or processing these amounts of data?

The naive proposition was the use of super-computers, but even that involves quite a few challenges like:

A general-purpose operating system like a framework for parallel computing needs didn’t exist.
High initial cost of the hardware and software maintenance which had to be taken care of.
Not simple to scale horizontally

So, Hadoop came to the rescue!

Here, we will cover all of the basic concepts of Hadoop architecture and its ecosystem.

At the end of the article, you will see a dictionary needed to understand Hadoop terms (these terms are marked as such).

What is Hadoop?

According to Wikipedia: “Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation.”

In other words, Apache Hadoop offers a scalable, flexible, and reliable distributed computing big data framework for a cluster of systems with storage capacity and local computing power by leveraging commodity hardware.

Hadoop follows a Master-Slave Architecture for the transformation and analysis of large datasets using the Hadoop MapReduce paradigm which will be discussed later.

Master-Slave Architecture

Hadoop’s master-slave architecture design can be separated into two layers: Distributed Data Storage and Distributed Data Processing using — HDFS layer and MapReduce layer respectively.

HDFS Layer:

According to IBM: “HDFS is a distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes.”

To understand this better, let’s assume that we have a file with a size of 200Mb. In a normal drive, the file will be saved in one piece. However, in Hadoop it’s different: the file split into 4 blocks, and each one of them with a size of 64Mb (the block size can be configured) except the last one (8Mb). Each block is treated as a file stored on a different machine (Data Node).

What about the failure of machines (Data Nodes)?

Before Hadoop 3.x, to avoid loss of data, copies of the data blocks are stored on different machines too (which should be on different racks). This called data replication. The replication factor property determines the number of replication for each block (the default replication factor is 3). In Hadoop 3.x, there is a new method to deal with failures called “erasure coding”, this one replaces the replicas method and relives a maximum of 50% extra storage space in the cluster (Hadoop 3.x still support the old method).

In HDFS Architecture there are two kinds of services: Master and Slave. Master Services can communicate with each other and in the same way, Slave services can communicate with each other.

Data Nodes: are the slave nodes in Hadoop HDFS. Data Nodes are inexpensive commodity hardware. They store blocks of a file.

Functions of HDFS Data Node:

It is responsible for serving the client read/write requests.
It performs block creation, replication, and deletion, based on the instruction from the Name Node.
It sends a heartbeat to the Name Node to report the health of HDFS.
It sends block reports to the Name Node to report the list of blocks it contains.

Plus, there are two services in Master mode:

Name Node: HDFS consists of only one Name Node which sometimes by mistake (or not) called the master node. Name Node is considered as the central file system of all other nodes.

Functions of HDFS Name Node:

It executes the file system namespace operations like opening, renaming, and closing files and directories.
It manages and maintains the file system and has the metadata of all of the stored data within it.
It determines the mapping of blocks of a file to Data Nodes and keeps the locations of each block of a file (when the client asked for a data the Name Node will know where to find it).
It records each change made to the file system namespace.
It takes care of the replication factor of all the blocks.
The Name Node receives heartbeat and block reports from all Data Nodes that ensure Data Node is alive. If one of the Data Node failed, the Name Node chooses new Data Nodes for new replicas.
It responsible for getting the client read/ write requests. Note that the breaking up of the original file into multiple blocks also happens in the client machine and not in the Name Node, Plus the client machine directly writes the files to the Data Nodes once the Name Node provides the details about the Data Nodes.
It is also has a Data Node functionality.

Secondary Name Node: Housekeeping, backup of Name Node metadata. It routinely connects to the Name Node (every hour) and gains its metadata (not hot standby, meaning it is a helper to the primary Name Node but not replace for primary Name Node). The Name Node and the Secondary Name Node should sit on a different rack.

Note that before Hadoop 2 the High Availability feature didn’t exist, Name Node was the single point of failure. In Hadoop 2 there is a standby Name Node which switches Active and connects to the Data Nodes when the original Name Node goes down (thanks to the replicating edits to a quorum of three Journal Nodes — which are not mentioned on this figure). ZooKeeper is the tool who tells that the Active Name Node is down and has to be switched.

Today, Hadoop 3.x allows for three Name Nodes running at the same time in the cluster in a hot standby configuration (but only one of them is active the reset are Passive Nodes.

HDFS sound like a pretty good tool, but it has some limitations like:

it is not good for low latency read.
it is not good for a large number of small files.

MapReduce Layer:

According to Wikipedia: “MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.”

To make that clear, we should explain the classic problem of “Word Count”. In this problem, we asked to count the frequency of words in a file. It’s sound pretty easy to do so for a limited size of file, but what about huge files? Here is why MapReduce came into the picture. In MapReduce, there are two phases the Map and the Reduce.

The Map phase expects to get key, value pair and produce the output in a key, value pair. In our case, in the first stage, we use a record reader which reads every single line in our file. Then, each one of the line content treated as a value for an ignored key (this is the mapper input). In the second stage, the mapper extracts each word from the line, and create a key for each extracted word and a value of 1 (this collection of keys will be the mapper output).

Next, the intermediate phase called Sort takes the mapper output and sorts it in ascending order of keys (lexicographic ordering).

Now, the Reduce phase (has 3 sub-phases merge shuffle and reducer), the sorted output of several mapper’s will be merged into a single file at Reduce machine. Then, the shuffling phase aggregate duplicate keys from the input.

Image courtesy: https://datawhatnow.com/mapreduce-shuffle-sort/

Finally, Reducer accepts the input from the shuffle phase. Reducer produces output in key, value pairs based on what it is programmed to do as per the problem statement.

One more thing to remember, in MapReduce, the input file path and the output file path always been in HDFS (meaning everything happened on top of HDFS).

In Hadoop 1.x, MapReduce was responsible for Resource Management as well as data processing. It could only execute application following the MapReduce model.

As it shown in these figures, Master Node contains Job Tracker as well as the Name Node, and all of the Slave Nodes contain Task Tracker (with Map and Reduce Tasks) as well as the Data Nodes.

Job Tracker: It received the requests for Map Reduce execution from the client. Job tracker talks to the Name Node to know about the location of the data that will be used in processing. The Name Node responds with the metadata of the required processing data.

Task Tracker: It will take the task and the code from the Job Tracker. It applies the code on the file. The process of applying that code on the file is known as Mapper. Task Trackers are working together.

But, this architecture had changed through the years, Hadoop 2.x wanted to achieve the ability to execute applications that did not necessarily follow the MapReduce model, and to achieve more generic when it comes to the execution model. That is why YARN came into the picture in Hadoop 2.x.

YARN Layer:

YARN stands for Yet Another Resource Negotiator, provides all of the above and even more. In the higher versions of Hadoop, YARN is responsible for the Resource Manager part (the figure is right down). In addition, YARN is isolated (YARN and HDFS are isolated systems. YARN does not fetch any metadata from HDFS), scalable and genetic. In YARN, there is no concept of a single point of failure because it has multiple Masters so if one got failed another master will pick it up and resume the execution.

As can be figured, YARN uses Name Node, Secondary Name Node, Data Nodes like in the old MapReduce layer (Hadoop 1.x) and uses two new concepts: Resource Manager and Node Manager instead of Job Tracker and Task Tracker.

The essential idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into two concepts. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).

Resource Manager: is the ultimate authority that manages resources among all the applications in the system. It has two components: Scheduler and Applications Manager. The Scheduler — responsible for allocating resources to all running applications. The Application Manager — responsible for accepting job-submissions, take care for the execution of the first container for the specific Application Master. In practice, when the client runs his application he adds a resource container requirements of that and YARN does its best to grant such a requested host container.

Node Manager: is responsible for a single machine's containers, monitoring their resource usage (like: cpu, memory, disk, network) and reporting the same to the Scheduler in Resource Manager.

Application Master: According to Cloudera — “The Application Master is, in effect, an instance of a framework-specific library and is responsible for negotiating resources from the Resource Manager and working with the Node Manager(s) to execute and monitor the containers and their resource consumption. It has the responsibility of negotiating appropriate resource containers from the Resource Manager, tracking their status and monitoring progress.”

Dictionary

Commodity Hardware: PCs that can be used to make a cluster.
Cluster/ Grid: Interconnection of systems in a network.
Node: A single instance of a computer (which only needs to provide its storage and processing power).
Distributed System: A system composed of multiple autonomous computers that communicate through a computer network.
High Availability (HA): A system that is available at any given moment will be called high availability system.
Single Point Of Failure (SPOF): it is a part of a system that, if it fails, will stop the entire system from working
Hot Standby: Uninterrupted failover. Hot standby is a redundant method in which one system runs simultaneously with an identical primary system. Upon failure of the primary system, the hot standby system immediately takes over, replacing the primary system.

What’s new in Hadoop 3: https://www.edureka.co/blog/hadoop-3/
Hadoop High Availability & NameNode High: https://data-flair.training/blogs/hadoop-high-availability-tutorial/
Master Nodes in Hadoop Cluster: https://www.dummies.com/programming/big-data/hadoop/master-nodes-in-hadoop-clusters/