Hadoop Tutorial


Hadoop Tutorial

In this Article, we will learn what is Hadoop, What is the need of Hadoop, Why Hadoop is most popular, How Hadoop works, Hadoop Architecture, data flow, Hadoop daemons, different flavors, an introduction of Hadoop components like HDFS, MapReduce, Yarn, etc.

First, lets start with Hadoop Introduction


Hadoop is an open source tool from the ASF – Apache Software Foundation. Open source project means it is freely available and even its source code can be changed as per the requirements. If certain functionality does not fulfill your requirement, you can change it according to your need. Most of Hadoop code is written by Yahoo, IBM, Facebook, Cloudera.

It provides an efficient framework for running jobs on multiple nodes of clusters.  Cluster means a group of systems connected via LAN.  Hadoop provides parallel processing of data as it works on multiple machines simultaneously.

It is inspired by Google, which has written a paper about the technologies it is using like Map-Reduce programming model as well as its file system (GFS). Hadoop was originally written for the Nutch search engine project when Doug cutting and his team were working on it but very soon, it became a top-level project due to its huge popularity.

Hadoop is an open source framework which is written in Java. But this does not mean you can code only in Java. You can code in C, C++, Perl, python, ruby etc. You can code in any language but it is recommended to code in java as you will have lower level control of the code.

It efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is developed for processing of huge volume of data. Commodity hardware is the low-end hardware, they are cheap devices which are very economic. So Hadoop is very economic.

Hadoop can be setup on a single machine (pseudo-distributed mode), but the real power of Hadoop comes with a cluster of machines, it can be scaled to thousand nodes on the fly ie, without any downtime. We need not make any system down to add more systems in the cluster. To learn installation of Hadoop on a multi-node cluster, follow this installation guide.

  Why Hadoop?

Let us now understand why Hadoop is very popular, why Hadoop has captured more than 90% of big data market.

Hadoop is not only a storage system but is a platform for data storage as well as processing. It is scalable (more nodes can be added on the fly), Fault tolerant (Even if nodes go down, data can be processed by another node) and Open source (can modify the source code if required).

Following characteristics of Hadoop make is a unique platform:

  • Flexibility to store and mine any type of data whether it is structured, semi-structured or unstructured. It is not bounded by a single schema.
  • Excels at processing data of complex nature, its scale-out architecture divides workloads across multiple nodes. Another added advantage is that its flexible file-system eliminates ETL bottlenecks.
  • Scales economically, as discussed it can be deployed on commodity hardware. Apart from this its open-source nature guards against vendor lock.

Hadoop Architecture

Hadoop Introduction-hadoop architecture data flow tutorial

After understanding what Hadoop is, let us now see Hadoop architecture.

Hadoop works in master – slave fashion. There is a master node and there are n numbers of slave nodes where n can be 1000s. Master manages, maintains and monitors the slaves while slaves are the actual worker nodes. Master should be deployed on good configuration hardware and not just any commodity hardware as it is the centerpiece of Hadoop cluster.

Master just stores the meta-data (data about data) while slaves are the nodes which store the data. Data is stored distributedly in the cluster. The client connects with master node to perform any task. Read the Complete Article

Tags (2)