In this Hadoop tutorial, we will discuss World’s most reliable storage system – HDFS (Hadoop Distributed File System). HDFS is Hadoop’s storage layer which provides high availability, reliability and fault tolerance. It is anticipated that world’s 75% of data will be stored in Hadoop HDFS by the end of 2017. This tutorial will provide the complete overview of what is HDFS? This introductory guide will cover basics of HDFS, HDFS introduction, HDFS nodes, HDFS daemons, etc.
Apache Hadoop HDFS is a distributed file system which provides redundant storage space for storing files which are huge in sizes; files which are in the range of Terabytes and Petabytes. In HDFS data is stored reliably. Files are broken into blocks and distributed across nodes in a cluster. After that each block is replicated, means copies of blocks are created on different machines. Hence if a machine goes down or gets crashed, then also we can easily retrieve and access our data from different machines. By default, 3 copies of a file are created on different machines. Hence it is highly fault-tolerant. HDFS provides faster file read and writes mechanism, as data is stored in different nodes in a cluster. Hence the user can easily access the data from any machine in a cluster. Hence HDFS is highly used as a platform for storing huge volume and different varieties of data worldwide.
Before working with HDFS you must have Hadoop installed and running, to install and configure Hadoop follow this Installation Guide.
HDFS has Master/slave architecture. There are two nodes in HDFS: Master and Slaves. The master node maintains various data storage and processing management services in distributed Hadoop clusters. The actual data in HDFS is stored in Slave nodes. Data is also processed on the slave nodes.
Master is the centerpiece of HDFS. It stores the metadata of HDFS. All the information related to files stored in HDFS gets stored in Master. It also gives information about where across the cluster the file data is kept. Master contains information about the details of the blocks and its location for all files present in HDFS. The idea of constructing the file from blocks comes with the help of this information to the master. Master is the most critical part of HDFS and if all the masters get crashed or down then the HDFS cluster is also considered down and becomes useless.
The actual files or the data of client is present on the slaves. The most important and useful functionality of slaves is to control storage attached to the nodes in which they run. As we know that, in HDFS files are broken down into smaller blocks and these blocks are distributed across nodes in the cluster. The slaves within the cluster manage these file blocks. And in order to perform all filesystem operations, it sends information to the Master about the blocks present. HDFS has more than one slaves, and the replicas of blocks are created across them.
Learn the Internals of HDFS Data Read Operation, Follow this tutorial to understand How Data flows in HDFS while reading the file