Hadoop Heartbeat and Data Block Rebalancing


Hadoop Heartbeat and Data Block Rebalancing

HDFS Data storage Reliability

  • The important objective of HDFS is to store data reliably, even when features occur with Name Nodes, data nodes or network partitions
  • Detection is, the first step HDFS takes to overcome failures and HDFS uses heart beat messages to detect connectivity between home and data nodes

Hadoop Heartbeat

  • Several things can cause loss of connectivity between name and data nodes and therefore each data node sends periodic heartbeat messages to its Name Nodes so the latter can detect loss of connectivity if it stops receiving them
  • The Name Node marks as dead data nodes not responding to heart beats and refrains from sending further requests to them
  • Data stored on a data node is no longer available to an HDFS client from that node, which is effectively removed from the system.
  • If the death of a node causes the replication factor of data blocks to drop below their minimum value, the Name Node initiates additional replication to normalized state.
The HDFS heartbeat process Diagram

Data Block Rebalancing:

HDFS data blocks night not always be placed uniformly across data nodes that means the used space for one or more data nodes can be underutilized.
HDFS Supports re balancing  data blocks using various models
  1. One model might more data blocks from one data node to another automatically if the free space on a data node false too low.
  2. Another model might dynamically create additional replicas and rebalance other data blocks in a cluster if a sudden increase in demands for a given file occurs.
  3. HDFS also provides the hadoop balance command for manual rebalancing tasks. The common reason to rebalance is the addition of a new data nodes to a cluster. When placing new blocks, Name Nodes consider various parameters before choosing the data nodes to receive them
Some of the considerations are:
  1. Block-replica writing policies
  2. Pretention of data loss due to installation of rack failure
  3. Reduction of cross- installation net work I/o
  4. Uniform data spread across data nodes in a cluster
The cluster- rebalancing feature of HDFS is just one mechanism if uses to sustain the integrity of its data.