An Introduction to Hadoop

What is Hadoop?

Hadoop is both a programming model and a framework to analyze and process big data in a distributed computing environment. In layman terms, it serves as an operating system for handling big data across a distributed system. It allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

A set of machines/nodes running HDFS and MapReduce constitute a Hadoop cluster. For most application scenarios, Hadoop is linearly scalable, which means we can expect better performance by simply adding more nodes to the cluster.

Hadoop is an open-source implementation of the Google File System (GFS).

Hadoop automatically handles the partitioning of work and combining of the partial results, making it easier to implement the Divide and Conquer strategy. It also takes care of synchronization and ensures high availability and fault tolerance. Note that Hadoop is not a real-time system; jobs will take time to execute, however low latency isn't the main aim of Hadoop anyway.

Hadoop Characteristics

  • Robust: Since it is intended to run on commodity hardware, Hadoop is archi­tected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures

  • Scalable: Hadoop scales linearly to handle larger data by adding more nodes to the cluster

  • Simple: Hadoop allows users to quickly write efficient parallel code

The next section gives an overview of the main components of Hadoop.

Last updated