A Beginner’s Guide to Hadoop’s Fundamentals

A non-technical guide to Hadoop’s big data analytical platform together with its primary modules, including HDFS and MapReduce

Literally, Hadoop was the name of a toy elephant — specifically, the toy elephant of Doug Cutting’s (Hadoop’s co-founder) son. But you’re not here to learn how, or from where, Hadoop got its name! Broadly speaking, Hadoop is a general-purpose, operating system-like platform for parallel computing.

I am sure I do not need to mention the severe limitations of a single system when it comes to processing all the big data floating around us — it is simply beyond the processing capacity of a single machine. Hadoop provides a framework to process this big data through parallel processing, similar to what supercomputers are used for.

But why can’t we utilize supercomputers to parallelize the processing of big data:

There is no standardized operating system (or an operating system like-framework) for supercomputers — making them less accessible to small and mid-sized organizations
High cost of both the initial purchase and regular maintenance
Hardware support is tied to a specific vendor, i.e., a company cannot procure the various individual components from different vendors and stack them together
In most cases, custom software needs to be developed to operate a supercomputer based on the specific use case
Not easy to scale horizontally

Hadoop comes to the rescue as it takes care of all the above limitations: it’s an open-source (with strong community support and regular updates), operating system-like platform for parallel processing that does not rely on specific hardware vendors for ongoing hardware support (works with commodity hardware) and does not require any proprietary software.

There have been three stable releases of Hadoop since 2006: Hadoop 1, Hadoop 2, and Hadoop 3.

Let’s now look at Hadoop’s architecture in more detail — I will start with Hadoop 1, which will make it easier for us to understand Hadoop 2’s architecture later on. I will also assume some basic familiarity with the following terms: commodity hardware, cluster & cluster node, distributed system, and hot standby.

#hadoop #data-analytics #big-data #developer #programming

towardsdatascience.com

A Beginner’s Guide to Hadoop’s Fundamentals