Hadoop is an open-source framework that empowers you to store and process massive datasets across clusters of computers. It's essentially a big data management and processing tool, known for its ability to handle petabytes (that's thousands of terabytes!) of information.
With the advent of big data and cloud computing, the need to efficiently store and process large datasets is constantly increasing. Using a single resource for storage and processing is unviable due to the cost, the sheer amount of data being generated, and the risk of failure with probable loss of valuable data.
This is the situation that gave rise to Hadoop, an open-source platform for distributed storage and processing of large datasets in compute clusters. For distributed computing, Hadoop utilizes MapReduce, and for distributed storage, it utilizes the Hadoop Distributed File System(HDFS).
Hadoop is mainly applied in areas where there is a large amount of data that requires processing and storage. Since Hadoop employs horizontal scaling, it is able to quickly scale and perform parallel processing.
Common business use cases and applications of this technology include:
A prerequisite for this process is to have openssh-server, Java and JDK installed on your machine.
This guide explores installing on Ubuntu.
Below are the steps to follow when installing Hadoop:
Create a hadoop user and group hadoop.
sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop
Switch user to the newly created hadoop user.
sudo su- hadoop
Set up and test passwordless SSH Keys since Hadoop requires SSH access for node management.
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost
Download Hadoop from the official Apache site. Then navigate to your downloaded file, extract it, and move it to the appropriate usr/local
folder.
tar -xzvf hadoop-3.3.0
sudo mv hadoop-3.3.0 /usr/local/hadoop
Grant ownership of the folder to the hadoop user you created.
sudo chown -R hadoop:hadoop usr/local/hadoop
Note: The folder hadoop-3.3.0
depends on the version you download. It is not fixed.
To configure your system for Hadoop, open up the ~/.bashrc
file using the command cat ~/.bashrc
and add the following lines at the end. Once done, save and close the file
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/native"
Run the ~/.bashrc
file using the command source ~/.bashrc
. At this point, Hadoop is installed.
Configure Hadoop's hadoop-env.sh
file to set the JAVA_HOME variable. To open the file, run the command:
sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Add the line export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
. This is typically the root directory for the Java installation.
Configure Hadoop's core-site.xml
by adding the following lines:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdadmin/hdata</value>
</property>
</configuration>
To open the core-site.xml file, run the command:
sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
Configure Hadoop's hdfs-site.xml
by adding the following lines:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
To open the hdfs-site.xml
file, run the command:
sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Configure Hadoop's mapred-site.xml
by copying the existing template mapred-site.xml.template
and adding the following lines:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
To copy the template file into a new one and edit it, run the commands:
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
Configure Hadoop's yarn-site.xml
by adding the following lines
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
To open the yarn-site.xml
file, run the command
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
To start the Hadoop cluster and format the namenode, run the command:
hdfs namenode -format
To start the HDFS service:
start-dfs.sh
To start the YARN service:
start-yarn.sh
To verify all services are running, run the command jps
and visit the Hadoop web interface at http://localhost:50070. The resource manager interface can be found at http://localhost:8088.
This guide gives you a good initial feel of what Hadoop is all about and its use cases.
Thanks For Reading !!!
Originally published by Kimaru Thagana at Pluralsight
#hadoop