How to Install Apache Hadoop on Ubuntu 16.04

r00t April 30, 2018

Install Apache Hadoop on Ubuntu 16.04

In this tutorial we’ll learn how to install Apache Hadoop on Ubuntu 16.04. We will also install and configure its prerequisites. Apache Hadoop is a Java-based programming framework that supports the processing and storage of extremely large datasets on a cluster of inexpensive machines. It was the first major open source project in the big data playing field and is sponsored by the Apache Software Foundation.

I recommend to use a minimal Ubuntu server setup as a basis for the tutorial, that can be a virtual or a root server image with an Ubuntu 16.04 minimal install from a web hosting company or you use our minimal server tutorial to install a server from scratch.

Install Apache Hadoop on Ubuntu 16.04

Step 1. First, ensure your system and apt package lists are fully up-to-date by running the following:

apt-get update -y
apt-get upgrade -y

Step 2. Installing Java.

As Hadoop is based on Java, we need to install it on our machine:

apt-get -y install openjdk-8-jdk-headless

Step 3. Added a new Hadoop User Group.

It is recommended to create a regular user to configure and run Apache Hadoop. So, create a user named “hadoop” and set a password:

useradd -m -d /home/hadoop -s /bin/bash hadoop
passwd hadoop

Step 4. Installing Apache Hadoop on Ubuntu 16.04.

First, you can visit Apache Hadoop page to download the latest Hadoop package, or you can just issue the following command in terminal to download Hadoop 3.1.0:

wget http://apache.mirrors.tds.net/hadoop/common/hadoop-3.1.0/hadoop-3.1.0.tar.gz

Then, we have to use the tar command to extract the file:

tar xvzf hadoop-3.1.0.tar.gz
mv hadoop-3.1.0 hadoop

Step 5. Configure Apache Hadoop.

We will be configuring Hadoop in Pseudo-Distributed mode. To start with, set an environmental variables in the ~/.bashrc file:

### nano ~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.144-0.b01.el7_4.x86_64/jre/ # Change it depends on JAVA installation directory
export HADOOP_HOME=/home/hadoop/hadoop # Hadoop installation directory
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Apply environmental variables to the current session:

source ~/.bashrc

Next, edit the Hadoop environmental file:

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Set JAVA_HOME environment variable:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.144-0.b01.el7_4.x86_64/jre/

Hadoop has many configuration files, and we need to edit them depends on the cluster modes we set up (Pseudo-Distributed):

cd $HADOOP_HOME/etc/hadoop

Edit core-site.xml:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Edit hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

Edit mapred-site.xml:

cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Edit yarn-site.xml:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Now format the NameNode using the following command:

hdfs namenode -format

Finally, start NameNode daemon and DataNode daemon by using the scripts in the /sbin directory, provided by Hadoop:

cd $HADOOP_HOME/sbin/
start-dfs.sh

Step 6. Configure Firewall for Apache Hadoop.

ufw allow 50070/tcp
ufw allow 8088/tcp
ufw reload

Step 7. Accessing Apache Hadoop.

Apache Hadoop will be available on HTTP port 8088 and port 50070 by default. Open your favorite browser and navigate to ttp://your-ip-address:50070/

Congratulation’s! You have successfully install and configure Apache Hadoop on your Ubuntu 16.04 server. Thanks for using this tutorial for installing Apache Hadoop on Ubuntu system.

The Tags:

Leave a Comment

Comments are closed.