Ubuntu 20.04 Hadoop - LinuxConfig.org

Apache Hadoop is comprised of multiple open source software packages that work together for distributed storage and distributed processing of big data. There are four main components to Hadoop:

  • Hadoop Common - the various software libraries that Hadoop depends on to run
  • Hadoop Distributed File System (HDFS) - a file system that allows for efficient distribution and storage of big data across a cluster of computers
  • Hadoop MapReduce - used for processing the data
  • Hadoop YARN - an API that manages the allocation of computing resources for the entire cluster

In this tutorial, we will go over the steps to install Hadoop version 3 on Ubuntu 20.04. This will involve installing HDFS (Namenode and Datanode), YARN, and MapReduce on a single node cluster configured in Pseudo Distributed Mode, which is distributed simulation on a single machine. Each component of Hadoop (HDFS, YARN, MapReduce) will run on our node as a separate Java process.

In this tutorial you will learn:
  • How to add users for Hadoop Environment
  • How to install Java prerequisite
  • How to configure passwordless SSH
  • How to install Hadoop and configure necessary related XML files
  • How to start the Hadoop Cluster
  • How to access NameNode and ResourceManager Web UI
This is a companion discussion topic for the original entry at https://linuxconfig.org/ubuntu-20-04-hadoop

Just one consideration about URL of NameNode Web UI.
In hadoop version 3.1.3 the port has moved to 9870.
This was the only hadoop tutorial that help me to get a clean Install.
Thanks a lot.

1 Like

Also one thing to note here, that the command to download hadoop might not work(atleast for me it didn’t work), resulting in a 404 error.
The quick fix for that would be to change the command for download from v3.1.3 to v3.1.4, else you can directly use the below command:

wget https://downloads.apache.org/hadoop/common/hadoop-3.1.4/hadoop-3.1.4.tar.gz

and it will work as expected. Have a great day!