Ubuntu 20.04 Hadoop - LinuxConfig.org

Apache Hadoop is comprised of multiple open source software packages that work together for distributed storage and distributed processing of big data. There are four main components to Hadoop:

  • Hadoop Common - the various software libraries that Hadoop depends on to run
  • Hadoop Distributed File System (HDFS) - a file system that allows for efficient distribution and storage of big data across a cluster of computers
  • Hadoop MapReduce - used for processing the data
  • Hadoop YARN - an API that manages the allocation of computing resources for the entire cluster

In this tutorial, we will go over the steps to install Hadoop version 3 on Ubuntu 20.04. This will involve installing HDFS (Namenode and Datanode), YARN, and MapReduce on a single node cluster configured in Pseudo Distributed Mode, which is distributed simulation on a single machine. Each component of Hadoop (HDFS, YARN, MapReduce) will run on our node as a separate Java process.

In this tutorial you will learn:
  • How to add users for Hadoop Environment
  • How to install Java prerequisite
  • How to configure passwordless SSH
  • How to install Hadoop and configure necessary related XML files
  • How to start the Hadoop Cluster
  • How to access NameNode and ResourceManager Web UI
This is a companion discussion topic for the original entry at https://linuxconfig.org/ubuntu-20-04-hadoop