How To

Install Apache Hadoop and HBase on Ubuntu 24.04

Apache Hadoop is a distributed storage and processing framework built for handling large datasets across clusters of commodity hardware. Apache HBase is a NoSQL database that runs on top of HDFS (Hadoop Distributed File System) and provides real-time read/write access to big data. Together, they form the backbone of many data engineering pipelines where batch processing and random data access are both needed.

Original content from computingforgeeks.com - post 56920

This guide walks through a complete installation of Apache Hadoop 3.4.3 and Apache HBase 2.5.13 in pseudo-distributed mode on Ubuntu 24.04 LTS. We cover Java setup, HDFS and YARN configuration, HBase integration with HDFS, and firewall rules for the web interfaces.

Prerequisites

Before starting, confirm the following are in place:

  • A server or VM running Ubuntu 24.04 LTS with at least 4GB RAM and 2 CPU cores
  • Root or sudo access
  • OpenJDK 11 (installed in Step 1)
  • SSH server running locally (for Hadoop daemon management)
  • Ports 9870 (NameNode), 8088 (YARN), and 16010 (HBase Master) open if accessing web UIs remotely

Step 1: Install Java (OpenJDK 11) on Ubuntu 24.04

Both Hadoop and HBase require Java. OpenJDK 11 is the recommended version that works reliably with Hadoop 3.4.x and HBase 2.5.x. Install it from the default Ubuntu repositories.

sudo apt update
sudo apt install -y openjdk-11-jdk

Confirm the installation by checking the Java version:

java -version

The output should show OpenJDK 11 installed and active:

openjdk version "11.0.25" 2024-10-15
OpenJDK Runtime Environment (build 11.0.25+9-post-Ubuntu-1ubuntu124.04)
OpenJDK 64-Bit Server VM (build 11.0.25+9-post-Ubuntu-1ubuntu124.04, mixed mode, sharing)

Set the JAVA_HOME environment variable system-wide so Hadoop and HBase can locate the JDK. If you need a more detailed Java setup, check our guide on how to install Java on Ubuntu 24.04.

echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' | sudo tee /etc/profile.d/java.sh
source /etc/profile.d/java.sh

Step 2: Create a Dedicated Hadoop User

Running Hadoop services as root is a security risk. Create a dedicated hadoop user that owns all Hadoop and HBase files and processes.

sudo adduser --disabled-password --gecos "" hadoop
sudo usermod -aG sudo hadoop

Switch to the hadoop user and set up passwordless SSH to localhost. Hadoop uses SSH internally to manage its daemons, even in pseudo-distributed mode.

sudo su - hadoop
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

Test the SSH connection to localhost. Accept the host key when prompted:

ssh localhost exit

If SSH is not installed, run sudo apt install -y openssh-server first.

Step 3: Download and Install Apache Hadoop 3.4.3

Download the latest Hadoop 3.4.3 binary release and extract it to /opt. All subsequent steps assume you are logged in as the hadoop user.

cd /tmp
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.3/hadoop-3.4.3.tar.gz
sudo tar -xzf hadoop-3.4.3.tar.gz -C /opt/
sudo mv /opt/hadoop-3.4.3 /opt/hadoop
sudo chown -R hadoop:hadoop /opt/hadoop

Add Hadoop environment variables to the hadoop user’s profile so they persist across sessions:

echo 'export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc
source ~/.bashrc

Also set JAVA_HOME in Hadoop’s own environment file:

sudo vi /opt/hadoop/etc/hadoop/hadoop-env.sh

Find the JAVA_HOME line and set it to:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Verify Hadoop detects Java correctly:

hadoop version

You should see the Hadoop version and build details confirming a successful installation:

Hadoop 3.4.3
Source code repository [email protected]:apache/hadoop.git
Compiled by ... with protoc 3.21.12
From source with checksum ...
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-3.4.3.jar

Step 4: Configure HDFS (core-site.xml and hdfs-site.xml)

HDFS needs two configuration files. The first, core-site.xml, tells Hadoop where the NameNode runs. The second, hdfs-site.xml, controls replication and storage directories.

Create the data directories for the NameNode and DataNode:

sudo mkdir -p /opt/hadoop/data/namenode
sudo mkdir -p /opt/hadoop/data/datanode
sudo chown -R hadoop:hadoop /opt/hadoop/data

Edit the core-site.xml configuration file:

vi /opt/hadoop/etc/hadoop/core-site.xml

Replace the empty <configuration> block with the following. This sets the default filesystem to HDFS on localhost port 9000:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop/data</value>
  </property>
</configuration>

Now edit the HDFS configuration file:

vi /opt/hadoop/etc/hadoop/hdfs-site.xml

Set the replication factor to 1 (single-node setup) and specify the NameNode and DataNode storage paths:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///opt/hadoop/data/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///opt/hadoop/data/datanode</value>
  </property>
</configuration>

Step 5: Configure YARN (mapred-site.xml and yarn-site.xml)

YARN (Yet Another Resource Negotiator) handles job scheduling and resource management. Configure MapReduce to run on YARN and set up the NodeManager shuffle service.

Edit the MapReduce configuration:

vi /opt/hadoop/etc/hadoop/mapred-site.xml

Set the MapReduce framework to YARN and define the classpath:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>mapreduce.application.classpath</name>
    <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
  </property>
</configuration>

Edit the YARN configuration:

vi /opt/hadoop/etc/hadoop/yarn-site.xml

Enable the MapReduce shuffle service and set the classpath for YARN:

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.env-whitelist</name>
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
  </property>
</configuration>

Step 6: Format NameNode and Start Hadoop Services

Before the first startup, format the HDFS NameNode. This initializes the filesystem metadata. Only run this once – formatting an existing NameNode destroys all HDFS data.

hdfs namenode -format

The format command should complete with a message containing “Storage directory … has been successfully formatted.” Now start the HDFS daemons:

start-dfs.sh

Start the YARN resource manager and node manager:

start-yarn.sh

Verify all Hadoop daemons are running with the jps command:

jps

You should see five Java processes – NameNode, DataNode, SecondaryNameNode, ResourceManager, and NodeManager:

12345 NameNode
12456 DataNode
12567 SecondaryNameNode
12678 ResourceManager
12789 NodeManager
12890 Jps

If any daemon is missing, check the logs in /opt/hadoop/logs/ for errors. Common issues include incorrect JAVA_HOME, permission problems on data directories, or SSH not configured properly.

Test HDFS by creating a directory and listing it:

hdfs dfs -mkdir /test
hdfs dfs -ls /

The output confirms HDFS is operational:

Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2026-03-22 10:00 /test

The NameNode web UI is available at http://your-server-ip:9870 and the YARN ResourceManager UI at http://your-server-ip:8088.

Step 7: Download and Install Apache HBase 2.5.13

With Hadoop running, install HBase to add NoSQL capabilities on top of HDFS. Download the latest stable HBase 2.5.13 release. If you also work with real-time data streaming, consider setting up Apache Spark on Ubuntu alongside this stack.

cd /tmp
wget https://dlcdn.apache.org/hbase/2.5.13/hbase-2.5.13-bin.tar.gz
sudo tar -xzf hbase-2.5.13-bin.tar.gz -C /opt/
sudo mv /opt/hbase-2.5.13 /opt/hbase
sudo chown -R hadoop:hadoop /opt/hbase

Add HBase environment variables to the hadoop user’s shell profile:

echo 'export HBASE_HOME=/opt/hbase
export PATH=$PATH:$HBASE_HOME/bin' >> ~/.bashrc
source ~/.bashrc

Set JAVA_HOME in the HBase environment file:

vi /opt/hbase/conf/hbase-env.sh

Uncomment and set the JAVA_HOME line:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Step 8: Configure HBase to Use HDFS (hbase-site.xml)

Configure HBase to run in pseudo-distributed mode with data stored on HDFS rather than the local filesystem. This gives HBase the fault tolerance and scalability of HDFS.

vi /opt/hbase/conf/hbase-site.xml

Replace the empty configuration block with these settings. The hbase.rootdir points to HDFS, and hbase.cluster.distributed enables pseudo-distributed mode:

<configuration>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:9000/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/opt/hbase/zookeeper</value>
  </property>
  <property>
    <name>hbase.unsafe.stream.capability.enforce</name>
    <value>false</value>
  </property>
</configuration>

The hbase.unsafe.stream.capability.enforce setting is needed for pseudo-distributed mode where HDFS does not support certain stream capabilities. In a production multi-node cluster, set this to true.

Step 9: Start HBase and Access the HBase Shell

Make sure Hadoop (HDFS and YARN) is already running before starting HBase. HBase depends on HDFS for storage.

start-hbase.sh

Verify HBase processes are running alongside the Hadoop daemons:

jps

You should see HMaster, HRegionServer, and HQuorumPeer (ZooKeeper) in addition to the Hadoop processes:

12345 NameNode
12456 DataNode
12567 SecondaryNameNode
12678 ResourceManager
12789 NodeManager
13001 HMaster
13102 HRegionServer
13203 HQuorumPeer
13304 Jps

Access the HBase shell to run commands interactively:

hbase shell

Test by creating a table, inserting a row, and reading it back:

hbase> create 'test_table', 'cf'
hbase> put 'test_table', 'row1', 'cf:col1', 'hello_hbase'
hbase> get 'test_table', 'row1'
hbase> exit

The get command returns the value you inserted, confirming HBase reads and writes are working through HDFS. The HBase Master web UI is available at http://your-server-ip:16010.

Step 10: Configure Firewall Rules for Hadoop and HBase

If you are accessing the web interfaces from a remote machine, open the required ports in UFW. For a detailed breakdown of UFW firewall commands, see our dedicated guide.

sudo ufw allow 9870/tcp comment 'HDFS NameNode Web UI'
sudo ufw allow 8088/tcp comment 'YARN ResourceManager Web UI'
sudo ufw allow 16010/tcp comment 'HBase Master Web UI'
sudo ufw allow 9000/tcp comment 'HDFS NameNode RPC'

Verify the rules were added:

sudo ufw status numbered

The output should list the four new rules along with any existing ones:

Status: active

     To                         Action      From
     --                         ------      ----
[ 1] 9870/tcp                   ALLOW IN    Anywhere       # HDFS NameNode Web UI
[ 2] 8088/tcp                   ALLOW IN    Anywhere       # YARN ResourceManager Web UI
[ 3] 16010/tcp                  ALLOW IN    Anywhere       # HBase Master Web UI
[ 4] 9000/tcp                   ALLOW IN    Anywhere       # HDFS NameNode RPC

The following table summarizes all ports used by this Hadoop and HBase setup:

PortServiceProtocol
9000HDFS NameNode RPCTCP
9870HDFS NameNode Web UITCP
9864HDFS DataNode Web UITCP
8088YARN ResourceManager Web UITCP
8042YARN NodeManager Web UITCP
16000HBase Master RPCTCP
16010HBase Master Web UITCP
16020HBase RegionServer RPCTCP
16030HBase RegionServer Web UITCP
2181ZooKeeperTCP

Managing Hadoop and HBase Services

Use these commands to stop and start the services. Always stop HBase before stopping Hadoop since HBase depends on HDFS.

To stop all services in the correct order:

stop-hbase.sh
stop-yarn.sh
stop-dfs.sh

To start all services:

start-dfs.sh
start-yarn.sh
start-hbase.sh

Conclusion

You now have Apache Hadoop 3.4.3 and HBase 2.5.13 running in pseudo-distributed mode on Ubuntu 24.04. HDFS handles distributed storage, YARN manages compute resources, and HBase provides real-time NoSQL access on top of HDFS. This setup works for development, testing, and learning the Hadoop ecosystem before scaling to a multi-node cluster.

For production use, plan for a minimum 3-node cluster with dedicated NameNode and ResourceManager hosts, enable Kerberos authentication, configure HDFS high availability with a standby NameNode, and set up monitoring with tools like Prometheus and Grafana. Regular HDFS balancer runs and HBase compactions are essential for long-term performance.

Related Articles

Ansible Set Up HA Kubernetes Cluster on Ubuntu 24.04 with Kubespray InfluxDB How To Install InfluxDB on CentOS 8 / RHEL 8 Debian Install Caddy Web Server on Ubuntu / Debian Databases How To Monitor PostgreSQL Database with pgDash

Leave a Comment

Press ESC to close