Install Apache Hadoop / HBase on CentOS

This guide will discuss the Installation of Hadoop and HBase on CentOS / RHEL / Rocky Linux system. HBase is an open-source distributed non-relational database developed under the Apache Software Foundation. It is written in Java and runs on top of Hadoop File Systems (HDFS).

Original content from computingforgeeks.com - post 14574

Apache Hadoop is a widely used framework used in distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is created to scale up from single servers to thousands of machines, each offering local computation and storage. It is designed to detect and handle failures at the application layer and to deliver a highly-available service on top of a cluster of computers.

In our last tutorial, we covered the installation of Hadoop & HBase on Ubuntu. This article we be focused on the installation, configuration and usage of Apache Hadoop / HBase on CentOS / Rocky Linux system.

Installing Hadoop on CentOS / RHEL / Rocky Linux

Here are the steps used to install a Single node Hadoop cluster on CentOS / Rocky / AlmaLinux and other RHEL based systems.

Step 1: Update System

Because Hadoop & HBase service ports are so dynamic, I recommend you install them on a Server in secure Private network and disable both SELinux and Firewalld.

sudo systemctl disable --now firewalld
sudo setenforce 0
sudo sed -i 's/^SELINUX=.*/SELINUX=permissive/g' /etc/selinux/config
cat /etc/selinux/config | grep SELINUX= | grep -v '#'

Update your CentOS 7 system before starting deployment of Hadoop and HBase.

sudo yum -y install epel-release
sudo yum -y install vim wget curl bash-completion
sudo yum -y update
sudo reboot

Step 2: Install Java Runtime

Install Java if it is missing on your CentOS 7 server.

### Java 11 ###
sudo yum -y install java-11-openjdk java-11-openjdk-devel

### Java 8 ###
sudo yum -y install java-1.8.0-openjdk java-1.8.0-openjdk-devel

Validate is Java has been installed successfully.

$ java -version
openjdk version "11.0.20" 2023-07-18 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.20.0.8-1) (build 11.0.20+8-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.20.0.8-1) (build 11.0.20+8-LTS, mixed mode, sharing)

Set JAVA_HOME variable.

cat <<EOF | sudo tee /etc/profile.d/hadoop_java.sh
export JAVA_HOME=\$(dirname \$(dirname \$(readlink \$(readlink \$(which javac)))))
export PATH=\$PATH:\$JAVA_HOME/bin
EOF

Update your $PATH and setting.

source /etc/profile.d/hadoop_java.sh

Then test:

$ echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-11.0.20.0.8-3.el8_8.x86_64

Step 3: Create a user account for Hadoop

Let’s create a separate user for Hadoop so we have isolation between the Hadoop file system and the Unix file system.

sudo adduser hadoop
passwd hadoop
sudo usermod -aG wheel hadoop

Once the user is added, generate SS key pair for the user.

$ sudo su - hadoop
$ ssh-keygen -t rsa
 Generating public/private rsa key pair.
 Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
 Created directory '/home/hadoop/.ssh'.
 Enter passphrase (empty for no passphrase): 
 Enter same passphrase again: 
 Your identification has been saved in /home/hadoop/.ssh/id_rsa.
 Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
 The key fingerprint is:
 SHA256:mA1b0nzdKcwv/LPktvlA5R9LyNe9UWt+z1z0AjzySt4 hadoop@hbase
 The key's randomart image is:
 +---[RSA 2048]----+
 |                 |
 |       o   + . . |
 |      o + . = o o|
 |       O . o.o.o=|
 |      + S . *ooB=|
 |           o *=.B|
 |          . . *+=|
 |         o o o.O+|
 |          o E.=o=|
 +----[SHA256]-----+

Add this user’s key to list of Authorized ssh keys.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Verify that you can ssh using added key.

$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is SHA256:WTqP642Xijk3xtTb/zt32o0Q7PqYlxzwX+H/B72z4P4.
ECDSA key fingerprint is MD5:47:dc:17:78:63:f7:bc:12:72:70:4b:e3:2f:8a:c3:8d.
Are you sure you want to continue connecting (yes/no)? yes
Last login: Thu Sep 28 19:23:28 2023

Step 4: Download and Install Hadoop

Check for the most recent version of Hadoop before downloading version specified here. As of this writing, this is version 3.2.1.

Save the recent version to a variable.

RELEASE="3.3.6"

Then download Hadoop archive to your local system.

wget https://dlcdn.apache.org/hadoop/common/hadoop-$RELEASE/hadoop-$RELEASE.tar.gz

Extract the file.

tar -xzvf hadoop-$RELEASE.tar.gz

Move resulting directory to /usr/local/hadoop.

sudo mv hadoop-$RELEASE/ /usr/local/hadoop

Set HADOOP_HOME and add directory with Hadoop binaries to your $PATH.

cat <<EOF | sudo tee /etc/profile.d/hadoop_java.sh
export JAVA_HOME=\$(dirname $(dirname $(readlink $(readlink $(which javac)))))
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_HDFS_HOME=\$HADOOP_HOME
export HADOOP_MAPRED_HOME=\$HADOOP_HOME
export YARN_HOME=\$HADOOP_HOME
export HADOOP_COMMON_HOME=\$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=\$HADOOP_HOME/lib/native
export PATH=\$PATH:\$JAVA_HOME/bin:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin
EOF

Source file.

source /etc/profile.d/hadoop_java.sh

Confirm your Hadoop version.

$ hadoop version
Hadoop 3.3.6
Source code repository https://github.com/apache/hadoop.git -r 1be78238728da9266a4f88195058f08fd012bf9c
Compiled by ubuntu on 2023-06-18T08:22Z
Compiled on platform linux-x86_64
Compiled with protoc 3.7.1
From source with checksum 5652179ad55f76cb287d9c633bb53bbd
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar

Step 5: Configure Hadoop cluster

All your Hadoop configurations are located under /usr/local/hadoop/etc/hadoop/ directory.

A number of configuration files need to be modified to complete Hadoop installation on CentOS 7.

First edit JAVA_HOME in shell script hadoop-env.sh:

$ sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh
# Set JAVA_HOME - Line 54
export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac)))))

Then configure:

1. core-site.xml

The core-site.xml file contains Hadoop cluster information used when starting up. These properties include:

The port number used for Hadoop instance
The memory allocated for file system
The memory limit for data storage
The size of Read / Write buffers.

Open core-site.xml

sudo vim /usr/local/hadoop/etc/hadoop/core-site.xml

Add the following properties in between the <configuration> and </configuration> tags.

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
      <description>The default file system URI</description>
   </property>
</configuration>

See screenshot below.

2. hdfs-site.xml

This file needs to be configured for each host to be used in the cluster. This file holds information such as:

The namenode and datanode paths on the local filesystem.
The value of replication data

Using Dedicated data disk (optional)

In this setup, I want to store Hadoop infrastructure in a secondary disk – /dev/sdb.

$ lsblk
 NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
 sda      8:0    0 76.3G  0 disk 
 └─sda1   8:1    0 76.3G  0 part /
 sdb      8:16   0   50G  0 disk 
 sr0     11:0    1 1024M  0 rom

I’ll partition and mount this disk to /hadoop directory.

DISK="sdX"
sudo parted -s -- /dev/$DISK mklabel gpt
sudo parted -s -a optimal -- /dev/$DISK mkpart primary 0% 100%
sudo parted -s -- /dev/$DISK align-check optimal 1
sudo mkfs.xfs /dev/$DISK1
sudo mkdir /hadoop

Update /etc/fstab file.

$ sudo vim /etc/fstab
/dev/sdX1 /hadoop xfs defaults 0 0

$ sudo mount -a

Confirm:

$ df -hT | grep /dev/sdb1
/dev/sdb1      xfs        50G   33M   50G   1% /hadoop

Condfigure data directory

Create directories for namenode and datanode.

sudo mkdir -p /hadoop/hdfs/{namenode,datanode}

Set ownership to hadoop user and group.

sudo chown -R hadoop:hadoop /hadoop

Now open the file:

sudo vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Then add the following properties in between the <configuration> and </configuration> tags.

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
	
   <property>
      <name>dfs.name.dir</name>
      <value>file:///hadoop/hdfs/namenode</value>
   </property>
	
   <property>
      <name>dfs.data.dir</name>
      <value>file:///hadoop/hdfs/datanode</value>
   </property>
</configuration>

See screenshot below.

3. mapred-site.xml

This is where you set the MapReduce framework to use.

sudo vim /usr/local/hadoop/etc/hadoop/mapred-site.xml

Set like below.

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

4. yarn-site.xml

Settings in this file will overwrite the configurations for Hadoop yarn. It defines resource management and job scheduling logic.

sudo vim /usr/local/hadoop/etc/hadoop/yarn-site.xml

Add:

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

Here is the screenshot of my configuration.

Step 6: Validate Hadoop Configurations

Initialize Hadoop Infrastructure store.

sudo su -  hadoop
hdfs namenode -format

See output below:

Test HDFS configurations.

$ start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [hbase]



Lastly verify YARN configurations:



$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers



Hadoop 3.x defult Web UI ports:




NameNode - Default HTTP port is 9870.



ResourceManager - Default HTTP port is 8088.



MapReduce JobHistory Server - Default HTTP port is 19888.




You can check ports used by hadoop using:



$ ss -tunelp | grep java
tcp   LISTEN 0      256          0.0.0.0:40323      0.0.0.0:*    users:(("java",pid=6546,fd=382)) uid:1000 ino:83704 sk:3 <->
tcp   LISTEN 0      256          0.0.0.0:8040       0.0.0.0:*    users:(("java",pid=6546,fd=393)) uid:1000 ino:84720 sk:4 <->
tcp   LISTEN 0      2048         0.0.0.0:9864       0.0.0.0:*    users:(("java",pid=5868,fd=372)) uid:1000 ino:70960 sk:5 <->
tcp   LISTEN 0      256        127.0.0.1:9000       0.0.0.0:*    users:(("java",pid=5723,fd=349)) uid:1000 ino:69629 sk:6 <->
tcp   LISTEN 0      128          0.0.0.0:8042       0.0.0.0:*    users:(("java",pid=6546,fd=404)) uid:1000 ino:86340 sk:7 <->
tcp   LISTEN 0      256          0.0.0.0:9866       0.0.0.0:*    users:(("java",pid=5868,fd=342)) uid:1000 ino:71906 sk:8 <->
tcp   LISTEN 0      256          0.0.0.0:9867       0.0.0.0:*    users:(("java",pid=5868,fd=373)) uid:1000 ino:71319 sk:9 <->
tcp   LISTEN 0      128        127.0.0.1:46443      0.0.0.0:*    users:(("java",pid=5868,fd=343)) uid:1000 ino:71921 sk:a <->
tcp   LISTEN 0      128          0.0.0.0:9868       0.0.0.0:*    users:(("java",pid=6127,fd=343)) uid:1000 ino:72624 sk:b <->
tcp   LISTEN 0      128          0.0.0.0:9870       0.0.0.0:*    users:(("java",pid=5723,fd=341)) uid:1000 ino:69273 sk:c <->
tcp   LISTEN 0      128          0.0.0.0:8088       0.0.0.0:*    users:(("java",pid=6406,fd=366)) uid:1000 ino:76748 sk:e <->
tcp   LISTEN 0      128          0.0.0.0:13562      0.0.0.0:*    users:(("java",pid=6546,fd=403)) uid:1000 ino:84736 sk:f <->
tcp   LISTEN 0      256          0.0.0.0:8030       0.0.0.0:*    users:(("java",pid=6406,fd=393)) uid:1000 ino:86229 sk:10 <->
tcp   LISTEN 0      256          0.0.0.0:8031       0.0.0.0:*    users:(("java",pid=6406,fd=383)) uid:1000 ino:84665 sk:11 <->
tcp   LISTEN 0      256          0.0.0.0:8032       0.0.0.0:*    users:(("java",pid=6406,fd=403)) uid:1000 ino:86337 sk:12 <->
tcp   LISTEN 0      256          0.0.0.0:8033       0.0.0.0:*    users:(("java",pid=6406,fd=373)) uid:1000 ino:77315 sk:13 <->



Access Hadoop Web Dashboard on http://ServerIP:9870.











Check Hadoop Cluster Overview at http://ServerIP:8088.







Test to see if you can create directory.



$ hadoop fs -mkdir /test
$ hadoop fs -ls /
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2023-09-28 19:48 /test



Stopping Hadoop Services



Use the commands:



stop-dfs.sh
stop-yarn.sh



Installing HBase on CentOS / RHEL / Rocky Linux



You can choose to install HBase in Standalone Mode or Pseudo-Distributed Mode. The setup process is similar to our Hadoop installation. 



Step 1: Download and Install HBase



Check latest release  or Stable release version before you download. For production use, I recommend you go with Stabke release.



VER="2.5.5"
wget http://apache.mirror.gtcomm.net/hbase/stable/hbase-$VER-bin.tar.gz



Extract Hbase archive downloaded.



tar xvf hbase-$VER-bin.tar.gz
sudo mv hbase-$VER/ /usr/local/HBase/



Update your $PATH values.



cat <<EOF | sudo tee /etc/profile.d/hadoop_java.sh
export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac)))))
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_HDFS_HOME=\$HADOOP_HOME
export HADOOP_MAPRED_HOME=\$HADOOP_HOME
export YARN_HOME=\$HADOOP_HOME
export HADOOP_COMMON_HOME=\$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=\$HADOOP_HOME/lib/native
export HBASE_HOME=/usr/local/HBase
export PATH=\$PATH:\$JAVA_HOME/bin:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin:\$HBASE_HOME/bin
EOF



Update your shell environment values.



$ source /etc/profile.d/hadoop_java.sh
$ echo $HBASE_HOME
/usr/local/HBase



Edit JAVA_HOME in shell script hbase-env.sh:



$ sudo vim /usr/local/HBase/conf/hbase-env.sh
# Set JAVA_HOME - Line 27
export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac)))))



Step 2: Configure HBase



We will do configurations like we did for Hadoop. All configuration files for HBase are located on /usr/local/HBase/conf/ directory.



Option 1: Install HBase in Standalone Mode (Not recommended)



In standalone mode all daemons (HMaster, HRegionServer, and ZooKeeper) ran in one jvm process/instance



Create HBase root directory.



sudo mkdir -p /hadoop/HBase/HFiles
sudo mkdir -p /hadoop/zookeeper
sudo chown -R hadoop:hadoop /hadoop/



Open the file for editing.



sudo vim /usr/local/HBase/conf/hbase-site.xml



Now add the following configurations between the  <configuration> and </configuration> tags to look like below.



<configuration>
   <property>
      <name>hbase.rootdir</name>
      <value>file:/hadoop/HBase/HFiles</value>
   </property>
	
   <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/hadoop/zookeeper</value>
   </property>
</configuration>



By default, unless you configure the hbase.rootdir property, your data is still stored in /tmp/.



Now start HBase by using start-hbase.sh script in HBase bin directory.



$ sudo su - hadoop
$ start-hbase.sh 
running master, logging to /usr/local/HBase/logs/hbase-hadoop-master-hbase.out



Option 2: Install HBase in Pseudo-Distributed Mode (Recommended)



Our value of hbase.rootdir set earlier will start in Standalone Mode. Pseudo-distributed mode means that HBase still runs completely on a single host, but each HBase daemon (HMaster, HRegionServer, and ZooKeeper) runs as a separate process.



To install HBase in Pseudo-Distributed Mode, set its values to:



<configuration>
   <property>
      <name>hbase.rootdir</name>
      <value>hdfs://localhost:8030/hbase</value>
   </property>
	
   <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/hadoop/zookeeper</value>
   </property>
   
   <property>
     <name>hbase.cluster.distributed</name>
     <value>true</value>
   </property>
</configuration>



In this setup, Data is stored your data in HDFS instead.



Ensure Zookeeper directory is created.



sudo mkdir -p /hadoop/zookeeper
sudo chown -R hadoop:hadoop /hadoop/



Now start HBase by using start-hbase.sh script in HBase bin directory.



$ sudo su - hadoop
$ start-hbase.sh 
localhost: running zookeeper, logging to /usr/local/HBase/logs/hbase-hadoop-zookeeper-hbase.out
running master, logging to /usr/local/HBase/logs/hbase-hadoop-master-hbase.out
: running regionserver, logging to /usr/local/HBase/logs/hbase-hadoop-regionserver-hbase.out



Check the HBase Directory in HDFS:



$ hadoop fs -ls /hbase
Found 9 items
drwxr-xr-x   - hadoop supergroup          0 2023-04-07 09:19 /hbase/.tmp
drwxr-xr-x   - hadoop supergroup          0 2023-04-07 09:19 /hbase/MasterProcWALs
drwxr-xr-x   - hadoop supergroup          0 2023-04-07 09:19 /hbase/WALs
drwxr-xr-x   - hadoop supergroup          0 2023-04-07 09:17 /hbase/corrupt
drwxr-xr-x   - hadoop supergroup          0 2023-04-07 09:16 /hbase/data
drwxr-xr-x   - hadoop supergroup          0 2023-04-07 09:16 /hbase/hbase
-rw-r--r--   1 hadoop supergroup         42 2023-04-07 09:16 /hbase/hbase.id
-rw-r--r--   1 hadoop supergroup          7 2023-04-07 09:16 /hbase/hbase.version
drwxr-xr-x   - hadoop supergroup          0 2023-04-07 09:17 /hbase/oldWALs



Step 3: Managing  HMaster & HRegionServer



The HMaster server controls the HBase cluster. You can start up to 9 backup HMaster servers, which makes 10 total HMasters, counting the primary.



The HRegionServer manages the data in its StoreFiles as directed by the HMaster. Generally, one HRegionServer runs per node in the cluster. Running multiple HRegionServers on the same system can be useful for testing in pseudo-distributed mode.



Master and Region Servers can be started and stopped using the scripts local-master-backup.sh  and local-regionservers.sh respectively.



# Start backup HMaster
local-master-backup.sh start 2 

# Start multiple RegionServers
local-regionservers.sh start 3 




Each HMaster uses two ports (16000 and 16010 by default). The port offset is added to these ports, so using an offset of 2, the backup HMaster would use ports 16002 and 16012




The following command starts 3 backup servers using ports 16002/16012, 16003/16013, and 16005/16015.



local-master-backup.sh start 2 3 5




Each RegionServer requires two ports, and the default ports are 16020 and 16030




The following command starts four additional RegionServers, running on sequential ports starting at 16022/16032 (base ports 16020/16030 plus 2).



local-regionservers.sh start 2 3 4 5



To stop, replace start parameter with stop for each command followed by the offset of the server to stop. Example.



local-regionservers.sh stop 5



Starting HBase Shell



Hadoop and Hbase should be running before you can use HBase shell. Here the correct order of starting services.



start-all.sh
start-hbase.sh



Then use HBase shell.



hadoop@hbase:~$ hbase shell
....
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.5.5, r7ebd4381261fefd78fc2acf258a95184f4147cee, Thu Jun  1 17:42:49 PDT 2023
Took 0.0024 seconds
hbase:001:0>



Stopping HBase.



stop-hbase.sh



Best Books to Read:



[content-egg module=Amazon template=item disable_features=1 next=1]



[content-egg module=Amazon template=item disable_features=1 next=1]



[content-egg module=Amazon template=item disable_features=1 next=1]



[content-egg module=Amazon template=item disable_features=1 next=1]



Conclusion



You have successfully installed Hadoop and HBase on CentOS 7. Refer to Apache Hadoop Documentation and Apache HBase book to learn more.



How to run Java Jar Application with Systemd on Linux



How to install WildFly (JBoss) Application Server on CentOS 7