How To

Install Spark on Ubuntu (20.04, 22.04, 24.04)

Apache Spark is an open-source distributed general-purpose cluster-computing framework. It is a fast unified analytics engine used for big data and machine learning processing.

Original content from computingforgeeks.com - post 16135

Spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Install Apache Spark on Ubuntu

Before we install Apache Spark on Ubuntu let’s update our system packages.

sudo apt update && sudo apt -y full-upgrade

Consider a system reboot after upgrade is required.

[ -f /var/run/reboot-required ] && sudo reboot -f

Now use the steps shown next to install Spark on Ubuntu.

Step 1: Install Java Runtime

Apache Spark requires Java to run, let’s make sure we have Java installed on our Ubuntu system.

For default system Java:

sudo apt install curl mlocate default-jdk -y

Verify Java version using the command:

$ java -version
openjdk version "11.0.20.1" 2023-08-24
OpenJDK Runtime Environment (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)

For missing add-apt-repository command, check Enable add-apt-repository on Debian / Ubuntu

Step 2: Download Apache Spark

Download the latest release of Apache Spark from the downloads page.

VER=3.5.1
wget https://dlcdn.apache.org/spark/spark-$VER/spark-$VER-bin-hadoop3.tgz

Extract the Spark tarball.

tar xvf spark-$VER-bin-hadoop3.tgz

Move the Spark folder created after extraction to the /opt/ directory.

sudo mv spark-$VER-bin-hadoop3/ /opt/spark 

Set Spark environment

Open your bashrc configuration file.

vim ~/.bashrc

Add:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Activate the changes.

source ~/.bashrc

Step 3: Start standalone master server

You can now start a standalone master server using the start-master.sh command.

$ start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu.out

The process will be listening on TCP port 8080.

$ sudo ss -tunelp | grep 8080
tcp   LISTEN  0       1                           *:8080                *:*      users:(("java",pid=8033,fd=238)) ino:41613 sk:5 v6only:0 <-> 

The Web UI looks like below.

Apache spark start master

My Spark URL is spark://ubuntu:7077.

Step 4: Starting Spark Worker Process

The start-slave.sh command is used to start Spark Worker Process.

$ start-slave.sh spark://ubuntu:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu.out

If you don’t have the script in your $PATH, you can first locate it.

$ sudo updatedb
$ locate start-slave.sh
/opt/spark/sbin/start-slave.sh

You can also use the absolute path to run the script.

Step 5: Using Spark shell

Use the spark-shell command to access Spark Shell.

$ /opt/spark/bin/spark-shell
...
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.0
      /_/

Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 11.0.20.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

If you’re more of a Python person, use pyspark.

$ /opt/spark/bin/pyspark
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0
      /_/

Using Python version 3.10.12 (main, Jun 11 2023 05:26:28)
Spark context Web UI available at http://static.199.96.140.128.clients.your-server.de:4040
Spark context available as 'sc' (master = local[*], app id = local-1695933483734).
SparkSession available as 'spark'.
>>>

Easily shut down the master and slave Spark processes using commands below.

$SPARK_HOME/sbin/stop-slave.sh
$SPARK_HOME/sbin/stop-master.sh

There you have it. Read more on Spark Documentation.

Related Articles

Containers Deploy Etcd Cluster on Ubuntu / Debian / Rocky / Alma Databases How To Install MySQL Workbench on Ubuntu 22.04|20.04 Fedora Fedora 42 vs Ubuntu 25.04: Detailed Comparison Table Web Hosting Install YOURLS – Your Own URL Shortener on Ubuntu 22.04|20.04|18.04

7 thoughts on “Install Spark on Ubuntu (20.04, 22.04, 24.04)”

Leave a Comment

Press ESC to close