Install Spark on Ubuntu (20.04, 22.04, 24.04) [Guide]

Apache Spark is an open-source distributed general-purpose cluster-computing framework. It is a fast unified analytics engine used for big data and machine learning processing.

Original content from computingforgeeks.com - post 16135

Spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Install Apache Spark on Ubuntu

Before we install Apache Spark on Ubuntu let’s update our system packages.

sudo apt update && sudo apt -y full-upgrade

Consider a system reboot after upgrade is required.

[ -f /var/run/reboot-required ] && sudo reboot -f

Now use the steps shown next to install Spark on Ubuntu.

Step 1: Install Java Runtime

Apache Spark requires Java to run, let’s make sure we have Java installed on our Ubuntu system.

For default system Java:

sudo apt install curl mlocate default-jdk -y

Verify Java version using the command:

$ java -version
openjdk version "11.0.20.1" 2023-08-24
OpenJDK Runtime Environment (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)

For missing add-apt-repository command, check Enable add-apt-repository on Debian / Ubuntu

Step 2: Download Apache Spark

Download the latest release of Apache Spark from the downloads page.

VER=3.5.1
wget https://dlcdn.apache.org/spark/spark-$VER/spark-$VER-bin-hadoop3.tgz

Extract the Spark tarball.

tar xvf spark-$VER-bin-hadoop3.tgz

Move the Spark folder created after extraction to the /opt/ directory.

sudo mv spark-$VER-bin-hadoop3/ /opt/spark

Set Spark environment

Open your bashrc configuration file.

vim ~/.bashrc

Add:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Activate the changes.

source ~/.bashrc

Step 3: Start standalone master server

You can now start a standalone master server using the start-master.sh command.

$ start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu.out

The process will be listening on TCP port 8080.

$ sudo ss -tunelp | grep 8080
tcp   LISTEN  0       1                           *:8080                *:*      users:(("java",pid=8033,fd=238)) ino:41613 sk:5 v6only:0 <->

The Web UI looks like below.

My Spark URL is spark://ubuntu:7077.

Step 4: Starting Spark Worker Process

The start-slave.sh command is used to start Spark Worker Process.

$ start-slave.sh spark://ubuntu:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu.out

If you don’t have the script in your $PATH, you can first locate it.

$ sudo updatedb
$ locate start-slave.sh
/opt/spark/sbin/start-slave.sh

You can also use the absolute path to run the script.

Step 5: Using Spark shell

Use the spark-shell command to access Spark Shell.

$ /opt/spark/bin/spark-shell
...
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.0
      /_/

Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 11.0.20.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

If you’re more of a Python person, use pyspark.

$ /opt/spark/bin/pyspark
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0
      /_/

Using Python version 3.10.12 (main, Jun 11 2023 05:26:28)
Spark context Web UI available at http://static.199.96.140.128.clients.your-server.de:4040
Spark context available as 'sc' (master = local[*], app id = local-1695933483734).
SparkSession available as 'spark'.
>>>

Easily shut down the master and slave Spark processes using commands below.

$SPARK_HOME/sbin/stop-slave.sh
$SPARK_HOME/sbin/stop-master.sh

There you have it. Read more on Spark Documentation.

7 thoughts on “Install Spark on Ubuntu (20.04, 22.04, 24.04)”

Sandheep

April 26, 2021 at 3:53 pm

The link to download spark 2.4.5 is wrong.
Please replace with https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
Josphat Mutai

April 30, 2021 at 9:01 pm

Thanks for capturing the issue. We’ve updated the article for the latest release.
Balajeevaa

December 30, 2021 at 11:38 am

Really very usefull document for me… its very helpful to avoid so many mistakes while installing Spark
Felipe

April 11, 2022 at 12:04 am

Thanks, was the only setup that works to install spark, I hope to build nice project after your post.
Just live the right url to download: https://downloads.apache.org/spark/spark-3.2.1/
- Josphat Mutai
  
  April 17, 2022 at 11:39 pm
  
  Awesome we’re happy to hear this.
- Fedrick
  
  June 5, 2022 at 7:03 am
  
  Same here. Thanks to Josphat! Your post saved me a lot of time.
  - Josphat Mutai
    
    June 17, 2022 at 7:42 pm
    
    Welcome Fedrick.