Welcome to our guide on how to install Apache Spark on Ubuntu 20.04/18.04 & Debian 9/8/10. Apache Spark is an open-source distributed general-purpose cluster-computing framework. It is a fast unified analytics engine used for big data and machine learning processing.

Spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Install Apache Spark on Ubuntu 20.04/18.04 / Debian 9/8/10

Before we install Apache Spark on Ubuntu / Debian, let’s update our system packages.

sudo apt update
sudo apt -y upgrade

Consider a system reboot after upgrade is required.

[ -f /var/run/reboot-required ] && sudo reboot -f

Now use the steps shown next to install Spark on Ubuntu 18.04 / Debian 9.

Step 1: Install Java

Apache Spark requires Java to run, let’s make sure we have Java installed on our Ubuntu / Debian system.

For default system Java:

sudo apt install curl mlocate default-jdk -y

Verify Java version using the command:

$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

For missing add-apt-repository command, check How to Install add-apt-repository on Debian / Ubuntu

Step 2: Download Apache Spark

Download the latest release of Apache Spark from the downloads page. As of this update, this is 2.4.5.

curl -O https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz

Extract the Spark tarball.

tar xvf spark-3.1.1-bin-hadoop3.2.tgz

Move the Spark folder created after extraction to the /opt/ directory.

sudo mv spark-3.1.1-bin-hadoop3.2/ /opt/spark 

Set Spark environment

Open your bashrc configuration file.

vim ~/.bashrc

Add:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Activate the changes.

source ~/.bashrc

Step 3: Start a standalone master server

You can now start a standalone master server using the start-master.sh command.

$ start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu.out

The process will be listening on TCP port 8080.

$ sudo ss -tunelp | grep 8080
tcp   LISTEN  0       1                           *:8080                *:*      users:(("java",pid=8033,fd=238)) ino:41613 sk:5 v6only:0 <-> 

The Web UI looks like below.

Apache spark start master

My Spark URL is spark://ubuntu:7077.

Step 4: Starting Spark Worker Process

The start-slave.sh command is used to start Spark Worker Process.

$ start-slave.sh spark://ubuntu:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu.out

If you don’t have the script in your $PATH, you can first locate it.

$ sudo updatedb
$ locate start-slave.sh
/opt/spark/sbin/start-slave.sh

You can also use the absolute path to run the script.

Step 5: Using Spark shell

Use the spark-shell command to access Spark Shell.

$ /opt/spark/bin/spark-shell
21/04/27 08:49:09 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 10.10.10.2 instead (on interface eth0)
21/04/27 08:49:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/04/27 08:49:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.10.10.2:4040
Spark context available as 'sc' (master = local[*], app id = local-1619513355938).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.10)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

If you’re more of a Python person, use pyspark.

$ /opt/spark/bin/pyspark
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
21/04/27 08:50:09 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 10.10.10.2 instead (on interface eth0)
21/04/27 08:50:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/04/27 08:50:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Python version 3.8.5 (default, Jan 27 2021 15:41:15)
Spark context Web UI available at http://10.10.10.2:4040
Spark context available as 'sc' (master = local[*], app id = local-1619513411109).
SparkSession available as 'spark'.
>>>

Easily shut down the master and slave Spark processes using commands below.

$ SPARK_HOME/sbin/stop-slave.sh
$ SPARK_HOME/sbin/stop-master.sh

There you have it. Read more on Spark Documentation.

Your support is our everlasting motivation,
that cup of coffee is what keeps us going!


As we continue to grow, we would wish to reach and impact more people who visit and take advantage of the guides we have on our blog. This is a big task for us and we are so far extremely grateful for the kind people who have shown amazing support for our work over the time we have been online.

Thank You for your support as we work to give you the best of guides and articles. Click below to buy us a coffee.

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here