Welcome to our guide on how to install Apache Spark on Ubuntu 19.04/18.04 & Debian 9/8/10. Apache Spark is an open-source distributed general-purpose cluster-computing framework. It is a fast unified analytics engine used for big data and machine learning processing.

Spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Install Apache Spark on Ubuntu 19.04/18.04 / Debian 9/8/10

Before we install Apache Spark on Ubuntu / Debian, let’s update our system packages.

sudo apt update
sudo apt -y upgrade

Now use the steps shown next to install Spark on Ubuntu 18.04 / Debian 9.

Step 1: Install Java

Apache Spark requires Java to run, let’s make sure we have Java installed on our Ubuntu / Debian system.

For default system Java:

sudo apt install default-jdk

Verify Java version using the command:

java -version

For Java 8 on Ubuntu 18.04:

sudo apt update
sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install oracle-java8-installer oracle-java8-set-default

For missing add-apt-repository command, check How to Install add-apt-repository on Debian / Ubuntu

Step 2: Download Apache Spark

Download the latest release of Apache Spark from the downloads page. As of this update, this is 2.4.2.

curl -O https://www-us.apache.org/dist/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz

Extract the Spark tarball.

tar xvf spark-2.4.2-bin-hadoop2.7.tgz

Move the Spark folder created after extraction to the /opt/ directory.

sudo mv spark-2.4.2-bin-hadoop2.7/ /opt/spark 

Set Spark environment

Open your bashrc configuration file.

vim ~/.bashrc

Add:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Activate the changes.

source ~/.bashrc

Step 3: Start a standalone master server

You can now start a standalone master server using the start-master.sh command.

# start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu.out

The process will be listening on TCP port 8080.

# ss -tunelp | grep 8080
tcp   LISTEN  0       1                           *:8080                *:*      users:(("java",pid=8033,fd=238)) ino:41613 sk:5 v6only:0 <-> 

The Web UI looks like below.

My Spark URL is spark://ubuntu:7077.

Step 4: Starting Spark Worker Process

The start-slave.sh command is used to start Spark Worker Process.

$ start-slave.sh spark://ubuntu:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu.out

If you don’t have the script in your $PATH, you can first locate it.

$ locate start-slave.sh
/opt/spark/sbin/start-slave.sh

You can also use the absolute path to run the script.

Step 5: Using Spark shell

Use the spark-shell command to access Spark Shell.

# /opt/spark/bin/spark-shell
19/04/25 21:48:59 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 116.203.127.13 instead (on interface eth0)
19/04/25 21:48:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.1.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/04/25 21:49:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://static.13.127.203.116.clients.your-server.de:4040
Spark context available as 'sc' (master = local[*], app id = local-1556221755866).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.1
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.2)
Type in expressions to have them evaluated.
Type :help for more information.

scala> println("Hello Spark World")
Hello Spark World
scala> 

If you’re more of a Python person, use pyspark.

# /opt/spark/bin/pyspark
Python 2.7.15rc1 (default, Nov 12 2018, 14:31:15) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
19/04/25 21:53:44 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 116.203.127.13 instead (on interface eth0)
19/04/25 21:53:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.1.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/04/25 21:53:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.1
      /_/

Using Python version 2.7.15rc1 (default, Nov 12 2018 14:31:15)
SparkSession available as 'spark'.
>>> 

Easily shut down the master and slave Spark processes using commands below.

$ SPARK_HOME/sbin/stop-slave.sh
$ SPARK_HOME/sbin/stop-master.sh

There you have it. Read more on Spark Documentation.