You can support us by downloading this article as PDF from the Link below. Download the guide as PDF

Welcome to our guide on how to install Apache Spark on Ubuntu 20.04/18.04 & Debian 9/8/10. Apache Spark is an open-source distributed general-purpose cluster-computing framework. It is a fast unified analytics engine used for big data and machine learning processing.

Spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Install Apache Spark on Ubuntu 20.04/18.04 / Debian 9/8/10

Before we install Apache Spark on Ubuntu / Debian, let’s update our system packages.

sudo apt update
sudo apt -y upgrade

Now use the steps shown next to install Spark on Ubuntu 18.04 / Debian 9.

Step 1: Install Java

Apache Spark requires Java to run, let’s make sure we have Java installed on our Ubuntu / Debian system.

For default system Java:

sudo apt install default-jdk

Verify Java version using the command:

java -version

For Java 8 on Ubuntu 18.04:

sudo apt update
sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install oracle-java8-installer oracle-java8-set-default

For missing add-apt-repository command, check How to Install add-apt-repository on Debian / Ubuntu

Step 2: Download Apache Spark

Download the latest release of Apache Spark from the downloads page. As of this update, this is 2.4.5.

curl -O https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

Extract the Spark tarball.

tar xvf spark-2.4.5-bin-hadoop2.7.tgz

Move the Spark folder created after extraction to the /opt/ directory.

sudo mv spark-2.4.5-bin-hadoop2.7/ /opt/spark 

Set Spark environment

Open your bashrc configuration file.

vim ~/.bashrc

Add:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Activate the changes.

source ~/.bashrc

Step 3: Start a standalone master server

You can now start a standalone master server using the start-master.sh command.

# start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu.out

The process will be listening on TCP port 8080.

# ss -tunelp | grep 8080
tcp   LISTEN  0       1                           *:8080                *:*      users:(("java",pid=8033,fd=238)) ino:41613 sk:5 v6only:0 <-> 

The Web UI looks like below.

My Spark URL is spark://ubuntu:7077.

Step 4: Starting Spark Worker Process

The start-slave.sh command is used to start Spark Worker Process.

$ start-slave.sh spark://ubuntu:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu.out

If you don’t have the script in your $PATH, you can first locate it.

$ locate start-slave.sh
/opt/spark/sbin/start-slave.sh

You can also use the absolute path to run the script.

Step 5: Using Spark shell

Use the spark-shell command to access Spark Shell.

# /opt/spark/bin/spark-shell
19/04/25 21:48:59 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 116.203.127.13 instead (on interface eth0)
19/04/25 21:48:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.1.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/04/25 21:49:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://static.13.127.203.116.clients.your-server.de:4040
Spark context available as 'sc' (master = local[*], app id = local-1556221755866).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.1
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.2)
Type in expressions to have them evaluated.
Type :help for more information.

scala> println("Hello Spark World")
Hello Spark World
scala> 

If you’re more of a Python person, use pyspark.

# /opt/spark/bin/pyspark
Python 2.7.15rc1 (default, Nov 12 2018, 14:31:15) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
19/04/25 21:53:44 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 116.203.127.13 instead (on interface eth0)
19/04/25 21:53:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.1.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/04/25 21:53:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.1
      /_/

Using Python version 2.7.15rc1 (default, Nov 12 2018 14:31:15)
SparkSession available as 'spark'.
>>> 

Easily shut down the master and slave Spark processes using commands below.

$ SPARK_HOME/sbin/stop-slave.sh
$ SPARK_HOME/sbin/stop-master.sh

There you have it. Read more on Spark Documentation.

You can support us by downloading this article as PDF from the Link below. Download the guide as PDF