Apache Spark is an open-source distributed general-purpose cluster-computing framework. It is a fast unified analytics engine used for big data and machine learning processing.
Spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Install Apache Spark on Ubuntu
Before we install Apache Spark on Ubuntu let’s update our system packages.
sudo apt update && sudo apt -y full-upgrade
Consider a system reboot after upgrade is required.
[ -f /var/run/reboot-required ] && sudo reboot -f
Now use the steps shown next to install Spark on Ubuntu.
Step 1: Install Java Runtime
Apache Spark requires Java to run, let’s make sure we have Java installed on our Ubuntu system.
For default system Java:
sudo apt install curl mlocate default-jdk -y
Verify Java version using the command:
$ java -version
openjdk version "11.0.20.1" 2023-08-24
OpenJDK Runtime Environment (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)
For missing add-apt-repository command, check Enable add-apt-repository on Debian / Ubuntu
Step 2: Download Apache Spark
Download the latest release of Apache Spark from the downloads page.
VER=3.5.1
wget https://dlcdn.apache.org/spark/spark-$VER/spark-$VER-bin-hadoop3.tgz
Extract the Spark tarball.
tar xvf spark-$VER-bin-hadoop3.tgz
Move the Spark folder created after extraction to the /opt/ directory.
sudo mv spark-$VER-bin-hadoop3/ /opt/spark
Set Spark environment
Open your bashrc configuration file.
vim ~/.bashrc
Add:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Activate the changes.
source ~/.bashrc
Step 3: Start standalone master server
You can now start a standalone master server using the start-master.sh command.
$ start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu.out
The process will be listening on TCP port 8080.
$ sudo ss -tunelp | grep 8080
tcp LISTEN 0 1 *:8080 *:* users:(("java",pid=8033,fd=238)) ino:41613 sk:5 v6only:0 <->
The Web UI looks like below.

My Spark URL is spark://ubuntu:7077.
Step 4: Starting Spark Worker Process
The start-slave.sh command is used to start Spark Worker Process.
$ start-slave.sh spark://ubuntu:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu.out
If you don’t have the script in your $PATH, you can first locate it.
$ sudo updatedb
$ locate start-slave.sh
/opt/spark/sbin/start-slave.sh
You can also use the absolute path to run the script.
Step 5: Using Spark shell
Use the spark-shell command to access Spark Shell.
$ /opt/spark/bin/spark-shell
...
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.0
/_/
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 11.0.20.1)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
If you’re more of a Python person, use pyspark.
$ /opt/spark/bin/pyspark
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.0
/_/
Using Python version 3.10.12 (main, Jun 11 2023 05:26:28)
Spark context Web UI available at http://static.199.96.140.128.clients.your-server.de:4040
Spark context available as 'sc' (master = local[*], app id = local-1695933483734).
SparkSession available as 'spark'.
>>>
Easily shut down the master and slave Spark processes using commands below.
$SPARK_HOME/sbin/stop-slave.sh
$SPARK_HOME/sbin/stop-master.sh
There you have it. Read more on Spark Documentation.
The link to download spark 2.4.5 is wrong.
Please replace with https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
Thanks for capturing the issue. We’ve updated the article for the latest release.
Really very usefull document for me… its very helpful to avoid so many mistakes while installing Spark
Thanks, was the only setup that works to install spark, I hope to build nice project after your post.
Just live the right url to download: https://downloads.apache.org/spark/spark-3.2.1/
Awesome we’re happy to hear this.
Same here. Thanks to Josphat! Your post saved me a lot of time.
Welcome Fedrick.