Set up a local spark data platform

Creating a spark data platform on a local machine can be useful for learning about spark and delta lakes when you may not have access to a cloud platform with technologies such as Databricks or Synapse/Fabric.

These are steps that you can use to set up your own Spark Data Platform.

This page documents how to set-up a Spark data platform on a system running Arch Linux. For other Linux distros adjust to use the appropriate package manager. Likewise I have done this previously on MacOS, the steps are the same but I used the brew package manager rather than pacman.

Step 1 Update all your packages and install any dependencies.

sudo pacman -Syu tar wget

Step 2 Install Java

Spark runs on Java and you will need the JDK, the open JDK works well, I am using version 17.

sudo pacman -S jre17-openjdk

Once installed validate:

java --version

Which should output something like this:

openjdk 17.0.11 2024-04-16 
OpenJDK Runtime Environment (build 17.0.11+9) 
OpenJDK 64-Bit Server VM (build 17.0.11+9, mixed mode, sharing)

Step 3 Install Scala

If scala is not installed on your system then download and install it. You will find instructions at https://www.scala-lang.org/download/

To do so I copied the command provided at the scala website and ran it as below:

curl -fL https://github.com/coursier/coursier/releases/latest/download/cs-x86_64-pc-linux.gz | gzip -d > cs && chmod +x cs && ./cs setup

Make sure that scala is included in your systems path. On my system Scala has been installed at:

~/.local/share/coursier/bin

During the scala install, there were messages indicating that the following were added to my path:

Should we add ~/.local/share/coursier/bin to your PATH via ~/.profile, ~/.bash_profile

To ensure these locations were now in my active path I ran:

source ~/.profile

I then validated this with:

echo $PATH

I also checked that scala had been installed

scala -version

To which I received the following output

Scala code runner version 3.4.2 -- Copyright 2002-2024, LAMP/EPFL

I also received an error saying that JAVA_HOME was not set. TO resolve this simply run scala by typing

scala

and it will downloaded the required version of Java if there was a problem earlier

Step 3 Install Apache Spark

Go here and select the version you want, pick the latest.

https://spark.apache.org/downloads.html

In this article I am downloading spark-3.5.4-bin-hadoop3.tgz and then extracting the tar archive as follows

wget https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
tar xvf spark-3.5.4-bin-hadoop3.tgz

Move the extracted folder to the right location on your system. I am moving mine to:

sudo mv spark-3.5.4-bin-hadoop3 /usr/local/spark

If you installed this in the same location as above add /usr/local/spark/bin to your path.

Step 4 Verify your spark installation

Run your spark shell with the below command “spark-shell” and you should see the below (note the below was from an install with an older version of Spark then mentioned above):

spark-shell

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/
         
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.11)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

Step 5 Verify Pyspark shell

To use the pyspark shell enter the command “pyspark” below and you should see

pyspark

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/

Using Python version 3.12.3 (main, Apr 23 2024 09:16:07)
Spark context Web UI available at http://192.168.1.32:4040
Spark context available as 'sc' (master = local[*], app id = local-1717346117959).
SparkSession available as 'spark'.
>>> 

Step 6 Using Delta with PySpark

To run a delta lake and to work with delta files you will find instructions here

https://delta.io/learn/getting-started/

Ensure you have the correct python libraries installed if you are using Python, we will do so below creating them in a virtual environment.

Create virtual environment

python -m venv venv

Install delta-spark library

pip install delta-spark

You are now ready to create some Python scripts, and create a delta-lake on your own system.

Here you can create a medallion architecture storing your delta files in different folders bronze, silver, gold if you like.

Other things to do is to start to play around with the Hive metastore.

As you play with these you will come to the realisation that Databricks, Synapse and Fabric are all very impressive, but are essentially a UI wrapper on Apache Spark and the Delta file format which is where the real magic lies.

Contact me