Spark Standalone Deployment in AWS

Overview

The plan is to launch 4 identical EC2 server instances. One server will be the Master node and the other 3 the worker nodes. In one of the worker nodes, we will install RStudio server.

What makes a server the Master node is only the fact that it is running the master service, while the other machines are running the slave service and are pointed to that first master. This simple setup, allows us to install the same Spark components on all 4 servers and then just add RStudio to one of them.

The topology will look something like this:

AWS EC Instances

Here are the details of the EC2 instance, just deploy one at this point:

Type: t2.medium
OS: Ubuntu 22.04 LTS
Disk space: At least 20GB
Security group: Open the following ports: 8080 (Spark UI), 4040 (Spark Worker UI), 8088 (sparklyr UI) and 8787 (RStudio). Also open All TCP ports for the machines inside the security group.

Spark

Perform the steps in this section on all of the servers that will be part of the cluster.

Install Java 8

We will add the Java 8 repository, install it and set it as default

sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default
sudo apt-get update

or alternatively, run

sudo apt install openjdk-8-jdk

to install Open JDK version 8.

Download Spark

Download and unpack a pre-compiled version of Spark. Here’s is the link to the official Spark download page

wget https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz
tar -xvzf spark-3.5.5-bin-hadoop3.tgz
cd spark-3.5.5-bin-hadoop3

Create and launch AMI

We will create an image of the server. In Amazon, these are called AMIs, for information please see the User Guide.
Launch 3 instances of the AMI

RStudio Server

Select one of the nodes to execute this section. Please check the RStudio download page for the latest version

Install R

In order to get the latest R core, we will need to update the source list in Ubuntu.

sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" >> /etc/apt/sources.list'
gpg --keyserver keyserver.ubuntu.com --recv-key 0x517166190x51716619e084dab9
gpg -a --export 0x517166190x51716619e084dab9 | sudo apt-key add -
sudo apt-get update

Now we can install R

sudo apt-get install r-base
sudo apt-get install gdebi-core

Install RStudio

We will download and install 1.044 of RStudio Server. To find the latest version, please visit the RStudio website. In order to get the enhanced integration with Spark, RStudio version 1.044 or later will be needed.

wget https://download2.rstudio.org/rstudio-server-1.0.153-amd64.deb
sudo gdebi rstudio-server-1.0.153-amd64.deb

Install dependencies

Run the following commands

sudo apt-get -y install libcurl4-gnutls-dev
sudo apt-get -y install libssl-dev
sudo apt-get -y install libxml2-dev

Add default user

Run the following command to add a default user

sudo adduser rstudio-user

Start the Master node

Select one of the servers to become your Master node
Run the command that starts the master service

sudo spark-3.5.5-bin-hadoop3/sbin/start-master.sh

Close the terminal connection (optional)

Start Worker nodes

Start the “slave” service. Important: Use dots not dashes as separators for the Spark Master node’s address

sudo spark-3.5.5-bin-hadoop3/sbin/start-slave.sh spark://[Master node's IP address]:7077

sudo spark-3.5.5-bin-hadoop3/sbin/start-slave.sh spark://ip-172-30-1-94.us-west-2.compute.internal:7077

Close the terminal connection (optional)

Pre-load pacakges

Log into RStudio (port 8787)
Use ‘rstudio-user’

install.packages("sparklyr")

Connect to the Spark Master

Navigate to the Spark Master’s UI, typically on port 8080

Note the Spark Master URL
Logon to RStudio
Run the following code

library(sparklyr)

conf <- spark_config()
conf$spark.executor.memory <- "2GB"
conf$spark.memory.fraction <- 0.9

sc <- spark_connect(master="[Spark Master URL]",
              version = "2.1.0",
              config = conf,
              spark_home = "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/"
              )