flowchart LR
subgraph lp[test]
subgraph r[R]
sr[sparklyr]
rt[reticulate]
end
subgraph ps[Python]
dc[Spark Connect]
g1[gRPC]
end
end
subgraph db[Compute Cluster]
sp[Spark]
end
sr <--> rt
rt <--> dc
g1 <-- Internet<br>Connection --> sp
dc <--> g1
style r fill:#fff,stroke:#666,color:#000
style sr fill:#fff,stroke:#666,color:#000
style rt fill:#fff,stroke:#666,color:#000
style ps fill:#fff,stroke:#666,color:#000
style lp fill:#fff,stroke:#666,color:#fff
style db fill:#fff,stroke:#666,color:#000
style sp fill:#fff,stroke:#666,color:#000
style g1 fill:#fff,stroke:#666,color:#000
style dc fill:#fff,stroke:#666,color:#000
Spark Connect
Last updated: Thu Apr 16 17:04:40 2026
Intro
Spark Connect introduced a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API. The separation between client and server allows Spark to be leveraged from everywhere, and this would allow R users to interact with a cluster from the comfort of their preferred environment, laptop or otherwise.
The Solution
The API is very different than “legacy” Spark and using the Spark shell is no longer an option. We have decided to use Python as the new interface. In turn, Python uses gRPC to interact with Spark.
We are using reticulate to interact with the Python API. sparklyr extends the functionality, and user experience, by providing the dplyr back-end, DBI back-end, and RStudio’s Connection pane integration.
sparklyr communicates with Spark Connect
Package Installation
To access Spark Connect, you will need to install the pysparklyr package in addition to sparklyr:
install.packages("pysparklyr")Initial setup
sparklyr will automatically set up a temporary Python environment for the duration of your R session using uv. This environment will include all the Python libraries needed to interact with the Spark cluster. No manual installation steps are required, simply call spark_connect() and everything will be ready to use.
Legacy setup via install_pyspark()
If uv is not available in your environment (for example, because your organization restricts its installation), you can fall back to setting up the Python environment manually using install_pyspark(). This will:
Create, or re-create, a persistent Python environment. Based on your OS, it will choose to create a Virtual Environment, or Conda.
Install the needed Python libraries
To install the latest versions of all the libraries, use:
pysparklyr::install_pyspark()It is recommended that the version of the PySpark library matches the Spark version of your cluster. To do this, pass the Spark version in the version argument, for example:
pysparklyr::install_pyspark("4.1")We have seen Spark sessions crash when the version of PySpark and the version of Spark do not match. Specifically when a newer version of PySpark is used against an older version of Spark. If you are having issues with your connection, consider running install_pyspark() to match the cluster’s specific Spark version.
Connecting
To start a session with an open source Spark cluster, via Spark Connect, you will need to set the master and method values. The master will be an IP address and an optional port. The connection URL uses the “sc://” protocol. For method, use “spark_connect”. Here is an example:
library(sparklyr)
sc <- spark_connect(
master = "sc://[Host IP(:Host Port - optional)]",
method = "spark_connect",
version = "[Version that matches your cluster]"
)If version is not passed, then sparklyr will automatically choose the installed Python environment with the highest PySpark version. In a console message, sparklyr will let you know which environment it will use.
Run locally
It is possible to run Spark Connect on your machine. We provide helper functions that let you setup and start/stop the services locally.
Start Spark Connect using:
pysparklyr::spark_connect_service_start("4.1")
#> Starting Spark Connect locally ...
#> openjdk version "17.0.18" 2026-01-20
#> OpenJDK Runtime Environment Homebrew (build 17.0.18+0)
#> OpenJDK 64-Bit Server VM Homebrew (build 17.0.18+0, mixed mode, sharing)
#> Retrieving version from PyPi.org
#> ✔ PyPi specs: 'pyspark' version 4.1.0, requires Python >=3.10 [45ms]
#>
#> ℹ
#> ✔ Python environment: 'Managed `uv` environment' [563ms]
#>
#> starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to
#> /Users/edgar/spark/spark-4.1.1-bin-hadoop3/logs/spark-edgar-org.apache.spark.sql.connect.service.SparkConnectServer-1-edgarruiz-WL57.outTo connect to your local Spark cluster using Spark Connect, use localhost as the address for master:
library(sparklyr)
library(dplyr)
sc <- spark_connect("sc://localhost", method = "spark_connect", version = "4.1")
#> Retrieving version from PyPi.org
#> ✔ PyPi specs: 'pyspark' version 4.1.0, requires Python >=3.10 [41ms]
#>
#> ℹ
#> ✔ Python environment: 'Managed `uv` environment' [4ms]
#> Now, you are able to interact with your local Spark session:
Machine Learning
Starting with Spark 4.0, Spark Connect has access to the full MLlib library. This means you can now train and apply machine learning models directly from R via sparklyr.
Here is an example that fits a linear regression model on the same Spark DataFrame:
fit <- mtcars_tbl |>
ml_linear_regression(mpg ~ .)Spark Connect models have a specialized print output that differs from other sparklyr deployment types. It displays a summary of the model, including the coefficients and fit statistics such as RMSE, R², and mean absolute error:
fit
#>
#> ── MLib model: LinearRegressionModel ──
#>
#> ── Coefficients:
#> ◼ Intercept: 12.303 ◼ qsec: 0.821
#> ◼ cyl: -0.111 ◼ vs: 0.318
#> ◼ disp: 0.013 ◼ am: 2.52
#> ◼ hp: -0.021 ◼ gear: 0.655
#> ◼ drat: 0.787 ◼ carb: -0.199
#> ◼ wt: -3.715
#>
#> ── Summary:
#> ◼ coefficientStandardErrors: 1.045, 0.018, 0.022, 1.635,...
#> ◼ devianceResiduals: -3.451, 4.627
#> ◼ explainedVariance: 30.58
#> ◼ featuresCol: features
#> ◼ labelCol: label
#> ◼ meanAbsoluteError: 1.723
#> ◼ meanSquaredError: 4.609
#> ◼ objectiveHistory: 0
#> ◼ predictionCol: prediction
#> ◼ pValues: 0.916, 0.463, 0.335, 0.635,...
#> ◼ r2: 0.869
#> ◼ r2adj: 0.807
#> ◼ rootMeanSquaredError: 2.147
#> ◼ tValues: -0.107, 0.747, -0.987, 0.48...Use ml_predict() to score new data with the fitted model:
pred <- ml_predict(fit, mtcars_tbl)
head(pred)
#> # Source: SQL [?? x 12]
#> # Database: spark_connection
#> mpg cyl disp hp drat wt qsec vs am gear carb prediction
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 22.6
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 22.1
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 26.3
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 21.2
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 17.7
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 20.4
spark_disconnect(sc)For a full list of the supported MLlib functions in sparklyr, see the appendix.
The local Spark Connect service can now be stopped:
pysparklyr::spark_connect_service_stop(version = "4.1")
#>
#> ── Stopping Spark Connect
#> - Shutdown command sentAdditional setup details
If you wish to use your own Python environment, then just make sure to load it before calling spark_connect(). If there is a Python environment already loaded when you connect to your Spark cluster, then sparklyr will use that environment instead. If you use your own Python environment you will need the following libraries installed:
pysparkpandasPyArrowgrpciogoogle-api-python-clientgrpcio_statustorchtorcheval
ML libraries (Optional):
torchtorchevalscikit-learn
Appendix: Supported MLlib Functions
The following sparklyr MLlib functions are supported via Spark Connect (requires Spark 4.0+):
The following feature transformation (ft_) functions are also supported: