Spark Connect

Last updated: Thu Apr 16 17:04:40 2026

Intro

Spark Connect introduced a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API. The separation between client and server allows Spark to be leveraged from everywhere, and this would allow R users to interact with a cluster from the comfort of their preferred environment, laptop or otherwise.

The Solution

The API is very different than “legacy” Spark and using the Spark shell is no longer an option. We have decided to use Python as the new interface. In turn, Python uses gRPC to interact with Spark.

We are using reticulate to interact with the Python API. sparklyr extends the functionality, and user experience, by providing the dplyr back-end, DBI back-end, and RStudio’s Connection pane integration.

flowchart LR
  subgraph lp[test]
    subgraph r[R]
      sr[sparklyr]
      rt[reticulate]
    end
    subgraph ps[Python]
      dc[Spark Connect]
      g1[gRPC]
    end
  end   
  subgraph db[Compute Cluster]
    sp[Spark]   
  end
  sr <--> rt
  rt <--> dc
  g1 <-- Internet<br>Connection --> sp
  dc <--> g1
  
  style r   fill:#fff,stroke:#666,color:#000
  style sr  fill:#fff,stroke:#666,color:#000
  style rt  fill:#fff,stroke:#666,color:#000
  style ps  fill:#fff,stroke:#666,color:#000
  style lp  fill:#fff,stroke:#666,color:#fff
  style db  fill:#fff,stroke:#666,color:#000
  style sp  fill:#fff,stroke:#666,color:#000
  style g1  fill:#fff,stroke:#666,color:#000
  style dc  fill:#fff,stroke:#666,color:#000

Figure 1: How sparklyr communicates with Spark Connect

Package Installation

To access Spark Connect, you will need to install the pysparklyr package in addition to sparklyr:

install.packages("pysparklyr")

Initial setup

sparklyr will automatically set up a temporary Python environment for the duration of your R session using uv. This environment will include all the Python libraries needed to interact with the Spark cluster. No manual installation steps are required, simply call spark_connect() and everything will be ready to use.

Legacy setup via install_pyspark()

If uv is not available in your environment (for example, because your organization restricts its installation), you can fall back to setting up the Python environment manually using install_pyspark(). This will:

  • Create, or re-create, a persistent Python environment. Based on your OS, it will choose to create a Virtual Environment, or Conda.

  • Install the needed Python libraries

To install the latest versions of all the libraries, use:

pysparklyr::install_pyspark()

It is recommended that the version of the PySpark library matches the Spark version of your cluster. To do this, pass the Spark version in the version argument, for example:

pysparklyr::install_pyspark("4.1")

We have seen Spark sessions crash when the version of PySpark and the version of Spark do not match. Specifically when a newer version of PySpark is used against an older version of Spark. If you are having issues with your connection, consider running install_pyspark() to match the cluster’s specific Spark version.

Connecting

To start a session with an open source Spark cluster, via Spark Connect, you will need to set the master and method values. The master will be an IP address and an optional port. The connection URL uses the “sc://” protocol. For method, use “spark_connect”. Here is an example:

library(sparklyr)

sc <- spark_connect(
  master = "sc://[Host IP(:Host Port - optional)]", 
  method = "spark_connect",
  version = "[Version that matches your cluster]"
  )

If version is not passed, then sparklyr will automatically choose the installed Python environment with the highest PySpark version. In a console message, sparklyr will let you know which environment it will use.

Run locally

It is possible to run Spark Connect on your machine. We provide helper functions that let you setup and start/stop the services locally.

Start Spark Connect using:

pysparklyr::spark_connect_service_start("4.1")
#> Starting Spark Connect locally ...
#> openjdk version "17.0.18" 2026-01-20
#> OpenJDK Runtime Environment Homebrew (build 17.0.18+0)
#> OpenJDK 64-Bit Server VM Homebrew (build 17.0.18+0, mixed mode, sharing)
#> Retrieving version from PyPi.org
#> ✔ PyPi specs: 'pyspark' version 4.1.0, requires Python >=3.10 [45ms]
#> 
#> ℹ
#> ✔ Python environment: 'Managed `uv` environment' [563ms]
#> 
#>   starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to
#>   /Users/edgar/spark/spark-4.1.1-bin-hadoop3/logs/spark-edgar-org.apache.spark.sql.connect.service.SparkConnectServer-1-edgarruiz-WL57.out

To connect to your local Spark cluster using Spark Connect, use localhost as the address for master:

library(sparklyr)
library(dplyr)

sc <- spark_connect("sc://localhost", method = "spark_connect", version = "4.1")
#> Retrieving version from PyPi.org
#> ✔ PyPi specs: 'pyspark' version 4.1.0, requires Python >=3.10 [41ms]
#> 
#> ℹ
#> ✔ Python environment: 'Managed `uv` environment' [4ms]
#> 

Now, you are able to interact with your local Spark session:

mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)

mtcars_tbl |>
  group_by(am) |>
  summarise(mpg = mean(mpg, na.rm = TRUE))
#> # Source:   SQL [?? x 2]
#> # Database: spark_connection
#>      am   mpg
#>   <dbl> <dbl>
#> 1     0  17.1
#> 2     1  24.4

Machine Learning

Starting with Spark 4.0, Spark Connect has access to the full MLlib library. This means you can now train and apply machine learning models directly from R via sparklyr.

Here is an example that fits a linear regression model on the same Spark DataFrame:

fit <- mtcars_tbl |>
  ml_linear_regression(mpg ~ .)

Spark Connect models have a specialized print output that differs from other sparklyr deployment types. It displays a summary of the model, including the coefficients and fit statistics such as RMSE, R², and mean absolute error:

fit
#> 
#> ── MLib model: LinearRegressionModel ──
#> 
#> ── Coefficients:
#> ◼ Intercept:    12.303   ◼ qsec:         0.821 
#> ◼ cyl:         -0.111    ◼ vs:           0.318 
#> ◼ disp:         0.013    ◼ am:           2.52  
#> ◼ hp:          -0.021    ◼ gear:         0.655 
#> ◼ drat:         0.787    ◼ carb:        -0.199 
#> ◼ wt:          -3.715
#> 
#> ── Summary:
#> ◼ coefficientStandardErrors:  1.045, 0.018, 0.022, 1.635,...
#> ◼ devianceResiduals:          -3.451, 4.627                 
#> ◼ explainedVariance:          30.58                         
#> ◼ featuresCol:                features                      
#> ◼ labelCol:                   label                         
#> ◼ meanAbsoluteError:          1.723                         
#> ◼ meanSquaredError:           4.609                         
#> ◼ objectiveHistory:           0                             
#> ◼ predictionCol:              prediction                    
#> ◼ pValues:                    0.916, 0.463, 0.335, 0.635,...
#> ◼ r2:                         0.869                         
#> ◼ r2adj:                      0.807                         
#> ◼ rootMeanSquaredError:       2.147                         
#> ◼ tValues:                    -0.107, 0.747, -0.987, 0.48...

Use ml_predict() to score new data with the fitted model:

pred <- ml_predict(fit, mtcars_tbl)

head(pred)
#> # Source:   SQL [?? x 12]
#> # Database: spark_connection
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb prediction
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4       22.6
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4       22.1
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1       26.3
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1       21.2
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2       17.7
#> 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1       20.4

For a full list of the supported MLlib functions in sparklyr, see the appendix.

The local Spark Connect service can now be stopped:

pysparklyr::spark_connect_service_stop(version = "4.1")
#> 
#> ── Stopping Spark Connect
#>   - Shutdown command sent

Additional setup details

If you wish to use your own Python environment, then just make sure to load it before calling spark_connect(). If there is a Python environment already loaded when you connect to your Spark cluster, then sparklyr will use that environment instead. If you use your own Python environment you will need the following libraries installed:

  • pyspark
  • pandas
  • PyArrow
  • grpcio
  • google-api-python-client
  • grpcio_status
  • torch
  • torcheval

ML libraries (Optional):

  • torch
  • torcheval
  • scikit-learn

Appendix: Supported MLlib Functions

The following sparklyr MLlib functions are supported via Spark Connect (requires Spark 4.0+):

ml_aft_survival_regression() ml_binary_classification_evaluator() ml_bisecting_kmeans() ml_clustering_evaluator()
ml_cross_validator() ml_decision_tree_classifier() ml_decision_tree_regressor() ml_gbt_classifier()
ml_gbt_regressor() ml_generalized_linear_regression() ml_isotonic_regression() ml_kmeans()
ml_linear_regression() ml_logistic_regression() ml_multiclass_classification_evaluator() ml_pipeline()
ml_predict() ml_random_forest_classifier() ml_random_forest_regressor() ml_regression_evaluator()
ml_save() ml_transform()

The following feature transformation (ft_) functions are also supported:

ft_binarizer() ft_bucketed_random_projection_lsh() ft_bucketizer() ft_count_vectorizer()
ft_dct() ft_discrete_cosine_transform() ft_elementwise_product() ft_feature_hasher()
ft_hashing_tf() ft_idf() ft_imputer() ft_index_to_string()
ft_max_abs_scaler() ft_min_max_scaler() ft_minhash_lsh() ft_ngram()
ft_normalizer() ft_one_hot_encoder() ft_pca() ft_polynomial_expansion()
ft_quantile_discretizer() ft_r_formula() ft_regex_tokenizer() ft_robust_scaler()
ft_sql_transformer() ft_standard_scaler() ft_stop_words_remover() ft_string_indexer()
ft_tokenizer() ft_vector_assembler() ft_vector_indexer() ft_vector_slicer()
ft_word2vec()