Model Data

You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within sparklyr. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.

Exercise

Here’s an example where we use ml_linear_regression() to fit a linear regression model. We’ll use the built-in mtcars dataset, and see if we can predict a car’s fuel consumption (mpg) based on its weight (wt), and the number of cylinders the engine contains (cyl). We’ll assume in each case that the relationship between mpg and each of our features is linear.

Initialize the environment

We will start by creating a local Spark session and load the mtcars data frame to it.

library(sparklyr)
sc <- spark_connect(master = "local")
mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)

Prepare the data

Spark provides data frame operations that makes it easier to prepare data for modeling. In this case, we will use the sdf_partition() command to divide the mtcars data into “training” and “test”.

partitions <- mtcars_tbl %>%
  select(mpg, wt, cyl) %>% 
  sdf_random_split(training = 0.5, test = 0.5, seed = 1099)

Note that the newly created partitions variable does not contain data, it contains a pointer to where the data was split within Spark. That means that no data is downloaded to the R session.

Fit the model

Next, we will fit a linear model to the training data set:

fit <- partitions$training %>%
  ml_linear_regression(mpg ~ .)

fit
#> Formula: mpg ~ .
#> 
#> Coefficients:
#> (Intercept)          wt         cyl 
#>   38.927395   -4.131014   -0.938832

For linear regression models produced by Spark, we can use summary() to learn a bit more about the quality of our fit, and the statistical significance of each of our predictors.

summary(fit)
#> Deviance Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.4891 -1.5262 -0.1481  0.8508  6.3162 
#> 
#> Coefficients:
#> (Intercept)          wt         cyl 
#>   38.927395   -4.131014   -0.938832 
#> 
#> R-Squared: 0.8469
#> Root Mean Squared Error: 2.416

Use the model

We can use ml_predict() to create a Spark data frame that contains the predictions against the testing data set.

pred <- ml_predict(fit, partitions$test)

head(pred)
#> # Source: spark<?> [?? x 4]
#>     mpg    wt   cyl prediction
#>   <dbl> <dbl> <dbl>      <dbl>
#> 1  14.3  3.57     8      16.7 
#> 2  14.7  5.34     8       9.34
#> 3  15    3.57     8      16.7 
#> 4  15.2  3.44     8      17.2 
#> 5  15.2  3.78     8      15.8 
#> 6  15.5  3.52     8      16.9

Exercise

Initialize the environment

Prepare the data

Fit the model

Use the model

Further reading