```
library(sparklyr)
<- spark_connect(master = "local")
sc <- copy_to(sc, mtcars, overwrite = TRUE) mtcars_tbl
```

# Model Data

You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within `sparklyr`

. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.

## Exercise

Here’s an example where we use `ml_linear_regression()`

to fit a linear regression model. We’ll use the built-in `mtcars`

dataset, and see if we can predict a car’s fuel consumption (`mpg`

) based on its weight (`wt`

), and the number of cylinders the engine contains (`cyl`

). We’ll assume in each case that the relationship between `mpg`

and each of our features is linear.

### Initialize the environment

We will start by creating a local Spark session and load the `mtcars`

data frame to it.

### Prepare the data

Spark provides data frame operations that makes it easier to prepare data for modeling. In this case, we will use the `sdf_partition()`

command to divide the `mtcars`

data into “training” and “test”.

```
<- mtcars_tbl %>%
partitions select(mpg, wt, cyl) %>%
sdf_random_split(training = 0.5, test = 0.5, seed = 1099)
```

Note that the newly created `partitions`

variable does not contain data, it contains a pointer to where the data was split within Spark. That means that no data is downloaded to the R session.

### Fit the model

Next, we will fit a linear model to the training data set:

```
<- partitions$training %>%
fit ml_linear_regression(mpg ~ .)
fit#> Formula: mpg ~ .
#>
#> Coefficients:
#> (Intercept) wt cyl
#> 38.927395 -4.131014 -0.938832
```

For linear regression models produced by Spark, we can use `summary()`

to learn a bit more about the quality of our fit, and the statistical significance of each of our predictors.

```
summary(fit)
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.4891 -1.5262 -0.1481 0.8508 6.3162
#>
#> Coefficients:
#> (Intercept) wt cyl
#> 38.927395 -4.131014 -0.938832
#>
#> R-Squared: 0.8469
#> Root Mean Squared Error: 2.416
```

### Use the model

We can use `ml_predict()`

to create a Spark data frame that contains the predictions against the testing data set.

```
<- ml_predict(fit, partitions$test)
pred
head(pred)
#> # Source: spark<?> [?? x 4]
#> mpg wt cyl prediction
#> <dbl> <dbl> <dbl> <dbl>
#> 1 14.3 3.57 8 16.7
#> 2 14.7 5.34 8 9.34
#> 3 15 3.57 8 16.7
#> 4 15.2 3.44 8 17.2
#> 5 15.2 3.78 8 15.8
#> 6 15.5 3.52 8 16.9
```

### Further reading

Spark machine learning supports a wide array of algorithms and feature transformations and as illustrated above it’s easy to chain these functions together with `dplyr`

pipelines. To learn more see the Machine Learning article on this site. For a list of Spark ML models available through `sparklyr`

visit Reference - ML