Creates the ‘label’ and ‘features’ columns

R/ml-prepare-dataset.R

ml_prepare_dataset

Description

Usage

 
ml_prepare_dataset( 
  x, 
  formula = NULL, 
  label = NULL, 
  features = NULL, 
  label_col = "label", 
  features_col = "features", 
  keep_original = TRUE, 
  ... 
)

Arguments

Arguments	Description
x	A `tbl_pyspark` object
formula	Used when `x` is a `tbl_spark`. R formula.
label	The name of the label column.
features	The name(s) of the feature columns as a character vector.
label_col	Label column name, as a length-one character vector.
features_col	Features column name, as a length-one character vector.
keep_original	Boolean flag that indicates if the output will contain, or not, the original columns from `x`. Defaults to `TRUE`.
…	Added for backwards compatibility. Not in use today.

Details

At this time, ‘Spark ML Connect’, does not include a Vector Assembler transformer. The main thing that this function does, is create a ‘Pyspark’ array column. Pipelines require a ‘label’ and ‘features’ columns. Even though it is is single column in the dataset, the ‘features’ column will contain all of the predictors insde an array. This function also creates a new ‘label’ column that copies the outcome variable. This makes it a lot easier to remove the ‘label’, and ‘outcome’ columns.

Value

A tbl_pyspark, with either the original columns from x, plus the ‘label’ and ‘features’ column, or, the ‘label’ and ‘features’ columns only.