Creates the ‘label’ and ‘features’ columns

R/ml-prepare-dataset.R

ml_prepare_dataset

Description

Creates the ‘label’ and ‘features’ columns

Usage

 
ml_prepare_dataset( 
  x, 
  formula = NULL, 
  label = NULL, 
  features = NULL, 
  label_col = "label", 
  features_col = "features", 
  keep_original = TRUE, 
  ... 
) 

Arguments

Arguments Description
x A tbl_pyspark object
formula Used when x is a tbl_spark. R formula.
label The name of the label column.
features The name(s) of the feature columns as a character vector.
label_col Label column name, as a length-one character vector.
features_col Features column name, as a length-one character vector.
keep_original Boolean flag that indicates if the output will contain, or not, the original columns from x. Defaults to TRUE.
Added for backwards compatibility. Not in use today.

Details

At this time, ‘Spark ML Connect’, does not include a Vector Assembler transformer. The main thing that this function does, is create a ‘Pyspark’ array column. Pipelines require a ‘label’ and ‘features’ columns. Even though it is is single column in the dataset, the ‘features’ column will contain all of the predictors insde an array. This function also creates a new ‘label’ column that copies the outcome variable. This makes it a lot easier to remove the ‘label’, and ‘outcome’ columns.

Value

A tbl_pyspark, with either the original columns from x, plus the ‘label’ and ‘features’ column, or, the ‘label’ and ‘features’ columns only.