Creates the ‘label’ and ‘features’ columns
R/ml-prepare-dataset.R
ml_prepare_dataset
Description
Creates the ‘label’ and ‘features’ columns
Usage
ml_prepare_dataset(
x, formula = NULL,
label = NULL,
features = NULL,
label_col = "label",
features_col = "features",
keep_original = TRUE,
... )
Arguments
Arguments | Description |
---|---|
x | A tbl_pyspark object |
formula | Used when x is a tbl_spark . R formula. |
label | The name of the label column. |
features | The name(s) of the feature columns as a character vector. |
label_col | Label column name, as a length-one character vector. |
features_col | Features column name, as a length-one character vector. |
keep_original | Boolean flag that indicates if the output will contain, or not, the original columns from x . Defaults to TRUE . |
… | Added for backwards compatibility. Not in use today. |
Details
At this time, ‘Spark ML Connect’, does not include a Vector Assembler transformer. The main thing that this function does, is create a ‘Pyspark’ array column. Pipelines require a ‘label’ and ‘features’ columns. Even though it is is single column in the dataset, the ‘features’ column will contain all of the predictors insde an array. This function also creates a new ‘label’ column that copies the outcome variable. This makes it a lot easier to remove the ‘label’, and ‘outcome’ columns.
Value
A tbl_pyspark
, with either the original columns from x
, plus the ‘label’ and ‘features’ column, or, the ‘label’ and ‘features’ columns only.