Feature Transformation - Imputer (Estimator)

ft_imputer

Description

Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. This function requires Spark 2.2.0+.

Usage

ft_imputer(
  x,
  input_cols = NULL,
  output_cols = NULL,
  missing_value = NULL,
  strategy = "mean",
  uid = random_string("imputer_"),
  ...
)

Arguments

Arguments	Description
x	A `spark_connection`, `ml_pipeline`, or a `tbl_spark`.
input_cols	The names of the input columns
output_cols	The names of the output columns.
missing_value	The placeholder for the missing values. All occurrences of `missing_value` will be imputed. Note that null values are always treated as missing.
strategy	The imputation strategy. Currently only “mean” and “median” are supported. If “mean”, then replace missing values using the mean value of the feature. If “median”, then replace missing values using the approximate median value of the feature. Default: mean
uid	A character string used to uniquely identify the feature transformer.
…	Optional arguments; currently unused.

Details

In the case where x is a tbl_spark, the estimator fits against x to obtain a transformer, returning a tbl_spark.

Value

The object returned depends on the class of x. If it is a spark_connection, the function returns a ml_estimator or a ml_estimator object. If it is a ml_pipeline, it will return a pipeline with the transformer or estimator appended to it. If a tbl_spark, it will return a tbl_spark with the transformation applied to it.

Other feature transformers: ft_binarizer(), ft_bucketizer(), ft_chisq_selector(), ft_count_vectorizer(), ft_dct(), ft_elementwise_product(), ft_feature_hasher(), ft_hashing_tf(), ft_idf(), ft_index_to_string(), ft_interaction(), ft_lsh, ft_max_abs_scaler(), ft_min_max_scaler(), ft_ngram(), ft_normalizer(), ft_one_hot_encoder(), ft_one_hot_encoder_estimator(), ft_pca(), ft_polynomial_expansion(), ft_quantile_discretizer(), ft_r_formula(), ft_regex_tokenizer(), ft_robust_scaler(), ft_sql_transformer(), ft_standard_scaler(), ft_stop_words_remover(), ft_string_indexer(), ft_tokenizer(), ft_vector_assembler(), ft_vector_indexer(), ft_vector_slicer(), ft_word2vec()

--- title: "Feature Transformation - Imputer (Estimator)" execute: eval: true freeze: true --- ## ft_imputer ## Description Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. This function requires Spark 2.2.0+. ## Usage ```r ft_imputer( x, input_cols = NULL, output_cols = NULL, missing_value = NULL, strategy = "mean", uid = random_string("imputer_"), ... ) ``` ## Arguments |Arguments|Description| |---|---| | x | A `spark_connection`, `ml_pipeline`, or a `tbl_spark`. | | input_cols | The names of the input columns | | output_cols | The names of the output columns. | | missing_value | The placeholder for the missing values. All occurrences of `missing_value` will be imputed. Note that null values are always treated as missing. | | strategy | The imputation strategy. Currently only "mean" and "median" are supported. If "mean", then replace missing values using the mean value of the feature. If "median", then replace missing values using the approximate median value of the feature. Default: mean | | uid | A character string used to uniquely identify the feature transformer. | | ... | Optional arguments; currently unused. | ## Details In the case where `x` is a `tbl_spark`, the estimator fits against `x` to obtain a transformer, returning a `tbl_spark`. ## Value The object returned depends on the class of `x`. If it is a `spark_connection`, the function returns a `ml_estimator` or a `ml_estimator` object. If it is a `ml_pipeline`, it will return a pipeline with the transformer or estimator appended to it. If a `tbl_spark`, it will return a `tbl_spark` with the transformation applied to it. ## See Also Other feature transformers: `ft_binarizer()`, `ft_bucketizer()`, `ft_chisq_selector()`, `ft_count_vectorizer()`, `ft_dct()`, `ft_elementwise_product()`, `ft_feature_hasher()`, `ft_hashing_tf()`, `ft_idf()`, `ft_index_to_string()`, `ft_interaction()`, `ft_lsh`, `ft_max_abs_scaler()`, `ft_min_max_scaler()`, `ft_ngram()`, `ft_normalizer()`, `ft_one_hot_encoder()`, `ft_one_hot_encoder_estimator()`, `ft_pca()`, `ft_polynomial_expansion()`, `ft_quantile_discretizer()`, `ft_r_formula()`, `ft_regex_tokenizer()`, `ft_robust_scaler()`, `ft_sql_transformer()`, `ft_standard_scaler()`, `ft_stop_words_remover()`, `ft_string_indexer()`, `ft_tokenizer()`, `ft_vector_assembler()`, `ft_vector_indexer()`, `ft_vector_slicer()`, `ft_word2vec()`