library(sparklyr)
<- spark_connect(master = "local")
sc <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
iris_tbl
<- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width")
features
ml_chisquare_test(iris_tbl, features = features, label = "Species")
#> feature label p_value degrees_of_freedom statistic
#> 1 Petal_Width Species 0.000000e+00 42 271.75000
#> 2 Petal_Length Species 0.000000e+00 84 271.80000
#> 3 Sepal_Length Species 6.665987e-09 68 156.26667
#> 4 Sepal_Width Species 6.016031e-05 44 89.54629
Chi-square hypothesis testing for categorical data.
R/ml_stat.R
ml_chisquare_test
Description
Conduct Pearson’s independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical.
Usage
ml_chisquare_test(x, features, label)
Arguments
Arguments | Description |
---|---|
x | A tbl_spark . |
features | The name(s) of the feature columns. This can also be the name of a single vector column created using ft_vector_assembler() . |
label | The name of the label column. |
Value
A data frame with one row for each (feature, label) pair with p-values, degrees of freedom, and test statistics.