library(sparklyr)
<- spark_connect(master = "local")
sc <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
iris_tbl
<- iris_tbl %>%
partitions sdf_random_split(training = 0.7, test = 0.3, seed = 1111)
<- partitions$training
iris_training <- partitions$test
iris_test
<- Species ~ .
formula
# Train the models
<- ml_kmeans(iris_training, formula = formula)
kmeans_model <- ml_bisecting_kmeans(iris_training, formula = formula)
b_kmeans_model <- ml_gaussian_mixture(iris_training, formula = formula)
gmm_model
# Predict
<- ml_predict(kmeans_model, iris_test)
pred_kmeans <- ml_predict(b_kmeans_model, iris_test)
pred_b_kmeans <- ml_predict(gmm_model, iris_test)
pred_gmm
# Evaluate
ml_clustering_evaluator(pred_kmeans)
#> [1] 0.8736088
ml_clustering_evaluator(pred_b_kmeans)
#> [1] 0.4433206
ml_clustering_evaluator(pred_gmm)
#> [1] 0.8677525
Spark ML - Clustering Evaluator
R/ml_evaluation_clustering.R
ml_clustering_evaluator
Description
Evaluator for clustering results. The metric computes the Silhouette measure using the squared Euclidean distance. The Silhouette is a measure for the validation of the consistency within clusters. It ranges between 1 and -1, where a value close to 1 means that the points in a cluster are close to the other points in the same cluster and far from the points of the other clusters.
Usage
ml_clustering_evaluator(
x, features_col = "features",
prediction_col = "prediction",
metric_name = "silhouette",
uid = random_string("clustering_evaluator_"),
... )
Arguments
Arguments | Description |
---|---|
x | A spark_connection object or a tbl_spark containing label and prediction columns. The latter should be the output of sdf_predict . |
features_col | Name of features column. |
prediction_col | Name of the prediction column. |
metric_name | The performance metric. Currently supports “silhouette”. |
uid | A character string used to uniquely identify the ML estimator. |
… | Optional arguments; currently unused. |
Value
The calculated performance metric