library(modeldata)
data("small_fine_foods")
%>%
training_data head(1) %>%
as.list()
#> $product
#> [1] "B000J0LSBG"
#>
#> $review
#> [1] "this stuff is not stuffing its not good at all save your money"
#>
#> $score
#> [1] other
#> Levels: great other
Text modeling
This article builds on the concepts and techniques contained in other articles found on this site. The example contained here goes beyond the descriptive analysis found in the Text Mining article. It shows how to pre-process, and then model text data. This article also expands on ML Pipelines, by providing more “real life” scenario of how and why to use pipelines.
Data
This article uses text data from the modeldata
package. The Fine foods example data contains reviews of fine foods from Amazon. The package contains a training and a test set. The data consist of a product code, the text of the review, and the score. The score has two values: “great”, and “other”.
We will start by starting a local session of Spark, and then copying both data sets to our new session.
library(sparklyr)
<- spark_connect(master = "local", version = "3.3")
sc
<- copy_to(sc, training_data)
sff_training_data
<- copy_to(sc, testing_data) sff_testing_data
Text transformers
Split into words (tokenizer)
We will split each review into individual words, or tokens. The ft_tokenizer()
function returns a in-line list containing the individual words.
%>%
sff_training_data ft_tokenizer(
input_col = "review",
output_col = "word_list"
%>%
) select(3:4)
#> # Source: spark<?> [?? x 2]
#> score word_list
#> <chr> <list>
#> 1 other <list [17]>
#> 2 great <list [100]>
#> 3 great <list [106]>
#> 4 great <list [36]>
#> 5 great <list [18]>
#> 6 great <list [30]>
#> 7 other <list [87]>
#> 8 great <list [54]>
#> 9 great <list [59]>
#> 10 great <list [44]>
#> # … with more rows
Clean-up words (stop words)
There are words very common in text, words such as: “the”, “and”, “or”, etc. These are called “stop words”. Most often, stop words are not useful in analysis and modeling, so it is necessary to remove them. That is exactly what ft_stop_words_remover()
does. In addition to English, Spark has lists of stop words for several other languages. In the resulting table, notice that the number of words in the wo_stop_words
is lower than the word_list
.
%>%
sff_training_data ft_tokenizer(
input_col = "review",
output_col = "word_list"
%>%
) ft_stop_words_remover(
input_col = "word_list",
output_col = "wo_stop_words"
%>%
) select(3:5)
#> # Source: spark<?> [?? x 3]
#> score word_list wo_stop_words
#> <chr> <list> <list>
#> 1 other <list [17]> <list [9]>
#> 2 great <list [100]> <list [61]>
#> 3 great <list [106]> <list [67]>
#> 4 great <list [36]> <list [20]>
#> 5 great <list [18]> <list [9]>
#> 6 great <list [30]> <list [17]>
#> 7 other <list [87]> <list [58]>
#> 8 great <list [54]> <list [33]>
#> 9 great <list [59]> <list [36]>
#> 10 great <list [44]> <list [24]>
#> # … with more rows
Index words (hash)
Text hashing maps a sequence of words, or “terms”, to their frequencies. The number of terms that are mapped can be controlled using the num_features
argument in ft_hashing_ft()
. Because we are eventually going to use a logistic regression model, we will need to override the frequencies from their original value to 1. This is accomplished by setting the binary
argument to TRUE
.
%>%
sff_training_data ft_tokenizer(
input_col = "review",
output_col = "word_list"
%>%
) ft_stop_words_remover(
input_col = "word_list",
output_col = "wo_stop_words"
%>%
) ft_hashing_tf(
input_col = "wo_stop_words",
output_col = "hashed_features",
binary = TRUE,
num_features = 1024
%>%
) select(3:6)
#> # Source: spark<?> [?? x 4]
#> score word_list wo_stop_words hashed_features
#> <chr> <list> <list> <list>
#> 1 other <list [17]> <list [9]> <dbl [1,024]>
#> 2 great <list [100]> <list [61]> <dbl [1,024]>
#> 3 great <list [106]> <list [67]> <dbl [1,024]>
#> 4 great <list [36]> <list [20]> <dbl [1,024]>
#> 5 great <list [18]> <list [9]> <dbl [1,024]>
#> 6 great <list [30]> <list [17]> <dbl [1,024]>
#> 7 other <list [87]> <list [58]> <dbl [1,024]>
#> 8 great <list [54]> <list [33]> <dbl [1,024]>
#> 9 great <list [59]> <list [36]> <dbl [1,024]>
#> 10 great <list [44]> <list [24]> <dbl [1,024]>
#> # … with more rows
Normalize results
Finally, we normalize the hashed column using ft_normalizer()
.
%>%
sff_training_data ft_tokenizer(
input_col = "review",
output_col = "word_list"
%>%
) ft_stop_words_remover(
input_col = "word_list",
output_col = "wo_stop_words"
%>%
) ft_hashing_tf(
input_col = "wo_stop_words",
output_col = "hashed_features",
binary = TRUE,
num_features = 1024
%>%
) ft_normalizer(
input_col = "hashed_features",
output_col = "normal_features"
%>%
) select(3:7)
#> # Source: spark<?> [?? x 5]
#> score word_list wo_stop_words hashed_features normal_features
#> <chr> <list> <list> <list> <list>
#> 1 other <list [17]> <list [9]> <dbl [1,024]> <dbl [1,024]>
#> 2 great <list [100]> <list [61]> <dbl [1,024]> <dbl [1,024]>
#> 3 great <list [106]> <list [67]> <dbl [1,024]> <dbl [1,024]>
#> 4 great <list [36]> <list [20]> <dbl [1,024]> <dbl [1,024]>
#> 5 great <list [18]> <list [9]> <dbl [1,024]> <dbl [1,024]>
#> 6 great <list [30]> <list [17]> <dbl [1,024]> <dbl [1,024]>
#> 7 other <list [87]> <list [58]> <dbl [1,024]> <dbl [1,024]>
#> 8 great <list [54]> <list [33]> <dbl [1,024]> <dbl [1,024]>
#> 9 great <list [59]> <list [36]> <dbl [1,024]> <dbl [1,024]>
#> 10 great <list [44]> <list [24]> <dbl [1,024]> <dbl [1,024]>
#> # … with more rows
The ft_hashing_tf()
outputs the index and frequency of each term. This can be thought of as how “dummy variables” are created for each discrete value of a categorical variable. This means that for modeling, we will only need to use only one “column”, hashed_features
. But, we will use normal_features
for the model because it is derived from hashed_features
.
Prepare the model with an ML Pipeline
The same set of complex transformations are needed for both modeling, and predictions. This means that we will have to duplicate the code for both. This is not ideal when developing, because any change in the transformation will have to be copied to both sets of code. This makes a compelling argument for using ML Pipelines.
We can initialize a pipeline (using ml_pipeline()
), and then pass the same exact steps used in the previous section. We then append the model via ft_r_formula()
and then the model function, in this case ml_logistic_regression()
<- ml_pipeline(sc) %>%
sff_pipeline ft_tokenizer(
input_col = "review",
output_col = "word_list"
%>%
) ft_stop_words_remover(
input_col = "word_list",
output_col = "wo_stop_words"
%>%
) ft_hashing_tf(
input_col = "wo_stop_words",
output_col = "hashed_features",
binary = TRUE,
num_features = 1024
%>%
) ft_normalizer(
input_col = "hashed_features",
output_col = "normal_features"
%>%
) ft_r_formula(score ~ normal_features) %>%
ml_logistic_regression()
sff_pipeline#> Pipeline (Estimator) with 6 stages
#> <pipeline__87caaa39_2fa9_4708_a1e1_20ab570c8917>
#> Stages
#> |--1 Tokenizer (Transformer)
#> | <tokenizer__e3cf3ba6_f7e9_4a05_a41d_11963d70fd6c>
#> | (Parameters -- Column Names)
#> | input_col: review
#> | output_col: word_list
#> |--2 StopWordsRemover (Transformer)
#> | <stop_words_remover__3fc0bf48_9fa0_441a_9bb3_5a19ec72be0f>
#> | (Parameters -- Column Names)
#> | input_col: word_list
#> | output_col: wo_stop_words
#> |--3 HashingTF (Transformer)
#> | <hashing_tf__3fa3d087_39e8_4668_9921_28150a53412c>
#> | (Parameters -- Column Names)
#> | input_col: wo_stop_words
#> | output_col: hashed_features
#> |--4 Normalizer (Transformer)
#> | <normalizer__6d4d9c1c_7488_4a4d_8d42_d9830de4ee2f>
#> | (Parameters -- Column Names)
#> | input_col: hashed_features
#> | output_col: normal_features
#> |--5 RFormula (Estimator)
#> | <r_formula__4ae7b190_ce59_4d5f_b75b_cbd623e1a790>
#> | (Parameters -- Column Names)
#> | features_col: features
#> | label_col: label
#> | (Parameters)
#> | force_index_label: FALSE
#> | formula: score ~ normal_features
#> | handle_invalid: error
#> | stringIndexerOrderType: frequencyDesc
#> |--6 LogisticRegression (Estimator)
#> | <logistic_regression__46c6e5fb_7c70_44f2_a366_f0a7f94801e1>
#> | (Parameters -- Column Names)
#> | features_col: features
#> | label_col: label
#> | prediction_col: prediction
#> | probability_col: probability
#> | raw_prediction_col: rawPrediction
#> | (Parameters)
#> | aggregation_depth: 2
#> | elastic_net_param: 0
#> | family: auto
#> | fit_intercept: TRUE
#> | max_iter: 100
#> | maxBlockSizeInMB: 0
#> | reg_param: 0
#> | standardization: TRUE
#> | threshold: 0.5
#> | tol: 1e-06
Fit and predict
sff_pipeline
is an ML Pipeline, which is essentially a set of steps to take, can be think of akin to a recipe
. In order to actually process de model we use ml_fit()
. This executes all of the transformations, and then fits the model. In other words, ml_fit()
runs all of the steps in the pipeline. The output will be considered an ML Pipeline Model.
<- ml_fit(sff_pipeline, sff_training_data) sff_pipeline_model
sff_pipeline_model
is more than just a “fitted” model. It also contains all of the pre-processing steps. So any new data passed through it, will go through the same transformations before running the predictions. To execute the pipeline model on against the test data, we use ml_transform()
<- sff_pipeline_model %>%
sff_test_predictions ml_transform(sff_testing_data)
glimpse(sff_test_predictions)
#> Rows: ??
#> Columns: 12
#> Database: spark_connection
#> $ product <chr> "B005GXFP60", "B000G7V394", "B004WJAULO", "B003D4MBOS"…
#> $ review <chr> "These are the best tasting gummy fruits I have ever e…
#> $ score <chr> "great", "great", "other", "other", "great", "other", …
#> $ word_list <list> ["these", "are", "the", "best", "tasting", "gummy", "…
#> $ wo_stop_words <list> ["best", "tasting", "gummy", "fruits", "ever", "eaten…
#> $ hashed_features <list> <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ normal_features <list> <0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.000000…
#> $ features <list> <0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.000000…
#> $ label <dbl> 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, …
#> $ rawPrediction <list> <8.570594, -8.570594>, <-0.1648486, 0.1648486>, <-1.9…
#> $ probability <list> <0.9998104359, 0.0001895641>, <0.4588809, 0.5411191>,…
#> $ prediction <dbl> 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, …
Using ml_metrics_binary()
, we can see how well the model performed.
ml_metrics_binary(sff_test_predictions)
#> # A tibble: 2 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 roc_auc binary 0.706
#> 2 pr_auc binary 0.567
Tune the model (optional)
The performance of the model may be acceptable, but there could be a desire to improve it. Hyper parameter tuning can be applied to figure if there are better function arguments to use. A big advantage of using an ML Pipeline for the initial model, is that we can literally use the exact same pipeline code to perform the tuning. The Grid Search Tuning article shows how to do this.