Introduction to tidymodels
R. Dimas Bagas Herlambang

33 minute read

The following presentation is produced by the team at Algoritma for its internal training This presentation is intended for a restricted audience only. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.

Outline

Why tidymodels Matters?

  • Things we think we’re doing it right
  • Things we never think we could do it better

Setting the Cross-Validation Scheme using rsample

  • Rethinking: Why we need validation?
  • Tidy way to rsample-ing your dataset

Data Preprocess using recipes

  • Rethinking: How we should treat train and test?
  • Reproducible preprocess recipes

Model Fitting using parsnip

  • Rethinking: How many machine learning packages you used?
  • One vegeta.. I mean package to rule them all: parsnip

Model Evaluation using yardstick

  • Rethinking: How we measure the goodness of our model?
  • It’s always good to bring your own yardstick

Why tidymodels Matters?

Things we think we’re doing it right

Sample splitting

# import libs
library(plotly)
library(randomForest)
library(ranger)
library(tidyverse)
library(tidymodels)
# import additional libs
library(lubridate)

# prepare example datasets
attrition <- read_csv("data_input/attrition.csv")
# set seed
set.seed(100)

# train rows
in_train <- sample(1:nrow(attrition), nrow(attrition) * 0.8)

# check target distribution in train
prop.table(table(attrition$attrition[in_train]))
#> 
#>        no       yes 
#> 0.8367347 0.1632653
# check target distribution in test
prop.table(table(attrition$attrition[-in_train]))
#> 
#>        no       yes 
#> 0.8469388 0.1530612

Numeric scaling

# scale age in full dataset
age_scaled <- scale(attrition$age)

# check mean and standard deviation
attr(age_scaled, "scaled:center")
#> [1] 36.92381
attr(age_scaled, "scaled:scale")
#> [1] 9.135373
# scale age in train dataset
age_train_scaled <- scale(attrition$age[in_train])

# check mean and standard deviation
attr(age_train_scaled, "scaled:center")
#> [1] 37.0085
attr(age_train_scaled, "scaled:scale")
#> [1] 9.070969
# scale age in test dataset
age_test_scaled <- scale(attrition$age[-in_train])

# check mean and standard deviation
attr(age_test_scaled, "scaled:center")
#> [1] 36.58503
attr(age_test_scaled, "scaled:scale")
#> [1] 9.396712

Things we never think we could do it better

How we see model performance

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  1036  197
#>        yes  197   40
#>                                           
#>                Accuracy : 0.732           
#>                  95% CI : (0.7085, 0.7545)
#>     No Information Rate : 0.8388          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.009           
#>                                           
#>  Mcnemar's Test P-Value : 1               
#>                                           
#>             Sensitivity : 0.16878         
#>             Specificity : 0.84023         
#>          Pos Pred Value : 0.16878         
#>          Neg Pred Value : 0.84023         
#>              Prevalence : 0.16122         
#>          Detection Rate : 0.02721         
#>    Detection Prevalence : 0.16122         
#>       Balanced Accuracy : 0.50450         
#>                                           
#>        'Positive' Class : yes             
#> 

How we use Receiver Operating Curve

Setting the Cross-Validation Scheme using rsample

Rethinking: Why we need validation?

Tidy way to rsample-ing your dataset

rsample is part of tidymodels that could help us in splitting or resampling or machine learning dataset.

There are so many splitting and resampling approach provided by rsample–as you could see in its full function references page. In this introduction, we will use two most general function:

Initial splitting

Initial random sampling for splitting train and test could be done using initial_split():

# set seed
set.seed(100)

# create initial split
splitted <- initial_split(attrition, prop = 0.8)

# check train dataset
training(splitted)
#> # A tibble: 1,177 x 35
#>    attrition   age business_travel daily_rate department distance_from_h~
#>    <chr>     <dbl> <chr>                <dbl> <chr>                 <dbl>
#>  1 yes          41 travel_rarely         1102 sales                     1
#>  2 no           49 travel_frequen~        279 research_~                8
#>  3 yes          37 travel_rarely         1373 research_~                2
#>  4 no           33 travel_frequen~       1392 research_~                3
#>  5 no           27 travel_rarely          591 research_~                2
#>  6 no           32 travel_frequen~       1005 research_~                2
#>  7 no           59 travel_rarely         1324 research_~                3
#>  8 no           30 travel_rarely         1358 research_~               24
#>  9 no           38 travel_frequen~        216 research_~               23
#> 10 no           36 travel_rarely         1299 research_~               27
#> # ... with 1,167 more rows, and 29 more variables: education <dbl>,
#> #   education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> #   over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> #   relationship_satisfaction <dbl>, standard_hours <dbl>,
#> #   stock_option_level <dbl>, total_working_years <dbl>,
#> #   training_times_last_year <dbl>, work_life_balance <dbl>,
#> #   years_at_company <dbl>, years_in_current_role <dbl>,
#> #   years_since_last_promotion <dbl>, years_with_curr_manager <dbl>
# check test dataset
testing(splitted)
#> # A tibble: 293 x 35
#>    attrition   age business_travel daily_rate department distance_from_h~
#>    <chr>     <dbl> <chr>                <dbl> <chr>                 <dbl>
#>  1 no           29 travel_rarely         1389 research_~               21
#>  2 no           38 travel_rarely          371 research_~                2
#>  3 yes          36 travel_rarely         1218 sales                     9
#>  4 yes          34 travel_rarely          699 research_~                6
#>  5 no           53 travel_rarely         1282 research_~                5
#>  6 yes          24 travel_rarely          813 research_~                1
#>  7 no           36 travel_rarely          852 research_~                5
#>  8 no           27 travel_rarely         1240 research_~                2
#>  9 yes          41 travel_rarely         1360 research_~               12
#> 10 yes          48 travel_rarely          626 research_~                1
#> # ... with 283 more rows, and 29 more variables: education <dbl>,
#> #   education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> #   over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> #   relationship_satisfaction <dbl>, standard_hours <dbl>,
#> #   stock_option_level <dbl>, total_working_years <dbl>,
#> #   training_times_last_year <dbl>, work_life_balance <dbl>,
#> #   years_at_company <dbl>, years_in_current_role <dbl>,
#> #   years_since_last_promotion <dbl>, years_with_curr_manager <dbl>

But sometimes, simple random sampling is not enough:

# target distribution in full dataset
prop.table(table(attrition$attrition))
#> 
#>        no       yes 
#> 0.8387755 0.1612245
# target distribution in train dataset
prop.table(table(training(splitted)$attrition))
#> 
#>        no       yes 
#> 0.8462192 0.1537808
# target distribution in test dataset
prop.table(table(testing(splitted)$attrition))
#> 
#>        no       yes 
#> 0.8088737 0.1911263

This is where we need strata argument to use stratified random sampling:

# set seed
set.seed(100)

# create stratified initial split
splitted <- initial_split(attrition, prop = 0.8, strata = "attrition")

# target distribution in full dataset
prop.table(table(attrition$attrition))
#> 
#>        no       yes 
#> 0.8387755 0.1612245
# target distribution in train dataset
prop.table(table(training(splitted)$attrition))
#> 
#>        no       yes 
#> 0.8385726 0.1614274
# target distribution in test dataset
prop.table(table(testing(splitted)$attrition))
#> 
#>        no       yes 
#> 0.8395904 0.1604096

Cross-sectional resampling

To use k-fold validation splits–and optionally, with repetition–we could use vfold_cv():

# set seed
set.seed(100)

# create k-fold splits with repetition
resampled <- vfold_cv(attrition, v = 3, repeats = 2, strata = "attrition")

# quick check
resampled
#> #  3-fold cross-validation repeated 2 times using stratification 
#> # A tibble: 6 x 3
#>   splits            id      id2  
#>   <named list>      <chr>   <chr>
#> 1 <split [980/490]> Repeat1 Fold1
#> 2 <split [980/490]> Repeat1 Fold2
#> 3 <split [980/490]> Repeat1 Fold3
#> 4 <split [980/490]> Repeat2 Fold1
#> 5 <split [980/490]> Repeat2 Fold2
#> 6 <split [980/490]> Repeat2 Fold3

Each train and test dataset are stored in splits column. We could use analysis() and assessment() to get the train and test dataset, consecutively:

# check train dataset on an example split
analysis(resampled$splits[[1]])
#> # A tibble: 980 x 35
#>    attrition   age business_travel daily_rate department distance_from_h~
#>    <chr>     <dbl> <chr>                <dbl> <chr>                 <dbl>
#>  1 no           49 travel_frequen~        279 research_~                8
#>  2 yes          37 travel_rarely         1373 research_~                2
#>  3 no           33 travel_frequen~       1392 research_~                3
#>  4 no           27 travel_rarely          591 research_~                2
#>  5 no           59 travel_rarely         1324 research_~                3
#>  6 no           36 travel_rarely         1299 research_~               27
#>  7 no           29 travel_rarely          153 research_~               15
#>  8 no           31 travel_rarely          670 research_~               26
#>  9 no           34 travel_rarely         1346 research_~               19
#> 10 no           22 non_travel            1123 research_~               16
#> # ... with 970 more rows, and 29 more variables: education <dbl>,
#> #   education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> #   over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> #   relationship_satisfaction <dbl>, standard_hours <dbl>,
#> #   stock_option_level <dbl>, total_working_years <dbl>,
#> #   training_times_last_year <dbl>, work_life_balance <dbl>,
#> #   years_at_company <dbl>, years_in_current_role <dbl>,
#> #   years_since_last_promotion <dbl>, years_with_curr_manager <dbl>
# check test dataset on an example split
assessment(resampled$splits[[1]])
#> # A tibble: 490 x 35
#>    attrition   age business_travel daily_rate department distance_from_h~
#>    <chr>     <dbl> <chr>                <dbl> <chr>                 <dbl>
#>  1 yes          41 travel_rarely         1102 sales                     1
#>  2 no           32 travel_frequen~       1005 research_~                2
#>  3 no           30 travel_rarely         1358 research_~               24
#>  4 no           38 travel_frequen~        216 research_~               23
#>  5 no           35 travel_rarely          809 research_~               16
#>  6 yes          28 travel_rarely          103 research_~               24
#>  7 no           29 travel_rarely         1389 research_~               21
#>  8 no           32 travel_rarely          334 research_~                5
#>  9 no           38 travel_rarely          371 research_~                2
#> 10 yes          36 travel_rarely         1218 sales                     9
#> # ... with 480 more rows, and 29 more variables: education <dbl>,
#> #   education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> #   over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> #   relationship_satisfaction <dbl>, standard_hours <dbl>,
#> #   stock_option_level <dbl>, total_working_years <dbl>,
#> #   training_times_last_year <dbl>, work_life_balance <dbl>,
#> #   years_at_company <dbl>, years_in_current_role <dbl>,
#> #   years_since_last_promotion <dbl>, years_with_curr_manager <dbl>

Data Preprocess using recipes

Rethinking: How do we should treat train and test?

Reproducible preprocess recipes

recipes is part of tidymodels that could help us in making a reproducible data preprocess.

There are so many data preprocess approach provided by recipes–as you could see in its full function references page . In this introduction, we will use several preprocess steps related to our example dataset.

There are several steps that we could apply to our dataset–some are very fundamental and sometimes mandatory, but some are also tuneable:

  • step_rm() : Manually remove unused columns.
  • step_nzv() : Automatically filter near-zero varianced columns.
  • step_string2factor() : Manually convert to factor columns.
  • step_downsample() : Downsampling step to balancing target’s class distribution (tuneable).
  • step_center() : Normalize the mean of numeric column(s) to zero (tuneable).
  • step_scale() : Normalize the standard deviation of numeric column(s) to one (tuneable).
  • step_pca() : Shrink numeric column(s) to several PCA components (tuneable).

Designing your first preprocess recipes

  1. Initiate a recipe using recipe()
    • Define your formula in the first argument.
    • Supply a template dataset in data argument.
  2. Pipe to every needed step_*()–always remember to put every step in proper consecutive manner.
  3. After finished with every needed step_*(), pipe to prep() function to train your recipe
    • It will automatically convert all character columns to factor;
    • If you used step_string2factor(), don’t forget to specify strings_as_factors = FALSE
    • It will train the recipe to the specified dataset in the recipe()’s data argument;
    • If you want to train to other dataset, you can supply the new dataset to training argument, and set the fresh argument to TRUE

Let’s see an example of defining a recipe:

# define preprocess recipe from train dataset
rec <- recipe(attrition ~ ., data = training(splitted)) %>% 
  step_rm(employee_count, employee_number) %>%
  step_nzv(all_predictors()) %>% 
  step_string2factor(all_nominal(), -attrition) %>%
  step_string2factor(attrition, levels = c("yes", "no")) %>%
  step_downsample(attrition, ratio = 1/1, seed = 100) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  step_pca(all_numeric(), threshold = 0.85) %>%
  prep(strings_as_factors = FALSE)

# quick check
rec
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         34
#> 
#> Training data contained 1177 data points and no missing data.
#> 
#> Operations:
#> 
#> Variables removed employee_count, employee_number [trained]
#> Sparse, unbalanced variable filter removed over_18, standard_hours [trained]
#> Factor variables from business_travel, ... [trained]
#> Factor variables from attrition [trained]
#> Down-sampling based on attrition [trained]
#> Centering for age, daily_rate, ... [trained]
#> Scaling for age, daily_rate, ... [trained]
#> PCA extraction with age, daily_rate, ... [trained]

There are two ways of obtaining the result from our recipe:

  • juice() : Extract preprocessed dataset from prep()-ed recipe. Normally, we will use this function to get preprocessed train dataset.

  • bake() : Apply a recipe to new dataset. Normally, we use we will use this function to preprocess new dataset, such as test dataset, or prediction dataset.

These are some example on how to get preprocessed train and test dataset:

# get preprocessed train dataset
data_train <- juice(rec)

# quick check
data_train
#> # A tibble: 380 x 22
#>    business_travel department education_field gender job_role
#>    <fct>           <fct>      <fct>           <fct>  <fct>   
#>  1 travel_rarely   sales      life_sciences   female sales_e~
#>  2 travel_rarely   research_~ other           male   laborat~
#>  3 travel_rarely   research_~ life_sciences   male   laborat~
#>  4 travel_rarely   research_~ medical         male   researc~
#>  5 travel_frequen~ research_~ life_sciences   female researc~
#>  6 travel_rarely   sales      technical_degr~ male   sales_r~
#>  7 travel_rarely   research_~ medical         male   researc~
#>  8 travel_rarely   sales      marketing       male   sales_r~
#>  9 travel_rarely   research_~ technical_degr~ female researc~
#> 10 travel_rarely   research_~ life_sciences   male   laborat~
#> # ... with 370 more rows, and 17 more variables: marital_status <fct>,
#> #   over_time <fct>, attrition <fct>, PC01 <dbl>, PC02 <dbl>, PC03 <dbl>,
#> #   PC04 <dbl>, PC05 <dbl>, PC06 <dbl>, PC07 <dbl>, PC08 <dbl>,
#> #   PC09 <dbl>, PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>
# get preprocessed test dataset
data_test <- bake(rec, testing(splitted))

# quick check
data_test
#> # A tibble: 293 x 22
#>    attrition business_travel department education_field gender job_role
#>    <fct>     <fct>           <fct>      <fct>           <fct>  <fct>   
#>  1 no        travel_rarely   research_~ life_sciences   female manufac~
#>  2 no        non_travel      research_~ other           female manufac~
#>  3 yes       travel_rarely   sales      life_sciences   male   sales_r~
#>  4 no        travel_rarely   research_~ other           female manager 
#>  5 no        travel_rarely   sales      marketing       male   sales_e~
#>  6 no        travel_rarely   sales      marketing       female sales_r~
#>  7 no        travel_rarely   research_~ other           male   laborat~
#>  8 yes       travel_rarely   research_~ life_sciences   male   laborat~
#>  9 no        travel_frequen~ research_~ medical         female laborat~
#> 10 no        travel_rarely   research_~ life_sciences   male   researc~
#> # ... with 283 more rows, and 16 more variables: marital_status <fct>,
#> #   over_time <fct>, PC01 <dbl>, PC02 <dbl>, PC03 <dbl>, PC04 <dbl>,
#> #   PC05 <dbl>, PC06 <dbl>, PC07 <dbl>, PC08 <dbl>, PC09 <dbl>,
#> #   PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>

Model Fitting using parsnip

Rethinking: How many machine learning packages you used?

One vegeta.. I mean package to rule them all: parsnip

parsnip is part of tidymodels that could help us in model fitting and prediction flows.

There are so many models supported by parsnip–as you could see in its full model list . In this introduction, we will use random forest as an example model.

There are two part of defining a model that should be noted:

  • Defining model’s specification: In this part, we need to define the model’s specification, such as mtry and trees for random forest, through model specific functions. For example, you can use rand_forest() to define a random forest specification. Make sure to check full model list to see every model and its available arguments.

  • Defining model’s engine: In this part, we need to define the model’s engine, which determines the package we will use to fit our model. This part could be done using set_engine() function. Note that in addition to defining which package we want to use as our engine, we could also passing package specific arguments to this function.

This is an example of defining a random forest model using randomForest::randomForest() as our engine:

# set-up model specification
model_spec <- rand_forest(
  mode = "classification",
  mtry = ncol(data_train) - 2,
  trees = 500,
  min_n = 15
)

# set-up model engine
model_engine <- set_engine(
  object = model_spec,
  engine = "randomForest"
)

# quick check
model_engine
#> Random Forest Model Specification (classification)
#> 
#> Main Arguments:
#>   mtry = ncol(data_train) - 2
#>   trees = 500
#>   min_n = 15
#> 
#> Computational engine: randomForest

To fit our model, we have two options:

  • Formula interface
  • X-Y interface

Note that some packages are behaving differently inside formula and x-y interface. For example, randomForest::randomForest() would convert all of our categorical variables into dummy variables in formula interface, but not in x-y interface.

Fit using formula interface using fit() function:

# fit the model
model <- fit(
  object = model_engine,
  formula = attrition ~ .,
  data = data_train
)

# quick check
model
#> parsnip model object
#> 
#> 
#> Call:
#>  randomForest(x = as.data.frame(x), y = y, ntree = ~500, mtry = ~ncol(data_train) -      2, nodesize = ~15) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 20
#> 
#>         OOB estimate of  error rate: 28.95%
#> Confusion matrix:
#>     yes  no class.error
#> yes 141  49   0.2578947
#> no   61 129   0.3210526

Or through x-y interface using fit_xy() function:

# fit the model
model <- fit_xy(
  object = model_engine,
  x = select(data_train, -attrition),
  y = select(data_train, attrition)
)

# quick check
model
#> parsnip model object
#> 
#> 
#> Call:
#>  randomForest(x = as.data.frame(x), y = y, ntree = ~500, mtry = ~ncol(data_train) -      2, nodesize = ~15) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 20
#> 
#>         OOB estimate of  error rate: 27.37%
#> Confusion matrix:
#>     yes  no class.error
#> yes 138  52   0.2736842
#> no   52 138   0.2736842

In this workflow, it should be relatively easy to change the model engine.

Let’s try to fit a same model specification, but now using ranger::ranger():

# set-up other model engine
model_engine <- set_engine(
  object = model_spec,
  engine = "ranger",
  seed = 100,
  num.threads = parallel::detectCores() / 2,
  importance = "impurity"
)

# quick check
model_engine
#> Random Forest Model Specification (classification)
#> 
#> Main Arguments:
#>   mtry = ncol(data_train) - 2
#>   trees = 500
#>   min_n = 15
#> 
#> Engine-Specific Arguments:
#>   seed = 100
#>   num.threads = parallel::detectCores()/2
#>   importance = impurity
#> 
#> Computational engine: ranger

Now let’s try to fit the model, and see the result:

# fit the model
model <- fit(
  object = model_engine,
  formula = attrition ~ .,
  data = data_train
)

# quick check
model
#> parsnip model object
#> 
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(formula = formula, data = data, mtry = ~ncol(data_train) -      2, num.trees = ~500, min.node.size = ~15, seed = ~100, num.threads = ~parallel::detectCores()/2,      importance = ~"impurity", verbose = FALSE, probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      380 
#> Number of independent variables:  21 
#> Mtry:                             20 
#> Target node size:                 15 
#> Variable importance mode:         impurity 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.1945066

Notice that ranger::ranger() doesn’t behave differently between fit() and fit_xy():

# fit the model
model <- fit_xy(
  object = model_engine,
  x = select(data_train, -attrition),
  y = select(data_train, attrition)
)

# quick check
model
#> parsnip model object
#> 
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(formula = formula, data = data, mtry = ~ncol(data_train) -      2, num.trees = ~500, min.node.size = ~15, seed = ~100, num.threads = ~parallel::detectCores()/2,      importance = ~"impurity", verbose = FALSE, probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      380 
#> Number of independent variables:  21 
#> Mtry:                             20 
#> Target node size:                 15 
#> Variable importance mode:         impurity 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.1945066

To get the prediction, we could use predict() as usual–but note that it would return a tidied tibble instead of a vector, as in type = "class" cases, or a raw data.frame, as in type = "prob" cases.

In this way, the prediction results would be very convenient for further usage, such as simple recombining with the original dataset:

# get prediction on test
predicted <- data_test %>% 
  bind_cols(predict(model, data_test)) %>% 
  bind_cols(predict(model, data_test, type = "prob"))

# quick check
predicted %>% 
  select(attrition, matches(".pred"))
#> # A tibble: 293 x 4
#>    attrition .pred_class .pred_yes .pred_no
#>    <fct>     <fct>           <dbl>    <dbl>
#>  1 no        no             0.340     0.660
#>  2 no        no             0.358     0.642
#>  3 yes       yes            0.509     0.491
#>  4 no        no             0.0884    0.912
#>  5 no        no             0.185     0.815
#>  6 no        no             0.448     0.552
#>  7 no        no             0.127     0.873
#>  8 yes       yes            0.748     0.252
#>  9 no        no             0.147     0.853
#> 10 no        yes            0.680     0.320
#> # ... with 283 more rows

Model Evaluation using yardstick

Rethinking: How do we measure the goodness of our model?

It’s always good to bring your own yardstick

yardstick is part of tidymodels that could help us in calculating model performance metrics.

There are so many metrics available by yardstick–as you could see in its full function references page . In this introduction, we will calculate some model performance metrics for classification task as an example.

There are two ways of calculating model performance metrics, which differ in its input and output:

  • tibble approach: We pass a dataset containing the truth and estimate to the function, and it will return a tibble containing the results, e.g., precision() function.
  • vector approach: We pass a vector as the truth and a vector as the estimate to the function, and it will return a vector which show the results, e.g., precision_vec() function.

Note that some function, like conf_mat() , only accept tibble approach, since it is not returned a vector of length one.

Let’s start by reporting the confusion matrix:

# show confusion matrix
predicted %>% 
  conf_mat(truth = attrition, estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

Now, to calculate the performance metrics, let’s try to use the tibble approach–which also support group_by:

# calculate accuracy
predicted %>% 
  accuracy(attrition, .pred_class)
#> # A tibble: 1 x 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.710
# calculate accuracy by group
predicted %>% 
  group_by(department) %>% 
  accuracy(attrition, .pred_class) %>% 
  ungroup()
#> # A tibble: 3 x 4
#>   department           .metric  .estimator .estimate
#>   <fct>                <chr>    <chr>          <dbl>
#> 1 human_resources      accuracy binary         0.636
#> 2 research_development accuracy binary         0.760
#> 3 sales                accuracy binary         0.605

Or using vector approach, which is more flexible in general:

# metrics summary
predicted %>% 
  summarise(
    accuracy = accuracy_vec(attrition, .pred_class),
    sensitivity = sens_vec(attrition, .pred_class),
    specificity = spec_vec(attrition, .pred_class),
    precision = precision_vec(attrition, .pred_class)
  )
#> # A tibble: 1 x 4
#>   accuracy sensitivity specificity precision
#>      <dbl>       <dbl>       <dbl>     <dbl>
#> 1    0.710       0.723       0.707     0.321
# metrics summary by group
predicted %>% 
  group_by(department) %>% 
  summarise(
    accuracy = accuracy_vec(attrition, .pred_class),
    sensitivity = sens_vec(attrition, .pred_class),
    specificity = spec_vec(attrition, .pred_class),
    precision = precision_vec(attrition, .pred_class)
  ) %>% 
  ungroup()
#> # A tibble: 3 x 5
#>   department           accuracy sensitivity specificity precision
#>   <fct>                   <dbl>       <dbl>       <dbl>     <dbl>
#> 1 human_resources         0.636       1           0.556     0.333
#> 2 research_development    0.760       0.677       0.776     0.362
#> 3 sales                   0.605       0.786       0.569     0.262

Sometimes the model performance metrics could also improving the models final results. For example, through Receiver Operating Curve, we could assess the probability threshold effect to sensitivity and specificity:

predicted %>% 
  roc_curve(attrition, .pred_yes) %>% 
  autoplot()

And, since it’s returning a tibble, we could do further data wrangling to help us see it more clearly:

# get roc curve data on test dataset
pred_test_roc <- predicted %>%
  roc_curve(attrition, .pred_yes)

# quick check
pred_test_roc
#> # A tibble: 295 x 3
#>    .threshold specificity sensitivity
#>         <dbl>       <dbl>       <dbl>
#>  1  -Inf          0                 1
#>  2     0.0422     0                 1
#>  3     0.0487     0.00407           1
#>  4     0.0568     0.00813           1
#>  5     0.0674     0.0122            1
#>  6     0.0685     0.0163            1
#>  7     0.0705     0.0203            1
#>  8     0.0884     0.0244            1
#>  9     0.0922     0.0285            1
#> 10     0.0945     0.0325            1
#> # ... with 285 more rows

With some ggplot2:

# tidying
pred_test_roc <- pred_test_roc %>% 
  mutate_if(~ is.numeric(.), ~ round(., 4)) %>% 
  gather(metric, value, -.threshold)

# plot sensitivity-specificity trade-off
p <- ggplot(pred_test_roc, aes(x = .threshold, y = value)) +
  geom_line(aes(colour = metric)) +
  labs(x = "Probability Threshold to be Classified as Positive", y = "Value", colour = "Metrics") +
  theme_minimal()

and plotly magic, it would be perfect:

# convert to plotly
ggplotly(p)

comments powered by Disqus