Quick Intro to TidyModels

Predictive Modeling with Tidymodels

Import Data

We will use the German Credit Data for this exercise.

Let’s first load some R packages.

library(tidyverse)
library(tidymodels)

library(glmnet)     # for logistic regression
library(ranger)     # for random forest
library(vip)        # for variable importance plots

Now let’s import the data. Note that the data file is not in csv format. The data are space-separated. In addition, the data file doesn’t contain column names, and most of the values are encoded too. Click the dataset link above to get more information on the data.

data_url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"

# the data file doesn't contain column names so add them
col_names <- c("chk_acct", "duration", "credit_his", 
               "purpose", "amount", "saving_acct",
               "present_emp", "installment_rate",
               "sex", "other_debtor", "present_resid",
               "property", "age", "other_install",
               "housing", "n_credits", "job",
               "n_people", "telephone", "foreign",
               "response")

german_credit <- read_delim(data_url, " ", col_names = col_names) %>%
  mutate_if(is.character, as.factor) %>%     # turn all character cols to factors/categorical variables
  mutate(response = as_factor(response - 1)) %>%     # 0: good credit; 1: bad credit
  mutate(response = fct_relevel(response, c("1", "0")))     # reorder factor level so 1 will be the "event"

# take a look at the data
glimpse(german_credit)
## Rows: 1,000
## Columns: 21
## $ chk_acct         <fct> A11, A12, A14, A11, A11, A14, A14, A12, A14, A12, A12~
## $ duration         <dbl> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12, 24~
## $ credit_his       <fct> A34, A32, A34, A32, A33, A32, A32, A32, A32, A34, A32~
## $ purpose          <fct> A43, A43, A46, A42, A40, A46, A42, A41, A43, A40, A40~
## $ amount           <dbl> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3059,~
## $ saving_acct      <fct> A65, A61, A61, A61, A61, A65, A63, A61, A64, A61, A61~
## $ present_emp      <fct> A75, A73, A74, A74, A73, A73, A75, A73, A74, A71, A72~
## $ installment_rate <dbl> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4, 2,~
## $ sex              <fct> A93, A92, A93, A93, A93, A93, A93, A93, A91, A94, A92~
## $ other_debtor     <fct> A101, A101, A101, A103, A101, A101, A101, A101, A101,~
## $ present_resid    <dbl> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4, 3,~
## $ property         <fct> A121, A121, A121, A122, A124, A124, A122, A123, A121,~
## $ age              <dbl> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 22, 6~
## $ other_install    <fct> A143, A143, A143, A143, A143, A143, A143, A143, A143,~
## $ housing          <fct> A152, A152, A152, A153, A153, A153, A152, A151, A152,~
## $ n_credits        <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 3,~
## $ job              <fct> A173, A173, A172, A173, A173, A172, A173, A174, A172,~
## $ n_people         <dbl> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ telephone        <fct> A192, A191, A191, A191, A191, A192, A191, A192, A191,~
## $ foreign          <fct> A201, A201, A201, A201, A201, A201, A201, A201, A201,~
## $ response         <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,~

Split Data

Let’s split the data for training, validation, and testing. The response variable is slightly imbalanced, so we use stratified random sample. This is done with the help of Tidymodels rsample package.

# the response variable is slightly imbalanced
german_credit %>% 
  count(response) %>%
  mutate(prop = n/sum(n))
## # A tibble: 2 x 3
##   response     n  prop
##   <fct>    <int> <dbl>
## 1 1          300   0.3
## 2 0          700   0.7

# set random seed for reproducibility
set.seed(123)

# use stratified random sample
# test and non-test split
splits <- initial_split(german_credit, strata = response)
splits
## <Analysis/Assess/Total>
## <750/250/1000>

# retrieving data from the split data
credit_other <- training(splits)
credit_test  <- testing(splits)

# training and validation split on non-test data
val_set <- validation_split(credit_other, 
                            strata = response, 
                            prop = 0.80)
val_set
## # Validation Set Split (0.8/0.2)  using stratification 
## # A tibble: 1 x 2
##   splits            id        
##   <list>            <chr>     
## 1 <split [602/148]> validation

Logistic Regression

We first build a penalized logistic regression model. We will use lasso here.

Given a model such as logistic regression, Tidymodels parsnip package provides a unified interface (e.g. logistic_reg() and set_engine()) to different underlying modeling functions (e.g. glmnet, keras or spark). This is a huge advantage. Worry no more about syntax for different models from different R modeling packages.

Tidymodels recipe helps with pre-process data (e.g. step_dummy).

Tidymodels workflows bundles pre-processing and modeling.

# build model
# note that the hyper-parameter penalty will be tuned
lr_mod <- 
  logistic_reg(penalty = tune(), mixture = 1) %>%     # mixture = 1 means pure lasso
  set_engine("glmnet")     # use the glmnet

# pre-process data
lr_recipe <- 
  recipe(response ~ ., data = credit_other) %>%     # response will be y/outcome
  step_dummy(all_nominal(), -all_outcomes()) %>%    # create dummies for all nominal vars excluding outcome var
  step_normalize(all_predictors())     # normalize (center and scale) all predictors

# combine modeling and data pre-processing
lr_workflow <- 
  workflow() %>% 
  add_model(lr_mod) %>% 
  add_recipe(lr_recipe)

We can start to tune and train the model. We first create a grid for tuning the penalty hyper-parameter. We then train the model using each grid value. We use “Area under the ROC Curve” to measure model performance over the validation set, so we plot the ROC AUC measure over penalty grid. We then select the best model. These are all done with the help of Tidymodels tune package.

# create a grid for tuning the penalty hyper-parameter
lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))

# train using a grid of penalty values
lr_res <- 
  lr_workflow %>% 
  tune_grid(val_set,
            grid = lr_reg_grid,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(roc_auc))

# plot ROC AUC against penalty grid
lr_plot <- 
  lr_res %>% 
  collect_metrics() %>% 
  ggplot(aes(x = penalty, y = mean)) + 
  geom_point() + 
  geom_line() + 
  ylab("Area under the ROC Curve") +
  scale_x_log10(labels = scales::label_number())

lr_plot

We then select the best model.

# take a look at top models first
lr_res %>% 
  show_best("roc_auc", n = 5)
## # A tibble: 5 x 7
##    penalty .metric .estimator  mean     n std_err .config              
##      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1 0.000853 roc_auc binary     0.773     1      NA Preprocessor1_Model10
## 2 0.00137  roc_auc binary     0.773     1      NA Preprocessor1_Model12
## 3 0.000672 roc_auc binary     0.773     1      NA Preprocessor1_Model09
## 4 0.00108  roc_auc binary     0.772     1      NA Preprocessor1_Model11
## 5 0.00174  roc_auc binary     0.772     1      NA Preprocessor1_Model13

# pick the top one
lr_best <- 
  lr_res %>% 
  select_best(metric = "roc_auc")
lr_best
## # A tibble: 1 x 2
##    penalty .config              
##      <dbl> <chr>                
## 1 0.000853 Preprocessor1_Model10

We plot the ROC for the best model we obtained from the lasso logistic regression.

lr_auc <- 
  lr_res %>% 
  collect_predictions(parameters = lr_best) %>%     # collect pred on val data under the best model
  roc_curve(response, .pred_1) %>% 
  mutate(model = "Logistic Regression")

autoplot(lr_auc)

Random Forest

We do almost exact the same thing as we did in the logistic regression except now we use random forest model.

# random forest will be trained in parallel on multiple cores
# get the number of cores my computer has 
cores <- parallel::detectCores()

# build a random forest model
# two hyper-parameters will be tuned
rf_mod <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 30) %>% 
  set_engine("ranger", num.threads = cores) %>%     # use ranger pacakge for training
  set_mode("classification")

# not much data pre-process for random forest model (just specify the model)
rf_recipe <- recipe(response ~ ., data = credit_other)

# combine model and data-process
rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)

# train and tune on validation data
# auto-create tuning grid with 25 set of values
set.seed(345)
rf_res <- 
  rf_workflow %>% 
  tune_grid(val_set,
            grid = 25,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(roc_auc))

# take a look at a few top models
rf_res %>% 
  show_best(metric = "roc_auc")
## # A tibble: 5 x 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    13    37 roc_auc binary     0.781     1      NA Preprocessor1_Model12
## 2    12    17 roc_auc binary     0.781     1      NA Preprocessor1_Model13
## 3     9    40 roc_auc binary     0.780     1      NA Preprocessor1_Model03
## 4    19    21 roc_auc binary     0.776     1      NA Preprocessor1_Model05
## 5     3     5 roc_auc binary     0.775     1      NA Preprocessor1_Model22

# select the best one
rf_best <- 
  rf_res %>% 
  select_best(metric = "roc_auc")
rf_best
## # A tibble: 1 x 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1    13    37 Preprocessor1_Model12

We see the bests model has a ROC AUC of 0.781.

Plot the ROC curve together with the one obtained from the best logistic lasso model.

rf_auc <- 
  rf_res %>% 
  collect_predictions(parameters = rf_best) %>% # collect pred on val data under the best model
  roc_curve(response, .pred_1) %>% 
  mutate(model = "Random Forest")

# combine the ROC from the best random forest model with the one from logistic lasso
bind_rows(rf_auc, lr_auc) %>% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal() + 
  scale_color_viridis_d(option = "plasma", end = .6)

The Last Fit

We see the best logistic lasso and random forest have similar performance. We will fit a final model,

using random forest with the best hyper-parameters obtained;
using all non-test data, i.e., combining the training and validation data. (We do this because our dataset is quite small.)

# the last model
# mtry and min_n are obtained from inspecting the best random forest model above 
last_rf_mod <- 
  rand_forest(mtry = 13, min_n = 37, trees = 30) %>%     # hard code for now 
  set_engine("ranger", num.threads = cores, importance = "impurity") %>% 
  set_mode("classification")

# the last workflow
last_rf_workflow <- 
  rf_workflow %>% 
  update_model(last_rf_mod)

# the last fit
# note: we train on non-test data, 
# i.e., we combine previous training and validation data
set.seed(345)
last_rf_fit <- 
  last_rf_workflow %>% 
  last_fit(splits)

# collect performance metric on test data
last_rf_fit %>% 
  collect_metrics()
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.744 Preprocessor1_Model1
## 2 roc_auc  binary         0.778 Preprocessor1_Model1

We see the model performance (ROC AUC) on test data is similar to the one we obtained on validation data (0.778 vs 0.781). That means our estimate to the model performance generalizes to unseen data.

However, our model accuracy is just 0.744. That’s not too impressive. If we naively guess all responses are 0 (good credit). We will get about 0.7 accuracy (given 70% of the data has 0 responses).

Anyway, keep in mind that this notebook is for introducing you Tidymodels. We haven’t carefully explored data and different models yet, so the OK performance is acceptable.

Let’s plot the feature importance graph.

# obtain importance score
last_rf_fit %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20)

Let’s plot the final ROC curve too.

# plot ROC plot
last_rf_fit %>% 
  collect_predictions() %>% 
  roc_curve(response, .pred_1) %>% 
  autoplot()