library(tidyverse)
library(tidymodels)
library(glmnet) # for logistic regression
library(ranger) # for random forest
library(vip) # for variable importance plots
A Quick Intro to TidyModels
Intro
Some of you are currently working on projects involving R and machine learning models. The new Tidymodels framework could be an additional useful concept to know for your projects. This notebook will give you a quick intro to Tidymodels.
So far, we have learned data wrangling using the Tidyverse eco-system. Tidymodels is to modeling what Tidyverse is to data wrangling. Tidymodels itself doesn’t implement any statistical or machine learning models. It’s a framework to organize all the tasks around modeling.
The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.
Below are a few core tidymodels packages.
- tidymodels: a meta-package that installs and load all tidymodels core packages.
- rsample: handles data splitting (e.g. for training, validation and testing) and resampling (e.g. for bootstrapping).
- parsnip: a unified interface to different underlying R modeling packages
- recipes: an interface to data pre-processing (i.e. feature engineering)
- workflows: bundles pre-processing, modeling and post-processing together
- tune: helps with hyper-parameter tuning
- yardstick: measures model performance
This notebook will introduce you Tidymodels through an example. The workflow in this example closely follows the case study in the official Tidymodels Get Started document. However, we will use a different dataset.
Predictive Modeling with Tidymodels
Import Data
We will use the German Credit Data for this exercise. The goal is to classify customers as good or bad credit risk based on their attributes.
Let’s first load some R packages.
Now let’s import the data. Note that the data file is not in csv format. The data are space-separated. In addition, the data file doesn’t contain column names, and most of the values are encoded too. Click the dataset link above to get more information on the data.
<- "http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
data_url
# the data file doesn't contain column names so add them
<- c("chk_acct", "duration", "credit_his",
col_names "purpose", "amount", "saving_acct",
"present_emp", "installment_rate",
"sex", "other_debtor", "present_resid",
"property", "age", "other_install",
"housing", "n_credits", "job",
"n_people", "telephone", "foreign",
"response")
<- read_delim(data_url, " ", col_names = col_names) %>%
german_credit mutate_if(is.character, as.factor) %>% # turn all character cols to factors/categorical variables
mutate(response = as_factor(response - 1)) %>% # 0: good credit; 1: bad credit
mutate(response = fct_relevel(response, c("1", "0"))) # reorder factor level so 1/bad will be the "event"
# take a look at the data
glimpse(german_credit)
## Rows: 1,000
## Columns: 21
## $ chk_acct <fct> A11, A12, A14, A11, A11, A14, A14, A12, A14, A12, A12…
## $ duration <dbl> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12, 24…
## $ credit_his <fct> A34, A32, A34, A32, A33, A32, A32, A32, A32, A34, A32…
## $ purpose <fct> A43, A43, A46, A42, A40, A46, A42, A41, A43, A40, A40…
## $ amount <dbl> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3059,…
## $ saving_acct <fct> A65, A61, A61, A61, A61, A65, A63, A61, A64, A61, A61…
## $ present_emp <fct> A75, A73, A74, A74, A73, A73, A75, A73, A74, A71, A72…
## $ installment_rate <dbl> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4, 2,…
## $ sex <fct> A93, A92, A93, A93, A93, A93, A93, A93, A91, A94, A92…
## $ other_debtor <fct> A101, A101, A101, A103, A101, A101, A101, A101, A101,…
## $ present_resid <dbl> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4, 3,…
## $ property <fct> A121, A121, A121, A122, A124, A124, A122, A123, A121,…
## $ age <dbl> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 22, 6…
## $ other_install <fct> A143, A143, A143, A143, A143, A143, A143, A143, A143,…
## $ housing <fct> A152, A152, A152, A153, A153, A153, A152, A151, A152,…
## $ n_credits <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 3,…
## $ job <fct> A173, A173, A172, A173, A173, A172, A173, A174, A172,…
## $ n_people <dbl> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ telephone <fct> A192, A191, A191, A191, A191, A192, A191, A192, A191,…
## $ foreign <fct> A201, A201, A201, A201, A201, A201, A201, A201, A201,…
## $ response <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,…
Split Data
Let’s split the data for training, validation, and testing. The response
variable (good or bad credit) is slightly imbalanced, so we use stratified random sample to ensure roughly the same proportion of good and bad credit samples in each sub-dataset after the split (as compared to the original porpotion in the full dataset). This is done with the help of Tidymodels rsample package.
(Note: there are other strategies of dealing with imbalanced data such as upsampling or downsampling. They are available in the themis package via the recipes.)
# the response variable is slightly imbalanced
%>%
german_credit count(response) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 × 3
## response n prop
## <fct> <int> <dbl>
## 1 1 300 0.3
## 2 0 700 0.7
# set random seed for reproducibility
set.seed(123)
# use stratified random sample
# test and non-test split
<- initial_split(german_credit, strata = response)
splits
splits## <Training/Testing/Total>
## <750/250/1000>
# retrieving data from the split data
<- training(splits)
credit_other <- testing(splits)
credit_test
# training and validation split on non-test data
<- validation_split(credit_other,
val_set strata = response,
prop = 0.80)
val_set## # Validation Set Split (0.8/0.2) using stratification
## # A tibble: 1 × 2
## splits id
## <list> <chr>
## 1 <split [600/150]> validation
Logistic Regression
We first build a penalized logistic regression model. We will use lasso method in this example to encourage features selection. The hyper-parameter penalty
will be tuned.
There are a few packages in R can perform logistic regression with lasso, for example glmnet
or LiblineaR
. Tidymodels parsnip package provides a unified interface (e.g. logistic_reg()
and set_engine()
) to underlying modeling functions from different packages. This means you can switch between different modeling packages easily. Worry no more about changing syntax for different models from different R modeling packages.
Tidymodels recipes helps with pre-process data (e.g. step_dummy
and step_normalize()
).
Tidymodels workflows bundles pre-processing and modeling.
# build model
# note that the hyper-parameter penalty will be tuned; tune() is a placeholder
<-
lr_mod logistic_reg(penalty = tune(), mixture = 1) %>% # mixture = 1 means pure lasso
set_engine("glmnet") # use the glmnet
# pre-process data
<-
lr_recipe recipe(response ~ ., data = credit_other) %>% # response will be y/outcome
step_dummy(all_nominal(), -all_outcomes()) %>% # create dummies for all nominal vars excluding outcome var
step_normalize(all_predictors()) # normalize (center and scale) all predictors
# combine modeling and data pre-processing
<-
lr_workflow workflow() %>%
add_model(lr_mod) %>%
add_recipe(lr_recipe)
We can start to tune and train the model. We first create a grid for tuning the penalty
hyper-parameter. Since we only have one hyper-parameter to tune, we simply store the grid as a one-column tibble/dataframe. For more complex cases, consider using the Tidymodels dials package.
We then train the model using each grid value. We use “Area under the ROC Curve” (ROC AUC) to measure model performance over the validation set, so let’s plot the ROC AUC measure over penalty
grid. We then select the best model. These are all done with the help of Tidymodels tune package.
# create a grid for tuning the penalty hyper-parameter
<- tibble(penalty = 10^seq(-4, -1, length.out = 30))
lr_reg_grid
# train using a grid of penalty values
<-
lr_res %>%
lr_workflow tune_grid(val_set, # use validation set for performance evaluation
grid = lr_reg_grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc)) # use ROC AUC as performance metric
# plot ROC AUC against penalty grid
<-
lr_plot %>%
lr_res collect_metrics() %>%
ggplot(aes(x = penalty, y = mean)) +
geom_point() +
geom_line() +
ylab("Area under the ROC Curve") +
scale_x_log10(labels = scales::label_number())
lr_plot
We then select the best model, i.e., the one with the penalty
value that delivers the highest ROC AUC.
# take a look at top models first
%>%
lr_res show_best(metric = "roc_auc", n = 5)
## # A tibble: 5 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0304 roc_auc binary 0.751 1 NA Preprocessor1_Model25
## 2 0.0240 roc_auc binary 0.749 1 NA Preprocessor1_Model24
## 3 0.0386 roc_auc binary 0.749 1 NA Preprocessor1_Model26
## 4 0.0189 roc_auc binary 0.746 1 NA Preprocessor1_Model23
## 5 0.0621 roc_auc binary 0.744 1 NA Preprocessor1_Model28
# pick the top one
<-
lr_best %>%
lr_res select_best(metric = "roc_auc")
lr_best## # A tibble: 1 × 2
## penalty .config
## <dbl> <chr>
## 1 0.0304 Preprocessor1_Model25
We see the bests model has a ROC AUC of 0.751, and it’s the one with penalty
hyper-parameter value of 0.0304. We plot the ROC for the best model we obtained from the lasso logistic regression.
<-
lr_auc %>%
lr_res collect_predictions(parameters = lr_best) %>% # collect pred on val data under the best model
roc_curve(response, .pred_1) %>%
mutate(model = "Logistic Regression")
autoplot(lr_auc)
Random Forest
We do almost exact the same thing as we did in the logistic regression except now we use random forest model.
# random forest will be trained in parallel on multiple cores
# get the number of cores my computer has
<- parallel::detectCores()
cores
# build a random forest model
# two hyper-parameters will be tuned
<-
rf_mod rand_forest(mtry = tune(), min_n = tune(), trees = 30) %>%
set_engine("ranger", num.threads = cores) %>% # use ranger pacakge for training
set_mode("classification")
# not much data pre-process for random forest model (just specify the model)
<- recipe(response ~ ., data = credit_other)
rf_recipe
# combine model and data-process
<-
rf_workflow workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe)
# train and tune on validation data
# auto-create tuning grid with 25 set of values
set.seed(345)
<-
rf_res %>%
rf_workflow tune_grid(val_set,
grid = 25,
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc))
# take a look at a few top models
%>%
rf_res show_best(metric = "roc_auc")
## # A tibble: 5 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 20 32 roc_auc binary 0.826 1 NA Preprocessor1_Model25
## 2 14 17 roc_auc binary 0.823 1 NA Preprocessor1_Model19
## 3 17 6 roc_auc binary 0.821 1 NA Preprocessor1_Model22
## 4 19 14 roc_auc binary 0.818 1 NA Preprocessor1_Model24
## 5 15 28 roc_auc binary 0.815 1 NA Preprocessor1_Model20
# select the best one
<-
rf_best %>%
rf_res select_best(metric = "roc_auc")
rf_best## # A tibble: 1 × 3
## mtry min_n .config
## <int> <int> <chr>
## 1 20 32 Preprocessor1_Model25
We see the bests model has a ROC AUC of 0.826.
Plot the ROC curve together with the one obtained from the best logistic lasso model.
<-
rf_auc %>%
rf_res collect_predictions(parameters = rf_best) %>% # collect pred on val data under the best model
roc_curve(response, .pred_1) %>%
mutate(model = "Random Forest")
# combine the ROC from the best random forest model with the one from logistic lasso
bind_rows(rf_auc, lr_auc) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) +
geom_path(lwd = 1.5, alpha = 0.8) +
geom_abline(lty = 3) +
coord_equal() +
scale_color_viridis_d(option = "plasma", end = .6)
The Last Fit
We see the best logistic lasso and random forest have similar performance. We will fit a final model,
- using random forest with the best hyper-parameters obtained;
- using all non-test data, i.e., combining the training and validation data. (We do this because our dataset is quite small.)
# the last model
# mtry and min_n are obtained from the best random forest model above
<- (rf_best %>% pull(mtry))[[1]]
mtry_best <- (rf_best %>% pull(min_n))[[1]]
min_n_best <-
last_rf_mod rand_forest(mtry = mtry_best, min_n = min_n_best, trees = 30) %>%
set_engine("ranger", num.threads = cores, importance = "impurity") %>%
set_mode("classification")
# the last workflow
<-
last_rf_workflow %>%
rf_workflow update_model(last_rf_mod)
# the last fit
# note: we train on non-test data,
# i.e., we combine previous training and validation data
set.seed(345)
<-
last_rf_fit %>%
last_rf_workflow last_fit(splits)
# collect performance metric on test data
%>%
last_rf_fit collect_metrics()
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.772 Preprocessor1_Model1
## 2 roc_auc binary 0.813 Preprocessor1_Model1
## 3 brier_class binary 0.158 Preprocessor1_Model1
We see the model performance (ROC AUC) on test data is similar to the one we obtained on validation data (0.813 vs 0.826). That means our estimate to the model performance generalizes to unseen data.
However, our model accuracy is just 0.772. That’s not too impressive. If we naively guess all responses are 0 (good credit). We will get about 0.7 accuracy (given 70% of the data has 0 responses).
Anyway, keep in mind that this notebook is for introducing you Tidymodels. We haven’t carefully explored data and different models yet, so the OK performance is acceptable.
Let’s plot the feature importance graph.
# obtain importance score
%>%
last_rf_fit pluck(".workflow", 1) %>%
extract_fit_parsnip() %>%
vip(num_features = 20)
Let’s plot the final ROC curve too.
# plot ROC plot
%>%
last_rf_fit collect_predictions() %>%
roc_curve(response, .pred_1) %>%
autoplot()