+ - 0:00:00
Notes for current slide
Notes for next slide

Build a model

Tidymodels, virtually — Session 01

Alison Hill

What is Machine Learning?

Alzheimer's disease data

Data from a clinical trial of individuals with well-characterized cognitive impairment, and age-matched control participants.

# install.packages("modeldata")
library(modeldata)
data("ad_data")
alz <- ad_data
glimpse(alz)
# Rows: 333
# Columns: 131
# $ ACE_CD143_Angiotensin_Converti <dbl> 2.0031003, 1.561…
# $ ACTH_Adrenocorticotropic_Hormon <dbl> -1.3862944, -1.3…
# $ AXL <dbl> 1.09838668, 0.68…
# $ Adiponectin <dbl> -5.360193, -5.02…
# $ Alpha_1_Antichymotrypsin <dbl> 1.7404662, 1.458…
# $ Alpha_1_Antitrypsin <dbl> -12.631361, -11.…
# $ Alpha_1_Microglobulin <dbl> -2.577022, -3.24…
# $ Alpha_2_Macroglobulin <dbl> -72.65029, -154.…
# $ Angiopoietin_2_ANG_2 <dbl> 1.06471074, 0.74…
# $ Angiotensinogen <dbl> 2.510547, 2.4572…
# $ Apolipoprotein_A_IV <dbl> -1.427116, -1.66…
# $ Apolipoprotein_A1 <dbl> -7.402052, -7.04…
# $ Apolipoprotein_A2 <dbl> -0.26136476, -0.…
# $ Apolipoprotein_B <dbl> -4.624044, -6.74…
# $ Apolipoprotein_CI <dbl> -1.2729657, -1.2…
# $ Apolipoprotein_CIII <dbl> -2.312635, -2.34…
# $ Apolipoprotein_D <dbl> 2.0794415, 1.335…
# $ Apolipoprotein_E <dbl> 3.7545215, 3.097…
# $ Apolipoprotein_H <dbl> -0.15734908, -0.…
# $ B_Lymphocyte_Chemoattractant_BL <dbl> 2.296982, 1.6731…
# $ BMP_6 <dbl> -2.200744, -1.72…
# $ Beta_2_Microglobulin <dbl> 0.69314718, 0.47…
# $ Betacellulin <int> 34, 53, 49, 52, …
# $ C_Reactive_Protein <dbl> -4.074542, -6.64…
# $ CD40 <dbl> -0.7964147, -1.2…
# $ CD5L <dbl> 0.09531018, -0.6…
# $ Calbindin <dbl> 33.21363, 25.276…
# $ Calcitonin <dbl> 1.3862944, 3.610…
# $ CgA <dbl> 397.6536, 465.67…
# $ Clusterin_Apo_J <dbl> 3.555348, 3.0445…
# $ Complement_3 <dbl> -10.36305, -16.1…
# $ Complement_Factor_H <dbl> 3.573725, 3.6000…
# $ Connective_Tissue_Growth_Factor <dbl> 0.5306283, 0.587…
# $ Cortisol <dbl> 10.0, 12.0, 10.0…
# $ Creatine_Kinase_MB <dbl> -1.710172, -1.75…
# $ Cystatin_C <dbl> 9.041922, 9.0676…
# $ EGF_R <dbl> -0.1354543, -0.3…
# $ EN_RAGE <dbl> -3.688879, -3.81…
# $ ENA_78 <dbl> -1.349543, -1.35…
# $ Eotaxin_3 <int> 53, 62, 62, 44, …
# $ FAS <dbl> -0.08338161, -0.…
# $ FSH_Follicle_Stimulation_Hormon <dbl> -0.6516715, -1.6…
# $ Fas_Ligand <dbl> 3.1014922, 2.978…
# $ Fatty_Acid_Binding_Protein <dbl> 2.5208712, 2.247…
# $ Ferritin <dbl> 3.329165, 3.9329…
# $ Fetuin_A <dbl> 1.2809338, 1.193…
# $ Fibrinogen <dbl> -7.035589, -8.04…
# $ GRO_alpha <dbl> 1.381830, 1.3724…
# $ Gamma_Interferon_induced_Monokin <dbl> 2.949822, 2.7217…
# $ Glutathione_S_Transferase_alpha <dbl> 1.0641271, 0.867…
# $ HB_EGF <dbl> 6.559746, 8.7545…
# $ HCC_4 <dbl> -3.036554, -4.07…
# $ Hepatocyte_Growth_Factor_HGF <dbl> 0.58778666, 0.53…
# $ I_309 <dbl> 3.433987, 3.1354…
# $ ICAM_1 <dbl> -0.1907787, -0.4…
# $ IGF_BP_2 <dbl> 5.609472, 5.3471…
# $ IL_11 <dbl> 5.121987, 4.9367…
# $ IL_13 <dbl> 1.282549, 1.2694…
# $ IL_16 <dbl> 4.192081, 2.8763…
# $ IL_17E <dbl> 5.731246, 6.7058…
# $ IL_1alpha <dbl> -6.571283, -8.04…
# $ IL_3 <dbl> -3.244194, -3.91…
# $ IL_4 <dbl> 2.484907, 2.3978…
# $ IL_5 <dbl> 1.09861229, 0.69…
# $ IL_6 <dbl> 0.26936976, 0.09…
# $ IL_6_Receptor <dbl> 0.64279595, 0.43…
# $ IL_7 <dbl> 4.8050453, 3.705…
# $ IL_8 <dbl> 1.711325, 1.6755…
# $ IP_10_Inducible_Protein_10 <dbl> 6.242223, 5.6869…
# $ IgA <dbl> -6.812445, -6.37…
# $ Insulin <dbl> -0.6258253, -0.9…
# $ Kidney_Injury_Molecule_1_KIM_1 <dbl> -1.204295, -1.19…
# $ LOX_1 <dbl> 1.7047481, 1.526…
# $ Leptin <dbl> -1.5290628, -1.4…
# $ Lipoprotein_a <dbl> -4.268698, -4.93…
# $ MCP_1 <dbl> 6.740519, 6.8490…
# $ MCP_2 <dbl> 1.9805094, 1.808…
# $ MIF <dbl> -1.237874, -1.89…
# $ MIP_1alpha <dbl> 4.968453, 3.6901…
# $ MIP_1beta <dbl> 3.258097, 3.1354…
# $ MMP_2 <dbl> 4.478566, 3.7814…
# $ MMP_3 <dbl> -2.207275, -2.46…
# $ MMP10 <dbl> -3.270169, -3.64…
# $ MMP7 <dbl> -3.7735027, -5.9…
# $ Myoglobin <dbl> -1.89711998, -0.…
# $ NT_proBNP <dbl> 4.553877, 4.2195…
# $ NrCAM <dbl> 5.003946, 5.2094…
# $ Osteopontin <dbl> 5.356586, 6.0038…
# $ PAI_1 <dbl> 1.00350156, -0.0…
# $ PAPP_A <dbl> -2.902226, -2.81…
# $ PLGF <dbl> 4.442651, 4.0253…
# $ PYY <dbl> 3.218876, 3.1354…
# $ Pancreatic_polypeptide <dbl> 0.5787808, 0.336…
# $ Prolactin <dbl> 0.00000000, -0.5…
# $ Prostatic_Acid_Phosphatase <dbl> -1.620527, -1.73…
# $ Protein_S <dbl> -1.784998, -2.46…
# $ Pulmonary_and_Activation_Regulat <dbl> -0.8439701, -2.3…
# $ RANTES <dbl> -6.214608, -6.93…
# $ Resistin <dbl> -16.475315, -16.…
# $ S100b <dbl> 1.5618560, 1.756…
# $ SGOT <dbl> -0.94160854, -0.…
# $ SHBG <dbl> -1.897120, -1.56…
# $ SOD <dbl> 5.609472, 5.8141…
# $ Serum_Amyloid_P <dbl> -5.599422, -6.11…
# $ Sortilin <dbl> 4.908629, 5.4787…
# $ Stem_Cell_Factor <dbl> 4.174387, 3.7135…
# $ TGF_alpha <dbl> 8.649098, 11.331…
# $ TIMP_1 <dbl> 15.20465, 11.266…
# $ TNF_RII <dbl> -0.0618754, -0.3…
# $ TRAIL_R3 <dbl> -0.1829004, -0.5…
# $ TTR_prealbumin <dbl> 2.944439, 2.8332…
# $ Tamm_Horsfall_Protein_THP <dbl> -3.095810, -3.11…
# $ Thrombomodulin <dbl> -1.340566, -1.67…
# $ Thrombopoietin <dbl> -0.1026334, -0.6…
# $ Thymus_Expressed_Chemokine_TECK <dbl> 4.149327, 3.8101…
# $ Thyroid_Stimulating_Hormone <dbl> -3.863233, -4.82…
# $ Thyroxine_Binding_Globulin <dbl> -1.4271164, -1.6…
# $ Tissue_Factor <dbl> 2.04122033, 2.02…
# $ Transferrin <dbl> 3.332205, 2.8903…
# $ Trefoil_Factor_3_TFF3 <dbl> -3.381395, -3.91…
# $ VCAM_1 <dbl> 3.258097, 2.7080…
# $ VEGF <dbl> 22.03456, 18.601…
# $ Vitronectin <dbl> -0.04082199, -0.…
# $ von_Willebrand_Factor <dbl> -3.146555, -3.86…
# $ age <dbl> 0.9876238, 0.986…
# $ tau <dbl> 6.297754, 6.6592…
# $ p_tau <dbl> 4.348108, 4.8599…
# $ Ab_42 <dbl> 12.019678, 11.01…
# $ male <dbl> 0, 0, 1, 0, 0, 1…
# $ Genotype <fct> E3E3, E3E4, E3E4…
# $ Class <fct> Control, Control…

Alzheimer's disease data

  • N = 333

  • 1 categorical outcome: Class

  • 130 predictors

  • 126 protein measurements

  • also: age, male, Genotype

What is the goal of machine learning?

Goal

Goal

Build models that

Goal

Build models that

generate accurate predictions

Goal

Build models that

generate accurate predictions

for future, yet-to-be-seen data.

Goal

Build models that

generate accurate predictions

for future, yet-to-be-seen data.

Max Kuhn & Kjell Johnston, http://www.feat.engineering/

This is our whole game vision for today. This is the main goal for predictive modeling broadly, and for machine learning specifically.

We'll use this goal to drive learning of 3 core tidymodels packages:

  • parsnip
  • yardstick
  • and rsample

🔨 Build models

🔨 Build models

with parsnip

Enter the parsnip package

parsnip

glm()

glm(Class ~ tau, family = binomial, data = alz)
#
# Call: glm(formula = Class ~ tau, family = binomial, data = alz)
#
# Coefficients:
# (Intercept) tau
# 13.664 -2.148
#
# Degrees of Freedom: 332 Total (i.e. Null); 331 Residual
# Null Deviance: 390.6
# Residual Deviance: 318.8 AIC: 322.8

So let's start with prediction. To predict, we have to have two things: a model to generate predictions, and data to predict

This type of formula interface may look familiar

How would we use parsnip to build this kind of linear regression model?

To specify a model with parsnip

1. Pick a model

2. Set the engine

3. Set the mode (if needed)

To specify a model with parsnip

logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
# Logistic Regression Model Specification (classification)
#
# Computational engine: glm

To specify a model with parsnip

decision_tree() %>%
set_engine("C5.0") %>%
set_mode("classification")
# Decision Tree Model Specification (classification)
#
# Computational engine: C5.0

To specify a model with parsnip

nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
# K-Nearest Neighbor Model Specification (classification)
#
# Computational engine: kknn

To specify a model with parsnip

1. Pick a model

2. Set the engine

3. Set the mode (if needed)

1. Pick a model

All available models are listed at

https://www.tidymodels.org/find/parsnip/

logistic_reg()

Specifies a model that uses logistic regression

logistic_reg(penalty = NULL, mixture = NULL)

logistic_reg()

Specifies a model that uses logistic regression

logistic_reg(
mode = "classification", # "default" mode, if exists
penalty = NULL, # model hyper-parameter
mixture = NULL # model hyper-parameter
)

To specify a model with parsnip

1. Pick a model

2. Set the engine

3. Set the mode (if needed)

set_engine()

Adds an engine to power or implement the model.

logistic_reg() %>% set_engine(engine = "glm")

To specify a model with parsnip

1. Pick a model

2. Set the engine

3. Set the mode (if needed)

set_mode()

Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1.

logistic_reg() %>% set_mode(mode = "classification")

Your turn 1

Run the chunk in your .Rmd and look at the output. Then, copy/paste the code and edit to create:

  • a decision tree model for classification

  • that uses the C5.0 engine.

Save it as tree_mod and look at the object. What is different about the output?

Hint: you'll need https://www.tidymodels.org/find/parsnip/

03:00
lr_mod
# Logistic Regression Model Specification (classification)
#
# Computational engine: glm
tree_mod <-
decision_tree() %>%
set_engine(engine = "C5.0") %>%
set_mode("classification")
tree_mod
# Decision Tree Model Specification (classification)
#
# Computational engine: C5.0

Now we've built a model.

Now we've built a model.

But, how do we use a model?

Now we've built a model.

But, how do we use a model?

First - what does it mean to use a model?

Statistical models learn from the data.

Many learn model parameters, which can be useful as values for inference and interpretation.

Show of hands

How many people have fitted a statistical model with R?

A fitted model

lr_mod %>%
fit(Class ~ tau + VEGF,
data = alz) %>%
broom::tidy()
# # A tibble: 3 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 8.97 1.98 4.54 5.61e- 6
# 2 tau -4.01 0.456 -8.79 1.55e-18
# 3 VEGF 0.934 0.130 7.19 6.38e-13

"All models are wrong, but some are useful"

"All models are wrong, but some are useful"

Predict new data

alz_new <-
tibble(tau = c(5, 6, 7),
VEGF = c(15, 15, 15),
Class = c("Control", "Control", "Impaired")) %>%
mutate(Class = factor(Class, levels = c("Impaired", "Control")))
alz_new
# # A tibble: 3 x 3
# tau VEGF Class
# <dbl> <dbl> <fct>
# 1 5 15 Control
# 2 6 15 Control
# 3 7 15 Impaired

Show of hands

How many people have used a model to generate predictions with R?

Predict old data

tree_mod %>%
fit(Class ~ tau + VEGF,
data = alz) %>%
predict(new_data = alz) %>%
mutate(true_class = alz$Class) %>%
accuracy(truth = true_class,
estimate = .pred_class)
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.856

Predict new data

out with the old...

tree_mod %>%
fit(Class ~ tau + VEGF,
data = alz) %>%
predict(new_data = alz) %>%
mutate(true_class = alz$Class) %>%
accuracy(truth = true_class,
estimate = .pred_class)
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.856

in with the 🆕

tree_mod %>%
fit(Class ~ tau + VEGF,
data = alz) %>%
predict(new_data = alz_new) %>%
mutate(true_class = alz_new$Class) %>%
accuracy(truth = true_class,
estimate = .pred_class)
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.667

fit()

Train a model by fitting a model. Returns a parsnip model fit.

fit(tree_mod, Class ~ tau + VEGF, data = alz)

fit()

Train a model by fitting a model. Returns a parsnip model fit.

tree_mod %>% # parsnip model
fit(Class ~ tau + VEGF, # a formula
data = alz # dataframe
)

fit()

Train a model by fitting a model. Returns a parsnip model fit.

tree_fit <-
tree_mod %>% # parsnip model
fit(Class ~ tau + VEGF, # a formula
data = alz # dataframe
)

predict()

Use a fitted model to predict new y values from data. Returns a tibble.

predict(tree_fit, new_data = alz_new)
tree_fit %>%
predict(new_data = alz_new)
# # A tibble: 3 x 1
# .pred_class
# <fct>
# 1 Control
# 2 Impaired
# 3 Impaired

Axiom

The best way to measure a model's performance at predicting new data is to predict new data.

Data splitting

Data splitting

We refer to the group for which we know the outcome, and use to develop the algorithm, as the training set. We refer to the group for which we pretend we don’t know the outcome as the test set.

♻️ Resample models

♻️ Resample models

with rsample

Enter the rsample package

rsample

initial_split()*

"Splits" data randomly into a single testing and a single training set.

initial_split(data, prop = 3/4)

* from rsample

alz_split <- initial_split(alz, strata = Class, prop = .9)
alz_split
# <Analysis/Assess/Total>
# <300/33/333>

data splitting

training() and testing()*

Extract training and testing sets from an rsplit

training(alz_split)
testing(alz_split)

* from rsample

alz_train <- training(alz_split)
alz_train
# # A tibble: 300 x 131
# ACE_CD143_Angio… ACTH_Adrenocort… AXL Adiponectin
# <dbl> <dbl> <dbl> <dbl>
# 1 2.00 -1.39 1.10 -5.36
# 2 1.56 -1.39 0.683 -5.02
# 3 1.52 -1.71 -0.145 -5.81
# 4 1.68 -1.61 0.683 -5.12
# 5 2.40 -0.968 0.191 -4.78
# 6 0.431 -1.27 -0.222 -5.22
# 7 0.946 -1.90 0.530 -6.12
# 8 0.708 -1.83 -0.327 -4.88
# 9 1.11 -1.97 0.191 -5.17
# 10 1.60 -1.51 0.449 -5.57
# # … with 290 more rows, and 127 more variables:
# # Alpha_1_Antichymotrypsin <dbl>,
# # Alpha_1_Antitrypsin <dbl>, Alpha_1_Microglobulin <dbl>,
# # Alpha_2_Macroglobulin <dbl>,
# # Angiopoietin_2_ANG_2 <dbl>, …

Quiz

Now that we have training and testing sets...

Quiz

Now that we have training and testing sets...

Which dataset do you think we use for fitting?

Quiz

Now that we have training and testing sets...

Which dataset do you think we use for fitting?

Which do we use for predicting?

Your turn 2

Fill in the blanks.

Use initial_split(), training(), and testing() to:

  1. Split alz into training and test sets. Save the rsplit!

  2. Extract the training data and fit your classification tree model.

  3. Predict the testing data, and save the true Class values.

  4. Measure the accuracy of your model with your test set.

Keep set.seed(100) at the start of your code.

04:00
set.seed(100) # Important!
alz_split <- initial_split(alz, strata = Class, prop = .9)
alz_train <- training(alz_split)
alz_test <- testing(alz_split)
tree_mod %>%
fit(Class ~ tau + VEGF,
data = alz_train) %>%
predict(new_data = alz_test) %>%
mutate(true_class = alz_test$Class) %>%
accuracy(truth = true_class, estimate = .pred_class)

Goal of Machine Learning

🎯 generate accurate predictions

Now we have predictions from our model. What can we do with them? If we already know the truth, that is, the outcome variable that was observed, we can compare them!

Axiom

Better Model = Better Predictions (Lower error rate)

accuracy()*

Calculates the accuracy based on two columns in a dataframe:

The truth: yi

The predicted estimate: y^i

accuracy(data, truth, estimate)

* from yardstick

tree_mod %>%
fit(Class ~ tau + VEGF,
data = alz_train) %>%
predict(new_data = alz_test) %>%
mutate(true_class = alz_test$Class) %>%
accuracy(truth = true_class, estimate = .pred_class)
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.848

Your Turn 3

What would happen if you repeated this process? Would you get the same answers?

Note your accuracy from above. Then change your seed number and rerun just the last code chunk above. Do you get the same answer?

Try it a few times with a few different seeds.

02:00
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.848
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.848
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.848
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.848
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.848
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.848
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.848
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.848
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.848

Quiz

Why is the new estimate different?

Data Splitting

Data Splitting

Data Splitting

Data Splitting

Data Splitting

Data Splitting

Data Splitting

Data Splitting

Data Splitting

Mean RMSE

Resampling

Let's resample 10 times

then compute the mean of the results...

acc %>% tibble::enframe(name = "accuracy")
# # A tibble: 10 x 2
# accuracy value
# <int> <dbl>
# 1 1 0.855
# 2 2 0.807
# 3 3 0.831
# 4 4 0.855
# 5 5 0.880
# 6 6 0.880
# 7 7 0.831
# 8 8 0.843
# 9 9 0.880
# 10 10 0.892
mean(acc)
# [1] 0.8554217

Guess

Which do you think is a better estimate?

The best result or the mean of the results? Why?

But also...

Fit with training set

Predict with testing set

But also...

Fit with training set

Predict with testing set

Rinse and repeat?

There has to be a better way...

acc <- vector(length = 10, mode = "double")
for (i in 1:10) {
new_split <- initial_split(alz)
new_train <- training(new_split)
new_test <- testing(new_split)
acc[i] <-
lr_mod %>%
fit(Class ~ tau + VEGF, data = new_train) %>%
predict(new_test) %>%
mutate(truth = new_test$Class) %>%
accuracy(truth, .pred_class) %>%
pull(.estimate)
}

The testing set is precious...

we can only use it once!

How can we use the training set to compare, evaluate, and tune models?

Cross-validation

V-fold cross-validation

vfold_cv(data, v = 10, ...)

Guess

How many times does in observation/row appear in the assessment set?

Quiz

If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold?

Quiz

If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold?

90% - training

10% - test

Your Turn 4

Run the code below. What does it return?

set.seed(100)
alz_folds <-
vfold_cv(alz_train, v = 10, strata = Class)
alz_folds
01:00
set.seed(100)
alz_folds <-
vfold_cv(alz_train, v = 10, strata = Class)
alz_folds
# # 10-fold cross-validation using stratification
# # A tibble: 10 x 2
# splits id
# <list> <chr>
# 1 <split [269/31]> Fold01
# 2 <split [269/31]> Fold02
# 3 <split [270/30]> Fold03
# 4 <split [270/30]> Fold04
# 5 <split [270/30]> Fold05
# 6 <split [270/30]> Fold06
# 7 <split [270/30]> Fold07
# 8 <split [270/30]> Fold08
# 9 <split [271/29]> Fold09
# 10 <split [271/29]> Fold10

We need a new way to fit

split1 <- alz_folds %>% pluck("splits", 1)
split1_train <- training(split1)
split1_test <- testing(split1)
tree_mod %>%
fit(Class ~ ., data = split1_train) %>%
predict(split1_test) %>%
mutate(truth = split1_test$Class) %>%
rmse(truth, .pred_class)
# rinse and repeat
split2 <- ...

fit_resamples()

Trains and tests a resampled model.

tree_mod %>%
fit_resamples(
Class ~ tau + VEGF,
resamples = alz_folds
)
tree_mod %>%
fit_resamples(
Class ~ tau + VEGF,
resamples = alz_folds
)
# # Resampling results
# # 10-fold cross-validation using stratification
# # A tibble: 10 x 4
# splits id .metrics .notes
# <list> <chr> <list> <list>
# 1 <split [269/31]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]>
# 2 <split [269/31]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]>
# 3 <split [270/30]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]>
# 4 <split [270/30]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]>
# 5 <split [270/30]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]>
# 6 <split [270/30]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]>
# 7 <split [270/30]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]>
# 8 <split [270/30]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]>
# 9 <split [271/29]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]>
# 10 <split [271/29]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]>

collect_metrics()

Unnest the metrics column from a tidymodels fit_resamples()

_results %>% collect_metrics(summarize = TRUE)

collect_metrics()

Unnest the metrics column from a tidymodels fit_resamples()

_results %>% collect_metrics(summarize = TRUE)

TRUE is actually the default; averages across folds

tree_mod %>%
fit_resamples(
Class ~ tau + VEGF,
resamples = alz_folds
) %>%
collect_metrics(summarize = FALSE)
# # A tibble: 20 x 4
# id .metric .estimator .estimate
# <chr> <chr> <chr> <dbl>
# 1 Fold01 accuracy binary 0.774
# 2 Fold01 roc_auc binary 0.692
# 3 Fold02 accuracy binary 0.839
# 4 Fold02 roc_auc binary 0.848
# 5 Fold03 accuracy binary 0.867
# 6 Fold03 roc_auc binary 0.852
# 7 Fold04 accuracy binary 0.8
# 8 Fold04 roc_auc binary 0.795
# 9 Fold05 accuracy binary 0.767
# 10 Fold05 roc_auc binary 0.744
# # … with 10 more rows

10-fold CV

10 different analysis/assessment sets

10 different models (trained on analysis sets)

10 different sets of performance statistics (on assessment sets)

Your Turn 5

Modify the code below to use fit_resamples and alz_folds to cross-validate the classification tree model. What is the ROC AUC that you collect at the end?

set.seed(100)
tree_mod %>%
fit(Class ~ tau + VEGF,
data = alz_train) %>%
predict(new_data = alz_test) %>%
mutate(true_class = alz_test$Class) %>%
accuracy(truth = true_class, estimate = .pred_class)
03:00
set.seed(100)
lr_mod %>%
fit_resamples(Class ~ tau + VEGF,
resamples = alz_folds) %>%
collect_metrics()
# # A tibble: 2 x 5
# .metric .estimator mean n std_err
# <chr> <chr> <dbl> <int> <dbl>
# 1 accuracy binary 0.854 10 0.0187
# 2 roc_auc binary 0.893 10 0.0120

How did we do?

tree_mod %>%
fit(Class ~ tau + VEGF, data = alz_train) %>%
predict(alz_test) %>%
mutate(truth = alz_test$Class) %>%
accuracy(truth, .pred_class)
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.848
# # A tibble: 2 x 5
# .metric .estimator mean n std_err
# <chr> <chr> <dbl> <int> <dbl>
# 1 accuracy binary 0.854 10 0.0187
# 2 roc_auc binary 0.893 10 0.0120

What is Machine Learning?

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow