Resample Models

class: title-slide, center, bottom

# Resample Models

## Tidymodels, Virtually &mdash; Session 02

### Alison Hill

---
class: middle, center, frame

# Goal of Predictive Modeling

## 🔨 build .display[models] that

## 🎯 generate .display[accurate predictions]

## 🔮 for .display[future, yet-to-be-seen data]

.footnote[Max Kuhn & Kjell Johnston, http://www.feat.engineering/]

???

This is our whole game vision for today. This is the main goal for predictive modeling broadly, and for machine learning specifically.

We'll use this goal to drive learning of 3 core tidymodels packages:

- parsnip
- yardstick
- and rsample

---
class: inverse, middle, center

# Resample models

## with rsample

???

Enter the rsample package

---
class: middle, center, frame

# rsample

---
background-image: url("images/saw.jpg")
background-size: contain
background-position: left
class: middle, right

.pull-right[
# *"Measure twice, <br>cut once"*
]

---
class: your-turn

# Your Turn 1

Run the first code chunk. Then fill in the blanks to

1. Create a split object that apportions 75% of `ames` to a training set and the remainder to a testing set.
2. Fit the `rt_spec` to the training set.
3. Predict with the testing set and compute the rmse of the fit.

---

```r
new_split <- initial_split(ames)
new_train <- training(new_split)
new_test  <- testing(new_split)

rt_spec %>% 
  fit(Sale_Price ~ ., data = new_train) %>% 
  predict(new_test) %>% 
  mutate(truth = new_test$Sale_Price) %>% 
  rmse(truth, .pred)
# # A tibble: 1 x 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      43474.
```

---
class: your-turn

# Your Turn 2

What would happen if you repeated this process? Would you get the same answers?

Then rerun the last code chunk from Your Turn 1. Do you get the same answer? Try it a few times.

---

.pull-left[

```
# # A tibble: 1 x 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      39010.
```

```
# # A tibble: 1 x 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      38326.
```

```
# # A tibble: 1 x 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      42286.
```

]

.pull-right[

```
# # A tibble: 1 x 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      39431.
```

```
# # A tibble: 1 x 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      43048.
```

```
# # A tibble: 1 x 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      41873.
```

]

---
class: middle, center

# Quiz

Why is the new estimate different?

---
class: middle, center

# Data Splitting

---

.right[Mean RMSE]

---
class: frame, center, middle

# Resampling

Let's resample 10 times

then compute the mean of the results...

---

```r
rmses %>% tibble::enframe(name = "rmse")
# # A tibble: 10 x 2
#     rmse  value
#    <int>  <dbl>
#  1     1 38589.
#  2     2 40967.
#  3     3 41875.
#  4     4 44294.
#  5     5 42807.
#  6     6 36848.
#  7     7 36330.
#  8     8 40182.
#  9     9 41058.
# 10    10 39547.
mean(rmses)
# [1] 40249.72
```

---
class: middle, center

# Guess

Which do you think is a better estimate?

The best result or the mean of the results? Why?

---
class: middle, center

# But also...

Fit with .display[training set]

Predict with .display[testing set]

Rinse and repeat?

---

# There has to be a better way...

```r
rmses <- vector(length = 10, mode = "double")
for (i in 1:10) {
  new_split <- initial_split(ames)
  new_train <- training(new_split)
  new_test  <- testing(new_split)
  rmses[i] <-
    rt_spec %>% 
      fit(Sale_Price ~ ., data = new_train) %>% 
      predict(new_test) %>% 
      mutate(truth = new_test$Sale_Price) %>% 
      rmse(truth, .pred) %>% 
      pull(.estimate)
}
```

---
background-image: url(images/diamonds.jpg)
background-size: contain
background-position: left
class: middle, center
background-color: #f5f5f5

.pull-right[
## The .display[testing set] is precious...

## we can only use it once!

]

---
background-image: url(https://www.tidymodels.org/start/resampling/img/resampling.svg)
background-size: 60%

---
class: middle, center, inverse

# Cross-validation

---
background-image: url(images/cross-validation/Slide2.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide3.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide4.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide5.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide6.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide7.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide8.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide9.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide10.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide11.png)
background-size: contain

---
class: middle, center

# V-fold cross-validation

```r
vfold_cv(data, v = 10, ...)
```

---
exclude: true

---
class: middle, center

# Guess

How many times does in observation/row appear in the assessment set?

---

---
class: middle, center

# Quiz

If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold?

90% - training

10% - test

---
class: your-turn

# Your Turn 3

Run the code below. What does it return?

```r
set.seed(100)
cv_folds <- 
    vfold_cv(ames_train, v = 10, strata = Sale_Price, breaks = 4)
cv_folds
```

---

```r
set.seed(100)
cv_folds <- 
    vfold_cv(ames_train, v = 10, strata = Sale_Price, breaks = 4)
cv_folds
# #  10-fold cross-validation using stratification 
# # A tibble: 10 x 2
#    splits           id    
#    <named list>     <chr> 
#  1 <split [2K/221]> Fold01
#  2 <split [2K/221]> Fold02
#  3 <split [2K/220]> Fold03
#  4 <split [2K/220]> Fold04
#  5 <split [2K/220]> Fold05
#  6 <split [2K/220]> Fold06
#  7 <split [2K/220]> Fold07
#  8 <split [2K/219]> Fold08
#  9 <split [2K/219]> Fold09
# 10 <split [2K/218]> Fold10
```

---
class: middle

.center[
# We need a new way to fit
]

```r
split1       <- cv_folds %>% pluck("splits", 1)
split1_train <- training(split1)
split1_test  <- testing(split1)

rt_spec %>% 
  fit(Sale_Price ~ ., data = split1_train) %>% 
  predict(split1_test) %>% 
  mutate(truth = split1_test$Sale_Price) %>% 
  rmse(truth, .pred)

# rinse and repeat
split2 <- ...
```

---
class: middle

.center[
# `fit_resamples()`

Trains and tests a resampled model.
]

```r
fit_resamples(
  rt_spec,
  Sale_Price ~ Gr_Liv_Area, 
  resamples = cv_folds
)
```

---
class: middle

.center[
# `fit_resamples()`

Trains and tests a resampled model.
]

```r
rt_spec %>% 
  fit_resamples(
    Sale_Price ~ Gr_Liv_Area, 
    resamples = cv_folds
    )
```

---

```r
rt_spec %>% 
  fit_resamples(
    Sale_Price ~ Gr_Liv_Area, 
    resamples = cv_folds
    )
# #  10-fold cross-validation using stratification 
# # A tibble: 10 x 4
#    splits           id     .metrics         .notes          
#    <named list>     <chr>  <list>           <list>          
#  1 <split [2K/221]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]>
#  2 <split [2K/221]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]>
#  3 <split [2K/220]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]>
#  4 <split [2K/220]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]>
#  5 <split [2K/220]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]>
#  6 <split [2K/220]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]>
#  7 <split [2K/220]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]>
#  8 <split [2K/219]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]>
#  9 <split [2K/219]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]>
# 10 <split [2K/218]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]>
```

---
class: middle, center

# `collect_metrics()`

Unnest the metrics column from a tidymodels `fit_resamples()`

```r
_results %>% collect_metrics(summarize = TRUE)
```

.footnote[`TRUE` is actually the default; averages across folds]

---

```r
rt_spec %>% 
  fit_resamples(
    Sale_Price ~ Gr_Liv_Area, 
    resamples = cv_folds
    ) %>% 
  collect_metrics(summarize = FALSE)
# # A tibble: 20 x 4
#    id     .metric .estimator .estimate
#    <chr>  <chr>   <chr>          <dbl>
#  1 Fold01 rmse    standard   60178.   
#  2 Fold01 rsq     standard       0.430
#  3 Fold02 rmse    standard   58111.   
#  4 Fold02 rsq     standard       0.339
#  5 Fold03 rmse    standard   61395.   
#  6 Fold03 rsq     standard       0.426
#  7 Fold04 rmse    standard   54305.   
#  8 Fold04 rsq     standard       0.474
#  9 Fold05 rmse    standard   56699.   
# 10 Fold05 rsq     standard       0.522
# # … with 10 more rows
```

---
class: middle, center, frame

# 10-fold CV

### 10 different analysis/assessment sets

### 10 different models (trained on .display[analysis] sets)

### 10 different sets of performance statistics (on .display[assessment] sets)

---
class: your-turn

# Your Turn 4

Modify the code below to use `fit_resamples()` and `cv_folds` to cross-validate the regression tree model.

Which RMSE do you collect at the end?

```r
set.seed(100)
rt_spec %>% 
  fit(Sale_Price ~ ., data = new_train) %>% 
  predict(new_test) %>% 
  mutate(truth = new_test$Sale_Price) %>% 
  rmse(truth, .pred)
```

---

```r
set.seed(100)
rt_spec %>% 
  fit_resamples(Sale_Price ~ ., 
                resamples = cv_folds) %>% 
  collect_metrics()
# # A tibble: 2 x 5
#   .metric .estimator      mean     n   std_err
#   <chr>   <chr>          <dbl> <int>     <dbl>
# 1 rmse    standard   42257.       10 1044.    
# 2 rsq     standard       0.718    10    0.0120
```

---

# How did we do?

```r
rt_spec %>% 
  fit(Sale_Price ~ ., ames_train) %>% 
  predict(ames_test) %>% 
  mutate(truth = ames_test$Sale_Price) %>% 
  rmse(truth, .pred)
# # A tibble: 1 x 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      40881.
```

```
# # A tibble: 2 x 5
#   .metric .estimator      mean     n   std_err
#   <chr>   <chr>          <dbl> <int>     <dbl>
# 1 rmse    standard   42257.       10 1044.    
# 2 rsq     standard       0.718    10    0.0120
```

---
class: middle, center, inverse

# Other types of cross-validation

---
class: middle, center

# `vfold_cv()` - V Fold cross-validation

---
class: middle, center

# `loo_cv()` - Leave one out CV

---
class: middle, center

# `mc_cv()` - Monte Carlo (random) CV

(Test sets sampled without replacement)

---
class: middle, center

# `bootstraps()`

(Test sets sampled with replacement)

---
class: middle, center, frame

# yardstick

Functions that compute common model metrics

<tidymodels.github.io/yardstick/>

---
class: middle

.center[
# `fit_resamples()`

Trains and tests a model with cross-validation.
]

.pull-left[

```r
fit_resamples(
  object, 
  resamples, 
  ..., 
* metrics = NULL,
  control = control_resamples()
)
```

]

.pull-right[

If `NULL`...

regression = `rmse` + `rsq`

classification = `accuracy` + `roc_auc`
]

---
class: middle, center

# `metric_set()`

A helper function for selecting yardstick metric functions.

```r
metric_set(rmse, rsq)
```

---
class: middle

.center[
# `fit_resamples()`

.fade[Trains and tests a model with cross-validation.]
]

.pull-left[

```r
fit_resamples(
  object, 
  resamples, 
  ..., 
* metrics = metric_set(rmse, rsq),
  control = control_resamples()
)
```

]

---
class: middle, center, frame

# Metric Functions

<https://tidymodels.github.io/yardstick/reference/index.html>