class: title-slide, center, bottom # Resample Models ## Tidymodels, Virtually — Session 02 ### Alison Hill --- class: middle, center, frame # Goal of Predictive Modeling -- ## 🔨 build .display[models] that -- ## 🎯 generate .display[accurate predictions] -- ## 🔮 for .display[future, yet-to-be-seen data] -- .footnote[Max Kuhn & Kjell Johnston, http://www.feat.engineering/] ??? This is our whole game vision for today. This is the main goal for predictive modeling broadly, and for machine learning specifically. We'll use this goal to drive learning of 3 core tidymodels packages: - parsnip - yardstick - and rsample --- class: inverse, middle, center # Resample models -- ## with rsample ??? Enter the rsample package --- class: middle, center, frame # rsample <iframe src="https://tidymodels.github.io/rsample/" width="100%" height="400px"></iframe> --- background-image: url("images/saw.jpg") background-size: contain background-position: left class: middle, right .pull-right[ # *"Measure twice, <br>cut once"* ] --- class: your-turn # Your Turn 1 Run the first code chunk. Then fill in the blanks to 1. Create a split object that apportions 75% of `ames` to a training set and the remainder to a testing set. 2. Fit the `rt_spec` to the training set. 3. Predict with the testing set and compute the rmse of the fit.
03
:
00
--- ```r new_split <- initial_split(ames) new_train <- training(new_split) new_test <- testing(new_split) rt_spec %>% fit(Sale_Price ~ ., data = new_train) %>% predict(new_test) %>% mutate(truth = new_test$Sale_Price) %>% rmse(truth, .pred) # # A tibble: 1 x 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 43474. ``` --- class: your-turn # Your Turn 2 What would happen if you repeated this process? Would you get the same answers? Then rerun the last code chunk from Your Turn 1. Do you get the same answer? Try it a few times.
02
:
00
--- .pull-left[ ``` # # A tibble: 1 x 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 39010. ``` ``` # # A tibble: 1 x 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 38326. ``` ``` # # A tibble: 1 x 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 42286. ``` ] -- .pull-right[ ``` # # A tibble: 1 x 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 39431. ``` ``` # # A tibble: 1 x 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 43048. ``` ``` # # A tibble: 1 x 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 41873. ``` ] --- class: middle, center # Quiz Why is the new estimate different? --- class: middle, center # Data Splitting -- <img src="figs/02-resample/unnamed-chunk-11-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-12-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-13-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-14-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-15-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-16-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-17-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-18-1.png" width="720" style="display: block; margin: auto;" /> --- <img src="figs/02-resample/unnamed-chunk-19-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-20-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-21-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-22-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-23-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-24-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-25-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02-resample/unnamed-chunk-26-1.png" width="1080" style="display: block; margin: auto;" /> -- .right[Mean RMSE] --- class: frame, center, middle # Resampling Let's resample 10 times then compute the mean of the results... --- ```r rmses %>% tibble::enframe(name = "rmse") # # A tibble: 10 x 2 # rmse value # <int> <dbl> # 1 1 38589. # 2 2 40967. # 3 3 41875. # 4 4 44294. # 5 5 42807. # 6 6 36848. # 7 7 36330. # 8 8 40182. # 9 9 41058. # 10 10 39547. mean(rmses) # [1] 40249.72 ``` --- class: middle, center # Guess Which do you think is a better estimate? The best result or the mean of the results? Why? --- class: middle, center # But also... Fit with .display[training set] Predict with .display[testing set] -- Rinse and repeat? --- # There has to be a better way... ```r rmses <- vector(length = 10, mode = "double") for (i in 1:10) { new_split <- initial_split(ames) new_train <- training(new_split) new_test <- testing(new_split) rmses[i] <- rt_spec %>% fit(Sale_Price ~ ., data = new_train) %>% predict(new_test) %>% mutate(truth = new_test$Sale_Price) %>% rmse(truth, .pred) %>% pull(.estimate) } ``` --- background-image: url(images/diamonds.jpg) background-size: contain background-position: left class: middle, center background-color: #f5f5f5 .pull-right[ ## The .display[testing set] is precious... ## we can only use it once! ] --- background-image: url(https://www.tidymodels.org/start/resampling/img/resampling.svg) background-size: 60% --- class: middle, center, inverse # Cross-validation --- background-image: url(images/cross-validation/Slide2.png) background-size: contain --- background-image: url(images/cross-validation/Slide3.png) background-size: contain --- background-image: url(images/cross-validation/Slide4.png) background-size: contain --- background-image: url(images/cross-validation/Slide5.png) background-size: contain --- background-image: url(images/cross-validation/Slide6.png) background-size: contain --- background-image: url(images/cross-validation/Slide7.png) background-size: contain --- background-image: url(images/cross-validation/Slide8.png) background-size: contain --- background-image: url(images/cross-validation/Slide9.png) background-size: contain --- background-image: url(images/cross-validation/Slide10.png) background-size: contain --- background-image: url(images/cross-validation/Slide11.png) background-size: contain --- class: middle, center # V-fold cross-validation ```r vfold_cv(data, v = 10, ...) ``` --- exclude: true --- class: middle, center # Guess How many times does in observation/row appear in the assessment set? <img src="figs/02-resample/vfold-tiles-1.png" width="864" style="display: block; margin: auto;" /> --- <img src="figs/02-resample/unnamed-chunk-31-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center # Quiz If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold? -- 90% - training 10% - test --- class: your-turn # Your Turn 3 Run the code below. What does it return? ```r set.seed(100) cv_folds <- vfold_cv(ames_train, v = 10, strata = Sale_Price, breaks = 4) cv_folds ```
01
:
00
--- ```r set.seed(100) cv_folds <- vfold_cv(ames_train, v = 10, strata = Sale_Price, breaks = 4) cv_folds # # 10-fold cross-validation using stratification # # A tibble: 10 x 2 # splits id # <named list> <chr> # 1 <split [2K/221]> Fold01 # 2 <split [2K/221]> Fold02 # 3 <split [2K/220]> Fold03 # 4 <split [2K/220]> Fold04 # 5 <split [2K/220]> Fold05 # 6 <split [2K/220]> Fold06 # 7 <split [2K/220]> Fold07 # 8 <split [2K/219]> Fold08 # 9 <split [2K/219]> Fold09 # 10 <split [2K/218]> Fold10 ``` --- class: middle .center[ # We need a new way to fit ] ```r split1 <- cv_folds %>% pluck("splits", 1) split1_train <- training(split1) split1_test <- testing(split1) rt_spec %>% fit(Sale_Price ~ ., data = split1_train) %>% predict(split1_test) %>% mutate(truth = split1_test$Sale_Price) %>% rmse(truth, .pred) # rinse and repeat split2 <- ... ``` --- class: middle .center[ # `fit_resamples()` Trains and tests a resampled model. ] ```r fit_resamples( rt_spec, Sale_Price ~ Gr_Liv_Area, resamples = cv_folds ) ``` --- class: middle .center[ # `fit_resamples()` Trains and tests a resampled model. ] ```r rt_spec %>% fit_resamples( Sale_Price ~ Gr_Liv_Area, resamples = cv_folds ) ``` --- ```r rt_spec %>% fit_resamples( Sale_Price ~ Gr_Liv_Area, resamples = cv_folds ) # # 10-fold cross-validation using stratification # # A tibble: 10 x 4 # splits id .metrics .notes # <named list> <chr> <list> <list> # 1 <split [2K/221]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]> # 2 <split [2K/221]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]> # 3 <split [2K/220]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]> # 4 <split [2K/220]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]> # 5 <split [2K/220]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]> # 6 <split [2K/220]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]> # 7 <split [2K/220]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]> # 8 <split [2K/219]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]> # 9 <split [2K/219]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]> # 10 <split [2K/218]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]> ``` --- class: middle, center # `collect_metrics()` Unnest the metrics column from a tidymodels `fit_resamples()` ```r _results %>% collect_metrics(summarize = TRUE) ``` -- .footnote[`TRUE` is actually the default; averages across folds] --- ```r rt_spec %>% fit_resamples( Sale_Price ~ Gr_Liv_Area, resamples = cv_folds ) %>% collect_metrics(summarize = FALSE) # # A tibble: 20 x 4 # id .metric .estimator .estimate # <chr> <chr> <chr> <dbl> # 1 Fold01 rmse standard 60178. # 2 Fold01 rsq standard 0.430 # 3 Fold02 rmse standard 58111. # 4 Fold02 rsq standard 0.339 # 5 Fold03 rmse standard 61395. # 6 Fold03 rsq standard 0.426 # 7 Fold04 rmse standard 54305. # 8 Fold04 rsq standard 0.474 # 9 Fold05 rmse standard 56699. # 10 Fold05 rsq standard 0.522 # # … with 10 more rows ``` --- class: middle, center, frame # 10-fold CV ### 10 different analysis/assessment sets ### 10 different models (trained on .display[analysis] sets) ### 10 different sets of performance statistics (on .display[assessment] sets) --- class: your-turn # Your Turn 4 Modify the code below to use `fit_resamples()` and `cv_folds` to cross-validate the regression tree model. Which RMSE do you collect at the end? ```r set.seed(100) rt_spec %>% fit(Sale_Price ~ ., data = new_train) %>% predict(new_test) %>% mutate(truth = new_test$Sale_Price) %>% rmse(truth, .pred) ```
03
:
00
--- ```r set.seed(100) rt_spec %>% fit_resamples(Sale_Price ~ ., resamples = cv_folds) %>% collect_metrics() # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 rmse standard 42257. 10 1044. # 2 rsq standard 0.718 10 0.0120 ``` --- # How did we do? ```r rt_spec %>% fit(Sale_Price ~ ., ames_train) %>% predict(ames_test) %>% mutate(truth = ames_test$Sale_Price) %>% rmse(truth, .pred) # # A tibble: 1 x 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 40881. ``` ``` # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 rmse standard 42257. 10 1044. # 2 rsq standard 0.718 10 0.0120 ``` --- class: middle, center, inverse # Other types of cross-validation --- class: middle, center # `vfold_cv()` - V Fold cross-validation <img src="figs/02-resample/unnamed-chunk-42-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center # `loo_cv()` - Leave one out CV <img src="figs/02-resample/loocv-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center # `mc_cv()` - Monte Carlo (random) CV (Test sets sampled without replacement) <img src="figs/02-resample/mccv-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center # `bootstraps()` (Test sets sampled with replacement) <img src="figs/02-resample/bootstrap-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center, frame # yardstick Functions that compute common model metrics <tidymodels.github.io/yardstick/> <iframe src="https://tidymodels.github.io/yardstick/" width="100%" height="400px"></iframe> --- class: middle .center[ # `fit_resamples()` Trains and tests a model with cross-validation. ] .pull-left[ ```r fit_resamples( object, resamples, ..., * metrics = NULL, control = control_resamples() ) ``` ] .pull-right[ If `NULL`... regression = `rmse` + `rsq` classification = `accuracy` + `roc_auc` ] --- class: middle, center # `metric_set()` A helper function for selecting yardstick metric functions. ```r metric_set(rmse, rsq) ``` --- class: middle .center[ # `fit_resamples()` .fade[Trains and tests a model with cross-validation.] ] .pull-left[ ```r fit_resamples( object, resamples, ..., * metrics = metric_set(rmse, rsq), control = control_resamples() ) ``` ] --- class: middle, center, frame # Metric Functions <https://tidymodels.github.io/yardstick/reference/index.html> <iframe src="https://tidymodels.github.io/yardstick/reference/index.html" width="100%" height="400px"></iframe>