class: title-slide, center, bottom # The Great Model-Off ## Tidymodels, Virtually — Session 06 ### Alison Hill --- class: middle, center <img src="images/hotel.jpg" width="80%" style="display: block; margin: auto;" /> --- ```r library(tidyverse) library(tidymodels) # read in the data-------------------------------------------------------------- hotels <- read_csv('https://tidymodels.org/start/case-study/hotels.csv') %>% mutate_if(is.character, as.factor) # data splitting---------------------------------------------------------------- set.seed(123) splits <- initial_split(hotels, strata = children) hotel_other <- training(splits) hotel_test <- testing(splits) # resample once----------------------------------------------------------------- set.seed(234) val_set <- validation_split(hotel_other, strata = children, prop = 0.80) ``` --- ```r glimpse(hotels) # Rows: 50,000 # Columns: 23 # $ hotel <fct> City_Hotel, City_H… # $ lead_time <dbl> 217, 2, 95, 143, 1… # $ stays_in_weekend_nights <dbl> 1, 0, 2, 2, 1, 2, … # $ stays_in_week_nights <dbl> 3, 1, 5, 6, 4, 2, … # $ adults <dbl> 2, 2, 2, 2, 2, 2, … # $ children <fct> none, none, none, … # $ meal <fct> BB, BB, BB, HB, HB… # $ country <fct> DEU, PRT, GBR, ROU… # $ market_segment <fct> Offline_TA/TO, Dir… # $ distribution_channel <fct> TA/TO, Direct, TA/… # $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, … # $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, … # $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, … # $ reserved_room_type <fct> A, D, A, A, F, A, … # $ assigned_room_type <fct> A, K, A, A, F, A, … # $ booking_changes <dbl> 0, 0, 2, 0, 0, 0, … # $ deposit_type <fct> No_Deposit, No_Dep… # $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, … # $ customer_type <fct> Transient-Party, T… # $ average_daily_rate <dbl> 80.75, 170.00, 8.0… # $ required_car_parking_spaces <fct> none, none, none, … # $ total_of_special_requests <dbl> 1, 3, 2, 1, 4, 1, … # $ arrival_date <date> 2016-09-01, 2017-… ``` --- class: middle .pull-left[ ```r # training set proportions hotel_other %>% count(children) %>% mutate(prop = n/sum(n)) # # A tibble: 2 x 3 # children n prop # <fct> <int> <dbl> # 1 children 3048 0.0813 # 2 none 34452 0.919 ``` ] .pull-right[ ```r # test set proportions hotel_test %>% count(children) %>% mutate(prop = n/sum(n)) # # A tibble: 2 x 3 # children n prop # <fct> <int> <dbl> # 1 children 990 0.0792 # 2 none 11510 0.921 ``` ] --- class: middle, inverse, center <img src="images/bird-turquoise.png" width="20%" style="display: block; margin: auto;" /> # Classification Challenge! --- class: middle, center, frame # Our Modeling Goal Predict which hotel stays included children and/or babies -- Based on the other characteristics of the stays such as: which hotel the guests stay at, how much they pay, etc. --- class: middle, center, frame # Your Challenge Maximize area under the ROC curve (`roc_auc`) --- class: middle .pull-left[ <img src="images/two-birds2-alpha.png" width="568" style="display: block; margin: auto;" /> ] .pull-right[ ## Work in groups ## Pick a model ## Tune! ## Select the top model ## FIN! ] --- class: title-slide, center, bottom # Our tidymodels --- background-image: url(images/cranes.jpg) background-position: left background-size: contain class: middle .right-column[ # Choose from: + Penalized logistic regression + Decision tree + K-nearest neighbors + Random forest + *Any other classification model/engine you want from parsnip!* ] --- class: middle # Decision Tree Model ```r tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification") ``` Docs: https://tidymodels.github.io/parsnip/reference/decision_tree.html --- class: middle # Random Forest Model ```r rf_spec <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification") ``` Docs: https://tidymodels.github.io/parsnip/reference/rand_forest.html --- class: middle # K-Nearest Neighbor Model ```r knn_spec <- nearest_neighbor() %>% set_engine("kknn") %>% set_mode("classification") ``` Docs: https://tidymodels.github.io/parsnip/reference/nearest_neighbor.html --- class: middle # Lasso Logistic Regression Model ```r lasso_spec <- logistic_reg(penalty = 0, mixture = 1) %>% set_engine("glmnet") %>% set_mode("classification") ``` Where: + `mixture = 0` is L2 (ridge) only, and + `mixture = 1` is L1 (lasso) only. Docs: https://tidymodels.github.io/parsnip/reference/logistic_reg.html --- background-image: url(images/bird-in-hand.jpg) background-position: left background-size: contain class: middle, center .pull-right[ # "A bird in the hand is worth two in the bush..." ] --- class: middle, center, frame # Our top model Let's pick the best, then finalize with the test set together.