class: title-slide, center, bottom # Build A Model ## Tidymodels, Virtually — Session 01 ### Alison Hill --- class: center, middle, inverse # What is Machine Learning? ??? Machine Learning is usually thought of as a subfield of artificial intelligence that itself contains other hot sub-fields. Let's start somewhere familiar. I have a data set and I want to analyze it. The actual data set is named `ames` and it comes in the `AmesHousing` R package. No need to open your computers. Let's just discuss for a few minutes. --- class: middle # .center[AmesHousing] Descriptions of 2,930 houses sold in Ames, IA from 2006 to 2010, collected by the Ames Assessor’s Office. ```r # install.packages("AmesHousing") library(AmesHousing) ames <- make_ames() %>% dplyr::select(-matches("Qu")) ``` ??? `ames` contains descriptions of 2,930 houses sold in Ames, IA from 2006 to 2010. The data comes from the Ames Assessor’s Office and contains things like the square footage of a house, its lot size, and its sale price. --- class: middle ```r glimpse(ames) # Rows: 2,930 # Columns: 74 # $ MS_SubClass <fct> One_Story_1946_and_Newer_All_S… # $ MS_Zoning <fct> Residential_Low_Density, Resid… # $ Lot_Frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 4… # $ Lot_Area <int> 31770, 11622, 14267, 11160, 13… # $ Street <fct> Pave, Pave, Pave, Pave, Pave, … # $ Alley <fct> No_Alley_Access, No_Alley_Acce… # $ Lot_Shape <fct> Slightly_Irregular, Regular, S… # $ Land_Contour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, … # $ Utilities <fct> AllPub, AllPub, AllPub, AllPub… # $ Lot_Config <fct> Corner, Inside, Corner, Corner… # $ Land_Slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, … # $ Neighborhood <fct> North_Ames, North_Ames, North_… # $ Condition_1 <fct> Norm, Feedr, Norm, Norm, Norm,… # $ Condition_2 <fct> Norm, Norm, Norm, Norm, Norm, … # $ Bldg_Type <fct> OneFam, OneFam, OneFam, OneFam… # $ House_Style <fct> One_Story, One_Story, One_Stor… # $ Overall_Cond <fct> Average, Above_Average, Above_… # $ Year_Built <int> 1960, 1961, 1958, 1968, 1997, … # $ Year_Remod_Add <int> 1960, 1961, 1958, 1968, 1998, … # $ Roof_Style <fct> Hip, Gable, Hip, Hip, Gable, G… # $ Roof_Matl <fct> CompShg, CompShg, CompShg, Com… # $ Exterior_1st <fct> BrkFace, VinylSd, Wd Sdng, Brk… # $ Exterior_2nd <fct> Plywood, VinylSd, Wd Sdng, Brk… # $ Mas_Vnr_Type <fct> Stone, None, BrkFace, None, No… # $ Mas_Vnr_Area <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0… # $ Exter_Cond <fct> Typical, Typical, Typical, Typ… # $ Foundation <fct> CBlock, CBlock, CBlock, CBlock… # $ Bsmt_Cond <fct> Good, Typical, Typical, Typica… # $ Bsmt_Exposure <fct> Gd, No, No, No, No, No, Mn, No… # $ BsmtFin_Type_1 <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, … # $ BsmtFin_SF_1 <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, … # $ BsmtFin_Type_2 <fct> Unf, LwQ, Unf, Unf, Unf, Unf, … # $ BsmtFin_SF_2 <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0… # $ Bsmt_Unf_SF <dbl> 441, 270, 406, 1045, 137, 324,… # $ Total_Bsmt_SF <dbl> 1080, 882, 1329, 2110, 928, 92… # $ Heating <fct> GasA, GasA, GasA, GasA, GasA, … # $ Heating_QC <fct> Fair, Typical, Typical, Excell… # $ Central_Air <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, … # $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SB… # $ First_Flr_SF <int> 1656, 896, 1329, 2110, 928, 92… # $ Second_Flr_SF <int> 0, 0, 0, 0, 701, 678, 0, 0, 0,… # $ Gr_Liv_Area <int> 1656, 896, 1329, 2110, 1629, 1… # $ Bsmt_Full_Bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, … # $ Bsmt_Half_Bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … # $ Full_Bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, … # $ Half_Bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, … # $ Bedroom_AbvGr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, … # $ Kitchen_AbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … # $ TotRms_AbvGrd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, … # $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, … # $ Fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, … # $ Garage_Type <fct> Attchd, Attchd, Attchd, Attchd… # $ Garage_Finish <fct> Fin, Unf, Unf, Fin, Fin, Fin, … # $ Garage_Cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, … # $ Garage_Area <dbl> 528, 730, 312, 522, 482, 470, … # $ Garage_Cond <fct> Typical, Typical, Typical, Typ… # $ Paved_Drive <fct> Partial_Pavement, Paved, Paved… # $ Wood_Deck_SF <int> 210, 140, 393, 0, 212, 360, 0,… # $ Open_Porch_SF <int> 62, 0, 36, 0, 34, 36, 0, 82, 1… # $ Enclosed_Porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0… # $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … # $ Screen_Porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0,… # $ Pool_Area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … # $ Pool_QC <fct> No_Pool, No_Pool, No_Pool, No_… # $ Fence <fct> No_Fence, Minimum_Privacy, No_… # $ Misc_Feature <fct> None, None, Gar2, None, None, … # $ Misc_Val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0,… # $ Mo_Sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, … # $ Year_Sold <int> 2010, 2010, 2010, 2010, 2010, … # $ Sale_Type <fct> WD , WD , WD , WD , WD , WD , … # $ Sale_Condition <fct> Normal, Normal, Normal, Normal… # $ Sale_Price <int> 215000, 105000, 172000, 244000… # $ Longitude <dbl> -93.61975, -93.61976, -93.6193… # $ Latitude <dbl> 42.05403, 42.05301, 42.05266, … ``` --- background-image: url(images/zestimate.png) background-size: contain --- class: middle, center, inverse # What is the goal of predictive modeling? --- class: middle, center, inverse # What is the goal of machine learning? --- class: middle, center, frame # Goal -- ## 🔨 build .display[models] that -- ## 🎯 generate .display[accurate predictions] -- ## 🔮 for .display[future, yet-to-be-seen data] -- .footnote[Max Kuhn & Kjell Johnston, http://www.feat.engineering/] ??? This is our whole game vision for today. This is the main goal for predictive modeling broadly, and for machine learning specifically. We'll use this goal to drive learning of 3 core tidymodels packages: - parsnip - yardstick - and rsample --- class: inverse, middle, center # 🔨 Build models -- ## with parsnip ??? Enter the parsnip package --- exclude: true name: predictions class: middle, center, frame # Goal of Predictive Modeling ## 🔮 generate accurate .display[predictions] --- class: middle # .center[`lm()`] ```r lm_ames <- lm(Sale_Price ~ Gr_Liv_Area, data = ames) lm_ames # # Call: # lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames) # # Coefficients: # (Intercept) Gr_Liv_Area # 13289.6 111.7 ``` ??? So let's start with prediction. To predict, we have to have two things: a model to generate predictions, and data to predict This type of formula interface may look familiar How would we use parsnip to build this kind of linear regression model? --- name: step1 background-image: url("images/predicting/predicting.001.jpeg") background-size: contain --- class: middle, frame # .center[To specify a model with parsnip] .right-column[ 1\. Pick a .display[model] 2\. Set the .display[engine] 3\. Set the .display[mode] (if needed) ] --- class: middle, frame # .center[To specify a model with parsnip] ```r decision_tree() %>% set_engine("C5.0") %>% set_mode("classification") ``` --- class: middle, frame # .center[To specify a model with parsnip] ```r nearest_neighbor() %>% set_engine("kknn") %>% set_mode("regression") %>% ``` --- class: middle, frame .fade[ # .center[To specify a model with parsnip] ] .right-column[ 1\. Pick a .display[model] .fade[ 2\. Set the .display[engine] 3\. Set the .display[mode] (if needed) ] ] --- class: middle, center # 1\. Pick a .display[model] All available models are listed at <https://tidymodels.github.io/parsnip/articles/articles/Models.html> <iframe src="https://tidymodels.github.io/parsnip/articles/articles/Models.html" width="504" height="400px"></iframe> --- class: middle .center[ # `linear_reg()` Specifies a model that uses linear regression ] ```r linear_reg(mode = "regression", penalty = NULL, mixture = NULL) ``` --- class: middle .center[ # `linear_reg()` Specifies a model that uses linear regression ] ```r linear_reg( mode = "regression", # "default" mode, if exists penalty = NULL, # model hyper-parameter mixture = NULL # model hyper-parameter ) ``` --- class: middle, frame .fade[ # .center[To specify a model with parsnip] ] .right-column[ .fade[ 1\. Pick a .display[model] ] 2\. Set the .display[engine] .fade[ 3\. Set the .display[mode] (if needed) ] ] --- class: middle, center # `set_engine()` Adds an engine to power or implement the model. ```r lm_spec %>% set_engine(engine = "lm", ...) ``` --- class: middle, frame .fade[ # .center[To specify a model with parsnip] ] .right-column[ .fade[ 1\. Pick a .display[model] 2\. Set the .display[engine] ] 3\. Set the .display[mode] (if needed) ] --- class: middle, center # `set_mode()` Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1. ```r lm_spec %>% set_mode(mode = "regression") ``` --- class: your-turn # Your turn 1 Write a pipe that creates a model that uses `lm()` to fit a linear regression. Save it as `lm_spec` and look at the object. What does it return? *Hint: you'll need https://tidymodels.github.io/parsnip/articles/articles/Models.html*
03
:
00
--- ```r lm_spec <- linear_reg() %>% # model type set_engine(engine = "lm") # engine lm_spec # Linear Regression Model Specification (regression) # # Computational engine: lm ``` --- class: middle, center # `fit()` Train a model by fitting a model. Returns a parsnip model fit. ```r fit(lm_spec, Sale_Price ~ Gr_Liv_Area, data = ames) ``` --- class: middle .center[ # `fit()` Train a model by fitting a model. Returns a parsnip model fit. ] ```r fit( lm_spec, # parsnip model Sale_Price ~ Gr_Liv_Area, # a formula data = ames # dataframe ) ``` --- class: middle .center[ # `fit()` Train a model by fitting a model. Returns a parsnip model fit. ] ```r lm_spec %>% # parsnip model fit(Sale_Price ~ Gr_Liv_Area, # a formula data = ames # dataframe ) ``` --- class: your-turn # Your turn 2 Double check. Does ```r lm_fit <- lm_spec %>% fit(Sale_Price ~ Gr_Liv_Area, data = ames) lm_fit ``` give the same results as ```r lm(Sale_Price ~ Gr_Liv_Area, data = ames) ```
02
:
00
--- ```r lm(Sale_Price ~ Gr_Liv_Area, data = ames) # # Call: # lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames) # # Coefficients: # (Intercept) Gr_Liv_Area # 13289.6 111.7 ``` --- ```r lm_fit # parsnip model object # # Fit time: 2ms # # Call: # stats::lm(formula = Sale_Price ~ Gr_Liv_Area, data = data) # # Coefficients: # (Intercept) Gr_Liv_Area # 13289.6 111.7 ``` --- name: handout class: center, middle data `(x, y)` + model = fitted model --- class: center, middle # Show of hands How many people have used a fitted model to generate .display[predictions] with R? --- template: step1 --- name: step2 background-image: url("images/predicting/predicting.003.jpeg") background-size: contain --- class: middle, center # `predict()` Use a fitted model to predict new `y` values from data. Returns a tibble. ```r predict(lm_fit, new_data = new_homes) ``` --- ```r lm_fit %>% predict(new_data = ames) # # A tibble: 2,930 x 1 # .pred # <dbl> # 1 198255. # 2 113367. # 3 161731. # 4 248964. # 5 195239. # 6 192447. # 7 162736. # 8 156258. # 9 193787. # 10 214786. # # … with 2,920 more rows ``` --- ```r new_homes <- tibble(Gr_Liv_Area = c(334, 1126, 1442, 1500, 1743, 5642)) lm_fit %>% predict(new_data = new_homes) # # A tibble: 6 x 1 # .pred # <dbl> # 1 50595. # 2 139057. # 3 174352. # 4 180831. # 5 207972. # 6 643467. ``` --- name: lm-predict class: middle, center # Predictions <img src="figs/01-model/lm-predict-1.png" width="504" style="display: block; margin: auto;" /> --- class: your-turn # Your turn 3 Fill in the blanks. Use `predict()` to 1. Use your linear model to predict sale prices; save the tibble as `price_pred` 1. Add a pipe and use `mutate()` to add a column with the observed sale prices; name it `truth` *Hint: Be sure to remove every `_` before running the code!*
02
:
00
--- ```r lm_fit <- lm_spec %>% fit(Sale_Price ~ Gr_Liv_Area, data = ames) price_pred <- lm_fit %>% predict(new_data = ames) %>% mutate(truth = ames$Sale_Price) price_pred # # A tibble: 2,930 x 2 # .pred truth # <dbl> <int> # 1 198255. 215000 # 2 113367. 105000 # 3 161731. 172000 # 4 248964. 244000 # 5 195239. 189900 # 6 192447. 195500 # 7 162736. 213500 # 8 156258. 191500 # 9 193787. 236500 # 10 214786. 189000 # # … with 2,920 more rows ``` --- template: handout -- data `(x)` + fitted model = predictions --- template: predictions --- name: accurate-predictions class: middle, center, frame # Goal of Machine Learning ## 🎯 generate .display[accurate predictions] ??? Now we have predictions from our model. What can we do with them? If we already know the truth, that is, the outcome variable that was observed, we can compare them! --- class: middle, center, frame # Axiom Better Model = Better Predictions (Lower error rate) --- template: lm-predict --- class: middle, center # Residuals <img src="figs/01-model/lm-resid-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center # RMSE Root Mean Squared Error - The standard deviation of the residuals about zero. $$ \sqrt{\frac{1}{n} \sum_{i=1}^n (\hat{y}_i - {y}_i)^2 }$$ --- class: middle, center # `rmse()*` Calculates the RMSE based on two columns in a dataframe: The .display[truth]: `\({y}_i\)` The predicted .display[estimate]: `\(\hat{y}_i\)` ```r rmse(data, truth, estimate) ``` .footnote[`*` from `yardstick`] --- ```r lm_fit <- lm_spec %>% fit(Sale_Price ~ Gr_Liv_Area, data = ames) price_pred <- lm_fit %>% predict(new_data = ames) %>% mutate(price_truth = ames$Sale_Price) price_pred %>% * rmse(truth = price_truth, estimate = .pred) # # A tibble: 1 x 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 56505. ``` --- template: step1 --- template: step2 --- name: step3 background-image: url("images/predicting/predicting.004.jpeg") background-size: contain --- template: handout -- data `(x)` + fitted model = predictions -- data `(y)` + predictions = metrics --- class: middle, center, inverse # A model doesn't have to be a straight line! --- exclude: true ```r rt_spec <- decision_tree() %>% set_engine(engine = "rpart") %>% set_mode("regression") rt_fit <- rt_spec %>% fit(Sale_Price ~ Gr_Liv_Area, data = ames) price_pred <- rt_fit %>% predict(new_data = ames) %>% mutate(price_truth = ames$Sale_Price) rmse(price_pred, truth = price_truth, estimate = .pred) ``` --- class: middle, center <img src="figs/01-model/unnamed-chunk-29-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center <img src="figs/01-model/unnamed-chunk-30-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, inverse, center # Do you trust it? --- class: middle, inverse, center # Overfitting --- <img src="figs/01-model/unnamed-chunk-32-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-33-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-34-1.png" width="504" style="display: block; margin: auto;" /> --- .pull-left[ <img src="figs/01-model/unnamed-chunk-36-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figs/01-model/unnamed-chunk-37-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: your-turn # Your turn 4 .left-column[ Take a minute and decide which model: 1. Has the smallest residuals 2. Will have lower prediction error. Why? ] .right-column[ <img src="figs/01-model/unnamed-chunk-38-1.png" width="50%" /><img src="figs/01-model/unnamed-chunk-38-2.png" width="50%" /> ]
01
:
00
--- <img src="figs/01-model/unnamed-chunk-40-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-41-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center, frame # Axiom 1 The best way to measure a model's performance at predicting new data is to .display[predict new data]. --- class: middle, center, frame # Goal of Machine Learning -- ## 🔨 build .display[models] that -- ## 🎯 generate .display[accurate predictions] -- ## 🔮 for .display[future, yet-to-be-seen data] -- .footnote[Max Kuhn & Kjell Johnston, http://www.feat.engineering/] ??? But need new data... --- class: middle, center, frame # Data splitting -- <img src="figs/01-model/all-split-1.png" width="864" style="display: block; margin: auto;" /> ??? We refer to the group for which we know the outcome, and use to develop the algorithm, as the training set. We refer to the group for which we pretend we don’t know the outcome as the test set. --- class: center, middle # `initial_split()*` "Splits" data randomly into a single testing and a single training set. ```r initial_split(data, prop = 3/4) ``` .footnote[`*` from `rsample`] --- ```r ames_split <- initial_split(ames, prop = 0.75) ames_split # <Analysis/Assess/Total> # <2198/732/2930> ``` ??? data splitting --- class: center, middle # `training()` and `testing()*` Extract training and testing sets from an rsplit ```r training(ames_split) testing(ames_split) ``` .footnote[`*` from `rsample`] --- ```r train_set <- training(ames_split) train_set # # A tibble: 2,198 x 74 # MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley # <fct> <fct> <dbl> <int> <fct> <fct> # 1 One_Story_… Resident… 141 31770 Pave No_A… # 2 One_Story_… Resident… 80 11622 Pave No_A… # 3 One_Story_… Resident… 81 14267 Pave No_A… # 4 One_Story_… Resident… 93 11160 Pave No_A… # 5 Two_Story_… Resident… 74 13830 Pave No_A… # 6 Two_Story_… Resident… 78 9978 Pave No_A… # 7 One_Story_… Resident… 41 4920 Pave No_A… # 8 One_Story_… Resident… 43 5005 Pave No_A… # 9 One_Story_… Resident… 39 5389 Pave No_A… # 10 Two_Story_… Resident… 60 7500 Pave No_A… # # … with 2,188 more rows, and 68 more variables: # # Lot_Shape <fct>, Land_Contour <fct>, Utilities <fct>, # # Lot_Config <fct>, Land_Slope <fct>, … ``` --- class: middle, center # Quiz Now that we have training and testing sets... -- Which dataset do you think we use for .display[fitting]? -- Which do we use for .display[predicting]? --- template: step1 --- template: step2 --- template: step3 background-image: url("images/predicting/predicting.004.jpeg") background-size: contain --- name: holdout-step2 background-image: url("images/predicting/predicting.006.jpeg") background-size: contain --- name: holdout-step3 background-image: url("images/predicting/predicting.007.jpeg") background-size: contain --- name: holdout-step4 background-image: url("images/predicting/predicting.008.jpeg") background-size: contain --- name: holdout background-image: url("images/predicting/predicting.009.jpeg") background-size: contain --- class: your-turn # Your turn 5 Fill in the blanks. Use `initial_split()`, `training()`, `testing()`, `lm()` and `rmse()` to: 1. Split **ames** into training and test sets. Save the rsplit! 1. Extract the training data. Fit a linear model to it. Save the model! 1. Measure the RMSE of your linear model with your test set. Keep `set.seed(100)` at the start of your code.
04
:
00
--- ```r set.seed(100) # Important! ames_split <- initial_split(ames) ames_train <- training(ames_split) ames_test <- testing(ames_split) lm_fit <- lm_spec %>% fit(Sale_Price ~ Gr_Liv_Area, data = ames_train) price_pred <- lm_fit %>% predict(new_data = ames_test) %>% mutate(price_truth = ames_test$Sale_Price) rmse(price_pred, truth = price_truth, estimate = .pred) ``` RMSE = 53884.78; compare to 56504.88 --- class: middle, center .pull-left[ ### Training RMSE = 57367.26 <img src="figs/01-model/unnamed-chunk-49-1.png" width="504" style="display: block; margin: auto;" /> ] -- .pull-right[ ### Testing RMSE = 53884.78 <img src="figs/01-model/lm-test-resid-1.png" width="504" style="display: block; margin: auto;" /> ] --- name: holdout-handout class: center, middle old data `(x, y)` + model = fitted model -- new data `(x)` + fitted model = predictions -- new data `(y)` + predictions = metrics --- class: middle, center, inverse # Stratified sampling --- <img src="figs/01-model/unnamed-chunk-51-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-52-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-53-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-54-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-55-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-56-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-57-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-58-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-59-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/01-model/unnamed-chunk-60-1.png" width="504" style="display: block; margin: auto;" /> --- ```r set.seed(100) # Important! ames_split <- initial_split(ames, * strata = Sale_Price, * breaks = 4) ames_train <- training(ames_split) ames_test <- testing(ames_split) lm_fit <- lm_spec %>% fit(Sale_Price ~ Gr_Liv_Area, data = ames_train) price_pred <- lm_fit %>% predict(new_data = ames_test) %>% mutate(price_truth = ames_test$Sale_Price) rmse(price_pred, truth = price_truth, estimate = .pred) ``` --- class: inverse, middle, center # Key concepts fitting a model (aka training a model) predicting new data overfitting data splitting (+ stratified splits)