Build A Model

class: title-slide, center, bottom

# Build A Model

## Tidymodels, Virtually &mdash; Session 01

### Alison Hill

---
class: center, middle, inverse

# What is Machine Learning?

???

Machine Learning is usually thought of as a subfield of artificial intelligence that itself contains other hot sub-fields.

Let's start somewhere familiar. I have a data set and I want to analyze it.

The actual data set is named `ames` and it comes in the `AmesHousing` R package. No need to open your computers. Let's just discuss for a few minutes.

---
class: middle

# .center[AmesHousing]

Descriptions of 2,930 houses sold in Ames, IA from 2006 to 2010, collected by the Ames Assessor’s Office.

```r
# install.packages("AmesHousing")
library(AmesHousing)
ames <- make_ames() %>% 
  dplyr::select(-matches("Qu"))
```

???

`ames` contains descriptions of 2,930 houses sold in Ames, IA from 2006 to 2010. The data comes from the Ames Assessor’s Office and contains things like the square footage of a house, its lot size, and its sale price.

---
class: middle

```r
glimpse(ames)
# Rows: 2,930
# Columns: 74
# $ MS_SubClass        <fct> One_Story_1946_and_Newer_All_S…
# $ MS_Zoning          <fct> Residential_Low_Density, Resid…
# $ Lot_Frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 4…
# $ Lot_Area           <int> 31770, 11622, 14267, 11160, 13…
# $ Street             <fct> Pave, Pave, Pave, Pave, Pave, …
# $ Alley              <fct> No_Alley_Access, No_Alley_Acce…
# $ Lot_Shape          <fct> Slightly_Irregular, Regular, S…
# $ Land_Contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, …
# $ Utilities          <fct> AllPub, AllPub, AllPub, AllPub…
# $ Lot_Config         <fct> Corner, Inside, Corner, Corner…
# $ Land_Slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, …
# $ Neighborhood       <fct> North_Ames, North_Ames, North_…
# $ Condition_1        <fct> Norm, Feedr, Norm, Norm, Norm,…
# $ Condition_2        <fct> Norm, Norm, Norm, Norm, Norm, …
# $ Bldg_Type          <fct> OneFam, OneFam, OneFam, OneFam…
# $ House_Style        <fct> One_Story, One_Story, One_Stor…
# $ Overall_Cond       <fct> Average, Above_Average, Above_…
# $ Year_Built         <int> 1960, 1961, 1958, 1968, 1997, …
# $ Year_Remod_Add     <int> 1960, 1961, 1958, 1968, 1998, …
# $ Roof_Style         <fct> Hip, Gable, Hip, Hip, Gable, G…
# $ Roof_Matl          <fct> CompShg, CompShg, CompShg, Com…
# $ Exterior_1st       <fct> BrkFace, VinylSd, Wd Sdng, Brk…
# $ Exterior_2nd       <fct> Plywood, VinylSd, Wd Sdng, Brk…
# $ Mas_Vnr_Type       <fct> Stone, None, BrkFace, None, No…
# $ Mas_Vnr_Area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0…
# $ Exter_Cond         <fct> Typical, Typical, Typical, Typ…
# $ Foundation         <fct> CBlock, CBlock, CBlock, CBlock…
# $ Bsmt_Cond          <fct> Good, Typical, Typical, Typica…
# $ Bsmt_Exposure      <fct> Gd, No, No, No, No, No, Mn, No…
# $ BsmtFin_Type_1     <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, …
# $ BsmtFin_SF_1       <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, …
# $ BsmtFin_Type_2     <fct> Unf, LwQ, Unf, Unf, Unf, Unf, …
# $ BsmtFin_SF_2       <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0…
# $ Bsmt_Unf_SF        <dbl> 441, 270, 406, 1045, 137, 324,…
# $ Total_Bsmt_SF      <dbl> 1080, 882, 1329, 2110, 928, 92…
# $ Heating            <fct> GasA, GasA, GasA, GasA, GasA, …
# $ Heating_QC         <fct> Fair, Typical, Typical, Excell…
# $ Central_Air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
# $ Electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SB…
# $ First_Flr_SF       <int> 1656, 896, 1329, 2110, 928, 92…
# $ Second_Flr_SF      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0,…
# $ Gr_Liv_Area        <int> 1656, 896, 1329, 2110, 1629, 1…
# $ Bsmt_Full_Bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, …
# $ Bsmt_Half_Bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# $ Full_Bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, …
# $ Half_Bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, …
# $ Bedroom_AbvGr      <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, …
# $ Kitchen_AbvGr      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
# $ TotRms_AbvGrd      <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, …
# $ Functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, …
# $ Fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, …
# $ Garage_Type        <fct> Attchd, Attchd, Attchd, Attchd…
# $ Garage_Finish      <fct> Fin, Unf, Unf, Fin, Fin, Fin, …
# $ Garage_Cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, …
# $ Garage_Area        <dbl> 528, 730, 312, 522, 482, 470, …
# $ Garage_Cond        <fct> Typical, Typical, Typical, Typ…
# $ Paved_Drive        <fct> Partial_Pavement, Paved, Paved…
# $ Wood_Deck_SF       <int> 210, 140, 393, 0, 212, 360, 0,…
# $ Open_Porch_SF      <int> 62, 0, 36, 0, 34, 36, 0, 82, 1…
# $ Enclosed_Porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0…
# $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# $ Screen_Porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0,…
# $ Pool_Area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# $ Pool_QC            <fct> No_Pool, No_Pool, No_Pool, No_…
# $ Fence              <fct> No_Fence, Minimum_Privacy, No_…
# $ Misc_Feature       <fct> None, None, Gar2, None, None, …
# $ Misc_Val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0,…
# $ Mo_Sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, …
# $ Year_Sold          <int> 2010, 2010, 2010, 2010, 2010, …
# $ Sale_Type          <fct> WD , WD , WD , WD , WD , WD , …
# $ Sale_Condition     <fct> Normal, Normal, Normal, Normal…
# $ Sale_Price         <int> 215000, 105000, 172000, 244000…
# $ Longitude          <dbl> -93.61975, -93.61976, -93.6193…
# $ Latitude           <dbl> 42.05403, 42.05301, 42.05266, …
```

---
background-image: url(images/zestimate.png)
background-size: contain

---
class: middle, center, inverse

# What is the goal of predictive modeling?

---
class: middle, center, inverse

# What is the goal of machine learning?

---
class: middle, center, frame

# Goal

## 🔨 build .display[models] that

## 🎯 generate .display[accurate predictions]

## 🔮 for .display[future, yet-to-be-seen data]

.footnote[Max Kuhn & Kjell Johnston, http://www.feat.engineering/]

???

This is our whole game vision for today. This is the main goal for predictive modeling broadly, and for machine learning specifically.

We'll use this goal to drive learning of 3 core tidymodels packages:

- parsnip
- yardstick
- and rsample

---
class: inverse, middle, center

# 🔨 Build models

## with parsnip

???

Enter the parsnip package

---
exclude: true
name: predictions
class: middle, center, frame

# Goal of Predictive Modeling

## 🔮 generate accurate .display[predictions]

---
class: middle

# .center[`lm()`]

```r
lm_ames <- lm(Sale_Price ~ Gr_Liv_Area, data = ames)
lm_ames
# 
# Call:
# lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames)
# 
# Coefficients:
# (Intercept)  Gr_Liv_Area  
#     13289.6        111.7
```

???

So let's start with prediction. To predict, we have to have two things: a model to generate predictions, and data to predict

This type of formula interface may look familiar

How would we use parsnip to build this kind of linear regression model?

---
name: step1
background-image: url("images/predicting/predicting.001.jpeg")
background-size: contain

---
class: middle, frame

# .center[To specify a model with parsnip]

.right-column[

1\. Pick a .display[model]

2\. Set the .display[engine]

3\. Set the .display[mode] (if needed)

]

---
class: middle, frame

# .center[To specify a model with parsnip]

```r
decision_tree() %>%
  set_engine("C5.0") %>%
  set_mode("classification")
```

---
class: middle, frame

# .center[To specify a model with parsnip]

```r
nearest_neighbor() %>%              
  set_engine("kknn") %>%             
  set_mode("regression") %>%        
```

---
class: middle, frame

.fade[
# .center[To specify a model with parsnip]
]

.right-column[

1\. Pick a .display[model]
.fade[
2\. Set the .display[engine]

3\. Set the .display[mode] (if needed)
]

]

---
class: middle, center

# 1\. Pick a .display[model]

All available models are listed at

<https://tidymodels.github.io/parsnip/articles/articles/Models.html>

---
class: middle

.center[
# `linear_reg()`

Specifies a model that uses linear regression
]

```r
linear_reg(mode = "regression", penalty = NULL, mixture = NULL)
```

---
class: middle

.center[
# `linear_reg()`

Specifies a model that uses linear regression
]

```r
linear_reg(
  mode = "regression", # "default" mode, if exists
  penalty = NULL,      # model hyper-parameter
  mixture = NULL       # model hyper-parameter
  )
```

---
class: middle, frame

.fade[
# .center[To specify a model with parsnip]
]

.right-column[
.fade[
1\. Pick a .display[model]
]

2\. Set the .display[engine]

.fade[
3\. Set the .display[mode] (if needed)
]

]

---
class: middle, center

# `set_engine()`

Adds an engine to power or implement the model.

```r
lm_spec %>% set_engine(engine = "lm", ...)
```

---
class: middle, frame

.fade[
# .center[To specify a model with parsnip]
]

.right-column[
.fade[
1\. Pick a .display[model]

2\. Set the .display[engine]
]

3\. Set the .display[mode] (if needed)

]

---
class: middle, center

# `set_mode()`

Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1.

```r
lm_spec %>% set_mode(mode = "regression")
```

---
class: your-turn

# Your turn 1

Write a pipe that creates a model that uses `lm()` to fit a linear regression. Save it as `lm_spec` and look at the object. What does it return?

*Hint: you'll need https://tidymodels.github.io/parsnip/articles/articles/Models.html*

---

```r
lm_spec <- 
   linear_reg() %>%          # model type
   set_engine(engine = "lm") # engine

lm_spec
# Linear Regression Model Specification (regression)
# 
# Computational engine: lm
```

---
class: middle, center

# `fit()`

Train a model by fitting a model. Returns a parsnip model fit.

```r
fit(lm_spec, Sale_Price ~ Gr_Liv_Area, data = ames)
```

---
class: middle

.center[
# `fit()`

Train a model by fitting a model. Returns a parsnip model fit.
]

```r
fit(
  lm_spec,                  # parsnip model
  Sale_Price ~ Gr_Liv_Area, # a formula
  data = ames               # dataframe
  )
```

---
class: middle

.center[
# `fit()`

Train a model by fitting a model. Returns a parsnip model fit.
]

```r
lm_spec %>%                     # parsnip model
  fit(Sale_Price ~ Gr_Liv_Area, # a formula
      data = ames               # dataframe
  )
```

---
class: your-turn

# Your turn 2

Double check. Does

```r
lm_fit <- 
  lm_spec %>% 
  fit(Sale_Price ~ Gr_Liv_Area, 
      data = ames)
lm_fit
```

give the same results as

```r
lm(Sale_Price ~ Gr_Liv_Area, data = ames)
```

---

```r
lm(Sale_Price ~ Gr_Liv_Area, data = ames)
# 
# Call:
# lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames)
# 
# Coefficients:
# (Intercept)  Gr_Liv_Area  
#     13289.6        111.7
```

---

```r
lm_fit
# parsnip model object
# 
# Fit time:  2ms 
# 
# Call:
# stats::lm(formula = Sale_Price ~ Gr_Liv_Area, data = data)
# 
# Coefficients:
# (Intercept)  Gr_Liv_Area  
#     13289.6        111.7
```

---
name: handout
class: center, middle

data `(x, y)` + model = fitted model

---
class: center, middle

# Show of hands

How many people have used a fitted model to generate .display[predictions] with R?

---
template: step1

---
name: step2
background-image: url("images/predicting/predicting.003.jpeg")
background-size: contain

---
class: middle, center

# `predict()`

Use a fitted model to predict new `y` values from data. Returns a tibble.

```r
predict(lm_fit, new_data = new_homes) 
```

---

```r
lm_fit %>% 
  predict(new_data = ames)
# # A tibble: 2,930 x 1
#      .pred
#      <dbl>
#  1 198255.
#  2 113367.
#  3 161731.
#  4 248964.
#  5 195239.
#  6 192447.
#  7 162736.
#  8 156258.
#  9 193787.
# 10 214786.
# # … with 2,920 more rows
```

---

```r
new_homes <- tibble(Gr_Liv_Area = c(334, 1126, 1442, 1500, 1743, 5642))
lm_fit %>% 
  predict(new_data = new_homes)
# # A tibble: 6 x 1
#     .pred
#     <dbl>
# 1  50595.
# 2 139057.
# 3 174352.
# 4 180831.
# 5 207972.
# 6 643467.
```

---
name: lm-predict
class: middle, center

# Predictions

---
class: your-turn

# Your turn 3

Fill in the blanks. Use `predict()` to

1. Use your linear model to predict sale prices; save the tibble as `price_pred`  
1. Add a pipe and use `mutate()` to add a column with the observed sale prices; name it `truth`

*Hint: Be sure to remove every `_` before running the code!*

---

```r
lm_fit <- 
  lm_spec %>% 
  fit(Sale_Price ~ Gr_Liv_Area, 
      data = ames)

price_pred <- 
  lm_fit %>% 
  predict(new_data = ames) %>% 
  mutate(truth = ames$Sale_Price)

price_pred
# # A tibble: 2,930 x 2
#      .pred  truth
#      <dbl>  <int>
#  1 198255. 215000
#  2 113367. 105000
#  3 161731. 172000
#  4 248964. 244000
#  5 195239. 189900
#  6 192447. 195500
#  7 162736. 213500
#  8 156258. 191500
#  9 193787. 236500
# 10 214786. 189000
# # … with 2,920 more rows
```

---
template: handout

data `(x)` + fitted model = predictions

---
template: predictions

---
name: accurate-predictions
class: middle, center, frame

# Goal of Machine Learning

## 🎯 generate .display[accurate predictions]

???

Now we have predictions from our model. What can we do with them? If we already know the truth, that is, the outcome variable that was observed, we can compare them!

---
class: middle, center, frame

# Axiom

Better Model = Better Predictions (Lower error rate)

---
template: lm-predict

---
class: middle, center

# Residuals

---
class: middle, center

# RMSE

Root Mean Squared Error - The standard deviation of the residuals about zero.

$$ \sqrt{\frac{1}{n} \sum_{i=1}^n (\hat{y}_i - {y}_i)^2 }$$

---
class: middle, center

# `rmse()*`

Calculates the RMSE based on two columns in a dataframe:

The .display[truth]: `${y}_i$`

The predicted .display[estimate]: `$\hat{y}_i$`

```r
rmse(data, truth, estimate)
```

.footnote[`*` from `yardstick`]

---

```r
lm_fit <- 
  lm_spec %>% 
  fit(Sale_Price ~ Gr_Liv_Area, 
      data = ames)

price_pred <- 
  lm_fit %>% 
  predict(new_data = ames) %>% 
  mutate(price_truth = ames$Sale_Price)

price_pred %>% 
* rmse(truth = price_truth, estimate = .pred)
# # A tibble: 1 x 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      56505.
```

---
template: step1

---
template: step2

---
name: step3
background-image: url("images/predicting/predicting.004.jpeg")
background-size: contain

---
template: handout

data `(x)` + fitted model = predictions

data `(y)` + predictions = metrics

---
class: middle, center, inverse

# A model doesn't have to be a straight&nbsp;line!

---
exclude: true

```r
rt_spec <- 
  decision_tree() %>%          
  set_engine(engine = "rpart") %>% 
  set_mode("regression")

rt_fit     <- rt_spec %>% 
              fit(Sale_Price ~ Gr_Liv_Area, 
                  data = ames)

price_pred <- rt_fit %>% 
              predict(new_data = ames) %>% 
              mutate(price_truth = ames$Sale_Price)

rmse(price_pred, truth = price_truth, estimate = .pred)
```

---
class: middle, center

---
class: middle, center

---
class: middle, inverse, center

# Do you trust it?

---
class: middle, inverse, center

# Overfitting

---

---

---

---

.pull-left[

]

.pull-right[
<img src="figs/01-model/unnamed-chunk-37-1.png" width="504" style="display: block; margin: auto;" />
]

---
class: your-turn

# Your turn 4

.left-column[
Take a minute and decide which model:

1. Has the smallest residuals  
2. Will have lower prediction error. Why?  
]

.right-column[
<img src="figs/01-model/unnamed-chunk-38-1.png" width="50%" /><img src="figs/01-model/unnamed-chunk-38-2.png" width="50%" />

]

---

---

---
class: middle, center, frame

# Axiom 1

The best way to measure a model's performance at predicting new data is to .display[predict new data].

---
class: middle, center, frame

# Goal of Machine Learning

## 🔨 build .display[models] that

## 🎯 generate .display[accurate predictions]

## 🔮 for .display[future, yet-to-be-seen data]

.footnote[Max Kuhn & Kjell Johnston, http://www.feat.engineering/]

???

But need new data...

---
class: middle, center, frame

# Data splitting

???

We refer to the group for which we know the outcome, and use to develop the algorithm, as the training set. We refer to the group for which we pretend we don’t know the outcome as the test set.

---
class: center, middle

# `initial_split()*`

"Splits" data randomly into a single testing and a single training set.

```r
initial_split(data, prop = 3/4)
```

.footnote[`*` from `rsample`]
---

```r
ames_split <- initial_split(ames, prop = 0.75)
ames_split
# <Analysis/Assess/Total>
# <2198/732/2930>
```

???

data splitting

---
class: center, middle

# `training()` and `testing()*`

Extract training and testing sets from an rsplit

```r
training(ames_split)
testing(ames_split)
```

.footnote[`*` from `rsample`]

---

```r
train_set <- training(ames_split) 
train_set
# # A tibble: 2,198 x 74
#    MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley
#    <fct>       <fct>            <dbl>    <int> <fct>  <fct>
#  1 One_Story_… Resident…          141    31770 Pave   No_A…
#  2 One_Story_… Resident…           80    11622 Pave   No_A…
#  3 One_Story_… Resident…           81    14267 Pave   No_A…
#  4 One_Story_… Resident…           93    11160 Pave   No_A…
#  5 Two_Story_… Resident…           74    13830 Pave   No_A…
#  6 Two_Story_… Resident…           78     9978 Pave   No_A…
#  7 One_Story_… Resident…           41     4920 Pave   No_A…
#  8 One_Story_… Resident…           43     5005 Pave   No_A…
#  9 One_Story_… Resident…           39     5389 Pave   No_A…
# 10 Two_Story_… Resident…           60     7500 Pave   No_A…
# # … with 2,188 more rows, and 68 more variables:
# #   Lot_Shape <fct>, Land_Contour <fct>, Utilities <fct>,
# #   Lot_Config <fct>, Land_Slope <fct>, …
```

---
class: middle, center

# Quiz

Now that we have training and testing sets...

Which dataset do you think we use for .display[fitting]?

Which do we use for .display[predicting]?

---
template: step1

---
template: step2

---
template: step3
background-image: url("images/predicting/predicting.004.jpeg")
background-size: contain

---
name: holdout-step2
background-image: url("images/predicting/predicting.006.jpeg")
background-size: contain

---
name: holdout-step3
background-image: url("images/predicting/predicting.007.jpeg")
background-size: contain

---
name: holdout-step4
background-image: url("images/predicting/predicting.008.jpeg")
background-size: contain

---
name: holdout
background-image: url("images/predicting/predicting.009.jpeg")
background-size: contain

---
class: your-turn

# Your turn 5

Fill in the blanks.

Use `initial_split()`, `training()`, `testing()`, `lm()` and `rmse()` to:

1. Split **ames** into training and test sets. Save the rsplit!

1. Extract the training data. Fit a linear model to it. Save the model!

1. Measure the RMSE of your linear model with your test set.

Keep `set.seed(100)` at the start of your code.

---

```r
set.seed(100) # Important!

ames_split  <- initial_split(ames)
ames_train  <- training(ames_split)
ames_test   <- testing(ames_split)

lm_fit      <- lm_spec %>% 
               fit(Sale_Price ~ Gr_Liv_Area, 
                   data = ames_train)

price_pred  <- lm_fit %>% 
               predict(new_data = ames_test) %>% 
               mutate(price_truth = ames_test$Sale_Price)

rmse(price_pred, truth = price_truth, estimate = .pred)
```

RMSE = 53884.78; compare to 56504.88

---
class: middle, center

.pull-left[

### Training RMSE = 57367.26
<img src="figs/01-model/unnamed-chunk-49-1.png" width="504" style="display: block; margin: auto;" />

]

.pull-right[

### Testing RMSE = 53884.78
<img src="figs/01-model/lm-test-resid-1.png" width="504" style="display: block; margin: auto;" />
]

---
name: holdout-handout
class: center, middle

old data `(x, y)` + model = fitted model

new data `(x)` + fitted model = predictions

new data `(y)` + predictions = metrics

---
class: middle, center, inverse

# Stratified sampling

---

---

---

---

---

---
<img src="figs/01-model/unnamed-chunk-56-1.png" width="504" style="display: block; margin: auto;" />

---
<img src="figs/01-model/unnamed-chunk-57-1.png" width="504" style="display: block; margin: auto;" />

---

---

---

---

```r
set.seed(100) # Important!

ames_split  <- initial_split(ames, 
*                            strata = Sale_Price,
*                            breaks = 4)
ames_train  <- training(ames_split)
ames_test   <- testing(ames_split)

lm_fit      <- lm_spec %>% 
               fit(Sale_Price ~ Gr_Liv_Area, 
                   data = ames_train)

price_pred  <- lm_fit %>% 
               predict(new_data = ames_test) %>% 
               mutate(price_truth = ames_test$Sale_Price)

rmse(price_pred, truth = price_truth, estimate = .pred)
```

---
class: inverse, middle, center

# Key concepts

fitting a model (aka training a model)

predicting new data

overfitting

data splitting (+ stratified splits)