class: title-slide, center, bottom # Ensemble Models ## Tidymodels, Virtually — Session 04 ### Alison Hill --- class: middle, frame, center # Decision Trees To predict the outcome of a new data point: Uses rules learned from splits Each split maximizes information gain --- class: middle, center ![](https://media.giphy.com/media/gj4ZruUQUnpug/source.gif) --- <img src="figs/04-ensemble/unnamed-chunk-2-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/04-ensemble/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center # Quiz How do assess predictions here? -- RMSE --- <img src="figs/04-ensemble/rt-test-resid-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center .pull-left[ ### LM RMSE = 53884.78 <img src="figs/04-ensemble/lm-test-resid-1.png" width="504" style="display: block; margin: auto;" /> ] -- .pull-right[ ### Tree RMSE = 61687.24 <img src="figs/04-ensemble/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse, middle, center .pull-left[ <img src="figs/04-ensemble/lm-fig-1.svg" width="504" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figs/04-ensemble/dt-fig-1.svg" width="504" style="display: block; margin: auto;" /> ] --- class: middle, center <img src="https://raw.githubusercontent.com/EmilHvitfeldt/blog/master/static/blog/2019-08-09-authorship-classification-with-tidymodels-and-textrecipes_files/figure-html/unnamed-chunk-18-1.png" width="70%" style="display: block; margin: auto;" /> https://www.hvitfeldt.me/blog/authorship-classification-with-tidymodels-and-textrecipes/ --- class: middle, center <img src="https://www.kaylinpavlik.com/content/images/2019/12/dt-1.png" width="50%" style="display: block; margin: auto;" /> https://www.kaylinpavlik.com/classifying-songs-genres/ --- class: middle, center <img src="https://a3.typepad.com/6a0105360ba1c6970c01b7c95c61fb970b-pi" width="40%" style="display: block; margin: auto;" /> .footnote[[tweetbotornot2](https://github.com/mkearney/tweetbotornot2)] --- name: guess-the-animal class: middle, center, inverse <img src="http://www.atarimania.com/8bit/screens/guess_the_animal.gif" width="100%" style="display: block; margin: auto;" /> --- class: middle, center # What makes a good guesser? -- High information gain per question (can it fly?) -- Clear features (feathers vs. is it "small"?) -- Order matters --- background-image: url(images/aus-standard-animals.png) background-size: cover .footnote[[Australian Computing Academy](https://aca.edu.au/resources/decision-trees-classifying-animals/)] --- background-image: url(images/aus-standard-tree.png) background-size: cover .footnote[[Australian Computing Academy](https://aca.edu.au/resources/decision-trees-classifying-animals/)] --- background-image: url(images/annotated-tree/annotated-tree.001.png) background-size: cover --- background-image: url(images/annotated-tree/annotated-tree.002.png) background-size: cover --- background-image: url(images/annotated-tree/annotated-tree.003.png) background-size: cover --- background-image: url(images/annotated-tree/annotated-tree.004.png) background-size: cover --- background-image: url(images/annotated-tree/annotated-tree.005.png) background-size: cover --- background-image: url(images/copyingandpasting-big.png) background-size: contain background-position: center class: middle, center --- background-image: url(images/so-dev-survey.png) background-size: contain background-position: center class: middle, center --- <img src="https://github.com/juliasilge/supervised-ML-case-studies-course/blob/master/img/remote_size.png?raw=true" width="80%" style="display: block; margin: auto;" /> .footnote[[Julia Silge](https://supervised-ml-course.netlify.com/)] ??? Notes: The specific question we are going to address is what makes a developer more likely to work remotely. Developers can work in their company offices or they can work remotely, and it turns out that there are specific characteristics of developers, such as the size of the company that they work for, how much experience they have, or where in the world they live, that affect how likely they are to be a remote developer. --- # StackOverflow Data ```r # read in the data stackoverflow <- read_rds(here::here("materials/data/stackoverflow.rds")) glimpse(stackoverflow) # Rows: 1,150 # Columns: 20 # $ salary <dbl> 63750.00, 93… # $ years_coded_job <int> 4, 9, 8, 3, … # $ open_source <dbl> 0, 1, 1, 1, … # $ hobby <dbl> 1, 1, 1, 0, … # $ company_size_number <dbl> 20, 1000, 10… # $ remote <fct> Remote, Remo… # $ career_satisfaction <int> 8, 8, 5, 10,… # $ data_scientist <dbl> 0, 0, 1, 0, … # $ database_administrator <dbl> 1, 0, 1, 0, … # $ desktop_applications_developer <dbl> 1, 0, 1, 0, … # $ developer_with_stats_math_background <dbl> 0, 0, 0, 0, … # $ dev_ops <dbl> 0, 0, 0, 0, … # $ embedded_developer <dbl> 0, 0, 0, 0, … # $ graphic_designer <dbl> 0, 0, 0, 0, … # $ graphics_programming <dbl> 0, 0, 0, 0, … # $ machine_learning_specialist <dbl> 0, 0, 0, 0, … # $ mobile_developer <dbl> 0, 1, 0, 0, … # $ quality_assurance_engineer <dbl> 0, 0, 0, 0, … # $ systems_administrator <dbl> 1, 0, 1, 0, … # $ web_developer <dbl> 0, 0, 0, 1, … ``` --- # Data Splitting & Resampling ```r set.seed(100) # Important! so_split <- initial_split(stackoverflow, strata = remote) so_train <- training(so_split) so_test <- testing(so_split) # use 10-fold CV so_folds <- vfold_cv(so_train, strata = remote) ``` --- class: middle, frame # .center[To specify a model with parsnip] .right-column[ 1\. Pick a .display[model] 2\. Set the .display[engine] 3\. Set the .display[mode] (if needed) ] --- class: middle, frame # .center[To specify a classification tree with parsnip] ```r decision_tree() %>% set_engine("rpart") %>% set_mode("classification") ``` --- class: your-turn # Your turn 1 Here is our very-vanilla parsnip model specification for a decision tree (also in your Rmd)... ```r vanilla_tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification") ``` --- class: your-turn # Your turn 1 Fill in the blanks to return the accuracy and ROC AUC for this model using 10-fold cross-validation.
02
:
00
--- ```r set.seed(100) vanilla_tree_spec %>% fit_resamples(remote ~ ., resamples = so_folds) %>% collect_metrics() # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.643 10 0.0165 # 2 roc_auc binary 0.668 10 0.0170 ``` --- class: middle, center # `args()` Print the arguments for a **parsnip** model specification. ```r args(decision_tree) ``` --- class: middle, center # `decision_tree()` Specifies a decision tree model ```r decision_tree(tree_depth = 30, min_n = 20, cost_complexity = .01) ``` -- *either* mode works! --- class: middle .center[ # `decision_tree()` Specifies a decision tree model ] ```r decision_tree( tree_depth = 30, # max tree depth min_n = 20, # smallest node allowed cost_complexity = .01 # 0 > cp > 0.1 ) ``` --- class: middle, center # `set_args()` Change the arguments for a **parsnip** model specification. ```r _spec %>% set_args(tree_depth = 3) ``` --- class: middle ```r decision_tree() %>% set_engine("rpart") %>% set_mode("classification") %>% * set_args(tree_depth = 3) # Decision Tree Model Specification (classification) # # Main Arguments: # tree_depth = 3 # # Computational engine: rpart ``` --- class: middle ```r *decision_tree(tree_depth = 3) %>% set_engine("rpart") %>% set_mode("classification") # Decision Tree Model Specification (classification) # # Main Arguments: # tree_depth = 3 # # Computational engine: rpart ``` --- class: middle, center # `tree_depth` Cap the maximum tree depth. A method to stop the tree early. Used to prevent overfitting. ```r vanilla_tree_spec %>% set_args(tree_depth = 30) ``` --- class: middle, center exclude: true --- class: middle, center <img src="figs/04-ensemble/unnamed-chunk-25-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center <img src="figs/04-ensemble/unnamed-chunk-26-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center # `min_n` Set minimum `n` to split at any node. Another early stopping method. Used to prevent overfitting. ```r vanilla_tree_spec %>% set_args(min_n = 20) ``` --- class: middle, center # Quiz What value of `min_n` would lead to the *most overfit* tree? -- `min_n` = 1 --- class: middle, center, frame # Recap: early stopping | `parsnip` arg | `rpart` arg | default | overfit? | |---------------|-------------|:-------:|:--------:| | `tree_depth` | `maxdepth` | 30 |⬆️| | `min_n` | `minsplit` | 20 |⬇️| --- class: middle, center # `cost_complexity` Adds a cost or penalty to error rates of more complex trees. A way to prune a tree. Used to prevent overfitting. ```r vanilla_tree_spec %>% set_args(cost_complexity = .01) ``` -- Closer to zero ➡️ larger trees. Higher penalty ➡️ smaller trees. --- class: middle, center <img src="figs/04-ensemble/unnamed-chunk-29-1.png" width="720" style="display: block; margin: auto;" /> --- class: middle, center <img src="figs/04-ensemble/unnamed-chunk-30-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center <img src="figs/04-ensemble/unnamed-chunk-31-1.png" width="864" style="display: block; margin: auto;" /> --- name: bonsai background-image: url(images/kari-shea-AVqh83jStMA-unsplash.jpg) background-position: left background-size: contain class: middle --- template: bonsai .pull-right[ # Consider the bonsai 1. Small pot 1. Strong shears ] --- template: bonsai .pull-right[ # Consider the bonsai 1. ~~Small pot~~ .display[Early stopping] 1. ~~Strong shears~~ .display[Pruning] ] --- class: middle, center, frame # Recap: early stopping & pruning | `parsnip` arg | `rpart` arg | default | overfit? | |---------------|-------------|:-------:|:--------:| | `tree_depth` | `maxdepth` | 30 |⬆️| | `min_n` | `minsplit` | 20 |⬇️| | `cost_complexity` | `cp` | .01 |⬇️| --- class: middle, center <table> <thead> <tr> <th style="text-align:left;"> engine </th> <th style="text-align:left;"> parsnip </th> <th style="text-align:left;"> original </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> rpart </td> <td style="text-align:left;"> tree_depth </td> <td style="text-align:left;"> maxdepth </td> </tr> <tr> <td style="text-align:left;"> rpart </td> <td style="text-align:left;"> min_n </td> <td style="text-align:left;"> minsplit </td> </tr> <tr> <td style="text-align:left;"> rpart </td> <td style="text-align:left;"> cost_complexity </td> <td style="text-align:left;"> cp </td> </tr> </tbody> </table> <https://rdrr.io/cran/rpart/man/rpart.control.html> --- class: your-turn # Your turn 2 Create a new classification tree model spec; call it `big_tree_spec`. Set the cost complexity to `0`, and the minimum number of data points in a node to split to be `1`. Compare the metrics of the big tree to the vanilla tree- which one predicts the test set better? *Hint: you'll need https://tidymodels.github.io/parsnip/reference/decision_tree.html*
03
:
00
--- ```r big_tree_spec <- * decision_tree(min_n = 1, cost_complexity = 0) %>% set_engine("rpart") %>% set_mode("classification") set.seed(100) # Important! big_tree_spec %>% fit_resamples(remote ~ ., resamples = so_folds) %>% collect_metrics() # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.564 10 0.0138 # 2 roc_auc binary 0.564 10 0.0138 ``` -- Compare to `vanilla`: accuracy = 0.64; ROC AUC = 0.67 --- exclude: true class: middle .center[ # Where is the fit? ] ```r big_tree ## # Monte Carlo cross-validation (0.75/0.25) with 1 resamples ## # A tibble: 1 x 6 ## splits id .metrics .notes .predictions .workflow ## <list> <chr> <list> <list> <list> <list> ## 1 <split … train/… <tibble [… <tibbl… <tibble [28… <workflo… ``` --- exclude: true class: middle .center[ # Where is the fit? ] ```r get_tree_fit(big_tree) # parsnip model object # # Fit time: 55ms # n= 864 # # node), split, n, loss, yval, (yprob) # * denotes terminal node # # 1) root 864 432 Remote (0.50000000 0.50000000) # 2) salary>=89196.97 329 103 Remote (0.68693009 0.31306991) # 4) company_size_number< 15 54 5 Remote (0.90740741 0.09259259) # 8) company_size_number< 5.5 27 0 Remote (1.00000000 0.00000000) * # 9) company_size_number>=5.5 27 5 Remote (0.81481481 0.18518519) # 18) web_developer>=0.5 24 3 Remote (0.87500000 0.12500000) # 36) desktop_applications_developer< 0.5 22 2 Remote (0.90909091 0.09090909) # 72) career_satisfaction< 8.5 10 0 Remote (1.00000000 0.00000000) * # 73) career_satisfaction>=8.5 12 2 Remote (0.83333333 0.16666667) # 146) career_satisfaction>=9.5 8 0 Remote (1.00000000 0.00000000) * # 147) career_satisfaction< 9.5 4 2 Remote (0.50000000 0.50000000) # 294) salary< 97500 1 0 Remote (1.00000000 0.00000000) * # 295) salary>=97500 3 1 Not remote (0.33333333 0.66666667) # 590) mobile_developer>=0.5 1 0 Remote (1.00000000 0.00000000) * # 591) mobile_developer< 0.5 2 0 Not remote (0.00000000 1.00000000) * # 37) desktop_applications_developer>=0.5 2 1 Remote (0.50000000 0.50000000) # 74) salary< 121000 1 0 Remote (1.00000000 0.00000000) * # 75) salary>=121000 1 0 Not remote (0.00000000 1.00000000) * # 19) web_developer< 0.5 3 1 Not remote (0.33333333 0.66666667) # 38) open_source>=0.5 1 0 Remote (1.00000000 0.00000000) * # 39) open_source< 0.5 2 0 Not remote (0.00000000 1.00000000) * # 5) company_size_number>=15 275 98 Remote (0.64363636 0.35636364) # 10) years_coded_job>=7.5 196 58 Remote (0.70408163 0.29591837) # 20) open_source>=0.5 85 17 Remote (0.80000000 0.20000000) # 40) salary>=93750 79 13 Remote (0.83544304 0.16455696) # 80) graphics_programming< 0.5 76 11 Remote (0.85526316 0.14473684) # 160) machine_learning_specialist< 0.5 73 9 Remote (0.87671233 0.12328767) # 320) company_size_number>=750 26 1 Remote (0.96153846 0.03846154) # 640) salary>=113500 21 0 Remote (1.00000000 0.00000000) * # 641) salary< 113500 5 1 Remote (0.80000000 0.20000000) # 1282) years_coded_job>=11.5 4 0 Remote (1.00000000 0.00000000) * # 1283) years_coded_job< 11.5 1 0 Not remote (0.00000000 1.00000000) * # 321) company_size_number< 750 47 8 Remote (0.82978723 0.17021277) # 642) years_coded_job< 14.5 26 2 Remote (0.92307692 0.07692308) # 1284) career_satisfaction< 9.5 19 0 Remote (1.00000000 0.00000000) * # 1285) career_satisfaction>=9.5 7 2 Remote (0.71428571 0.28571429) # 2570) salary>=130500 4 0 Remote (1.00000000 0.00000000) * # 2571) salary< 130500 3 1 Not remote (0.33333333 0.66666667) # 5142) salary< 115000 1 0 Remote (1.00000000 0.00000000) * # 5143) salary>=115000 2 0 Not remote (0.00000000 1.00000000) * # 643) years_coded_job>=14.5 21 6 Remote (0.71428571 0.28571429) # 1286) career_satisfaction>=7.5 17 3 Remote (0.82352941 0.17647059) # 2572) hobby>=0.5 13 0 Remote (1.00000000 0.00000000) * # 2573) hobby< 0.5 4 1 Not remote (0.25000000 0.75000000) # 5146) years_coded_job< 17.5 1 0 Remote (1.00000000 0.00000000) * # 5147) years_coded_job>=17.5 3 0 Not remote (0.00000000 1.00000000) * # 1287) career_satisfaction< 7.5 4 1 Not remote (0.25000000 0.75000000) # 2574) hobby< 0.5 1 0 Remote (1.00000000 0.00000000) * # 2575) hobby>=0.5 3 0 Not remote (0.00000000 1.00000000) * # 161) machine_learning_specialist>=0.5 3 1 Not remote (0.33333333 0.66666667) # 322) years_coded_job< 13 1 0 Remote (1.00000000 0.00000000) * # 323) years_coded_job>=13 2 0 Not remote (0.00000000 1.00000000) * # 81) graphics_programming>=0.5 3 1 Not remote (0.33333333 0.66666667) # 162) salary< 131500 1 0 Remote (1.00000000 0.00000000) * # 163) salary>=131500 2 0 Not remote (0.00000000 1.00000000) * # 41) salary< 93750 6 2 Not remote (0.33333333 0.66666667) # 82) mobile_developer>=0.5 2 0 Remote (1.00000000 0.00000000) * # 83) mobile_developer< 0.5 4 0 Not remote (0.00000000 1.00000000) * # 21) open_source< 0.5 111 41 Remote (0.63063063 0.36936937) # 42) salary< 115870.5 49 12 Remote (0.75510204 0.24489796) # 84) salary>=109360 19 1 Remote (0.94736842 0.05263158) # 168) company_size_number>=60 15 0 Remote (1.00000000 0.00000000) * # 169) company_size_number< 60 4 1 Remote (0.75000000 0.25000000) # 338) career_satisfaction< 8.5 3 0 Remote (1.00000000 0.00000000) * # 339) career_satisfaction>=8.5 1 0 Not remote (0.00000000 1.00000000) * # 85) salary< 109360 30 11 Remote (0.63333333 0.36666667) # 170) graphics_programming< 0.5 28 9 Remote (0.67857143 0.32142857) # 340) hobby< 0.5 11 1 Remote (0.90909091 0.09090909) # 680) years_coded_job>=9.5 10 0 Remote (1.00000000 0.00000000) * # 681) years_coded_job< 9.5 1 0 Not remote (0.00000000 1.00000000) * # 341) hobby>=0.5 17 8 Remote (0.52941176 0.47058824) # 682) salary>=93500 15 6 Remote (0.60000000 0.40000000) # 1364) years_coded_job>=8.5 14 5 Remote (0.64285714 0.35714286) # 2728) career_satisfaction< 7.5 4 0 Remote (1.00000000 0.00000000) * # 2729) career_satisfaction>=7.5 10 5 Remote (0.50000000 0.50000000) # 5458) career_satisfaction>=9.5 2 0 Remote (1.00000000 0.00000000) * # 5459) career_satisfaction< 9.5 8 3 Not remote (0.37500000 0.62500000) # 10918) company_size_number< 60 1 0 Remote (1.00000000 0.00000000) * # 10919) company_size_number>=60 7 2 Not remote (0.28571429 0.71428571) # 21838) embedded_developer>=0.5 1 0 Remote (1.00000000 0.00000000) * # 21839) embedded_developer< 0.5 6 1 Not remote (0.16666667 0.83333333) # 43678) company_size_number>=7500 2 1 Remote (0.50000000 0.50000000) # 87356) salary>=101000 1 0 Remote (1.00000000 0.00000000) * # 87357) salary< 101000 1 0 Not remote (0.00000000 1.00000000) * # 43679) company_size_number< 7500 4 0 Not remote (0.00000000 1.00000000) * # 1365) years_coded_job< 8.5 1 0 Not remote (0.00000000 1.00000000) * # 683) salary< 93500 2 0 Not remote (0.00000000 1.00000000) * # 171) graphics_programming>=0.5 2 0 Not remote (0.00000000 1.00000000) * # 43) salary>=115870.5 62 29 Remote (0.53225806 0.46774194) # 86) developer_with_stats_math_background< 0.5 57 24 Remote (0.57894737 0.42105263) # 172) systems_administrator>=0.5 5 0 Remote (1.00000000 0.00000000) * # 173) systems_administrator< 0.5 52 24 Remote (0.53846154 0.46153846) # 346) web_developer< 0.5 19 5 Remote (0.73684211 0.26315789) # 692) company_size_number< 5250 10 0 Remote (1.00000000 0.00000000) * # 693) company_size_number>=5250 9 4 Not remote (0.44444444 0.55555556) # 1386) salary>=131502 6 2 Remote (0.66666667 0.33333333) # 2772) salary< 165000 3 0 Remote (1.00000000 0.00000000) * # 2773) salary>=165000 3 1 Not remote (0.33333333 0.66666667) # 5546) salary>=175000 1 0 Remote (1.00000000 0.00000000) * # 5547) salary< 175000 2 0 Not remote (0.00000000 1.00000000) * # 1387) salary< 131502 3 0 Not remote (0.00000000 1.00000000) * # 347) web_developer>=0.5 33 14 Not remote (0.42424242 0.57575758) # 694) career_satisfaction>=9.5 2 0 Remote (1.00000000 0.00000000) * # 695) career_satisfaction< 9.5 31 12 Not remote (0.38709677 0.61290323) # 1390) company_size_number>=7500 8 3 Remote (0.62500000 0.37500000) # 2780) career_satisfaction< 7.5 3 0 Remote (1.00000000 0.00000000) * # 2781) career_satisfaction>=7.5 5 2 Not remote (0.40000000 0.60000000) # 5562) mobile_developer< 0.5 3 1 Remote (0.66666667 0.33333333) # 11124) salary>=121500 2 0 Remote (1.00000000 0.00000000) * # 11125) salary< 121500 1 0 Not remote (0.00000000 1.00000000) * # 5563) mobile_developer>=0.5 2 0 Not remote (0.00000000 1.00000000) * # 1391) company_size_number< 7500 23 7 Not remote (0.30434783 0.69565217) # 2782) salary>=128340 18 7 Not remote (0.38888889 0.61111111) # 5564) career_satisfaction>=5 16 7 Not remote (0.43750000 0.56250000) # 11128) company_size_number>=550 4 1 Remote (0.75000000 0.25000000) # 22256) years_coded_job>=14.5 3 0 Remote (1.00000000 0.00000000) * # 22257) years_coded_job< 14.5 1 0 Not remote (0.00000000 1.00000000) * # 11129) company_size_number< 550 12 4 Not remote (0.33333333 0.66666667) # 22258) salary< 157500 8 4 Remote (0.50000000 0.50000000) # 44516) years_coded_job< 16 2 0 Remote (1.00000000 0.00000000) * # 44517) years_coded_job>=16 6 2 Not remote (0.33333333 0.66666667) # 89034) salary>=151500 1 0 Remote (1.00000000 0.00000000) * # 89035) salary< 151500 5 1 Not remote (0.20000000 0.80000000) # 178070) career_satisfaction< 6.5 1 0 Remote (1.00000000 0.00000000) * # 178071) career_satisfaction>=6.5 4 0 Not remote (0.00000000 1.00000000) * # 22259) salary>=157500 4 0 Not remote (0.00000000 1.00000000) * # 5565) career_satisfaction< 5 2 0 Not remote (0.00000000 1.00000000) * # 2783) salary< 128340 5 0 Not remote (0.00000000 1.00000000) * # 87) developer_with_stats_math_background>=0.5 5 0 Not remote (0.00000000 1.00000000) * # 11) years_coded_job< 7.5 79 39 Not remote (0.49367089 0.50632911) # 22) salary< 114250 51 22 Remote (0.56862745 0.43137255) # 44) embedded_developer< 0.5 47 19 Remote (0.59574468 0.40425532) # 88) salary< 90954.55 7 1 Remote (0.85714286 0.14285714) # 176) years_coded_job>=1 6 0 Remote (1.00000000 0.00000000) * # 177) years_coded_job< 1 1 0 Not remote (0.00000000 1.00000000) * # 89) salary>=90954.55 40 18 Remote (0.55000000 0.45000000) # 178) hobby>=0.5 33 13 Remote (0.60606061 0.39393939) # 356) salary>=91250 32 12 Remote (0.62500000 0.37500000) # 712) salary< 99150 7 1 Remote (0.85714286 0.14285714) # 1424) data_scientist< 0.5 6 0 Remote (1.00000000 0.00000000) * # 1425) data_scientist>=0.5 1 0 Not remote (0.00000000 1.00000000) * # 713) salary>=99150 25 11 Remote (0.56000000 0.44000000) # 1426) salary>=101000 22 8 Remote (0.63636364 0.36363636) # 2852) salary< 102750 3 0 Remote (1.00000000 0.00000000) * # 2853) salary>=102750 19 8 Remote (0.57894737 0.42105263) # 5706) salary>=112250 2 0 Remote (1.00000000 0.00000000) * # 5707) salary< 112250 17 8 Remote (0.52941176 0.47058824) # 11414) salary< 110500 15 6 Remote (0.60000000 0.40000000) # 22828) company_size_number>=5500 3 0 Remote (1.00000000 0.00000000) * # 22829) company_size_number< 5500 12 6 Remote (0.50000000 0.50000000) # 45658) career_satisfaction< 6.5 2 0 Remote (1.00000000 0.00000000) * # 45659) career_satisfaction>=6.5 10 4 Not remote (0.40000000 0.60000000) # 91318) web_developer>=0.5 7 3 Remote (0.57142857 0.42857143) # 182636) salary>=103500 6 2 Remote (0.66666667 0.33333333) # 365272) salary< 107500 3 0 Remote (1.00000000 0.00000000) * # 365273) salary>=107500 3 1 Not remote (0.33333333 0.66666667) # 730546) career_satisfaction>=8.5 1 0 Remote (1.00000000 0.00000000) * # 730547) career_satisfaction< 8.5 2 0 Not remote (0.00000000 1.00000000) * # 182637) salary< 103500 1 0 Not remote (0.00000000 1.00000000) * # 91319) web_developer< 0.5 3 0 Not remote (0.00000000 1.00000000) * # 11415) salary>=110500 2 0 Not remote (0.00000000 1.00000000) * # 1427) salary< 101000 3 0 Not remote (0.00000000 1.00000000) * # 357) salary< 91250 1 0 Not remote (0.00000000 1.00000000) * # 179) hobby< 0.5 7 2 Not remote (0.28571429 0.71428571) # 358) salary>=98500 3 1 Remote (0.66666667 0.33333333) # 716) company_size_number>=300 2 0 Remote (1.00000000 0.00000000) * # 717) company_size_number< 300 1 0 Not remote (0.00000000 1.00000000) * # 359) salary< 98500 4 0 Not remote (0.00000000 1.00000000) * # 45) embedded_developer>=0.5 4 1 Not remote (0.25000000 0.75000000) # 90) years_coded_job>=6 1 0 Remote (1.00000000 0.00000000) * # 91) years_coded_job< 6 3 0 Not remote (0.00000000 1.00000000) * # 23) salary>=114250 28 10 Not remote (0.35714286 0.64285714) # 46) salary>=151500 2 0 Remote (1.00000000 0.00000000) * # 47) salary< 151500 26 8 Not remote (0.30769231 0.69230769) # 94) web_developer>=0.5 20 8 Not remote (0.40000000 0.60000000) # 188) company_size_number< 550 6 2 Remote (0.66666667 0.33333333) # 376) years_coded_job>=4.5 5 1 Remote (0.80000000 0.20000000) # 752) years_coded_job< 6.5 3 0 Remote (1.00000000 0.00000000) * # 753) years_coded_job>=6.5 2 1 Remote (0.50000000 0.50000000) # 1506) salary< 122500 1 0 Remote (1.00000000 0.00000000) * # 1507) salary>=122500 1 0 Not remote (0.00000000 1.00000000) * # 377) years_coded_job< 4.5 1 0 Not remote (0.00000000 1.00000000) * # 189) company_size_number>=550 14 4 Not remote (0.28571429 0.71428571) # 378) database_administrator>=0.5 1 0 Remote (1.00000000 0.00000000) * # 379) database_administrator< 0.5 13 3 Not remote (0.23076923 0.76923077) # 758) graphic_designer>=0.5 1 0 Remote (1.00000000 0.00000000) * # 759) graphic_designer< 0.5 12 2 Not remote (0.16666667 0.83333333) # 1518) career_satisfaction< 6.5 4 2 Remote (0.50000000 0.50000000) # 3036) salary>=125000 1 0 Remote (1.00000000 0.00000000) * # 3037) salary< 125000 3 1 Not remote (0.33333333 0.66666667) # 6074) hobby>=0.5 1 0 Remote (1.00000000 0.00000000) * # 6075) hobby< 0.5 2 0 Not remote (0.00000000 1.00000000) * # 1519) career_satisfaction>=6.5 8 0 Not remote (0.00000000 1.00000000) * # 95) web_developer< 0.5 6 0 Not remote (0.00000000 1.00000000) * # 3) salary< 89196.97 535 206 Not remote (0.38504673 0.61495327) # 6) company_size_number< 15 135 59 Remote (0.56296296 0.43703704) # 12) years_coded_job>=16.5 15 1 Remote (0.93333333 0.06666667) # 24) quality_assurance_engineer< 0.5 14 0 Remote (1.00000000 0.00000000) * # 25) quality_assurance_engineer>=0.5 1 0 Not remote (0.00000000 1.00000000) * # 13) years_coded_job< 16.5 120 58 Remote (0.51666667 0.48333333) # 26) graphic_designer>=0.5 6 0 Remote (1.00000000 0.00000000) * # 27) graphic_designer< 0.5 114 56 Not remote (0.49122807 0.50877193) # 54) salary< 6211.884 14 4 Remote (0.71428571 0.28571429) # 108) salary>=4184.408 7 0 Remote (1.00000000 0.00000000) * # 109) salary< 4184.408 7 3 Not remote (0.42857143 0.57142857) # 218) salary< 2202.32 4 1 Remote (0.75000000 0.25000000) # 436) database_administrator< 0.5 3 0 Remote (1.00000000 0.00000000) * # 437) database_administrator>=0.5 1 0 Not remote (0.00000000 1.00000000) * # 219) salary>=2202.32 3 0 Not remote (0.00000000 1.00000000) * # 55) salary>=6211.884 100 46 Not remote (0.46000000 0.54000000) # 110) years_coded_job>=1.5 82 41 Remote (0.50000000 0.50000000) # 220) mobile_developer< 0.5 63 28 Remote (0.55555556 0.44444444) # 440) developer_with_stats_math_background< 0.5 56 22 Remote (0.60714286 0.39285714) # 880) career_satisfaction>=9.5 10 1 Remote (0.90000000 0.10000000) # 1760) systems_administrator< 0.5 7 0 Remote (1.00000000 0.00000000) * # 1761) systems_administrator>=0.5 3 1 Remote (0.66666667 0.33333333) # 3522) years_coded_job>=2.5 2 0 Remote (1.00000000 0.00000000) * # 3523) years_coded_job< 2.5 1 0 Not remote (0.00000000 1.00000000) * # 881) career_satisfaction< 9.5 46 21 Remote (0.54347826 0.45652174) # 1762) career_satisfaction< 5.5 3 0 Remote (1.00000000 0.00000000) * # 1763) career_satisfaction>=5.5 43 21 Remote (0.51162791 0.48837209) # 3526) systems_administrator>=0.5 9 2 Remote (0.77777778 0.22222222) # 7052) years_coded_job>=3.5 7 0 Remote (1.00000000 0.00000000) * # 7053) years_coded_job< 3.5 2 0 Not remote (0.00000000 1.00000000) * # 3527) systems_administrator< 0.5 34 15 Not remote (0.44117647 0.55882353) # 7054) database_administrator< 0.5 29 14 Remote (0.51724138 0.48275862) # 14108) salary>=25879.5 27 12 Remote (0.55555556 0.44444444) # 28216) salary< 36939.39 3 0 Remote (1.00000000 0.00000000) * # 28217) salary>=36939.39 24 12 Remote (0.50000000 0.50000000) # 56434) hobby>=0.5 21 9 Remote (0.57142857 0.42857143) # 112868) years_coded_job>=2.5 19 7 Remote (0.63157895 0.36842105) # 225736) embedded_developer< 0.5 17 5 Remote (0.70588235 0.29411765) # 451472) web_developer>=0.5 16 4 Remote (0.75000000 0.25000000) # 902944) career_satisfaction>=6.5 13 2 Remote (0.84615385 0.15384615) # 1805888) company_size_number>=5.5 7 0 Remote (1.00000000 0.00000000) * # 1805889) company_size_number< 5.5 6 2 Remote (0.66666667 0.33333333) # 3611778) salary>=62812.5 3 0 Remote (1.00000000 0.00000000) * # 3611779) salary< 62812.5 3 1 Not remote (0.33333333 0.66666667) # 7223558) dev_ops>=0.5 1 0 Remote (1.00000000 0.00000000) * # 7223559) dev_ops< 0.5 2 0 Not remote (0.00000000 1.00000000) * # 902945) career_satisfaction< 6.5 3 1 Not remote (0.33333333 0.66666667) # 1805890) salary< 51693.55 1 0 Remote (1.00000000 0.00000000) * # 1805891) salary>=51693.55 2 0 Not remote (0.00000000 1.00000000) * # 451473) web_developer< 0.5 1 0 Not remote (0.00000000 1.00000000) * # 225737) embedded_developer>=0.5 2 0 Not remote (0.00000000 1.00000000) * # 112869) years_coded_job< 2.5 2 0 Not remote (0.00000000 1.00000000) * # 56435) hobby< 0.5 3 0 Not remote (0.00000000 1.00000000) * # 14109) salary< 25879.5 2 0 Not remote (0.00000000 1.00000000) * # 7055) database_administrator>=0.5 5 0 Not remote (0.00000000 1.00000000) * # 441) developer_with_stats_math_background>=0.5 7 1 Not remote (0.14285714 0.85714286) # 882) web_developer< 0.5 2 1 Remote (0.50000000 0.50000000) # 1764) salary>=53806.45 1 0 Remote (1.00000000 0.00000000) * # 1765) salary< 53806.45 1 0 Not remote (0.00000000 1.00000000) * # 883) web_developer>=0.5 5 0 Not remote (0.00000000 1.00000000) * # 221) mobile_developer>=0.5 19 6 Not remote (0.31578947 0.68421053) # 442) salary>=65500 2 0 Remote (1.00000000 0.00000000) * # 443) salary< 65500 17 4 Not remote (0.23529412 0.76470588) # 886) salary< 43977.27 9 4 Not remote (0.44444444 0.55555556) # 1772) company_size_number< 5.5 6 2 Remote (0.66666667 0.33333333) # 3544) years_coded_job>=5.5 3 0 Remote (1.00000000 0.00000000) * # 3545) years_coded_job< 5.5 3 1 Not remote (0.33333333 0.66666667) # 7090) salary< 16819.58 1 0 Remote (1.00000000 0.00000000) * # 7091) salary>=16819.58 2 0 Not remote (0.00000000 1.00000000) * # 1773) company_size_number>=5.5 3 0 Not remote (0.00000000 1.00000000) * # 887) salary>=43977.27 8 0 Not remote (0.00000000 1.00000000) * # 111) years_coded_job< 1.5 18 5 Not remote (0.27777778 0.72222222) # 222) open_source>=0.5 1 0 Remote (1.00000000 0.00000000) * # 223) open_source< 0.5 17 4 Not remote (0.23529412 0.76470588) # 446) developer_with_stats_math_background>=0.5 1 0 Remote (1.00000000 0.00000000) * # 447) developer_with_stats_math_background< 0.5 16 3 Not remote (0.18750000 0.81250000) # 894) career_satisfaction< 8.5 10 3 Not remote (0.30000000 0.70000000) # 1788) web_developer< 0.5 3 1 Remote (0.66666667 0.33333333) # 3576) salary< 45500 2 0 Remote (1.00000000 0.00000000) * # 3577) salary>=45500 1 0 Not remote (0.00000000 1.00000000) * # 1789) web_developer>=0.5 7 1 Not remote (0.14285714 0.85714286) # 3578) salary>=68500 1 0 Remote (1.00000000 0.00000000) * # 3579) salary< 68500 6 0 Not remote (0.00000000 1.00000000) * # 895) career_satisfaction>=8.5 6 0 Not remote (0.00000000 1.00000000) * # 7) company_size_number>=15 400 130 Not remote (0.32500000 0.67500000) # 14) open_source>=0.5 130 54 Not remote (0.41538462 0.58461538) # 28) developer_with_stats_math_background< 0.5 120 53 Not remote (0.44166667 0.55833333) # 56) salary< 3482.286 3 0 Remote (1.00000000 0.00000000) * # 57) salary>=3482.286 117 50 Not remote (0.42735043 0.57264957) # 114) embedded_developer< 0.5 113 50 Not remote (0.44247788 0.55752212) # 228) web_developer< 0.5 19 7 Remote (0.63157895 0.36842105) # 456) salary>=69257.09 5 0 Remote (1.00000000 0.00000000) * # 457) salary< 69257.09 14 7 Remote (0.50000000 0.50000000) # 914) company_size_number>=550 9 3 Remote (0.66666667 0.33333333) # 1828) salary>=24225.52 8 2 Remote (0.75000000 0.25000000) # 3656) years_coded_job< 12.5 7 1 Remote (0.85714286 0.14285714) # 7312) data_scientist< 0.5 5 0 Remote (1.00000000 0.00000000) * # 7313) data_scientist>=0.5 2 1 Remote (0.50000000 0.50000000) # 14626) salary< 42335.7 1 0 Remote (1.00000000 0.00000000) * # 14627) salary>=42335.7 1 0 Not remote (0.00000000 1.00000000) * # 3657) years_coded_job>=12.5 1 0 Not remote (0.00000000 1.00000000) * # 1829) salary< 24225.52 1 0 Not remote (0.00000000 1.00000000) * # 915) company_size_number< 550 5 1 Not remote (0.20000000 0.80000000) # 1830) salary< 35654.64 2 1 Remote (0.50000000 0.50000000) # 3660) salary>=7271.167 1 0 Remote (1.00000000 0.00000000) * # 3661) salary< 7271.167 1 0 Not remote (0.00000000 1.00000000) * # 1831) salary>=35654.64 3 0 Not remote (0.00000000 1.00000000) * # 229) web_developer>=0.5 94 38 Not remote (0.40425532 0.59574468) # 458) salary>=86760.75 2 0 Remote (1.00000000 0.00000000) * # 459) salary< 86760.75 92 36 Not remote (0.39130435 0.60869565) # 918) salary< 80322.58 83 35 Not remote (0.42168675 0.57831325) # 1836) salary>=62951.61 20 8 Remote (0.60000000 0.40000000) # 3672) years_coded_job< 6.5 12 3 Remote (0.75000000 0.25000000) # 7344) data_scientist< 0.5 11 2 Remote (0.81818182 0.18181818) # 14688) years_coded_job>=3.5 5 0 Remote (1.00000000 0.00000000) * # 14689) years_coded_job< 3.5 6 2 Remote (0.66666667 0.33333333) # 29378) salary< 75000 5 1 Remote (0.80000000 0.20000000) # 58756) desktop_applications_developer< 0.5 3 0 Remote (1.00000000 0.00000000) * # 58757) desktop_applications_developer>=0.5 2 1 Remote (0.50000000 0.50000000) # 117514) salary< 64000 1 0 Remote (1.00000000 0.00000000) * # 117515) salary>=64000 1 0 Not remote (0.00000000 1.00000000) * # 29379) salary>=75000 1 0 Not remote (0.00000000 1.00000000) * # 7345) data_scientist>=0.5 1 0 Not remote (0.00000000 1.00000000) * # 3673) years_coded_job>=6.5 8 3 Not remote (0.37500000 0.62500000) # 7346) salary>=73742.42 2 0 Remote (1.00000000 0.00000000) * # 7347) salary< 73742.42 6 1 Not remote (0.16666667 0.83333333) # 14694) years_coded_job>=10.5 2 1 Remote (0.50000000 0.50000000) # 29388) salary>=71609.85 1 0 Remote (1.00000000 0.00000000) * # 29389) salary< 71609.85 1 0 Not remote (0.00000000 1.00000000) * # 14695) years_coded_job< 10.5 4 0 Not remote (0.00000000 1.00000000) * # 1837) salary< 62951.61 63 23 Not remote (0.36507937 0.63492063) # 3674) salary< 41612.12 36 17 Not remote (0.47222222 0.52777778) # 7348) systems_administrator< 0.5 32 15 Remote (0.53125000 0.46875000) # 14696) company_size_number>=60 17 6 Remote (0.64705882 0.35294118) # 29392) company_size_number< 3000 11 2 Remote (0.81818182 0.18181818) # 58784) years_coded_job< 5.5 9 1 Remote (0.88888889 0.11111111) # 117568) hobby>=0.5 7 0 Remote (1.00000000 0.00000000) * # 117569) hobby< 0.5 2 1 Remote (0.50000000 0.50000000) # 235138) salary>=7884.305 1 0 Remote (1.00000000 0.00000000) * # 235139) salary< 7884.305 1 0 Not remote (0.00000000 1.00000000) * # 58785) years_coded_job>=5.5 2 1 Remote (0.50000000 0.50000000) # 117570) salary>=22446.61 1 0 Remote (1.00000000 0.00000000) * # 117571) salary< 22446.61 1 0 Not remote (0.00000000 1.00000000) * # 29393) company_size_number>=3000 6 2 Not remote (0.33333333 0.66666667) # 58786) career_satisfaction>=5 3 1 Remote (0.66666667 0.33333333) # 117572) salary< 25356.96 2 0 Remote (1.00000000 0.00000000) * # 117573) salary>=25356.96 1 0 Not remote (0.00000000 1.00000000) * # 58787) career_satisfaction< 5 3 0 Not remote (0.00000000 1.00000000) * # 14697) company_size_number< 60 15 6 Not remote (0.40000000 0.60000000) # 29394) desktop_applications_developer< 0.5 12 6 Remote (0.50000000 0.50000000) # 58788) years_coded_job< 2.5 4 1 Remote (0.75000000 0.25000000) # 117576) salary>=3964.176 3 0 Remote (1.00000000 0.00000000) * # 117577) salary< 3964.176 1 0 Not remote (0.00000000 1.00000000) * # 58789) years_coded_job>=2.5 8 3 Not remote (0.37500000 0.62500000) # 117578) years_coded_job>=4.5 4 1 Remote (0.75000000 0.25000000) # 235156) salary< 40075.76 3 0 Remote (1.00000000 0.00000000) * # 235157) salary>=40075.76 1 0 Not remote (0.00000000 1.00000000) * # 117579) years_coded_job< 4.5 4 0 Not remote (0.00000000 1.00000000) * # 29395) desktop_applications_developer>=0.5 3 0 Not remote (0.00000000 1.00000000) * # 7349) systems_administrator>=0.5 4 0 Not remote (0.00000000 1.00000000) * # 3675) salary>=41612.12 27 6 Not remote (0.22222222 0.77777778) # 7350) salary>=47943.55 18 6 Not remote (0.33333333 0.66666667) # 14700) company_size_number>=60 10 5 Remote (0.50000000 0.50000000) # 29400) career_satisfaction< 7.5 3 0 Remote (1.00000000 0.00000000) * # 29401) career_satisfaction>=7.5 7 2 Not remote (0.28571429 0.71428571) # 58802) salary< 51537.3 1 0 Remote (1.00000000 0.00000000) * # 58803) salary>=51537.3 6 1 Not remote (0.16666667 0.83333333) # 117606) salary>=61303.03 2 1 Remote (0.50000000 0.50000000) # 235212) salary< 62451.61 1 0 Remote (1.00000000 0.00000000) * # 235213) salary>=62451.61 1 0 Not remote (0.00000000 1.00000000) * # 117607) salary< 61303.03 4 0 Not remote (0.00000000 1.00000000) * # 14701) company_size_number< 60 8 1 Not remote (0.12500000 0.87500000) # 29402) years_coded_job>=13.5 1 0 Remote (1.00000000 0.00000000) * # 29403) years_coded_job< 13.5 7 0 Not remote (0.00000000 1.00000000) * # 7351) salary< 47943.55 9 0 Not remote (0.00000000 1.00000000) * # 919) salary>=80322.58 9 1 Not remote (0.11111111 0.88888889) # 1838) company_size_number< 60 2 1 Remote (0.50000000 0.50000000) # 3676) salary< 84750 1 0 Remote (1.00000000 0.00000000) * # 3677) salary>=84750 1 0 Not remote (0.00000000 1.00000000) * # 1839) company_size_number>=60 7 0 Not remote (0.00000000 1.00000000) * # 115) embedded_developer>=0.5 4 0 Not remote (0.00000000 1.00000000) * # 29) developer_with_stats_math_background>=0.5 10 1 Not remote (0.10000000 0.90000000) # 58) data_scientist>=0.5 1 0 Remote (1.00000000 0.00000000) * # 59) data_scientist< 0.5 9 0 Not remote (0.00000000 1.00000000) * # 15) open_source< 0.5 270 76 Not remote (0.28148148 0.71851852) # 30) company_size_number< 60 87 34 Not remote (0.39080460 0.60919540) # 60) salary< 9543.386 7 1 Remote (0.85714286 0.14285714) # 120) years_coded_job< 5.5 5 0 Remote (1.00000000 0.00000000) * # 121) years_coded_job>=5.5 2 1 Remote (0.50000000 0.50000000) # 242) salary>=4881.72 1 0 Remote (1.00000000 0.00000000) * # 243) salary< 4881.72 1 0 Not remote (0.00000000 1.00000000) * # 61) salary>=9543.386 80 28 Not remote (0.35000000 0.65000000) # 122) career_satisfaction>=7.5 48 21 Not remote (0.43750000 0.56250000) # 244) salary>=40806.45 36 17 Remote (0.52777778 0.47222222) # 488) years_coded_job< 14.5 30 12 Remote (0.60000000 0.40000000) # 976) desktop_applications_developer>=0.5 7 1 Remote (0.85714286 0.14285714) # 1952) years_coded_job>=3.5 6 0 Remote (1.00000000 0.00000000) * # 1953) years_coded_job< 3.5 1 0 Not remote (0.00000000 1.00000000) * # 977) desktop_applications_developer< 0.5 23 11 Remote (0.52173913 0.47826087) # 1954) salary< 76500 21 9 Remote (0.57142857 0.42857143) # 3908) hobby< 0.5 5 0 Remote (1.00000000 0.00000000) * # 3909) hobby>=0.5 16 7 Not remote (0.43750000 0.56250000) # 7818) years_coded_job>=2.5 12 5 Remote (0.58333333 0.41666667) # 15636) mobile_developer>=0.5 3 0 Remote (1.00000000 0.00000000) * # 15637) mobile_developer< 0.5 9 4 Not remote (0.44444444 0.55555556) # 31274) years_coded_job>=6.5 4 1 Remote (0.75000000 0.25000000) # 62548) graphic_designer< 0.5 3 0 Remote (1.00000000 0.00000000) * # 62549) graphic_designer>=0.5 1 0 Not remote (0.00000000 1.00000000) * # 31275) years_coded_job< 6.5 5 1 Not remote (0.20000000 0.80000000) # 62550) data_scientist>=0.5 1 0 Remote (1.00000000 0.00000000) * # 62551) data_scientist< 0.5 4 0 Not remote (0.00000000 1.00000000) * # 7819) years_coded_job< 2.5 4 0 Not remote (0.00000000 1.00000000) * # 1955) salary>=76500 2 0 Not remote (0.00000000 1.00000000) * # 489) years_coded_job>=14.5 6 1 Not remote (0.16666667 0.83333333) # 978) salary>=81418.01 1 0 Remote (1.00000000 0.00000000) * # 979) salary< 81418.01 5 0 Not remote (0.00000000 1.00000000) * # 245) salary< 40806.45 12 2 Not remote (0.16666667 0.83333333) # 490) years_coded_job>=17.5 1 0 Remote (1.00000000 0.00000000) * # 491) years_coded_job< 17.5 11 1 Not remote (0.09090909 0.90909091) # 982) web_developer< 0.5 1 0 Remote (1.00000000 0.00000000) * # 983) web_developer>=0.5 10 0 Not remote (0.00000000 1.00000000) * # 123) career_satisfaction< 7.5 32 7 Not remote (0.21875000 0.78125000) # 246) salary>=58250 15 5 Not remote (0.33333333 0.66666667) # 492) years_coded_job< 2.5 2 0 Remote (1.00000000 0.00000000) * # 493) years_coded_job>=2.5 13 3 Not remote (0.23076923 0.76923077) # 986) dev_ops>=0.5 1 0 Remote (1.00000000 0.00000000) * # 987) dev_ops< 0.5 12 2 Not remote (0.16666667 0.83333333) # 1974) mobile_developer>=0.5 1 0 Remote (1.00000000 0.00000000) * # 1975) mobile_developer< 0.5 11 1 Not remote (0.09090909 0.90909091) # 3950) salary< 64758.06 3 1 Not remote (0.33333333 0.66666667) # 7900) salary>=61827.96 1 0 Remote (1.00000000 0.00000000) * # 7901) salary< 61827.96 2 0 Not remote (0.00000000 1.00000000) * # 3951) salary>=64758.06 8 0 Not remote (0.00000000 1.00000000) * # 247) salary< 58250 17 2 Not remote (0.11764706 0.88235294) # 494) salary< 38104.84 5 2 Not remote (0.40000000 0.60000000) # 988) years_coded_job< 3.5 3 1 Remote (0.66666667 0.33333333) # 1976) hobby>=0.5 2 0 Remote (1.00000000 0.00000000) * # 1977) hobby< 0.5 1 0 Not remote (0.00000000 1.00000000) * # 989) years_coded_job>=3.5 2 0 Not remote (0.00000000 1.00000000) * # 495) salary>=38104.84 12 0 Not remote (0.00000000 1.00000000) * # 31) company_size_number>=60 183 42 Not remote (0.22950820 0.77049180) # 62) salary>=74983.75 38 15 Not remote (0.39473684 0.60526316) # 124) career_satisfaction< 7.5 15 6 Remote (0.60000000 0.40000000) # 248) years_coded_job< 8 10 2 Remote (0.80000000 0.20000000) # 496) years_coded_job>=2.5 6 0 Remote (1.00000000 0.00000000) * # 497) years_coded_job< 2.5 4 2 Remote (0.50000000 0.50000000) # 994) company_size_number>=2750 2 0 Remote (1.00000000 0.00000000) * # 995) company_size_number< 2750 2 0 Not remote (0.00000000 1.00000000) * # 249) years_coded_job>=8 5 1 Not remote (0.20000000 0.80000000) # 498) salary>=87250 2 1 Remote (0.50000000 0.50000000) # 996) salary< 88250 1 0 Remote (1.00000000 0.00000000) * # 997) salary>=88250 1 0 Not remote (0.00000000 1.00000000) * # 499) salary< 87250 3 0 Not remote (0.00000000 1.00000000) * # 125) career_satisfaction>=7.5 23 6 Not remote (0.26086957 0.73913043) # 250) salary< 84500 17 6 Not remote (0.35294118 0.64705882) # 500) years_coded_job>=9.5 6 2 Remote (0.66666667 0.33333333) # 1000) company_size_number< 5250 3 0 Remote (1.00000000 0.00000000) * # 1001) company_size_number>=5250 3 1 Not remote (0.33333333 0.66666667) # 2002) years_coded_job< 15 1 0 Remote (1.00000000 0.00000000) * # 2003) years_coded_job>=15 2 0 Not remote (0.00000000 1.00000000) * # 501) years_coded_job< 9.5 11 2 Not remote (0.18181818 0.81818182) # 1002) years_coded_job< 2.5 4 2 Remote (0.50000000 0.50000000) # 2004) salary>=76500 3 1 Remote (0.66666667 0.33333333) # 4008) desktop_applications_developer< 0.5 2 0 Remote (1.00000000 0.00000000) * # 4009) desktop_applications_developer>=0.5 1 0 Not remote (0.00000000 1.00000000) * # 2005) salary< 76500 1 0 Not remote (0.00000000 1.00000000) * # 1003) years_coded_job>=2.5 7 0 Not remote (0.00000000 1.00000000) * # 251) salary>=84500 6 0 Not remote (0.00000000 1.00000000) * # 63) salary< 74983.75 145 27 Not remote (0.18620690 0.81379310) # 126) data_scientist>=0.5 11 4 Not remote (0.36363636 0.63636364) # 252) developer_with_stats_math_background< 0.5 7 3 Remote (0.57142857 0.42857143) # 504) salary< 58064.52 5 1 Remote (0.80000000 0.20000000) # 1008) salary>=8368.815 3 0 Remote (1.00000000 0.00000000) * # 1009) salary< 8368.815 2 1 Remote (0.50000000 0.50000000) # 2018) salary< 4698.282 1 0 Remote (1.00000000 0.00000000) * # 2019) salary>=4698.282 1 0 Not remote (0.00000000 1.00000000) * # 505) salary>=58064.52 2 0 Not remote (0.00000000 1.00000000) * # 253) developer_with_stats_math_background>=0.5 4 0 Not remote (0.00000000 1.00000000) * # 127) data_scientist< 0.5 134 23 Not remote (0.17164179 0.82835821) # 254) graphic_designer>=0.5 2 1 Remote (0.50000000 0.50000000) # 508) salary>=61750 1 0 Remote (1.00000000 0.00000000) * # 509) salary< 61750 1 0 Not remote (0.00000000 1.00000000) * # 255) graphic_designer< 0.5 132 22 Not remote (0.16666667 0.83333333) # 510) salary< 62970.43 104 20 Not remote (0.19230769 0.80769231) # 1020) salary>=53437.5 26 9 Not remote (0.34615385 0.65384615) # 2040) years_coded_job>=1.5 22 9 Not remote (0.40909091 0.59090909) # 4080) mobile_developer< 0.5 19 9 Not remote (0.47368421 0.52631579) # 8160) dev_ops< 0.5 17 8 Remote (0.52941176 0.47058824) # 16320) salary< 56125 4 0 Remote (1.00000000 0.00000000) * # 16321) salary>=56125 13 5 Not remote (0.38461538 0.61538462) # 32642) career_satisfaction>=7.5 8 3 Remote (0.62500000 0.37500000) # 65284) years_coded_job< 5 3 0 Remote (1.00000000 0.00000000) * # 65285) years_coded_job>=5 5 2 Not remote (0.40000000 0.60000000) # 130570) salary>=59569.89 3 1 Remote (0.66666667 0.33333333) # 261140) salary< 61875 2 0 Remote (1.00000000 0.00000000) * # 261141) salary>=61875 1 0 Not remote (0.00000000 1.00000000) * # 130571) salary< 59569.89 2 0 Not remote (0.00000000 1.00000000) * # 32643) career_satisfaction< 7.5 5 0 Not remote (0.00000000 1.00000000) * # 8161) dev_ops>=0.5 2 0 Not remote (0.00000000 1.00000000) * # 4081) mobile_developer>=0.5 3 0 Not remote (0.00000000 1.00000000) * # 2041) years_coded_job< 1.5 4 0 Not remote (0.00000000 1.00000000) * # 1021) salary< 53437.5 78 11 Not remote (0.14102564 0.85897436) # 2042) career_satisfaction>=9.5 9 3 Not remote (0.33333333 0.66666667) # 4084) company_size_number< 300 3 0 Remote (1.00000000 0.00000000) * # 4085) company_size_number>=300 6 0 Not remote (0.00000000 1.00000000) * # 2043) career_satisfaction< 9.5 69 8 Not remote (0.11594203 0.88405797) # 4086) salary< 30287.5 32 6 Not remote (0.18750000 0.81250000) # 8172) salary>=28881.05 1 0 Remote (1.00000000 0.00000000) * # 8173) salary< 28881.05 31 5 Not remote (0.16129032 0.83870968) # 16346) quality_assurance_engineer>=0.5 1 0 Remote (1.00000000 0.00000000) * # 16347) quality_assurance_engineer< 0.5 30 4 Not remote (0.13333333 0.86666667) # 32694) salary>=9837.028 18 4 Not remote (0.22222222 0.77777778) # 65388) salary< 11892.53 1 0 Remote (1.00000000 0.00000000) * # 65389) salary>=11892.53 17 3 Not remote (0.17647059 0.82352941) # 130778) developer_with_stats_math_background>=0.5 1 0 Remote (1.00000000 0.00000000) * # 130779) developer_with_stats_math_background< 0.5 16 2 Not remote (0.12500000 0.87500000) # 261558) career_satisfaction>=7.5 6 2 Not remote (0.33333333 0.66666667) # 523116) salary>=14315.08 3 1 Remote (0.66666667 0.33333333) # 1046232) years_coded_job>=2.5 2 0 Remote (1.00000000 0.00000000) * # 1046233) years_coded_job< 2.5 1 0 Not remote (0.00000000 1.00000000) * # 523117) salary< 14315.08 3 0 Not remote (0.00000000 1.00000000) * # 261559) career_satisfaction< 7.5 10 0 Not remote (0.00000000 1.00000000) * # 32695) salary< 9837.028 12 0 Not remote (0.00000000 1.00000000) * # 4087) salary>=30287.5 37 2 Not remote (0.05405405 0.94594595) # 8174) company_size_number>=3000 13 2 Not remote (0.15384615 0.84615385) # 16348) company_size_number< 7500 1 0 Remote (1.00000000 0.00000000) * # 16349) company_size_number>=7500 12 1 Not remote (0.08333333 0.91666667) # 32698) years_coded_job< 2.5 3 1 Not remote (0.33333333 0.66666667) # 65396) hobby< 0.5 1 0 Remote (1.00000000 0.00000000) * # 65397) hobby>=0.5 2 0 Not remote (0.00000000 1.00000000) * # 32699) years_coded_job>=2.5 9 0 Not remote (0.00000000 1.00000000) * # 8175) company_size_number< 3000 24 0 Not remote (0.00000000 1.00000000) * # 511) salary>=62970.43 28 2 Not remote (0.07142857 0.92857143) # 1022) company_size_number>=7500 9 2 Not remote (0.22222222 0.77777778) # 2044) years_coded_job>=2.5 5 2 Not remote (0.40000000 0.60000000) # 4088) dev_ops< 0.5 3 1 Remote (0.66666667 0.33333333) # 8176) salary>=66875 2 0 Remote (1.00000000 0.00000000) * # 8177) salary< 66875 1 0 Not remote (0.00000000 1.00000000) * # 4089) dev_ops>=0.5 2 0 Not remote (0.00000000 1.00000000) * # 2045) years_coded_job< 2.5 4 0 Not remote (0.00000000 1.00000000) * # 1023) company_size_number< 7500 19 0 Not remote (0.00000000 1.00000000) * ``` .footnote[* see your `04-helpers.R` script] --- class: your-turn # Your turn 3 Let's combine bootstrapping with decision trees. Do **Round 1** on your handouts.
05
:
00
--- exclude: true --- class: middle # The trouble with trees? <img src="figs/04-ensemble/unnamed-chunk-38-1.png" width="33%" style="display: block; margin: auto;" /><img src="figs/04-ensemble/unnamed-chunk-38-2.png" width="33%" style="display: block; margin: auto;" /><img src="figs/04-ensemble/unnamed-chunk-38-3.png" width="33%" style="display: block; margin: auto;" /> --- class: your-turn # Your turn 4 Now, let's add the aggregating part. Do **Round 2** on your handouts.
05
:
00
--- class: middle, center # Your first ensemble! <img src="images/orchestra.jpg" width="25%" style="display: block; margin: auto;" /> --- class: middle, frame, center # Axiom There is an inverse relationship between model *accuracy* and model *interpretability*. --- class: middle, center # `rand_forest()` Specifies a random forest model ```r rand_forest(mtry = 4, trees = 500, min_n = 1) ``` -- *either* mode works! --- class: middle .center[ # `rand_forest()` Specifies a random forest model ] ```r rand_forest( mtry = 4, # predictors seen at each node trees = 500, # trees per forest min_n = 1 # smallest node allowed ) ``` --- class: your-turn # Your turn 5 Create a new model spec called `rf_spec`, which will learn an ensemble of classification trees from our training data using the **ranger** package. Compare the metrics of the random forest to your two single tree models (vanilla and big)- which predicts the test set better? *Hint: you'll need https://tidymodels.github.io/parsnip/articles/articles/Models.html*
05
:
00
--- ```r rf_spec <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification") set.seed(100) rf_spec %>% fit_resamples(remote ~ ., resamples = so_folds) %>% collect_metrics() # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.629 10 0.0214 # 2 roc_auc binary 0.681 10 0.0202 ``` --- .pull-left[ ### Vanilla Decision Tree ``` # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.643 10 0.0165 # 2 roc_auc binary 0.668 10 0.0170 ``` ### Big Decision Tree ``` # # A tibble: 2 x 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 accuracy binary 0.605 # 2 roc_auc binary 0.605 ``` ] .pull-right[ ### Random Forest ``` # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.640 10 0.0196 # 2 roc_auc binary 0.681 10 0.0208 ``` ] --- class: middle, center `mtry` The number of predictors that will be randomly sampled at each split when creating the tree models. ```r rand_forest(mtry = 4) ``` **ranger** default = `floor(sqrt(num_predictors))` --- class: your-turn # Your turn 6 Challenge: Make 4 more random forest model specs, each using 4, 8, 12, and 19 variables at each split. Which value maximizes the area under the ROC curve? *Hint: you'll need https://tidymodels.github.io/parsnip/reference/rand_forest.html*
04
:
00
--- ```r rf4_spec <- rf_spec %>% * set_args(mtry = 4) set.seed(100) rf4_spec %>% fit_resamples(remote ~ ., resamples = so_folds) %>% collect_metrics() # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.629 10 0.0214 # 2 roc_auc binary 0.681 10 0.0202 ``` --- ```r rf8_spec <- rf_spec %>% * set_args(mtry = 8) set.seed(100) rf8_spec %>% fit_resamples(remote ~ ., resamples = so_folds) %>% collect_metrics() # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.633 10 0.0233 # 2 roc_auc binary 0.671 10 0.0209 ``` --- ```r rf12_spec <- rf_spec %>% * set_args(mtry = 12) set.seed(100) rf12_spec %>% fit_resamples(remote ~ ., resamples = so_folds) %>% collect_metrics() # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.630 10 0.0229 # 2 roc_auc binary 0.664 10 0.0214 ``` --- ```r rf19_spec <- rf_spec %>% * set_args(mtry = 19) set.seed(100) rf19_spec %>% fit_resamples(remote ~ ., resamples = so_folds) %>% collect_metrics() # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.628 10 0.0191 # 2 roc_auc binary 0.659 10 0.0214 ``` --- class: middle, center <img src="figs/04-ensemble/unnamed-chunk-55-1.png" width="100%" style="display: block; margin: auto;" /> --- ```r treebag_spec <- * rand_forest(mtry = .preds()) %>% set_engine("ranger") %>% set_mode("classification") set.seed(100) treebag_spec %>% fit_resamples(remote ~ ., resamples = so_folds) %>% collect_metrics() # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.628 10 0.0191 # 2 roc_auc binary 0.659 10 0.0214 ``` --- class: center, middle # `.preds()` The number of columns in the data set that are associated with the predictors prior to dummy variable creation. ```r rand_forest(mtry = .preds()) ``` -- <https://tidymodels.github.io/parsnip/reference/descriptors.html> --- .pull-left[ ### Vanilla Decision Tree ``` # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.643 10 0.0165 # 2 roc_auc binary 0.668 10 0.0170 ``` ### Big Decision Tree ``` # # A tibble: 2 x 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 accuracy binary 0.605 # 2 roc_auc binary 0.605 ``` ] .pull-right[ ### Random Forest ``` # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.640 10 0.0196 # 2 roc_auc binary 0.681 10 0.0208 ``` ### Bagging ``` # # A tibble: 2 x 5 # .metric .estimator mean n std_err # <chr> <chr> <dbl> <int> <dbl> # 1 accuracy binary 0.627 10 0.0228 # 2 roc_auc binary 0.656 10 0.0225 ``` ] --- class: middle, frame # .center[To specify a model with parsnip] .right-column[ .fade[ 1\. Pick a .display[model] ] 2\. Set the .display[engine] .fade[ 3\. Set the .display[mode] (if needed) ] ] --- class: middle, center # `set_engine()` Adds to a model an R package to train the model. ```r spec %>% set_engine(engine = "ranger", ...) ``` --- class: middle .center[ # `set_engine()` Adds to a model an R package to train the model. ] ```r spec %>% set_engine( engine = "ranger", # package name in quotes ... # optional arguments to pass to function ) ``` --- class: middle .center[ .fade[ # `set_engine()` Adds to a model an R package to train the model. ] ] ```r rf_imp_spec <- rand_forest(mtry = 4) %>% set_engine("ranger", importance = 'impurity') %>% set_mode("classification") ``` --- ```r rf_imp_spec <- rand_forest(mtry = 4) %>% set_engine("ranger", importance = 'impurity') %>% set_mode("classification") imp_fit <- rf_imp_spec %>% last_fit(remote ~ ., split = so_split) imp_fit # # Monte Carlo cross-validation (0.75/0.25) with 1 resamples # # A tibble: 1 x 6 # splits id .metrics .notes .predictions .workflow # <list> <chr> <list> <list> <list> <list> # 1 <split … train/… <tibble [… <tibbl… <tibble [28… <workflo… ``` --- class: middle .center[ # `get_tree_fit()` Gets the parsnip model object from the output of `fit_split()` ] ```r get_tree_fit(imp_fit) ``` .footnote[in your helpers.R script] --- ```r get_tree_fit(imp_fit) # parsnip model object # # Fit time: 284ms # Ranger result # # Call: # ranger::ranger(formula = formula, data = data, mtry = ~4, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) # # Type: Probability estimation # Number of trees: 500 # Sample size: 864 # Number of independent variables: 19 # Mtry: 4 # Target node size: 10 # Variable importance mode: impurity # Splitrule: gini # OOB prediction error (Brier s.): 0.2224028 ``` --- class: middle, center # `vip` Plot variable importance. <iframe src="https://koalaverse.github.io/vip/index.html" width="504" height="400px"></iframe> --- class: middle, center # `vip()` Plot variable importance scores for the predictors in a model. ```r vip(object, geom = "point", ...) ``` --- class: middle .center[ # `vip()` Plot variable importance scores for the predictors in a model. ] ```r vip( object, # fitted model object geom = "col", # one of "col", "point", "boxplot", "violin" ... ) ``` --- ```r imp_plot <- get_tree_fit(imp_fit) vip::vip(imp_plot, geom = "point") ``` <img src="figs/04-ensemble/unnamed-chunk-71-1.png" width="504" style="display: block; margin: auto;" /> --- class: your-turn # Your turn 7 Make a new model spec called `treebag_imp_spec` to fit a bagged classification tree model. Set the variable `importance` mode to "permutation". Plot the variable importance- which variable was the most important?
03
:
00
--- class: middle ```r treebag_imp_spec <- rand_forest(mtry = .preds()) %>% set_engine("ranger", importance = 'permutation') %>% set_mode("classification") treebag_wf <- workflow() %>% add_formula(remote ~ .) %>% add_model(treebag_imp_spec) imp_fit <- treebag_wf %>% last_fit(split = so_split) imp_plot <- get_tree_fit(imp_fit) imp_plot ``` --- ``` # parsnip model object # # Fit time: 913ms # Ranger result # # Call: # ranger::ranger(formula = formula, data = data, mtry = ~.preds(), importance = ~"permutation", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) # # Type: Probability estimation # Number of trees: 500 # Sample size: 864 # Number of independent variables: 19 # Mtry: 19 # Target node size: 10 # Variable importance mode: permutation # Splitrule: gini # OOB prediction error (Brier s.): 0.2354235 ```