Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.

To cite this case study please use:

Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com//opencasestudies/ocs-bp-air-pollution. Predicting Annual Air Pollution (Version v1.0.0).

To access the GitHub Repository for this case study see here: https://github.com/opencasestudies/ocs-bp-air-pollution.

You may also access and download the data using our OCSdata package. To learn more about this package including examples, see this link. Here is how you would install this package:

install.packages("OCSdata")

This case study is part of a series of public health case studies for the Bloomberg American Health Initiative.


The total reading time for this case study is calculated via koRpus and shown below:

Reading Time Method
107 minutes koRpus

Readability Score:

A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via koRpus. These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age.

Text language: en 
index grade age
Flesch-Kincaid 10 15
FORCAST 10 15
SMOG 13 18

Please help us by filling out our survey.

Motivation


A variety of different sources contribute different types of pollutants to what we call air pollution.

Some sources are natural while others are anthropogenic (human derived):

Major types of air pollutants

  1. Gaseous - Carbon Monoxide (CO), Ozone (O3), Nitrogen Oxides(NO, NO2), Sulfur Dioxide (SO2)
  2. Particulate - small liquids and solids suspended in the air (includes lead- can include certain types of dust)
  3. Dust - small solids (larger than particulates) that can be suspended in the air for some time but eventually settle
  4. Biological - pollen, bacteria, viruses, mold spores

See here for more detail on the types of pollutants in the air.

Particulate pollution

Air pollution particulates are generally described by their size.

There are 3 major categories:

  1. Large Coarse Particulate Matter - has diameter of >10 micrometers (10 µm)

  2. Coarse Particulate Matter (called PM10-2.5) - has diameter of between 2.5 µm and 10 µm

  3. Fine Particulate Matter (called PM2.5) - has diameter of < 2.5 µm

PM10 includes any particulate matter <10 µm (both coarse and fine particulate matter)

Here you can see how these sizes compare with a human hair:

[source]

The following plot shows the relative sizes of these different pollutants in micrometers (µm):

[source]

This table shows how deeply some of the smaller fine particles can penetrate within the human body:

Negative impact of particulate exposure on health

Exposure to air pollution is associated with higher rates of mortality in older adults and is known to be a risk factor for many diseases and conditions including but not limited to:

  1. Asthma - fine particle exposure (PM2.5) was found to be associated with higher rates of asthma in children
  2. Inflammation in type 1 diabetes - fine particle exposure (PM2.5) from traffic-related air pollution was associated with increased measures of inflammatory markers in youths with Type 1 diabetes
  3. Lung function and emphysema - higher concentrations of ozone (O3), nitrogen oxides (NOx), black carbon, and fine particle exposure PM2.5 , at study baseline were significantly associated with greater increases in percent emphysema per 10 years
  4. Low birthweight - fine particle exposure(PM2.5) was associated with lower birth weight in full-term live births
  5. Viral Infection - higher rates of infection and increased severity of infection are associated with higher exposures to pollution levels including fine particle exposure (PM2.5)

See this review article for more information about sources of air pollution and the influence of air pollution on health.

Sparse monitoring is problematic for Public Health

Historically, epidemiological studies would assess the influence of air pollution on health outcomes by relying on a number of monitors located around the country.

However, as can be seen in the following figure, these monitors are relatively sparse in certain regions of the country and are not necessarily located near pollution sources. We will see later when we evaluate the data, that even in certain relatively large cities there is only one monitor!

Furthermore, dramatic differences in pollution rates can be seen even within the same city. In fact, the term micro-environments describes environments within cities or counties which may vary greatly from one block to another.

[source]

This lack of granularity in air pollution monitoring has hindered our ability to discern the full impact of air pollution on health and to identify at-risk locations.

Machine learning offers a solution

An article published in the Environmental Health journal dealt with this issue by using data, including population density and road density, among other features, to model or predict air pollution levels at a more localized scale using machine learning (ML) methods.

Yanosky, J. D. et al. Spatio-temporal modeling of particulate air pollution in the conterminous United States using geographic and meteorological predictors. Environ Health 13, 63 (2014).

The authors of this article state that:

“Exposure to atmospheric particulate matter (PM) remains an important public health concern, although it remains difficult to quantify accurately across large geographic areas with sufficiently high spatial resolution. Recent epidemiologic analyses have demonstrated the importance of spatially- and temporally-resolved exposure estimates, which show larger PM-mediated health effects as compared to nearest monitor or county-specific ambient concentrations.”

[source]

The article above demonstrates that machine learning methods can be used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems. We will use similar methods to predict annual air pollution levels spatially within the US.

Main Question


Our main question:

  1. Can we predict annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

Learning Objectives


In this case study, we will walk you through importing data from CSV files and performing machine learning methods to predict our outcome variable of interest (in this case annual fine particle air pollution estimates).

We will especially focus on using packages and functions from the tidyverse, and more specifically the tidymodels package/ecosystem primarily developed and maintained by Max Kuhn and Davis Vaughan. This package loads more modeling related packages like rsample, recipes, parsnip, yardstick, workflows, and tune packages.

The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially legible and intuitive.

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data Science Learning Objectives:

  1. Familiarity with the tidymodels ecosystem.
  2. Ability to evaluate correlation among predictor variables (corrplot and GGally).
  3. Ability to implement tidymodels packages such as rsample to split the data into training and testing sets as well as cross validation sets.
  4. Ability to use the recipes, parsnip, and workflows to train and test a linear regression model and random forest model.
  5. Demonstrate how to visualize geo-spatial data using ggplot2.

Statistical Learning Objectives:

  1. Basic understanding the utility of machine learning for prediction and classification
  2. Understanding of the need for training and test sets
  3. Understanding of the utility of cross validation
  4. Understanding of random forest
  5. How to interpret root mean squared error (rmse) to assess performance for prediction

We will begin by loading the packages that we will need:

# Load packages for data import and data wrangling
library(here)
library(readr)
library(dplyr)
library(skimr)
library(summarytools)
library(magrittr)
# Load packages for making correlation plots
library(corrplot)
library(RColorBrewer)
library(GGally)
# Load packages for building machine learning algorithm
library(tidymodels)
library(workflows)
library(vip)
library(tune)
library(randomForest)
library(doParallel)
# Load packages for data visualization/creating map
library(ggplot2)
library(stringr)
library(tidyr)
library(lwgeom)
library(proxy)# needed for lwgeom
library(sf)
library(maps)
library(rnaturalearth)
library(rnaturalearthdata) # needed for rnaturalearth
library(rgeos)
library(patchwork)
# Load package for downloading the case study data files
library(OCSdata)

Packages used in this case study:

Package Use in this case study
here to easily load and save data
readr to import CSV files
dplyr to view/arrange/filter/select/compare specific subsets of data
skimr to get an overview of data
summarytools to get an overview of data in a different style
magrittr to use the %<>% pipping operator
corrplot to make large correlation plots
GGally to make smaller correlation plots
tidymodels to load in a set of packages (broom, dials, infer, parsnip, purrr, recipes, rsample, tibble, yardstick)
rsample to split the data into testing and training sets; to split the training set for cross-validation
recipes to pre-process data for modeling in a tidy and reproducible way and to extract pre-processed data (major functions are recipe(), prep() and various transformation step_*() functions, as well as bake which extracts pre-processed training data (used to require juice()) and applies recipe preprocessing steps to testing data). See here for more info.
parsnip an interface to create models (major functions are fit(), set_engine())
yardstick to evaluate the performance of models
broom to get tidy output for our model fit and performance
ggplot2 to make visualizations with multiple layers
dials to specify hyper-parameter tuning
tune to perform cross validation, tune hyper-parameters, and get performance metrics
workflows to create modeling workflow to streamline the modeling process
vip to create variable importance plots
randomForest to perform the random forest analysis
doParallel to fit cross validation samples in parallel
stringr to manipulate the text the map data
tidyr to separate data within a column into multiple columns
rnaturalearth to get the geometry data for the earth to plot the US
maps to get map database data about counties to draw them on our US map
sf to convert the map data into a data frame
lwgeom to use the sf function to convert map geographical data
rgeos to use geometry data
patchwork to allow plots to be combined
OCSdata to access and download OCS data files

The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.

Context


The State of Global Air is a report released every year to communicate the impact of air pollution on public health.

The State of Global Air 2019 report which uses data from 2017 stated that:

Air pollution is the fifth leading risk factor for mortality worldwide. It is responsible for more deaths than many better-known risk factors such as malnutrition, alcohol use, and physical inactivity. Each year, more people die from air pollution–related disease than from road traffic injuries or malaria.

[source]

The report also stated that:

In 2017, air pollution is estimated to have contributed to close to 5 million deaths globally — nearly 1 in every 10 deaths.

[source]

The State of Global Air 2018 report using data from 2016 which separated different types of air pollution, found that particulate pollution was particularly associated with mortality.

[source]

The 2019 report shows that the highest levels of fine particulate pollution occur in Africa and Asia and that:

More than 90% of people worldwide live in areas exceeding the World Health Organization (WHO) Guideline for healthy air. More than half live in areas that do not even meet WHO’s least-stringent air quality target.

[source]

Looking at the US specifically, air pollution levels are generally improving, with declining national air pollutant concentration averages as shown from the 2019 Our Nation’s Air report from the US Environmental Protection Agency (EPA):

[source]

However, air pollution continues to contribute to health risk for Americans, in particular in regions with higher than national average rates of pollution that, at times, exceed the WHO’s recommended level. Thus, it is important to obtain high spatial granularity in estimates of air pollution in order to identify locations where populations are experiencing harmful levels of exposure.

You can see that current air quality conditions at this website, and you will notice variation across different cities.

For example, here are the conditions in Topeka Kansas at the time this case study was created:

[source]

It reports particulate values using what is called the Air Quality Index (AQI). This calculator indicates that 114 AQI is equivalent to 40.7 ug/m3 and is considered unhealthy for sensitive individuals. Thus, some areas exceed the WHO annual exposure guideline (10 ug/m3), and this may adversely affect the health of people living in these locations.

Adverse health effects have been associated with populations experiencing higher pollution exposure despite the levels being below suggested guidelines. Also, it appears that the composition of the particulate matter and the influence of other demographic factors may make specific populations more at risk for adverse health effects due to air pollution. For example, see this article for more details.

The monitor data that we will use in this case study come from a system of monitors in which roughly 90% are located within cities. Hence, there is an equity issue in terms of capturing the air pollution levels of more rural areas. To get a better sense of the pollution exposures for the individuals living in these areas, methods like machine learning can be useful to estimate air pollution levels in areas with little to no monitoring. Specifically, these methods can be used to estimate air pollution in these low monitoring areas so that we can make a map like this where we have annual estimates for all of the contiguous US:

[source]

This is what we aim to achieve in this case study.

Limitations


There are some important considerations regarding the data analysis in this case study to keep in mind:

  1. The data do not include information about the composition of particulate matter. Different types of particulates may be more benign or deleterious for health outcomes.

  2. Outdoor pollution levels are not necessarily an indication of individual exposures. People spend differing amounts of time indoors and outdoors and are exposed to different pollution levels indoors. Researchers are now developing personal monitoring systems to track air pollution levels on the personal level.

  3. Our analysis will use annual mean estimates of pollution levels, but these can vary greatly by season, day and even hour. There are data sources that have finer levels of temporal data; however, we are interested in long term exposures, as these appear to be the most influential for health outcomes.

What are the data?


We are going to perform a type of machine learning to try to predict air pollution levels that is called supervised machine learning. This type of machine learning requires that we have an outcome or result of real values (of in our case air pollution) to work with that will guide or supervise our work. This will ultimately allow us to try to predict air pollution values in the future when we don’t have them. For more explanation of what supervised machine learning is (and how it compares to a different kind of machine learning called unsupervised machine learning where we don’t have outcome values) refer to this),

When using supervised machine learning for prediction,there are two main types of data of interest:

  1. A continuous outcome variable that we want to predict
  2. A set of feature(s) (or predictor variables) that we use to predict the outcome variable

The outcome variable is what we are trying to predict. To build (or train) our model, we use both the outcome and features. The goal is to identify informative features that can explain a large amount of variation in our outcome variable. Using this model, we can then predict the outcome from new observations with the same features where have not observed the outcome.

As a simple example, imagine that we have data about the sales and characteristics of cars from last year and we want to predict which cars might sell well this year. We do not have the sales data yet for this year, but we do know the characteristics of our cars for this year. We can build a model of the characteristics that explained sales last year to estimate what cars might sell well this year. In this case, our outcome variable is the sales of cars, while the different characteristics of the cars make up our features.

Start with a question


This is the most commonly missed step when developing a machine learning algorithm. Machine learning can very easily be turned into an engineering problem. Just dump the outcome and the features into a black box algorithm and viola! But this kind of thinking can lead to major problems. In general good machine learning questions:

  1. Have a plausible explanation for why the features predict the outcome.
  2. Consider potential variation in both the features and the outcome over time.
  3. Are consistently re-evaluated on criteria 1 and 2 over time.

In this case study, we want to predict air pollution levels. To build this machine learning algorithm, our outcome variable is average annual fine particulate matter (PM2.5) captured from air pollution monitors in the contiguous US in 2008. Our features (or predictor variables) include data about population density, road density, urbanization levels, and NASA satellite data.

Our outcome variable


The monitor data that we will be using comes from gravimetric monitors (see picture below) operated by the US Environmental Protection Agency (EPA).

[image curtesy of Kirsten Koehler]

These monitors use a filtration system to specifically capture fine particulate matter.

[source]

The weight of this particulate matter is manually measured daily or weekly. For the EPA standard operating procedure for PM gravimetric analysis in 2008, we refer the reader to here.

For more on Gravimetric analysis, you can expand here

Gravimetric analysis is also used for emission testing. The same idea applies: a fresh filter is applied and the desired amount of time passes, then the filter is removed and weighed.

There are other monitoring systems that can provide hourly measurements, but we will not be using data from these monitors in our analysis. Gravimetric analysis is considered to be among the most accurate methods for measuring particulate matter.

In our data set, the value column indicates the PM2.5 monitor average for 2008 in mass of fine particles/volume of air for 876 gravimetric monitors. The units are micrograms of fine particulate matter (PM) that is less than 2.5 micrometers in diameter per cubic meter of air - mass concentration (ug/m3). Recall the WHO exposure guideline is < 10 ug/m3 on average annually for PM2.5.

Our features (predictor variables)


There are 48 features with values for each of the 876 monitors (observations). The data comes from the US Environmental Protection Agency (EPA), the National Aeronautics and Space Administration (NASA), the US Census, and the National Center for Health Statistics (NCHS).

Click here to see a table about the set of features
Variable Details
id Monitor number
– the county number is indicated before the decimal
– the monitor number is indicated after the decimal
Example: 1073.0023 is Jefferson county (1073) and .0023 one of 8 monitors
fips Federal information processing standard number for the county where the monitor is located
– 5 digit id code for counties (zero is often the first value and sometimes is not shown)
– the first 2 numbers indicate the state
– the last three numbers indicate the county
Example: Alabama’s state code is 01 because it is first alphabetically
(note: Alaska and Hawaii are not included because they are not part of the contiguous US)
Lat Latitude of the monitor in degrees
Lon Longitude of the monitor in degrees
state State where the monitor is located
county County where the monitor is located
city City where the monitor is located
CMAQ Estimated values of air pollution from a computational model called Community Multiscale Air Quality (CMAQ)
– A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution
Does not use any of the PM2.5 gravimetric monitoring data. (There is a version that does use the gravimetric monitoring data, but not this one!)
– Data from the EPA
zcta Zip Code Tabulation Area where the monitor is located
– Postal Zip codes are converted into “generalized areal representations” that are non-overlapping
– Data from the 2010 Census
zcta_area Land area of the zip code area in meters squared
– Data from the 2010 Census
zcta_pop Population in the zip code area
– Data from the 2010 Census
imp_a500 Impervious surface measure
– Within a circle with a radius of 500 meters around the monitor
– Impervious surface are roads, concrete, parking lots, buildings
– This is a measure of development
imp_a1000 Impervious surface measure
– Within a circle with a radius of 1000 meters around the monitor
imp_a5000 Impervious surface measure
– Within a circle with a radius of 5000 meters around the monitor
imp_a10000 Impervious surface measure
– Within a circle with a radius of 10000 meters around the monitor
imp_a15000 Impervious surface measure
– Within a circle with a radius of 15000 meters around the monitor
county_area Land area of the county of the monitor in meters squared
county_pop Population of the county of the monitor
Log_dist_to_prisec Log (Natural log) distance to a primary or secondary road from the monitor
– Highway or major road
log_pri_length_5000 Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log)
– Highways only
log_pri_length_10000 Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log)
– Highways only
log_pri_length_15000 Count of primary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log)
– Highways only
log_pri_length_25000 Count of primary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log)
– Highways only
log_prisec_length_500 Count of primary and secondary road length in meters in a circle with a radius of 500 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_1000 Count of primary and secondary road length in meters in a circle with a radius of 1000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_5000 Count of primary and secondary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_10000 Count of primary and secondary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_15000 Count of primary and secondary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_25000 Count of primary and secondary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log)
– Highway and secondary roads
log_nei_2008_pm25_sum_10000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
log_nei_2008_pm25_sum_15000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
log_nei_2008_pm25_sum_25000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_10000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_15000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_25000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
popdens_county Population density (number of people per kilometer squared area of the county)
popdens_zcta Population density (number of people per kilometer squared area of zcta)
nohs Percentage of people in zcta area where the monitor is that do not have a high school degree
– Data from the Census
somehs Percentage of people in zcta area where the monitor whose highest formal educational attainment was some high school education
– Data from the Census
hs Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a high school degree
– Data from the Census
somecollege Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing some college education
– Data from the Census
associate Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an associate degree
– Data from the Census
bachelor Percentage of people in zcta area where the monitor whose highest formal educational attainment was a bachelor’s degree
– Data from the Census
grad Percentage of people in zcta area where the monitor whose highest formal educational attainment was a graduate degree
– Data from the Census
pov Percentage of people in zcta area where the monitor is that lived in poverty in 2008 - or would it have been 2007 guidelines??https://aspe.hhs.gov/2007-hhs-poverty-guidelines
– Data from the Census
hs_orless Percentage of people in zcta area where the monitor whose highest formal educational attainment was a high school degree or less (sum of nohs, somehs, and hs)
urc2013 2013 Urban-rural classification of the county where the monitor is located
– 6 category variable - 1 is totally urban 6 is completely rural
– Data from the National Center for Health Statistics
urc2006 2006 Urban-rural classification of the county where the monitor is located
– 6 category variable - 1 is totally urban 6 is completely rural
– Data from the National Center for Health Statistics
aod Aerosol Optical Depth measurement from a NASA satellite
– based on the diffraction of a laser
– used as a proxy of particulate pollution
– unit-less - higher value indicates more pollution
– Data from NASA

Many of these features have to do with the circular area around the monitor called the “buffer”. These are illustrated in the following figure:

Data Import


All of our data was previously collected by a researcher at the Johns Hopkins School of Public Health who studies air pollution and climate change.

We have one CSV file that contains both our single outcome variable and all of our features (or predictor variables). You can download this file using the OCSdata package:

# install.packages("OCSdata")
OCSdata::raw_data("ocs-bp-air-pollution", outpath = getwd())

If you have trouble using the package, you can also find this data on our GitHub repository. Or you can download it more directly by clicking here.

We have created a “raw” subdirectory within a directory called “data” of our working directory of our RStudio project.

Next, we import our data into R now so that we can explore the data further. We will call our data object pm for particulate matter. We import the data using the read_csv() function from the readr package.

We will use the here package to make it easier to find the data file.

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.

pm <- readr::read_csv(here("data","raw", "pm25_data.csv"))

We will save this data as an rda file for later in an “imported” subdirectory of the data directory.

save(pm, file = here::here("data", "imported", "imported_pm.rda"))

Data Exploration and Wrangling


If you are following along but stopped, you could start here by first loading the data like so:

load(here::here("data", "imported", "imported_pm.rda"))

If you skipped the data import section click here.

First you need to install the OCSdata package:

install.packages("OCSdata")

Then, you may download and load the imported data .rda file using the following code:

OCSdata::imported_data("ocs-bp-air-pollution", outpath = getwd())
load(here::here("OCSdata", "data", "imported", "imported_pm.rda"))

If the package does not work for you, an RDA file (stands for R data) of the data can be found on our GitHub repository. Or you can download it more directly by clicking here.

To load the downloaded data into your environment, you may double click on the .rda file in Rstudio or using the load() function.

To copy and paste our code below, place the downloaded file in your current working directory within a subdirectory called “imported” within a subdirectory called “data”. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "imported", "imported_pm.rda"))

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.


The first step in performing any data analysis is to explore the data.

For example, we might want to better understand the variables included in the data, as we may learn about important details about the data that we should keep in mind as we try to predict our outcome variable.

First, let’s just get a general sense of our data. We can do that using the glimpse() function of the dplyr package (it is also in the tibble package).

We will also use the %>% pipe, which can be used to define the input for later sequential steps.

This will make more sense when we have multiple sequential steps using the same data object.

To use the pipe notation we need to install and load dplyr as well.

For example, here we start with pm data object and “pipe” the object into as input into the glimpse() function. The output is an overview of what is in the pm object such as the number of rows and columns, all the column names, the data types for each column and the first view values in each column. The output below is scrollable so you can see everything from the glimpse() function.

# Scroll through the output!
pm %>%
  dplyr::glimpse()
Rows: 876
Columns: 50
$ id                          <dbl> 1003.001, 1027.000, 1033.100, 1049.100, 10…
$ value                       <dbl> 9.597647, 10.800000, 11.212174, 11.659091,…
$ fips                        <dbl> 1003, 1027, 1033, 1049, 1055, 1069, 1073, …
$ lat                         <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 33…
$ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.96830…
$ state                       <chr> "Alabama", "Alabama", "Alabama", "Alabama"…
$ county                      <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "E…
$ city                        <chr> "Fairhope", "Ashland", "Muscle Shoals", "C…
$ CMAQ                        <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9.…
$ zcta                        <dbl> 36532, 36251, 35660, 35962, 35901, 36303, …
$ zcta_area                   <dbl> 190980522, 374132430, 16716984, 203836235,…
$ zcta_pop                    <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 901…
$ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 5.782…
$ imp_a1000                   <dbl> 1.4096021, 0.8531574, 11.1448962, 3.867647…
$ imp_a5000                   <dbl> 3.3360118, 0.9851479, 15.1786154, 1.231141…
$ imp_a10000                  <dbl> 1.9879187, 0.5208189, 9.7253870, 1.0316469…
$ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 0.9730444…
$ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 201266…
$ county_pop                  <dbl> 182265, 13932, 54428, 71109, 104430, 10154…
$ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5.…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9.…
$ log_pri_length_10000        <dbl> 9.210340, 9.210340, 9.274303, 10.409411, 1…
$ log_pri_length_15000        <dbl> 9.630228, 9.615805, 9.658899, 11.173626, 1…
$ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 12…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8.…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9.…
$ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 10.214200…
$ log_prisec_length_10000     <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 12…
$ log_prisec_length_15000     <dbl> 12.205723, 12.042963, 13.282656, 12.353663…
$ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 13…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.0…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.2…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4.…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.3500…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4.…
$ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 35.330814,…
$ popdens_zcta                <dbl> 145.716431, 13.639555, 540.887040, 40.7189…
$ nohs                        <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7, …
$ somehs                      <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, 6…
$ hs                          <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2, …
$ somecollege                 <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5, …
$ associate                   <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, 4…
$ bachelor                    <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17.…
$ grad                        <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, 2…
$ pov                         <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7.…
$ hs_orless                   <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4, …
$ urc2013                     <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ urc2006                     <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ aod                         <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 43…

We can see that there are 876 monitors (rows) and that we have 50 total variables (columns) - one of which is the outcome variable. In this case, the outcome variable is called value.

Notice that some of the variables that we would think of as factors (or categorical data) are currently of class character as indicated by the <chr> just to the right of the column names/variable names in the glimpse() output. This means that the variable values are character strings, such as words or phrases.

The other variables are of class <dbl>, which stands for double precision which indicates that they are numeric and that they have decimal values. In contrast, one could have integer values which would not allow for decimal numbers. Here is a link for more information on double precision numeric values.

Another common data class is factor which is abbreviated like this: <fct>. A factor is something that has unique levels but there is no appreciable order to the levels. For example we can have a numeric value that is just an id that we want to be interpreted as just a unique level and not as the number that it would typically indicate. This would be useful for several of our variables:

  1. the monitor ID (id)
  2. the Federal Information Processing Standard number for the county where the monitor was located (fips)
  3. the zip code tabulation area (zcta)

None of the values actually have any real numeric meaning, so we want to make sure that R does not interpret them as if they do.

So let’s convert these variables into factors. We can do this using the across() function of the dplyr package and the as.factor() base function. The across() function has two main arguments: (i) the columns you want to operate on and (ii) the function or list of functions to apply to each column.

In this case, we are also using the magrittr assignment pipe or double pipe that looks like this %<>% of the magrittr package. This allows us use the pm data as input, but also reassigns the output to the same data object name.

# Scroll through the output!
pm %<>%
  dplyr::mutate(across(c(id, fips, zcta), as.factor)) 

glimpse(pm)
Rows: 876
Columns: 50
$ id                          <fct> 1003.001, 1027.0001, 1033.1002, 1049.1003,…
$ value                       <dbl> 9.597647, 10.800000, 11.212174, 11.659091,…
$ fips                        <fct> 1003, 1027, 1033, 1049, 1055, 1069, 1073, …
$ lat                         <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 33…
$ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.96830…
$ state                       <chr> "Alabama", "Alabama", "Alabama", "Alabama"…
$ county                      <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "E…
$ city                        <chr> "Fairhope", "Ashland", "Muscle Shoals", "C…
$ CMAQ                        <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9.…
$ zcta                        <fct> 36532, 36251, 35660, 35962, 35901, 36303, …
$ zcta_area                   <dbl> 190980522, 374132430, 16716984, 203836235,…
$ zcta_pop                    <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 901…
$ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 5.782…
$ imp_a1000                   <dbl> 1.4096021, 0.8531574, 11.1448962, 3.867647…
$ imp_a5000                   <dbl> 3.3360118, 0.9851479, 15.1786154, 1.231141…
$ imp_a10000                  <dbl> 1.9879187, 0.5208189, 9.7253870, 1.0316469…
$ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 0.9730444…
$ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 201266…
$ county_pop                  <dbl> 182265, 13932, 54428, 71109, 104430, 10154…
$ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5.…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9.…
$ log_pri_length_10000        <dbl> 9.210340, 9.210340, 9.274303, 10.409411, 1…
$ log_pri_length_15000        <dbl> 9.630228, 9.615805, 9.658899, 11.173626, 1…
$ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 12…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8.…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9.…
$ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 10.214200…
$ log_prisec_length_10000     <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 12…
$ log_prisec_length_15000     <dbl> 12.205723, 12.042963, 13.282656, 12.353663…
$ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 13…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.0…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.2…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4.…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.3500…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4.…
$ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 35.330814,…
$ popdens_zcta                <dbl> 145.716431, 13.639555, 540.887040, 40.7189…
$ nohs                        <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7, …
$ somehs                      <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, 6…
$ hs                          <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2, …
$ somecollege                 <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5, …
$ associate                   <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, 4…
$ bachelor                    <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17.…
$ grad                        <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, 2…
$ pov                         <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7.…
$ hs_orless                   <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4, …
$ urc2013                     <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ urc2006                     <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ aod                         <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 43…

Great! Now we can see that these variables are now factors as indicated by <fct> after the variable name.

skim package


The skim() function of the skimr package is also really helpful for getting a general sense of your data. By design, it provides summary statistics about variables in the data set.

# Scroll through the output!
skimr::skim(pm)
Data summary
Name pm
Number of rows 876
Number of columns 50
_______________________
Column type frequency:
character 3
factor 3
numeric 44
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 4 20 0 49 0
county 0 1 3 20 0 471 0
city 0 1 4 48 0 607 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
id 0 1 FALSE 876 100: 1, 102: 1, 103: 1, 104: 1
fips 0 1 FALSE 569 170: 12, 603: 10, 261: 9, 107: 8
zcta 0 1 FALSE 842 475: 3, 110: 2, 160: 2, 290: 2

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
value 0 1 10.81 2.580000e+00 3.02 9.27 11.15 12.37 2.316000e+01 ▂▆▇▁▁
lat 0 1 38.48 4.620000e+00 25.47 35.03 39.30 41.66 4.840000e+01 ▁▃▅▇▂
lon 0 1 -91.74 1.496000e+01 -124.18 -99.16 -87.47 -80.69 -6.804000e+01 ▃▂▃▇▃
CMAQ 0 1 8.41 2.970000e+00 1.63 6.53 8.62 10.24 2.313000e+01 ▃▇▃▁▁
zcta_area 0 1 183173481.91 5.425989e+08 15459.00 14204601.75 37653560.50 160041508.25 8.164821e+09 ▇▁▁▁▁
zcta_pop 0 1 24227.58 1.777216e+04 0.00 9797.00 22014.00 35004.75 9.539700e+04 ▇▇▃▁▁
imp_a500 0 1 24.72 1.934000e+01 0.00 3.70 25.12 40.22 6.961000e+01 ▇▅▆▃▂
imp_a1000 0 1 24.26 1.802000e+01 0.00 5.32 24.53 38.59 6.750000e+01 ▇▅▆▃▁
imp_a5000 0 1 19.93 1.472000e+01 0.05 6.79 19.07 30.11 7.460000e+01 ▇▆▃▁▁
imp_a10000 0 1 15.82 1.381000e+01 0.09 4.54 12.36 24.17 7.209000e+01 ▇▃▂▁▁
imp_a15000 0 1 13.43 1.312000e+01 0.11 3.24 9.67 20.55 7.110000e+01 ▇▃▁▁▁
county_area 0 1 3768701992.12 6.212830e+09 33703512.00 1116536297.50 1690826566.50 2878192209.00 5.194723e+10 ▇▁▁▁▁
county_pop 0 1 687298.44 1.293489e+06 783.00 100948.00 280730.50 743159.00 9.818605e+06 ▇▁▁▁▁
log_dist_to_prisec 0 1 6.19 1.410000e+00 -1.46 5.43 6.36 7.15 1.045000e+01 ▁▁▃▇▁
log_pri_length_5000 0 1 9.82 1.080000e+00 8.52 8.52 10.05 10.73 1.205000e+01 ▇▂▆▅▂
log_pri_length_10000 0 1 10.92 1.130000e+00 9.21 9.80 11.17 11.83 1.302000e+01 ▇▂▇▇▃
log_pri_length_15000 0 1 11.50 1.150000e+00 9.62 10.87 11.72 12.40 1.359000e+01 ▆▂▇▇▃
log_pri_length_25000 0 1 12.24 1.100000e+00 10.13 11.69 12.46 13.05 1.436000e+01 ▅▃▇▇▃
log_prisec_length_500 0 1 6.99 9.500000e-01 6.21 6.21 6.21 7.82 9.400000e+00 ▇▁▂▂▁
log_prisec_length_1000 0 1 8.56 7.900000e-01 7.60 7.60 8.66 9.20 1.047000e+01 ▇▅▆▃▁
log_prisec_length_5000 0 1 11.28 7.800000e-01 8.52 10.91 11.42 11.83 1.278000e+01 ▁▁▃▇▃
log_prisec_length_10000 0 1 12.41 7.300000e-01 9.21 11.99 12.53 12.94 1.385000e+01 ▁▁▃▇▅
log_prisec_length_15000 0 1 13.03 7.200000e-01 9.62 12.59 13.13 13.57 1.441000e+01 ▁▁▃▇▅
log_prisec_length_25000 0 1 13.82 7.000000e-01 10.13 13.38 13.92 14.35 1.523000e+01 ▁▁▃▇▆
log_nei_2008_pm25_sum_10000 0 1 3.97 2.350000e+00 0.00 2.15 4.29 5.69 9.120000e+00 ▆▅▇▆▂
log_nei_2008_pm25_sum_15000 0 1 4.72 2.250000e+00 0.00 3.47 5.00 6.35 9.420000e+00 ▃▃▇▇▂
log_nei_2008_pm25_sum_25000 0 1 5.67 2.110000e+00 0.00 4.66 5.91 7.28 9.650000e+00 ▂▂▇▇▃
log_nei_2008_pm10_sum_10000 0 1 4.35 2.320000e+00 0.00 2.69 4.62 6.07 9.340000e+00 ▅▅▇▇▂
log_nei_2008_pm10_sum_15000 0 1 5.10 2.180000e+00 0.00 3.87 5.39 6.72 9.710000e+00 ▂▃▇▇▂
log_nei_2008_pm10_sum_25000 0 1 6.07 2.010000e+00 0.00 5.10 6.37 7.52 9.880000e+00 ▁▂▆▇▃
popdens_county 0 1 551.76 1.711510e+03 0.26 40.77 156.67 510.81 2.682191e+04 ▇▁▁▁▁
popdens_zcta 0 1 1279.66 2.757490e+03 0.00 101.15 610.35 1382.52 3.041884e+04 ▇▁▁▁▁
nohs 0 1 6.99 7.210000e+00 0.00 2.70 5.10 8.80 1.000000e+02 ▇▁▁▁▁
somehs 0 1 10.17 6.200000e+00 0.00 5.90 9.40 13.90 7.220000e+01 ▇▂▁▁▁
hs 0 1 30.32 1.140000e+01 0.00 23.80 30.75 36.10 1.000000e+02 ▂▇▂▁▁
somecollege 0 1 21.58 8.600000e+00 0.00 17.50 21.30 24.70 1.000000e+02 ▆▇▁▁▁
associate 0 1 7.13 4.010000e+00 0.00 4.90 7.10 8.80 7.140000e+01 ▇▁▁▁▁
bachelor 0 1 14.90 9.710000e+00 0.00 8.80 12.95 19.22 1.000000e+02 ▇▂▁▁▁
grad 0 1 8.91 8.650000e+00 0.00 3.90 6.70 11.00 1.000000e+02 ▇▁▁▁▁
pov 0 1 14.95 1.133000e+01 0.00 6.50 12.10 21.22 6.590000e+01 ▇▅▂▁▁
hs_orless 0 1 47.48 1.675000e+01 0.00 37.92 48.65 59.10 1.000000e+02 ▁▃▇▃▁
urc2013 0 1 2.92 1.520000e+00 1.00 2.00 3.00 4.00 6.000000e+00 ▇▅▃▂▁
urc2006 0 1 2.97 1.520000e+00 1.00 2.00 3.00 4.00 6.000000e+00 ▇▅▃▂▁
aod 0 1 43.70 1.956000e+01 5.00 31.66 40.17 49.67 1.430000e+02 ▃▇▁▁▁

Notice how there is a column called n_missing about the number of values that are missing.

This is also indicated by the complete_rate variable (or missing/number of observations).

In our data set, it looks like our data do not contain any missing data.

Also notice how the function provides separate tables of summary statistics for each data type: character, factor and numeric.

Next, the n_unique column shows us the number of unique values for each of our columns. We can see that there are 49 states represented in the data.

We can see that for many variables there are many low values as the distribution shows two peaks, one near zero and another with a higher value.

This is true for the imp variables (measures of development), the nei variables (measures of emission sources) and the road density variables.

We can also see that the range of some of the variables is very large, in particular the area and population related variables.

Let’s take a look to see which states are included using the distinct() function of the dplyr package:

pm %>% 
  dplyr::distinct(state) 

Scroll through the output:

# A tibble: 49 × 1
   state               
   <chr>               
 1 Alabama             
 2 Arizona             
 3 Arkansas            
 4 California          
 5 Colorado            
 6 Connecticut         
 7 Delaware            
 8 District Of Columbia
 9 Florida             
10 Georgia             
11 Idaho               
12 Illinois            
13 Indiana             
14 Iowa                
15 Kansas              
16 Kentucky            
17 Louisiana           
18 Maine               
19 Maryland            
20 Massachusetts       
21 Michigan            
22 Minnesota           
23 Mississippi         
24 Missouri            
25 Montana             
26 Nebraska            
27 Nevada              
28 New Hampshire       
29 New Jersey          
30 New Mexico          
31 New York            
32 North Carolina      
33 North Dakota        
34 Ohio                
35 Oklahoma            
36 Oregon              
37 Pennsylvania        
38 Rhode Island        
39 South Carolina      
40 South Dakota        
41 Tennessee           
42 Texas               
43 Utah                
44 Vermont             
45 Virginia            
46 Washington          
47 West Virginia       
48 Wisconsin           
49 Wyoming             

It looks like “District of Columbia” is being included as a state. We can see that Alaska and Hawaii are not included in the data.

Let’s also take a look to see how many monitors there are in a few cities. We can use the filter() function of the dplyr package to do so. For example, let’s look at Albuquerque, New Mexico.

pm %>% dplyr::filter(city == "Albuquerque")
# A tibble: 2 × 50
  id      value fips    lat   lon state county city   CMAQ zcta  zcta_…¹ zcta_…²
  <fct>   <dbl> <fct> <dbl> <dbl> <chr> <chr>  <chr> <dbl> <fct>   <dbl>   <dbl>
1 35001.…  5.98 35001  35.1 -107. New … Berna… Albu…  10.1 87109  2.62e7   40432
2 35001.…  5.91 35001  35.1 -107. New … Berna… Albu…  10.1 87108  1.52e7   38647
# … with 38 more variables: imp_a500 <dbl>, imp_a1000 <dbl>, imp_a5000 <dbl>,
#   imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>, county_pop <dbl>,
#   log_dist_to_prisec <dbl>, log_pri_length_5000 <dbl>,
#   log_pri_length_10000 <dbl>, log_pri_length_15000 <dbl>,
#   log_pri_length_25000 <dbl>, log_prisec_length_500 <dbl>,
#   log_prisec_length_1000 <dbl>, log_prisec_length_5000 <dbl>,
#   log_prisec_length_10000 <dbl>, log_prisec_length_15000 <dbl>, …
# ℹ Use `colnames()` to see all variable names

We can see that there were only two monitors in the city of Albuquerque in 2006. Let’s compare this with Baltimore.

pm %>% filter(city == "Baltimore")
# A tibble: 5 × 50
  id      value fips    lat   lon state county city   CMAQ zcta  zcta_…¹ zcta_…²
  <fct>   <dbl> <fct> <dbl> <dbl> <chr> <chr>  <chr> <dbl> <fct>   <dbl>   <dbl>
1 24510.…  12.2 24510  39.3 -76.6 Mary… Balti… Balt…  10.9 21251  4.61e5     934
2 24510.…  12.5 24510  39.3 -76.7 Mary… Balti… Balt…  10.9 21215  1.76e7   60161
3 24510.…  12.8 24510  39.3 -76.5 Mary… Balti… Balt…  10.9 21224  2.45e7   49134
4 24510.…  14.3 24510  39.2 -76.6 Mary… Balti… Balt…  10.9 21226  2.57e7    7561
5 24510.…  13.3 24510  39.3 -76.6 Mary… Balti… Balt…  10.9 21202  4.11e6   22832
# … with 38 more variables: imp_a500 <dbl>, imp_a1000 <dbl>, imp_a5000 <dbl>,
#   imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>, county_pop <dbl>,
#   log_dist_to_prisec <dbl>, log_pri_length_5000 <dbl>,
#   log_pri_length_10000 <dbl>, log_pri_length_15000 <dbl>,
#   log_pri_length_25000 <dbl>, log_prisec_length_500 <dbl>,
#   log_prisec_length_1000 <dbl>, log_prisec_length_5000 <dbl>,
#   log_prisec_length_10000 <dbl>, log_prisec_length_15000 <dbl>, …
# ℹ Use `colnames()` to see all variable names

There were in contrast five monitors for the city of Baltimore, despite the fact that if we take a look at the land area and population of the counties for Baltimore and Albuquerque, we can see that they had very similar land area and populations.

pm %>% 
  filter(city == "Baltimore") %>% 
  dplyr::select(county_area:county_pop)
# A tibble: 5 × 2
  county_area county_pop
        <dbl>      <dbl>
1   209643241     620961
2   209643241     620961
3   209643241     620961
4   209643241     620961
5   209643241     620961
pm %>% 
  filter(city == "Albuquerque") %>%
  select(county_area:county_pop)
# A tibble: 2 × 2
  county_area county_pop
        <dbl>      <dbl>
1  3006530549     662564
2  3006530549     662564

In fact, the county containing Albuerque had a larger population. Thus the measurements for Albuquerque were not as thorough as they were for Baltimore.

This may be due to the fact that the monitor values were lower in Albuquerque. It is interesting to note here that the CMAQ values are quite similar for both cities.

Evaluate correlation


In prediction analyses, it is also useful to evaluate if any of the variables are correlated. Why should we care about this?

If we are using a linear regression to model our data, then we might run into a problem called multicollinearity which can lead us to misinterpret what is really predictive of our outcome variable. This phenomenon occurs when the predictor variables actually predict one another. See this case study for a deeper explanation about this.

Another reason we should look out for correlation is that we don’t want to include redundant variables. This can add unnecessary noise to our algorithm causing a reduction in prediction accuracy, and it can cause our algorithm to be unnecessarily slower. Finally, it can also make it difficult to interpret what variables are actually predictive.

Let’s first take a look at all of our numeric variables with thecorrplot package: The corrplot package is a great option to look at correlation among possible predictors, and particularly useful if we have many predictors.

First, we calculate the Pearson correlation coefficients between all features pairwise using the cor() function of the stats package (which is loaded automatically). Then we use the corrplot::corrplot() function. The tl.cex = 0.5 argument controls the size of the text label.

PM_cor <- cor(pm %>% dplyr::select_if(is.numeric))
corrplot::corrplot(PM_cor, tl.cex = 0.5)

Nice! Now we can see which variables show a positive (blue) or negative (red) correlation to one another. Variables that show a very little correlation with one another appear as white or lightly colored. We can see that each variable is perfectly correlated with itself, which is it why there is a line of blue squares diagnally across the plot.

We can also plot the absolute value of the Pearson correlation coefficients using the abs() function from base R and change the order of the columns. This can be helplful if we aren’t interested in the direction of the correlation and just want to see which variables have a relationship with one another.

corrplot(abs(PM_cor), order = "hclust", tl.cex = 0.5, cl.lim = c(0, 1))

Nice, this is a bit easier to read now as it isn’t as distracting to look at the different colors - we just want to focus on the intensity. Notice that the legend here now doesn’t inform us of much in the case of the red because we have plotted the absolute values of the correlation values and they will thus all be positive and blue. The darker the blue the more correlation.

There are several options for ordering the variables. See here for more options. Here we will use the “hclust” option for ordering by hierarchical clustering - which will order the variables by how similar they are to one another.

The cl.lim = c(0, 1) argument limits the color label to be between 0 and 1.

We can see that the development variables (imp) variables are correlated with each other as we might expect. We also see that the road density variables seem to be correlated with each other, and the emission variables seem to be correlated with each other.

Also notice that none of the predictors are highly correlated with our outcome variable (value).

We can take also take a closer look using the ggcorr() function and the ggpairs() function of the GGally package.

To select our variables of interest we can use the select() function with the contains() function of the tidyr package.

First let’s look at the imp/development variables. We can change the default color palette (palette = "RdBu") and add on correlation coefficients to the plot (label = TRUE).

select(pm, contains("imp")) %>%
  ggcorr(palette = "RdBu", label = TRUE)

select(pm, contains("imp")) %>%
  ggpairs()

Indeed, we can now see more clearly that imp_a1000 and imp_a500 are highly correlated, as well as imp_a10000, imp_a15000. We also get a sense of how the data points vary across the range of values. Note that in this plot red indicates positive correlation values and blue indicates negative correlation values.This is in contrast to our previous plot.

Next, let’s take a look at the road density data:

select(pm, contains("pri")) %>%
  ggcorr(palette = "RdBu", hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

We can see that many of the road density variables are highly correlated with one another, while others are less so. Again note that in this plot red indicates positive correlation values and blue indicates negative correlation values.

Finally let’s look at the emission variables.

select(pm, contains("nei")) %>%
  ggcorr(palette = "RdBu", hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

select(pm, contains("nei")) %>%
  ggpairs()

We can see some fairly high correlation values as well.

We would also expect the population density data might correlate with some of these variables. Let’s take a look.

pm %>%
select(log_nei_2008_pm25_sum_10000, popdens_county, 
       log_pri_length_10000, imp_a10000) %>%
  ggcorr(palette = "RdBu",  hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

pm %>%
select(log_nei_2008_pm25_sum_10000, popdens_county, 
       log_pri_length_10000, imp_a10000, county_pop) %>%
  ggpairs()

Interesting, so these variables don’t appear to be highly correlated, therefore we might need variables from each of the categories to predict our monitor PM2.5 pollution values.

Because some variables in our data have extreme values, it might be good to take a log transformation. This can affect our estimates of correlation.

pm %>%
  mutate(log_popdens_county= log(popdens_county)) %>%
select(log_nei_2008_pm25_sum_10000, log_popdens_county, 
       log_pri_length_10000, imp_a10000) %>%
  ggcorr(palette = "RdBu",  hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

pm %>%
  mutate(log_popdens_county= log(popdens_county)) %>%
  mutate(log_pop_county = log(county_pop)) %>%
select(log_nei_2008_pm25_sum_10000, log_popdens_county, 
       log_pri_length_10000, imp_a10000, log_pop_county) %>%
  ggpairs()

Indeed this increased the correlation, but variables from each of these categories may still prove to be useful for prediction.

Now that we have a sense of what our data are, we can get started with building a machine learning model to predict air pollution.

First let’s save our data again because we did wrangle it just a tad. This time we will save it to a subdirectory of the “data” directory called “wrangled”. This is a good practice in general for data analyses to keep your data organized. We will also save a csv version as this is often useful to give to collaborators. To do this we will use the write_csv() function of the readr package.

save(pm, file = here::here("data", "wrangled", "wrangled_pm.rda"))
write_csv(pm, file = here::here("data", "wrangled", "wrangled_pm.csv"))

What is machine learning?


You may have learned about the central dogma of statistics that you sample from a population.

Then you use the sample to try to guess what is happening in the population.

For prediction we have a similar sampling problem

But now we are trying to build a rule that can be used to predict a single observation’s value of some characteristic using characteristics of the other observations.

Let’s make this more concrete.

If you recall from the What are the data? section above, when we are using machine learning for prediction, our data consists of:

  1. An continuous outcome variable that we want to predict
  2. A set of feature(s) (or predictor variables) that we use to predict the outcome variable

We will use \(Y\) to denote the outcome variable and \(X = (X_1, \dots, X_p)\) to denote \(p\) different features (or predictor variables). Because our outcome variable is continuous (as opposed to categorical), we are interested in a particular type of machine learning algorithm.

Our goal is to build a machine learning algorithm that uses the features \(X\) as input and predicts an outcome variable (or air pollution levels) in the situation where we do not know the outcome variable.

The way we do this is to use data where we have both the features \((X_1=x_1, \dots X_p=x_p)\) and the actual outcome \(Y\) data to train a machine learning algorithm to predict the outcome, which we call \(\hat{Y}\).

When we say train a machine learning algorithm we mean that we estimate a function \(f\) that uses the predictor variables \(X\) as input or \(\hat{Y} = f(X)\).

ML as an optimization problem

If we are doing a good job, then our predicted outcome \(\hat{Y}\) should closely match our actual outcome \(Y\) that we observed.

In this way, we can think of machine learning (ML) as an optimization problem that tries to minimize the distance between \(\hat{Y} = f(X)\) and \(Y\).

\[d(Y - f(X))\] The choice of distance metric \(d(\cdot)\) can be the mean of the absolute or squared difference or something more complicated.

Much of the fields of statistics and computer science are focused on defining \(f\) and \(d\).

The parts of an ML problem

To set up a machine learning (ML) problem, we need a few components. To solve a (standard) machine learning problem you need:

  1. A data set to train from.
  2. An algorithm or set of algorithms you can use to try values of \(f\).
  3. A distance metric \(d\) for measuring how close \(Y\) is to \(\hat{Y}\).
  4. A definition of what a “good” distance is.

While each of these components is a technical problem, there has been a ton of work addressing those technical details. The most pressing open issue in machine learning is realizing that though these are technical steps they are not objective steps. In other words, how you choose the data, algorithm, metric, and definition of “good” says what you value and can dramatically change the results. A couple of cases where this was a big deal are:

  1. Machine learning for recidivism - people built ML models to predict who would re-commit a crime. But these predictions were based on historically biased data which led to biased predictions about who would commit new crimes.
  2. Deciding how self driving cars should act - self driving cars will have to make decisions about how to drive, who they might injure, and how to avoid accidents. Depending on our choices for \(f\) and \(d\) these might lead to wildly different kinds of self driving cars. Try out the moralmachine to see how this looks in practice.

Now that we know a bit more about machine learning, let’s build a model to predict air pollution levels using the tidymodels framework.

Machine learning with tidymodels


The goal is to build a machine learning algorithm that uses the features as input and predicts a outcome variable (or air pollution levels) in the situation where we do not know the outcome variable.

The way we do this is to use data where we have both the input and output data to train a machine learning algorithm.

To train a machine learning algorithm, we will use the tidymodels package ecosystem.

Overview


The tidymodels ecosystem


To perform our analysis we will be using the tidymodels suite of packages. You may be familiar with the older packages caret or mlr which are also for machine learning and modeling but are not a part of the tidyverse. Max Kuhn describes tidymodels like this:

“Other packages, such as caret and mlr, help to solve the R model API issue. These packages do a lot of other things too: pre-processing, model tuning, resampling, feature selection, ensembling, and so on. In the tidyverse, we strive to make our packages modular and parsnip is designed only to solve the interface issue. It is not designed to be a drop-in replacement for caret. The tidymodels package collection, which includes parsnip, has other packages for many of these tasks, and they are designed to work together. We are working towards higher-level APIs that can replicate and extend what the current model packages can do.”

There are many R packages in the tidymodels ecosystem, which assist with various steps in the process of building a machine learning algorithm. These are the main packages, but there are others.

This is a schematic of how these packages work together to build a machine learning algorithm:

Here we can see that after exploring and splitting the data, we perform the initial modeling stages of variable assignment and pre-processing steps as shown in the blue box the, while the green box indicates the steps required to train the model by specifying the model, fitting the model and tuning it.

Benefits of tidymodels


The two major benefits of tidymodels are:

  1. Standardized workflow/format/notation across different types of machine learning algorithms

Different notations are required for different algorithms as the algorithms have been developed by different people. This would require the painstaking process of reformatting the data to be compatible with each algorithm if multiple algorithms were tested.

  1. Can easily modify pre-processing, algorithm choice, and hyper-parameter tuning making optimization easy

Modifying a piece of the overall process is now easier than before because many of the steps are specified using the tidymodels packages in a convenient manner. Thus the entire process can be rerun after a simple change to pre-processing without much difficulty.

tidymodels Steps


Splitting the data


The first step after data exploration in machine learning analysis is to split the data into training and testing data sets.

The training data set will be used to build and tune our model. This is the data that the model “learns” on. The testing data set will be used to evaluate the performance of our model in a more generalizable way. What do we mean by “generalizable”?

Remember that our main goal is to use our model to be able to predict air pollution levels in areas where there are no gravimetric monitors.

Therefore, if our model is really good at predicting air pollution with the data that we use to build it, it might not do the best job for the areas where there are few to no monitors.

This would cause us to have really good prediction accuracy and we might assume that we were going to do a good job estimating air pollution any time we use our model, but in fact this would likely not be the case. This situation is what we call overfitting .

Overfitting happens when we end up modeling not only the major relationships in our data but also the noise within our data.

[source]

If we get good prediction with our testing set, then we know that our model can be applied to other data and will likely perform well. We will discuss this more later.

We will not touch the testing set until we have completed optimizing our model with the training set. This will allow us to have a less biased evaluation of how well our model can do with other data besides the data used in the training set to build the model. Ideally, you would also want a completely independent data set to further test the performance of your model.

To split the data into training and testing, we will use the initial_split() function in the rsample package to specify how we want to split our data.

If you skipped previous sections click here for more information on how to obtain and load the data.

First you need to install the OCSdata package:

install.packages("OCSdata")

Then, you may download and load the wrangled data .rda file using the following code:

OCSdata::wrangled_rda("ocs-bp-air-pollution", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_pm.rda"))

If the package does not work for you, you may also download this .rda file by clicking this link here.

To load the downloaded data into your environment, you may double click on the .rda file in Rstudio or using the load() function.

To copy and paste our code below, place the downloaded .rda file in your current working directory within a subdirectory called “wrangled” within a subdirectory called “data”. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "wrangled", "wrangled_pm.rda"))

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.


set.seed(1234)
pm_split <- rsample::initial_split(data = pm, prop = 2/3)
pm_split
<Training/Testing/Total>
<584/292/876>

A couple of notes from the code above:

  • Typically, data are split with the majority of the observations for training and a smaller portion for testing. The default with this function is 3/4 (75%) of the observations for training and 1/4 (25%) for testing. This is the default proportion and does not need to be specified. However, you can change the proportion using the prop argument, which we will do here for illustrative purposes. Often people use 80% (4/5) for training and 20% for testing (1/5) or some other similar proportion like what we used here with 2/3 for training and 1/3 for testing.

Click here to learn more about how people decide what split proportion to use.

Having more training data helps the model to train on a greater variety of observations. However, having more testing data helps to see how generalizable the model is and allows for better comparisons of different models. The need for each may depend on your data and how much variability it has and the size of your data. For smaller datasets, setting aside a larger portion for testing can be beneficial to avoid having a very small testing dataset. Here’s a paper that describes more about this topic.

  • Since the split is performed randomly, it is a good idea to use the set.seed() function in base R to ensure that if your rerun your code that your split will be the same next time.
  • We can see the number of monitors in our training, testing, and original data by typing in the name of our split object. The result will look like this: <training data sample number, testing data sample number, original sample number>

Now, you can also specify a variable to stratify by with the strata argument. This is useful if you have imbalanced categorical variables and you would like to intentionally make sure that there are similar number of samples of the rarer categories in both the testing and training sets. Otherwise the split is performed randomly.

According to the documentation for the rsample package:

The strata argument causes the random sampling to be conducted within the stratification variable. This can help ensure that the number of data points in the training data is equivalent to the proportions in the original data set.

In the case with our data set, perhaps we would like our training set to have similar proportions of monitors from each of the states as in the initial data. This might be useful if we want our model to be generalizable across all of the states.

We can see that indeed there are different proportions of monitors in each state by using the count() function of the dplyr package.

count(pm, state)

Scroll through the output:

# A tibble: 49 × 2
   state                    n
   <chr>                <int>
 1 Alabama                 24
 2 Arizona                 17
 3 Arkansas                16
 4 California              85
 5 Colorado                15
 6 Connecticut             14
 7 Delaware                 7
 8 District Of Columbia     3
 9 Florida                 29
10 Georgia                 28
11 Idaho                    7
12 Illinois                38
13 Indiana                 36
14 Iowa                    20
15 Kansas                  10
16 Kentucky                22
17 Louisiana               17
18 Maine                    1
19 Maryland                15
20 Massachusetts           16
21 Michigan                30
22 Minnesota               17
23 Mississippi             12
24 Missouri                13
25 Montana                 16
26 Nebraska                 7
27 Nevada                   4
28 New Hampshire            7
29 New Jersey              23
30 New Mexico              10
31 New York                24
32 North Carolina          35
33 North Dakota             4
34 Ohio                    44
35 Oklahoma                10
36 Oregon                  17
37 Pennsylvania            32
38 Rhode Island             5
39 South Carolina          14
40 South Dakota             9
41 Tennessee                3
42 Texas                   27
43 Utah                    14
44 Vermont                  4
45 Virginia                20
46 Washington               8
47 West Virginia           14
48 Wisconsin               21
49 Wyoming                 12

If our data set were large enough it might be nice then to stratify by state using the strata = "state" argument in initial_split(), but our data is unfortunately not large enough.

Importantly the initial_split() function only determines what rows of our pm data frame should be assigned for training or testing, it does not actually split the data.

To extract the testing and training data we can use the training() and testing() functions also of the rsample package.

train_pm <- rsample::training(pm_split)
test_pm <- rsample::testing(pm_split)
 
# Scroll through the output!
count(train_pm, state)
# A tibble: 49 × 2
   state                    n
   <chr>                <int>
 1 Alabama                 13
 2 Arizona                 12
 3 Arkansas                 8
 4 California              55
 5 Colorado                10
 6 Connecticut             12
 7 Delaware                 3
 8 District Of Columbia     2
 9 Florida                 22
10 Georgia                 20
# … with 39 more rows
# ℹ Use `print(n = ...)` to see more rows
count(test_pm, state)
# A tibble: 47 × 2
   state                    n
   <chr>                <int>
 1 Alabama                 11
 2 Arizona                  5
 3 Arkansas                 8
 4 California              30
 5 Colorado                 5
 6 Connecticut              2
 7 Delaware                 4
 8 District Of Columbia     1
 9 Florida                  7
10 Georgia                  8
# … with 37 more rows
# ℹ Use `print(n = ...)` to see more rows

Preparing for pre-processing the data


After splitting the data, the next step is to process the training and testing data so that the data are are compatible and optimized to be used with the model.

In order to start this, we need to think about what each of the aspects of our data might do in the model. For example, is this particular column of data what we would consider the outcome of interest? Or is these data values possibly helpful for predicting the outcome? This process is described as assigning variables to specific roles within the model.

We will then do what is called pre-processing to prepare the data so it is ready. This involves things like like scaling variables and removing redundant variables.

This process is also called feature engineering.

To do this in tidymodels, we will create what’s called a “recipe” using the recipes package, which is a standardized format for a sequence of steps for pre-processing the data. This can be very useful because it makes testing out different pre-processing steps or different algorithms with the same pre-processing very easy and reproducible. Creating a recipe specifies how a data frame of predictors should be created - it specifies what variables to be used and the pre-processing steps, but it does not execute these steps or create the data frame of predictors.

Step 1: Specify variables roles with recipe() function

The first thing to do to create a recipe is to specify which variables we will be using as our outcome and predictors using the recipe() function. In terms of the metaphor of baking, we can think of this as listing our ingredients. Translating this to the recipes package, we use the recipe() function to assign roles to all the variables.

Let’s try the simplest recipe with no pre-processing steps: simply list the outcome and predictor variables.

We can do so in two ways:

  1. Using formula notation
  2. Assigning roles to each variable

Let’s look at the first way using formula notation, which looks like this:

outcome(s) ~ predictor(s)

If in the case of multiple predictors or a multivariate situation with two outcomes, use a plus sign:

outcome1 + outcome2 ~ predictor1 + predictor2

If we want to include all predictors we can use a period like so:

outcome_variable_name ~ .

Now with our data, we will start by making a recipe for our training data. If you recall, the continuous outcome variable is value (the average annual gravimetric monitor PM2.5 concentration in ug/m3). Our features (or predictor variables) are all the other variables except the monitor ID, which is an id variable.

The reason not to include the id variable is because this variable includes the county number and a number designating which particular monitor the values came from (of the monitors there are in that county). Since this number is arbitrary and the county information is also given in the data, and the fact that each monitor only has one value in the value variable, nothing is gained by including this variable and it may instead introduce noise. However, it is useful to keep this data to take a look at what is happening later. We will show you what to do in this case in just a bit.

To summarize this step, we will use the recipe() function to assign roles to all the variables:

We will describe this step by step and then show all the steps together.

In the simplest case, we might use all predictors like this:

simple_rec <- train_pm %>%
  recipes::recipe(value ~ .)

simple_rec
Recipe

Inputs:

      role #variables
   outcome          1
 predictor         49

We see a recipe has been created with 1 outcome variable and 49 predictor variables (or features). Also, notice how we named the output of recipe(). The naming convention for recipe objects is *_rec or rec.

Now, let’s get back to the id variable. Instead of including it as a predictor variable, we could also use the update_role() function of the recipes package.

simple_rec <- train_pm %>%
  recipes::recipe(value ~ .) %>%
  recipes::update_role(id, new_role = "id variable")

simple_rec
Recipe

Inputs:

        role #variables
 id variable          1
     outcome          1
   predictor         48

Click here to learn more about the working with id variables

This option works well with the newer workflows package, however id variables are often dropped from analyses that do not use this newer package as they can make the process difficult with using the parsnip package alone due to the fact that new levels (or possible values) may be introduced with the testing data.

We could also specify the outcome and predictors in the same way as we just specified the id variable. Please see here for examples of other roles for variables. The role can be actually be any value.

The order is important here, as we first make all variables predictors and then override this role for the outcome and id variable. We will use the everything() function of the dplyr package to start with all of the variables in train_pm.

simple_rec <- recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor") %>%
    update_role(value, new_role = "outcome") %>%
    update_role(id, new_role = "id variable")

simple_rec
Recipe

Inputs:

        role #variables
 id variable          1
     outcome          1
   predictor         48

We can view our recipe in more detail using the base summary() function.

summary(simple_rec)
# A tibble: 50 × 4
   variable type    role        source  
   <chr>    <chr>   <chr>       <chr>   
 1 id       nominal id variable original
 2 value    numeric outcome     original
 3 fips     nominal predictor   original
 4 lat      numeric predictor   original
 5 lon      numeric predictor   original
 6 state    nominal predictor   original
 7 county   nominal predictor   original
 8 city     nominal predictor   original
 9 CMAQ     numeric predictor   original
10 zcta     nominal predictor   original
# … with 40 more rows
# ℹ Use `print(n = ...)` to see more rows

Step 2: Specify the pre-processing steps with step*() functions

Next, we use the step*() functions from the recipe package to specify pre-processing steps.

This link and this link show the many options for recipe step functions.

There are step functions for a variety of purposes:

  1. Imputation – filling in missing values based on the existing data
  2. Transformation – changing all values of a variable in the same way, typically to make it more normal or easier to interpret
  3. Discretization – converting continuous values into discrete or nominal values - binning for example to reduce the number of possible levels (However this is generally not advisable!)
  4. Encoding / Creating Dummy Variables – creating a numeric code for categorical variables (More on one-hot and Dummy Variables encoding)
  5. Data type conversions – which means changing from integer to factor or numeric to date etc.
  6. Interaction term addition to the model – which means that we would be modeling for predictors that would influence the capacity of each other to predict the outcome
  7. Normalization – centering and scaling the data to a similar range of values
  8. Dimensionality Reduction/ Signal Extraction – reducing the space of features or predictors to a smaller set of variables that capture the variation or signal in the original variables (ex. Principal Component Analysis and Independent Component Analysis)
  9. Filtering – filtering options for removing variables (ex. remove variables that are highly correlated to others or remove variables with very little variance and therefore likely little predictive capacity)
  10. Row operations – performing functions on the values within the rows (ex. rearranging, filtering, imputing)
  11. Checking functions – Gut checks to look for missing values, to look at the variable classes etc.

All of the step functions look like step_*() with the * replaced with a name, except for the check functions which look like check_*().

There are several ways to select what variables to apply steps to:

  1. Using tidyselect methods: contains(), matches(), starts_with(), ends_with(), everything(), num_range()
  2. Using the type: all_nominal(), all_numeric() , has_type()
  3. Using the role: all_predictors(), all_outcomes(), has_role()
  4. Using the name - use the actual name of the variable/variables of interest

Let’s try adding some steps to our recipe.

We might want to potentially modify some of our categorical variables so that they can be used with certain algorithms, like regression, that require only numeric values.

We can do this with the step_dummy() function and the one_hot = TRUE argument to use a method called One-hot encoding.

One-hot encoding means that we do not simply encode our categorical variables numerically as a simple 1,2,3, as our numeric assignments can be interpreted by algorithms as having a particular rank or order. Instead, new binary variables made up of 1s and 0s are used to arbitrarily assign a numeric value that has no apparent order. Note that while there is a different but similar method to do this referred to as dummy variables, one-hot encoded variables are also sometimes called dummy variables.

For more information about what one-hot encoding is, you can expand here.

For example, say we only had three city values for the city variable that were “Tuscon”, “Denver”, and “New York”.

City
Tuscon
New York
Tuscon
Denver

This would be replaced by three new variables one for if the value was “Tuscon”, one for if the value was “Denver”, and one for if the value was “New York”. Each would be made up of zeros and ones. One for the new Tuscon variable would indicate that indeed “Tuscon” was the value and zero would indicate that it was instead “New York” or “Denver”. Essentially there is one “hot” value possible for each new variable, where the value will be one.

Tuscon Denver New York
1 0 0
0 0 1
1 0 0
0 1 0

Here we will create such variables to replace the current, state, country, and city categorical data. Similarly the ZCTA values (which are currently a factor class) are not intended to be interpreted as numeric as they could be thought of like a zip code and thus we would also want to encode this as well.

simple_rec %>%
  step_dummy(state, county, city, zcta, one_hot = TRUE)
Recipe

Inputs:

        role #variables
 id variable          1
     outcome          1
   predictor         48

Operations:

Dummy variables from state, county, city, zcta

Click here to see how the variables would change if our recipe stopped here with simply the one-hot encoding.

To create the data we will use steps that we will demonstrate later, for now we will just show you what the new encoded variables look like.

length(names(train_pm))
[1] 50
length(names(baked_one_hot_rec)) # many more variables!
[1] 1733
train_pm %>% select(zcta, city, state, county) %>% head
# A tibble: 6 × 4
  zcta  city          state      county          
  <fct> <chr>         <chr>      <chr>           
1 46805 Fort Wayne    Indiana    Allen           
2 54520 Not in a city Wisconsin  Forest          
3 92506 Riverside     California Riverside       
4 45711 Not in a city Ohio       Athens          
5 45217 St. Bernard   Ohio       Hamilton        
6 21251 Baltimore     Maryland   Baltimore (City)
# let's look at a few
baked_one_hot_rec %>% select("zcta_X54520", "city_Not.in.a.city", "state_Indiana") %>% head()
# A tibble: 6 × 3
  zcta_X54520 city_Not.in.a.city state_Indiana
        <dbl>              <dbl>         <dbl>
1           0                  0             1
2           1                  1             0
3           0                  0             0
4           0                  1             0
5           0                  0             0
6           0                  0             0

Our fips variable includes a numeric code for state and county - and therefore is essentially a proxy for county. Since we already have county, we will just use it and keep the fips ID as another ID variable.

We can remove the fips variable from the predictors using update_role() to make sure that the role is no longer "predictor". We can make the role anything we want actually, so we will keep it something identifiable.

simple_rec %>%
  update_role("fips", new_role = "county id")
Recipe

Inputs:

        role #variables
   county id          1
 id variable          1
     outcome          1
   predictor         47

We might also want to remove variables that appear to be redundant and are highly correlated with others, as we know from our exploratory data analysis that many of our variables are correlated with one another. We can do this using the step_corr() function.

We don’t want to remove some of our variables, like the CMAQ and aod variables, we can specify this using the - sign before the names of these variables like so:

simple_rec %>%
  step_corr(all_predictors(), - CMAQ, - aod)
Recipe

Inputs:

        role #variables
 id variable          1
     outcome          1
   predictor         48

Operations:

Correlation filter on all_predictors(), -CMAQ, -aod

It is also a good idea to remove variables with near-zero variance, which can be done with the step_nzv() function.

Variables have low variance if all the values are very similar, the values are very sparse, or if they are highly imbalanced. Again we don’t want to remove our CMAQ and aod variables.

simple_rec %>%
  step_nzv(all_predictors(), - CMAQ, - aod)
Recipe

Inputs:

        role #variables
 id variable          1
     outcome          1
   predictor         48

Operations:

Sparse, unbalanced variable filter on all_predictors(), -CMAQ, -aod

Click here to learn about examples where you might have near-zero variance variables
  1. Similar Values - If the population density was nearly the same for every zcta that contained a monitor, then knowing the population density near our monitor would contribute little to our model in assisting us to predict monitor air pollution values.
  2. Sparse Data - If all of the monitors were in locations where the populations did not attend graduate school, then these values would mostly be zero, again this would do very little to help us distinguish our air pollution monitors.When many of the values are zero this is also called sparse data.
  3. Imbalanced Data If nearly all of the monitors were located in one particular state, and all the others only had one monitor each, then the real predictive value would simply be in knowing if a monitor is located in that particular state or not. In this case we don’t want to remove our variable, we just want to simplify it.

See this blog post about why removing near-zero variance variables isn’t always a good idea if we think that a variable might be especially informative.

Let’s put all this together now.

Remember: it is important to add the steps to the recipe in an order that makes sense just like with a cooking recipe.

First, we are going to create numeric values for our categorical variables, then we will look at correlation and near-zero variance. Again, we do not want to remove the CMAQ and aod variables, so we can make sure they are kept in the model by excluding them from those steps. If we specifically wanted to remove a predictor we could use step_rm().

simple_rec %<>%
  update_role("fips", new_role = "county id") %>%
  step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
  step_corr(all_predictors(), - CMAQ, - aod)%>%
  step_nzv(all_predictors(), - CMAQ, - aod)
  
simple_rec
Recipe

Inputs:

        role #variables
   county id          1
 id variable          1
     outcome          1
   predictor         47

Operations:

Dummy variables from state, county, city, zcta
Correlation filter on all_predictors(), -CMAQ, -aod
Sparse, unbalanced variable filter on all_predictors(), -CMAQ, -aod

Running the pre-processing


Step 1: Update the recipe with training data using prep()


The next major function of the recipes package is prep(). This function updates the recipe object based on the training data. It estimates parameters (estimating the required quantities and statistics required by the steps for the variables) for pre-processing and updates the variables roles, as some of the predictors may be removed, this allows the recipe to be ready to use on other data sets. It does not necessarily actually execute the pre-processing itself; however, we will specify in argument for it to do this so that we can take a look at the pre-processed data.

There are some important arguments to know about:

  1. training - you must supply a training data set to estimate parameters for pre-processing operations (recipe steps) - this may already be included in your recipe - as is the case for us
  2. fresh - if fresh=TRUE, will retrain and estimate parameters for any previous steps that were already prepped if you add more steps to the recipe (default is FALSE)
  3. verbose - if verbose=TRUE, shows the progress as the steps are evaluated and the size of the pre-processed training set (default is FALSE)
  4. retain - if retain=TRUE, then the pre-processed training set will be saved within the recipe (as template). This is good if you are likely to add more steps and do not want to rerun the prep() on the previous steps. However this can make the recipe size large. This is necessary if you want to actually look at the pre-processed data (default is TRUE)

Let’s try out the prep() function:

prepped_rec <- prep(simple_rec, verbose = TRUE, retain = TRUE )
oper 1 step dummy [training] 
oper 2 step corr [training] 
oper 3 step nzv [training] 
The retained training set is ~ 0.26 Mb  in memory.
names(prepped_rec)
 [1] "var_info"       "term_info"      "steps"          "template"      
 [5] "levels"         "retained"       "requirements"   "tr_info"       
 [9] "orig_lvls"      "last_term_info"

There are also lots of useful things to check out in the output of prep(). You can see:

  1. the steps that were run
  2. the original variable info (var_info)
  3. the updated variable info after pre-processing (term_info)
  4. the new levels of the variables
  5. the original levels of the variables (orig_lvls)
  6. info about the training data set size and completeness (tr_info)

Note: You may see the prep.recipe() function in material that you read about the recipes package. This is referring to the prep() function of the recipes package.

Step 2: Extract pre-processed training data using bake()


Since we retained our pre-processed training data (i.e. prep(retain=TRUE)), we can take a look at it by using the bake() function of the recipes package. The bake function allows us to apply our modeling steps (in this case just pre-processing on the training data) and see what it would do the data.

Let’s bake!

Since we don’t have new data (we aren’t looking at the testing data), we need to specify this with new_data = NULL.

# Scroll through the output!
baked_train <- bake(prepped_rec, new_data = NULL)
glimpse(baked_train)
Rows: 584
Columns: 37
$ id                          <fct> 18003.0004, 55041.0007, 6065.1003, 39009.0…
$ value                       <dbl> 11.699065, 6.956780, 13.289744, 10.742000,…
$ fips                        <fct> 18003, 55041, 6065, 39009, 39061, 24510, 6…
$ lat                         <dbl> 41.09497, 45.56300, 33.94603, 39.44217, 39…
$ lon                         <dbl> -85.10182, -88.80880, -117.40063, -81.9088…
$ CMAQ                        <dbl> 10.383231, 3.411247, 11.404085, 7.971165, …
$ zcta_area                   <dbl> 16696709, 370280916, 41957182, 132383592, …
$ zcta_pop                    <dbl> 21306, 4141, 44001, 1115, 6566, 934, 41192…
$ imp_a500                    <dbl> 28.9783737, 0.0000000, 30.3901384, 0.00000…
$ imp_a15000                  <dbl> 13.0547959, 0.3676404, 23.7457506, 0.33079…
$ county_area                 <dbl> 1702419942, 2626421270, 18664696661, 13043…
$ county_pop                  <dbl> 355329, 9304, 2189641, 64757, 802374, 6209…
$ log_dist_to_prisec          <dbl> 6.621891, 8.415468, 7.419762, 6.344681, 5.…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 10.150514, 8.517193, 9…
$ log_pri_length_25000        <dbl> 12.77378, 10.16440, 13.14450, 10.12663, 13…
$ log_prisec_length_500       <dbl> 6.214608, 6.214608, 6.214608, 6.214608, 7.…
$ log_prisec_length_1000      <dbl> 9.240294, 7.600902, 7.600902, 8.793450, 8.…
$ log_prisec_length_5000      <dbl> 11.485093, 9.425537, 10.155961, 10.562382,…
$ log_prisec_length_10000     <dbl> 12.75582, 11.44833, 11.59563, 11.69093, 12…
$ log_nei_2008_pm10_sum_10000 <dbl> 4.91110140, 3.86982666, 4.03184660, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 5.399131, 3.883689, 5.459257, 0.000000, 6.…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.816047, 3.887264, 6.884537, 3.765635, 6.…
$ popdens_county              <dbl> 208.719947, 3.542463, 117.314577, 49.64834…
$ popdens_zcta                <dbl> 1276.059851, 11.183401, 1048.711994, 8.422…
$ nohs                        <dbl> 4.3, 5.1, 3.7, 4.8, 2.1, 0.0, 2.5, 7.7, 0.…
$ somehs                      <dbl> 6.7, 10.4, 5.9, 11.5, 10.5, 0.0, 4.3, 7.5,…
$ hs                          <dbl> 31.7, 40.3, 17.9, 47.3, 30.0, 0.0, 17.8, 2…
$ somecollege                 <dbl> 27.2, 24.1, 26.3, 20.0, 27.1, 0.0, 26.1, 2…
$ associate                   <dbl> 8.2, 7.4, 8.3, 3.1, 8.5, 71.4, 13.2, 7.6, …
$ bachelor                    <dbl> 15.0, 8.6, 20.2, 9.8, 14.2, 0.0, 23.4, 17.…
$ grad                        <dbl> 6.8, 4.2, 17.7, 3.5, 7.6, 28.6, 12.6, 12.3…
$ pov                         <dbl> 13.500, 18.900, 6.700, 14.400, 12.500, 3.5…
$ hs_orless                   <dbl> 42.7, 55.8, 27.5, 63.6, 42.6, 0.0, 24.6, 3…
$ urc2006                     <dbl> 3, 6, 1, 5, 1, 1, 2, 1, 2, 6, 4, 4, 4, 4, …
$ aod                         <dbl> 54.11111, 31.16667, 83.12500, 33.36364, 50…
$ state_California            <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, …
$ city_Not.in.a.city          <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …

Note- this process used to require the juice() function.

For easy comparison sake - here is our original data:

# Scroll through the output!
glimpse(pm)
Rows: 876
Columns: 50
$ id                          <fct> 1003.001, 1027.0001, 1033.1002, 1049.1003,…
$ value                       <dbl> 9.597647, 10.800000, 11.212174, 11.659091,…
$ fips                        <fct> 1003, 1027, 1033, 1049, 1055, 1069, 1073, …
$ lat                         <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 33…
$ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.96830…
$ state                       <chr> "Alabama", "Alabama", "Alabama", "Alabama"…
$ county                      <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "E…
$ city                        <chr> "Fairhope", "Ashland", "Muscle Shoals", "C…
$ CMAQ                        <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9.…
$ zcta                        <fct> 36532, 36251, 35660, 35962, 35901, 36303, …
$ zcta_area                   <dbl> 190980522, 374132430, 16716984, 203836235,…
$ zcta_pop                    <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 901…
$ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 5.782…
$ imp_a1000                   <dbl> 1.4096021, 0.8531574, 11.1448962, 3.867647…
$ imp_a5000                   <dbl> 3.3360118, 0.9851479, 15.1786154, 1.231141…
$ imp_a10000                  <dbl> 1.9879187, 0.5208189, 9.7253870, 1.0316469…
$ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 0.9730444…
$ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 201266…
$ county_pop                  <dbl> 182265, 13932, 54428, 71109, 104430, 10154…
$ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5.…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9.…
$ log_pri_length_10000        <dbl> 9.210340, 9.210340, 9.274303, 10.409411, 1…
$ log_pri_length_15000        <dbl> 9.630228, 9.615805, 9.658899, 11.173626, 1…
$ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 12…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8.…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9.…
$ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 10.214200…
$ log_prisec_length_10000     <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 12…
$ log_prisec_length_15000     <dbl> 12.205723, 12.042963, 13.282656, 12.353663…
$ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 13…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.0…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.2…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4.…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.3500…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4.…
$ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 35.330814,…
$ popdens_zcta                <dbl> 145.716431, 13.639555, 540.887040, 40.7189…
$ nohs                        <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7, …
$ somehs                      <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, 6…
$ hs                          <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2, …
$ somecollege                 <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5, …
$ associate                   <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, 4…
$ bachelor                    <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17.…
$ grad                        <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, 2…
$ pov                         <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7.…
$ hs_orless                   <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4, …
$ urc2013                     <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ urc2006                     <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ aod                         <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 43…

Notice how we only have 36 variables now instead of 50! Two of these are our ID variables (fips and the actual monitor ID (id)) and one is our outcome (value). Thus we only have 33 predictors now. We can also see that we no longer have any categorical variables. Variables like state are gone and only state_California remains as it was the only state identity to have nonzero variance.

We can see that California had the largest number of monitors compared to the other states.

pm %>% count(state) 

Scroll through the output:

# A tibble: 49 × 2
   state                    n
   <chr>                <int>
 1 Alabama                 24
 2 Arizona                 17
 3 Arkansas                16
 4 California              85
 5 Colorado                15
 6 Connecticut             14
 7 Delaware                 7
 8 District Of Columbia     3
 9 Florida                 29
10 Georgia                 28
11 Idaho                    7
12 Illinois                38
13 Indiana                 36
14 Iowa                    20
15 Kansas                  10
16 Kentucky                22
17 Louisiana               17
18 Maine                    1
19 Maryland                15
20 Massachusetts           16
21 Michigan                30
22 Minnesota               17
23 Mississippi             12
24 Missouri                13
25 Montana                 16
26 Nebraska                 7
27 Nevada                   4
28 New Hampshire            7
29 New Jersey              23
30 New Mexico              10
31 New York                24
32 North Carolina          35
33 North Dakota             4
34 Ohio                    44
35 Oklahoma                10
36 Oregon                  17
37 Pennsylvania            32
38 Rhode Island             5
39 South Carolina          14
40 South Dakota             9
41 Tennessee                3
42 Texas                   27
43 Utah                    14
44 Vermont                  4
45 Virginia                20
46 Washington               8
47 West Virginia           14
48 Wisconsin               21
49 Wyoming                 12

We can also see that there were more monitors listed as "Not in a city" than any city.

pm %>% count(city)

Scroll through the output:

# A tibble: 607 × 2
    city                                                 n
    <chr>                                            <int>
  1 Aberdeen                                             1
  2 Akron                                                2
  3 Albany                                               3
  4 Albuquerque                                          2
  5 Alexandria                                           1
  6 Allen Park                                           1
  7 Altamont                                             1
  8 Alton                                                1
  9 Amarillo                                             1
 10 Anadarko                                             1
 11 Anaheim                                              1
 12 Anderson                                             1
 13 Annandale                                            1
 14 Apache Junction                                      1
 15 Apple Valley                                         1
 16 Appleton                                             1
 17 Arden-Arcade                                         1
 18 Arlington                                            1
 19 Arnold                                               1
 20 Asheville                                            1
 21 Ashland                                              2
 22 Atascadero                                           1
 23 Athens-Clarke County (Remainder)                     1
 24 Atlanta                                              2
 25 Atlantic City                                        1
 26 Augusta-Richmond County (Remainder)                  2
 27 Aurora                                               1
 28 Austin                                               1
 29 Azusa                                                1
 30 Bakersfield                                          3
 31 Baltimore                                            5
 32 Batavia                                              1
 33 Baton Rouge                                          1
 34 Bay City                                             1
 35 Bayport                                              1
 36 Baytown                                              1
 37 Beaver Falls                                         1
 38 Beckley                                              1
 39 Belle Glade                                          1
 40 Bellevue                                             1
 41 Beltsville                                           1
 42 Bend                                                 1
 43 Bennington                                           1
 44 Bensley                                              1
 45 Big Bear City                                        1
 46 Billings                                             1
 47 Birmingham                                           2
 48 Bismarck                                             1
 49 Bladensburg                                          1
 50 Blair                                                1
 51 Blue Ash                                             1
 52 Blue Island                                          1
 53 Boise (corporate name Boise City)                    1
 54 Boone                                                1
 55 Boston                                               4
 56 Boulder                                              1
 57 Boulevard                                            1
 58 Bountiful                                            1
 59 Braidwood                                            1
 60 Brawley                                              1
 61 Bridgeport                                           1
 62 Brigham City                                         1
 63 Bristol                                              2
 64 Brockton                                             1
 65 Brook Park                                           1
 66 Brookings                                            1
 67 Brunswick                                            1
 68 Bryson City (RR name Bryson)                         1
 69 Buffalo                                              1
 70 Burbank                                              1
 71 Burlington                                           2
 72 Burns                                                1
 73 Butte-Silver Bow (Remainder)                         1
 74 Calexico                                             1
 75 Camden                                               1
 76 Candor                                               1
 77 Canton                                               2
 78 Carlisle                                             1
 79 Carlstadt                                            3
 80 Cary                                                 1
 81 Casa Grande                                          1
 82 Cedar Rapids                                         2
 83 Cedarhurst                                           1
 84 Central Point                                        1
 85 Chalmette                                            1
 86 Champaign                                            1
 87 Chapel Hill                                          1
 88 Charleroi                                            1
 89 Charleston                                           2
 90 Charlotte                                            2
 91 Chattanooga                                          1
 92 Chelmsford (Chelmsford Center)                       1
 93 Chester                                              2
 94 Cheyenne                                             1
 95 Chicago                                              5
 96 Chickasaw                                            1
 97 Chicopee                                             1
 98 Childersburg                                         1
 99 Chula Vista                                          1
100 Cicero                                               1
101 Cincinnati                                           3
102 Clairton                                             1
103 Claremont                                            1
104 Clarion                                              1
105 Clarksburg                                           1
106 Clearwater                                           1
107 Cleveland                                            6
108 Clinton                                              2
109 Clive                                                1
110 Clovis                                               1
111 Cockeysville                                         1
112 Cody                                                 1
113 Coloma                                               1
114 Colorado Springs                                     1
115 Columbia                                             1
116 Columbia Falls                                       1
117 Columbus                                             4
118 Columbus (Remainder)                                 3
119 Colusa                                               1
120 Commerce City                                        1
121 Concord                                              1
122 Conway                                               1
123 Corcoran                                             1
124 Cornwall                                             1
125 Corpus Christi                                       2
126 Cottage Grove                                        1
127 Cottonwood West                                      1
128 Council Bluffs                                       1
129 Covington                                            1
130 Crossett                                             1
131 Crossville                                           1
132 Dale                                                 1
133 Dallas                                               3
134 Danbury                                              1
135 Darrington                                           1
136 Davenport                                            3
137 Davie                                                1
138 Dayton                                               1
139 Dearborn                                             1
140 Decatur                                              2
141 Delray Beach                                         1
142 Dentsville (Dents)                                   1
143 Denver                                               3
144 Des Moines                                           1
145 Des Plaines                                          1
146 Detroit                                              5
147 Doraville                                            1
148 Dothan                                               1
149 Douglas                                              1
150 Dover                                                1
151 Duluth                                               2
152 Durham                                               1
153 East Chicago                                         1
154 East Farmingdale                                     1
155 East Hartford                                        2
156 East Highland Park                                   1
157 East Providence                                      1
158 East Ridge                                           1
159 East Saint Louis                                     1
160 East Syracuse                                        1
161 Edgewood                                             1
162 El Cajon                                             1
163 El Centro                                            1
164 El Dorado                                            1
165 El Paso                                              3
166 Elgin                                                1
167 Elizabeth                                            2
168 Elizabethtown                                        1
169 Elkhart                                              1
170 Emmetsburg                                           1
171 Erie                                                 1
172 Escondido                                            1
173 Essex                                                1
174 Eugene                                               2
175 Eureka                                               2
176 Evansville                                           3
177 Fairfield                                            1
178 Fairhope                                             1
179 Fairmont                                             1
180 Fall River                                           1
181 Farmington                                           1
182 Farrell                                              1
183 Fayetteville                                         1
184 Ferry Pass                                           1
185 Flagstaff                                            1
186 Flint                                                1
187 Follansbee                                           1
188 Fontana                                              1
189 Forest Park                                          1
190 Fort Collins                                         1
191 Fort Defiance                                        1
192 Fort Lee                                             1
193 Fort Myers                                           1
194 Fort Pierce                                          1
195 Fort Smith                                           1
196 Fort Wayne                                           1
197 Fort Worth                                           2
198 Frankfort                                            1
199 Franklin                                             1
200 Freemansburg                                         1
201 Fremont                                              1
202 Fresno                                               2
203 Gadsden                                              1
204 Gainesville                                          3
205 Galloway (Township of)                               1
206 Garden City                                          1
207 Gary                                                 3
208 Gastonia                                             1
209 Gilroy                                               1
210 Glen Burnie                                          1
211 Goldsboro                                            1
212 Gordon                                               1
213 Grand Island                                         1
214 Grand Junction                                       1
215 Grand Rapids                                         2
216 Granite City                                         2
217 Grants Pass                                          1
218 Grass Valley                                         1
219 Great Falls                                          1
220 Greater Upper Marlboro                               1
221 Greeley                                              1
222 Green Bay                                            1
223 Greensboro                                           1
224 Greensburg                                           1
225 Greenville                                           3
226 Greenwich (Township of)                              1
227 Grenada                                              1
228 Griffith                                             1
229 Groveton                                             1
230 Gulfport                                             1
231 Hamilton                                             1
232 Hammond                                              3
233 Hampton                                              1
234 Harrison Township                                    1
235 Harrisville                                          1
236 Hattiesburg                                          1
237 Haverhill                                            1
238 Helena                                               2
239 Helena Valley West Central                           1
240 Hernando                                             1
241 Hickory                                              1
242 Highland                                             1
243 Highland Heights                                     1
244 Hillsboro                                            1
245 Hobbs                                                1
246 Holland                                              1
247 Hollister                                            1
248 Hollywood                                            1
249 Homestead                                            1
250 Hoover                                               1
251 Hopewell (Township of)                               1
252 Hot Springs (Hot Springs National Park)              1
253 Houston                                              2
254 Huntington                                           1
255 Huntsville                                           1
256 Indianapolis                                         1
257 Indianapolis (Remainder)                             4
258 Indio                                                1
259 Iowa City                                            1
260 Ironton                                              1
261 Jackson                                              2
262 Jacksonville                                         2
263 Jamesville                                           1
264 Jasper                                               3
265 Jean                                                 1
266 Jeffersonville                                       1
267 Jenison                                              1
268 Jersey City                                          1
269 Jerseyville                                          1
270 Johnstown                                            1
271 Joliet                                               1
272 Kalamazoo                                            1
273 Kalispell                                            1
274 Kansas City                                          3
275 Keeler                                               1
276 Keene                                                1
277 Kenansville                                          1
278 Kenner                                               1
279 Kennesaw                                             1
280 Keokuk                                               1
281 Kinston                                              1
282 Kokomo                                               1
283 La Crosse                                            1
284 La Grande                                            1
285 Lackawanna                                           1
286 Laconia                                              1
287 Ladue                                                1
288 Lafayette                                            3
289 Lake Charles                                         1
290 Lakeland                                             1
291 Lakeport                                             1
292 Lakeview                                             1
293 Lancaster                                            2
294 Lander                                               1
295 Lansing                                              1
296 Las Cruces                                           1
297 Las Vegas                                            1
298 Laurel                                               1
299 Lawrence                                             1
300 Leander                                              1
301 Lebanon                                              2
302 Leeds                                                1
303 Lexington                                            1
304 Lexington-Fayette (corporate name for Lexington)     2
305 Libby                                                1
306 Liberty                                              1
307 Lincoln                                              1
308 Lindon                                               1
309 Little Rock                                          2
310 Littleton                                            1
311 Live Oak                                             1
312 Livermore                                            1
313 Livonia                                              1
314 Logan                                                1
315 Long Beach                                           2
316 Longmont                                             1
317 Los Angeles                                          1
318 Louisville                                           4
319 Luna Pier                                            1
320 Lynchburg                                            1
321 Lynn                                                 1
322 Lynwood                                              1
323 Macon                                                2
324 Madison                                              1
325 Magna                                                1
326 Mamaroneck                                           1
327 Manistee                                             1
328 Marble City Community                                1
329 Maricopa                                             1
330 Marion                                               2
331 Marrero                                              1
332 Martinsburg                                          1
333 Marysville                                           1
334 McAlester                                            1
335 McCook                                               1
336 McDonald                                             1
337 McLean                                               1
338 Medford                                              1
339 Melbourne                                            1
340 Mena                                                 1
341 Merced                                               1
342 Meridian                                             1
343 Mesa                                                 1
344 Miami                                                2
345 Michigan City                                        1
346 Middlesborough (corporate name for Middlesboro)      1
347 Middletown                                           2
348 Midlothian                                           1
349 Milwaukee                                            5
350 Mingo Junction                                       1
351 Minneapolis                                          2
352 Mira Loma                                            1
353 Mission                                              1
354 Mission Viejo                                        1
355 Missoula                                             1
356 Modesto                                              1
357 Mojave                                               1
358 Monroe                                               1
359 Montgomery                                           1
360 Morgantown                                           1
361 Morristown                                           1
362 Moundsville                                          1
363 Muncie                                               1
364 Muscatine                                            1
365 Muscle Shoals                                        1
366 Muskegon                                             1
367 Muskogee                                             1
368 Naperville                                           1
369 Nashua                                               1
370 Natchez                                              1
371 New Albany                                           1
372 New Haven                                            5
373 New Paris                                            1
374 New York                                             9
375 Newark                                               2
376 Newburgh                                             1
377 Newburgh Heights                                     1
378 Newport                                              1
379 Niagara Falls                                        1
380 Nogales                                              1
381 Norfolk                                              1
382 Normal                                               1
383 Norristown                                           1
384 North Braddock                                       1
385 North Brunswick Township                             1
386 North Charleston                                     1
387 North Las Vegas                                      1
388 North Little Rock                                    1
389 Northbrook                                           1
390 Norwalk                                              1
391 Norwich                                              1
392 Norwood                                              1
393 Not in a city                                      103
394 Oak Park                                             1
395 Oakland                                              2
396 Oakridge                                             1
397 Odessa                                               1
398 Ogden                                                1
399 Ogden Dunes (Wickliffe)                              1
400 Oglesby                                              1
401 Oklahoma City                                        2
402 Omaha                                                2
403 Onamia                                               1
404 Ontario                                              1
405 Orlando                                              1
406 Overland Park                                        1
407 Paducah                                              1
408 Painesville                                          1
409 Palm Springs                                         1
410 Palm Springs North                                   1
411 Panama City                                          1
412 Pasadena                                             1
413 Pascagoula                                           1
414 Paterson                                             1
415 Pawtucket                                            1
416 Peach Springs                                        1
417 Pelham                                               1
418 Pendleton                                            1
419 Pennsauken (Pensauken)                               1
420 Peoria                                               1
421 Phenix City                                          1
422 Philadelphia                                         5
423 Phillipsburg                                         1
424 Phoenix                                              3
425 Pico Rivera                                          1
426 Pikeville                                            1
427 Pinedale                                             1
428 Pinehurst (Pine Creek)                               1
429 Pinson                                               1
430 Piru                                                 1
431 Pittsboro                                            1
432 Pittsburgh                                           1
433 Pittsfield                                           1
434 Platteville                                          1
435 Pleasant Prairie                                     1
436 Pompano Beach                                        1
437 Port Arthur                                          1
438 Port Huron                                           1
439 Portland                                             2
440 Portola                                              1
441 Portsmouth                                           2
442 Potosi                                               1
443 Potsdam                                              1
444 Powder Springs                                       1
445 Prescott Valley                                      1
446 Presque Isle                                         1
447 Providence                                           2
448 Provo                                                1
449 Pryor (corporate name Pryor Creek)                   1
450 Pueblo                                               1
451 Quincy                                               2
452 Rahway                                               1
453 Raleigh                                              1
454 Rapid City                                           2
455 Ravenna                                              1
456 Redding                                              1
457 Redwood City                                         1
458 Reno                                                 1
459 Reseda                                               1
460 Richfield                                            1
461 Richmond                                             1
462 Ridge Wood Heights                                   1
463 Ridgecrest                                           1
464 Rio Rancho Estates                                   1
465 Riverside                                            1
466 Roanoke                                              2
467 Rochester                                            2
468 Rock Island Arsenal (U.S. Army)                      1
469 Rock Springs                                         1
470 Rockford                                             1
471 Rockwell                                             1
472 Rocky Mount                                          1
473 Rome                                                 1
474 Roseville                                            1
475 Roswell                                              1
476 Roxborough Park                                      1
477 Royal Palm Beach                                     1
478 Rubidoux                                             1
479 Russellville                                         1
480 Rutland                                              1
481 Sacramento                                           2
482 Saint Petersburg                                     1
483 Salinas                                              1
484 Salt Lake City                                       2
485 San Andreas                                          1
486 San Antonio                                          2
487 San Bernardino                                       1
488 San Diego                                            2
489 San Francisco                                        1
490 San Jose                                             1
491 San Luis Obispo                                      1
492 Sandersville                                         1
493 Sanford                                              1
494 Santa Barbara                                        1
495 Santa Fe                                             1
496 Santa Maria                                          1
497 Santa Rosa                                           1
498 Sault Ste. Marie                                     2
499 Savannah                                             1
500 Schiller Park                                        1
501 Scottsbluff                                          1
502 Scottsdale                                           1
503 Scranton                                             1
504 Seaford                                              1
505 Searcy                                               1
506 Seattle                                              2
507 Seeley Lake                                          1
508 Seven Oaks                                           1
509 Shakopee                                             1
510 Sharonville                                          1
511 Shasta Lake                                          1
512 Sheffield                                            1
513 Shepherdsville                                       1
514 Sheridan                                             2
515 Shreveport                                           1
516 Silver City                                          1
517 Simi Valley                                          1
518 Sioux City                                           1
519 Sioux Falls                                          2
520 Soddy-Daisy                                          1
521 South Bend                                           2
522 South Charleston                                     1
523 South Padre Island                                   1
524 Spanish Fork                                         1
525 Spokane                                              1
526 Springdale                                           1
527 Springfield                                          5
528 Spruce Pine                                          1
529 St. Bernard                                          1
530 St. Cloud                                            1
531 St. Joseph                                           1
532 St. Louis                                            4
533 St. Louis Park                                       1
534 St. Paul                                             3
535 State College                                        1
536 Ste. Genevieve                                       1
537 Steubenville                                         1
538 Stockton                                             1
539 Stuttgart                                            1
540 Summit                                               1
541 Suncook                                              1
542 Swansea                                              1
543 Tacoma                                               1
544 Tallahassee                                          1
545 Tampa                                                1
546 Taylors                                              1
547 Tecumseh                                             1
548 Terre Haute                                          2
549 Texarkana                                            1
550 Theodore                                             1
551 Thomaston                                            1
552 Thompson Falls                                       1
553 Thousand Oaks                                        1
554 Toledo                                               3
555 Toms River                                           1
556 Tooele                                               1
557 Topeka                                               1
558 Trenton                                              1
559 Truckee                                              1
560 Tucson                                               2
561 Tulsa                                                2
562 Tupelo                                               1
563 Tuscaloosa                                           1
564 Ukiah                                                1
565 Underhill (Town of)                                  1
566 Union City                                           1
567 Valdosta                                             1
568 Vallejo                                              1
569 Valrico                                              1
570 Vancouver                                            1
571 Victorville                                          1
572 Vienna                                               1
573 Vinton                                               1
574 Virginia                                             1
575 Virginia Beach                                       1
576 Visalia                                              1
577 Warner Robins                                        1
578 Warren                                               1
579 Washington                                           4
580 Waterbury                                            1
581 Waterloo                                             1
582 Watertown                                            1
583 Waukesha                                             1
584 Waynesville                                          1
585 Weirton                                              2
586 West Orange                                          1
587 West Yellowstone                                     1
588 Westfield                                            1
589 Westport                                             1
590 Wheeling                                             1
591 Whitefish                                            1
592 Wichita                                              3
593 Wilmington                                           1
594 Winston-Salem                                        2
595 Winter Park                                          1
596 Wood River                                           1
597 Woodland                                             1
598 Worcester                                            2
599 Wyandotte                                            1
600 Yakima                                               1
601 Yellow Springs                                       1
602 York                                                 1
603 Youngstown                                           2
604 Ypsilanti                                            1
605 Yuba City                                            1
606 Yuma                                                 1
607 Zion                                                 1

Note: Recall that you must specify retain = TRUE argument of the prep() function to use bake() to see pre-processed training data.

Step 3: Extract pre-processed testing data using bake()


According to the tidymodels documentation:

bake() takes a trained recipe and applies the operations to a data set to create a design matrix. For example: it applies the centering to new data sets using these means used to create the recipe.

Therefore, if you wanted to look at the pre-processed testing data you would use the bake() function of the recipes package. (You generally want to leave your testing data alone, but it is good to look for issues like the introduction of NA values).

Let’s bake!

# Scroll through the output!
baked_test_pm <- recipes::bake(prepped_rec, new_data = test_pm)
glimpse(baked_test_pm)
Rows: 292
Columns: 37
$ id                          <fct> 1033.1002, 1055.001, 1069.0003, 1073.0023,…
$ value                       <dbl> 11.212174, 12.375394, 10.508850, 15.591017…
$ fips                        <fct> 1033, 1055, 1069, 1073, 1073, 1073, 1073, …
$ lat                         <dbl> 34.75878, 33.99375, 31.22636, 33.55306, 33…
$ lon                         <dbl> -87.65056, -85.99107, -85.39077, -86.81500…
$ CMAQ                        <dbl> 9.402679, 9.241744, 9.121892, 10.235612, 1…
$ zcta_area                   <dbl> 16716984, 154069359, 162685124, 26929603, …
$ zcta_pop                    <dbl> 9042, 20045, 30217, 9010, 16140, 3699, 137…
$ imp_a500                    <dbl> 19.17301038, 16.49307958, 19.13927336, 41.…
$ imp_a15000                  <dbl> 5.2472094, 5.1612102, 4.7401296, 17.452484…
$ county_area                 <dbl> 1534877333, 1385618994, 1501737720, 287819…
$ county_pop                  <dbl> 54428, 104430, 101547, 658466, 658466, 194…
$ log_dist_to_prisec          <dbl> 5.760131, 5.261457, 7.112373, 6.600958, 6.…
$ log_pri_length_5000         <dbl> 8.517193, 9.066563, 8.517193, 11.156977, 1…
$ log_pri_length_25000        <dbl> 10.15769, 12.01356, 10.12663, 12.98762, 12…
$ log_prisec_length_500       <dbl> 8.611945, 8.740680, 6.214608, 6.214608, 6.…
$ log_prisec_length_1000      <dbl> 9.735569, 9.627898, 7.600902, 9.075921, 8.…
$ log_prisec_length_5000      <dbl> 11.770407, 11.728889, 12.298627, 12.281645…
$ log_prisec_length_10000     <dbl> 12.840663, 12.768279, 12.994141, 13.278416…
$ log_nei_2008_pm10_sum_10000 <dbl> 6.69187313, 4.43719884, 0.92888890, 8.2097…
$ log_nei_2008_pm10_sum_15000 <dbl> 6.70127741, 4.46267932, 3.67473904, 8.6488…
$ log_nei_2008_pm10_sum_25000 <dbl> 7.148858, 4.678311, 3.744629, 8.858019, 8.…
$ popdens_county              <dbl> 35.460814, 75.367038, 67.619664, 228.77763…
$ popdens_zcta                <dbl> 540.8870404, 130.1037411, 185.7391706, 334…
$ nohs                        <dbl> 7.3, 4.3, 5.8, 7.1, 2.7, 11.1, 9.7, 3.0, 8…
$ somehs                      <dbl> 15.8, 13.3, 11.6, 17.1, 6.6, 11.6, 21.6, 1…
$ hs                          <dbl> 30.6, 27.8, 29.8, 37.2, 30.7, 46.0, 39.3, …
$ somecollege                 <dbl> 20.9, 29.2, 21.4, 23.5, 25.7, 17.2, 21.6, …
$ associate                   <dbl> 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, 5.2, 6.6, 4…
$ bachelor                    <dbl> 12.7, 10.0, 13.7, 5.9, 17.6, 7.1, 2.2, 7.8…
$ grad                        <dbl> 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, 0.4, 4.2, 3.…
$ pov                         <dbl> 19.0, 8.8, 15.6, 25.5, 7.3, 8.1, 13.3, 23.…
$ hs_orless                   <dbl> 53.7, 45.4, 47.2, 61.4, 40.0, 68.7, 70.6, …
$ urc2006                     <dbl> 4, 4, 4, 1, 1, 1, 2, 3, 3, 3, 2, 5, 4, 1, …
$ aod                         <dbl> 36.000000, 43.416667, 33.000000, 39.583333…
$ state_California            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ city_Not.in.a.city          <dbl> NA, NA, NA, 0, 1, 1, 1, NA, NA, NA, 0, NA,…

Notice that our city_Not.in.a.city variable seems to be NA values. Why might that be?

Ah! Perhaps it is because some of our levels were not previously seen in the training set!

Let’s take a look using the set operations of the dplyr package. We can take a look at cities that were different between the test and training set.

traincities <- train_pm %>% distinct(city)
testcities <- test_pm %>% distinct(city)

#get the number of cities that were different
dim(dplyr::setdiff(traincities, testcities))
[1] 381   1
#get the number of cities that overlapped
dim(dplyr::intersect(traincities, testcities))
[1] 55  1

Indeed, there are lots of different cities in our test data that are not in our training data!

So, let’s go back to our pm data set and modify the city variable to just be values of in a city or not in a city using the case_when() function of dplyr. This function allows you to vectorize multiple if_else() statements.

pm %>%
  mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                          city != "Not in a city" ~ "In a city"))
# A tibble: 876 × 50
   id     value fips    lat   lon state county city   CMAQ zcta  zcta_…¹ zcta_…²
   <fct>  <dbl> <fct> <dbl> <dbl> <chr> <chr>  <chr> <dbl> <fct>   <dbl>   <dbl>
 1 1003.…  9.60 1003   30.5 -87.9 Alab… Baldw… In a…  8.10 36532  1.91e8   27829
 2 1027.… 10.8  1027   33.3 -85.8 Alab… Clay   In a…  9.77 36251  3.74e8    5103
 3 1033.… 11.2  1033   34.8 -87.7 Alab… Colbe… In a…  9.40 35660  1.67e7    9042
 4 1049.… 11.7  1049   34.3 -86.0 Alab… DeKalb In a…  8.53 35962  2.04e8    8300
 5 1055.… 12.4  1055   34.0 -86.0 Alab… Etowah In a…  9.24 35901  1.54e8   20045
 6 1069.… 10.5  1069   31.2 -85.4 Alab… Houst… In a…  9.12 36303  1.63e8   30217
 7 1073.… 15.6  1073   33.6 -86.8 Alab… Jeffe… In a… 10.2  35207  2.69e7    9010
 8 1073.… 12.4  1073   33.3 -87.0 Alab… Jeffe… Not … 10.2  35111  1.66e8   16140
 9 1073.… 11.1  1073   33.5 -87.3 Alab… Jeffe… Not …  8.16 35444  3.86e8    3699
10 1073.… 13.1  1073   33.5 -86.5 Alab… Jeffe… In a…  9.30 35094  1.49e8   14212
# … with 866 more rows, 38 more variables: imp_a500 <dbl>, imp_a1000 <dbl>,
#   imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>,
#   county_pop <dbl>, log_dist_to_prisec <dbl>, log_pri_length_5000 <dbl>,
#   log_pri_length_10000 <dbl>, log_pri_length_15000 <dbl>,
#   log_pri_length_25000 <dbl>, log_prisec_length_500 <dbl>,
#   log_prisec_length_1000 <dbl>, log_prisec_length_5000 <dbl>,
#   log_prisec_length_10000 <dbl>, log_prisec_length_15000 <dbl>, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Alternatively you could create a custom step function to do this and add this to your recipe, but that is beyond the scope of this case study.

We will need to repeat all the steps (splitting the data, pre-processing, etc) as the levels of our variables have now changed.

While we are doing this, we might also have this issue for county.

The county variables appears to get dropped due to either correlation or near zero variance.

It is likely due to near zero variance because this is the more granular of these geographic categorical variables and likely sparse.

pm %<>%
  mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                          city != "Not in a city" ~ "In a city"))

set.seed(1234) # same seed as before
pm_split <-rsample::initial_split(data = pm, prop = 2/3)
pm_split
<Training/Testing/Total>
<584/292/876>
 train_pm <-rsample::training(pm_split)
 test_pm <-rsample::testing(pm_split)

Question Opportunity

See if you can come up with the code for the new recipe.


Click here to reveal the code for the new recipe.
novel_rec <- recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor") %>%
    update_role(value, new_role = "outcome") %>%
    update_role(id, new_role = "id variable") %>%
    update_role("fips", new_role = "county id") %>%
    step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
    step_corr(all_numeric()) %>%
    step_nzv(all_numeric()) 

novel_rec
Recipe

Inputs:

        role #variables
   county id          1
 id variable          1
     outcome          1
   predictor         47

Operations:

Dummy variables from state, county, city, zcta
Correlation filter on all_numeric()
Sparse, unbalanced variable filter on all_numeric()

Now let’s retrain our training data with the new model recipe and try baking our test data again.

Question Opportunity

Do you recall how to pre-process and extract the pre-processed training data?


Click here to reveal the answer.
prepped_rec <- prep(novel_rec, verbose = TRUE, retain = TRUE)
oper 1 step dummy [training] 
oper 2 step corr [training] 
oper 3 step nzv [training] 
The retained training set is ~ 0.27 Mb  in memory.
baked_train <- bake(prepped_rec, new_data = NULL)

# Scroll through the output!
glimpse(baked_train)
Rows: 584
Columns: 38
$ id                          <fct> 18003.0004, 55041.0007, 6065.1003, 39009.0…
$ value                       <dbl> 11.699065, 6.956780, 13.289744, 10.742000,…
$ fips                        <fct> 18003, 55041, 6065, 39009, 39061, 24510, 6…
$ lat                         <dbl> 41.09497, 45.56300, 33.94603, 39.44217, 39…
$ lon                         <dbl> -85.10182, -88.80880, -117.40063, -81.9088…
$ CMAQ                        <dbl> 10.383231, 3.411247, 11.404085, 7.971165, …
$ zcta_area                   <dbl> 16696709, 370280916, 41957182, 132383592, …
$ zcta_pop                    <dbl> 21306, 4141, 44001, 1115, 6566, 934, 41192…
$ imp_a500                    <dbl> 28.9783737, 0.0000000, 30.3901384, 0.00000…
$ imp_a15000                  <dbl> 13.0547959, 0.3676404, 23.7457506, 0.33079…
$ county_area                 <dbl> 1702419942, 2626421270, 18664696661, 13043…
$ county_pop                  <dbl> 355329, 9304, 2189641, 64757, 802374, 6209…
$ log_dist_to_prisec          <dbl> 6.621891, 8.415468, 7.419762, 6.344681, 5.…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 10.150514, 8.517193, 9…
$ log_pri_length_25000        <dbl> 12.77378, 10.16440, 13.14450, 10.12663, 13…
$ log_prisec_length_500       <dbl> 6.214608, 6.214608, 6.214608, 6.214608, 7.…
$ log_prisec_length_1000      <dbl> 9.240294, 7.600902, 7.600902, 8.793450, 8.…
$ log_prisec_length_5000      <dbl> 11.485093, 9.425537, 10.155961, 10.562382,…
$ log_prisec_length_10000     <dbl> 12.75582, 11.44833, 11.59563, 11.69093, 12…
$ log_prisec_length_25000     <dbl> 13.98749, 13.15082, 13.44293, 13.58697, 14…
$ log_nei_2008_pm10_sum_10000 <dbl> 4.91110140, 3.86982666, 4.03184660, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 5.399131, 3.883689, 5.459257, 0.000000, 6.…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.816047, 3.887264, 6.884537, 3.765635, 6.…
$ popdens_county              <dbl> 208.719947, 3.542463, 117.314577, 49.64834…
$ popdens_zcta                <dbl> 1276.059851, 11.183401, 1048.711994, 8.422…
$ nohs                        <dbl> 4.3, 5.1, 3.7, 4.8, 2.1, 0.0, 2.5, 7.7, 0.…
$ somehs                      <dbl> 6.7, 10.4, 5.9, 11.5, 10.5, 0.0, 4.3, 7.5,…
$ hs                          <dbl> 31.7, 40.3, 17.9, 47.3, 30.0, 0.0, 17.8, 2…
$ somecollege                 <dbl> 27.2, 24.1, 26.3, 20.0, 27.1, 0.0, 26.1, 2…
$ associate                   <dbl> 8.2, 7.4, 8.3, 3.1, 8.5, 71.4, 13.2, 7.6, …
$ bachelor                    <dbl> 15.0, 8.6, 20.2, 9.8, 14.2, 0.0, 23.4, 17.…
$ grad                        <dbl> 6.8, 4.2, 17.7, 3.5, 7.6, 28.6, 12.6, 12.3…
$ pov                         <dbl> 13.500, 18.900, 6.700, 14.400, 12.500, 3.5…
$ hs_orless                   <dbl> 42.7, 55.8, 27.5, 63.6, 42.6, 0.0, 24.6, 3…
$ urc2006                     <dbl> 3, 6, 1, 5, 1, 1, 2, 1, 2, 6, 4, 4, 4, 4, …
$ aod                         <dbl> 54.11111, 31.16667, 83.12500, 33.36364, 50…
$ state_California            <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, …
$ city_Not.in.a.city          <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …

And now, let’s try baking our test set to see if we still have NA values.

# Scroll through the output!
baked_test_pm <- bake(prepped_rec, new_data = test_pm)

glimpse(baked_test_pm)
Rows: 292
Columns: 38
$ id                          <fct> 1033.1002, 1055.001, 1069.0003, 1073.0023,…
$ value                       <dbl> 11.212174, 12.375394, 10.508850, 15.591017…
$ fips                        <fct> 1033, 1055, 1069, 1073, 1073, 1073, 1073, …
$ lat                         <dbl> 34.75878, 33.99375, 31.22636, 33.55306, 33…
$ lon                         <dbl> -87.65056, -85.99107, -85.39077, -86.81500…
$ CMAQ                        <dbl> 9.402679, 9.241744, 9.121892, 10.235612, 1…
$ zcta_area                   <dbl> 16716984, 154069359, 162685124, 26929603, …
$ zcta_pop                    <dbl> 9042, 20045, 30217, 9010, 16140, 3699, 137…
$ imp_a500                    <dbl> 19.17301038, 16.49307958, 19.13927336, 41.…
$ imp_a15000                  <dbl> 5.2472094, 5.1612102, 4.7401296, 17.452484…
$ county_area                 <dbl> 1534877333, 1385618994, 1501737720, 287819…
$ county_pop                  <dbl> 54428, 104430, 101547, 658466, 658466, 194…
$ log_dist_to_prisec          <dbl> 5.760131, 5.261457, 7.112373, 6.600958, 6.…
$ log_pri_length_5000         <dbl> 8.517193, 9.066563, 8.517193, 11.156977, 1…
$ log_pri_length_25000        <dbl> 10.15769, 12.01356, 10.12663, 12.98762, 12…
$ log_prisec_length_500       <dbl> 8.611945, 8.740680, 6.214608, 6.214608, 6.…
$ log_prisec_length_1000      <dbl> 9.735569, 9.627898, 7.600902, 9.075921, 8.…
$ log_prisec_length_5000      <dbl> 11.770407, 11.728889, 12.298627, 12.281645…
$ log_prisec_length_10000     <dbl> 12.840663, 12.768279, 12.994141, 13.278416…
$ log_prisec_length_25000     <dbl> 13.79973, 13.70026, 13.85550, 14.45221, 13…
$ log_nei_2008_pm10_sum_10000 <dbl> 6.69187313, 4.43719884, 0.92888890, 8.2097…
$ log_nei_2008_pm10_sum_15000 <dbl> 6.70127741, 4.46267932, 3.67473904, 8.6488…
$ log_nei_2008_pm10_sum_25000 <dbl> 7.148858, 4.678311, 3.744629, 8.858019, 8.…
$ popdens_county              <dbl> 35.460814, 75.367038, 67.619664, 228.77763…
$ popdens_zcta                <dbl> 540.8870404, 130.1037411, 185.7391706, 334…
$ nohs                        <dbl> 7.3, 4.3, 5.8, 7.1, 2.7, 11.1, 9.7, 3.0, 8…
$ somehs                      <dbl> 15.8, 13.3, 11.6, 17.1, 6.6, 11.6, 21.6, 1…
$ hs                          <dbl> 30.6, 27.8, 29.8, 37.2, 30.7, 46.0, 39.3, …
$ somecollege                 <dbl> 20.9, 29.2, 21.4, 23.5, 25.7, 17.2, 21.6, …
$ associate                   <dbl> 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, 5.2, 6.6, 4…
$ bachelor                    <dbl> 12.7, 10.0, 13.7, 5.9, 17.6, 7.1, 2.2, 7.8…
$ grad                        <dbl> 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, 0.4, 4.2, 3.…
$ pov                         <dbl> 19.0, 8.8, 15.6, 25.5, 7.3, 8.1, 13.3, 23.…
$ hs_orless                   <dbl> 53.7, 45.4, 47.2, 61.4, 40.0, 68.7, 70.6, …
$ urc2006                     <dbl> 4, 4, 4, 1, 1, 1, 2, 3, 3, 3, 2, 5, 4, 1, …
$ aod                         <dbl> 36.000000, 43.416667, 33.000000, 39.583333…
$ state_California            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ city_Not.in.a.city          <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …

Great, now we no longer have NA values! :)

Note: if you use the skip option for some of the pre-processing steps, be careful. the juice() function (an older option) will show all of the results ignoring skip = TRUE (as you can still use this function if you prefer it to bake()). The bake() function will not necessarily conduct these steps on the new data.

Specifying the model


So far we have used the packages rsample to split the data and recipes to assign variable types, and to specify and prep our pre-processing (as well as to optionally extract the pre-processed data).

We will now use the parsnip package (which is similar to the previous caret package - and hence why it is named after the vegetable) to specify our model.

There are four things we need to define about our model:

  1. The type of model (using specific functions in parsnip like rand_forest(), logistic_reg() etc.)
  2. The package or engine that we will use to implement the type of model selected (using the set_engine() function)
  3. The mode of learning - classification or regression (using the set_mode() function)
  4. Any arguments necessary for the model/package selected (using the set_args()function - for example the mtry = argument for random forest which is the number of variables to be used as options for splitting at each tree node)

Let’s walk through these steps one by one. For our case, we are going to start our analysis with a linear regression (a very common starting point for modeling) but we will demonstrate how we can try different models. We will also show how to model with the Random Forest method (which is very widely used) later on.

The first step is to define what type of model we would like to use. See here for modeling options in parsnip.

PM_model <- parsnip::linear_reg() # PM was used in the name for particulate matter
PM_model
Linear Regression Model Specification (regression)

Computational engine: lm 

OK. So far, all we have defined is that we want to use a linear regression…
Let’s tell parsnip more about what we want.

We would like to use the ordinary least squares method to fit our linear regression. So we will tell parsnip that we want to use the lm package to implement our linear regression (there are many options actually such as rstan glmnet, keras, and sparklyr). See here for a description of the differences and using these different engines with parsnip.

We will do so by using the set_engine() function of the parsnip package.

lm_PM_model <- PM_model  %>%
  parsnip::set_engine("lm")

lm_PM_model
Linear Regression Model Specification (regression)

Computational engine: lm 

In some cases some packages can do either classification or prediction, so it is a good idea to specify which mode you intend to perform. Here, we aim to predict the air pollution. You can do this with the set_mode() function of the parsnip package, by using either set_mode("classification") or set_mode("regression").

lm_PM_model <- PM_model  %>%
  parsnip::set_engine("lm") %>%
  set_mode("regression")

lm_PM_model
Linear Regression Model Specification (regression)

Computational engine: lm 

Fitting the model


We can use the parsnip package with a newer package called workflows to fit our model.

The workflows package allows us to keep track of both our pre-processing steps and our model specification. It also allows us to implement fancier optimizations in an automated way and it can also handle post-processing operations.

We begin by creating a workflow using the workflow() function in the workflows package.

Next, we use add_recipe() (our pre-processing specifications) and we add our model with the add_model() function – both functions from the workflows package.

Note: We do not need to actually prep() our recipe before using workflows!

If you recall novel_rec is the recipe we previously created with the recipes package and lm_PM_model was created when we specified our model with the parsnip package. Here, we combine everything together into a workflow.

PM_wflow <- workflows::workflow() %>%
            workflows::add_recipe(novel_rec) %>%
            workflows::add_model(lm_PM_model)
PM_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_corr()
• step_nzv()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

Ah, nice. Notice how it tells us about both our pre-processing steps and our model specifications.

Next, we “prepare the recipe” (or estimate the parameters) and fit the model to our training data all at once. Printing the output, we can see the coefficients of the model.

PM_wflow_fit <- parsnip::fit(PM_wflow, data = train_pm)
PM_wflow_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_corr()
• step_nzv()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
                (Intercept)                          lat  
                  2.936e+02                    3.261e-02  
                        lon                         CMAQ  
                  1.586e-02                    2.463e-01  
                  zcta_area                     zcta_pop  
                 -3.433e-10                    1.013e-05  
                   imp_a500                   imp_a15000  
                  5.064e-03                   -3.066e-03  
                county_area                   county_pop  
                 -2.324e-11                   -7.576e-08  
         log_dist_to_prisec          log_pri_length_5000  
                  6.214e-02                   -2.006e-01  
       log_pri_length_25000        log_prisec_length_500  
                 -5.411e-02                    2.204e-01  
     log_prisec_length_1000       log_prisec_length_5000  
                  1.154e-01                    2.374e-01  
    log_prisec_length_10000      log_prisec_length_25000  
                 -3.436e-02                    5.224e-01  
log_nei_2008_pm10_sum_10000  log_nei_2008_pm10_sum_15000  
                  1.829e-01                   -2.355e-02  
log_nei_2008_pm10_sum_25000               popdens_county  
                  2.403e-02                    2.203e-05  
               popdens_zcta                         nohs  
                 -2.132e-06                   -2.983e+00  
                     somehs                           hs  
                 -2.956e+00                   -2.962e+00  
                somecollege                    associate  
                 -2.967e+00                   -2.999e+00  
                   bachelor                         grad  
                 -2.979e+00                   -2.978e+00  
                        pov                    hs_orless  
                  1.859e-03                           NA  
                    urc2006                          aod  
                  2.577e-01                    1.535e-02  
           state_California           city_Not.in.a.city  
                  3.114e+00                   -4.250e-02  

Click here to see the steps that the workflows package performs that used to be required

Previously, the processed training data (baked_train), as opposed to the raw training data, would be required to fit the model.

In this case, we would actually also need to write the model again! Recall that id and fips are ID variables and that values is our outcome of interest (the air pollution measure at each monitor). It is nice that workflows keeps track of this!

baked_train_ready <- baked_train %>% 
  select(-id, -fips)

PM_fit <- lm_PM_model %>% 
  parsnip::fit(value ~., data = baked_train_ready)

Assessing the model fit


After we fit our model, we can use the broom package to look at the output from the fitted model in an easy/tidy way.

The tidy() function returns a tidy data frame with coefficients from the model (one row per coefficient).

Many other broom functions currently only work with parsnip objects, not raw workflows objects.

However, we can use the tidy function if we first use the extract_fit_parsnip() function which is imported as part of the workflows package from the hardhat package (also part of tidymodels).

wflowoutput <- PM_wflow_fit %>% 
  extract_fit_parsnip() %>% 
  broom::tidy() 
wflowoutput
# A tibble: 36 × 5
   term         estimate std.error statistic       p.value
   <chr>           <dbl>     <dbl>     <dbl>         <dbl>
 1 (Intercept)  2.94e+ 2  1.18e+ 2     2.49  0.0130       
 2 lat          3.26e- 2  2.28e- 2     1.43  0.153        
 3 lon          1.59e- 2  1.01e- 2     1.58  0.115        
 4 CMAQ         2.46e- 1  3.97e- 2     6.20  0.00000000108
 5 zcta_area   -3.43e-10  1.60e-10    -2.15  0.0320       
 6 zcta_pop     1.01e- 5  5.33e- 6     1.90  0.0578       
 7 imp_a500     5.06e- 3  7.42e- 3     0.683 0.495        
 8 imp_a15000  -3.07e- 3  1.16e- 2    -0.263 0.792        
 9 county_area -2.32e-11  1.97e-11    -1.18  0.238        
10 county_pop  -7.58e- 8  9.29e- 8    -0.815 0.415        
# … with 26 more rows
# ℹ Use `print(n = ...)` to see more rows

We have fit our model on our training data, which means we have created a model to predict values of air pollution based on the predictors that we have included. Yay!

One last thing before we leave this section. We often are interested in getting a sense of which variables are the most important in our model. We can explore the variable importance using the vip() function of the vip package. This function creates a bar plot of variable importance scores for each predictor variable (or feature) in a model. The bar plot is ordered by importance (highest to smallest).

Notice again that we need to use the extract_fit_parsnip() function.

Let’s take a look at the top 10 contributing variables:

PM_wflow_fit %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10)

The location of the monitor (being in California versus another state), the CMAQ model, and the aod satellite information appear to be the most important for predicting the air pollution at a given monitor.

Indeed, if we plot monitor values for those in California relative to other states we can see that there are some high values for monitors in California. This may be playing a role in what we are seeing. Here we assume that you have some experience plotting with the ggplot2package. If not, please see this case study.

Click here for an introduction about this package if you are new to using ggplot2

The ggplot2 package is generally intuitive for beginners because it is based on a grammar of graphics or the gg in ggplot2. The idea is that you can construct many sentences by learning just a few nouns, adjectives, and verbs. There are specific “words” that we will need to learn and once we do, you will be able to create (or “write”) hundreds of different plots.

The critical part to making graphics using ggplot2 is the data needs to be in a tidy format. Given that we have just spent time putting our data in tidy format, we are primed to take advantage of all that ggplot2 has to offer!

We will show how it is easy to pipe tidy data (output) as input to other functions that create plots. This all works because we are working within the tidyverse.

What is the ggplot() function? As explained by Hadley Wickham:

The grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system. ggplot2 Terminology:

  • ggplot - the main function where you specify the dataset and variables to plot (this is where we define the x and y variable names)
  • geoms - geometric objects
    • e.g. geom_point(), geom_bar(), geom_line(), geom_histogram()
  • aes - aesthetics
    • shape, transparency, color, fill, line types
  • scales - define how your data will be plotted
    • continuous, discrete, log, etc

The function aes() is an aesthetic mapping function inside the ggplot() object. We use this function to specify plot attributes (e.g. x and y variable names) that will not change as we add more layers.

Anything that goes in the ggplot() object becomes a global setting. From there, we use the geom objects to add more layers to the base ggplot() object. These will define what we are interested in illustrating using the data.

baked_train_ready %>% 
  mutate(state_California = as.factor(state_California)) %>%
  mutate(state_California = recode(state_California, 
                                   "0" = "Not California", 
                                   "1" = "California")) %>%
  ggplot(aes(x = state_California, y = value)) + 
  geom_boxplot() +
  geom_jitter(width = .05) + 
  xlab("Location of Monitor")

Model performance


In this next section, our goal is to assess the overall model performance. The way we do this is to compare the similarity between the predicted estimates of the outcome variable produced by the model and the true outcome variable values.

If you recall the What is machine learning? section, we showed how to think about machine learning (ML) as an optimization problem that tries to minimize the distance between our predicted outcome \(\hat{Y} = f(X)\) and actual outcome \(Y\) using our features (or predictor variables) \(X\) as input to a function \(f\) that we want to estimate.

\[d(Y - \hat{Y})\]

As our goal in this section is to assess overall model performance, we will now talk about different distance metrics that you can use.

First, let’s pull out our predicted outcome values \(\hat{Y} = f(X)\) from the models we fit (using different approaches).

wf_fit <- PM_wflow_fit %>% 
  extract_fit_parsnip()

wf_fitted_values <- fitted(wf_fit[["fit"]])
head(wf_fitted_values)
        1         2         3         4         5         6 
12.186782  9.139406 12.646119 10.377628 11.909934  9.520860 

Alternatively, we can get the fitted values using the augment() function of the broom package using the output from workflows:

wf_fitted_values <- 
  broom::augment(wf_fit[["fit"]], data = baked_train) %>% 
  select(value, .fitted:.std.resid)

head(wf_fitted_values)
# A tibble: 6 × 6
  value .fitted   .hat .sigma   .cooksd .std.resid
  <dbl>   <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
1 11.7    12.2  0.0370   2.05 0.0000648     -0.243
2  6.96    9.14 0.0496   2.05 0.00179       -1.09 
3 13.3    12.6  0.0484   2.05 0.000151       0.322
4 10.7    10.4  0.0502   2.05 0.0000504      0.183
5 14.5    11.9  0.0243   2.05 0.00113        1.26 
6 12.2     9.52 0.476    2.04 0.0850         1.81 

Note that because we use the actual workflow here, we can (and actually need to) use the raw data instead of the pre-processed data.

values_pred_train <- 
  predict(PM_wflow_fit, train_pm) %>% 
  bind_cols(train_pm %>% select(value, fips, county, id)) 

values_pred_train
# A tibble: 584 × 5
   .pred value fips  county           id        
   <dbl> <dbl> <fct> <chr>            <fct>     
 1 12.2  11.7  18003 Allen            18003.0004
 2  9.14  6.96 55041 Forest           55041.0007
 3 12.6  13.3  6065  Riverside        6065.1003 
 4 10.4  10.7  39009 Athens           39009.0003
 5 11.9  14.5  39061 Hamilton         39061.8001
 6  9.52 12.2  24510 Baltimore (City) 24510.0006
 7 12.6  11.2  6061  Placer           6061.0006 
 8 10.3   6.98 6065  Riverside        6065.5001 
 9  8.74  6.61 44003 Kent             44003.0002
10 10.0  11.6  37111 McDowell         37111.0004
# … with 574 more rows
# ℹ Use `print(n = ...)` to see more rows

Visualizing model performance


Now, we can compare the predicted outcome values (or fitted values) \(\hat{Y}\) to the actual outcome values \(Y\) that we observed:

wf_fitted_values %>% 
  ggplot(aes(x =  value, y = .fitted)) + 
  geom_point() + 
  xlab("actual outcome values") + 
  ylab("predicted outcome values")

OK, so our range of the predicted outcome values appears to be smaller than the real values. We could probably do a bit better.

Quantifying model performance


Next, let’s use different distance functions \(d(\cdot)\) to assess how far off our predicted outcome \(\hat{Y} = f(X)\) and actual outcome \(Y\) values are from each other:

\[d(Y - \hat{Y})\]

As mentioned, there are entire scholarly fields of research dedicated to identifying different distance metrics \(d(\cdot)\) for machine learning applications. However, when performing prediction with a continuous outcome \(Y\), a few of the mostly commonly used distance metrics are:

  1. mean absolute error (mae)

\[MAE = \frac{\sum_{i=1}^{n}{(|\hat{y_t}- y_t|)}^2}{n}\]

  1. R squared error (rsq) – this is also known as the coefficient of determination which is the squared correlation between truth and estimate

This is calculated and 1 minus the fraction of the residual sum of squares (\(SS_res\)) by the total sum of squares (\(SS_tot\))

\[RSQ = R^2 = 1 - \frac{SSres}{SStot}\]

\[SS_{tot} = \sum_{i=1}^{n}{(y_i- \bar{y})}^2\] The total sum of squares is proportional to the variance of the data. It is calculated as the sum of each true value from the mean true value (\(\bar{y}\)).

\[SS_{res} = \sum_{i=1}^{n}{(y_i- \hat{y_i})}^2\]

The sum of squares of residuals is calculated as the sum of each predicted value (\(\hat{y_i}\) or sometimes \(f_i\)) from the true value (\(y_i\)).

  1. root mean squared error (rmse)

\[RMSE = \sqrt{\frac{\sum_{i=1}^{n}{(\hat{y_t}- y_t)}^2}{n}}\]

One way to calculate these metrics within the tidymodels framework is to use the yardstick package using the metrics() function.

Note that you may obtain different results depending on your version of R and the package versions you are using. See Session Info section to learn what we used.

yardstick::metrics(wf_fitted_values, 
                   truth = value, estimate = .fitted)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.98 
2 rsq     standard       0.392
3 mae     standard       1.47 

Alternatively if you only wanted one metric you could use the mae(), rsq(), or rmse() functions, respectively.

yardstick::mae(wf_fitted_values, 
               truth = value, estimate = .fitted)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard        1.47

The lower the error values, the better the performance. RMSE and MAE can range from zero to infinity, so we aren’t doing too bad. The MAE value suggests that the average difference between the value predicted and the real value was 1.47 ug/m3. The range of the values was 3-22 in the training data, so this is a relatively small amount. The difference between the RMSE and the MAE can indicate the variance of the errors. Since the RMSE and MAE were similar, this indicates that the errors were quite consistent across the range of values and large errors are unlikely to have occurred, where our prediction would have been really far off. The R squared error value indicates how much variability in the outcome value could be explained by the predictors in the model. This value ranges from 0 to 1 (sometimes -1 depending on the method used for calculation). A value of 1 would indicate the the model perfectly predicted the outcome. Our value indicates that 39% of the variability of the air pollution measures could be explained by the model. So we could maybe do a bit better. See here for more information.

Cross validation


Until now we have used everything in our “training” dataset (and have not touched the “testing” dataset) from the rsample package to build our machine learning (ML) model \(\hat{Y} = f(X)\) (or to estimate \(f\) using the features or predictor variable \(X\)).

Here, we take this beyond the simple split into training and testing data sets. We will use the rsample package again in order to further implement what are called cross validation techniques. This is also called re-sampling or repartitioning.

Note: we are not actually getting new samples from the underlying distribution so the term re-sampling is a bit of a misnomer.

Cross validation splits our training data into multiple training data sets to allow for a deeper assessment of the accuracy of the model.

Here is a visualization of the concept for cross validation/resampling/repartitioning from Max Kuhn:

Technically creating our testing and training set out of our original training data is sometimes considered a form of cross validation, called the holdout method. The reason we do this it so we can get a better sense of the accuracy of our model using data that we did not train it on.

However, we can actually do a better job of optimizing our model for accuracy if we also perform another type of cross validation on the newly defined training set that we just created. There are many cross validation methods and most can be easily implemented using rsample package. Here, we will use a very popular method called either k-fold or v-fold cross validation.

This method involves essentially preforming the hold out method iteratively with the training data.

First, the training set is divided into \(v\) (or often called called \(k\)) equally sized smaller pieces.

Next, the model is trained on the model on \(v\)-1 subsets of the data iteratively (removing a different \(v\) until all possible \(v\)-1 sets have been evaluated) to get a sense of the performance of the model. This is really useful for fine tuning specific aspects of the model in a process called model tuning, which we will learn about in the next section.

Here is a visualization of how the folds are created:

Note: People typically ignore spatial dependence with cross validation of air pollution monitoring data in the air pollution field, so we will do the same. However, it might make sense to leave out blocks of monitors rather than random individual monitors to help account for some spatial dependence.

Creating the \(v\)-folds using rsample


The vfold_cv() function of the rsample package can be used to parse the training data into folds for \(v\)-fold cross validation.

  • The v argument specifies the number of folds to create.
  • The repeats argument specifies if any samples should be repeated across folds - default is FALSE
  • The strata argument specifies a variable to stratify samples across folds - just like in initial_split().

Again, because these are created at random, we need to use the base set.seed() function in order to obtain the same results each time we knit this document. Generally speaking using 10 folds is good practice, but this depends on the variability within your data. We are going to use 4 for the sake of expediency in this demonstration.

set.seed(1234)
vfold_pm <- rsample::vfold_cv(data = train_pm, v = 4)
vfold_pm
#  4-fold cross-validation 
# A tibble: 4 × 2
  splits            id   
  <list>            <chr>
1 <split [438/146]> Fold1
2 <split [438/146]> Fold2
3 <split [438/146]> Fold3
4 <split [438/146]> Fold4
pull(vfold_pm, splits)
[[1]]
<Analysis/Assess/Total>
<438/146/584>

[[2]]
<Analysis/Assess/Total>
<438/146/584>

[[3]]
<Analysis/Assess/Total>
<438/146/584>

[[4]]
<Analysis/Assess/Total>
<438/146/584>

Now we can see that we have created 4 folds of the data and we can see how many values were set aside for testing (called assessing for cross validation sets) and training (called analysis for cross validation sets) within each fold.

Once the folds are created they can be used to evaluate performance by fitting the model to each of the re-samples that we created:

Assessing model performance on \(v\)-folds using tune


We can fit the model to our cross validation folds using the fit_resamples() function of the tune package, by specifying our workflow object and the cross validation fold object we just created. See here for more information.

resample_fit <- tune::fit_resamples(PM_wflow, vfold_pm)

We can now take a look at various performance metrics based on the fit of our cross validation “resamples”. To do this we will use the show_best() function of the tune package.

tune::show_best(resample_fit, metric = "rmse")
# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard    2.12     4  0.0444 Preprocessor1_Model1

Here we can see the mean RMSE value across all four folds. The function is called show_best() because it is also used for model tuning and it will show the parameter combination with the best performance, we will discuss this more later in the case study when we test different models with different parameters. However, for now this gives us a more nuanced estimate of the RMSE in terms of the performance of this single model with the training data by looking at the subsets of the training data.

Data Analysis


If you have been following along but stopped and are starting here you could load the wrangled data using the following command:

load(here::here( "data", "wrangled", "wrangled_pm.rda"))

If you skipped the previous sections, click here for a step-by-step guide on how to download and load the data.

First you need to install and load the OCSdata package:

install.packages("OCSdata")
library(OCSdata)

Then, you may load the wrangled data .rda file using the following function:

wrangled_rda("ocs-bp-air-pollution", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_pm.rda"))

If the package does not work for you, you may also download this .rda file by clicking this link here.

To load the downloaded data into your environment, you may double click on the .rda file in Rstudio or using the load() function.

To copy and paste our code below, place the downloaded .rda file in your current working directory within a subdirectory called “wrangled” within a subdirectory called “data”. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "wrangled", "wrangled_pm.rda"))

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.


Then you can modify the data to match what we did for the previous model like so to prepare the city variable and to split the data into testing and training sets. (You will see later that this would be required due to the number of levels for the city variable in the original data).

pm %<>%
  mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                          city != "Not in a city" ~ "In a city"))

set.seed(1234) # same seed as before
pm_split <-rsample::initial_split(data = pm, prop = 2/3)
pm_split
<Training/Testing/Total>
<584/292/876>
 train_pm <-rsample::training(pm_split)
 test_pm <-rsample::testing(pm_split)

We will also split our data into cross validation folds:

set.seed(1234)
vfold_pm <- rsample::vfold_cv(data = train_pm, v = 4)
vfold_pm
#  4-fold cross-validation 
# A tibble: 4 × 2
  splits            id   
  <list>            <chr>
1 <split [438/146]> Fold1
2 <split [438/146]> Fold2
3 <split [438/146]> Fold3
4 <split [438/146]> Fold4

In the previous section, we demonstrated how to build a machine learning model (specifically a linear regression model) to predict air pollution with the tidymodels framework.

In the next few section, we will demonstrate another very different kind of machine learning model. This will allow us to see if this model will have better prediction performance.

Random Forest


Now, to try to see if we can get better prediction performance, we are going to predict our outcome variable (air pollution) using a decision tree method called random forest.

A decision tree is a tool to partition data or anything really, based on a series of sequential (often binary) decisions, where the decisions are chosen based on their ability to optimally split the data.

Here you can see a simple example:

[source]

In the case of random forest, multiple decision trees are created - hence the name forest, and each tree is built using a random subset of the training data (with replacement) - hence the full name random forest. This random aspect helps to keep the algorithm from overfitting the data.

The mean of the predictions from each of the trees is used in the final output.

Overall, a major distinction with our last regression model, is that random forest allows us to use our categorical data largely as is. There is no need to recode these predictors to be numerical. In our case, we are going to use the random forest method of the the randomForest package.

This package is currently not compatible with categorical variables that have more than 53 levels. See here for the documentation about when this was updated from 25 levels. Thus we will remove the zcta and county variables. This is why it is good that we have modified the city variable to two levels, as there were originally nearly 600, so this would not have been compatible with this package.

Note that the step_novel() function is necessary here for the state variable to get all cross validation folds to work, because there will be different levels included in each fold test and training sets. The new levels for some of the test sets would otherwise result in an error.

According to the documentation for the recipes package:

step_novel creates a specification of a recipe step that will assign a previously unseen factor level to a new value.

RF_rec <- recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor")%>%
    update_role(value, new_role = "outcome")%>%
    update_role(id, new_role = "id variable") %>%
    update_role("fips", new_role = "county id") %>%
    step_novel("state") %>%
    step_string2factor("state", "county", "city") %>%
    step_rm("county") %>%
    step_rm("zcta") %>%
    step_corr(all_numeric())%>%
    step_nzv(all_numeric())

The rand_forest() function of the parsnip package has three important arguments that act as an interface for the different possible engines to perform a random forest analysis:

  1. mtry - The number of predictor variables (or features) that will be randomly sampled at each split when creating the tree models. The default number for regression analyses is the number of predictors divided by 3.
  2. min_n - The minimum number of data points in a node that are required for the node to be split further.
  3. trees - the number of trees in the ensemble

We will start by trying an mtry value of 10 and a min_n value of 3. As you might imagine it is a bit difficult to know what to choose. This is where the tuning process that we just started to describe will come in helpful as we can test different models with different values. However, first let’s just start with these values.

Now that we have our recipe (RF_rec), let’s specify the model with rand_forest() from parsnip.

PMtree_model <- parsnip::rand_forest(mtry = 10, min_n = 3)
PMtree_model
Random Forest Model Specification (unknown)

Main Arguments:
  mtry = 10
  min_n = 3

Computational engine: ranger 

Next, we set the engine and mode:

Note that you could also use the ranger or spark packages instead of randomForest. If you were to use the ranger package to implement the random forest analysis you would need to specify an importance argument to be able to evaluate predictor importance. The options are impurity or permutation.

These other packages have different advantages and disadvantages- for example ranger and spark are not as limiting for the number of categories for categorical variables. For more information see their documentation: here for ranger, here for spark, and here for randomForest.

See here for more documentation about implementing these engine options with tidymodels. Note that there are also other R packages for implementing random forest algorithms, but these three packages (ranger, spark, and randomForest) are currently compatible with tidymodels.

We also need to specify with the set_mode() function that our outcome variable (air pollution) is continuous.

RF_PM_model <- PMtree_model %>%
  set_engine("randomForest") %>%
  set_mode("regression")

RF_PM_model
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 10
  min_n = 3

Computational engine: randomForest 

Then, we put this all together into a workflow:

Question Opportunity

See if you can come up with the code to do this.


Click here to reveal the answer.
RF_wflow <- workflows::workflow() %>%
  workflows::add_recipe(RF_rec) %>%
  workflows::add_model(RF_PM_model)

RF_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps

• step_novel()
• step_string2factor()
• step_rm()
• step_rm()
• step_corr()
• step_nzv()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 10
  min_n = 3

Computational engine: randomForest 

Finally, we fit the data to the model:

Question Opportunity

Do you recall how to do this?


Click here to reveal the answer.
RF_wflow_fit <- parsnip::fit(RF_wflow, data = train_pm)

If you get an error “Can not handle categorical predictors with more than 53 categories.” then you should scroll up a bit and make sure that you removed the categorical variables that have more than 53 categories as this method can’t handle such variables at this time.


RF_wflow_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps

• step_novel()
• step_string2factor()
• step_rm()
• step_rm()
• step_corr()
• step_nzv()

── Model ───────────────────────────────────────────────────────────────────────

Call:
 randomForest(x = maybe_data_frame(x), y = y, mtry = min_cols(~10,      x), nodesize = min_rows(~3, x)) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 10

          Mean of squared residuals: 2.698043
                    % Var explained: 58.29

Let’s take a look at the top 10 contributing variables:

Question Opportunity

See if you can recall how to do this.


Click here to reveal the answer.
RF_wflow_fit %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10)

Interesting! In the previous model the CMAQ values and the state where the monitor was located (being in California or not) were also the top two most important, however predictors about education levels of the communities where the monitor was located was among the top most important. Now we see that population density and proximity to sources of emissions and roads are among the top ten.

Now let’s take a look at model performance by fitting the data using cross validation:

Question Opportunity

See if you can recall how to do this.


Click here to reveal the answer.
set.seed(456)
resample_RF_fit <- tune::fit_resamples(RF_wflow, vfold_pm)
collect_metrics(resample_RF_fit)

# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   1.67      4  0.101  Preprocessor1_Model1
2 rsq     standard   0.591     4  0.0514 Preprocessor1_Model1

Model Comparison


Now let’s compare the performance of our model with our linear regression model - if you have been following along you could type this to take a look.

# our initial linear regression model:
collect_metrics(resample_fit)

For those starting here we will tell you that our first model had a cross validation mean rmse value of 2.12. It looks like the random forest model had a much lower rmse value of 1.67. This suggest that this model is better at prediction air pollution values. In addition, the R squared value is much higher (it was 31% and is now 59%), suggesting that more of the variance of the air pollution values could be explained by the new Random Forest model.

Question Opportunity

Do you recall how the RMSE is calculated?


Click here to reveal the answer. \[RMSE = \sqrt{\frac{\sum_{i=1}^{n}{(\hat{y_t}- y_t)}^2}{n}}\]

If we tuned our random forest model based on the number of trees or the value for mtry (which is “The number of predictors that will be randomly sampled at each split when creating the tree models”), we might get a model with even better performance.

However, our cross validated mean rmse value of 1.67 is quite good because our range of true outcome values is much larger: (4.298, 23.161).

Model tuning


Hyperparameters are often things that we need to specify about a model. For example, the number of predictor variables (or features) that will be randomly sampled at each split when creating the tree models called mtry is a hyperparameter. The default number for regression analyses is the number of predictors divided by 3. Instead of arbitrarily specifying this, we can try to determine the best option for model performance by a process called tuning.

Now let’s try some tuning.

Let’s take a closer look at the mtry and min_n hyperparametrs in our Random Forest model.

We aren’t exactly sure what values of mtry and min_n achieve good accuracy yet keep our model generalizable for other data.

This is when our cross validation methods become really handy because now we can test out different values for each of these hyperparameters to assess what values seem to work best for model performance on these resamples of our training set data.

Previously we specified our model like so:

RF_PM_model <- 
  parsnip::rand_forest(mtry = 10, min_n = 3) %>%
  set_engine("randomForest") %>%
  set_mode("regression")

RF_PM_model
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 10
  min_n = 3

Computational engine: randomForest 

Now instead of specifying a value for the mtry and min_n arguments, we can use the tune() function of the tune package like so: mtry = tune(). This indicates that these hyperparameters are to be tuned.

tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) %>%
  set_engine("randomForest") %>%
  set_mode("regression")
    
tune_RF_model
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  min_n = tune()

Computational engine: randomForest 

Again we will add this to a workflow, the only difference here is that we are using a different model specification with tune_RF_model instead of RF_model:

RF_tune_wflow <- workflows::workflow() %>%
  workflows::add_recipe(RF_rec) %>%
  workflows::add_model(tune_RF_model)
RF_tune_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps

• step_novel()
• step_string2factor()
• step_rm()
• step_rm()
• step_corr()
• step_nzv()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  min_n = tune()

Computational engine: randomForest 

Now we can use the tune_grid() function of the tune package to evaluate different combinations of values for mtry and min_n using our cross validation samples of our training set (vfold_pm) to see what combination of values performs best.

To use this function we will specify the workflow using the object argument and the samples to use using the resamples argument. The grid argument specifies how many possible options for each argument should be attempted.

By default 10 different values will be attempted for each hyperparameter that is being tuned.

We can use the doParallel package to allow us to fit all these models to our cross validation samples faster. So if you were performing this on a computer with multiple cores or processors, then different models with different hyperparameter values can be fit to the cross validation samples simultaneously across different cores or processors.

You can see how many cores you have access to on your system using the detectCores() function in the parallel package.

parallel::detectCores()
[1] 8

The registerDoParallel() function will use the number for cores specified using the cores= arguement, or it will assign it automatically to one-half of the number of cores detected by the parallel package.

We need to use set.seed() here because the values chosen for mtry and min_n may vary if we preform this evaluation again because they are chosen semi-randomly (meaning that they are within a range of reasonable values but still random).

Note: this step will take some time.

doParallel::registerDoParallel(cores = 2)
set.seed(123)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20)
tune_RF_results
# Tuning results
# 4-fold cross-validation 
# A tibble: 4 × 4
  splits            id    .metrics          .notes          
  <list>            <chr> <list>            <list>          
1 <split [438/146]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
2 <split [438/146]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
3 <split [438/146]> Fold3 <tibble [40 × 6]> <tibble [1 × 3]>
4 <split [438/146]> Fold4 <tibble [40 × 6]> <tibble [0 × 3]>

There were issues with some computations:

  - Warning(s) x1: 36 columns were requested but there were 35 predictors in the dat...

Run `show_notes(.Last.tune.result)` for more information.

See the tune getting started guide for more information about implementing this in tidymodels.

If you wanted more control over this process you could specify how the different possible options for mtry and min_n in the tune_grid() function using the grid_*() functions of the dials package to create a more specific grid.

By default the values for the hyperparameters being tuned are chosen semi-randomly (meaning that they are within a range of reasonable values but still random)..

Now we can use the collect_metrics() function again to take a look at what happened with our cross validation tests. We can see the different values chosen for mtry and min_n and the mean rmse and rsq values across the cross validation samples.

tune_RF_results %>%
  collect_metrics()
# A tibble: 40 × 8
    mtry min_n .metric .estimator  mean     n std_err .config              
   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1    12    33 rmse    standard   1.72      4  0.0866 Preprocessor1_Model01
 2    12    33 rsq     standard   0.562     4  0.0466 Preprocessor1_Model01
 3    27    35 rmse    standard   1.69      4  0.102  Preprocessor1_Model02
 4    27    35 rsq     standard   0.563     4  0.0511 Preprocessor1_Model02
 5    22    40 rmse    standard   1.71      4  0.106  Preprocessor1_Model03
 6    22    40 rsq     standard   0.556     4  0.0543 Preprocessor1_Model03
 7     1    27 rmse    standard   2.03      4  0.0501 Preprocessor1_Model04
 8     1    27 rsq     standard   0.440     4  0.0245 Preprocessor1_Model04
 9     6    32 rmse    standard   1.77      4  0.0756 Preprocessor1_Model05
10     6    32 rsq     standard   0.552     4  0.0435 Preprocessor1_Model05
# … with 30 more rows
# ℹ Use `print(n = ...)` to see more rows

We can now use the show_best() function (which we used previously to get a better estimate of the RMSE using cross validation fold of the training data) as it was truly intended, to see what values for min_n and mtry resulted in the best performance.

show_best(tune_RF_results, metric = "rmse", n = 1)
# A tibble: 1 × 8
   mtry min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1    32    11 rmse    standard    1.65     4   0.113 Preprocessor1_Model10

There we have it… looks like an mtry of 32 and show_best(tune_RF_results, metric = "rmse", n = 1)$min_n of 4 had the best rmse value. You can verify this in the above output, but it is easier to just pull this row out using this function. We can see that the mean rmse value across the cross validation sets was 1.65. Before tuning it was 1.67 with a fairly similar std_err so the performance may be slightly improved. In other words the model was slightly better at predicting air pollution values.

Final model performance evaluation


Now that we have decided that we have reasonable performance with our training data using the Random Forest method and tuning, we can stop building our model and evaluate performance with our testing data. Note that you might often need to try a variety of methods to optimize your model.

Here, we will use the random forest model that we built to predict values for the monitors in the testing data and we will use the values for mtry and min_n that we just determined based on our tuning analysis to achieve the best performance.

So, first we need to specify these values in a workflow. We can use the select_best() function of the tune package to grab the values that were determined to be best for mtry and min_n.

tuned_RF_values<- select_best(tune_RF_results, "rmse")
tuned_RF_values
# A tibble: 1 × 3
   mtry min_n .config              
  <int> <int> <chr>                
1    32    11 Preprocessor1_Model10

Now we can finalize the model/workflow that we we used for tuning with these values.

RF_tuned_wflow <-RF_tune_wflow %>%
  tune::finalize_workflow(tuned_RF_values)

With the workflows package, we can use the splitting information for our original data pm_split to fit the final model on the full training set and also on the testing data using the last_fit() function of the tune package. No pre-processing steps are required.

The results will show the performance using the testing data.

overallfit <- tune::last_fit(RF_tuned_wflow, pm_split)
 # or
overallfit <- RF_wflow %>%
  tune::last_fit(pm_split)

The overallfit output has a lot of really useful information about the model, the testing and training data split, and the predictions for the testing data.

To see the performance on the test data we can use the collect_metrics() function like we did before.

collect_metrics(overallfit)
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       1.71  Preprocessor1_Model1
2 rsq     standard       0.612 Preprocessor1_Model1

Awesome! We can see that our rmse of 1.71 is quite similar to that of our testing data cross validation sets (where rmse was 1.65 . We achieved quite good performance, which suggests that we could predict other locations with more sparse monitoring based on our predictors with reasonable accuracy, however some of our predictors involve monitoring itself, so the accuracy would likely be lower.

Now if you wanted to take a look at the predicted values for the test set (the 292 rows with predictions out of the 876 original monitor values) you can use the collect_predictions() function of the tune package:

test_predictions <- collect_predictions(overallfit)
test_predictions

# A tibble: 292 × 5
    id               .pred  .row value .config             
    <chr>            <dbl> <int> <dbl> <chr>               
  1 train/test split 11.9      3 11.2  Preprocessor1_Model1
  2 train/test split 11.8      5 12.4  Preprocessor1_Model1
  3 train/test split 11.1      6 10.5  Preprocessor1_Model1
  4 train/test split 13.1      7 15.6  Preprocessor1_Model1
  5 train/test split 12.0      8 12.4  Preprocessor1_Model1
  6 train/test split 10.5      9 11.1  Preprocessor1_Model1
  7 train/test split 11.2     14 11.8  Preprocessor1_Model1
  8 train/test split 11.2     16 10.0  Preprocessor1_Model1
  9 train/test split 11.6     18 12.0  Preprocessor1_Model1
 10 train/test split 11.9     20 13.2  Preprocessor1_Model1
 11 train/test split 11.2     24 11.7  Preprocessor1_Model1
 12 train/test split  8.02    26  6.91 Preprocessor1_Model1
 13 train/test split  8.90    27  5.93 Preprocessor1_Model1
 14 train/test split  9.52    28 10.5  Preprocessor1_Model1
 15 train/test split  8.11    33  5.53 Preprocessor1_Model1
 16 train/test split 10.7     37  7.47 Preprocessor1_Model1
 17 train/test split 10.9     42 11.2  Preprocessor1_Model1
 18 train/test split 11.1     44 11.7  Preprocessor1_Model1
 19 train/test split 10.9     46 11.2  Preprocessor1_Model1
 20 train/test split 10.9     47 11.2  Preprocessor1_Model1
 21 train/test split 10.9     48 11.1  Preprocessor1_Model1
 22 train/test split 11.2     50 11.3  Preprocessor1_Model1
 23 train/test split 11.5     52 12.1  Preprocessor1_Model1
 24 train/test split 11.1     56 10.7  Preprocessor1_Model1
 25 train/test split 11.0     62 10.2  Preprocessor1_Model1
 26 train/test split 17.4     65 18.7  Preprocessor1_Model1
 27 train/test split 11.9     68  7.6  Preprocessor1_Model1
 28 train/test split 12.4     69  8.22 Preprocessor1_Model1
 29 train/test split 12.5     72  8.08 Preprocessor1_Model1
 30 train/test split  9.24    73  6.70 Preprocessor1_Model1
 31 train/test split 17.7     78 23.2  Preprocessor1_Model1
 32 train/test split 10.6     80  9.03 Preprocessor1_Model1
 33 train/test split 13.0     81 14.2  Preprocessor1_Model1
 34 train/test split 13.5     82 13.9  Preprocessor1_Model1
 35 train/test split 13.6     89 13.7  Preprocessor1_Model1
 36 train/test split 12.6     90  7.12 Preprocessor1_Model1
 37 train/test split 10.5     92 16.7  Preprocessor1_Model1
 38 train/test split 11.9     93  7.17 Preprocessor1_Model1
 39 train/test split 10.8     94 14.3  Preprocessor1_Model1
 40 train/test split 12.6     96 12.8  Preprocessor1_Model1
 41 train/test split 11.3     99 15.3  Preprocessor1_Model1
 42 train/test split 10.3    109  6.94 Preprocessor1_Model1
 43 train/test split 15.0    110 15.8  Preprocessor1_Model1
 44 train/test split 14.0    111  8.64 Preprocessor1_Model1
 45 train/test split 14.7    114 13.3  Preprocessor1_Model1
 46 train/test split 13.5    115 12.3  Preprocessor1_Model1
 47 train/test split 13.9    117 11.3  Preprocessor1_Model1
 48 train/test split 13.6    119 13.7  Preprocessor1_Model1
 49 train/test split 12.0    129 12.3  Preprocessor1_Model1
 50 train/test split 11.1    130  6.77 Preprocessor1_Model1
 51 train/test split 12.0    134  9.08 Preprocessor1_Model1
 52 train/test split 11.7    137 20.6  Preprocessor1_Model1
 53 train/test split  8.99   139  9.69 Preprocessor1_Model1
 54 train/test split 11.0    140 10.7  Preprocessor1_Model1
 55 train/test split  7.89   144  7.18 Preprocessor1_Model1
 56 train/test split  9.10   148  8.22 Preprocessor1_Model1
 57 train/test split  7.37   151  4.32 Preprocessor1_Model1
 58 train/test split  7.17   153  6.68 Preprocessor1_Model1
 59 train/test split  7.72   154  9.11 Preprocessor1_Model1
 60 train/test split 10.9    158 11.9  Preprocessor1_Model1
 61 train/test split 11.3    167 11.3  Preprocessor1_Model1
 62 train/test split 11.4    172 11.4  Preprocessor1_Model1
 63 train/test split 12.2    173 11.3  Preprocessor1_Model1
 64 train/test split 12.4    175 11.5  Preprocessor1_Model1
 65 train/test split 12.8    177 13.5  Preprocessor1_Model1
 66 train/test split 11.8    180 12.4  Preprocessor1_Model1
 67 train/test split  8.02   185  7.56 Preprocessor1_Model1
 68 train/test split  8.79   193  8.55 Preprocessor1_Model1
 69 train/test split  9.06   196 10.6  Preprocessor1_Model1
 70 train/test split  7.94   197  7.97 Preprocessor1_Model1
 71 train/test split  8.21   200  7.55 Preprocessor1_Model1
 72 train/test split  8.86   202  6.20 Preprocessor1_Model1
 73 train/test split  7.98   209  6.93 Preprocessor1_Model1
 74 train/test split 12.0    213 11.9  Preprocessor1_Model1
 75 train/test split 11.9    217 13.7  Preprocessor1_Model1
 76 train/test split 12.4    219 12.7  Preprocessor1_Model1
 77 train/test split 12.1    220 13.1  Preprocessor1_Model1
 78 train/test split 12.5    230 13.5  Preprocessor1_Model1
 79 train/test split 12.4    232 13.2  Preprocessor1_Model1
 80 train/test split 11.6    233 11.9  Preprocessor1_Model1
 81 train/test split 12.3    236 12.5  Preprocessor1_Model1
 82 train/test split  9.09   239  7.91 Preprocessor1_Model1
 83 train/test split  9.34   240  7.72 Preprocessor1_Model1
 84 train/test split  8.69   242  9.05 Preprocessor1_Model1
 85 train/test split  8.11   244  9.72 Preprocessor1_Model1
 86 train/test split  8.39   245 11.7  Preprocessor1_Model1
 87 train/test split 11.1    246  9.18 Preprocessor1_Model1
 88 train/test split 11.2    247 10.5  Preprocessor1_Model1
 89 train/test split 12.4    253 11.9  Preprocessor1_Model1
 90 train/test split 12.7    257 12.0  Preprocessor1_Model1
 91 train/test split 11.4    261 11.3  Preprocessor1_Model1
 92 train/test split 11.3    264 10.8  Preprocessor1_Model1
 93 train/test split 11.4    266  9.36 Preprocessor1_Model1
 94 train/test split 11.1    268 10.1  Preprocessor1_Model1
 95 train/test split 11.9    273 12.4  Preprocessor1_Model1
 96 train/test split 11.1    282 10.4  Preprocessor1_Model1
 97 train/test split 11.6    283 10.7  Preprocessor1_Model1
 98 train/test split 12.2    287 12.2  Preprocessor1_Model1
 99 train/test split 12.3    288 12.5  Preprocessor1_Model1
100 train/test split 11.7    290 12.3  Preprocessor1_Model1
101 train/test split 12.6    300 12.2  Preprocessor1_Model1
102 train/test split 12.5    301 12.9  Preprocessor1_Model1
103 train/test split 11.8    304 12.2  Preprocessor1_Model1
104 train/test split 11.9    311 11.3  Preprocessor1_Model1
105 train/test split 11.9    312 12.4  Preprocessor1_Model1
106 train/test split 12.3    314 11.6  Preprocessor1_Model1
107 train/test split 12.3    315 12.3  Preprocessor1_Model1
108 train/test split 12.2    316 12.5  Preprocessor1_Model1
109 train/test split 12.1    317 12.5  Preprocessor1_Model1
110 train/test split 10.5    320 10.4  Preprocessor1_Model1
111 train/test split 11.4    322 11.0  Preprocessor1_Model1
112 train/test split  9.14   323 10.2  Preprocessor1_Model1
113 train/test split 10.6    325 11.4  Preprocessor1_Model1
114 train/test split 11.0    333 10.3  Preprocessor1_Model1
115 train/test split 11.2    336 13.5  Preprocessor1_Model1
116 train/test split  9.45   337  9.27 Preprocessor1_Model1
117 train/test split  9.12   339  9.15 Preprocessor1_Model1
118 train/test split  9.02   342  9.75 Preprocessor1_Model1
119 train/test split 12.8    351 12.1  Preprocessor1_Model1
120 train/test split 11.3    355 12.0  Preprocessor1_Model1
121 train/test split 12.3    356 12.0  Preprocessor1_Model1
122 train/test split 11.9    359 11.5  Preprocessor1_Model1
123 train/test split 12.1    360 12.2  Preprocessor1_Model1
124 train/test split 12.9    366 12.0  Preprocessor1_Model1
125 train/test split 12.4    368 10.5  Preprocessor1_Model1
126 train/test split 12.1    369 12.0  Preprocessor1_Model1
127 train/test split 11.5    370 10.6  Preprocessor1_Model1
128 train/test split  9.50   373  8.76 Preprocessor1_Model1
129 train/test split 10.6    374  9.25 Preprocessor1_Model1
130 train/test split 10.6    376  9.66 Preprocessor1_Model1
131 train/test split 10.3    377 11.6  Preprocessor1_Model1
132 train/test split 10.3    385 10.6  Preprocessor1_Model1
133 train/test split  9.56   386 10.3  Preprocessor1_Model1
134 train/test split 11.1    391 11.7  Preprocessor1_Model1
135 train/test split 12.9    392 12.6  Preprocessor1_Model1
136 train/test split 11.5    393 13.2  Preprocessor1_Model1
137 train/test split 11.8    396 12.5  Preprocessor1_Model1
138 train/test split 12.1    401 12.5  Preprocessor1_Model1
139 train/test split  9.99   406  9.05 Preprocessor1_Model1
140 train/test split  9.90   407  8.71 Preprocessor1_Model1
141 train/test split  9.78   411 10.8  Preprocessor1_Model1
142 train/test split  9.79   412 10.8  Preprocessor1_Model1
143 train/test split 10.5    415 11.2  Preprocessor1_Model1
144 train/test split 10.5    416 10.7  Preprocessor1_Model1
145 train/test split 11.3    422  8.94 Preprocessor1_Model1
146 train/test split 10.5    423  9.90 Preprocessor1_Model1
147 train/test split 10.8    425 10.6  Preprocessor1_Model1
148 train/test split  9.42   435  6.52 Preprocessor1_Model1
149 train/test split 11.6    436 11.4  Preprocessor1_Model1
150 train/test split 10.6    437  9.59 Preprocessor1_Model1
151 train/test split 11.1    438 10.9  Preprocessor1_Model1
152 train/test split 12.2    443 12.9  Preprocessor1_Model1
153 train/test split 10.9    446 11.0  Preprocessor1_Model1
154 train/test split 11.9    447 13.4  Preprocessor1_Model1
155 train/test split  7.31   451  4.74 Preprocessor1_Model1
156 train/test split  9.88   452  9.49 Preprocessor1_Model1
157 train/test split 10.2    453  9.45 Preprocessor1_Model1
158 train/test split  9.65   458 10.2  Preprocessor1_Model1
159 train/test split  7.84   464  7.81 Preprocessor1_Model1
160 train/test split 10.2    467  8.40 Preprocessor1_Model1
161 train/test split 11.2    472 10.3  Preprocessor1_Model1
162 train/test split 11.2    473  9.80 Preprocessor1_Model1
163 train/test split 11.7    477 12.4  Preprocessor1_Model1
164 train/test split 11.4    483 10.1  Preprocessor1_Model1
165 train/test split 12.7    490 12.9  Preprocessor1_Model1
166 train/test split 12.6    492 13.4  Preprocessor1_Model1
167 train/test split  8.12   493  5.37 Preprocessor1_Model1
168 train/test split  8.64   494  8.89 Preprocessor1_Model1
169 train/test split  8.44   496  7.91 Preprocessor1_Model1
170 train/test split  7.83   503  7.85 Preprocessor1_Model1
171 train/test split  8.45   506  7.19 Preprocessor1_Model1
172 train/test split  8.27   507 10.2  Preprocessor1_Model1
173 train/test split  9.34   509  9.18 Preprocessor1_Model1
174 train/test split  8.46   514  6.78 Preprocessor1_Model1
175 train/test split  8.93   515  8.22 Preprocessor1_Model1
176 train/test split 10.5    516  9.28 Preprocessor1_Model1
177 train/test split  8.90   517  4.52 Preprocessor1_Model1
178 train/test split  9.22   524  8.71 Preprocessor1_Model1
179 train/test split  9.06   525  8.20 Preprocessor1_Model1
180 train/test split 10.4    527 10.2  Preprocessor1_Model1
181 train/test split 11.0    528 10.8  Preprocessor1_Model1
182 train/test split 12.6    532 12.3  Preprocessor1_Model1
183 train/test split 12.6    533 12.7  Preprocessor1_Model1
184 train/test split 12.2    535 13.0  Preprocessor1_Model1
185 train/test split 12.6    536 11.6  Preprocessor1_Model1
186 train/test split 11.2    549 11.2  Preprocessor1_Model1
187 train/test split  8.11   550  5.98 Preprocessor1_Model1
188 train/test split  7.42   553 12.4  Preprocessor1_Model1
189 train/test split  7.67   555  5.07 Preprocessor1_Model1
190 train/test split  6.96   556  6.21 Preprocessor1_Model1
191 train/test split  7.44   558  5.88 Preprocessor1_Model1
192 train/test split  7.85   559  4.51 Preprocessor1_Model1
193 train/test split 10.5    566 10.7  Preprocessor1_Model1
194 train/test split  7.58   568  4.30 Preprocessor1_Model1
195 train/test split  9.42   575  8.15 Preprocessor1_Model1
196 train/test split 11.6    579 10.7  Preprocessor1_Model1
197 train/test split  8.85   581  7.91 Preprocessor1_Model1
198 train/test split 10.4    583 11.0  Preprocessor1_Model1
199 train/test split 11.8    589 12.3  Preprocessor1_Model1
200 train/test split 10.9    603 11.1  Preprocessor1_Model1
201 train/test split 12.2    604 12.4  Preprocessor1_Model1
202 train/test split 11.4    610 11.8  Preprocessor1_Model1
203 train/test split 11.5    611 11.9  Preprocessor1_Model1
204 train/test split 10.3    614 11.2  Preprocessor1_Model1
205 train/test split 11.9    616 11.4  Preprocessor1_Model1
206 train/test split 10.4    617  9.15 Preprocessor1_Model1
207 train/test split  8.65   619  4.57 Preprocessor1_Model1
208 train/test split  8.84   621  8.33 Preprocessor1_Model1
209 train/test split  8.22   622  6.52 Preprocessor1_Model1
210 train/test split 12.6    625 13.8  Preprocessor1_Model1
211 train/test split 12.0    626 12.8  Preprocessor1_Model1
212 train/test split 13.0    628 13.2  Preprocessor1_Model1
213 train/test split 12.9    630 14.0  Preprocessor1_Model1
214 train/test split 13.3    631 13.6  Preprocessor1_Model1
215 train/test split 12.9    633 14.6  Preprocessor1_Model1
216 train/test split 13.0    636 12.4  Preprocessor1_Model1
217 train/test split 13.3    640 15.2  Preprocessor1_Model1
218 train/test split 12.6    650 11.4  Preprocessor1_Model1
219 train/test split 12.6    652 12.1  Preprocessor1_Model1
220 train/test split 12.6    653 12.3  Preprocessor1_Model1
221 train/test split 12.9    655 13.1  Preprocessor1_Model1
222 train/test split 12.7    657 13.4  Preprocessor1_Model1
223 train/test split 13.1    662 12.5  Preprocessor1_Model1
224 train/test split 12.6    664 13.0  Preprocessor1_Model1
225 train/test split 11.3    669 11.6  Preprocessor1_Model1
226 train/test split 10.8    672 10.0  Preprocessor1_Model1
227 train/test split 10.7    674 11.1  Preprocessor1_Model1
228 train/test split  9.67   677  5.22 Preprocessor1_Model1
229 train/test split 10.3    680  6.92 Preprocessor1_Model1
230 train/test split 10.4    684  8.42 Preprocessor1_Model1
231 train/test split 10.4    686  7.31 Preprocessor1_Model1
232 train/test split 11.0    689  8.36 Preprocessor1_Model1
233 train/test split 10.8    690  7.74 Preprocessor1_Model1
234 train/test split  9.42   691  8.28 Preprocessor1_Model1
235 train/test split  9.78   692  6.72 Preprocessor1_Model1
236 train/test split  9.83   693  9.75 Preprocessor1_Model1
237 train/test split 11.6    694 11.4  Preprocessor1_Model1
238 train/test split 13.2    695 13.0  Preprocessor1_Model1
239 train/test split 12.8    698 13.5  Preprocessor1_Model1
240 train/test split 12.8    702 12.4  Preprocessor1_Model1
241 train/test split 13.2    703 12.5  Preprocessor1_Model1
242 train/test split 12.4    704 13.9  Preprocessor1_Model1
243 train/test split 11.7    705 10.8  Preprocessor1_Model1
244 train/test split 13.4    709 13.9  Preprocessor1_Model1
245 train/test split 11.6    710 10.7  Preprocessor1_Model1
246 train/test split 11.9    711 10.1  Preprocessor1_Model1
247 train/test split 12.3    714 11.8  Preprocessor1_Model1
248 train/test split 12.5    715 12.3  Preprocessor1_Model1
249 train/test split 13.4    718 13.3  Preprocessor1_Model1
250 train/test split 13.2    720 13.1  Preprocessor1_Model1
251 train/test split 12.0    722 12.4  Preprocessor1_Model1
252 train/test split 11.6    723 11.0  Preprocessor1_Model1
253 train/test split 13.1    725 13.4  Preprocessor1_Model1
254 train/test split  9.74   728 10.7  Preprocessor1_Model1
255 train/test split  9.56   730  9.03 Preprocessor1_Model1
256 train/test split 11.0    734 12.2  Preprocessor1_Model1
257 train/test split 11.4    735 11.4  Preprocessor1_Model1
258 train/test split 11.7    741 12.4  Preprocessor1_Model1
259 train/test split  7.37   749  5.25 Preprocessor1_Model1
260 train/test split  8.09   752  7.74 Preprocessor1_Model1
261 train/test split  8.17   753  6.72 Preprocessor1_Model1
262 train/test split 11.8    756 12.6  Preprocessor1_Model1
263 train/test split 10.8    759 11.2  Preprocessor1_Model1
264 train/test split 11.2    762 11.5  Preprocessor1_Model1
265 train/test split 10.4    765  9.27 Preprocessor1_Model1
266 train/test split 10.7    768 11.9  Preprocessor1_Model1
267 train/test split 10.6    769 16.2  Preprocessor1_Model1
268 train/test split 11.8    770 11.8  Preprocessor1_Model1
269 train/test split 11.4    771 10.8  Preprocessor1_Model1
270 train/test split  9.36   785  9.70 Preprocessor1_Model1
271 train/test split  9.41   797  8.65 Preprocessor1_Model1
272 train/test split  9.14   800  7.76 Preprocessor1_Model1
273 train/test split 11.8    805 11.3  Preprocessor1_Model1
274 train/test split 11.4    807 11.2  Preprocessor1_Model1
275 train/test split 11.0    809 12.1  Preprocessor1_Model1
276 train/test split 10.8    814 11.5  Preprocessor1_Model1
277 train/test split 10.9    815 10.6  Preprocessor1_Model1
278 train/test split 12.0    818 13.7  Preprocessor1_Model1
279 train/test split 11.2    820 11.7  Preprocessor1_Model1
280 train/test split 12.3    830 14.2  Preprocessor1_Model1
281 train/test split 12.5    837 14.2  Preprocessor1_Model1
282 train/test split 12.1    838 13.7  Preprocessor1_Model1
283 train/test split 12.9    843 13.9  Preprocessor1_Model1
284 train/test split 11.2    845 11.8  Preprocessor1_Model1
285 train/test split 10.6    851 11.8  Preprocessor1_Model1
286 train/test split 12.0    855 12.7  Preprocessor1_Model1
287 train/test split 11.9    856 13.3  Preprocessor1_Model1
288 train/test split 12.0    857 13.1  Preprocessor1_Model1
289 train/test split 10.7    859 11.3  Preprocessor1_Model1
290 train/test split 11.3    864 13.5  Preprocessor1_Model1
291 train/test split  8.23   870  4.5  Preprocessor1_Model1
292 train/test split  6.62   874  6.17 Preprocessor1_Model1

Nice!

Data Visualization


If you have been following along but stopped and are starting here you could load the wrangled data using the following command:

load(here::here("data", "wrangled", "wrangled_pm.rda"))

If you skipped previous sections click here for more information on how to obtain and load the data.

First you need to install the OCSdata package:

install.packages("OCSdata")

Then, you may download and load the wrangled data .rda file using the following code:

OCSdata::wrangled_rda("ocs-bp-air-pollution", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_pm.rda"))

If the package does not work for you, you may also download this .rda file by clicking this link here.

To load the downloaded data into your environment, you may double click on the .rda file in Rstudio or using the load() function.

To copy and paste our code below, place the downloaded .rda file in your current working directory within a subdirectory called “wrangled” within a subdirectory called “data”. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "wrangled", "wrangled_pm.rda"))

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.


Our main question for this case study was:

Can we predict regional annual average air pollution concentrations by zip-code using predictors such as population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

Thus far, we have built a machine learning (ML) model to predict fine particulate matter air pollution levels based on our predictor variables (or features).

Now, let’s make a plot of our predicted outcome values (\(\hat{Y}\)) and actual outcome values \(Y\) we observed.

First, let’s start by making a plot of our monitors. To do this, we will use the following packages to create a map of the US:

  1. sf - the simple features package helps to convert geographical coordinates into geometry variables which are useful for making 2D plots
  2. maps - this package contains geographical outlines and plotting functions to create plots with maps
  3. rnaturalearth- this allows for easy interaction with map data from Natural Earth which is a public domain map dataset
  4. rgeos - this package interfaces with the Geometry Engine-Open Source (GEOS) which is also helpful for coordinate conversion

We will start with getting an outline of the US with the ne_countries() function of the rnaturalearth package which will return polygons of the countries in the Natural Earth dataset.

world <- ne_countries(scale = "medium", returnclass = "sf")
glimpse(world)
Rows: 241
Columns: 64
$ scalerank  <int> 3, 1, 1, 1, 1, 3, 3, 1, 1, 1, 3, 1, 5, 3, 1, 1, 1, 1, 1, 1,…
$ featurecla <chr> "Admin-0 country", "Admin-0 country", "Admin-0 country", "A…
$ labelrank  <dbl> 5, 3, 3, 6, 6, 6, 6, 4, 2, 6, 4, 4, 5, 6, 6, 2, 4, 5, 6, 2,…
$ sovereignt <chr> "Netherlands", "Afghanistan", "Angola", "United Kingdom", "…
$ sov_a3     <chr> "NL1", "AFG", "AGO", "GB1", "ALB", "FI1", "AND", "ARE", "AR…
$ adm0_dif   <dbl> 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,…
$ level      <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ type       <chr> "Country", "Sovereign country", "Sovereign country", "Depen…
$ admin      <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ adm0_a3    <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ geou_dif   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ geounit    <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ gu_a3      <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ su_dif     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ subunit    <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ su_a3      <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ brk_diff   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ name       <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ name_long  <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ brk_a3     <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ brk_name   <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ brk_group  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ abbrev     <chr> "Aruba", "Afg.", "Ang.", "Ang.", "Alb.", "Aland", "And.", "…
$ postal     <chr> "AW", "AF", "AO", "AI", "AL", "AI", "AND", "AE", "AR", "ARM…
$ formal_en  <chr> "Aruba", "Islamic State of Afghanistan", "People's Republic…
$ formal_fr  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ note_adm0  <chr> "Neth.", NA, NA, "U.K.", NA, "Fin.", NA, NA, NA, NA, "U.S.A…
$ note_brk   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Multiple claim…
$ name_sort  <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ name_alt   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ mapcolor7  <dbl> 4, 5, 3, 6, 1, 4, 1, 2, 3, 3, 4, 4, 1, 7, 2, 1, 3, 1, 2, 3,…
$ mapcolor8  <dbl> 2, 6, 2, 6, 4, 1, 4, 1, 1, 1, 5, 5, 2, 5, 2, 2, 1, 6, 2, 2,…
$ mapcolor9  <dbl> 2, 8, 6, 6, 1, 4, 1, 3, 3, 2, 1, 1, 2, 9, 5, 2, 3, 5, 5, 1,…
$ mapcolor13 <dbl> 9, 7, 1, 3, 6, 6, 8, 3, 13, 10, 1, NA, 7, 11, 5, 7, 4, 8, 8…
$ pop_est    <dbl> 103065, 28400000, 12799293, 14436, 3639453, 27153, 83888, 4…
$ gdp_md_est <dbl> 2258.0, 22270.0, 110300.0, 108.9, 21810.0, 1563.0, 3660.0, …
$ pop_year   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lastcensus <dbl> 2010, 1979, 1970, NA, 2001, NA, 1989, 2010, 2010, 2001, 201…
$ gdp_year   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ economy    <chr> "6. Developing region", "7. Least developed region", "7. Le…
$ income_grp <chr> "2. High income: nonOECD", "5. Low income", "3. Upper middl…
$ wikipedia  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fips_10    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ iso_a2     <chr> "AW", "AF", "AO", "AI", "AL", "AX", "AD", "AE", "AR", "AM",…
$ iso_a3     <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALA", "AND", "ARE", "AR…
$ iso_n3     <chr> "533", "004", "024", "660", "008", "248", "020", "784", "03…
$ un_a3      <chr> "533", "004", "024", "660", "008", "248", "020", "784", "03…
$ wb_a2      <chr> "AW", "AF", "AO", NA, "AL", NA, "AD", "AE", "AR", "AM", "AS…
$ wb_a3      <chr> "ABW", "AFG", "AGO", NA, "ALB", NA, "ADO", "ARE", "ARG", "A…
$ woe_id     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ adm0_a3_is <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALA", "AND", "ARE", "AR…
$ adm0_a3_us <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ adm0_a3_un <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ adm0_a3_wb <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ continent  <chr> "North America", "Asia", "Africa", "North America", "Europe…
$ region_un  <chr> "Americas", "Asia", "Africa", "Americas", "Europe", "Europe…
$ subregion  <chr> "Caribbean", "Southern Asia", "Middle Africa", "Caribbean",…
$ region_wb  <chr> "Latin America & Caribbean", "South Asia", "Sub-Saharan Afr…
$ name_len   <dbl> 5, 11, 6, 8, 7, 5, 7, 20, 9, 7, 14, 10, 23, 22, 17, 9, 7, 1…
$ long_len   <dbl> 5, 11, 6, 8, 7, 13, 7, 20, 9, 7, 14, 10, 27, 35, 19, 9, 7, …
$ abbrev_len <dbl> 5, 4, 4, 4, 4, 5, 4, 6, 4, 4, 9, 4, 7, 10, 6, 4, 5, 4, 4, 5…
$ tiny       <dbl> 4, NA, NA, NA, NA, 5, 5, NA, NA, NA, 3, NA, NA, 2, 4, NA, N…
$ homepart   <dbl> NA, 1, 1, NA, 1, NA, 1, 1, 1, 1, NA, 1, NA, NA, 1, 1, 1, 1,…
$ geometry   <MULTIPOLYGON [°]> MULTIPOLYGON (((-69.89912 1..., MULTIPOLYGON (…

Here you can see the data about the countries in the world. Notice the geometry variable. This is used to create the outlines that we want.

Now we can use the geom_sf() function of the ggplot2 package to create a visual of simple feature (the geometry coordinates found in the geometry variable).

ggplot(data = world) +
    geom_sf() 

So now we can see that we have outlines of all the countries in the world.

We want to limit this just to the coordinates for the US. We will do this based on the coordinates we found on Wikipedia. According to this link, these are the latitude and longitude bounds of the continental US:

  • top = 49.3457868 # north lat
  • left = -124.7844079 # west long
  • right = -66.9513812 # east long
  • bottom = 24.7433195 # south lat
ggplot(data = world) +
    geom_sf() +
    coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), 
             expand = FALSE)

Now we just have a plot that is mostly limited to the outline of the US.

Now we will use the geom_point() function of the ggplot package to add scatter plot on top of the map. We want to show where the monitors are located based on the latitude and longitude values in the data.

ggplot(data = world) +
    geom_sf() +
    coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), 
             expand = FALSE)+
    geom_point(data = pm, aes(x = lon, y = lat), size = 2, 
               shape = 23, fill = "darkred")

Nice!

Now let’s add county lines.

County graphical data is available from the maps package. The sf package which again is short for simple features creates a data frame about this graphical data so that we can work with it.

counties <- sf::st_as_sf(maps::map("county", plot = FALSE,
                                   fill = TRUE))

counties
Simple feature collection with 3076 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.6813 ymin: 25.12993 xmax: -67.00742 ymax: 49.38323
Geodetic CRS:  WGS 84
First 10 features:
                 ID                           geom
1   alabama,autauga MULTIPOLYGON (((-86.50517 3...
2   alabama,baldwin MULTIPOLYGON (((-87.93757 3...
3   alabama,barbour MULTIPOLYGON (((-85.42801 3...
4      alabama,bibb MULTIPOLYGON (((-87.02083 3...
5    alabama,blount MULTIPOLYGON (((-86.9578 33...
6   alabama,bullock MULTIPOLYGON (((-85.66866 3...
7    alabama,butler MULTIPOLYGON (((-86.8604 31...
8   alabama,calhoun MULTIPOLYGON (((-85.74313 3...
9  alabama,chambers MULTIPOLYGON (((-85.59416 3...
10 alabama,cherokee MULTIPOLYGON (((-85.46812 3...

Now we will use this data within the geom_sf() function to add this to our plot. We will also add a title using the ggtitle() function, as well as remove axis ticks and titles using the theme() function of the ggplot2 package.

monitors <- ggplot(data = world) +
    geom_sf(data = counties, fill = NA, color = gray(.5))+
      coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), 
             expand = FALSE) +
    geom_point(data = pm, aes(x = lon, y = lat), size = 2, 
               shape = 23, fill = "darkred") +
    ggtitle("Monitor Locations") +
    theme(axis.title.x=element_blank(),
          axis.text.x = element_blank(),
          axis.ticks.x = element_blank(),
          axis.title.y = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks.y = element_blank())

monitors

Great!

Now, let’s add a fill at the county-level for the true monitor values of air pollution.

First, we need to get the county map data that we just got and our air pollution data to have similarly formatted county names so that we can combine the datasets together.

We can see that in the county data the counties are listed after the state name and a comma. In addition they are all lower case.

head(counties)
Simple feature collection with 6 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -88.01778 ymin: 30.24071 xmax: -85.06131 ymax: 34.2686
Geodetic CRS:  WGS 84
               ID                           geom
1 alabama,autauga MULTIPOLYGON (((-86.50517 3...
2 alabama,baldwin MULTIPOLYGON (((-87.93757 3...
3 alabama,barbour MULTIPOLYGON (((-85.42801 3...
4    alabama,bibb MULTIPOLYGON (((-87.02083 3...
5  alabama,blount MULTIPOLYGON (((-86.9578 33...
6 alabama,bullock MULTIPOLYGON (((-85.66866 3...

In contrast, our air pollution pm data shows counties as titles with the first letter as upper case.

dplyr::pull(pm, county) %>%
  head()
[1] "Baldwin" "Clay"    "Colbert" "DeKalb"  "Etowah"  "Houston"

We can use the separate() function of the tidyr package to separate the ID variable of our counties data into two variables based on the comma as a separator.

counties %<>% 
  tidyr::separate(ID, into = c("state", "county"), sep = ",")

head(counties)
Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -88.01778 ymin: 30.24071 xmax: -85.06131 ymax: 34.2686
Geodetic CRS:  WGS 84
    state  county                           geom
1 alabama autauga MULTIPOLYGON (((-86.50517 3...
2 alabama baldwin MULTIPOLYGON (((-87.93757 3...
3 alabama barbour MULTIPOLYGON (((-85.42801 3...
4 alabama    bibb MULTIPOLYGON (((-87.02083 3...
5 alabama  blount MULTIPOLYGON (((-86.9578 33...
6 alabama bullock MULTIPOLYGON (((-85.66866 3...

Now we just need to make these names in the new county variable of the counties data to be in title format. We can use the str_to_title() function of the stringr package to do this.

counties[["county"]] <- stringr::str_to_title(counties[["county"]])

Great! Now the county information is the same for the counties and pm data.

We can use the inner_join() function of the dplyr package to join the datasets together based on the county variables in each. This function will keep all rows that are in both datasets.

map_data <- dplyr::inner_join(counties, pm, by = "county")

glimpse(map_data)
Rows: 3,926
Columns: 52
$ state.x                     <chr> "alabama", "alabama", "alabama", "alabama"…
$ county                      <chr> "Baldwin", "Bibb", "Bibb", "Butler", "Butl…
$ id                          <fct> 1003.001, 13021.0007, 13021.0012, 39017.00…
$ value                       <dbl> 9.597647, 12.253134, 12.233673, 13.918079,…
$ fips                        <fct> 1003, 13021, 13021, 39017, 39017, 13059, 1…
$ lat                         <dbl> 30.49800, 32.77746, 32.80541, 39.49380, 39…
$ lon                         <dbl> -87.88141, -83.64110, -83.54352, -84.35430…
$ state.y                     <chr> "Alabama", "Georgia", "Georgia", "Ohio", "…
$ city                        <chr> "Fairhope", "Macon", "Macon", "Middletown"…
$ CMAQ                        <dbl> 8.098836, 11.716801, 11.716801, 11.321991,…
$ zcta                        <fct> 36532, 31206, 31020, 45044, 45015, 30605, …
$ zcta_area                   <dbl> 190980522, 72325015, 276913325, 98746815, …
$ zcta_pop                    <dbl> 27829, 29072, 2541, 52822, 12038, 39952, 5…
$ imp_a500                    <dbl> 0.01730104, 35.64359862, 0.28200692, 27.44…
$ imp_a1000                   <dbl> 1.4096021, 24.4824827, 0.2973616, 28.28871…
$ imp_a5000                   <dbl> 3.3360118, 8.7317283, 2.4691097, 22.697407…
$ imp_a10000                  <dbl> 1.9879187, 9.1999720, 1.9873487, 11.901559…
$ imp_a15000                  <dbl> 1.4386207, 6.4619966, 3.6435089, 8.4052921…
$ county_area                 <dbl> 4117521611, 646879637, 646879637, 12096684…
$ county_pop                  <dbl> 182265, 155547, 155547, 368130, 368130, 11…
$ log_dist_to_prisec          <dbl> 4.648181, 7.635438, 7.576493, 5.959728, 5.…
$ log_pri_length_5000         <dbl> 8.517193, 10.215058, 10.659655, 9.747731, …
$ log_pri_length_10000        <dbl> 9.210340, 12.116408, 11.653566, 10.734382,…
$ log_pri_length_15000        <dbl> 9.630228, 12.591833, 12.313347, 11.172817,…
$ log_pri_length_25000        <dbl> 11.32735, 13.12475, 13.02383, 12.14238, 12…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 6.214608, 8.104926, 7.…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 7.600902, 9.180109, 8.…
$ log_prisec_length_5000      <dbl> 10.81504, 11.61283, 11.34202, 11.63677, 11…
$ log_prisec_length_10000     <dbl> 11.88680, 13.23991, 12.47152, 12.62117, 12…
$ log_prisec_length_15000     <dbl> 12.20572, 13.74532, 13.37704, 13.26170, 13…
$ log_prisec_length_25000     <dbl> 13.41395, 14.28483, 14.25295, 14.16745, 14…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.3180354, 5.5380414, 0.1972941, 7.2307845…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967359, 5.538041, 5.536822, 7.408732, 5.…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 5.986616, 5.986572, 7.538614, 7.…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.3558851, 5.6035407, 0.9849611, 7.2785763…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.2678341, 5.6035407, 5.5971561, 7.4899282…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.6287284, 6.0302900, 6.0347672, 7.6330042…
$ popdens_county              <dbl> 44.265706, 240.457407, 240.457407, 304.323…
$ popdens_zcta                <dbl> 145.716431, 401.963277, 9.176156, 534.9235…
$ nohs                        <dbl> 3.300, 8.700, 6.800, 1.300, 3.500, 3.800, …
$ somehs                      <dbl> 4.900, 24.700, 17.500, 3.300, 7.700, 6.800…
$ hs                          <dbl> 25.10, 32.60, 46.60, 33.40, 33.40, 17.80, …
$ somecollege                 <dbl> 19.7, 12.9, 18.7, 23.0, 23.2, 16.1, 18.8, …
$ associate                   <dbl> 8.200, 4.800, 5.000, 8.700, 8.100, 5.100, …
$ bachelor                    <dbl> 25.30, 8.90, 4.00, 22.20, 16.20, 25.40, 5.…
$ grad                        <dbl> 13.50, 7.30, 1.30, 8.10, 7.90, 25.00, 3.10…
$ pov                         <dbl> 6.100, 38.800, 26.800, 0.900, 6.900, 12.10…
$ hs_orless                   <dbl> 33.30, 66.00, 70.90, 38.00, 44.60, 28.40, …
$ urc2013                     <dbl> 4, 4, 4, 2, 2, 4, 6, 2, 4, 1, 1, 1, 3, 4, …
$ urc2006                     <dbl> 5, 4, 4, 2, 2, 4, 6, 2, 4, 1, 1, 1, 3, 4, …
$ aod                         <dbl> 37.36364, 36.25000, 30.45455, 48.36364, 51…
$ geom                        <MULTIPOLYGON [°]> MULTIPOLYGON (((-87.93757 3..…

Nice! we can see that we have add a geom variable to the pm data.

Now we can use this to color the counties in our plot based on the value variable of our pm data, which you may recall is the actual monitor data for fine particulate air pollution at each monitor.

WE can do so using the scale_fill_gradientn() function of the ggplot2 package which creates color gradient based on a variable. In this case it is the variable that was specified as the fill in the aes function of the geom_sf() function. We specified that it would be the value variable of the pm data.

This scale_fill_gradientn() function also allows you to specify the colors, what to do about NA values (should they be a specific color or transparent) and the breaks, limits, labels and name/title on the legend for the color gradient.

truth <- ggplot(data = world) +
  coord_sf(xlim = c(-125,-66),
           ylim = c(24.5, 50),
           expand = FALSE) +
  geom_sf(data = map_data, aes(fill = value)) +
  scale_fill_gradientn(colours = topo.colors(7),
                       na.value = "transparent",
                       breaks = c(0, 10, 20),
                       labels = c(0, 10, 20),
                       limits = c(0, 23.5),
                       name = "PM ug/m3") +
  ggtitle("True PM 2.5 levels") +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

truth

Nice!

Now let’s do the same with our predicted outcome values.

Let’s grab both the testing and training predicted outcome values so that we have as much data as possible.

First we need to fit our training data with our final model to be able to get the predictions for the monitors included in the training set. We did this using the last_fit() function, but the output of this makes it difficult to grab the predicted values for the training data, and it is also difficult to get the id variables for the testing data.

Thus we will use the parsnip fit() and predict() functions of the parsnip package to do this like so:

Question Opportunity

Why do we not need pre-processed data?


Click here to reveal the answer.

Since we are using a workflow, the data will be pre-processed when it is fit as well.


RF_final_train_fit <- parsnip::fit(RF_tuned_wflow, data = train_pm)
RF_final_test_fit <- parsnip::fit(RF_tuned_wflow, data = test_pm)


values_pred_train <- predict(RF_final_train_fit, train_pm) %>% 
  bind_cols(train_pm %>% select(value, fips, county, id)) 

values_pred_train
# A tibble: 584 × 5
   .pred value fips  county           id        
   <dbl> <dbl> <fct> <chr>            <fct>     
 1 11.9  11.7  18003 Allen            18003.0004
 2  8.14  6.96 55041 Forest           55041.0007
 3 13.8  13.3  6065  Riverside        6065.1003 
 4 11.0  10.7  39009 Athens           39009.0003
 5 13.8  14.5  39061 Hamilton         39061.8001
 6 12.5  12.2  24510 Baltimore (City) 24510.0006
 7 12.0  11.2  6061  Placer           6061.0006 
 8  8.00  6.98 6065  Riverside        6065.5001 
 9  7.64  6.61 44003 Kent             44003.0002
10 11.2  11.6  37111 McDowell         37111.0004
# … with 574 more rows
# ℹ Use `print(n = ...)` to see more rows
values_pred_test <- predict(RF_final_test_fit, test_pm) %>% 
  bind_cols(test_pm %>% select(value, fips, county, id)) 

values_pred_test
# A tibble: 292 × 5
   .pred value fips  county     id       
   <dbl> <dbl> <fct> <chr>      <fct>    
 1  11.6  11.2 1033  Colbert    1033.1002
 2  12.0  12.4 1055  Etowah     1055.001 
 3  11.1  10.5 1069  Houston    1069.0003
 4  14.0  15.6 1073  Jefferson  1073.0023
 5  12.1  12.4 1073  Jefferson  1073.1005
 6  11.3  11.1 1073  Jefferson  1073.1009
 7  11.5  11.8 1073  Jefferson  1073.5003
 8  11.1  10.0 1097  Mobile     1097.0003
 9  11.9  12.0 1101  Montgomery 1101.0007
10  12.9  13.2 1113  Russell    1113.0001
# … with 282 more rows
# ℹ Use `print(n = ...)` to see more rows

Now we can combine this data for the predictions for all monitors using the bind_rows() function of the dplyr package, which will essentially append the second dataset to the first.

all_pred <- bind_rows(values_pred_test, values_pred_train)

all_pred
# A tibble: 876 × 5
   .pred value fips  county     id       
   <dbl> <dbl> <fct> <chr>      <fct>    
 1  11.6  11.2 1033  Colbert    1033.1002
 2  12.0  12.4 1055  Etowah     1055.001 
 3  11.1  10.5 1069  Houston    1069.0003
 4  14.0  15.6 1073  Jefferson  1073.0023
 5  12.1  12.4 1073  Jefferson  1073.1005
 6  11.3  11.1 1073  Jefferson  1073.1009
 7  11.5  11.8 1073  Jefferson  1073.5003
 8  11.1  10.0 1097  Mobile     1097.0003
 9  11.9  12.0 1101  Montgomery 1101.0007
10  12.9  13.2 1113  Russell    1113.0001
# … with 866 more rows
# ℹ Use `print(n = ...)` to see more rows

Great! as we can see there are 876 values like we would expect for all of the monitors. We can use the county variable to combine this with the counties data like we did with the pm data previously so that we can use the value variable as a color scheme for our map.

map_data <- inner_join(counties, all_pred, by = "county")

pred <- ggplot(data = world) +
  coord_sf(xlim = c(-125,-66),
           ylim = c(24.5, 50),
           expand = FALSE) +
  geom_sf(data = map_data, aes(fill = .pred)) +
  scale_fill_gradientn(colours = topo.colors(7),
                       na.value = "transparent",
                       breaks = c(0, 10, 20),
                       labels = c(0, 10, 20),
                       limits = c(0, 23.5),
                       name = "PM ug/m3") +
  ggtitle("Predicted PM 2.5 levels") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

pred

Now we will use the patchwork package to combine our last two plots. This allows us to combine plots using the + or the / . The + will place plots side by side and the / will place plots top to bottom.

Now let’s just combine the truth plot and the prediction plots together:

truth/pred

We can see that the predicted fine particle air pollution values in (ug/m3) are quite similar to the true values measured by the actual gravimetric monitors. We can also see that southern California has some large counties with worse pollution (as they are yellow and thus have much higher particulate matter levels).

Let’s add some text to our plot to explain it a bit more. We can do so using the plot_annotation() function of the patchwork package. The theme argument of this function takes the same theme information using the theme() function of the ggplot2 package as when creating ggplot2plots.

(truth/pred) + 
  plot_annotation(title = "Machine Learning Methods Allow for Prediction of Air Pollution", subtitle = "A random forest model predicts true monitored levels of fine particulate matter (PM 2.5) air pollution based on\ndata about population density and other predictors reasonably well, thus suggesting that we can use similar methods to predict levels\nof pollution in places with poor monitoring",
                  theme = theme(plot.title = element_text(size =12, face = "bold"), 
                                plot.subtitle = element_text(size = 8)))

Summary


Synopsis


In this case study, we explored gravimetric monitoring data of fine particulate matter air pollution (outcome variable). Our goal was to able to predict air pollution where we only had predictor variables (or features) without having observed a corresponding measurement of air pollution.

Our learning objectives were:

  • Introduce concepts in machine learning
  • Demonstrate how to build a machine learning model with tidymodels
  • Demonstrate how to visualize geo-spatial data using ggplot2

Using the machine learning models built in this case study, we could now extend this model to predict air pollution levels in areas with poor monitoring, to help identify regions where populations maybe especially at risk for the health effects of air pollution.

Analyses like the one in our case study are important for defining which groups could benefit the most from interventions, education, and policy changes when attempting to mitigate public health challenges. You can see in this article that many additional considerations would be involved to adequately understand the data enough to recommend policy changes.

Here are some visual summaries about what we learned about using tidymodels to perform prediction analyses.

First the minimal steps required:

Here is a guide for more advanced analyses involving preprocessing, cross validation, or tuning:

Click here for more on what we learned with tidymodels

Here, we provide an overview of the tidymodels framework.

We performed the major steps of machine learning that we introduced in the beginning of the data analysis:

  1. Data exploration

We used packages like skimr, summarytools, corrplot, and GGally to better understand our data. These packages can tell us how many missing values each variable has (if any), the class of each variable, the distribution of values for each variable, the sparsity of each variable, and the level of correlation between variables.

  1. Data splitting

We used the rsample package to first perform an initial split of our data into two pieces: a training set and a testing set. The training set was used to optimize the model, while the testing set was used only to evaluate the performance of our final model. We also used the rsample package to create cross validation subsets of our training data. This allowed us to better assess the performance of our tested models using our training data.

  1. Variable assignment and pre-processing

We used the recipes package to assign variable roles (such as outcome, predictor, and id variable). We also used this package to create a recipe for pre-processing our training and testing data. This involved steps such as: step_dummy to create dummy numeric encodings of our categorical variables, step_corr to remove highly correlated variables, step_nzv to remove near zero variance variables that would contribute little to our model and potentially add noise. We learned that once our recipe was created and prepped using prep()we could extract the pre-processed training data or our pre-processed testing data using bake(). We also learned that if we used the newer workflows package that we did not need to use the prep() or bake() functions, but that it is still useful to know how to do so if we want to look at our data and how the recipe is influencing it more deeply.

  1. Model specification, fitting, tuning and performance evaluation using the training data

We learned that the model needs to first be fit to the training data. We learned that in both classification and prediction, the model is fit to the training data and the explanatory variables are used to estimate numeric values (in the case of prediction) or categorical values (in the case of classification) of the outcome variable of interest. We learned that we specify the model and its specifications using the parnsip package and that we also use this package to fit the model using the fit() function. We learned that if we just use parsnip to fit the model, then we need to use the pre-processed training data (output from bake()). We learned that we can use the raw training data if we use the workflows package to create a workflow that pre-processes our data for us.

We learned that if the model fits well than the estimated values will be very similar to the true outcome variable values in our training data. We learned that we can assess model performance using the yardstick package with the metrics functio or the tune package and the collect_metrics() function (required if using cross validation or tuning). We also learned that we can use subsets of our training data (which we created with the rsample package) to perform cross validation to get a better estimate about the performance of our model using our training data, as we want our results to be generalizable and to perform well with other data, not just our training data. We used the fit_resamples() function of the tune package to fit our model on our different training data subsets and the collect_metrics() function (also of the tune package) to evaluate model performance using these subsets. We also learned that we can potentially improve model performance by tuning aspects about the model called hyperparameters to determine the best option for model performance. We learned that we can do this using the tune and dials packages and evaluating the performance of our model with the different hyperparameter options and our training data subsets that we used for cross validation. After we tested several different methods to model our data, we compared them to choose the best performing model as our final model.

  1. Overall model performance evaluation

Once we chose our final model, we evaluated the final model performance using the testing data using the last_fit() function of the tune package. This gives us a better estimate about how well the model will predict or classify the outcome variable of interest with new independent data. Ideally one would also perform an evaluation with independent data to provide a sense of how generalizable the model is to other data sources.

We also saw that we can use the collect_predictions() function of the tune package to get the predictions for our test data. We saw that we can get more detailed prediction data using the predict() function of the parsnip package.

Suggested Homework


Students can predict air pollution monitor values using a different algorithm and provide an explanation for how that algorithm works and why it may be a good choice for modeling this data.

Additional Information


Session info


sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur/Monterey 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] OCSdata_1.0.2             patchwork_1.1.1          
 [3] rgeos_0.5-9               sp_1.5-0                 
 [5] rnaturalearthdata_0.1.0   rnaturalearth_0.1.0      
 [7] maps_3.4.0                sf_1.0-7                 
 [9] proxy_0.4-27              lwgeom_0.2-8             
[11] stringr_1.4.0             doParallel_1.0.17        
[13] iterators_1.0.14          foreach_1.5.2            
[15] randomForest_4.7-1.1      vip_0.3.2                
[17] yardstick_1.0.0           workflowsets_1.0.0       
[19] workflows_1.0.0           tune_1.0.0               
[21] tidyr_1.2.0               tibble_3.1.8             
[23] rsample_1.0.0             recipes_1.0.1            
[25] purrr_0.3.4               parsnip_1.0.0            
[27] modeldata_1.0.0           infer_1.0.2              
[29] dials_1.0.0               scales_1.2.0             
[31] broom_1.0.0               tidymodels_1.0.0         
[33] GGally_2.1.2              ggplot2_3.3.6            
[35] RColorBrewer_1.1-3        corrplot_0.92            
[37] summarytools_1.0.1        skimr_2.1.4              
[39] dplyr_1.0.9               readr_2.1.2              
[41] koRpus.lang.en_0.1-4      koRpus_0.13-8            
[43] sylly_0.1-6               read.so_0.1.1            
[45] wordcountaddin_0.3.0.9000 magrittr_2.0.3           
[47] here_1.0.1                knitr_1.39               

loaded via a namespace (and not attached):
 [1] backports_1.4.1    plyr_1.8.7         repr_1.1.4         splines_4.2.0     
 [5] listenv_0.8.0      usethis_2.1.6      pryr_0.1.5         digest_0.6.29     
 [9] htmltools_0.5.3    magick_2.7.3       fansi_1.0.3        checkmate_2.1.0   
[13] tzdb_0.3.0         remotes_2.4.2      globals_0.15.1     gower_1.0.0       
[17] matrixStats_0.62.0 vroom_1.5.7        hardhat_1.2.0      colorspace_2.0-3  
[21] xfun_0.31          tcltk_4.2.0        crayon_1.5.1       jsonlite_1.8.0    
[25] survival_3.3-1     glue_1.6.2         gtable_0.3.0       ipred_0.9-13      
[29] future.apply_1.9.0 rapportools_1.1    DBI_1.1.3          Rcpp_1.0.9        
[33] units_0.8-0        bit_4.0.4          GPfit_1.0-8        lava_1.6.10       
[37] prodlim_2019.11.13 httr_1.4.3         wk_0.6.0           ellipsis_0.3.2    
[41] farver_2.1.1       pkgconfig_2.0.3    reshape_0.8.9      nnet_7.3-17       
[45] sass_0.4.2         utf8_1.2.2         labeling_0.4.2     tidyselect_1.1.2  
[49] rlang_1.0.4        DiceDesign_1.9     reshape2_1.4.4     munsell_0.5.0     
[53] tools_4.2.0        cachem_1.0.6       cli_3.3.0          generics_0.1.3    
[57] evaluate_0.15      fastmap_1.1.0      yaml_2.3.5         bit64_4.0.5       
[61] fs_1.5.2           pander_0.6.5       s2_1.1.0           future_1.27.0     
[65] compiler_4.2.0     rstudioapi_0.13    curl_4.3.2         e1071_1.7-11      
[69] lhs_1.1.5          bslib_0.4.0        stringi_1.7.8      highr_0.9         
[73] lattice_0.20-45    Matrix_1.4-1       classInt_0.4-7     vctrs_0.4.1       
[77] pillar_1.8.0       lifecycle_1.0.1    furrr_0.3.0        jquerylib_0.1.4   
[81] data.table_1.14.2  sylly.en_0.1-3     R6_2.5.1           KernSmooth_2.23-20
[85] gridExtra_2.3      parallelly_1.32.1  codetools_0.2-18   MASS_7.3-58       
[89] assertthat_0.2.1   rprojroot_2.0.3    withr_2.5.0        hms_1.1.1         
[93] grid_4.2.0         rpart_4.1.16       timeDate_4021.104  class_7.3-20      
[97] rmarkdown_2.14     lubridate_1.8.0    base64enc_0.1-3   

Estimate of RMarkdown Compilation Time:

About 208 - 218 seconds

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.

Acknowledgments


We would like to acknowledge Roger Peng, Megan Latshaw, and Kirsten Koehler for assisting in framing the major direction of the case study.

We would like to acknowledge Michael Breshock for his contributions to this case study and developing the OCSdata package.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.

---
title: "Open Case Studies: Predicting Annual Air Pollution"
css: style.css
output:
  html_document:
    includes:
       in_header: GA_Script.Rhtml
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes

---
<style>
#TOC {
  background: url("https://opencasestudies.github.io/img/icon-bahi.png");
  background-size: contain;
  padding-top: 240px !important;
  background-repeat: no-repeat;
}
</style>

<!-- Open all links in new tab-->  
<base target="_blank"/>  
<div id="google_translate_element"></div>

<script type="text/javascript" src='//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit'></script>

<script type="text/javascript">
function googleTranslateElementInit() {
  new google.translate.TranslateElement({pageLanguage: 'en'}, 'google_translate_element');
}
</script>


```{r setup, include=FALSE}
library(knitr)
library(here)
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
                      message = FALSE, warning = FALSE, cache = FALSE,
                      fig.align = "center", out.width = '90%')
library(magrittr)
remotes::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)
remotes::install_github("alistaire47/read.so")
library(wordcountaddin)
library(read.so)

rmarkdown:::perf_timer_reset_all()
rmarkdown:::perf_timer_start("render")
```


#### {.outline }
```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "main_plot_maps.png"))
```

####

#### {.disclaimer_block}

**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

####

#### {.license_block}

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"} United States License.

####

#### {.reference_block}

To cite this case study please use:

Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). [https://github.com//opencasestudies/ocs-bp-air-pollution](https://github.com//opencasestudies/ocs-bp-air-pollution/). Predicting Annual Air Pollution (Version v1.0.0).

####

To access the GitHub Repository for this case study see here: https://github.com/opencasestudies/ocs-bp-air-pollution.

You may also access and download the data using our `OCSdata` package. To learn more about this package including examples, see this [link](https://github.com/opencasestudies/OCSdata). Here is how you would install this package:

```{r, eval=FALSE}
install.packages("OCSdata")
```

This case study is part of a series of public health case studies for the [Bloomberg American Health  Initiative](https://americanhealth.jhu.edu/open-case-studies).

***

The total reading time for this case study is calculated via [koRpus](https://github.com/unDocUMeantIt/koRpus) and shown below: 

```{r, echo=FALSE}
readtable = text_stats("index.Rmd") # producing reading time markdown table
readtime = read.so::read.md(readtable) %>% dplyr::select(Method, koRpus) %>% # reading table into dataframe, selecting relevant factors
  dplyr::filter(Method == "Reading time") %>% # dropping unnecessary rows
  dplyr::mutate(koRpus = paste(round(as.numeric(stringr::str_split(koRpus, " ")[[1]][1])), "minutes")) %>% # rounding reading time estimate
  dplyr::mutate(Method = "koRpus") %>% dplyr::relocate(koRpus, .before = Method) %>% dplyr::rename(`Reading Time` = koRpus) # reorganizing table
knitr::kable(readtime, format="markdown")
```

***

**Readability Score: **

A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via [koRpus](https://github.com/unDocUMeantIt/koRpus). These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age. 

```{r, echo=FALSE}
rt = wordcountaddin::readability("index.Rmd", quiet=TRUE) # producing readability markdown table
df = read.so::read.md(rt) %>% dplyr::select(index, grade, age) %>%  # reading table into dataframe, selecting relevant factors
  tidyr::drop_na() %>% dplyr::mutate(grade = round(as.numeric(grade)), # dropping rows with missing values, rounding age and grade columns
                                     age = round(as.numeric(age))
                                     )
knitr::kable(df, format="markdown")
```

***

Please help us by filling out our survey.


<div style="display: flex; justify-content: center;"><iframe src="https://docs.google.com/forms/d/e/1FAIpQLSfpN4FN3KELqBNEgf2Atpi7Wy7Nqy2beSkFQINL7Y5sAMV5_w/viewform?embedded=true" width="1200" height="700" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe></div>


# **Motivation**
***
A variety of different sources contribute different types of pollutants to what we call air pollution. 

Some sources are natural while others are anthropogenic (human derived):

<p align="center">
<img width="600" src="https://www.nps.gov/subjects/air/images/Sources_Graphic_Huge.jpg?maxwidth=1200&maxheight=1200&autorotate=false">
</p>

##### [[source]](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.nps.gov%2Fsubjects%2Fair%2Fsources.htm&psig=AOvVaw2v7AVxSF8ZSAPEhNudVtbN&ust=1585770966217000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCPDN66q_xegCFQAAAAAdAAAAABAD){target="_blank"}

### Major types of air pollutants

1) **Gaseous** - Carbon Monoxide (CO), Ozone (O~3~), Nitrogen Oxides(NO, NO~2~), Sulfur Dioxide (SO~2~)
2) **Particulate** - small liquids and solids suspended in the air (includes lead- can include certain types of dust)
3) **Dust** - small solids (larger than particulates) that can be suspended in the air for some time but eventually settle
4) **Biological** - pollen, bacteria, viruses, mold spores

See [here](http://www.redlogenv.com/worker-safety/part-1-dust-and-particulate-matter) for more detail on the types of pollutants in the air.


### Particulate pollution 

Air pollution particulates are generally described by their **size**.

There are 3 major categories:

1) **Large Coarse** Particulate Matter - has diameter of >10 micrometers (10 µm) 

2) **Coarse** Particulate Matter (called **PM~10-2.5~**) - has diameter of between 2.5 µm and 10 µm

3) **Fine** Particulate Matter (called **PM~2.5~**) - has diameter of < 2.5 µm 

**PM~10~** includes any particulate matter <10 µm (both coarse and fine particulate matter)

Here you can see how these sizes compare with a human hair:

```{r, echo = FALSE, out.width= "600 px"}
knitr::include_graphics(here::here("img", "pm2.5_scale_graphic-color_2.jpg"))
```

##### [[source]](https://www.epa.gov/pm-pollution/particulate-matter-pm-basics){target="_blank"}

<!-- <p align="center"> -->
<!--   <img width="500" src="https://www.sensirion.com/images/sensirion-specialist-article-figure-1-cdd70.jpg"> -->
<!-- </p> -->


<u>The following plot shows the relative sizes of these different pollutants in micrometers (µm):</u>

```{r, echo = FALSE, out.width= "800 px"}
knitr::include_graphics(here::here("img", "particulate-size-chart.png"))
```

##### [[source]](https://en.wikipedia.org/wiki/Particulates){target="_blank"}


<u>This table shows how deeply some of the smaller fine particles can penetrate within the human body:</u>

```{r, echo = FALSE, out.width= "800 px"}
knitr::include_graphics(here::here("img", "sizes.jpg"))
```

##### [[source]](https://www.frontiersin.org/articles/10.3389/fpubh.2020.00014/full){target="_blank"}


### Negative impact of particulate exposure on health 

Exposure to air pollution is associated with higher rates of [mortality](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5783186/){target="_blank"} in older adults and is known to be a risk factor for many diseases and conditions including but not limited to:

1) [Asthma](https://www.ncbi.nlm.nih.gov/pubmed/29243937){target="_blank"} - fine particle exposure (**PM~2.5~**) was found to be associated with higher rates of asthma in children
2) [Inflammation in type 1 diabetes](https://www.ncbi.nlm.nih.gov/pubmed/31419765){target="_blank"} - fine particle exposure (**PM~2.5~**) from traffic-related air pollution was associated with increased measures of inflammatory markers in youths with Type 1 diabetes
3) [Lung function and emphysema](https://www.ncbi.nlm.nih.gov/pubmed/31408135){target="_blank"} - higher concentrations of ozone (O~3~), nitrogen oxides (NO~x~), black carbon, and fine particle exposure **PM~2.5~** , at study baseline were significantly associated with greater increases in percent emphysema per 10 years 
4) [Low birthweight](https://www.ncbi.nlm.nih.gov/pubmed/31386643){target="_blank"} - fine particle exposure(**PM~2.5~**) was associated with lower birth weight in full-term live births
5) [Viral Infection](https://www.tandfonline.com/doi/full/10.1080/08958370701665434){target="_blank"} - higher rates of infection and increased severity of infection are associated with higher exposures to pollution levels including fine particle exposure (**PM~2.5~**)

See this [review article](https://www.frontiersin.org/articles/10.3389/fpubh.2020.00014/full){target="_blank"} for more information about sources of air pollution and the influence of air pollution on health.

### Sparse monitoring is problematic for Public Health

Historically, epidemiological studies would assess the influence of air pollution on health outcomes by relying on a number of monitors located around the country. 

However, as can be seen in the following figure, these monitors are relatively sparse in certain regions of the country and are not necessarily located near pollution sources. We will see later when we evaluate the data, that even in certain relatively large cities there is only  one monitor!

Furthermore, dramatic differences in pollution rates can be seen even within the same city. In fact, the term micro-environments describes environments within cities or counties which may vary greatly from one block to another.

```{r, echo = FALSE, out.width= "800 px"}
knitr::include_graphics(here::here("img", "map_of_monitors.jpg"))
```

##### [[source]](https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63){target="_blank"}

This lack of granularity in air pollution monitoring has hindered our ability to discern the full impact of air pollution on health and to identify at-risk locations. 


### Machine learning offers a solution

An [article](https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63){target="_blank"} published in the *Environmental Health* journal dealt with this issue by using data, including population density and road density, among other features, to model or predict air pollution levels at a more localized scale using machine learning (ML) methods. 

```{r, echo = FALSE, out.width= "800 px"}
knitr::include_graphics(here::here("img", "thepaper.png"))
```

##### [[source]](https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63){target="_blank"}

#### {.reference_block}
Yanosky, J. D. et al. Spatio-temporal modeling of particulate air pollution in the conterminous United States using geographic and meteorological predictors. *Environ Health* 13, 63 (2014).

####

The authors of this article state that:

> "Exposure to atmospheric particulate matter (PM) remains an important public health concern, although it remains difficult to quantify accurately across large geographic areas with sufficiently high spatial resolution. Recent epidemiologic analyses have demonstrated the importance of spatially- and temporally-resolved exposure estimates, which show larger PM-mediated health effects as compared to nearest monitor or county-specific ambient concentrations." 

##### [[source]](https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63){target="_blank"}

The article above demonstrates that machine learning methods can be used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems. 
We will use similar methods to predict annual air pollution levels spatially within the US.


# **Main Question**
***

#### {.main_question_block}
<b><u> Our main question: </u></b>

1) Can we predict annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

####

# **Learning Objectives**
***

In this case study, we will walk you through importing data from CSV files and performing machine learning methods to predict our outcome variable of interest (in this case annual fine particle air pollution estimates). 

We will especially focus on using packages and functions from the [`tidyverse`](https://www.tidyverse.org/){target="_blank"}, and more specifically the [`tidymodels`](https://cran.r-project.org/web/packages/tidymodels/tidymodels.pdf){target="_blank"} package/ecosystem primarily developed and maintained by [Max Kuhn](https://resources.rstudio.com/authors/max-kuhn){target="_blank"} and [Davis Vaughan](https://resources.rstudio.com/authors/davis-vaughan){target="_blank"}. 
This package loads more modeling related packages like `rsample`, `recipes`, `parsnip`, `yardstick`, `workflows`, and `tune` packages. 

The tidyverse is a library of packages created by RStudio. 
While some students may be familiar with previous R programming packages, these packages make data science in R especially legible and intuitive.


```{r, echo = FALSE, fig.show = "hold", out.width = "20%", fig.align = "default"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
include_graphics("https://pbs.twimg.com/media/DkBFpSsW4AIyyIN.png")
```

The skills, methods, and concepts that students will be familiar with by the end of this case study are:


<u>**Data Science Learning Objectives:**</u> 
  
1. Familiarity with the tidymodels ecosystem.
2. Ability to evaluate correlation among predictor variables (`corrplot` and `GGally`).
3. Ability to implement tidymodels packages such as `rsample` to split the data into training and testing sets as well as cross validation sets.
4. Ability to use the `recipes`, `parsnip`, and `workflows` to train and test a linear regression model and random forest model.
5. Demonstrate how to visualize geo-spatial data using `ggplot2`.

<u>**Statistical Learning Objectives:**</u>  
  
1. Basic understanding the utility of machine learning for prediction and classification
2. Understanding of the need for training and test sets
3. Understanding of the utility of cross validation
4. Understanding of random forest
5. How to interpret root mean squared error (rmse) to assess performance for prediction

We will begin by loading the packages that we will need:

```{r}
# Load packages for data import and data wrangling
library(here)
library(readr)
library(dplyr)
library(skimr)
library(summarytools)
library(magrittr)
# Load packages for making correlation plots
library(corrplot)
library(RColorBrewer)
library(GGally)
# Load packages for building machine learning algorithm
library(tidymodels)
library(workflows)
library(vip)
library(tune)
library(randomForest)
library(doParallel)
# Load packages for data visualization/creating map
library(ggplot2)
library(stringr)
library(tidyr)
library(lwgeom)
library(proxy)# needed for lwgeom
library(sf)
library(maps)
library(rnaturalearth)
library(rnaturalearthdata) # needed for rnaturalearth
library(rgeos)
library(patchwork)
# Load package for downloading the case study data files
library(OCSdata)
```


 <u>**Packages used in this case study:** </u>

Package   | Use in this case study                                                                      
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import CSV files
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to view/arrange/filter/select/compare specific subsets of data 
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data
[summarytools](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data in a different style
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"}   | to use the `%<>%` pipping operator 
[corrplot](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html){target="_blank"} | to make large correlation plots
[GGally](https://cran.r-project.org/web/packages/GGally/GGally.pdf){target="_blank"} | to make smaller correlation plots  
[tidymodels](https://www.tidymodels.org){target="_blank"} | to load in a set of packages (broom, dials, infer, parsnip, purrr, recipes, rsample, tibble, yardstick)
[rsample](https://tidymodels.github.io/rsample/articles/Basics.html){target="_blank"}   | to split the data into testing and training sets; to split the training set for cross-validation  
[recipes](https://tidymodels.github.io/recipes/){target="_blank"}   | to pre-process data for modeling in a tidy and reproducible way and to extract pre-processed data (major functions are `recipe()`, `prep()` and various transformation `step_*()` functions, as well as `bake` which extracts pre-processed training data (used to require `juice()`) and applies recipe preprocessing steps to testing data). See [here](https://cran.r-project.org/web/packages/recipes/recipes.pdf){target="_blank"}  for more info.
[parsnip](https://tidymodels.github.io/parsnip/){target="_blank"}   | an interface to create models (major functions are `fit()`, `set_engine()`)
[yardstick](https://tidymodels.github.io/yardstick/){target="_blank"}   | to evaluate the performance of models
[broom](https://www.tidyverse.org/blog/2018/07/broom-0-5-0/){target="_blank"} | to get tidy output for our model fit and performance
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers
[dials](https://www.tidyverse.org/blog/2019/10/dials-0-0-3/){target="_blank"} | to specify hyper-parameter tuning
[tune](https://tune.tidymodels.org/){target="_blank"} | to perform cross validation, tune hyper-parameters, and get performance metrics
[workflows](https://www.rdocumentation.org/packages/workflows/versions/0.1.1){target="_blank"}| to create modeling workflow to streamline the modeling process
[vip](https://cran.r-project.org/web/packages/vip/vip.pdf){target="_blank"} | to create variable importance plots
[randomForest](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf){target="_blank"} | to perform the random forest analysis
[doParallel](https://cran.r-project.org/web/packages/doParallel/doParallel.pdf) | to fit cross validation samples in parallel 
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the text the map data
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to separate data within a column into multiple columns
[rnaturalearth](https://cran.r-project.org/web/packages/rnaturalearth/README.html){target="_blank"} | to get the geometry data for the earth to plot the US
[maps](https://cran.r-project.org/web/packages/maps/maps.pdf){target="_blank"} | to get map database data about counties to draw them on our US map
[sf](https://r-spatial.github.io/sf/){target="_blank"} | to convert the map data into a data frame
[lwgeom](https://cran.r-project.org/web/packages/lwgeom/lwgeom.pdf){target="_blank"} | to use the `sf` function to convert map geographical data
[rgeos](https://cran.r-project.org/web/packages/rgeos/rgeos.pdf){target="_blank"} | to use geometry data
[patchwork](https://cran.r-project.org/web/packages/patchwork/patchwork.pdf){target="_blank"} | to allow plots to be combined
[OCSdata](https://github.com/opencasestudies/OCSdata){target="_blank"} | to access and download OCS data files
___


The first time we use a function, we will use the `::` to indicate which package we are using. 
Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.


# **Context**
***

The [State of Global Air](https://www.stateofglobalair.org/){target="_blank"} is a report released every year to communicate the impact of air pollution on public health. 

The [State of Global Air 2019 report](https://www.stateofglobalair.org/sites/default/files/soga_2019_report.pdf){target="_blank"}
which uses data from 2017 stated that:

> Air pollution is the **fifth** leading risk factor for mortality worldwide. It is responsible for more
deaths than many better-known risk factors such as malnutrition, alcohol use, and physical inactivity.
Each year, **more** people die from air pollution–related disease than from road **traffic injuries** or **malaria**.

<p align="center">
<img width="600" src="https://www.healtheffects.org/sites/default/files/SoGA-Figures-01.jpg">
</p>

##### [[source]](https://www.stateofglobalair.org/sites/default/files/soga_2019_report.pdf){target="_blank"}

The report also stated that:

> In 2017, air pollution is estimated to have contributed to close to 5 million
deaths globally — nearly **1 in every 10 deaths**.

```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","2017deaths.png"))
```

##### [[source]](https://www.stateofglobalair.org/sites/default/files/soga_2019_fact_sheet.pdf){target="_blank"}

The [State of Global Air 2018 report](https://www.stateofglobalair.org/sites/default/files/soga-2018-report.pdf){target="_blank"} using data from 2016 which separated different types of air pollution, found that **particulate pollution was particularly associated with mortality**.

```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","2017mortality.png"))
```

##### [[source]](https://www.stateofglobalair.org/sites/default/files/soga-2018-report.pdf){target="_blank"}

The 2019 report shows that the highest levels of fine particulate pollution occur in Africa and Asia and that:

> More than **90%** of people worldwide live in areas **exceeding** the World Health Organization (WHO) **Guideline** for healthy air. More than half live in areas that do not even meet WHO's least-stringent air quality target.

```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","PMworld.png"))
```

##### [[source]](https://www.stateofglobalair.org/sites/default/files/soga_2019_fact_sheet.pdf){target="_blank"}

Looking at the US specifically, air pollution levels are generally improving, with declining national air pollutant concentration averages as shown from the 2019 [*Our Nation's Air*](https://gispub.epa.gov/air/trendsreport/2019/#home){target="_blank"} report from the US Environmental Protection Agency (EPA): 

```{r, echo = FALSE}
knitr::include_graphics(here::here("img", "US.png"))
```

##### [[source]](https://gispub.epa.gov/air/trendsreport/2019/documentation/AirTrends_Flyer.pdf){target="_blank"}

However, air pollution **continues to contribute to health risk for Americans**, in particular in **regions with higher than national average rates** of pollution that, at times, exceed the WHO's recommended level. 
Thus, it is important to obtain high spatial granularity in estimates of air pollution in order to identify locations where populations are experiencing harmful levels of exposure.

You can see that current air quality conditions at this [website](https://aqicn.org/city/usa/){target="_blank"}, and you will notice variation across different cities.

For example, here are the conditions in Topeka Kansas at the time this case study was created:

```{r, echo = FALSE}
knitr::include_graphics(here::here("img", "Kansas.png"))
```

##### [[source]](https://aqicn.org/city/usa/){target="_blank"}

It reports particulate values using what is called the [Air Quality Index](https://www.airnow.gov/index.cfm?action=aqibasics.aqi){target="_blank"} (AQI).
This [calculator](https://airnow.gov/index.cfm?action=airnow.calculator){target="_blank"} indicates that 114 AQI is equivalent to 40.7 ug/m^3^ and is considered unhealthy for sensitive individuals.
Thus, some areas exceed the WHO annual exposure guideline (10 ug/m^3^), and this may adversely affect the health of people living in these locations.

Adverse health effects have been associated with populations experiencing higher pollution exposure despite the levels being below suggested guidelines. 
Also, it appears that the composition of the particulate matter and the influence of other demographic factors may make specific populations more at risk for adverse health effects due to air pollution. 
For example, see this [article](https://www.nejm.org/doi/full/10.1056/NEJMoa1702747){target="_blank"} for more details.

The monitor data that we will use in this case study come from a system of monitors in which roughly 90% are located within cities. 
Hence, there is an **equity issue** in terms of capturing the air pollution levels of more rural areas. 
To get a better sense of the pollution exposures for the individuals living in these areas, methods like machine learning can be useful to estimate air pollution levels in **areas with little to no monitoring**. 
Specifically, these methods can be used to estimate air pollution in these low monitoring areas so that we can make a map like this where we have annual estimates for all of the contiguous US:

<p align="center">
  <img width="600" src="https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/SAWOEGBXMVGQ7AS5PZ6UUOX6FY.png">
</p>

##### [[source]](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.washingtonpost.com%2Fbusiness%2F2019%2F10%2F23%2Fair-pollution-is-getting-worse-data-show-more-people-are-dying%2F&psig=AOvVaw3v-ZDTBPnLP2MYtKf3Undj&ust=1585784479068000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCPCyn9fxxegCFQAAAAAdAAAAABAd){target="_blank"}

This is what we aim to achieve in this case study.

# **Limitations**
***

There are some important considerations regarding the data analysis in this case study to keep in mind: 

1. The data do not include information about the composition of particulate matter. Different types of particulates may be more benign or deleterious for health outcomes.

2. Outdoor pollution levels are not necessarily an indication of individual exposures. People spend differing amounts of time indoors and outdoors and are exposed to different pollution levels indoors. Researchers are now developing personal monitoring systems to track air pollution levels on the personal level.

3. Our analysis will use annual mean estimates of pollution levels, but these can vary greatly by season, day and even hour. There are data sources that have finer levels of temporal data; however, we are interested in long term exposures, as these appear to be the most influential for health outcomes.


# **What are the data?** {#whatarethedata}
***

We are going to perform a type of machine learning to try to predict air pollution levels that is called supervised machine learning. This type of machine learning requires that we have an outcome or result of real values (of in our case air pollution) to work with that will guide or supervise our work. This will ultimately allow us to try to predict air pollution values in the future when we don't have them. 
For more explanation of what supervised machine learning is (and how it compares to a different kind of machine learning called unsupervised machine learning where we don't have outcome values) refer to [this](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d){target="_blank"}), 
 
When using supervised machine learning for prediction,there are two main types of data of interest:

1. A **continuous** outcome variable that we want to predict 
2. A set of feature(s) (or predictor variables) that we use to predict the outcome variable

The **outcome variable** is what we are trying to **predict**. 
To build (or train) our model, we use both the outcome and features.
The goal is to identify informative features that can explain a large amount of variation in our outcome variable. 
Using this model, we can then predict the outcome from new observations with the same features where have not observed the outcome. 

As a simple example, imagine that we have data about the sales and characteristics of cars from last year and we want to predict which cars might sell well this year. 
We do not have the sales data yet for this year, but we do know the characteristics of our cars for this year. 
We can build a model of the characteristics that explained sales last year to estimate what cars might sell well this year. 
In this case, our outcome variable is the sales of cars, while the different characteristics of the cars make up our features.

### **Start with a question**
***

This is the most commonly missed step when developing a machine learning algorithm. 
Machine learning can very easily be turned into an engineering problem. 
Just dump the outcome and the features into a [black box algorithm](https://en.wikipedia.org/wiki/Black_box) and viola! 
But this kind of thinking can lead to major problems. In general good machine learning questions:

1. Have a plausible explanation for why the features predict the outcome. 
2. Consider potential variation in both the features and the outcome over time.
3. Are consistently re-evaluated on criteria 1 and 2 over time. 

In this case study, we want to **predict** air pollution levels. 
To build this machine learning algorithm, our **outcome variable** is average annual fine particulate matter (PM~2.5~) captured from air pollution monitors in the contiguous US in 2008. 
Our **features** (or predictor variables) include data about population density, road density, urbanization levels, and NASA satellite data. 



### **Our outcome variable**
***

The monitor data that we will be using comes from **[gravimetric monitors](https://publiclab.org/wiki/filter-pm){target="_blank"}** (see picture below) operated by the US [Environmental Protection Agency (EPA)](https://www.epa.gov/){target="_blank"}.

```{r, echo = FALSE, out.width="100px"}
knitr::include_graphics(here::here("img","monitor.png"))
```

##### [image curtesy of [Kirsten Koehler](https://www.jhsph.edu/faculty/directory/profile/2928/kirsten-koehler)]

These monitors use a filtration system to specifically capture fine particulate matter. 

```{r, echo = FALSE, out.width="150px"}
knitr::include_graphics(here::here("img","filter.png"))
```

##### [[source]](https://publiclab.org/wiki/filter-pm){target="_blank"}

The weight of this particulate matter is manually measured daily or weekly. 
For the EPA standard operating procedure for PM gravimetric analysis in 2008, we refer the reader to [here](https://www3.epa.gov/ttnamti1/files/ambient/pm25/spec/RTIGravMassSOPFINAL.pdf){target="_blank"}.

#### {.click_to_expand_block}

<details><summary>For more on Gravimetric analysis, you can expand here </summary>

Gravimetric analysis is also used for [emission testing](https://www.mt.com/us/en/home/applications/Laboratory_weighing/emissions-testing-particulate-matter.html){target="_blank"}. 
The same idea applies: a fresh filter is applied and the desired amount of time passes, then the filter is removed and weighed. 

There are [other monitoring systems](https://www.sensirion.com/en/about-us/newsroom/sensirion-specialist-articles/particulate-matter-sensing-for-air-quality-measurements/){target="_blank"} that can provide hourly measurements, but we will not be using data from these monitors in our analysis. 
Gravimetric analysis is considered to be among the most accurate methods for measuring particulate matter.

</details>

####

In our data set, the `value` column indicates the PM~2.5~ monitor average for 2008 in mass of fine particles/volume of air for 876 gravimetric monitors. 
The units are micrograms of fine particulate matter (PM) that is less than 2.5 micrometers in diameter per cubic meter of air - mass concentration (ug/m^3^).
Recall the WHO exposure guideline is < 10 ug/m^3^ on average annually for PM~2.5~.

### **Our features (predictor variables)**
***

There are 48 features with values for each of the 876 monitors (observations). 
The data comes from the US [Environmental Protection Agency (EPA)](https://www.epa.gov/){target="_blank"}, the [National Aeronautics and Space Administration (NASA)](https://www.nasa.gov/){target="_blank"}, the US [Census](https://www.census.gov/about/what/census-at-a-glance.html){target="_blank"}, and the [National Center for Health Statistics (NCHS)](https://www.cdc.gov/nchs/about/index.htm){target="_blank"}.

#### {.click_to_expand_block}

<details><summary> Click here to see a table about the set of features </summary>

Variable   | Details                                                                        
---------- |-------------
**id**  | Monitor number  <br> -- the county number is indicated before the decimal <br> -- the monitor number is indicated after the decimal <br>  **Example**: 1073.0023  is Jefferson county (1073) and .0023 one of 8 monitors 
**fips** | Federal information processing standard number for the county where the monitor is located <br> -- 5 digit id code for counties (zero is often the first value and sometimes is not shown) <br> -- the first 2 numbers indicate the state <br> -- the last three numbers indicate the county <br>  **Example**: Alabama's state code is 01 because it is first alphabetically <br> (note: Alaska and Hawaii are not included because they are not part of the contiguous US)  
**Lat** | Latitude of the monitor in degrees  
**Lon** | Longitude of the monitor in degrees  
**state** | State where the monitor is located
**county** | County where the monitor is located
**city** | City where the monitor is located
**CMAQ**  | Estimated values of air pollution from a computational model called [**Community Multiscale Air Quality (CMAQ)**](https://www.epa.gov/cmaq){target="_blank"} <br> --  A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution <br> -- ***Does not use any of the PM~2.5~ gravimetric monitoring data.*** (There is a version that does use the gravimetric monitoring data, but not this one!) <br> -- Data from the EPA
**zcta** | [Zip Code Tabulation Area](https://en.wikipedia.org/wiki/ZIP_Code_Tabulation_Area){target="_blank"} where the monitor is located <br> -- Postal Zip codes are converted into "generalized areal representations" that are non-overlapping  <br> -- Data from the 2010 Census  
**zcta_area** | Land area of the zip code area in meters squared  <br> -- Data from the 2010 Census  
**zcta_pop** | Population in the zip code area  <br> -- Data from the 2010 Census  
**imp_a500** | Impervious surface measure <br> -- Within a circle with a radius of 500 meters around the monitor <br> -- Impervious surface are roads, concrete, parking lots, buildings <br> -- This is a measure of development 
**imp_a1000** | Impervious surface measure <br> --  Within a circle with a radius of 1000 meters around the monitor
**imp_a5000** | Impervious surface measure <br> --  Within a circle with a radius of 5000 meters around the monitor  
**imp_a10000** | Impervious surface measure <br> --  Within a circle with a radius of 10000 meters around the monitor   
**imp_a15000** | Impervious surface measure <br> --  Within a circle with a radius of 15000 meters around the monitor  
**county_area** | Land area of the county of the monitor in meters squared  
**county_pop** | Population of the county of the monitor  
**Log_dist_to_prisec** | Log (Natural log) distance to a primary or secondary road from the monitor <br> -- Highway or major road  
**log_pri_length_5000** | Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log) <br> -- Highways only  
**log_pri_length_10000** | Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log) <br> -- Highways only  
**log_pri_length_15000** | Count of primary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log) <br> -- Highways only  
**log_pri_length_25000** | Count of primary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log) <br> -- Highways only  
**log_prisec_length_500** | Count of primary and secondary road length in meters in a circle with a radius of 500 meters around the monitor (Natural log)  <br> -- Highway and secondary roads  
**log_prisec_length_1000** | Count of primary and secondary road length in meters in a circle with a radius of 1000 meters around the monitor (Natural log)  <br> -- Highway and secondary roads  
**log_prisec_length_5000** | Count of primary and secondary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log)  <br> -- Highway and secondary roads  
**log_prisec_length_10000** | Count of primary and secondary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log)  <br> -- Highway and secondary roads  
**log_prisec_length_15000** | Count of primary and secondary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log)  <br> -- Highway and secondary roads  
**log_prisec_length_25000** | Count of primary and secondary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log)  <br> -- Highway and secondary roads      
**log_nei_2008_pm25_sum_10000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)    
**log_nei_2008_pm25_sum_15000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)     
**log_nei_2008_pm25_sum_25000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)     
**log_nei_2008_pm10_sum_10000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)      
**log_nei_2008_pm10_sum_15000**| Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)      
**log_nei_2008_pm10_sum_25000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)      
**popdens_county** | Population density (number of people per kilometer squared area of the county)
**popdens_zcta** | Population density (number of people per kilometer squared area of zcta)
**nohs** | Percentage of people in zcta area where the monitor is that **do not have a high school degree** <br> -- Data from the Census
**somehs** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was **some high school education** <br> -- Data from the Census
**hs** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a **high school degree** <br> -- Data from the Census  
**somecollege** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing **some college education** <br> -- Data from the Census 
**associate** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an **associate degree** <br> -- Data from the Census 
**bachelor** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a **bachelor's degree** <br> -- Data from the Census 
**grad** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a **graduate degree** <br> -- Data from the Census 
**pov** | Percentage of people in zcta area where the monitor is that lived in [**poverty**](https://aspe.hhs.gov/2008-hhs-poverty-guidelines) in 2008 - or would it have been 2007 guidelines??https://aspe.hhs.gov/2007-hhs-poverty-guidelines <br> -- Data from the Census  
**hs_orless** |  Percentage of people in zcta area where the monitor whose highest formal educational attainment was a **high school degree or less** (sum of nohs, somehs, and hs)  
**urc2013** | [2013 Urban-rural classification](https://www.cdc.gov/nchs/data/series/sr_02/sr02_166.pdf){target="_blank"} of the county where the monitor is located <br> -- 6 category variable - 1 is totally urban 6 is completely rural <br>  -- Data from the [National Center for Health Statistics](https://www.cdc.gov/nchs/index.htm){target="_blank"}     
**urc2006** | [2006 Urban-rural classification](https://www.cdc.gov/nchs/data/series/sr_02/sr02_154.pdf){target="_blank"} of the county where the monitor is located <br> -- 6 category variable - 1 is totally urban 6 is completely rural <br> -- Data from the [National Center for Health Statistics](https://www.cdc.gov/nchs/index.htm){target="_blank"}     
**aod** | Aerosol Optical Depth measurement from a NASA satellite <br> -- based on the diffraction of a laser <br> -- used as a proxy of particulate pollution <br> -- unit-less - higher value indicates more pollution <br> -- Data from NASA  

</details>

####

Many of these features have to do with the circular area around the monitor called the "buffer". These are illustrated in the following figure:

```{r, echo = FALSE, out.width = "800px",}
knitr::include_graphics(here::here("img", "regression.png"))
```

##### [[source]](https://www.ncbi.nlm.nih.gov/pubmed/15292906){target="_blank"}



# **Data Import**
***

All of our data was previously collected by a [researcher](http://www.biostat.jhsph.edu/~rpeng/) at the [Johns Hopkins School of Public Health](https://www.jhsph.edu/) who studies air pollution and climate change.  

We have one CSV file that contains both our single **outcome variable** and all of our **features** (or predictor variables). You can download this file using the `OCSdata` package:

```{r, eval=FALSE}
# install.packages("OCSdata")
OCSdata::raw_data("ocs-bp-air-pollution", outpath = getwd())
```

If you have trouble using the package, you can also find this data on our [GitHub repository](https://github.com/opencasestudies/ocs-bp-air-pollution/tree/master/data/raw). Or you can download it more directly by clicking [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-air-pollution/master/data/raw/pm25_data.csv).

We have created a "raw" subdirectory within a directory called "data" of our working directory of our RStudio project.

Next, we import our data into R now so that we can explore the data further. 
We will call our data object `pm` for particulate matter. 
We import the data using the `read_csv()` function from the `readr` package. 

We will use the `here` package to make it easier to find the data file.

#### {.click_to_expand_block}

<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>

####


```{r}
pm <- readr::read_csv(here("data","raw", "pm25_data.csv"))
```

We will save this data as an rda file for later in an "imported" subdirectory of the data directory.

```{r}
save(pm, file = here::here("data", "imported", "imported_pm.rda"))
```


# **Data Exploration and Wrangling**
***
If you are following along but stopped, you could start here by first loading the data like so:

```{r}
load(here::here("data", "imported", "imported_pm.rda"))
```

#### {.click_to_expand_block}

<details> <summary> If you skipped the data import section click here. </summary>

First you need to install the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
```

Then, you may download and load the imported data `.rda` file using the following code:

```{r, eval=FALSE}
OCSdata::imported_data("ocs-bp-air-pollution", outpath = getwd())
load(here::here("OCSdata", "data", "imported", "imported_pm.rda"))
```

If the package does not work for you, an RDA file (stands for R data) of the data can be found on our [GitHub repository](https://github.com/opencasestudies/ocs-bp-air-pollution/tree/master/data/imported). Or you can download it more directly by clicking [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-air-pollution/master/data/imported/imported_pm.rda).

To load the downloaded data into your environment, you may double click on the `.rda` file in Rstudio or using the `load()` function.

To copy and paste our code below, place the downloaded file in your current working directory within a subdirectory called "imported" within a subdirectory called "data". We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "imported", "imported_pm.rda"))
```

<hr style="height:1px;border:none;color:#333;background-color:#333;" />

<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>

<hr style="height:1px;border:none;color:#333;background-color:#333;" />

</details>

####

The first step in performing any data analysis is to explore the data. 

For example, we might want to better understand the variables included in the data, as we may learn about important details about the data that we should keep in mind as we try to predict our outcome variable.

First, let's just get a general sense of our data. 
We can do that using the `glimpse()` function of the `dplyr` package (it is also in the `tibble` package).

We will also use the `%>%` pipe, which can be used to define the input for later sequential steps. 

This will make more sense when we have multiple sequential steps using the same data object. 

To use the pipe notation we need to install and load `dplyr` as well.

For example, here we start with `pm` data object and "pipe" the object into as input into the `glimpse()` function. 
The output is an overview of what is in the `pm` object such as the number of rows and columns, all the column names, the data types for each column and the first view values in each column. 
The output below is scrollable so you can see everything from the `glimpse()` function. 

#### {.scrollable }

```{r}
# Scroll through the output!
pm %>%
  dplyr::glimpse()
```

####

We can see that there are 876 monitors (rows) and that we have 50 total variables (columns) - one of which is the outcome variable. In this case, the outcome variable is called `value`. 

Notice that some of the variables that we would think of as factors (or categorical data) are currently of class character as indicated by the `<chr>` just to the right of the column names/variable names in the `glimpse()` output. This means that the variable values are character strings, such as words or phrases. 

The other variables are of class `<dbl>`, which stands for double precision which indicates that they are numeric and that they have decimal values. In contrast, one could have integer values which would not allow for decimal numbers. Here is a [link](https://en.wikipedia.org/wiki/Double-precision_floating-point_format){target="_blank"} for more information on double precision numeric values.

Another common data class is factor which is abbreviated like this: `<fct>`. A factor is something that has unique levels but there is no appreciable order to the levels. For example we can have a numeric value that is just an id that we want to be interpreted as just a unique level and not as the number that it would typically indicate. This would be useful for several of our variables:

1. the monitor ID (`id`)
2. the Federal Information Processing Standard number for the county where the monitor was located (`fips`)
3. the zip code tabulation area (`zcta`)

None of the values actually have any real numeric meaning, so we want to make sure that R does not interpret them as if they do. 

So let's convert these variables into factors. 
We can do this using the `across()` function of the `dplyr` package and the `as.factor()` base function. 
The `across()` function has two main arguments: (i) the columns you want to operate on and (ii) the function or list of functions to apply to each column. 

In this case, we are also using the `magrittr` assignment pipe or double pipe that looks like this `%<>%` of the `magrittr` package. 
This allows us use the `pm` data as input, but also reassigns the output to the same data object name.

#### {.scrollable }

```{r}
# Scroll through the output!
pm %<>%
  dplyr::mutate(across(c(id, fips, zcta), as.factor)) 

glimpse(pm)
```

####

Great! Now we can see that these variables are now factors as indicated by `<fct>` after the variable name.



### **`skim` package**
***

The `skim()` function of the `skimr` package is also really helpful for getting a general sense of your data.
By design, it provides summary statistics about variables in the data set. 


#### {.scrollable }

```{r}
# Scroll through the output!
skimr::skim(pm)
```

####

Notice how there is a column called `n_missing` about the number of values that are missing. 

This is also indicated by the `complete_rate` variable (or missing/number of observations). 

In our data set, it looks like our data do not contain any missing data. 

Also notice how the function provides separate tables of summary statistics for each data type: character, factor and numeric. 

Next, the `n_unique` column shows us the number of unique values for each of our columns. 
We can see that there are 49 states represented in the data.

We can see that for many variables there are many low values as the distribution shows two peaks, one near zero and another with a higher value. 

This is true for the `imp` variables (measures of development), the `nei` variables (measures of emission sources) and the road density variables. 

We can also see that the range of some of the variables is very large, in particular the area and population related variables.


Let's take a look to see which states are included using the `distinct()` function of the `dplyr` package:

```{r, eval = FALSE} 
pm %>% 
  dplyr::distinct(state) 
```


Scroll through the output:

#### {.scrollable }
```{r, echo = FALSE}
# Scroll through the output!
pm %>% 
  distinct(state) %>%
# this allows us to show the full output in the rendered rmarkdown
 print(n = 1e3)
```
####

It looks like "District of Columbia" is being included as a state. 
We can see that Alaska and Hawaii are not included in the data.

Let's also take a look to see how many monitors there are in a few cities. We can use the `filter()` function of the `dplyr` package to do so. For example, let's look at Albuquerque, New Mexico. 

```{r}
pm %>% dplyr::filter(city == "Albuquerque")

```

We can see that there were only two monitors in the city of Albuquerque in 2006. Let's compare this with Baltimore.

```{r}
pm %>% filter(city == "Baltimore")

```

There were in contrast five monitors for the city of Baltimore, despite the fact that if we take a look at the land area and population of the counties for Baltimore and Albuquerque, we can see that they had very similar land area and populations.

```{r}
pm %>% 
  filter(city == "Baltimore") %>% 
  dplyr::select(county_area:county_pop)
pm %>% 
  filter(city == "Albuquerque") %>%
  select(county_area:county_pop)

```

In fact, the county containing Albuerque had a larger population. Thus the measurements for Albuquerque were not as thorough as they were for Baltimore.

This may be due to the fact that the monitor values were lower in Albuquerque. It is interesting to note here that the CMAQ values are quite similar for both cities.


## **Evaluate correlation**
***

In prediction analyses, it is also useful to evaluate if any of the variables are correlated. Why should we care about this?

If we are using a linear regression to model our data, then we might run into a problem called multicollinearity which can lead us to misinterpret what is really predictive of our outcome variable. This phenomenon occurs when the predictor variables actually predict one another. See [this case study](https://opencasestudies.github.io/ocs-bp-RTC-analysis/) for a deeper explanation about this. 

Another reason we should look out for correlation is that we don't want to include redundant variables. This can add unnecessary noise to our algorithm causing a reduction in prediction accuracy, and it can cause our algorithm to be unnecessarily slower. Finally, it can also make it difficult to interpret what variables are actually predictive.

Let's first take a look at all of our numeric variables with the`corrplot` package:
The `corrplot` package is a great option to look at correlation among possible predictors, and particularly useful if we have many predictors. 

First, we calculate the Pearson correlation coefficients between all features pairwise using the `cor()` function of the `stats` package (which is loaded automatically). Then we use the `corrplot::corrplot()` function. The `tl.cex = 0.5` argument controls the size of the text label. 

```{r}
PM_cor <- cor(pm %>% dplyr::select_if(is.numeric))
corrplot::corrplot(PM_cor, tl.cex = 0.5)
```
Nice! Now we can see which variables show a positive (blue) or negative (red) correlation to one another. Variables that show a very little correlation with one another appear as white or lightly colored. We can see that each variable is perfectly correlated with itself, which is it why there is a line of blue squares diagnally across the plot.

We can also plot the absolute value of the Pearson correlation coefficients using the `abs()` function from base R and change the order of the columns. This can be helplful if we aren't interested in the direction of the correlation and just want to see which variables have a relationship with one another.

```{r}
corrplot(abs(PM_cor), order = "hclust", tl.cex = 0.5, cl.lim = c(0, 1))

```
Nice, this is a bit easier to read now as it isn't as distracting to look at the different colors - we just want to focus on the intensity. Notice that the legend here now doesn't inform us of much in the case of the red because we have plotted the absolute values of the correlation values and they will thus all be positive and blue. The darker the blue the more correlation.

There are several options for ordering the variables. See [here](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) for more options. Here we will use the "hclust" option for ordering by [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) - which will order the variables by how similar they are to one another.

The `cl.lim = c(0, 1)` argument limits the color label to be between 0 and 1. 


We can see that the development variables (`imp`) variables are correlated with each other as we might expect. 
We also see that the road density variables seem to be correlated with each other, and the emission variables seem to be correlated with each other. 


Also notice that none of the predictors are highly correlated with our outcome variable (`value`).

We can take also take a closer look  using the `ggcorr()` function and the `ggpairs()` function of the `GGally` package. 

To select our variables of interest we can use the `select()` function with the `contains()` function of the `tidyr` package. 

First let's look at the `imp`/development variables. 
We can change the default color palette (`palette = "RdBu"`) and add on 
correlation coefficients to the plot (`label = TRUE`).

```{r, out.width = "400px"}
select(pm, contains("imp")) %>%
  ggcorr(palette = "RdBu", label = TRUE)

select(pm, contains("imp")) %>%
  ggpairs()
```



Indeed, we can now see more clearly that `imp_a1000` and `imp_a500` are highly correlated, as well as `imp_a10000`, `imp_a15000`. We also get a sense of how the data points vary across the range of values. Note that in this plot red indicates positive correlation values and blue indicates negative correlation values.This is in contrast to our previous plot.

Next, let's take a look at the road density data:

```{r, fig.weight=12}
select(pm, contains("pri")) %>%
  ggcorr(palette = "RdBu", hjust = .85, size = 3,
       layout.exp=2, label = TRUE)
```

We can see that many of the road density variables are highly correlated with one another, while others are less so. Again note that in this plot red indicates positive correlation values and blue indicates negative correlation values.

Finally let's look at the emission variables.

```{r}
select(pm, contains("nei")) %>%
  ggcorr(palette = "RdBu", hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

select(pm, contains("nei")) %>%
  ggpairs()
```
We can see some fairly high correlation values as well.


We would also expect the population density data might correlate with some of these variables. 
Let's take a look.

```{r}
pm %>%
select(log_nei_2008_pm25_sum_10000, popdens_county, 
       log_pri_length_10000, imp_a10000) %>%
  ggcorr(palette = "RdBu",  hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

pm %>%
select(log_nei_2008_pm25_sum_10000, popdens_county, 
       log_pri_length_10000, imp_a10000, county_pop) %>%
  ggpairs()
```


Interesting, so these variables don't appear to be highly correlated, therefore we might need variables from each of the categories to predict our monitor PM~2.5~ pollution values.

Because some variables in our data have extreme values, it might be good to take a log transformation. This can affect our estimates of correlation. 

```{r}
pm %>%
  mutate(log_popdens_county= log(popdens_county)) %>%
select(log_nei_2008_pm25_sum_10000, log_popdens_county, 
       log_pri_length_10000, imp_a10000) %>%
  ggcorr(palette = "RdBu",  hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

pm %>%
  mutate(log_popdens_county= log(popdens_county)) %>%
  mutate(log_pop_county = log(county_pop)) %>%
select(log_nei_2008_pm25_sum_10000, log_popdens_county, 
       log_pri_length_10000, imp_a10000, log_pop_county) %>%
  ggpairs()
```

Indeed this increased the correlation, but variables from each of these categories may still prove to be useful for prediction.

Now that we have a sense of what our data are, we can get started with building a machine learning model to predict air pollution. 

First let's save our data again because we did wrangle it just a tad. This time we will save it to a subdirectory of the "data" directory called "wrangled". This is a good practice in general for data analyses to keep your data organized. We will also save a csv version as this is often useful to give to collaborators. To do this we will use the `write_csv()` function of the `readr` package.

```{r, eval = FALSE}
save(pm, file = here::here("data", "wrangled", "wrangled_pm.rda"))
write_csv(pm, file = here::here("data", "wrangled", "wrangled_pm.csv"))
```

# **What is machine learning?**  {#whatisml}
***

You may have learned about the central dogma of statistics that you sample from a population.

![](img/cdi1.png)

Then you use the sample to try to guess what is happening in the population.

![](img/cdi2.png)

For prediction we have a similar sampling problem

![](img/cdp1.png)

But now we are trying to build a rule that can be used to predict a single observation's value of some characteristic using characteristics of the other observations. 

![](img/cdp2.png)

Let's make this more concrete.

If you recall from the [What are the data?](#whatarethedata) section above, when we are using machine learning for prediction, our data consists of: 

1. An **continuous** outcome variable that we want to predict 
2. A set of feature(s) (or predictor variables) that we use to predict the outcome variable

We will use $Y$ to denote the outcome variable and $X = (X_1, \dots, X_p)$ to denote $p$ different features (or predictor variables). 
Because our outcome variable is **continuous** (as opposed to categorical), we are interested in a particular type of machine learning algorithm. 

Our goal is to build a machine learning algorithm that uses the features $X$ as input and predicts an outcome variable (or air pollution levels) in the situation where we do not know the outcome variable. 

The way we do this is to use data where we have both the features $(X_1=x_1, \dots X_p=x_p)$ and the actual outcome $Y$ data to _train_ a machine learning algorithm to predict the outcome, which we call $\hat{Y}$.  

When we say train a machine learning algorithm we mean that we estimate a function $f$ that uses the predictor variables $X$ as input or $\hat{Y} = f(X)$. 

## **ML as an optimization problem**

If we are doing a good job, then our predicted outcome $\hat{Y}$ should closely match our actual outcome $Y$ that we observed. 

In this way, we can think of machine learning (ML) as an optimization problem that tries to minimize the distance between $\hat{Y} = f(X)$ and $Y$. 

$$d(Y - f(X))$$
The choice of distance metric $d(\cdot)$ can be the mean of the absolute or squared difference or something more complicated. 

Much of the fields of statistics and computer science are focused on defining $f$ and $d$.

## **The parts of an ML problem**

To set up a machine learning (ML) problem, we need a few components.
To solve a (standard) machine learning problem you need: 

1. A data set to train from. 
2. An algorithm or set of algorithms you can use to try values of $f$.
3. A distance metric $d$ for measuring how close $Y$ is to $\hat{Y}$.
4. A definition of what a "good" distance is.

While each of these components is a _technical_ problem, there has been a ton of work addressing those technical details. The most pressing open issue in machine learning is realizing that though these are _technical_ steps they are not _objective_ steps. In other words, how you choose the data, algorithm, metric, and definition of "good" says what you value and can dramatically change the results. A couple of cases where this was a big deal are: 

1. [Machine learning for recidivism](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) - people built ML models to predict who would re-commit a crime. But these predictions were based on historically biased data which led to biased predictions about who would commit new crimes. 
2. [Deciding how self driving cars should act](https://www.nature.com/articles/d41586-018-07135-0) - self driving cars will have to make decisions about how to drive, who they might injure, and how to avoid accidents. Depending on our choices for $f$ and $d$ these might lead to wildly different kinds of self driving cars. Try out the [moralmachine](http://moralmachine.mit.edu/) to see how this looks in practice. 

Now that we know a bit more about machine learning, let's build a model to predict air pollution levels using the `tidymodels` framework. 

# **Machine learning with `tidymodels`**
***
The goal is to build a machine learning algorithm that uses the features as input and predicts a outcome variable (or air pollution levels) in the situation where we do not know the outcome variable. 

The way we do this is to use data where we have both the input and output data to _train_ a machine learning algorithm. 

To train a machine learning algorithm, we will use the `tidymodels` package ecosystem. 

## **Overview**
***

### **The `tidymodels` ecosystem**
***

To perform our analysis we will be using the `tidymodels` suite of packages. 
You may be familiar with the older packages `caret` or `mlr` which are also for machine learning and modeling but are not a part of the `tidyverse`. 
[Max Kuhn](https://resources.rstudio.com/authors/max-kuhn){target="_blank"} describes `tidymodels` like this:

> "Other packages, such as caret and mlr, help to solve the R model API issue. These packages do a lot of other things too: pre-processing, model tuning, resampling, feature selection, ensembling, and so on. In the tidyverse, we strive to make our packages modular and parsnip is designed only to solve the interface issue. It is not designed to be a drop-in replacement for caret.
The tidymodels package collection, which includes parsnip, has other packages for many of these tasks, and they are designed to work together. We are working towards higher-level APIs that can replicate and extend what the current model packages can do."

There are many R packages in the `tidymodels` ecosystem, which assist with various steps in the process of building a machine learning algorithm. These are the main packages, but there are others.

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","simpletidymodels.png"))
```

This is a schematic of how these packages work together to build a machine learning algorithm:

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","MachineLearning.png"))
```

Here we can see that after exploring and  splitting the data, we perform the initial modeling stages of variable assignment and pre-processing steps as shown in the blue box the, while the green box indicates the steps required to  train the model by specifying the model, fitting the model and tuning it. 

### **Benefits of `tidymodels`**
***

The two major benefits of `tidymodels` are: 

1. Standardized workflow/format/notation across different types of machine learning algorithms  

Different notations are required for different algorithms as the algorithms have been developed by different people. This would require the painstaking process of reformatting the data to be compatible with each algorithm if multiple algorithms were tested.

2. Can easily modify pre-processing, algorithm choice, and hyper-parameter tuning making optimization easy  

Modifying a piece of the overall process is now easier than before because many of the steps are specified using the `tidymodels` packages in a convenient manner. Thus the entire process can be rerun after a simple change to pre-processing without much difficulty.

## **`tidymodels` Steps**
***

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","Updated_tidymodels_basics.png"))
```

## **Splitting the data**
***

The first step after data exploration in machine learning analysis is to [split the data](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7){target="_blank"} into **training** and **testing** data sets. 

The training data set will be used to build and tune our model. 
This is the data that the model "learns" on. 
The testing data set will be used to evaluate the performance of our model in a more generalizable way. What do we mean by "generalizable"?

Remember that our main goal is to use our model to be able to predict air pollution levels in areas where there are no gravimetric monitors. 

Therefore, if our model is really good at predicting air pollution with the data that we use to build it, it might not do the best job for the areas where there are few to no monitors. 

This would cause us to have really good prediction accuracy and we might assume that we were going to do a good job estimating air pollution any time we use our model, but in fact this would likely not be the case. 
This situation is what we call **[overfitting](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6){target="_blank"} **.

Overfitting happens when we end up modeling not only the major relationships in our data but also the noise within our data. 

```{r, echo=FALSE}
knitr::include_graphics("https://miro.medium.com/max/1110/1*tBErXYVvTw2jSUYK7thU2A.png")
```

##### [[source]](https://miro.medium.com/max/1110/1*tBErXYVvTw2jSUYK7thU2A.png){target="_blank"}

If we get good prediction with our testing set, then we know that our model can be applied to other data and will likely perform well. We will discuss this more later.

We will not touch the testing set until we have completed optimizing our model with the training set. 
This will allow us to have a less biased evaluation of how well our model can do with other data besides the data used in the training set to build the model. 
**Ideally, you would also want a completely independent data set to further test the performance of your model.**

To split the data into training and testing, we will use the `initial_split()` function in the `rsample` package to specify how we want to split our data.


```{r, echo=FALSE}

knitr::include_graphics(here::here("img","split.png"))
```


#### {.click_to_expand_block}

<details> <summary> If you skipped previous sections click here for more information on how to obtain and load the data. </summary>

First you need to install the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
```

Then, you may download and load the wrangled data `.rda` file using the following code:

```{r, eval=FALSE}
OCSdata::wrangled_rda("ocs-bp-air-pollution", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_pm.rda"))
```

If the package does not work for you, you may also download this `.rda` file by clicking this link [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-air-pollution/master/data/wrangled/wrangled_pm.rda).

To load the downloaded data into your environment, you may double click on the `.rda` file in Rstudio or using the `load()` function.

To copy and paste our code below, place the downloaded `.rda` file in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data". We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "wrangled", "wrangled_pm.rda"))
```

<hr style="height:1px;border:none;color:#333;background-color:#333;" />

<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>

<hr style="height:1px;border:none;color:#333;background-color:#333;" />

</details>

####

```{r}
set.seed(1234)
pm_split <- rsample::initial_split(data = pm, prop = 2/3)
pm_split
```

A couple of notes from the code above: 

- Typically, data are split with the majority of the observations for training and a smaller portion for testing. The default with this function is 3/4 (75%) of the observations for training and 1/4 (25%) for testing. This is the default proportion and does not need to be specified. However, you can change the proportion using the `prop` argument, which we will do here for illustrative purposes. Often people use 80% (4/5) for training and 20% for testing (1/5) or some other similar proportion like what we used here with 2/3 for training and 1/3 for testing. 

#### {.click_to_expand_block}

<details><summary> Click here to learn more about how people decide what split proportion to use. </summary>

Having more training data helps the model to train on a greater variety of observations. However, having more testing data helps to see how generalizable the model is and allows for better comparisons of different models. The need for each may depend on your data and how much variability it has and the size of your data. For smaller datasets, setting aside a larger portion for testing can be beneficial to avoid having a very small testing dataset. Here's a [paper](https://onlinelibrary.wiley.com/doi/full/10.1002/sam.11583) that describes more about this topic.

</details>

####

- Since the split is performed randomly, it is a good idea to use the `set.seed()` function in base R to ensure that if your rerun your code that your split will be the same next time.
- We can see the number of monitors in our training, testing, and original data by typing in the name of our split object. The result will look like this:
<training data sample number, testing data sample number, original sample number> 

Now, you can also specify a variable to stratify by with the `strata` argument. 
This is useful if you have imbalanced categorical variables and you would like to intentionally make sure that there are similar number of samples of the rarer categories in both the testing and training sets. 
Otherwise the split is performed randomly. 

According to the [documentation](https://www.rdocumentation.org/packages/rsample/versions/0.0.5/topics/initial_split) for the `rsample` package:

> The strata argument causes the random sampling to be conducted within the stratification variable. This can help ensure that the number of data points in the training data is equivalent to the proportions in the original data set.

In the case with our data set, perhaps we would like our training set to have similar proportions of monitors from each of the states as in the initial data. 
This might be useful if we want our model to be generalizable across all of the states.

We can see that indeed there are different proportions of monitors in each state by using the `count()` function of the `dplyr` package. 

```{r, eval = FALSE}
count(pm, state)
```

Scroll through the output:

#### {.scrollable }

```{r, echo=FALSE}
# Scroll through the output!
count(pm, state) %>%
  print(n = 1e3)
```
####

If our data set were large enough it might be nice then to stratify by state using the `strata = "state"` argument in `initial_split()`, but our data is unfortunately not large enough. 

Importantly the `initial_split()` function only determines what rows of our `pm` data frame should be assigned for training or testing, it does not actually split the data. 

To extract the testing and training data we can use the `training()` and `testing()` functions also of the `rsample` package.

#### {.scrollable }
```{r}
train_pm <- rsample::training(pm_split)
test_pm <- rsample::testing(pm_split)
 
# Scroll through the output!
count(train_pm, state)
count(test_pm, state)
```
####



## **Preparing for pre-processing the data**
***

After splitting the data, the next step is to process the training and testing data so that the data are are compatible and optimized to be used with the model. 

In order to start this, we need to think about what each of the aspects of our data might do in the model. For example, is this particular column of data what we would consider the outcome of interest? Or is these data values possibly helpful for predicting the outcome? This process is described as assigning variables to specific roles within the model.

We will then do what is called pre-processing to prepare the data so it is ready. This involves things like like scaling variables and removing redundant variables. 

This process is also called feature engineering.

To do this in `tidymodels`, we will create what's called a "recipe" using the `recipes` package, which is a standardized format for a sequence of steps for pre-processing the data.
This can be very useful because it makes testing out different pre-processing steps or different algorithms with the same pre-processing very easy and reproducible.
Creating a recipe specifies **how a data frame of predictors should be created** - it specifies what variables to be used and the pre-processing steps, but it **does not execute these steps** or create the data frame of predictors.

### Step 1: Specify variables roles with `recipe()` function

The first thing to do to create a recipe is to specify which variables we will be using as our outcome and predictors using the `recipe()` function. 
In terms of the metaphor of baking, we can think of this as listing our ingredients. 
Translating this to the `recipes` package, we use the `recipe()` function to assign roles to all the variables. 

Let's try the simplest recipe with no pre-processing steps: simply list the outcome and predictor variables.

We can do so in two ways:  

1) Using formula notation  
2) Assigning roles to each variable  

Let's look at the first way using formula notation, which looks like this:  

outcome(s) ~ predictor(s)  

If in the case of multiple predictors or a multivariate situation with two outcomes, use a plus sign:  

outcome1 + outcome2 ~ predictor1 + predictor2  

If we want to include all predictors we can use a period like so:  

outcome_variable_name ~ .  

Now with our data, we will start by making a recipe for our training data.
If you recall, the continuous outcome variable is `value` (the average annual gravimetric monitor PM~2.5~ concentration in ug/m^3^). 
Our features (or predictor variables) are all the other variables except the monitor ID, which is an `id` variable.

The reason not to include the `id` variable is because this variable includes the county number and a number designating which particular monitor the values came from (of the monitors there are in that county). 
Since this number is arbitrary and the county information is also given in the data, and the fact that each monitor only has one value in the `value` variable, nothing is gained by including this variable and it may instead introduce noise. 
However, it is useful to keep this data to take a look at what is happening later. 
We will show you what to do in this case in just a bit.

To summarize this step, we will use the `recipe()` function to assign roles to all the variables: 

```{r, echo=FALSE, out.width="400px"}
knitr::include_graphics(here::here("img","Starting_a_recipe_recipes1.png"))
```

We will describe this step by step and then show all the steps together.

In the simplest case, we might use all predictors like this:

```{r}
simple_rec <- train_pm %>%
  recipes::recipe(value ~ .)

simple_rec
```

We see a recipe has been created with 1 outcome variable and 49 predictor variables (or features). 
Also, notice how we named the output of `recipe()`. 
The naming convention for recipe objects is `*_rec` or `rec`. 

Now, let's get back to the `id` variable. 
Instead of including it as a predictor variable, we could also use the `update_role()` function of the `recipes` package.

```{r}
simple_rec <- train_pm %>%
  recipes::recipe(value ~ .) %>%
  recipes::update_role(id, new_role = "id variable")

simple_rec
```

#### {.click_to_expand_block}

<details><summary> Click here to learn more about the working with `id` variables </summary>

This option works well with the newer `workflows` package, however `id` variables are often dropped from analyses that do not use this newer package as they can make the process difficult with using the `parsnip` package alone due to the fact that new levels (or possible values) may be introduced with the testing data.

</details>

####

We could also specify the outcome and predictors in the same way as we just specified the id variable. 
Please see [here](https://tidymodels.github.io/recipes/reference/recipe.html){target="_blank"} for examples of other roles for variables. 
The role can be actually be any value. 

The order is important here, as we first make all variables predictors and then override this role for the outcome and `id` variable. 
We will use the `everything()` function of the `dplyr` package to start with all of the variables in `train_pm`.

```{r}
simple_rec <- recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor") %>%
    update_role(value, new_role = "outcome") %>%
    update_role(id, new_role = "id variable")

simple_rec
```

We can view our recipe in more detail using the base `summary()` function.

```{r}
summary(simple_rec)
```




### Step 2: Specify the pre-processing steps with `step*()` functions

Next, we use the `step*()` functions from the `recipe` package to specify pre-processing steps. 

```{r, echo=FALSE, out.width="400px"}
knitr::include_graphics(here::here("img","Making_a_recipe_recipes2.png"))
```

**This [link](https://tidymodels.github.io/recipes/reference/index.html){target="_blank"} and this [link](https://cran.r-project.org/web/packages/recipes/recipes.pdf){target="_blank"} show the many options for recipe step functions.**

<u>There are step functions for a variety of purposes:</u>

1. [**Imputation**](https://en.wikipedia.org/wiki/Imputation_(statistics)){target="_blank"} -- filling in missing values based on the existing data 
2. [**Transformation**](https://en.wikipedia.org/wiki/Data_transformation_(statistics)){target="_blank"} -- changing all values of a variable in the same way, typically to make it more normal or easier to interpret
3. [**Discretization**](https://en.wikipedia.org/wiki/Discretization_of_continuous_features){target="_blank"} -- converting continuous values into discrete or nominal values - binning for example to reduce the number of possible levels (However this is generally not advisable!)
4. [**Encoding / Creating Dummy Variables**](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)){target="_blank"} -- creating a numeric code for categorical variables
([**More on one-hot and Dummy Variables encoding**](https://medium.com/p/b5840be3c41a/responses/show){target="_blank"})
5. [**Data type conversions**](https://cran.r-project.org/web/packages/hablar/vignettes/convert.html){target="_blank"}  -- which means changing from integer to factor or numeric to date etc.
6. [**Interaction**](https://statisticsbyjim.com/regression/interaction-effects/){target="_blank"}  term addition to the model -- which means that we would be modeling for predictors that would influence the capacity of each other to predict the outcome
7. [**Normalization**](https://en.wikipedia.org/wiki/Normalization_(statistics)){target="_blank"} -- centering and scaling the data to a similar range of values
8. [**Dimensionality Reduction/ Signal Extraction**](https://en.wikipedia.org/wiki/Dimensionality_reduction){target="_blank"} -- reducing the space of features or predictors to a smaller set of variables that capture the variation or signal in the original variables (ex. Principal Component Analysis and Independent Component Analysis)
9. **Filtering** -- filtering options for removing variables (ex. remove variables that are highly correlated to others or remove variables with very little variance and therefore likely little predictive capacity)
10. [**Row operations**](https://tartarus.org/gareth/maths/Linear_Algebra/row_operations.pdf){target="_blank"} -- performing functions on the values within the rows  (ex. rearranging, filtering, imputing)
11. **Checking functions** -- Gut checks to look for missing values, to look at the variable classes etc.

All of the step functions look like `step_*()` with the `*` replaced with a name, except for the check functions which look like `check_*()`.

There are several ways to select what variables to apply steps to:  

1. Using `tidyselect` methods: `contains()`, `matches()`, `starts_with()`, `ends_with()`, `everything()`, `num_range()`  
2. Using the type: `all_nominal()`, `all_numeric()` , `has_type()` 
3. Using the role: `all_predictors()`, `all_outcomes()`, `has_role()`
4. Using the name - use the actual name of the variable/variables of interest  

Let's try adding some steps to our recipe.


We might want to potentially modify some of our categorical variables so that they can be used with certain algorithms, like regression, that require only numeric values. 

We can do this with the `step_dummy()` function and the `one_hot = TRUE` argument to use a method called [One-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/){target="_blank"}.


One-hot encoding means that we do not simply encode our categorical variables numerically as a simple 1,2,3, as our numeric assignments can be interpreted by algorithms as having a particular rank or order. 
Instead, new binary variables made up of 1s and 0s are used to arbitrarily assign a numeric value that has no apparent order. Note that while there is a different but similar method to do this referred to as dummy variables, one-hot encoded variables are also sometimes called dummy variables.

#### {.click_to_expand_block}

<details><summary>For more information about what one-hot encoding is, you can expand here. </summary>

For example, say we only had three city values for the city variable that were "Tuscon", "Denver", and "New York".

| City 	|
|:---:	|
| Tuscon 	|
| New York 	|
| Tuscon 	|
| Denver 	|

This would be replaced by three new variables one for if the value was "Tuscon", one for if the value was "Denver", and one for if the value was "New York". Each would be made up of zeros and ones. One for the new Tuscon variable would indicate that indeed "Tuscon" was the value and zero would indicate that it was instead "New York" or "Denver". Essentially there is one "hot" value possible for each new variable, where the value will be one.


| Tuscon 	| Denver 	| New York 	|
|:---:	|:---:	|:---:	|
| 1 	| 0 	| 0 	|
| 0 	| 0 	| 1 	|
| 1 	| 0 	| 0 	|
| 0 	| 1 	| 0 	|

</details>

####

Here we will create such variables to replace the current, state, country, and city categorical data. Similarly the ZCTA values (which are currently a factor class) are not intended to be interpreted as numeric as they could be thought of like a zip code and thus we would also want to encode this as well. 

```{r}
simple_rec %>%
  step_dummy(state, county, city, zcta, one_hot = TRUE)
```

#### {.click_to_expand_block}

<details><summary> Click here to see how the variables would change if our recipe stopped here with simply the one-hot encoding. </summary>

To create the data we will use steps that we will demonstrate later, for now we will just show you what the new encoded variables look like.

```{r, echo = FALSE, include = FALSE}
one_hot_rec <- recipe(train_pm) %>%
  update_role(everything(), new_role = "predictor") %>%
  update_role(value, new_role = "outcome") %>%
  update_role(id, new_role = "id variable")
one_hot_rec %<>% step_dummy(state, county, city, zcta, one_hot = TRUE)
prepped_one_hot_rec <- prep(one_hot_rec , verbose = TRUE, retain = TRUE )
baked_one_hot_rec <- bake(prepped_one_hot_rec, new_data = NULL)
```





```{r}
length(names(train_pm))
length(names(baked_one_hot_rec)) # many more variables!

train_pm %>% select(zcta, city, state, county) %>% head

# let's look at a few
baked_one_hot_rec %>% select("zcta_X54520", "city_Not.in.a.city", "state_Indiana") %>% head()

```


</details>

####

Our `fips` variable includes a numeric code for state and county - and therefore is essentially a proxy for county.
Since we already have county, we will just use it and keep the `fips` ID as another ID variable.

We can remove the `fips` variable from the predictors using `update_role()` to make sure that the role is no longer `"predictor"`. 
We can make the role anything we want actually, so we will keep it something identifiable.

```{r}
simple_rec %>%
  update_role("fips", new_role = "county id")
```

We might also want to remove variables that appear to be redundant and are highly correlated with others, as we know from our exploratory data analysis that many of our variables are correlated with one another. 
We can do this using the `step_corr()` function.

We don't want to remove some of our variables, like the `CMAQ` and `aod` variables, we can specify this using the `-` sign before the names of these variables like so:

```{r}
simple_rec %>%
  step_corr(all_predictors(), - CMAQ, - aod)
```


It is also a good idea to remove variables with near-zero variance, which can be done with the `step_nzv()` function. 

Variables have low variance if all the values are very similar, the values are very sparse, or if they are highly imbalanced. Again we don't want to remove our `CMAQ` and `aod` variables.

```{r}
simple_rec %>%
  step_nzv(all_predictors(), - CMAQ, - aod)
```

#### {.click_to_expand_block}

<details><summary> Click here to learn about examples where you might have near-zero variance variables</summary>

1) **Similar Values** - If the population density was nearly the same for every zcta that contained a monitor, then knowing the population density near our monitor would contribute little to our model in assisting us to predict monitor air pollution values. 
2) **Sparse Data** - If all of the monitors were in locations where the populations did not attend graduate school, then these values would mostly be zero, again this would do very little to help us distinguish our air pollution monitors.When many of the values are zero this is also called sparse data.  
3) **Imbalanced Data** If nearly all of the monitors were located in one particular state, and all the others only had one monitor each, then the real predictive value would simply be in knowing if a monitor is located in that particular state or not. In this case we don't want to remove our variable, we just want to simplify it.

See this [blog post](https://www.r-bloggers.com/near-zero-variance-predictors-should-we-remove-them/){target="_blank"} about why removing near-zero variance variables isn't always a good idea if we think that a variable might be especially informative.

</details>

####

Let's put all this together now. 

**Remember: it is important to add the steps to the recipe in an order that makes sense just like with a cooking recipe.**

First, we are going to create numeric values for our categorical variables, then we will look at correlation and near-zero variance. 
Again, we do not want to remove the `CMAQ` and `aod` variables, so we can make sure they are kept in the model by excluding them from those steps. 
If we specifically wanted to remove a predictor we could use `step_rm()`.

```{r}
simple_rec %<>%
  update_role("fips", new_role = "county id") %>%
  step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
  step_corr(all_predictors(), - CMAQ, - aod)%>%
  step_nzv(all_predictors(), - CMAQ, - aod)
  
simple_rec
```



## **Running the pre-processing**
***

### **Step 1: Update the recipe with training data using `prep()`**
***

The next major function of the `recipes` package is `prep()`.
This function updates the recipe object based on the training data. 
It estimates parameters (estimating the required quantities and statistics required by the steps for the variables) for pre-processing and updates the variables roles, as some of the predictors may be removed, this allows the recipe to be ready to use on other data sets. 
It **does not necessarily actually execute the pre-processing itself**; however, we will specify in argument for it to do this so that we can take a look at the pre-processed data.


There are some important arguments to know about:

1. `training` - you must supply a training data set to estimate parameters for pre-processing operations (recipe steps) - this may already be included in your recipe - as is the case for us
2. `fresh` - if `fresh=TRUE`, will retrain and estimate parameters for any previous steps that were already prepped if you add more steps to the recipe (default is `FALSE`)
3. `verbose` - if `verbose=TRUE`, shows the progress as the steps are evaluated and the size of the pre-processed training set (default is `FALSE`)
4. `retain` - if `retain=TRUE`, then the pre-processed training set will be saved within the recipe (as template). This is good if you are likely to add more steps and do not want to rerun the `prep()` on the previous steps. However this can make the recipe size large. This is necessary if you want to actually look at the pre-processed data (default is `TRUE`)

Let's try out the `prep()` function: 

```{r}
prepped_rec <- prep(simple_rec, verbose = TRUE, retain = TRUE )
names(prepped_rec)
```

There are also lots of useful things to check out in the output of `prep()`.
You can see:

1. the `steps` that were run  
2. the original variable info (`var_info`)  
3. the updated variable info after pre-processing (`term_info`)
4. the new `levels` of the variables 
5. the original levels of the variables (`orig_lvls`)
6. info about the training data set size and completeness (`tr_info`)

**Note**: You may see the `prep.recipe()` function in material that you read about the `recipes` package. This is referring to the `prep()` function of the `recipes` package.


### **Step 2: Extract pre-processed training data using `bake()`**
***


Since we retained our pre-processed training data (i.e. `prep(retain=TRUE)`), we can take a look at it by using the `bake()` function of the `recipes` package. The `bake` function allows us to apply our modeling steps (in this case just pre-processing on the training data) and see what it would do the data.

```{r, echo=FALSE, out.width="400px"}
knitr::include_graphics(here::here("img","training_preprocessing_recipes3.png"))
```

Let's bake! 

Since we don't have new data (we aren't looking at the testing data), we need to specify this with `new_data = NULL`.

#### {.scrollable }
```{r}
# Scroll through the output!
baked_train <- bake(prepped_rec, new_data = NULL)
glimpse(baked_train)
```
####

**Note**- this process used to require the `juice()` function.

For easy comparison sake - here is our original data:

#### {.scrollable }

```{r}
# Scroll through the output!
glimpse(pm)
```
####

Notice how we only have 36 variables now instead of 50! 
Two of these are our ID variables (`fips` and the actual monitor ID (`id`)) and one is our outcome (`value`). 
Thus we only have 33 predictors now. 
We can also see that we no longer have any categorical variables. 
Variables like `state` are gone and only `state_California` remains as it was the only state identity to have nonzero variance.


We can see that California had the largest number of monitors compared to the other states.

```{r, eval = FALSE}
pm %>% count(state) 
```


Scroll through the output:

#### {.scrollable }

```{r, echo = FALSE}
pm %>% count(state)  %>%
  print(n = 1e3)
```

####


We can also see that there were more monitors listed as `"Not in a city"` than any city. 

```{r, eval = FALSE}
pm %>% count(city)
```

Scroll through the output:

#### {.scrollable }

```{r, echo=FALSE}
pm %>% count(city) %>%
  print(n = 1e3)
```

####

**Note**: Recall that you must specify `retain = TRUE` argument of the `prep()` function to use `bake()` to see pre-processed training data.

### **Step 3: Extract pre-processed testing data using `bake()`**
***

According to the `tidymodels` documentation:

> `bake()` takes a trained recipe and applies the operations to a data set to create a design matrix.
 For example: it applies the centering to new data sets using these means used to create the recipe.

Therefore, if you wanted to look at the pre-processed testing data you would use the `bake()` function of the `recipes` package.
(You generally want to leave your testing data alone, but it is good to look for issues like the introduction of NA values).

```{r, echo=FALSE, out.width="400px"}
knitr::include_graphics(here::here("img","testing_preprocessing_recipes4.png"))
```

Let's bake! 

#### {.scrollable }
```{r,}
# Scroll through the output!
baked_test_pm <- recipes::bake(prepped_rec, new_data = test_pm)
glimpse(baked_test_pm)
```
####


Notice that our `city_Not.in.a.city` variable seems to be NA values. 
Why might that be?

Ah! Perhaps it is because some of our levels were not previously seen in the training set!

Let's take a look using the [set operations](https://www.probabilitycourse.com/chapter1/1_2_2_set_operations.php){target="_blank"} of the `dplyr` package. 
We can take a look at cities that were different between the test and training set.

```{r}
traincities <- train_pm %>% distinct(city)
testcities <- test_pm %>% distinct(city)

#get the number of cities that were different
dim(dplyr::setdiff(traincities, testcities))

#get the number of cities that overlapped
dim(dplyr::intersect(traincities, testcities))
```

Indeed, there are lots of different cities in our test data that are not in our training data!


So, let's go back to our `pm` data set and modify the `city` variable to just be values of `in a city` or `not in a city` using the `case_when()` function of `dplyr`.
This function allows you to vectorize multiple `if_else()` statements.

```{r}
pm %>%
  mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                          city != "Not in a city" ~ "In a city"))
```

Alternatively you could create a [custom step function](https://recipes.tidymodels.org/articles/Custom_Steps.html){target="_blank"} to do this and add this to your recipe, but that is beyond the scope of this case study. 

We will need to repeat all the steps (splitting the data, pre-processing, etc) as the levels of our variables have now changed. 

While we are doing this, we might also have this issue for `county`. 

The `county` variables appears to get dropped due to either correlation or near zero variance. 

It is likely due to near zero variance because this is the more granular of these geographic categorical variables and likely sparse.

```{r}
pm %<>%
  mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                          city != "Not in a city" ~ "In a city"))

set.seed(1234) # same seed as before
pm_split <-rsample::initial_split(data = pm, prop = 2/3)
pm_split
 train_pm <-rsample::training(pm_split)
 test_pm <-rsample::testing(pm_split)
```


#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

See if you can come up with the code for the new recipe.

***

<details> <summary> Click here to reveal the code for the new recipe. </summary>


```{r}
novel_rec <- recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor") %>%
    update_role(value, new_role = "outcome") %>%
    update_role(id, new_role = "id variable") %>%
    update_role("fips", new_role = "county id") %>%
    step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
    step_corr(all_numeric()) %>%
    step_nzv(all_numeric()) 
```
</details>

***

####

```{r}
novel_rec
```



Now let's retrain our training data with the new model recipe and try baking our test data again.



#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Do you recall how to pre-process and extract the pre-processed training data?

***

<details> <summary> Click here to reveal the answer. </summary>

```{r}
prepped_rec <- prep(novel_rec, verbose = TRUE, retain = TRUE)
baked_train <- bake(prepped_rec, new_data = NULL)
```
</details> 

***

####


#### {.scrollable }
```{r}
# Scroll through the output!
glimpse(baked_train)
```

####

And now, let's try baking our test set to see if we still have `NA` values.

#### {.scrollable }

```{r}
# Scroll through the output!
baked_test_pm <- bake(prepped_rec, new_data = test_pm)

glimpse(baked_test_pm)
```

####

Great, now we no longer have `NA` values! :)

**Note**: if you use the skip option for some of the pre-processing steps, be careful. 
the `juice()` function (an older option) will show all of the results ignoring `skip = TRUE` (as you can still use this function if you prefer it to `bake()`). The
`bake()` function will not necessarily conduct these steps on the new data.


## **Specifying the model**
***

So far we have used the packages `rsample` to split the data and `recipes` to assign variable types, and to specify and prep our pre-processing (as well as to optionally extract the pre-processed data).

We will now use the `parsnip` package (which is similar to the previous `caret` package - and hence why it is named after the vegetable) to specify our model.

There are four things we need to define about our model:  

1. The **type** of model (using specific functions in parsnip like `rand_forest()`, `logistic_reg()` etc.)  
2. The package or **engine** that we will use to implement the type of model selected (using the `set_engine()` function) 
3. The **mode** of learning - classification or regression (using the `set_mode()` function) 
4. Any **arguments** necessary for the model/package selected (using the `set_args()`function -  for example the `mtry =` argument for random forest which is the number of variables to be used as options for splitting at each tree node)

Let's walk through these steps one by one. 
For our case, we are going to start our analysis with a linear regression (a very common starting point for modeling) but we will demonstrate how we can try different models. We will also show how to model with the Random Forest method (which is very widely used) later on. 

The first step is to define what type of model we would like to use. 
See [here](https://www.tidymodels.org/find/parsnip/){target="_blank"} for modeling options in `parsnip`.


```{r}
PM_model <- parsnip::linear_reg() # PM was used in the name for particulate matter
PM_model
```

OK. So far, all we have defined is that we want to use a linear regression...  
Let's tell `parsnip` more about what we want.

We would like to use the [ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) method to fit our linear regression. 
So we will tell `parsnip` that we want to use the `lm` package to implement our linear regression (there are many options actually such as [`rstan`](https://cran.r-project.org/web/packages/rstan/vignettes/rstan.html){target="_blank"}  [`glmnet`](https://cran.r-project.org/web/packages/glmnet/index.html){target="_blank"}, [`keras`](https://keras.rstudio.com/){target="_blank"}, and [`sparklyr`](https://therinspark.com/starting.html#starting-sparklyr-hello-world){target="_blank"}). See [here](https://parsnip.tidymodels.org/reference/linear_reg.html) for a description of the differences and using these different engines with `parsnip`.

We will do so by using the `set_engine()` function of the `parsnip` package.

```{r}
lm_PM_model <- PM_model  %>%
  parsnip::set_engine("lm")

lm_PM_model
```

In some cases some packages can do either classification or prediction, so it is a good idea to specify which mode you intend to perform. 
Here, we aim to predict the air pollution. 
You can do this with the `set_mode()` function of the `parsnip` package, by using either `set_mode("classification")` or `set_mode("regression")`.

```{r}
lm_PM_model <- PM_model  %>%
  parsnip::set_engine("lm") %>%
  set_mode("regression")

lm_PM_model
```

## **Fitting the model**
***

We can  use the `parsnip` package with a newer package called `workflows` to fit our model. 

The `workflows` package allows us to keep track of both our pre-processing steps and our model specification. It also allows us to implement fancier optimizations in an automated way and it can also handle post-processing operations. 


We begin by creating a workflow using the `workflow()` function in the `workflows` package. 

Next, we use `add_recipe()` (our pre-processing specifications) and we add our model with the `add_model()` function -- both functions from the `workflows` package.

**Note**: We do not need to actually `prep()` our recipe before using workflows!

If you recall `novel_rec` is the recipe we previously created with the `recipes` package and `lm_PM_model` was created when we specified our model with the `parsnip` package.
Here, we combine everything together into a workflow. 

```{r}
PM_wflow <- workflows::workflow() %>%
            workflows::add_recipe(novel_rec) %>%
            workflows::add_model(lm_PM_model)
PM_wflow
```

Ah, nice. 
Notice how it tells us about both our pre-processing steps and our model specifications.

Next, we "prepare the recipe" (or estimate the parameters) and fit the model to our training data all at once. 
Printing the output, we can see the coefficients of the model.

```{r}
PM_wflow_fit <- parsnip::fit(PM_wflow, data = train_pm)
PM_wflow_fit
```

#### {.click_to_expand_block}

<details><summary> Click here to see the steps that the `workflows` package performs that used to be required </summary>

Previously, the processed training data (`baked_train`), as opposed to the raw training data, would be required to fit the model.

In this case, we would actually also need to write the model again! 
Recall that `id` and `fips` are ID variables and that `values` is our outcome of interest (the air pollution measure at each monitor). It is nice that `workflows` keeps track of this!

```{r}
baked_train_ready <- baked_train %>% 
  select(-id, -fips)

PM_fit <- lm_PM_model %>% 
  parsnip::fit(value ~., data = baked_train_ready)
```

</details>

####

## **Assessing the model fit**
***

After we fit our model, we can use the `broom` package to look at the output from the fitted model in an easy/tidy way.   

The `tidy()` function returns a tidy data frame with coefficients from the model (one row per coefficient).

Many other `broom` functions currently only work with `parsnip` objects, not raw `workflows` objects. 

However, we can use the `tidy` function if we first use the `extract_fit_parsnip()` function which is imported as part of the `workflows` package from the [`hardhat` package](https://hardhat.tidymodels.org/) (also part of tidymodels).

```{r}
wflowoutput <- PM_wflow_fit %>% 
  extract_fit_parsnip() %>% 
  broom::tidy() 
```


```{r}
wflowoutput
```

We have fit our model on our training data, which means we have created a model to predict values of air pollution based on the predictors that we have included. Yay!

One last thing before we leave this section. 
We often are interested in getting a sense of which variables are the most important in our model. 
We can explore the variable importance using the `vip()` function of the `vip` package. 
This function creates a bar plot of variable importance scores for each predictor variable (or feature) in a model. 
The bar plot is ordered by importance (highest to smallest). 


Notice again that we need to use the `extract_fit_parsnip()` function.

Let's take a look at the top 10 contributing variables:

```{r}
PM_wflow_fit %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10)
```

The location of the monitor (being in California versus another state), the CMAQ model, and the aod satellite information appear to be the most important for predicting the air pollution at a given monitor.

Indeed, if we plot monitor values for those in California relative to other states we can see that there are some high values for monitors in California. This may be playing a role in what we are seeing. Here we assume that you have some experience plotting with the `ggplot2`package. If not, please see this [case study](https://www.opencasestudies.org/ocs-bp-co2-emissions/).

#### {.click_to_expand_block}

<details><summary> Click here for an introduction about this package if you are  new to using `ggplot2` </summary>


The [ggplot2 package](http://ggplot2.tidyverse.org) is generally intuitive for beginners because it is based on a  [grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.html) or the `gg` in `ggplot2`. 
The idea is that you can construct many sentences by learning just a few nouns, adjectives, and verbs. There are specific “words” that we will need to learn and once we do, you will be able to create (or “write”) hundreds of different plots.

The critical part to making graphics using `ggplot2` is the data needs to be in a _tidy_ format. 
Given that we have just spent time putting our data in _tidy_ format, we are primed to take advantage of all that `ggplot2` has to offer! 

We will show how it is easy to pipe _tidy_ data (output) as input to other functions that create plots. 
This all works because we are working 
within the _tidyverse_. 

**What is the `ggplot()` function?** 
As explained by Hadley Wickham:

> The grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system.
`ggplot2` Terminology: 

- **ggplot** - the main function where you specify the dataset and variables to plot (this is where we define the `x` and
`y` variable names)
- **geoms** - geometric objects
    - e.g. `geom_point()`, `geom_bar()`, `geom_line()`, `geom_histogram()`
- **aes** - aesthetics
    - shape, transparency, color, fill, line types
- **scales** - define how your data will be plotted
    - continuous, discrete, log, etc

The function `aes()` is an aesthetic mapping function inside the `ggplot()` object. 
We use this function to specify plot attributes (e.g. `x` and `y` variable names) that will not change as we add more layers.  

Anything that goes in the `ggplot()` object becomes a global setting. 
From there, we use the `geom` objects to add more layers to the base `ggplot()` object. 
These will define what we are interested in illustrating using the data.

</details>

####


```{r}

baked_train_ready %>% 
  mutate(state_California = as.factor(state_California)) %>%
  mutate(state_California = recode(state_California, 
                                   "0" = "Not California", 
                                   "1" = "California")) %>%
  ggplot(aes(x = state_California, y = value)) + 
  geom_boxplot() +
  geom_jitter(width = .05) + 
  xlab("Location of Monitor")

```

## **Model performance**
***

In this next section, our goal is to assess the overall model performance. 
The way we do this is to compare the similarity between the predicted estimates of the outcome variable produced by the model and the true outcome variable values. 

If you recall the [What is machine learning?](#whatisml) section, we showed how to think about machine learning (ML) as an optimization problem that tries to minimize the distance between our predicted outcome $\hat{Y} = f(X)$ and actual outcome $Y$ using our features (or predictor variables) $X$ as input to a function $f$ that we want to estimate. 

$$d(Y - \hat{Y})$$

As our goal in this section is to assess overall model performance, we will now talk about different distance metrics that you can use. 

First, let's pull out our predicted outcome values $\hat{Y} = f(X)$ from the models we fit (using different approaches). 


```{r}
wf_fit <- PM_wflow_fit %>% 
  extract_fit_parsnip()

wf_fitted_values <- fitted(wf_fit[["fit"]])
head(wf_fitted_values)
```

Alternatively, we can get the fitted values using the `augment()` function of the `broom` package using the output from `workflows`: 

```{r}
wf_fitted_values <- 
  broom::augment(wf_fit[["fit"]], data = baked_train) %>% 
  select(value, .fitted:.std.resid)

head(wf_fitted_values)
```

Note that because we use the actual workflow here, we can (and actually need to) use the raw data instead of the pre-processed data.

```{r}
values_pred_train <- 
  predict(PM_wflow_fit, train_pm) %>% 
  bind_cols(train_pm %>% select(value, fips, county, id)) 

values_pred_train
```

### **Visualizing model performance**
***

Now, we can compare the predicted outcome values (or fitted values) $\hat{Y}$ to the actual outcome values $Y$ that we observed: 

```{r}
wf_fitted_values %>% 
  ggplot(aes(x =  value, y = .fitted)) + 
  geom_point() + 
  xlab("actual outcome values") + 
  ylab("predicted outcome values")
```

OK, so our range of the predicted outcome values appears to be smaller than the real values. 
We could probably do a bit better.

### **Quantifying model performance**
***

Next, let's use different distance functions $d(\cdot)$ to assess how far off our predicted outcome $\hat{Y} = f(X)$ and actual outcome $Y$ values are from each other: 

$$d(Y - \hat{Y})$$

As mentioned, there are entire scholarly fields of research dedicated to identifying different distance metrics $d(\cdot)$ for machine learning applications. 
However, when performing prediction with a continuous outcome $Y$, a few of the mostly commonly used distance metrics are: 

1. mean absolute error (`mae`)  

$$MAE = \frac{\sum_{i=1}^{n}{(|\hat{y_t}- y_t|)}^2}{n}$$


2. R squared error (`rsq`) -- this is also known as the coefficient of determination which is the squared correlation between truth and estimate

This is calculated and 1 minus the fraction of the residual sum of squares ($SS_res$) by the total sum of squares ($SS_tot$)


$$RSQ = R^2 = 1 - \frac{SSres}{SStot}$$

$$SS_{tot} = \sum_{i=1}^{n}{(y_i- \bar{y})}^2$$
The total sum of squares is proportional to the variance of the data. It is calculated as the sum of each true value from the mean true value ($\bar{y}$).

$$SS_{res} = \sum_{i=1}^{n}{(y_i- \hat{y_i})}^2$$

The sum of squares of residuals is calculated as the sum of each predicted value ($\hat{y_i}$ or sometimes $f_i$) from the true value ($y_i$). 


3. [root mean squared error](https://en.wikipedia.org/wiki/Root-mean-square_deviation){target="_blank"} (`rmse`)   

$$RMSE = \sqrt{\frac{\sum_{i=1}^{n}{(\hat{y_t}- y_t)}^2}{n}}$$




One way to calculate these metrics within the `tidymodels` framework is to use the `yardstick` package using the `metrics()` function. 

Note that you may obtain different results depending on your version of R and the package versions you are using. See [Session Info](#sessioninfo) section to learn what we used.

```{r}
yardstick::metrics(wf_fitted_values, 
                   truth = value, estimate = .fitted)
```

Alternatively if you only wanted one metric you could use the `mae()`, `rsq()`, or `rmse()` functions, respectively. 

```{r}
yardstick::mae(wf_fitted_values, 
               truth = value, estimate = .fitted)
```

The lower the error values, the better the performance. RMSE and MAE can range from zero to infinity, so we aren't doing too bad.  The MAE value suggests that the average difference between the value predicted and the real value was `r round(yardstick::mae(wf_fitted_values, truth = value, estimate = .fitted)$.estimate, 2)` ug/m3. The range of the values was 3-22 in the training data, so this is a relatively small amount. The difference between the RMSE and the MAE can indicate the variance of the errors. Since the RMSE and MAE were similar, this indicates that the errors were quite consistent across the range of values and large errors are unlikely to have occurred, where our prediction would have been really far off. The [R squared error](https://en.wikipedia.org/wiki/Coefficient_of_determination) value indicates how much variability in the outcome value could be explained by the predictors in the model. This value ranges from 0 to 1 (sometimes -1 depending on the method used for calculation). A value of 1 would indicate the the model perfectly predicted the outcome. Our value indicates that `r round(yardstick::rsq(wf_fitted_values, truth = value, estimate = .fitted)$.estimate, 2)*100`% of the variability of the air pollution measures could be explained by the model. So we could maybe do a bit better. See [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4678365/) for more information. 

## **Cross validation**
***

Until now we have used everything in our "training" dataset (and have not touched the "testing" dataset) from the `rsample` package to build our machine learning (ML) model $\hat{Y} = f(X)$ (or to estimate $f$ using the features or predictor variable $X$). 

Here, we take this beyond the simple split into training and testing data sets. 
We will use the `rsample` package again in order to further implement what are called [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)){target="_blank"} techniques. This is also called **re-sampling** or **repartitioning**.  

**Note**: we are not actually getting new samples from the underlying distribution so the term re-sampling is a bit of a misnomer.

[Cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)){target="_blank"} splits our training data into multiple training data sets to allow for a deeper assessment of the accuracy of the model.

Here is a visualization of the concept for cross validation/resampling/repartitioning from [Max Kuhn](https://resources.rstudio.com/authors/max-kuhn){target="_blank"}:

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","resampling.png"))
```

Technically creating our testing and training set out of our original training data is sometimes considered a form of [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)){target="_blank"}, called the holdout method. 
The reason we do this it so we can get a better sense of the accuracy of our model using data that we did not train it on. 

However, we can actually do a better job of optimizing our model for accuracy if we also perform another type of [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)){target="_blank"} on the newly defined training set that we just created. 
There are many [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)){target="_blank"} methods and most can be easily implemented using `rsample` package. 
Here, we will use a very popular method called either [k-fold or v-fold cross validation](https://machinelearningmastery.com/k-fold-cross-validation/){target="_blank"}. 

This method involves essentially preforming the hold out method iteratively with the training data. 

First, the training set is divided into $v$ (or often called called $k$) equally sized smaller pieces. 

Next, the model is trained on the model on $v$-1 subsets of the data iteratively (removing a different $v$ until all possible $v$-1 sets have been evaluated) to get a sense of the performance of the model. 
This is really useful for fine tuning specific aspects of the model in a process called model tuning, which we will learn about in the next section. 

Here is a visualization of how the folds are created:

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img", "vfold.png"))
```

**Note**: People typically ignore spatial dependence with cross validation of air pollution monitoring data in the air pollution field, so we will do the same. However, it might make sense to leave out blocks of monitors rather than random individual monitors to help account for some spatial dependence.

### **Creating the $v$-folds using `rsample`**
***

The [`vfold_cv()`](https://tidymodels.github.io/rsample/reference/vfold_cv.html){target="_blank"} function of the `rsample` package can be used to parse the training data into folds for $v$-fold cross validation.

- The `v` argument specifies the number of folds to create.
- The `repeats` argument specifies if any samples should be repeated across folds - default is `FALSE`
- The `strata` argument specifies a variable to stratify samples across folds - just like in `initial_split()`.

Again, because these are created at random, we need to use the base `set.seed()` function in order to obtain the same results each time we knit this document. 
Generally speaking using 10 folds is good practice, but this depends on the variability within your data. 
We are going to use 4 for the sake of expediency in this demonstration. 

```{r}
set.seed(1234)
vfold_pm <- rsample::vfold_cv(data = train_pm, v = 4)
vfold_pm
pull(vfold_pm, splits)
```

Now we can see that we have created 4 folds of the data and we can see how many values were set aside for testing (called assessing for cross validation sets) and training (called analysis for cross validation sets) within each fold.

Once the folds are created they can be used to evaluate performance by fitting the model to each of the re-samples that we created:

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img", "cross_validation.png"))
```

### **Assessing model performance on $v$-folds using `tune`**
***

We can fit the model to our cross validation folds using the `fit_resamples()` function of the `tune` package, by specifying our `workflow` object and the cross validation fold object we just created. 
See [here](https://tidymodels.github.io/tune/reference/fit_resamples.html){target="_blank"} for more information.


```{r}
resample_fit <- tune::fit_resamples(PM_wflow, vfold_pm)
```


We can now take a look at various performance metrics based on the fit of our cross validation "resamples". 
To do this we will use the `show_best()` function of the `tune` package.

```{r}
tune::show_best(resample_fit, metric = "rmse")
```

Here we can see the mean `RMSE` value across all four folds. The function is called `show_best()` because it is also used for model tuning and it will show the parameter combination with the best performance, we will discuss this more later in the case study when we test different models with different parameters. However, for now this gives us a more nuanced estimate of the RMSE in terms of the performance of this single model with the training data by looking at the subsets of the training data.


# **Data Analysis**
***

If you have been following along but stopped and are starting here you could load the wrangled data using the following command:

```{r}
load(here::here( "data", "wrangled", "wrangled_pm.rda"))
```

#### {.click_to_expand_block}

<details> <summary> If you skipped the previous sections, click here for a step-by-step guide on how to download and load the data. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data `.rda` file using the following function:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-air-pollution", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_pm.rda"))
```

If the package does not work for you, you may also download this `.rda` file by clicking this link [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-air-pollution/master/data/wrangled/wrangled_pm.rda).

To load the downloaded data into your environment, you may double click on the `.rda` file in Rstudio or using the `load()` function.

To copy and paste our code below, place the downloaded `.rda` file in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data". We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "wrangled", "wrangled_pm.rda"))
```

<hr style="height:1px;border:none;color:#333;background-color:#333;" />

<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>

<hr style="height:1px;border:none;color:#333;background-color:#333;" />

</details>

####

Then you can modify the data to match what we did for the previous model like so to prepare the `city` variable and to split the data into testing and training sets. (You will see later that this would be required due to the number of levels for the city variable in the original data).

```{r}
pm %<>%
  mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                          city != "Not in a city" ~ "In a city"))

set.seed(1234) # same seed as before
pm_split <-rsample::initial_split(data = pm, prop = 2/3)
pm_split
 train_pm <-rsample::training(pm_split)
 test_pm <-rsample::testing(pm_split)
```

We will also split our data into cross validation folds:

```{r}
set.seed(1234)
vfold_pm <- rsample::vfold_cv(data = train_pm, v = 4)
vfold_pm
```

In the previous section, we demonstrated how to build a machine learning model (specifically a linear regression model) to predict air pollution with the `tidymodels` framework. 

In the next few section, we will demonstrate another very different kind of machine learning model. This will allow us to see if this model will have better prediction performance.


## **Random Forest**
***

Now, to try to see if we can get better prediction performance, we are going to predict our outcome variable (air pollution) using a decision tree method called [random forest](https://en.wikipedia.org/wiki/Random_forest){target="_blank"}.

A [decision tree](https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb){target="_blank"} is a tool to partition data or anything really, based on a series of sequential (often binary) decisions, where the decisions are chosen based on their ability to optimally split the data.

Here you can see a simple example:

```{r, echo = FALSE}
knitr::include_graphics("https://miro.medium.com/max/1000/1*LMoJmXCsQlciGTEyoSN39g.jpeg")
```

##### [[source]](https://towardsdatascience.com/understanding-random-forest-58381e0602d2){target="_blank"}

In the case of [random forest](https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9){target="_blank"}, multiple decision trees are created - hence the name forest, and each tree is built using a random subset of the training data (with replacement) - hence the full name random forest. This random aspect helps to keep the algorithm from overfitting the data.

The mean of the predictions from each of the trees is used in the final output.

```{r, echo = FALSE}
knitr::include_graphics("https://miro.medium.com/max/1400/0*f_qQPFpdofWGLQqc.png")
```


Overall, a major distinction with our last regression model, is that random forest allows us to use our categorical data largely as is. There is no need to recode these predictors to be numerical. In our case, we are going to use the random forest method of the the `randomForest` package. 

This package is currently not compatible with categorical variables that have more than 53 levels. See [here](https://cran.r-project.org/web/packages/randomForest/NEWS) for the documentation about when this was updated from 25 levels. Thus we will remove the `zcta`  and `county` variables. This is why it is good that we have modified the city variable to two levels, as there were originally nearly 600, so this would not have been compatible with this package. 

Note that the `step_novel()` function is necessary here for the `state` variable to get all cross validation folds to work, because there will be different levels included in each fold test and training sets. The new levels for some of the test sets would otherwise result in an error.

According to the [documentation](https://www.rdocumentation.org/packages/recipes/versions/0.1.13/topics/step_novel) for the `recipes` package:

> step_novel creates a specification of a recipe step that will assign a previously unseen factor level to a new value.

```{r}
RF_rec <- recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor")%>%
    update_role(value, new_role = "outcome")%>%
    update_role(id, new_role = "id variable") %>%
    update_role("fips", new_role = "county id") %>%
    step_novel("state") %>%
    step_string2factor("state", "county", "city") %>%
    step_rm("county") %>%
    step_rm("zcta") %>%
    step_corr(all_numeric())%>%
    step_nzv(all_numeric())
```

The `rand_forest()` function of the `parsnip` package has three important arguments that act as an interface for the different possible engines to perform a random forest analysis:

1. `mtry` - The number of predictor variables (or features) that will be randomly sampled at each split when creating the tree models. The default number for regression analyses is the number of predictors divided by 3. 
2. `min_n` - The minimum number of data points in a node that are required for the node to be split further.
3. `trees` - the number of trees in the ensemble

We will start by trying an `mtry` value of 10 and a `min_n` value of 3. As you might imagine it is a bit difficult to know what to choose. This is where the tuning process that we just started to describe will come in helpful as we can test different models with different values. However, first let's just start with these values.

Now that we have our recipe (`RF_rec`), let's specify the model with `rand_forest()` from `parsnip`.

```{r}
PMtree_model <- parsnip::rand_forest(mtry = 10, min_n = 3)
PMtree_model
```

Next, we set the engine and mode:

Note that you could also use the `ranger` or `spark` packages instead of `randomForest`.
If you were to use the `ranger` package to implement the random forest analysis you would need to specify an `importance` argument to be able to evaluate predictor importance.  The options are `impurity` or `permutation`.

These other packages have different advantages and disadvantages- for example `ranger` and `spark` are not as limiting for the number of categories for categorical variables. For more information see their documentation: [here](https://cran.r-project.org/web/packages/ranger/ranger.pdf) for `ranger`, [here](http://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests) for `spark`, and [here](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf) for `randomForest`.

See [here](https://parsnip.tidymodels.org/reference/rand_forest.html) for more documentation about implementing these engine options with tidymodels. Note that there are also [other](https://www.linkedin.com/pulse/different-random-forest-packages-r-madhur-modi/) R packages for implementing random forest algorithms, but these three packages (`ranger`, `spark`, and `randomForest`) are currently compatible with `tidymodels`.

We also need to specify with the `set_mode()` function that our outcome variable (air pollution) is continuous. 

```{r}

RF_PM_model <- PMtree_model %>%
  set_engine("randomForest") %>%
  set_mode("regression")

RF_PM_model
```

Then, we put this all together into a `workflow`: 

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

See if you can come up with the code to do this.

***
<details> <summary> Click here to reveal the answer. </summary>

```{r}
RF_wflow <- workflows::workflow() %>%
  workflows::add_recipe(RF_rec) %>%
  workflows::add_model(RF_PM_model)

```
</details> 

***

####

```{r}
RF_wflow
```


Finally, we fit the data to the model:

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Do you recall how to do this?

***
<details> <summary> Click here to reveal the answer. </summary>

```{r}
RF_wflow_fit <- parsnip::fit(RF_wflow, data = train_pm)
```

If you get an error "Can not handle categorical predictors with more than 53 categories." then you should scroll up a bit and make sure that you removed the categorical variables that have more than 53 categories as this method can't handle such variables at this time.

</details> 

***

####

```{r}
RF_wflow_fit
```

Let's take a look at the top 10 contributing variables:

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

See if you can recall how to do this.


```{r, echo = FALSE}
RF_wflow_fit %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10)

```


***
<details> <summary> Click here to reveal the answer. </summary>

```{r}
RF_wflow_fit %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10)
```
</details>

***

####


Interesting! In the previous model the CMAQ values and the state where the monitor was located (being in California or not) were also the top two most important, however predictors about education levels of the communities where the monitor was located was among the top most important. Now we see that population density and proximity to sources of emissions and roads are among the top ten.


Now let's take a look at model performance by fitting the data using cross validation:

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

See if you can recall how to do this.

***
<details> <summary> Click here to reveal the answer. </summary>

```{r, eval = FALSE}
set.seed(456)
resample_RF_fit <- tune::fit_resamples(RF_wflow, vfold_pm)
collect_metrics(resample_RF_fit)
```

</details>

***

####

```{r, echo = FALSE}
set.seed(456)
resample_RF_fit <- tune::fit_resamples(RF_wflow, vfold_pm)
collect_metrics(resample_RF_fit)
```

## **Model Comparison**
***

Now let's compare the performance of our model with our linear regression model - if you have been following along you could type this to take a look. 

```{r, eval = FALSE}
# our initial linear regression model:
collect_metrics(resample_fit)
```

For those starting here we will tell you that our first model had a cross validation mean `rmse` value of `r round(collect_metrics(resample_fit)$mean[1], 2)`.
It looks like the random forest model had  a much lower `rmse` value of `r round(collect_metrics(resample_RF_fit)$mean[1], 2)`. This suggest that this model is better at prediction air pollution values. In addition, the R squared value is much higher (it was `r round(collect_metrics(resample_fit)$mean[2], 2)*100`% and is now `r round(collect_metrics(resample_RF_fit)$mean[2], 2)*100`%), suggesting that more of the variance of the air pollution values could be explained by the new Random Forest model. 

#### {.think_question_block}
<b><u> Question Opportunity </u></b>

Do you recall how the RMSE is calculated?

***

<details> <summary> Click here to reveal the answer. </summary>
$$RMSE = \sqrt{\frac{\sum_{i=1}^{n}{(\hat{y_t}- y_t)}^2}{n}}$$
</details>

***

####

If we tuned our random forest model based on the number of trees or the value for `mtry` (which is "The number of predictors that will be randomly sampled at each split when creating the tree models"), we might get a model with even better performance.

However, our cross validated mean rmse value of `r round(collect_metrics(resample_RF_fit)$mean[1], 2)` is quite good because our range of true outcome values is much larger: (`r round(range(test_pm$value),3)`).


## **Model tuning**
***

[Hyperparameters](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/) are often things that we need to specify about a model. For example, the number of predictor variables (or features) that will be randomly sampled at each split when creating the tree models called `mtry` is a hyperparameter. The default number for regression analyses is the number of predictors divided by 3. Instead of arbitrarily specifying this, we can try to determine the best option for model performance by a process called tuning. 


Now let's try some tuning.

Let's take a closer look at the `mtry` and `min_n` hyperparametrs in our Random Forest model.

We aren't exactly sure what values of `mtry` and `min_n` achieve good accuracy yet keep our model generalizable for other data.

This is when our cross validation methods become really handy because now we can test out different values for each of these hyperparameters to assess what values seem to work best for model performance on these resamples of our training set data.

Previously we specified our model like so:
```{r}
RF_PM_model <- 
  parsnip::rand_forest(mtry = 10, min_n = 3) %>%
  set_engine("randomForest") %>%
  set_mode("regression")

RF_PM_model
```

Now instead of specifying a value for the `mtry` and `min_n` arguments, we can use the `tune()` function of the `tune` package like so: `mtry = tune()`. This indicates that these hyperparameters are to be tuned. 

```{r}

tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) %>%
  set_engine("randomForest") %>%
  set_mode("regression")
    
tune_RF_model

```


Again we will add this to a workflow, the only difference here is that we are using a different model specification with `tune_RF_model` instead of `RF_model`:

```{r}

RF_tune_wflow <- workflows::workflow() %>%
  workflows::add_recipe(RF_rec) %>%
  workflows::add_model(tune_RF_model)
RF_tune_wflow

```


Now we can use the `tune_grid()` function of the `tune` package to evaluate different combinations of values for `mtry` and `min_n` using our cross validation samples of our training set (`vfold_pm`) to see what combination of values performs best.

To use this function we will specify the workflow using the `object` argument  and the samples to use using the `resamples` argument. The `grid` argument specifies how many possible options for each argument should be attempted.

By default 10 different values will be attempted for each hyperparameter that is being tuned.

We can use the `doParallel` package to allow us to fit all these models to our cross validation samples faster. So if you were performing this on a computer with multiple cores or processors, then different models with different hyperparameter values can be fit to the cross validation samples simultaneously across different cores or processors. 

You can see how many cores you have access to on your system using the `detectCores()` function in the `parallel` package. 

```{r}
parallel::detectCores()
```

The `registerDoParallel()` function will use the number for cores specified using the `cores=` arguement, or it will assign it automatically to one-half of the number of cores detected by the `parallel` package. 

We need to use `set.seed()` here because the values chosen for `mtry` and `min_n` may vary if we preform this evaluation again because they are chosen semi-randomly (meaning that they are within a range of reasonable values but still random).

Note: this step will take some time.

```{r}
doParallel::registerDoParallel(cores = 2)
set.seed(123)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20)
tune_RF_results
```


See [the tune getting started guide ](https://tidymodels.github.io/tune/articles/getting_started.html){target="_blank"} for more information about implementing this in `tidymodels`.

If you wanted more control over this process you could specify how the different possible options for `mtry` and `min_n` in the `tune_grid()` function using the `grid_*()` functions of the `dials` package to create a more specific grid.

By default the values for the hyperparameters being tuned are chosen semi-randomly (meaning that they are within a range of reasonable values but still random)..


Now we can use the `collect_metrics()` function again to take a look at what happened with our cross validation tests. We can see the different values chosen for `mtry` and `min_n` and the mean rmse and rsq values across the cross validation samples.

```{r}
tune_RF_results %>%
  collect_metrics()
```

We can now use the `show_best()` function (which we used previously to get a better estimate of the RMSE using cross validation fold of the training data) as it was **truly intended**, to see what values for `min_n` and `mtry` resulted in the best performance.

```{r}
show_best(tune_RF_results, metric = "rmse", n = 1)
```

There we have it... looks like an `mtry` of `r show_best(tune_RF_results, metric = "rmse", n = 1)$mtry` and `show_best(tune_RF_results, metric = "rmse", n = 1)$min_n` of 4 had the best `rmse` value. You can verify this in the above output, but it is easier to just pull this row out using this function. We can see that the mean `rmse` value across the cross validation sets was `r round(show_best(tune_RF_results, metric = "rmse", n = 1)$mean, 2)`. Before tuning it was `r round(collect_metrics(resample_RF_fit)$mean[1], 2)`  with a fairly similar `std_err` so the performance may be slightly improved. In other words the model was slightly better at predicting air pollution values.


## **Final model performance evaluation**
***

Now that we have decided that we have reasonable performance with our training data using the Random Forest method and tuning, we can stop building our model and evaluate performance with our testing data. Note that you might often need to try a variety of methods to optimize your model.

Here, we will use the random forest model that we built to predict values for the monitors in the testing data and we will use the values for `mtry` and `min_n` that we just determined based on our tuning analysis to achieve the best performance.

So, first we need to specify these values in a workflow. We can use the `select_best()` function of the `tune` package to grab the values that were determined to be best for `mtry` and `min_n`.



```{r}

tuned_RF_values<- select_best(tune_RF_results, "rmse")
tuned_RF_values
```

Now we can finalize the model/workflow that we we used for tuning with these values.


```{r}
RF_tuned_wflow <-RF_tune_wflow %>%
  tune::finalize_workflow(tuned_RF_values)
```


With the `workflows` package, we can use the splitting information for our original data `pm_split` to fit the final model on the full training set and also on the testing data using the `last_fit()` function of the `tune` package. No pre-processing steps are required.

The results will show the performance using the testing data.


```{r}
overallfit <- tune::last_fit(RF_tuned_wflow, pm_split)
 # or
overallfit <- RF_wflow %>%
  tune::last_fit(pm_split)
```

The `overallfit` output has a lot of really useful information about the model, the testing and training data split, and the predictions for the testing data.

To see the performance on the test data we can use the `collect_metrics()` function like we did before.

```{r}
collect_metrics(overallfit)
```

Awesome! We can see that our rmse of `r round(collect_metrics(overallfit)$.estimate, 2)[1]` is quite similar to that of our testing data cross validation sets (where rmse was `r round(show_best(tune_RF_results, metric = "rmse", n = 1)$mean, 2)` . We achieved quite good performance, which suggests that we could predict other locations with more sparse monitoring based on our predictors with reasonable accuracy, however some of our predictors involve monitoring itself, so the accuracy would likely be lower.

Now if you wanted to take a look at the predicted values for the test set (the 292 rows with predictions out of the 876 original monitor values) you can use the  `collect_predictions()` function of the `tune` package:

```{r}
test_predictions <- collect_predictions(overallfit)
```

```{r, eval = FALSE}
test_predictions
```

#### {.scrollable }
```{r, echo =FALSE}
test_predictions %>%
  print(n = 1e3)
```

####

Nice!

# **Data Visualization**
***

If you have been following along but stopped and are starting here you could load the wrangled data using the following command:

```{r}
load(here::here("data", "wrangled", "wrangled_pm.rda"))
```

#### {.click_to_expand_block}

<details> <summary> If you skipped previous sections click here for more information on how to obtain and load the data. </summary>

First you need to install the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
```

Then, you may download and load the wrangled data `.rda` file using the following code:

```{r, eval=FALSE}
OCSdata::wrangled_rda("ocs-bp-air-pollution", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_pm.rda"))
```

If the package does not work for you, you may also download this `.rda` file by clicking this link [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-air-pollution/master/data/wrangled/wrangled_pm.rda).

To load the downloaded data into your environment, you may double click on the `.rda` file in Rstudio or using the `load()` function.

To copy and paste our code below, place the downloaded `.rda` file in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data". We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "wrangled", "wrangled_pm.rda"))
```

<hr style="height:1px;border:none;color:#333;background-color:#333;" />

<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>

<hr style="height:1px;border:none;color:#333;background-color:#333;" />

</details>

####


Our main question for this case study was:  

> Can we predict regional annual average air pollution concentrations by zip-code using predictors such as population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

Thus far, we have built a machine learning (ML) model to predict fine particulate matter air pollution levels based on our predictor variables (or features).

Now, let's make a plot of our predicted outcome values ($\hat{Y}$) and actual outcome values $Y$ we observed. 

First, let's start by making a plot of our monitors. 
To do this, we will use the following packages to create a map of the US:

1. `sf` - the simple features package helps to convert geographical coordinates into `geometry` variables which are useful for making 2D plots
2. `maps` - this package contains geographical outlines and plotting functions to create plots with maps 
3. `rnaturalearth`- this allows for easy interaction with map data from [Natural Earth](http://www.naturalearthdata.com/) which is a public domain map dataset
4. `rgeos` - this package interfaces with the Geometry Engine-Open Source (`GEOS`) which is also helpful for coordinate conversion

We will start with getting an outline of the US with the `ne_countries()` function of the `rnaturalearth` package which will return polygons of the countries in the [Natural Earth](http://www.naturalearthdata.com/) dataset.

```{r}

world <- ne_countries(scale = "medium", returnclass = "sf")
glimpse(world)

```


Here you can see the data about the countries in the world. Notice the `geometry` variable. This is used to create the outlines that we want. 

Now we can use the `geom_sf()` function of the `ggplot2` package to create a visual of simple feature (the geometry coordinates found in the `geometry` variable).

```{r}
ggplot(data = world) +
    geom_sf() 

```

So now we can see that we have outlines of all the countries in the world.

We want to limit this just to the coordinates for the US. We will do this based on the coordinates we found on Wikipedia. According to this [link](https://en.wikipedia.org/wiki/List_of_extreme_points_of_the_United_States#Westernmost){target="_blank"}, these are the latitude and longitude bounds of the continental US:

- top = 49.3457868 # north lat
- left = -124.7844079 # west long
- right = -66.9513812 # east long
- bottom =  24.7433195 # south lat

```{r}

ggplot(data = world) +
    geom_sf() +
    coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), 
             expand = FALSE)
```
Now we just have a plot that is mostly limited to the outline of the US.

Now we will use the `geom_point()` function of the `ggplot` package to add scatter plot on top of the map. We want to show where the monitors are located based on the latitude and longitude values in the data.

```{r}
ggplot(data = world) +
    geom_sf() +
    coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), 
             expand = FALSE)+
    geom_point(data = pm, aes(x = lon, y = lat), size = 2, 
               shape = 23, fill = "darkred")

```

Nice!

Now let's add county lines.

County graphical data is available from the `maps` package. 
The `sf` package which again is short for simple features creates a data frame about this graphical data so that we can work with it.

```{r}
counties <- sf::st_as_sf(maps::map("county", plot = FALSE,
                                   fill = TRUE))

counties
```

Now we will use this data within the `geom_sf()` function to add this to our plot.  We will also add a title using the `ggtitle()` function, as well as remove axis ticks and titles using the `theme()` function of the `ggplot2` package.

```{r}
monitors <- ggplot(data = world) +
    geom_sf(data = counties, fill = NA, color = gray(.5))+
      coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), 
             expand = FALSE) +
    geom_point(data = pm, aes(x = lon, y = lat), size = 2, 
               shape = 23, fill = "darkred") +
    ggtitle("Monitor Locations") +
    theme(axis.title.x=element_blank(),
          axis.text.x = element_blank(),
          axis.ticks.x = element_blank(),
          axis.title.y = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks.y = element_blank())

monitors
```

Great!

Now, let's add a fill at the county-level for the true monitor values of air pollution.

First, we need to get the county map data that we just got and our air pollution data to have similarly formatted county names so that we can combine the datasets together.

We can see that in the `county` data the counties are listed after the state name and a comma. In addition they are all lower case.


```{r}
head(counties)
```

In contrast, our air pollution `pm` data shows counties as titles with the first letter as upper case. 

```{r}
dplyr::pull(pm, county) %>%
  head()
```

We can use the `separate()` function of the `tidyr` package to separate the `ID` variable of our `counties` data into two `variables` based on the comma as a separator.

```{r}
counties %<>% 
  tidyr::separate(ID, into = c("state", "county"), sep = ",")

head(counties)
```

Now we just need to make these names in the new `county` variable of the `counties` data to be in title format. We can use the `str_to_title()` function of the `stringr` package to do this. 

```{r}
counties[["county"]] <- stringr::str_to_title(counties[["county"]])
```

Great! Now the county information is the same for the `counties` and `pm` data.

We can use the `inner_join()` function of the `dplyr` package to join the datasets together based on the `county` variables in each. This function will keep all rows that are in both datasets.

```{r}
map_data <- dplyr::inner_join(counties, pm, by = "county")

glimpse(map_data)

```

Nice! we can see that we have add a `geom` variable to the `pm` data.

Now we can use this to color the counties in our plot based on the `value` variable of our `pm` data, which you may recall is the actual monitor data for fine particulate air pollution at each monitor. 

WE can do so using the `scale_fill_gradientn()` function of the `ggplot2` package which creates color gradient based on a variable. In this case it is the variable that was specified as the `fill` in the `aes` function of the `geom_sf()` function. We specified that it would be the `value` variable of the `pm` data.

This `scale_fill_gradientn()` function  also allows you to specify the colors, what to do about NA values (should they be a specific color or transparent) and the breaks, limits, labels and name/title on the legend for the color gradient. 

```{r}
truth <- ggplot(data = world) +
  coord_sf(xlim = c(-125,-66),
           ylim = c(24.5, 50),
           expand = FALSE) +
  geom_sf(data = map_data, aes(fill = value)) +
  scale_fill_gradientn(colours = topo.colors(7),
                       na.value = "transparent",
                       breaks = c(0, 10, 20),
                       labels = c(0, 10, 20),
                       limits = c(0, 23.5),
                       name = "PM ug/m3") +
  ggtitle("True PM 2.5 levels") +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

truth

```

Nice!

Now let's do the same with our predicted outcome values.

Let's grab both the testing and training predicted outcome values so that we have as much data as possible. 

First we need to fit our training data with our final model to be able to get the predictions for the monitors included in the training set. We did this using the `last_fit()` function, but the output of this makes it difficult to grab the predicted values for the training data, and it is also difficult to get the id variables for the testing data. 


Thus we will use the parsnip `fit()` and `predict()` functions of the `parsnip` package to do this like so:

#### {.think_question_block}
<b><u> Question Opportunity </u></b>

Why do we not need pre-processed data?

***

<details> <summary> Click here to reveal the answer. </summary>

Since we are using a workflow, the data will be pre-processed when it is fit as well.

</details>

***

####


```{r}

RF_final_train_fit <- parsnip::fit(RF_tuned_wflow, data = train_pm)
RF_final_test_fit <- parsnip::fit(RF_tuned_wflow, data = test_pm)


values_pred_train <- predict(RF_final_train_fit, train_pm) %>% 
  bind_cols(train_pm %>% select(value, fips, county, id)) 

values_pred_train

values_pred_test <- predict(RF_final_test_fit, test_pm) %>% 
  bind_cols(test_pm %>% select(value, fips, county, id)) 

values_pred_test
```

Now we can combine this data for the predictions for all monitors using the `bind_rows()` function of the `dplyr` package, which will essentially append the second dataset to the first.

```{r}
all_pred <- bind_rows(values_pred_test, values_pred_train)

all_pred
```

Great! as we can see there are 876 values like we would expect for all of the monitors. We can use the `county` variable to combine this with the `counties` data like we did with the `pm` data previously so that we can use the `value` variable as a color scheme for our map.


```{r}
map_data <- inner_join(counties, all_pred, by = "county")

pred <- ggplot(data = world) +
  coord_sf(xlim = c(-125,-66),
           ylim = c(24.5, 50),
           expand = FALSE) +
  geom_sf(data = map_data, aes(fill = .pred)) +
  scale_fill_gradientn(colours = topo.colors(7),
                       na.value = "transparent",
                       breaks = c(0, 10, 20),
                       labels = c(0, 10, 20),
                       limits = c(0, 23.5),
                       name = "PM ug/m3") +
  ggtitle("Predicted PM 2.5 levels") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

pred
```

Now we will use the `patchwork` package to combine our last two plots. This allows us to combine plots using the `+` or the `/` . The `+` will place plots side by side and the `/` will place plots top to bottom.


Now let's just combine the truth plot and the prediction plots together:
```{r}
truth/pred

```

We can see that the predicted fine particle air pollution values in (ug/m3) are quite similar to the true values measured by the actual gravimetric monitors. We can also see that southern California has some large counties with worse pollution (as they are yellow and thus have much higher particulate matter levels).

Let's add some text to our plot to explain it a bit more. We can do so using the `plot_annotation()` function of the `patchwork` package. The `theme` argument of this function takes the same theme information using the `theme()` function of the `ggplot2` package as when creating `ggplot2`plots.

```{r}
(truth/pred) + 
  plot_annotation(title = "Machine Learning Methods Allow for Prediction of Air Pollution", subtitle = "A random forest model predicts true monitored levels of fine particulate matter (PM 2.5) air pollution based on\ndata about population density and other predictors reasonably well, thus suggesting that we can use similar methods to predict levels\nof pollution in places with poor monitoring",
                  theme = theme(plot.title = element_text(size =12, face = "bold"), 
                                plot.subtitle = element_text(size = 8)))

```


```{r, echo = FALSE, message=FALSE, eval=FALSE, include = FALSE}
png(here::here("img", "main_plot_maps.png"), 
    height = 1500, width = 2000, res = 300)
(truth/pred) + 
  plot_annotation(title = "Machine Learning Methods Allow for Prediction of Air Pollution", subtitle = "A random forest model predicts true monitored levels of fine particulate matter (PM 2.5) air pollution based on\ndata about population density and other predictors reasonably well, thus suggesting that we can use similar methods predict levels\nof pollution in places with poor monitoring",
                  theme = theme(plot.title = element_text(size =12, face = "bold"),
                                plot.subtitle = element_text(size = 8)))
dev.off()
```

# **Summary**
***

## **Synopsis**
***

In this case study, we explored gravimetric monitoring data of fine particulate matter air pollution (outcome variable). 
Our goal was to able to predict air pollution where we only had predictor variables (or features) without having observed a corresponding measurement of air pollution.

Our learning objectives were: 

- Introduce concepts in machine learning
- Demonstrate how to build a machine learning model with `tidymodels`
- Demonstrate how to visualize geo-spatial data using `ggplot2`

Using the machine learning models built in this case study, we could now extend this model to predict air pollution levels in areas with poor monitoring, to help identify regions where populations maybe especially at risk for the health effects of air pollution.  

Analyses like the one in our case study are important for defining which groups could benefit the most from interventions, education, and policy changes when attempting to mitigate public health challenges. You can see in this [article](https://www.nejm.org/doi/full/10.1056/NEJMoa1702747){target="_blank"} that many additional considerations would be involved to adequately understand the data enough to recommend policy changes.

Here are some visual summaries about what we learned about using `tidymodels` to perform prediction analyses.

First the minimal steps required:

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","Updated_tidymodels_basics.png"))
```


Here is a guide for more advanced analyses involving preprocessing, cross validation, or tuning:

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","full_tidymodels_overview.png"))
```


#### {.click_to_expand_block}

<details><summary> Click here for more on what we learned with `tidymodels` </summary>

Here, we provide an overview of the `tidymodels` framework. 

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","ecosystem.png"))
```


We performed the major steps of machine learning that we introduced in the beginning of the data analysis:  

1. Data exploration  

We used packages like `skimr`, `summarytools`, `corrplot`, and `GGally` to better understand our data. These packages can tell us how many missing values each variable has (if any), the class of each variable, the distribution of values for each variable, the sparsity of each variable, and the level of correlation between variables.  

2. Data splitting 

We used the `rsample` package to first perform an initial split of our data into two pieces: a training set and a testing set. The training set was used to optimize the model, while the testing set was used only to evaluate the performance of our final model. We also used the `rsample` package to create cross validation subsets of our training data. This allowed us to better assess the performance of our tested models using our training data.  

3. Variable assignment and pre-processing   

We used the `recipes` package to assign variable roles (such as outcome, predictor, and id variable). We also used this package to create a recipe for pre-processing our training and testing data. This involved steps such as: ` step_dummy` to create dummy numeric encodings of our categorical variables, `step_corr` to remove highly correlated variables, `step_nzv` to remove near zero variance variables that would contribute little to our model and potentially add noise.  We learned that once our recipe was created and prepped using `prep()`we could extract the pre-processed training data or our pre-processed testing data using `bake()`. We also learned that if we used the newer workflows package that we did not need to use the `prep()` or `bake()` functions, but that it is still useful to know how to do so if we want to look at our data and how the recipe is influencing it more deeply.  

4. Model specification, fitting, tuning and performance evaluation using the training data  

We learned that the model needs to first be fit to the training data. We learned that in both classification and prediction, the model is fit to the training data and the explanatory variables are used to estimate numeric values (in the case of prediction) or categorical values (in the case of classification) of the outcome variable of interest. We learned that we specify the model and its specifications using the `parnsip` package and that we also use this package to fit the model using the `fit()` function. We learned that if we just use `parsnip` to fit the model, then we need to use the pre-processed training data (output from `bake()`). We learned that we can use the raw training data if we use the `workflows` package to create a workflow that pre-processes our data for us.   

We learned that if the model fits well than the estimated values will be very similar to the true outcome variable values in our training data. We learned that we can assess model performance using the `yardstick` package with the `metrics` functio or the `tune` package and the `collect_metrics()` function (required if using cross validation or tuning). We also learned that we can use subsets of our training data (which we created with the `rsample` package) to perform cross validation to get a better estimate about the performance of our model using our training data, as we want our results to be generalizable and to perform well with other data, not just our training data. We used the `fit_resamples()` function of the tune package to fit our model on our different training data subsets and the `collect_metrics()` function (also of the `tune` package) to evaluate model performance using these subsets.  We also learned that we can potentially improve model performance by tuning aspects about the model called [hyperparameters](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/){target="_blank"} to determine the best option for model performance. We learned that we can do this using the `tune` and `dials` packages and evaluating the performance of our model with the different hyperparameter options and our training data subsets that we used for cross validation. After we tested several different methods to model our data, we compared them to choose the best performing model as our final model.  


5. Overall model performance evaluation  

Once we chose our final model, we evaluated the final model performance using the testing data using the `last_fit()` function of the `tune` package. This gives us a better estimate about how well the model will predict or classify the outcome variable of interest with new independent data. **Ideally one would also perform an evaluation with independent data to provide a sense of how generalizable the model is to other data sources.**

We also saw that we can use the `collect_predictions()` function of the `tune` package to get the predictions for our test data. We saw that we can get more detailed prediction data using the `predict()` function of the `parsnip` package.

</details>

####


## **Suggested Homework**
***

Students can predict air pollution monitor values using a different algorithm and provide an explanation for how that algorithm works and why it may be a good choice for modeling this data.


# **Additional Information**
***

## **Helpful Links**
***

1. A review of [tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/){target="_blank"}  
2. A [course on tidymodels](https://juliasilge.com/blog/tidymodels-ml-course/){target="_blank"} by Julia Silge  
3. [More examples, explanations, and info about tidymodels development](https://www.tidymodels.org/learn/){target="_blank"} from the developers  
4. A guide for [pre-processing with recipes](http://www.rebeccabarter.com/blog/2019-06-06_pre_processing/){target="_blank"}  
5. A [guide](https://briatte.github.io/ggcorr/){target="_blank"} for using GGally to create correlation plots  
6. A [guide](https://www.tidyverse.org/blog/2018/11/parsnip-0-0-1/){target="_blank"} for using parsnip to try different algorithms or engines  
7. A [list of recipe functions](https://tidymodels.github.io/recipes/reference/index.html){target="_blank"}  
8. A great blog post about [cross validation](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6){target="_blank"}  
9. A discussion about [evaluating model performance](https://medium.com/@limavallantin/metrics-to-measure-machine-learning-model-performance-e8c963665476){target="_blank"} for a deeper explanation about how to evaluate model performance  
10. [RStudio cheatsheets](https://rstudio.com/resources/cheatsheets/){target="_blank"}
11. An [explanation](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d){target="_blank"} of supervised vs unsupervised machine learning and bias-variance trade-off.
12. A thorough [explanation](https://royalsocietypublishing.org/doi/10.1098/rsta.2015.0202#:~:text=Principal%20component%20analysis%20(PCA)%20is,variables%20that%20successively%20maximize%20variance.){target="_blank"} of principal component analysis.
13. If you have access, this is a great [discussion](https://www.tandfonline.com/doi/abs/10.1080/00031305.1984.10483183){target="_blank"}  about the difference between independence, orthogonality, and lack of correlation.
14. Great [video explanation](https://youtu.be/_UVHneBUBW0){target="_blank"} of PCA.  

<u>Terms and concepts covered:</u>  

[Tidyverse](https://www.tidyverse.org/){target="_blank"}  
[Imputation](https://en.wikipedia.org/wiki/Imputation_(statistics)){target="_blank"}  
[Transformation](https://en.wikipedia.org/wiki/Data_transformation_(statistics)){target="_blank"}  
[Discretization](https://en.wikipedia.org/wiki/Discretization_of_continuous_features){target="_blank"}  
[Dummy Variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)){target="_blank"}  
[One-Hot Encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/){target="_blank"}  
[Data Type Conversions](https://cran.r-project.org/web/packages/hablar/vignettes/convert.html){target="_blank"}  
[Interaction](https://statisticsbyjim.com/regression/interaction-effects/){target="_blank"}  
[Normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)){target="_blank"}  
[Dimensionality Reduction/Signal Extraction](https://en.wikipedia.org/wiki/Dimensionality_reduction){target="_blank"}  
[Row Operations](https://tartarus.org/gareth/maths/Linear_Algebra/row_operations.pdf){target="_blank"}  
[Near Zero Varaince](https://www.r-bloggers.com/near-zero-variance-predictors-should-we-remove-them/){target="_blank"}  
[Parameters and Hyperparameters](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/){target="_blank"}   
[Supervised and Unspervised Learning](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d){target="_blank"}  
[Principal Component Analysis](https://medium.com/@savastamirko/pca-a-linear-transformation-f8aacd4eb007){target="_blank"}  
[Linear Combinations](https://www.mathbootcamps.com/linear-combinations-vectors/){target="_blank"}  
[Decision Tree](https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb){target="_blank"}  
[Random Forest](https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9){target="_blank"}  


 <u>**Packages used in this case study:** </u>

Package   | Use in this case study                                                                      
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to view/arrange/filter/select/compare specific subsets of the data 
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data
[summarytools](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data in a different style
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"}   | to use the `%<>%` pipping operator 
[corrplot](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html){target="_blank"} | to make large correlation plots
[GGally](https://cran.r-project.org/web/packages/GGally/GGally.pdf){target="_blank"} | to make smaller correlation plots  
[rsample](https://tidymodels.github.io/rsample/articles/Basics.html){target="_blank"}   | to split the data into testing and training sets and to split the training set for cross-validation  
[recipes](https://tidymodels.github.io/recipes/){target="_blank"}   | to pre-process data for modeling in a tidy and reproducible way and to extract pre-processed data (major functions are `recipe()` , `prep()` and various transformation `step_*()` functions, as well as `bake` which extracts pre-processed training data (used to require `juice()`) and applies recipe preprocessing steps to testing data). See [here](https://cran.r-project.org/web/packages/recipes/recipes.pdf){target="_blank"}  for more info.
[parsnip](https://tidymodels.github.io/parsnip/){target="_blank"}   | an interface to create models (major functions are  `fit()`, `set_engine()`)
[yardstick](https://tidymodels.github.io/yardstick/){target="_blank"}   | to evaluate the performance of models
[broom](https://www.tidyverse.org/blog/2018/07/broom-0-5-0/){target="_blank"} | to get tidy output for our model fit and performance
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers
[dials](https://www.tidyverse.org/blog/2019/10/dials-0-0-3/){target="_blank"} | to specify hyper-parameter tuning
[tune](https://tune.tidymodels.org/){target="_blank"} | to perform cross validation, tune hyper-parameters, and get performance metrics
[workflows](https://www.rdocumentation.org/packages/workflows/versions/0.1.1){target="_blank"} | to create modeling workflow to streamline the modeling process
[vip](https://cran.r-project.org/web/packages/vip/vip.pdf){target="_blank"} | to create variable importance plots
[randomForest](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf){target="_blank"} | to perform the random forest analysis
[doParallel](https://cran.r-project.org/web/packages/doParallel/doParallel.pdf) | to fit cross validation samples in parallel 
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the text the map data
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to separate data within a column into multiple columns
[rnaturalearth](https://cran.r-project.org/web/packages/rnaturalearth/README.html){target="_blank"} | to get the geometry data for the earth to plot the US
[maps](https://cran.r-project.org/web/packages/maps/maps.pdf){target="_blank"} | to get map database data about counties to draw them on our US map
[sf](https://r-spatial.github.io/sf/){target="_blank"}  | to convert the map data into a data frame
[lwgeom](https://cran.r-project.org/web/packages/lwgeom/lwgeom.pdf){target="_blank"} | to use the `sf` function to convert the map geographical data
[rgeos](https://cran.r-project.org/web/packages/rgeos/rgeos.pdf){target="_blank"} | to use geometry data
[patchwork](https://cran.r-project.org/web/packages/patchwork/patchwork.pdf){target="_blank"} | to allow plots to be combined


## **Session info**
***


```{r}
sessionInfo()
```

**Estimate of RMarkdown Compilation Time: **

```{r, echo=FALSE}
rmarkdown:::perf_timer_stop("render")
pts = rmarkdown:::perf_timer_summary()
cat("About", round(pts$time[1]/1000 + 5), "-", round(pts$time[1]/1000 + 15),"seconds")
```

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.

## **Acknowledgments**
***

We would like to acknowledge [Roger Peng](http://www.biostat.jhsph.edu/~rpeng/), [Megan Latshaw](https://www.jhsph.edu/faculty/directory/profile/1708/megan-weil-latshaw), and [Kirsten Koehler](https://www.jhsph.edu/faculty/directory/profile/2928/kirsten-koehler) for assisting in framing the major direction of the case study.

We would like to acknowledge [Michael Breshock](https://mbreshock.github.io/) for his contributions to this case study and developing the `OCSdata` package.

We would also like to acknowledge the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/) for funding this work. 


<script type='text/javascript' id='clustrmaps' src='//cdn.clustrmaps.com/map_v2.js?cl=080808&w=a&t=tt&d=7YT-EDGa4MUwQXSQxqfv9Nd9Nt852b7plGdS6UQJO1Q&co=ffffff&cmo=3acc3a&cmn=ff5353&ct=808080'></script>
