Motivation

Health policy in the United States is complicated, and several forms of healthcare coverage exist, including both coverage by federal goverment-led healthcare policy, and by private insurance companies. Before making any inference about the relationship between health condition and health policy, it is important for us to have a general idea about healthcare economics in the United States. Thus, we are interested in getting sense of healthcare coverage and healthcare spending across States. More specifically, the questions are:

  1. Is there a relationship between healthcare coverage and healthcare spending in the United States?
  2. How does the spending distribution change across geographic regions in the United States?
  3. Does the relationship between healthcare coverage and healthcare spending in the United States change from 2013 to 2014?

In this case study, we’ll walk you through collecting data, importing data, cleaning data, wrangling data, and visualizing the data, using well-established and commonly used packages, including datasets, tidyr, dplyr, ggplot2, and ggrepel.

What is the data?

Image source from US Department of Health and Human Services

Healthcare data

We will be using the data from the Henry J Kaiser Family Foundation (KFF).

We have downloaded, re-named and saved these files in the GitHub repository under the data/KFF/ directory.

Now, before we dig into the data analysis, we need to introduce a set of R packages that we will use to analyze the data.

Data Import

Introduction to “Tidy data”

The tidyverse is “an opinionated collection of R packages designed for data science. All packages share an underlying philosophy and common APIs.”

Another way of putting it is that it’s a set of packages that are useful specifically for data manipulation, exploration and visualization with a common philosophy.

What is this common philosophy?

The common philosophy is called “tidy” data. It is a standard way of mapping the meaning of a dataset to its structure.

In tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Below, we are interested in transforming the table on the right to the the table on the left, which is considered “tidy”.

Working with tidy data is useful because it creates a structured way of organizing data values within a data set. This makes the data analysis process more efficient and simplifies the development of data analysis tools that work together. In this way, you can focus on the problem you are investigating, rather than the uninteresting logistics of data.

1. What is in the tidyverse?

We can install and load the set of R packages using install.packages("tidyverse") function.

When we load the tidyverse package using library(tidyverse), there are six core R packages that load:

  • readr, for data import.
  • tidyr, for data tidying.
  • dplyr, for data wrangling.
  • ggplot2, for data visualisation.
  • purrr, for functional programming.
  • tibble, for tibbles, a modern re-imagining of data frames.

Here, we load in the tidyverse.

These packages are highlighted in bold here:

Because these packages all share the “tidy” philosophy, the data analysis workflow is easier as you move from package to package.

Here, we will focus on the readr, tidyr and dplyr R packages to import data, to transform data to the “tidy” format, and to wrangle data.

Next, we will give a brief description of the features in each of these packages.

There are several base R functions that allow you read in data into R, which you may be familiar with such as read.table(), read.csv(), and read.delim(). Instead of using these, we will use the functions in the readr R package. The main reasons for this are

  1. Compared to equivalent base R functions, the functions in readr are around 10x faster.
  2. You can specify the column types (e.g character, integer, double, logical, date, time, etc)
  3. All parsing problems are recorded in a data frame.

Read data using the readr R package

The main functions in readr are:

readr functions Description
read_delim() reads in a flat file data with a given character to separate fields
read_csv() reads in a CSV file
read_tsv() reads in a file with values separated by tabs
read_lines() reads only a certain number of lines from the file
read_file() reads a complete file into a string
write_csv() writes data frame to CSV

A useful cheatsheet for the functions in the readr package can be found on RStudio’s website:

1. Read in data

Read in health healthcare coverage data

Let’s try reading in some data. We will begin by reading in the healthcare-coverage.csv data.

If we want to see what the header of the file looks like, we can use the read_lines() function to peak at the first few lines.

 [1] "\"Title: Health Insurance Coverage of the Total Population | The Henry J. Kaiser Family Foundation\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
 [2] "\"Timeframe: 2013 - 2016\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
 [3] "\"Location\",\"2013__Employer\",\"2013__Non-Group\",\"2013__Medicaid\",\"2013__Medicare\",\"2013__Other Public\",\"2013__Uninsured\",\"2013__Total\",\"2014__Employer\",\"2014__Non-Group\",\"2014__Medicaid\",\"2014__Medicare\",\"2014__Other Public\",\"2014__Uninsured\",\"2014__Total\",\"2015__Employer\",\"2015__Non-Group\",\"2015__Medicaid\",\"2015__Medicare\",\"2015__Other Public\",\"2015__Uninsured\",\"2015__Total\",\"2016__Employer\",\"2016__Non-Group\",\"2016__Medicaid\",\"2016__Medicare\",\"2016__Other Public\",\"2016__Uninsured\",\"2016__Total\""
 [4] "\"United States\",\"155696900\",\"13816000\",\"54919100\",\"40876300\",\"6295400\",\"41795100\",\"313401200\",\"154347500\",\"19313000\",\"61650400\",\"41896500\",\"5985000\",\"32967500\",\"316159900\",\"155965800\",\"21816500\",\"62384500\",\"43308400\",\"6422300\",\"28965900\",\"318868500\",\"157381500\",\"21884400\",\"62303400\",\"44550200\",\"6192200\",\"28051900\",\"320372000\""                                                                                                                                                                           
 [5] "\"Alabama\",\"2126500\",\"174200\",\"869700\",\"783000\",\"85600\",\"724800\",\"4763900\",\"2202800\",\"288900\",\"891900\",\"718400\",\"143900\",\"522200\",\"4768000\",\"2218000\",\"291500\",\"911400\",\"719100\",\"174600\",\"519400\",\"4833900\",\"2263800\",\"262400\",\"997000\",\"761200\",\"128800\",\"420800\",\"4834100\""                                                                                                                                                                                                                                      
 [6] "\"Alaska\",\"364900\",\"24000\",\"95000\",\"55200\",\"60600\",\"102200\",\"702000\",\"345300\",\"26800\",\"130100\",\"55300\",\"37300\",\"100800\",\"695700\",\"355700\",\"22300\",\"128100\",\"60900\",\"47700\",\"90500\",\"705300\",\"324400\",\"20300\",\"145400\",\"68200\",\"55600\",\"96900\",\"710800\""                                                                                                                                                                                                                                                             
 [7] "\"Arizona\",\"2883800\",\"170800\",\"1346100\",\"842000\",\"N/A\",\"1223000\",\"6603100\",\"2835200\",\"333500\",\"1639400\",\"911100\",\"N/A\",\"827100\",\"6657200\",\"2766500\",\"278400\",\"1711500\",\"949000\",\"189300\",\"844800\",\"6739500\",\"3010700\",\"377000\",\"1468400\",\"1028000\",\"172500\",\"833700\",\"6890200\""                                                                                                                                                                                                                                     
 [8] "\"Arkansas\",\"1128800\",\"155600\",\"600800\",\"515200\",\"67600\",\"436800\",\"2904800\",\"1176500\",\"231700\",\"639200\",\"479400\",\"82000\",\"287200\",\"2896000\",\"1293700\",\"200200\",\"641400\",\"484500\",\"63700\",\"268400\",\"2953000\",\"1290900\",\"252900\",\"618600\",\"490000\",\"67500\",\"225500\",\"2945300\""                                                                                                                                                                                                                                        
 [9] "\"California\",\"17747300\",\"1986400\",\"8344800\",\"3828500\",\"675400\",\"5594100\",\"38176400\",\"17703700\",\"2778800\",\"9618800\",\"4049000\",\"634400\",\"3916700\",\"38701300\",\"17718300\",\"3444200\",\"10138100\",\"4080100\",\"752700\",\"2980600\",\"39113900\",\"18116200\",\"3195400\",\"9853800\",\"4436000\",\"556100\",\"3030800\",\"39188300\""                                                                                                                                                                                                         
[10] "\"Colorado\",\"2852500\",\"426300\",\"697300\",\"549700\",\"118100\",\"654000\",\"5297800\",\"2489400\",\"397900\",\"1053700\",\"619500\",\"214000\",\"602900\",\"5377400\",\"2706000\",\"346900\",\"1036600\",\"708000\",\"148000\",\"475700\",\"5421300\",\"2872600\",\"370000\",\"855800\",\"692400\",\"190100\",\"528400\",\"5509200\""                                                                                                                                                                                                                                  

It looks like the first two lines are descriptive and are not useful. We will tell R to skip reading these in using the skip argument in read_csv(). The third line looks like it contains the column names and starting on the fourth line is where the data starts.

# A tibble: 6 x 29
  Location `2013__Employer` `2013__Non-Grou… `2013__Medicaid` `2013__Medicare`
  <chr>               <dbl>            <dbl>            <dbl>            <dbl>
1 United …        155696900         13816000         54919100         40876300
2 Alabama           2126500           174200           869700           783000
3 Alaska             364900            24000            95000            55200
4 Arizona           2883800           170800          1346100           842000
5 Arkansas          1128800           155600           600800           515200
6 Califor…         17747300          1986400          8344800          3828500
# … with 24 more variables: `2013__Other Public` <chr>,
#   `2013__Uninsured` <dbl>, `2013__Total` <dbl>, `2014__Employer` <dbl>,
#   `2014__Non-Group` <dbl>, `2014__Medicaid` <dbl>, `2014__Medicare` <dbl>,
#   `2014__Other Public` <chr>, `2014__Uninsured` <dbl>, `2014__Total` <dbl>,
#   `2015__Employer` <dbl>, `2015__Non-Group` <dbl>, `2015__Medicaid` <dbl>,
#   `2015__Medicare` <dbl>, `2015__Other Public` <chr>,
#   `2015__Uninsured` <dbl>, `2015__Total` <dbl>, `2016__Employer` <dbl>,
#   `2016__Non-Group` <dbl>, `2016__Medicaid` <dbl>, `2016__Medicare` <dbl>,
#   `2016__Other Public` <chr>, `2016__Uninsured` <dbl>, `2016__Total` <dbl>
# A tibble: 6 x 29
  Location `2013__Employer` `2013__Non-Grou… `2013__Medicaid` `2013__Medicare`
  <chr>               <dbl>            <dbl>            <dbl>            <dbl>
1 <NA>                   NA               NA               NA               NA
2 *Uninsu…               NA               NA               NA               NA
3 <NA>                   NA               NA               NA               NA
4 For exa…               NA               NA               NA               NA
5 <NA>                   NA               NA               NA               NA
6 *N/A*: …               NA               NA               NA               NA
# … with 24 more variables: `2013__Other Public` <chr>,
#   `2013__Uninsured` <dbl>, `2013__Total` <dbl>, `2014__Employer` <dbl>,
#   `2014__Non-Group` <dbl>, `2014__Medicaid` <dbl>, `2014__Medicare` <dbl>,
#   `2014__Other Public` <chr>, `2014__Uninsured` <dbl>, `2014__Total` <dbl>,
#   `2015__Employer` <dbl>, `2015__Non-Group` <dbl>, `2015__Medicaid` <dbl>,
#   `2015__Medicare` <dbl>, `2015__Other Public` <chr>,
#   `2015__Uninsured` <dbl>, `2015__Total` <dbl>, `2016__Employer` <dbl>,
#   `2016__Non-Group` <dbl>, `2016__Medicaid` <dbl>, `2016__Medicare` <dbl>,
#   `2016__Other Public` <chr>, `2016__Uninsured` <dbl>, `2016__Total` <dbl>

It looks like we now have the right header, but there are a bunch of NAs in the end of the data frame because most of it isn’t useful data.

Let’s take a closer look at the last 30 lines

# A tibble: 30 x 29
   Location `2013__Employer` `2013__Non-Grou… `2013__Medicaid` `2013__Medicare`
   <chr>               <dbl>            <dbl>            <dbl>            <dbl>
 1 "Washin…          3541600           309000          1026800           879000
 2 "West V…           841300            42600           382500           329400
 3 "Wiscon…          3154500           225300           907600           812900
 4 "Wyomin…           305900            19500            74200            65400
 5 "Notes"                NA               NA               NA               NA
 6 "The ma…               NA               NA               NA               NA
 7  <NA>                  NA               NA               NA               NA
 8 "In thi…               NA               NA               NA               NA
 9  <NA>                  NA               NA               NA               NA
10 "Data e…               NA               NA               NA               NA
# … with 20 more rows, and 24 more variables: `2013__Other Public` <chr>,
#   `2013__Uninsured` <dbl>, `2013__Total` <dbl>, `2014__Employer` <dbl>,
#   `2014__Non-Group` <dbl>, `2014__Medicaid` <dbl>, `2014__Medicare` <dbl>,
#   `2014__Other Public` <chr>, `2014__Uninsured` <dbl>, `2014__Total` <dbl>,
#   `2015__Employer` <dbl>, `2015__Non-Group` <dbl>, `2015__Medicaid` <dbl>,
#   `2015__Medicare` <dbl>, `2015__Other Public` <chr>,
#   `2015__Uninsured` <dbl>, `2015__Total` <dbl>, `2016__Employer` <dbl>,
#   `2016__Non-Group` <dbl>, `2016__Medicaid` <dbl>, `2016__Medicare` <dbl>,
#   `2016__Other Public` <chr>, `2016__Uninsured` <dbl>, `2016__Total` <dbl>

It looks like there is a line with a string Notes in it and everything below that line should not be read in. We can use the n_max argument here.

# A tibble: 6 x 29
  Location `2013__Employer` `2013__Non-Grou… `2013__Medicaid` `2013__Medicare`
  <chr>               <dbl>            <dbl>            <dbl>            <dbl>
1 Vermont            317700            26200           123400            96600
2 Virginia          4661600           364800           773200           968000
3 Washing…          3541600           309000          1026800           879000
4 West Vi…           841300            42600           382500           329400
5 Wiscons…          3154500           225300           907600           812900
6 Wyoming            305900            19500            74200            65400
# … with 24 more variables: `2013__Other Public` <chr>,
#   `2013__Uninsured` <dbl>, `2013__Total` <dbl>, `2014__Employer` <dbl>,
#   `2014__Non-Group` <dbl>, `2014__Medicaid` <dbl>, `2014__Medicare` <dbl>,
#   `2014__Other Public` <chr>, `2014__Uninsured` <dbl>, `2014__Total` <dbl>,
#   `2015__Employer` <dbl>, `2015__Non-Group` <dbl>, `2015__Medicaid` <dbl>,
#   `2015__Medicare` <dbl>, `2015__Other Public` <chr>,
#   `2015__Uninsured` <dbl>, `2015__Total` <dbl>, `2016__Employer` <dbl>,
#   `2016__Non-Group` <dbl>, `2016__Medicaid` <dbl>, `2016__Medicare` <dbl>,
#   `2016__Other Public` <chr>, `2016__Uninsured` <dbl>, `2016__Total` <dbl>

That’s better!

Read in healthcare spending data

Now because we are also going to want to use in healthcare-spending.csv, let’s read it in now.

# A tibble: 6 x 25
  Location `1991__Total He… `1992__Total He… `1993__Total He… `1994__Total He…
  <chr>               <dbl>            <dbl>            <dbl>            <dbl>
1 Vermont              1330             1421             1522             1625
2 Virginia            14829            15599            16634            17637
3 Washing…            12674            13859            14523            15303
4 West Vi…             4672             5159             5550             5891
5 Wiscons…            12694            13669            14636            15532
6 Wyoming              1023             1067             1171             1265
# … with 20 more variables: `1995__Total Health Spending` <dbl>, `1996__Total
#   Health Spending` <dbl>, `1997__Total Health Spending` <dbl>, `1998__Total
#   Health Spending` <dbl>, `1999__Total Health Spending` <dbl>, `2000__Total
#   Health Spending` <dbl>, `2001__Total Health Spending` <dbl>, `2002__Total
#   Health Spending` <dbl>, `2003__Total Health Spending` <dbl>, `2004__Total
#   Health Spending` <dbl>, `2005__Total Health Spending` <dbl>, `2006__Total
#   Health Spending` <dbl>, `2007__Total Health Spending` <dbl>, `2008__Total
#   Health Spending` <dbl>, `2009__Total Health Spending` <dbl>, `2010__Total
#   Health Spending` <dbl>, `2011__Total Health Spending` <dbl>, `2012__Total
#   Health Spending` <dbl>, `2013__Total Health Spending` <dbl>, `2014__Total
#   Health Spending` <dbl>

2. Take a glimpse() at your data

One last thing in this section. One way to look at our data would be to use head() or tail(), as we just saw. Another one you might have heard of is the str() function. One you might not have heard of is the glimpse() function. It’s used for a special type of object in R called a tibble. Let’s read the help file to learn more.

It’s kind of like print() where it shows you columns running down the page. Let’s try it out. If we look at our data, say the coverage data frame, we see that it is not “tidy”:

Rows: 52
Columns: 29
$ Location             <chr> "United States", "Alabama", "Alaska", "Arizona",…
$ `2013__Employer`     <dbl> 155696900, 2126500, 364900, 2883800, 1128800, 17…
$ `2013__Non-Group`    <dbl> 13816000, 174200, 24000, 170800, 155600, 1986400…
$ `2013__Medicaid`     <dbl> 54919100, 869700, 95000, 1346100, 600800, 834480…
$ `2013__Medicare`     <dbl> 40876300, 783000, 55200, 842000, 515200, 3828500…
$ `2013__Other Public` <chr> "6295400", "85600", "60600", "N/A", "67600", "67…
$ `2013__Uninsured`    <dbl> 41795100, 724800, 102200, 1223000, 436800, 55941…
$ `2013__Total`        <dbl> 313401200, 4763900, 702000, 6603100, 2904800, 38…
$ `2014__Employer`     <dbl> 154347500, 2202800, 345300, 2835200, 1176500, 17…
$ `2014__Non-Group`    <dbl> 19313000, 288900, 26800, 333500, 231700, 2778800…
$ `2014__Medicaid`     <dbl> 61650400, 891900, 130100, 1639400, 639200, 96188…
$ `2014__Medicare`     <dbl> 41896500, 718400, 55300, 911100, 479400, 4049000…
$ `2014__Other Public` <chr> "5985000", "143900", "37300", "N/A", "82000", "6…
$ `2014__Uninsured`    <dbl> 32967500, 522200, 100800, 827100, 287200, 391670…
$ `2014__Total`        <dbl> 316159900, 4768000, 695700, 6657200, 2896000, 38…
$ `2015__Employer`     <dbl> 155965800, 2218000, 355700, 2766500, 1293700, 17…
$ `2015__Non-Group`    <dbl> 21816500, 291500, 22300, 278400, 200200, 3444200…
$ `2015__Medicaid`     <dbl> 62384500, 911400, 128100, 1711500, 641400, 10138…
$ `2015__Medicare`     <dbl> 43308400, 719100, 60900, 949000, 484500, 4080100…
$ `2015__Other Public` <chr> "6422300", "174600", "47700", "189300", "63700",…
$ `2015__Uninsured`    <dbl> 28965900, 519400, 90500, 844800, 268400, 2980600…
$ `2015__Total`        <dbl> 318868500, 4833900, 705300, 6739500, 2953000, 39…
$ `2016__Employer`     <dbl> 157381500, 2263800, 324400, 3010700, 1290900, 18…
$ `2016__Non-Group`    <dbl> 21884400, 262400, 20300, 377000, 252900, 3195400…
$ `2016__Medicaid`     <dbl> 62303400, 997000, 145400, 1468400, 618600, 98538…
$ `2016__Medicare`     <dbl> 44550200, 761200, 68200, 1028000, 490000, 443600…
$ `2016__Other Public` <chr> "6192200", "128800", "55600", "172500", "67500",…
$ `2016__Uninsured`    <dbl> 28051900, 420800, 96900, 833700, 225500, 3030800…
$ `2016__Total`        <dbl> 320372000, 4834100, 710800, 6890200, 2945300, 39…

Read the State information using the datasets R package

Since our goal is to get sense of the health expenditure, including healthcare coverage and healthcare spending, across States, it would be nice add some information about each state. Namely, the state abbreviation and state region (i.e. north, south, etc).

For this we use the state dataset in the datasets R package.

Before we begin, let’s look at what states are there:

 [1] "United States"        "Alabama"              "Alaska"              
 [4] "Arizona"              "Arkansas"             "California"          
 [7] "Colorado"             "Connecticut"          "Delaware"            
[10] "District of Columbia" "Florida"              "Georgia"             
[13] "Hawaii"               "Idaho"                "Illinois"            
[16] "Indiana"              "Iowa"                 "Kansas"              
[19] "Kentucky"             "Louisiana"            "Maine"               
[22] "Maryland"             "Massachusetts"        "Michigan"            
[25] "Minnesota"            "Mississippi"          "Missouri"            
[28] "Montana"              "Nebraska"             "Nevada"              
[31] "New Hampshire"        "New Jersey"           "New Mexico"          
[34] "New York"             "North Carolina"       "North Dakota"        
[37] "Ohio"                 "Oklahoma"             "Oregon"              
[40] "Pennsylvania"         "Rhode Island"         "South Carolina"      
[43] "South Dakota"         "Tennessee"            "Texas"               
[46] "Utah"                 "Vermont"              "Virginia"            
[49] "Washington"           "West Virginia"        "Wisconsin"           
[52] "Wyoming"             

We see there are more than 50 states because “United States” and “District of Columbia” are both included.

Let’s look what states are inside the state dataset.

 [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
 [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
 [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
[13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
[17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
[21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
[25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
[29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
[33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
[37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
[41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
[45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
[49] "Wisconsin"      "Wyoming"       

Ah, ok. So let’s start by dealing with DC as a special case.

We will deal with the “United States” in the next section.

Data Wrangling

What is “Tidy Data”?

Glance at “Tidy Data”

A subset of the data analysis process can be thought about in the following way:

where each of these steps needs its own tools and software to complete.

After we import the data into R, if we are going to take advantage of the “tidyverse”, this means we need to transform the data into a form that is “tidy”. If you recall, in tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

For example, consider the following dataset:

Here:

  • each row represents one company (row names are companies)
  • each column represent one time point
  • the stock prices are defined for each row/column pair

Alternatively, a data set can be structured in the following way:

  • each row represents one time point (but no row names)
  • the first column defines the time variable and the last three columns contain the stock prices for three companies

In both cases, the data is the same, but the structure is different. This can be frustrating to deal with as an analyst because the meaning of the values (rows and columns) in the two data sets are different. Providing a standardized way of organizing values within a data set would alleviate a major portion of this frustration.

For motivation, a tidy version of the stock data we looked at above looks like this: (we’ll learn how the functions work in just a moment)

In this “tidy” data set, we have three columns representing three variables (time, company name and stock price). Every row represents contains one stock price from a particular time and for a specific company.

If we consider our coverage dataframe, we see it is also not in a tidy format. Each row contains information about the coverage level by Location across years and types of coverage.

# A tibble: 5 x 5
  Location   `2013__Employer` `2013__Non-Grou… `2013__Medicaid` `2013__Medicare`
  <chr>                 <dbl>            <dbl>            <dbl>            <dbl>
1 United St…        155696900         13816000         54919100         40876300
2 Alabama             2126500           174200           869700           783000
3 Alaska               364900            24000            95000            55200
4 Arizona             2883800           170800          1346100           842000
5 Arkansas            1128800           155600           600800           515200

Now, let’s use the tidyr R package to transform our data into a tidy format.

The tidyr R package

1. What is the tidyr R package ?

tidyr is an R package that transforms data sets to a tidy format.

This package is installed and loaded when you load the tidyverse using library(tidyverse). However, you can also just load the library by itself.

The main functions in tidyr are:

tidyr functions Description
gather() takes multiple columns, and gathers them into key-value pairs, making “wide” data longer
separate() turns a single character column into multiple columns, making “long” data wider
spread() spread rows into multiple columns, transforming “long” data into “wide” format

We’ll explore what it means to go between a “wide” and “long” data format using gather() , separate(), and spread().

A tidyr cheatsheet for the functions in the tidyr package can be found on RStudio’s website:

2. Convert data from wide format to long format using gather()

Let’s start by looking at the gather() help file

This function gathers multiple columns and collapses them into new key-value pairs. This transform data from wide format into a long format.

  • The key is the name of the new column that you are creating which contains the values of the column headings that you are gathering
  • The value is the name of the new column that will contain the values themselves
  • The third argument defines the columns to gather

For example, here we create a column titled year_type and coverage. We also want to keep the Location column as it is because it also contains observational level data.

# A tibble: 1,456 x 3
   Location             year_type      tot_coverage
   <chr>                <chr>          <chr>       
 1 United States        2013__Employer 155696900   
 2 Alabama              2013__Employer 2126500     
 3 Alaska               2013__Employer 364900      
 4 Arizona              2013__Employer 2883800     
 5 Arkansas             2013__Employer 1128800     
 6 California           2013__Employer 17747300    
 7 Colorado             2013__Employer 2852500     
 8 Connecticut          2013__Employer 2030500     
 9 Delaware             2013__Employer 473700      
10 District of Columbia 2013__Employer 324300      
# … with 1,446 more rows

Now we see each row contains one observation. Namely, a Location, a year_type and coverage. It would be nice to separate out the information in the year_type column into two columns. We can implement same techniques to the healthcare spending dataset.

Convert healthcare spending data to a long format (tidy format)

Let’s do the same for the spending data. In this case I will use year and spending for the key and value. We also want to keep Location like before.

# A tibble: 1,248 x 3
   Location             year                        tot_spending
   <chr>                <chr>                              <dbl>
 1 United States        1991__Total Health Spending       675896
 2 Alabama              1991__Total Health Spending        10393
 3 Alaska               1991__Total Health Spending         1458
 4 Arizona              1991__Total Health Spending         9269
 5 Arkansas             1991__Total Health Spending         5632
 6 California           1991__Total Health Spending        81438
 7 Colorado             1991__Total Health Spending         8460
 8 Connecticut          1991__Total Health Spending        10950
 9 Delaware             1991__Total Health Spending         1938
10 District of Columbia 1991__Total Health Spending         2800
# … with 1,238 more rows

We will explore how to do that in the Data Wrangling section below. For now let’s learn more about the tidyr package.

3. Convert data from long format to wide format using spread()

In contrast to gathering multiple columns into key-value pairs, we can spread a key-value pair across multiple columns.

The function spread() does just that. It transforms data from a long format into a wide format.

  • The key is the name of the column in your data set that contains the values of the column headings that you are spreading across multiple columns
  • The value is the name of the column that contains the values for the multiple columns
# A tibble: 52 x 29
   Location `2013__Employer` `2013__Medicaid` `2013__Medicare` `2013__Non-Grou…
   <chr>    <chr>            <chr>            <chr>            <chr>           
 1 Alabama  2126500          869700           783000           174200          
 2 Alaska   364900           95000            55200            24000           
 3 Arizona  2883800          1346100          842000           170800          
 4 Arkansas 1128800          600800           515200           155600          
 5 Califor… 17747300         8344800          3828500          1986400         
 6 Colorado 2852500          697300           549700           426300          
 7 Connect… 2030500          532000           475300           126800          
 8 Delaware 473700           192700           141300           25100           
 9 Distric… 324300           174900           59900            30400           
10 Florida  8023400          3190900          3108800          968200          
# … with 42 more rows, and 24 more variables: `2013__Other Public` <chr>,
#   `2013__Total` <chr>, `2013__Uninsured` <chr>, `2014__Employer` <chr>,
#   `2014__Medicaid` <chr>, `2014__Medicare` <chr>, `2014__Non-Group` <chr>,
#   `2014__Other Public` <chr>, `2014__Total` <chr>, `2014__Uninsured` <chr>,
#   `2015__Employer` <chr>, `2015__Medicaid` <chr>, `2015__Medicare` <chr>,
#   `2015__Non-Group` <chr>, `2015__Other Public` <chr>, `2015__Total` <chr>,
#   `2015__Uninsured` <chr>, `2016__Employer` <chr>, `2016__Medicaid` <chr>,
#   `2016__Medicare` <chr>, `2016__Non-Group` <chr>, `2016__Other
#   Public` <chr>, `2016__Total` <chr>, `2016__Uninsured` <chr>

In the real world, analyzing data rarely involves data that can be easily imported and ready for analysis. According to Wikipedia:

Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.

As you may see in class or here from data scientists on Twitter, one of the most time-consuming aspects of the data analysis process is “data wrangling”. This is also is a trendy term for cleaning up a messy data set.

R provides incredibly powerful and flexible language for data wrangling. However, the syntax is somewhat hard to get used to. We will therefore introducing a package that makes the syntax much more like the English language. This package is dplyr.

The dplyr R package

1. What is the dplyr R package ?

dplyr is a powerful R-package to transform and summarize tabular data with rows and columns.

The package contains a set of functions (or “verbs”) to perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data.

In addition, dplyr contains a useful function to perform another common task which is the is the “split-apply-combine” concept. We will discuss that in a little bit.

2. Compare dplyr R package compare with base functions R

If you are familiar with R, you are probably familiar with base R functions such as split(), subset(), apply(), sapply(), lapply(), tapply() and aggregate(). Compared to base functions in R, the functions in dplyr are easier to work with, are more consistent in the syntax and are targeted for data analysis around data frames instead of just vectors.

The important dplyr verbs to remember are:

dplyr verbs Description
select() select columns
filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns
summarize() summarize values
group_by() allows for group operations in the “split-apply-combine” concept

3. Pipe operator: %>%

Before we go any further, let’s introduce the pipe operator: %>%. In our stocks example, we briefly saw this symbol. It is called the pipe operator. dplyr imports this operator from another package (magrittr) see help file here. This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right.

Now in stocks example, we pipe the stocks data frame to the function that will gather multiple columns into key-value pairs.

dplyr verbs in action: separate(), unite(), …

First, let’s separate the year_type column in the coverage dataset to two columns: year and health coverage type.

To do this, we will use the separate() function in the tidyr package.

Note:

  • separate() = separate one column into multiple columns
  • unite() = unite multiple columns into one

Learn separate() and unite() in the spending dataset

# A tibble: 1,456 x 4
   Location             year  type     tot_coverage
   <chr>                <chr> <chr>    <chr>       
 1 United States        2013  Employer 155696900   
 2 Alabama              2013  Employer 2126500     
 3 Alaska               2013  Employer 364900      
 4 Arizona              2013  Employer 2883800     
 5 Arkansas             2013  Employer 1128800     
 6 California           2013  Employer 17747300    
 7 Colorado             2013  Employer 2852500     
 8 Connecticut          2013  Employer 2030500     
 9 Delaware             2013  Employer 473700      
10 District of Columbia 2013  Employer 324300      
# … with 1,446 more rows

We see that we now have two columns, except the year column was converted to a character. If we look at the help file ?separate, we see we can use the convert=TRUE argument to convert the character to an integer.

# A tibble: 1,456 x 4
   Location              year type     tot_coverage
   <chr>                <int> <chr>    <chr>       
 1 United States         2013 Employer 155696900   
 2 Alabama               2013 Employer 2126500     
 3 Alaska                2013 Employer 364900      
 4 Arizona               2013 Employer 2883800     
 5 Arkansas              2013 Employer 1128800     
 6 California            2013 Employer 17747300    
 7 Colorado              2013 Employer 2852500     
 8 Connecticut           2013 Employer 2030500     
 9 Delaware              2013 Employer 473700      
10 District of Columbia  2013 Employer 324300      
# … with 1,446 more rows

Next, we see that the tot_coverage column is also a character. Gah!

Let’s fix that. We can use the mutate_at() function to do this. We are asking R to take tot_coverage column and convert it to an integer and then replace the old column with the new converted column

# A tibble: 1,456 x 6
   Location              year type     tot_coverage abb   region   
   <chr>                <int> <chr>           <int> <chr> <fct>    
 1 United States         2013 Employer    155696900 <NA>  <NA>     
 2 Alabama               2013 Employer      2126500 AL    South    
 3 Alaska                2013 Employer       364900 AK    West     
 4 Arizona               2013 Employer      2883800 AZ    West     
 5 Arkansas              2013 Employer      1128800 AR    South    
 6 California            2013 Employer     17747300 CA    West     
 7 Colorado              2013 Employer      2852500 CO    West     
 8 Connecticut           2013 Employer      2030500 CT    Northeast
 9 Delaware              2013 Employer       473700 DE    South    
10 District of Columbia  2013 Employer       324300 DC    South    
# … with 1,446 more rows

The coverage data looks good now. We see that there are different years and different types of healthcare coverage.

Also, you may want to link the coverage data with our location information.

# A tibble: 1,456 x 6
   Location              year type     tot_coverage abb   region   
   <chr>                <int> <chr>           <int> <chr> <fct>    
 1 United States         2013 Employer    155696900 <NA>  <NA>     
 2 Alabama               2013 Employer      2126500 AL    South    
 3 Alaska                2013 Employer       364900 AK    West     
 4 Arizona               2013 Employer      2883800 AZ    West     
 5 Arkansas              2013 Employer      1128800 AR    South    
 6 California            2013 Employer     17747300 CA    West     
 7 Colorado              2013 Employer      2852500 CO    West     
 8 Connecticut           2013 Employer      2030500 CT    Northeast
 9 Delaware              2013 Employer       473700 DE    South    
10 District of Columbia  2013 Employer       324300 DC    South    
# … with 1,446 more rows

(*) What is the range of years and types of healthcare in the coverage dataset?

              
               2013 2014 2015 2016
  Employer       52   52   52   52
  Medicaid       52   52   52   52
  Medicare       52   52   52   52
  Non-Group      52   52   52   52
  Other Public   52   52   52   52
  Total          52   52   52   52
  Uninsured      52   52   52   52

Implement separate() and unite() in the spending dataset

Next, we will look at the spending data. We see the year column has information that we do not want. We only care about the year.

# A tibble: 1,248 x 3
   Location             year                        tot_spending
   <chr>                <chr>                              <dbl>
 1 United States        1991__Total Health Spending       675896
 2 Alabama              1991__Total Health Spending        10393
 3 Alaska               1991__Total Health Spending         1458
 4 Arizona              1991__Total Health Spending         9269
 5 Arkansas             1991__Total Health Spending         5632
 6 California           1991__Total Health Spending        81438
 7 Colorado             1991__Total Health Spending         8460
 8 Connecticut          1991__Total Health Spending        10950
 9 Delaware             1991__Total Health Spending         1938
10 District of Columbia 1991__Total Health Spending         2800
# … with 1,238 more rows

Let’s use the separate() function with convert=TRUE to separate the year column into columns. Then, we introduce another dplyr action verb: select().

The two most basic functions are select() and filter() which selects columns and filters rows, respectively.

4. Select columns using select()

In the separate() function, we create two new columns called year and name. Then, we ask to return all the columns, except name. To select all the columns except a specific column, use the “-” (subtraction) operator (also known as negative indexing).

# A tibble: 1,248 x 3
   Location              year tot_spending
   <chr>                <int>        <dbl>
 1 United States         1991       675896
 2 Alabama               1991        10393
 3 Alaska                1991         1458
 4 Arizona               1991         9269
 5 Arkansas              1991         5632
 6 California            1991        81438
 7 Colorado              1991         8460
 8 Connecticut           1991        10950
 9 Delaware              1991         1938
10 District of Columbia  1991         2800
# … with 1,238 more rows

The function select() is much more powerful though. To select a range of columns by name, use the “:” (colon) operator

# A tibble: 1,456 x 2
    year type    
   <int> <chr>   
 1  2013 Employer
 2  2013 Employer
 3  2013 Employer
 4  2013 Employer
 5  2013 Employer
 6  2013 Employer
 7  2013 Employer
 8  2013 Employer
 9  2013 Employer
10  2013 Employer
# … with 1,446 more rows

To select all columns that start with the character string “t”, use the function starts_with()

# A tibble: 1,456 x 2
   type     tot_coverage
   <chr>           <int>
 1 Employer    155696900
 2 Employer      2126500
 3 Employer       364900
 4 Employer      2883800
 5 Employer      1128800
 6 Employer     17747300
 7 Employer      2852500
 8 Employer      2030500
 9 Employer       473700
10 Employer       324300
# … with 1,446 more rows

Some additional options to select columns based on a specific criteria include

  1. ends_with() = Select columns that end with a character string
  2. contains() = Select columns that contain a character string
  3. matches() = Select columns that match a regular expression
  4. one_of() = Select columns names that are from a group of names

5. Select rows using filter()

Let’s say we want to know how many peopled had health insurance coverage in Maryland?

First, we can filter the rows for years in 2007.

# A tibble: 28 x 6
   Location  year type         tot_coverage abb   region
   <chr>    <int> <chr>               <int> <chr> <fct> 
 1 Maryland  2013 Employer          3172400 MD    South 
 2 Maryland  2013 Non-Group          320800 MD    South 
 3 Maryland  2013 Medicaid           889800 MD    South 
 4 Maryland  2013 Medicare           751500 MD    South 
 5 Maryland  2013 Other Public       124400 MD    South 
 6 Maryland  2013 Uninsured          682000 MD    South 
 7 Maryland  2013 Total             5940900 MD    South 
 8 Maryland  2014 Employer          3558800 MD    South 
 9 Maryland  2014 Non-Group          361700 MD    South 
10 Maryland  2014 Medicaid           807900 MD    South 
# … with 18 more rows

Note: you can use the Boolean operators (e.g. >, <, >=, <=, !=, %in%) to create logical tests.

For example, if we wanted only years after 2014, we can add a second criteria within filter():

# A tibble: 14 x 6
   Location  year type         tot_coverage abb   region
   <chr>    <int> <chr>               <int> <chr> <fct> 
 1 Maryland  2015 Employer          3431400 MD    South 
 2 Maryland  2015 Non-Group          371400 MD    South 
 3 Maryland  2015 Medicaid           856800 MD    South 
 4 Maryland  2015 Medicare           705500 MD    South 
 5 Maryland  2015 Other Public       141200 MD    South 
 6 Maryland  2015 Uninsured          394300 MD    South 
 7 Maryland  2015 Total             5900500 MD    South 
 8 Maryland  2016 Employer          3210600 MD    South 
 9 Maryland  2016 Non-Group          443000 MD    South 
10 Maryland  2016 Medicaid           926300 MD    South 
11 Maryland  2016 Medicare           827000 MD    South 
12 Maryland  2016 Other Public       153800 MD    South 
13 Maryland  2016 Uninsured          372100 MD    South 
14 Maryland  2016 Total             5932800 MD    South 

(*) Has the number of uninsured has increased or decreased in Maryland between 2013 and 2016?

# A tibble: 4 x 6
  Location  year type      tot_coverage abb   region
  <chr>    <int> <chr>            <int> <chr> <fct> 
1 Maryland  2013 Uninsured       682000 MD    South 
2 Maryland  2014 Uninsured       343000 MD    South 
3 Maryland  2015 Uninsured       394300 MD    South 
4 Maryland  2016 Uninsured       372100 MD    South 

What happened between 2013 and 2014?

Probably this is due to ACA

6. Arrange or re-order rows using arrange()

Now, let’s say we want to see the states ordered from lowest to highest tot_coverage.

To arrange (or re-order) rows by a particular column you’ll use the arrange() function:

# A tibble: 1,456 x 6
   Location       year type         tot_coverage abb   region       
   <chr>         <int> <chr>               <int> <chr> <fct>        
 1 Vermont        2013 Other Public         9900 VT    Northeast    
 2 Vermont        2014 Other Public         9900 VT    Northeast    
 3 Rhode Island   2013 Other Public        12100 RI    Northeast    
 4 Wyoming        2014 Other Public        13600 WY    West         
 5 Delaware       2013 Other Public        13800 DE    South        
 6 Vermont        2016 Other Public        14600 VT    Northeast    
 7 New Hampshire  2013 Other Public        15100 NH    Northeast    
 8 Wyoming        2016 Other Public        16400 WY    West         
 9 Vermont        2015 Other Public        16500 VT    Northeast    
10 North Dakota   2014 Other Public        17300 ND    North Central
# … with 1,446 more rows

(*) In 2016, what were the top three states with the largest Employer type of healthcare coverage?

Hint: use the desc() function inside of arrange() to order rows in a descending order.

# A tibble: 3 x 6
  Location    year type     tot_coverage abb   region   
  <chr>      <int> <chr>           <int> <chr> <fct>    
1 California  2016 Employer     18116200 CA    West     
2 Texas       2016 Employer     13607200 TX    South    
3 New York    2016 Employer      9767500 NY    Northeast

7. Join two datasets using join()

Here, we’re going to demonstrate how to join two datasets using series of join() function, including left_join(), right_join(), inner_join(), …

Up until now, we have been working with three datasets coverage and spending separately. Next, we will combine these together.

If we want to combine, say, coverage and spending together, we have to decide a few things. Both share a Location column and a year column. However, the range of years is different between datasets.


2013 2014 2015 2016 
 364  364  364  364 

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 
  52   52   52   52   52   52   52   52   52   52   52   52   52   52   52   52 
2007 2008 2009 2010 2011 2012 2013 2014 
  52   52   52   52   52   52   52   52 

Do we want a dataset with all the years available or only a portion of it? Because there is spending information from 1991-2014, and coverage information from 2013-2016.

dplyr has a list of join functions that are useful to combine datasets. To read more about them, Jenny Bryan has created a nice cheatsheet.

If we look at the help file

We see there are several options for us to pick from. Let’s try one out. We’ll start with left_join() and see what that does.

# A tibble: 6 x 7
  Location       year type     tot_coverage abb   region tot_spending
  <chr>         <int> <chr>           <int> <chr> <fct>         <dbl>
1 United States  2013 Employer    155696900 <NA>  <NA>        2435624
2 Alabama        2013 Employer      2126500 AL    South         33788
3 Alaska         2013 Employer       364900 AK    West           7684
4 Arizona        2013 Employer      2883800 AZ    West          41481
5 Arkansas       2013 Employer      1128800 AR    South         20500
6 California     2013 Employer     17747300 CA    West         278168
# A tibble: 6 x 7
  Location       year type  tot_coverage abb   region        tot_spending
  <chr>         <int> <chr>        <int> <chr> <fct>                <dbl>
1 Vermont        2016 Total       622500 VT    Northeast               NA
2 Virginia       2016 Total      8175000 VA    South                   NA
3 Washington     2016 Total      7297300 WA    West                    NA
4 West Virginia  2016 Total      1814100 WV    South                   NA
5 Wisconsin      2016 Total      5766100 WI    North Central           NA
6 Wyoming        2016 Total       571700 WY    West                    NA

What did it do? We see that the new hc dataset includes all the years from 2013-2016 (as that is the range of years in coverage), but because the spending dataset only goes to 2014, the tot_spending is reported as NA for years 2015 and 2016.

What about a right_join()?

# A tibble: 6 x 7
  Location       year type     tot_coverage abb   region tot_spending
  <chr>         <int> <chr>           <int> <chr> <fct>         <dbl>
1 United States  2013 Employer    155696900 <NA>  <NA>        2435624
2 Alabama        2013 Employer      2126500 AL    South         33788
3 Alaska         2013 Employer       364900 AK    West           7684
4 Arizona        2013 Employer      2883800 AZ    West          41481
5 Arkansas       2013 Employer      1128800 AR    South         20500
6 California     2013 Employer     17747300 CA    West         278168
# A tibble: 6 x 7
  Location       year type  tot_coverage abb   region tot_spending
  <chr>         <int> <chr>        <int> <chr> <fct>         <dbl>
1 Vermont        2012 <NA>            NA <NA>  <NA>           5827
2 Virginia       2012 <NA>            NA <NA>  <NA>          58535
3 Washington     2012 <NA>            NA <NA>  <NA>          51443
4 West Virginia  2012 <NA>            NA <NA>  <NA>          16270
5 Wisconsin      2012 <NA>            NA <NA>  <NA>          46158
6 Wyoming        2012 <NA>            NA <NA>  <NA>           4518

Here, we see every row in the spending dataset is there, but with NAs for the years that there was no coverage data.

There is also a full_join() and inner_join(). If we want the intersection of years from coverage and spending (meaning only 2013 and 2014), we should use inner_join().

# A tibble: 6 x 7
  Location       year type     tot_coverage abb   region tot_spending
  <chr>         <int> <chr>           <int> <chr> <fct>         <dbl>
1 United States  2013 Employer    155696900 <NA>  <NA>        2435624
2 Alabama        2013 Employer      2126500 AL    South         33788
3 Alaska         2013 Employer       364900 AK    West           7684
4 Arizona        2013 Employer      2883800 AZ    West          41481
5 Arkansas       2013 Employer      1128800 AR    South         20500
6 California     2013 Employer     17747300 CA    West         278168
# A tibble: 6 x 7
  Location       year type  tot_coverage abb   region        tot_spending
  <chr>         <int> <chr>        <int> <chr> <fct>                <dbl>
1 Vermont        2014 Total       617000 VT    Northeast             6389
2 Virginia       2014 Total      8258800 VA    South                62847
3 Washington     2014 Total      7085000 WA    West                 55819
4 West Virginia  2014 Total      1825500 WV    South                17491
5 Wisconsin      2014 Total      5747200 WI    North Central        50109
6 Wyoming        2014 Total       572000 WY    West                  4856

Yes, that’s what we want!

Next, if we are only interested in looking at US states, we can remove the rows corresponding to the Location == "United States"

Another problem is that inside our hc dataset, we have seen there are multiple types of healthcare coverage.


    Employer     Medicaid     Medicare    Non-Group Other Public        Total 
         102          102          102          102          102          102 
   Uninsured 
         102 

The total type is not really a formal type of healthcare coverage. It really represents just the total number of people in the state. This is useful information and we can include it as a column called tot_pop. How can we do this?

Well, one way would be to use the join functions again in dplyr.

# A tibble: 102 x 3
   Location              year tot_coverage
   <chr>                <int>        <int>
 1 Alabama               2013      4763900
 2 Alaska                2013       702000
 3 Arizona               2013      6603100
 4 Arkansas              2013      2904800
 5 California            2013     38176400
 6 Colorado              2013      5297800
 7 Connecticut           2013      3578900
 8 Delaware              2013       909300
 9 District of Columbia  2013       652100
10 Florida               2013     19429000
# … with 92 more rows
# A tibble: 612 x 8
   Location         year type    tot_coverage abb   region  tot_spending tot_pop
   <chr>           <int> <chr>          <int> <chr> <fct>          <dbl>   <int>
 1 Alabama          2013 Employ…      2126500 AL    South          33788  4.76e6
 2 Alaska           2013 Employ…       364900 AK    West            7684  7.02e5
 3 Arizona          2013 Employ…      2883800 AZ    West           41481  6.60e6
 4 Arkansas         2013 Employ…      1128800 AR    South          20500  2.90e6
 5 California       2013 Employ…     17747300 CA    West          278168  3.82e7
 6 Colorado         2013 Employ…      2852500 CO    West           34090  5.30e6
 7 Connecticut      2013 Employ…      2030500 CT    Northe…        34223  3.58e6
 8 Delaware         2013 Employ…       473700 DE    South           9038  9.09e5
 9 District of Co…  2013 Employ…       324300 DC    South           7443  6.52e5
10 Florida          2013 Employ…      8023400 FL    South         150547  1.94e7
# … with 602 more rows

We can check to make sure that the total is no longer listed as a type of healthcare coverage.


    Employer     Medicaid     Medicare    Non-Group Other Public    Uninsured 
         102          102          102          102          102          102 

We are now ready to try answering our first question that we asked:

  1. Is there a relationship between healthcare coverage and healthcare spending in the United States?

Let’s pick out the type==Employer and year==2013.

We see there is a strong relationship. However, we also see that healthcare coverage and spending is also strongly related to population size

This means we need to take into account the population size of each state when we are comparing the healthcare coverage and spending.

8. Add columns using mutate()

Instead of the absolute number of people who are covered (tot_coverage), we will calculate the proportion of people who are coverage in each state, year and type.

For this, we will use the mutate() function in dplyr.

# A tibble: 612 x 9
   Location  year type  tot_coverage abb   region tot_spending tot_pop
   <chr>    <int> <chr>        <int> <chr> <fct>         <dbl>   <int>
 1 Alabama   2013 Empl…      2126500 AL    South         33788  4.76e6
 2 Alaska    2013 Empl…       364900 AK    West           7684  7.02e5
 3 Arizona   2013 Empl…      2883800 AZ    West          41481  6.60e6
 4 Arkansas  2013 Empl…      1128800 AR    South         20500  2.90e6
 5 Califor…  2013 Empl…     17747300 CA    West         278168  3.82e7
 6 Colorado  2013 Empl…      2852500 CO    West          34090  5.30e6
 7 Connect…  2013 Empl…      2030500 CT    North…        34223  3.58e6
 8 Delaware  2013 Empl…       473700 DE    South          9038  9.09e5
 9 Distric…  2013 Empl…       324300 DC    South          7443  6.52e5
10 Florida   2013 Empl…      8023400 FL    South        150547  1.94e7
# … with 602 more rows, and 1 more variable: prop_coverage <dbl>

We need to add another column to our dataset. We will add the spending per capita (or spending per person) in dollars and name this column spending_capita.

How we will do this?

The tot_spending column is reported in millions (1e6). Therefore, to calculate spending_capita we will need to adjust for this scaling factor to report it on the original scale (just dollars) and then divide by tot_pop.

Now we are ready to go back to our first question.

  1. Is there a relationship between healthcare coverage and healthcare spending in the United States?

Yes, it looks like there is a relationship for Employer healthcare coverage in 2013.

We will continue to explore the other types of coverages later on. For now, we get back to to learning more action verbs in dplyr.

Our second question that we were interested in was:

  1. Which US states spend the most and which spend the least on healthcare? How does the spending distribution change across geographic regions in the United States?

To answer these questions, we need to learn how to calculate summary statistics in our data.

9. Create summaries of columns using summarize()

The summarize() function in dplyr will create summary statistics for a given column in the data frame such as finding the max, min, average. For example, to compute the average spending per capita, we can apply the mean() function to the column spending_captia and call the summary value avg_spending_capita.

# A tibble: 1 x 1
  avg_spending_capita
                <dbl>
1               8246.

There are many other summary statistics you could consider such sd(), min(), median(), mean(), sum(), n() (returns the length of vector), first() (returns first value in vector), last() (returns last value in vector) and n_distinct() (number of distinct values in vector).

Also note, this is the average across all states, and all years. This is not very informative.

If you recall, our question asked about which states spent the most, so we want an average spending per capita for each state.

For this, we need to introduce another function in dplyr called group_by().

10. Group operations using group_by()

The group_by() verb is and incredibly powerful function in dplyr. As we mentioned before it’s related to concept of “split-apply-combine”.

In our example above, we want to split the data frame by some variable (e.g. Location), apply a function to the individual data frames (mean) and then combine the output back into a summary data frame.

Let’s see how that would look

# A tibble: 51 x 2
   Location             avg_spending_capita
   <chr>                              <dbl>
 1 Alabama                            7244.
 2 Alaska                            11331.
 3 Arizona                            6397.
 4 Arkansas                           7324.
 5 California                         7416.
 6 Colorado                           6602.
 7 Connecticut                        9730.
 8 Delaware                          10127.
 9 District of Columbia              11698.
10 Florida                            7945.
# … with 41 more rows

That’s better. Here we are averaging across the years 2013 and 2014.

(*) What are the top 3 states that have the largest average spending per capita? What about the top 3 states with the smallest average spending per capita?

# A tibble: 3 x 2
  Location avg_spending_capita
  <chr>                  <dbl>
1 Utah                   5842.
2 Arizona                6397.
3 Georgia                6513.
# A tibble: 3 x 2
  Location             avg_spending_capita
  <chr>                              <dbl>
1 District of Columbia              11698.
2 Alaska                            11331.
3 Massachusetts                     10535.

(*) How does the spending distribution change across geographic regions in the United States?

Hint: Calculate the mean and standard deviation of spending per capita for each geographic region in the US.

# A tibble: 4 x 3
  region        avg_spending_capita sd_spending_capita
  <fct>                       <dbl>              <dbl>
1 North Central               8404.               541.
2 Northeast                   9592.               546.
3 South                       7994.              1273.
4 West                        7498.              1329.

Another way to visualize distributions is to use boxplots.

Create four boxplots representing the spending per capita distribution for each of the four regions using the boxplot() function in R.

Now that we have our data in a tidy format, next, we will learn about how to do this using the ggplot2 R package in the tidyverse.

Data Visualization

As you have already seen, there are many functions available in base R that can create plots (e.g. plot(), boxplot()). Others include: hist(), qqplot(), etc. These functions are great because they come with a basic installation of R and can be quite powerful when you need a quick visualization of something when you are exploring data.

We are choosing to introduce ggplot2 because, in our opinion, it’s one of the simplest ways for beginners to create relatively complicated plots that are intuitive and aesthetically pleasing.

The ggplot2 R package

The reasons ggplot2 is generally intuitive for beginners is the use of grammar of graphics or the gg in ggplot2. The idea is that you can construct many sentences by learning just a few nouns, adjectives, and verbs. There are specific “words” that we will need to learn and once we do, you will be able to create (or “write”) hundreds of different plots.

The critical part to making graphics using ggplot2 is the data needs to be in a tidy format. Given that we have just spend the last two lectures learning about how to work with tidy data, we are primed to take advantage of all that ggplot2 has to offer!

We will show how it’s easy to pipe tidy data (output) as input to other functions that creates plots. This all works because we are working within the tidyverse.

ggplot2 cheatsheet

The cheatsheet looks like the following:

1. What is the ggplot() function?

As explained by Hadley Wickham:

the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system.

ggplot2 Terminology

  • ggplot - the main function where you specify the data set and variables to plot (this is where we define the x and y variable names)
  • geoms - geometric objects
    • e.g. geom_point(), geom_bar(), geom_line(), geom_histogram()
  • aes - aesthetics
    • shape, transparency, color, fill, linetype
  • scales - define how your data will be plotted
    • continuous, discrete, log, etc

There are three ways to initialize a ggplot() object.

An empty ggplot object

A ggplot object associated with a dataset

or a ggplot object with a dataset and x and y defined

2. Create scatter plots using geom_point()

The function aes() is an aesthetic mapping function inside the ggplot() object. We use this function to specify plot attributes (e.g. x and y variable names) that will not change as we add more layers.

Anything that goes in the ggplot() object becomes a global setting. From there, we use the geom objects to add more layers to the base ggplot() object. These will define what we are interested in illustrating using the data.

If you recall, our first question that we were interested in was

  1. Is there a relationship between healthcare coverage and healthcare spending in the United States?

Before, we were using base R to create something like this:

Let’s re-create this plot with ggplot2 using the geom_point() geometry.

We used the xlab() and ylab() functions in ggplot2 to specify the x-axis and y-axis labels.

Note, we do not have to assign (<-) the plot to anything:

It’s also simple to fit a linear regression model and plot it on top of scatter plot using the geom_smooth() (or stat_smooth()) functions.

The standard error bounds are computed and included in the plot.

It would be nice to know which state is represented by which state. For this, we will introduce another geom called geom_text().

3. Add layers of text using geom_text()

In our dataset, we have information about the abbreviation for each state. We could add the abbreviations for each state next to the point on the plot to assess which states have a higher or lower coverage for a given amount of money they spend per capita.

That is cool, but it would be even better if we could nudge the text over a bit. Let’s look at the help file for geom_text():

We see there is an argument called nudge_x and nudge_y. We can use these to nudge the text over a bit so the text is not directly on top of the points.

4. Facet across a variable using facet_wrap

Ok, getting back to our original question:

  1. Is there a relationship between healthcare coverage and healthcare spending in the United States?

We saw there was a positive relationship, but this was only for one type of healthcare coverage (Employer) and one year. What about the other types?

For this, we will introduce facets. The idea of faceting is to stratify the data by some variable and make the same plot for each strata.

For example, if we wanted to facet by the type variable, we will add a layer to our ggplot() object using the facet_grid() or facet_wrap() functions. The function expects the row and column variables to be separated by a ~.

We see that the proportion of people covered have different scales in the y-axis. Let’s read the help file to see if there is some way to not restrict the y-axis to be the same.

Yes, we see there is an argument called scales that can be free_y, (free columns), free_x (free rows), and free (both). Let’s try free_y and look at a different year (year=="2014"):

Given we know Other Public refers to the military or Veterans Administration, we can see states like HI, VA, NV have a larger proportion of military or VA Other Public type coverage. While a state like AK has a similar proportion of Other Public coverage, it has a much larger spending per capita.

We also see a negative relationship with the Uninsured type. The more states spend, the less uninsured people in the state.

5. Create boxplots using geom_boxplot()

Next, let’s revisit the second question.

  1. Which US states spend the most and which spend the least on healthcare? How does the spending distribution change across geographic regions in the United States?

Let’s try making a boxplot with ggplot2. If you recall, the way to do this in base R was:

Now, we introduce the geom_boxplot() function. Note, we needed to tell ggplot2 what needs to be along the x and y axis in aes().

6. Facet by two variables using facet_grid

  1. Does the relationship between healthcare coverage and healthcare spending in the United States change from 2013 to 2014?

Let’s try faceting by both year and type. Note that we can facet by rows putting a column name before the ~ and facet by columns putting a column name after the ~. We are also using facet_grid() instead of facet_wrap().

Summary

The total healthcare expenditure is associated with the population. To make a fair comparison, we create “healthcare expenditure per capita.” Further, the exploratory analysis via data visualization showed higher spending in healthcare per capita is positively associated with higher employer coverage proportion and is negatively associated with the proportion of uninsured population across the States.

---
title: "Open Case Studies: Exploring health expenditure using state-level data in the United States"
output:
  html_document:
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
                      message = FALSE, warning = FALSE)
```

```{r echo=FALSE, out.width='100%'}
knitr::include_graphics('./img/SummaryPlot.png')

```

# Motivation  

Health policy in the United States is complicated, and several 
forms of healthcare coverage exist, including both coverage by federal 
goverment-led healthcare policy, and by private insurance companies. Before making any inference about 
the relationship between health condition and health policy, it is important for us to 
have a general idea about healthcare economics in the United States. Thus, we are interested in 
getting sense of healthcare coverage and healthcare spending across States. More specifically, the questions are:  

1. Is there a relationship between healthcare coverage and healthcare spending in the United States?   
2. How does the spending distribution change across geographic regions in the United States?  
3. Does the relationship between healthcare coverage and healthcare spending in the United States change from 2013 to 2014?  

In this case study, we'll walk you through collecting data, 
importing data, cleaning data, wrangling data, and visualizing the data, using 
well-established and commonly used packages, including `datasets`, `tidyr`, `dplyr`, `ggplot2`, and `ggrepel`.    
 
# What is the data?    

```{r out.width = "95%", echo = FALSE, out.width='90%'}
knitr::include_graphics("https://aspe.hhs.gov/system/files/images-reports-basic/70441/fig1.jpg")
```
[Image source from US Department of Health and Human Services](https://aspe.hhs.gov/basic-report/overview-uninsured-united-states-summary-2011-current-population-survey)

## Healthcare data

We will be using the data from the [Henry J Kaiser Family Foundation (KFF)](https://www.kff.org). 

* [Health Insurance Coverage of the Total Population](https://www.kff.org/other/state-indicator/total-population/) - Includes years 2013-2016
* [Health Care Expenditures by State of Residence (in millions)](https://www.kff.org/other/state-indicator/health-care-expenditures-by-state-of-residence-in-millions/) - Includes years 1991-2014
 
We have downloaded, re-named and saved these files in the 
[GitHub repository](https://github.com/opencasestudies/ocs-healthexpenditure) under the `data/KFF/` directory. 

Now, before we dig into the data analysis, we need to introduce a set of R packages that we will use to analyze the data. 


# Data Import 
 
## Introduction to "Tidy data"

The [tidyverse](https://www.tidyverse.org) is _"an opinionated 
collection of R packages designed for data science. All packages 
share an underlying philosophy and common APIs."_ 

Another way of putting it is that it's a set of packages 
that are useful specifically for data manipulation, 
exploration and visualization with a common philosophy. 

#### What is this common philosophy? 

The common philosophy is called _"tidy"_ data. It is 
a standard way of mapping the meaning of a dataset
to its structure.

In _tidy_ data:

* Each variable forms a column.
* Each observation forms a row.
* Each type of observational unit forms a table.

```{r out.width = "95%", echo = FALSE}
knitr::include_graphics("http://r4ds.had.co.nz/images/tidy-1.png")
```

Below, we are interested in transforming the table on 
the right to the the table on the left, which is 
considered "tidy". 

```{r out.width = "95%", echo = FALSE}
knitr::include_graphics("http://r4ds.had.co.nz/images/tidy-9.png")
```

Working with tidy data is useful because it creates a structured way of
organizing data values within a data set. This makes the data analysis 
process more efficient and simplifies the development of data analysis tools
that work together. In this way, you can focus on the problem you are
investigating, rather than the uninteresting logistics of data.  

### 1. What is in the `tidyverse`? 

We can install and load the set of R packages using 
`install.packages("tidyverse")` function. 

When we load the tidyverse package using `library(tidyverse)`, 
there are six core R packages that load:

* [readr](http://readr.tidyverse.org), for data import.
* [tidyr](http://tidyr.tidyverse.org), for data tidying.
* [dplyr](http://dplyr.tidyverse.org), for data wrangling.
* [ggplot2](http://ggplot2.tidyverse.org), for data visualisation.
* [purrr](http://purrr.tidyverse.org), for functional programming.
* [tibble](http://tibble.tidyverse.org), for tibbles, a modern re-imagining of data frames.

Here, we load in the tidyverse. 
```{r, message=FALSE}
library(tidyverse)
```

These packages are highlighted in bold here: 

```{r out.width = "95%", echo = FALSE}
knitr::include_graphics("https://rviews.rstudio.com/post/2017-06-09-What-is-the-tidyverse_files/tidyverse1.png")
```

Because these packages all share the "tidy" philosophy, 
the data analysis workflow is easier as you move from 
package to package. 

Here, we will focus on the `readr`,
`tidyr` and `dplyr` R packages to import data, 
to transform data to the "tidy" format, 
and to wrangle data. 

Next, we will give a brief description of the 
features in each of these packages. 

There are several base R functions that allow you 
read in data into R, which you may be familiar 
with such as `read.table()`, `read.csv()`, 
and `read.delim()`. Instead of using these, 
we will use the functions in the 
[readr](https://readr.tidyverse.org/articles/readr.html)
R package. The main reasons for this are 

1. Compared to equivalent base R functions, the 
functions in `readr` are around 10x faster. 
2. You can specify the column types (e.g 
character, integer, double, logical, date, 
time, etc)
3. All parsing problems are recorded in 
a data frame. 

## Read data using the `readr` R package  

```{r, message=FALSE}
library(readr)
```

The main functions in `readr` are: 

`readr` functions | Description |
--- | ---------------------------------------------------------------------------------------- |
`read_delim()` | reads in a flat file data with a given character to separate fields |
`read_csv()` | reads in a CSV file |
`read_tsv()` | reads in a file with values separated by tabs |
`read_lines()` | reads only a certain number of lines from the file |
`read_file()` | reads a complete file into a string |
`write_csv()` | writes data frame to CSV | 

A useful cheatsheet for the functions in the
`readr` package can be found on RStudio's website: 

![](https://www.rstudio.com/wp-content/uploads/2018/08/data-import.png)



### 1. Read in data 

#### Read in health  healthcare coverage data
Let's try reading in some data. We will begin by
reading in the `healthcare-coverage.csv` data. 

If we want to see what the header of the file looks like, 
we can use the `read_lines()` function to peak at the 
first few lines. 

```{r warning=FALSE,message=FALSE}
read_lines(file = "./data/KFF/healthcare-coverage.csv", n_max = 10)
```

It looks like the first two lines are descriptive 
and are not useful. We will tell R to skip reading 
these in using the `skip` argument in 
`read_csv()`. The third line looks like it contains the
column names and starting on the fourth line is
where the data starts. 

```{r, message=FALSE}
coverage <- read_csv("./data/KFF/healthcare-coverage.csv", 
                     skip = 2, col_names = TRUE)
head(coverage)
tail(coverage)
```

It looks like we now have the right header, but
there are a bunch of NAs in the end of the data 
frame because most of it isn't useful data. 

Let's take a closer look at the last 30 lines
```{r warning=FALSE,message=FALSE}
tail(coverage, n=30)
```

It looks like there is a line with a string 
`Notes` in it and everything below that line
should not be read in. We can use the `n_max` 
argument here.

```{r, message=FALSE}
coverage <- read_csv("./data/KFF/healthcare-coverage.csv", 
                     skip = 2, col_names = TRUE)
coverage <- read_csv("./data/KFF/healthcare-coverage.csv", 
                     skip = 2, col_names = TRUE,
                     n_max  = which(coverage$Location == "Notes")-1)
tail(coverage)
```

That's better! 

#### Read in healthcare spending data

Now because we are also going to want to 
use in `healthcare-spending.csv`, let's 
read it in now. 

```{r, message=FALSE}
spending <- read_csv("./data/KFF/healthcare-spending.csv", 
                     skip = 2, col_names = TRUE)
spending <- read_csv("./data/KFF/healthcare-spending.csv", 
                     skip = 2, col_names = TRUE,
                     n_max  = which(spending$Location == "Notes")-1)
tail(spending)
```


### 2. Take a `glimpse()` at your data

One last thing in this section. 
One way to look at our data would be to use 
`head()` or `tail()`, as we just saw. 
Another one you might have heard of is the
`str()` function. One you might not have 
heard of is the `glimpse()` function. It's
used for a special type of object in R called 
a `tibble`. Let's read the help file to learn
more. 

```{r, eval=FALSE}
?tibble::tibble
```

It's kind of like `print()` where it shows you 
columns running down the page. Let's try it out. 
If we look at our data, say the `coverage` 
data frame, we see that it is not _"tidy"_: 
```{r warning=FALSE,message=FALSE}
glimpse(coverage)
```

## Read the State information using the `datasets` R package 

Since our goal is to get sense of the health expenditure, including healthcare coverage and 
healthcare spending, **across States**, 
it would be nice add some information about each state. 
Namely, the state abbreviation and state region 
(i.e. north, south, etc). 

For this we use the 
[state](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/state.html)
dataset in the `datasets` R package. 

Before we begin, let's look at what states are there: 

```{r}
unique(coverage$Location)
```

We see there are more than 50 states because 
"United States" and "District of Columbia" 
are both included. 

Let's look what states are inside the `state` dataset. 
```{r}
library(datasets)
data(state)
unique(state.name)
```

Ah, ok. So let's start by dealing with
DC as a special case. 

```{r}
state.abb <- c(state.abb, "DC")
state.region <- as.factor(c(as.character(state.region), "South"))
state.name <- c(state.name, "District of Columbia")
```

We will deal with the "United States" in 
the next section. 

# Data Wrangling 

## What is "Tidy Data"?  

#### Glance at "Tidy Data"
A subset of the data analysis process can be thought
about in the following way:

```{r out.width = "95%", echo = FALSE}
knitr::include_graphics("http://r4ds.had.co.nz/diagrams/data-science.png")
```

where each of these steps needs its own 
tools and software to complete. 

After we import the data into R, if we are 
going to take advantage of the _"tidyverse"_, 
this means we need to _transform_ the data 
into a form that is _"tidy"_. If you recall, 
in _tidy_ data:

* Each variable forms a column.
* Each observation forms a row.
* Each type of observational unit forms a table.

For example, consider the following dataset: 

![](https://github.com/datasciencelabs/2016/raw/master/lectures/wrangling/pics/stocks-by-company.png)

Here:  

* each row represents one company (row names are companies)
* each column represent one time point
* the stock prices are defined for each row/column pair

Alternatively, a data set can be structured in the following way:

* each row represents one time point (but no row names)
* the first column defines the time variable and the last three columns contain the stock prices for three companies 

![](https://github.com/datasciencelabs/2016/raw/master/lectures/wrangling/pics/stocks-by-time.png)

In both cases, the data is the same, but the structure is 
different. This can be  _frustrating_ to deal with as an 
analyst because the meaning of the values (rows and columns)
in the two data sets are different. Providing a standardized 
way of organizing values within a data set would alleviate 
a major portion of this frustration.  

For motivation, a _tidy_ version of the stock data we 
looked at above looks like this: (we'll learn how the
functions work in just a moment)

![](https://github.com/datasciencelabs/2016/raw/master/lectures/wrangling/pics/stocks-tidy.png)

In this "tidy" data set, we have three columns representing 
three variables (time, company name and stock price). 
Every row represents contains one stock price from a 
particular time and for a specific company. 

If we consider our `coverage` dataframe, we see it 
is also not in a tidy format. Each row contains information
about the coverage level by `Location` across years and 
types of coverage. 

```{r}
coverage[1:5, 1:5]
```

Now, let's use the `tidyr` R package to transform
our data into a _tidy_ format. 

## The `tidyr` R package

### 1. What is the `tidyr` R package ?

[`tidyr`](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)
is an R package that transforms data sets to a tidy format. 

This package is installed and loaded when you load 
the `tidyverse` using `library(tidyverse)`. However, 
you can also just load the library by itself. 

```{r, message=FALSE}
library(tidyr)
```

The main functions in `tidyr` are: 

`tidyr` functions | Description |
--- |  ---------------------------------------------------------------------------------------- |
`gather()` | takes multiple columns, and gathers them into key-value pairs, making "wide" data longer |
`separate()` | turns a single character column into multiple columns, making "long" data wider |
`spread()` | spread rows into multiple columns, transforming "long" data into "wide" format | 

We'll explore what it means to go between a "wide" and "long" 
data format using `gather()` , `separate()`, and `spread()`. 

A [`tidyr` cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
for the functions in the `tidyr` package can be 
found on RStudio's website: 

### 2. Convert data from wide format to long format using `gather()` 

Let's start by looking at the `gather()` help file

```{r, eval=FALSE}
?gather
```

This function gathers multiple columns and collapses them into new 
*key-value* pairs. This transform data from _wide_ format into 
a _long_  format. 

* The `key` is the name of the _new_ column that you are creating which 
contains the values of the column headings that you are gathering 
* The `value` is the name of the _new_ column that will contain the values
themselves
* The third argument defines the columns to gather

For example, here we create a column titled 
`year_type` and `coverage`. We also want to keep 
the `Location` column as it is because it also contains
observational level data.

```{r}
coverage <- gather(coverage, "year_type", "tot_coverage", -Location)
coverage
```

Now we see each row contains one observation. 
Namely, a `Location`, a `year_type` and `coverage`. 
It would be nice to separate out the information 
in the `year_type` column into two columns. We can 
implement same techniques to the healthcare spending 
dataset. 

#### Convert healthcare spending data to a long format (tidy format)

Let's do the same for the `spending` data. In this 
case I will use `year` and `spending` for
the `key` and `value`. We also want to keep `Location`
like before. 

```{r}
spending <- gather(spending, "year", "tot_spending", -Location)
spending
```


We will 
explore how to do that in the Data Wrangling section
below. For now let's learn more about the `tidyr` 
package. 

### 3. Convert data from long format to wide format using `spread()` 

In contrast to *gathering* multiple columns into key-value pairs, we can 
*spread* a key-value pair across multiple columns.  

The function `spread()` does just that. It transforms data from a _long_
format into a _wide_ format. 

* The `key` is the name of the column in your data set that 
contains the values of the column headings that you are spreading across 
multiple columns
* The `value` is the name of the column that contains the values for the 
multiple columns


```{r}
spread(coverage, year_type, tot_coverage)
```


In the real world, analyzing data rarely involves 
data that can be easily imported and ready for 
analysis. According to Wikipedia:

> Data munging or data wrangling is loosely the process 
of manually converting or mapping data from one "raw" 
form into another format that allows for more convenient 
consumption of the data with the help of semi-automated 
tools.

As you may see in class or here from data scientists on Twitter, 
one of the most time-consuming aspects of the data analysis 
process is "data wrangling". This is also 
is a trendy term for 
_cleaning up a messy data set_. 

R provides incredibly powerful and flexible language 
for data wrangling. However, the syntax is somewhat 
hard to get used to. We will therefore introducing 
a package that makes the syntax much more like 
the English language. This package is `dplyr`. 

## The `dplyr` R package

### 1. What is the `dplyr` R package ? 

[`dplyr`](http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) 
is a powerful R-package to transform and summarize 
tabular data with rows and columns. 

The package contains a set of functions 
(or "verbs") to perform common data manipulation
operations such as filtering for rows, selecting 
specific columns, re-ordering rows, adding new 
columns and summarizing data. 

In addition, `dplyr` contains a useful function to
perform another common task which is the is the 
"split-apply-combine" concept.  We will discuss 
that in a little bit. 

### 2. Compare `dplyr` R package compare with base functions R 

If you are familiar with R, you are probably familiar 
with base R functions such as `split()`, `subset()`, 
`apply()`, `sapply()`, `lapply()`, `tapply()` and 
`aggregate()`. Compared to base functions in R, the 
functions in `dplyr` are easier to work with, are 
more consistent in the syntax and are targeted for 
data analysis around data frames instead of just vectors. 

The important `dplyr` verbs to remember are: 


`dplyr` verbs | Description |
--- | ------------------------------------------------------------------------------------ |
`select()` | select columns  |
`filter()` | filter rows |
`arrange()` | re-order or arrange rows |
`mutate()` | create new columns |
`summarize()` | summarize values |
`group_by()` | allows for group operations in the "split-apply-combine" concept |


### 3. Pipe operator: %>%

Before we go any further, let's introduce the 
pipe operator: `%>%`. In our `stocks` example,
we briefly saw this symbol. It is called the
pipe operator. `dplyr` imports
this operator from another package 
(`magrittr`)
[see help file here](http://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html). 
This operator allows you to pipe the output 
from one function to the input of another
function. Instead of nesting functions 
(reading from the inside to the 
outside), the idea of of piping is to 
read the functions from left to right. 

Now in `stocks` example, we pipe the `stocks`
data frame to the function that will 
gather multiple columns into key-value pairs. 

![](https://github.com/datasciencelabs/2016/raw/master/lectures/wrangling/pics/stocks-tidy.png)




####  `dplyr` verbs in action: `separate()`, `unite()`, ...

First, let's separate the `year_type` column 
in the `coverage` dataset to two columns:
`year` and health coverage `type`. 

To do this, we will use the `separate()` 
function in the `tidyr` package. 

**Note**: 

* `separate()` = separate one column into multiple columns
* `unite()` = unite multiple columns into one

#### Learn `separate()` and `unite()` in the `spending` dataset

```{r}
coverage %>% 
  separate(year_type, sep="__", 
           into=c("year", "type"))
```

We see that we now have two columns, except 
the `year` column was converted to a character. 
If we look at the help file `?separate`, we see
we can use the `convert=TRUE` argument to 
convert the character to an integer. 

```{r}
coverage <- 
  coverage %>% 
  separate(year_type, sep="__", 
           into=c("year", "type"), 
           convert = TRUE)
coverage
```

Next, we see that the `tot_coverage` column is 
also a character. Gah! 

Let's fix that. We can use the `mutate_at()` 
function to do this. We are asking R to take
`tot_coverage` column and convert it to an
integer and then replace the old column with 
the new converted column 

```{r}
coverage <- 
  coverage %>% 
  mutate_at("tot_coverage", as.integer)

# Add the abbreviation of States
coverage$abb <- state.abb[match(coverage$Location, state.name)]
coverage$region <- state.region[match(coverage$Location, state.name)]

coverage
```

The `coverage` data looks good now. We see 
that there are different `year`s and different 
`types` of healthcare coverage. 

Also, you may want to link the coverage data with our location information. 
```{r}
# Add the abbreviation of States
coverage$abb <- state.abb[match(coverage$Location, state.name)]
coverage$region <- state.region[match(coverage$Location, state.name)]

coverage
```

#### (*) What is the range of years and types of healthcare in the `coverage` dataset?   

```{r}
table(coverage$type, coverage$year)
```


#### Implement  `separate()` and `unite()` in the `spending` dataset      

Next, we will look at the `spending` data. 
We see the `year` column has information that 
we do not want. We only care about the year. 

```{r}
spending
```

Let's use the `separate()` function with `convert=TRUE` 
to separate the `year` column into columns. Then, we 
introduce another `dplyr` action verb: `select()`. 

The two most basic functions are `select()` and 
`filter()` which selects columns and filters 
rows, respectively. 

### 4. Select columns using `select()`

In the `separate()` function, we create two
new columns called `year` and `name`. Then, 
we ask to return all the columns, except 
`name`. To select all the columns *except* a 
specific column, use the "-" (subtraction) operator 
(also known as negative indexing). 

```{r warning=FALSE,message=FALSE}
spending <- 
  spending %>% 
  separate(year, sep="__", into=c("year", "name"), convert = TRUE) %>% 
  select(-name)
spending
```

The function `select()` is much more 
powerful though. To select a range 
of columns by name, use the ":" (colon) operator

```{r warning=FALSE,message=FALSE}
coverage %>% 
  select(year:type)
```

To select all columns that start with the 
character string "t", use the function `starts_with()`

```{r warning=FALSE,message=FALSE}
coverage %>% 
  select(starts_with("t"))
```

Some additional options to select columns based 
on a specific criteria include

1. `ends_with()` = Select columns that end with 
a character string
2. `contains()` = Select columns that contain 
a character string
3. `matches()` = Select columns that match a 
regular expression
4. `one_of()` = Select columns names that are 
from a group of names


### 5. Select rows using `filter()`

Let's say we want to know how many peopled 
had health insurance coverage in Maryland? 

First, we can filter the rows for years in 2007. 

```{r warning=FALSE,message=FALSE}
coverage %>% 
  filter(Location == "Maryland")
```

**Note**: you can use the Boolean operators 
(e.g. `>`, `<`, `>=`, `<=`, `!=`, `%in%`) 
to create logical tests.

For example, if we wanted only years 
after 2014, we can add a second criteria within `filter()`: 

```{r warning=FALSE,message=FALSE}
coverage %>% 
  filter(Location == "Maryland", 
         year > 2014)
```

#### (*) Has the number of uninsured has increased or decreased in Maryland between 2013 and 2016?  

```{r warning=FALSE,message=FALSE}
coverage %>% 
  filter(Location == "Maryland", 
         type == "Uninsured")
```

What happened between 2013 and 2014?   

[Probably this is due to ACA](https://en.wikipedia.org/wiki/Patient_Protection_and_Affordable_Care_Act)

### 6. Arrange or re-order rows using `arrange()` 

Now, let's say we want to see the states ordered from 
lowest to highest `tot_coverage`.

To arrange (or re-order) rows by a particular 
column you'll use the `arrange()` function:

```{r warning=FALSE,message=FALSE}
coverage %>% 
    arrange(tot_coverage)
```


#### (*) In 2016, what were the top three states with the largest `Employer` type of healthcare coverage? 

**Hint**: use the `desc()` function inside of
`arrange()` to order rows in a descending order. 

```{r warning=FALSE,message=FALSE}
coverage %>% 
  filter(Location != "United States", year == 2016, type == "Employer") %>% 
  arrange(desc(tot_coverage)) %>% 
  head(n=3)
```


### 7. Join two datasets using `join()`  


Here, we're going to demonstrate how to 
join two datasets using series of `join()` function, including 
`left_join()`, `right_join()`, `inner_join()`, ...  

Up until now, we have been working with three
datasets `coverage` and `spending` separately. 
Next, we will combine these together.  

If we want to combine, say, `coverage` and `spending` 
together, we have to decide a few things. Both 
share a `Location` column and a `year` column. 
However, the range of `years` is different between
datasets. 

```{r}
table(coverage$year)
table(spending$year)
```

Do we want a dataset with all the years available
or only a portion of it? Because there is spending
information from 1991-2014, 
and coverage information from 2013-2016.

`dplyr` has a list of `join` functions that are 
useful to combine datasets. To read more about them, 
Jenny Bryan has created a nice 
[cheatsheet](http://stat545.com/bit001_dplyr-cheatsheet.html). 

If we look at the help file
```{r}
?dplyr::join
```

We see there are several options for us to pick from. 
Let's try one out. We'll start with `left_join()`
and see what that does. 

```{r}
hc <- left_join(coverage, spending, by = c("Location", "year"))
head(hc)
tail(hc)
```

What did it do? We see that the new `hc` dataset
includes all the years from 2013-2016 (as that 
is the range of years in `coverage`), but because
the `spending` dataset only goes to 2014, the 
`tot_spending` is reported as NA for years 2015 and
2016. 

#### What about a `right_join()`? 
```{r}
hc <- right_join(coverage, spending, by = c("Location", "year"))
head(hc)
tail(hc)
```

Here, we see every row in the spending dataset is 
there, but with NAs for the years that there was no 
coverage data. 

There is also a `full_join()` and 
`inner_join()`. If we want the intersection of 
`years` from `coverage` and `spending` (meaning only 
2013 and 2014), we should use `inner_join()`. 

```{r}
hc <- inner_join(coverage, spending, by = c("Location", "year"))
head(hc)
tail(hc)
```

Yes, that's what we want! 

Next, if we are only interested in looking at US 
states, we can remove the rows corresponding to 
the `Location == "United States"`

```{r}
hc <- hc %>% 
  filter(Location != "United States")
```

Another problem is that inside our `hc` 
dataset, we have seen there are 
multiple `types` of healthcare coverage.

```{r}
table(hc$type)
```

The `total` type is not really a formal type of
healthcare coverage. It really represents just 
the total number of people in the state. This is 
useful information and we can include it as a 
column called `tot_pop`. How can we do this? 

Well, one way would be to use the `join` functions
again in `dplyr`. 

```{r}
pop <- hc %>% 
  filter(type == "Total") %>% 
  select(Location, year, tot_coverage)
pop

hc <- hc %>% 
  filter(type != "Total") %>% 
  left_join(pop, by = c("Location", "year")) %>% 
  rename(tot_coverage = tot_coverage.x, tot_pop = tot_coverage.y)
hc
```

We can check to make sure that the `total`
is no longer listed as a `type` of healthcare
coverage. 

```{r}
table(hc$type)
```

We are now ready to try answering our first 
question that we asked: 

> 1. Is there a relationship between healthcare coverage and healthcare spending in the United States?

Let's pick out the `type==Employer` 
and `year==2013`. 

```{r}
hc.employer.2013 <- hc %>%
  filter(type == "Employer", year == "2013")
plot(hc.employer.2013$tot_spending, 
     hc.employer.2013$tot_coverage, log = "xy", 
     xlab = "spending", ylab = "coverage")
```

We see there is a strong relationship. However, 
we also see that healthcare coverage and spending 
is also strongly related to population size 

```{r, fig.width=10, fig.height=4}
par(mfrow=c(1,2))
plot(hc.employer.2013$tot_pop, 
     hc.employer.2013$tot_coverage, log = "xy", 
     xlab = "population size", ylab = "coverage")
plot(hc.employer.2013$tot_pop, 
     hc.employer.2013$tot_spending, log = "xy", 
     xlab = "population size", ylab = "spending")
```

This means we need to take into account the 
population size of each state when we are 
comparing the healthcare coverage and spending. 

### 8. Add columns using `mutate()`

Instead of the absolute number of people who 
are covered (`tot_coverage`), we will calculate
the proportion of people who are coverage in 
each state, year and type. 

For this, we will use the `mutate()` function 
in `dplyr`. 

```{r}
hc <- hc %>% 
    mutate(prop_coverage = tot_coverage/tot_pop) 
hc
```

We need to add another column to our dataset.
We will add the spending per capita (or spending 
per person) in dollars and name this column 
`spending_capita`. 

#### How we will do this?

The `tot_spending` column is reported 
in millions (1e6). Therefore, to calculate 
`spending_capita` we will need to adjust for this
scaling factor to report it on the original scale
(just dollars) and then divide by `tot_pop`.

```{r}
hc <- hc %>% 
  mutate(spending_capita = (tot_spending*1e6) / tot_pop)
```

Now we are ready to go back to our first question.

> 1. Is there a relationship between healthcare coverage and healthcare spending in the United States? 

```{r}
hc.employer.2013 <- hc %>%
  filter(type == "Employer", year == "2013")
plot(hc.employer.2013$spending_capita, 
     hc.employer.2013$prop_coverage, log = "xy", 
     xlab = "spending per capita", 
     ylab = "proportion of Employer coverage")
```

Yes, it looks like there is a relationship for 
`Employer` healthcare coverage in 2013. 

We will continue to explore the other
types of coverages later on. For now, we 
get back to to learning more action verbs 
in `dplyr`. 

Our second question that we were interested 
in was: 

> 2. Which US states spend the most and which spend the least on healthcare? How does the spending distribution change across geographic regions in the United States?

To answer these questions, we need to learn how 
to calculate summary statistics in our data. 

### 9.  Create summaries of columns using `summarize()`

The `summarize()` function in `dplyr` 
will create summary statistics for a given 
column in the data frame 
such as finding the max, min, average. 
For example, to compute the average spending 
per capita, we can apply the `mean()` function 
to the column `spending_captia` and call the 
summary value `avg_spending_capita`. 

```{r}
hc %>% 
  summarize(avg_spending_capita = mean(spending_capita))
```

There are many other summary statistics you 
could consider such `sd()`, `min()`, `median()`, 
`mean()`, `sum()`, `n()` (returns the length of vector), 
`first()` (returns first value in vector), 
`last()` (returns last value in vector) and 
`n_distinct()` (number of distinct values in vector). 

Also note, this is the average across all states,
and all years. This is not very informative. 

If you recall, our question asked about 
_which states_ spent the most, so we want 
an average spending per capita for each state. 

For this, we need to introduce another function in 
`dplyr` called `group_by()`. 

### 10. Group operations using `group_by()`

The `group_by()` verb is and incredibly powerful
function in `dplyr`. As we mentioned before
it's related to concept of "split-apply-combine". 

In our example above, we want to split the data 
frame by some variable (e.g. `Location`), 
apply a function to the individual 
data frames (`mean`) and then combine the output
back into a summary data frame. 

Let's see how that would look

```{r}
hc %>% 
  group_by(Location) %>%
  summarize(avg_spending_capita = mean(spending_capita))
```

That's better. Here we are averaging across the
years 2013 and 2014. 

#### (*) What are the top 3 states that have the largest average spending per capita? What about the top 3 states with the smallest average spending per capita? 

```{r}
# smallest 
hc %>% 
  group_by(Location) %>%
  summarize(avg_spending_capita = mean(spending_capita)) %>% 
  arrange(avg_spending_capita) %>% 
  head(n=3)

# largest 
hc %>% 
  group_by(Location) %>%
  summarize(avg_spending_capita = mean(spending_capita)) %>% 
  arrange(desc(avg_spending_capita)) %>% 
  head(n=3)
```

#### (*) How does the spending distribution change across geographic regions in the United States? 

**Hint**: Calculate the mean and standard deviation of
spending per capita for each geographic region in the US. 

```{r}
hc %>% 
  group_by(region) %>%
  summarize(avg_spending_capita = mean(spending_capita), 
            sd_spending_capita = sd(spending_capita))
```

Another way to visualize distributions is to use boxplots. 

Create four boxplots representing the spending per capita 
distribution for each of the four regions using 
the `boxplot()` function in R.

```{r}
boxplot(hc$spending_capita ~ hc$region)
```

Now that we have our data in a `tidy` format, next, 
we will learn about how to do this using the 
`ggplot2` R package in the `tidyverse`. 

# Data Visualization  

As you have already seen, there are many functions available
in base R that can create plots (e.g. `plot()`, `boxplot()`). 
Others include: `hist()`, `qqplot()`, etc. These 
functions are great because they come with a basic installation 
of R and can be quite powerful when you need a quick visualization 
of something when you are exploring data. 

We are choosing to introduce `ggplot2` because, in our 
opinion, it's one of the simplest ways for beginners to 
create relatively complicated plots that are intuitive 
and aesthetically pleasing. 

## The `ggplot2` R package

The reasons [`ggplot2`](http://ggplot2.tidyverse.org) 
is generally intuitive for beginners is the use of 
[grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.html) 
or the `gg` in `ggplot2`. The idea is that you can construct
many sentences by learning just a few nouns, adjectives,
and verbs. There are specific "words" that we will need to 
learn and once we do, you will be able to create 
(or "write") hundreds of different plots. 

The critical part to making graphics using `ggplot2` is the 
data needs to be in a _tidy_ format. Given that we have 
just spend the last two lectures learning about how to 
work with _tidy_ data, we are primed to take 
advantage of all that `ggplot2` has to offer! 

We will show how it's easy to pipe _tidy_ data
(output) as input to other functions that creates
plots. This all works because we are working 
within the _tidyverse_. 

#### `ggplot2` cheatsheet

The [cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) 
looks like the following:
```{r, echo=FALSE}
knitr::include_graphics("./doc/ggplot2cheatsheet.PNG")

```


### 1. What is the `ggplot()` function? 

As explained by Hadley Wickham: 

> the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system.

#### `ggplot2` Terminology 
* **ggplot** - the main function where you specify the data set and variables to plot (this is where we define the `x` and
`y` variable names)
* **geoms** - geometric objects
    * e.g. `geom_point()`, `geom_bar()`, `geom_line()`, `geom_histogram()`
* **aes** - aesthetics
    * shape, transparency, color, fill, linetype
* **scales** - define how your data will be plotted
    * continuous, discrete, log, etc

There are three ways to initialize a `ggplot()` object. 

An empty ggplot object
```{r}
library(ggplot2)
p <- ggplot()         
```

A ggplot object associated with a dataset
```{r}
p <- hc %>% 
      filter(year==2014) %>% 
      ggplot()
```
or a ggplot object with a dataset and `x` and `y` defined

```{r}
p <- hc %>% 
      filter(year==2014) %>% 
      ggplot(aes(x = spending_capita, y = prop_coverage))
```

```{r, eval = FALSE}
p
```

### 2. Create scatter plots using `geom_point()` 

The function `aes()` is an aesthetic mapping 
function inside the `ggplot()` object. We 
use this function to specify plot attributes 
(e.g. `x` and `y` variable names) that 
will not change as we add more layers.  

Anything that goes in the `ggplot()` object becomes
a global setting. From there, we use the `geom`
objects to add more layers to the base `ggplot()` 
object. These will define what we are interested in 
illustrating using the data.  

If you recall, our first question that we were
interested in was 

> 1. Is there a relationship between healthcare coverage and healthcare spending in the United States?

Before, we were using base R to create something
like this: 

```{r}
hc.employer.2013 <- hc %>% 
  filter(type == "Employer", year == "2013")

plot(hc.employer.2013$spending_capita, 
     hc.employer.2013$prop_coverage, 
     xlab = "spending per capita", 
     ylab = "coverage proportion")
```

Let's  re-create this plot with `ggplot2` 
using the `geom_point()` geometry. 

```{r}
p <- hc %>%
  filter(type == "Employer", year == "2013") %>% 
  ggplot(aes(x = spending_capita, y = prop_coverage)) 
p + geom_point() + 
  xlab("spending per capita") + 
  ylab("coverage proportion")
```

We used the `xlab()` and `ylab()` functions
in `ggplot2` to specify the x-axis and y-axis
labels. 

**Note**, we do not have to assign (`<-`) the plot 
to anything: 

```{r}
hc %>%
  filter(type == "Employer", year == "2013") %>% 
  ggplot(aes(x = spending_capita, y = prop_coverage)) + 
  geom_point() + 
  xlab("spending per capita") + 
  ylab("coverage proportion")
```


It's also simple to fit a linear regression model 
and plot it on top of scatter plot using the 
`geom_smooth()` (or `stat_smooth()`) functions. 

```{r}
hc %>%
  filter(type == "Employer", year == "2013") %>% 
  ggplot(aes(x = spending_capita, y = prop_coverage)) + 
  geom_point() + 
  xlab("spending per capita") + 
  ylab("coverage proportion") + 
  geom_smooth(method = "lm", col = "red")
```

The standard error bounds are computed and included 
in the plot. 


It would be nice to know which state is represented
by which state. For this, we will introduce another 
_geom_ called `geom_text()`. 

### 3. Add layers of text using `geom_text()`

In our dataset, we have information about the 
abbreviation for each state. We could add the 
abbreviations for each state next to the point on 
the plot to assess which states have a higher or
lower coverage for a given amount of money they 
spend per capita. 

```{r}
hc %>% 
  filter(type == "Employer", year == "2013") %>% 
  ggplot(aes(x = spending_capita, y = prop_coverage)) + 
  geom_point() + 
  xlab("spending per capita") + 
  ylab("coverage proportion") + 
  geom_smooth(method = "lm", col = "red") + 
  geom_text(aes(label=abb))
```

That is cool, but it would be even better if we 
could _nudge_ the text over a bit. Let's look at 
the help file for `geom_text()`: 

```{r, eval=FALSE}
?ggplot2::geom_text
```

We see there is an argument called 
`nudge_x` and `nudge_y`. We can use these 
to _nudge_ the text over a bit so the 
text is not directly on top of the points. 

```{r}
hc %>% 
  filter(type == "Employer", year == "2013") %>% 
  ggplot(aes(x = spending_capita, y = prop_coverage)) + 
  geom_point() + 
  xlab("spending per capita") + 
  ylab("coverage proportion") + 
  geom_smooth(method = "lm", col = "red") + 
  geom_text(aes(label=abb), nudge_x = 150)
```

#### (*) Color each point (or state) by what region they are from.  

```{r}
## add your code here

hc %>% 
  filter(type == "Employer", year == "2013") %>% 
  ggplot(aes(x = spending_capita, y = prop_coverage, 
             color = region)) + 
  geom_point() + 
  xlab("spending per capita") + 
  ylab("coverage proportion") + 
  geom_smooth(method = "lm", col = "red") + 
  geom_text(aes(label=abb), nudge_x = 150)
```

#### (*) Try to explore the package `ggrepel`, and check if you can improve the quality of visualization using the *geom_text_repel* from `ggrepel` instead of *x_nudge* from `ggplot2`.  

```{r}
## add your code here  

library(ggrepel)

hc %>% 
  filter(type == "Employer", year == "2013") %>% 
  ggplot(aes(x = spending_capita, y = prop_coverage, 
             color = region)) + 
  geom_point() + 
  xlab("spending per capita") + 
  ylab("coverage proportion") + 
  geom_smooth(method = "lm", col = "red") + 
  geom_text_repel(aes(label=abb))  
```


### 4. Facet across a variable using `facet_wrap`

Ok, getting back to our original question: 

> 1. Is there a relationship between healthcare coverage and healthcare spending in the United States? 

We saw there was a positive relationship,
but this was only for one type of healthcare 
coverage (`Employer`) and one year. 
What about the other types? 

For this, we will introduce `facets`. The idea 
of _faceting_ is to stratify the data by some 
variable and make the same plot for each strata. 

For example, if we wanted to _facet_ by the 
`type` variable, we will add a layer to our 
`ggplot()` object using the `facet_grid()` or 
`facet_wrap()` functions. The function expects
the row and column variables to be separated 
by a `~`. 

```{r, fig.width=12, fig.height=8}
hc %>%
  filter(year == "2013") %>% 
  ggplot(aes(x = spending_capita, y = prop_coverage, 
             color = region)) + 
  geom_point() + 
  xlab("spending per capita") + 
  ylab("coverage proportion") + 
  geom_smooth(method = "lm", col = "red") + 
  geom_text_repel(aes(label=abb)) + 
  facet_wrap(~type)
```

We see that the proportion of people covered 
have different scales in the y-axis. Let's 
read the help file to see if there is some way 
to not restrict the y-axis to be the same. 
```{r}
?ggplot2::facet_grid
```

Yes, we see there is an argument called 
`scales` that can be `free_y`, (free columns),
`free_x` (free rows), and `free` (both). 
Let's try `free_y` and look at a different
year (`year=="2014"`): 

```{r, fig.width=12, fig.height=6}
hc %>%
  filter(year == "2014") %>% 
  ggplot(aes(x = spending_capita, y = prop_coverage, 
             color = region)) + 
  geom_point() + 
  xlab("spending per capita") + 
  ylab("coverage proportion") + 
  geom_smooth(method = "lm", col = "red") + 
  geom_text_repel(aes(label=abb)) + 
  facet_wrap(~type, scales="free_y")
```

Given we know `Other Public` refers to the 
military or Veterans Administration, we can see 
states like HI, VA, NV have a larger 
proportion of military or VA `Other Public` 
type coverage. While a state like AK has a 
similar proportion of `Other Public` coverage, 
it has a much larger spending per capita. 

We also see a negative relationship with the 
`Uninsured` type. The more states spend, the
less uninsured people in the state. 

### 5. Create boxplots using `geom_boxplot()`

Next, let's revisit the second question. 

> 2. Which US states spend the most and which spend the least on healthcare? How does the spending distribution change across geographic regions in the United States?

Let's try making a boxplot with `ggplot2`. 
If you recall, the way to do this in base R was: 

```{r}
boxplot(hc$spending_capita ~ hc$region)
```

Now, we introduce the `geom_boxplot()` 
function. Note, we needed to tell `ggplot2`
what needs to be along the x and y axis in
`aes()`. 

```{r}
hc %>% 
  ggplot(aes(x = region, y = spending_capita)) + 
  geom_boxplot()
```

### 6. Facet by two variables using `facet_grid`

> 3. Does the relationship between healthcare coverage and healthcare spending in the United States change from 2013 to 2014?   

Let's try faceting by both `year` and `type`. 
Note that we can facet by rows putting a 
column name before the `~` and facet by 
columns putting a column name after the `~`. 
We are also using `facet_grid()` instead of 
`facet_wrap()`.

```{r, fig.width=12, fig.height=8}
p <- hc %>%
  ggplot(aes(x = spending_capita, y = prop_coverage, 
             color = region)) + 
  geom_point() + 
  xlab("spending per capita") + 
  ylab("coverage proportion") + 
  geom_smooth(method = "lm", col = "red") + 
  geom_text_repel(aes(label=abb)) 

p + facet_grid(year~type, scales="free")
p + facet_grid(year~type)
```


# Summary   

The total healthcare expenditure is associated with 
the population. To make a fair comparison, 
we create "healthcare expenditure per capita." 
Further, the exploratory analysis via data visualization showed 
higher spending in healthcare per capita 
is positively associated with higher 
employer coverage proportion and is 
negatively associated with the proportion 
of uninsured population across the States. 





