Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.

To cite this case study please use:

Wright, Carrie and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-co2-emissions. Exploring CO2 emissions across time (Version v1.0.0).

To access the GitHub repository for this case study see here: https://github.com//opencasestudies/ocs-bp-co2-emissions.

You may also access and download the data using our OCSdata package. To learn more about this package including examples, see this link. Here is how you would install this package:

install.packages("OCSdata")

This case study is part of a series of public health case studies for the Bloomberg American Health Initiative.


The total reading time for this case study is calculated via koRpus and shown below:

Reading Time Method
70 minutes koRpus

Readability Score:

A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via koRpus. These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age.

Text language: en 
index grade age
Flesch-Kincaid 9 14
FORCAST 10 15
SMOG 11 16

Please help us by filling out our survey.

Motivation


This case study explores how different countries have contributed to Carbon Dioxide (CO2) emissions over time and how CO2 emission rates may relate to increasing global temperatures and increased rates of natural disasters and storms. We used this report from the EPA as the basis for motivating this case study, as it provides background information about how CO2 emissions and other greenhouse gases have influenced the climate and weather patterns.

CO2 makes up the largest proportion of greenhouse gas emissions in the United States:

[source]

A variety of sources and sectors contribute to greenhouse gas emissions:

[source]

Transportation and Electricity contribute the most metric tons of CO2:

[source]

So why should we pay attention to greenhouse gases?

According to the US Environmental Protection Agency (EPA) Inventory of U.S. Greenhouse Gas Emissions and Sinks 2020 Report:

Greenhouse gases absorb infrared radiation, thereby trapping heat in the atmosphere and making the planet warmer. The most important greenhouse gases directly emitted by humans include carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O), and several fluorine-containing halogenated substances. Although CO2, CH4, and N2O occur naturally in the atmosphere, human activities have changed their atmospheric concentrations. From the pre- industrial era (i.e., ending about 1750) to 2018, concentrations of these greenhouse gases have increased globally by 46, 165, and 23 percent, respectively (IPCC 2013; NOAA/ESRL 2019a, 2019b, 2019c).

* IPCC stands for the Intergovernmental Panel on Climate Change

In fact, there are many signs that our planet is experiencing warmer temperatures:

[source]

The connection between greenhouse gas levels and global temperatures and the influence of increased global temperatures on human health are motivated by these reports:

The National Climate Assessment Report states that:

Heat-trapping gases already in the atmosphere have committed us to a hotter future with more climate-related impacts over the next few decades. The magnitude of climate change beyond the next few decades depends primarily on the amount of heat-trapping gases that human activities emit globally, now and in the future.

See the following links for more information about how greenhouse gases have influenced global temperatures: 1) The EPA report on green house gases
2) The National Climate Assessment (NCA) summary from 2014) 3) The World101 website about how countries are adapting to climate change

Main Questions


Our main questions:

  1. How have global CO2 emission rates changed over time? In particular for the US, and how does the US compare to other countries?
  2. Are CO2 emissions in the US, global temperatures, and natural disaster rates in the US associated?

Learning Objectives


In this case study, we will explore CO2 emission data from around the world. We will also focus on the US specifically to evaluate patterns of temperatures and natural disaster activity.

This case study will particularly focus on how to use different datasets that span different ranges of time, as well as how to create visualizations of patterns over time. We will especially focus on using packages and functions from the tidyverse, such as dplyr, tidyr, and ggplot2.

The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially legible and intuitive.

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data Science Learning Objectives:

  1. Importing data from various types of Excel files and CSV files
  2. Apply action verbs in dplyr for data wrangling
  3. How to pivot between “long” and “wide” datasets
  4. Joining together multiple datasets using dplyr
  5. How to create effective longitudinal data visualizations with ggplot2
  6. How to add text, color, and labels to ggplot2 plots
  7. How to create faceted ggplot2 plots

Statistical Learning Objectives:

  1. Introduction to correlation coefficient as a summary statistic
  2. Relationship between correlation and linear regression
  3. Correlation is not causation


We will begin by loading the packages that we will need:

library(here)
library(readxl)
library(readr)
library(dplyr)
library(magrittr)
library(stringr)
library(purrr)
library(tidyr)
library(forcats)
library(ggplot2)
library(directlabels)
library(ggrepel)
library(broom)
library(patchwork)
library(OCSdata)

Packages used in this case study:

Package Use in this case study
here to easily load and save data
readxl to import the Excel file data
readr to import the csv file data
dplyr to view and wrangle the data, by modifying variables, renaming variables, selecting variables, creating variables, and arranging values within a variable
magrittr to use and reassign data objects using the %<>%pipe operator
stringr to select only the first 4 characters of date data
purrr to apply a function on a list of tibbles (tibbles are the tidyverse version of a data frame)
tidyr to drop rows with NA values from a tibble
forcats to reorder the levels of a factor
ggplot2 to make visualizations
directlabels to add labels to plots easily
ggrepel to add labels that don’t overlap to plots
broom to make the output form statistical tests easier to work with
patchwork to combine plots
OCSdata to access and download OCS data files

The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.

Context


Now we will describe a bit more background about greenhouse gas emissions and the potential influence of these emissions on public health.

Greenhouse gas emissions are due to both natural processes and anthropogenic (human-derived) activities.

These emissions are one of the contributing factors to rising global temperatures, which can have a great influence on public health as illustrated in the following image:

[source]

According to the US Environmental Protection Agency (EPA) Inventory of U.S. Greenhouse Gas Emissions and Sinks 2020 Report:

Gases in the atmosphere can contribute to climate change both directly and indirectly. Direct effects occur when the gas itself absorbs radiation. Indirect radiative forcing occurs when chemical transformations of the substance produce other greenhouse gases, when a gas influences the atmospheric lifetimes of other gases, and/or when a gas affects atmospheric processes that alter the radiative balance of the earth (e.g., affect cloud formation or albedo).

The Global Warming Potential (GWP) compares the ability of a greenhouse gas to trap heat in the atmosphere relative to another gas.

The GWP of a greenhouse gas is defined as the ratio of the accumulated radiative forcing within a specific time horizon caused by emitting 1 kilogram of the gas, relative to that of the reference gas CO2 (IPCC 2013). Therefore GWP-weighted emissions are provided in million metric tons of CO2 equivalent (MMT CO2 Eq.)

[source]

CO2 is actually the least heat-trapping gas of the greenhouse gases:

[source]

However, because CO2 is so much more abundant and stays in the atmosphere so much longer than other greenhouse gases, it has been the largest contributor to global warming. See here for more details.

It is also important to keep in mind that there is a lag between greenhouse gas emissions and temperature changes that we experience because much of Earth’s thermal energy (and CO2) gets stored in the ocean.

Due to a process called thermal inertia, the heat stored in the ocean will eventually be transfered to the surface of the Earth long after the gases were emitted that resulted in the increased ocean temperature.

See here for more explanation.

Furthermore, rising CO2 levels in the ocean also influence ocean acidity:

[source]

As CO2 levels rise in the ocean, the pH becomes more acidic, which makes it difficult for organisms to maintain their shells or skeletons that are made of calcium carbonate, thus making it more difficult for these organisms to survive and impacting their role in the ecosystem and food chain.

Furthermore, greenhouse gas emissions are believed to influence weather patterns as shown in this report.

Indeed, events with high levels of precipitation which can induce flooding and property damage are generally increasing around the country:

Limitations


An important limitation regarding this data analysis to keep in mind is the datasets only include countries and years in which countries were reporting such information to the agencies that collected the data. Thus, the data are incomplete. For example, while we have a fairly good sense of CO2 emissions globally for later years, additional emissions were also produced by countries that are not included in the data.

What are the data?


In this case study we will be using data related to CO2 emissions, as well as other data that may influence, be influenced or relate to CO2 emissions. Most of our data is from Gapminder that was originally obtained from the World Bank.

In addition, we will use some data that is specific to the United States from the National Oceanic and Atmospheric Administration (NOAA), which is an agency that collects weather and climate data.

Data Time span Source Original Source Description Citation
CO2 emissions 1751-2014 Gapminder Carbon Dioxide Information Analysis Center (CDIAC) CO2 emissions in tonnes or metric tons (equivalent to approximately 2,204.6 pounds) per person by country NA
GDP per capita (percent yearly growth) 1801-2019 Gapminder World Bank Growth Domestic Product (which is an overall measure of the health of nation’s economy) per person by country NA
Energy use per person 1960-2015 Gapminder World Bank Use of primary energy before transformation to other end-use fuels, by country NA
US Natural Disasters 1980-2019 The National Oceanic and Atmospheric Administration (NOAA) The National Oceanic and Atmospheric Administration (NOAA) US data about:
– Droughts
– Floods
– Freezes
– Severe Storms
– Tropical Cyclones
– Wildfires
– Winter Storms
NOAA National Centers for Environmental Information (NCEI) U.S. Billion-Dollar Weather and Climate Disasters (2020). https://www.ncdc.noaa.gov/billions/, DOI: 10.25921/stkw-7w73
Temperature 1895-2019 The National Oceanic and Atmospheric Administration (NOAA) The National Oceanic and Atmospheric Administration (NOAA) US National yearly average temperature (in Fahrenheit) from 1895 to 2019 NOAA National Centers for Environmental information, Climate at a Glance: National Time Series, published June 2020, retrieved on June 26, 2020 from https://www.ncdc.noaa.gov/cag/

To obtain the temperature data, the annual average temperatures were selected as shown in this image:

[source]

Importantly, notice that the data we would like to use span different time periods:

Data Time span
CO2 emissions 1751 to 2014
GDP per capita (yearly growth) 1801 to 2019
Energy use per person 1960 to 2015
US Natural Disasters 1980 to 2019
Temperature 1895 to 2019

We will explore more about this a bit later.

Question Opportunity

What concerns might arise about reliability and variation of measurement practices over time?

Data Import


In our case, we downloaded the data for the files from the various sources as indicated in the table above and put them within a “raw” subdirectory of a “data” directory for our project. If you use an RStudio project, then you can use the here() function of the here package to make the path for importing this data simpler. The here package automatically starts looking for files based on where you have a .Rproj file which is created when you start a new RStudio project. We can specify that we want to look for the “yearly_co2_emissions_1000_tonnes.xlsx” file within the “raw” directory within the “data” directory within a directory where our .Rproj file is located by separating the names of these directories using commas and listing “data” first.


Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.


To read in the files that were downloaded from the various sources as indicated in the table above, we will use the read_xlsx() and read_xls() functions of the readxl package to import the data from the .xlsx and .xls files, respectively. We will also use the here() function of the here package to more easily specify the path to our files relative to the directory where the .Rproj file is located.

CO2_emissions <- readxl::read_xlsx(here("data","raw", "yearly_co2_emissions_1000_tonnes.xlsx"))
gdp_growth    <- readxl::read_xlsx(here("data", "raw", "gdp_per_capita_yearly_growth.xlsx"))
energy_use    <- readxl::read_xlsx(here("data", "raw", "energy_use_per_person.xlsx"))

If you had trouble downloading these files, you can do so at our GitHub repo or more directly by clicking here, here, and here.

You may also download these files using the OCSdata package:

# install.packages("OCSdata")
library(OCSdata)
raw_data("ocs-bp-co2-emissions", outpath = getwd())
# This will save the raw data files in a "OCSdata/data/raw/" subfolder 
# in your current working directory

We will use the read_csv() function of the readr package to import the data from the .csv files.

However, for these files there are some lines that we would like to not import because the number of columns differ for some rows. If we don’t account for this, then we may end up importing fewer columns of the data that we would like.

In the first 5 rows shown below in the data/disasters.csv file, you can see that the first two rows does not have the same number of columns as the subsequent rows and are just (sub)titles.

To do this, we can skip rows using the skip = 2 argument of the read_csv() function.

us_disaster <- readr::read_csv(here("data", "raw", "disasters.csv"), skip = 2)

If you had trouble downloading this file, you can do so at our GitHub repo or more directly by clicking here.

Now looking at the data/temperature.csv file, we see that the first four lines do not have the same number of columns as the subsequent lines.

We will skip importing all 4 lines by using skip = 4. We can also replace all instances of "-99" with NA using the na = "-99" argument of the read_csv() function. The “-99” needs to be in quotation marks because this argument expects characters.


Click here for an explanation about data types in R and about character strings.

There are several classes of data in R programming, meaning that certain objects will be treated or interpreted differently. Character is one of these classes. A character string is an individual data value made up of characters. This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter “a” or the number “3”. If data are of class character, than the numeric values will not be processed like a numeric value in a mathematical sense. If you want your numeric values to be interpreted that way, they need to be converted to a numeric class. The options typically used are integer (which has no decimal place) and double precision (which has a decimal place).

A variable that is a factor has a set of particular values called levels (this can be numbers or characters). Even if these are numeric, they will be interpreted as levels (i.e., as if they were characters) not as mathematical numbers. The values of a factor are assumed to have a particular ordering; by default the order is alphabetical, but this is not always the correct/intuitive ordering. You can modify the order of these levels with the forcats package.


us_temperature <- readr::read_csv(here("data", "raw", "temperature.csv"), skip = 4, na = "-99")

If you had trouble downloading this file, you can do so at our GitHub repo or more directly by clicking here.

Great! now we have imported all of the data that we will need.

To allow users to skip import we will save the data as an RDA file:

save(CO2_emissions, 
     gdp_growth,
     energy_use, 
     us_disaster, 
     us_temperature, 
     file = here::here("data", "imported", "co2_data_imported.rda"))

Data Wrangling


If you have been following along but stopped, we could load our imported data like so:

load(here::here("data", "imported", "co2_data_imported.rda"))

If you skipped the data import section click here.

First you need to install and load the OCSdata package:

install.packages("OCSdata")
library(OCSdata)

Then, you may load the imported data using the following code:

imported_data("ocs-bp-co2-emissions", outpath = getwd())
load(here::here("OCSdata", "data", "imported", "co2_data_imported.rda"))

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found here or slightly more directly here. Download this file and then place it in your current working directory within a subdirectory called “imported” within a directory called “data” to copy and paste our code. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "imported", "co2_data_imported.rda"))

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.



Next, we take a look at our data that we just imported. We will need to do some data wrangling to allow us to evaluate how CO2 emissions have changed over time and how emissions may relate to energy use, GDP, etc. Let’s explore how to do that with useful functions and packages from the tidyverse.

Yearly CO2 Emissions


First, let’s take a look at the CO2 data (CO2_emissions). We can use the slice_head() function of the dplyr package to see just the first rows of our data. We can specify how many rows we would like to see by using the n = argument.

We will use the %>% pipe from the magrittr package (although it is also imported by other tidyverse packages, like dplyr), which can be used to define the input for later sequential steps. This will make more sense when we have multiple sequential steps using the same data object.

CO2_emissions %>%
  slice_head(n = 3)
# A tibble: 3 x 265
  country  `1751` `1752` `1753` `1754` `1755` `1756` `1757` `1758` `1759` `1760`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Afghani~     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
2 Albania      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
3 Algeria      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
# ... with 254 more variables: `1761` <dbl>, `1762` <dbl>, `1763` <dbl>,
#   `1764` <dbl>, `1765` <dbl>, `1766` <dbl>, `1767` <dbl>, `1768` <dbl>,
#   `1769` <dbl>, `1770` <dbl>, `1771` <dbl>, `1772` <dbl>, `1773` <dbl>,
#   `1774` <dbl>, `1775` <dbl>, `1776` <dbl>, `1777` <dbl>, `1778` <dbl>,
#   `1779` <dbl>, `1780` <dbl>, `1781` <dbl>, `1782` <dbl>, `1783` <dbl>,
#   `1784` <dbl>, `1785` <dbl>, `1786` <dbl>, `1787` <dbl>, `1788` <dbl>,
#   `1789` <dbl>, `1790` <dbl>, `1791` <dbl>, `1792` <dbl>, `1793` <dbl>, ...

Another useful function is slice_sample() to look at a selection of random rows using pseudorandom numbers for the index of rows to show. To continue to get the same random values or for others to get the same values, we need to set a seed first. We can do this with the set.seed() base function. We just specify a number with this function and that will allow us to get the same subset of values from the slice_sample() function. If two different people ran this code (without set.seed()), they would each see a different subset of rows. For data exploration, this isn’t a huge deal, but if we’d like separate analysts running the same code to see the same output, we will use set.seed(). If we changed set.seed(123) to set.seed(333), we would obtain a different random sample of rows.

set.seed(123)

CO2_emissions %>%
  slice_sample(n = 3)
# A tibble: 3 x 265
  country  `1751` `1752` `1753` `1754` `1755` `1756` `1757` `1758` `1759` `1760`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Sri Lan~     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
2 Tuvalu       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
3 Banglad~     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
# ... with 254 more variables: `1761` <dbl>, `1762` <dbl>, `1763` <dbl>,
#   `1764` <dbl>, `1765` <dbl>, `1766` <dbl>, `1767` <dbl>, `1768` <dbl>,
#   `1769` <dbl>, `1770` <dbl>, `1771` <dbl>, `1772` <dbl>, `1773` <dbl>,
#   `1774` <dbl>, `1775` <dbl>, `1776` <dbl>, `1777` <dbl>, `1778` <dbl>,
#   `1779` <dbl>, `1780` <dbl>, `1781` <dbl>, `1782` <dbl>, `1783` <dbl>,
#   `1784` <dbl>, `1785` <dbl>, `1786` <dbl>, `1787` <dbl>, `1788` <dbl>,
#   `1789` <dbl>, `1790` <dbl>, `1791` <dbl>, `1792` <dbl>, `1793` <dbl>, ...

Question Opportunity

Try setting a different seed to see the difference in the output.

OK, we see each country is represented along one row and each column contains yearly CO2 emissions. We also see that there are a lot of NA values.

We can also use the glimpse() function of the dplyr package to view our data. This allows us to see all of our variables at once. We will see a tiny bit of each variable/column with the data displayed on the right.

# Scroll through the output!
CO2_emissions %>%
  dplyr::glimpse()
Rows: 192
Columns: 265
$ country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Ant~
$ `1751`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1752`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1753`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1754`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1755`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1756`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1757`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1758`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1759`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1760`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1761`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1762`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1763`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1764`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1765`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1766`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1767`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1768`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1769`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1770`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1771`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1772`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1773`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1774`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1775`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1776`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1777`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1778`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1779`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1780`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1781`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1782`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1783`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1784`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1785`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1786`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1787`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1788`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1789`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1790`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1791`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1792`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1793`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1794`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1795`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1796`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1797`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1798`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1799`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1800`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1801`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1802`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1803`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1804`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1805`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1806`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1807`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 169, NA, NA, NA, NA, NA, N~
$ `1808`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1809`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1810`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1811`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1812`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1813`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1814`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1815`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1816`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1817`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1818`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ `1819`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 253, NA, NA, NA, NA, NA, N~
$ `1820`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 334, NA, NA, NA, NA, NA, N~
$ `1821`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 359, NA, NA, NA, NA, NA, N~
$ `1822`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 367, NA, NA, NA, NA, NA, N~
$ `1823`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 348, NA, NA, NA, NA, NA, N~
$ `1824`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 400, NA, NA, NA, NA, NA, N~
$ `1825`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 403, NA, NA, NA, NA, NA, N~
$ `1826`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 458, NA, NA, NA, NA, NA, N~
$ `1827`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 477, NA, NA, NA, NA, NA, N~
$ `1828`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 458, NA, NA, NA, NA, NA, N~
$ `1829`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 477, NA, NA, NA, NA, NA, N~
$ `1830`  <dbl> NA, NA, NA, NA, NA, NA, NA, 0.032, NA, 495.000, 0.308, NA, NA,~
$ `1831`  <dbl> NA, NA, NA, NA, NA, NA, NA, 3.84e-02, NA, 4.80e+02, 3.70e-01, ~
$ `1832`  <dbl> NA, NA, NA, NA, NA, NA, NA, 2.56e-02, NA, 5.13e+02, 2.47e-01, ~
$ `1833`  <dbl> NA, NA, NA, NA, NA, NA, NA, 0.032, NA, 429.000, 0.308, NA, NA,~
$ `1834`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 587, NA, NA, NA, NA, NA, N~
$ `1835`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 634, NA, NA, NA, NA, NA, N~
$ `1836`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 675, NA, NA, NA, NA, NA, N~
$ `1837`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 708, NA, NA, NA, NA, NA, N~
$ `1838`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 851, NA, NA, NA, NA, NA, N~
$ `1839`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1060, NA, NA, NA, NA, NA, ~
$ `1840`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1170, NA, NA, NA, NA, NA, ~
$ `1841`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1320, NA, NA, NA, NA, NA, ~
$ `1842`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1460, NA, NA, NA, NA, NA, ~
$ `1843`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1270, NA, NA, NA, NA, NA, ~
$ `1844`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1600, NA, NA, NA, NA, NA, ~
$ `1845`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1800, NA, NA, NA, NA, NA, ~
$ `1846`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 2120, NA, NA, NA, NA, NA, ~
$ `1847`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 2080, NA, NA, NA, NA, NA, ~
$ `1848`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 2340, NA, NA, NA, NA, NA, ~
$ `1849`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 2260, NA, NA, NA, NA, NA, ~
$ `1850`  <dbl> NA, NA, NA, NA, NA, NA, NA, 0.198, NA, 2330.000, 1.910, NA, NA~
$ `1851`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 2340, NA, NA, NA, NA, NA, ~
$ `1852`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 2810, NA, NA, NA, NA, NA, ~
$ `1853`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 3230, NA, NA, NA, NA, NA, ~
$ `1854`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 3180, NA, NA, NA, NA, NA, ~
$ `1855`  <dbl> NA, NA, NA, NA, NA, NA, NA, 6.01e-01, NA, 3.70e+03, 5.80e+00, ~
$ `1856`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4240, NA, NA, NA, NA, NA, ~
$ `1857`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4880, NA, NA, NA, NA, NA, ~
$ `1858`  <dbl> NA, NA, NA, NA, NA, NA, NA, 8.44e-01, NA, 7.25e+03, 8.14e+00, ~
$ `1859`  <dbl> NA, NA, NA, NA, NA, NA, NA, 8.95e-01, NA, 5.87e+03, 8.64e+00, ~
$ `1860`  <dbl> NA, NA, NA, NA, NA, NA, NA, 1.18, 279.00, 6150.00, 11.40, NA, ~
$ `1861`  <dbl> NA, NA, NA, NA, NA, NA, NA, 1.5, 510.0, 6380.0, 14.5, NA, NA, ~
$ `1862`  <dbl> NA, NA, NA, NA, NA, NA, NA, 1.36, 356.00, 6360.00, 13.10, NA, ~
$ `1863`  <dbl> NA, NA, NA, NA, NA, NA, NA, 1.42, 400.00, 5880.00, 13.70, NA, ~
$ `1864`  <dbl> NA, NA, NA, NA, NA, NA, NA, 1.59, 268.00, 5080.00, 15.40, NA, ~
$ `1865`  <dbl> NA, NA, NA, NA, NA, NA, NA, 1.52, 422.00, 5360.00, 14.70, NA, ~
$ `1866`  <dbl> NA, NA, NA, NA, NA, NA, NA, 4.81, 697.00, 3600.00, 46.40, NA, ~
$ `1867`  <dbl> NA, NA, NA, NA, NA, NA, NA, 5.52, 895.00, 4920.00, 53.20, NA, ~
$ `1868`  <dbl> NA, NA, NA, NA, NA, NA, NA, 4.59, 733.00, 6080.00, 44.30, NA, ~
$ `1869`  <dbl> NA, NA, NA, NA, NA, NA, NA, 6.23, 642.00, 6490.00, 60.10, NA, ~
$ `1870`  <dbl> NA, NA, NA, NA, NA, NA, NA, 6.76, 601.00, 7370.00, 65.20, NA, ~
$ `1871`  <dbl> NA, NA, NA, NA, NA, NA, NA, 9.12, 693.00, 10200.00, 88.00, NA,~
$ `1872`  <dbl> NA, NA, NA, NA, NA, NA, NA, 9.36, 708.00, 10000.00, 90.40, NA,~
$ `1873`  <dbl> NA, NA, NA, NA, NA, NA, NA, 8.79, 869.00, 10700.00, 84.80, NA,~
$ `1874`  <dbl> NA, NA, NA, NA, NA, NA, NA, 10.7, 891.0, 9160.0, 103.0, NA, NA~
$ `1875`  <dbl> NA, NA, NA, NA, NA, NA, NA, 12.3, 829.0, 7870.0, 119.0, NA, NA~
$ `1876`  <dbl> NA, NA, NA, NA, NA, NA, NA, 15.2, 931.0, 8100.0, 147.0, NA, NA~
$ `1877`  <dbl> NA, NA, NA, NA, NA, NA, NA, 15.6, 1070.0, 7290.0, 150.0, NA, N~
$ `1878`  <dbl> NA, NA, NA, NA, NA, NA, NA, 20.3, 968.0, 7250.0, 196.0, NA, NA~
$ `1879`  <dbl> NA, NA, NA, NA, NA, NA, NA, 20.9, 1460.0, 8870.0, 201.0, NA, N~
$ `1880`  <dbl> NA, NA, NA, NA, NA, NA, NA, 24.5, 2210.0, 23700.0, 236.0, NA, ~
$ `1881`  <dbl> NA, NA, NA, NA, NA, NA, NA, 25.80, 1770.00, 10300.00, 249.00, ~
$ `1882`  <dbl> NA, NA, NA, NA, NA, NA, NA, 27.20, 2010.00, 10600.00, 262.00, ~
$ `1883`  <dbl> NA, NA, NA, NA, NA, NA, NA, 30.90, 2430.00, 11800.00, 298.00, ~
$ `1884`  <dbl> NA, NA, NA, NA, NA, NA, NA, 31.4, 2570.0, 11500.0, 303.0, NA, ~
$ `1885`  <dbl> NA, NA, NA, NA, NA, NA, NA, 34.20, 2910.00, 12100.00, 330.00, ~
$ `1886`  <dbl> NA, NA, NA, NA, NA, NA, NA, 35.10, 2890.00, 11400.00, 338.00, ~
$ `1887`  <dbl> NA, NA, NA, NA, NA, NA, 1090.0, 37.1, 3040.0, 12300.0, 358.0, ~
$ `1888`  <dbl> NA, NA, NA, NA, NA, NA, 891.0, 38.7, 3530.0, 12000.0, 373.0, N~
$ `1889`  <dbl> NA, NA, NA, NA, NA, NA, 1760.0, 41.8, 3430.0, 12900.0, 403.0, ~
$ `1890`  <dbl> NA, NA, NA, NA, NA, NA, 1370.0, 47.3, 3550.0, 13000.0, 457.0, ~
$ `1891`  <dbl> NA, NA, NA, NA, NA, NA, 939.0, 52.1, 4010.0, 15000.0, 503.0, N~
$ `1892`  <dbl> NA, NA, NA, NA, NA, NA, 1390.0, 55.1, 4150.0, 14500.0, 532.0, ~
$ `1893`  <dbl> NA, NA, NA, NA, NA, NA, 1550.0, 64.6, 3970.0, 17700.0, 624.0, ~
$ `1894`  <dbl> NA, NA, NA, NA, NA, NA, 1990.0, 65.8, 4360.0, 18100.0, 635.0, ~
$ `1895`  <dbl> NA, NA, NA, NA, NA, NA, 2270.0, 75.6, 4590.0, 20400.0, 730.0, ~
$ `1896`  <dbl> NA, NA, NA, NA, NA, NA, 2310, 77, 4510, 21300, 743, NA, NA, NA~
$ `1897`  <dbl> NA, NA, NA, NA, NA, NA, 2080, 89, 4980, 23000, 859, NA, NA, NA~
$ `1898`  <dbl> NA, NA, NA, NA, NA, NA, 2350.0, 99.9, 5620.0, 24500.0, 964.0, ~
$ `1899`  <dbl> NA, NA, NA, NA, NA, NA, 2920, 116, 5790, 24800, 1120, NA, NA, ~
$ `1900`  <dbl> NA, NA, NA, NA, NA, NA, 2070, 131, 10200, 27700, 1270, NA, NA,~
$ `1901`  <dbl> NA, NA, NA, NA, NA, NA, 2490, 135, 11400, 28400, 1300, NA, NA,~
$ `1902`  <dbl> NA, NA, NA, NA, NA, NA, 2820, 130, 11400, 25700, 1260, NA, NA,~
$ `1903`  <dbl> NA, NA, NA, NA, NA, NA, 2860, 127, 11200, 25600, 1230, NA, NA,~
$ `1904`  <dbl> NA, NA, NA, NA, NA, NA, 3800, 142, 11600, 26900, 1370, NA, NA,~
$ `1905`  <dbl> NA, NA, NA, NA, NA, NA, 3990, 126, 12100, 28100, 1220, NA, NA,~
$ `1906`  <dbl> NA, NA, NA, NA, NA, NA, 6260, 144, 14400, 33600, 1390, NA, NA,~
$ `1907`  <dbl> NA, NA, NA, NA, NA, NA, 6260, 161, 15500, 42200, 1560, NA, NA,~
$ `1908`  <dbl> NA, NA, NA, NA, NA, NA, 7620, 162, 16800, 59000, 1570, NA, NA,~
$ `1909`  <dbl> NA, NA, NA, NA, NA, NA, 5940, 172, 14600, 42200, 1660, NA, NA,~
$ `1910`  <dbl> NA, NA, NA, NA, NA, NA, 8910, 168, 17500, 57600, 1620, NA, NA,~
$ `1911`  <dbl> NA, NA, NA, NA, NA, NA, 9950, 174, 19300, 48100, 1680, NA, NA,~
$ `1912`  <dbl> NA, NA, NA, NA, NA, NA, 9490, 198, 20800, 50000, 1910, NA, NA,~
$ `1913`  <dbl> NA, NA, NA, NA, NA, NA, 10200, 215, 22400, 59700, 2070, NA, NA~
$ `1914`  <dbl> NA, NA, NA, NA, NA, NA, 8680, 194, 24500, 48900, 1870, NA, NA,~
$ `1915`  <dbl> NA, NA, NA, NA, NA, NA, 6950, 178, 21800, 34900, 1720, NA, NA,~
$ `1916`  <dbl> NA, NA, 3.67, NA, NA, NA, 4990.00, 189.00, 19300.00, 8040.00, ~
$ `1917`  <dbl> NA, NA, 7.33, NA, NA, NA, 2230.00, 174.00, 20800.00, 3450.00, ~
$ `1918`  <dbl> NA, NA, 18.3, NA, NA, NA, 2520.0, 69.5, 23000.0, 3340.0, 671.0~
$ `1919`  <dbl> NA, NA, 18.3, NA, NA, NA, 3730.0, 59.4, 21800.0, 3020.0, 573.0~
$ `1920`  <dbl> NA, NA, 22.0, NA, NA, NA, 5900.0, 54.2, 25800.0, 14500.0, 523.~
$ `1921`  <dbl> NA, NA, 25.7, NA, NA, NA, 5540.0, 58.7, 23200.0, 19400.0, 567.~
$ `1922`  <dbl> NA, NA, 25.7, NA, NA, NA, 7300.0, 71.6, 24400.0, 18600.0, 692.~
$ `1923`  <dbl> NA, NA, 14.7, NA, NA, NA, 8450.0, 79.1, 24900.0, 17800.0, 764.~
$ `1924`  <dbl> NA, NA, 29.3, NA, NA, NA, 11000.0, 94.3, 27100.0, 20100.0, 910~
$ `1925`  <dbl> NA, NA, 33.0, NA, NA, NA, 11200.0, 93.1, 28300.0, 19000.0, 898~
$ `1926`  <dbl> NA, NA, 40.3, NA, NA, NA, 11300.0, 135.0, 27900.0, 18600.0, 13~
$ `1927`  <dbl> NA, NA, 58.7, NA, NA, NA, 13400.0, 168.0, 28900.0, 20100.0, 16~
$ `1928`  <dbl> NA, NA, 73.30, NA, NA, NA, 12800.00, 186.00, 26300.00, 21200.0~
$ `1929`  <dbl> NA, NA, 80.70, NA, NA, NA, 13100.00, 201.00, 23700.00, 24200.0~
$ `1930`  <dbl> NA, NA, 84.30, NA, NA, NA, 12800.00, 273.00, 22000.00, 18900.0~
$ `1931`  <dbl> NA, NA, 99.00, NA, NA, NA, 12900.00, 328.00, 19600.00, 18100.0~
$ `1932`  <dbl> NA, NA, 114.00, NA, NA, NA, 13100.00, 369.00, 20400.00, 15200.~
$ `1933`  <dbl> NA, 7.33, 121.00, NA, NA, NA, 13200.00, 412.00, 21600.00, 1420~
$ `1934`  <dbl> NA, 7.33, 139.00, NA, NA, NA, 14300.00, 499.00, 22700.00, 1380~
$ `1935`  <dbl> NA, 18.3, 132.0, NA, NA, NA, 14000.0, 565.0, 25300.0, 13900.0,~
$ `1936`  <dbl> NA, 128.0, 51.3, NA, NA, NA, 15100.0, 648.0, 27100.0, 13600.0,~
$ `1937`  <dbl> NA, 297.0, 69.7, NA, NA, NA, 16700.0, 662.0, 28900.0, 15300.0,~
$ `1938`  <dbl> NA, 348, 33, NA, NA, NA, 16400, 699, 28100, 5790, 6750, NA, 34~
$ `1939`  <dbl> NA, 433.00, 161.00, NA, NA, NA, 17400.00, 707.00, 32200.00, 63~
$ `1940`  <dbl> NA, 693, 238, NA, NA, NA, 15900, 848, 29100, 7350, 8190, NA, 2~
$ `1941`  <dbl> NA, 627, 312, NA, NA, NA, 14000, 745, 34600, 7980, 7190, NA, 2~
$ `1942`  <dbl> NA, 744, 499, NA, NA, NA, 13500, 513, 36500, 8560, 4950, NA, 2~
$ `1943`  <dbl> NA, 462, 469, NA, NA, NA, 14100, 655, 35000, 9620, 6320, NA, 2~
$ `1944`  <dbl> NA, 154, 499, NA, NA, NA, 14000, 613, 34200, 9400, 5920, NA, 2~
$ `1945`  <dbl> NA, 121, 616, NA, NA, NA, 13700, 649, 32700, 4570, 6270, NA, 3~
$ `1946`  <dbl> NA, 484, 763, NA, NA, NA, 13700, 730, 35500, 12800, 7040, NA, ~
$ `1947`  <dbl> NA, 928.00, 744.00, NA, NA, NA, 14500.00, 878.00, 38000.00, 17~
$ `1948`  <dbl> NA, 704.00, 803.00, NA, NA, NA, 17400.00, 935.00, 38500.00, 24~
$ `1949`  <dbl> 14.70, 1020.00, 909.00, NA, NA, NA, 15400.00, 1060.00, 37700.0~
$ `1950`  <dbl> 84.3, 297.0, 3790.0, NA, 187.0, NA, 30000.0, 1180.0, 54800.0, ~
$ `1951`  <dbl> 91.7, 403.0, 4140.0, NA, 249.0, NA, 35000.0, 1280.0, 59100.0, ~
$ `1952`  <dbl> 91.7, 374.0, 3890.0, NA, 312.0, NA, 36100.0, 1370.0, 60300.0, ~
$ `1953`  <dbl> 106.0, 414.0, 4000.0, NA, 275.0, NA, 35200.0, 1450.0, 59500.0,~
$ `1954`  <dbl> 106.0, 502.0, 4160.0, NA, 348.0, NA, 36800.0, 1590.0, 67900.0,~
$ `1955`  <dbl> 154.0, 664.0, 4610.0, NA, 414.0, NA, 39600.0, 1800.0, 70700.0,~
$ `1956`  <dbl> 183.0, 840.0, 5000.0, NA, 502.0, NA, 44300.0, 1970.0, 73100.0,~
$ `1957`  <dbl> 293.0, 1510.0, 5540.0, NA, 620.0, 22.0, 47700.0, 2160.0, 74600~
$ `1958`  <dbl> 330.0, 1200.0, 5220.0, NA, 594.0, 29.3, 44200.0, 2310.0, 77700~
$ `1959`  <dbl> 385.0, 1440.0, 5670.0, NA, 620.0, 29.3, 49000.0, 2430.0, 83800~
$ `1960`  <dbl> 414.0, 2020.0, 6160.0, NA, 550.0, 36.7, 48800.0, 2530.0, 88200~
$ `1961`  <dbl> 491.0, 2280.0, 6070.0, NA, 455.0, 47.7, 51200.0, 2600.0, 90600~
$ `1962`  <dbl> 689.0, 2460.0, 5670.0, NA, 1180.0, 103.0, 53700.0, 2730.0, 949~
$ `1963`  <dbl> 708.0, 2080.0, 5430.0, NA, 1150.0, 84.3, 50100.0, 2930.0, 1010~
$ `1964`  <dbl> 840.0, 2020.0, 5650.0, NA, 1220.0, 91.7, 55700.0, 3120.0, 1090~
$ `1965`  <dbl> 1010.0, 2170.0, 6600.0, NA, 1190.0, 150.0, 58900.0, 3310.0, 12~
$ `1966`  <dbl> 1090.0, 2550.0, 8430.0, NA, 1550.0, 348.0, 63100.0, 3490.0, 12~
$ `1967`  <dbl> 1280, 2680, 8440, NA, 994, 565, 65500, 3650, 129000, 40000, 35~
$ `1968`  <dbl> 1220, 3070, 9060, NA, 1670, 990, 69100, 3750, 135000, 42400, 3~
$ `1969`  <dbl> 942, 3250, 11300, NA, 2790, 1260, 77300, 3910, 142000, 44700, ~
$ `1970`  <dbl> 1.67e+03, 3.74e+03, 1.51e+04, NA, 3.58e+03, 4.62e+02, 8.27e+04~
$ `1971`  <dbl> 1.90e+03, 4.35e+03, 1.87e+04, NA, 3.41e+03, 4.25e+02, 8.89e+04~
$ `1972`  <dbl> 1.53e+03, 5.64e+03, 2.83e+04, NA, 4.51e+03, 3.74e+02, 9.02e+04~
$ `1973`  <dbl> 1.64e+03, 5.29e+03, 3.83e+04, NA, 4.88e+03, 3.30e+02, 9.41e+04~
$ `1974`  <dbl> 1.92e+03, 4.35e+03, 3.19e+04, NA, 4.87e+03, 4.29e+02, 9.56e+04~
$ `1975`  <dbl> 2.13e+03, 4.59e+03, 3.20e+04, NA, 4.42e+03, 7.08e+02, 9.49e+04~
$ `1976`  <dbl> 1.99e+03, 4.95e+03, 3.92e+04, NA, 3.29e+03, 4.03e+02, 9.98e+04~
$ `1977`  <dbl> 2.39e+03, 5.72e+03, 4.19e+04, NA, 3.53e+03, 4.66e+02, 1.01e+05~
$ `1978`  <dbl> 2160, 6490, 62500, NA, 5410, 491, 103000, 5810, 202000, 57500,~
$ `1979`  <dbl> 2240, 7590, 45600, NA, 5500, 407, 111000, 5850, 205000, 61600,~
$ `1980`  <dbl> 1760, 5170, 66500, NA, 5350, 143, 109000, 6080, 221000, 52300,~
$ `1981`  <dbl> 1980.0, 7340.0, 46400.0, NA, 5280.0, 106.0, 102000.0, 5970.0, ~
$ `1982`  <dbl> 2100, 7310, 39300, NA, 4650, 293, 103000, 6080, 234000, 53900,~
$ `1983`  <dbl> 2520.0, 7630.0, 52600.0, NA, 5120.0, 84.3, 105000.0, 6170.0, 2~
$ `1984`  <dbl> 2830.0, 7830.0, 71100.0, NA, 5010.0, 147.0, 107000.0, 6230.0, ~
$ `1985`  <dbl> 3510.0, 7880.0, 72800.0, NA, 4700.0, 249.0, 101000.0, 6710.0, ~
$ `1986`  <dbl> 3140, 8060, 76300, NA, 4660, 249, 104000, 6730, 240000, 54100,~
$ `1987`  <dbl> 3120, 7440, 84100, NA, 5820, 275, 115000, 7020, 256000, 57700,~
$ `1988`  <dbl> 2870, 7330, 83900, NA, 5130, 286, 121000, 7210, 261000, 53300,~
$ `1989`  <dbl> 2780.0, 8980.0, 80000.0, NA, 5010.0, 286.0, 117000.0, 7060.0, ~
$ `1990`  <dbl> 2610, 5520, 77000, 407, 5120, 282, 112000, 6620, 264000, 57700~
$ `1991`  <dbl> 2440, 4290, 79000, 407, 5090, 268, 117000, 6380, 261000, 61600~
$ `1992`  <dbl> 1390, 2520, 80100, 407, 5200, 264, 121000, 5830, 268000, 56700~
$ `1993`  <dbl> 1350, 2340, 82200, 411, 5780, 271, 118000, 2560, 277000, 57100~
$ `1994`  <dbl> 1290, 1930, 86400, 407, 3890, 268, 122000, 2710, 278000, 57100~
$ `1995`  <dbl> 1240, 2090, 95300, 425, 11000, 275, 128000, 3410, 282000, 5980~
$ `1996`  <dbl> 1180, 2020, 97100, 455, 10500, 293, 135000, 2560, 302000, 6320~
$ `1997`  <dbl> 1100, 1540, 87300, 466, 7380, 308, 138000, 3230, 306000, 62700~
$ `1998`  <dbl> 1040, 1750, 107000, 491, 7310, 319, 140000, 3360, 317000, 6370~
$ `1999`  <dbl> 821, 2980, 92000, 513, 9160, 330, 147000, 3010, 325000, 61900,~
$ `2000`  <dbl> 774, 3020, 87900, 524, 9540, 345, 142000, 3470, 329000, 62300,~
$ `2001`  <dbl> 818, 3220, 84200, 524, 9730, 348, 134000, 3540, 325000, 65900,~
$ `2002`  <dbl> 1070, 3750, 89900, 532, 12700, 370, 125000, 3040, 341000, 6710~
$ `2003`  <dbl> 1200, 4290, 91600, 535, 9060, 403, 135000, 3430, 336000, 72200~
$ `2004`  <dbl> 950, 4170, 88500, 561, 18800, 422, 158000, 3640, 343000, 72400~
$ `2005`  <dbl> 1330, 4250, 107000, 576, 19200, 429, 162000, 4350, 350000, 742~
$ `2006`  <dbl> 1650, 3900, 101000, 546, 22300, 444, 175000, 4380, 365000, 722~
$ `2007`  <dbl> 2270, 3930, 109000, 539, 25200, 469, 175000, 5060, 372000, 697~
$ `2008`  <dbl> 4210, 4370, 110000, 539, 25700, 480, 189000, 5560, 386000, 690~
$ `2009`  <dbl> 6770, 4380, 121000, 517, 27800, 510, 180000, 4360, 395000, 627~
$ `2010`  <dbl> 8460, 4600, 119000, 517, 29100, 524, 188000, 4220, 391000, 675~
$ `2011`  <dbl> 12200, 5240, 121000, 491, 30300, 513, 192000, 4920, 392000, 65~
$ `2012`  <dbl> 10800, 4910, 130000, 488, 33400, 524, 192000, 5690, 388000, 62~
$ `2013`  <dbl> 10000, 5060, 134000, 477, 32600, 524, 190000, 5500, 372000, 62~
$ `2014`  <dbl> 9810, 5720, 145000, 462, 34800, 532, 204000, 5530, 361000, 587~

We can also see that we have a large tibble.

CO2_emissions %>%
  class()
[1] "tbl_df"     "tbl"        "data.frame"

This is the object that is created when we read in the data with readr. A tibble (or tbl_df) is the tidyverse version of a data.frame object. Similar to data.frame, it is a table with variable information arranged as columns, and individual observations arranged as rows. However some nice differences are they do not change variable names or data types and they give more messages when something is wrong (e.g. when a variable does not exist), which forces the analyst to confront problems earlier. Tibbles also give us information about the class of each variable.

For example the country variable is made up of character (abbreviated as chr) values.

CO2_emissions %>%
  select(country)
# A tibble: 192 x 1
   country            
   <chr>              
 1 Afghanistan        
 2 Albania            
 3 Algeria            
 4 Andorra            
 5 Angola             
 6 Antigua and Barbuda
 7 Argentina          
 8 Armenia            
 9 Australia          
10 Austria            
# ... with 182 more rows

We see that we have 192 rows different country variables and CO2 emission values for 264 different years (from 1751 to 2014).

names(CO2_emissions)
  [1] "country" "1751"    "1752"    "1753"    "1754"    "1755"    "1756"   
  [8] "1757"    "1758"    "1759"    "1760"    "1761"    "1762"    "1763"   
 [15] "1764"    "1765"    "1766"    "1767"    "1768"    "1769"    "1770"   
 [22] "1771"    "1772"    "1773"    "1774"    "1775"    "1776"    "1777"   
 [29] "1778"    "1779"    "1780"    "1781"    "1782"    "1783"    "1784"   
 [36] "1785"    "1786"    "1787"    "1788"    "1789"    "1790"    "1791"   
 [43] "1792"    "1793"    "1794"    "1795"    "1796"    "1797"    "1798"   
 [50] "1799"    "1800"    "1801"    "1802"    "1803"    "1804"    "1805"   
 [57] "1806"    "1807"    "1808"    "1809"    "1810"    "1811"    "1812"   
 [64] "1813"    "1814"    "1815"    "1816"    "1817"    "1818"    "1819"   
 [71] "1820"    "1821"    "1822"    "1823"    "1824"    "1825"    "1826"   
 [78] "1827"    "1828"    "1829"    "1830"    "1831"    "1832"    "1833"   
 [85] "1834"    "1835"    "1836"    "1837"    "1838"    "1839"    "1840"   
 [92] "1841"    "1842"    "1843"    "1844"    "1845"    "1846"    "1847"   
 [99] "1848"    "1849"    "1850"    "1851"    "1852"    "1853"    "1854"   
[106] "1855"    "1856"    "1857"    "1858"    "1859"    "1860"    "1861"   
[113] "1862"    "1863"    "1864"    "1865"    "1866"    "1867"    "1868"   
[120] "1869"    "1870"    "1871"    "1872"    "1873"    "1874"    "1875"   
[127] "1876"    "1877"    "1878"    "1879"    "1880"    "1881"    "1882"   
[134] "1883"    "1884"    "1885"    "1886"    "1887"    "1888"    "1889"   
[141] "1890"    "1891"    "1892"    "1893"    "1894"    "1895"    "1896"   
[148] "1897"    "1898"    "1899"    "1900"    "1901"    "1902"    "1903"   
[155] "1904"    "1905"    "1906"    "1907"    "1908"    "1909"    "1910"   
[162] "1911"    "1912"    "1913"    "1914"    "1915"    "1916"    "1917"   
[169] "1918"    "1919"    "1920"    "1921"    "1922"    "1923"    "1924"   
[176] "1925"    "1926"    "1927"    "1928"    "1929"    "1930"    "1931"   
[183] "1932"    "1933"    "1934"    "1935"    "1936"    "1937"    "1938"   
[190] "1939"    "1940"    "1941"    "1942"    "1943"    "1944"    "1945"   
[197] "1946"    "1947"    "1948"    "1949"    "1950"    "1951"    "1952"   
[204] "1953"    "1954"    "1955"    "1956"    "1957"    "1958"    "1959"   
[211] "1960"    "1961"    "1962"    "1963"    "1964"    "1965"    "1966"   
[218] "1967"    "1968"    "1969"    "1970"    "1971"    "1972"    "1973"   
[225] "1974"    "1975"    "1976"    "1977"    "1978"    "1979"    "1980"   
[232] "1981"    "1982"    "1983"    "1984"    "1985"    "1986"    "1987"   
[239] "1988"    "1989"    "1990"    "1991"    "1992"    "1993"    "1994"   
[246] "1995"    "1996"    "1997"    "1998"    "1999"    "2000"    "2001"   
[253] "2002"    "2003"    "2004"    "2005"    "2006"    "2007"    "2008"   
[260] "2009"    "2010"    "2011"    "2012"    "2013"    "2014"   

Recall, the values are emissions in metric tons, also called tonnes. Scrolling through the glimpse() function above, we can also see that there are fewer NA values for later years.

In this next code chunk, we will introduce the %<>% operator from the magrittr package. This allows us to use our CO2_emissions data and reassign it to a modified version at the same time. Let’s modify CO2_emissions to make it more usable for making visualizations. Specifically, we will use the pivot_longer() function of the dplyr package to convert our data into what is called “long” format. This is also sometimes referred to as “narrow” format.

This means that we will have more rows and fewer columns than our current format.

Right now our data is in what is called “wide” format. In wide format, each variable is listed as its own column. In contrast, in long format, variables maybe collapsed into a column that identifies the variables and a column of values. See here for more information about the difference between the two formats.

We want to collapse all of the values for the emission data across the different individual year variables into one new Emissions variable. We will identify what year they are from by creating a new Year variable. The cols = argument allows us to specify which columns we want to pivot (or not pivot) to create these new columns. We want to keep our country data as an ID variable so we will exclude it using the - sign, by default all other columns will be used.

CO2_emissions  %<>%
  pivot_longer(cols = -country,
               names_to = "Year",
               values_to = "Emissions")

set.seed(123)

CO2_emissions %>%
  slice_sample(n = 6)
# A tibble: 6 x 3
  country     Year  Emissions
  <chr>       <chr>     <dbl>
1 Bahamas     1832       NA  
2 Montenegro  1843       NA  
3 Mongolia    1892       NA  
4 Samoa       1791       NA  
5 Azerbaijan  1867       53.2
6 Timor-Leste 2010      235  

Question Opportunity

Think a moment about what the dimensions of the CO2_emissions tibble are now and why? How would you check this?

Hint : Checking has something to do with a unique aspect about tibbles.

Let’s say we also want to rename the country variable to be capitalized. To do this, we can use the rename() function of the dplyr package to rename this variable. When renaming variables the syntax is new-name = old-name, where the new name is listed first before the =.

You may also note that the Year variable is currently of class type character. We would like to change it to be numeric. To do this we will use the mutate() function, which is also part of the dplyr package. This function allows us to create and modify variables. We will also use this function to create a variable called Label which will have "CO2 Emissions (Metric Tons)" as the value for every row, to be used when we create plots later.

CO2_emissions  %<>%
   dplyr::rename(Country = country) %>%
   dplyr::mutate(Year = as.numeric(Year),
                 Label = "CO2 Emissions (Metric Tons)")

Now let’s take a look to see how our data has changed:

set.seed(123)

CO2_emissions %>%
  slice_sample(n = 6)
# A tibble: 6 x 4
  Country      Year Emissions Label                      
  <chr>       <dbl>     <dbl> <chr>                      
1 Bahamas      1832      NA   CO2 Emissions (Metric Tons)
2 Montenegro   1843      NA   CO2 Emissions (Metric Tons)
3 Mongolia     1892      NA   CO2 Emissions (Metric Tons)
4 Samoa        1791      NA   CO2 Emissions (Metric Tons)
5 Azerbaijan   1867      53.2 CO2 Emissions (Metric Tons)
6 Timor-Leste  2010     235   CO2 Emissions (Metric Tons)

Great, we can see that now the Year variable is of class double (abbreviated dbl), which is a numeric class.

Now, let’s take a look at the Country variable to check if there is anything unexpected. We will use the distinct() function of the dplyr package to view the unique values only. Finally, we use the pull() function of the dplyr package to extract the values from the column (this is similar to using the $ base R syntax e.g. CO2_emission$Country).

# Scroll through the output!
CO2_emissions %>%
  distinct(Country) %>%
  pull()
  [1] "Afghanistan"                    "Albania"                       
  [3] "Algeria"                        "Andorra"                       
  [5] "Angola"                         "Antigua and Barbuda"           
  [7] "Argentina"                      "Armenia"                       
  [9] "Australia"                      "Austria"                       
 [11] "Azerbaijan"                     "Bahamas"                       
 [13] "Bahrain"                        "Bangladesh"                    
 [15] "Barbados"                       "Belarus"                       
 [17] "Belgium"                        "Belize"                        
 [19] "Benin"                          "Bhutan"                        
 [21] "Bolivia"                        "Bosnia and Herzegovina"        
 [23] "Botswana"                       "Brazil"                        
 [25] "Brunei"                         "Bulgaria"                      
 [27] "Burkina Faso"                   "Burundi"                       
 [29] "Cambodia"                       "Cameroon"                      
 [31] "Canada"                         "Cape Verde"                    
 [33] "Central African Republic"       "Chad"                          
 [35] "Chile"                          "China"                         
 [37] "Colombia"                       "Comoros"                       
 [39] "Congo, Dem. Rep."               "Congo, Rep."                   
 [41] "Costa Rica"                     "Cote d'Ivoire"                 
 [43] "Croatia"                        "Cuba"                          
 [45] "Cyprus"                         "Czech Republic"                
 [47] "Denmark"                        "Djibouti"                      
 [49] "Dominica"                       "Dominican Republic"            
 [51] "Ecuador"                        "Egypt"                         
 [53] "El Salvador"                    "Equatorial Guinea"             
 [55] "Eritrea"                        "Estonia"                       
 [57] "Ethiopia"                       "Fiji"                          
 [59] "Finland"                        "France"                        
 [61] "Gabon"                          "Gambia"                        
 [63] "Georgia"                        "Germany"                       
 [65] "Ghana"                          "Greece"                        
 [67] "Grenada"                        "Guatemala"                     
 [69] "Guinea"                         "Guinea-Bissau"                 
 [71] "Guyana"                         "Haiti"                         
 [73] "Honduras"                       "Hungary"                       
 [75] "Iceland"                        "India"                         
 [77] "Indonesia"                      "Iran"                          
 [79] "Iraq"                           "Ireland"                       
 [81] "Israel"                         "Italy"                         
 [83] "Jamaica"                        "Japan"                         
 [85] "Jordan"                         "Kazakhstan"                    
 [87] "Kenya"                          "Kiribati"                      
 [89] "Kuwait"                         "Kyrgyz Republic"               
 [91] "Lao"                            "Latvia"                        
 [93] "Lebanon"                        "Lesotho"                       
 [95] "Liberia"                        "Libya"                         
 [97] "Liechtenstein"                  "Lithuania"                     
 [99] "Luxembourg"                     "Macedonia, FYR"                
[101] "Madagascar"                     "Malawi"                        
[103] "Malaysia"                       "Maldives"                      
[105] "Mali"                           "Malta"                         
[107] "Marshall Islands"               "Mauritania"                    
[109] "Mauritius"                      "Mexico"                        
[111] "Micronesia, Fed. Sts."          "Moldova"                       
[113] "Mongolia"                       "Montenegro"                    
[115] "Morocco"                        "Mozambique"                    
[117] "Myanmar"                        "Namibia"                       
[119] "Nauru"                          "Nepal"                         
[121] "Netherlands"                    "New Zealand"                   
[123] "Nicaragua"                      "Niger"                         
[125] "Nigeria"                        "North Korea"                   
[127] "Norway"                         "Oman"                          
[129] "Pakistan"                       "Palau"                         
[131] "Palestine"                      "Panama"                        
[133] "Papua New Guinea"               "Paraguay"                      
[135] "Peru"                           "Philippines"                   
[137] "Poland"                         "Portugal"                      
[139] "Qatar"                          "Romania"                       
[141] "Russia"                         "Rwanda"                        
[143] "Samoa"                          "Sao Tome and Principe"         
[145] "Saudi Arabia"                   "Senegal"                       
[147] "Serbia"                         "Seychelles"                    
[149] "Sierra Leone"                   "Singapore"                     
[151] "Slovak Republic"                "Slovenia"                      
[153] "Solomon Islands"                "Somalia"                       
[155] "South Africa"                   "South Korea"                   
[157] "South Sudan"                    "Spain"                         
[159] "Sri Lanka"                      "St. Kitts and Nevis"           
[161] "St. Lucia"                      "St. Vincent and the Grenadines"
[163] "Sudan"                          "Suriname"                      
[165] "Swaziland"                      "Sweden"                        
[167] "Switzerland"                    "Syria"                         
[169] "Tajikistan"                     "Tanzania"                      
[171] "Thailand"                       "Timor-Leste"                   
[173] "Togo"                           "Tonga"                         
[175] "Trinidad and Tobago"            "Tunisia"                       
[177] "Turkey"                         "Turkmenistan"                  
[179] "Tuvalu"                         "Uganda"                        
[181] "Ukraine"                        "United Arab Emirates"          
[183] "United Kingdom"                 "United States"                 
[185] "Uruguay"                        "Uzbekistan"                    
[187] "Vanuatu"                        "Venezuela"                     
[189] "Vietnam"                        "Yemen"                         
[191] "Zambia"                         "Zimbabwe"                      

These all look as expected!

Yearly Growth in GDP per Capita


Let’s take a look at the next dataset (gdp_growth) that we imported.

gdp_growth %>%
  slice_head(n = 3)
# A tibble: 3 x 220
  country       `1801`   `1802`   `1803`   `1804`   `1805`   `1806`   `1807`
  <chr>          <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 Afghanistan NA       NA       NA       NA       NA       NA       NA      
2 Albania      0.104    0.104    0.104    0.104    0.104    0.104    0.104  
3 Algeria     -0.00247 -0.00247 -0.00247 -0.00247 -0.00247 -0.00247 -0.00247
# ... with 212 more variables: `1808` <dbl>, `1809` <dbl>, `1810` <dbl>,
#   `1811` <dbl>, `1812` <dbl>, `1813` <dbl>, `1814` <dbl>, `1815` <dbl>,
#   `1816` <dbl>, `1817` <dbl>, `1818` <dbl>, `1819` <dbl>, `1820` <dbl>,
#   `1821` <dbl>, `1822` <dbl>, `1823` <dbl>, `1824` <dbl>, `1825` <dbl>,
#   `1826` <dbl>, `1827` <dbl>, `1828` <dbl>, `1829` <dbl>, `1830` <dbl>,
#   `1831` <dbl>, `1832` <dbl>, `1833` <dbl>, `1834` <dbl>, `1835` <dbl>,
#   `1836` <dbl>, `1837` <dbl>, `1838` <dbl>, `1839` <dbl>, `1840` <dbl>, ...

How many rows and columns are there are there? We can easily check by using the base dim() function, which evaluates the dimensions of an object.

dim(gdp_growth)
[1] 194 220

Interesting, it’s 194 rows (as opposed to 50688 above). We will deal with this and other differences in the sets of countries a bit later on. There are also 220 columns with a country column and a set of columns corresponding to different years.

names(gdp_growth)
  [1] "country" "1801"    "1802"    "1803"    "1804"    "1805"    "1806"   
  [8] "1807"    "1808"    "1809"    "1810"    "1811"    "1812"    "1813"   
 [15] "1814"    "1815"    "1816"    "1817"    "1818"    "1819"    "1820"   
 [22] "1821"    "1822"    "1823"    "1824"    "1825"    "1826"    "1827"   
 [29] "1828"    "1829"    "1830"    "1831"    "1832"    "1833"    "1834"   
 [36] "1835"    "1836"    "1837"    "1838"    "1839"    "1840"    "1841"   
 [43] "1842"    "1843"    "1844"    "1845"    "1846"    "1847"    "1848"   
 [50] "1849"    "1850"    "1851"    "1852"    "1853"    "1854"    "1855"   
 [57] "1856"    "1857"    "1858"    "1859"    "1860"    "1861"    "1862"   
 [64] "1863"    "1864"    "1865"    "1866"    "1867"    "1868"    "1869"   
 [71] "1870"    "1871"    "1872"    "1873"    "1874"    "1875"    "1876"   
 [78] "1877"    "1878"    "1879"    "1880"    "1881"    "1882"    "1883"   
 [85] "1884"    "1885"    "1886"    "1887"    "1888"    "1889"    "1890"   
 [92] "1891"    "1892"    "1893"    "1894"    "1895"    "1896"    "1897"   
 [99] "1898"    "1899"    "1900"    "1901"    "1902"    "1903"    "1904"   
[106] "1905"    "1906"    "1907"    "1908"    "1909"    "1910"    "1911"   
[113] "1912"    "1913"    "1914"    "1915"    "1916"    "1917"    "1918"   
[120] "1919"    "1920"    "1921"    "1922"    "1923"    "1924"    "1925"   
[127] "1926"    "1927"    "1928"    "1929"    "1930"    "1931"    "1932"   
[134] "1933"    "1934"    "1935"    "1936"    "1937"    "1938"    "1939"   
[141] "1940"    "1941"    "1942"    "1943"    "1944"    "1945"    "1946"   
[148] "1947"    "1948"    "1949"    "1950"    "1951"    "1952"    "1953"   
[155] "1954"    "1955"    "1956"    "1957"    "1958"    "1959"    "1960"   
[162] "1961"    "1962"    "1963"    "1964"    "1965"    "1966"    "1967"   
[169] "1968"    "1969"    "1970"    "1971"    "1972"    "1973"    "1974"   
[176] "1975"    "1976"    "1977"    "1978"    "1979"    "1980"    "1981"   
[183] "1982"    "1983"    "1984"    "1985"    "1986"    "1987"    "1988"   
[190] "1989"    "1990"    "1991"    "1992"    "1993"    "1994"    "1995"   
[197] "1996"    "1997"    "1998"    "1999"    "2000"    "2001"    "2002"   
[204] "2003"    "2004"    "2005"    "2006"    "2007"    "2008"    "2009"   
[211] "2010"    "2011"    "2012"    "2013"    "2014"    "2015"    "2016"   
[218] "2017"    "2018"    "2019"   

Yes, no other columns in this dataset.

Next, we will use the pivot_longer() to transform the data to long format, similar to what we did in the previous section.

We will also again change the country variable to be Country by using the rename() function, and we will make the Year variable numeric using the mutate() function.

Question Opportunity

Using what you just learned about pivot_longer(), rename(), and mutate() and without scrolling up, try to come up with the code to do the wrangling for this data.


Click here to reveal the code.
gdp_growth %<>%
  pivot_longer(cols = -country,
               names_to = "Year",
               values_to = "gdp_growth") %>%
  rename(Country = country) %>%
  mutate(Year = as.numeric(Year),
         Label = "GDP Growth/Capita (%)") %>%
  rename(GDP = gdp_growth)

Now let’s see how this data has changed:

gdp_growth %>%
  slice_head(n = 6)
# A tibble: 6 x 4
  Country      Year   GDP Label                
  <chr>       <dbl> <dbl> <chr>                
1 Afghanistan  1801    NA GDP Growth/Capita (%)
2 Afghanistan  1802    NA GDP Growth/Capita (%)
3 Afghanistan  1803    NA GDP Growth/Capita (%)
4 Afghanistan  1804    NA GDP Growth/Capita (%)
5 Afghanistan  1805    NA GDP Growth/Capita (%)
6 Afghanistan  1806    NA GDP Growth/Capita (%)
gdp_growth %>%
  count(Year)
# A tibble: 219 x 2
    Year     n
   <dbl> <int>
 1  1801   194
 2  1802   194
 3  1803   194
 4  1804   194
 5  1805   194
 6  1806   194
 7  1807   194
 8  1808   194
 9  1809   194
10  1810   194
# ... with 209 more rows

Again let’s check that the Country variable only contains values we would expect.

# Scroll through the output!
gdp_growth %>%
  distinct(Country) %>%
  pull()
  [1] "Afghanistan"                    "Albania"                       
  [3] "Algeria"                        "Andorra"                       
  [5] "Angola"                         "Antigua and Barbuda"           
  [7] "Argentina"                      "Armenia"                       
  [9] "Australia"                      "Austria"                       
 [11] "Azerbaijan"                     "Bahamas"                       
 [13] "Bahrain"                        "Bangladesh"                    
 [15] "Barbados"                       "Belarus"                       
 [17] "Belgium"                        "Belize"                        
 [19] "Benin"                          "Bhutan"                        
 [21] "Bolivia"                        "Bosnia and Herzegovina"        
 [23] "Botswana"                       "Brazil"                        
 [25] "Brunei"                         "Bulgaria"                      
 [27] "Burkina Faso"                   "Burundi"                       
 [29] "Cambodia"                       "Cameroon"                      
 [31] "Canada"                         "Cape Verde"                    
 [33] "Central African Republic"       "Chad"                          
 [35] "Chile"                          "China"                         
 [37] "Colombia"                       "Comoros"                       
 [39] "Congo, Dem. Rep."               "Congo, Rep."                   
 [41] "Costa Rica"                     "Cote d'Ivoire"                 
 [43] "Croatia"                        "Cuba"                          
 [45] "Cyprus"                         "Czech Republic"                
 [47] "Denmark"                        "Djibouti"                      
 [49] "Dominica"                       "Dominican Republic"            
 [51] "Ecuador"                        "Egypt"                         
 [53] "El Salvador"                    "Equatorial Guinea"             
 [55] "Eritrea"                        "Estonia"                       
 [57] "Ethiopia"                       "Fiji"                          
 [59] "Finland"                        "France"                        
 [61] "Gabon"                          "Gambia"                        
 [63] "Georgia"                        "Germany"                       
 [65] "Ghana"                          "Greece"                        
 [67] "Grenada"                        "Guatemala"                     
 [69] "Guinea"                         "Guinea-Bissau"                 
 [71] "Guyana"                         "Haiti"                         
 [73] "Honduras"                       "Hungary"                       
 [75] "Iceland"                        "India"                         
 [77] "Indonesia"                      "Iran"                          
 [79] "Iraq"                           "Ireland"                       
 [81] "Israel"                         "Italy"                         
 [83] "Jamaica"                        "Japan"                         
 [85] "Jordan"                         "Kazakhstan"                    
 [87] "Kenya"                          "Kiribati"                      
 [89] "Kuwait"                         "Kyrgyz Republic"               
 [91] "Lao"                            "Latvia"                        
 [93] "Lebanon"                        "Lesotho"                       
 [95] "Liberia"                        "Libya"                         
 [97] "Liechtenstein"                  "Lithuania"                     
 [99] "Luxembourg"                     "Macedonia, FYR"                
[101] "Madagascar"                     "Malawi"                        
[103] "Malaysia"                       "Maldives"                      
[105] "Mali"                           "Malta"                         
[107] "Marshall Islands"               "Mauritania"                    
[109] "Mauritius"                      "Mexico"                        
[111] "Micronesia, Fed. Sts."          "Moldova"                       
[113] "Monaco"                         "Mongolia"                      
[115] "Montenegro"                     "Morocco"                       
[117] "Mozambique"                     "Myanmar"                       
[119] "Namibia"                        "Nauru"                         
[121] "Nepal"                          "Netherlands"                   
[123] "New Zealand"                    "Nicaragua"                     
[125] "Niger"                          "Nigeria"                       
[127] "North Korea"                    "Norway"                        
[129] "Oman"                           "Pakistan"                      
[131] "Palau"                          "Palestine"                     
[133] "Panama"                         "Papua New Guinea"              
[135] "Paraguay"                       "Peru"                          
[137] "Philippines"                    "Poland"                        
[139] "Portugal"                       "Qatar"                         
[141] "Romania"                        "Russia"                        
[143] "Rwanda"                         "Samoa"                         
[145] "San Marino"                     "Sao Tome and Principe"         
[147] "Saudi Arabia"                   "Senegal"                       
[149] "Serbia"                         "Seychelles"                    
[151] "Sierra Leone"                   "Singapore"                     
[153] "Slovak Republic"                "Slovenia"                      
[155] "Solomon Islands"                "Somalia"                       
[157] "South Africa"                   "South Korea"                   
[159] "South Sudan"                    "Spain"                         
[161] "Sri Lanka"                      "St. Kitts and Nevis"           
[163] "St. Lucia"                      "St. Vincent and the Grenadines"
[165] "Sudan"                          "Suriname"                      
[167] "Swaziland"                      "Sweden"                        
[169] "Switzerland"                    "Syria"                         
[171] "Tajikistan"                     "Tanzania"                      
[173] "Thailand"                       "Timor-Leste"                   
[175] "Togo"                           "Tonga"                         
[177] "Trinidad and Tobago"            "Tunisia"                       
[179] "Turkey"                         "Turkmenistan"                  
[181] "Tuvalu"                         "Uganda"                        
[183] "Ukraine"                        "United Arab Emirates"          
[185] "United Kingdom"                 "United States"                 
[187] "Uruguay"                        "Uzbekistan"                    
[189] "Vanuatu"                        "Venezuela"                     
[191] "Vietnam"                        "Yemen"                         
[193] "Zambia"                         "Zimbabwe"                      

Also looks good!

Energy Use per Person


Now let’s take a look at the energy use per person data (energy_use) using slice_head() and glimpse().

energy_use %>%
  slice_head(n = 3)
# A tibble: 3 x 57
  country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969`
  <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Albania     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
2 Algeria     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
3 Angola      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
# ... with 46 more variables: `1970` <dbl>, `1971` <dbl>, `1972` <dbl>,
#   `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
#   `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
#   `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
#   `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
#   `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>,
#   `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, ...

energy_use %>%
  glimpse()
Rows: 169
Columns: 57
$ country <chr> "Albania", "Algeria", "Angola", "Antigua and Barbuda", "Argent~
$ `1960`  <dbl> NA, NA, NA, NA, NA, NA, 3060, 1550, NA, NA, NA, NA, NA, NA, 25~
$ `1961`  <dbl> NA, NA, NA, NA, NA, NA, 3120, 1550, NA, NA, NA, NA, NA, NA, 25~
$ `1962`  <dbl> NA, NA, NA, NA, NA, NA, 3170, 1680, NA, NA, NA, NA, NA, NA, 28~
$ `1963`  <dbl> NA, NA, NA, NA, NA, NA, 3280, 1820, NA, NA, NA, NA, NA, NA, 30~
$ `1964`  <dbl> NA, NA, NA, NA, NA, NA, 3350, 1860, NA, NA, NA, NA, NA, NA, 30~
$ `1965`  <dbl> NA, NA, NA, NA, NA, NA, 3460, 1850, NA, NA, NA, NA, NA, NA, 31~
$ `1966`  <dbl> NA, NA, NA, NA, NA, NA, 3550, 1900, NA, NA, NA, NA, NA, NA, 30~
$ `1967`  <dbl> NA, NA, NA, NA, NA, NA, 3690, 1920, NA, NA, NA, NA, NA, NA, 31~
$ `1968`  <dbl> NA, NA, NA, NA, NA, NA, 3760, 2050, NA, NA, NA, NA, NA, NA, 35~
$ `1969`  <dbl> NA, NA, NA, NA, NA, NA, 3790, 2180, NA, NA, NA, NA, NA, NA, 38~
$ `1970`  <dbl> NA, NA, NA, NA, NA, NA, 4060, 2420, NA, NA, NA, NA, NA, NA, 41~
$ `1971`  <dbl> 785.0, 232.0, 556.0, NA, 1380.0, NA, 3990.0, 2510.0, NA, NA, 6~
$ `1972`  <dbl> 866.0, 261.0, 584.0, NA, 1380.0, NA, 4040.0, 2630.0, NA, NA, 5~
$ `1973`  <dbl> 763.0, 305.0, 568.0, NA, 1410.0, NA, 4260.0, 2830.0, NA, NA, 8~
$ `1974`  <dbl> 777.0, 319.0, 565.0, NA, 1420.0, NA, 4290.0, 2730.0, NA, NA, 9~
$ `1975`  <dbl> 827.0, 330.0, 536.0, NA, 1380.0, NA, 4350.0, 2650.0, NA, NA, 8~
$ `1976`  <dbl> 891, 367, 515, NA, 1400, NA, 4410, 2870, NA, NA, 9580, 98, NA,~
$ `1977`  <dbl> 924.0, 399.0, 494.0, NA, 1420.0, NA, 4670.0, 2800.0, NA, NA, 8~
$ `1978`  <dbl> 1010.0, 477.0, 527.0, NA, 1430.0, NA, 4630.0, 2890.0, NA, NA, ~
$ `1979`  <dbl> 864.0, 586.0, 518.0, NA, 1480.0, NA, 4680.0, 3140.0, NA, NA, 8~
$ `1980`  <dbl> 1150, 579, 511, NA, 1490, NA, 4740, 3070, NA, NA, 7790, 103, N~
$ `1981`  <dbl> 989, 611, 497, NA, 1430, NA, 4690, 2900, NA, NA, 8300, 102, NA~
$ `1982`  <dbl> 967, 771, 473, NA, 1420, NA, 4820, 2830, NA, NA, 9070, 105, NA~
$ `1983`  <dbl> 1000, 808, 469, NA, 1420, NA, 4560, 2840, NA, NA, 8500, 105, N~
$ `1984`  <dbl> 1020, 776, 458, NA, 1450, NA, 4650, 2950, NA, NA, 8830, 104, N~
$ `1985`  <dbl> 917, 786, 470, NA, 1360, NA, 4600, 3050, NA, NA, 9920, 107, NA~
$ `1986`  <dbl> 964, 862, 462, NA, 1420, NA, 4620, 3060, NA, NA, 10300, 111, N~
$ `1987`  <dbl> 922, 828, 461, NA, 1480, NA, 4770, 3170, NA, NA, 9520, 107, NA~
$ `1988`  <dbl> 928, 850, 467, NA, 1500, NA, 4700, 3200, NA, NA, 10500, 114, N~
$ `1989`  <dbl> 896, 820, 465, NA, 1440, NA, 5000, 3140, NA, NA, 10200, 117, N~
$ `1990`  <dbl> 813, 856, 483, 1480, 1410, 2180, 5060, 3240, 3170, 2520, 10600~
$ `1991`  <dbl> 573, 884, 480, NA, 1430, 2320, 4930, 3420, 3090, NA, 10100, 11~
$ `1992`  <dbl> 418, 884, 467, NA, 1480, 1200, 4960, 3250, 2460, NA, 10800, 11~
$ `1993`  <dbl> 412, 868, 468, NA, 1470, 652, 5150, 3260, 2180, NA, 11100, 123~
$ `1994`  <dbl> 441, 819, 459, NA, 1540, 420, 5090, 3230, 1950, NA, 11600, 126~
$ `1995`  <dbl> 417, 839, 445, NA, 1540, 511, 5130, 3370, 1810, NA, 11400, 134~
$ `1996`  <dbl> 448, 798, 445, NA, 1580, 562, 5390, 3580, 1510, NA, 11100, 132~
$ `1997`  <dbl> 385, 805, 443, NA, 1610, 594, 5470, 3550, 1440, NA, 12200, 135~
$ `1998`  <dbl> 427, 821, 430, NA, 1650, 610, 5550, 3610, 1490, NA, 12400, 138~
$ `1999`  <dbl> 576, 864, 439, NA, 1660, 594, 5610, 3590, 1370, NA, 11900, 137~
$ `2000`  <dbl> 580, 866, 437, NA, 1660, 656, 5640, 3570, 1400, NA, 12000, 139~
$ `2001`  <dbl> 597, 856, 442, NA, 1560, 657, 5450, 3760, 1410, NA, 11700, 149~
$ `2002`  <dbl> 660, 904, 447, NA, 1500, 618, 5570, 3770, 1410, NA, 11500, 150~
$ `2003`  <dbl> 648, 949, 466, NA, 1590, 657, 5570, 3970, 1480, NA, 11600, 155~
$ `2004`  <dbl> 715, 948, 462, 1530, 1720, 698, 5600, 4010, 1540, 2060, 10900,~
$ `2005`  <dbl> 720, 974, 431, 1530, 1710, 843, 5560, 4090, 1600, 2110, 11700,~
$ `2006`  <dbl> 707, 1030, 456, 1580, 1840, 865, 5710, 4080, 1560, 2100, 11600~
$ `2007`  <dbl> 680, 1070, 470, 1600, 1850, 973, 5870, 4020, 1410, 2070, 11200~
$ `2008`  <dbl> 711, 1070, 491, NA, 1920, 1030, 5960, 4030, 1520, NA, 11300, 1~
$ `2009`  <dbl> 732, 1150, 514, NA, 1850, 904, 5860, 3800, 1330, NA, 10300, 18~
$ `2010`  <dbl> 729, 1110, 521, NA, 1910, 863, 5790, 4050, 1280, NA, 10200, 20~
$ `2011`  <dbl> 765, 1140, 522, NA, 1930, 944, 5750, 3920, 1370, NA, 9910, 206~
$ `2012`  <dbl> 688, 1220, 553, NA, 1920, 1030, 5570, 3890, 1470, NA, 9660, 21~
$ `2013`  <dbl> 801, 1240, 534, NA, 1950, 1000, 5460, 3920, 1470, NA, 10400, 2~
$ `2014`  <dbl> 808, 1320, 545, NA, 2020, 1020, 5330, 3760, 1500, NA, 10600, 2~
$ `2015`  <dbl> NA, NA, NA, NA, NA, NA, 5480, 3800, NA, NA, NA, NA, NA, NA, 46~

Looks like we have 169 rows and 57 columns where we have a country column and again a set of years. To wrangle the energy_use data, we will again convert the data to long format, rename some variables, and mutate the Year data to be numeric.

Question Opportunity

Again try to come up with the code on your own to wrangle the data.


Click here to reveal the code.
energy_use %<>%
  pivot_longer(cols = -country,
               names_to = "Year",
               values_to = "energy_use") %>%
  rename(Country = country) %>%
  mutate(Year = as.numeric(Year),
         Label = "Energy Use (kg, oil-eq./capita)") %>%
  rename(Energy = energy_use)

set.seed(123)

energy_use %>%
  slice_sample(n = 3)
# A tibble: 3 x 4
  Country             Year Energy Label                          
  <chr>              <dbl>  <dbl> <chr>                          
1 Dominican Republic  2014    734 Energy Use (kg, oil-eq./capita)
2 Ecuador             2006    671 Energy Use (kg, oil-eq./capita)
3 Turkey              1997   1170 Energy Use (kg, oil-eq./capita)

Now we will check the Country variable:

# Scroll through the output!
energy_use %>%
  distinct(Country) %>%
  pull()
  [1] "Albania"                        "Algeria"                       
  [3] "Angola"                         "Antigua and Barbuda"           
  [5] "Argentina"                      "Armenia"                       
  [7] "Australia"                      "Austria"                       
  [9] "Azerbaijan"                     "Bahamas"                       
 [11] "Bahrain"                        "Bangladesh"                    
 [13] "Barbados"                       "Belarus"                       
 [15] "Belgium"                        "Belize"                        
 [17] "Benin"                          "Bhutan"                        
 [19] "Bolivia"                        "Bosnia and Herzegovina"        
 [21] "Botswana"                       "Brazil"                        
 [23] "Brunei"                         "Bulgaria"                      
 [25] "Cambodia"                       "Cameroon"                      
 [27] "Canada"                         "Cape Verde"                    
 [29] "Chile"                          "China"                         
 [31] "Colombia"                       "Comoros"                       
 [33] "Congo, Dem. Rep."               "Congo, Rep."                   
 [35] "Costa Rica"                     "Cote d'Ivoire"                 
 [37] "Croatia"                        "Cuba"                          
 [39] "Cyprus"                         "Czech Republic"                
 [41] "Denmark"                        "Djibouti"                      
 [43] "Dominica"                       "Dominican Republic"            
 [45] "Ecuador"                        "Egypt"                         
 [47] "El Salvador"                    "Equatorial Guinea"             
 [49] "Eritrea"                        "Estonia"                       
 [51] "Ethiopia"                       "Fiji"                          
 [53] "Finland"                        "France"                        
 [55] "Gabon"                          "Gambia"                        
 [57] "Georgia"                        "Germany"                       
 [59] "Ghana"                          "Greece"                        
 [61] "Grenada"                        "Guatemala"                     
 [63] "Guinea-Bissau"                  "Guyana"                        
 [65] "Haiti"                          "Honduras"                      
 [67] "Hungary"                        "Iceland"                       
 [69] "India"                          "Indonesia"                     
 [71] "Iran"                           "Iraq"                          
 [73] "Ireland"                        "Israel"                        
 [75] "Italy"                          "Jamaica"                       
 [77] "Japan"                          "Jordan"                        
 [79] "Kazakhstan"                     "Kenya"                         
 [81] "Kiribati"                       "Kuwait"                        
 [83] "Kyrgyz Republic"                "Latvia"                        
 [85] "Lebanon"                        "Lesotho"                       
 [87] "Libya"                          "Lithuania"                     
 [89] "Luxembourg"                     "Macedonia, FYR"                
 [91] "Malaysia"                       "Maldives"                      
 [93] "Malta"                          "Marshall Islands"              
 [95] "Mauritius"                      "Mexico"                        
 [97] "Moldova"                        "Mongolia"                      
 [99] "Montenegro"                     "Morocco"                       
[101] "Mozambique"                     "Myanmar"                       
[103] "Namibia"                        "Nepal"                         
[105] "Netherlands"                    "New Zealand"                   
[107] "Nicaragua"                      "Niger"                         
[109] "Nigeria"                        "North Korea"                   
[111] "Norway"                         "Oman"                          
[113] "Pakistan"                       "Palau"                         
[115] "Panama"                         "Paraguay"                      
[117] "Peru"                           "Philippines"                   
[119] "Poland"                         "Portugal"                      
[121] "Qatar"                          "Romania"                       
[123] "Russia"                         "Samoa"                         
[125] "Sao Tome and Principe"          "Saudi Arabia"                  
[127] "Senegal"                        "Serbia"                        
[129] "Seychelles"                     "Singapore"                     
[131] "Slovak Republic"                "Slovenia"                      
[133] "Solomon Islands"                "South Africa"                  
[135] "South Korea"                    "South Sudan"                   
[137] "Spain"                          "Sri Lanka"                     
[139] "St. Kitts and Nevis"            "St. Lucia"                     
[141] "St. Vincent and the Grenadines" "Sudan"                         
[143] "Suriname"                       "Swaziland"                     
[145] "Sweden"                         "Switzerland"                   
[147] "Syria"                          "Tajikistan"                    
[149] "Tanzania"                       "Thailand"                      
[151] "Timor-Leste"                    "Togo"                          
[153] "Tonga"                          "Trinidad and Tobago"           
[155] "Tunisia"                        "Turkey"                        
[157] "Turkmenistan"                   "Ukraine"                       
[159] "United Arab Emirates"           "United Kingdom"                
[161] "United States"                  "Uruguay"                       
[163] "Uzbekistan"                     "Vanuatu"                       
[165] "Venezuela"                      "Vietnam"                       
[167] "Yemen"                          "Zambia"                        
[169] "Zimbabwe"                      

Looks good!

US Specific Data


Now we will take a look at the US data about disasters and temperature.

Disasters


First, we consider the disasters that have occurred in the US.

us_disaster
# A tibble: 40 x 57
    Year `Drought Count` `Drought Cost` `Drought Lower 75` `Drought Upper 75`
   <dbl>           <dbl>          <dbl>              <dbl>              <dbl>
 1  1980               1           33.2               26.4               39.6
 2  1981               0            0                  0                  0  
 3  1982               0            0                  0                  0  
 4  1983               1            7.8                5.5                9  
 5  1984               0            0                  0                  0  
 6  1985               0            0                  0                  0  
 7  1986               1            4.2                3.5                5  
 8  1987               0            0                  0                  0  
 9  1988               1           44.4               33.8               54  
10  1989               1            6.4                5.6                7.4
# ... with 30 more rows, and 52 more variables: `Drought Lower 90` <dbl>,
#   `Drought Upper 90` <dbl>, `Drought Lower 95` <dbl>,
#   `Drought Upper 95` <dbl>, `Flooding Count` <dbl>, `Flooding Cost` <dbl>,
#   `Flooding Lower 75` <dbl>, `Flooding Upper 75` <dbl>,
#   `Flooding Lower 90` <dbl>, `Flooding Upper 90` <dbl>,
#   `Flooding Lower 95` <dbl>, `Flooding Upper 95` <dbl>, `Freeze Count` <dbl>,
#   `Freeze Cost` <dbl>, `Freeze Lower 75` <dbl>, `Freeze Upper 75` <dbl>, ...

We are specifically interested in the Year and the variables that contain the word "Count". The other variables represent an estimate of the economic cost in billions of dollars, as well as the upper and lower bounds for simulations used to estimate the economic cost, which show the level of uncertainty in these estimates (at three different levels of confidence) as the true cost is unknown. See here for more information about the data. For this analysis, we will focus just on the number of disasters that occurred each year.

We will select our variables of interest using the select() and contains() functions in the dplyr package. Since we are selecting for variables with the word "Count" we need to use quotation marks around it.

Selecting for the variable Year does not require quotes because it is the full name of one of the existing variables.

us_disaster %<>%
           select(Year, contains("Count"))

us_disaster %>%
  slice_head(n = 6)
# A tibble: 6 x 8
   Year `Drought Count` `Flooding Count` `Freeze Count` `Severe Storm Count`
  <dbl>           <dbl>            <dbl>          <dbl>                <dbl>
1  1980               1                1              0                    0
2  1981               0                0              1                    1
3  1982               0                0              0                    2
4  1983               1                2              1                    0
5  1984               0                0              0                    2
6  1985               0                0              1                    0
# ... with 3 more variables: `Tropical Cyclone Count` <dbl>,
#   `Wildfire Count` <dbl>, `Winter Storm Count` <dbl>

Now we want to create a new variable that will be the sum of all the different types of disasters for each year.

We don’t want to include the Year variable in our sum, so we can exclude it using the select function. To perform the sum for each year, we can use the base rowSums() function.

yearly_disasters <- us_disaster %>% 
                      select(-Year) %>%
                      rowSums()

yearly_disasters
 [1]  3  2  3  5  2  5  2  0  1  5  3  4  7  5  6  5  4  3 10  5  4  2  4  7  5
[26]  6  7  5 12  7  6 16 11  9  8 10 15 16 14 14

We could then add this to our us_diaster tibble like so using the bind_cols function of the dplyr package:

us_disaster %>% bind_cols(Disaters = yearly_disasters)
# A tibble: 40 x 9
    Year `Drought Count` `Flooding Count` `Freeze Count` `Severe Storm Count`
   <dbl>           <dbl>            <dbl>          <dbl>                <dbl>
 1  1980               1                1              0                    0
 2  1981               0                0              1                    1
 3  1982               0                0              0                    2
 4  1983               1                2              1                    0
 5  1984               0                0              0                    2
 6  1985               0                0              1                    0
 7  1986               1                0              0                    1
 8  1987               0                0              0                    0
 9  1988               1                0              0                    0
10  1989               1                0              1                    1
# ... with 30 more rows, and 4 more variables: `Tropical Cyclone Count` <dbl>,
#   `Wildfire Count` <dbl>, `Winter Storm Count` <dbl>, Disaters <dbl>

However, we can actually create and add this new variable directly to the us_disaster tibble by using the mutate() function of dplyr and using the . notation.

We need to use the . notation to indicate that we are using the data that we already used as input (on the left side of the pipe) to our mutate() function (on the right side of our pipe), which in this case is the entire us_disaster tibble for our select() function. The output from the select() function will be used for the rowSums() function.

us_disaster %<>%
  mutate(Disasters = rowSums(select(., -Year)))

us_disaster %>%
  glimpse()
Rows: 40
Columns: 9
$ Year                     <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 198~
$ `Drought Count`          <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, ~
$ `Flooding Count`         <dbl> 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, ~
$ `Freeze Count`           <dbl> 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, ~
$ `Severe Storm Count`     <dbl> 0, 1, 2, 0, 2, 0, 1, 0, 0, 1, 1, 1, 4, 1, 1, ~
$ `Tropical Cyclone Count` <dbl> 1, 0, 0, 1, 0, 3, 0, 0, 0, 1, 0, 1, 2, 0, 1, ~
$ `Wildfire Count`         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, ~
$ `Winter Storm Count`     <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 2, ~
$ Disasters                <dbl> 3, 2, 3, 5, 2, 5, 2, 0, 1, 5, 3, 4, 7, 5, 6, ~

Great, now we are going to remove some of these variables and just keep the variables of interest using select().

We are also going to add a new variable called Country to indicate that this data is from the United States. Again this will create a new variable where every value is United States.

us_disaster %<>%
  dplyr::select(Year, Disasters) %>%
  mutate(Country = "United States") %>%
  pivot_longer(cols = c(-Country, -Year),
               names_to = "Indicator",
               values_to = "Value") %>%
  mutate(Label = "Number of Disasters")

us_disaster %>%
  slice_head(n = 6)
# A tibble: 6 x 5
   Year Country       Indicator Value Label              
  <dbl> <chr>         <chr>     <dbl> <chr>              
1  1980 United States Disasters     3 Number of Disasters
2  1981 United States Disasters     2 Number of Disasters
3  1982 United States Disasters     3 Number of Disasters
4  1983 United States Disasters     5 Number of Disasters
5  1984 United States Disasters     2 Number of Disasters
6  1985 United States Disasters     5 Number of Disasters

Great, this looks good now.

Question Opportunity

This dataset was slightly different from the other datasets and therefore required slightly different wrangling. Why was it necessary to exclude the Year variable from the pivot_longer() function? What would happen if we did not exclude Year?

Temperature


Next, we consider the temperature in the US over time.

us_temperature %>%
  slice_head(n = 6)
# A tibble: 6 x 3
    Date Value Anomaly
   <dbl> <dbl>   <dbl>
1 189512  50.3   -1.68
2 189612  52.0   -0.03
3 189712  51.6   -0.46
4 189812  51.4   -0.59
5 189912  51.0   -1.01
6 190012  52.8    0.75

So a few things need to be fixed here.

First, the Date column looks a bit strange. The format of the numbers look like the year followed by the number 12 (representing 12 months).

We want to change this to only keep the first 4 characters in the Date variable string values.

However, first let’s make sure that indeed all of the Date variables are 6 characters long and that they all end with the number 12.

We can use a couple of functions in the stringr package to do this. This package is used for working with character strings. The str_length() function can be used to check the length of each value, while the str_ends() function can be used to check that all the values end with "12".

Let’s start with the str_length() function. These functions in the stringr package require a character vector. Thus we need to pull the values for the Date variable first using the pull() function of the dplyr package.

us_temperature %>%
  pull(Date) %>%
  str_length()
  [1] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
 [38] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
 [75] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
[112] 6 6 6 6 6 6 6 6 6 6 6 6 6 6

Great! It looks like all of the values are 6 characters long.

Now let’s check that they all end with "12". We just need to specify what pattern to look for.

us_temperature %>%
  pull(Date) %>%
  str_ends(pattern = "12")
  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[121] TRUE TRUE TRUE TRUE TRUE

Great! Since all of the values are TRUE we know that all of the values in the Date variable end with "12".

It’s a good idea to always check that your data is as you expect.

Now we can use the str_sub() function of the stringr package to remove the "12" from each Date value.

We just need to indicate the start and stop characters.

In this case the start would be 1 and the 4th character would be where we want to stop, so we would use start = 1, stop = 4. We can do this inside of the mutate() function to modify the Date variable. In doing so, we will not need to use pull() to pull the values for the Date variable.

us_temperature %<>%
  mutate(Date = str_sub(Date, start = 1, end = 4))

us_temperature
# A tibble: 125 x 3
   Date  Value Anomaly
   <chr> <dbl>   <dbl>
 1 1895   50.3   -1.68
 2 1896   52.0   -0.03
 3 1897   51.6   -0.46
 4 1898   51.4   -0.59
 5 1899   51.0   -1.01
 6 1900   52.8    0.75
 7 1901   51.9   -0.15
 8 1902   51.6   -0.43
 9 1903   50.6   -1.4 
10 1904   51.2   -0.86
# ... with 115 more rows

We also want to remove the Anomaly variable, which is an indicator of how different the national average temperature for that year was from the average temperature from 1901-2000 which was 52.02°F.

Then, we also want to create a Country variable. We will also change the name of the Date variable to Year so that it will be consistent with our other datasets. We also also want the Year to be numeric. We can accomplish both renaming and changing to numeric by using the mutate() function.

We also want to create an Indicator variable so that we can later tell what data the values in this tibble represent if we combine it with other tibbles and a Label variable, so that we will have informative labels if we make a plot with this data later.

Finally, we remove the Date variable and also order the columns just like the other us data using the select() function.

us_temperature %<>%
  dplyr::select(-Anomaly) %>%
  mutate(Country = "United States",
         Year = as.numeric(Date),
         Indicator = "Temperature",
         Label = "Temperature (Fahrenheit)") %>%
  select(Year, Country, Indicator, Value, Label)

us_temperature %>%
  slice_head(n = 6)
# A tibble: 6 x 5
   Year Country       Indicator   Value Label                   
  <dbl> <chr>         <chr>       <dbl> <chr>                   
1  1895 United States Temperature  50.3 Temperature (Fahrenheit)
2  1896 United States Temperature  52.0 Temperature (Fahrenheit)
3  1897 United States Temperature  51.6 Temperature (Fahrenheit)
4  1898 United States Temperature  51.4 Temperature (Fahrenheit)
5  1899 United States Temperature  51.0 Temperature (Fahrenheit)
6  1900 United States Temperature  52.8 Temperature (Fahrenheit)

Joining data


Now that we have wrangled the individual datasets, we are ready to put everything together. Specifically, we will join the individual datasets into one tibble using *_join() functions available in the dplyr package.

Before we begin though, we will need to make sure that there is at least one variable/column that has the same name across all datasets to be joined. Such variables with common names are called keys for joining your data.

These are the by="x1" arguments below where x1 is the name of the column in both the a and b datasets that we will join together.

[source]

There are several types of *_join() functions to consider. The full_join() function keeps all rows from both tibbles that are being joined and adds NA values as necessary if there are values within the key for either of the tibbles that is is not in the key of the other tibble.

We use the full_join() function as we have different time spans for each dataset and we would like to retain as much data as possible.

The full_join() function will simply create NA values for any of the years that are not in one of the datasets.

First, we check using the base summary() function that there are column names that are consistent in each dataset that we wish to combine.

summary(CO2_emissions)
   Country               Year        Emissions           Label          
 Length:50688       Min.   :1751   Min.   :       0   Length:50688      
 Class :character   1st Qu.:1817   1st Qu.:     550   Class :character  
 Mode  :character   Median :1882   Median :    4390   Mode  :character  
                    Mean   :1882   Mean   :   83808                     
                    3rd Qu.:1948   3rd Qu.:   31925                     
                    Max.   :2014   Max.   :10300000                     
                                   NA's   :33772                        
summary(gdp_growth)
   Country               Year           GDP             Label          
 Length:42486       Min.   :1801   Min.   :-67.500   Length:42486      
 Class :character   1st Qu.:1855   1st Qu.:  0.133   Class :character  
 Mode  :character   Median :1910   Median :  0.633   Mode  :character  
                    Mean   :1910   Mean   :  1.302                     
                    3rd Qu.:1965   3rd Qu.:  2.160                     
                    Max.   :2019   Max.   :145.000                     
                                   NA's   :2392                        
summary(energy_use)
   Country               Year          Energy            Label          
 Length:9464        Min.   :1960   Min.   :    9.58   Length:9464       
 Class :character   1st Qu.:1974   1st Qu.:  505.75   Class :character  
 Mode  :character   Median :1988   Median : 1185.00   Mode  :character  
                    Mean   :1988   Mean   : 2238.82                     
                    3rd Qu.:2001   3rd Qu.: 3030.00                     
                    Max.   :2015   Max.   :22000.00                     
                                   NA's   :3544                         

Question Opportunity

What variable or variables might we want to use to join our data by?


Click here to see an explanation for what variable or variables to join by after you have thought about it.

The Country, and Year variables are present in all of the datasets with values that overlap. Although Label is also present in the datasets, the values do not overlap. We can see that the minimum and maximum year is different for nearly all the datasets.

Next, we need to specify what columns/variables we will be joining by using the by = argument in the full_join() function, (recall that this variable is called the “key”)

data_wide <- CO2_emissions %>%
  full_join(gdp_growth, by = c("Country", "Year", "Label")) %>%
  full_join(energy_use, by = c("Country", "Year", "Label"))

set.seed(123)

data_wide %>%
  slice_sample(n = 6)
# A tibble: 6 x 6
  Country                   Year Emissions Label                      GDP Energy
  <chr>                    <dbl>     <dbl> <chr>                    <dbl>  <dbl>
1 Angola                    1899        NA GDP Growth/Capita (%)    0.425     NA
2 Central African Republic  1974        NA GDP Growth/Capita (%)   -4.92      NA
3 Bahamas                   1832        NA CO2 Emissions (Metric ~ NA         NA
4 Montenegro                1843        NA CO2 Emissions (Metric ~ NA         NA
5 Croatia                   2015        NA Energy Use (kg, oil-eq~ NA         NA
6 Israel                    1885        NA GDP Growth/Capita (%)    0.582     NA

Click here to see an explanation for another option that works well for large numbers of tibbles

We can also do the same thing by using the reduce() function of the purrr package. This takes a list of elements (which can be tibbles) and then applies a function that requires two inputs iteratively using the first pair of elements and creating a single element and then applying the function again to the output element and the next listed element and so on and so forth.

For example we will use a list of tibbles and the full_join() function which requires two tibbles to combine. This will first combine CO2_emissions and gdp_growth and then take the resulting joined tibble and combine this with the energy_use tibble.

You can see that this is a great option if you have many datasets to combine!

data_wide <-
  list(CO2_emissions, gdp_growth, energy_use) %>%
  reduce(full_join, by = c("Country", "Year", "Label"))

set.seed(123)

data_wide %>%
  slice_sample(n = 6)
# A tibble: 6 x 6
  Country                   Year Emissions Label                      GDP Energy
  <chr>                    <dbl>     <dbl> <chr>                    <dbl>  <dbl>
1 Angola                    1899        NA GDP Growth/Capita (%)    0.425     NA
2 Central African Republic  1974        NA GDP Growth/Capita (%)   -4.92      NA
3 Bahamas                   1832        NA CO2 Emissions (Metric ~ NA         NA
4 Montenegro                1843        NA CO2 Emissions (Metric ~ NA         NA
5 Croatia                   2015        NA Energy Use (kg, oil-eq~ NA         NA
6 Israel                    1885        NA GDP Growth/Capita (%)    0.582     NA

data_wide %>%
  glimpse()
Rows: 102,638
Columns: 6
$ Country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
$ Year      <dbl> 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, ~
$ Emissions <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
$ Label     <chr> "CO2 Emissions (Metric Tons)", "CO2 Emissions (Metric Tons)"~
$ GDP       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
$ Energy    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~

Nice, looks good!

We will also make a long version of this data, where we will create an new variable called Indicator that will indicate what dataset the data came from.

Question Opportunity

Try to come up with the code to do this.


Click here to reveal the code.
data_long <- data_wide %>%
  pivot_longer(cols = c(-Country, -Year, -Label),
               names_to = "Indicator",
               values_to = "Value")

set.seed(123)

data_long %>%
  slice_sample(n = 6)
# A tibble: 6 x 5
  Country      Year Label                       Indicator   Value
  <chr>       <dbl> <chr>                       <chr>       <dbl>
1 Ethiopia     1829 GDP Growth/Capita (%)       GDP        0.0197
2 Tanzania     1820 CO2 Emissions (Metric Tons) Energy    NA     
3 South Sudan  1907 CO2 Emissions (Metric Tons) GDP       NA     
4 Bangladesh   1931 GDP Growth/Capita (%)       GDP       -2     
5 Mongolia     1805 GDP Growth/Capita (%)       Emissions NA     
6 Spain        1805 CO2 Emissions (Metric Tons) Emissions NA     

We will now combine this data with the US data about disasters and temperatures.

us_disaster %>%
  slice_head(n = 6)
# A tibble: 6 x 5
   Year Country       Indicator Value Label              
  <dbl> <chr>         <chr>     <dbl> <chr>              
1  1980 United States Disasters     3 Number of Disasters
2  1981 United States Disasters     2 Number of Disasters
3  1982 United States Disasters     3 Number of Disasters
4  1983 United States Disasters     5 Number of Disasters
5  1984 United States Disasters     2 Number of Disasters
6  1985 United States Disasters     5 Number of Disasters
us_temperature %>%
  slice_head(n = 6)
# A tibble: 6 x 5
   Year Country       Indicator   Value Label                   
  <dbl> <chr>         <chr>       <dbl> <chr>                   
1  1895 United States Temperature  50.3 Temperature (Fahrenheit)
2  1896 United States Temperature  52.0 Temperature (Fahrenheit)
3  1897 United States Temperature  51.6 Temperature (Fahrenheit)
4  1898 United States Temperature  51.4 Temperature (Fahrenheit)
5  1899 United States Temperature  51.0 Temperature (Fahrenheit)
6  1900 United States Temperature  52.8 Temperature (Fahrenheit)

We will now use the bind_rows() function of the dplyr package which will just append the us_temperature data and the us_disaster data after the data_long data.

data_long <-
  list(data_long, us_disaster, us_temperature) %>%
  bind_rows() %>%
  mutate(Country = as.factor(Country))

We also converted the Country column to a factor in the last line of the code chunk.

We can check the top and bottom of the new data_long tibble to see that our us_temperature data is at the bottom. To see the end of our tibble we can use slice_tail() function of the dplyr package.

data_long %>%
  slice_head(n = 6)
# A tibble: 6 x 5
  Country      Year Label                       Indicator Value
  <fct>       <dbl> <chr>                       <chr>     <dbl>
1 Afghanistan  1751 CO2 Emissions (Metric Tons) Emissions    NA
2 Afghanistan  1751 CO2 Emissions (Metric Tons) GDP          NA
3 Afghanistan  1751 CO2 Emissions (Metric Tons) Energy       NA
4 Afghanistan  1752 CO2 Emissions (Metric Tons) Emissions    NA
5 Afghanistan  1752 CO2 Emissions (Metric Tons) GDP          NA
6 Afghanistan  1752 CO2 Emissions (Metric Tons) Energy       NA
data_long %>%
  slice_tail(n = 6)
# A tibble: 6 x 5
  Country        Year Label                    Indicator   Value
  <fct>         <dbl> <chr>                    <chr>       <dbl>
1 United States  2014 Temperature (Fahrenheit) Temperature  52.5
2 United States  2015 Temperature (Fahrenheit) Temperature  54.4
3 United States  2016 Temperature (Fahrenheit) Temperature  54.9
4 United States  2017 Temperature (Fahrenheit) Temperature  54.6
5 United States  2018 Temperature (Fahrenheit) Temperature  53.5
6 United States  2019 Temperature (Fahrenheit) Temperature  52.7
set.seed(123)

data_long %>%
  slice_sample(n = 10)
# A tibble: 10 x 5
   Country      Year Label                       Indicator   Value
   <fct>       <dbl> <chr>                       <chr>       <dbl>
 1 Ethiopia     1829 GDP Growth/Capita (%)       GDP        0.0197
 2 Tanzania     1820 CO2 Emissions (Metric Tons) Energy    NA     
 3 South Sudan  1907 CO2 Emissions (Metric Tons) GDP       NA     
 4 Bangladesh   1931 GDP Growth/Capita (%)       GDP       -2     
 5 Mongolia     1805 GDP Growth/Capita (%)       Emissions NA     
 6 Spain        1805 CO2 Emissions (Metric Tons) Emissions NA     
 7 Germany      1858 GDP Growth/Capita (%)       Emissions NA     
 8 Fiji         1837 CO2 Emissions (Metric Tons) GDP       NA     
 9 Jamaica      1823 CO2 Emissions (Metric Tons) Emissions NA     
10 Iceland      1926 CO2 Emissions (Metric Tons) Emissions NA     

Click here for details about the difference between full_join() and bind_rows()

The difference between this function and the full_join() function is that the bind_rows() function will essentially just append each dataset to each other, whereas the full_join() function collapses data that is comparable. Here, you will see an example of what the data would have been like for data_wide if we had made it using bind_rows() and if full_join() had been used but was not joined by the Label variable. Since the Label variable has unique values for each type of Indicator, this causes the full_join() result to be the same as bind_rows().

Let’s consider an example and look at the values for China in the year of 1980.

First we will use the bind_rows() function which automatically creates NA values for any variable that is missing from a data object that is added by combining the data object with another that contains that missing variable using this function.

data_wide_br <-
  list(CO2_emissions, gdp_growth, energy_use) %>%
  bind_rows()

data_wide_br %>%
 filter(Country == "China",
           Year == 1980)
# A tibble: 3 x 6
  Country  Year Emissions Label                             GDP Energy
  <chr>   <dbl>     <dbl> <chr>                           <dbl>  <dbl>
1 China    1980   1470000 CO2 Emissions (Metric Tons)     NA        NA
2 China    1980        NA GDP Growth/Capita (%)            2.16     NA
3 China    1980        NA Energy Use (kg, oil-eq./capita) NA       609

We see that we have three rows of data.

Now we will use the full_join() function two ways. First we will combine by Country and Year and Label.

data_wide_fj_label <-
  list(CO2_emissions, gdp_growth, energy_use) %>%
  reduce(full_join, by = c("Country", "Year", "Label"))

data_wide_fj_label %>%
  filter(Country == "China", Year == "1980")
# A tibble: 3 x 6
  Country  Year Emissions Label                             GDP Energy
  <chr>   <dbl>     <dbl> <chr>                           <dbl>  <dbl>
1 China    1980   1470000 CO2 Emissions (Metric Tons)     NA        NA
2 China    1980        NA GDP Growth/Capita (%)            2.16     NA
3 China    1980        NA Energy Use (kg, oil-eq./capita) NA       609

Again we have 3 rows of data. The data produced by bind_rows() and full_join() is identical (which we can check by using the setequal() function of the dplyr package) and has the same dimensions (which we can check by using the dim() base function).

dim(data_wide_br)
[1] 102638      6
dim(data_wide_fj_label)
[1] 102638      6
setequal(data_wide_fj_label, data_wide_br)
[1] TRUE

However, now we will join by only Country and Year:

data_wide_fj <-
  list(CO2_emissions, gdp_growth, energy_use) %>%
  reduce(full_join, by = c("Country", "Year"))

data_wide_fj %>%
  filter(Country == "China", Year == "1980")
# A tibble: 1 x 8
  Country  Year Emissions Label.x                       GDP Label.y Energy Label
  <chr>   <dbl>     <dbl> <chr>                       <dbl> <chr>    <dbl> <chr>
1 China    1980   1470000 CO2 Emissions (Metric Tons)  2.16 GDP Gr~    609 Ener~

Now we see that we have only a single row. The data that corresponds to the same year and country has been collapsed into a single but wider row.

This is something to keep in mind when you are wrangling your data. The choice of what function to use and how should depend on how you want the data to be after you combine the different sources of data together.


We have a few more things to do before we leave the data wrangling section.

We will create a new variable called Region that will indicate if the data is about the United States or a different country based on the values in the Country variable. To do this, we will use the case_when() function of the dplyr package.

For example, if the Country variable is equal to "United States" the value for the new variable will also be "United States", where as if the Country variable is not equal to "United States" but is some other character string value, such as "Afghanistan", then the value for the new variable will be "Rest of the World". We can specify that something is not equal by using the != operator.

The new values for the new variable Region are indicated after the specific conditional statements by using the ~ symbol.

data_long %<>%
  mutate(Region = case_when(Country == "United States" ~ "United States",
                            Country != "United States" ~ "Rest of the World"))

data_long  %>%
  arrange(Country) %>%
  slice_head(n = 6)
# A tibble: 6 x 6
  Country      Year Label                       Indicator Value Region          
  <fct>       <dbl> <chr>                       <chr>     <dbl> <chr>           
1 Afghanistan  1751 CO2 Emissions (Metric Tons) Emissions    NA Rest of the Wor~
2 Afghanistan  1751 CO2 Emissions (Metric Tons) GDP          NA Rest of the Wor~
3 Afghanistan  1751 CO2 Emissions (Metric Tons) Energy       NA Rest of the Wor~
4 Afghanistan  1752 CO2 Emissions (Metric Tons) Emissions    NA Rest of the Wor~
5 Afghanistan  1752 CO2 Emissions (Metric Tons) GDP          NA Rest of the Wor~
6 Afghanistan  1752 CO2 Emissions (Metric Tons) Energy       NA Rest of the Wor~

We can also remove rows for countries with NA values using the drop_na() function of the tidyr package to drop all years with missing data.

data_long_with_miss <-
  data_long %>%
  arrange(Country)

data_long %<>%
  drop_na() %>%
  arrange(Country)

You can see that by removing the NA values the data for Afghanistan starts at 1949 instead of 1751.

data_long %>%
  slice_head(n = 6)
# A tibble: 6 x 6
  Country      Year Label                       Indicator Value Region          
  <fct>       <dbl> <chr>                       <chr>     <dbl> <chr>           
1 Afghanistan  1949 CO2 Emissions (Metric Tons) Emissions  14.7 Rest of the Wor~
2 Afghanistan  1950 CO2 Emissions (Metric Tons) Emissions  84.3 Rest of the Wor~
3 Afghanistan  1951 CO2 Emissions (Metric Tons) Emissions  91.7 Rest of the Wor~
4 Afghanistan  1952 CO2 Emissions (Metric Tons) Emissions  91.7 Rest of the Wor~
5 Afghanistan  1953 CO2 Emissions (Metric Tons) Emissions 106   Rest of the Wor~
6 Afghanistan  1954 CO2 Emissions (Metric Tons) Emissions 106   Rest of the Wor~

Question Opportunity

Using only data from the US, calculate the first and last year that a value was reported for each variable (e.g. CO2 emissions, energy use, etc).

Hint : Use the group_by() and summarize() functions in the dplyr package.

To allow users to skip import and wrangling we will save the data as an RDA file as well as a CSV file as this is often useful to send our data to collaborators. We will save this in a “wrangled” subdirectory of our “data” directory of our working directory.

save(data_long, file = here::here("data", "wrangled", "wrangled_data.rda"))
readr::write_csv(data_long, path = here::here("data","wrangled", "wrangled_data.csv"))

Data Visualization


If you have been following along but stopped, we could load our wrangled data like so:

load(here::here("data", "wrangled", "wrangled_data.rda"))

If you skipped the data wrangling section click here.

First you need to install and load the OCSdata package:

install.packages("OCSdata")
library(OCSdata)

Then, you may load the wrangled data using the following code:

wrangled_rda("ocs-bp-co2-emissions", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found here or slightly more directly here. Download this file and then place it in your current working directory within a subdirectory called “wrangled” within a subdirectory called “data” to copy and paste our code. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "wrangled", "wrangled_data.rda"))

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.



Now we will create some simple plots to examine the CO2 emissions over time using the ggplot2 package.

As you may have already seen, there are many functions available in base R that can create plots (e.g. plot(), boxplot()). Others include: hist(), qqplot(), etc. These functions are great because they come with a basic installation of R and can be quite powerful when you need a quick visualization of something when you are exploring data.

We are choosing to introduce ggplot2 because, in our opinion, it is one of the simplest ways for beginners to create relatively complicated plots that are intuitive and aesthetically pleasing.

The ggplot2 R package


The reasons ggplot2 is generally intuitive for beginners is the use of grammar of graphics or the gg in ggplot2. The idea is that you can construct many sentences by learning just a few nouns, adjectives, and verbs. There are specific “words” that we will introduce, and once you are comfortable with them, you will be able to create (or “write”) hundreds of different plots.

The critical part to making graphics using ggplot2 is the data needs to be in a tidy format. Given that we have just spent time putting our data in tidy format, we are primed to take advantage of all that ggplot2 has to offer!

We will show how it is easy to pipe tidy data (output) as input to other functions that create plots. This all works because we are working within the tidyverse.

What is the ggplot() function?

As explained by Hadley Wickham:

the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system.

ggplot2 Terminology

  • ggplot - the main function where you specify the dataset and variables to plot (this is where we define the x and y variable names)
  • geoms - geometric objects
    • e.g. geom_point(), geom_bar(), geom_line(), geom_histogram()
  • aes - aesthetics
    • shape, transparency, color, fill, line types
  • scales - define how your data will be plotted
    • continuous, discrete, log, etc

The function aes() is an aesthetic mapping function inside the ggplot() object. We use this function to specify plot attributes (e.g. x and y variable names) that will not change as we add more layers.

Anything that goes in the ggplot() object becomes a global setting. From there, we use the geom objects to add more layers to the base ggplot() object. These will define what we are interested in illustrating using the data.

CO2 Emissions

Let’s start by plotting the CO2 emissions over time. Because our dataset contains other variables, we first need to filter our data to only include the CO2 emissions data by using the filter() function of the dplyr package. To use this function we need to specify what value (e.g. Emissions) we want for a given variable or column (e.g. Indicator).

In this case, we filter to keep all rows where the Indicator variable is equal to the word Emissions. Notice that this needs to be in quotes, while the variable name does not.

data_long %>%
  filter(Indicator == "Emissions")
# A tibble: 16,916 x 6
   Country      Year Label                       Indicator Value Region         
   <fct>       <dbl> <chr>                       <chr>     <dbl> <chr>          
 1 Afghanistan  1949 CO2 Emissions (Metric Tons) Emissions  14.7 Rest of the Wo~
 2 Afghanistan  1950 CO2 Emissions (Metric Tons) Emissions  84.3 Rest of the Wo~
 3 Afghanistan  1951 CO2 Emissions (Metric Tons) Emissions  91.7 Rest of the Wo~
 4 Afghanistan  1952 CO2 Emissions (Metric Tons) Emissions  91.7 Rest of the Wo~
 5 Afghanistan  1953 CO2 Emissions (Metric Tons) Emissions 106   Rest of the Wo~
 6 Afghanistan  1954 CO2 Emissions (Metric Tons) Emissions 106   Rest of the Wo~
 7 Afghanistan  1955 CO2 Emissions (Metric Tons) Emissions 154   Rest of the Wo~
 8 Afghanistan  1956 CO2 Emissions (Metric Tons) Emissions 183   Rest of the Wo~
 9 Afghanistan  1957 CO2 Emissions (Metric Tons) Emissions 293   Rest of the Wo~
10 Afghanistan  1958 CO2 Emissions (Metric Tons) Emissions 330   Rest of the Wo~
# ... with 16,906 more rows

We also need to sum the emissions across countries for each year. Here, we use the group_by() and summarize() function that we previously learned about.

data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value))
# A tibble: 264 x 2
    Year Emissions
   <dbl>     <dbl>
 1  1751      9360
 2  1752      9360
 3  1753      9360
 4  1754      9370
 5  1755      9370
 6  1756     10000
 7  1757     10000
 8  1758     10000
 9  1759     10000
10  1760     10000
# ... with 254 more rows

Then, we use the aes() argument of the ggplot() function to define that our x-axis will be the Year variable, the y-axis will be the emission Value variable, and that our data should be grouped or separated by the Country variable.

data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value)) %>%
  ggplot(aes(x = Year, y = Emissions))

Looks like we got a blank plot. What happened? We need to tell R what type of plot we want. To do that, we need to add another layer to define how we want the plot to look. We do so by using the + sign in between each command.

Line plots

To tell R what type of plot we want, we need to add another layer to our ggplot object. To add a type of plot, we can use one of the many geom_* functions in ggplot2. For example, type geom into the RStudio console and you will see many options to scroll through.

Here, we will use the geom_line() function because we would like to create a line plot. We also use the size argument in geom_line() to control the size of the line.

data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value)) %>%
  ggplot(aes(x = Year, y = Emissions)) +
    geom_line(size = 1.5)

Wow, the CO2 is really rising sharply!

Finally, let’s make this plot really nice by adding a few final touches. To change title, caption, and the axis labels, we can use the labs() function. Again, notice that a plus sign is used between each layer that we add to the plot. To make CO2 appear with a subscript we can use ~CO[2]~.

data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value)) %>%
  ggplot(aes(x = Year, y = Emissions)) +
    geom_line(size = 1.5) +
    labs(title = "World " ~CO[2]~ " Emissions per Year (1751-2014)",
         caption = "Limited to reporting countries",
         y = "Emissions (Metric Tonnes)")

Next, we use the theme() function to change the font size of the x-axis, y-axis, axis titles, and the caption as shown below. To know what to call each element of the plot in this function to change the size type ?theme() in the console.

You will see a very large list that includes other plot aspects like the background and the legend. This function can be used to modify your plot to your specifications.

data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value)) %>%
  ggplot(aes(x = Year, y = Emissions)) +
    geom_line(size = 1.5) +
    labs(title = "World " ~CO[2]~ " Emissions per Year (1751-2014)",
         caption = "Limited to reporting countries",
         y = "Emissions (Metric Tonnes)") +
  theme_linedraw() +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
       axis.title.x = element_text(size = 12),
       axis.title.y = element_text(size = 12),
       plot.caption = element_text(size = 12),
         plot.title = element_text(size = 16))

We can clearly see that global CO2 emissions have dramatically risen since 1900.

Notice, we used the function theme_linedraw() of ggplot2 to change the general appearance of the plot.

Useful tip: You can type theme_ in the RStudio console to see the various plot theme options available.

Great! We’ve created our first plot.

Before we leave this section, let’s save this theme so we do not have to keep typing the same code in future plots.

my_theme <-
  theme_linedraw() +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
       axis.title.x = element_text(size = 12),
       axis.title.y = element_text(size = 12),
       plot.caption = element_text(size = 12),
         plot.title = element_text(size = 16))

In this way, we can just add another layer to our plot with the my_theme we created with the specifications of the sizes of the title and axis labels.

CO2_world <-
  data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value)) %>%
  ggplot(aes(x = Year, y = Emissions)) +
    geom_line(size = 1.5) +
    labs(title = "World " ~CO[2]~ " Emissions per Year (1751-2014)",
         caption = "Limited to reporting countries",
         y = "Emissions (Metric Tonnes)") +
  my_theme

We are also saving the plot to an object called CO2_world. To show the plot we simply type the name of the object:

CO2_world

Now let’s say we wanted to save this plot.

We could do so using the using the save() function to save this to a “plot” directory in our working directory as an RDA file and we can use the png() function to save a png for collaborators. We need to use dev.off() function to close the graphical device that we will use to create the png version of the plot so that we are ready to make another plot like this.

save(CO2_world, file =here::here("plots", "CO2_world.rda"))
png(here::here("plots", "CO2_world.png"))
CO2_world
dev.off()
png 
  2 

One thing that would be nice to know is which countries are contributing the most or the least to CO2 emissions. Let’s continue to explore the data to investigate how CO2 emissions from individual countries have changed over time.

Next, we go back to using our data_long dataset. Here, we use the group argument in aes() which controls whether a line should be drawn

data_long %>%
  filter(Indicator == "Emissions") %>%
  ggplot(aes(x = Year, y = Value, group = Country)) +
  geom_line() +
  ylab("Emissions") +
  my_theme

We can see that many countries show a dramatic increase in emissions over time with a handful of countries with particularly high levels.

Question Opportunity

  • What happens when you do not use the group = Country argument?
  • What other aesthetic (i.e. aes() arguments) can be changed?

Since we have many overlapping lines, we will make our lines slightly transparent by using the alpha argument.

This takes values from 0 to 1, where 0 is completely transparent and 1 is completely opaque.

We also add our my_theme controlling the size of the title and axis labels.

CO2_countries <-
  data_long %>%
  filter(Indicator == "Emissions") %>%
  ggplot(aes(x = Year, y = Value, group = Country)) +
  geom_line(alpha = 0.4) +
  labs(title = "Country" ~CO[2]~ "Emissions per Year (1751-2014)",
     caption = "Limited to reporting countries",
           y = "Emissions (Metric Tonnes)") +
  my_theme

CO2_countries

Our plot is starting to look really good. One question you still might have is which country corresponds to which line? Which line indicates the emissions in the US?

Adding color

We can add another “layer” on top of our first plot to add a blue line just for the US data. To do this we need to indicate what data we would like to plot, so we need to filter for just the US data and then we need to indicate that it will be colored by Country, even though in this case we only have one line to color. The default color would be a salmon pink color, but we would like blue. So we will use the scale_color_manual() function to manually choose the color that we want by using scale_colour_manual(values = c("blue")). Often you might use red to highlight a subset of the data, however, this can be difficult for viewers with certain types of colorblindness.

Notice how the color name needs to be in quotes and that the argument values = is used to specify what color values to use.

We can add this line to the plot in two ways.

Useful tip: Instead of retyping the original code to create the CO2_countries plot, we can just use the + operator:

CO2_countries +
  geom_line(data = data_long %>%
              filter(Indicator == "Emissions",
                     Country == "United States"),
              aes(x = Year, y = Value, color = Country)) +
  scale_colour_manual(values = c("blue"))

It looks like the US has long been the largest CO2 emission producing country until recently, when the US was surpassed by another country.

Let’s figure out who are the top 10 emission producing countries in 2014. Here, we filter the data for the year 2014, which was the final year of the data. Then, we can make a rank variable based on the Value variable for the amount of emissions produced.

There are many functions in the dplyr package for ranking values that are based on the SQL or specifically SQL rank functions. SQL is another programming language for managing large amounts of data. The difference in the rank functions mostly has to do with how to deal with ties in the data.
We will use dense_rank(), as we do not want gaps between ranks.

We want to do this in descending order because we want to rank by largest to smallest, so we will use the desc() function of the dplyr package. Then, we will arrange the output by rank using the arrange() function of the dplyr package.

top_10_count <-
  data_long %>%
  filter(Indicator == "Emissions", Year == 2014) %>%
  mutate(rank = dense_rank(desc(Value))) %>%
  filter(rank <= 10) %>%
  arrange(rank)

top_10_count
# A tibble: 10 x 7
   Country        Year Label                       Indicator  Value Region  rank
   <fct>         <dbl> <chr>                       <chr>      <dbl> <chr>  <int>
 1 China          2014 CO2 Emissions (Metric Tons) Emissions 1.03e7 Rest ~     1
 2 United States  2014 CO2 Emissions (Metric Tons) Emissions 5.25e6 Unite~     2
 3 India          2014 CO2 Emissions (Metric Tons) Emissions 2.24e6 Rest ~     3
 4 Russia         2014 CO2 Emissions (Metric Tons) Emissions 1.71e6 Rest ~     4
 5 Japan          2014 CO2 Emissions (Metric Tons) Emissions 1.21e6 Rest ~     5
 6 Germany        2014 CO2 Emissions (Metric Tons) Emissions 7.2 e5 Rest ~     6
 7 Iran           2014 CO2 Emissions (Metric Tons) Emissions 6.49e5 Rest ~     7
 8 Saudi Arabia   2014 CO2 Emissions (Metric Tons) Emissions 6.01e5 Rest ~     8
 9 South Korea    2014 CO2 Emissions (Metric Tons) Emissions 5.87e5 Rest ~     9
10 Canada         2014 CO2 Emissions (Metric Tons) Emissions 5.37e5 Rest ~    10

We can see that China is now the top emission producing country.

Question Opportunity

What are the bottom 10 emission producing countries in 2014?

Let’s make a plot of just these top ten countries.

To do this, we need to filter the data to just these top countries by using the %in% operator to only keep countries in our Country variable that are also in the Country variable within top_10_count. We can use the pull() function also of the dplyr package to specifically grab just the Country data out of top_10_count.

Since we have 10 countries we will want to differentiate them by color.

To color our plot we will use the viridis color palette which is compatible with color-blindness by using the scale_color_viridis_d() function. This function is available by loading the ggplot2 package. There are a few variations for discrete values as _d, or binned continuous values as _b, or continuous values as _c. See here for more information.

Top10b <- data_long %>%
  filter(Country %in% pull(top_10_count, Country)) %>%
  filter(Indicator == "Emissions") %>%
  filter(Year >= 1900) %>%
  ggplot(aes(x = Year, y = Value, color = Country)) +
    geom_line() +
    scale_color_viridis_d() +
    labs(title = "Top 10 Emissions-producing Countries in 2010 (1900-2014)",
         subtitle = "Ordered by Emissions Produced in 2014",
         y = "Emissions (Metric Tons)",
         x = "Year") +
    my_theme

Top10b

It’s still a bit difficult to tell which line corresponds to which country. So, let’s add a text label directly to the plot. .

Adding text labels

One way to do this is to add text layer to our plot using the geom_text() function of the ggplot2 package. We need to first specify what variable or column we will use. However, we only want to pull the text labels for the top ten countries in the last year. To do this, we use the last() function of the dplyr package.

Then, we need to indicate that our text label will be based on the Country variable using the aes() aesthetics mapping argument. We will also get rid of our legend since we will not need it anymore, by using the theme() function of the ggplot2 package.

Top10b +
  geom_text(data = data_long %>%
              filter(Country %in% pull(top_10_count, Country)) %>%
              filter(Indicator == "Emissions") %>%
              filter(Year == last(Year)),
            aes(label = Country)) +
  theme(legend.position = "none")

Not bad, but some of the labels are overlapping and difficult to read. We can use the check_overlap = TRUE argument within the geom_text() function to remove overlapping variables. Also, we can expand the plot area horizontally so that the names are not cutoff by using scale_x_continuous(expand = c(0.2,0)). This takes a vector with two values. The first value indicates what percentage to expand the x axis in both directions. In our case we will expand by 15 percent. The second value indicates what absolute value to expand the limit of the x axis. In our case c(0.15,0) will achieve a similar result as c(0, 17), as the range of values from 1990 to 2014 is 114 years and 15% of this is 17.

Top10b +
  geom_text(data = data_long %>%
              filter(Country %in% pull(top_10_count, Country)) %>%
              filter(Indicator == "Emissions") %>%
              filter(Year == last(Year)),
            aes(label = Country),
            check_overlap = TRUE) +
  scale_x_continuous(expand = c(0.15, 0)) +
  theme(legend.position = "none")

This is easier to read now, but it also causes us to lose some of the labels. There are several alternative ways we can keep all of our labels and make them easier to read. The first package we will show is called directlabels.

The most simple option is to use the direct.label() function, which will automatically add labels at the end of the lines. However, it is a bit difficult to see some of our labels as they get automatically sized to fit the plot.

direct.label(Top10b) +
  scale_x_continuous(expand = c(0.3, 0))

Alternatively this can be done within the ggplot2 framework by layering using the geom_dl() function.

Top10b +
  scale_x_continuous(expand = c(0.3, 0)) +
  geom_dl(aes(label = Country), method = list("last.bumpup")) +
  theme(legend.position = "none")

This is more legible now. We have all 10 countries names listed and they are in order of the last data point and they are relatively close to the lines that they correspond to.

Another option is to use a different method in the directlables package. Here is a list of options.

For example, the "angled.boxes" method looks nice for some plots but does not work very well for our plot:

direct.label(Top10b, method = list("angled.boxes")) +
  scale_x_continuous(expand = c(0.3, 0))

However the "last.polygons" method works quite well:

direct.label(Top10b, method = list("last.polygons")) +
  scale_x_continuous(expand = c(0.3, 0))

The second package is the ggrepel package, which is especially good for crowded labels that might overlap one another. It allows for more control than the directlabels package.

Specifically, we will use the geom_text_repel() function from the ggrepel package. Just like with geom_text(), first we need to specify what data we want to include. Then, we specify with the aes() argument that our label will be based on the Country variable and we again specify what variable to use for our x axis and y axis, so that we indicate where the labels should be plotted.

Top10b +
  geom_text_repel(data = data_long %>%
                    filter(Country %in% pull(top_10_count, Country)) %>%
                    filter(Indicator == "Emissions") %>%
                    filter(Year == last(Year)),
                  aes(label = Country, x = Year, y = Value)) +
  theme(legend.position = "none") +
  scale_x_continuous(expand = c(0.3, 0))

You can see that this package creates segments that connect the label to the line.

There are many arguments to use to style your labels just the way that you want:

[source]

See the ggrepel vignette for more details.

Let’s play around with some of these options in the table above.

Top10b +
  geom_text_repel(data = data_long %>%
                    filter(Country %in% pull(top_10_count, Country)) %>%
                    filter(Indicator == "Emissions") %>%
                    filter(Year == last(Year)),
                  aes(label = Country, x = Year, y = Value),
                  nudge_x = 10,
                  hjust = 1,
                  vjust = 1,
                  segment.size = 0.25,
                  force = 1) +
  theme(legend.position = "none") +
  scale_x_continuous(expand = c(0.3, 0)) +
  scale_y_continuous(expand = c(0.3, 0))

Nice, that looks pretty good. For fun, let’s try showing our data in an entirely different way.

Tile plots

This time we will create a geom_tile plot.

To create this plot we will filter our data to include only the Countries included in the Country variable of the top_10_count.

Then, we will use the fct_reorder() function of the forcats package to order our countries based on the last emission value in 2014.

To use this function, the variable that is to be reordered is listed first. The variable that is being used to determine the order is listed second. Finally, a function to apply to the variable listed second is listed third. This function is used to determine the order. In this case, we want to determine the last value of the Value variable using the last() function (recall that this is also a function of the dplyr package). Then, the Country variable will be ordered by the last value of the Value variable.

To color our plot we will use the viridis color palette again but this time we will use the scale_fill_viridis_c() – recall that the _c indicates a continuous scale. See the scale_viridis reference for more information.

Top10t <-
  data_long %>%
  filter(Country %in% pull(top_10_count, Country)) %>%
  filter(Indicator == "Emissions") %>%
  filter(Year >= 1900) %>%
  ggplot(aes(x = Year, y = fct_reorder(Country, Value, last))) +
    geom_tile(aes(fill = log(Value))) +
    scale_fill_viridis_c()

Finally, let’s clean up the axes and the axes labels:

Top10t <- Top10t +
  scale_x_continuous(breaks = seq(1900, 2014, by = 5),
                     labels = seq(1900, 2014, by = 5)) +
  labs(title = "Top 10 " ~CO[2]~ "Emission-producing Countries",
       subtitle = "Ordered by Emissions Produced in 2014",
       fill = "Ln(CO2 Emissions (Metric Tonnes))") +
  theme_classic() +
  theme(axis.text.x = element_text(size = 12, angle = 90, color = "black"),
        axis.text.y = element_text(size = 12, color = "black"),
        axis.title = element_blank(),
        plot.caption = element_text(size = 12),
        plot.title = element_text(size = 16),
        legend.position = "bottom")

Top10t

Now let’s say we wanted to save this plot.

We could do so using the using the save() function to save this to a “plot” directory in our working directory as an RDA file and we can use the png() function to save a png for collaborators. We need to use dev.off() function to close the graphical device that we will use to create the png version of the plot so that we are ready to make another plot like this.

save(Top10t, file =here::here("plots", "Top10t.rda"))
png(here::here("plots", "Top10t.png"))
Top10t
dev.off()
png 
  2 

We see that Germany had very low emission rates at the end of World War II. We also see that the US has consistently had high emission rates since 1900, but the emission rates in China recently surpassed that of the US. The portions of the plot that are white indicate that there is no emission data for that country.

Question Opportunity

Think about what the pros and cons are of tile and line plots. In what situations would a tile plot be a better choice for effective scientific communication? In what situations would a line plot be a better choice?

More than one variable


Now, we will visualize all the variables in our dataset.

Faceted plots

Here, we use the facet_wrap() function of the ggplot2 package, which plots multiple subplots simultaneously.

To use facet_wrap() with the option for a different y-axis scale for each subplot, we need to set the scales argument equal to "free_y". We can also indicate where we would like the label for the subplots to be located by using the strip.position argument.

ggplot(data_long, aes(x = Year, y = Value, group = Country)) +
  geom_line(alpha = 0.2) +
  geom_line(data = data_long %>%
              filter(Country == "United States"),
              aes(x = Year, y = Value, color = Country)) +
  scale_colour_manual(values = c("blue")) +
  labs(title = "Distribution of Indicators by Year and Value",
       y = "Indicator Value") +
  my_theme +
  theme(strip.text = element_text(size = 16, face = "bold")) +
  facet_wrap(Indicator ~ .,
             scales = "free_y",
             strip.position = "right",
             ncol = 1)

Notice that we can change the size or style of the font for these labels using the strip.text = argument of the theme() function. We can also specify how many rows or columns we would like the subplots to be shown.

We can also facet by more than one variable (e.g. Indicator and Region to show the data from the US compared to other countries).

In this case we want the same y-axis to be used across the rows. We will use the facet_grid() function this time instead of facet_wrap() because of the way that the two facet variables are displayed. The facet_grid() function will as you might expect, create an output of plots that are displayed in a grid.

The syntax here is to put the name of the two variables on the left or right side of the ~ symbol, which tells you to facet by rows (left) or columns (right).

First, we will filter out the data about disasters and temperature as this is only for the US, by using the filter() function and ! indicates that we want only values of the Indicator variable not in the list containing "Disasters" and "Temperature".

data_long %>%
  filter(!(Indicator %in% c("Disasters", "Temperature"))) %>%
  ggplot(aes(x = Year, y = Value, group = Country)) +
    geom_line() +
    facet_grid(Indicator ~ Region, scales = "free_y") +
    labs(title = "Distribution of Indicators by Year and Value",
         y = "Indicator Value") +
    my_theme +
    theme(strip.text = element_text(size = 16, face = "bold"))

From these plots we can see that each type of data spans a different time span.

Question Opportunity

What happens when you create the same plot with facet_wrap()? Why might this be preferable in certain cases?

Question Opportunity

Calculate the total number of countries per year reporting C02 emissions, energy use and GDP. Plot this summary statistic (y-axis) across the years (x-axis). What do you see?

Hint : Use the tally() function in the dplyr package.

Line segment plots

There are also some other common visualization techniques that are good for showing the difference between a set of observations and a mean value across time.

One of those is a line segment plot. For simplicity, let’s focus only on the data from the US. Let’s calculate the mean across all years for the CO2 emission and temperature from 1980 to 2010. Recall that our Indicator variable describes what kind of data we have (Emissions, Temperature, GDP, Energy, Disasters). We will calculate the mean for each Indicator set of data by first grouping by this variable and then calculating the mean of the values of the Value variable for each set of data. We will call this new variable Mean. Then, we calculate the difference between each observation and the mean and create a new variable for these values called Diff_from_mean. Again, all of this is performed for each group of data separately. Once we have created the new variables, we want to use the ungroup() function so that we no longer perform functions on subsets of the data based on the Indicator variable. Finally, we will also create a factor variable about the sign of the Diff_from_mean value to distinguish positive or negative changes. We will use this to color our plots.

data_long_us <-
  data_long %>%
  filter(Country == "United States", Year >= 1980, Year <= 2010) %>%
  group_by(Indicator) %>%
  mutate(Mean = mean(Value), Diff_from_mean = Value - Mean) %>%
  ungroup() %>%
  mutate(Diff_color = sign(Diff_from_mean)) %>%
  mutate(Diff_color = as.factor(Diff_color))
glimpse(data_long_us)
Rows: 155
Columns: 9
$ Country        <fct> "United States", "United States", "United States", "Uni~
$ Year           <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1~
$ Label          <chr> "CO2 Emissions (Metric Tons)", "CO2 Emissions (Metric T~
$ Indicator      <chr> "Emissions", "Emissions", "Emissions", "Emissions", "Em~
$ Value          <dbl> 4720000, 4540000, 4310000, 4340000, 4480000, 4490000, 4~
$ Region         <chr> "United States", "United States", "United States", "Uni~
$ Mean           <dbl> 5134194, 5134194, 5134194, 5134194, 5134194, 5134194, 5~
$ Diff_from_mean <dbl> -414193.548, -594193.548, -824193.548, -794193.548, -65~
$ Diff_color     <fct> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,~

Next, we use the geom_segment() function to draw a straight line between points (x, y) and (xend, yend). In our case, this creates a plot that shows a bar for the difference between the observation and the mean across all the years.

data_long_us %>%
  filter(Indicator %in% c("Emissions", "Temperature", "Disasters")) %>%
  ggplot(aes(x = Year, y = Value)) +
    geom_segment(aes(x = Year, y = Value, xend = Year,
                     yend = Mean, color = Diff_color),
                 size = 3.25) +
  scale_color_manual(values = c("blue", "red")) +
  geom_hline(aes(yintercept = Mean), linetype = 1, color = "black") +
  facet_wrap(Indicator ~ ., scales = "free_y", ncol = 1) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90),
         axis.title = element_blank(),
    legend.position = "none")  +
  labs(title = "US Disasters, Emissions, and Temperatures (1990-2010)",
    subtitle = "Indicator Mean of 1990-2010 Represented by Solid Black Line")

We can see from this plot that overall there has been an increase in disasters, emissions and temperature in the most recent years.

Question Opportunity

What trends do you see in GDP and energy use across time?

Scatter plots

Next, let’s zoom in on two of the variables: CO2 emissions and temperature. We use years between 1980 and 2014 as we have values for all of those years for these two variables.

We know that the datasets do not span the same amount of time. So let’s limit this plot to only the years where the data overlaps for both CO2 emissions and temperature.

We use the geom_point() function to create a scatter plot between the x and y variable defined in aes() where x is time and y is one of the two variables. We also add a line on top of the scatter plot that smooths the trend from the points. The smoother we used here is loess locally estimated scatterplot smoothing, a type of local polynomial regression fitting. This is a nonparametric regression that is also called a “moving regression” where subsets of points that are close to one another (hence the term local) are used in a least squares linear or nonlinear fit. Thus this results in a fit that may curve with the data.

CO2_temp_US_facet <-
  data_long %>%
  filter(Country == "United States", Year >= 1980, Year <= 2014,
         Indicator %in% c("Emissions", "Temperature")) %>%
  ggplot(aes(x = Year, y = Value)) +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE) +
  scale_x_continuous(breaks = seq(1980, 2014, by = 5),
                     labels = seq(1980, 2014, by = 5)) +
  facet_wrap(Label ~ ., scales = "free_y", ncol = 1) +
  theme_classic() +
  theme(axis.text.x = element_text(size = 12, angle = 90, color = "black"),
        axis.text.y = element_text(size = 12, color = "black"),
        strip.text.x = element_text(size = 14),
         axis.title = element_blank(),
         plot.title = element_text(size = 16))
  labs(title = "US Emissions and Temperatures (1980-2014)")
$title
[1] "US Emissions and Temperatures (1980-2014)"

attr(,"class")
[1] "labels"
CO2_temp_US_facet

Note, we are showing a different theme here, namely the theme_classic() theme. We can see that there are similar patterns of CO2 emission levels and average annual temperatures.

We will save this plot now like so to our “plots” directory:

save(CO2_temp_US_facet, file =here::here("plots", "CO2_temp_US_facet.rda"))
png(here::here("plots", "CO2_temp_US_facet.png"))
CO2_temp_US_facet
dev.off()
png 
  2 

Question Opportunity

  • Try show a similar plot without filtering by years and faceting by the other three variables: energy use, GDP and disasters. Are there other variables that look like they might have a similar pattern to CO2 emissions?
  • Try using a different smoother in geom_smooth().

Next, instead of looking at the variables separately in faceted plots, let’s look at the relationship between CO2 emissions and other variables directly. Thus, it is useful to have each of these indicators as their own variable.

We can do this by using pivot_wider() to transform our long data table into a wide format.

wide_US <-
  data_long %>%
  filter(Country == "United States", Year >= 1980, Year <= 2014) %>%
  select(-Label) %>%
  pivot_wider(names_from = Indicator, values_from = Value)

Let’s save this data as an rda file for future use and as a csv file, as this is often useful for collaborators. We will save this in a “wrangled” subdirectory of our “data” directory of our working directory.

save(wide_US, file = here::here("data", "wrangled", "wrangled_US_data.rda"))
readr::write_csv(wide_US, path = here::here("data", "wrangled", "wrangled_US_data.csv"))

Now we can specify which indicators we want to look at, so now we can specifically look at emissions and temperature.

CO2_temp_US <-
  wide_US %>%
  ggplot(aes(x = Emissions, y = Temperature)) +
    geom_point() +
    theme_classic() +
    theme(axis.text.x = element_text(size = 12, color = "black"),
          axis.text.y = element_text(size = 12, color = "black"),
           axis.title = element_text(size = 14),
           plot.title = element_text(size = 16)) +
    labs(title = "US Emissions and Temperature (1980-2014)",
         x = "Emissions (Metric Tonnes)",
         y = "Temperature (Fahrenheit)")


CO2_temp_US

It might be helpful to add a trend line to this. We can do so by using the geom_smooth() function of the ggplot2 package.

If we want to look at a linear trend we need to specify the method using the method = lm argument. This adds a line to the data based on a linear model of the data using the lm function of the stats package. We will discuss the se = FALSE argument later.

We can just add this to the plot object that we just created to create a plot with this trend line.

CO2_temp_US <- CO2_temp_US + geom_smooth(method = "lm", se = FALSE)
CO2_temp_US 

Indeed, it does look like there is a positive, linear trend.

We will also save this plot:

save(CO2_temp_US, file =here::here("plots", "CO2_temp_US.rda"))
png(here::here("plots", "CO2_temp_US.png"))
CO2_temp_US
dev.off()
png 
  2 

Question Opportunity

  • Make similar plots for between CO2 emissions and other variables.
  • Are there other pairs variables that look like they might have a similar pattern to CO2 emissions?
  • Does this match what we saw above?
  • Do these trend look linear or non-linear?

Now that we see that there might be a linear relationship between CO2 emissions and temperature, let’s learn about some statistical techniques to measure the strength of that relationship.

Data Analysis


If you are following along and stopped you could load the data you will need like so:

load(here::here("data", "wrangled", "wrangled_US_data.rda"))

If you skipped the previous sections click here.

First you need to install and load the OCSdata package:

install.packages("OCSdata")
library(OCSdata)

Then, you may load the wrangled data using the following code:

wrangled_rda("ocs-bp-co2-emissions", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_US_data.rda"))

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data (called wrangled_US_data.rda) can be found here or slightly more directly here. Download this file and then place it in your current working directory within a subdirectory called “wrangled” within a subdirectory called “data” to copy and paste our code. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "wrangled", "wrangled_US_data.rda"))

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.



In this section, we are going to introduce some ways to better understand how two variables move together. We will focus on the CO2 emissions and temperature, but you will be encouraged to explore the relationship between CO2 emissions and the other variables.

Basic summary statistics


We can always calculate the sample mean and variance for two variables.

wide_US %>%
  summarize(mean(Emissions), mean(Temperature), sd(Emissions), sd(Temperature))
# A tibble: 1 x 4
  `mean(Emissions)` `mean(Temperature)` `sd(Emissions)` `sd(Temperature)`
              <dbl>               <dbl>           <dbl>             <dbl>
1          5142286.                52.9         450549.             0.891

These are useful, but on their own they do not summarize whether or not there is a relationship between Emissions and Temperature (also these are on different scales entirely).

What else could we use? Next, we are going to learn about the correlation coefficient, which is a summary statistic that describes how two variables are related or move together.

Correlation coefficient


We can use the correlation coefficient. Here, we are using this summary statistic to measure the strength of a linear relationship between two variables.

If we plot one variable on the x-axis and the other variable on the y-axis, we can see:

  1. The strength of the relationship - based on how well the points form a line
  2. The direction of the relationship - based on if the points progress upward or downward

If the variables point upward in a very clear line, then there is a strong positive relationship. If the points do not really form a line, then there is a weak linear relationship or no linear relationship. There may however be a nonlinear relationship if the points create a different but defined shape.

See here for more information on nonlinear relationships.

If the points form a downward sloping line, then there is a negative relationship.

[source]

The numbers below each plot above are called correlation coefficients. They range from -1 to 1. A value of zero indicates that there is no correlation between the variables. While a value of 1 or -1 indicates perfect correlation, the closer the coefficient is to 1 or -1, the stronger the relationship. The sign of the coefficient indicates the direction of the relationship. If there is a negative relationship then the variables show opposing changes from each other - as one gets larger the other gets smaller. If the sign is positive, then the variables increase similarly.

We previously made this plot:

load(here::here("plots", "CO2_temp_US.rda"))
CO2_temp_US

Let’s calculate the Pearson’s correlation coefficient called “Rho” \(\rho\) between CO2 emissions and temperature in the US. There are a few ways to calculate a correlation coefficient and this is one of the most common.

Formally, if we have a pair of observations \((x_1, y_1), \dots, (x_n,y_n)\), the correlation coefficient \(\rho\) between \(x\) and \(y\) is defined as

\[ \rho = \frac{1}{n-1} \sum_{i=1}^n \left( \frac{x_i-\mu_x}{\sigma_x} \right)\left( \frac{y_i-\mu_y}{\sigma_y} \right) \] where \(\mu_x, \mu_y\) are the means of \(x_1,\dots, x_n\) and \(y_1, \dots, y_n\), respectively, and \(\sigma_x, \sigma_y\) are the standard deviations.

Therefore, we can standardize the two variables and essentially average (the denominator is n-1) the standardized values to calculate the correlation coefficient rho.

Here we will manually perform the calculation. We will use the tally() function of the dplyr package to get the number of samples \(n\). In this case this is equivalent to the number of rows in the wide_US tibble. We need to then use the pull() function to specifically grab the value out of the tibble that is created from using this function. As you can see from using the base class() function that this is a tbl_df which is short for tibble data frame (the tidyverse version of a data frame) rather than just a number.

tally(wide_US)
# A tibble: 1 x 1
      n
  <int>
1    35
class(tally(wide_US))
[1] "tbl_df"     "tbl"        "data.frame"

When we check the class after using the pull() function we see that it is an integer.

pull(tally(wide_US), n)
[1] 35
class(pull(tally(wide_US), n))
[1] "integer"

We will also use the base scale() function to standardize the Emissions and Temperature values.

wide_US %>%
  summarize(rho = (1/(pull(tally(wide_US), n) -1)) *(sum(scale(Emissions) * scale(Temperature)))) %>%
  pull(rho)
[1] 0.4711717

Alternatively, you can use the cor() function in base R like so:

wide_US %>%
  summarize(r = cor(x = Emissions,
                    y = Temperature, 
               method = "pearson")) %>%
  pull(r)
[1] 0.4711717

If you want to learn more about why this is the calculation to determine the strength of the relationship between two variables, see this link.

Question Opportunity

There are different types of correlation coefficients. Look at the help file for the cor() function by typing ?cor() into the RStudio Console to learn more and then try a different method.

What are the differences?

Question Opportunity

  • Try calculating the correlation coefficient between CO2 emissions and the other variables.
  • What do you expect?

To test if the association between a pair of variables is statistically significant, you can use the cor.test() of the stats package to calculate the correlation coefficient, as well as confidence intervals for correlation coefficients.

We can use the tidy() function of the broom package to make the output more usable for working with in R. The role of this function is to pull numeric values from outputs and create a data frame of the values.

cor.test(pull(wide_US, Emissions),
         pull(wide_US, Temperature))

    Pearson's product-moment correlation

data:  pull(wide_US, Emissions) and pull(wide_US, Temperature)
t = 3.0686, df = 33, p-value = 0.004277
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1636155 0.6952523
sample estimates:
      cor 
0.4711717 
broom::tidy(cor.test(pull(wide_US, Emissions),
         pull(wide_US, Temperature)))
# A tibble: 1 x 8
  estimate statistic p.value parameter conf.low conf.high method     alternative
     <dbl>     <dbl>   <dbl>     <int>    <dbl>     <dbl> <chr>      <chr>      
1    0.471      3.07 0.00428        33    0.164     0.695 Pearson's~ two.sided  

We see that the correlation coefficient quantifying the strength of the linear relationship between C02 emissions and temperature is statistically significant.

Relationship between correlation and linear regression


Let’s briefly discuss the relationship between correlation and linear regression, which is further described in the Introduction to Data Science book.

We can use a regression line to predict a random variable \(Y\) given that we have gathered or observed some data about another variable \(X=x\). The regression line formally is defined as:

\[ \left( \frac{Y-\mu_Y}{\sigma_Y} \right) = \rho \left( \frac{x-\mu_X}{\sigma_X} \right) \] where \(\mu_X\) and \(\sigma_X\) (\(\mu_Y\) and \(\sigma_Y\)) are the mean and standard deviation of \(X\) (\(Y\)), and \(\rho\) is correlation between \(X\) and \(Y\). If \(x\) is larger than \(\mu_X\), then for every \(\sigma_X\), then \(Y\) will also increase \(\rho\) standard deviations above \(\mu_Y\).

Re-organizing the terms so that \(Y\) is on the left side and everything else is on the right side, we get:

\[ Y = \mu_Y + \rho \left( \frac{x-\mu_X}{\sigma_X} \right) \sigma_Y \] Thinking about some extreme examples:

  • If \(\rho\) = 0 (i.e. no correlation), we ignore the \(x\) term entirely and only predict \(Y\) using the mean \(\mu_Y\).
  • If \(\rho\) = 1 (or -1) (i.e. perfect correlation), the regression line predicts an increase (or decrease) that is the same number of SDs.
  • If \(\rho\) is between -1 and 1, then we predict using both terms on the right hand side.

To add regression lines to plots, we will need the above formula in the form:

\[ y= b + mx \mbox{ with slope } m = \rho \frac{\sigma_y}{\sigma_x} \mbox{ and intercept } b=\mu_y - m \mu_x \]

In our example, we can calculate the slope and intercept using the formula above and plot the line.

wide_US_summary <-
  wide_US %>%
  summarize(mu_x = mean(Emissions), sd_x = sd(Emissions),
            mu_y = mean(Temperature), sd_y = sd(Temperature),
            rho = cor(Emissions, Temperature),
            slope = rho * sd_y / sd_x,
            intercept = mu_y - rho * sd_y / sd_x * mu_x)

wide_US %>%
  ggplot(aes(x = Emissions, y = Temperature)) +
    geom_point() +
    geom_abline(slope = wide_US_summary$slope,
                intercept = wide_US_summary$intercept) +
  theme_linedraw() +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
       axis.title.x = element_text(size = 12),
       axis.title.y = element_text(size = 12),
       plot.caption = element_text(size = 12),
         plot.title = element_text(size = 16))

Note: In the plot above, we use the scale of the original variables (CO2 emission and temperature), but the formula above implies that standardization of the variables (i.e. subtracting the mean and dividing by the standard deviation) allows the regression line to have an intercept of 0 and slope equal to \(\rho\). A similar plot in standardized units is given below.

CO2_temp_US_scaled<-wide_US %>%
  ggplot(aes(x = scale(Emissions), y = scale(Temperature))) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
  labs(title = "US" ~ CO[2]~ "Emissions and Temperature (1980-2014)",
         y = "Scaled Temperature (Fahrenheit)",
         x = "Scaled Emissions (Metric Tonnes)") +
  theme_linedraw() +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
       axis.title.x = element_text(size = 14),
       axis.title.y = element_text(size = 14),
       plot.caption = element_text(size = 12),
         plot.title = element_text(size = 16))

CO2_temp_US_scaled

Notice that we also use the geom_smooth(method = "lm") function that we previously used. Again, this adds a line corresponding to the slope and intercept from the lm() function from the stats R package.

Question Opportunity

What does se = FALSE mean? Try turning it to TRUE. What happens?


Click here for the answer. se stands for standard error. The gray shading shows the confidence interval of the smooth line.

Limitations of Correlation


While correlation is useful in many settings to understand how two variables move together, correlation is not always a useful summary. For example, here are ways it might not be useful:

  • A linear relationship might not be the best way to capture the relationship.
  • If an individual were interested in understanding a causal relationship between two variables, as correlation does not imply causation. Another way of stating this is that simply showing there is a linear trend over time does not imply there is a causal relationship between these two variables.

As you can see from this plot, often data may show a similar pattern over time by random chance. See this website for more examples.

[source]

In this example and in our case study, data was collected over time. This type of data will often have what is called autocorrleation or serial correlation. This means that data points from one year to the next may be similar to one another or have some sort of internal structure related to time (such as seasonality). Let’s think about our CO2 emission and temperature data. You may be able to see how one year of CO2 emissions might be fairly similar to next year, because the number of sources (such as industrial factories and cars) in the US will change slightly from year to year, but they will be related to that of the previous years. Anytime we look at correlation between two variables that each have autocorrelation, this can result in a higher likelihood of correlation between these variables.

Indeed if we look at correlation between two random variables with autocorrelation we can see an inflation in the correlation rho values between the two variables, as compared to that of variables that do not have autocorrelation.

This is something to keep in mind when you evaluate how two things are related to one another over time. There are methods that have been developed to account for this in time series analysis (analyzing data over time). This does not mean that two variables (such as CO2 and temperature) may not be in fact correlated, it just means that we need to account for the autocorrelation within each variable, however this is beyond the scope of this case study.

Summary


Summary Plot


The last thing we will do here is to create a plot that summarizes our major findings. We will use the plot_layout() function of the patchwork R package. The patchwork allows you to create a plot layout based on mathematical-like formulas. As you can see in this example we want the CO2_world and Top10t plot on top and we want another row with the CO2_temp_US_facet and the CO2_temp_US plot on the bottom. The plot_layout() function of the patchwork package then allows us to specify heights and widths for the plots.

We will also save the figure using the grDevices png() function. We can specify the name of our plot file and where we want it to be saved using the here() function of the here package. In this case in the img subdirectory of the directory containing our .Rproj file. We can also specify the size of the plot and the resolution using the res argument. The grDevices dev.off() function is necessary to close the graphics device.

This will include a few plots that we made previously in other sections. We saved these in the “plots” directory and will load these now for users who stopped and restarted or started at the data analysis section.

load(here::here("plots", "CO2_world.rda"))
load(here::here("plots", "Top10t.rda"))
load(here::here("plots", "CO2_temp_US_facet.rda"))
png(here::here("img", "mainplot.png"), units = "in", width = 12, height = 10, res = 300)
(CO2_world | Top10t) / (CO2_temp_US_facet | CO2_temp_US_scaled) +
  plot_layout(widths = c(1, 2),
              heights = unit(c(4, 10), c('cm', 'cm')))
dev.off()

Synopsis


In this case study we evaluated CO2 emissions from as far back as 1751 for some countries to 2014. We discovered that global levels of CO2 emissions have dramatically increased over time. We also learned that some countries have been responsible for particularly high levels.

We also took a look at how CO2 emissions might relate to other factors, such as temperature, energy use, and natural disasters. We learned that we can summarize the relationship between two sets of data using correlation coefficients. We also learned that we can use regression to predict or describe how changes in one variable may influence changes in another variable. Importantly, we also learned that just because two variables show strong correlation or show an association, it does not necessarily indicate that they are causally related.

However, there is quite a bit of scientific evidence to indicate that in fact CO2 emissions trap heat and lead to increased global temperatures. Yet, it is important to realize that there are other factors involved in the relationship between US CO2 emissions and US annual average temperatures. For example there are CO2 emissions from other countries in the atmosphere, there are other greenhouse gases, there is already existing CO2 in the atmosphere that will continue to trap heat for many years, and finally there is heat trapped in the ocean due to previous emissions that will cause delayed changes in surface temperatures. However, it is vital that we work around the globe to reduce future greenhouse gas emissions to mitigate the increased temperatures that we will experience due to previous and existing CO2 emissions, so that the warming temperatures aren’t as extreme as they could be. Furthermore, we need to prepare for increased rates of natural disasters and how these may influence people around the world. Evidence suggests that impoverished people are the most affected by disasters. We need to be particularly mindful of this as we prepare for the future.

Suggested Homework


Ask students to create a plot with labels showing the countries with the lowest CO2 emission levels.

Ask students to plot CO2 emissions and other variables (e.g. energy use) on a scatter plot, calculate the Pearson’s correlation coefficient, and discuss results.

Additional information


Session Info


sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] OCSdata_1.0.2             patchwork_1.1.1          
 [3] broom_0.7.11              ggrepel_0.9.1            
 [5] directlabels_2021.1.13    ggplot2_3.3.5            
 [7] forcats_0.5.1             tidyr_1.1.4              
 [9] purrr_0.3.4               stringr_1.4.0            
[11] dplyr_1.0.7               readr_2.1.1              
[13] readxl_1.3.1              koRpus.lang.en_0.1-4     
[15] koRpus_0.13-8             sylly_0.1-6              
[17] read.so_0.1.1             wordcountaddin_0.3.0.9000
[19] magrittr_2.0.2            knitr_1.37               
[21] here_1.0.1               

loaded via a namespace (and not attached):
 [1] httr_1.4.2        sass_0.4.0        splines_4.1.2     bit64_4.0.5      
 [5] vroom_1.5.7       jsonlite_1.8.0    viridisLite_0.4.0 bslib_0.3.1      
 [9] assertthat_0.2.1  highr_0.9         cellranger_1.1.0  yaml_2.3.5       
[13] remotes_2.4.2     pillar_1.7.0      backports_1.4.1   lattice_0.20-45  
[17] glue_1.6.1        quadprog_1.5-8    digest_0.6.29     colorspace_2.0-2 
[21] htmltools_0.5.2   Matrix_1.3-4      pkgconfig_2.0.3   scales_1.1.1     
[25] tzdb_0.2.0        tibble_3.1.6      mgcv_1.8-38       generics_0.1.1   
[29] farver_2.1.0      usethis_2.1.5     ellipsis_0.3.2    withr_2.5.0      
[33] cli_3.2.0         crayon_1.5.0      evaluate_0.15     fs_1.5.2         
[37] fansi_1.0.2       nlme_3.1-153      tools_4.1.2       data.table_1.14.2
[41] hms_1.1.1         lifecycle_1.0.1   munsell_0.5.0     compiler_4.1.2   
[45] jquerylib_0.1.4   rlang_1.0.1       grid_4.1.2        rstudioapi_0.13  
[49] labeling_0.4.2    rmarkdown_2.11    gtable_0.3.0      DBI_1.1.2        
[53] curl_4.3.2        R6_2.5.1          sylly.en_0.1-3    fastmap_1.1.0    
[57] bit_4.0.4         utf8_1.2.2        rprojroot_2.0.2   stringi_1.7.6    
[61] parallel_4.1.2    Rcpp_1.0.8        vctrs_0.3.8       tidyselect_1.1.2 
[65] xfun_0.29        

Estimate of RMarkdown Compilation Time:

About 36 - 46 seconds

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.

Acknowledgments


We would like to acknowledge Megan Latshaw for assisting in framing the major direction of the case study.

We would like to acknowledge Qier Meng and Michael Breshock for their contributions to this case study.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.

---
title: "Open Case Studies: Exploring CO2 emissions across time"
css: style.css
output:
  html_document:
    includes:
      in_header: GA_Script.Rhtml
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes
---

<style>
#TOC {
  background: url("https://opencasestudies.github.io/img/icon-bahi.png");
  background-size: contain;
  padding-top: 240px !important;
  background-repeat: no-repeat;
}
</style>

<html lang="en-US">
<body>

<!-- Open all links in new tab-->  
<base target="_blank"/> 

<div id="google_translate_element"></div>

<script type="text/javascript" src='//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit'></script>

<script type="text/javascript">
function googleTranslateElementInit() {
  new google.translate.TranslateElement({pageLanguage: 'en'}, 'google_translate_element');
}
</script>



```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
                      message = FALSE, warning = FALSE, cache = FALSE,
                      fig.align = "center", out.width = '90%')
library(here)
library(knitr)
library(magrittr)
remotes::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)
remotes::install_github("alistaire47/read.so")
library(wordcountaddin)
library(read.so)

rmarkdown:::perf_timer_reset_all()
rmarkdown:::perf_timer_start("render")
```

#### {.outline }
```{r, echo = FALSE, out.width = "800 px", dpi=300}
knitr::include_graphics(here::here("img", "mainplot.png"))
```

####

#### {.disclaimer_block}

**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

####

#### {.license_block}

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"}  United States License.

####

#### {.reference_block}

To cite this case study please use:

Wright, Carrie and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-co2-emissions. Exploring CO2 emissions across time (Version v1.0.0).

####

To access the GitHub repository for this case study see here: https://github.com//opencasestudies/ocs-bp-co2-emissions.

You may also access and download the data using our `OCSdata` package. To learn more about this package including examples, see this [link](https://github.com/opencasestudies/OCSdata). Here is how you would install this package:

```{r, eval=FALSE}
install.packages("OCSdata")
```

This case study is part of a series of public health case studies for the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/open-case-studies).

***

The total reading time for this case study is calculated via [koRpus](https://github.com/unDocUMeantIt/koRpus) and shown below: 

```{r, echo=FALSE}
readtable = text_stats("index.Rmd") # producing reading time markdown table
readtime = read.so::read.md(readtable) %>% dplyr::select(Method, koRpus) %>% # reading table into dataframe, selecting relevant factors
  dplyr::filter(Method == "Reading time") %>% # dropping unnecessary rows
  dplyr::mutate(koRpus = paste(round(as.numeric(stringr::str_split(koRpus, " ")[[1]][1])), "minutes")) %>% # rounding reading time estimate
  dplyr::mutate(Method = "koRpus") %>% dplyr::relocate(koRpus, .before = Method) %>% dplyr::rename(`Reading Time` = koRpus) # reorganizing table
knitr::kable(readtime, format="markdown")
```

***

**Readability Score: **

A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via [koRpus](https://github.com/unDocUMeantIt/koRpus). These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age. 

```{r, echo=FALSE}
rt = wordcountaddin::readability("index.Rmd", quiet=TRUE) # producing readability markdown table
df = read.so::read.md(rt) %>% dplyr::select(index, grade, age) %>%  # reading table into dataframe, selecting relevant factors
  tidyr::drop_na() %>% dplyr::mutate(grade = round(as.numeric(grade)), # dropping rows with missing values, rounding age and grade columns
                                     age = round(as.numeric(age))
                                     )
knitr::kable(df, format="markdown")
```

***

Please help us by filling out our survey.


<div style="display: flex; justify-content: center;"><iframe src="https://docs.google.com/forms/d/e/1FAIpQLSfpN4FN3KELqBNEgf2Atpi7Wy7Nqy2beSkFQINL7Y5sAMV5_w/viewform?embedded=true" width="1200" height="700" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe></div>


# **Motivation**
*** 

This case study explores how different countries have contributed to Carbon Dioxide (CO2) emissions over time and how CO2 emission rates may relate to increasing global temperatures and increased rates of natural disasters and storms.
We used this [report from the EPA](https://www.epa.gov/report-environment/greenhouse-gases){target="_blank"} as the basis for motivating this case study, as it provides background information about how CO2 emissions and other greenhouse gases have influenced the climate and weather patterns.

CO2 makes up the largest proportion of greenhouse gas emissions in the United States:


```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "emissions.jpg"))
```

##### [[source]](https://www.epa.gov/ghgemissions/inventory-us-greenhouse-gas-emissions-and-sinks){target="_blank"}

A variety of sources and sectors contribute to greenhouse gas emissions:


```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "sector.png"))
```

##### [[source]](https://www.epa.gov/ghgemissions/inventory-us-greenhouse-gas-emissions-and-sinks){target="_blank"}

Transportation and Electricity contribute the most metric tons of CO2:

```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "sources_pie.jpg"))
```

##### [[source]](https://www.epa.gov/ghgemissions/inventory-us-greenhouse-gas-emissions-and-sinks){target="_blank"}


So why should we pay attention to greenhouse gases?

According to the [US Environmental Protection Agency (EPA) Inventory of U.S. Greenhouse Gas Emissions and Sinks 2020 Report](https://www.epa.gov/sites/production/files/2020-04/documents/us-ghg-inventory-2020-main-text.pdf){target="_blank"}: 

> Greenhouse gases absorb infrared radiation, thereby trapping heat in the atmosphere and making the planet warmer. The most important greenhouse gases directly emitted by humans include carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O), and several fluorine-containing halogenated substances. Although CO2, CH4, and N2O occur naturally in the atmosphere, human activities have changed their atmospheric concentrations. From the pre- industrial era (i.e., ending about 1750) to 2018, concentrations of these greenhouse gases have increased globally by 46, 165, and 23 percent, respectively (IPCC 2013; NOAA/ESRL 2019a, 2019b, 2019c). 

\* IPCC stands for the Intergovernmental Panel on Climate Change

In fact, there are many signs that our planet is experiencing warmer temperatures:

```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "warming.png"))
```

##### [[source]](https://data.globalchange.gov/report/nca3-overview){target="_blank"}

The connection between greenhouse gas levels and global temperatures and the influence of increased global temperatures on human health are motivated by these reports:

#### {.reference_block}

- Melillo, J.M., T.C. Richmond, and G.W. Yohe (eds.). 2014. Climate change impacts in the United States: The third National Climate Assessment. U.S. Global Change Research Program.  

- 2020. “Inventory of US Greenhouse Gas Emissions and Sinks: 1990--2018.” EPA 430-R-20-002, Tech. Rep. https://www.epa.gov/ghgemissions/inventory-us-greenhouse-gas-emissions-and-sinks.


####

The [National Climate Assessment Report](https://data.globalchange.gov/report/nca3-overview){target="_blank"} states that:

> Heat-trapping gases already in the atmosphere have committed us to a hotter future with more climate-related impacts over the next few decades. The magnitude of climate change beyond the next few decades depends primarily on the amount of heat-trapping gases that human activities emit globally, now and in the future.

See the following links for more information about how greenhouse gases have influenced global temperatures:
1) The EPA [report](https://www.epa.gov/report-environment/greenhouse-gases){target="_blank"} on green house gases  
2) The National Climate Assessment (NCA) [summary from 2014](https://nca2014.globalchange.gov/){target="_blank"}) 
3) The [World101 website](https://world101.cfr.org/global-era-issues/climate-change/climate-change-adaptations){target="_blank"} about how countries are adapting to climate change

# **Main Questions**
*** 

#### {.main_question_block}
<b><u> Our main questions: </u></b>

1. How have global CO2 emission rates changed over time? In particular for the US, and how does the US compare to other countries? 
2. Are CO2 emissions in the US, global temperatures, and natural disaster rates in the US associated? 

####

# **Learning Objectives** 
*** 

In this case study, we will explore CO2 emission data from around the world. 
We will also focus on the US specifically to evaluate patterns of temperatures and natural disaster activity. 

This case study will particularly focus on how to use different datasets that span different ranges of time, as well as how to create visualizations of patterns over time. 
We will especially focus on using packages and functions from the [`tidyverse`](https://www.tidyverse.org/){target="_blank"}, such as `dplyr`, `tidyr`, and `ggplot2`. 

The tidyverse is a library of packages created by RStudio. 
While some students may be familiar with previous R programming packages, these packages make data science in R especially legible and intuitive.

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

<u>**Data Science Learning Objectives:**</u>  

1. Importing data from various types of Excel files and CSV files
2. Apply action verbs in `dplyr` for data wrangling
3. How to pivot between "long" and "wide" datasets
4. Joining together multiple datasets using `dplyr`
5. How to create effective longitudinal data visualizations with `ggplot2`
6. How to add text, color, and labels to `ggplot2` plots
7. How to create faceted `ggplot2` plots

<u>**Statistical Learning Objectives:**</u>  

1. Introduction to correlation coefficient as a summary statistic
2. Relationship between correlation and linear regression
3. Correlation is not causation

```{r, out.width = "20%", echo = FALSE, fig.align = "center"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
```

*** 


We will begin by loading the packages that we will need:

```{r}
library(here)
library(readxl)
library(readr)
library(dplyr)
library(magrittr)
library(stringr)
library(purrr)
library(tidyr)
library(forcats)
library(ggplot2)
library(directlabels)
library(ggrepel)
library(broom)
library(patchwork)
library(OCSdata)
```

<u>**Packages used in this case study:** </u>


 Package   | Use in this case study                                                                         
---------- |-------------
[`here`](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data
[`readxl`](https://readxl.tidyverse.org/){target="_blank"}  | to import the Excel file data
[`readr`](https://readr.tidyverse.org/){target="_blank"}  | to import the csv file data
[`dplyr`](https://dplyr.tidyverse.org/){target="_blank"}  |  to view and wrangle the data, by modifying variables, renaming variables, selecting variables, creating variables, and arranging values within a variable   
[`magrittr`](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"}  |  to use and reassign data objects using the `%<>%`pipe operator
[`stringr`](https://stringr.tidyverse.org/){target="_blank"}  | to select only the first 4 characters of date data
[`purrr`](https://purrr.tidyverse.org/){target="_blank"}  | to apply a function on a list of tibbles (tibbles are the tidyverse version of a data frame)  
[`tidyr`](https://tidyr.tidyverse.org/){target="_blank"}  | to drop rows with `NA` values from a tibble
[`forcats`](https://forcats.tidyverse.org/){target="_blank"}  | to reorder the levels of a factor
[`ggplot2`](https://ggplot2.tidyverse.org/){target="_blank"} | to make visualizations
[`directlabels`](http://directlabels.r-forge.r-project.org/docs/index.html){target="_blank"} | to add labels to plots easily
[`ggrepel`](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html){target="_blank"} | to add labels that don't overlap to plots
[`broom`](https://www.tidyverse.org/blog/2018/07/broom-0-5-0/) | to make the output form statistical tests easier to work with
[`patchwork`](https://github.com/thomasp85/patchwork){target="_blank"}  | to combine plots
[`OCSdata`](https://github.com/opencasestudies/OCSdata){target="_blank"} | to access and download OCS data files

The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.


# **Context**
*** 

Now we will describe a bit more background about greenhouse gas emissions and the potential influence of these emissions on public health. 

Greenhouse gas emissions are due to both natural processes and anthropogenic (human-derived) activities. 

These emissions are one of the contributing factors to rising global temperatures, which can have a great influence on [public health](https://www.epa.gov/climate-indicators/understanding-connections-between-climate-change-and-human-health){target="_blank"}  as illustrated in the following image:

```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img", "climate_change_health_impacts.jpg"))
```

##### [[source]](https://www.cdc.gov/climateandhealth/effects/default.htm){target="_blank"}


According to the [US Environmental Protection Agency (EPA) Inventory of U.S. Greenhouse Gas Emissions and Sinks 2020 Report](https://www.epa.gov/sites/production/files/2020-04/documents/us-ghg-inventory-2020-main-text.pdf){target="_blank"}:

> Gases in the atmosphere can contribute to climate change both directly and indirectly. Direct effects occur when the gas itself absorbs radiation. Indirect radiative forcing occurs when chemical transformations of the substance produce other greenhouse gases, when a gas influences the atmospheric lifetimes of other gases, and/or when a gas affects atmospheric processes that alter the radiative balance of the earth (e.g., affect cloud formation or [albedo](https://en.wikipedia.org/wiki/Albedo){target="_blank"}). 

The **Global Warming Potential (GWP)** compares the **ability of a greenhouse gas to trap heat in the atmosphere relative to another gas**.

>The GWP of a greenhouse gas is defined as the ratio of the accumulated radiative forcing within a specific time horizon caused by emitting 1 kilogram of the gas, relative to that of the reference gas CO2 (IPCC 2013). Therefore GWP-weighted emissions are provided in million metric tons of CO2 equivalent (MMT CO2 Eq.)

##### [[source]](https://www.epa.gov/sites/production/files/2020-04/documents/us-ghg-inventory-2020-main-text.pdf){target="_blank"}

CO2 is actually the least heat-trapping gas of the greenhouse gases:

```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img", "GWP.png"))
```

##### [[source]](https://www.epa.gov/sites/production/files/2020-04/documents/us-ghg-inventory-2020-main-text.pdf){target="_blank"}

However, because CO2 is so much more abundant and stays in the atmosphere so much longer than other greenhouse gases, it has been the largest contributor to global warming.
See [here](https://www.ucsusa.org/resources/why-does-co2-get-more-attention-other-gases#:~:text=CO2%20sticks%20around,oxide%20(N2O)){target="_blank"} for more details.

It is also important to keep in mind that there is a [lag](https://earthobservatory.nasa.gov/blogs/climateqa/would-gw-stop-with-greenhouse-gases/) between greenhouse gas emissions and temperature changes that we experience because much of Earth's thermal energy (and CO2) gets stored in the ocean. 

Due to a process called [thermal inertia](https://en.wikipedia.org/wiki/Volumetric_heat_capacity#Thermal_inertia), the heat stored in the ocean will eventually be transfered to the surface of the Earth long after the gases were emitted that resulted in the increased ocean temperature.

See [here](https://earthobservatory.nasa.gov/blogs/climateqa/would-gw-stop-with-greenhouse-gases/) for more explanation.

Furthermore, rising CO2 levels in the ocean also influence ocean acidity:


```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "oceans.png"))
```

##### [[source]](https://data.globalchange.gov/report/nca3-overview){target="_blank"}


As CO2 levels rise in the ocean, the pH becomes more acidic, which makes it difficult for organisms to maintain their shells or skeletons that are made of calcium carbonate, thus making it more difficult for these organisms to survive and impacting their role in the ecosystem and food chain. 


Furthermore, greenhouse gas emissions are believed to influence weather patterns as shown in this [report](https://data.globalchange.gov/report/nca3-overview){target="_blank"}. 

Indeed, events with high levels of precipitation which can induce flooding and property damage are generally increasing around the country:

```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "storms.png"))
```

##### [[source]](https://data.globalchange.gov/report/nca3-overview){target="_blank"}


# **Limitations**
*** 

An important limitation regarding this data analysis to keep in mind is the datasets only include countries and years in which countries were reporting such information to the agencies that collected the data. 
Thus, the data are incomplete. 
For example, while we have a fairly good sense of CO2 emissions globally for later years, additional emissions were also produced by countries that are not included in the data.


# **What are the data?**
*** 

In this case study we will be using data related to CO2 emissions, as well as other data that may influence, be influenced or relate to CO2 emissions. 
Most of our data is from [Gapminder](https://www.gapminder.org/data/){target="_blank"} that was originally obtained from the [World Bank](https://www.worldbank.org/en/what-we-do){target="_blank"}.

In addition, we will use some data that is specific to the United States from the [National Oceanic and Atmospheric Administration (NOAA)](https://www.noaa.gov/){target="_blank"}, which is an agency that collects weather and climate data.



Data   | Time span | Source  | Original Source   | Description | Citation                                                                    
-----------|---------------|-------------|-------------|----------------------------|--------
**CO2 emissions**  |1751-2014 | [Gapminder](https://www.gapminder.org/data/){target="_blank"}  | [Carbon Dioxide Information Analysis Center (CDIAC)](https://cdiac.ess-dive.lbl.gov/){target="_blank"}  |  CO2 emissions in tonnes or metric tons (equivalent to approximately 2,204.6 pounds) per person by country| NA
**GDP per capita (percent yearly growth)** | 1801-2019| [Gapminder](https://www.gapminder.org/data/){target="_blank"}  | [World Bank](https://data.worldbank.org/indicator/NY.GDP.PCAP.KD.ZG){target="_blank"}  |  [Growth Domestic Product](https://www.investopedia.com/terms/g/gdp.asp#:~:text=Gross%20Domestic%20Product%20(GDP)%20is%20the%20monetary%20value%20of%20all,expenditures%2C%20production%2C%20or%20incomes.){target="_blank"}  (which is an overall measure of the health of nation's economy) per person by country| NA
**Energy use per person** |1960-2015 | [Gapminder](https://www.gapminder.org/data/){target="_blank"}  | [World Bank](https://data.worldbank.org/indicator/EG.USE.PCAP.KG.OE){target="_blank"}  |  Use of primary energy before transformation to other end-use fuels, by country | NA
**US Natural Disasters** | 1980-2019 | [The National Oceanic and Atmospheric Administration (NOAA)](https://www.ncdc.noaa.gov/billions/time-series){target="_blank"}| [The National Oceanic and Atmospheric Administration (NOAA) ](https://www.ncdc.noaa.gov/billions/time-series){target="_blank"}|  US data about: <br> -- Droughts <br> -- Floods <br> -- Freezes <br> -- Severe Storms <br> -- Tropical Cyclones <br> -- Wildfires<br> -- Winter Storms | NOAA National Centers for Environmental Information (NCEI) U.S. Billion-Dollar Weather and Climate Disasters (2020). https://www.ncdc.noaa.gov/billions/, DOI: 10.25921/stkw-7w73
**Temperature**  | 1895-2019|  [The National Oceanic and Atmospheric Administration (NOAA)](https://www.ncdc.noaa.gov/cag/national/time-series){target="_blank"}  | [The National Oceanic and Atmospheric Administration (NOAA)](https://www.ncdc.noaa.gov/cag/national/time-series){target="_blank"} | US National yearly average temperature (in Fahrenheit) from 1895 to 2019 | NOAA National Centers for Environmental information, Climate at a Glance: National Time Series, published June 2020, retrieved on June 26, 2020 from https://www.ncdc.noaa.gov/cag/


To obtain the temperature data, the annual average temperatures were selected as shown in this image:
```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "temp.png"))
```

##### [[source]](https://www.ncdc.noaa.gov/cag/national/time-series){target="_blank"}


Importantly, notice that the data we would like to use span different time periods:

Data   | Time span                                                                     
---------- |-------------
**CO2 emissions**  |1751 to 2014 
**GDP per capita (yearly growth)** | 1801 to 2019
**Energy use per person** |1960 to 2015 
**US Natural Disasters** | 1980 to 2019 
**Temperature**  | 1895 to 2019

We will explore more about this a bit later. 

#### {.think_question_block}
<b><u> Question Opportunity </u></b>

What concerns might arise about reliability and variation of measurement practices over time?

####

# **Data Import**
*** 
In our case, we downloaded the data for the files from the various sources as indicated in the table above and put them within a "raw" subdirectory of a "data" directory for our project. If you use an RStudio project, then you can use the `here()` function of the `here` package to make the path for importing this data simpler. The `here` package automatically starts looking for files based on where you have a `.Rproj` file which is created when you start a new RStudio project. We can specify that we want to look for the "yearly_co2_emissions_1000_tonnes.xlsx" file within the "raw" directory within the "data" directory within a directory where our `.Rproj` file is located by separating the names of these directories using commas and listing "data" first. 

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>

***

To read in the files that were downloaded from the various sources as indicated in the table above, we will use the `read_xlsx()` and `read_xls()` functions of the `readxl` package to import the data from the `.xlsx` and `.xls` files, respectively. We will also use the `here()` function of the `here` package to more easily specify the path to our files relative to the directory where the .Rproj file is located. 

```{r}
CO2_emissions <- readxl::read_xlsx(here("data","raw", "yearly_co2_emissions_1000_tonnes.xlsx"))
gdp_growth    <- readxl::read_xlsx(here("data", "raw", "gdp_per_capita_yearly_growth.xlsx"))
energy_use    <- readxl::read_xlsx(here("data", "raw", "energy_use_per_person.xlsx"))
```

If you had trouble downloading these files, you can do so at our [GitHub repo](https://github.com//opencasestudies/ocs-bp-co2-emissions/tree/master/data/raw/) or more directly by clicking [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/raw/yearly_co2_emissions_1000_tonnes.xlsx), [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/raw/gdp_per_capita_yearly_growth.xlsx), and [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/raw/energy_use_per_person.xlsx).

You may also download these files using the `OCSdata` package:

```{r, eval=FALSE}
# install.packages("OCSdata")
library(OCSdata)
raw_data("ocs-bp-co2-emissions", outpath = getwd())
# This will save the raw data files in a "OCSdata/data/raw/" subfolder 
# in your current working directory
```

We will use the `read_csv()` function of the `readr` package to import the data from the `.csv` files.

However, for these files there are some lines that we would like to not import because the number of columns differ for some rows. If we don't account for this, then we may end up importing fewer columns of the data that we would like.

In the first 5 rows shown below in the `data/disasters.csv` file, you can see that the first two rows does not have the same number of columns as the subsequent rows and are just (sub)titles. 

```{r, echo = FALSE, out.width = "600 px"}
knitr::include_graphics(here::here("img", "Disasters.png"))
```

To do this, we can skip rows using the `skip = 2` argument of the `read_csv()` function. 

```{r}
us_disaster <- readr::read_csv(here("data", "raw", "disasters.csv"), skip = 2)
```

If you had trouble downloading this file, you can do so at our [GitHub repo](https://github.com//opencasestudies/ocs-bp-co2-emissions/tree/master/data/raw) or more directly by clicking [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/raw/disasters.csv).

Now looking at the `data/temperature.csv` file, we see that the first four lines do not have the same number of columns as the subsequent lines. 

```{r, echo = FALSE, out.width = "600 px"}
knitr::include_graphics(here::here("img", "tempdata.png"))
```

We will skip importing all 4 lines by using `skip = 4`. 
We can also replace all instances of `"-99"` with `NA` using the `na = "-99"` argument of the `read_csv()` function.
The "-99" needs to be in quotation marks because this argument expects characters.

***
<details> <summary> Click here for an explanation about data types in R and about character strings.</summary>

There are several [classes of data in R programming](https://en.wikipedia.org/wiki/R_(programming_language)), meaning that certain objects will be treated or interpreted differently. Character is one of these classes. A character string is an individual data value made up of characters. This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter "a" or the number "3". If data are of class character, than the numeric values will not be processed like a numeric value in a mathematical sense. If you want your numeric values to be interpreted that way, they need to be converted to a numeric class. The options typically used are integer (which has no decimal place) and double precision (which has a decimal place).

A variable that is a factor has a set of particular values called levels (this can be numbers or characters). Even if these are numeric, they will be interpreted as levels (i.e., as if they were characters) not as mathematical numbers. The values of a factor are assumed to have a particular ordering; by default the order is alphabetical, but this is not always the correct/intuitive ordering. You can modify the order of these levels with the `forcats` package.

</details> 
***

```{r}
us_temperature <- readr::read_csv(here("data", "raw", "temperature.csv"), skip = 4, na = "-99")
```

If you had trouble downloading this file, you can do so at our [GitHub repo](https://github.com//opencasestudies/ocs-bp-co2-emissions/tree/master/data/raw) or more directly by clicking [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/raw/temperature.csv).

Great! now we have imported all of the data that we will need.

To allow users to skip import we will save the data as an RDA file:

```{r, eval = FALSE}
save(CO2_emissions, 
     gdp_growth,
     energy_use, 
     us_disaster, 
     us_temperature, 
     file = here::here("data", "imported", "co2_data_imported.rda"))
```

# **Data Wrangling**
*** 
If you have been following along but stopped, we could load our imported data like so:

```{r}
load(here::here("data", "imported", "co2_data_imported.rda"))
```

***
<details> <summary> If you skipped the data import section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the imported data using the following code:

```{r, eval=FALSE}
imported_data("ocs-bp-co2-emissions", outpath = getwd())
load(here::here("OCSdata", "data", "imported", "co2_data_imported.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found [here](https://github.com//opencasestudies/ocs-bp-co2-emissions/tree/master/data/imported) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/imported/co2_data_imported.rda). Download this file and then place it in your current working directory within a subdirectory called "imported" within a directory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily.

```{r}
load(here::here("data", "imported", "co2_data_imported.rda"))
```

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>
***


Next, we take a look at our data that we just imported. 
We will need to do some data wrangling to allow us to evaluate how CO2 emissions have changed over time and how emissions may relate to energy use, GDP, etc.
Let's explore how to do that with useful functions and packages from the `tidyverse`. 

## **Yearly CO~2~ Emissions**
***

First, let's take a look at the CO2 data (`CO2_emissions`). 
We can use the `slice_head()` function of the `dplyr` package to see just the first rows of our data. 
We can specify how many rows we would like to see by using the `n =` argument. 

We will use the `%>%` pipe from the `magrittr` package (although it is also imported by other `tidyverse` packages, like `dplyr`), which can be used to define the input for later sequential steps. 
This will make more sense when we have multiple sequential steps using the same data object. 

```{r}
CO2_emissions %>%
  slice_head(n = 3)
```

Another useful function is `slice_sample()` to look at a **selection of random rows** using [pseudorandom](https://en.wikipedia.org/wiki/Pseudorandomness){target="_blank"} numbers for the index of rows to show. To continue to get the same random values or for others to get the same values, we need to set a seed first. We can do this with the `set.seed()` base function. We just specify a number with this function and that will allow us to get the same subset of values from the `slice_sample()` function. If two different people ran this code (without set.seed()), they would each see a different subset of rows. For data exploration, this isn't a huge deal, but if we'd like separate analysts running the same code to see the same output, we will use set.seed(). If we changed set.seed(123) to set.seed(333), we would obtain a different random sample of rows. 

```{r}
set.seed(123)

CO2_emissions %>%
  slice_sample(n = 3)
```

#### {.think_question_block}
<b><u> Question Opportunity </u></b>

Try setting a different seed to see the difference in the output.

####

OK, we see each country is represented along one row and each column contains yearly CO2 emissions. 
We also see that there are a lot of `NA` values.

We can also use the `glimpse()` function of the `dplyr` package to view our data. 
This allows us to see all of our variables at once. 
We will see a tiny bit of each variable/column with the data displayed on the right.

#### {.scrollable }
```{r}
# Scroll through the output!
CO2_emissions %>%
  dplyr::glimpse()
```
####


We can also see that we have a large [tibble](https://tibble.tidyverse.org/). 
```{r}
CO2_emissions %>%
  class()
```

This is the object that is created when we read in the data with `readr`. 
A tibble (or `tbl_df`) is the `tidyverse` version of a `data.frame` object. 
Similar to `data.frame`, it is a table with variable information arranged as columns, and individual observations arranged as rows. 
However some nice differences are they do not change variable names or data types and they give more messages when something is wrong (e.g. when a variable does not exist), which forces the analyst to confront problems earlier. 
Tibbles also give us information about the class of each variable. 

For example the `country` variable is made up of character (abbreviated as `chr`) values.
```{r}
CO2_emissions %>%
  select(country)
```

We see that we have `r nrow(CO2_emissions)` rows different country variables and CO2 emission values for `r ncol(CO2_emissions) - 1` different years (from 1751 to 2014). 
```{r}
names(CO2_emissions)
```

Recall, the values are emissions in metric tons, also called tonnes.
Scrolling through the `glimpse()` function above, we can also see that there are fewer `NA` values for later years.

In this next code chunk, we will introduce the `%<>%` operator from the `magrittr` package. 
This allows us to use our `CO2_emissions` data and reassign it to a modified version at the same time. 
Let's modify `CO2_emissions` to make it more usable for making visualizations. 
Specifically, we will use the `pivot_longer()` function of the `dplyr` package to convert our data into what is called **"long"** format. This is also sometimes referred to as **"narrow"** format.

This means that we will have more rows and fewer columns than our current format.

Right now our data is in what is called **"wide"** format. 
In wide format, each variable is listed as its own column. 
In contrast, in long format, variables maybe collapsed into a column that identifies the variables and a column of values. 
See [here](https://en.wikipedia.org/wiki/Wide_and_narrow_data){target="_blank"} for more information about the difference between the two formats.

We want to collapse all of the values for the emission data across the different individual year variables into one new `Emissions` variable. We will identify what year they are from by creating a new `Year` variable. The `cols =` argument allows us to specify which columns we want to pivot (or not pivot) to create these new columns. We want to keep our `country` data as an ID variable so we will exclude it using the `-` sign, by default all other columns will be used.

```{r}

CO2_emissions  %<>%
  pivot_longer(cols = -country,
               names_to = "Year",
               values_to = "Emissions")

set.seed(123)

CO2_emissions %>%
  slice_sample(n = 6)
```


#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Think a moment about what the dimensions of the  `CO2_emissions` tibble are now and why? How would you check this?

<b><u> Hint </u></b>: Checking has something to do with a unique aspect about tibbles. 

####


Let's say we also want to rename the `country` variable to be capitalized.
To do this, we can use the `rename()` function of the `dplyr` package to rename this variable. 
When renaming variables the syntax is `new-name = old-name`, where the new name is listed first before the `=`. 
 
You may also note that the `Year` variable is currently of class type character. We would like to change it to be numeric. To do this we will use the `mutate()` function, which is also part of the `dplyr` package. This function allows us to create and modify variables. We will also use this function to create a variable called `Label` which will have `"CO2 Emissions (Metric Tons)"` as the value for every row, to be used when we create plots later.


```{r}
CO2_emissions  %<>%
   dplyr::rename(Country = country) %>%
   dplyr::mutate(Year = as.numeric(Year),
                 Label = "CO2 Emissions (Metric Tons)")
```

Now let's take a look to see how our data has changed:

```{r}
set.seed(123)

CO2_emissions %>%
  slice_sample(n = 6)
```
Great, we can see that now the `Year` variable is of class double (abbreviated `dbl`), which is a numeric class.

Now, let's take a look at the `Country` variable to check if there is anything unexpected. 
We will use the `distinct()` function of the `dplyr` package to view the unique values only.
Finally, we use the `pull()` function of the `dplyr` package to extract the values from the column (this is similar to using the `$` base R syntax e.g. `CO2_emission$Country`). 

#### {.scrollable }
```{r}
# Scroll through the output!
CO2_emissions %>%
  distinct(Country) %>%
  pull()
```
####

These all look as expected!


## **Yearly Growth in GDP per Capita**
***
Let's take a look at the next dataset (`gdp_growth`) that we imported. 

```{r}
gdp_growth %>%
  slice_head(n = 3)
```

How many rows and columns are there are there? We can easily check by using the base `dim()` function, which evaluates the dimensions of an object.

```{r}
dim(gdp_growth)
```

Interesting, it's `r nrow(gdp_growth)` rows (as opposed to `r nrow(CO2_emissions)` above). 
We will deal with this and other differences in the sets of countries a bit later on.
There are also `r ncol(gdp_growth)` columns with a `country` column and a set of columns corresponding to different years. 

```{r}
names(gdp_growth)
```

Yes, no other columns in this dataset. 

Next, we will use the `pivot_longer()` to transform the data to long format, similar to what we did in the previous section.

We will also again change the `country` variable to be `Country` by using the `rename()` function, and we will make the `Year` variable numeric using the `mutate()` function. 

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Using what you just learned about `pivot_longer()`, `rename()`, and `mutate()` and without scrolling up, try to come up with the code to do the wrangling for this data.

####

***
<details> <summary> Click here to reveal the code. </summary>

```{r, eval = FALSE}
gdp_growth %<>%
  pivot_longer(cols = -country,
               names_to = "Year",
               values_to = "gdp_growth") %>%
  rename(Country = country) %>%
  mutate(Year = as.numeric(Year),
         Label = "GDP Growth/Capita (%)") %>%
  rename(GDP = gdp_growth)
```  

</details>
***

```{r, echo = FALSE}
gdp_growth %<>%
  pivot_longer(cols = -country,
           names_to = "Year",
          values_to = "gdp_growth") %>%
  rename(Country = country) %>%
  mutate(Year = as.numeric(Year),
        Label = "GDP Growth/Capita (%)") %>%
  rename(GDP = gdp_growth)
```

Now let's see how this data has changed:

```{r}
gdp_growth %>%
  slice_head(n = 6)

gdp_growth %>%
  count(Year)
```

Again let's check that the `Country` variable only contains values we would expect.

#### {.scrollable }
```{r}
# Scroll through the output!
gdp_growth %>%
  distinct(Country) %>%
  pull()
```
####

Also looks good!

## **Energy Use per Person**
***

Now let's take a look at the energy use per person data (`energy_use`) using `slice_head()` and `glimpse()`. 

```{r}
energy_use %>%
  slice_head(n = 3)
```

#### {.scrollable}
```{r}
energy_use %>%
  glimpse()
```
####

Looks like we have `r nrow(energy_use)` rows and `r ncol(energy_use)` columns where we have a `country` column and again a set of years. 
To wrangle the `energy_use` data, we will again convert the data to long format, rename some variables, and mutate the `Year` data to be numeric.

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Again try to come up with the code on your own to wrangle the data.

####

***
<details> <summary> Click here to reveal the code. </summary>

```{r, eval = FALSE}
energy_use %<>%
  pivot_longer(cols = -country,
               names_to = "Year",
               values_to = "energy_use") %>%
  rename(Country = country) %>%
  mutate(Year = as.numeric(Year),
         Label = "Energy Use (kg, oil-eq./capita)") %>%
  rename(Energy = energy_use)
```
</details>
***

```{r, echo = FALSE}
energy_use %<>%
  pivot_longer(cols = -country,
               names_to = "Year",
               values_to = "energy_use") %>%
  rename(Country = country) %>%
  mutate(Year = as.numeric(Year),
         Label = "Energy Use (kg, oil-eq./capita)") %>%
  rename(Energy = energy_use)
```


```{r}
set.seed(123)

energy_use %>%
  slice_sample(n = 3)
```

Now we will check the `Country` variable:

#### {.scrollable }
```{r}
# Scroll through the output!
energy_use %>%
  distinct(Country) %>%
  pull()
```
####

Looks good!

## **US Specific Data**
***

Now we will take a look at the US data about disasters and temperature.

### **Disasters**
***

First, we consider the disasters that have occurred in the US. 
```{r}
us_disaster
```

We are specifically interested in the `Year` and the variables that contain the word `"Count"`. The other variables represent an estimate of the economic cost in billions of dollars, as well as the upper and lower bounds for simulations used to estimate the economic cost, which show the level of uncertainty in these estimates (at three different levels of confidence) as the true cost is unknown. See [here](https://www.ncdc.noaa.gov/billions/time-series) for more information about the data. For this analysis, we will focus just on the number of disasters that occurred each year. 

We will select our variables of interest using the `select()` and `contains()` functions in the `dplyr` package. 
Since we are selecting for variables with the word `"Count"` we need to use quotation marks around it. 

Selecting for the variable `Year` does not require quotes because it is the full name of one of the existing variables.


```{r}
us_disaster %<>%
           select(Year, contains("Count"))

us_disaster %>%
  slice_head(n = 6)
```

Now we want to create a new variable that will be the sum of all the different types of disasters for each year. 

We don't want to include the `Year` variable in our sum, so we can exclude it using the `select` function. To perform the sum for each year, we can use the base `rowSums()` function. 

```{r}
yearly_disasters <- us_disaster %>% 
                      select(-Year) %>%
                      rowSums()

yearly_disasters
```

We could then add this to our `us_diaster` tibble like so using the `bind_cols` function of the `dplyr` package:

```{r}
us_disaster %>% bind_cols(Disaters = yearly_disasters)
```

However, we can actually create and add this new variable directly to the `us_disaster` tibble by using the `mutate()` function of `dplyr` and using the `.` notation.

We need to use the `.` notation to indicate that we are using the data that we already used as input (on the left side of the pipe) to our `mutate()` function (on the right side of our pipe), which in this case is the entire `us_disaster` tibble for our `select()` function. The output from the `select()` function will be used for the `rowSums()` function. 

```{r}
us_disaster %<>%
  mutate(Disasters = rowSums(select(., -Year)))

us_disaster %>%
  glimpse()
```

Great, now we are going to remove some of these variables and just keep the variables of interest using `select()`.

We are also going to add a new variable called `Country` to indicate that this data is from the United States. Again this will create a new variable where every value is `United States`.

```{r}
us_disaster %<>%
  dplyr::select(Year, Disasters) %>%
  mutate(Country = "United States") %>%
  pivot_longer(cols = c(-Country, -Year),
               names_to = "Indicator",
               values_to = "Value") %>%
  mutate(Label = "Number of Disasters")

us_disaster %>%
  slice_head(n = 6)
```
Great, this looks good now. 

#### {.think_question_block}
<b><u> Question Opportunity </u></b>

This dataset was slightly different from the other datasets and therefore required slightly different wrangling.
Why was it necessary to exclude the `Year` variable from the `pivot_longer()` function?
What would happen if we did not exclude `Year`?

####

### **Temperature**
***

Next, we consider the temperature in the US over time.  
```{r}
us_temperature %>%
  slice_head(n = 6)
```
So a few things need to be fixed here. 

First, the `Date` column looks a bit strange. The format of the numbers look like the year followed by the number 12 (representing 12 months).

We want to change this to only keep the first 4 characters in the `Date` variable string values. 

However, first let's make sure that indeed all of the `Date` variables are 6 characters long and that they all end with the number 12.

We can use a couple of functions in the `stringr` package to do this. This package is used for working with character strings.  The `str_length()` function can be used to check the length of each value, while the `str_ends()` function can be used to check that all the values end with `"12"`.

Let's start with the `str_length()` function. These functions in the `stringr` package require a character vector. Thus we need to pull the values for the `Date` variable first using the `pull()` function of the `dplyr` package. 

```{r}
us_temperature %>%
  pull(Date) %>%
  str_length()
```
Great! It looks like all of the values are 6 characters long.

Now let's check that they all end with `"12"`. We just need to specify what pattern to look for. 

```{r}
us_temperature %>%
  pull(Date) %>%
  str_ends(pattern = "12")

```
Great! Since all of the values are `TRUE` we know that all of the values in the `Date` variable end with `"12"`.

It's a good idea to always check that your data is as you expect. 

Now we can use the `str_sub()` function of the `stringr` package to remove the `"12"` from each `Date` value.

We just need to indicate the start and stop characters. 

In this case the start would be 1 and the 4th character would be where we want to stop, so we would use `start = 1, stop = 4`. We can do this inside of the `mutate()` function to modify the `Date` variable. In doing so, we will not need to use `pull()` to pull the values for the `Date` variable.

```{r}
us_temperature %<>%
  mutate(Date = str_sub(Date, start = 1, end = 4))

us_temperature
```
We also want to remove the `Anomaly` variable, which is an indicator of how different the national average temperature for that year was from the average temperature from 1901-2000 which was 52.02&deg;F. 

Then, we also want to create a `Country` variable. 
We will also change the name of the `Date` variable to `Year` so that it will be consistent with our other datasets. We also also want the `Year` to be numeric. 
We can accomplish both renaming and changing to numeric by using the `mutate()` function.

We also want to create an `Indicator` variable so that we can later tell what data the values in this tibble represent if we combine it with other tibbles and a `Label` variable, so that we will have informative labels if we make a plot with this data later. 

Finally, we remove the `Date` variable and also order the columns just like the other us data using the `select()` function.

```{r}
us_temperature %<>%
  dplyr::select(-Anomaly) %>%
  mutate(Country = "United States",
         Year = as.numeric(Date),
         Indicator = "Temperature",
         Label = "Temperature (Fahrenheit)") %>%
  select(Year, Country, Indicator, Value, Label)

us_temperature %>%
  slice_head(n = 6)
```


## **Joining data**
***

Now that we have wrangled the individual datasets, we are ready to put everything together. 
Specifically, we will _join_ the individual datasets into one tibble using `*_join()` functions available in the `dplyr` package. 

Before we begin though, we will need to make sure that there is at least one variable/column that has the same name across all datasets to be joined. Such variables with common names are called **keys** for joining your data.

These are the `by="x1"` arguments below where `x1` is the name of the column in both the `a` and `b` datasets that we will join together. 

```{r, echo = FALSE, out.width = "500 px"}
knitr::include_graphics(here::here("img", "join.png"))
```

##### [[source]](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf){target="_blank"}

There are several types of `*_join()` functions to consider. The `full_join()` function keeps all rows from both tibbles that are being joined and adds `NA` values as necessary if there are values within the key for either of the tibbles that is is not in the key of the other tibble. 

We use the `full_join()` function as we have different time spans for each dataset and we would like to retain as much data as possible. 

The `full_join()` function will simply create `NA` values for any of the years that are not in one of the datasets. 

First, we check using the base `summary()` function that there are column names that are consistent in each dataset that we wish to combine.

```{r}
summary(CO2_emissions)
summary(gdp_growth)
summary(energy_use)
```

#### {.think_question_block}
<b><u> Question Opportunity </u></b>

What variable or variables might we want to use to join our data by?

####

***
<details> <summary> Click here to see an explanation for what variable or variables to join by after you have thought about it. </summary>


The `Country`, and `Year` variables are present in all of the datasets with values that overlap. Although `Label` is also present in the datasets, the values do not overlap. We can see that the minimum and maximum year is different for nearly all the datasets.


Next, we need to specify what columns/variables we will be joining by using the `by =` argument in the `full_join()` function, (recall that this variable is called the "key")

```{r}
data_wide <- CO2_emissions %>%
  full_join(gdp_growth, by = c("Country", "Year", "Label")) %>%
  full_join(energy_use, by = c("Country", "Year", "Label"))

set.seed(123)

data_wide %>%
  slice_sample(n = 6)
```

***
<details> <summary> Click here to see an explanation for another option that works well for large numbers of tibbles </summary>

We can also do the same thing by using the `reduce()` function of the `purrr` package.  This takes a list of elements (which can be tibbles) and then applies a function that requires two inputs iteratively using the first pair of elements and creating a single element and then applying the function again to the output element and the next listed element and so on and so forth.

For example we will use a list of tibbles and the `full_join()` function which requires two tibbles to combine. This will first combine `CO2_emissions` and  `gdp_growth` and then take the resulting joined tibble and combine this with the `energy_use` tibble. 


You can see that this is a great option if you have many datasets to combine!

```{r}
data_wide <-
  list(CO2_emissions, gdp_growth, energy_use) %>%
  reduce(full_join, by = c("Country", "Year", "Label"))

set.seed(123)

data_wide %>%
  slice_sample(n = 6)
```

</details> 
</details>
***

```{r}
data_wide %>%
  glimpse()
```

Nice, looks good!


We will also make a long version of this data, where we will create an new variable called `Indicator` that will indicate what dataset the data came from.

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Try to come up with the code to do this.

####

***
<details> <summary> Click here to reveal the code. </summary>


```{r}
data_long <- data_wide %>%
  pivot_longer(cols = c(-Country, -Year, -Label),
               names_to = "Indicator",
               values_to = "Value")
```

</details> 
***

```{r}
set.seed(123)

data_long %>%
  slice_sample(n = 6)
```


We will now combine this data with the US data about disasters and temperatures.


```{r}
us_disaster %>%
  slice_head(n = 6)
us_temperature %>%
  slice_head(n = 6)
```

We will now use the `bind_rows()` function of the `dplyr` package which will just append the `us_temperature` data and the `us_disaster` data after the `data_long` data. 

```{r}
data_long <-
  list(data_long, us_disaster, us_temperature) %>%
  bind_rows() %>%
  mutate(Country = as.factor(Country))
```

We also converted the `Country` column to a factor in the last line of the code chunk. 

We can check the top and bottom of the new `data_long` tibble to see that our `us_temperature` data is at the bottom. To see the end of our tibble we can use `slice_tail()` function of the `dplyr` package.

```{r}
data_long %>%
  slice_head(n = 6)

data_long %>%
  slice_tail(n = 6)

set.seed(123)

data_long %>%
  slice_sample(n = 10)
```

***
<details> <summary> Click here for details about the difference between `full_join()` and `bind_rows()` </summary>

The difference between this function and the `full_join()` function is that the `bind_rows()` function will essentially just append each dataset to each other, whereas the `full_join()` function collapses data that is comparable. 
Here, you will see an example of what the data would have been like for `data_wide` if we had made it using `bind_rows()` and if `full_join()` had been used but was not joined by the `Label` variable.
Since the `Label` variable has unique values for each type of `Indicator`, this causes the `full_join()` result to be the same as `bind_rows()`. 

Let's consider an example and look at the values for China in the year of 1980.

First we will use the `bind_rows()` function which automatically creates `NA` values for any variable that is missing from a data object that is added by combining the data object with another that contains that missing variable using this function.

```{r}
data_wide_br <-
  list(CO2_emissions, gdp_growth, energy_use) %>%
  bind_rows()

data_wide_br %>%
 filter(Country == "China",
           Year == 1980)
```

We see that we have three rows of data.

Now we will use the `full_join()` function two ways. First we will combine by `Country` and `Year` and `Label`. 

```{r}
data_wide_fj_label <-
  list(CO2_emissions, gdp_growth, energy_use) %>%
  reduce(full_join, by = c("Country", "Year", "Label"))

data_wide_fj_label %>%
  filter(Country == "China", Year == "1980")
```

Again we have 3 rows of data. The data produced by  `bind_rows()` and `full_join()` is identical (which we can check by using the `setequal()` function of the `dplyr` package) and has the same dimensions (which we can check by using the `dim()` base function).

```{r}
dim(data_wide_br)
dim(data_wide_fj_label)
setequal(data_wide_fj_label, data_wide_br)
```

However, now we will join by only `Country` and `Year`:

```{r}
data_wide_fj <-
  list(CO2_emissions, gdp_growth, energy_use) %>%
  reduce(full_join, by = c("Country", "Year"))

data_wide_fj %>%
  filter(Country == "China", Year == "1980")
```

Now we see that we have only a single row. The data that corresponds to the same year and country has been collapsed into a single but wider row. 

This is something to keep in mind when you are wrangling your data. The choice of what function to use and how should depend on how you want the data to be after you combine the different sources of data together.

</details>  
***

We have a few more things to do before we leave the data wrangling section. 

We will create a new variable called `Region` that will indicate if the data is about the United States or a different country based on the values in the `Country` variable. 
To do this, we will use the `case_when()` function of the `dplyr` package. 

For example, if the `Country` variable is equal to `"United States"` the value for the new variable will also be `"United States"`, where as if the `Country` variable is not equal to `"United States"` but is some other character string value, such as `"Afghanistan"`, then the value for the new variable will be `"Rest of the World"`. We can specify that something is not equal by using the `!=` operator.

The new values for the new variable `Region` are indicated after the specific conditional statements by using the `~` symbol. 


```{r}
data_long %<>%
  mutate(Region = case_when(Country == "United States" ~ "United States",
                            Country != "United States" ~ "Rest of the World"))

data_long  %>%
  arrange(Country) %>%
  slice_head(n = 6)
```

We can also remove rows for countries with `NA` values using the `drop_na()` function of the `tidyr` package to drop all years with missing data.

```{r}
data_long_with_miss <-
  data_long %>%
  arrange(Country)

data_long %<>%
  drop_na() %>%
  arrange(Country)
```

You can see that by removing the NA values the data for Afghanistan starts at 1949 instead of 1751.

```{r}
data_long %>%
  slice_head(n = 6)
```

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Using only data from the US, calculate the first and last year that a value was reported for each variable (e.g. CO2 emissions, energy use, etc). 

<b><u> Hint </u></b>: Use the `group_by()` and `summarize()` functions in the `dplyr` package. 

####

To allow users to skip import and wrangling we will save the data as an RDA file as well as a CSV file as this is often useful to send our data to collaborators. We will save this in a "wrangled" subdirectory of our "data" directory of our working directory.

```{r, eval = FALSE}
save(data_long, file = here::here("data", "wrangled", "wrangled_data.rda"))
readr::write_csv(data_long, path = here::here("data","wrangled", "wrangled_data.csv"))
```


# **Data Visualization**
*** 

If you have been following along but stopped, we could load our wrangled data like so:

```{r}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

***
<details> <summary> If you skipped the data wrangling section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-co2-emissions", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found [here](https://github.com//opencasestudies/ocs-bp-co2-emissions/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/wrangled/wrangled_data.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily.

```{r}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***

</details>
***

Now we will create some simple plots to examine the CO2 emissions over time using the `ggplot2` package.

As you may have already seen, there are many functions available in base R that can create plots (e.g. `plot()`, `boxplot()`). 
Others include: `hist()`, `qqplot()`, etc. 
These functions are great because they come with a basic installation of R and can be quite powerful when you need a quick visualization of something when you are exploring data. 

We are choosing to introduce `ggplot2` because, in our opinion, it is one of the simplest ways for beginners to create relatively complicated plots that are intuitive and aesthetically pleasing. 

## **The `ggplot2` R package**
***

The reasons [`ggplot2`](http://ggplot2.tidyverse.org) is generally intuitive for beginners is the use of [grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.html) or the `gg` in `ggplot2`. 
The idea is that you can construct many sentences by learning just a few nouns, adjectives, and verbs. 
There are specific "words" that we will introduce, and once you are comfortable with them, you will be able to create (or "write") hundreds of different plots. 

The critical part to making graphics using `ggplot2` is the data needs to be in a _tidy_ format. 
Given that we have just spent time putting our data in _tidy_ format, we are primed to take advantage of all that `ggplot2` has to offer! 

We will show how it is easy to pipe _tidy_ data (output) as input to other functions that create plots. 
This all works because we are working 
within the _tidyverse_. 

#### What is the `ggplot()` function? 

As explained by Hadley Wickham: 

> the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system.

#### `ggplot2` Terminology 

* **ggplot** - the main function where you specify the dataset and variables to plot (this is where we define the `x` and
`y` variable names)
* **geoms** - geometric objects
    * e.g. `geom_point()`, `geom_bar()`, `geom_line()`, `geom_histogram()`
* **aes** - aesthetics
    * shape, transparency, color, fill, line types
* **scales** - define how your data will be plotted
    * continuous, discrete, log, etc

The function `aes()` is an aesthetic mapping function inside the `ggplot()` object. 
We use this function to specify plot attributes (e.g. `x` and `y` variable names) that will not change as we add more layers.  

Anything that goes in the `ggplot()` object becomes a global setting. 
From there, we use the `geom` objects to add more layers to the base `ggplot()` object. 
These will define what we are interested in illustrating using the data.  

## **CO2 Emissions**

Let's start by plotting the CO2 emissions over time. 
Because our dataset contains other variables, we first need to filter our data to only include the CO2 emissions data by using the `filter()` function of the `dplyr` package. 
To use this function we need to specify what value (e.g. `Emissions`) we want for a given variable or column (e.g. `Indicator`).  

In this case, we filter to keep all rows where the `Indicator` variable is equal to the word `Emissions`. 
Notice that this needs to be in quotes, while the variable name does not.

```{r}
data_long %>%
  filter(Indicator == "Emissions")
```

We also need to sum the emissions across countries for each year. 
Here, we use the `group_by()` and `summarize()` function that we previously learned about. 

```{r}
data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value))
```

Then, we use the `aes()` argument of the `ggplot()` function to define that our x-axis will be the `Year` variable, the y-axis will be the emission `Value` variable, and that our data should be grouped or separated by the `Country` variable. 

```{r}
data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value)) %>%
  ggplot(aes(x = Year, y = Emissions))
```

Looks like we got a blank plot. 
What happened? 
We need to tell R what _type of plot_ we want. 
To do that, we need to add another layer to define how we want the plot to look. 
We do so by using the `+` sign in between each command. 

### Line plots

To tell R what type of plot we want, we need to add another _layer_ to our ggplot object. 
To add a type of plot, we can use one of the many `geom_*` functions in `ggplot2`.
For example, type `geom` into the RStudio console and you will see many options to scroll through.

```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "geom_.png"))
```

Here, we will use the `geom_line()` function because we would like to create a line plot.
We also use the `size` argument in `geom_line()` to control the size of the line. 

```{r}
data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value)) %>%
  ggplot(aes(x = Year, y = Emissions)) +
    geom_line(size = 1.5)
```
Wow, the CO2 is really rising sharply!

Finally, let's make this plot really nice by adding a few final touches. 
To change title, caption, and the axis labels, we can use the `labs()` function. 
Again, notice that a plus sign is used between each layer that we add to the plot. 
To make CO2 appear with a subscript we can use `~CO[2]~`.  

```{r}
data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value)) %>%
  ggplot(aes(x = Year, y = Emissions)) +
    geom_line(size = 1.5) +
    labs(title = "World " ~CO[2]~ " Emissions per Year (1751-2014)",
         caption = "Limited to reporting countries",
         y = "Emissions (Metric Tonnes)")
```

Next, we use the `theme()` function to change the font size of the x-axis, y-axis, axis titles, and the caption as shown below. 
To know what to call each element of the plot in this function to change the size type `?theme()` in the console. 

You will see a very large list that includes other plot aspects like the background and the legend. 
This function can be used to modify your plot to your specifications. 

```{r}
data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value)) %>%
  ggplot(aes(x = Year, y = Emissions)) +
    geom_line(size = 1.5) +
    labs(title = "World " ~CO[2]~ " Emissions per Year (1751-2014)",
         caption = "Limited to reporting countries",
         y = "Emissions (Metric Tonnes)") +
  theme_linedraw() +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
       axis.title.x = element_text(size = 12),
       axis.title.y = element_text(size = 12),
       plot.caption = element_text(size = 12),
         plot.title = element_text(size = 16))
```

We can clearly see that global CO2 emissions have dramatically risen since 1900.

Notice, we used the function `theme_linedraw()` of `ggplot2` to change the general appearance of the plot. 

**Useful tip**: You can type `theme_` in the RStudio console to see the various plot theme options available.

```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "themes.png"))
```

Great! We've created our first plot. 

Before we leave this section, let's save this theme so we do not have to keep typing the same code in future plots. 

```{r}
my_theme <-
  theme_linedraw() +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
       axis.title.x = element_text(size = 12),
       axis.title.y = element_text(size = 12),
       plot.caption = element_text(size = 12),
         plot.title = element_text(size = 16))
```

In this way, we can just add another _layer_ to our plot with the `my_theme` we created with the specifications of the sizes of the title and axis labels. 

```{r}
CO2_world <-
  data_long %>%
  filter(Indicator == "Emissions") %>%
  group_by(Year) %>%
  summarize(Emissions = sum(Value)) %>%
  ggplot(aes(x = Year, y = Emissions)) +
    geom_line(size = 1.5) +
    labs(title = "World " ~CO[2]~ " Emissions per Year (1751-2014)",
         caption = "Limited to reporting countries",
         y = "Emissions (Metric Tonnes)") +
  my_theme
```

We are also saving the plot to an object called `CO2_world`. 
To show the plot we simply type the name of the object: 

```{r}
CO2_world
```
Now let's say we wanted to save this plot.

We could do so using the using the  `save()` function to save this to a "plot" directory in our working directory as an RDA file and we can use the `png()` function to save a png for collaborators. We need to use `dev.off()` function to close the graphical device that we will use to create the png version of the plot so that we are ready to make another plot like this.

```{r}
save(CO2_world, file =here::here("plots", "CO2_world.rda"))
png(here::here("plots", "CO2_world.png"))
CO2_world
dev.off()
```


One thing that would be nice to know is which countries are contributing the most or the least to CO2 emissions. 
Let's continue to explore the data to investigate how CO2 emissions from individual countries have changed over time. 

Next, we go back to using our `data_long` dataset. 
Here, we use the `group` argument in `aes()` which controls whether a line should be drawn 

```{r}
data_long %>%
  filter(Indicator == "Emissions") %>%
  ggplot(aes(x = Year, y = Value, group = Country)) +
  geom_line() +
  ylab("Emissions") +
  my_theme
```

We can see that many countries show a dramatic increase in emissions over time with a handful of countries with particularly high levels. 

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

- What happens when you do not use the `group = Country` argument? 
- What other aesthetic  (i.e. `aes()` arguments) can be changed?

####

Since we have many overlapping lines, we will make our lines slightly transparent by using the `alpha` argument. 

This takes values from 0 to 1, where 0 is completely transparent and 1 is completely opaque. 

We also add our `my_theme` controlling the size of the title and axis labels. 

```{r}
CO2_countries <-
  data_long %>%
  filter(Indicator == "Emissions") %>%
  ggplot(aes(x = Year, y = Value, group = Country)) +
  geom_line(alpha = 0.4) +
  labs(title = "Country" ~CO[2]~ "Emissions per Year (1751-2014)",
     caption = "Limited to reporting countries",
           y = "Emissions (Metric Tonnes)") +
  my_theme

CO2_countries
```

Our plot is starting to look really good. 
One question you still might have is which country corresponds to which line? 
Which line indicates the emissions in the US? 


### Adding color 

We can add another "layer" on top of our first plot to add a blue line just for the US data. 
To do this we need to indicate what data we would like to plot, so we need to filter for just the US data and then we need to indicate that it will be colored by Country, even though in this case we only have one line to color. 
The default color would be a salmon pink color, but we would like blue. 
So we will use the `scale_color_manual()` function to manually choose the color that we want by using `scale_colour_manual(values = c("blue"))`. Often you might use red to highlight a subset of the data, however, this can be difficult for viewers with certain types of colorblindness. 

Notice how the color name needs to be in quotes and that the argument `values =` is used to specify what color values to use.

We can add this line to the plot in two ways. 

**Useful tip**: Instead of retyping the original code to create the `CO2_countries` plot, we can just use the `+` operator: 

```{r}
CO2_countries +
  geom_line(data = data_long %>%
              filter(Indicator == "Emissions",
                     Country == "United States"),
              aes(x = Year, y = Value, color = Country)) +
  scale_colour_manual(values = c("blue"))
```

It looks like the US has long been the largest CO2 emission producing country until recently, when the US was surpassed by another country. 

Let's figure out who are the top 10 emission producing countries in 2014. 
Here, we filter the data for the year 2014, which was the final year of the data. 
Then, we can make a rank variable based on the `Value` variable for the amount of emissions produced. 

There are many functions in the `dplyr` package for ranking values that are based on the [SQL](https://en.wikipedia.org/wiki/SQL){target="_blank"} or specifically [SQL rank functions](https://www.sqlshack.com/overview-of-sql-rank-functions/){target="_blank"}. 
SQL is another programming language for managing large amounts of data. 
The difference in the rank functions mostly has to do with how to deal with ties in the data.  
We will use `dense_rank()`, as we do not want gaps between ranks.

```{r, echo = FALSE, out.width = "600 px"}
knitr::include_graphics(here::here("img", "rank.png"))
```

We want to do this in descending order because we want to rank by largest to smallest, so we will use the `desc()` function of the `dplyr` package. 
Then, we will arrange the output by rank using the `arrange()` function of the `dplyr` package. 

```{r}
top_10_count <-
  data_long %>%
  filter(Indicator == "Emissions", Year == 2014) %>%
  mutate(rank = dense_rank(desc(Value))) %>%
  filter(rank <= 10) %>%
  arrange(rank)

top_10_count
```

We can see that China is now the top emission producing country.

#### {.think_question_block}
<b><u> Question Opportunity </u></b>

What are the bottom 10 emission producing countries in 2014?

####

Let's make a plot of *just these top ten countries*. 

To do this, we need to filter the data to just these top countries by using the `%in%` operator to only keep countries in our `Country` variable that are also in the `Country` variable within `top_10_count`. 
We can use the `pull()` function also of the `dplyr` package to specifically grab just the `Country` data out of `top_10_count`.

Since we have 10 countries we will want to differentiate them by color. 

To color our plot we will use the `viridis` color palette which is compatible with color-blindness by using the `scale_color_viridis_d()` function. 
This function is available by loading the `ggplot2` package. 
There are a few variations for discrete values as `_d`, or binned continuous values as `_b`, or continuous values as `_c`. 
See [here](https://ggplot2.tidyverse.org/reference/scale_viridis.html) for more information.


```{r}
Top10b <- data_long %>%
  filter(Country %in% pull(top_10_count, Country)) %>%
  filter(Indicator == "Emissions") %>%
  filter(Year >= 1900) %>%
  ggplot(aes(x = Year, y = Value, color = Country)) +
    geom_line() +
    scale_color_viridis_d() +
    labs(title = "Top 10 Emissions-producing Countries in 2010 (1900-2014)",
         subtitle = "Ordered by Emissions Produced in 2014",
         y = "Emissions (Metric Tons)",
         x = "Year") +
    my_theme

Top10b
```

It's still a bit difficult to tell which line corresponds to which country. 
So, let's add a text label directly to the plot. . 

### Adding text labels

One way to do this is to add text layer to our plot using the `geom_text()` function of the `ggplot2` package. 
We need to first specify what variable or column we will use. 
However, we only want to pull the text labels for the top ten countries in the last year. 
To do this, we use the `last()` function of the `dplyr` package.

Then, we need to indicate that our text label will be based on the `Country` variable using the `aes()` aesthetics mapping argument. 
We will also get rid of our legend since we will not need it anymore, by using the `theme()` function of the `ggplot2` package.

```{r}
Top10b +
  geom_text(data = data_long %>%
              filter(Country %in% pull(top_10_count, Country)) %>%
              filter(Indicator == "Emissions") %>%
              filter(Year == last(Year)),
            aes(label = Country)) +
  theme(legend.position = "none")
```

Not bad, but some of the labels are overlapping and difficult to read.
We can use the `check_overlap = TRUE` argument within the `geom_text()` function to remove overlapping variables. 
Also, we can expand the plot area horizontally so that the names are not cutoff by using `scale_x_continuous(expand = c(0.2,0))`. This takes a vector with two values. The first value indicates what percentage to expand the x axis in both directions. In our case we will expand by 15 percent. The second value indicates what absolute value to expand the limit of the x axis. In our case `c(0.15,0)` will achieve a similar result as `c(0, 17)`, as the range of values from 1990 to 2014 is 114 years and 15% of this is 17. 

```{r}

Top10b +
  geom_text(data = data_long %>%
              filter(Country %in% pull(top_10_count, Country)) %>%
              filter(Indicator == "Emissions") %>%
              filter(Year == last(Year)),
            aes(label = Country),
            check_overlap = TRUE) +
  scale_x_continuous(expand = c(0.15, 0)) +
  theme(legend.position = "none")
```

This is easier to read now, but it also causes us to lose some of the labels. 
There are several alternative ways we can keep all of our labels and make them easier to read. The first package we will show is called `directlabels`.

The most simple option is to use the `direct.label()` function, which will automatically add labels at the end of the lines. 
However, it is a bit difficult to see some of our labels as they get automatically sized to fit the plot.

```{r}
direct.label(Top10b) +
  scale_x_continuous(expand = c(0.3, 0))
```

Alternatively this can be done within the `ggplot2` framework by layering using the `geom_dl()` function.

```{r}
Top10b +
  scale_x_continuous(expand = c(0.3, 0)) +
  geom_dl(aes(label = Country), method = list("last.bumpup")) +
  theme(legend.position = "none")
```

This is more legible now. 
We have all 10 countries names listed and they are in order of the last data point and they are relatively close to the lines that they correspond to. 

Another option is to use a different method in the `directlables` package. 
[Here](http://directlabels.r-forge.r-project.org/docs/index.html){target="_blank"} is a list of options.

For example, the `"angled.boxes"` method looks nice for some plots but does not work very well for our plot:

```{r}
direct.label(Top10b, method = list("angled.boxes")) +
  scale_x_continuous(expand = c(0.3, 0))
```

However the `"last.polygons"` method works quite well:

```{r}
direct.label(Top10b, method = list("last.polygons")) +
  scale_x_continuous(expand = c(0.3, 0))
```

The second package is the `ggrepel` package, which is especially good for crowded labels that might overlap one another. 
It allows for more control than the `directlabels` package. 

Specifically, we will use the `geom_text_repel()` function from the `ggrepel` package. 
Just like with `geom_text()`, first we need to specify what data we want to include. 
Then, we specify with the `aes()` argument that our label will be based on the `Country` variable and we again specify what variable to use for our x axis and y axis, so that we indicate where the labels should be plotted. 

```{r}
Top10b +
  geom_text_repel(data = data_long %>%
                    filter(Country %in% pull(top_10_count, Country)) %>%
                    filter(Indicator == "Emissions") %>%
                    filter(Year == last(Year)),
                  aes(label = Country, x = Year, y = Value)) +
  theme(legend.position = "none") +
  scale_x_continuous(expand = c(0.3, 0))
```
You can see that this package creates segments that connect the label to the line.

There are many arguments to use to style your labels just the way that you want:

```{r, echo = FALSE, out.width = "600 px"}
knitr::include_graphics(here::here("img", "ggrepel.png"))
```

##### [[source]](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html){target="_blank"}

See [the ggrepel vignette](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html){target="_blank"} for more details.

Let's play around with some of these options in the table above. 
```{r}
Top10b +
  geom_text_repel(data = data_long %>%
                    filter(Country %in% pull(top_10_count, Country)) %>%
                    filter(Indicator == "Emissions") %>%
                    filter(Year == last(Year)),
                  aes(label = Country, x = Year, y = Value),
                  nudge_x = 10,
                  hjust = 1,
                  vjust = 1,
                  segment.size = 0.25,
                  force = 1) +
  theme(legend.position = "none") +
  scale_x_continuous(expand = c(0.3, 0)) +
  scale_y_continuous(expand = c(0.3, 0))
```

Nice, that looks pretty good.
For fun, let's try showing our data in an entirely different way. 


### Tile plots

This time we will create a `geom_tile` plot.

To create this plot we will filter our data to include only the Countries included in the `Country` variable of the `top_10_count`. 

Then, we will use the `fct_reorder()` function of the `forcats` package to order our countries based on the last emission value in 2014.

To use this function, the variable that is to be reordered is listed first. 
The variable that is being used to determine the order is listed second. 
Finally, a function to apply to the variable listed second is listed third.
This function is used to determine the order. 
In this case, we want to determine the last value of the `Value` variable using the `last()` function (recall that this is also a function of the `dplyr` package). 
Then, the `Country` variable will be ordered by the last value of the `Value` variable.

To color our plot we will use the `viridis` color palette again but this time we will use the `scale_fill_viridis_c()` -- recall that the `_c` indicates a continuous scale. 
See [the scale_viridis reference](https://ggplot2.tidyverse.org/reference/scale_viridis.html) for more information.


```{r}
Top10t <-
  data_long %>%
  filter(Country %in% pull(top_10_count, Country)) %>%
  filter(Indicator == "Emissions") %>%
  filter(Year >= 1900) %>%
  ggplot(aes(x = Year, y = fct_reorder(Country, Value, last))) +
    geom_tile(aes(fill = log(Value))) +
    scale_fill_viridis_c()
```


Finally, let's clean up the axes and the axes labels: 

```{r}
Top10t <- Top10t +
  scale_x_continuous(breaks = seq(1900, 2014, by = 5),
                     labels = seq(1900, 2014, by = 5)) +
  labs(title = "Top 10 " ~CO[2]~ "Emission-producing Countries",
       subtitle = "Ordered by Emissions Produced in 2014",
       fill = "Ln(CO2 Emissions (Metric Tonnes))") +
  theme_classic() +
  theme(axis.text.x = element_text(size = 12, angle = 90, color = "black"),
        axis.text.y = element_text(size = 12, color = "black"),
        axis.title = element_blank(),
        plot.caption = element_text(size = 12),
        plot.title = element_text(size = 16),
        legend.position = "bottom")

Top10t
```

Now let's say we wanted to save this plot.

We could do so using the using the  `save()` function to save this to a "plot" directory in our working directory as an RDA file and we can use the `png()` function to save a png for collaborators. We need to use `dev.off()` function to close the graphical device that we will use to create the png version of the plot so that we are ready to make another plot like this.

```{r}
save(Top10t, file =here::here("plots", "Top10t.rda"))
png(here::here("plots", "Top10t.png"))
Top10t
dev.off()
```

We see that Germany had very low emission rates at the end of World War II. 
We also see that the US has consistently had high emission rates since 1900, but the emission rates in China recently surpassed that of the US. 
The portions of the plot that are white indicate that there is no emission data for that country.

#### {.think_question_block}
<b><u> Question Opportunity </u></b>

Think about what the pros and cons are of tile and line plots. In what situations would a tile plot be a better choice for effective scientific communication? In what situations would a line plot be a better choice?

####

## **More than one variable**
***

Now, we will visualize all the variables in our dataset.


### Faceted plots

Here, we use the `facet_wrap()` function of the `ggplot2` package, which plots multiple subplots simultaneously. 

To use `facet_wrap()` with the option for a different y-axis scale for each subplot, we need to set the `scales` argument equal to `"free_y"`. 
We can also indicate where we would like the label for the subplots to be located by using the `strip.position` argument. 

```{r,fig.width=10, fig.height=10}
ggplot(data_long, aes(x = Year, y = Value, group = Country)) +
  geom_line(alpha = 0.2) +
  geom_line(data = data_long %>%
              filter(Country == "United States"),
              aes(x = Year, y = Value, color = Country)) +
  scale_colour_manual(values = c("blue")) +
  labs(title = "Distribution of Indicators by Year and Value",
       y = "Indicator Value") +
  my_theme +
  theme(strip.text = element_text(size = 16, face = "bold")) +
  facet_wrap(Indicator ~ .,
             scales = "free_y",
             strip.position = "right",
             ncol = 1)
```

Notice that we can change the size or style of the font for these labels using the `strip.text =` argument of the `theme()` function. 
We can also specify how many rows or columns we would like the subplots to be shown. 

We can also facet by more than one variable (e.g. `Indicator` and `Region` to show the data from the US compared to other countries).

In this case we want the same y-axis to be used across the rows. We will use the `facet_grid()` function this time instead of `facet_wrap()` because of the way that the two facet variables are displayed. The `facet_grid()` function will as you might expect, create an output of plots that are displayed in a grid. 

The syntax here is to put the name of the two variables on the left or right side of the `~` symbol, which tells you to facet by rows (left) or columns (right).  

First, we will filter out the data about disasters and temperature as this is only for the US, by using the `filter()` function and `!`  indicates that we want only values of the `Indicator` variable not in the list containing `"Disasters"` and `"Temperature"`.

```{r,fig.width=10, fig.height=10}
data_long %>%
  filter(!(Indicator %in% c("Disasters", "Temperature"))) %>%
  ggplot(aes(x = Year, y = Value, group = Country)) +
    geom_line() +
    facet_grid(Indicator ~ Region, scales = "free_y") +
    labs(title = "Distribution of Indicators by Year and Value",
         y = "Indicator Value") +
    my_theme +
    theme(strip.text = element_text(size = 16, face = "bold"))
```



From these plots we can see that each type of data spans a different time span.


#### {.think_question_block}
<b><u> Question Opportunity </u></b>

What happens when you create the same plot with `facet_wrap()`? 
Why might this be preferable in certain cases?

####

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Calculate the total number of countries per year reporting C02 emissions, energy use and GDP. 
Plot this summary statistic (y-axis) across the years (x-axis). 
What do you see? 

<b><u> Hint </u></b>: Use the `tally()` function in the `dplyr` package. 

####

### Line segment plots

There are also some other common visualization techniques that are good for showing the difference between a set of observations and a mean value across time. 

One of those is a line segment plot.
For simplicity, let's focus only on the data from the US. 
Let's calculate the mean across all years for the CO2 emission and temperature from 1980 to 2010.
Recall that our `Indicator` variable describes what kind of data we have (Emissions, Temperature, GDP, Energy, Disasters).
We will calculate the mean for each `Indicator` set of data by first grouping by this variable and then calculating the mean of the values of the `Value` variable for each set of data. 
We will call this new variable `Mean`. 
Then, we calculate the difference between each observation and the mean and create a new variable for these values called `Diff_from_mean`.  Again, all of this is performed for each group of data separately.
Once we have created the new variables, we want to use the `ungroup()` function so that we no longer perform functions on subsets of the data based on the `Indicator` variable.
Finally, we will also create a factor variable about the sign of the `Diff_from_mean` value to distinguish positive or negative changes. 
We will use this to color our plots.

```{r}
data_long_us <-
  data_long %>%
  filter(Country == "United States", Year >= 1980, Year <= 2010) %>%
  group_by(Indicator) %>%
  mutate(Mean = mean(Value), Diff_from_mean = Value - Mean) %>%
  ungroup() %>%
  mutate(Diff_color = sign(Diff_from_mean)) %>%
  mutate(Diff_color = as.factor(Diff_color))
```

```{r}
glimpse(data_long_us)
```

Next, we use the `geom_segment()` function to draw a straight line between points (`x`, `y`) and (`xend`, `yend`). 
In our case, this creates a plot that shows a bar for the difference between the observation and the mean across all the years.

```{r, fig.width=6, fig.height=6}
data_long_us %>%
  filter(Indicator %in% c("Emissions", "Temperature", "Disasters")) %>%
  ggplot(aes(x = Year, y = Value)) +
    geom_segment(aes(x = Year, y = Value, xend = Year,
                     yend = Mean, color = Diff_color),
                 size = 3.25) +
  scale_color_manual(values = c("blue", "red")) +
  geom_hline(aes(yintercept = Mean), linetype = 1, color = "black") +
  facet_wrap(Indicator ~ ., scales = "free_y", ncol = 1) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90),
         axis.title = element_blank(),
    legend.position = "none")  +
  labs(title = "US Disasters, Emissions, and Temperatures (1990-2010)",
    subtitle = "Indicator Mean of 1990-2010 Represented by Solid Black Line")
```
We can see from this plot that overall there has been an increase in disasters, emissions and temperature in the most recent years. 

#### {.think_question_block}
<b><u> Question Opportunity </u></b>

What trends do you see in GDP and energy use across time? 

####


### Scatter plots

Next, let's zoom in on two of the variables: CO2 emissions and temperature. 
We use years between 1980 and 2014 as we have values for all of those years for these two variables. 

We know that the datasets do not span the same amount of time. 
So let's limit this plot to only the years where the data overlaps for both CO2 emissions and temperature.

We use the `geom_point()` function to create a scatter plot between the `x` and `y` variable defined in `aes()` where `x` is time and `y` is one of the two variables. 
We also add a line on top of the scatter plot that smooths the trend from the points. The smoother we used here is `loess` [locally estimated scatterplot smoothing](https://en.wikipedia.org/wiki/Local_regression){target="_blank"}, a type of [local polynomial regression](https://en.wikipedia.org/wiki/Local_regression){target="_blank"} fitting. This is a nonparametric regression that is also called a "moving regression" where subsets of points that are close to one another (hence the term local) are used in a least squares  linear or nonlinear fit. Thus this results in a fit that may curve with the data.

```{r}
CO2_temp_US_facet <-
  data_long %>%
  filter(Country == "United States", Year >= 1980, Year <= 2014,
         Indicator %in% c("Emissions", "Temperature")) %>%
  ggplot(aes(x = Year, y = Value)) +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE) +
  scale_x_continuous(breaks = seq(1980, 2014, by = 5),
                     labels = seq(1980, 2014, by = 5)) +
  facet_wrap(Label ~ ., scales = "free_y", ncol = 1) +
  theme_classic() +
  theme(axis.text.x = element_text(size = 12, angle = 90, color = "black"),
        axis.text.y = element_text(size = 12, color = "black"),
        strip.text.x = element_text(size = 14),
         axis.title = element_blank(),
         plot.title = element_text(size = 16))
  labs(title = "US Emissions and Temperatures (1980-2014)")

CO2_temp_US_facet
```

Note, we are showing a different `theme` here, namely the `theme_classic()` theme. 
We can see that there are similar patterns of CO2 emission levels and average annual temperatures.

We will save this plot now like so to our "plots" directory:
```{r}
save(CO2_temp_US_facet, file =here::here("plots", "CO2_temp_US_facet.rda"))
png(here::here("plots", "CO2_temp_US_facet.png"))
CO2_temp_US_facet
dev.off()
```

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

- Try show a similar plot without filtering by years and faceting by the other three variables: energy use, GDP and disasters. Are there other variables that look like they might have a similar pattern to CO2 emissions? 
- Try using a different smoother in `geom_smooth()`. 

####

Next, instead of looking at the variables separately in faceted plots, let's look at the relationship between CO2 emissions and other variables directly. Thus, it is useful to have each of these indicators as their own variable.

We can do this by using `pivot_wider()` to transform our long data table into a wide format. 

```{r}
wide_US <-
  data_long %>%
  filter(Country == "United States", Year >= 1980, Year <= 2014) %>%
  select(-Label) %>%
  pivot_wider(names_from = Indicator, values_from = Value)
```

Let's save this data as an rda file for future use and as a csv file, as this is often useful for collaborators.
We will save this in a "wrangled" subdirectory of our "data" directory of our working directory.

```{r, eval = FALSE}
save(wide_US, file = here::here("data", "wrangled", "wrangled_US_data.rda"))
readr::write_csv(wide_US, path = here::here("data", "wrangled", "wrangled_US_data.csv"))
```

Now we can specify which indicators we want to look at, so now we can specifically look at emissions and temperature.

```{r}
CO2_temp_US <-
  wide_US %>%
  ggplot(aes(x = Emissions, y = Temperature)) +
    geom_point() +
    theme_classic() +
    theme(axis.text.x = element_text(size = 12, color = "black"),
          axis.text.y = element_text(size = 12, color = "black"),
           axis.title = element_text(size = 14),
           plot.title = element_text(size = 16)) +
    labs(title = "US Emissions and Temperature (1980-2014)",
         x = "Emissions (Metric Tonnes)",
         y = "Temperature (Fahrenheit)")


CO2_temp_US
```

It might be helpful to add a trend line to this. We can do so by using the `geom_smooth()` function of the `ggplot2` package.

If we want to look at a linear trend we need to specify the method using the `method = lm` argument. This adds a line to the data based on a linear model of the data using the `lm` function of the `stats` package. We will discuss the se = FALSE argument later. 

We can just add this to the plot object that we just created to create a plot with this trend line.

```{r}
CO2_temp_US <- CO2_temp_US + geom_smooth(method = "lm", se = FALSE)
CO2_temp_US 
```

Indeed, it does look like there is a positive, linear trend. 

We will also save this plot:

```{r}
save(CO2_temp_US, file =here::here("plots", "CO2_temp_US.rda"))
png(here::here("plots", "CO2_temp_US.png"))
CO2_temp_US
dev.off()
```

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

- Make similar plots for between CO2 emissions and other variables. 
- Are there other pairs variables that look like they might have a similar pattern to CO2 emissions?
- Does this match what we saw above? 
- Do these trend look linear or non-linear? 

####

Now that we see that there might be a linear relationship between CO2 emissions and temperature, let's learn about some statistical techniques to measure the strength of that relationship. 




# **Data Analysis**
***
 
If you are following along and stopped you could load the data you will need like so:

```{r}
load(here::here("data", "wrangled", "wrangled_US_data.rda"))
```

***
<details> <summary> If you skipped the previous sections click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-co2-emissions", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_US_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data (called `wrangled_US_data.rda`) can be found [here](https://github.com//opencasestudies/ocs-bp-co2-emissions/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/wrangled/wrangled_US_data.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily.

```{r}
load(here::here("data", "wrangled", "wrangled_US_data.rda"))
```


***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>
***



In this section, we are going to introduce some ways to better understand how two variables move together. 
We will focus on the CO2 emissions and temperature, but you will be encouraged to explore the relationship between CO2 emissions and the other variables. 

### **Basic summary statistics**
***

We can always calculate the sample mean and variance for two variables. 

```{r}
wide_US %>%
  summarize(mean(Emissions), mean(Temperature), sd(Emissions), sd(Temperature))
```

These are useful, but on their own they do not summarize whether or not there is a relationship between `Emissions` and `Temperature` (also these are on different scales entirely). 

What else could we use? 
Next, we are going to learn about the correlation coefficient, which is a summary statistic that describes how two variables are related or move together. 

### **Correlation coefficient**
***

We can use the [correlation coefficient](https://rafalab.github.io/dsbook/regression.html#corr-coefl){target="_blank"}. 
Here, we are using this summary statistic to measure the strength of a _linear_ relationship between two variables. 

If we plot one variable on the x-axis and the other variable on the y-axis, we can see:

1. The strength of the relationship - based on how well the points form a line  
2. The direction of the relationship - based on if the points progress upward or downward 

If the variables point upward in a very clear line, then there is a strong positive relationship. 
If the points do not really form a line, then there is a weak linear relationship or no linear relationship. There may however be a nonlinear relationship if the points create a different but defined shape. 

See [here](https://towardsdatascience.com/estimating-non-linear-correlation-in-r-62c6571cb1db){target="_blank"} for more information on nonlinear relationships.

If the points form a downward sloping line, then there is a negative relationship.

```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics('https://www.mathsisfun.com/data/images/correlation-examples.svg')
```

##### [[source]](https://www.mathsisfun.com/data/correlation.html){target="_blank"}

The numbers below each plot above are called correlation coefficients. 
They range from -1 to 1. 
A value of zero indicates that there is no correlation between the variables. 
While a value of 1 or -1 indicates perfect correlation, the closer the coefficient is to 1 or -1, the stronger the relationship. 
The sign of the coefficient indicates the direction of the relationship. 
If there is a negative relationship then the variables show opposing changes from each other - as one gets larger the other gets smaller. 
If the sign is positive, then the variables increase similarly. 

We previously made this plot:

```{r}
load(here::here("plots", "CO2_temp_US.rda"))
CO2_temp_US

```

Let's calculate the Pearson's correlation coefficient called "Rho" $\rho$ between CO2 emissions and temperature in the US. There are a few ways to calculate a correlation coefficient and this is one of the most common.

Formally, if we have a pair of observations $(x_1, y_1), \dots, (x_n,y_n)$, the correlation coefficient $\rho$ between $x$ and $y$ is defined as 

$$
\rho = \frac{1}{n-1} \sum_{i=1}^n \left( \frac{x_i-\mu_x}{\sigma_x} \right)\left( \frac{y_i-\mu_y}{\sigma_y} \right)
$$
where $\mu_x, \mu_y$ are the means of $x_1,\dots, x_n$ and $y_1, \dots, y_n$, respectively, and $\sigma_x, \sigma_y$ are the standard deviations. 

Therefore, we can standardize the two variables and essentially average (the denominator is n-1) the standardized values to calculate the correlation coefficient `rho`. 

Here we will manually perform the calculation. We will use the `tally()` function of the `dplyr` package to get the number of samples $n$. In this case this is equivalent to the number of rows in the `wide_US` tibble. We need to then use the `pull()` function to specifically grab the value out of the tibble that is created from using this function. As you can see from using the base `class()` function that this is a `tbl_df` which is short for tibble data frame (the tidyverse version of a data frame) rather than just a number. 

```{r}
tally(wide_US)
class(tally(wide_US))
```

When we check the class after using the `pull()` function we see that it is an integer.

```{r}
pull(tally(wide_US), n)
class(pull(tally(wide_US), n))
```

We will also use the base `scale()` function to standardize the `Emissions` and `Temperature` values.
```{r}
wide_US %>%
  summarize(rho = (1/(pull(tally(wide_US), n) -1)) *(sum(scale(Emissions) * scale(Temperature)))) %>%
  pull(rho)
```

Alternatively, you can use the `cor()` function in base R like so:

```{r}
wide_US %>%
  summarize(r = cor(x = Emissions,
                    y = Temperature, 
               method = "pearson")) %>%
  pull(r)
```

If you want to learn more about why this is the calculation to determine the strength of the relationship between two variables, see this [link](https://bcheggeseth.github.io/Stat155Notes/two-quantitative-variables.html). 

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

There are different types of correlation coefficients. Look at the help file for the `cor()` function by typing `?cor()` into the RStudio Console to learn more and then try a different `method`. 

What are the differences? 

####


#### {.think_question_block}
<b><u> Question Opportunity </u></b>

- Try calculating the correlation coefficient between CO2 emissions and the other variables. 
- What do you expect? 

####

To test if the association between a pair of variables is [statistically significant](https://en.wikipedia.org/wiki/Statistical_significance), you can use the `cor.test()` of the `stats` package to calculate the correlation coefficient, as well as confidence intervals for correlation coefficients.

We can use the `tidy()` function of the `broom` package to make the output more usable for working with in R. The role of this function is to pull numeric values from outputs and create a data frame of the values.

```{r}

cor.test(pull(wide_US, Emissions),
         pull(wide_US, Temperature))

broom::tidy(cor.test(pull(wide_US, Emissions),
         pull(wide_US, Temperature)))
```

We see that the correlation coefficient quantifying the strength of the linear relationship between C02 emissions and temperature is statistically significant.

### **Relationship between correlation and linear regression**
***

Let's briefly discuss the relationship between correlation and linear regression, which is further described in the [Introduction to Data Science book](https://rafalab.github.io/dsbook/regression.html). 

We can use a regression line to predict a random variable $Y$ given that we have gathered or observed some data about another variable $X=x$. 
The regression line formally is defined as:

$$ \left( \frac{Y-\mu_Y}{\sigma_Y} \right) = \rho \left( \frac{x-\mu_X}{\sigma_X} \right) $$
where $\mu_X$ and $\sigma_X$ ($\mu_Y$ and $\sigma_Y$) are the mean and standard deviation of $X$ ($Y$), and $\rho$ is correlation between $X$ and $Y$. 
If $x$ is larger than $\mu_X$, then for every $\sigma_X$, then $Y$ will also increase $\rho$ standard deviations above $\mu_Y$. 

Re-organizing the terms so that $Y$ is on the left side and everything else is on the right side, we get:

$$ Y = \mu_Y + \rho \left( \frac{x-\mu_X}{\sigma_X} \right) \sigma_Y $$
Thinking about some extreme examples: 

- If $\rho$ = 0 (i.e. no correlation), we ignore the $x$ term entirely and only predict $Y$ using the mean $\mu_Y$. 
- If $\rho$ = 1 (or -1) (i.e. perfect correlation), the regression line predicts an increase (or decrease) that is the same number of SDs. 
- If $\rho$ is between -1 and 1, then we predict using both terms on the right hand side. 

To add regression lines to plots, we will need the above formula in the form: 

$$
y= b + mx \mbox{ with slope } m = \rho \frac{\sigma_y}{\sigma_x} \mbox{ and intercept } b=\mu_y - m \mu_x
$$

In our example, we can calculate the slope and intercept using the formula above and plot the line. 
```{r}
wide_US_summary <-
  wide_US %>%
  summarize(mu_x = mean(Emissions), sd_x = sd(Emissions),
            mu_y = mean(Temperature), sd_y = sd(Temperature),
            rho = cor(Emissions, Temperature),
            slope = rho * sd_y / sd_x,
            intercept = mu_y - rho * sd_y / sd_x * mu_x)

wide_US %>%
  ggplot(aes(x = Emissions, y = Temperature)) +
    geom_point() +
    geom_abline(slope = wide_US_summary$slope,
                intercept = wide_US_summary$intercept) +
  theme_linedraw() +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
       axis.title.x = element_text(size = 12),
       axis.title.y = element_text(size = 12),
       plot.caption = element_text(size = 12),
         plot.title = element_text(size = 16))
```

**Note**: In the plot above, we use the scale of the original variables (CO2 emission and temperature), but the formula above implies that standardization of the variables (i.e. subtracting the mean and dividing by the standard deviation) allows the regression line to have an intercept of 0 and slope equal to $\rho$. A similar plot in standardized units is given below.  

```{r} 
CO2_temp_US_scaled<-wide_US %>%
  ggplot(aes(x = scale(Emissions), y = scale(Temperature))) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
  labs(title = "US" ~ CO[2]~ "Emissions and Temperature (1980-2014)",
         y = "Scaled Temperature (Fahrenheit)",
         x = "Scaled Emissions (Metric Tonnes)") +
  theme_linedraw() +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
       axis.title.x = element_text(size = 14),
       axis.title.y = element_text(size = 14),
       plot.caption = element_text(size = 12),
         plot.title = element_text(size = 16))

CO2_temp_US_scaled
```

Notice that we also use the `geom_smooth(method = "lm")` function that we previously used. Again, this adds a line corresponding to the slope and intercept from the `lm()` function from the `stats` R package. 



#### {.think_question_block}
<b><u> Question Opportunity </u></b>

What does `se = FALSE` mean? Try turning it to TRUE. What happens? 

####
***
<details> <summary> Click here for the answer.</summary>
`se` stands for standard error. The gray shading shows the [confidence interval](https://stattrek.com/regression/slope-confidence-interval.aspx){target="_blank"} of the smooth line.
</details>
***

### **Limitations of Correlation**
***

While correlation is useful in many settings to understand how two variables move together, correlation is not always a useful summary. 
For example, here are ways it might not be useful: 

- A linear relationship might not be the best way to capture the relationship.
- If an individual were interested in understanding a _causal_ relationship between two variables, as [correlation does not imply causation](https://dfrieds.com/math/correlation-does-not-imply-causation.html){target="_blank"}. Another way of stating this is that simply showing there is a linear trend over time does not imply there is a causal relationship between these two variables.  

As you can see from this plot, often data may show a similar pattern over time by random chance. 
See this [website](https://www.tylervigen.com/spurious-correlations){target="_blank"} for more examples.

```{r, echo = FALSE, out.width="600px"}
knitr::include_graphics(here::here("img", "causation.png"))
```

##### [[source]](http://tylervigen.com/spurious-correlations){target="_blank"}

In this example and in our case study, data was collected over time. This type of data will often have what is called [autocorrleation](https://en.wikipedia.org/wiki/Autocorrelation){target="_blank"} or serial correlation. This means that data points from one year to the next may be similar to one another or have some sort of internal structure related to time (such as seasonality). Let's think about our CO2 emission and temperature data. You may be able to see how one year of CO2 emissions might be fairly similar to next year, because the number of sources (such as industrial factories and cars) in the US will change slightly from year to year, but they will be related to that of the previous years. Anytime we look at correlation between two variables that each have autocorrelation, this can result in a higher likelihood of correlation between these variables. 

Indeed if we look at correlation between two random variables with autocorrelation we can see an inflation in the correlation rho values between the two variables, as compared to that of variables that do not have autocorrelation.

```{r, echo = FALSE}
hist(replicate(5000, { cor(diffinv(rnorm(99)),diffinv(rnorm(99))) }), breaks=100, main="Correlation between autocorrelated variables", xlab="rho")
hist(replicate(5000, { cor(rnorm(100),rnorm(100)) }), breaks=100, main="Correlation between variables with no autocorrelation", xlab="rho")
```

This is something to keep in mind when you evaluate how two things are related to one another over time. There are [methods](https://online.stat.psu.edu/stat462/node/188/) that have been developed to account for this in [time series](https://en.wikipedia.org/wiki/Time_series) analysis (analyzing data over time). This does not mean that two variables (such as CO2 and temperature) may not be in fact correlated, it just means that we need to account for the autocorrelation within each variable, however this is beyond the scope of this case study.

# **Summary**
*** 

## **Summary Plot**
***

The last thing we will do here is to create a plot that summarizes our major findings. 
We will use the `plot_layout()` function of the `patchwork` R package. The `patchwork` allows you to create a plot layout based on mathematical-like formulas. As you can see in this example we want the `CO2_world` and `Top10t` plot on top and we want another row with the `CO2_temp_US_facet` and  the `CO2_temp_US` plot on the bottom. The plot_layout() function of the `patchwork` package then allows us to specify heights and widths for the plots.

We will also save the figure using the `grDevices` `png()` function. We can specify the name of our plot file and where we want it to be saved using the `here()` function of the `here` package. In this case in the `img` subdirectory of the directory containing our .Rproj file. We can also specify the size of the plot and the resolution using the `res` argument.  The `grDevices` `dev.off()` function is necessary to close the graphics device.

This will include a few plots that we made previously in other sections. We saved these in the "plots" directory and will load these now for users who stopped and restarted or started at the data analysis section.

```{r}
load(here::here("plots", "CO2_world.rda"))
load(here::here("plots", "Top10t.rda"))
load(here::here("plots", "CO2_temp_US_facet.rda"))
```

```{r, eval = FALSE}
png(here::here("img", "mainplot.png"), units = "in", width = 12, height = 10, res = 300)
(CO2_world | Top10t) / (CO2_temp_US_facet | CO2_temp_US_scaled) +
  plot_layout(widths = c(1, 2),
              heights = unit(c(4, 10), c('cm', 'cm')))
dev.off()
```

```{r, echo = FALSE, out.width = "800 px", dpi=300}
knitr::include_graphics(here::here("img", "mainplot.png"))
```



## **Synopsis**
***

In this case study we evaluated CO2 emissions from as far back as 1751 for some countries to 2014. We discovered that global levels of CO2 emissions have dramatically increased over time. We also learned that some countries have been responsible for particularly high levels. 

We also took a look at how CO2 emissions might relate to other factors, such as temperature, energy use, and natural disasters. We learned that we can summarize the relationship between two sets of data using **correlation coefficients**. We also learned that we can use **regression** to predict or describe how changes in one variable may influence changes in another variable. Importantly, we also learned that just because two variables show strong correlation or show an association, it does not necessarily indicate that they are causally related. 

However, there is quite a bit of scientific evidence to indicate that in fact CO2 emissions trap heat and lead to increased global temperatures. Yet, it is important to realize that there are other factors involved in the relationship between US CO2 emissions and US annual average temperatures. For example there are CO2 emissions from other countries in the atmosphere, there are other greenhouse gases, there is already existing CO2 in the atmosphere that will continue to trap heat for many years, and finally there is heat trapped in the ocean due to previous emissions that will cause delayed changes in surface temperatures. However, it is vital that we work around the globe to reduce future greenhouse gas emissions to mitigate the increased temperatures that we will experience due to previous and existing CO2 emissions, so that the warming temperatures aren't as extreme as they could be. Furthermore, we need to prepare for increased rates of natural disasters and how these may influence people around the world. Evidence suggests that impoverished people are the [most affected by disasters](https://ourworldindata.org/natural-disasters){target="_blank"}. We need to be particularly mindful of this as we prepare for the future.  



# **Suggested Homework**
***

Ask students to create a plot with labels showing the countries with the lowest CO2 emission levels.

Ask students to plot CO2 emissions and other variables (e.g. energy use) on a scatter plot, calculate the Pearson's correlation coefficient, and discuss results. 

# **Additional information**
***

## **Helpful Links**
***

[Tidyverse](https://www.tidyverse.org/){target="_blank"}  
[RStudio cheatsheets](https://rstudio.com/resources/cheatsheets/){target="_blank"}
[Introduction to correlation](https://www.mathsisfun.com/data/correlation.html){target="_blank"}
[Correlation coefficient](https://rafalab.github.io/dsbook/regression.html#corr-coefl){target="_blank"}   
[Correlation does not imply causation](https://dfrieds.com/math/correlation-does-not-imply-causation.html){target="_blank"}  
[Regression](https://rafalab.github.io/dsbook/regression.html){target="_blank"}  
[Locally estimated scatterplot smoothing](https://en.wikipedia.org/wiki/Local_regression){target="_blank"}  
[Local polynomial regression](https://en.wikipedia.org/wiki/Local_regression){target="_blank"}  
[Autocorrleation](https://en.wikipedia.org/wiki/Autocorrelation){target="_blank"}  
[Time series](https://en.wikipedia.org/wiki/Time_series){target="_blank"}   
[Methods to account for autocorrelation](https://online.stat.psu.edu/stat462/node/188/){target="_blank"}  
[US Environmental Protection Agency (EPA) Inventory of U.S. Greenhouse Gas Emissions and Sinks 2020 Report](https://www.epa.gov/sites/production/files/2020-04/documents/us-ghg-inventory-2020-main-text.pdf){target="_blank"}   
[National Climate Assessment Report](https://data.globalchange.gov/report/nca3-overview){target="_blank"}    
[Greenhouse gases](https://www.epa.gov/report-environment/greenhouse-gases){target="_blank"} 
[Climate change](https://world101.cfr.org/global-era-issues/climate-change/climate-change-adaptations){target="_blank"}


<u>**Packages used in this case study:** </u>

 Package   | Use in this case study                                                                        
---------- |-------------
[`here`](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data
[`readxl`](https://readxl.tidyverse.org/){target="_blank"}  | to import the excel file data
[`readr`](https://readr.tidyverse.org/){target="_blank"}  | to import the csv file data
[`dplyr`](https://dplyr.tidyverse.org/){target="_blank"}  |  o view and wrangle the data, by modifying variables, renaming variables, selecting variables, creating variables, and arranging values within a variable   
[`magrittr`](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"}  |  to use and reassign data objects using the `%<>%`pipe operator
[`stringr`](https://stringr.tidyverse.org/){target="_blank"}  | to select only the first 4 characters of date data
[`purrr`](https://purrr.tidyverse.org/){target="_blank"}  | to apply a function on a list of tibbles (tibbles are the tidyverse version of a data frame)  
[`tidyr`](https://tidyr.tidyverse.org/){target="_blank"}  | to drop rows with `NA` values from a tibble
[`forcats`](https://forcats.tidyverse.org/){target="_blank"}  | to reorder the levels of a factor
[`ggplot2`](https://ggplot2.tidyverse.org/){target="_blank"} | to make visualizations
[`directlabels`](http://directlabels.r-forge.r-project.org/docs/index.html){target="_blank"} | to add labels to plots easily
[`ggrepel`](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html){target="_blank"} | to add labels that don't overlap to plots
[`broom`](https://www.tidyverse.org/blog/2018/07/broom-0-5-0/) | to make the output form statistical tests easier to work with
[`patchwork`](https://github.com/thomasp85/patchwork){target="_blank"}  | to combine plots


## **Session Info**
***

```{r}
sessionInfo()
```

**Estimate of RMarkdown Compilation Time: **

```{r, echo=FALSE}
rmarkdown:::perf_timer_stop("render")
pts = rmarkdown:::perf_timer_summary()
cat("About", round(pts$time[1]/1000 + 5), "-", round(pts$time[1]/1000 + 15),"seconds")
```

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems. 

## **Acknowledgments**
***

We would like to acknowledge [Megan Latshaw](https://www.jhsph.edu/faculty/directory/profile/1708/megan-weil-latshaw) for assisting in framing the major direction of the case study.

We would like to acknowledge [Qier Meng](https://www.opencasestudies.org/authors/qmeng/) and [Michael Breshock](https://mbreshock.github.io/) for their contributions to this case study. 

We would also like to acknowledge the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/) for funding this work. 


<script type='text/javascript' id='clustrmaps' src='//cdn.clustrmaps.com/map_v2.js?cl=080808&w=a&t=tt&d=rkeJy7szR2zgko6SzQvOjgSAKPG6aHwfgP338ysqAyE&co=ffffff&cmo=3acc3a&cmn=ff5353&ct=808080'></script>