Chapter 2 Open Case Study Infrastructure
2.1 Learning Objectives
In this chapter we will discuss the overall infrastructure of the Open Case Studies platform, which includes:
- The Open Case Studies (OCS) website
- Methods to provide feedback
- A search tool to find case studies
- The OCS GitHub organization
- Our R package called
OCSdata
2.2 OCS Website
The OpenCaseStudies website describes the mission of the Open Case Studies project, the history of its inception, current and previous members of the OCS team, an archive of talks and blog posts and other information.
Links to all of our case studies can be found on the Open Case Studies website. The case studies are listed in a searchable table that will be detailed further in the following section.
2.3 Feedback
We are continually striving to make our case studies better. Please contact us if you have ideas for suggestions for the project or new ideas for how case studies can be used in the classroom.
Also please let us know if you notice typos or errors, or if you are interested in getting involved.
2.3.1 Email Form
The website contains a contact email form that may be used to send a message to Open Case Studies to ask a question or provide suggestions.
2.3.2 Survey
There is also a survey available on our website and also within the case studies themselves that allows us to do research on case study use.
The survey should take no more than 10 minutes to complete. Your feedback helps us learn more about how to improve the data science education experience. Part of this includes getting a better understanding of who is using our case studies and how so that we can better design our case studies. We would greatly appreciate you filling it out if you have the time!
2.4 OCS Case Study Search Tool
The website also includes the case study search tool to aid instructors in finding appropriate case studies for their learning objectives. Accessing the search tool and how to use it is described in more detail below.
This diagram illustrates the workflow of accessing a case study from the OCS website through the case study search table. From the table, users can use the provided links to view the original static case studies, interactive case studies, and the GitHub repositories for each. Users may find all case study source files in the case study repository, as well as instructions on how to use the case study.
2.4.1 Interactive Case Studies
The interactive versions of the case studies are a recent development. These versions include live tutorials through quizzes and interactive coding exercises with real-time feedback. The interactive case studies were made using the learnr
and gradethis
packages.
If you’d like to learn more about the interactive case studies, graduate student Qier Meng discusses the interactive versions in further detail in the video below:
If you’d like to learn more about the interactive case studies you can read this thesis by former graduate student Michael Breshock Breshock (2021).
The Open Case Study Search tool can be found at the bottom of the OCS Website. The tool consists of a table with searchable columns and each row describing an individual case study. This searchable table is designed to aid instructors in identifying appropriate case studies for their learning objectives. The columns are organized as such:
- The “Case Study” column contains the case study name and a link to the static and interactive versions of the case study (if available)
- The “GitHub Repository” column provides links to the case study’s associated GitHub repository that contains all case study source files, data, code, and more
- The “Packages” column details all the R packages used in the case study, and can help identify if a case study teaches a specific data import, wrangling, analysis, or visualization skill
- The “Objectives” column details the learning objectives of each case study (e.g. importing data from PDF files, reshaping data, specific statistical analysis, etc.)
- The “Category” column lists the source of funding or project that the case study is associated with
The main two columns likely to be helpful in identifying appropriate case studies are the “Packages” and “Objectives” columns. Users may search for keywords across all columns using the overall search bar, otherwise users can search individual columns of interest.
This table can be used to access all case study resources:
This video provides a live demonstration on how to use the search tool:
2.5 Open Case Studies GitHub Organization
GitHub is a website and cloud service that enables developers to store, manage, and track changes to their code. OCS uses GitHub for both development and distribution purposes. Users have complete access to all case study material through our OCS GitHub page where each case study is hosted in an individual repository. The repository contains all the materials needed for the case study. This includes the case study text to be distributed to students, the data used in the case study (discussed below), additional documents and references, and brief guidelines on case study use.
Data included in the GitHub repository is available in multiple formats to enable modular use of the case studies. This diagram explains the case study data folder structure and how data is categorized into different sub-folders:
Data included in the GitHub repository is available in multiple formats to facilitate modularization of the case studies as described below. To use the case study data, you can download the GitHub repository directly or use the OCSdata
R package described below.
2.6 OCSdata
To simplify the process of accessing the data required for each case study, we have created the OCSdata
R package. Briefly, the OCSdata
package creates a new folder called “OCSdata” where it downloads the data needed for a specific case study. Users can download the data in its original raw format or in various processed formats that correspond to different stages of data wrangling and cleaning. This allows users to perform the data exploration and wrangling or the data visualization and analysis sections of the case study without having to process the data from the raw files. For some of the case studies, the OCSdata
package also downloads extra source data that is not used in the case study.
The following are the main functions to import data in various formats using the OCSdata
package. Each function is described in more detail in the OCSdata
package documentation.
Data Folder | Case Study Section | OCSdata Function |
---|---|---|
raw | Data Import | raw_data |
imported | Data Exploration, Data Wrangling | imported_data |
wrangled | Data Visualization, Data Analysis | wrangled_csv , wrangled_rda |
simpler_import | Data Import | simpler_import_data |
extra | Suggested Homework (?) | extra_data |
The package source files and documentation are also available on GitHub.
2.6.1 Getting Started with OCSdata
The OCSdata
package is available on the package repository CRAN. It requires R 3.5 or higher and can be installed in R as follows:
install.packages("OCSdata") #only run once to install package
library(OCSdata) #run every new R session to load package
2.6.2 Downloading raw data
The raw_data
function will download the raw data files that can be imported into R.
The first argument is the name of the case study. A list of case study names can be found in the package documentation online or by typing ?raw_data
in R.
The outpath
argument is a string specifying the folder where the data should be downloaded. To download the data to a folder named “OCS_data” in the current working directory, you can supply getwd()
to the output
argument. If nothing is provided for the argument, you will be prompted to enter 1, 2, or 3 to download the data into the current director, to specify the download path, or to cancel, respectively.
In the following example, we download the raw data for the “Opioids in the United States” case study to the current directory.
raw_data("ocs-bp-opioid-rural-urban", outpath = getwd())
2.6.3 Downloading data in other formats
The OCSdata
package can be used to download the data in various processed formats that may be helpful in skipping certain case study sections and focusing on data wrangling and/or analysis and visualization. All of the functions take the same arguments described above.
2.6.3.1 Simpler import
The simpler_import_data()
function will download raw data files that have been converted to file formats that are easier to import into R, typically .csv. Some case studies offer this option when the original raw files require a more complicated import step.
simpler_import_data("ocs-bp-opioid-rural-urban", outpath = getwd())
2.6.3.2 Importing data as R objects
The imported_data()
function will download raw data files in .rda format. This means the data have already been imported into R objects. This can be used to skip the data import section and start directly with data wrangling. The R objects files can be imported into R by either double clicking on the files in RStudio or using the load()
function as follows.
imported_data("ocs-bp-opioid-rural-urban", outpath = getwd()) #download data in .rda format
= "~/OCS_data/data/imported/land_area.rda" #specify download directory
file_path load(file_path) #load R objects
2.6.3.3 Importing wrangled data
The following functions will download the data files that have already been wrangled and are ready to be analyzed. These come in both .csv and .rda formats.
Download as csv files:
wrangled_csv("ocs-bp-opioid-rural-urban", outpath = getwd())
Downloading as R objects:
wrangled_rda("ocs-bp-opioid-rural-urban", outpath = getwd())
2.6.4 Downloading extra data
Some case studies have extra data are not used in the case study but can be used to explore the case study subject from different perspectives. These data but can This data can be downloaded using the extra_data()
function.
extra_data("ocs-bp-opioid-rural-urban", outpath = getwd())
2.6.5 Downloading all case study data
The zip_ocs()
function will download the all of the repository files in a .zip folder and unzip them into a specified directory. This includes the case study data in all the formats detailed above (raw, simpler_import, imported, wrangled, and extra). It also includes the case study .Rmd file, which can be modified by instructors as needed. We recommend using this method over cloning or forking (terms that you may be familiar with if you are familiar with Git and GitHub), as this will not result in the user getting all of our git history.
If you choose to fork the repository you will automatically generate a repository on GitHub and your repository will have connections to the original case study. This can be helpful for pulling any changes to the original case study. It can also be helpful if you want to send edits to the original case study in the form of what is called a pull request.
If you clone the case study repository, you can set it up on GitHub as well with a few more steps and you will not preserve any connection to the original case study repository.
Again, don’t worry if all these terms are new to you. You can just use the zip_ocs()
function instead. Otherwise take a look at Hester (n.d.) to learn more.
zip_ocs("ocs-bp-opioid-rural-urban", outpath = getwd())
2.6.6 Fork or clone the case study repository
If instead users are familiar with Git and GitHub and want to fork or clone the case study repository, this can also easily be done using the OCSdata
package. The clone_ocs()
function of the OCSdata
package can be used to do either. If the fork_repo
function is set to TRUE
it will fork the repo, otherwise, by default, it will clone the repository. These functions will result in the same outcome as using GitHub to clone or fork the repo.
Again you can also specify the outpath
location as in the previous description about the zip_ocs()
function.
# clone a repository
::clone_ocs(casestudy = "ocs-bp-diet")
OCSdata
#fork a repository
::clone_ocs(casestudy = "ocs-bp-diet", fork_repo = "TRUE") OCSdata
However, using these functions will involve the users getting all of our git history so we suggest that users use the zip_ocs()
function (described in the above section) of OCSdata
instead.
If you’d like to learn more about the OCSdata
package or the OCS GitHub organization page, you can read this thesis by former graduate student Michael Breshock Breshock (2021).
2.7 Session info
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_4.0.2 magrittr_2.0.2 bookdown_0.24 htmltools_0.5.0
## [5] tools_4.0.2 yaml_2.2.1 jquerylib_0.1.4 stringi_1.5.3
## [9] rmarkdown_2.10 highr_0.8 knitr_1.33 stringr_1.4.0
## [13] digest_0.6.25 xfun_0.26 rlang_0.4.10 evaluate_0.14