Biomedical Open Case Studies: Building a Neural Network-Based Epigenetic Clock in R Using scorcher

{{< include _main_image.qmd >}}

Disclaimer

The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. In addition, due to size constraints, datasets used within a case study may be subset of the original/full dataset.

License information

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0) United States License.

This work is funded through the National Institutes of Health, specifically the National Institute of General Medical Sciences: Grant Number 1R25GM160622.


To access the GitHub repository for this case study see here: https://github.com/opencasestudies/ocs-bio-ai-deep-learning



To cite this case study, please use:

{Fill In Authors}. {(YEAR)}. {github source url}. {Title} {(Version)}.

Motivation


DNA methylation-based epigenetic clocks have transformed aging research by providing quantitative markers of biological age that often outperform chronological age in predicting morbidity, frailty, and mortality. Although most established clocks to date rely on penalized linear regression (e.g., elastic net), recent works have demonstrated that deep learning may capture richer nonlinear relationships among DNA methylation sites, improving predictive accuracy.

This case study walks learners through reconstructing a neural network epigenetic clock model in the R statistical programming language. Learners will use scorcher, an R package designed for developing deep learning models with ease. The goal is to give the learners a hands-on experience working with DNA methylation data, introduce epigenetic clock models, teach modern neural network construction in R, and highlight practical modeling issues such as overfitting, interpretability, and biological validation.

Definition

Epigenetic clocks are statistical or machine learning models that estimate an individual’s biological age by learning systematic patterns in DNA methylation levels at specific genomic locations (CpG sites). These models leverage age-associated changes in methylation to produce an age estimate that can reflect not only chronological time but also aspects of biological aging, health status, and disease risk.

Images are very helpful within this section….

NEED IMAGE


Main Question


Our main question(s)

  1. Can a neural network-based epigenetic clock accurately predict chronological age from publicly-available DNA methylation data?

  2. How does a deep learning-based epigenetic clock compare to a more traditional, elastic net-based clock when trained on the same, harmonized data?

  3. What practical challenges arise when applying deep learning model to high-dimensional biological data, and how can these challenges be addressed in a reproducible R workflow?


Learning Objectives


In this case study, we will explore how deep learning methods can be applied to high-dimensional biological data, using DNA methylation-based epigenetic clocks as our motivation.

This case study will particularly focus on the practical and conceptual challenges that arise when building neural network models for omics-scale biological data, including data preprocessing, regularization, model evaluation, and comparison with classical statistical approaches such as elastic net regression. We will exemplify these concepts in within R ecosystem using the R package, scorcher.

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data Science/Bioinformatics Learning Objectives:

  1. Work with high-dimensional biological datasets and understand common data structures used in -omics analyses.

  2. Implement reproducible data preprocessing and wrangling workflows in R for large-scale datasets.

  3. Train, tune, and evaluate deep learning-based predictive models using modern machine learning workflows such as scorcher.

  4. Compare model performance using appropriate validation strategies and metrics.

  5. Reason about computational tradeoffs, including memory usage, runtime, and the use of GPUs.

Statistical Learning Objectives:

  1. Explain the bias-variance tradeoff in high-dimensional prediction problems.

  2. Describe penalized regression methods (e.g., elastic net) and their role in feature selection.

  3. Understand loss functions, regularization, and optimization in neural network training.

  4. Design and interpret cross-validation and study-level validation strategies.

  5. Critically assess interpretability versus complexity when building prediction models.

Biological/Topical Learning Objectives:

  1. Describe the biological basis of DNA methylation and its relationship to aging.

  2. Explain what epigenetic clocks are, how they are constructed, and their historical development.

  3. Understand sources of biological and technical variation in methylation data.

  4. Learn how to preprocess and wrangle large DNA methylation datasets.

  5. Interpret predictive models in the context of biological plausibility and discuss their limitations.


We will begin by loading the packages that we will need:

#add library() calls
library(here)
here() starts at /home
Package Use
{Package Name} {Package use}
{Package Name} {Package use}

The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.

Context


What are Epigenetic Clocks?

DNA methylation is an epigenetic mechanism where methyl groups attach to DNA, affecting gene regulation without changing the DNA sequence. Across the genomes of many organisms, methylation levels change systematically with age. Early studies found that these changes were predictable enough to allow for the construction of statistical models that estimate age from methylation alone. These models became known as epigenetic clocks.

Epigenetic clocks are predictive models that estimate an organism’s biological age based on methylation patterns at specific regions in DNA (i.e., CpG sites, where a guanine nucleotide directly follows a cytosine nucleotide).

Early clocks focused on predicting chronological age, but over time, researchers have developed “next-generation” clocks that correlate not just with chronological age, but also with health outcomes, disease risk, and mortality. Today, epigenetic clocks play roles in:

  • Aging research, as biomarkers of biological aging
  • Exposure science, tracking environmental/behavioral influences
  • Clinical prediction, including mortality risk and disease progression
  • Comparative genomics, identifying conserved aging signatures across species

History of Epigenetic Clocks

In 2011, a study showed that DNA methylation in saliva could predict chronological age with an average error of ~5.2 years. Following this, in 2013, two landmark clocks were published: the blood-based Hannum clock and the cross-tissue Horvath clock. As the design of such clocks evolved, newer models now include biological signals coming from biomarkers, disease risk factors, and other lifestyle predictors to estimate biological age or age acceleration, which may better reflect functional aging, morbidity, and mortality.

Table here from nature paper that lists clocks, tissues, programming language, and model?

Timeline diagram here?

Moving Beyond Classic Clock Models

Nearly all canonical clocks, including the Horvath, Hannum, PhenoAge, and GrimAge, are based on penalized linear regression models. As there are often more CpG sites (~850k) than available samples (###), we cannot directly fit a traditional linear regression model. Penalized models, such as the elastic net model, identify subsets of sites (~100-500) with methylation values that are predictive of chronological age. Penalized linear regression models such as the elastic net have several advantages. Namely, they:

  • Have good performance in high-dimensional settings,
  • Perform automatic feature selection,
  • Are interpretable,
  • Are computationally inexpensive,
  • Generalize reasonably across studies and tissues.

However, the biology of epigenetics is complex, and methylation patterns arise from regulatory networks, chromatin structures, and biological interactions that are often nonlinear processes. Linear models ignore interactions between CpG sites, as the genome is organized as a set of physically interacting DNA strands in three-dimensional space. Further, they struggle with multicollinearity among related features, often leading penalization methods to arbitrarily select one CpG from a group of similarly behaving sites. As research advances, more flexible modeling approaches may better capture the complexity of methylation dynamics. Recent works have begun moving beyond linear models to deep learning, as deep learning may capture patterns that linear models might miss. Specifically, deep learning models can:

  • Model nonlinear relationships,
  • Capture interactions between CpG sites,
  • Incorporate multi-omic features,
  • Leverage architectures reflecting genomic structure.

Recently, cite Bioc2025_torch-clock repository here
demonstrated how to build a deep learning-based epigenetic clock in R, highlighting growing interest in such approaches. However, deep learning-based pipelines introduce barriers for many researchers in genomics. This case study explores how to build a neural network-based epigenetic clock using the R package, scorcher.

NEED VIDEO


What are the data?


Variable Details
variable1 Variable info
– more details
– more details
Example: Content content
variable2 Variable info
– more details
– more details
Example: Content content

Limitations


There are some important considerations regarding this data analysis to keep in mind:

  1. {FILL IN}
  2. {FILL IN}

Ethical Considerations


There are some important ethical considerations when working with data relating to this case study’s main questions.

  1. {FILL IN}
  2. {FILL IN}

Data Import


pm <-readr::read_csv(here("data", "raw", "pm25_data.csv"))
Rows: 876 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): state, county, city
dbl (47): id, value, fips, lat, lon, CMAQ, zcta, zcta_area, zcta_pop, imp_a5...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
save(pm, file = here::here("data", "imported", "pm25_data_imported.rda"))

Data Wrangling


If you have been following along but stopped, we could load our imported data like so:

load(here::here("data", "imported", "pm25_data_imported.rda"))

If you skipped the data import section click here.

An RDA version (stands for R data) of the data can be found here or slightly more directly here. Download this file and then place it in your current working directory within a subdirectory called “imported” within a directory called “data” to use the following code. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "imported", "co2_data_imported.rda"))

To allow users to skip import and wrangling we will save the data as an RDA file as well as a CSV file as this is often useful to send our data to collaborators. We will save this in a “wrangled” subdirectory of our “data” directory of our working directory.

1save(pm, file = here::here("data", "wrangled", "wrangled_data.rda"))
2readr::write_csv(pm, file = here::here("data","wrangled",
                                       "wrangled_data.csv"))
1
saving the data as an RDA file within the wrangled data subdirectory
2
saving the data as a CSV file within the wrangled data subdirectory

Data Visualization


If you have been following along but stopped, we could load our wrangled data like so:

load(here::here("data", "wrangled", "wrangled_data.rda"))

If you skipped the data import section click here.

An RDA file (stands for R data) of the data can be found here or slightly more directly here. Download this file and then place it in your current working directory within a subdirectory called “wrangled” within a subdirectory called “data” to use the following code. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "wrangled", "wrangled_data.rda"))


Data Analysis


If you have been following along but stopped, we could load our wrangled data like so:

load(here::here("data", "wrangled", "wrangled_data.rda"))

If you skipped the data import section click here.

An RDA file (stands for R data) of the data can be found here or slightly more directly here. Download this file and then place it in your current working directory within a subdirectory called “wrangled” within a subdirectory called “data” to use the following code. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "wrangled", "wrangled_data.rda"))

Question opportunity

Pose a question about the data analysis.
Answer You can use a collapsible details section to provide an answer

We might use a column margin note to add an equation or some other annotation for learners or educators.


Summary


Synopsis



Summary Plot



Suggested Homework



Additional Information


Session Info


devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       Ubuntu 22.04.4 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Etc/UTC
 date     2026-02-03
 pandoc   3.1.1 @ /usr/local/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 bit           4.0.5   2022-11-15 [1] RSPM (R 4.3.0)
 bit64         4.0.5   2020-08-30 [1] RSPM (R 4.3.0)
 cachem        1.0.8   2023-05-01 [1] RSPM (R 4.3.0)
 cli           3.6.5   2025-04-23 [1] CRAN (R 4.3.2)
 crayon        1.5.2   2022-09-29 [1] RSPM (R 4.3.0)
 devtools      2.4.5   2022-10-11 [1] RSPM (R 4.3.0)
 digest        0.6.34  2024-01-11 [1] RSPM (R 4.3.0)
 ellipsis      0.3.2   2021-04-29 [1] RSPM (R 4.3.0)
 evaluate      1.0.5   2025-08-27 [1] CRAN (R 4.3.2)
 fansi         1.0.6   2023-12-08 [1] RSPM (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] RSPM (R 4.3.0)
 fs            1.6.3   2023-07-20 [1] RSPM (R 4.3.0)
 glue          1.7.0   2024-01-09 [1] RSPM (R 4.3.0)
 here        * 1.0.2   2025-09-15 [1] CRAN (R 4.3.2)
 hms           1.1.3   2023-03-21 [1] RSPM (R 4.3.0)
 htmltools     0.5.7   2023-11-03 [1] RSPM (R 4.3.0)
 htmlwidgets   1.6.4   2023-12-06 [1] RSPM (R 4.3.0)
 httpuv        1.6.14  2024-01-26 [1] RSPM (R 4.3.0)
 jsonlite      2.0.0   2025-03-27 [1] CRAN (R 4.3.2)
 knitr         1.50    2025-03-16 [1] CRAN (R 4.3.2)
 later         1.3.2   2023-12-06 [1] RSPM (R 4.3.0)
 lifecycle     1.0.4   2023-11-07 [1] RSPM (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] RSPM (R 4.3.0)
 memoise       2.0.1   2021-11-26 [1] RSPM (R 4.3.0)
 mime          0.12    2021-09-28 [1] RSPM (R 4.3.0)
 miniUI        0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] RSPM (R 4.3.0)
 pkgbuild      1.4.3   2023-12-10 [1] RSPM (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.3.0)
 pkgload       1.4.1   2025-09-23 [1] CRAN (R 4.3.2)
 profvis       0.3.8   2023-05-02 [1] RSPM (R 4.3.0)
 promises      1.2.1   2023-08-10 [1] RSPM (R 4.3.0)
 purrr         1.0.2   2023-08-10 [1] RSPM (R 4.3.0)
 R6            2.6.1   2025-02-15 [1] CRAN (R 4.3.2)
 Rcpp          1.0.12  2024-01-09 [1] RSPM (R 4.3.0)
 readr         2.1.5   2024-01-10 [1] RSPM (R 4.3.0)
 remotes       2.4.2.1 2023-07-18 [1] RSPM (R 4.3.0)
 rlang         1.1.6   2025-04-11 [1] CRAN (R 4.3.2)
 rmarkdown     2.25    2023-09-18 [1] RSPM (R 4.3.0)
 rprojroot     2.1.1   2025-08-26 [1] CRAN (R 4.3.2)
 sessioninfo   1.2.2   2021-12-06 [1] RSPM (R 4.3.0)
 shiny         1.8.0   2023-11-17 [1] RSPM (R 4.3.0)
 stringi       1.8.3   2023-12-11 [1] RSPM (R 4.3.0)
 stringr       1.5.1   2023-11-14 [1] RSPM (R 4.3.0)
 tibble        3.3.0   2025-06-08 [1] CRAN (R 4.3.2)
 tidyselect    1.2.0   2022-10-10 [1] RSPM (R 4.3.0)
 tzdb          0.4.0   2023-05-12 [1] RSPM (R 4.3.0)
 urlchecker    1.0.1   2021-11-30 [1] RSPM (R 4.3.0)
 usethis       2.2.3   2024-02-19 [1] RSPM (R 4.3.0)
 utf8          1.2.4   2023-10-22 [1] RSPM (R 4.3.0)
 vctrs         0.6.5   2023-12-01 [1] RSPM (R 4.3.0)
 vroom         1.6.5   2023-12-05 [1] RSPM (R 4.3.0)
 xfun          0.55    2025-12-16 [1] CRAN (R 4.3.2)
 xtable        1.8-4   2019-04-21 [1] RSPM (R 4.3.0)
 yaml          2.3.12  2025-12-10 [1] CRAN (R 4.3.2)

 [1] /usr/local/lib/R/site-library
 [2] /usr/local/lib/R/library

──────────────────────────────────────────────────────────────────────────────

Acknowledgments


We would also like to acknowledge the National Institute of General Medical Sciences for funding this work (1R25GM160622).

Icons are from iconpacks.