Data Science

While completing my Ph.D. in Psychological Sciences at the University of North Carolina Wilmington, I became interested in data science topics and techniques. Under the mentorship of Dr. Dale Cohen, I developed technical discipline and a principled approach to statistical programming in R.

Since then, data science has become central to my professional practice. I specialize in building compelling data visualizations, producing polished static and dynamic documents, designing automated data workflows, and working with larger-than-memory data sets. I also actively integrate AI tools into my work — not as a shortcut, but to raise the ceiling on what’s achievable in analysis, visualization, and documentation.

Visualizations

The creation of visualizations is perhaps my favorite area of data science. A great visualization should be accessible to professional audiences regardless of their statistical background, while remaining aesthetically polished. For organizations, standardizing visual elements such as fonts, color palettes, and output dimensions helps ensure reports and presentations feel cohesive and consistent. My approach draws on both psychological principles (e.g., perception, color accessibility, and bias reduction) and data science tools that streamline production. My two primary tools are R (particularly ggplot2) and Adobe Illustrator.

Standardized Visuals

Researchers and organizations routinely rely on common visualization types like bar, histogram, line, box, and scatter plots to communicate findings. Producing these efficiently and consistently is critical to a healthy workflow, particularly when reports and presentations need to share a unified visual identity.

To address this at the Center for Collegiate Mental Health at Penn State University (CCMH), I developed a suite of R functions built on ggplot2 that generate standardized, CCMH-specific plots. These functions are available in the CCMHr GitHub repository. Each function exposes over 40 customizable arguments that control dimensions, titles, legends, text, and more, while defaulting to CCMH-standard settings. This means a researcher can produce a properly formatted plot with minimal code, while still having the flexibility to override any default when a specific use case demands it.

To illustrate, I’ll walk through a few examples using a dataset from my personal coin collection that examines the number of coins I own from select countries in South America and Asia (Cuba, Mexico, China, and Japan). You can read more about my coin collection on the Personal Information page.

First, here is a standard ggplot2 column plot:


# Library 
library(ggplot2)

# Create data set 
df.coins <- data.frame(countries = c("Japan", 
                                     "China", 
                                     "Cuba", 
                                     "Mexico", 
                                     "Venezuela"), 
                       coin_number = c(18, 
                                       29, 
                                       39, 
                                       17, 
                                       14))

# Make countries a factor 
df.coins$countries <- factor(df.coins$countries, 
                             levels = c("Cuba", 
                                        "China", 
                                        "Japan", 
                                        "Mexico", 
                                        "Venezuela"))

# Column plot using only ggplot2
coin.plot <- ggplot2::ggplot(df.coins, 
                             ggplot2::aes(x = countries, 
                                          y = coin_number)) +
  ggplot2::geom_col(fill = "#D0B38F") +
  ggplot2::labs(title = "Number of coins from specific countries",
                x = "Countries",
                y = "Number of coins")

ggplot2::ggsave(coin.plot, 
                filename = "visuals/01-ggplot2-example.png", 
                width = 12,
                height = 9,
                units = "in")

The visualization is readable, but the plot is visually busy with grid lines and panel background, and certain elements, like the y-axis label, are difficult to read. Now, here is the same plot produced using my CCMHr plotting function, with most arguments left at their defaults:


# Packages
library(CCMHr)

# Plot
CCMHr::plot_column(data = df.coins,
                   x.var = countries,
                   y.var = coin_number,
                   color = "#D0B38F",
                   save = TRUE,
                   path = "ccmh-basic-example.png",
                   plot.width = 12,
                   plot.height = 9,
                   plot.units = "in",
                   plot.title = "Number of coins from specific countries",
                   x.title = "Countries",
                   y.title = "Number\nof coins")

With only a few lines of code, the plot is cleaner, more polished, and formatted consistently with other CCMH outputs. Adding further detail is equally straightforward. For instance, here are some other aesthetics I would add to improve the visualization:


CCMHr::plot_column(data = df.coins,
                   x.var = countries,
                   y.var = coin_number,
                   color = "#D0B38F",
                   save = TRUE,
                   path = "ccmh-advance-example.png",
                   plot.width = 8,
                   plot.height = 6,
                   plot.units = "in",
                   plot.title = "Number of coins from specific countries",
                   x.title = "Countries",
                   y.title = "Number\nof coins", 
                   y.min = 0, # Specify minimum y limit
                   y.max = 40, # Specify maximum y limit
                   y.breaks = seq(0, 40, 10), # Specify breaks for grid lines
                   y.expand = c(0 ,0), # Expand y axis to start at 0
                   column.width = 0.75, # Reduce column width, 
                   y.grid.major = TRUE, # Add major grid lines
                   y.grid.minor = TRUE, # Add minor grid lines
                   y.axis.line = FALSE, # Remove y axis line
                   x.axis.line = FALSE, # Remove x axis line
                   remove.axis.ticks = TRUE, # Remove axis tick marks
                   column.text = TRUE, # Add column text
                   column.text.color = "#292929", # Column text color
                   column.text.position = ggplot2::position_dodge(width = 0.75), # Column text position
                   column.text.vjust = 2, # Column text vertical adjustment
                   column.text.hjust = 0.5) # Column text horizontal adjustment

Of course, no set of predefined arguments can anticipate every need. When a customization falls outside what the function natively supports, the plot.elements arguments allow researchers to pass raw ggplot2 code directly, granting access to ggplot2's full range of functionality without sacrificing the standardized baseline. The example below uses plot.elements to modify the axis title text color for a special report:


CCMHr::plot_column(data = df.coins,
                   x.var = countries,
                   y.var = coin_number,
                   color = "#D0B38F",
                   save = TRUE,
                   path = "ccmh-plotelements-example.png",
                   plot.width = 8,
                   plot.height = 6,
                   plot.units = "in",
                   plot.title = "Number of coins from specific countries",
                   x.title = "Countries",
                   y.title = "Number\nof coins", 
                   y.min = 0, # Specify minimum y limit
                   y.max = 40, # Specify maximum y limit
                   y.breaks = seq(0, 40, 10), # Specify breaks for grid lines
                   y.expand = c(0 ,0), # Expand y axis to start at 0
                   column.width = 0.75, # Reduce column width, 
                   y.grid.major = TRUE, # Add major grid lines
                   y.grid.minor = TRUE, # Add minor grid lines
                   y.axis.line = FALSE, # Remove y axis line
                   x.axis.line = FALSE, # Remove x axis line
                   remove.axis.ticks = TRUE, # Remove axis tick marks
                   column.text = TRUE, # Add column text
                   column.text.color = "#292929", # Column text color
                   column.text.position = ggplot2::position_dodge(width = 0.75), # Column text position
                   column.text.vjust = 2, # Column text vertical adjustment
                   column.text.hjust = 0.5, # Column text horizontal adjustment
                   plot.element1 = ggplot2::theme(axis.title = ggplot2::element_text(color = "red"))) # Make axis title text color red

Below are examples of published visuals produced using CCMHr plotting functions:

Image from a blog titled "Discrimination and the Development of Working Alliance in College Counseling Clients – A Pilot Study at a Single College Counseling Center"

Image from a blog titled "Diagnostic Prevalence and Trends in College Counseling"

This graph has no title, but examines changes in the prevalence of different racial/ethnic groups receiving treatment at University/College counseling centers over time. Image from a blog titled "College Counseling Centers are Increasingly Treating a Greater Percentage of Students who Represent Diverse Identities: 11-Year Trends"

Custom Visuals

Standardized functions are built for efficiency and consistency, but some projects call for something more tailored, or a design that aligns with a specific report's aesthetic. In these cases, I build fully custom visuals from the ground up using ggplot2 and Adobe Illustrator.

Here is an example of a custom plot built entirely in ggplot2:


# Coin Composite example 
  
# Data 
df.composite <- data.frame(composite = c("Silver", 
                                         "Copper", 
                                         "Bronze", 
                                         "Other"), 
                           percent = c(0.5492228, 
                                       0.238342, 
                                       0.119171, 
                                       0.09326425))

# Factor
df.composite$composite <- factor(df.composite$composite, 
                                 levels = rev(c("Silver", 
                                                "Copper", 
                                                "Bronze", 
                                                "Other")))

# Data frames related to creating plot
df.composite.empty <- df.composite |>
  dplyr::mutate(percent = 0)

df.composite2 <- plyr::rbind.fill(df.composite, 
                                  df.composite.empty)

df.line <- df.composite2 |>
  dplyr::mutate(percent = ifelse(percent != 0, 1, 0))

# Plot 
composite.plot <- ggplot2::ggplot(df.composite2, 
                                  ggplot2::aes(x = composite,
                                               y = percent) ) +
  ggplot2::geom_line(data = df.line, 
                     ggplot2::aes(x = composite,
                                  y = percent),
                     lineend = "round", 
                     lwd = 24, 
                     color = "#292929") +
  ggplot2::geom_line(data = df.line, 
                     ggplot2::aes(x = composite,
                                  y = percent),
                     lineend = "round", 
                     lwd = 22, 
                     color = "#fdfdfd") +
  ggplot2::geom_line(color = "#D0B38F", 
                     lineend = "round", 
                     lwd = 19) +
  ggplot2::ylim(-0.025, 1.025) +
  ggplot2::labs(title = "What's My Collection Made Of? A Breakdown by Metal") +
  ggplot2::theme(legend.position = "none",
                 panel.background = ggplot2::element_blank(), 
                 axis.text.x = ggplot2::element_blank(), 
                 axis.title = ggplot2::element_blank(), 
                 plot.title = ggplot2::element_text(hjust = 0.5, 
                                                    face = "bold",
                                                    color = "#292929",
                                                    size = 20),
                 axis.text.y = ggplot2::element_text(family = "sans", 
                                                     face = "bold",
                                                     color = "#292929",
                                                     size = 15),
                 axis.ticks = ggplot2::element_blank()) +
  ggplot2::coord_flip() + 
  ggplot2::geom_text(data = df.composite,
                     ggplot2::aes(label = paste0(format(round(percent, digits = 3)*100, nsmall = 1), "%")), 
                                  hjust = 0.9,
                                  family = "sans",
                                  fontface = "bold",
                                  size = 5, 
                                  color = "#292929")

ggsave(composite.plot, 
       filename = "custom-example.png", 
       width = 9.5,
       height = 6,
       units = "in")

Below are examples of published custom visuals:

Image from a blog titled "The Prevalence of Discrimination Within Different Identities"

We call this a nested column plot. Image from the CCMH 2025 Annual Report's Special Topic titled "Students with Financial Insecurity: Prevalence and Associations with Employment, Extracurricular Activities, and Psychological Distress"

Process Visuals

Not all visuals communicate results! Some communicate contextual details about a project. Process visuals describe the structure behind the data: how it was collected, how variables are defined, or how a cleaning or analytical pipeline works. Done well, these visuals help audiences understand and trust the analysis before they ever see a finding. I typically create process visuals in Adobe Illustrator, which gives me precise control over layout and design.

Below is an example of process visual:

This is a figure CCMH uses to display the different variables we collect from clients.

Data Art Visuals

Data art sits at the intersection of analytics and design. Often, I use vector graphics (a form of computer graphics) to represent specific constructs like people, institutions, or abstract concepts. These vector graphics are often arranged to tell a narrative. The number, color, size, and arrangement of vector graphics are sometimes related to statistical aspects (e.g., the prevalence of something, a point in time).

Below is an example that visualizes the national dropout rate among college students from enrollment to graduation, using data from IPEDS. Vectors were obtained from Adobe Stock, and edits were made in Adobe Illustrator.

This visual shows students progressing toward the goal of graduating from college. The figure symbolizes this journey by depicting students walking along a light grey path that visibly narrows from left to right, representing attrition as students move through each milestone. Three rings mark key time points along the path: starting college, continuing enrollment after the first year (75.4%), and graduating within 6 years (61.0%). The first ring serves as the 100% baseline and carries no percentage label. The size of each subsequent ring corresponds to the percentage of students who reach that time point. The rings progress from light to medium to dark blue, mirroring a broader color progression across the entire visual.

The number of student figures visible between each pair of rings also corresponds to the percentage of students who will reach the next time point, with 10 figures representing 100%. From the start through graduation, students carry a backpack whose shade deepens from light to dark blue as they progress. After passing through the final ring, the student figure is shown wearing a dark blue graduation cap, signifying the transition from an enrolled student to a graduate.

Documentation

Documentation is critical to disseminating research findings, building project manuals, and communicating the scope and structure of analytical work. My primary tools for documentation are Quarto and Adobe Illustrator. Quarto is an open-source scientific and technical publishing system that supports R, Python, and other languages, which makes it well-suited for dynamic, reproducible reports. Adobe Illustrator is better suited for work that demands a more polished visual presentation, such as executive summaries and infographics.

Static Documents

Static documents present fixed information that, once produced, remains unchanged unless the document is manually updated. They are well-suited for special project reports, technical manuals, and any deliverable where the underlying data and conclusions are final.

At CCMH, I developed a technical manual for the CCMHr package that documents its functions and arguments, and I update it manually as the package evolves. I have also produced static reports detailing project scope, data collection methods, analyses, and conclusions. Many of these documents are internal, though CCMH is introducing a series of quarterly data briefs focused on topics in collegiate mental health.

Dynamic Documents

Dynamic documents update automatically when new data is introduced or the document is re-rendered, making them essential for reports that need to be refreshed on a regular cadence. At CCMH, the data science team has developed several dynamic reporting workflows. My contributions to these vary across projects.

One strong use case is the Annual Report, where the structure and narrative remain largely consistent year over year, but statistical summaries update automatically as new data is incorporated. Only general conclusions and any new variables require manual revision. The CCMH Annual Reports are available here.

Dynamic documents are also well-suited for producing reports at scale. CCMH partners with over 800 university and college counseling centers, most of which receive a comprehensive report tailored to their center each year. Generating these manually would be time-consuming and error-prone, so we use dynamic reporting procedures to automate their production. Because centers vary on what data they collect, the reports incorporate conditional logic to determine what content is displayed and how it is visualized. An example of a comprehensive report is available here.

Scrollytelling

Scrollytelling is a digital storytelling technique that uses interactive and animated elements to guide readers through a narrative as they scroll down a page. Unlike a traditional webpage or blog post, the experience is paced and structured — closer to a guided presentation than a passive read.

At CCMH, I developed a scrollytelling piece that explains the Clinical Load Index (CLI), a complex data construct that helps university and college counseling centers assess whether their service capacity aligns with the expectations of students, administrators, and other stakeholders. The project combined R, Adobe Illustrator, Adobe Stock, Quarto, and Close-Read, with careful attention to how information was sequenced and revealed throughout the scroll experience.

Executive Summaries and Infographics

Executive summaries and infographics distill complex findings into concise, visually compelling formats, making them effective tools for communicating with administrators, stakeholders, and other audiences who need the key takeaways without the full technical detail. I use Adobe Illustrator, Adobe Stock, and R to produce these materials, combining data visualizations with a clear narrative structure. All executive summaries and infographics I have created at CCMH are for internal use.

Data Science

Visualizations

Standardized Visuals

Custom Visuals

Process Visuals

Data Art Visuals

Documentation

Static Documents

Dynamic Documents

Scrollytelling

Executive Summaries and Infographics

Other Critical Topics

Data Automation

Larger-Than-Memory Data

Use of Artificial Intelligence (AI)

Skills and Languages I Want to Learn