---
title: "Amounts and proportions"
date: "2025-09-15"
date_end: "2025-09-19"
---
For this example, we're going to use real world data to demonstrate some different ways to visualize amounts and proportions. We'll use data from the CDC and the Social Security Administration about the number of daily births in the United States from 1994–2014. [FiveThirtyEight reported a story using this data in 2016](https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/) and they posted relatively CSV files [on GitHub](https://github.com/fivethirtyeight/data/tree/master/births), so we can download and use those.
If you want to follow along with this example, you can download the data directly from [GitHub](https://github.com/fivethirtyeight/data/tree/master/births) or by using these links (you'll likely need to right click on these and choose "Save Link As…"):
- [{{< fa file-csv >}} `US_births_1994-2003_CDC_NCHS.csv`](https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv)
- [{{< fa file-csv >}} `US_births_2000-2014_SSA.csv`](https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv)
## Live coding example
<div class="ratio ratio-16x9">
<iframe src="https://www.youtube.com/embed/zrT-ThV6U6M" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" frameborder="0"></iframe>
</div>
::: {.callout-important}
### Slight differences from the video
This is a slightly cleaned up version of the code from the video.
:::
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.width = 6, fig.height = 3.6, fig.align = "center", collapse = TRUE)
set.seed(1234)
options(dplyr.summarise.inform = FALSE)
```
## Load data
There are two CSV files:
- `US_births_1994-2003_CDC_NCHS.csv` contains U.S. births data for the years 1994 to 2003, as provided by the Centers for Disease Control and Prevention’s National Center for Health Statistics.
- `US_births_2000-2014_SSA.csv` contains U.S. births data for the years 2000 to 2014, as provided by the Social Security Administration.
Since the two datasets overlap in 2000–2003, we use Social Security Administration data for those years.
We downloaded the data from GitHub and placed the CSV files in a folder named `data`. We'll then load them with `read_csv()` and combine them into one data frame.
```{r load-libraries-data-fake, eval=FALSE}
library(tidyverse)
library(scales) # For nice labels in charts
births_1994_1999 <- read_csv("data/US_births_1994-2003_CDC_NCHS.csv") |>
# Ignore anything after 2000
filter(year < 2000)
births_2000_2014 <- read_csv("data/US_births_2000-2014_SSA.csv")
births_combined <- bind_rows(births_1994_1999, births_2000_2014)
```
```{r load-libraries-data-real, include=FALSE, warning=FALSE, message=FALSE}
library(tidyverse)
library(scales)
births_1994_1999 <- read_csv(here::here(
"files", "data", "external_data",
"US_births_1994-2003_CDC_NCHS.csv"
)) |>
# Ignore anything after 2000
filter(year < 2000)
births_2000_2014 <- read_csv(here::here(
"files", "data", "external_data",
"US_births_2000-2014_SSA.csv"
))
births_combined <- bind_rows(births_1994_1999, births_2000_2014)
```
## Wrangle data
Let's look at the first few rows of the data to see what we're working with:
```{r head-data}
head(births_combined)
```
The columns for year and births seem straightforward and ready to use. The columns for month and day of the week could be improved if we changed them to text (i.e. January instead of 1; Tuesday instead of 3). To fix this, we can convert these columns to categorical variables, or factors in R. We can also specify that these categories (or factors) are ordered, meaning that Feburary comes after January, etc. Without ordering, R will plot them alphabetically, which isn't very helpful.
We'll make a new dataset named `births` that's based on the combined births data, but with some new columns added:
```{r wrangle-data}
# The c() function lets us make a list of values
month_names <- c("January", "February", "March", "April", "May", "June", "July",
"August", "September", "October", "November", "December")
day_names <- c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday")
births <- births_combined |>
# Make month an ordered factor, using the month_name list as labels
mutate(month = factor(month, labels = month_names, ordered = TRUE)) |>
mutate(day_of_week = factor(day_of_week, labels = day_names, ordered = TRUE),
date_of_month_categorical = factor(date_of_month)) |>
# Add a column indicating if the day is on a weekend
mutate(weekend = ifelse(day_of_week %in% c("Saturday", "Sunday"), TRUE, FALSE))
head(births)
```
If you look at the data now, you can see the columns are changed and have different types. `year` and `date_of_month` are still numbers, but `month`, and `day_of_week` are ordered factors (`ord`) and `date_of_month_categorical` is a regular factor (`fct`). Technically it's also ordered, but because it's already alphabetical (i.e. 2 naturally comes after 1), we don't need to force it to be in the right order.
Our `births` data is now clean and ready to go!
## Bar plot
First we can look at a bar chart showing the total number of births each day. We need to make a smaller summarized dataset and then we'll plot it:
```{r plot-bar-chart}
total_births_weekday <- births |>
group_by(day_of_week) |>
summarize(total = sum(births))
ggplot(data = total_births_weekday,
mapping = aes(x = day_of_week, y = total, fill = day_of_week)) +
geom_col() +
# Turn off the fill legend because it's redundant
guides(fill = "none")
```
If we fill by day of the week, we get 7 different colors, which is fine (I guess), but doesn't really help tell a story. The main story here is that there are far fewer births during weekends. If we create a new column that flags if a row is Saturday or Sunday, we can fill by that column instead:
```{r plot-bar-chart-weekend}
total_births_weekday <- births |>
group_by(day_of_week) |>
summarize(total = sum(births)) |>
mutate(weekend = ifelse(day_of_week %in% c("Saturday", "Sunday"), TRUE, FALSE))
ggplot(data = total_births_weekday,
mapping = aes(x = day_of_week, y = total, fill = weekend)) +
geom_col()
```
Neat! Those default colors are kinda ugly, though, so let's use the principles of preattentive processing and contrast to highlight the weekend bars:
```{r plot-bar-chart-weekend-better}
ggplot(data = total_births_weekday,
mapping = aes(x = day_of_week, y = total, fill = weekend)) +
geom_col() +
# Use grey and orange
scale_fill_manual(values = c("grey70", "#f2ad22")) +
# Use commas instead of scientific notation
scale_y_continuous(labels = label_comma()) +
# Turn off the legend since the title shows what the orange is
guides(fill = "none") +
labs(title = "Weekends are unpopular times for giving birth",
x = NULL, y = "Total births")
```
## Lollipop chart
Since the ends of the bars are often the most important part of the graph, we can use a lollipop chart to emphasize them. We'll keep all the same code from our bar chart and make a few changes:
- Color by weekend instead of fill by weekend, since points and lines are colored in ggplot, not filled
- Switch `scale_fill_manual()` to `scale_color_manual()` and turn off the `color` legend in the `guides()` layer
- Switch `geom_col()` to `geom_pointrange()`. The `geom_pointrange()` layer requires two additional aesthetics: `ymin` and `ymax` for the ends of the lines that come out of the point. Here we'll set `ymin` to 0 so it starts at the x-axis, and we'll set `ymax` to `total` so it ends at the point.
```{r plot-lollipop-chart-weekend-better}
ggplot(data = total_births_weekday,
mapping = aes(x = day_of_week, y = total, color = weekend)) +
geom_pointrange(aes(ymin = 0, ymax = total),
# Make the lines a little thicker and the dots a little bigger
fatten = 5, size = 1.5) +
# Use grey and orange
scale_color_manual(values = c("grey70", "#f2ad22")) +
# Use commas instead of scientific notation
scale_y_continuous(labels = label_comma()) +
# Turn off the legend since the title shows what the orange is
guides(color = "none") +
labs(title = "Weekends are unpopular times for giving birth",
x = NULL, y = "Total births")
```
## Strip plot
However, we want to \#barbarplots! (Though they're arguably okay here, since they show totals and not averages). Let's show all the data with points. We'll use the full dataset now, map x to weekday, y to births, and change `geom_col()` to `geom_point()`. We'll tell `geom_point()` to jitter the points randomly.
```{r strip-plot}
ggplot(data = births,
mapping = aes(x = day_of_week, y = births, color = weekend)) +
scale_color_manual(values = c("grey70", "#f2ad22")) +
geom_point(size = 0.5, position = position_jitter(height = 0)) +
guides(color = "none")
```
There are some interesting points in the low ends, likely because of holidays like Labor Day and Memorial Day (for the Mondays) and Thanksgiving (for the Thursday). If we had a column that indicated whether a day was a holiday, we could color by that and it would probably explain most of those low numbers. Unfortunately we don't have that column, and it'd be hard to make. Some holidays are constant (Halloween is always October 31), but some aren't (Thanksgiving is the fourth Thursday in November, so we'd need to find out which November 20-somethingth each year is the fourth Thursday, and good luck doing that at scale).
## Beeswarm plot
We can add some structure to these points if we use the [{ggbeeswarm} package](https://github.com/eclarke/ggbeeswarm), with either `geom_beeswarm()` or `geom_quasirandom()`. `geom_quasirandom()` actually works better here since there are so many points—`geom_beeswarm()` makes the clusters of points way too wide.
```{r beeswarm-plot}
library(ggbeeswarm)
ggplot(data = births,
mapping = aes(x = day_of_week, y = births, color = weekend)) +
scale_color_manual(values = c("grey70", "#f2ad22")) +
# Make these points suuuper tiny
geom_quasirandom(size = 0.0001) +
guides(color = "none")
```
## Heatmap
Finally, let's use something non-traditional to show the average births by day in a somewhat proportional way. We can calculate the average number of births every day and then make a heatmap that fills each square by that average, thus showing the relative differences in births per day.
To do this, we need to make a summarized data frame with `group_by() |> summarize()` to calculate the average number of births by month and day of the month (i.e. average for January 1, January 2, etc.).
We'll then make a sort of calendar with date of the month on the x axis, month on the y axis, with heat map squares filled by the daily average. We'll use `geom_tile()` to add squares for each day, and then add some extra scale, coordinates, and theme layers to clean up the plot:
```{r plot-heatmap, fig.width=10, fig.height=4}
avg_births_month_day <- births |>
group_by(month, date_of_month_categorical) |>
summarize(avg_births = mean(births))
ggplot(data = avg_births_month_day,
# By default, the y-axis will have December at the top, so use fct_rev() to reverse it
mapping = aes(x = date_of_month_categorical, y = fct_rev(month), fill = avg_births)) +
geom_tile() +
# Add viridis colors
scale_fill_viridis_c(option = "inferno", labels = label_comma()) +
# Add nice labels
labs(x = "Day of the month", y = NULL,
title = "Average births per day",
subtitle = "1994-2014",
fill = "Average births") +
# Force all the tiles to have equal widths and heights
coord_equal() +
# Use a cleaner theme
theme_minimal()
```
Neat! There are some really interesting trends here. Most obvious, probably, is that very few people are born on New Year's Day, July 4th, Halloween, Thanksgiving, and Christmas.
```{r lowest-days}
avg_births_month_day |>
arrange(avg_births)
```
The days with the highest average are in mid-September (lol my birthday is #2), likely because that's about 9 months after the first week of January. July 7th at #7 is odd and I have no idea why it might be so popular `r emoji::emoji("shrug")`.
```{r highest-days}
avg_births_month_day |>
arrange(desc(avg_births))
```
The funniest trend is the very visible dark column for the 13th of every month. People *really* don't want to give birth on the 13th.