Describing Data: Tables and Visualizations
You have clean data. Now you need to describe it — what’s in the dataset? What patterns are visible? This chapter teaches you to build frequency tables and publication-ready charts using the same music dataset.
Describing data is not just a technical step — it’s communication. Every table and chart in this chapter answers a question that a reader would have about your data.
Load Your Clean Data
library(tidyverse)
music <- readRDS("data/music_data_clean.RDS")Frequency Tables
A frequency table answers the simplest question: How many of each category do I have?
One-Variable Frequency Table
genre_counts <- music |>
count(playlist_genre, sort = TRUE) |> # count and sort by frequency
mutate(
percent = n / sum(n) * 100, # add percentage column
percent = round(percent, 1) # round to one decimal
)
genre_countsThis tells you the distribution of genres in the dataset. Notice that some genres have far more songs than others — that’s important context for any analysis.
Two-Variable Cross-Tabulation
A cross-tabulation shows how two categorical variables relate to each other. This is the foundation for the chi-square test in Chapter 6.
cross_tab <- music |>
filter(!is.na(mode_label)) |> # remove rows where mode is missing
count(playlist_genre, mode_label) |> # count each combination
pivot_wider( # reshape from long to wide format
names_from = mode_label,
values_from = n,
values_fill = 0 # fill missing combinations with 0
)
cross_tabpivot_wider() Do?
pivot_wider() reshapes your data from “long” format (one row per combination) to “wide” format (one row per genre, with columns for each mode). This makes the table easier to read — it looks like the cross-tabulations you’ve seen in textbooks.
Adding Proportions
Raw counts are useful, but proportions tell a richer story:
music |>
filter(!is.na(mode_label)) |>
count(playlist_genre, mode_label) |>
group_by(playlist_genre) |>
mutate(
prop = n / sum(n), # proportion within each genre
prop = round(prop, 3) # round to 3 decimals
) |>
ungroup()Now you can see not just how many songs in each genre are major vs. minor, but what percentage. If pop is 70% major and edm is 55% major, that’s a meaningful difference — even if pop has more total songs.
The Grammar of Graphics (ggplot2)
R’s ggplot2 package builds charts in layers. Instead of clicking through menus like in Excel, you describe what you want:
ggplot(data, aes(x = ..., y = ..., fill = ...)) +
geom_*() +
labs() +
theme()
Each piece:
| Component | What It Does | Example |
|---|---|---|
ggplot() |
Sets up the canvas and data | ggplot(music, aes(x = playlist_genre)) |
aes() |
Maps variables to visual properties | aes(x = genre, fill = mode) |
geom_*() |
Adds the visual layer (bars, points, etc.) | geom_col(), geom_boxplot() |
labs() |
Adds titles and labels | labs(title = "Genre Distribution") |
theme() |
Controls appearance | theme_minimal() |
Think of it like building a sandwich: you start with the bread (the canvas), add the main ingredient (the geometry), then add toppings (labels, colors, themes).
Bar Chart: Genre Distribution
Let’s build a chart in stages — from basic to publication-ready.
Stage 1: The Bare Minimum
ggplot(genre_counts, aes(x = playlist_genre, y = n)) +
geom_col()This works, but it’s ugly and hard to read. Let’s improve it step by step.
Stage 2: Sort the Bars
Unsorted bars make it harder to compare. Use fct_reorder() to sort by frequency:
ggplot(genre_counts, aes(x = fct_reorder(playlist_genre, n), y = n)) +
geom_col() +
coord_flip() # horizontal bars are easier to read for category labelsStage 3: Add Color and Labels
ggplot(genre_counts, aes(x = fct_reorder(playlist_genre, n), y = n, fill = playlist_genre)) +
geom_col(show.legend = FALSE) + # hide legend (redundant with axis)
coord_flip() +
scale_fill_brewer(palette = "Set2") + # colorblind-friendly palette
labs(
title = "Distribution of Songs by Genre",
subtitle = "Billboard/Spotify Music Dataset (n = 1,792)",
x = NULL, # remove redundant axis label
y = "Number of Songs"
) +
theme_minimal(base_size = 12)Stage 4: Publication-Ready
ggplot(genre_counts, aes(x = fct_reorder(playlist_genre, n), y = n, fill = playlist_genre)) +
geom_col(show.legend = FALSE, width = 0.7) +
geom_text(aes(label = paste0(n, " (", percent, "%)")), # add count labels
hjust = -0.1, size = 3.5) +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) + # room for labels
labs(
title = "Distribution of Songs by Genre",
subtitle = "Billboard/Spotify Music Dataset (n = 1,792)",
x = NULL,
y = "Number of Songs",
caption = "Source: Spotify API via coursepackR"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
panel.grid.major.y = element_blank() # remove horizontal grid lines
)That’s the difference between a homework chart and a professional one. Every element serves a purpose: the sort order aids comparison, the labels provide exact values, and the clean theme reduces visual noise.
Stacked Proportional Bar Chart
A stacked proportional chart shows how the composition of one variable changes across categories of another. This is perfect for showing genre × mode relationships:
music |>
filter(!is.na(mode_label)) |>
ggplot(aes(x = fct_infreq(playlist_genre), fill = mode_label)) +
geom_bar(position = "fill") + # "fill" makes it proportional
coord_flip() +
scale_fill_brewer(palette = "Set1") +
scale_y_continuous(labels = scales::percent) + # show as percentages
labs(
title = "Proportion of Major vs. Minor Mode by Genre",
subtitle = "Billboard/Spotify Music Dataset",
x = NULL,
y = "Proportion",
fill = "Mode",
caption = "Source: Spotify API via coursepackR"
) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))Use this chart when you want to compare the composition (not the count) across groups. It answers: “Does the proportion of major vs. minor mode differ across genres?” If all bars look roughly the same, there’s probably no relationship. If they differ visibly, that’s worth testing statistically (Chapter 6).
Boxplot: Comparing a Numeric Variable Across Groups
Boxplots show the distribution of a continuous variable across categories. They reveal medians, spread, and outliers at a glance.
music |>
ggplot(aes(x = fct_reorder(playlist_genre, energy, .fun = median),
y = energy,
fill = playlist_genre)) +
geom_boxplot(show.legend = FALSE, alpha = 0.7) +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Energy Levels by Genre",
subtitle = "Higher values indicate more energetic tracks",
x = NULL,
y = "Energy (0–1 scale)",
caption = "Source: Spotify API via coursepackR"
) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))Reading a boxplot:
- The thick line in the middle is the median (50th percentile)
- The box spans the 25th to 75th percentile (the middle 50% of values)
- The whiskers extend to 1.5× the box width; anything beyond is an outlier (shown as dots)
Writing Your Interpretation
Numbers and charts don’t speak for themselves. Your assignment requires a 2–3 paragraph interpretation that cites specific numbers. Here’s how to structure it:
Paragraph 1: What the data contains > “The dataset includes 1,792 songs spanning six genres. Pop and rap are the most represented genres, accounting for X% and Y% of the dataset respectively, while latin represents the smallest share at Z%.”
Paragraph 2: Key patterns > “The cross-tabulation reveals notable differences in musical mode across genres. Pop songs are predominantly in major mode (X%), while [genre] shows a more even split (Y% major, Z% minor). This suggests that genre conventions may influence compositional choices around modality.”
Paragraph 3: Implications > “These descriptive patterns warrant further investigation. The apparent relationship between genre and mode will be tested statistically in the inferencing phase using a chi-square test of independence.”
“Most pop songs are in major mode” is weak. “72.3% of pop songs are in major mode, compared to 56.1% of edm tracks” is strong. Specific numbers make your writing credible and verifiable.
Try It Yourself
These exercises map directly to the Describing Data [R] assignment:
Build a frequency table for
playlist_genrewith both counts and percentages. Format it so a reader can immediately see which genre is most and least common.Create a cross-tabulation of
playlist_genre×mode_labelwith both counts and proportions.Build a horizontal bar chart that is publication-ready: sorted bars, colorblind-friendly palette, count labels, full titles and captions.
Create a stacked proportional chart showing the proportion of major vs. minor mode within each genre.
Choose one additional visualization (boxplot, scatter plot, or another chart type) and write 2–3 sentences justifying why you chose it and what it reveals.
Write your interpretation: 2–3 paragraphs that cite specific numbers from your tables and describe the patterns you see.
For your final portfolio, you’ll create a frequency table and bar chart using your own variables — not genre and mode, but whatever you coded in your content analysis. The ggplot2 syntax is identical. Only the variable names change. The interpretation paragraphs become part of your Results section.