Data Wrangling: Import, Diagnose, Clean, Export
Real data is messy. The music dataset you’re about to work with has missing values, duplicate rows, inconsistent labels, and out-of-range numbers — on purpose. This chapter teaches you how to find and fix these problems using R.
This is the single most practical R skill you’ll learn: every dataset needs cleaning before analysis, and the cleaning process is the same regardless of the topic.
The Four-Step Wrangling Pipeline
Import → Diagnose → Clean → Export
Every data wrangling task follows this sequence. By the end of this chapter, you’ll have taken a messy CSV file and produced a clean .RDS file ready for visualization and statistical testing.
Step 1: Import
First, load your packages and import the data:
library(tidyverse) # for data manipulation and visualization
library(janitor) # for clean_names() — fixes messy column headers
music_raw <- read_csv("data/music_data_raw.csv")
music_raw <- music_raw |> clean_names()Let’s see what we’re working with:
dim(music_raw)
names(music_raw)|> Symbol?
The |> is called a pipe. It takes the result from the left side and passes it to the function on the right side. Read it as “and then.” So music_raw |> clean_names() means “take music_raw and then clean the names.”
You’ll also see %>% in some tutorials — it does the same thing. We use |> because it’s built into modern R.
Step 2: Diagnose
Before cleaning, you need to know what’s wrong. Think of this as a doctor’s exam — you check vital signs before prescribing treatment.
Check the Structure
glimpse(music_raw)Look at the data types: - <chr> = text (character) - <dbl> = number (double/decimal) - <date> = date
Check for Missing Values
music_raw |>
summarize(across(everything(), ~sum(is.na(.)))) |>
pivot_longer(everything(), names_to = "column", values_to = "missing") |>
filter(missing > 0) |>
arrange(desc(missing))This tells you which columns have missing data and how much. A few missing values are normal. Columns with most values missing might need to be dropped entirely.
Check for Duplicates
cat("Total rows:", nrow(music_raw), "\n")
cat("Unique rows:", nrow(distinct(music_raw)), "\n")
cat("Duplicates:", nrow(music_raw) - nrow(distinct(music_raw)), "\n")Check Categorical Variables
For text/categorical columns, check the unique values. Inconsistent labels are one of the most common data problems.
music_raw |> count(playlist_genre, sort = TRUE)
music_raw |> count(mode, sort = TRUE)Check Numeric Variables
music_raw |>
select(danceability, energy, valence, tempo, track_popularity) |>
summary()Look for values that don’t make sense: - danceability, energy, and valence should be between 0 and 1 - tempo should be a positive number (usually 50–250 BPM) - track_popularity should be between 0 and 100
Step 3: Clean
Now fix the problems you found. The order matters — handle duplicates first, then labels, then values.
3a: Remove Duplicates
music_clean <- music_raw |> distinct()
cat("Rows before:", nrow(music_raw), "\n")
cat("Rows after:", nrow(music_clean), "\n")3b: Standardize Text Labels
If your genre column has inconsistent capitalization or spelling (e.g., “Pop” and “pop” and “POP”), fix it:
music_clean <- music_clean |>
mutate(playlist_genre = str_to_lower(playlist_genre))
music_clean |> count(playlist_genre, sort = TRUE)3c: Handle Problematic Values
Use case_when() to recode or fix specific values. This function works like a series of if/then rules:
music_clean <- music_clean |>
mutate(
mode_label = case_when(
mode == 0 ~ "minor",
mode == 1 ~ "major",
TRUE ~ NA_character_ # anything else becomes NA
)
)
music_clean |> count(mode_label)case_when() Works
case_when() checks conditions from top to bottom. The first condition that’s TRUE determines the result. The TRUE ~ at the end is a catch-all — it handles anything that didn’t match the earlier rules. Think of it as “if none of the above, then…”
3d: Convert to Factors
A factor is R’s way of storing a categorical variable — it remembers both the values and their labels. This matters for visualization and statistical tests.
music_clean <- music_clean |>
mutate(
playlist_genre = factor(playlist_genre),
mode_label = factor(mode_label)
)
levels(music_clean$playlist_genre)
levels(music_clean$mode_label)Step 4: Export
Save your cleaned data as an .RDS file. RDS is R’s native format — it preserves factor levels, data types, and everything else about your cleaned dataset.
saveRDS(music_clean, "data/music_data_clean.RDS")
music_verify <- readRDS("data/music_data_clean.RDS")
cat("Saved successfully:", nrow(music_verify), "rows,", ncol(music_verify), "columns\n").RDS Instead of .csv?
When you save as CSV, R forgets which columns are factors and what their levels are. When you save as RDS, everything is preserved. Since you just spent time setting up your factors, you don’t want to lose that work. Always save your cleaned data as RDS.
The Complete Pipeline (Summary)
Here’s the entire wrangling process in one code block, without the diagnostic checks:
library(tidyverse)
library(janitor)
music_raw <- read_csv("data/music_data_raw.csv") |> clean_names()
music_clean <- music_raw |>
distinct() |> # remove duplicates
mutate(playlist_genre = str_to_lower(playlist_genre)) |> # standardize labels
mutate(
mode_label = case_when( # create labeled mode
mode == 0 ~ "minor",
mode == 1 ~ "major",
TRUE ~ NA_character_
)
) |>
mutate(
playlist_genre = factor(playlist_genre), # set factors
mode_label = factor(mode_label)
)
saveRDS(music_clean, "data/music_data_clean.RDS")Try It Yourself
These exercises directly prepare you for the Data Wrangling [R] assignment:
Diagnose another column: Pick a numeric variable from the dataset (e.g.,
energyorvalence). Usesummary()to check its range. Are there any values outside the expected 0–1 range?Practice
case_when(): Create a new variable calledpopularity_levelthat categorizestrack_popularityinto three groups:- “low” (0–33)
- “medium” (34–66)
- “high” (67–100)
Run the full pipeline: Starting from
read_csv(), apply all four steps to produce a clean.RDSfile. Verify it loads correctly withreadRDS().
For your own data, you’ll follow this exact pipeline — but with your CSV file and your variables. The assignment uses music_data_raw.csv as practice. Your final portfolio uses your own coded data. Same skills, different dataset.