Practice: Data Wrangling with Pokemon

Apply the full Import → Diagnose → Clean → Export pipeline to a new dataset

Download the Practice File

Download the .qmd to work through in RStudio:

The first half of this page walks through the dataset together. The second half is yours to complete independently.

Part 1: Setting the Stage

Why a New Dataset?

You learned the four-step wrangling pipeline in the Data Wrangling chapter using the music dataset. That chapter taught you the functions. This practice tests whether you can apply them — to data you haven’t seen, with column names you don’t recognize, and problems you have to find yourself.

The pipeline is always the same:

Import → Diagnose → Clean → Export

If you can wrangle Pokemon data, you can wrangle your content analysis data. Same skills, different columns.

The Dataset

Today’s data comes from the TidyTuesday project — a weekly community data practice used by R learners and professionals worldwide. It contains stats, types, and physical attributes for 949 Pokemon across all generations.

This dataset has:

Text columns (type_1, type_2, color_1, egg_group_1) — for practicing count() and factor()
Numeric columns (hp, attack, defense, speed, weight, height) — for practicing summary() and range checks
Real missing values — not every Pokemon has a secondary type, so type_2 and egg_group_2 have legitimate NAs
Enough rows (949) to feel like real data without being overwhelming

Importing Together

Start by loading packages and pulling the data from the URL:

library(tidyverse)
library(janitor)

pokemon_raw <- read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-01/pokemon_df.csv"
)

pokemon <- pokemon_raw |>
  clean_names()

Reading Data from a URL

read_csv() works with URLs, not just local files. R downloads the data directly into memory. This is how many open datasets are shared — no file download needed.

Check what we’re working with:

dim(pokemon)
names(pokemon)

You should see 949 rows and 22 columns. Each row is one Pokemon.

Diagnosing Together

Let’s walk through the diagnostic checks as a class.

Structure

glimpse(pokemon)

Notice the mix of types:

<chr> = text (character) — names, types, colors, egg groups, URLs
<int> = whole numbers — stats like hp, attack, defense, speed
<dbl> = decimal numbers — height, weight

Missing Values

colSums(is.na(pokemon))

You’ll find missing values in type_2, color_2, and egg_group_2. This is not an error — not every Pokemon has a secondary type. But you need to decide what to do about it before analysis.

You’ll also find NAs in base_experience. Some Pokemon simply don’t have this stat recorded.

Duplicates

cat("Total rows:", nrow(pokemon), "\n")
cat("Unique rows:", nrow(distinct(pokemon)), "\n")

Numeric Ranges

pokemon |>
  select(
    hp,
    attack,
    defense,
    speed,
    weight,
    height,
    base_experience
  ) |>
  summary()

Check the mins and maxes. Do they make sense? Weight and height have huge ranges because Pokemon range from tiny to enormous. The combat stats (hp, attack, defense, speed) typically run 1–255.

What Needs Cleaning?

Here’s what we found:

Issue	Column(s)	Decision
Missing secondary type	`type_2`	Fill with `"none"` — absence is meaningful
Missing base experience	`base_experience`	Impute with median
No factor types	`type_1`, `generation_id`	Convert after cleaning
No derived variables	—	Create `power_tier` and `size_class` with `case_when()`

This is typical. Not every dataset has dramatic errors. Sometimes cleaning is just structuring the data for the analysis you want to run.

Part 2: Your Turn

Everything below this line is yours. The downloaded .qmd file has empty code chunks for each step. Fill them in, render, and verify your output.

Clean the Data

Remove duplicates

pokemon_clean <- pokemon |>
  distinct()

Even if there are none, write the code. It’s part of the pipeline.

Handle missing values

Fill missing type_2 with "none" and impute missing numerics with the column median:

pokemon_clean <- pokemon_clean |>
  mutate(
    type_2 = ifelse(is.na(type_2), "none", type_2)
  ) |>
  mutate(
    across(
      where(is.numeric),
      ~ ifelse(is.na(.), median(., na.rm = TRUE), .)
    )
  )

When Missing Data Isn’t an Error

In the music dataset, missing values usually meant something went wrong during collection. Here, a missing type_2 means the Pokemon only has one type. The decision to fill it with "none" vs. leave it as NA depends on your analysis plan. We fill it here so count() and factor() work cleanly.

Create a power tier with `case_when()`

Create power_tier from base_experience:

"low" if below 100
"mid" if 100 to 199
"high" if 200 or above

pokemon_clean <- pokemon_clean |>
  mutate(
    power_tier = case_when(
      base_experience < 100  ~ "low",
      base_experience < 200  ~ "mid",
      base_experience >= 200 ~ "high",
      TRUE                   ~ NA_character_
    )
  )

pokemon_clean |>
  count(power_tier)

The TRUE ~ Catch-All

The TRUE ~ NA_character_ line handles anything that didn’t match above — including any remaining NA values. Without it, unmatched rows silently become NA, which is confusing to debug. Always include a catch-all.

Create a size class

Create size_class from weight:

"light" if under 25
"medium" if 25 to 100
"heavy" if over 100

pokemon_clean <- pokemon_clean |>
  mutate(
    size_class = case_when(
      weight < 25   ~ "light",
      weight <= 100 ~ "medium",
      weight > 100  ~ "heavy",
      TRUE          ~ NA_character_
    )
  )

pokemon_clean |>
  count(size_class)

Convert to factors

pokemon_clean <- pokemon_clean |>
  mutate(
    type_1 = factor(type_1),
    type_2 = factor(type_2),
    power_tier = factor(power_tier),
    size_class = factor(size_class),
    generation_id = factor(generation_id)
  )

levels(pokemon_clean$type_1)
levels(pokemon_clean$power_tier)
levels(pokemon_clean$size_class)

Export

saveRDS(pokemon_clean, "pokemon_clean.RDS")

pokemon_verify <- readRDS("pokemon_clean.RDS")

cat(
  "Rows:", nrow(pokemon_verify),
  "| Columns:", ncol(pokemon_verify), "\n"
)

levels(pokemon_verify$power_tier)

The Complete Pipeline

Here’s everything in one block, without the diagnostic checks:

library(tidyverse)
library(janitor)

pokemon_raw <- read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-01/pokemon_df.csv"
)

pokemon_clean <- pokemon_raw |>
  clean_names() |>
  distinct() |>
  mutate(
    type_2 = ifelse(is.na(type_2), "none", type_2)
  ) |>
  mutate(
    across(
      where(is.numeric),
      ~ ifelse(is.na(.), median(., na.rm = TRUE), .)
    )
  ) |>
  mutate(
    power_tier = case_when(
      base_experience < 100  ~ "low",
      base_experience < 200  ~ "mid",
      base_experience >= 200 ~ "high",
      TRUE                   ~ NA_character_
    ),
    size_class = case_when(
      weight < 25   ~ "light",
      weight <= 100 ~ "medium",
      weight > 100  ~ "heavy",
      TRUE          ~ NA_character_
    )
  ) |>
  mutate(
    type_1 = factor(type_1),
    type_2 = factor(type_2),
    power_tier = factor(power_tier),
    size_class = factor(size_class),
    generation_id = factor(generation_id)
  )

saveRDS(pokemon_clean, "pokemon_clean.RDS")

Reflection

What was different about doing this yourself versus watching a demo?
Which function gave you the most trouble? What did you do to fix it?
What’s your plan for completing the Data Wrangling assignment before the deadline?

Connection to Your Assignment

The Data Wrangling assignment uses music_data_raw.csv and follows this exact pipeline. If you completed this practice, you’ve already done the hard part — same functions, same order, different column names.

Part 1: Setting the Stage

Why a New Dataset?

The Dataset

Importing Together

Diagnosing Together

Structure

Missing Values

Duplicates

Categories

Numeric Ranges

What Needs Cleaning?

Part 2: Your Turn

Clean the Data

Remove duplicates

Handle missing values

Create a power tier with case_when()

Create a size class

Convert to factors

Export

The Complete Pipeline

Reflection

Create a power tier with `case_when()`