Practice: Data Wrangling with Pokemon

Apply the full Import → Diagnose → Clean → Export pipeline to a new dataset

TipDownload the Practice File

Download the .qmd to work through in RStudio:

pokemon-practice.qmd

The first half of this page walks through the dataset together. The second half is yours to complete independently.


Part 1: Setting the Stage

Why a New Dataset?

You learned the four-step wrangling pipeline in the Data Wrangling chapter using the music dataset. That chapter taught you the functions. This practice tests whether you can apply them — to data you haven’t seen, with column names you don’t recognize, and problems you have to find yourself.

The pipeline is always the same:

Import → Diagnose → Clean → Export

If you can wrangle Pokemon data, you can wrangle your content analysis data. Same skills, different columns.

The Dataset

Today’s data comes from the TidyTuesday project — a weekly community data practice used by R learners and professionals worldwide. It contains stats, types, and physical attributes for 949 Pokemon across all generations.

This dataset has:

  • Text columns (type_1, type_2, color_1, egg_group_1) — for practicing count() and factor()
  • Numeric columns (hp, attack, defense, speed, weight, height) — for practicing summary() and range checks
  • Real missing values — not every Pokemon has a secondary type, so type_2 and egg_group_2 have legitimate NAs
  • Enough rows (949) to feel like real data without being overwhelming

Importing Together

Start by loading packages and pulling the data from the URL:

library(tidyverse)
library(janitor)

pokemon_raw <- read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-01/pokemon_df.csv"
)

pokemon <- pokemon_raw |>
  clean_names()
NoteReading Data from a URL

read_csv() works with URLs, not just local files. R downloads the data directly into memory. This is how many open datasets are shared — no file download needed.

Check what we’re working with:

dim(pokemon)
names(pokemon)

You should see 949 rows and 22 columns. Each row is one Pokemon.

Diagnosing Together

Let’s walk through the diagnostic checks as a class.

Structure

glimpse(pokemon)

Notice the mix of types:

  • <chr> = text (character) — names, types, colors, egg groups, URLs
  • <int> = whole numbers — stats like hp, attack, defense, speed
  • <dbl> = decimal numbers — height, weight

Missing Values

colSums(is.na(pokemon))

You’ll find missing values in type_2, color_2, and egg_group_2. This is not an error — not every Pokemon has a secondary type. But you need to decide what to do about it before analysis.

You’ll also find NAs in base_experience. Some Pokemon simply don’t have this stat recorded.

Duplicates

cat("Total rows:", nrow(pokemon), "\n")
cat("Unique rows:", nrow(distinct(pokemon)), "\n")

Categories

pokemon |>
  count(type_1, sort = TRUE)

18 primary types. Water is the most common. This is a clean categorical variable — no inconsistent labels, no typos. Not every dataset will be this cooperative.

pokemon |>
  count(generation_id)

Generations 1 and 5 have the most Pokemon. This column is numeric, but it represents a category — a good candidate for factor().

Numeric Ranges

pokemon |>
  select(
    hp,
    attack,
    defense,
    speed,
    weight,
    height,
    base_experience
  ) |>
  summary()

Check the mins and maxes. Do they make sense? Weight and height have huge ranges because Pokemon range from tiny to enormous. The combat stats (hp, attack, defense, speed) typically run 1–255.


What Needs Cleaning?

Here’s what we found:

Issue Column(s) Decision
Missing secondary type type_2 Fill with "none" — absence is meaningful
Missing base experience base_experience Impute with median
No factor types type_1, generation_id Convert after cleaning
No derived variables Create power_tier and size_class with case_when()

This is typical. Not every dataset has dramatic errors. Sometimes cleaning is just structuring the data for the analysis you want to run.


Part 2: Your Turn

Everything below this line is yours. The downloaded .qmd file has empty code chunks for each step. Fill them in, render, and verify your output.


Clean the Data

Remove duplicates

pokemon_clean <- pokemon |>
  distinct()

Even if there are none, write the code. It’s part of the pipeline.

Handle missing values

Fill missing type_2 with "none" and impute missing numerics with the column median:

pokemon_clean <- pokemon_clean |>
  mutate(
    type_2 = ifelse(is.na(type_2), "none", type_2)
  ) |>
  mutate(
    across(
      where(is.numeric),
      ~ ifelse(is.na(.), median(., na.rm = TRUE), .)
    )
  )
NoteWhen Missing Data Isn’t an Error

In the music dataset, missing values usually meant something went wrong during collection. Here, a missing type_2 means the Pokemon only has one type. The decision to fill it with "none" vs. leave it as NA depends on your analysis plan. We fill it here so count() and factor() work cleanly.

Create a power tier with case_when()

Create power_tier from base_experience:

  • "low" if below 100
  • "mid" if 100 to 199
  • "high" if 200 or above
pokemon_clean <- pokemon_clean |>
  mutate(
    power_tier = case_when(
      base_experience < 100  ~ "low",
      base_experience < 200  ~ "mid",
      base_experience >= 200 ~ "high",
      TRUE                   ~ NA_character_
    )
  )

pokemon_clean |>
  count(power_tier)
TipThe TRUE ~ Catch-All

The TRUE ~ NA_character_ line handles anything that didn’t match above — including any remaining NA values. Without it, unmatched rows silently become NA, which is confusing to debug. Always include a catch-all.

Create a size class

Create size_class from weight:

  • "light" if under 25
  • "medium" if 25 to 100
  • "heavy" if over 100
pokemon_clean <- pokemon_clean |>
  mutate(
    size_class = case_when(
      weight < 25   ~ "light",
      weight <= 100 ~ "medium",
      weight > 100  ~ "heavy",
      TRUE          ~ NA_character_
    )
  )

pokemon_clean |>
  count(size_class)

Convert to factors

pokemon_clean <- pokemon_clean |>
  mutate(
    type_1 = factor(type_1),
    type_2 = factor(type_2),
    power_tier = factor(power_tier),
    size_class = factor(size_class),
    generation_id = factor(generation_id)
  )

levels(pokemon_clean$type_1)
levels(pokemon_clean$power_tier)
levels(pokemon_clean$size_class)

Export

saveRDS(pokemon_clean, "pokemon_clean.RDS")

pokemon_verify <- readRDS("pokemon_clean.RDS")

cat(
  "Rows:", nrow(pokemon_verify),
  "| Columns:", ncol(pokemon_verify), "\n"
)

levels(pokemon_verify$power_tier)

The Complete Pipeline

Here’s everything in one block, without the diagnostic checks:

library(tidyverse)
library(janitor)

pokemon_raw <- read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-01/pokemon_df.csv"
)

pokemon_clean <- pokemon_raw |>
  clean_names() |>
  distinct() |>
  mutate(
    type_2 = ifelse(is.na(type_2), "none", type_2)
  ) |>
  mutate(
    across(
      where(is.numeric),
      ~ ifelse(is.na(.), median(., na.rm = TRUE), .)
    )
  ) |>
  mutate(
    power_tier = case_when(
      base_experience < 100  ~ "low",
      base_experience < 200  ~ "mid",
      base_experience >= 200 ~ "high",
      TRUE                   ~ NA_character_
    ),
    size_class = case_when(
      weight < 25   ~ "light",
      weight <= 100 ~ "medium",
      weight > 100  ~ "heavy",
      TRUE          ~ NA_character_
    )
  ) |>
  mutate(
    type_1 = factor(type_1),
    type_2 = factor(type_2),
    power_tier = factor(power_tier),
    size_class = factor(size_class),
    generation_id = factor(generation_id)
  )

saveRDS(pokemon_clean, "pokemon_clean.RDS")

Reflection

  1. What was different about doing this yourself versus watching a demo?
  2. Which function gave you the most trouble? What did you do to fix it?
  3. What’s your plan for completing the Data Wrangling assignment before the deadline?

ImportantConnection to Your Assignment

The Data Wrangling assignment uses music_data_raw.csv and follows this exact pipeline. If you completed this practice, you’ve already done the hard part — same functions, same order, different column names.