Data Collection: Consistency at Scale

Your codebook is ready. Your spreadsheet is set up. Now you need to code your full sample — somewhere between 50 and 200 items, depending on your sampling plan. This chapter covers the practical realities of doing that well.

The Coder Drift Problem

Here’s what happens to every researcher: the first 20 items are easy. You’re focused, your definitions are fresh, and every coding decision feels clear. By item 75, things get fuzzy. You start second-guessing yourself. You code something as 2 that you would have coded as 1 an hour ago. This is coder drift — the gradual, unconscious shift in how you apply your coding rules.

Coder drift is normal. It’s also dangerous, because inconsistent coding produces unreliable data, and unreliable data produces meaningless results.

Two Weapons Against Drift

1. Anchor Re-Coding

At the start of every coding session, re-code 5 items you’ve already coded — your anchor examples from Chapter 1. If your new codes match the originals, you’re consistent. If they don’t, stop and review your decision rules before continuing.

2. The Decision Log

Every time you encounter an ambiguous case, write it down in your coder_notes column. Don’t just pick a code and move on — document why you chose it. At the end of each session, review your log. If you see patterns (e.g., “I keep struggling with items that are partially sarcastic”), add a new decision rule to your codebook.

Batch Workflow

Code in focused batches rather than marathon sessions. Here’s a realistic workflow:

Before Each Session (5 minutes)

  1. Open your codebook — have it visible on screen or printed
  2. Re-code your 5 anchor examples
  3. Review any decision log entries from last session

During Each Session (45–60 minutes)

  1. Code 30–40 items per session (more than this and accuracy drops)
  2. For each item: read/watch → assign codes → enter in spreadsheet → note anything ambiguous
  3. Save your spreadsheet after every 10 items

After Each Session (5 minutes)

  1. Review your decision log entries
  2. Update your codebook if you added any new rules
  3. Record your count: “Session 3 complete — items 61–95 coded”

Source-Specific Tips

Depending on your media source, you may need different retrieval strategies:

Source How to Access What to Save
News articles (Nexis Uni) Download in batches of 20–50; save as text or PDF Headline, date, source, full text
Social media posts Screenshot or use platform export tools Post text, date, engagement counts, account name
YouTube/video Watch and code in real-time; note timestamps Title, channel, date, duration, view count
Podcasts/audio Listen and code; use timestamps for reference Episode title, date, duration, key segments
ImportantSave Everything

Always keep the original source material accessible. You may need to go back and re-check a coding decision. For news articles, save the PDFs or text files. For social media, save screenshots. For video/audio, note the URL and timestamp.

Timeline Planning

You need your data collection complete before the R assignments begin. Let’s calculate your timeline:

Your Number Value
Total target sample size ______
Items coded so far (from pilot) ______
Items remaining ______
Sessions needed (remaining ÷ 35) ______
Sessions per day you can realistically do ______
Days needed ______
Target completion date ______ (aim for Friday, April 3)
WarningThe Hard Deadline

Your Data Wrangling [R] assignment is due April 10. You need a completed, clean dataset in CSV format before you can start that assignment. Working backward: finish coding by April 3, spend April 4–5 cleaning and checking your spreadsheet, and start the R assignment on April 6.

Quality Self-Check

Before you move to R, run through this checklist on your completed spreadsheet:

How to Export as CSV

  • Excel: File → Save As → choose “CSV (Comma delimited)” from the format dropdown
  • Google Sheets: File → Download → Comma Separated Values (.csv)

Name your file descriptively: my_coding_data_raw.csv (the “raw” reminds you this is pre-cleaning).

Try It Yourself

  1. Calculate your personal data collection timeline using the table above. Is your target realistic? If not, what can you adjust — batch size, sessions per day, or sample size (consult your professor before reducing sample size)?

  2. Open your coding spreadsheet and run the quality self-check. Fix any issues you find now, before you import to R.

TipConnection to Your Project

The CSV file you produce at the end of this chapter is the input for everything that follows. Chapters 4–6 will teach you R using the class music dataset, but you’ll apply those same techniques to your own CSV for the final portfolio. A clean dataset here means smooth analysis later. A messy dataset means debugging headaches in R.