Data Collection: Consistency at Scale
Your codebook is ready. Your spreadsheet is set up. Now you need to code your full sample — somewhere between 50 and 200 items, depending on your sampling plan. This chapter covers the practical realities of doing that well.
The Coder Drift Problem
Here’s what happens to every researcher: the first 20 items are easy. You’re focused, your definitions are fresh, and every coding decision feels clear. By item 75, things get fuzzy. You start second-guessing yourself. You code something as 2 that you would have coded as 1 an hour ago. This is coder drift — the gradual, unconscious shift in how you apply your coding rules.
Coder drift is normal. It’s also dangerous, because inconsistent coding produces unreliable data, and unreliable data produces meaningless results.
Two Weapons Against Drift
1. Anchor Re-Coding
At the start of every coding session, re-code 5 items you’ve already coded — your anchor examples from Chapter 1. If your new codes match the originals, you’re consistent. If they don’t, stop and review your decision rules before continuing.
2. The Decision Log
Every time you encounter an ambiguous case, write it down in your coder_notes column. Don’t just pick a code and move on — document why you chose it. At the end of each session, review your log. If you see patterns (e.g., “I keep struggling with items that are partially sarcastic”), add a new decision rule to your codebook.
Batch Workflow
Code in focused batches rather than marathon sessions. Here’s a realistic workflow:
Before Each Session (5 minutes)
- Open your codebook — have it visible on screen or printed
- Re-code your 5 anchor examples
- Review any decision log entries from last session
During Each Session (45–60 minutes)
- Code 30–40 items per session (more than this and accuracy drops)
- For each item: read/watch → assign codes → enter in spreadsheet → note anything ambiguous
- Save your spreadsheet after every 10 items
After Each Session (5 minutes)
- Review your decision log entries
- Update your codebook if you added any new rules
- Record your count: “Session 3 complete — items 61–95 coded”
Source-Specific Tips
Depending on your media source, you may need different retrieval strategies:
| Source | How to Access | What to Save |
|---|---|---|
| News articles (Nexis Uni) | Download in batches of 20–50; save as text or PDF | Headline, date, source, full text |
| Social media posts | Screenshot or use platform export tools | Post text, date, engagement counts, account name |
| YouTube/video | Watch and code in real-time; note timestamps | Title, channel, date, duration, view count |
| Podcasts/audio | Listen and code; use timestamps for reference | Episode title, date, duration, key segments |
Always keep the original source material accessible. You may need to go back and re-check a coding decision. For news articles, save the PDFs or text files. For social media, save screenshots. For video/audio, note the URL and timestamp.
Timeline Planning
You need your data collection complete before the R assignments begin. Let’s calculate your timeline:
| Your Number | Value |
|---|---|
| Total target sample size | ______ |
| Items coded so far (from pilot) | ______ |
| Items remaining | ______ |
| Sessions needed (remaining ÷ 35) | ______ |
| Sessions per day you can realistically do | ______ |
| Days needed | ______ |
| Target completion date | ______ (aim for Friday, April 3) |
Your Data Wrangling [R] assignment is due April 10. You need a completed, clean dataset in CSV format before you can start that assignment. Working backward: finish coding by April 3, spend April 4–5 cleaning and checking your spreadsheet, and start the R assignment on April 6.
Quality Self-Check
Before you move to R, run through this checklist on your completed spreadsheet:
How to Export as CSV
- Excel: File → Save As → choose “CSV (Comma delimited)” from the format dropdown
- Google Sheets: File → Download → Comma Separated Values (.csv)
Name your file descriptively: my_coding_data_raw.csv (the “raw” reminds you this is pre-cleaning).
Try It Yourself
Calculate your personal data collection timeline using the table above. Is your target realistic? If not, what can you adjust — batch size, sessions per day, or sample size (consult your professor before reducing sample size)?
Open your coding spreadsheet and run the quality self-check. Fix any issues you find now, before you import to R.
The CSV file you produce at the end of this chapter is the input for everything that follows. Chapters 4–6 will teach you R using the class music dataset, but you’ll apply those same techniques to your own CSV for the final portfolio. A clean dataset here means smooth analysis later. A messy dataset means debugging headaches in R.