Data Collection: Consistency at Scale

Your codebook is ready. Your spreadsheet is set up. Now you need to code your full sample — somewhere between 50 and 200 items, depending on your sampling plan. This chapter covers the practical realities of doing that well.

The Coder Drift Problem

Here’s what happens to every researcher: the first 20 items are easy. You’re focused, your definitions are fresh, and every coding decision feels clear. By item 75, things get fuzzy. You start second-guessing yourself. You code something as 2 that you would have coded as 1 an hour ago. This is coder drift — the gradual, unconscious shift in how you apply your coding rules.

Coder drift is normal. It’s also dangerous, because inconsistent coding produces unreliable data, and unreliable data produces meaningless results.

Two Weapons Against Drift

1. Anchor Re-Coding

At the start of every coding session, re-code 5 items you’ve already coded — your anchor examples from Chapter 1. If your new codes match the originals, you’re consistent. If they don’t, stop and review your decision rules before continuing.

2. The Decision Log

Every time you encounter an ambiguous case, write it down in your coder_notes column. Don’t just pick a code and move on — document why you chose it. At the end of each session, review your log. If you see patterns (e.g., “I keep struggling with items that are partially sarcastic”), add a new decision rule to your codebook.

Batch Workflow

Code in focused batches rather than marathon sessions. Here’s a realistic workflow:

Before Each Session (5 minutes)

Open your codebook — have it visible on screen or printed
Re-code your 5 anchor examples
Review any decision log entries from last session

During Each Session (45–60 minutes)

Code 30–40 items per session (more than this and accuracy drops)
For each item: read/watch → assign codes → enter in spreadsheet → note anything ambiguous
Save your spreadsheet after every 10 items

After Each Session (5 minutes)

Review your decision log entries
Update your codebook if you added any new rules
Record your count: “Session 3 complete — items 61–95 coded”

Source-Specific Tips

Depending on your media source, you may need different retrieval strategies:

Source	How to Access	What to Save
News articles (Nexis Uni)	Download in batches of 20–50; save as text or PDF	Headline, date, source, full text
Social media posts	Screenshot or use platform export tools	Post text, date, engagement counts, account name
YouTube/video	Watch and code in real-time; note timestamps	Title, channel, date, duration, view count
Podcasts/audio	Listen and code; use timestamps for reference	Episode title, date, duration, key segments

Save Everything

Always keep the original source material accessible. You may need to go back and re-check a coding decision. For news articles, save the PDFs or text files. For social media, save screenshots. For video/audio, note the URL and timestamp.

Timeline Planning

You need your data collection complete before the R assignments begin. Let’s calculate your timeline:

Your Number	Value
Total target sample size	______
Items coded so far (from pilot)	______
Items remaining	______
Sessions needed (remaining ÷ 35)	______
Sessions per day you can realistically do	______
Days needed	______
Target completion date	______ (aim for Friday, April 3)

The Hard Deadline

Your Data Wrangling [R] assignment is due April 10. You need a completed, clean dataset in CSV format before you can start that assignment. Working backward: finish coding by April 3, spend April 4–5 cleaning and checking your spreadsheet, and start the R assignment on April 6.

Quality Self-Check

Before you move to R, run through this checklist on your completed spreadsheet:

Completeness: Every row has a value in every column (no blank cells except coder_notes)
Consistency: All codes match your codebook (no 4 in a variable that only goes to 3)
No text in numeric columns: Codes are numbers, not words (check for stray “yes” or “positive” entries)
No duplicate items: Each item appears exactly once
Column headers are clean: Lowercase, no spaces, underscores only (e.g., source_type not Source Type)
Saved as CSV: Export your spreadsheet as a .csv file — this is what R will import

How to Export as CSV

Excel: File → Save As → choose “CSV (Comma delimited)” from the format dropdown
Google Sheets: File → Download → Comma Separated Values (.csv)

Name your file descriptively: my_coding_data_raw.csv (the “raw” reminds you this is pre-cleaning).

Try It Yourself

Calculate your personal data collection timeline using the table above. Is your target realistic? If not, what can you adjust — batch size, sessions per day, or sample size (consult your professor before reducing sample size)?
Open your coding spreadsheet and run the quality self-check. Fix any issues you find now, before you import to R.

Connection to Your Project

The CSV file you produce at the end of this chapter is the input for everything that follows. Chapters 4–6 will teach you R using the class music dataset, but you’ll apply those same techniques to your own CSV for the final portfolio. A clean dataset here means smooth analysis later. A messy dataset means debugging headaches in R.