Glossary {.unnumbered}

Key terms from this workbook, defined in plain English.

Alternative Hypothesis (H₁): The claim that there is a relationship between two variables. This is what you’re hoping to find evidence for. If the p-value is small enough, you reject the null hypothesis in favor of this one.
APA Format: The citation and writing style used by the American Psychological Association. In research methods, it refers specifically to how you report statistical results — e.g., χ²(5) = 23.41, p < .001.
Categorical Variable: A variable whose values are groups or labels, not numbers on a scale. Examples: genre (pop, rock, rap), tone (positive, neutral, negative), mode (major, minor).
Chi-Square Test of Independence: A statistical test that asks: “Are these two categorical variables related, or are the patterns I see just due to chance?” It compares what you observed to what you’d expect if there were no relationship.
Codebook: A document that defines every variable in your study — what it means, how you measure it, and what values it can take. Think of it as the instruction manual for your data.
Coder Drift: The gradual, unconscious change in how you apply coding rules over time. Item 1 and item 100 might get coded differently even though the rules haven’t changed. Anchor examples and decision logs help prevent this.
Contingency Table: A table showing the count for every combination of two categorical variables. Also called a cross-tabulation. It’s the input for a chi-square test.
Continuous Variable: A variable measured on a numeric scale where values can fall anywhere in a range. Examples: energy (0–1), tempo (BPM), word count.
Cramér’s V: A measure of how strongly two categorical variables are associated, ranging from 0 (no association) to 1 (perfect association). Cohen’s benchmarks: .10 = small, .30 = medium, .50 = large.
CSV (Comma-Separated Values): A simple file format where each row is a line of text and columns are separated by commas. Spreadsheet programs and R can both read and write CSV files.
Data Wrangling: The process of importing raw data, finding problems (missing values, duplicates, inconsistent labels), fixing those problems, and saving the result. Also called data cleaning or data munging.
Degrees of Freedom (df): A number that describes the “size” of your statistical test. For a chi-square test on a contingency table, df = (number of rows - 1) × (number of columns - 1).
Effect Size: A measure of how strong a relationship is, separate from whether it’s statistically significant. A relationship can be real but tiny, or real and large. Cramér’s V is one effect size measure.
Factor: R’s data type for categorical variables. A factor stores both the values and their labels (called “levels”). Setting your categorical columns as factors ensures R treats them correctly in charts and tests.
Frequency Table: A table showing how many times each value of a variable appears in the dataset. Usually includes both raw counts and percentages.
ggplot2: An R package for creating data visualizations. It uses the “grammar of graphics” — you build charts in layers by specifying data, visual mappings (aesthetics), and geometric shapes (geoms).
GitHub Pages: A free hosting service from GitHub that turns a repository’s files into a public website. You push your rendered Quarto Book to GitHub, enable Pages, and your research report becomes a live URL.
Null Hypothesis (H₀): The claim that there is no relationship between two variables — any pattern you see is just random chance. The default assumption; you need statistical evidence to reject it.
Operational Definition: The specific, measurable procedure for assigning a value to a variable. “Tone is how positive or negative the article feels” is a conceptual definition. “Tone is coded as 1 (positive), 2 (neutral), or 3 (negative) based on the dominant framing in the headline and first paragraph” is an operational definition.
p-value: The probability of seeing data as extreme as yours if the null hypothesis were true. A small p-value (< .05) means it would be very unlikely to see your results by chance, so you reject the null hypothesis. It is NOT the probability that your hypothesis is true.
Pipe (|>): An R operator that takes the result from the left side and passes it as the first argument to the function on the right side. Read it as “and then.”
Quarto: An open-source publishing tool that renders documents mixing text and code into PDFs, websites, presentations, and books. The successor to R Markdown.
Quarto Book: A Quarto project that combines multiple .qmd files into a single, multi-chapter document — rendered as both a PDF and a navigable website.
RDS: R’s native file format for saving data. Unlike CSV, RDS preserves factor levels, data types, and column attributes. Always save cleaned data as RDS.
Standardized Residual: A number showing how much a specific cell in a contingency table deviates from what you’d expect under the null hypothesis. Values above +2 or below -2 indicate cells that are notably different from expected — these are the combinations “driving” your chi-square result.
Statistical Significance: A result is “statistically significant” when the p-value falls below a chosen threshold (usually .05). It means the pattern in the data is unlikely to have occurred by chance alone. It does NOT mean the finding is important, meaningful, or large.
Tidyverse: A collection of R packages (dplyr, ggplot2, tidyr, readr, stringr, etc.) that share a consistent design philosophy and work well together. The main toolkit for data wrangling and visualization in this course.