Glossary {.unnumbered}
Key terms from this workbook, defined in plain English.
- Alternative Hypothesis (H₁)
- The claim that there is a relationship between two variables. This is what you’re hoping to find evidence for. If the p-value is small enough, you reject the null hypothesis in favor of this one.
- APA Format
- The citation and writing style used by the American Psychological Association. In research methods, it refers specifically to how you report statistical results — e.g., χ²(5) = 23.41, p < .001.
- Categorical Variable
- A variable whose values are groups or labels, not numbers on a scale. Examples: genre (pop, rock, rap), tone (positive, neutral, negative), mode (major, minor).
- Chi-Square Test of Independence
- A statistical test that asks: “Are these two categorical variables related, or are the patterns I see just due to chance?” It compares what you observed to what you’d expect if there were no relationship.
- Codebook
- A document that defines every variable in your study — what it means, how you measure it, and what values it can take. Think of it as the instruction manual for your data.
- Coder Drift
- The gradual, unconscious change in how you apply coding rules over time. Item 1 and item 100 might get coded differently even though the rules haven’t changed. Anchor examples and decision logs help prevent this.
- Contingency Table
- A table showing the count for every combination of two categorical variables. Also called a cross-tabulation. It’s the input for a chi-square test.
- Continuous Variable
- A variable measured on a numeric scale where values can fall anywhere in a range. Examples: energy (0–1), tempo (BPM), word count.
- Cramér’s V
- A measure of how strongly two categorical variables are associated, ranging from 0 (no association) to 1 (perfect association). Cohen’s benchmarks: .10 = small, .30 = medium, .50 = large.
- CSV (Comma-Separated Values)
- A simple file format where each row is a line of text and columns are separated by commas. Spreadsheet programs and R can both read and write CSV files.
- Data Wrangling
- The process of importing raw data, finding problems (missing values, duplicates, inconsistent labels), fixing those problems, and saving the result. Also called data cleaning or data munging.
- Degrees of Freedom (df)
- A number that describes the “size” of your statistical test. For a chi-square test on a contingency table, df = (number of rows - 1) × (number of columns - 1).
- Effect Size
- A measure of how strong a relationship is, separate from whether it’s statistically significant. A relationship can be real but tiny, or real and large. Cramér’s V is one effect size measure.
- Factor
- R’s data type for categorical variables. A factor stores both the values and their labels (called “levels”). Setting your categorical columns as factors ensures R treats them correctly in charts and tests.
- Frequency Table
- A table showing how many times each value of a variable appears in the dataset. Usually includes both raw counts and percentages.
- ggplot2
- An R package for creating data visualizations. It uses the “grammar of graphics” — you build charts in layers by specifying data, visual mappings (aesthetics), and geometric shapes (geoms).
- GitHub Pages
- A free hosting service from GitHub that turns a repository’s files into a public website. You push your rendered Quarto Book to GitHub, enable Pages, and your research report becomes a live URL.
- Null Hypothesis (H₀)
- The claim that there is no relationship between two variables — any pattern you see is just random chance. The default assumption; you need statistical evidence to reject it.
- Operational Definition
- The specific, measurable procedure for assigning a value to a variable. “Tone is how positive or negative the article feels” is a conceptual definition. “Tone is coded as 1 (positive), 2 (neutral), or 3 (negative) based on the dominant framing in the headline and first paragraph” is an operational definition.
- p-value
- The probability of seeing data as extreme as yours if the null hypothesis were true. A small p-value (< .05) means it would be very unlikely to see your results by chance, so you reject the null hypothesis. It is NOT the probability that your hypothesis is true.
- Pipe (
|>) - An R operator that takes the result from the left side and passes it as the first argument to the function on the right side. Read it as “and then.”
- Quarto
- An open-source publishing tool that renders documents mixing text and code into PDFs, websites, presentations, and books. The successor to R Markdown.
- Quarto Book
-
A Quarto project that combines multiple
.qmdfiles into a single, multi-chapter document — rendered as both a PDF and a navigable website. - RDS
- R’s native file format for saving data. Unlike CSV, RDS preserves factor levels, data types, and column attributes. Always save cleaned data as RDS.
- Standardized Residual
- A number showing how much a specific cell in a contingency table deviates from what you’d expect under the null hypothesis. Values above +2 or below -2 indicate cells that are notably different from expected — these are the combinations “driving” your chi-square result.
- Statistical Significance
- A result is “statistically significant” when the p-value falls below a chosen threshold (usually .05). It means the pattern in the data is unlikely to have occurred by chance alone. It does NOT mean the finding is important, meaningful, or large.
- Tidyverse
-
A collection of R packages (
dplyr,ggplot2,tidyr,readr,stringr, etc.) that share a consistent design philosophy and work well together. The main toolkit for data wrangling and visualization in this course.