Chapter 14: The Surprise Detector

Learning Objectives

  • Understand the logic of null hypothesis significance testing (NHST)
  • Interpret p-values correctly (and avoid common misinterpretations)
  • Conduct chi-square tests for categorical relationships in R
  • Calculate and interpret confidence intervals
  • Distinguish between statistical significance and practical significance
  • Recognize when hypothesis testing is appropriate and when it’s not
  • Report statistical results following APA guidelines

You’ve visualized your data. You created a scatterplot showing that songs with negative lyric sentiment tend to chart higher than songs with positive sentiment. You faceted a boxplot and noticed that high-intensity songs appear more common among top-charting hits than low-intensity songs.

But here’s the uncomfortable question: Could these patterns be flukes?

With a dataset of 200 songs, random variation creates patterns. If you shuffle sentiment labels randomly and plot the results, you’ll sometimes see apparent differences between groups—not because sentiment actually matters, but because chance produces noise that looks like signal.

Statistical inference is how we distinguish signal from noise. It asks: “If nothing real were happening—if sentiment truly didn’t affect chart performance—how surprised should we be to see a pattern this strong?”

This chapter teaches you to quantify surprise.

The Logic of Hypothesis Testing

Hypothesis testing feels backward at first. You don’t directly prove what you believe. Instead, you assume the opposite (the null hypothesis), then calculate how surprising your data would be if that assumption were true.

An analogy: Imagine you suspect a coin is unfair—weighted to land heads more often. You can’t prove it’s unfair directly. But you can flip it 100 times and see what happens.

  • Null hypothesis (H₀): The coin is fair (50/50 heads/tails)
  • Alternative hypothesis (H₁): The coin is unfair

You flip 100 times. Result: 65 heads, 35 tails.

Question: If the coin were truly fair, how often would random chance produce 65 or more heads out of 100 flips?

Answer (from probability theory): About 0.3% of the time.

That’s rare. If the null hypothesis (fair coin) were true, you’d almost never see 65 heads. So either: 1. The coin is fair, and you witnessed a very unlikely event (0.3% chance), OR 2. The coin is unfair

Most researchers conclude: The null hypothesis is probably false. The coin is likely unfair.

This is the logic of null hypothesis significance testing (NHST): Assume nothing’s happening, calculate how surprising your data are, and reject the null if the data are sufficiently surprising.

P-Values: Quantifying Surprise

The p-value is the probability of observing data at least as extreme as yours, assuming the null hypothesis is true.

Coin example: p = 0.003 (0.3%)

Music example: You test whether negative sentiment songs chart differently than positive sentiment songs.

  • H₀: Sentiment and chart position are unrelated
  • H₁: Sentiment and chart position are related

You run a statistical test. Result: p = 0.02 (2%)

Interpretation: If sentiment truly didn’t matter (H₀ is true), you’d see a difference this large or larger only 2% of the time due to random chance.

That’s surprising. Most researchers would reject the null hypothesis and conclude that sentiment and chart position are related.

The 0.05 Threshold (and Why It’s Arbitrary)

By convention, researchers use p < 0.05 as the threshold for “statistical significance.”

  • p < 0.05 → “Statistically significant” → Reject null hypothesis
  • p ≥ 0.05 → “Not statistically significant” → Fail to reject null

But this threshold is arbitrary. There’s nothing magical about 5%. It’s a convention that emerged historically and stuck. Some fields use stricter thresholds (p < 0.01 or p < 0.001). Others are moving away from bright-line cutoffs entirely.

What p = 0.049 vs. p = 0.051 really means: Almost nothing. The first barely crosses the threshold; the second barely misses. Yet one is declared “significant” and the other “not significant.” This is why many statisticians now recommend reporting exact p-values and focusing on effect sizes rather than binary yes/no decisions.

What P-Values Do NOT Mean

P-values are wildly misinterpreted. Here’s what they do not tell you:

❌ “p = 0.03 means there’s a 3% chance the null hypothesis is true”

Wrong. The null hypothesis is either true or false (you just don’t know which). P-values tell you about the probability of data, not hypotheses. Correct interpretation: “If H₀ were true, we’d see data this extreme 3% of the time.”

❌ “p = 0.03 means there’s a 97% chance the alternative hypothesis is true”

Wrong. P-values say nothing about the probability of H₁ being true.

❌ “p = 0.03 means the effect is large or important”

Wrong. Statistical significance doesn’t imply practical significance. A tiny, trivial effect can be statistically significant with a large enough sample. A large, important effect can be non-significant with a small sample.

❌ “p > 0.05 means there’s no effect”

Wrong. Failing to find significance doesn’t prove the null. It means your data aren’t strong enough evidence to reject it. Absence of evidence isn’t evidence of absence.

✓ Correct interpretation: “If the null hypothesis were true, we’d observe data at least this extreme [p × 100]% of the time due to random variation.”

Chi-Square Test: Categorical Relationships

The chi-square test (χ²) tests whether two categorical variables are independent or related.

Use case: Do lyric sentiment and chart performance category (top 10 vs. not top 10) relate to each other?

The Research Question

Hypothesis: Negative sentiment songs are more likely to chart in the top 10 than positive sentiment songs.

Null hypothesis (H₀): Sentiment and top-10 status are independent (unrelated).

Alternative hypothesis (H₁): Sentiment and top-10 status are related.

Creating the Contingency Table

First, create a binary “top 10” variable and cross-tabulate it with sentiment:

library(tidyverse)

music_data <- music_data %>%
  mutate(top10 = if_else(Max_Rank <= 10, "Top 10", "Not Top 10"))

# Create contingency table
table(music_data$Lyric_Sentiment, music_data$top10)

Example output:

                Not Top 10  Top 10
  Positive           35       15
  Negative           25       25
  Neutral            20       10
  Mixed              30       20

Observation: Negative songs have a higher proportion in top 10 (25/50 = 50%) than Positive songs (15/50 = 30%).

Question: Is this difference statistically significant, or could it be chance?

Running the Chi-Square Test

# Chi-square test
chisq.test(music_data$Lyric_Sentiment, music_data$top10)

Output:

    Pearson's Chi-squared test

data:  music_data$Lyric_Sentiment and music_data$top10
X-squared = 8.42, df = 3, p-value = 0.038

Interpretation:

  • χ² = 8.42: The test statistic (measures how much the observed frequencies differ from what we’d expect if variables were independent)
  • df = 3: Degrees of freedom (based on table dimensions: [rows - 1] × [columns - 1] = [4 - 1] × [2 - 1] = 3)
  • p = 0.038: If sentiment and chart position were unrelated, we’d see a chi-square value this large or larger about 3.8% of the time by chance

Conclusion: p < 0.05, so we reject the null hypothesis. Lyric sentiment and top-10 chart performance appear to be related.

Assumptions and Warnings

Chi-square tests have assumptions:

  1. Independence: Each observation is independent (one song can’t be coded twice)
  2. Expected frequencies: Each cell in the contingency table should have an expected frequency ≥ 5

Check expected frequencies:

chisq.test(music_data$Lyric_Sentiment, music_data$top10)$expected

If any cell has expected frequency < 5, the test may be unreliable. Consider: - Combining categories (e.g., merge “Neutral” and “Mixed”) - Using Fisher’s exact test instead

Effect Size: Cramér’s V

Statistical significance tells you whether a relationship exists. Effect size tells you how strong it is.

Cramér’s V measures association strength for categorical variables: - V = 0: No association - V = 0.1: Small effect - V = 0.3: Medium effect - V = 0.5: Large effect

library(rcompanion)

cramerV(music_data$Lyric_Sentiment, music_data$top10)

Example output: V = 0.21

Interpretation: A small-to-medium effect. Sentiment and chart position are related, but the relationship isn’t overwhelmingly strong.

Confidence Intervals: Estimating Uncertainty

P-values tell you whether to reject the null. Confidence intervals tell you the range of plausible values for an effect.

Example: What’s the average chart position for negative sentiment songs?

# Calculate mean chart position for negative songs
negative_songs <- music_data %>%
  filter(Lyric_Sentiment == "Negative")

mean(negative_songs$Max_Rank)  # Point estimate

Output: Mean = 32.4

But this is just one sample. The true population mean (if we coded all songs ever) is unknown. A 95% confidence interval gives a range of plausible values.

# Calculate 95% CI
t.test(negative_songs$Max_Rank)$conf.int

Output: [28.1, 36.7]

Interpretation: We’re 95% confident that the true average chart position for negative sentiment songs lies between 28.1 and 36.7.

What “95% confident” means: If we repeated this study 100 times with different random samples, about 95 of those confidence intervals would contain the true population mean. It does not mean there’s a 95% probability the true mean is in this specific interval.

Comparing Groups with Confidence Intervals

Compare negative vs. positive songs:

# Negative songs
negative_ci <- t.test(negative_songs$Max_Rank)$conf.int
negative_ci

# Positive songs
positive_songs <- music_data %>%
  filter(Lyric_Sentiment == "Positive")

positive_ci <- t.test(positive_songs$Max_Rank)$conf.int
positive_ci

Output: - Negative: [28.1, 36.7] - Positive: [40.2, 52.8]

Interpretation: The confidence intervals don’t overlap. This suggests negative songs chart significantly higher (lower numbers = better) than positive songs.

Running Statistical Tests in R

Chi-Square Test (Categorical × Categorical)

chisq.test(music_data$Lyric_Sentiment, music_data$top10)

T-Test (Comparing Two Group Means)

# Are negative songs faster (higher tempo) than positive songs?
t.test(Tempo ~ Lyric_Sentiment, 
       data = music_data %>% filter(Lyric_Sentiment %in% c("Negative", "Positive")))

Output:

    Welch Two Sample t-test

data:  Tempo by Lyric_Sentiment
t = 2.13, df = 87.5, p-value = 0.036
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.2  22.8
sample estimates:
mean in group Negative mean in group Positive 
                 128.5                   116.5

Interpretation: - Negative songs have higher average tempo (128.5 BPM) than positive songs (116.5 BPM) - Difference: 12 BPM - 95% CI for difference: [1.2, 22.8] - p = 0.036, so the difference is statistically significant

Correlation (Two Numeric Variables)

# Is there a relationship between tempo and chart peak?
cor.test(music_data$Tempo, music_data$Max_Rank)

Output:

    Pearson's product-moment correlation

data:  music_data$Tempo and music_data$Max_Rank
t = -1.92, df = 198, p-value = 0.056
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.267  0.003
sample estimates:
       cor 
-0.135

Interpretation: - Weak negative correlation (r = -0.135): Faster songs tend to chart slightly higher - p = 0.056 (just above 0.05 threshold), so not quite statistically significant by conventional standards - 95% CI includes zero [-0.267, 0.003], suggesting the relationship might not exist in the population

Statistical Significance vs. Practical Significance

A finding can be statistically significant (unlikely to be chance) but practically trivial (too small to matter).

Example: You test whether negative songs have higher tempo. Result:

  • Difference: 2.1 BPM (negative songs average 121.1, positive songs average 119.0)
  • p = 0.003 (highly statistically significant)
  • Sample size: 10,000 songs

Is 2.1 BPM meaningful? Listeners can’t perceive a 2 BPM difference. DJs wouldn’t care. This is statistically significant but practically trivial.

Conversely: A difference of 30 BPM (very noticeable) might not be statistically significant if your sample is only 15 songs. Lack of significance doesn’t mean the effect doesn’t exist—just that you don’t have enough data to detect it reliably.

Takeaway: Always report effect sizes and confidence intervals alongside p-values. Statistical significance tells you whether there’s an effect; effect size tells you how much.

When Hypothesis Testing Isn’t Appropriate

Hypothesis tests work when you have: 1. A clear null hypothesis 2. Random or representative sampling 3. Sufficient sample size 4. Data that meet test assumptions

When NOT to use hypothesis testing:

Exploratory analysis: If you’re exploring data without specific predictions, hypothesis tests are misleading. You’re fishing for patterns, which inflates false positives.

Convenience samples: If your 200 songs aren’t a random sample of all music (they’re probably not), p-values don’t generalize to the population. They only describe this specific dataset.

Tiny samples: With n = 10, even large effects may not reach significance. The test lacks power.

Multiple comparisons: If you test 20 hypotheses at p < 0.05, you’d expect 1 false positive by chance. Adjust for multiple testing or report all tests conducted.

Reporting Statistical Results

Follow APA format for transparency:

Chi-square:

“A chi-square test revealed a significant relationship between lyric sentiment and top-10 chart status, χ²(3, N = 200) = 8.42, p = .038, Cramér’s V = 0.21. Negative sentiment songs were more likely to chart in the top 10 (50%) than positive sentiment songs (30%).”

T-test:

“An independent samples t-test showed that negative sentiment songs (M = 128.5 BPM, SD = 24.3) had significantly higher tempo than positive sentiment songs (M = 116.5 BPM, SD = 22.1), t(87.5) = 2.13, p = .036, 95% CI [1.2, 22.8], d = 0.51.”

Correlation:

“Tempo and chart peak position were weakly negatively correlated, r(198) = -.135, p = .056, 95% CI [-.267, .003], indicating that faster songs tended to chart slightly higher, though this relationship did not reach statistical significance.”

What to include: - Test name - Test statistic and degrees of freedom - Exact p-value (not just “p < .05”) - Effect size - Descriptive statistics (means, SDs) - Confidence intervals when available


Practice: Statistical Inference

Exercise 14.1: Interpreting P-Values

For each p-value, write a correct interpretation:

  1. p = 0.03
  2. p = 0.18
  3. p = 0.001

Exercise 14.2: Chi-Square Test

Test whether Emotional_Intensity (Low/Medium/High) relates to top-10 chart status.

  1. Create a contingency table
  2. Run the chi-square test
  3. Interpret the p-value
  4. Calculate Cramér’s V
  5. Write an APA-formatted results sentence

Exercise 14.3: Confidence Intervals

Calculate 95% confidence intervals for average tempo in each sentiment category. Do the intervals overlap? What does that tell you?


Exercise 14.4: T-Test

Test whether songs that reached #1 have different average tempo than songs that peaked at #2-10. Report:

  1. Descriptive statistics (means, SDs, sample sizes)
  2. Test statistic and p-value
  3. 95% CI for the difference
  4. Effect size (Cohen’s d)
  5. Interpretation in plain language

Reflection Questions

  1. The 0.05 Threshold: If p = 0.048, you reject the null. If p = 0.052, you don’t. Does this bright line make sense, or should we think about evidence more continuously? What are the consequences of treating 0.05 as a hard cutoff?

  2. Publication Bias: Studies with p < 0.05 are more likely to be published than studies with p > 0.05. How does this distort our understanding of what’s true? If you found p = 0.08, would you report it honestly or keep testing until you got p < 0.05?

  3. Significance vs. Importance: You can have a statistically significant finding that’s practically trivial (2 BPM difference in tempo). How should researchers balance these? When does statistical significance matter, and when should we prioritize effect size?


Chapter Summary

This chapter introduced statistical inference and hypothesis testing:

  • Null hypothesis significance testing (NHST): Assume nothing is happening (H₀), calculate how surprising the data are, reject H₀ if sufficiently surprising
  • P-value: Probability of observing data at least this extreme if H₀ were true
  • p < 0.05: Conventional threshold for “statistical significance,” but arbitrary
  • Chi-square test: Tests independence of two categorical variables
  • Effect sizes (Cramér’s V, Cohen’s d): Measure strength of relationships, complementing p-values
  • Confidence intervals: Range of plausible values for a parameter (e.g., mean, difference, correlation)
  • 95% CI interpretation: If we repeated the study 100 times, ~95 of those intervals would contain the true value
  • Statistical vs. practical significance: Statistical significance doesn’t imply importance; small effects can be significant with large samples
  • Common p-value misinterpretations: P-values don’t tell you the probability H₀ is true, or that H₁ is true, or that the effect is large
  • When not to test: Exploratory analysis, convenience samples, tiny samples, multiple comparisons without correction
  • Reporting standards: Include test statistic, df, exact p-value, effect size, descriptives, CIs

Key Terms

  • Alternative hypothesis (H₁): The claim that there is an effect or relationship
  • Chi-square test (χ²): Statistical test for independence of categorical variables
  • Confidence interval (CI): Range of plausible values for a population parameter
  • Cramér’s V: Effect size measure for categorical associations (0 = none, 1 = perfect)
  • Effect size: Magnitude of a relationship or difference (distinct from statistical significance)
  • Null hypothesis (H₀): The claim that there is no effect or relationship
  • Null hypothesis significance testing (NHST): Framework for testing whether data are surprising under H₀
  • P-value: Probability of observing data at least as extreme as yours if H₀ is true
  • Practical significance: Whether an effect is large enough to matter in real-world contexts
  • Statistical significance: Whether a result is unlikely to have occurred by chance alone (conventionally p < 0.05)
  • Type I error: False positive (rejecting true H₀)
  • Type II error: False negative (failing to reject false H₀)

Looking Ahead

Chapter 15 (Interpreting the Call) moves beyond mechanical hypothesis testing to the art of interpretation. You’ve run statistical tests—now you need to make sense of what they mean. This chapter covers effect sizes in context, statistical power, what to do when results are “marginally significant” (p ≈ 0.06), how to handle unexpected findings, and the difference between exploratory and confirmatory analysis. Most importantly, it teaches honest, nuanced interpretation: acknowledging limitations, considering alternative explanations, and avoiding overstating findings.