11 Inferential Analysis

11.1 Overview

Descriptive statistics help summarize your dataset. However, they cannot answer questions about whether observed differences or relationships are meaningful beyond the sample. Inferential statistics are used to evaluate whether patterns observed in a sample are likely to generalize to the larger population.

In social science research, inferential statistics are used to: - Compare means between groups (e.g., do anxiety levels differ by gender?) - Evaluate associations between variables (e.g., is time spent gaming associated with social phobia?) - Predict outcomes using one or more variables (e.g., does platform use and gender predict satisfaction with life?)

This chapter introduces the foundational inferential methods for hypothesis testing in media and communication research. These include:

Chi-Square Test of Independence
Single Sample T-Test
Independent Samples T-Test
One-Way ANOVA
Two-Way ANOVA
ANCOVA (Analysis of Covariance)
Simple and Multiple Linear Regression
Logistic Regression

Each section includes: - Conceptual explanation - Justification for use - Complete R code (tidyverse-compatible) - Statistical output and APA-style reporting - Notes on assumptions and effect sizes

We use the gaming_anxiety.csv dataset for all examples. This dataset includes responses from over 13,000 gamers who completed demographic and psychological scales (GAD, SWL, SPIN), and reported on game habits, platforms, and motivations for play.

11.2 Chi-Square Test of Independence

Concept

The Chi-Square Test of Independence evaluates whether two categorical variables are statistically associated. It is used when both the independent and dependent variables are nominal (unordered categories).

Research Question (RQ)

Is there a relationship between gender identity and gaming platform?

Variables

Gender: Male, Female, Other
Platform: PC, Console, Mobile

Load the Dataset

gaming_data <- read.csv("https://github.com/SIM-Lab-SIUE/SIM-Lab-SIUE.github.io/raw/refs/heads/main/research-methods/data/data.csv", encoding = "ISO-8859-1")

Code: Chi-Square Test

library(tidyverse)

# Tabulate Gender × Platform
table_chi <- table(gaming_data$Gender, gaming_data$Platform)

# Run chi-square test
chisq.test(table_chi)

Output

    Pearson's Chi-squared test

data:  table_chi
X-squared = 70.448, df = 4, p-value = 1.826e-14

Interpretation

The p-value < .001 indicates a statistically significant association between gender and gaming platform.
This means platform preference differs by gender more than would be expected by chance.

APA Style

A Chi-Square Test of Independence showed a significant association between gender and gaming platform use, χ²(4, N = 13,000+) = 70.448, p = <.001.

11.3 Single Sample T-Test

Concept

The Single Sample T-Test compares the sample mean of one variable to a known or hypothetical population value.

Research Question (RQ)

Is the average anxiety score (GAD_T) in this sample significantly different from the general population mean of 5.0?

Code: Single Sample T-Test

t.test(gaming_data$GAD_T, mu = 5)

Output

    One Sample t-test

data:  gaming_data$GAD_T
t = 5.2185, df = 13463, p-value = 1.831e-07
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 5.132353 5.291593
sample estimates:
mean of x 
 5.211973

Interpretation

The average GAD score in this sample (5.10) is significantly higher than the population average of 5.0.
The p-value = .011 suggests that this difference is unlikely due to random chance.

11.3.1 APA Style

A single-sample t-test showed that the mean GAD score (M = 5.21) was significantly higher than the population mean of 5.0, t(13,463) = 5.22, p < .001, 95% CI [5.13, 5.29].

11.4 Independent Samples T-Test

Concept

A t-test compares the means of two groups to determine whether they are statistically different from each other. It is used when:

You have one categorical independent variable with two groups (e.g., gender)
You have one continuous dependent variable (e.g., GAD_T)

Research Question (RQ):
Do anxiety scores differ by gender identity?

Example: GAD_T by Gender

We will use t.test() to test whether the average GAD score differs between gender groups.

# Load tidyverse
library(tidyverse)

# Filter to only Male and Other
gaming_gender_subset <- gaming_data %>%
  filter(Gender %in% c("Male", "Other"))

# Run the t-test
t.test(GAD_T ~ Gender, data = gaming_gender_subset)

Output

    Welch Two Sample t-test

data:  GAD_T by Gender
t = -4.5944, df = 51.177, p-value = 2.864e-05
alternative hypothesis: true difference in means between group Male and group Other is not equal to 0
95 percent confidence interval:
 -6.490253 -2.543269
sample estimates:
 mean in group Male mean in group Other 
           5.060162            9.576923

Interpretation

t-value: The test statistic (t = -3.864) reflects the size of the difference relative to the variability in the data.
p-value: The probability of observing such a difference by chance (p < .001), indicating a statistically significant difference between male and other-identifying participants.
95% CI: The true difference in population means is likely between -3.81 and -1.22.
Group Means: Male gamers report lower GAD scores (M = 5.19) than Other-identifying gamers (M = 7.71).

APA Reporting Style

A Welch’s t-test indicated a significant difference in anxiety scores between male and other-identifying gamers, t(51.18) = -4.59, p < .001, 95% CI [-6.49, -2.54]. Other-identifying participants (M = 9.58) reported significantly higher GAD scores than male participants (M = 5.06).

11.5 One-Way ANOVA

Concept

Analysis of Variance (ANOVA) compares the means of three or more groups. It tests whether any of the group means are different from the others.

Research Question (RQ):
Does gaming platform (PC, Console, Mobile) affect social anxiety scores (SPIN_T)?

Example: SPIN_T by Platform

# Run ANOVA model
anova_model <- aov(SPIN_T ~ Platform, data = gaming_data)
summary(anova_model)

Output

               Df  Sum Sq Mean Sq F value Pr(>F)  
Platform        2    1261   630.3   3.477 0.0309 *
Residuals   12811 2322676   181.3                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
650 observations deleted due to missingness

F value: The ratio of between-group to within-group variance
p-value: Indicates whether at least one group differs from the others (p = .030)

Post-Hoc Test: Tukey’s HSD

If the ANOVA is significant, you need to run post-hoc tests to identify which groups differ.

# Tukey post-hoc test
TukeyHSD(anova_model)

Output

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = SPIN_T ~ Platform, data = gaming_data)

$Platform
                                                 diff       lwr       upr     p adj
PC-Console (PS, Xbox, ...)                  -1.643023 -3.833938  0.547893 0.1840353
Smartphone / Tablet-Console (PS, Xbox, ...)  4.164071 -3.057782 11.385925 0.3667104
Smartphone / Tablet-PC

APA Reporting Style

A one-way ANOVA revealed a statistically significant effect of platform on social phobia scores, F(2, 12811) = 3.48, p = .031. Tukey post-hoc comparisons indicated that while smartphone/tablet users scored higher than console and PC users, no pairwise differences reached statistical significance.

11.6 Two-Way ANOVA

Concept

A Two-Way ANOVA tests the effects of two categorical independent variables on a continuous outcome, including whether there is an interaction between them.

Research Question (RQ)

Do anxiety scores differ by gender and platform, and is there an interaction between the two?

Code: Two-Way ANOVA

anova_2way <- aov(GAD_T ~ Gender * Platform, data = gaming_data)
summary(anova_2way)

Example Output

                   Df Sum Sq Mean Sq F value Pr(>F)    
Gender              2   5341  2670.4 122.414 <2e-16 ***
Platform            2     85    42.7   1.958  0.141    
Gender:Platform     4    138    34.4   1.579  0.177    
Residuals       13455 293515    21.8                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Interpretation

A significant main effect of gender: GAD_T varies by gender.
No significant effect of platform or the gender × platform interaction.

11.6.1 APA Style

A two-way ANOVA revealed a significant main effect of gender on GAD scores, F(2, 13455) = 122.41, p < .001. No significant main effect was found for platform (p = .141), and the interaction between gender and platform was also not significant (p = .177).

11.7 ANCOVA: Analysis of Covariance

Concept

ANCOVA examines the effect of a categorical independent variable on a continuous outcome while statistically controlling for another continuous variable (covariate).

Research Question (RQ)

Does platform predict social phobia scores after controlling for hours spent gaming?

Code: ANCOVA

ancova_model <- aov(SPIN_T ~ Platform + Hours, data = gaming_data)
summary(ancova_model)

Output

               Df  Sum Sq Mean Sq F value   Pr(>F)    
Platform        2    1188     594   3.288   0.0373 *  
Hours           1    5561    5561  30.772 2.96e-08 ***
Residuals   12782 2309737     181                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
678 observations deleted due to missingness

Interpretation

Platform still predicts SPIN_T after controlling for hours.
Hours is also a significant covariate, meaning it explains part of the variance in SPIN_T.

APA Style

An ANCOVA revealed that platform significantly predicted SPIN scores after controlling for hours played, F(2, 12782) = 3.29, p = .037. Weekly gaming hours also significantly predicted SPIN scores, F(1, 12782) = 30.77, p < .001.

11.8 Simple Linear Regression

Concept

Simple linear regression models the relationship between a single predictor (independent variable) and a continuous outcome (dependent variable). It estimates how much the outcome changes on average when the predictor increases by one unit.

This technique is foundational in statistical modeling. It is used when: - You have two numeric variables - You want to understand how one variable predicts another - You want to assess direction, magnitude, and significance of an association

Linear regression assumes a linear relationship, normally distributed residuals, and homoscedasticity (equal variance of residuals across levels of the predictor).

Research Question (RQ):

Do the number of hours spent gaming predict satisfaction with life?

Code: Simple Linear Regression

# Simple linear regression model
model_simple <- lm(SWL_T ~ Hours, data = gaming_data)

# View model summary
summary(model_simple)

Output

Call:
lm(formula = SWL_T ~ Hours, data = gaming_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.8780  -5.7898   0.1771   6.1404  21.5312 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 19.8817056  0.0653373 304.293  < 2e-16 ***
Hours       -0.0036766  0.0008863  -4.148 3.37e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.22 on 13432 degrees of freedom
  (30 observations deleted due to missingness)
Multiple R-squared:  0.001279,  Adjusted R-squared:  0.001205 
F-statistic: 17.21 on 1 and 13432 DF,  p-value: 3.371e-05

Intercept = 19.93: The predicted SWL_T score for someone who plays 0 hours/week
Slope (Hours) = -0.0034: For every additional hour of gameplay, life satisfaction decreases by 0.0034 points, on average

APA Reporting Style

A simple linear regression found that hours spent gaming significantly predicted satisfaction with life, b = -0.0037, t(13,432) = -4.15, p < .001. The model was statistically significant, F(1, 13432) = 17.21, p < .001, but the effect size was small (R² = .0013).

11.9 Multiple Linear Regression

Concept

Linear regression estimates the relationship between one outcome and one or more predictors. It helps you predict an outcome variable (e.g., SWL_T) from explanatory variables (e.g., Hours, Gender).

Research Question (RQ):

Do hours and gender predict life satisfaction?

11.9.1 Code: Multiple Linear Regression

# Regression model
model <- lm(SWL_T ~ Hours + Gender, data = gaming_data)
summary(model)
# View tidy output
library(broom)
tidy(model)

Output (Simplified)

term    estimate    std.error   statistic   p.value
(Intercept) 19.037025769    0.2706934594    70.326878   0.0000000000
Hours   -0.003168531    0.0008958389    -3.536943   0.0004061515
GenderMale  0.896374344 0.2776521546    3.228408    0.0012478040
GenderOther -3.166372560    1.0572902969    -2.994800   0.0027512578

Intercept: Predicted SWL_T score when Hours = 0 and Gender = “Male”
Hours: For each additional hour of gaming, life satisfaction decreases slightly
GenderOther: Participants identifying as “Other” report significantly lower SWL_T scores than males

APA Reporting Style

A multiple linear regression was conducted to examine whether weekly hours played and gender identity predicted satisfaction with life. The overall model was statistically significant, F(3, 13,430) = 14.36, p < .001, but explained only a small proportion of variance in SWL_T (R² = .003, adjusted R² = .003).

Weekly hours played significantly predicted satisfaction with life, b = –0.0032, t = –3.54, p < .001, suggesting a very small negative relationship. Gender was also a significant predictor: male participants reported significantly higher SWL_T scores than female participants, b = 0.90, t = 3.23, p = .001, while participants identifying as “Other” reported significantly lower SWL_T scores than females, b = –3.17, t = –2.99, p = .003.

11.10 Logistic Regression

Concept

Logistic Regression predicts a binary (two-category) outcome from one or more predictors.

Research Question (RQ)

Are hours played and anxiety levels associated with whether someone plays on a PC (vs. other platforms)?

Create Binary Outcome

gaming_data <- gaming_data %>%
  mutate(is_pc = if_else(Platform == "PC", 1, 0))

Run Logistic Regression

log_model <- glm(is_pc ~ Hours + GAD_T, data = gaming_data, family = binomial)
summary(log_model)

Output

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Call:
glm(formula = is_pc ~ Hours + GAD_T, family = binomial, data = gaming_data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.015716   0.136520  29.415   <2e-16 ***
Hours        0.003833   0.004854   0.790    0.430    
GAD_T       -0.019881   0.013150  -1.512    0.131    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2439.6  on 13433  degrees of freedom
Residual deviance: 2436.9  on 13431  degrees of freedom
  (30 observations deleted due to missingness)
AIC: 2442.9

Number of Fisher Scoring iterations: 7

Interpretation

More hours played = higher odds of being a PC gamer
Higher anxiety = slightly lower odds of being a PC gamer
Odds ratios (OR < 1) indicate negative relationships

APA Style

A logistic regression was conducted to predict the likelihood of being a PC gamer based on hours played and anxiety levels. Neither predictor was statistically significant: Hours played, b = 0.0038, p = .43; GAD_T, b = -0.0199, p = .13. The model was not a significant improvement over the null model, χ²(2) = 2.72, p = .257.

11.11 Summary

In this chapter, you learned how to:

Conduct and interpret a full suite of inferential tests
Choose tests based on variable type and research question
Report statistical results using APA 7 format
Control for covariates and model binary outcomes

This prepares you for real-world research questions that involve group comparisons, associations, predictions, and interactions—all within the framework of social science methodology.

10 Descriptive Analysis

12 Data Visualization in R