12 Data Visualization in R

12.1 Overview

Data visualization is a core component of social science research. It allows researchers to explore, communicate, and explain patterns in data using graphical displays. Well-designed graphics reveal trends, support arguments, and make research findings accessible to a wide range of audiences.

In this chapter, you will learn how to:

Visualize the distribution of variables using histograms and density plots
Compare group differences using boxplots and violin plots
Explore relationships between variables using scatterplots
Use statistical overlays and enhanced labeling with advanced ggplot2 extensions
Create clear, reproducible, and publication-ready figures

All visualizations use ggplot2, the foundational R package for the grammar of graphics, and are written using tidyverse principles. Each section includes code, annotated examples, interpretation, and APA-style presentation guidance.

12.2 Visualizing Distributions

Purpose

When working with numeric variables, the first step is to understand how the values are distributed. Are most values clustered around a center? Are there extreme outliers? Is the data skewed left or right?

Visualizations such as histograms and density plots are effective tools for examining:

Shape (symmetry, skewness)
Spread (range, variability)
Peaks (modes)
Outliers

These tools are especially useful when describing:

Weekly hours spent gaming (Hours)
Scale scores such as anxiety (GAD_T), satisfaction with life (SWL_T), or social phobia (SPIN_T)

Histograms

A histogram displays the frequency of observations in specified bins (ranges of values). It helps identify:

How often values occur
Whether the distribution is normal or skewed
Where potential cutoffs or outliers lie

Example: Histogram of Weekly Hours Played

# Load required libraries
library(tidyverse)

# Import data
gaming_data <- read.csv("https://github.com/SIM-Lab-SIUE/SIM-Lab-SIUE.github.io/raw/refs/heads/main/research-methods/data/data.csv", encoding = "ISO-8859-1")

# Create histogram of gaming hours
ggplot(gaming_data, aes(x = Hours)) +
  geom_histogram(binwidth = 10, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Weekly Hours Spent Gaming",
    x = "Hours per Week",
    y = "Number of Participants"
  ) +
  theme_minimal()

Interpretation

This plot displays how many participants reported each range of weekly gaming hours.
A peak around 10–30 hours suggests a typical playtime.
A long right tail indicates that some participants report very high playtimes (outliers).
If necessary, you can limit the x-axis to reduce the impact of extreme values:

# Truncate extreme values
ggplot(gaming_data, aes(x = Hours)) +
  geom_histogram(binwidth = 5, fill = "tomato", color = "white") +
  coord_cartesian(xlim = c(0, 100)) +
  labs(title = "Distribution of Gaming Hours (Capped at 100)", x = "Hours", y = "Count") +
  theme_minimal()

You can also create a filtered data set to use moving forward that removes all users who report impossible weekly gaming hours (i.e., over 168 hours)

# Filter for reponses higher than 168 hours a week
gaming_data <- gaming_data %>%
  filter(Hours <= 168)

# Visual remaining values
ggplot(gaming_data, aes(x = Hours)) +
  geom_histogram(binwidth = 5, fill = "green", color = "white") +
  labs(title = "Distribution of Gaming Hours (Capped at 100)", x = "Hours", y = "Count") +
  theme_minimal()

Density Plots

Density plots are smoothed versions of histograms that show the probability distribution of a numeric variable. They are useful for comparing shapes across groups.

Example: Density of GAD_T Scores

ggplot(gaming_data, aes(x = GAD_T)) +
  geom_density(fill = "skyblue", alpha = 0.4) +
  labs(
    title = "Density Plot of GAD Scores",
    x = "GAD Total Score",
    y = "Density"
  ) +
  theme_minimal()

Interpretation

The curve illustrates where values are most common.
Peaks represent modes, while the width of the curve shows variability.
The shape of the distribution (e.g., skewed right) can suggest non-normality.

Comparing Distributions Across Groups

You can use grouped density plots or facet wrapping to compare distributions across categories such as gender or platform.

Example: Compare Life Satisfaction (SWL_T) by Gender

ggplot(gaming_data, aes(x = SWL_T, fill = Gender)) +
  geom_density(alpha = 0.5) +
  labs(
    title = "Distribution of SWL Scores by Gender",
    x = "SWL Total Score",
    y = "Density"
  ) +
  theme_minimal()

Alternative: Separate Plots with `facet_wrap()`

ggplot(gaming_data, aes(x = SWL_T)) +
  geom_density(fill = "steelblue", alpha = 0.6) +
  facet_wrap(~ Gender) +
  labs(
    title = "SWL Score Distribution by Gender Identity",
    x = "Satisfaction With Life Score",
    y = "Density"
  ) +
  theme_minimal()

These visualizations help researchers quickly assess patterns and outliers in scale scores and behavioral data. Before conducting any hypothesis tests, visualizing your variables ensures your assumptions and interpretations are grounded in the structure of your data.

12.3 Visualizing Group Comparisons

Purpose

Many research questions in communication studies and psychology involve comparing means or distributions across different groups. For example:

Do male and female participants differ in their anxiety scores?
Does social phobia vary by gaming platform?

When your outcome variable is continuous (e.g., scale scores) and your grouping variable is categorical (e.g., gender, platform), you can visualize the comparison using boxplots or violin plots.

These plots help answer:

Are group means different?
Are there differences in spread or outliers?
Is one group more variable or skewed than another?

This section introduces two primary methods for visualizing these comparisons.

Boxplots

Boxplots summarize distributions using five statistics:

Minimum
1st quartile (25th percentile)
Median (50th percentile)
3rd quartile (75th percentile)
Maximum (excluding outliers)

Outliers are plotted as individual points. The box itself contains the middle 50% of the data.

Example: GAD_T Scores by Gender

ggplot(gaming_data, aes(x = Gender, y = GAD_T)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Anxiety Scores by Gender",
    x = "Gender",
    y = "GAD Total Score"
  ) +
  theme_minimal()

12.3.1 Interpretation

Medians are represented by bold horizontal lines inside each box.
The interquartile range (IQR) shows the middle 50% of responses.
Circles above or below the whiskers represent outliers—participants with unusually high or low anxiety scores.
This plot allows you to visually assess whether anxiety differs by gender before performing a statistical test (e.g., t-test or ANOVA).

12.3.2 Violin Plots

Violin plots combine boxplots and density plots to display the distribution and shape of the data for each group.

They are particularly useful when:

The number of observations varies across groups
You want to emphasize distribution shape
You’re preparing a figure for publication or presentation

Example: SWL_T Scores by Gender

ggplot(gaming_data, aes(x = Gender, y = SWL_T)) +
  geom_violin(fill = "lightcoral", alpha = 0.7) +
  geom_boxplot(width = 0.1, fill = "white") +  # Overlay boxplot
  labs(
    title = "Life Satisfaction by Gender",
    x = "Gender",
    y = "SWL Total Score"
  ) +
  theme_minimal()

Interpretation

The width of the violin at each score level indicates relative frequency.
Wider areas represent more common values, while narrower areas represent less frequent scores.
Adding a boxplot inside the violin (as shown above) combines precision (medians, quartiles) with shape-based insight.

Group Comparisons Across Platforms

You can use the same visual tools to compare more than two groups.

Example: SPIN_T Scores by Platform

ggplot(gaming_data, aes(x = Platform, y = SPIN_T)) +
  geom_boxplot(fill = "lightgreen") +
  labs(
    title = "Social Phobia Scores by Gaming Platform",
    x = "Platform",
    y = "SPIN Total Score"
  ) +
  theme_minimal()

If you expect a non-normal distribution or have small group sizes (e.g., few smartphone users), violin plots may be more appropriate.

Optional: Add Data Points with `geom_jitter()`

To better show individual data points, add geom_jitter() to your plot:

ggplot(gaming_data, aes(x = Gender, y = GAD_T)) +
  geom_boxplot(fill = "skyblue", outlier.shape = NA) +  # Hide default outliers
  geom_jitter(width = 0.2, alpha = 0.3) +               # Add raw points
  labs(title = "GAD Scores by Gender with Raw Data") +
  theme_minimal()

This is especially useful when:

Sample sizes are small
You want to emphasize data variability
You are preparing exploratory visuals for peer review

By visually comparing groups before performing inferential tests (such as t-tests or ANOVA), you develop a richer understanding of your data and ensure that your statistical analysis is appropriate for the underlying distribution and group structure.

12.4 Visualizing Associations

Purpose

Many research questions in social science explore the relationship between two numeric variables. For example:

Do people who spend more time gaming report lower life satisfaction?
Is there a relationship between hours played and social phobia?

When both variables are continuous, the appropriate visualization is a scatterplot. You can also use smoothing lines to identify trends.

This section teaches how to:

Create scatterplots using ggplot2
Add regression lines with geom_smooth()
Interpret linear trends and potential outliers

Scatterplots

A scatterplot displays the relationship between two numeric variables. Each point represents a single observation.

Example: Hours vs. SWL_T

We will examine whether gaming hours predict life satisfaction.

ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
  geom_point(alpha = 0.3) +
  labs(
    title = "Gaming Hours and Life Satisfaction",
    x = "Weekly Hours Played",
    y = "Satisfaction With Life Score"
  ) +
  theme_minimal()

Interpretation

Each point represents one participant.
Clusters suggest where most responses fall.
Outliers appear as isolated points far from the others.
This plot alone does not indicate whether the relationship is statistically significant—it only visualizes it.

Adding a Regression Line

You can use geom_smooth(method = "lm") to overlay a linear regression line.

ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = TRUE, color = "steelblue") +
  labs(
    title = "Linear Fit: Gaming Hours and Life Satisfaction",
    x = "Weekly Hours Played",
    y = "Satisfaction With Life Score"
  ) +
  theme_minimal()

method = "lm" fits a straight line (linear model).
se = TRUE adds a shaded area showing the 95% confidence interval.

Interpretation

A downward slope suggests a negative correlation (more hours, less satisfaction).
If the slope is flat, the variables are likely unrelated.
A narrow confidence band means the estimate is more precise.

Associations with Anxiety and Social Phobia

You can explore other relationships in the same way.

Example: Hours vs. GAD_T

ggplot(gaming_data, aes(x = Hours, y = GAD_T)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = TRUE, color = "darkred") +
  labs(
    title = "Gaming Hours and Generalized Anxiety",
    x = "Weekly Hours Played",
    y = "GAD Score"
  ) +
  theme_minimal()

Example: Hours vs. SPIN_T

ggplot(gaming_data, aes(x = Hours, y = SPIN_T)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = TRUE, color = "darkgreen") +
  labs(
    title = "Gaming Hours and Social Phobia",
    x = "Weekly Hours Played",
    y = "SPIN Score"
  ) +
  theme_minimal()

Optional: Highlight Outliers

If outliers are influencing the regression line, consider limiting the x-axis to a more typical range:

ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = TRUE, color = "darkblue") +
  coord_cartesian(xlim = c(0, 100)) +
  labs(
    title = "Life Satisfaction vs. Gaming Hours (0–100 hours)",
    x = "Weekly Hours Played",
    y = "SWL Score"
  ) +
  theme_minimal()

This makes the pattern among most participants more visible by trimming extreme values.

Scatterplots allow you to explore associations visually before running formal statistical tests. Combined with correlation and regression (covered in Chapters 4 and 6), they provide critical insight into your data.

12.5 Enhancing Visuals with Labels, Themes, and Extensions

Purpose

A graph is only as good as it is readable. In social science research—especially in publications, presentations, and reports—well-designed visuals are essential for communicating results clearly and professionally.

This section shows you how to:

Add labels and annotations
Customize themes and fonts
Use helpful ggplot2 extensions to highlight findings
Export your figures for use in slides, papers, or web-based reports

These enhancements are especially useful when visualizing comparisons or trends that require clear explanation.

Adding Labels and Annotations

Use labs() to define titles, subtitles, and axis labels. Use annotate() or geom_text() to add labels for emphasis.

Example: Annotated Scatterplot with Regression Line

ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "steelblue", se = TRUE) +
  labs(
    title = "Relationship Between Weekly Gaming and Life Satisfaction",
    subtitle = "Linear regression with 95% confidence interval",
    x = "Weekly Hours Played",
    y = "Satisfaction With Life Score"
  ) +
  theme_minimal()

You can also annotate directly on the plot:

ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "steelblue", se = TRUE) +
  annotate("text", x = 150, y = 10, label = "Negative correlation", color = "red") +
  theme_minimal()

Highlighting Results with `gghighlight`

The gghighlight package allows you to emphasize parts of your data based on a condition.

# install.packages("gghighlight")
library(gghighlight)

ggplot(gaming_data, aes(x = Hours, y = SWL_T, color = Gender)) +
  geom_point() +
  gghighlight(Gender == "Other") +
  labs(title = "Highlighting Gender = 'Other'") +
  theme_minimal()

This is useful when showcasing a specific subset of data.

Avoiding Overlapping Labels with `ggrepel`

The ggrepel package makes labels readable by pushing them apart.

# install.packages("ggrepel")
library(ggrepel)

top_hours <- gaming_data %>%
  filter(Hours > 84)

ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
  geom_point(alpha = 0.3) +
  geom_point(data = top_hours, color = "red") +
  geom_text_repel(data = top_hours, aes(label = Gender)) +
  labs(title = "Top Gamers Labeled by Gender") +
  theme_minimal()

This approach is ideal when highlighting outliers or individuals in small samples.

Enhancing Style with `ggthemes`

The ggthemes package provides pre-built styles that improve appearance with minimal effort.

# install.packages("ggthemes")
library(ggthemes)

ggplot(gaming_data, aes(x = Gender, y = GAD_T)) +
  geom_boxplot(fill = "lightblue") +
  theme_economist() +
  labs(title = "GAD Scores by Gender") +
  theme(legend.position = "none")

Other useful themes include:

theme_minimal() – clean and simple
theme_classic() – traditional
theme_bw() – high-contrast black and white
theme_light() – soft contrast

Adding Statistical Context with `ggstatsplot`

The ggstatsplot package integrates statistical test results directly into your visuals.

# install.packages("ggstatsplot")
library(ggstatsplot)

ggbetweenstats(data = gaming_data, x = Gender, y = GAD_T)

This creates a publication-ready boxplot with:

Group means and confidence intervals
p-values and effect sizes
Optional interpretation text

Exporting Plots

You can save any ggplot2 object using ggsave().

# Save last plot as a PNG
ggsave("figures/swl_vs_hours.png", width = 6, height = 4, dpi = 300)

You can specify image format (.png, .pdf, .svg) and dimensions.

Enhancing your figures ensures that your findings are not only correct, but clear and compelling. When you visualize data in research, presentation, or journalism, these enhancements distinguish your work.

11 Inferential Analysis

13 Reporting Results Using R

12 Data Visualization in R

12.1 Overview

12.2 Visualizing Distributions

Purpose

Histograms

Example: Histogram of Weekly Hours Played

Interpretation

Density Plots

Example: Density of GAD_T Scores

Interpretation

Comparing Distributions Across Groups

Example: Compare Life Satisfaction (SWL_T) by Gender

Alternative: Separate Plots with facet_wrap()

12.3 Visualizing Group Comparisons

Purpose

Boxplots

Example: GAD_T Scores by Gender

12.3.1 Interpretation

12.3.2 Violin Plots

Example: SWL_T Scores by Gender

Interpretation

Group Comparisons Across Platforms

Example: SPIN_T Scores by Platform

Optional: Add Data Points with geom_jitter()

12.4 Visualizing Associations

Purpose

Scatterplots

Example: Hours vs. SWL_T

Interpretation

Adding a Regression Line

Interpretation

Associations with Anxiety and Social Phobia

Example: Hours vs. GAD_T

Example: Hours vs. SPIN_T

Optional: Highlight Outliers

12.5 Enhancing Visuals with Labels, Themes, and Extensions

Purpose

Adding Labels and Annotations

Example: Annotated Scatterplot with Regression Line

Highlighting Results with gghighlight

Avoiding Overlapping Labels with ggrepel

Enhancing Style with ggthemes

Adding Statistical Context with ggstatsplot

Exporting Plots

Alternative: Separate Plots with `facet_wrap()`

Optional: Add Data Points with `geom_jitter()`

Highlighting Results with `gghighlight`

Avoiding Overlapping Labels with `ggrepel`

Enhancing Style with `ggthemes`

Adding Statistical Context with `ggstatsplot`