12 Data Visualization in R
12.1 Overview
Data visualization is a core component of social science research. It allows researchers to explore, communicate, and explain patterns in data using graphical displays. Well-designed graphics reveal trends, support arguments, and make research findings accessible to a wide range of audiences.
In this chapter, you will learn how to:
- Visualize the distribution of variables using histograms and density plots
- Compare group differences using boxplots and violin plots
- Explore relationships between variables using scatterplots
- Use statistical overlays and enhanced labeling with advanced
ggplot2
extensions - Create clear, reproducible, and publication-ready figures
All visualizations use ggplot2
, the foundational R package for the grammar of graphics, and are written using tidyverse principles. Each section includes code, annotated examples, interpretation, and APA-style presentation guidance.
12.2 Visualizing Distributions
Purpose
When working with numeric variables, the first step is to understand how the values are distributed. Are most values clustered around a center? Are there extreme outliers? Is the data skewed left or right?
Visualizations such as histograms and density plots are effective tools for examining:
- Shape (symmetry, skewness)
- Spread (range, variability)
- Peaks (modes)
- Outliers
These tools are especially useful when describing:
- Weekly hours spent gaming (
Hours
) - Scale scores such as anxiety (
GAD_T
), satisfaction with life (SWL_T
), or social phobia (SPIN_T
)
Histograms
A histogram displays the frequency of observations in specified bins (ranges of values). It helps identify:
- How often values occur
- Whether the distribution is normal or skewed
- Where potential cutoffs or outliers lie
Example: Histogram of Weekly Hours Played
# Import data
gaming_data <- read.csv("https://github.com/SIM-Lab-SIUE/SIM-Lab-SIUE.github.io/raw/refs/heads/main/research-methods/data/data.csv", encoding = "ISO-8859-1")
# Create histogram of gaming hours
ggplot(gaming_data, aes(x = Hours)) +
geom_histogram(binwidth = 10, fill = "steelblue", color = "white") +
labs(
title = "Distribution of Weekly Hours Spent Gaming",
x = "Hours per Week",
y = "Number of Participants"
) +
theme_minimal()
Interpretation
- This plot displays how many participants reported each range of weekly gaming hours.
- A peak around 10–30 hours suggests a typical playtime.
- A long right tail indicates that some participants report very high playtimes (outliers).
- If necessary, you can limit the x-axis to reduce the impact of extreme values:
# Truncate extreme values
ggplot(gaming_data, aes(x = Hours)) +
geom_histogram(binwidth = 5, fill = "tomato", color = "white") +
coord_cartesian(xlim = c(0, 100)) +
labs(title = "Distribution of Gaming Hours (Capped at 100)", x = "Hours", y = "Count") +
theme_minimal()
You can also create a filtered data set to use moving forward that removes all users who report impossible weekly gaming hours (i.e., over 168 hours)
# Filter for reponses higher than 168 hours a week
gaming_data <- gaming_data %>%
filter(Hours <= 168)
# Visual remaining values
ggplot(gaming_data, aes(x = Hours)) +
geom_histogram(binwidth = 5, fill = "green", color = "white") +
labs(title = "Distribution of Gaming Hours (Capped at 100)", x = "Hours", y = "Count") +
theme_minimal()
Density Plots
Density plots are smoothed versions of histograms that show the probability distribution of a numeric variable. They are useful for comparing shapes across groups.
Example: Density of GAD_T Scores
ggplot(gaming_data, aes(x = GAD_T)) +
geom_density(fill = "skyblue", alpha = 0.4) +
labs(
title = "Density Plot of GAD Scores",
x = "GAD Total Score",
y = "Density"
) +
theme_minimal()
Interpretation
- The curve illustrates where values are most common.
- Peaks represent modes, while the width of the curve shows variability.
- The shape of the distribution (e.g., skewed right) can suggest non-normality.
Comparing Distributions Across Groups
You can use grouped density plots or facet wrapping to compare distributions across categories such as gender or platform.
Example: Compare Life Satisfaction (SWL_T) by Gender
ggplot(gaming_data, aes(x = SWL_T, fill = Gender)) +
geom_density(alpha = 0.5) +
labs(
title = "Distribution of SWL Scores by Gender",
x = "SWL Total Score",
y = "Density"
) +
theme_minimal()
Alternative: Separate Plots with facet_wrap()
ggplot(gaming_data, aes(x = SWL_T)) +
geom_density(fill = "steelblue", alpha = 0.6) +
facet_wrap(~ Gender) +
labs(
title = "SWL Score Distribution by Gender Identity",
x = "Satisfaction With Life Score",
y = "Density"
) +
theme_minimal()
These visualizations help researchers quickly assess patterns and outliers in scale scores and behavioral data. Before conducting any hypothesis tests, visualizing your variables ensures your assumptions and interpretations are grounded in the structure of your data.
12.3 Visualizing Group Comparisons
Purpose
Many research questions in communication studies and psychology involve comparing means or distributions across different groups. For example:
- Do male and female participants differ in their anxiety scores?
- Does social phobia vary by gaming platform?
When your outcome variable is continuous (e.g., scale scores) and your grouping variable is categorical (e.g., gender, platform), you can visualize the comparison using boxplots or violin plots.
These plots help answer:
- Are group means different?
- Are there differences in spread or outliers?
- Is one group more variable or skewed than another?
This section introduces two primary methods for visualizing these comparisons.
Boxplots
Boxplots summarize distributions using five statistics:
- Minimum
- 1st quartile (25th percentile)
- Median (50th percentile)
- 3rd quartile (75th percentile)
- Maximum (excluding outliers)
Outliers are plotted as individual points. The box itself contains the middle 50% of the data.
Example: GAD_T Scores by Gender
ggplot(gaming_data, aes(x = Gender, y = GAD_T)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Anxiety Scores by Gender",
x = "Gender",
y = "GAD Total Score"
) +
theme_minimal()
12.3.1 Interpretation
- Medians are represented by bold horizontal lines inside each box.
- The interquartile range (IQR) shows the middle 50% of responses.
- Circles above or below the whiskers represent outliers—participants with unusually high or low anxiety scores.
- This plot allows you to visually assess whether anxiety differs by gender before performing a statistical test (e.g., t-test or ANOVA).
12.3.2 Violin Plots
Violin plots combine boxplots and density plots to display the distribution and shape of the data for each group.
They are particularly useful when:
- The number of observations varies across groups
- You want to emphasize distribution shape
- You’re preparing a figure for publication or presentation
Example: SWL_T Scores by Gender
ggplot(gaming_data, aes(x = Gender, y = SWL_T)) +
geom_violin(fill = "lightcoral", alpha = 0.7) +
geom_boxplot(width = 0.1, fill = "white") + # Overlay boxplot
labs(
title = "Life Satisfaction by Gender",
x = "Gender",
y = "SWL Total Score"
) +
theme_minimal()
Interpretation
- The width of the violin at each score level indicates relative frequency.
- Wider areas represent more common values, while narrower areas represent less frequent scores.
- Adding a boxplot inside the violin (as shown above) combines precision (medians, quartiles) with shape-based insight.
Group Comparisons Across Platforms
You can use the same visual tools to compare more than two groups.
Example: SPIN_T Scores by Platform
ggplot(gaming_data, aes(x = Platform, y = SPIN_T)) +
geom_boxplot(fill = "lightgreen") +
labs(
title = "Social Phobia Scores by Gaming Platform",
x = "Platform",
y = "SPIN Total Score"
) +
theme_minimal()
If you expect a non-normal distribution or have small group sizes (e.g., few smartphone users), violin plots may be more appropriate.
Optional: Add Data Points with geom_jitter()
To better show individual data points, add geom_jitter()
to your plot:
ggplot(gaming_data, aes(x = Gender, y = GAD_T)) +
geom_boxplot(fill = "skyblue", outlier.shape = NA) + # Hide default outliers
geom_jitter(width = 0.2, alpha = 0.3) + # Add raw points
labs(title = "GAD Scores by Gender with Raw Data") +
theme_minimal()
This is especially useful when:
- Sample sizes are small
- You want to emphasize data variability
- You are preparing exploratory visuals for peer review
By visually comparing groups before performing inferential tests (such as t-tests or ANOVA), you develop a richer understanding of your data and ensure that your statistical analysis is appropriate for the underlying distribution and group structure.
12.4 Visualizing Associations
Purpose
Many research questions in social science explore the relationship between two numeric variables. For example:
Do people who spend more time gaming report lower life satisfaction?
Is there a relationship between hours played and social phobia?
When both variables are continuous, the appropriate visualization is a scatterplot. You can also use smoothing lines to identify trends.
This section teaches how to:
- Create scatterplots using
ggplot2
- Add regression lines with
geom_smooth()
- Interpret linear trends and potential outliers
Scatterplots
A scatterplot displays the relationship between two numeric variables. Each point represents a single observation.
Example: Hours vs. SWL_T
We will examine whether gaming hours predict life satisfaction.
ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
geom_point(alpha = 0.3) +
labs(
title = "Gaming Hours and Life Satisfaction",
x = "Weekly Hours Played",
y = "Satisfaction With Life Score"
) +
theme_minimal()
Interpretation
- Each point represents one participant.
- Clusters suggest where most responses fall.
- Outliers appear as isolated points far from the others.
- This plot alone does not indicate whether the relationship is statistically significant—it only visualizes it.
Adding a Regression Line
You can use geom_smooth(method = "lm")
to overlay a linear regression line.
ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = TRUE, color = "steelblue") +
labs(
title = "Linear Fit: Gaming Hours and Life Satisfaction",
x = "Weekly Hours Played",
y = "Satisfaction With Life Score"
) +
theme_minimal()
-
method = "lm"
fits a straight line (linear model). -
se = TRUE
adds a shaded area showing the 95% confidence interval.
Interpretation
- A downward slope suggests a negative correlation (more hours, less satisfaction).
- If the slope is flat, the variables are likely unrelated.
- A narrow confidence band means the estimate is more precise.
Optional: Highlight Outliers
If outliers are influencing the regression line, consider limiting the x-axis to a more typical range:
ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = TRUE, color = "darkblue") +
coord_cartesian(xlim = c(0, 100)) +
labs(
title = "Life Satisfaction vs. Gaming Hours (0–100 hours)",
x = "Weekly Hours Played",
y = "SWL Score"
) +
theme_minimal()
This makes the pattern among most participants more visible by trimming extreme values.
Scatterplots allow you to explore associations visually before running formal statistical tests. Combined with correlation and regression (covered in Chapters 4 and 6), they provide critical insight into your data.
12.5 Enhancing Visuals with Labels, Themes, and Extensions
Purpose
A graph is only as good as it is readable. In social science research—especially in publications, presentations, and reports—well-designed visuals are essential for communicating results clearly and professionally.
This section shows you how to:
- Add labels and annotations
- Customize themes and fonts
- Use helpful
ggplot2
extensions to highlight findings - Export your figures for use in slides, papers, or web-based reports
These enhancements are especially useful when visualizing comparisons or trends that require clear explanation.
Adding Labels and Annotations
Use labs()
to define titles, subtitles, and axis labels. Use annotate()
or geom_text()
to add labels for emphasis.
Example: Annotated Scatterplot with Regression Line
ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", color = "steelblue", se = TRUE) +
labs(
title = "Relationship Between Weekly Gaming and Life Satisfaction",
subtitle = "Linear regression with 95% confidence interval",
x = "Weekly Hours Played",
y = "Satisfaction With Life Score"
) +
theme_minimal()
You can also annotate directly on the plot:
ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", color = "steelblue", se = TRUE) +
annotate("text", x = 150, y = 10, label = "Negative correlation", color = "red") +
theme_minimal()
Highlighting Results with gghighlight
The gghighlight
package allows you to emphasize parts of your data based on a condition.
# install.packages("gghighlight")
library(gghighlight)
ggplot(gaming_data, aes(x = Hours, y = SWL_T, color = Gender)) +
geom_point() +
gghighlight(Gender == "Other") +
labs(title = "Highlighting Gender = 'Other'") +
theme_minimal()
This is useful when showcasing a specific subset of data.
Avoiding Overlapping Labels with ggrepel
The ggrepel
package makes labels readable by pushing them apart.
# install.packages("ggrepel")
library(ggrepel)
top_hours <- gaming_data %>%
filter(Hours > 84)
ggplot(gaming_data, aes(x = Hours, y = SWL_T)) +
geom_point(alpha = 0.3) +
geom_point(data = top_hours, color = "red") +
geom_text_repel(data = top_hours, aes(label = Gender)) +
labs(title = "Top Gamers Labeled by Gender") +
theme_minimal()
This approach is ideal when highlighting outliers or individuals in small samples.
Enhancing Style with ggthemes
The ggthemes
package provides pre-built styles that improve appearance with minimal effort.
# install.packages("ggthemes")
library(ggthemes)
ggplot(gaming_data, aes(x = Gender, y = GAD_T)) +
geom_boxplot(fill = "lightblue") +
theme_economist() +
labs(title = "GAD Scores by Gender") +
theme(legend.position = "none")
Other useful themes include:
-
theme_minimal()
– clean and simple -
theme_classic()
– traditional -
theme_bw()
– high-contrast black and white -
theme_light()
– soft contrast
Adding Statistical Context with ggstatsplot
The ggstatsplot
package integrates statistical test results directly into your visuals.
# install.packages("ggstatsplot")
library(ggstatsplot)
ggbetweenstats(data = gaming_data, x = Gender, y = GAD_T)
This creates a publication-ready boxplot with:
- Group means and confidence intervals
- p-values and effect sizes
- Optional interpretation text
Exporting Plots
You can save any ggplot2
object using ggsave()
.
# Save last plot as a PNG
ggsave("figures/swl_vs_hours.png", width = 6, height = 4, dpi = 300)
You can specify image format (.png, .pdf, .svg) and dimensions.
Enhancing your figures ensures that your findings are not only correct, but clear and compelling. When you visualize data in research, presentation, or journalism, these enhancements distinguish your work.