9 Data Management
9.1 Defining Data
What is Data?
In research, data refers to information collected to answer questions, test hypotheses, or explore patterns. Data can take many forms—numbers, text, or categories—and understanding these forms is essential for effective analysis. In RStudio, data is organized in tables called data frames, where rows represent individual observations and columns represent variables.
What is Data in Mass Communication Research?
In mass communication research, data often comes from audience surveys, digital platforms, gaming environments, or content analyses. The gaming-anxiety.csv
dataset is an example of survey data collected from gamers, which includes psychological scale responses (e.g., anxiety, satisfaction with life), game behaviors (e.g., hours played, streaming frequency), and demographics (e.g., age, gender, location). These data can be used to study relationships between psychological well-being and gaming behavior.
Qualitative vs. Quantitative Data
In the gaming-anxiety.csv
dataset, variables can be classified as either qualitative or quantitative.
Qualitative Data: Qualitative data are non-numerical and typically describe categories or characteristics. In this dataset,
Game
,Platform
,Playstyle
, andGender
are qualitative variables. These variables describe how respondents play games or how they identify, without involving numeric values.Quantitative Data: Quantitative data are numerical and allow for statistical analysis. In this dataset, variables such as
Age
,Hours
(spent gaming per week),GAD1
toGAD7
(General Anxiety Disorder items), andSWL1
toSWL5
(Satisfaction With Life items) are quantitative. These values allow researchers to calculate scores and identify patterns in gamer well-being.
9.2 Variables and Observations
In RStudio, datasets are displayed in tabular format where columns represent variables and rows represent observations.
Variables: Variables are measurable characteristics or data fields. In the
gaming-anxiety.csv
dataset, variables includeAge
,Game
,GAD1
,SWL2
, andSPIN_T
. Each variable corresponds to a different question or category from the survey. For example,GAD1
records how often a respondent felt nervous, whilePlatform
indicates their gaming device.Observations: Observations are the individual entries or survey responses. Each row in
gaming-anxiety.csv
represents one gamer’s full set of responses, including their psychological scores, gaming behavior, and demographic information. These are the units of analysis.
Explanation of Data Types
The dataset includes a variety of data types, each requiring different analytical techniques:
Nominal Data: These are unordered categories. For example,
Game
,Gender
, andPlatform
are nominal because they describe types without implying rank.Ordinal Data: These categories have a logical order. While not directly labeled as such, variables like
highestleague
or Likert-scale items (e.g., from thewhyplay
orGADE
questions) might represent ordinal responses, depending on how the data were collected.Discrete Data: Discrete numeric data are countable and often whole numbers. Variables such as
streams
(number of times streamed) andSPIN_T
(Social Phobia Inventory total score) are discrete because they represent count-based values.Continuous Data: These are numeric values that can take any value in a range.
Hours
(hours spent gaming per week) andAge
are continuous, as they represent measurements that can vary on a spectrum.Dichotomous or Binary Data: These variables have only two values (e.g., Yes/No, Accept/Reject). In this dataset,
accept
is an example of a dichotomous variable indicating whether the participant accepted the terms of the study.
9.3 Inputting Data
In RStudio, entering or importing data is an essential first step in any research project. Most datasets, like gaming-anxiety.csv
, come in CSV (comma-separated values) format, which can be easily read into R using functions like read_csv()
from the readr
package.
Data Structures in R
Data structures are fundamental in R programming as they organize and store the data that one works with for analyses, visualizations, and other computational tasks. Understanding these structures is critical for effective manipulation of data and implementing various algorithms (Wickham & Grolemund, 2017). Below are the primary data structures that R provides.
Vectors
Vectors are one-dimensional arrays used to hold elements of a single data type. This could be numeric, character, or logical data types. Vectors are often used for operations that require the application of a function to each element in the data set (Maindonald & Braun, 2010).
Vectors can be created using the c()
function, which combines elements into a vector.
Creating a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
Creating a character vector
character_vector <- c("apple", "banana", "cherry")
Creating a logical vector
logical_vector <- c(TRUE, FALSE, TRUE)
You can perform various operations on vectors like addition, subtraction, or applying a function to each element.
Data Frames
Data frames serve as the fundamental data structure for data analysis in R. They are similar to matrices but allow different types of variables in different columns, which makes them extremely versatile (Chambers, 2008).
Data frames can be created using the data.frame()
function.
# Creating a data frame
df <- data.frame(Name = c("Alice", "Bob"), Age = c(23, 45), Gender = c("F", "M"))
Various operations like subsetting, merging, and sorting can be performed on data frames.
# Subsetting data frame by column
subset_df <- df[, c("Name", "Age")]
Lists
Lists are an ordered collection of objects, which can be of different types and structures, including vectors, matrices, and even other lists (Wickham & Grolemund, 2017).
Lists can be created using the list()
function.
Lists can be modified by adding, deleting, or updating list elements.
# Updating a list element
my_list$Name <- "Bob"
# Adding a new list element
my_list$Email <- "bob@email.com"
By understanding these primary data structures, students in Mass Communications can gain a strong foundation for more complex data analyses relevant to their field, whether it involves analyzing large sets of textual data, audience metrics, or other forms of media data.
Importing Data from a File
When working with larger datasets, such as CSV files, importing data into R is more efficient. A CSV (Comma Separated Values) file stores tabular data as plain text, making it easy to exchange data between programs. Below are several ways to import the gaming-anxiety.csv dataset into R.
Use read.csv
from Base R
The read.csv()
function is part of base R and can be used to import CSV files directly into your environment:
# Reading the IMDb_Economist_tv_ratings dataset using read.csv from base R
csv_base <- read.csv("https://github.com/SIM-Lab-SIUE/SIM-Lab-SIUE.github.io/raw/refs/heads/main/research-methods/data/gaming-anxiety.csv", header = TRUE, stringsAsFactors = FALSE)
This code imports the dataset from the URL provided. The header = TRUE
argument indicates that the first row contains variable names, and stringsAsFactors = FALSE
prevents character strings from being converted to factors.
Use write.csv()
to write a data frame to a csv.
Use read_csv
from the readr
Package
The readr
package provides an alternative function, read_csv()
, which offers better performance and flexibility:
# Install the readr package if it's not already installed
# install.packages("readr")
# Load the readr package
library(readr)
# Reading the IMDb_Economist_tv_ratings dataset using read_csv from readr
csv_readr <- read_csv("https://github.com/SIM-Lab-SIUE/SIM-Lab-SIUE.github.io/raw/refs/heads/main/research-methods/data/gaming-anxiety.csv")
The read_csv()
function is faster than read.csv()
and automatically detects data types, making it easier to handle larger datasets efficiently.
Use write_csv()
to write a data frame to a csv.
Use fread
from the data.table
Package
For very large datasets, fread()
from the data.table
package is a faster alternative:
# Install the data.table package if it's not already installed
# install.packages("data.table")
# Load the data.table package
library(data.table)
# Reading the IMDb_Economist_tv_ratings dataset using fread from data.table
csv_datatable <- fread("https://github.com/SIM-Lab-SIUE/SIM-Lab-SIUE.github.io/raw/refs/heads/main/research-methods/data/gaming-anxiety.csv")
The fread()
function provides high-speed reading for large CSV files, making it ideal for processing extensive datasets.
Use fwrite()
to write a data frame to a csv.
Use vroom,
from the vroom
Package
The fastest method for reading rectangular data that I know of is vroom()
from the vroom
package:
# Install the data.table package if it's not already installed
# install.packages("vroom")
# Load the data.table package
library(vroom)
# Reading the IMDb_Economist_tv_ratings dataset using fread from data.table
csv_vroom <- vroom("https://github.com/SIM-Lab-SIUE/SIM-Lab-SIUE.github.io/raw/refs/heads/main/research-methods/data/gaming-anxiety.csv")
The vroom()
function provides the fastest current read for .csv files.
Use vroom_write()
to write a data frame to a csv.
9.4 Manipulating Data
Data manipulation is a crucial aspect of preparing datasets for analysis. In RStudio, the dplyr
package—part of the tidyverse ecosystem—provides powerful, intuitive functions for transforming, summarizing, and reshaping data. This section introduces dplyr
and demonstrates how to manipulate data using examples from the billboard dataset, which contains information about songs, performers, and chart positions.
The dplyr
Package
Introducing Tidyverse
Tidyverse is a collection of R packages designed for data science, which share an underlying design philosophy and programming style. The dplyr
package is part of the tidyverse and is widely used for data manipulation tasks such as filtering rows, selecting columns, grouping data, and summarizing statistics.
To get started, load the tidyverse (or specifically dplyr
) into your R environment:
Load the gaming_anxiety dataset from an online file
# Load the data.table package
library("data.table")
#@ Load the gaming_anxiety dataset
gaming_anxiety <- fread("https://github.com/SIM-Lab-SIUE/SIM-Lab-SIUE.github.io/raw/refs/heads/main/research-methods/data/gaming-anxiety.csv")
The Pipe Operator %>%
The pipe operator %>%
passes the result of one function into the next. This allows you to build operations in readable, sequential steps.
Instead of:
Use:
Important dplyr
Commands
01. `summarize()
Calculates summary statistics across the entire dataset or within groups.
02. `count()
Counts the frequency of unique values in a column.
03. `group_by()
Groups data by one or more variables, typically followed by summarize()
or mutate()
.
04. `ungroup()
Removes grouping structure after a grouped operation.
05. `mutate()
Creates new variables or modifies existing ones.
06. `rowwise()
Applies operations across columns within individual rows.
07. `filter()
Selects rows based on logical conditions.
08. `distinct()
Returns unique rows based on selected columns.
09. `slice()
Selects rows by position index.
10. `slice_sample()
Randomly selects a number of rows.
gaming_anxiety %>%
slice_sample(n = 5)
11. slice_min()
and `slice_max()
Selects rows with the minimum or maximum value in a column.
12. `arrange()
Sorts the data by column values in ascending order.
13. `desc()
Used inside arrange()
to sort values in descending order.
14. `pull()
Extracts a single column as a vector.
15. `select()
Selects specific columns from a dataset.
16. `relocate()
Changes the order of columns in the dataset.
17. `across()
Applies a function to multiple columns simultaneously.
18. `c_across()
Used inside rowwise()
to perform operations across selected columns in each row.
19. `rename()
Renames one or more columns.
20. `n()
Returns the count of rows in each group, often used inside summarize()
.
21. mean()
, median()
, sum()
, `sd()
Common summary functions used inside summarize()
or mutate()
.
9.5 Cleaning the Data to Include Only Valid Survey Responses
The full dataset contains 14250 response IDs, but only 13464 are real, valid survey responses. Some rows were added to simulate incomplete or fake data to teach cleaning and filtering. These steps will help you remove the simulated responses and retain only the clean, complete data needed for analysis.
Step 01: Identify the Structure of the Data
Before filtering, examine how the dataset is organized. You should look at: - Missing values - Incomplete rows - Common patterns in legitimate responses
This should return 14250. But we only want 13464.
Step 02: Filter Out Incomplete Responses
Now, remove simulated responses. We’ll assume that valid responses have: - A non-missing value for GAD1, SWL1, and SPIN1 - A listed Game and Age
This combination ensures we’re only keeping actual, completed surveys.
Step 03: Check Your Work
After filtering, check that the dataset now contains exactly 13464 responses.
nrow(valid_gaming_data)
You now have a clean version of the dataset that includes only complete and authentic responses. This version—valid_gaming_data
—is the one you’ll use in the next chapters on descriptive statistics, inferential tests, and data visualization.