Sampling
The Universe in a Drop of Water
In 1854, a devastating cholera outbreak gripped the Soho district of London. The prevailing theory of the time, the “miasma” theory, held that the disease was spread through “bad air.” A physician named John Snow, however, was skeptical. He suspected the source was contaminated water. To test his idea, he did not need to analyze every drop of water in London or interview every single resident. Instead, he engaged in a brilliant act of sampling. He meticulously mapped the locations of the cholera deaths and found they clustered around a single public water pump on Broad Street. He then took samples of water from that pump and, upon examining it under a microscope, found evidence of the contamination he suspected. By studying a carefully selected subset of the environment, Snow was able to draw a powerful conclusion about the entire outbreak, leading to the removal of the pump handle and a swift decline in new cases.
This historical episode is a powerful illustration of the logic that lies at the heart of all empirical research: the logic of sampling. In most research, it is impossible, impractical, or simply unnecessary to study every single member of a group of interest. We cannot survey every voter in a country, analyze every news article ever published, or observe every family’s media habits. Instead, we study a smaller, manageable subset—a sample—and seek to draw conclusions about the larger group, or population, from which it was drawn. The entire process of inference, of making claims about the whole based on evidence from a part, rests on the quality of that sample. A poorly chosen sample, like a movie trailer that shows only the two exciting minutes from a dull two-hour film, can be profoundly misleading. A well-chosen sample, however, can act like a miniature, high-fidelity portrait of the larger population, allowing us to understand the universe by studying a single drop of water.
This chapter is dedicated to the principles and techniques of sampling. It is a journey into one of the most foundational and consequential stages of the research workflow. We will begin by defining the core concepts of population, sample, and sampling frame, and explore the crucial goals of representativeness and generalizability. We will then delve into the two major families of sampling techniques. First, we will examine probability sampling, the gold standard for quantitative research, which uses the power of random selection to generate samples that can accurately mirror a population. Second, we will explore non-probability sampling, a set of techniques essential for qualitative and exploratory research, where the goal is not to generalize to a population but to gain deep, targeted insights. Finally, we will discuss the practical realities of sampling error, the logic of confidence intervals, and the new challenges and opportunities for sampling that have emerged in the complex landscape of the digital age. Understanding the logic of selection is not just a technical skill; it is the key to determining the reach and credibility of your research findings.
The Logic of Sampling: Representativeness and Generalizability
The primary goal of many research studies, particularly those within the social scientific paradigm, is to produce findings that are generalizable. Generalization is the process by which a researcher takes conclusions derived from observing a sample and extends those conclusions to the entire, unobserved population. For example, when a polling organization reports that 52% of a sample of 1,200 likely voters supports a particular candidate, they are generalizing that finding to the entire population of tens of millions of likely voters. The degree to which this leap of inference is justified depends entirely on how the sample was selected and, specifically, on its representativeness.
A sample is considered representative if it is a microcosm of the population from which it is drawn—if it accurately reflects the characteristics of the population in approximately the same proportion. If a population of university students is 60% female and 40% male, a representative sample of those students should also be approximately 60% female and 40% male. The same would hold true for other relevant characteristics, such as age, race, socioeconomic status, and year in school. A sample that fails to mirror the population in these ways is considered biased, and any generalizations made from it are likely to be inaccurate. This was the fatal flaw of the infamous 1936 Literary Digest poll, which predicted a landslide presidential victory for Alf Landon over Franklin Roosevelt. The poll’s sample was drawn from telephone directories and automobile registration lists, which in the midst of the Great Depression systematically overrepresented wealthier Americans and excluded the poorer voters who overwhelmingly supported Roosevelt. The sample was massive—over two million people—but it was not representative, and thus its prediction was spectacularly wrong.
The process of sampling, therefore, begins with a series of careful definitions. The first step is to precisely define the target population, which consists of all the objects, events, or people of a certain type about which the researcher seeks knowledge. This definition must be specific, setting clear boundaries that separate who or what is of interest from who or what is not. A population of “married couples” is too vague. A more precise definition might be “opposite-sex married couples, living in the same residence in the United States, who have been married for between five and ten years and have at least one child under the age of 18.”1 This level of specificity is crucial for the next step: creating a sampling frame.
A sampling frame is the actual list of all the elements or units in the population from which the sample will be selected. It is the operationalization of the population definition. For a study of current members of the American Sociological Association, the sampling frame would be the organization’s official membership roster. For a study of news articles from a particular newspaper, the sampling frame would be a complete archive of all articles published in that paper during a specific time period. The quality of a sample can be no better than the quality of its sampling frame. An incomplete or inaccurate list will produce a biased sample, regardless of how carefully the selection process is conducted. For example, if the sampling frame for a city’s residents is the local telephone book, it will systematically exclude people with unlisted numbers and those who only use mobile phones, a problem known as undercoverage. The time spent carefully defining the population and constructing or obtaining the best possible sampling frame is a critical investment in the ultimate validity of a study’s findings.
Probability Sampling: The Gold Standard of Generalization
How can a researcher be confident that their sample is truly representative of the population? The most powerful strategy for overcoming the obstacles of bias and achieving a representative sample is to use a probability sampling technique. A probability sample is one in which every element in the population has a known, non-zero, and calculable probability of being included in the sample. The mechanism that makes this possible is random selection. In a random selection process, chance alone determines which elements from the sampling frame are chosen. This process systematically eliminates the influence of researcher bias (e.g., a surveyor in a mall who consciously or unconsciously avoids certain types of people) and allows the laws of probability to do the work of creating a sample that, in the long run, will mirror the population.
The idea that leaving something as important as sample selection to chance can seem counterintuitive. Our culture often warns us to “leave nothing to chance.” In sampling, however, chance is our greatest ally. It is the guarantor of fairness and the mathematical foundation upon which the entire logic of statistical inference is built. All probability sampling methods require a complete and accurate sampling frame. While there are several variations, they all share this core commitment to random selection.
Simple Random Sampling
This is the most basic and straightforward form of probability sampling, and it serves as the theoretical foundation for all others. In a simple random sample, every element in the sampling frame has an equal chance of being selected, and every possible combination of elements has an equal chance of being the final sample. The process is analogous to placing the name of every person in the population into a very large hat, mixing them thoroughly, and drawing out the desired number of names for the sample.
In practice, this is typically done using a computer. The researcher first numbers every element in the sampling frame. Then, a random number generator is used to produce a list of numbers corresponding to the desired sample size. The elements on the list whose numbers were generated are included in the sample. While simple random sampling is the “purest” form of probability sampling, it can be tedious and impractical for very large populations, which has led to the development of more efficient alternatives.
Systematic Random Sampling
A systematic random sample is often a more efficient alternative to a simple random sample, especially when dealing with a long sampling frame. The process begins in the same way, with a complete list of the population. The researcher then calculates a sampling interval (denoted as k) by dividing the population size by the desired sample size. A random starting point is then selected between 1 and k. From that starting point, every kth element on the list is selected for inclusion in the sample.
For example, imagine a researcher has a sampling frame of 10,000 employees at a large corporation and wants to draw a sample of 500. The sampling interval would be 20 (10,000 / 500 = 20). The researcher would then use a random number generator to select a starting number between 1 and 20. If the number 13 is chosen, the sample would consist of the 13th, 33rd, 53rd, 73rd (and so on) employees on the list until 500 have been selected. In most cases, a systematic sample is functionally equivalent to a simple random sample. The only potential pitfall is if the sampling frame has a hidden periodic pattern that happens to align with the sampling interval, which could introduce a systematic bias. For instance, if a list of houses is organized by street corner, and every 20th house is a corner lot, a sampling interval of 20 would result in a sample of only corner-lot houses.
Stratified Sampling
Sometimes, a researcher wants to ensure that specific subgroups within a population are adequately represented in the sample. This is particularly important when a subgroup of interest is relatively small. A simple random sample might, by chance, underrepresent or even completely miss the members of this small group. Stratified sampling is a technique designed to prevent this.
The process begins by dividing, or stratifying, the population into mutually exclusive and homogeneous subgroups, or strata, based on a characteristic of interest (e.g., gender, race, age group, geographic region). A separate random sample (either simple or systematic) is then drawn from within each stratum. This guarantees that the final sample will include members from each subgroup.
In proportionate stratified sampling, the number of elements drawn from each stratum is proportional to that stratum’s representation in the total population. If a university’s student body is 15% seniors, a proportionate stratified sample would ensure that 15% of the sample consists of seniors. In disproportionate stratified sampling, a researcher might intentionally “oversample” a small subgroup to ensure they have a large enough number of cases from that group to conduct meaningful statistical analysis. When using this technique, the results must be statistically weighted later to correct for the oversampling and accurately reflect the total population.
Cluster Sampling
What happens when it is impossible or impractical to construct a complete sampling frame for a population? This is often the case for large, geographically dispersed populations, like all public high school teachers in the United States. It would be a monumental task to compile a single list of every teacher. Cluster sampling is a multi-stage technique designed for precisely these situations.
Instead of sampling individuals, the researcher first samples larger, naturally occurring groups, or clusters, in which the individuals are found. The process works in stages, moving from larger clusters to smaller ones. To sample high school teachers, a researcher might:
Obtain a list of all school districts in the country (the first-stage clusters) and draw a random sample of districts.
For each selected district, obtain a list of all high schools (the second-stage clusters) and draw a random sample of schools.
For each selected school, obtain a list of all teachers (the final sampling frame) and draw a simple random sample of teachers.
Cluster sampling is often more efficient and less expensive than simple random sampling for large populations. However, it also tends to have a higher degree of sampling error, because error is introduced at each stage of the sampling process.
Non-Probability Sampling: When Generalization Is Not the Goal
In many research situations, particularly in qualitative or exploratory studies, a sampling frame is not available, or the primary goal is not to produce findings that are statistically generalizable to a larger population. In these cases, researchers use non-probability sampling methods. In non-probability sampling, the probability of any given element being selected is unknown, and the selection process is not random. The findings from these samples cannot be used to make statistical inferences about a population, but they can provide valuable, in-depth, and targeted insights that are essential for many research questions.
Convenience Sampling
Also known as accidental or haphazard sampling, convenience sampling involves selecting participants based on their easy availability to the researcher. This is the least rigorous of all sampling methods but is very common in communication research, especially for preliminary or exploratory studies. Examples include surveying students in a large university lecture course, interviewing people who happen to be walking through a public park, or analyzing the first 50 comments on a news website. The major disadvantage of convenience sampling is that it is highly susceptible to selection bias; the people who are “convenient” are often not representative of any larger population.
Purposive Sampling
Also called judgmental sampling, purposive sampling is a technique in which the researcher uses their own knowledge and judgment to select cases that are most informative for the study’s purpose. The researcher intentionally targets individuals who are known to possess specific characteristics or expertise relevant to the research question. For example, if a researcher wants to understand the communication strategies of successful social movement leaders, they would not sample randomly from the population; they would purposively seek out and interview individuals who are recognized as leaders in that field. This method is common in qualitative research where the goal is to gain deep insight from a small, information-rich sample.
Snowball Sampling
Snowball sampling, also known as network or respondent-assisted sampling, is a referral-based technique used to find participants in hard-to-reach or hidden populations for which no sampling frame exists. This method is particularly useful for studying stigmatized or marginalized groups, such as undocumented immigrants, members of an underground subculture, or individuals with a rare medical condition. The researcher starts by identifying and interviewing a few key informants who are members of the population. These initial participants are then asked to refer the researcher to other members of their network. The sample “snowballs” as each new participant leads to others. The primary limitation of this method is that it tends to sample people who are well-connected within a social network, potentially missing those who are more isolated.
Quota Sampling
Quota sampling is the non-probability equivalent of stratified sampling. Like stratified sampling, the researcher begins by identifying relevant subgroups in the population and determining the proportion of the population that falls into each subgroup (e.g., based on census data for age, gender, and race). The researcher then sets a “quota” for the number of participants to be recruited from each subgroup to match these population proportions. The crucial difference is that the participants who fill these quotas are not selected randomly. They are typically recruited using convenience methods. For example, a mall interviewer might be told to survey 20 men and 30 women. They will then approach people in the mall until they have met those specific quotas. While quota sampling can create a sample that appears representative on the surface for a few key characteristics, it is still subject to the selection biases of convenience sampling and cannot be used for statistical generalization.
Sampling Error, Confidence, and Sample Size
Even the most meticulously designed probability sample will almost never be a perfect mirror of the population. Imagine drawing a small handful of marbles from a large jar containing an equal number of red and blue marbles. By pure chance, your handful might contain slightly more red marbles or slightly more blue ones. This natural, random variation between a sample statistic (the percentage of red marbles in your hand) and the population parameter (the true 50/50 split in the jar) is called sampling error. It is an unavoidable feature of sampling, an acknowledgment that we are working with incomplete information.
While we cannot eliminate sampling error, the power of probability theory is that it allows us to account for it and to quantify our uncertainty. This is done through the calculation of confidence intervals and confidence levels.
A confidence interval, often reported in the media as the “margin of error,” provides a range of values within which the true population parameter is likely to fall. When a poll reports that a candidate has 46% support with a margin of error of +/- 3%, they are stating a confidence interval of 43% to 49%. They are acknowledging that the true level of support in the population is probably not exactly 46%, but is very likely somewhere within that range.
The confidence level expresses how certain we are that the true population value lies within that calculated interval. The standard confidence level used in most social science research is 95%. A 95% confidence level means that if we were to draw 100 different random samples from the same population and calculate a confidence interval for each one, we would expect the true population parameter to fall within our interval in 95 of those 100 samples.
The size of the confidence interval—our margin of error—is influenced by two main factors: the variability within the population and the size of our sample. For a highly diverse, or heterogeneous, population, we need a larger sample to capture that variability accurately than we would for a very uniform, or homogeneous, population. The most direct way a researcher can increase the precision of their estimates (i.e., narrow the confidence interval) is by increasing the sample size. A larger sample provides more information and thus reduces the uncertainty caused by sampling error. However, there is a point of diminishing returns; quadrupling the sample size is required to cut the margin of error in half, which can be very costly. Determining the appropriate sample size is a balancing act between the desired level of statistical precision and the practical constraints of time and resources.
Sampling in the Digital Age: New Frontiers and New Problems
The rise of the internet and social media has radically transformed the landscape of communication research, presenting both unprecedented opportunities and profound new challenges for sampling. Researchers now have access to vast streams of “big data” generated by millions of users, but the traditional principles of sampling are often difficult, if not impossible, to apply in this new environment.
The most significant challenge is the breakdown of the traditional sampling frame. For most social media platforms, a complete and accurate list of all users—the full population—is simply not available to researchers. The total population of Twitter or Facebook is unknown and constantly in flux. This means that a true simple random sample of all users is not possible. Researchers often rely on data collected through a platform’s
Application Programming Interface (API), which provides structured access to a portion of the platform’s data. Twitter’s “streaming API,” for example, provides access to a random sample of about 1% of all public tweets in real-time. While this is a form of random sampling, it is a sample of tweets, not a sample of users, and it is still only a fraction of the total conversation.
This reality means that many large-scale digital studies, even those involving millions of data points, are effectively relying on large and complex convenience samples. The data is “found,” not systematically sampled from a known population. This introduces several potential biases that researchers must acknowledge.
Population Bias: The population of users on any given social media platform is not representative of the general population. Users of platforms like Twitter, for example, tend to be younger, more urban, and more educated than the population as a whole.
Self-Selection Bias: The content people choose to post is not a random sample of their thoughts or behaviors. People present a curated version of themselves online.
Data Availability Bias: Not all data is equally accessible. Users with private accounts are excluded from most data collection. Furthermore, users who choose to enable features like geotagging their posts have been shown to be demographically different from users who do not.
This new environment does not invalidate digital research, but it does demand a heightened sense of methodological transparency and humility. It is incumbent upon the modern researcher to be clear about the limitations of their digital samples and to be appropriately cautious when making claims about the generalizability of their findings. The logic of sampling remains as crucial as ever, but its application requires a new set of critical considerations for the unique nature of our networked world.
Conclusion: The Foundation of Inference
The selection of a sample is one of the most consequential decisions a researcher will make. It is the foundation upon which all claims of inference and generalization are built. A carefully constructed probability sample can provide a remarkably accurate portrait of a large and complex population, allowing us to make confident claims about the whole by observing just a small part. A thoughtfully selected non-probability sample can offer deep, rich, and targeted insights into a specific phenomenon or community, providing a level of understanding that a broad survey could never achieve.
The choice of a sampling strategy is not a mere technicality; it is a direct and logical extension of the research question and the overall goals of the study. A researcher who seeks to produce statistically generalizable findings must embrace the rigor and logic of probability sampling. A researcher who seeks to explore a new area or understand a subjective experience must master the targeted and strategic logic of non-probability sampling. In every case, the researcher must be a critical and transparent steward of their data, fully aware of the strengths and limitations that their sampling decisions impose on their conclusions. In the end, the quality of our knowledge is inextricably linked to the quality of our samples.
Journal Prompts
The chapter opens with the story of John Snow and the Broad Street pump—an example of how sampling can reveal powerful truths about a whole system. Reflect on a time you formed a strong opinion or insight based on a small piece of evidence (e.g., a social media post, a conversation, a single article). Was that sample representative of the broader reality? What does this example teach you about the risks or rewards of inference from a small sample?
Imagine you are planning a study on how college students interact with AI tools like ChatGPT. Would you choose a probability sampling method or a non-probability one? Why? Consider your research goals—do you want to generalize to all college students or understand a specific group more deeply? Explain your choice and what trade-offs it involves in terms of access, time, cost, and generalizability.
Much of today’s research relies on digital data—tweets, posts, videos, and online surveys. This chapter explains how population bias, self-selection bias, and data availability bias can distort digital research. Choose one of these forms of bias and describe how it might affect a study of online news consumption or streaming habits. What could a researcher do to acknowledge or reduce that bias?