
The phrase "four out of five dentists recommend" appeared in advertising throughout my childhood and I accepted it as meaningful evidence without ever asking the obvious question: which five? A claim built on five data points tells you almost nothing reliable — you could survey five friends about their preferred lunch and whatever they said would represent nothing beyond those five people. Understanding what sample size means for the reliability of a conclusion is one of the most useful things you can take from statistics, because the sample size is almost always the number missing from claims that depend on it most.
The Law of Large Numbers
The law of large numbers states that as a sample size increases, the sample's average will tend toward the true population average. Flip a fair coin 10 times and you might get 7 heads — an apparent rate of 70%. Flip it 1,000 times and you'll get something much closer to 50%. The coin hasn't changed; your estimate of its properties has just become more reliable with more data.
The implication: small samples can produce results that look dramatic but are explained entirely by chance variation. An early clinical trial with 30 patients showing a 40% improvement rate might, by chance, have recruited an unusually healthy group — or the improvement might be a real treatment effect. With 30 patients, you can't tell. With 3,000 patients, the estimate is much more reliable.
Confidence Intervals
A confidence interval (CI) quantifies the uncertainty in an estimate due to sample size. A study of 50 people showing 60% responding positively might have a 95% CI of 46%–74%. The true rate could plausibly be anywhere in that range. A study of 5,000 people showing the same 60% rate would have a much narrower CI, perhaps 58.6%–61.4%. The wider the CI, the less information the study provides.
Reputable statistical work always reports confidence intervals alongside point estimates. When a study only reports a single number without a CI, ask what the sample size was and what the plausible range of true values might be.
The Hot Hand and Small Samples
A basketball player makes five consecutive three-pointers. Are they on a "hot hand" — genuinely performing better than usual — or is this normal statistical variation? Research on hot hand claims in sports has found that what appears to be streakiness is usually consistent with the variation expected from a fixed underlying probability. Given a player who makes 40% of three-pointers, random sequences of makes and misses will produce streaks of 5+ that look uncanny but aren't.
We're not good at intuiting what random variation looks like. Sequences of coin flips feel non-random because our intuition expects alternation (HTHTHTHT) rather than runs (HHHTTTHH). Both are equally likely from a fair coin, but runs feel deliberate rather than random.
Publication Bias Amplifies the Problem
In academic research, studies with significant results are more likely to be published than studies with null results. If 20 researchers independently test a false hypothesis, statistics predicts that one will find a statistically significant result by chance (the standard threshold is 5% — you expect 1 false positive in 20 trials). That one positive study gets published; the 19 null results go into file drawers. The published record looks like strong evidence for something that doesn't exist.
This is not hypothetical. It has been documented systematically in psychology, nutrition science, and pharmaceutical research. Pre-registration of trial designs (specifying the hypothesis and analysis plan before collecting data) partially addresses this by making null results visible.
When Small Samples Are Unavoidable
For rare conditions, unique populations, or resource-constrained settings, large samples are sometimes impossible. In these cases, small studies are acceptable as preliminary evidence — hypothesis-generating rather than hypothesis-confirming. They should be replicated before strong conclusions are drawn, and findings should be presented with their uncertainty made explicit. A small study that shows a promising result is a reason to do a larger study, not a reason to change practice. Our probability calculator shows the range of outcomes you'd expect from a binomial process at different sample sizes — a practical way to see how much variation is normal from small samples.
What Happened When Scientists Tried to Replicate Each Other's Work
Between 2011 and 2015, the Open Science Collaboration attempted to replicate 100 published psychology studies — all of which had appeared in respected peer-reviewed journals and had been cited as evidence for real effects. Only 36 of the 100 replications produced results that were statistically significant in the same direction as the original. Many of the most widely-cited findings failed to hold up when tested with larger, pre-registered samples by independent researchers.
The causes were multiple, but small sample sizes were central. Many of the original studies had been powered with 20 to 50 participants — enough to detect a very large effect but not a modest one. When chance variation in a small sample happened to produce a significant result, it got published. The studies that found nothing — equally informative, if not more so — went unpublished. The replication crisis didn't reveal that the original researchers were fraudulent; it revealed that the incentive structure of academic publishing had systematically selected for underpowered studies that found something dramatic rather than well-powered studies that found something reliable.
Statistical Power: Can the Study Even Find What It's Looking For?
Statistical power is the probability that a study will detect a real effect if one exists. A study with 80% power has an 80% chance of finding a true effect — and a 20% chance of missing it (a false negative). Power depends on sample size, effect size, and the significance threshold. The problem with many small studies is that they only have sufficient power to detect very large effects. If the true effect is modest — as most real-world effects in psychology, nutrition, and social science are — a study of 30 people has almost no chance of detecting it reliably, regardless of how carefully it is conducted.
Calculating the required sample size for a study before collecting data — a power analysis — is standard practice in well-designed research. It asks: given what we know about the likely effect size and the level of certainty we require, how many participants do we need? Skipping this step and collecting data until a significant result appears, then stopping, is a practice known as "p-hacking" — and it is one of the most reliable ways to produce published findings that don't replicate.
