Correlation Does Not Mean Causation

A few years back I came across a report circulating in health circles claiming that counties with higher organic food sales showed significantly elevated rates of autism diagnoses. The correlation in the data was real — both had been rising over the same two decades. The report didn't quite say that organic food caused autism, but it didn't need to; the juxtaposition did the work. I spent about ten seconds feeling vaguely troubled before realising what was actually happening: both trends shared a common driver, which was time. An enormous number of things increased between 2000 and 2020. Organic food sales rose. Autism diagnoses rose. So did smartphone ownership, streaming subscriptions, and the number of coffee shops per high street. Correlation without a causal mechanism is noise dressed up as signal.

What Correlation Actually Tells You

A correlation coefficient (written as r) measures the strength and direction of the linear relationship between two variables, on a scale from −1 to +1. A score of +1 means they move in perfect lockstep — one goes up, so does the other, proportionally. A score of −1 means perfect inverse movement. Zero means no linear relationship. A strong correlation like r = 0.9 means the relationship is reliable enough to predict one variable from the other, but it says absolutely nothing about why the relationship exists.

Correlation can arise because A causes B, B causes A, a third variable C causes both, or the two happen to move together by chance — especially in small datasets, or when you search long enough through enough variables. Statistical analysis alone cannot distinguish between these possibilities.

The Hospital That Looked Worse on Paper

Here is a counterintuitive example that surprised me when I first came across it. Hospital-level data often shows that hospitals with higher admission rates have worse patient outcomes. The obvious read: these hospitals are less competent. The actual explanation: hospitals that admit sicker patients have worse outcomes because their patients are sicker. The correlation between high admission rates and poor recovery is driven entirely by a third variable — patient severity — that the raw data doesn't capture. Once you control for patient severity, the apparent relationship often reverses. The hospitals that look worst on raw statistics are frequently the ones doing the hardest work.

This is why performance comparisons in healthcare, education, and criminal justice are so difficult to do fairly from aggregate data. The correlation tells you what happened. It doesn't tell you why — and the why contains the entire story.

The Confounding Variable Problem

A confounding variable influences both variables being measured, creating a relationship between them that isn't direct. People who eat more vegetables also tend to exercise more, earn more, and sleep better. All of these independently affect health outcomes. A study showing that vegetable consumption correlates with lower rates of cardiovascular disease cannot, on its own, tell you whether the vegetables are doing the work, or whether the vegetable-eaters are simply healthier across multiple dimensions simultaneously.

Randomised controlled trials solve this by randomly assigning participants to groups — which distributes all confounders, measured and unmeasured, roughly equally between them. That's why RCTs are the gold standard for establishing causation. Observational studies can identify associations and control for the confounders that researchers thought to measure; they cannot control for the ones no one thought to ask about.

When Correlation Does Provide Real Evidence

Correlation is not useless — it's the starting point. What matters is whether the association holds up under scrutiny: whether it's consistent across multiple independent studies, whether the effect grows with greater exposure (a dose-response relationship), whether there's a plausible biological or mechanical explanation, and whether alternative explanations can be ruled out one by one. Tobacco and lung cancer took decades to establish causally because each of these criteria had to be satisfied. The correlation was visible within years; the causal case required sustained work across multiple research groups and disciplines.

Understanding base rates and conditional probabilities is part of evaluating whether a claimed correlation is likely to reflect something real or a false positive. Our probability calculator covers these concepts — the same logic that underlies whether a correlation in a study is likely to be genuine or an artefact of sample size and multiple testing.

Regression to the Mean — A Subtle Trap

One of the subtler ways correlation misleads is through an effect called regression to the mean. Extreme values tend to be followed by less extreme values simply because exceptional outcomes contain a luck component. A student who scores in the top 5% one week tends to score lower the next — not because anything changed, but because performing exceptionally requires things to go unusually well, and that doesn't happen consistently. This creates the illusion of causation: intervene after an unusually bad result, observe a subsequent improvement, conclude the intervention worked. Many ineffective interventions have persisted for years on exactly this logic.

Reverse Causation — Another Common Mistake

Sometimes the causal arrow runs in the opposite direction from the one implied. Research has found that people who carry an umbrella are more likely to get wet than people who don't — which sounds absurd until you recognise that people carry umbrellas because they expect rain. The weather causes both the umbrella and the wetness; the umbrella doesn't cause the wetness. In medical research, sick people often eat less — so poor diet can appear correlated with illness even when the illness is driving the dietary change, not the other way around.

How to Read Correlation Claims Critically

When a headline tells you "people who do X have better Y outcomes", the questions worth asking take seconds but change everything. Was this a randomised trial (causal) or an observational study (association only)? What confounders might explain the relationship — what else differs between people who do X and people who don't? How large was the sample, and how large is the reported effect? Was the hypothesis pre-specified, or found by searching through many possible correlations until one looked significant? Could reverse causation explain it — not that X causes Y, but Y causes X? Asking these questions consistently makes you a meaningfully more reliable interpreter of statistical claims, which in health reporting, financial journalism, and public policy are never in short supply.

Correlation Does Not Mean Causation

What Correlation Actually Tells You

The Hospital That Looked Worse on Paper

The Confounding Variable Problem

When Correlation Does Provide Real Evidence

Regression to the Mean — A Subtle Trap

Reverse Causation — Another Common Mistake

How to Read Correlation Claims Critically

Fraction Calculator

Percentage Increase / Decrease Calculator

Mean Median Mode Explained

Construction Measurement Mistakes

Everyday Maths You Use Without Realising

How Misleading Graphs Change Perception

How Much Paint Do You Really Need

How To Calculate Probability

What Correlation Actually Tells You

The Hospital That Looked Worse on Paper

The Confounding Variable Problem

When Correlation Does Provide Real Evidence

Regression to the Mean — A Subtle Trap

Reverse Causation — Another Common Mistake

How to Read Correlation Claims Critically

Related calculators

Fraction Calculator

Percentage Increase / Decrease Calculator

Related articles

Mean Median Mode Explained

Construction Measurement Mistakes

Everyday Maths You Use Without Realising

How Misleading Graphs Change Perception

How Much Paint Do You Really Need

How To Calculate Probability