From (what else?) a debunking of one of Gladwell’s heroes:
In statistics, you can’t judge the predictive oomph of anything without knowing the population prevalence of the event or condition you’re studying. Here’s a simple way to see how easy it is to fall into what they call, in the field, “base-rate neglect”: Suppose you’re told that a man named John is extremely well-educated, smokes a pipe, and wears tweed jackets with patches on the sleeve—is he more likely to be a particle physicist or a janitor? A physicist, you immediately think. But you’d likely be wrong, because janitors are common and particle physicists rare. The chances that you’d happen upon a very well-educated, tweed wearing, pipe-smoking janitor are higher than those that you’d meet a physicist who meets the same profile.
This ends up being a crucial skill in understanding public policy, educational research, personal medical decisions, whatever. And most people get thrown for a loop by it every time. It’s going to be one of the things we cover in our statistical literacy course.
The classic example, of course, is a medical test. Say the accuracy of a test for Cancer X is 90%. Now say that the prevalence of that form of cancer is 0.5% of the population over 40. And let’s say we test EVERYONE in the population.
You get back a positive result from the test. Assuming no other information, what’s the chance that you have Cancer X?
Most people think that a positive result means they have a 90% chance of having it. In reality, a positive result in this case means you have a 4.3% chance.
To understand how this works, consider a population of 2000 above-40 adults. Out of those 2000 people, 10 actually have Cancer X. Nine of those people get positive results, per the test accuracy.
Out of the 1990 other people, 10% of those that don’t have it get mistaken results. That’s 199 people.
So 208 people get positive results back, 9 of them actually have it. So if you get back a positive result, your chances that you actually have it are 9/208, or about 4%.
Now say you test only people that have a family history of Cancer X and demonstrate symptom Y. And let’s say the prevalence in that population is 5%.
The equation goes from 9/208 to 90/208. The test is now over 40% accurate.
Life or death stuff for those making medical decisions, and crucial for understanding much research. But almost no one knows it. It’s things like this that have got me to delve into teaching statistical literacy.