Sometimes a single poll diverges from the pack of results generated by everyone else. How can you tell when the pollster is doing a better job of picking up a new trend versus simply being wrong?
Peter Kellner offers an educative account of how these events occur, using as an example a poll that shows two political parties in a neck and neck race while everyone poll has one party ahead. It’s a UK example, but that doesn’t matter, the value of the essay is its discussion of how to weigh poll respondents who say they have no preference, how small samples affect conclusions and the like.
The article is worth reading both for its clarity and its modesty. Kellner’s own polling firm disagrees with the outlier poll, but he remains balanced and gentlemanly throughout his critique.
The time frame for statistics are often edited to mislead the listener
I was watching a baseball game with a mathematician friend, during which the announcer said of a batter “He’s been on a hot streak, with 6 hits in his last 19 at bats”.
My friend said “Which means he has 6 hits in his last 20 or more at bats”.
Of course this was true, because if the batter’s hot streak went back farther than his last 19 at bats, the announcer would have extended the range of his statistic accordingly. The announcer was trying to make a point, and a 7-for-20 or 8-for-21 run of hitting sounds hotter than a mere 6-for-19.
Pundits often use this trick to make ominous sounding prouncements that are in fact content-light, for example when forecasting who will or will not get elected president. Hillary Clinton will never be in the Oval Office, by the way, because no candidate whose first name starts with an H has been elected President in the past 64.5 years.
Arbitrary ranges can also be used to make the case for a historical trend that may not actually be substantive. At the end of the 1980s, some Republicans crowed that Democrats had lost “5 of the past 6” presidential elections. Today, some Democrats brag that Republicans have failed to win the popular vote in precisely the same arbitrary timespan.
Politicians also snooker voters with arbitrary ranges. If Mayor Jones says that crime is down for 5 straight months, s/he may well be covering over the fact that crime was up in the months before that. Be particularly sceptical when a politician’s long-ago launched pet initiative is said to have “not really gotten off the ground” until the precise moment that the referenced time range of an upbeat statistic begins.
Obviously, range-based statistics have to have some starting point, and that’s fine as long as they are meaningful: A new person comes into office, a new law is passed, the fiscal year’s budget comes into force, a war begins or ends etc. If there isn’t a qualitative demarcation point like that at the start of a referenced time range, the person quoting the statistic is probably either fooling himself or trying to fool you.
Howard Weiner and Harris Zwerling provide a valuable lesson in how not to be fooled by small samples when trying to understand school performance.
Imagine that I told you with great excitement that I had flipped a coin “repeatedly” and it came up heads 100% of the time. You’d probably be surprised and wonder what could explain such an unlikely event: Was the coin two-headed? Was it of unbalanced weight? Did I had some weird method of flipping a coin such that it only came up heads?
But then imagine I told you that by “repeatedly” I mean that I flipped it twice. It came up heads both times so that’s 100% heads. You’d be exasperated that I was excited about something so trivial because clearly such a result is a million miles from rare: 100% of only two coin flips coming up heads is at best yawn-inducing.
Hold that example in your head and then consider the fact that whenever schools are ranked on student performance, small schools are always over-represented at the very top of the list. If you are a proponent of small schools, you might explain this as due to the fact that such schools are more home-like, that the teachers give students more individual attention, and that the kids have a sense of community. If you’re an opponent of small schools, you might argue that small schools do well because they tend to be exclusive places that screen out kids with disabilities and expel kids who pose behavior problems.
But if you knew the brilliant work of Wainer and Zwerling, you would recognize that the over-representation of small schools at the high end of performance is exactly as impressive as flipping a coin twice and having it come up heads 100% of the time, i.e., not at all.
The link between the two examples is small sample size. Small samples are more prone to extreme scores. Even though a fair coin will come up heads 50% of the time, having it come up heads on 100% of only two flips is common. In contrast, flipping a coin 20 times and having it come up 100% heads would be shocking. The larger the sample of coin flips gets, the closer the result is to the boring old true score of 50% heads.
Small schools by definition have fewer students than big schools. That means they will be more prone to extreme scores on any measurement, whether it’s academic performance, shoe size or degree of enjoyment of pistachio ice cream. The average test scores of 50-kid schools will be more likely to be very high than are the average scores of 500-kid schools even if the students in the two types of schools are perfectly identical in terms of ability.
Not incidentally, the tendency toward extreme scores in small samples is bidirectional. That’s why small schools are also over-represented at the very bottom of the list when schools are ranked on performance measures. As Wainer and Zwerling point out, The Gates Foundation was one of many charities that invested in small schools after looking only at the top of achievement lists. If they’d looked at both end of the distribution and seen all the small schools with extremely poor scores, they would have known that there was nothing more complex at work than small samples being prone to extreme scores.
It’s always nice when you learn, years later, that a student actually remembers something you said. It’s even nicer when it’s something you’d forgotten ever saying.
Had coffee this week with someone I had in class maybe 18 years ago, now doing a fairly senior job in Washington. She said her current job had convinced her that I was right to say (as she recalls): “Never believe a statistic you didn’t make up yourself.”
Alejandro Hope shreds the latest drug numbers invented by the UN.
Alejandro Hope, until recently in charge of organized-crime analysis at CISEN (the Mexican intelligence service) shreds the latest nonsense numbers from the UN Office on Drugs and Crime.
Part of the problem with drug policymaking at every level is its almost complete detachment from reality. In particular, statistics about drug volumes and revenues are more or less invented at random, then passed around like bad pennies.
I have this flagged under “Lying in Politics,” but “lying” isn’t really the mot juste. This is more what Harry Frankfort calls “bullsh*tting”: the statements are false, but not really intended to deceive. Stating them, and citing them, are ritual gestures: socially meaningful, but essentially content-free. (And debunking them is one of the central themes Jon Caulkins, Angela Hawken, and I pursue in Drugs and Drug Policy, whose subtitle should really be, not “what everyone needs to know,” but “what everyone needs to stop believing.”
Translation below the fold.
Footnote More from Alejandro at his new blog (mentioned earlier by Keith). It’s called “Plata o plomo” (“Silver or lead”), the proverbial choice offered to law enforcement officials by Latin American drug traffickers: take our money or face our bullets.
Continue reading “Calling b.s. on UN drug numbers”