Imagine that I told you with great excitement that I had flipped a coin “repeatedly” and it came up heads 100% of the time. You’d probably be surprised and wonder what could explain such an unlikely event: Was the coin two-headed? Was it of unbalanced weight? Did I had some weird method of flipping a coin such that it only came up heads?

But then imagine I told you that by “repeatedly” I mean that I flipped it twice. It came up heads both times so that’s 100% heads. You’d be exasperated that I was excited about something so trivial because clearly such a result is a million miles from rare: 100% of only two coin flips coming up heads is at best yawn-inducing.

Hold that example in your head and then consider the fact that whenever schools are ranked on student performance, small schools are always over-represented at the very top of the list. If you are a proponent of small schools, you might explain this as due to the fact that such schools are more home-like, that the teachers give students more individual attention, and that the kids have a sense of community. If you’re an opponent of small schools, you might argue that small schools do well because they tend to be exclusive places that screen out kids with disabilities and expel kids who pose behavior problems.

But if you knew the brilliant work of Wainer and Zwerling, you would recognize that the over-representation of small schools at the high end of performance is exactly as impressive as flipping a coin twice and having it come up heads 100% of the time, i.e., not at all.

The link between the two examples is small sample size. Small samples are more prone to extreme scores. Even though a fair coin will come up heads 50% of the time, having it come up heads on 100% of only two flips is common. In contrast, flipping a coin 20 times and having it come up 100% heads would be shocking. The larger the sample of coin flips gets, the closer the result is to the boring old true score of 50% heads.

Small schools by definition have fewer students than big schools. That means they will be more prone to extreme scores on any measurement, whether it’s academic performance, shoe size or degree of enjoyment of pistachio ice cream. The average test scores of 50-kid schools will be more likely to be very high than are the average scores of 500-kid schools *even if the students in the two types of schools are perfectly identical in terms of ability*.

Not incidentally, the tendency toward extreme scores in small samples is bidirectional. That’s why small schools are also over-represented at the very bottom of the list when schools are ranked on performance measures. As Wainer and Zwerling point out, The Gates Foundation was one of many charities that invested in small schools after looking only at the top of achievement lists. If they’d looked at both end of the distribution and seen all the small schools with extremely poor scores, they would have known that there was nothing more complex at work than small samples being prone to extreme scores.

A really strong effect in a small sample size can point you in an interesting direction for doing a bigger study, but that’s it. It should never be taken at face value, and one of the things I always look for when I hear about some interesting study is the sample size.

Interesting post, and I only know enough stats to be wary of them, but are you confusing the idea of small class size with small schools?

Also, what about measures that aren’t test scores, like attendance and how many books they read? I think the new frontier in the ed wars should be in consumer surveys!

And aren’t we really talking about money here? (again? always?)

It doesn’t matter what is measured and it doesn’t matter if you are looking at small class samples or small school samples, small samples will be more likely to be at the extemes of the distribution of whatever you are measuring than will larger samples.

Over time though, won’t the rankings also be more unstable for these schools if the small sample is really the cause of their high position?

Yes. In any given year, small schools will be over-represented in the top ranks, but the specific small schools in that group can change every year. Big schools in contrast would be unlikely to move dramatically in either direction year to year because their performance is more precisely measured each year.

Alternatively, if a small school, with an even smaller faculty, lucks into a good sample of teachers, their continued success could continue for some years, and actually represents a real, if difficult to replicate, source of genuine good performance: Good staff.

But isn’t there a way to correct for that? It seemed to me that stats class was all about figuring out what kind of error you had and working around it somehow. I am sorry to say, I don’t remember it well enough to try now.

You can’t “correct for it” but you can report confidence intervals around estimates which would lead sophisticated readers at least to know to disregard some findings that otherwise would be compelling.

Sounds good to me! I vaguely remember something about a 95/5… must dig out textbook.

I just read this in Kahneman “Thinking, Fast and Slow” and found it fascinating. Those who have a hard time separating out the “yabbut, what about…” with respect to schools, read Kahneman on kidney cancer. Highest rates of incidence in sparsely populated counties. And…lowest rates of incidence in sparsely populated counties.

And if the coin flip analogy isn’t helpful for you, I found Kahneman’s marbles in a jar analogy much more helpful.

The great Kahnemann does discuss this, but it’s Wainer and Zwerling’s work he is summarizing (indeed he credits them for the kidney cancer example in the book)

True.

Keith, I thought your discussion of this issue resonated with your usual intelligence, but a troubling moral problem kept bugging me. I’ve read pieces discussing the statistics of road side bombs vs bullets as effective killers of our young people, but the issue seemed to me to be the killing. In your discussion what looms in the background is this odious business of national or statewide testing of our young people. In my view this is the worst thing to happen to education I my lifetime, which is almost eighty years. I’m sure you are familiar with all the attacks against this practice, but it seems that big business and the Gates people have the uppers hand.

While I think Keith has pointed to a valid statistical issue, it does seem to me that you’re on to something in widening the context here. My own hobby horse when it comes to “school reform” is the impression that the one attribute held in common by more “successful” operations is a higher ratio of adults in actual contact with students, to students. That’s achieved in several ways; in charter schools, for example, by lowering teacher pay and hiring more and/or not enrolling special-needs students and/or flattening administrative structures and/or bringing in non-certified volunteer aides, etc. (That’s when there isn’t outright cheating, of course.) I’d love to see someone do a summary/comparative study that deals only with that one variable, assuming it hasn’t been done yet.

The extremes may be self-perpetuating. We’re top! (by random variation): this encourages students and teachers to perform to stay up, and the school attracts genuinely above-average new students. We’re bottom! (again by random variation): students and teachers get discouraged, parents start looking for exits, recruitment falls off. Quite a lot of the great inequality of outcome we see everywhere comes from positive feedback loops amplifying smaller random variations.

Yes — and increased/decreased confidence and morale may also play a role.

And, unfortunately this needs repeating in the current climate even most of us do know it, all of this discussion assumes that ‘school performance’ is a single, easy-to-measure, number. Which is clearly false.

[By the way, I have to say, Brett has posted something logical, clear, and reasonably insightful! Let’s hope this keeps up!]