Well, it’s coming up on election day and the news is wall-to-wall polls, so it’s time for the biennial lecture on polling errors.

Seven polls have come out in October. Kean was ahead only in one, and that one was released ten days ago. If you toss out the Zogby poll which has Menendez up by ten points (I’m having less and less confidence in Zogby’s numbers), Menendez’s average lead is only like 3 or 4 points. The statisticians will note that that’s probably not a statistically significant margin. But when all the polls are coming up with the same narrow margin, I think you can say that Menendez is now back on top.

No, the statisticians probably won’t note any such thing. The sampling error falls as the square root of the sample size. A poll with 2000 respondents has half the sampling error of a poll with 500 respondents.

Four polls with 500 respondents each can be treated as a single poll with 2000 respondents. So if we assume that the six polls have sample sizes that produce sampling errors of between plus/minus 3 and plus/minus 4, which is the usual range, then the sampling error for the collection of six polls would be roughly plus/minus 3.5 divided by the square root of 6, which is plus/minus 1.4. A lead of three to four points is therefore (barely) outside the range of sampling error. (If you had the actual polls in front of you, you’d want to weight each one by its sample size and figure out a weighted average lead, and then figure a sampling error from the pooled sample sizes.)

That’s good news if you want Menendez to win.

But it’s only modified good news.

First, there’s nothing magic about the estimated error band. Some journalists write as if any margin outside the band is certain, and any margin inside it is meaningless (the phrase you see is “virtually tied”). That’s just wrong. What the error band means is that if there had been a second poll taken at the same time — same questions, same sampling frame, same interviewers — nineteen times out of twenty the results of the second poll would differ from the results of the first by less than the plus/minus amount.

So if we’re told that X is ahead of Y in the latest poll by 50%-45%, plus or minus 3%, that means that if twenty more polls had been done in the same way at the same time, in nineteen cases the results would have been between 53-42 and 47-48 the other way. So being up 4 points in a survey with a sampling error of plus/minus 3 isn’t a lead outside the confidence interval: your lead needs to be twice the size of the confidence interval to give you that 95% assurance.

On the other hand, a lead within the confidence interval is still, statistically, a lead: you’d rather be up 2 plus/minus 3 than down 2 plus/minus 3. If anyone wanted to bother, you could compute the actual probability that the measured lead is entirely due to sampling error: that’s the p-value so beloved of the social science journals. But I’ve never seen it done for polling results.

Second, and more important, * sampling error is only one kind of error, and rarely the largest*. Systematic, or non-sampling, errors are the sources of inaccuracy that won’t go away if you do the same poll twice. The sampling frame may not accurately reflect the population that will actually vote. The non-response group — people selected to be sampled but who couldn’t be reached or who wouldn’t answer — may be tilted toward one candidate or the other. The questions or the interviewers may impart some bias. (If you ask about the Iraq war before you ask about who the respondent tends to vote for this year, that’s going to give the Democrat a couple of extra points.) The voter might misreport his intentions or change his mind. And of course the votes as counted may differ from the votes as cast, by an unknowable margin.

So a polling lead “outside the margin of error” doesn’t mean that your guy is actually 95% likely to win. That’s why polls tend to use small samples. Given the non-sampling errors, there’s not much point in increasing sample size much past 1000 if all you want to do is guess the horserace. The great advantage of larger samples isn’t increased precision about the end result: it’s that the subgroups — suburban women with children, for example — get big enough so you can start having confidence in the story they’re telling.

Everyone should read this, especially those writers who insist on calling anything with the confidence limits as a "virtual dead heat."

One of my own pet peeves is the fact that results are rounded to the nearest whole percent. A 52-48 lead could be anything from 52.4 – 47.6, or 4.8%, down to 51.6-48.4. or 3.2%. That looks like a big difference.

Outstanding diary. Should be repeated monthly until we get it.

One question, and one comment:

Question: I've always wondered about the "plus/minus" reported for polls. Is this the standard error, or the halfwidth of the confidence interval (which would be approximately twice the standard error)? I gather from this post that the plus/minus is the halfwidth of the confidence interval for the individual candidates' vote shares. What may perhaps explain my confusion is that this means that the same plus/minus figure is only half of the halfwidth of the confidence interval for the size of the lead, because the candidates' vote shares are perfectly negatively correlated.

Comment: You write that "If anyone wanted to bother, you could compute the actual probability that the measured lead is entirely due to sampling error…" You can't do that. You can compute the probability that we would have gotten a measurement like this had the race in fact been tied. But that's different. Only a Bayesian with a prior distribution for the size of the lead could do what you propose. This is somewhat pedantic, I realize, but I think it is also important: A non-Bayesian simply can't make statements like "your guy is actually 95% likely to win." And the media discussion of changes in polling numbers from week to week would be a lot more informative if they could be a bit Bayesian about it, using prior polls to form priors for the current distribution. But that'll be the day…

You write: "So if we're told that X is ahead of Y in the latest poll by 50%-45%, plus or minus 3%, that means that if twenty more polls had been done in the same way at the same time, in nineteen cases the results would have been between 53-42 and 47-48 the other way."

No. If the poll were done 20 times, then in 19 cases the results would have been within 3% of THE TRUTH (the truth being what one would find if one could have polled everyone in your population of interest, instead of just a sample). Your description is only true if that first sample happened to nail the true population breakdown on the head.

The sampling error used to estimate the quoted confidence intervals is the sampling error which would be associated with truly random sampling of the population of voters. The difficulty of actually doing a "random sample" — it's impossible, frankly — leads pollsters to use more practical methods, like dialing random phone numbers, and to compensate for the sampling bias by stratifying the sample in various ways and checking the stratification of the sample against known characteristics of the population.

A properly stratified sample actually has a lower sampling error than a random sample would have; by stratifying, you reduce the probability of an outlying sample. But, calculating the actual sampling error in a stratified sample can be quite complicated.

Stratification means that you make an effort to see that your sample has similar proportions to the population in regard to certain characteristics: partisan identification, geographic residence, income or class, race, for example. To the extent that voting correlates with these characteristics, policing the stratification of the sample will reduce the sampling error, and this reduction can be very substantial.

So, the quoted sampling error is an overestimate of sampling error, per se. But, given the other sources of error you cite, it is regarded as a useful CYA metaphor.

I haven't seen the specific Zogby poll(s) you discuss, but in the past I've always seen Zogby poll results described without "not sure" or "don't know" as a possible answer. If the Zogby folks are still doing that, it may account for their numbers being screwy.

One other note: You could only combine the surveys done around the same time to get a larger total population if the questions are exactly the same.

I know that may seem like an obvious point, but not all vote questions are asked in the same way, and not all are asked after the same questions which preceded them (which can make a difference as Mark points out).

Isn't one advantage of looking at a large population of polls be that their systematic errors should tend to cancel?