Note on study design

Statistical power is what you need to make sure you can measure effects too small to care about.

Statistical power is what you need to make sure you can measure effects too small to care about.

Author: Mark Kleiman

Professor of Public Policy at the NYU Marron Institute for Urban Management and editor of the Journal of Drug Policy Analysis. Teaches about the methods of policy analysis about drug abuse control and crime control policy, working out the implications of two principles: that swift and certain sanctions don't have to be severe to be effective, and that well-designed threats usually don't have to be carried out. Books: Drugs and Drug Policy: What Everyone Needs to Know (with Jonathan Caulkins and Angela Hawken) When Brute Force Fails: How to Have Less Crime and Less Punishment (Princeton, 2009; named one of the "books of the year" by The Economist Against Excess: Drug Policy for Results (Basic, 1993) Marijuana: Costs of Abuse, Costs of Control (Greenwood, 1989) UCLA Homepage Curriculum Vitae Contact:

29 thoughts on “Note on study design”

  1. Well, low statistical power combined with a publication bias in favor of “statistically significant” results is also a recipe for a literature with exaggerated effect estimates. For instance: We might have that effects in the range of -10 to +10 units (of something) would be “expected” under the null hypothesis. Let’s say the true effect is +3. We would then have a case where any statistically significant result would involve an effect estimate more than 3 times too large. Andrew Gelman has written on this several times, see for instance here:

    1. One of my bugaboos is that bad language leads to bad thinking, even in scientific fields.
      The term “statistically significant” is a perfect example of this. I’ve seen an astonishing amount of literature that seems to think “statistically significant” means “significant, as proved by statistics”.

      [Note that this is a completely different problem from the OTHER, more technical, problem one sees in this field, the “we see an effect that is clear but is not statistically significant”, ie “we’ll see what we want to see in the data, regardless of everything we’ve been told about stochasticity”…
      This second problem is the one that is drummed into students repeatedly, but in the real world, I suspect it is the former problem that is actually more deadly.]

  2. The health benefits of aspirin in preventing heart disease and cancer are small, but definite. It required enormous samples (n>10,000) to prove this.

    In medical research, it’s unethical to continue a double-blind trial beyond the point of overwhelming probability, so in principle you should never reach certainty.

    Don’t knock plodding methodology. Look at the harm done by the shoddy Reinhart-Rogoff paper on debt ratios. An update to the Latin tag: Falsus in cellula L47, falsus in omnia.

    1. “Debt ratios?”

      What about the unemployed?
      What about having to suffer right-wingers smug assertion (for at least the next two decades) that it has been proved that debt should never be greater than 90% of GDP?

  3. “If your experiment requires statistical analysis, you ought to have done a better experiment.” – Rutherford

    1. That pretty much rules out modern particle physics, but there’s some real truth to it. Physical scientists and (as Katja points out below) logical scientists have a much easier life than biological and social scientists.

  4. …or when it costs hundreds or thousands of dollars to run a single subject.*

    *Hollah at my neuroscience homies! Yo!

  5. Having mentioned John Ioannidis in previous discussions of this general topic, it seems appropriate to mention what he says at the end of the Atlantic article
    “Science is a noble endeavor, but it’s also a low-yield endeavor,” he says. “I’m not sure that more than a very small percentage of medical research is ever likely to lead to major improvements in clinical outcomes and quality of life. We should be very comfortable with that fact.”
    “Low yield” refers to breakthroughs, which are rare. Most progress in medicine occurs in small to moderate size steps, steps worth taking but which are neither leaps nor bounds. New treatments are a little bit better than existing treatments, and detecting the improvement requires sample sizes large enough to detect them. Most research leads to minor improvements which are nevertheless worth making.
    Some laboratory scientists do not need statistical tests commonly used in clinical research. If you are running a microbiology lab, you have ten trillion E. coli organisms in your test tube, and you do not need to get ten trillion consent forms from them to experiment upon them. E. Coli organisms are all pretty much alike, and the experimenter who changes the nutrient composition of a petri dish can count on having only one source of variation to explain the outcome of the experiment. Ernest Rutherford did not need to worry about sample sizes when working with atoms. He had plenty of them to work with.
    But clinical experiments have to contend with multiple sources of variation, only one of which is the experimental intervention. This obliges the researcher to use statistical tests and to strive to recruit sufficient numbers of participants in order to have enough power to detect differences in outcome attributable to the intervention of interest. The effects are small enough to get lost amid other sources of variation, but may still be large enough to care about. Hence the need for statistical power.
    Speaking of noble endeavors, let us remember that William Gosset developed the t-test in order to improve the quality of Guinness Stout. Remember this the next time someone tells you that statistics has nothing to do with serving the needs of real life.

    1. “Low yield” refers to breakthroughs, which are rare. Most progress in medicine occurs in small to moderate size steps, steps worth taking but which are neither leaps nor bounds. New treatments are a little bit better than existing treatments, and detecting the improvement requires sample sizes large enough to detect them. Most research leads to minor improvements which are nevertheless worth making.

      It seems to me that this is depressing only to the extent that you view biology purely as a technology, a way to improve people’s lives. If that is your mindset, you’re probably best off as a doctor; but I suspect most of the people doing this “low-yield” work view what they are doing as biology, and the joy is not only in figuring out that alpha-whatitsname gooses the immune system to fight cancer better, it’s also in all the steps before then, learning exactly how the immune system works, learning that the way the immune system evolved in mammals is different in these significant ways from the way it evolved in arthropods, learning that this signaling molecule fills in the gap between “step A occurs and then somehow step C occurs”. and so on.

      I don’t want to be mean, but I suspect that it is the fields that are trying hardest to generate breakthroughs. and which are actually based on precious little real science (so there is none of the joy of learning small things along the way) which generate the most nonsense — most obviously nutritional epidemiology. The equivalent sort of study, but through alternative means (in this case physiology rather than epidemiology) seems to generate much less glamor, but does generate a steady stream of minor facts that are actually a secure basis for moving forward.

      1. Quite true that basic biology research need not result in dramatic cures of human disease in order to be well worth pursuing and funding. Very large pictures are built up of thousands of tiny pixels and later assembled into portraits which are lovely to behold.

        A quick example is the immunoglobulin superfamily which not only mediates immune responses but also contains families of adhesion molecules which help cells stick together during development and later during maintenance of tissues. IIRC Metchnikoff came across phagocytosis while pursuing understandings of how multicellular organisms evolved out of single cell organisms.

        The attempts to model a “War on Cancer” on the Apollo moon landings or on the Manhattan project illustrate what happens when you want big breakthroughs before understanding the underlying principles involved in governing the phenomena you are trying to control.

        1. I agree with all you said up to the criticism of “The War on Cancer”.

          IMHO this campaign is a perfect example of my point about biological research. The “War on Cancer” started off generating a huge amount of information about yeast genetics, cytology, and molecular biology — and received a fair bit of pillorying for it. But forty years on, this low-level scutwork forms the basis for everything from understanding how gleevec works to understanding cancer genomics. I don’t know if there was ever a single “breakthrough” in “The War on Cancer”, but there has been a long sustained improved by a small amount each and every year. But pretty much all (IMHO) built upon a foundation constructed by biologists interested in biology (and, of course, helped by physicists interested in physics, chemists interested in chemistry, engineers interested in making better computers etc), NOT based on the idea of a single breakthrough.

          From what I can tell, as an outsider just aware of the history of science over the pasty fifty years, and as someone who reads papers from a variety of disciplines, and with friends running high tech companies, science funding is one of the best functioning parts of the US, perhaps because the real decision makers understand my point above regarding foundations vs breakthroughs. The politicians may natter about breakthroughs, and may sell the project on those grounds but fortunately (at least so far) when it comes time to spend the money, much of it goes to the unexciting foundational science and engineering. We can argue about whether the split is optimal and you know my biases would be towards more of the foundational and less of the “immediately applied”, but the US still seems to have a vastly healthier balance in this respect that any other country I can think of.

          1. The “War on Cancer” as narrated in Siddhartha Mukherjee’s Emperor of All Maladies was conceived by Mary Lasker as a moon shot which should cure the disease within the next decade or so. Richard Nixon, who took it on as a favorite project, had no patience with basic scientists who wanted to research the biology of malignant cells, and sought no-nonsense managers who would just get the job done in a practical way. Basic scientists objected that the moon landings had been possible because their counterparts in physics had sufficiently discovered the principles of fluid mechanics and thermodynamics, but cancer biologists had not yet mastered the equivalent levels of understanding how cancer cells grow and metastasize. James Watson objected to the War on Cancer setup because he saw it as trying to go to the moon without knowing Newton’s laws of motion.

            I do not think we disagree on the principles, only on the narrative of how the War on Cancer was conceived and promoted in 1970 when the funding mechanisms were enacted by Congress following lobbying by advocacy groups who misunderstood the need for basic research before clinical successes could be expected.

  6. Consider this example: Estimation of how close a 5 km diameter asteroid hurtling through space will come to the Earth. The difference between missing us by one foot versus hitting us is a trivial distance in space terms, but we want to measure it with enormous statistical power (coming in this case not from big samples but from incredibly reliable measurement) because of the consequences of our estimate being even that slightly off.

    1. Keith said: “The difference between missing us by one foot versus hitting us is a trivial distance in space terms, but we want to measure it with enormous statistical power…”

      Bad model design. If the asteroid misses Earth by just “one foot,” it will nonetheless have non-trivial impacts on the atmosphere. I suspect we want to rule out an approach closer than 100 miles for any solution, but that’s just a WAG on my part.

      1. You know, I almost wrote it that way for the very reason you say, but realized it is correct either way, the difference between an asteroid making contact with us and not is large enough to care about.

        1. A cute cartoon but a bit of a straw man because even basic statistics classes teach all about adjusting for multiple comparisons.

          However, it is true that physical therapy for back pain is effective provided that the physical therapist is named Helga.

          1. Oh, no question they do. People who write headlines, OTOH, don’t take even basic statistics classes…

            However, you mistook the point, I think: You’re going to reach ‘statistical significance’ at a 95% confidence level one time out of 20 by chance even for utterly unrelated statistical studies. They don’t have to be “multiple comparisons”, one utterly worthles drug out of 20 is going to look good at that level.

            A 95% confidence level is really only good enough for screening, and should never be regarded as proof of causality.

          2. One thing (out of many) that has always puzzled me is the origin of the whole 5% business. I remember that in my first statistics class there was a considerable discussion of types of errors and whatnot, and that what was considered significant should depend on the consequences of being wrong and so on. Then suddenly all that disappeared in a puff of smoke and there was only 5% left. What happened?

          3. byomtov asks, “What happened?”

            What happened is something that happens far too often. People do not bother to go read the primary source, preferring instead a more accessible secondary source.

            In his Statistical Methods for Research Workers, Fisher mentioned one chance in twenty as being a reasonable cutoff to avoid too many blind alleys. However for Fisher, 5% was the starting point, not the end point. Once the investigator has decided that there is (probably) something going on that is not attributable to sampling, she should now be trying to figure out what is going on.

            Somehow, (I blame journal associate editors) this got grossly oversimplified to, “5% means it is statistically significant and we will publish the result.”

            Anyone who has studied statistics seriously knows that 5% as a blanket probability for rejecting a true null is just silly. Different sorts of tests should have different alpha levels. Diagnostic tests, as an example, should have fairly non-strenuous alphas. The consequence of rejecting the null is to look to explain an effect that is not present, or to compensate for a modeling problem that is not really there. This is a pretty low cost, and the cost of not rejecting a false null is potentially much higher.

            But what the journals teach students is that if you can’t do a proper weighing of consequences, you are safe with alpha = 5%, because it is what most everyone else uses. If you’re wrong, there will be lots of relatively soft corpses to land on.

          4. I had to take an information sciences class as a part of my accounting education. It mostly served to prove that everything non-accounting taught at the business school really needs to be canned. We talked about data mining and how looking at the data could help you find complimentary products that should be cross marketed. The example used was a finding by a supermarket that on Thursday evenings and Saturday afternoons the sale of both diapers and beer went up, varying together in a statistically significant way. The hypothesis was that was when young fathers went to shop and there were all sorts of proposals about how to display the two items together in order to generate more sales. I could not get the instructor or the students to understand that with as many comparisons as the example implied were being made they were guaranteed to generate thousands of entirely spurious correlations. The response was, “But it’s there in the data.”

            This was also an advanced undergraduate class in which 80% of your grade was based on multiple choice exams. Just burn the business schools to the ground.

          5. JMN,

            The phenomenon you describe is unfortunately absurdly common.

            Sports announcers make a healthy living from it.

  7. One of the blessings of working in computer science and mathematics is that you have a better chance of avoiding statistics than in other disciplines. 🙂

    1. You really need to read SJ Gould’s Full House.

      It is a wonderful description of a statistical way of thinking about the world. Very much a frequentist viewpoint, very much NOT about hypothesis testing. It would help if you like baseball, however.

  8. Great article! This is the kind of info that are meant to be shared around the internet.
    Disgrace on Google for no longer positioning this submit upper!
    Come on over and visit my website . Thank you =)

Comments are closed.