How Can We Tell if a Treatment Really Works?

Separating the effects of treatments from selection effects (i.e., characteristics of patients which bias the estimate of the treatment’s effect) is an enduring challenge of medical research. The randomized clinical trial is often seen as a simple solution, but in truth there is usually selection within trials. Instrumental variables analysis is one way to handle this situation and Dr. Aaron Carroll here presents the most lucid, short explanation of this statistical technique that I have seen (and I am flattered he chose one of my own studies as his example).

Author: Keith Humphreys

Keith Humphreys is the Esther Ting Memorial Professor of Psychiatry at Stanford University and an Honorary Professor of Psychiatry at Kings College London. His research, teaching and writing have focused on addictive disorders, self-help organizations (e.g., breast cancer support groups, Alcoholics Anonymous), evaluation research methods, and public policy related to health care, mental illness, veterans, drugs, crime and correctional systems. Professor Humphreys' over 300 scholarly articles, monographs and books have been cited over thirteen thousand times by scientific colleagues. He is a regular contributor to Washington Post and has also written for the New York Times, Wall Street Journal, Washington Monthly, San Francisco Chronicle, The Guardian (UK), The Telegraph (UK), Times Higher Education (UK), Crossbow (UK) and other media outlets.

19 thoughts on “How Can We Tell if a Treatment Really Works?”

  1. Intention to treat (ITT) analysis can confuse people because it subtly changes the research question from what people expect it to be. An "as treated" analysis, comparing patients who receive A versus those who receive B, is more intuitively appealing to most of us because it seems to answer the question "What is the difference in outcomes between intervention A and intervention B?" But ITT can give a more accurate estimate of the fate of assignment to treatment, or "treatment plans" which are formulated when a patient is first seen.

    There is actually some pretty good science to suggest that abstinence alone can prevent teen pregnancy and STDs. Imagine going to a high school and randomizing students to either abstinence (perhaps teaching the girls to say "Not tonight, dear, I have a headache") or to usual adolescent behavior as the control group. You can anticipate that there will be some crossovers from one group to the other group during the course of your study. If you analyze by ITT, you will derive a conservative estimate of the effect size you will obtain by analyzing the "as treated" groups.

    ITT actually is more useful to public health types because it gives a better idea of what will happen when you attempt to implement a health intervention in a population. Some of the acrimony surrounding the debates over this sensitive topic in recent years arise from the fact that one group judges the effectiveness of abstinence education using ITT, and the other group insists on analyzing only the "as treated" or "per protocol" data.

    Understanding how ITT changes the research question in a subtle but consequential way is helpful in grasping how it is best applied in making decisions that affect real world outcomes.

    1. Very thoughtful comment Ed, thanks.

      On principle, I never say things like “method X is more useful” because I think that always depends on the context and the question, but otherwise I agree with you about what ITT can teach and also that it’s very important for consumers of science to understand that a trial does not evaluate treatment X versus treatment Y, but randomizing people to treatment X versus treatment Y.

      1. Context-sensitive is exactly right–come to think of it, this is true of most things in life.

        Thinking about evidence-based medicine and deciding what works, I wonder if you have any insights into how the Cochrane collaboration does its meta-analyses. Specifically, most high-quality clinical trials will include a sample size calculation in which the authors decide how many patients are needed in order to detect a given effect size, assuming a certain variance, type 1 error, and type 2 error. It has been well-argued that meta-analyses should calculate an optimal information size for detecting an effect size of interest, which ought to be at least as large as would be needed for a randomized trial, but larger because of the need to compensate for heterogeneity between trials. It seems that Cochrane never does this in its meta-analyses and you never see such a thing in their protocols.

        Also, I understand that the funding for Cochrane reviews is proportional to the number of included trials, which creates a bit of an incentive to include all trials which meet the protocol criteria, regardless of quality or whether they ever ought to have conducted in the first place. So they report again and again that most included trials were of low quality, but often proceed to combine their data anyway.

        It's OK by me if you do not have an answer to these two questions, but not calculating an optimal information size seems like a real weakness in their system. If I did not rely so heavily on Cochranes for my work I would not be so concerned here.

        1. Funnily enough, I am doing my first Cochrane review right now. What I can tell you so far (and to be clear, this is new to me so I could be wrong) is that the standards have gone up a lot since it started — you have to rate every trial for bias, document inclusion/exclusion much better. So they may make progress on the issue you raise even if their track record doesn't suggest that.

          I am not sure about the funding per trial, I have not heard of that so can't say if it's so or not (I am not personally being funded on the one I am working on).

          1. They have always been pretty good about rating trials for bias, but have often included trials with unclear or fairly high risks of bias with respect to allocation concealment, blinding, and incomplete data. I agree that they are more discriminating now than in past years, and are excluding some studies with markers of significant bias.

            Still, I wonder why they do not use methods like trial sequential analysis very often. TSA has the same goals for meta-analyses that interim analyses have in clinical trials, adjusting the thresholds for benefit so that when trials are stopped early for benefit, you can reduce the risk of prematurely concluding that the intervention is beneficial. A couple of interesting papers have discussed how important this can be:

            Brok J, Thorlund K, et al. Trial sequential analysis reveals insufficient information size and potentially false positive results in many meta-analyses. J Clin Epidemiol 2008;61:763–9.

            Brok J, Thorlund K, et al. Apparently conclusive meta-analyses may be inconclusive—trial sequential analysis adjustment of random error risk due to repetitive testing of accumulating data in apparently conclusive neonatal meta-analyses. Int J Epidemiol 2009;38:287–98.

            Anyway, I am glad that you are working on a Cochrane and wish that they would pay you something for your trouble!

          2. This is really fascinating — I am going to read more about this..would be good to have a TSA in my life that had value.

        2. IANAD but the biggest problem with Cochrane, and metanalyses generally, is one outside their control: the non-publication of trials that gave inconclusive (boring) results, or, worse, results unhelpful to the commercial sponsor. Epidemiologist and scourge of bad science Ben Goldacre has been campaigning – including at Cochrane conferences – for compulsory publication of the results of all trials. Almost half of all trials have never been published. Sounds like a very good cause. Senator Warren is a notable and heavyweight backer of data transparency in medical research. Website.

          1. Some of the Cochrane reviewers have adopted a strong position on trial registration, and will exclude any studies published after about 2010 which were not registered. That helps to control selective outcome reporting, but does not in itself achieve Dr. Goldacre's goals.

            This is one more reason to really like Sen. Warren.

            Most industry-sponsored trials are of pretty high quality, the problem being that they never publish their results. Cochrane often has a section for "ongoing studies" which you can check at You can find out that there were studies of the use of bone morphogenetic protein for tibial fractures, and can see that several years after completion, they had not published results. You can then speculate that if BMP had been fantastically effective in promoting fracture healing, the sponsor would have told the world. Sometimes speculations are well-founded.

          2. We can't mandate publication unless we are willing to send federal agents into universities to kick down doors, repossess data and then hire authors to write it up and publish it somewhere (presuming any journal wants to publish it).

            In contrast what can be done is required registration of the trial, its central hypothesis and analysis. Bob Kaplan's analysis of cardiology shows that when this happens the number of significant results drops dramatically, suggesting quite a bit of data fishing has been going on.

          3. The feds can make any funding conditional on timely publication of any previous funded work. Tenure committees can impose the same test. The thing is to shift non-publication into the "unprofessional behaviour" box.

            "Publish" here can't mean "in a peer reviewed journal." Unless someone sets up the Journal of Insignificant Results. Peer review there would presumably mean "run past Maureen's spell checker".

            Idea for a short story: a failed experiment that is sort-of published in this way, and the hero spots that by accident the clowns have made an amazing breakthrough on a bigger problem.

          4. Grant committees have long weighed prior publications in review; people who haven't published from prior grants tend to get worse scores for their lack of productivity which is reasonable because public funds are not there to satisfy private curiosity. This norm could not be formalized though for many reasons without an enormous regulatory apparatus and the willingness to accept significant collateral damage to certain subfields and certain types of scientists (particularly women and people in lower-resource institutions).

            Tenure committees on principle will ignore government diktats regarding what professors may and may not publish, which would be in their eyes an intolerable intrusion on academic freedom. And companies don't need to care what tenure committees say because we have no authority over them. Neither do they need government funds to do their research.

            And as for having results published without peer review in second rate journals by second-hand authors assigned to write up studies they were not a part of, that is not the way to scientific progress.

          5. ".. that is not the way to scientific progress." Goldacre's sound point is that negative and inconclusive results – as in most of mediaeval alchemy – are an important part of scientific progress. In this case, the incentives of science as a social enterprise – rewarding positive results published in prestigious journals – are deeply misaligned with its goals as a cognitive enterprise, which requires systematic publication (somehow) of failures. Please put more thought into finding a solution rather than saying it can't be done.

          6. Negative results are indeed important and the journal where I serve as regional editor makes a point of publishing them (so your comment that I should be "putting more thought in" is ignorant of my record and at a personal level, ungentlemanly).

            You were proposing punishment of scientists who didn't publish their results in what you or a central government body considered a timely period. That is what I termed unworkable and you did not respond to the critique. Do you have a response or not? Your approach requires heavy monitoring, potentially data seizure and publication by others who don't understand the original study, an adjudication system and appeals process (how fast is "timely" in particle physics, spider monkey genome research, and psychotherapy research, for example and who classifies each study as being in what field with what time frame — and what about interdisciplinary work, what is the time frame allowed there? And if a researcher gets pregnant, or their child is diagnosed with cancer, or loses his or her job, where do they appeal the punishment that your committee hands down for their lack of timeliness? How do you ensure that this process doesn't fall unevenly on particular groups (e.g., women who do most caregiving) which it almost certainly will?) All of this will cost money and make doing science more difficult in a time when many young scientists are quitting due to low funding and high regulations — is that cost acceptable to you?

          7. Apologies. You had not however mentioned the good work you are personally doing as an editor. It's not fair to expect a gentleman to check up on such things, breaking down doors in the night.

            My two suggestions – I'm not expert enough to call them proposals – were for making federal grant funding "conditional on timely publication of any previous funded work" and scrutiny by tenure committees. I don't see how you get from these to jackboots except by the straw man fallacy. The grant suggestion requires a check in the database to see if there are previous project grants with results still not published. That would bear more heavily on established researchers than young ones. It doesn't seem a very strong requirement to me.

            There are obviously force majeure reasons for non-publication, as for not carrying the work out in the first place. ("The rabies vaccine animal trial was abandoned after animal rights activists freed the rats, leading to a citywide public health crisis.)" What's the difference between the cases? A sensibly operated system will recognize and allow for these problems, a badly operated one will turn them into yet another barrier. I don't think this possibility is an argument against shifting the culture towards an expectation of universal publication as the professional thing to do.

            The real difficulty seems to be peer review (a thankless chore for reviewers at the best of times, I assume worse than watching paint dry for negative results) and the incentives of journal editors towards the positive.

  2. This video was better than typical popularizations but it might mislead naive viewers into thinking there are bigger problems with ITT than there are. The failure-to-comply in RCT's analyzed by ITT does not "bias" the causal interpretation of results, which is rigorously valid with respect to ITT. Under the strong null (no treatment effect) it creates no signal. It does dilute the estimate of the direct effect of treatment. In cases where the non-compliance is reasonably thought not to be strongly related to what the outcomes would have been (e.g. members of a control group getting PSA screening on their own) one can make a fairly good estimate of the effects of actual treatment on outcomes (e.g. death from prostate cancer) by correcting for the dilution.

    1. The dilution effect could be enough in marginal cases to tip the results into statistical significance. The significance boundaries are arbitrary, but they play an important part in the scientific ecosystem as filters.

      1. In the simple dilution case, the correction just consists of a multiplicative factor. It multiplies the entire confidence interval, error bars and all. Thus it has no effect on whether 0 is included in that interval. A closely related but not quite identical fact is that it doesn't change the p-value. So what it can do is make an effect more important but not change the statistical significance. That's one of many reminders that the focus on p-values with arbitrary cutoffs is at best out of proportion to what it should be, and at worst irrational.

Comments are closed.