The Subjects in the Studies that Guide Your Medical Care May Not Be Like You at All

The basic logic of evidence-based medicine is that researchers study health care interventions in rigorous clinical trials and then clinicians apply the subset of interventions with good outcomes to patients in the real world. However, this simple formulation glosses over the possibility that the research subjects in the clinical trial may differ from real-world patients in ways that make the results of questionable generalizability.

The primary reason this occurs is that clinical trials often forbid enrollment by many patients who are treated in our health care system, including for example anyone who is over the age of 60, or has multiple medical conditions, or is on medications etc. This makes the clinical trial easier to conduct but it can also result in a research sample that is completely unlike real-world health care recipients. If for example a new medication has been FDA-approved based on a clinical trial that excluded anyone who was already taking another medication, any adverse medication interactions won’t come to light until patients start experiencing them in the health care system.

This would be less of a problem if clinical trials carefully describe who was excluded from the study and why. To see how well researchers are doing on this score, my colleagues and I reviewed the 20 most highly-cited recent clinical trials for each of 14 prevalent medical conditions (i.e., 280 trials in all).

Our study appears in JAMA Internal Medicine today. The punchline: About half of even the most-highly cited clinical trials provide no information at all on how many patients were excluded and why. Among the subset that do reveal this critical information, the average study reports not enrolling about 40% of those patients with the disease who would get the treatment the study is evaluating in real-world clinical practice. By definition, these non-enrolled patients are different than those in the trial, because they were ruled out of participating by set criteria rather than at random.

You might wonder how we figure out whether the results of these clinical trials generalize to those who were turned away by the study eligibility criteria (or who were offered enrollment and refused). The general answer is that the treatments that performed well in the clinical trial are broadly provided in the health care system, and if someone notices bad outcomes accumulating among the type of patient who wasn’t in the original study, then we know the trial result has been over-generalized and what is beneficial to some patients is useless or harmful to others.

If that sounds a bit scary, it’s because it is.

Comments

  1. David Isaac says

    What you say is true, which is why outcome research using (often retrospective) data from disease registry cohorts is growing in popularity. the National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) Program of regional cancer registries. These studies can compare the outcomes for patients given different treatments, who have multiple chronic illnesses in addition to the target condition, and often use data from Medicare for prescriptions, hospitalizations, and emergency department treatments. some of these studies (to use linked Medicare data) only include patients 65 years and older.

    What is your impression of these types of studies?

    • Keith Humphreys says

      I have long supported efforts such as SEER, as well as post-approval marketing in relatively captive health care systems with good informatics (e.g., Kaiser, the VA). That said, I would still like RCTs to use exclusion criteria more minimally and only with better justification than is commonly the case.

  2. Warren Terra says

    You’re not wrong, but do you have a suggestion for improvement? Sure, some of these criteria might be relaxed – age, for example – but others make excellent sense. To maximize your statistical power, you need to reduce the number of variables. Once some of your study participants have potential complicating factors, you need more people with the same or similar complicating factors, on both sides of the treatment/control divide – or you live with increased noise in the results.

    Yes, it’s an assumption that the effect clearly perceived under ideal conditions will also be present (if less clearly perceived) in more complex circumstances. Yes, you will fail to detect possible interactions and their complications. There are undoubtedly all sorts of things we could do better to follow up the use of drugs and treatments, especially new drugs and treatments, and especially in real-world, complicated cases unlikely to have been perfectly reflected in the trials. And, as David Isaac says, large-scale analysis and meta-analysis may perceive issues no trial of limited scale could have predicted. But these are proposals subsequent to the trials. Do you really feel great changes to the trials themselves are necessary or desirable?

  3. Keith Humphreys says

    Warren wrote: To maximize your statistical power, you need to reduce the number of variables.

    This is widely assumed and believed, but is often not correct. Power is only increased if heterogeneity in outcome is reduced. What we have shown in some of our work is that exclusions often INCREASE heterogeneity and thereby reduce statistical power. This is because it is hard for researchers to predict what will influence outcome. I have been struck repeatedly as a grant reviewer by proposals that say “We don’t know if this treatment works” and then propose 10 exclusions…remarkable, we don’t even know if there is a main effect but can already specify 10 interaction effects.

    There are many alternatives to narrowly designed trials, we know this because even within fields the average number of exclusions varies substantially by country (German trials are exclusive, the UK trials not at all) and because the exclusivity of trials varies within country over time.

    There is also the option of conducting some large, simple trials such as Sir Richard Peto has proposed.
    http://globalhealthtrials.tghn.org/site_media/media/articles/trialprotocoltool/SOURCE/Extras/Stats/DLSLargeTrials.pdf

    • Dennis says

      Beyond this, Keith, the real story is usually in the interactions rather than the main effects. You can’t detect interactions you eliminate by design.

      • Keith Humphreys says

        @Dennis. That is one of Peto’s key arguments for large, simple trials. If you have a large sample, interactions go from being a threat to a resource, because you can estimate them reliabily. This, if you have a huge sample, even a study with no main effect could be veyr informative.

        • Dennis says

          Peto’s right.

          In the presence of interactions, main effects are almost never interesting by themselves. Once you have told the story of the interactions, the main effects are a consequence.

          In fact, the name main effect is a misnomer. They are actually marginal effects — that is, the effects averaged over the other factors. Marginal effects are most useful when the factors are additive — when there is no interaction. In the presence of interaction, conclusions from the margins have to be “adjusted” for the non-additivity anyway. Just describe the interaction and be done with it.

  4. alnval says

    What also might be interesting would be to look at the pattern of professional responses to the JAMA article. I suspect it might yield some leads as to how better to educate the professional community in addressing (before the fact) the issues your article points to.

  5. Ed Whitney says

    Something else (for a different paper) would be how many patients end up in clinical trials who would have been ineligible by the initial protocol. Recruiting participants into RCTs is hard enough as it is, and trials often fail to accrue the number of participants they had targeted in their power calculations. When recruitment falters, are people entered into the trial with comorbidities or other factors which were listed in the exclusion criteria for the trial at its inception?

    The protocols at clinicaltrials.gov are often rather vague and do not necessarily say how the exclusion criteria were determined. Often, they do not want to enroll people who are, to use a technical term, funny in the head, but the protocols do not seem to say that all potential participants had their heads examined. As trials progress, one wonders if criteria were relaxed in participants recruited later than in those recruited earlier.

    But exclusions do make sense in circumstances other than age. There is controversy about the effectiveness of vertebroplasty for osteoporotic vertebral fractures, especially after two randomized trials in the New England Journal showed no difference in outcome between sham and real vertebroplasty. A different trial in the Lancet did show a difference between vertebroplasty and continued conservative treatment. The controversy has arisen over the way that the New England Journal paper may have included people with healed fractures or with pain arising from structures other than the fractured vertebral body; they did not have a criterion of localized tenderness for inclusion into the trial, and their MRI criteria were also a potential problem. The Lancet trial was probably a bit better at selecting people with fresh fractures for which the procedure may be effective.

    Since a recent review of hospital records published in Spine showed that mortality was lower in osteoporotic fractures treated with vertebroplasty than in fractures not treated, the question become of more than academic interest. Hospital record reviews are notoriously subject to bias, but the hazard ratios were so large, and the distribution of comorbidities actually being somewhat greater in the vertebroplasty group, that dismissing vertebroplasty on the basis of the sham-controlled randomized trials could be questionable indeed.

    What I am driving at is that this stuff truly matters to real people. A methodologically impeccable study which answers an irrelevant question is not necessarily preferable to a methodologically problematic study which answers a relevant question.

    Congratulations on publishing a timely research paper. It is on its way to my files right now.

    • Keith Humphreys says

      Thanks Ed. We have discussed the trial recruitment issue in a number of our papers. Some people argue that exclusion criteria are necessary to reduce cost, for example excluding those who live far away/have plans to move/are residentially unstable so that follow-up will be cheaper. But every time you do that, you are risking an extension of recruitment time to get your sample, and that also has a cost.

      I forget what proportion of NIH trials do not get their target sample, but it’s not small. That also means that the study is underpowered.

      This situation you describe of changing criteria as the study goes along no doubt happens. Another issue is that comparisons of the original IRB approved protocol and the study reports shows that many studies drop criteria between those two points, meaning either they were approved to exclude some people and didn’t OR that they did exclude them but do not reveal that to the paper’s reading audience.

  6. J. Michael Neal says

    Am I correct in thinking that these exclusions also eliminate or at least greatly reduce the the probability of finding treatments that would be beneficial in excluded populations but are less powerful in the ones that are included?

    • Keith Humphreys says

      Yes, less or more powerful. In the alcohol field, trials tend to exclude African-Americans, people who are socio-residentialy unstable, people who are poor, people who also use illegal drugs, and people with psychiatric and medical comorbidities. As one clinician who works in the public sector put it “I treat all the patients researchers don’t include in their studies”.

      Note also that because those excluded patients could have different outcomes, the use of criteria can change a study’s results, making treatment look more less effective than it is.

  7. Ed Whitney says

    An important question about the eligibility criteria for these conditions: how many of them represent clear indications or contraindications to the intervention being studied? For example, in the trials of surgery vs. nonoperative management of lumbar disc herniation, cauda equina syndrome was an exclusion criterion, because this represents a surgical emergency and a clear indication for immediate surgery before it is too late. The principle of equipoise requires that the trial only randomize patients in whom the best treatment is uncertain; if some of the exclusions are for conditions in which equipoise is removed, the external validity is not really compromised. I did notice that there were no physicians among the authors; having some specialists in the areas being studied would be a big help, because they can recognize which exclusion criteria are appropriate and which are arbitrary. Some comorbidities could be generally recognized contraindications to one or another of the interventions in the studies, but it takes a content area expert to tell which conditions would remove equipoise and which are more for the convenience of analyzing the data.

    • Keith Humphreys says

      This is correct, we don’t need to know whether a treatment works in cases where there is no uncertainty (by the doctor or patient).

      Van Spall’s review in JAMA showed that most exclusion criteria are not well-justified, often it’s just habit, an implicit belief that this is a no cost methodological decision that makes a trial more sciency. When I reviewed grants regularly I used to ask proposal authors for a justificiation for their criteria. I never said they could not use them, just asked for a reason. In no case did anyone every give a reason, usually they said “We cut and paste that, come to think of it doesn’t need to be in there”.

      • Ed Whitney says

        Keith:
        I was wondering about the eTable referred to in the text of the article. I have been clicking on it and it leads to an error message. However, when I click on the “Data Supplement” item which is to the right of the text of the article, it does lead to a document with the search terms, but not with the references used for the analysis. Maybe this is not the same document.
        Is the eTable a different document from the “Data Supplement”? I contacted the technical support folks and hope that they will soon repair the link.
        This does bring up an opportunity to rant for a moment about something rather maddening in journal publishing. Often there is supplementary content online, which is available at the journal website. But more often than warranted, there is a link to supplemental content which is not functioning or which is not available. The Van Spall article in JAMA 2007 is a good example; in the Methods section, the “Data Sources” paragraph for the search process leads to a link which has now gone and joined the choir invisible. A 2006 article in a different prominent journal refers the reader to an appendix at “ArticlePlus” which is not at the journal website, and not available anywhere that I can find. The e-mail of the corresponding author also has ceased to be; a request for the ArticlePlus supplement bounced back as undeliverable. I realize that six years is a very long time, practically an eternity, but these data supplements need to be better maintained by highly cited journals.

        • Keith Humphreys says

          Hi Ed — the supplemental material is just the search terms and not the cites for the 280 clinical trials. We didn’t think those were important to include but if you email me I can get them to you.

  8. J.m.g. says

    Can you say more about your understanding of the term evidence based medicine?

    I’ve been reading Dr. Nortin Hadler’s great books, such as “The Last Well Person” and I do not think he uses the term the way you do.

    Perhaps I misunderstand him, you, or both, but my gloss on one important point in his work is that we excessively fixate on effect studies and fail to focus on meaningful, holistic endpoints (paradigmatic example, fetishizing lowered cholesterol with statins while ignoring the vast evidence that only a tiny subgroup of the population actually seem to live any longer thanks to statins; similarly, whole volumes can be written on what exactly “early detection” of cancer buys you, or whether early detection is simply an artifact of more and more testing of more and more people, many of whom will now suffer the consequences of Type Ii errors and the associated interventions).

    • Keith Humphreys says

      I haven’t read that book, but there is nothing in what I said about evidence-based medicine that is inconsistent with look at meaningful endpoints.

  9. James Wimberley says

    Is the problem alleviated by the somewhat seedy practice of carrying out the trials in Third-World countries and enrolling everybody who will take the money? Of course, the other medical conditions that Delhi slumdwellers may have will be wildly different from those of the rich-world patients the drug is probably aimed at: parasites vs. obesity, say.

    • Keith Humphreys says

      You can do this without being seedy. I believe one of Peto’s aspirin trials enrolled over 10,000 people in India to take a daily aspirin or not — no exclusions at all and very minimal baseline data collection, like one page.

    • Warren Terra says

      Surely it would test a reviewer’s goodwill and trust if the people who performed a study and were most invested in a positive outcome (emotionally, career-wise, or monetarily) were to trim the study participants post-facto?

      Though, of course, things of this sort do happen; they can even be of benefit. Famously, it was determined that if a couple of previous studies that had demonstrated no benefit versus control were trimmed to only the African-American participants, a benefit could be found, which was later replicated in a focused trial (problematically, this has resulted in the granting of a patent for the more lucrative formulation and marketing of a medicine that combines two drugs available cheaply as generics). Parsing a mixed group of participants to reveal a more homogenous subgroup may reveal benefits overlooked in the larger group, benefits that are real and are worth knowing about. On the other hand, if you can figure out 20 different ways to parse the study participants, there’s a fair chance one such subgroup will show a P-value of less than 0.05 – which could easily let you fool yourself or others into thinking that you’ve achieved statistical significance, especially if you can come up with a narrative to justify focusing on that one subgroup.

    • Keith Humphreys says

      What they are worried about is a bit different, post randomization loss (or omission). Our group studies what happens before that point.

  10. paul says

    How can you challenge the conventional wisdom that exclusions make studies easier to run and analyze? I’d say that modern computational tools have made teasing out the details of effects and confounders relatively straightforward, but then we have the spectacle of major economists (who are generally way better than clinicians at analysis) making a bunch of newbie errors, and it makes you wonder.

    • Keith Humphreys says

      @paul: In part one does it by showing that the CW is wrong. I was highly impressed at a VA Cooperative study planning meeting in which the data guys said “Every exclusion criteria is a threat to sample recruitment, let’s go over this list again and see how many we can knock out”. They were trained to bring up a fact — there is a cost that the CW doesn’t acknowledge.

      Another part of the case is ethical/moral — many exclusions tend to inadvertantly cut out people of color, which is something almost every scientist recognize is unjust. Indeed, I see studies where scientists are making huge efforts at recruiting a diverse sample and then shooting themselves in the foot with exclusion criteria. Concrete example from the alcohol field: People often rule out participation of patients who live more than X miles away. Well, African-Americans travel further for care than other racial groups, so if you do that you are inadvertantly disproproportinately excluding them. When you show that empirically, people get it, and because they want to do the right thing, they are often persuaded to change on that basis alone.

  11. says

    I had HepC, took the 48 week treatment with health insurance, and it didn’t work. Then my insurance ran out. I found a clinical trial ( at clinicaltrials.gov) and had to talk my way in. It was 100 miles from my home, and I had to prove I could get there every week. I was drug tested, but they didn’t care about pot – which is the only drug I was on.
    The trial worked and I am cured!

  12. NCG says

    I was wondering, with all our marvelous new technology, could there be a way to enroll and study people just through their family practitioner or gynecologist?

    It’s funny too because I see ads in the paper where they’re looking for people who already have x, y, or z condition.

    I suppose it takes a special kind of person to let someone else put untested medicines in their body. One might argue though, it may be almost a duty? Now I am having guilt!

  13. Ed Whitney says

    Another curveball relates to how many eligible patients agree to become participants in clinical trials. People may be reluctant to enter trials, especially placebo controlled trials, and this has been a barrier to getting enough power (not to mention external validity) for interpreting a study. People who agree to sign the consent forms, after having had both alternatives explained to them, are different from other people, even if the most liberal inclusion criteria are applied.

    Sibai et al (J Bone Joint Surg Am. 2012;94 Suppl 1(E):49-55) reported on a trial they wanted to do for the treatment of unstable distal radial fractures. They had 300 eligible patients they were prepared to randomize, but only 13 (4%) agreed to participate.

    They could have been the world’s worst salesmen, and maybe the Godfather would have had better luck in recruiting patients into the trial, but refusal to participate can make matters difficult for even the best researchers to get the job done.

    • Keith Humphreys says

      We have a paper on predictors of refusal among eligible, and one of the key issues is how different the two conditions are. When they are more stark, people are more likely to have preferences, which leads them either to refuse randomization or quit after randomization if they don’t like what they got.

      • paul says

        Do you still have the problems with clinicians breaking randomization to get the patients “who really need it” into the new-intervention arm? (Which may even be rational behavior in the presence of a zero lower bound.)

        • Keith Humphreys says

          paul: Again following Sir Richard’s reasoning, if the physician is convinced a patient really needs a treatment, then that patient should not be referred to the RCT in the first place. RCTs are for those patients who are at a point of uncertainty, as jointly defined by the patient and the physician.

          • James Wimberley says

            Look at Richard Crew’s comment above. Some patients in RCTs – especially for life-threatening conditions, as in the early days of HIV research – are desperate to get access to a new treatment, even if it’s only a 50% chance of enrolment in the new-intervention arm, and an unknown probability of it working.

      • Ed Whitney says

        Or to put Paul’s question another way, what do you need to see to convince you that there has been airtight concealment of the allocation list? I look for phrases like “sequentially numbered opaque envelopes” or “central randomization by a person not otherwise involved in the study.” But what clues might raise one of your eyebrows and raise suspicions that allocation concelment has been compromised?

  14. Keith Humphreys says

    central randomization by a person not otherwise involved in the study

    Using a publically available random number list (not dice, coins, colored balls from a jar) and reproducible for full audit by groups not associated with the study.

    • Dennis says

      The problem with using a publicly available RNT (like the RAND book of 1,000,000 random digits and 10,000 Normal deviates) is precisely that it IS public. If full specification of the mechanism is given (starting point, reading direction, number of digits used per read) the groups can be recreated by anyone with access.

      I would prefer to see a specified PRNG used, with usage mechanism specified, but the seed recorded and held confidential until the end of the trial. That can be fully audited after the fact by anyone (or everyone) concerned, but cannot (in theory) be broken ahead of time.

  15. Ed Whitney says

    Dennis:

    What are your thoughts on block size? Do you think it is an acceptable trade-off to increase block size (making the next allocation) less predictable, at the (possibly acceptable) price of allocation imbalance up to half of the block size? It seems like good deal, but maybe there are downsides..

    Also, do you see minimization as a good allocation strategy, at least under some circumstances? How clearly do you think you need to know and identify strong predictors of prognosis in order to use it?

    We are getting into what may look like pretty fine grained stuff here, but I never pass up the opportunity to pick someone’s brain on issues where my thinking can be refined.

    Finally, do you think that you need to count the angels dancing on the head of a pin through an actual enumeration, or is sampling an acceptable method? :)