A Misdirected COMPAS

I’m generally a fan of the use of actuarial tools in criminal justice. These tools help estimate the risk of a given event (say, failure to appear in court or rearrest) based on factors that are statistically correlated with that event. Part of my support for actuarial tools comes from my belief that criminal justice is as susceptible (if not moreso) to implicit bias as other human endeavors, and that, without necessarily meaning to, judges and prosecutors, who hold the majority of power in the criminal justice system, will implicitly make different evaluations of people based on their race. Discretionary decisions are hard to review. Actuarial tools can guide or replace that discretion with something that, in theory, is more accurate, transparent, and fair. Actuarial tools also fit well into the overall move to justify practices with evidence.

I am therefore extremely troubled to learn (via Doug Berman) about this study of the COMPAS risk assessment tool undertaken by Pro Publica (methodology here), which found that, as used in Florida, the predictions of risk were inaccurately biased on the basis of race. COMPAS, which stands for Correctional Offender Management Profiling for Alternative Sanctions, is the intellectual property of a private corporation, Northpointe. Jurisdictions using COMPAS enter data on a number of issues and a proprietary, non-public algorithm converts these data points into a risk score. By comparing COMPAS risk scores with criminal records, Pro Publica reporters were able to assess how accurate the predictions were. The results? “In forecasting who would re-offend, the algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways,” over-estimating risk for black defendants (when compared with actual reoffending rates) and under-estimating risk for white defendants (when compared with actual reoffending rates). This means, in practical terms, that black defendants were unjustly denied favorable release terms and white defendants were unjustly granted them.

Because I am, in general, in favor of risk assessment tools, I want to take this opportunity to clarify that being in favor of evidence-based practices and risk-assessment tools does not mean that all tools are equal, nor that they can be installed without modification or monitoring. It is extremely important to have good tools, good training, and good practices.

There are a few best practices that should always be followed.

Validation. Validation means looking at how the risk assessment tool actually did at predicting results, changing the tool as necessary. It’s not just best practice to validate risk assessment tools–it’s non-practice not to validate them. Judicial discretion hard to review because we can’t read judges’ minds and know what went into a given decision. But actuarial tools can be checked and we can see how good they are at predicting the future. Unless we check, we’re just putting our blind faith in something because it has a number on it. That’s not evidence-based practice; that’s numerology. Even tools that are valid in one jurisdiction are not necessarily valid in another due to differences (often unobserved) between one place and another. If the tool doesn’t work, stop using it until you change it. After you use it, keep checking. Institutions, not journalists, need to take the responsibility on themselves to see if what the tool is telling them is actually true, and if there are unexplained racial disparities.

Use open tools, open data, and open methods. Part of the problem with COMPAS is that the tool itself is proprietary. It’s difficult to check without knowing how the data is scored (and scoring, not the question list, is the key–it gives you the verdict about risk that drives decisions). I think all tools should be open, all data collected should be auditable, and we should know how things are scored. All of this should also be challengeable. This is just basic Daubert reasoning for you lawyers out there.

Test on subpopulations. Given the disparate impact criminal justice has had on race and class (etc.), we should always do what ProPublica did: control for known risk factors and see if there are disparate racial impacts. This just seems like a no-brainer. The integrity of the criminal justice system–and people’s faith in it–demand no less.

I don’t think this example is fatal to the idea of risk assessment tools, just a cautionary note that you don’t check your skepticism at the door. Using the tools means accepting their limitations and knowing what they were designed to do–and what they can’t do. Don’t pound in a nail with the handle of a screwdriver. If you do use the tools, the benefits only accrue with changes to how you do business–including collecting data and changing operations based on that data. The main benefit of such tools is that they can be checked much more easily than discretionary decisions made elsewhere in the criminal justice system. But to do that, we need to have more openness with them–and administrators who understand that risk assessment is an iterative process that requires management and reassessment.

Author: W. David Ball

W. David Ball is an Associate Professor at Santa Clara School of Law. He writes and teaches primarily in the fields of criminal law and criminal procedure, with a special focus on sentencing and corrections. He also serves as the Co-Chair of the Corrections Committee of the American Bar Association.

8 thoughts on “A Misdirected COMPAS”

  1. Isn't there a larger problem with statistical risk assessment, in that it will accurately reflect the discriminatory forces facing people of different races caught up in the criminal justice system? Take recidivism. Black and white ex-cons will face different levels of hostility from prospective employers, possibilities of financial support from family and friends, and so on. It would not be surprising if this were reflected in a higher risk for blacks, If you are using the statistics to allocate support, fine. Not if you are using them to screen job applicants.

    1. Hi James–

      Thanks so much for your reply. Following the lead of Jennifer Skeem, I think it's useful to distinguish between bias (inaccuracy based on non-relevant factors) and disparate impact (scores that might be related to relevant factors that affect some populations more severely than others). It's definitely true that there are some tools that can "bake in"/embed racial disparities, things like age at first arrest (black youths are more likely to be arrested, therefore if this factor is included, they will be more likely to receive more severe sanctions, which, ceteris paribus, not only makes the predictive power of arrest endogenous but it also means that (since future criminality is partly a function of time behind bars) it will affect recidivism as well). What's troubling about this ProPublica story is that it appears to be a function of bias–although on twitter another reader has posted this analysis (https://twitter.com/RAVerBruggen/status/734853560576442368), which, if I had quant skills enough and time, I could better assess.

      As for employment, the study with which I'm most familiar suggests that race actually dominates record. Devah Pager (http://scholar.harvard.edu/files/pager/files/pager_ajs.pdf) submitted resumes and found that white ex-cons were more likely to receive job interviews than black applicants without criminal records.

    2. If you're trying to predict criminal behavior, it's important to include the effect of racism in your model. After all, a black ex-con who's rejected from jobs and housing due to his race is more likely to resort to criminal activity than a white ex-con who's given a 2nd/3rd/4th/5th chance. If you don't include effects for race and ethnicity, your model might work in Never Never Land with no racists, but will not work in reality, no matter what decision you're trying to help with the model.
      Many statistical techniques provide estimates of the error each model has (https://en.wikipedia.org/wiki/Out-of-bag_error). So you can measure how much worse your model will be if you remove race/ethnicity/gender information. Then you can ask yourself, "Am I OK with making worse decisions than I might make otherwise, so that I can pretend that life is fair?"

      1. I don't want to give the wrong impression: it's important to check your own model to avoid the sorts of errors Northpointe made. A model that is wrong in one direction for one set and in another direction for another set is not a great model.

    3. Here is the abstract to the Skeem (et al) paper I was referencing, which I think is relevant also to Adrian Martin's comment.
      http://papers.ssrn.com/sol3/Papers.cfm?abstract_i

      Gender, Risk Assessment, and Sanctioning: The Cost of Treating Women Like Men

      Abstract:
      Increasingly, jurisdictions across the U.S. are using risk assessment instruments to scaffold efforts to unwind mass incarceration without compromising public safety. Despite promising results, critics oppose the use of these instruments to inform sentencing and correctional decisions. One argument is that the use of instruments that include gender as a risk factor will discriminate against men in sanctioning. Based on a sample of 14,310 federal offenders, we empirically test the predictive fairness of an instrument that omits gender, the Post Conviction Risk Assessment (PCRA). We found that the PCRA strongly predicts arrests for both genders — but overestimates women’s likelihood of recidivism. For a given PCRA score, the predicted probability of arrest — which is based on combining both genders — is too high for women. Although gender neutrality is an obviously appealing concept, it may translate into instrument bias and overly harsh sanctions for women. With respect to the moral question of disparate impact, we found that women obtain slightly lower mean scores on the PCRA than men (d=.32); this difference is wholly attributable to men’s greater criminal history, a factor already embedded in sentencing guidelines.

  2. Even if the claimed accuracy (68%) were true, isn't that kinda rotten for a predictive tool that's being used to make decisions about individuals' liberty? (I know, there aren't many alternatives, and few that at least pretend to be unbiased, but still.) What concerns me even more is the ultimately self-fulfilling aspect of any disparate impact. Putting someone in jail for longer doesn't (apparently) reduce their likelihood of committing crimes in the future, and may increase it, especially for people with previously light records.

  3. This sort of model is something I know something about. (I'm a life insurance actuary, and this is a classic underwriting model.)

    First point: ProPublica's analysis is troubling, but you always need to compare models to alternatives. If the alternative is "everyone arrested stays in jail for a week", or "the judge decides without the model score," this may be worse even if model scores are quite bad.

    Second point: Robert VerBruggen has a critical point, that is easy to see once you see it. For an example, let's estimate the chances that a dish is "inedibly spicy" using a model. We could put in a bunch of factors, and then do the ProPublica analysis and note that dishes with cayenne pepper were likely to be more rated "inedibly spicy" when they weren't, and dishes with mustard less likely to be rated "inedibly spicy" when they were. Now, however, note another fact: many more dishes including cayenne pepper were rated inedibly spicy. (Let's say 70% of cayenne dishes, and 20% of mustard dishes, were rated "inedibly spicy"; thus 30% of cayenne dishes, and 80% of mustard dishes, were rated "can be eaten comfortably"). Now, let's assume that this analysis had a consistent error: 10% of the time, a dish was in the wrong category. Now–7% of cayenne dishes that were not inedibly spicy were rated "inedibly spicy", vs 2% of mustard dishes; 8% of mustard dishes rated "comfortable to eat" were inedibly spicy, vs 3% of cayenne dishes.

    But it isn't a prejudice against cayenne that's the problem; it's a constant error rate, applied to a larger group.

    Third point: it seems like a very bad idea–any underwriting textbook will warn you about–to have self-reported, easily modified information in your model. That is going to mean social knowledge of the "right" answers makes the model degrade rapidly. (If the "right" answer to "how much do you think about pot" is "occasionally", not "never" or "frequently", some social groups will know that and some won't.)

Comments are closed.