In our original post, ”Why Lenders Shouldn’t ‘Just Use SHAP’ To Explain Machine Learning Credit Models,” we raised several issues lenders face when trying to explain their machine learning (ML) credit models with “just” the popular open-source package SHAP. Computer scientist Scott Lundberg, one of the primary authors of SHAP, has been gracious with his time and responded to many of our questions and comments on his github page. We agree with most of what Scott says, but we want to clarify our position on a few issues.
Mainly we want to clarify that one must take care to ensure the margin space/score space transition is handled correctly so that the explanations generated are an accurate assessment of how a model-based decision is made. Although this may seem like a technical nit, this is keenly important both for generating accurate adverse action reasons and for doing adequate fair lending analysis. It’s also why we spent so much time and care to develop the core explainability math inside ZAML: So that lenders can accurately assess the reasons for a model-based decision in score space, without requiring unrealistic assumptions such as variable independence, or that missing values are missing completely at random.
Let’s first consider the process of generating adverse action reasons. Under the Equal Credit Opportunity Act and Regulation B, when an applicant is denied a loan, the lender must respond with an adverse action letter notifying the applicant of the denial of credit and listing the top four reasons they were denied. These are provided for two purposes: first, so that the applicant can see if there are any errors in the information provided to the lender, and, second, so that the applicant can figure out what to do to raise their likelihood of being approved in the future.
Scott calls out one of our statements in his response and raises an important issue. We say:
If you compute the set of weighted key factors in margin space, you'll get a very different set of factors and weights than if you compute them in score space, which is where banks derive their top five explanations for rejecting a borrower.
And then Scott says:
[The additivity of margin space] is the same reason I often encourage people to think about explanations in margin space and not just use probability space... I talked with Zest about this and it seems like it could be better to do the explanations in score space for finance, but that conclusion is not 100% clear cut.
We respectfully disagree.
It is important to understand why the distinction between margin space and score space matters as you consider a method for generating adverse action reasons. The lender needs to tell the applicant the five most important reasons they were denied -- but what does "most important” actually mean? Lenders approve a fixed fraction, say k percent, of all applications -- that is, they look at the highest ranked k percent of the applications they receive. That's a ranking problem, so the natural score one would wish to discuss is the 'score space' in which the distribution of outputs would be uniform. The applicant does not benefit from knowing what five values would improve the difference between their marginal score and an abstract threshold which corresponds to the desired rank and so, lenders are required to disclose which five factors would most reduce the difference between the rank of their current application and the threshold acceptance rank.
Thus, an explainer powering adverse action reasons needs to provide accurate “score space” reasons. While reasoning in margin space is convenient, ultimately financial services applications of ML explainability require something different. This issue isn’t limited to finance applications, it applies to any modeling problem which requires a probability assignment.
As Scott rightly points out, score space explanations are available from the various explainers within SHAP, but the score space explanations generated by SHAP are an approximation. Per the SHAP documentation, TreeExplainer probability outputs (which are required for applications like generating adverse action reasons in credit, as explained above) are only available when the variables upon which the model depends are statistically independent. This just isn't realistic for a credit risk scenario, in which many of the variables are dependent on each other and any adequate risk model must capture that fact.
To see how this is problematic in a domain like credit risk, consider two common input variables: total debt to income (DTI) and revolving credit utilization. These variables are not independent: revolving credit utilization usually leads to monthly payment obligation and is therefore usually a component of total debt to income. Imposing an independence assumption means you can't explain one of machine learning's greatest strengths -- the ability for ML models to capture interactions among variables.
GradientExplainer, the part of SHAP that can be used for explaining continuous models like neural networks, assumes the input features are independent as well. From the SHAP README page, "If we approximate the model with a linear function between each background data sample and the current input to be explained, and we assume the input features are independent then expected gradients will compute approximate SHAP values." Unfortunately, feature independence is not a safe assumption in credit risk, and, as far as we can tell, not a necessary assumption to make here.
KernelExplainer, a more sophisticated implementation of LIME that computes Shapley values, suffers from a different flaw. It makes the assumption that missing values can be filled with an average, as though they are missing at random. This is not valid in most real-world datasets, in general, and in datasets arising from financial services applications, in particular. To see why, consider the meaning of a common credit risk variable, an applicant’s credit score. That someone has a missing credit score provides information about the distribution of the other signals corresponding to the application. A missing credit score usually indicates a lack of credit history, which, in turn, suggests that many of the variables associated with a credit history will be differently distributed for that population than they would be for the population in general. This, in turn, means that creating a population of artificial completions for items from which the credit score is omitted becomes more complicated than it may at first seem. KernelExplainer will erroneously impute a spectrum of values drawn from the population as a whole. This will inevitably lead to providing the wrong adverse action reasons to the consumer.
The same issues that come up when considering how to generate adverse action reasons also come up when considering fairness. The Equal Credit Opportunity Act, requires lenders to make decisions without regard to race and ethnicity, gender, and other protected statuses. The act further requires the identification of disparity in approval rate and pricing terms, and that if disparate impact exists, e.g., that the approval rate or pricing for applicants within a protected class is unfavorable when compared to the unprotected baseline, that the lender quantify and understand the drivers of such disparity, and mitigate them or document them accurately. The law provides for stiff enforcement penalties. You can easily see how quantifying the drivers of a difference in approval rate also requires reasoning in “score space”. The factors that drive an applicant to be approved or not are based on the rank ordering of the applicant’s credit risk.
The point of all this is not to discredit the work of our esteemed colleague Dr. Lundberg. (BTW, congratulations, Scott, on successfully defending your dissertation and receiving your Ph.D.) Clearly, a team of data scientists with enough time and care can make the improvements and accommodations required to use open source packages like SHAP safely in financial services.
But we think it’s important to understand limitations, assumptions, and safe operating parameters before applying algorithms and techniques, especially in a domain like credit, where significant life-changing events are at stake, such as the ability to own a home or to get financing for a car you need to drive to work. We spent significant resources to develop the core explainability math inside ZAML so that lenders can accurately assess the reasons for a model-based decision, and thereby safely make use of the significantly better predictive power offered by modern machine learning techniques.