Federal banking regulators have yet to issue official rules about the use of AI and machine learning in credit underwriting. But on Monday the Office of the Comptroller of the Currency for the first time explicitly called out the use of AI and machine learning in credit in its semi-annual report on the key issues facing the national banking system.
Small stone, big ripple. The financial services industry is increasingly adopting machine learning (ML) for a range of applications. ML models are powerful at predicting outcomes because they can consider more data than traditional models and apply sophisticated math to evaluate multiple variables and the relationships among them, and continually refine and improve their underlying algorithms to enhance performance and predictive power on an ongoing basis. ML technologies have the potential to bring more unbanked and underbanked consumers into the financial system, enhance access to responsible credit, and contribute positively to the overall safety and soundness of the financial system.
Increased predictive power, however, comes with increased model risk and complexity. In its report this week, the OCC wrote: “Bank management should be aware of the potential fair lending risk with the use of AI or alternative data in their efforts to increase efficiencies and effectiveness of underwriting. It is important to understand and monitor underwriting and pricing models to identify potential disparate impact and other fair lending issues. New technology and systems for evaluating and determining creditworthiness, such as machine learning, may add complexity while limiting transparency. Bank management should be able to explain and defend underwriting and modeling decisions.”
At ZestFinance, we couldn’t agree more about the fundamental need to “explain and defend” complicated ML models. Our ZAML software tools are expressly designed to help lenders do both as they transition from conventional underwriting to transparent ML. ZAML software quickly renders the inner workings of ML models transparent from creation through deployment. You can use the tools to monitor model health in run-time, and trust the results are fair and accurate. A handful of other techniques in the market also claim to solve ML’s “black-box” problem, but they’re often snake oil, providing inconsistent, inaccurate, and computationally expensive explanations. They also can easily fail to spot race- and gender-based discrimination. Our ZAML fair lending tool can both find the racial or gender disparities and within seconds produce less discriminatory versions of the model.
Several innovative lenders are already deploying ML models for credit underwriting, but many more are waiting for more clarity about how the current regulatory framework applies to machine learning-powered credit models. The OCC and Federal Reserve last issued supervisory guidance around managing model risk in 2011, largely before ML was used in financial services, and thus the guidance does not explicitly address ML models.
To help bridge this gap, and facilitate the industry’s transition to ML, we’ve developed the following FAQ to address questions about how ML fits within the existing Model Risk Management guidance, especially in credit underwriting. We’ve spent a lot of time in Washington to understand the concerns of the agencies, legislators, and their staffs. As we see it, the goals for ML underwriting are to ensure safety and soundness in the financial system, increase access to credit, and minimize the risk of being sued or fined for violating fair lending and equal credit laws. The accompanying Q&A sets forth our current views on best practices for the responsible adoption of ML underwriting consistent with the relevant portions of the 2011 guidance.
OVERVIEW OF MODEL RISK MANAGEMENT
Is it acceptable to use ML models in high stakes financial services decision-making?
- Yes. Nothing in the guidance precludes the use of ML models. The guidance applies to a financial institution’s use of any “model,” which it defines as a “quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates.” ML fits squarely within this definition.
- Fairness, anti-discrimination, and safety and soundness goals tend to support the use of more predictive models, including ML models. As many as fifty million Americans have incomplete or inaccurate credit bureau data. Millions of these consumers are denied access to credit by lenders using conventional, static credit scoring techniques because those models often inaccurately predict default risk. ML models are resilient to incomplete data, able to consider more variables, and capable of creating models that more accurately assess credit risk. Consequently, ML’s enhanced predictive power has the potential to safely expand access to credit while reducing losses and systemic risk.
- However, ML-based credit risk models must be validated, documented, and monitored using methods appropriate to the modeling approach selected in order to comply with the principles articulated in the guidance. As discussed below, conventional validation approaches are not sufficient to evaluate ML models. ML model developers and institutions should take care to conform their practices to the principles in the guidance regarding Model Development Implementation and Use, and Model Evaluation and Verification standards using techniques robust enough to assess and explain the performance of ML models.
MODEL DEVELOPMENT, IMPLEMENTATION, AND USE
Can you use as many variables as desired in a model?
- Yes. The guidance does not address, or limit, the number of variables that may be used in a model, and nothing in the guidance suggests that fewer variables necessarily decreases risk. ML models can consider many more variables than traditional methods, which is a key reason why ML models often provide greater predictive power, and deliver superior results, compared to traditional models.
- The same data review and documentation practices outlined in the guidance still apply to ML models even though ML models consider many more variables than traditional models. As the guidance indicates, “there should be rigorous assessment of data quality and relevance, and appropriate documentation. Developers should be able to demonstrate that such data and information are suitable for the model.”
Can model developers analyze vastly more variables and still comply with the guidance?
- Yes. As the guidance states: “Developers should be able to demonstrate that such data and information are suitable for the model and that they are consistent with the theory behind the approach and with the chosen methodology. If data proxies are used, they should be carefully identified, justified, and documented.”
- ML models consider hundreds or even thousands of variables, so it may be impractical to review all of them manually. An automated variable review may be the most effective way to support comprehensive analysis and documentation of the data and the model. Automated variable review methods should identify and document data issues that could raise questions about the predictive power, fairness, and safety and soundness of a model. Notably, variables should be reviewed for unexpected and/or inconsistent distributions, mappings, and other data degradation issues that can lead to model misbehavior. In connection with reviewing data variables, ML models will detect patterns and relationships among variables that no human would detect. This continuously evolving multivariate analysis is what makes any assessment of the data during the development phase problematic. The guidance calls for documentation of these review methods and descriptions of the assumptions and theoretical basis for their use.
What methods are permissible for assessing the soundness of an ML model?
- The guidance does not prescribe any specific method for validating any model, including a machine learning model. Nonetheless, the guidance sets out a core framework for effective model validation: evaluation of conceptual soundness, ongoing monitoring, and outcomes analysis.
- Regarding soundness, certain conventional evaluation methods described in the guidance would, if applied to ML models, be ineffective and would likely produce misleading results. For example, one of the testing methods identified by the guidance is sensitivity analysis. Common implementations of sensitivity analysis include exploring all combinations of inputs and permuting these inputs one-by-one (univariate permutation) in order to understand the influence of each variable (or a combination thereof) on model scores. Exploring all combinations of inputs (exhaustive search) is computationally infeasible for most ML models. Univariate permutation (permuting inputs one-by-one), while more computationally tractable, yields incorrect results for ML models that capture and evaluate multivariate interactions.
- Effective ML model evaluation techniques should be efficient and tractable, and designed to test the how ML models actually work. Such techniques should also assess the impact of multivariate interactions because ML models evaluate such interactions. Appropriate methods of evaluating ML models include techniques derived from game theory, multivariate calculus, and probabilistic simulation.
How do the guidance’s monitoring standards apply to ML models?
- The guidance calls for ongoing model monitoring: “Such monitoring confirms that the model is appropriately implemented and is being used and is performing as intended…” The guidance further states: “Many of the tests employed as part of model development should be included in ongoing monitoring and be conducted on a regular basis to incorporate additional information as it becomes available.”
- A thorough approach for monitoring ML models should include:
- Input distribution monitoring: Recent model input data may be compared with model training data to determine whether incoming credit applications are significantly different from model training data. The more that live data differs from training data, the less accurate the model is likely to be. This data comparison is typically done by looking at variable distributions and ensuring recent data is drawn from a similar distribution as occurred in the model training data. For ML models, multivariate input variable distributions should be monitored to identify input data where combinations of values that were unlikely to appear together during model development are now occurring in production. Systems for monitoring model inputs should trigger alerts to monitors or validators when they spot anomalies or shifts that exceed pre-defined safe bounds.
- Missing input data monitoring: Comprehensive model monitoring should include monitoring for missing input data. Model input data comes from a variety of sources, some of which is retrieved over networks from third parties. Data sources could become unavailable in production. A complete model monitoring program should monitor and trigger alerts to monitors and validators when the rate of missing data, and its impact on model outputs and downstream business outcomes, exceed pre-defined thresholds.
- Output distribution monitoring: Model outputs should be monitored by comparing distributions of model scores over time. Monitoring systems should compute statistics that establish the degree to which the score distribution has shifted from the scores generated by the model in prior periods such as those contained in training and validation data sets.
- Execution failure monitoring: Error and warning alerts generated during model execution can indicate flaws in model code that may affect model outputs. Such alerts should, therefore, be closely monitored, the causes of such alerts should be investigated and identified, and appropriate remediation should be implemented where necessary.
- Latency monitoring: Model response times should be monitored to ensure model execution code and infrastructure meet the latency requirements of applications and workflows that rely on model outputs. Models that perform slowly or with unreliable execution time may cause intermittent timing issues, which can result in the generation of inaccurate scores. Establishing clear latency objectives and pre-defined alert thresholds should be part of a comprehensive model monitoring management program.
- Economic performance monitoring: A complete ML model monitoring solution should include business dashboards that enable analysts to configure or pre-define alert triggers on key performance indicators such as default rate, approval rate, and volumes. Substantial changes in these indicators can signal operational issues with model execution and, at a minimum, should be investigated and understood in order to manage risk.
- Reason code stability: Reason codes explain the key drivers of a model’s score. Reason code distributions should be monitored because material changes to the distributions can indicate a change in the character of the applicant population or even in the decision-making logic of the ML model.
- Fair lending analysis: Machine learning models can develop unintended biases for a variety of reasons. Relatedly, like any model, ML models can result in disparities between protected classes. To ensure that all applicants are treated fairly and in a non-discriminatory manner, it is important to monitor loan approvals, declines, and default rates across protected classes. Because of the possibility of bias and the advanced predictive fit of ML models, monitoring of these models should occur in real time.
Should model monitoring include automation?
- Yes. The guidance states: “monitoring should continue periodically over time, with a frequency appropriate to the nature of the model.” Given the complexity of ML models, automated model monitoring, which can run concurrently with model operations, is essential to meet the expectations set by the guidance, especially when combined with multivariate input monitoring and alerts. Changes to input and output distributions should be monitored in real time to identify problems promptly and reflected in periodic reports.
- As the guidance recommends, model outcomes should be thoroughly understood prior to adoption and deployment of any new model, including ML models. Because ML models can consider many more data points than traditional models, traditional tools such as manual review of partial dependence plots can be cumbersome or inaccurate. Such tools can also miss crucial aspects of ML model behavior, such as the influence of variable interactions. In addition to understanding fully how a model arrives at a score, it is important to understand the swap sets generated by switching to a new model. That is, which applicants will now be approved (swap-ins) and which will now be denied (swap-outs). While the quantity of applicants is important, so is the quality of applicants. Outcomes validation methods should include an examination of the distribution of values for all model attributes of swap-ins and swap-outs, as well as a comparison with populations already accepted and with known credit performance.
How should model outcomes be analyzed?
GOVERNANCE, POLICIES, AND CONTROLS
How do the guidance documentation requirements apply to ML models?
- As the guidance states, “documentation of model development and validation should be sufficiently detailed so that parties unfamiliar with a model can understand how the model operates, its limitations, and its key assumptions.”
- Meeting the requirement for thorough documentation of advanced modeling techniques can be challenging for model developers because ML models can process many more variables than traditional models, ML algorithms often have many tunable parameters, ML “ensembles” can join both many variables and many tunable parameters, and all of these must be thoroughly documented so the model can be reproduced.
- These issues largely do not apply to logistic regression-based underwriting models, which are easier to understand and explain but less predictive.
- In the case of ML models, documenting how a model operates, its limitations, and its key assumptions requires using explainability techniques that accurately reveal how the model reached its decisions and why.
- Entities should ensure that they use explainability methods that accurately explain how a model operates. Most commonly used explainability methods are unable to provide accurate explanations. For example, some methods (e.g., LOCO, LIME, PI) look only at model inputs and outputs, as opposed to the internal structure of a model. Probing the model only externally in this way is an imperfect process leading to potential mistakes and inaccuracies. Similarly, methods that analyze refitted and/or proxy models (e.g., LOCO and LIME), as opposed to the actual final model, result in limited accuracy. Explainability methods that use “drop one” or permutation impact methods (e.g., LOCO and PI) rely on univariate analysis, which fails to properly capture feature interactions and correlation effects. Finally, methods that rely on subjective judgement (e.g., LIME) create explanations that are both difficult to reproduce and overly reliant on the initial judgement. These errors in explanation cause model accuracy to suffer. Even slight inaccuracies in explanations can lead to models that discriminate against protected classes, are unstable, and/or produce high default rates. Appropriate explainability methods rely on mathematical analyses of the underlying model itself, including high-order interactions, and do not need subjective judgement.
Should model documentation include automation?
- Yes. Although the guidance is silent on whether model documentation may be generated automatically, automated model documentation is the most practical solution for ML models. ML model development is complex, and operationalizing and monitoring ML models is even harder. It is not feasible for a human, unaided, to keep track of all that was done to ensure proper model development, testing and validation. There are tools to automate model documentation for review by model developers, compliance teams, and other stakeholders in the model risk governance process. Given the number of variables in ML models, automated documentation is likely to provide a higher degree of accuracy and completeness than manual documentation. In general, participants in model risk management should not rely upon manually generated documentation for ML models.
Are there other best practices for ML model risk management?
- Yes. The guidance makes clear that the quality of a bank’s model development, testing, and validation process turns in large part on “the extent and clarity of documentation.” Therefore, model documentation should be clear, comprehensive, and complete so that others can quickly and accurately revise or reproduce the model and verification steps. Documentation should explain the business rationale for adopting a model and enable validation of its regulatory compliance.
- Records of model development decisions and data artifacts should be kept together so that a model may be more easily adjusted, recalibrated, or redeveloped when conditions change. Such artifacts include development data, data transformation code, modeling notebooks, source code and development files, the final model code, model verification testing code, and documentation.
- Model documentation should be clear, comprehensive, and complete so that others can quickly and accurately revise or reproduce the model and verification steps. Documentation should explain the business rationale for adopting a model and enable validation of its regulatory compliance.