All Articles PartnerRe Perspectives Previous Article

Fair’s Fair? Notions of Unfairness in Underwriting Models

As data-driven underwriting models become more prevalent and calls for legislation on data usage grow louder, the time has come to examine the models more closely. Tom Fletcher, PartnerRe’s Global Head of Data Science Consulting, explores the concept of “fairness” in data modeling for the life insurance industry and provides some first principles for assessing predictive models for fairness.

Recently, there has been a paradigm shift in insurance underwriting. Diverse data sources are being used more and more. The how has been driven by improvements in models, Artificial Intelligence (AI) and Machine Learning (ML) technologies and techniques. The why is all about meeting changing customer expectations – datasets enable insurers to increase prediction accuracy while improving the customer experience, by reducing the use of time-consuming questionnaires and invasive tests. This has helped to achieve efficiencies and make insurance more available, as well helping to augment decisions in full medical underwriting. In addition to prescription history and medical records in the US, newer datasets are being used in accelerated underwriting, including public records, motor vehicle information, credit attributes, electronic health records and more.

To many, the way these systems work is opaque. When it is unclear how decisions are made there’s a tendency for many people to assume the worst. Do they have a case? Are we fully confident in our models and methods, or do we need a new way of assessing them to ensure fairness?

Good data vs. bad data

It’s understandable that many groups, such as consumer advocates, news outlets and social media commentators, see claims about the capabilities of AI on the news and conflate the theoretically possible with what’s actually happening now.

While other industries may adopt a less conservative stance towards data usage, such as in retail and marketing where product recommendations are based on previous purchases and customer data scraped from social media, it is unlikely that insurers would engage in similar practices. Though it is technically possible to scrape social media accounts for health-related information, such actions would be considered imprudent within the insurance sector.

Underwriters typically avoid data of this unproven nature, preferring trusted data sources with verifiable levels of accuracy. To do otherwise is to risk creating an invalid model – something that can carry significant commercial and reputational risk. So, while it may be the “wild west” for data in some industries, it’s often not widely recognized just how conservative insurers are when it comes to data.

Finally, there’s plenty of data regulation already in place or on its way for those using datasets in general, from the Fair Credit Reporting Act (FCRA) in the United States to the EU’s recently passed Artificial Intelligence Act. Even now, the National Association of Insurance Commissioners (NAIC) in the US has developed a bulletin regarding data usage and the State of Colorado has actively been working on finalizing a regulation to support the recent legislation on the use of external data.

How should we respond?

So, there’s no problem, right? Well – not quite.

To begin with, ignoring this issue is not likely to be a good strategy. After all, opacity itself is a risk – both in terms of market success and corporate reputation. If you were challenged tomorrow about the fairness and validity of your predictive models, could you respond convincingly?

Even if your models are 100% valid and defensible, you should still be able to demonstrate that this is the case. You should be confident enough to stand by their fairness and accuracy. To do that, you will need to have at least two things:

  1. A clear understanding of what “fairness” means in the context of your models and their predictions. What is your individual definition of fairness, aligned with your business, cultural and customer values and the jurisdictions you operate in?
  2. A way to check your models, from input to outputs and everything in between, to be sure that they don’t result in decisions that won’t meet your own, previously defined, concept of ‘fair’. What are acceptable inputs? What are acceptable output tolerances?

What does “fair” really mean?

One of the paradoxes within our industry, whether it be life, auto or home insurance, is that lawful discrimination is inherently fair. Insurers discriminate based on risk – and customers with lower risk can expect to pay less than those with higher risk.

But where unfairness extends to significant subgroups – particularly those with protected characteristics – then this unlawful discrimination is clearly a problem that needs to be addressed. In some cases, differences within groups will be invalid and we’ll clearly see that a model discriminates against a particular subgroup unfairly. In other cases, those differences between groups uncovered within models may be at least partially valid and insurers will need to make a call as to what they want to adjust within their models to increase actual – and perceived – fairness.

To complicate things further, it’s important to note that fairness can be perceived differently by different individuals, making it harder to land on a single definition that will satisfy everyone.

However, we can identify four common concepts of (un)fairness:

  • Unlawful Discrimination: Including the direct use of information related to a protected class in underwriting decisions.
  • “Proxy” Discrimination: Here a proxy is substituted for a factor based on protected class. While a proxy value might appear to predict the risk-based outcome, it is actually predicting the protected class. Proxies can mask unfair discrimination, and consequently, their use is illegal in many jurisdictions, whether intentional or not.
  • Technical Biases: These can be shown to exist when the model “works” differently for different protected classes. Technical biases can then result in disparate outcomes depending on the subgroup being measured.
  • Disparate Outcomes: Similar to adverse impacts or disparate impacts, disparate outcomes refer to instances when a model delivers a disproportionately negative outcome for different protected characteristics without justification.1

Our goal as an industry is to better identify and remove any factors that unfairly discriminate. A good starting point would be to thoroughly review relevant laws and regulations, ensuring our approach aligns with official definitions and guidelines. A number of organizations are working on frameworks, for example: the Center for Economic Justice defines disparate impact as “practices that have the same effect as disparate treatment or intentional discrimination against protected classes.”2 However, as of the time of writing, there is no single, comprehensive framework available to assist us – and that means we’ll have to do some thinking on our own.

Our responsibility

Navigating these gray areas requires careful thought and clear principles. Insurance, as an industry, can (and I believe should) be viewed as a social good – something to which everyone should have fair access. Even with fair access to insurance, still unresolved are historic inequalities, which are the responsibility of our collective society to address. Our society has a moral obligation to be fair, or at least, to do no harm.

Evaluating real or perceived fairness in predictive models

This is where the real work begins. Once we have developed our clear principles of fairness, we then need to critically evaluate our models to ensure that they hold up. First, we need to ensure our models are valid – measuring what they purport to measure, and there is a statistically or conceptually relevant relation to the end result. We also need to challenge ourselves to collect the requisite data to ensure fairness. In the absence of collecting sensitive data, there are several techniques we can use to infer protected classes and help us establish the fairness or otherwise of models. Once we have inferred protected class, we must analyze the model’s predictions to ensure that we aren’t allowing unfair discrimination to lurk unnoticed.

Flexibility is the key

Trying to develop a set of hard and fast rules by which every model can be assessed for fairness is simply unworkable. Different definitions of what’s fair, and the multiplicity of ways in which different models can deliver results that are demonstrably or apparently unfair, mean that models need to be analyzed individually using a range of tools and techniques. So, be prepared to test your models under privilege with your legal team; examine your results with an open mind.

Uncovering technical bias

Throughout this article, I am only using the word “bias” in its technical sense, and when I talk about technical bias in underwriting models, I am specifically looking for examples where the model doesn’t work as well for one group as for another. Insurers want to ensure that two similarly situated individuals (in terms of health, finances and so on) that have the same score should share the same risk. I recognize that some use “bias” interchangeably with “unfair” or “discriminatory”. This is a broad and rather vague definition of the word, and one that is invariably negative.

What to look out for and test against

While it should go without saying, a model should leverage relevant data and be valid in its predictions. But differences in predicted outputs between subgroups, including those with protected characteristics, do not necessarily invalidate a model. As we can see in Figure 1, Group A has systemically lower average scores in Y (Outcome) than Group B, but this is a result of the model producing accurate predictions.  A person with the same score, but belonging to a different subgroup, would have the same predicted result from the model. Setting a threshold for accept/decline may result in an adverse impact to one group, but this may be defensible due differences in the underlying risks (higher prevalence of medical conditions).  Here, there is no evidence of model bias.

Figure 1: Example of a valid model using simulated data, where the model accurately predicts the desired result. The ovals (i.e., ellipses) represent the 95% confidence range for the Outcome (Y-axis) and Risk Score (X-axis). The upward tilt of the oval shows that the Outcome and Risk Score are positively related – as the Risk Score increases, so does the Outcome. The width of the oval indicates how much the Risk Score and Outcomes vary within each group. Group A has higher average Risk Scores then Group B, due to underlying differences between the groups. The model is still considered valid; if you were to compare any two people with the same Risk Score, one coming from Group B and the other Group A, they will have the same predicted output (i.e., Outcome) from the model.

When assessing the validity of a model, the presence of a proxy factor related to a protected class is indeed a significant concern. We can see in Figure 2 a clear example of proxy discrimination. This seemingly neutral factor (X) is predictive of an outcome (Y), but it only gets its predictive power by being able to distinguish between groups. The use of proxies such as X, whether unintentional or not, is unfairly discriminatory.

Figure 2: Example of a model that appears invalid using simulated data. The ovals represent the 95% confidence range for the Outcome (Y-axis) and Variable X (X-axis). The oval for Group B is much higher on the plot than the oval for Group A, showing that Group B has much higher values of Variable X compared to Group A. The limited overlap between the ovals indicates that Variable X is very different for the two groups. The model uses Variable X to predict the Outcome (Y-axis), but Variable X is actually just a proxy for distinguishing between Group A and Group B, rather than being a true predictor of the Outcome .

Looking for bias? There are two common types to watch out for:

  • Slope bias: Slope bias occurs when the relationship between the model (score) and the predicted outputs are not the same for two groups. The relationship is stronger (steeper slope) for one group than another.
  • Intercept bias : In some cases, we can see that the relationship (slope) is parallel for both groups, but that there is a level difference. In other words, the overall mean risk is higher for one group than for another.

 

Which comes first – the input or the outcome?

Where should we focus? There is plenty of debate about whether we need to get inputs right first, or whether we need to look directly at outcomes. As always, the answer is not so simple.

Let’s take as an example a model that uses several inputs, including application questions, prescription history, lab results and so on. One of the application questions asks, “do you have diabetes?” Let’s assume as well that the prevalence of diabetes differs, perhaps by race or sex. Our question then becomes, should we remove this single input? Shouldn’t the focus be on the whole story of the individual and not one single attribute? Or does the increased risk predicted by the presence of diabetes justify its inclusion?

Perhaps, we should focus on the predicted outputs from the model instead? If we see adverse outcomes affecting certain groups, we can then adjust our model to address this.

The best strategy probably involves a bit of both. Run a qualitative review process (developing a robust form of governance here will prove valuable) for your inputs, ensuring that they align with your agreed philosophy for “what’s fair and reasonable”. Then run a quantitative analysis to identify unfair results and dig deeper to understand any inherent problems in your model.

Don’t focus on variables in isolation

Checking individual variables is fine but remember that they may not operate in the same way when brought together. Sometimes variables correlate with each other. Other times, dependencies may cause one variable to operate differently in the presence of another. Using a multivariate approach, therefore, has several benefits. It can help to provide a more comprehensive picture of the impact on outcomes from multiple variables. Alternatively, if individual variables are creating differences, using multiple variables may help to soften the impact.

Analyzing variables – some points to consider:

  • Evaluations can become overwhelmingly complicated (and counter intuitive)
  • Distributions may be non-normal
  • Relationships to risk can be non-linear
  • Interactions may be present
  • There may be correlations with other unseen variables, resulting in confounding variables
  • There are significant sampling issues with protected class data (e.g., underrepresentation, biases in self-reporting)
  • There is no guarantee that solving issues with single variable will resolve decision differences
  • Underwriting almost always requires multiple variables to be taken into account, unless you are setting up a knock-out variable within a rules engine

Digging deeper into disproportionate outcomes

Analyzing models in this way will frequently throw up disproportionate outcomes for different groups. Some groups may have significantly higher levels of declines, for example, or have a much higher chance of being placed into less favorable underwriting categories. For a model to deliver disproportionate outcomes in any meaningful sense, these differences must be demonstrably unfair. Most legal frameworks use a wording that approximates to either “without justification” or “disproportionate relative to the underlying risk”. For example, the Colorado Senate Bill 21-169 (US) considers the use of a model to be unfairly discriminatory when it “results in a disproportionately negative outcome for such classification or classifications, which negative outcome exceeds the reasonable correlation to the underlying insurance practice”.

While the NAIC’s Model Unfair Trade Practices Act (#880-4) in the United States uses slightly different wording – prohibiting life insurers from charging different rates for policies when individuals share the same risk factors (i.e., life expectancy and class) – it too equates unfair discrimination with a disproportionate outcome.

In conclusion: Now’s not the time to be caught sleeping

This is a complex issue, and not something that can be wrapped up in a brief article. But hopefully this discussion has at least surfaced some of the key issues.

Insurers can’t afford to ignore the issue of fairness and discrimination in the datasets that are increasingly relied upon. Kneejerk reactions risk unintended consequences. In the end, our goals align closely with those of regulators and customers: to ensure our products are fairly priced and accessible. It is both reasonable and achievable to safeguard your business interests by ensuring that the risk assessment models used do not unfairly discriminate.

References

[1] Colorado Revised Statutes:  “that use results in a disproportionately negative outcome for such classification or classifications, which negative outcome exceeds the reasonable correlation to the underlying insurance practice” – Colo. Rev. Stat. 10-3-1104.9

2 The Center for Economic Justice’s Call to Insurers and Insurance Regulators, June 18, 2020: p2 (note)

Contact PartnerRe

Please contact us if you would like to find out more about bias and fairness in data modeling.

Opinions expressed herein are solely those of the author.  This article is for general information, education and discussion purposes only. It does not constitute legal or professional advice and does not necessarily reflect, in whole or in part, any corporate position, opinion or view of PartnerRe or its affiliates.

Get in touch