Explainable AI with Shap

Explainable AI is a key concept in Machine Learning/AI to explain why your model is making the predictions. It helps us understand how good a model is. In this blog, we cover how you can use a game theory-based method called Shapley Values to explain what's happening inside the ML model.

GraphQL has a role beyond API Query Language- being the backbone of application Integration
background Coditation

Explainable AI with Shap

Let's assume you are tasked with developing an ML model to predict credit card defaulters. With the data at your disposal you clean and transform it, after which you properly cross-validate your results. However, the C-suite isn’t impressed because although your scores/results (precision/recall/F-1) are great, they have no clue to figure out why the model is referring to someone as a potential defaulter. You can show them the feature importance scores (which are on a global level), yet something more is desired. In other words, what is required to convince stakeholders is to explain why the ML model might be making any predictions. This would increase their trust in the model and this process of providing explanations is called model explainability.

Explainability is very important if one is working in regulated sectors like healthcare, trade, etc. In such domains, the data science teams not only work on understanding the data and model building but also try to explain why the models made those decisions.
So, what can you do now? To have higher interpretability, you can use some variants of linear models. This would enable you to explain individual predictions as well. But it comes at the cost of performance. You still want to retain similar performance with as little sacrifice of interpretability as possible. This is where certain concepts are borrowed from the field of game theory. Let’s understand it in the following sections.

GAME THEORY

Imagine you have differently skilled workers collaborating for some collective reward. How should the reward be divided fairly among them? This is what game theory tries to answer. One possible solution is to get/calculate the marginal contribution of every worker.

DIVIDING PAYOFF FAIRLY

Before diving into the mathematics of these marginal contributions, let’s consider three workers A, B & C working together on a project of developing a web application. Our task is to find the marginal contributions of every worker in order to fairly compensate everyone. The fair compensation can be derived by calculating the marginal contribution aka payoffs and it’s formula is:

Let’s break down the formula first. Here, the game (collaborative) is the development of the web app and N is the set of the workers i.e. {A,B,C}. The payoff function is defined by v(S) which gives us the payoff for any subset of workers. For now, we want to understand how much A should be paid. To do it, you decide to use the formula above.

Therefore, N = {A,B,C}   &   i=A. The above formula can be rearranged in this form (an explanation is shown below)

Now let’s turn our attention to the right-hand side (don’t forget the summation) of the formula v(S∪{i})-v(S). What this is telling us is:

  1. Consider all possible subsets possible with and without player A.
  2. Calculate their payoffs with A i.e. v(S∪{i}) and without A i.e. v(S). Their difference represents the marginal value of A.
  3. Add up all the marginal values and we get the marginal contribution of A.
STEP 1

Possible subsets without A = {Φ, {B}, {C}, {B,C} } . Reminder, the number of possible subsets of a set with n elements = 2n. Here, it might look like we are not focussing on the orders i.e. we are not concerned with the order in which B & C started their work. However, it should not matter as from A’s perspective it is irrelevant whether B started his work earlier or C. So, you can evaluate the payoff function once, with and without A, and track how much was contributed once A came into the picture.

STEP 2

The payoff function v(S) is nothing but the function learned by the model from data. The difference v(S∪{i})-v(S) can be represented as Δv. We will have four such values for each of the four subsets. Consider the subset {B}, we get ΔvB, A which tells us how much A is contributing to the work given that only B has worked on it so far.

STEP 3 

This step tells us to add them after scaling. The scaling term is the term in orange color

What is the need for scaling you might wonder? This is done to average out the effect of the rest of the team members for every subset size while getting the marginal value of A within each subset. It calculates the number of possible combinations of every subset size considering the set excluding worker A. For the subset {B}, ΔvB, A will have the scale value of 1/2.

There is one final scaling aspect and that is |N|. It is the total number of workers i.e. 3. This is inserted to average out the effect of the group size (number of workers). In this way, you can finally get the marginal contribution or the Shapley value for worker A.

APPLICATIONS 

How is this all transferable to the ML domain? It turns out the workers are nothing but the features one feeds to the model. And the payoff function calculated for every subset is nothing but the function learned by the model from data during the training phase. To understand it better, let us take the Adult Census Income dataset from the UCI repository. All the information about attributes is explained on their website.

Problem Statement 

Predict income exceeding $50k/year based on attributes (binary classification scenario). If income exceeds, then it is labeled as 1 else 0.

Model Fitting

We have fitted a gradient boosting model using the LightGBM library. Results (precision, recall & F1 score) are shown below for both the classes.

Shapley values

Now, using the SHAP library, we will understand how to generate Shapley values and explain the model predictions for our problem. It is important to remember that this library will give usapproximate valuesand notexact values. Since we have used a tree-based model, we will be using the TreeSHAP implementation for our purpose.

We start by initializing an explainer object with TreeSHAP over the model and then generating Shapley values for our target set.

Explaining individual predictions

Theforce_plot()method helps us in visualizing the impact of different features on the prediction. We will be looking at one record (X_test.iloc[0,:]) and taking its corresponding Shapley value (shap_values[1][0,:]) to generate the plot shown below.

Here, the value in bold (-2.28) is the model’s prediction in the log-odds scale. It is important to keep in mind that LightGBM trees are built in log-odds scale and then just transformed to probabilities forpredict_proba(). A negative base value simply means that we are likely to receive a 0 instead of a 1. Features important in making predictions are colored red and blue, with red ones pushing the model score higher and blue ones pushing it lower. The features located close to the boundary of red & blue are the ones with the higher impact, which is proportional to the size of the color bar.

Now, let’s check how Shapley values are distributed across different feature values. Consider the below image. It shows the summary plot where features (Relationship, Age, etc.) are represented on the Y-axes with their values being color coded (red=high and blue=low) and their respective Shapley values on the X-axes. A high Shapley value means it is contributing more towards our event of interest and vice-versa. If we consider the featureCapital Gain, we can infer that high values for it are generally associated with instances of positive classification. Also, you might spot a bias in the featureSex where the value of 1 (Male) corresponds more towards positive events.

DRAWBACKS

Like anything, Shapley's values aren’t perfect. Some of its noteworthy shortcomings are discussed below: 

  1. Handling of missing feature(s) values: There is one significant drawback of Shapley values. What does it even mean if any feature(s) go missing? Replacing the missing entries with 0 won’t make practical sense. To get around it, such feature values are replaced with the expected value over the whole data. But in reality, the expected value might not be realistic. At best it is an uninformed educated guess for the missing feature(s). 
  2. Handling of correlated features: Another prominent drawback is its handling of correlated features. In many cases, one might observe certainly correlated feature attributions to be close to zero. This happens due to a phenomenon called ‘correlation bias’. For example, while training with a highly correlated set of features {a,b,c}, the ML model might wrongly assign high importance to an arbitrary selection, let’s say feature ‘a’. Indeed this can happen as Shapley values try to make sense of the ML model and not the data. 

Overall, Shapley's values are immensely valuable when trying to explain ML models keeping in mind all its limitations. It’s not perfect but it works great when applied correctly.

Hello, Bijit here! I am a data scientist with more than 4 years of experience across different domains (SaaS, real estate, edtech). When I am not browsing through data, you can find me playing soccer, cooking, or sleeping.

Want to receive update about our upcoming podcast?

Thanks for joining our newsletter.
Oops! Something went wrong.

Latest Articles

Optimizing Databricks Spark jobs using dynamic partition pruning and AQE

Learn how to supercharge your Databricks Spark jobs using Dynamic Partition Pruning (DPP) and Adaptive Query Execution (AQE). This comprehensive guide walks through practical implementations, real-world scenarios, and best practices for optimizing large-scale data processing. Discover how to significantly reduce query execution time and resource usage through intelligent partition handling and runtime optimizations. Perfect for data engineers and architects looking to enhance their Spark job performance in Databricks environments.

time
8
 min read

Implementing custom serialization and deserialization in Apache Kafka for optimized event processing performance

Dive deep into implementing custom serialization and deserialization in Apache Kafka to optimize event processing performance. This comprehensive guide covers building efficient binary serializers, implementing buffer pooling for reduced garbage collection, managing schema versions, and integrating compression techniques. With practical code examples and performance metrics, learn how to achieve up to 65% higher producer throughput, 45% better consumer throughput, and 60% reduction in network bandwidth usage. Perfect for developers looking to enhance their Kafka implementations with advanced serialization strategies.

time
11
 min read

Designing multi-agent systems using LangGraph for collaborative problem-solving

Learn how to build sophisticated multi-agent systems using LangGraph for collaborative problem-solving. This comprehensive guide covers the implementation of a software development team of AI agents, including task breakdown, code implementation, and review processes. Discover practical patterns for state management, agent communication, error handling, and system monitoring. With real-world examples and code implementations, you'll understand how to orchestrate multiple AI agents to tackle complex problems effectively. Perfect for developers looking to create robust, production-grade multi-agent systems that can handle iterative development workflows and maintain reliable state management.

time
7
 min read