Chapter #2: Feature Importance With SHAP

Hi folks! 👋

In the previous post I talked about EDA and in particular statistical bias. Now it’s the turn of feature importance.

What’s feature importance?

Feature importance is the idea of explaining the individual features that make up a training data set, using a score called important score.

Example

Let’s take, for example, the product review data set. It consists of multiple different features, and I am trying to build a product sentiment prediction model out of that data set. I would be interested in understanding what features will play a role in that final model; that is where feature importance comes into the picture.

Why should I consider it?

Some features from my data set could be more relevant to the final model than others. Using feature importance, I can rank the individual features in the order of their importance and contribution to the final model. Feature importance allows me to evaluate how useful or valuable a feature is, concerning the other features that exist in the same data set.

SHAP framework

Feature importance is based on a very popular open-source framework called SHAP (i.e. SHapley Additive ExPlanations). The framework itself is based on Shapley values: a topic at the core of the game theory.

How does SHAP work?

To understand how SHAP works, consider a game. Multiple players are involved and there is a very specific outcome to the play, that could be either a win or a loss. Shapley values allow me to attribute the outcome of the game to the individual players involved in it.

But how this could be translated into the ML world?

In our case, the individual players would be the individual features that make up the data set, and the outcome of the play would be the model prediction. Therefore, it can be possible to explain how the predictions will correlate to the individual feature values.

Using the SHAP framework, I can provide both local and global explanations. In details:

  • local explanation focuses on indicating how an individual feature contributes to the final model;
  • the global explanation takes a much more comprehensive view in trying to understand how the data in its entirety (i.e. all features) contributes to the outcome.

SHAP framework is also very extensive, since it considers all possible combinations of feature values along with all possible outcomes. Due to this extensive nature, the SHAP framework could be very time-intensive, but also SHAP can provide me with guarantees in terms of consistency and local accuracy.

Additional material

If you’d like to gain more information about SHAP navigate to this link.

In the next post, I’ll go through a hands-on tutorial on how compute these feature importance scores on AWS SageMaker. See you there!

Posts in this series

comments

comments powered by Disqus