Model Explanation¶
Model explanation techniques help explain black-box models by creating interpretable approximations or counterfactual analyses.
LIME (Local Interpretable Model-agnostic Explanations)¶
LIME explains individual predictions by building a local surrogate model — a simple, interpretable model that approximates the black box in the neighbourhood of a specific prediction.
graph LR
A[Input Instance] --> B[Generate Perturbations]
B --> C[Get Black Box Predictions]
C --> D[Fit Local Linear Model]
D --> E[Feature Importance for This Prediction]
How LIME Works¶
- Select the instance to explain
- Perturb the instance — create variations by modifying feature values
- Predict each perturbation using the black-box model
- Weight perturbations by proximity to the original instance
- Fit a simple model (linear regression) on the weighted perturbations
- Extract feature weights as the explanation
import lime
import lime.lime_tabular
# Create explainer
explainer = lime.lime_tabular.LimeTabularExplainer(
training_data=X_train.values,
feature_names=X_train.columns.tolist(),
class_names=['Non-Default', 'Default'],
mode='classification'
)
# Explain a single prediction
exp = explainer.explain_instance(
X_test.iloc[0].values,
model.predict_proba,
num_features=10
)
exp.show_in_notebook()
LIME Limitations
- Explanations are local — they may not generalize
- Different perturbation strategies can give different explanations
- Works best with tabular data; less reliable for images/text
Counterfactual Fairness¶
Counterfactual fairness asks: "What would the prediction be if this person belonged to a different protected group?"
\[P(\hat{Y}_{S \leftarrow s'} = y \mid X = x, S = s) = P(\hat{Y}_{S \leftarrow s} = y \mid X = x, S = s)\]
A model is counterfactually fair if changing the protected attribute (and its causal descendants) does not change the prediction.
Practical Application¶
import pandas as pd
import numpy as np
def counterfactual_check(model, instance, protected_col, values_to_test):
"""Check how predictions change when protected attribute changes."""
results = []
original = instance.copy()
for val in values_to_test:
modified = original.copy()
modified[protected_col] = val
pred = model.predict_proba(modified.values.reshape(1, -1))[0]
results.append({
f'{protected_col}': val,
'P(Favourable)': pred[1],
'P(Unfavourable)': pred[0]
})
results_df = pd.DataFrame(results)
print("Counterfactual Analysis:")
print(results_df)
max_diff = results_df['P(Favourable)'].max() - results_df['P(Favourable)'].min()
print(f"\nMax prediction difference: {max_diff:.4f}")
if max_diff > 0.05:
print("⚠️ Model may not be counterfactually fair!")
return results_df
Counterfactual Explanations for Users¶
Counterfactuals also help explain decisions to end users: "Your loan was rejected. If your income were $5,000 higher, it would have been approved."
def find_counterfactual(model, instance, target_class, feature_ranges, step_size=0.01):
"""Find the minimal change needed to flip a prediction."""
current = instance.copy()
original_pred = model.predict(current.values.reshape(1, -1))[0]
for feature, (min_val, max_val) in feature_ranges.items():
original_val = current[feature]
# Try increasing
for delta in np.arange(0, max_val - original_val, step_size):
current[feature] = original_val + delta
pred = model.predict(current.values.reshape(1, -1))[0]
if pred == target_class:
print(f"Change {feature}: {original_val:.2f} → {current[feature]:.2f}")
return current
current[feature] = original_val # reset
print("No counterfactual found within given ranges")
return None
Next: Explainable Models →