Proxy Features¶
What Are Proxy Features?¶
Removing protected features from your dataset does not guarantee your model is bias-free. Independent features can serve as proxies for protected features through correlation.
Common Misconception
"If I'm not using any protected features, I can't have bias." Wrong. Proxy features smuggle protected information back into the model.
Common Proxy Relationships¶
| Proxy Feature | What It Reveals |
|---|---|
| Tax paid | Income level |
| Zip code | Race / ethnicity |
| Shopping patterns | Gender / marital status |
| Disposable income | Gender / marital status |
| Salary + age combined | Gender / promotions |
| Car engine size | Buyer's gender |
American Express (2019)
AmEx decreased credit limits for customers because "other customers who used their card at establishments where you recently shopped have a poor repayment history." Shopping location was a proxy for socio-economic status.
Methods to Detect Proxy Features¶
1. Linear Regression¶
The simplest approach: regress one feature against another and check \(R^2\).
Standard error:
If \(R^2 \approx 1\) (perfect fit), the features are collinear — one is a proxy for the other.
Tip
Run regression with each protected feature as the dependent variable against all other features. High \(R^2\) values indicate proxy candidates.
2. Variance Inflation Factor (VIF)¶
VIF measures multicollinearity by computing \(R^2\) for each feature regressed against all others:
| VIF Value | Interpretation |
|---|---|
| 1 | No collinearity |
| 1–5 | Moderate |
| > 5 | High collinearity — likely proxy |
| > 10 | Severe — strong proxy |
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
def detect_proxies_vif(df, protected_col):
"""Detect proxy features using VIF."""
# Include protected feature + all independent features
features = df.select_dtypes(include=[np.number]).columns.tolist()
vif_data = pd.DataFrame()
vif_data["Feature"] = features
vif_data["VIF"] = [
variance_inflation_factor(df[features].values, i)
for i in range(len(features))
]
print(f"Proxy candidates for '{protected_col}':")
print(vif_data.sort_values("VIF", ascending=False))
return vif_data
VIF Limitation
VIF only detects pairwise collinearity. If a proxy is formed by a combination of features, pairwise VIF may miss it. Use mutual information for non-linear relationships.
3. Linear Association Using Variance¶
From the paper "Hunting for Discriminatory Proxies in Linear Regression Models":
This is the square of the Pearson correlation coefficient — it highlights higher association values for stronger proxies.
4. Cosine Similarity¶
For non-linear relationships, use cosine similarity:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
def detect_proxies_cosine(df, protected_col):
"""Detect proxies using cosine similarity."""
features = df.select_dtypes(include=[np.number]).columns
protected_vec = df[protected_col].values.reshape(1, -1)
results = {}
for col in features:
if col != protected_col:
sim = cosine_similarity(
protected_vec,
df[col].values.reshape(1, -1)
)[0][0]
results[col] = sim
results_df = pd.DataFrame(
results.items(), columns=['Feature', 'Cosine Similarity']
).sort_values('Cosine Similarity', ascending=False)
print(f"Cosine similarity with '{protected_col}':")
print(results_df)
return results_df
When to Use Which Method
- VIF / Linear Regression: Good for linear relationships
- Cosine Similarity: Works for non-linear and categorical proxies
- Mutual Information: Best for capturing any type of dependency
5. Mutual Information¶
Mutual information measures how much information one variable reveals about another — it captures any relationship (linear or non-linear):
from sklearn.feature_selection import mutual_info_classif
import pandas as pd
import numpy as np
def detect_proxies_mi(df, protected_col):
"""Detect proxies using mutual information."""
features = [c for c in df.select_dtypes(include=[np.number]).columns
if c != protected_col]
mi_scores = mutual_info_classif(
df[features], df[protected_col], random_state=42
)
results = pd.DataFrame({
'Feature': features,
'Mutual Information': mi_scores
}).sort_values('Mutual Information', ascending=False)
print(f"Mutual information with '{protected_col}':")
print(results)
return results
Detection Strategy¶
graph TD
A[Identify Protected Features] --> B[Run VIF Analysis]
B --> C{VIF > 5?}
C -->|Yes| D[Flag as Proxy Candidate]
C -->|No| E[Run Cosine Similarity]
E --> F{Similarity > 0.9?}
F -->|Yes| D
F -->|No| G[Run Mutual Information]
G --> H{MI Score High?}
H -->|Yes| D
H -->|No| I[Feature is Independent]
D --> J[Business Review]
J --> K[Remove or Transform]
Check After Feature Engineering Too
Proxy features can be created during feature engineering. For example, disposable_income = total_income - expenses can become a proxy for gender or marital status even if those features were already removed.
Back to: Chapter 2 Overview ←