Proxy Features¶

What Are Proxy Features?¶

Removing protected features from your dataset does not guarantee your model is bias-free. Independent features can serve as proxies for protected features through correlation.

Common Misconception

"If I'm not using any protected features, I can't have bias." Wrong. Proxy features smuggle protected information back into the model.

Common Proxy Relationships¶

Proxy Feature	What It Reveals
Tax paid	Income level
Zip code	Race / ethnicity
Shopping patterns	Gender / marital status
Disposable income	Gender / marital status
Salary + age combined	Gender / promotions
Car engine size	Buyer's gender

American Express (2019)

AmEx decreased credit limits for customers because "other customers who used their card at establishments where you recently shopped have a poor repayment history." Shopping location was a proxy for socio-economic status.

Methods to Detect Proxy Features¶

1. Linear Regression¶

The simplest approach: regress one feature against another and check \(R^2\).

\[X_1 = \lambda + a \times X_2\]

Standard error:

\[SE = \sqrt{1 - R^2_{adj}} \times \sigma_y\]

If \(R^2 \approx 1\) (perfect fit), the features are collinear — one is a proxy for the other.

Tip

Run regression with each protected feature as the dependent variable against all other features. High \(R^2\) values indicate proxy candidates.

2. Variance Inflation Factor (VIF)¶

VIF measures multicollinearity by computing \(R^2\) for each feature regressed against all others:

\[VIF = \frac{1}{1 - R^2}\]

VIF Value	Interpretation
1	No collinearity
1–5	Moderate
> 5	High collinearity — likely proxy
> 10	Severe — strong proxy

import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

def detect_proxies_vif(df, protected_col):
    """Detect proxy features using VIF."""
    # Include protected feature + all independent features
    features = df.select_dtypes(include=[np.number]).columns.tolist()

    vif_data = pd.DataFrame()
    vif_data["Feature"] = features
    vif_data["VIF"] = [
        variance_inflation_factor(df[features].values, i)
        for i in range(len(features))
    ]

    print(f"Proxy candidates for '{protected_col}':")
    print(vif_data.sort_values("VIF", ascending=False))
    return vif_data

VIF Limitation

VIF only detects pairwise collinearity. If a proxy is formed by a combination of features, pairwise VIF may miss it. Use mutual information for non-linear relationships.

3. Linear Association Using Variance¶

From the paper "Hunting for Discriminatory Proxies in Linear Regression Models":

\[Assoc = \frac{cov(X_1, X_2)^2}{Var(X_1) \times Var(X_2)}\]

This is the square of the Pearson correlation coefficient — it highlights higher association values for stronger proxies.

4. Cosine Similarity¶

For non-linear relationships, use cosine similarity:

\[Similarity = \frac{A \cdot B}{\|A\| \times \|B\|} = \frac{\sum A_i B_i}{\sqrt{\sum A_i^2} \times \sqrt{\sum B_i^2}}\]

from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

def detect_proxies_cosine(df, protected_col):
    """Detect proxies using cosine similarity."""
    features = df.select_dtypes(include=[np.number]).columns
    protected_vec = df[protected_col].values.reshape(1, -1)

    results = {}
    for col in features:
        if col != protected_col:
            sim = cosine_similarity(
                protected_vec, 
                df[col].values.reshape(1, -1)
            )[0][0]
            results[col] = sim

    results_df = pd.DataFrame(
        results.items(), columns=['Feature', 'Cosine Similarity']
    ).sort_values('Cosine Similarity', ascending=False)

    print(f"Cosine similarity with '{protected_col}':")
    print(results_df)
    return results_df

When to Use Which Method

VIF / Linear Regression: Good for linear relationships
Cosine Similarity: Works for non-linear and categorical proxies
Mutual Information: Best for capturing any type of dependency

5. Mutual Information¶

Mutual information measures how much information one variable reveals about another — it captures any relationship (linear or non-linear):

\[MI(X; Y) = \sum_{x}\sum_{y} p(x, y) \log\frac{p(x, y)}{p(x) \cdot p(y)}\]

from sklearn.feature_selection import mutual_info_classif
import pandas as pd
import numpy as np

def detect_proxies_mi(df, protected_col):
    """Detect proxies using mutual information."""
    features = [c for c in df.select_dtypes(include=[np.number]).columns 
                if c != protected_col]

    mi_scores = mutual_info_classif(
        df[features], df[protected_col], random_state=42
    )

    results = pd.DataFrame({
        'Feature': features,
        'Mutual Information': mi_scores
    }).sort_values('Mutual Information', ascending=False)

    print(f"Mutual information with '{protected_col}':")
    print(results)
    return results

Detection Strategy¶

graph TD
    A[Identify Protected Features] --> B[Run VIF Analysis]
    B --> C{VIF > 5?}
    C -->|Yes| D[Flag as Proxy Candidate]
    C -->|No| E[Run Cosine Similarity]
    E --> F{Similarity > 0.9?}
    F -->|Yes| D
    F -->|No| G[Run Mutual Information]
    G --> H{MI Score High?}
    H -->|Yes| D
    H -->|No| I[Feature is Independent]
    D --> J[Business Review]
    J --> K[Remove or Transform]

Check After Feature Engineering Too

Proxy features can be created during feature engineering. For example, disposable_income = total_income - expenses can become a proxy for gender or marital status even if those features were already removed.

Back to: Chapter 2 Overview ←