Ch 7: Accountability in AI¶
Introduction¶
You've built a fair, explainable model and pushed it to production. Everything looks great. Then over time, performance starts to dip — not just accuracy, but fairness metrics too. What happened?
The answer: drift. The data distribution of live data shifts over time, and if unattended, it degrades model performance on every dimension.
Types of Degradation¶
graph TD
A[Model in Production] --> B{Performance Drop?}
B -->|Features shifted| C[Data Drift]
B -->|Target relationship changed| D[Concept Drift]
B -->|Training ≠ Production| E[Production Skew]
C --> F[Retrain / Recalibrate]
D --> F
E --> G[Debug Pipeline]
| Type | What Changes | Example |
|---|---|---|
| Data Drift | Feature distributions (\(P(X)\) or \(P(Y)\)) | Post-COVID income patterns shift |
| Concept Drift | Relationship between features and target (\(P(X\|Y)\)) | "Good credit" criteria evolve |
| Production Skew | Training vs live environment differences | Bugs, missing features, pipeline errors |
Why Monitoring Matters for RAI¶
Fairness Degrades Too
Standard monitoring tracks accuracy. But a model that's fair at deployment can become unfair as demographics shift. Monitor fairness metrics alongside accuracy.
Monitoring should track:
- ✅ Overall accuracy and performance
- ⚖️ Fairness metrics (SPD, DI, equalized odds) per protected group
- 🔍 Feature importance stability (SHAP/IV drift)
- 🔒 Privacy parameters (do they need refreshing?)
Regulatory Context¶
Fed SR 11-7 Guidelines
The Federal Reserve recommends:
- Model is being used and performing as intended
- Monitoring is essential to evaluate whether changes necessitate adjustment, redevelopment, or replacement
- Extension of model beyond original scope must be validated
Next: Data Drift →