Skip to content

Ch 8: Data & Model Privacy

Introduction

In 2006, Netflix released 100 million "anonymized" movie ratings for a competition. Within 2 weeks, researchers Narayanan and Shmatikov re-identified subscribers by cross-referencing with IMDB — uncovering viewing histories, political preferences, and sensitive personal information.

Anonymization Is Not Enough

Latanya Sweeney estimated that 87% of the US population can be uniquely identified using just three attributes: 5-digit zip code, gender, and date of birth.

What Is Private Data?

GDPR defines personal data as any information relating to an identifiable natural person — including:

  • Social media content (text, images, videos)
  • Location data from devices
  • Device usage patterns and shopping behavior
  • Financial information and health records
  • Relationship information

Types of Attacks

Attack Type Attacker Knowledge Goal
White box Has model parameters Leak data, cause misclassification
Black box No model knowledge Reverse-engineer model or data
Grey box Partial knowledge Combination of above
Linkage attack Public datasets Re-identify anonymized records
Model inversion Model access Reconstruct training data
Data poisoning Training pipeline access Degrade model performance
graph TD
    A[Privacy Threats] --> B[Data Attacks]
    A --> C[Model Attacks]

    B --> B1[Linkage Attack]
    B --> B2[Re-identification]

    C --> C1[Model Inversion]
    C --> C2[Membership Inference]
    C --> C3[Data Poisoning]

    D[Defenses] --> D1[Differential Privacy]
    D --> D2[Federated Learning]
    D --> D3[Encryption]

Defense Approaches

Technique What It Protects How
Differential Privacy Individual records Add calibrated noise
Federated Learning Raw data Train without centralizing data
k-Anonymity Identity Ensure k identical records
l-Diversity Sensitive attributes Ensure l distinct values per group
t-Closeness Distribution Limit distance between distributions

Privacy Early in the Pipeline

Add privacy protections as early as possible in your data pipeline. If data is made private before entering your datastore, all downstream consumption inherits the protection.


Next: Differential Privacy →