ML Systems Engineering — Field Guide¶

A self-study curriculum for staff-level ML engineering roles (\(200K–\)750K+) at Netflix, Google, Meta, Anthropic, and beyond.

Why This Exists¶

The gap between "I understand ML concepts" and "I can build production ML systems at scale" is where the highest-paying roles in tech live. This wiki bridges that gap — covering the full stack from distributed training to serving, from data pipelines to recommendation systems.

Every section is built around what top companies actually ask for in their job postings and what their engineering blogs reveal about how they work.

The Target¶

Netflix Research Scientist 5/6 — \(466,000–\)750,000/year

Required: Python, TensorFlow, PyTorch

Nice to have: Java, Scala, Spark, Hive, Jax, Flink, Hadoop

Deep expertise in LLM development, post-training (fine-tuning, distillation), distributed training, reinforcement learning, personalization, recommender systems.

This isn't just Netflix. The same skillset maps to:

Google — ML Infrastructure, Search, Ads
Meta — AI Research, Recommendation Systems
Anthropic / OpenAI — Post-training, RLHF, Safety
Spotify — Personalization, ML Platform
Apple — On-device ML, Foundation Models

Curriculum Map¶

┌─────────────────────────────────────────────────────────────┐
│                    ML SYSTEMS ENGINEERING                     │
├──────────────┬──────────────┬──────────────┬────────────────┤
│  1. POST-    │ 2. DISTRIB-  │ 3. INFERENCE │ 4. DATA AT     │
│  TRAINING    │ UTED         │ & SERVING    │ SCALE          │
│              │ TRAINING     │              │                │
│ • SFT       │ • FSDP       │ • vLLM       │ • Spark SQL    │
│ • RLHF      │ • Tensor     │ • KV Cache   │ • Hive         │
│ • DPO       │   Parallel   │ • Quantize   │ • Feature      │
│ • GRPO      │ • Ray        │ • Batching   │   Stores       │
│ • LoRA      │ • Checkpoint │              │ • Training     │
│ • Distill   │ • MFU        │              │   Data         │
├──────────────┼──────────────┼──────────────┼────────────────┤
│ 5. RECSYS   │ 6. TRANS-    │ 7. RL FOR    │ 8. MLOPS       │
│              │ FORMERS      │ LLMs         │                │
│ • Collab    │ • Attention  │ • PPO        │ • Experiment   │
│   Filtering │ • Position   │ • GRPO       │   Tracking     │
│ • Two-Tower │   Encoding   │ • Reward     │ • CI/CD        │
│ • Semantic  │ • MoE        │   Modeling   │ • A/B Testing  │
│   IDs       │ • Modern     │ • On/Off     │                │
│ • Netflix   │   Archs      │   Policy     │                │
│   Personal. │              │              │                │
└──────────────┴──────────────┴──────────────┴────────────────┘

How to Use This Wiki¶

Read section overviews first — each section starts with "what this is and why it matters"
Depth pages — detailed explanations with code examples, math where needed
Case studies — real implementations from Netflix, Google, Meta tech blogs
Reading list — papers, blogs, and courses ranked by priority
Target roles — specific job postings mapped to wiki sections

Progress Tracker¶

Section	Status	Pages
1. LLM Post-Training	🟢 Started	7
2. Distributed Training	🟡 Scaffolded	5
3. ML Inference & Serving	🟡 Scaffolded	4
4. Data at Scale	🟡 Scaffolded	4
5. Recommender Systems	🟡 Scaffolded	4
6. Transformers Deep Dive	🟡 Scaffolded	4
7. RL for LLMs	🟡 Scaffolded	4
8. MLOps & Infrastructure	🟡 Scaffolded	3

Built by Jose Pineda — preparing for what's next.