1. LLM Post-Training¶
What Is Post-Training?¶
Pre-training gives an LLM broad linguistic ability and general knowledge by training on massive text corpora. Post-training is everything that happens after — the phase that transforms a general model into one that follows instructions, aligns with human preferences, and meets production reliability requirements.
Pre-Training Post-Training Production
(Broad knowledge) → (Aligned behavior) → (Deployed model)
Trillions of tokens SFT → RLHF/DPO/GRPO → Serving via vLLM
on web text Domain adaptation A/B tested
Why It Matters¶
- Pre-training costs \(10M–\)100M+ and takes months → you don't do this yourself
- Post-training costs \(1K–\)100K and takes days → this is where you add value
- Every production LLM (ChatGPT, Claude, Gemini) goes through extensive post-training
- Netflix's entire ML research team is focused on post-training open-weight models for their specific use cases
The Post-Training Landscape (2025-2026)¶
| Method | Type | Signal | Key Idea |
|---|---|---|---|
| SFT | Supervised | Dense (per-token) | Train on curated (prompt, response) pairs |
| RLHF | RL-based | Sparse (reward) | Train reward model on preferences, then optimize policy with PPO |
| DPO | Offline | Preference pairs | Skip the reward model — optimize directly on preferred vs rejected |
| GRPO | On-policy RL | Sparse (reward) | Group-relative scoring, no critic network needed |
| LoRA | Parameter-efficient | Any of above | Train only low-rank adapter weights, not full model |
| Distillation | Supervised | Teacher logits | Train small model to mimic large model's outputs |
The Evolution¶
2022: SFT → RLHF (InstructGPT)
"Fine-tune, then align with human feedback"
2023: SFT → DPO (Zephyr, Neural Chat)
"Skip the reward model, optimize preferences directly"
2025: SFT → GRPO (DeepSeek-R1)
"On-policy RL without a critic — group-relative scoring"
2026: SFT → GRPO + Domain Adaptation (Netflix, Google)
"Post-training as a product differentiator"
Netflix's Approach¶
Netflix built an internal post-training framework that supports SFT, DPO, RL, and distillation. Key insight from their blog:
"SFT became table stakes rather than the finish line. Staying close to the frontier required infrastructure that could move from 'offline training loop' to 'multi-stage, on-policy orchestration.'"
Their framework handles:
- Standard chat/instruction fine-tuning
- Custom transformers trained on non-NLP sequences (member interaction events)
- RL loops with business-defined reward metrics
- Models with expanded vocabularies (semantic IDs for catalog items)
What To Learn (Priority Order)¶
- SFT — the foundation; understand loss masking, chat templates, data curation
- LoRA — parameter-efficient fine-tuning; you'll use this constantly
- DPO — the simplest preference optimization method
- GRPO — the current frontier (DeepSeek-R1)
- RLHF/PPO — the classic approach; important historically and conceptually
- Distillation — increasingly important for deployment (large → small model transfer)