Skip to content

ML Systems Engineering — Field Guide

Index

Index

# 2. Distributed Training

Why This Matters¶

Modern LLMs don't fit on a single GPU. A 70B parameter model requires ~140GB just for weights in fp16. Training requires 3-4x that for optimizer states and gradients. Distributed training splits this across multiple GPUs and nodes.

Topics¶