Distributed Training Across Heterogeneous GPUs with FSDP2 and Ray
· 1 min read
ml systems distributed-training
Work in Progress — This post is still being written. I’ll be documenting the design, implementation, and lessons learned from building a distributed training pipeline.
Motivation
Most LLM models these days don’t fit in commercial GPUs. I have a measly 8GB VRAM on my old RTX 1070, which means training anything meaningful requires distributed training across multiple GPUs.
The vast majority of training clusters use NVIDIA hardware, so I built this pipeline as a way to test whether I can provision an efficient training cluster with heterogeneous GPUs and orchestrate it properly.
Experimental Goals
This is a rigorous experiment testing distributed ML training using:
- PyTorch FSDP2 — fully sharded data parallelism
- Ray Train — distributed training orchestration
- Kubernetes + KubeRay — heterogeneous GPU cluster management
- Vast.ai spot instances — cost-effective GPU provisioning
POC Experiments
I’m planning to verify this pipeline across several dimensions:
Cluster & Scheduling
- KubeRay heterogeneous groups — pools, placement, tier isolation
- Kueue fair scheduling — concurrent submissions, quota enforcement
- RayJob lifecycle — submit / monitor / cancel / no orphaned pods
Backend Strategy
- FSDP2 multi-node — NCCL rendezvous across K8s pods
- FSDP2 + CPU offload — graceful degradation vs. OOM
- DeepSpeed ZeRO-3 — checkpoint consolidation across shards
- ZeRO-Infinity (if NVMe) — load model exceeding total GPU VRAM
Strategy Selection
- Auto-selection — correct backend chosen per model size
- Fallback — FSDP2 OOM → retry with ZeRO-3
Researcher Interface
- End-to-end submission — plain Python in, job ID + MLflow out
- Fault tolerance — kill worker mid-run, checkpoint resumes
- MLflow logging — metrics correct, not duplicated across workers
More details coming soon.