Skip to content
Peter Bacalso
← Back to blog

Distributed Training Across Heterogeneous GPUs with FSDP2 and Ray

· 1 min read
ml systems distributed-training

Work in Progress — This post is still being written. I’ll be documenting the design, implementation, and lessons learned from building a distributed training pipeline.

Motivation

Most LLM models these days don’t fit in commercial GPUs. I have a measly 8GB VRAM on my old RTX 1070, which means training anything meaningful requires distributed training across multiple GPUs.

The vast majority of training clusters use NVIDIA hardware, so I built this pipeline as a way to test whether I can provision an efficient training cluster with heterogeneous GPUs and orchestrate it properly.

Experimental Goals

This is a rigorous experiment testing distributed ML training using:

POC Experiments

I’m planning to verify this pipeline across several dimensions:

Cluster & Scheduling

  1. KubeRay heterogeneous groups — pools, placement, tier isolation
  2. Kueue fair scheduling — concurrent submissions, quota enforcement
  3. RayJob lifecycle — submit / monitor / cancel / no orphaned pods

Backend Strategy

  1. FSDP2 multi-node — NCCL rendezvous across K8s pods
  2. FSDP2 + CPU offload — graceful degradation vs. OOM
  3. DeepSpeed ZeRO-3 — checkpoint consolidation across shards
  4. ZeRO-Infinity (if NVMe) — load model exceeding total GPU VRAM

Strategy Selection

  1. Auto-selection — correct backend chosen per model size
  2. Fallback — FSDP2 OOM → retry with ZeRO-3

Researcher Interface

  1. End-to-end submission — plain Python in, job ID + MLflow out
  2. Fault tolerance — kill worker mid-run, checkpoint resumes
  3. MLflow logging — metrics correct, not duplicated across workers

More details coming soon.