Distributed Training Across Heterogeneous GPUs with FSDP2 and Ray

May 11, 2026 · 1 min read

ml systems distributed-training

Work in Progress — This post is still being written. I’ll be documenting the design, implementation, and lessons learned from building a distributed training pipeline.

Motivation

Most LLM models these days don’t fit in commercial GPUs. I have a measly 8GB VRAM on my old RTX 1070, which means training anything meaningful requires distributed training across multiple GPUs.

The vast majority of training clusters use NVIDIA hardware, so I built this pipeline as a way to test whether I can provision an efficient training cluster with heterogeneous GPUs and orchestrate it properly.

Experimental Goals

This is a rigorous experiment testing distributed ML training using:

PyTorch FSDP2 — fully sharded data parallelism
Ray Train — distributed training orchestration
Kubernetes + KubeRay — heterogeneous GPU cluster management
Vast.ai spot instances — cost-effective GPU provisioning

POC Experiments

I’m planning to verify this pipeline across several dimensions:

Cluster & Scheduling

KubeRay heterogeneous groups — pools, placement, tier isolation
Kueue fair scheduling — concurrent submissions, quota enforcement
RayJob lifecycle — submit / monitor / cancel / no orphaned pods

Backend Strategy

FSDP2 multi-node — NCCL rendezvous across K8s pods
FSDP2 + CPU offload — graceful degradation vs. OOM
DeepSpeed ZeRO-3 — checkpoint consolidation across shards
ZeRO-Infinity (if NVMe) — load model exceeding total GPU VRAM

Strategy Selection

Auto-selection — correct backend chosen per model size
Fallback — FSDP2 OOM → retry with ZeRO-3

Researcher Interface

End-to-end submission — plain Python in, job ID + MLflow out
Fault tolerance — kill worker mid-run, checkpoint resumes
MLflow logging — metrics correct, not duplicated across workers

More details coming soon.