Energy/Grid-aware distributed training using pytorch distributed, FSDP and Async checkpointing

We train EcoAgent using PyTorch’s elastic and distributed features. Jobs are launched via torchrun with NCCL rendezvous for dynamic scaling. Models are wrapped in FSDP (FULL_SHARD) with bfloat16 mixed precision, CPU offload, backward prefetch, and optional activation checkpointing on A100 GPUs. Data is sharded via DistributedSampler, and training employs gradient accumulation, clipping, and NaN/Inf checks. We integrate
torch.profiler for CPU/CUDA profiling and optional PEFT (LoRA) for efficient fine-tuning.

On k8s, we use a Kubeflow PyTorchJob CRD with a headless Master and auto-scalable workers (min/max replicas tied to grid load via custom annotations). All pods mount a shared PVC at for async checkpointing: rank-0 writes epochwise model+optimizer checkpoints and metadata.json. At startup, metadata is broadcast and FSDP’s FULL_STATE_DICT context plus scatter_full_optim_state_dict resume training seamlessly even if cluster size changes. This end-to-end design delivers resilient, energy-aware, memory-efficient distributed training.

Keval Shah

Pebble - AI Researcher

San Francisco, California, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Energy/Grid-aware distributed training using pytorch distributed, FSDP and Async checkpointing

Keval Shah

Links

Actions