Distributed Multi-Node Jobs

In the following example, we show how we set environment variables that make it possible to run PyTorch distributed to on two nodes.

name: torch-distributed

resources:
    image_id: nvcr.io/nvidia/pytorch:23.10-py3
    accelerators: H100:8
    cpus: 60
    memory: 500
    labels:
      kueue.x-k8s.io/queue-name: user-queue
      maxRunDurationSeconds: "3200"

num_nodes: 2

run: |
    git clone https://github.com/roanakb/pytorch-distributed-resnet
    cd pytorch-distributed-resnet
    mkdir -p data  && mkdir -p saved_models && cd data && \
    wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
    tar -xvzf cifar-10-python.tar.gz
    python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS_PER_NODE \
    --nnodes=$NUM_NODES --node_rank=$RANK --master_addr=$MASTER_ADDR \
    --master_port=8008 resnet_ddp.py --num_epochs 20

Environment Variables

Konduktor will pass on the following environment variables to enable distributed jobs easily, as in PyTorch.

Environment variable	Description
`MASTER_ADDR`	The FQDN of the rank 0 worker. `test-1234-workers-0-0.test-1234`
`NODE_HOST_IPS`	A comma separated separated list of FQDN, `test-1234-workers-0-0.test-1234,test-1234-workers-0-1.test-1234`
`RANK`	The global rank within a job.
`NUM_NODES`	The total number of nodes/workers
`NUM_GPUS_PER_NODE`	The number of GPUs per node

Get Started

CLI

User Guides

Environment Variables

Get Started

CLI

User Guides

Documentation Index

​Environment Variables

Environment Variables