Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.trainy.ai/llms.txt

Use this file to discover all available pages before exploring further.

In the following example, we show how we set environment variables that make it possible to run PyTorch distributed to on two nodes.
name: torch-distributed

resources:
    image_id: nvcr.io/nvidia/pytorch:23.10-py3
    accelerators: H100:8
    cpus: 60
    memory: 500
    labels:
      kueue.x-k8s.io/queue-name: user-queue
      maxRunDurationSeconds: "3200"

num_nodes: 2

run: |
    git clone https://github.com/roanakb/pytorch-distributed-resnet
    cd pytorch-distributed-resnet
    mkdir -p data  && mkdir -p saved_models && cd data && \
    wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
    tar -xvzf cifar-10-python.tar.gz
    python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS_PER_NODE \
    --nnodes=$NUM_NODES --node_rank=$RANK --master_addr=$MASTER_ADDR \
    --master_port=8008 resnet_ddp.py --num_epochs 20

Environment Variables

Konduktor will pass on the following environment variables to enable distributed jobs easily, as in PyTorch.
Environment variableDescription
MASTER_ADDRThe FQDN of the rank 0 worker. test-1234-workers-0-0.test-1234
NODE_HOST_IPSA comma separated separated list of FQDN, test-1234-workers-0-0.test-1234,test-1234-workers-0-1.test-1234
RANKThe global rank within a job.
NUM_NODESThe total number of nodes/workers
NUM_GPUS_PER_NODEThe number of GPUs per node