> ## Documentation Index
> Fetch the complete documentation index at: https://docs.trainy.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Distributed Multi-Node Jobs

> Jobs deployed via Konduktor can be scaled up to run on multiple nodes.

In the following example, we show how we set environment variables that make it possible to run  PyTorch distributed to on two nodes.

```
name: torch-distributed

resources:
    image_id: nvcr.io/nvidia/pytorch:23.10-py3
    accelerators: H100:8
    cpus: 60
    memory: 500
    labels:
      kueue.x-k8s.io/queue-name: user-queue
      maxRunDurationSeconds: "3200"

num_nodes: 2

run: |
    git clone https://github.com/roanakb/pytorch-distributed-resnet
    cd pytorch-distributed-resnet
    mkdir -p data  && mkdir -p saved_models && cd data && \
    wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
    tar -xvzf cifar-10-python.tar.gz
    python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS_PER_NODE \
    --nnodes=$NUM_NODES --node_rank=$RANK --master_addr=$MASTER_ADDR \
    --master_port=8008 resnet_ddp.py --num_epochs 20
```

### Environment Variables

Konduktor will pass on the following environment variables to enable distributed jobs easily, [as in PyTorch](https://pytorch.org/docs/stable/elastic/run.html#environment-variables).

| Environment variable | Description                                                                                                 |
| -------------------- | ----------------------------------------------------------------------------------------------------------- |
| `MASTER_ADDR`        | The FQDN of the rank 0 worker. `test-1234-workers-0-0.test-1234`                                            |
| `NODE_HOST_IPS`      | A comma separated separated list of FQDN, `test-1234-workers-0-0.test-1234,test-1234-workers-0-1.test-1234` |
| `RANK`               | The global rank within a job.                                                                               |
| `NUM_NODES`          | The total number of nodes/workers                                                                           |
| `NUM_GPUS_PER_NODE`  | The number of GPUs per node                                                                                 |
