In the following example, we show how we set environment variables that make it possible to run PyTorch distributed to on two nodes.Documentation Index
Fetch the complete documentation index at: https://docs.trainy.ai/llms.txt
Use this file to discover all available pages before exploring further.
Environment Variables
Konduktor will pass on the following environment variables to enable distributed jobs easily, as in PyTorch.| Environment variable | Description |
|---|---|
MASTER_ADDR | The FQDN of the rank 0 worker. test-1234-workers-0-0.test-1234 |
NODE_HOST_IPS | A comma separated separated list of FQDN, test-1234-workers-0-0.test-1234,test-1234-workers-0-1.test-1234 |
RANK | The global rank within a job. |
NUM_NODES | The total number of nodes/workers |
NUM_GPUS_PER_NODE | The number of GPUs per node |