When running distributed training withDocumentation Index
Fetch the complete documentation index at: https://docs.trainy.ai/llms.txt
Use this file to discover all available pages before exploring further.
torchrun across multiple nodes or processes, all workers can log metrics to the same experiment using a shared run_id. This enables unified tracking of distributed training runs.
How It Works
Pluto supports attaching multiple processes to the same experiment run. When the first process callspluto.init(), a new run is created. Subsequent processes with the same run_id will attach to that existing run instead of creating new ones.
Setting Up Distributed Logging
1. Set a Shared Run ID
Before launching your distributed job, set thePLUTO_RUN_ID environment variable to a unique identifier:
When using
konduktor launch , PLUTO_RUN_ID is set to the job name by default so you donβt need to explicitly set it yourself.2. Initialize Pluto in Your Training Script
In your training script, initialize Pluto after setting up the distributed process group:3. Log Metrics with Rank Prefixes
To distinguish metrics from different processes, prefix them with the rank:Complete Example
Hereβs a full example for a Konduktor task that runs distributed training with Pluto logging:Run Properties
The run object provides useful properties for distributed scenarios:| Property | Type | Description |
|---|---|---|
run.resumed | bool | True if this process attached to an existing run |
run.run_id | str | The user-provided external run ID |
run.id | int | The server-assigned numeric run ID |
Environment Variables
Pluto recognizes the following environment variables for distributed logging:| Variable | Description |
|---|---|
PLUTO_RUN_ID | Primary environment variable for shared run identification |
MLOP_RUN_ID | Fallback environment variable (for compatibility) |
Best Practices
- Set run_id before launching: Ensure
PLUTO_RUN_IDis set before callingtorchrunso all processes inherit the same value. - Use rank prefixes: Prefix metrics with the rank to distinguish data from different processes in the dashboard.
- Log config from rank 0 only: Pass
configonly from rank 0 to avoid duplicate metadata. - Unique run IDs: Include a timestamp or UUID in your run ID to ensure each training run is distinct.
Handle the name parameter: Thenameparameter is only used when creating a new run. Processes that resume an existing run will ignore this parameter (a warning is logged to indicate this).