torchrun across multiple nodes or processes, all workers can log metrics to the same experiment using a shared run_id. This enables unified tracking of distributed training runs.
How It Works
Pluto supports attaching multiple processes to the same experiment run. When the first process callspluto.init(), a new run is created. Subsequent processes with the same run_id will attach to that existing run instead of creating new ones.
Setting Up Distributed Logging
1. Set a Shared Run ID
Before launching your distributed job, set thePLUTO_RUN_ID environment variable to a unique identifier:
2. Initialize Pluto in Your Training Script
In your training script, initialize Pluto after setting up the distributed process group:3. Log Metrics with Rank Prefixes
To distinguish metrics from different processes, prefix them with the rank:Complete Example
Hereβs a full example for a Konduktor task that runs distributed training with Pluto logging:Run Properties
The run object provides useful properties for distributed scenarios:| Property | Type | Description |
|---|---|---|
run.resumed | bool | True if this process attached to an existing run |
run.run_id | str | The user-provided external run ID |
run.id | int | The server-assigned numeric run ID |
Alternative: Pass run_id Directly
Instead of using the environment variable, you can passrun_id directly to pluto.init():
Environment Variables
Pluto recognizes the following environment variables for distributed logging:| Variable | Description |
|---|---|
PLUTO_RUN_ID | Primary environment variable for shared run identification |
MLOP_RUN_ID | Fallback environment variable (for compatibility) |
Best Practices
-
Set run_id before launching: Ensure
PLUTO_RUN_IDis set before callingtorchrunso all processes inherit the same value. - Use rank prefixes: Prefix metrics with the rank to distinguish data from different processes in the dashboard.
-
Log config from rank 0 only: Pass
configonly from rank 0 to avoid duplicate metadata. - Unique run IDs: Include a timestamp or UUID in your run ID to ensure each training run is distinct.
-
Handle the
nameparameter: Thenameparameter is only used when creating a new run. Processes that resume an existing run will ignore this parameter (a warning is logged to indicate this).