File Sync during Run

Konduktor’s file sync on launch ensures your code and static assets are available inside each container before it starts. However, once your workload is running, you may also need to persist runtime outputs such as checkpoints, logs, or evaluation results back to cloud storage. When a Konduktor job runs, its containers execute in ephemeral pods. Local storage (/tmp, /root, or your workdir) is ephemeral and wiped when the pod restarts or completes. Therefore, we recommend uploading to cloud storage any data that must survive restarts or be analyzed later. Runtime sync works through direct writes to your configured object store (S3, GCS, etc.) rather than automatic file syncing seen on launch.

Common Use Cases

Model checkpoints - Periodically save model state so you can resume after failure.
Metrics or logs - Write evaluation summaries or artifacts for later inspection.
Intermediate outputs - Store temporary data for multi-stage workflows.

Setup

Full setup for file sync requires cloud storage configuration which can be found here. Konduktor mounts your cloud credentials into the job containers and places them in ~/.aws (S3) or ~/.config/gcloud (GS) at startup. If you plan to use command-line tools like aws s3, gsutil, or gcloud, ensure your image includes those CLIs or install them in your run: block. We check our cloud service account credentials in the Trainy cluster with this:

$ konduktor check <CLOUD STORAGE ALIAS> # s3, gs, etc.

Afterwards we configure the storage provider by setting ~/.konduktor/config.yaml

# ~/.konduktor/config.yaml
allowed_clouds:
  - gs # {s3, gs}

Usage

Checkpoint Resumption

You can combine cloud storage bucket writes using aws s3 cp or gcloud storage cp with Konduktor environment variables like RESTART_ATTEMPT and JOB_COMPLETION_INDEX to resume training automatically after a crash or restart. Each pod can detect whether it’s a retry and download its latest checkpoint from S3 or GCS before continuing. Below is a complete example that trains a tiny PyTorch model, saves checkpoints to S3, crashes on purpose on the first attempt, then automatically resumes from the latest checkpoint uploaded to S3 storage.

import os, subprocess, sys
import torch, torch.nn as nn, torch.optim as optim

JOB_NAME = os.environ.get("KONDUKTOR_JOB_NAME", "job")
ATTEMPT  = int(os.environ.get("RESTART_ATTEMPT", "0"))
IDX      = os.environ.get("JOB_COMPLETION_INDEX", "0")

BUCKET   = "my-konduktor-bucket"
PREFIX   = f"checkpoints/{JOB_NAME}"
REMOTE   = f"s3://{BUCKET}/{PREFIX}/idx_{IDX}"
TMP      = "/tmp/ckpt.pt"
RESUME   = "/tmp/resume.pt"

def sh(cmd): subprocess.check_call(cmd, shell=True)

# Upload checkpoints to S3
def upload(step: int):
    name = f"step_{step:06d}.pt"
    sh(f"aws s3 cp {TMP} {REMOTE}/{name} --only-show-errors")     # versioned
    sh(f"aws s3 cp {TMP} {REMOTE}/latest.pt --only-show-errors")  # stable pointer
    print(f"Saved {name} and latest.pt")

# Resume from latest checkpoint if available
def try_resume():
    try:
        sh(f"aws s3 cp {REMOTE}/latest.pt {RESUME} --only-show-errors")
        ckpt = torch.load(RESUME, map_location="cpu")
        model.load_state_dict(ckpt["model"])
        opt.load_state_dict(ckpt["opt"])
        start = int(ckpt.get("step", 0)) + 1
        print(f"Resumed from latest.pt @ step {start}")
        return start
    except subprocess.CalledProcessError:
        # nothing remote yet; start fresh
        return 0

# tiny model
torch.manual_seed(0)
model = nn.Sequential(nn.Linear(10, 1))
opt = optim.SGD(model.parameters(), lr=0.1)
loss = nn.MSELoss()

print(f"ATTEMPT={ATTEMPT}  REMOTE={REMOTE}")
start = try_resume() if ATTEMPT > 0 else 0

for step in range(start, 501):
    x, y = torch.randn(64,10), torch.randn(64,1)
    opt.zero_grad(); out = model(x); l = loss(out, y); l.backward(); opt.step()
    if step % 20 == 0: print(f"[{step}] loss={l.item():.4f}")

    # save every 100 steps
    if step > 0 and step % 100 == 0:
        torch.save({"model": model.state_dict(), "opt": opt.state_dict(), "step": step}, TMP)
        upload(step)

        # Force a crash at the first attempt after we have a good checkpoint
        if ATTEMPT == 0 and step == 200:
            print("Intentionally crashing after saving step_000200.pt (attempt 0).")
            sys.exit(1)  # non-zero -> pod fails -> Job restarts (max_restarts must be >=1)

print("Training complete")

Get Started

User Guides

Common Use Cases

Setup

Usage

Checkpoint Resumption

Get Started

User Guides

​Common Use Cases

​Setup

​Usage

​Checkpoint Resumption

Common Use Cases

Setup

Usage

Checkpoint Resumption