Return values from jobs - Trainy Konduktor

Experimental. This feature is new and may change in backwards-incompatible ways before it stabilizes. The KonduktorResult condition type, the $KONDUKTOR_OUTPUT_DIR/result.json contract, and the Job.result() / konduktor result surfaces are the most likely to shift based on early feedback.

Konduktor jobs can report a JSON-serializable value back to the launching process. This is useful for hyperparameter sweeps, Optuna trials, and any workflow where the launcher needs to read an objective / metric / artifact pointer from the run.

Job Handles

Every Konduktor-launched pod has:

An emptyDir volume mounted at /konduktor/output.
An environment variable KONDUKTOR_OUTPUT_DIR pointing at that directory.

To report a value, write JSON to $KONDUKTOR_OUTPUT_DIR/result.json:

# shell workload — no library install required
echo '{"val_loss": 0.42}' > "$KONDUKTOR_OUTPUT_DIR/result.json"

# python workload — stdlib only
import json, os
with open(f"{os.environ['KONDUKTOR_OUTPUT_DIR']}/result.json", "w") as f:
    json.dump({"val_loss": 0.42}, f)

# python workload — with konduktor installed (optional sugar)
import konduktor
konduktor.report({"val_loss": 0.42})

Retrieving the value (Python API)

konduktor.launch() returns a Job handle (a str subclass, back-compatible with existing job_name = konduktor.launch(task) code). New methods:

import konduktor

task = konduktor.Task(
    name="sweep-trial-0",
    run='echo \'{"val_loss": 0.42}\' > "$KONDUKTOR_OUTPUT_DIR/result.json"',
)
task.set_resources(konduktor.Resources(cpus=1, memory=2, image_id="ubuntu"))

job = konduktor.launch(task)
value = job.result(timeout=600)   # blocks until the job terminates
print(value)   # {"val_loss": 0.42}

job.wait(timeout=...) returns the terminal state ("succeeded" or "failed") without parsing a value.

Retrieving the value (CLI)

konduktor launch --wait blocks until completion and prints the result JSON:

$ konduktor launch --wait -y tests/test_yamls/return_value_shell.yaml
{"val_loss": 0.123}

konduktor result <job-name> fetches the result of a terminal job:

$ konduktor result sweep-trial-0-abcd
{"val_loss": 0.42}

Both commands exit non-zero on job failure, malformed result, or timeout.

Failure modes

Error	Raised when
`JobFailedError`	Job exited non-zero or never reported a result
`ResultTooLargeError`	Result exceeds 4 KB (kubelet’s `terminationMessagePath` cap)
`MalformedResultError`	`result.json` wasn’t valid JSON
`NoResultReportedError`	Job succeeded but didn’t write `result.json`
`ResultTimeoutError`	`timeout` elapsed before the job was terminal

If your result is larger than 4 KB, return a summary (e.g. a checkpoint path), not the full artifact.

Multi-node jobs

In a multi-node JobSet, only worker 0 of the first replicated job reports. Other workers’ termination messages are ignored.

Use case: Optuna sweep

The primitive pairs naturally with an Optuna hyperparameter sweep:

import konduktor
import optuna

study = optuna.create_study(direction="minimize")

for _ in range(50):
    trial = study.ask()
    lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
    task = build_task(lr=lr)
    job = konduktor.launch(task, detach_run=True)
    try:
        val_loss = job.result(timeout=3600)["val_loss"]
        study.tell(trial, val_loss)
    except konduktor.JobFailedError:
        study.tell(trial, state=optuna.trial.TrialState.FAIL

​Job Handles

​Retrieving the value (Python API)

​Retrieving the value (CLI)

​Failure modes

​Multi-node jobs

​Use case: Optuna sweep

Job Handles

Retrieving the value (Python API)

Retrieving the value (CLI)

Failure modes

Multi-node jobs

Use case: Optuna sweep