Skip to main content
Experimental. This feature is new and may change in backwards-incompatible ways before it stabilizes. The KonduktorResult condition type, the $KONDUKTOR_OUTPUT_DIR/result.json contract, and the Job.result() / konduktor result surfaces are the most likely to shift based on early feedback. Please share feedback in the Trainy Discord or file an issue.
Konduktor jobs can report a JSON-serializable value back to the launching process. This is useful for hyperparameter sweeps, Optuna trials, and any workflow where the launcher needs to read an objective / metric / artifact pointer from the run.

The contract

Every Konduktor-launched pod has:
  • An emptyDir volume mounted at /konduktor/output.
  • An environment variable KONDUKTOR_OUTPUT_DIR pointing at that directory.
To report a value, write JSON to $KONDUKTOR_OUTPUT_DIR/result.json:
# shell workload — no library install required
echo '{"val_loss": 0.42}' > "$KONDUKTOR_OUTPUT_DIR/result.json"
# python workload — stdlib only
import json, os
with open(f"{os.environ['KONDUKTOR_OUTPUT_DIR']}/result.json", "w") as f:
    json.dump({"val_loss": 0.42}, f)
# python workload — with konduktor installed (optional sugar)
import konduktor
konduktor.report({"val_loss": 0.42})

Retrieving the value (Python API)

konduktor.launch() returns a Job handle (a str subclass, back-compatible with existing job_name = konduktor.launch(task) code). New methods:
import konduktor

task = konduktor.Task(
    name="sweep-trial-0",
    run='echo \'{"val_loss": 0.42}\' > "$KONDUKTOR_OUTPUT_DIR/result.json"',
)
task.set_resources(konduktor.Resources(cpus=1, memory=2, image_id="ubuntu"))

job = konduktor.launch(task)
value = job.result(timeout=600)   # blocks until the job terminates
print(value)   # {"val_loss": 0.42}
job.wait(timeout=...) returns the terminal state ("succeeded" or "failed") without parsing a value.

Retrieving the value (CLI)

konduktor launch --wait blocks until completion and prints the result JSON:
$ konduktor launch --wait -y tests/test_yamls/return_value_shell.yaml
{"val_loss": 0.123}
konduktor result <job-name> fetches the result of a terminal job:
$ konduktor result sweep-trial-0-abcd
{"val_loss": 0.42}
Both commands exit non-zero on job failure, malformed result, or timeout.

Failure modes

ErrorRaised when
JobFailedErrorJob exited non-zero or never reported a result
ResultTooLargeErrorResult exceeds 4 KB (kubelet’s terminationMessagePath cap)
MalformedResultErrorresult.json wasn’t valid JSON
NoResultReportedErrorJob succeeded but didn’t write result.json
ResultTimeoutErrortimeout elapsed before the job was terminal
If your result is larger than 4 KB, return a summary (e.g. a checkpoint path), not the full artifact.

Multi-node jobs

In a multi-node JobSet, only worker 0 of the first replicated job reports. Other workers’ termination messages are ignored. This matches Konduktor’s $master_addr convention.

Multikueue

This feature works unchanged under multikueue. The pod runs in a worker cluster, the trainy-controller reconciler patches a KonduktorResult condition onto the worker-side JobSet, and Kueue’s JobSet multikueue adapter syncs that condition to the manager-side JobSet. Your client only talks to the manager cluster.

Try it on a local Kind cluster

You can exercise the full end-to-end flow against a local Kind cluster without touching any cloud resources. Takes about 5 minutes from scratch.
1

Build the controller image

cd trainy-controller
make docker-build IMG=trainy-controller:local
2

Stand up Kind with Kueue + JobSet

cd ..
bash tests/kind_install.sh
This installs the versions Konduktor is tested against (Kueue v0.15.2, JobSet v0.10.1) plus VictoriaLogs and OTEL for log streaming.
3

Load the image and create the Kueue queue

kind load docker-image trainy-controller:local
kubectl apply -f tests/smoke_tests/single-clusterqueue-setup.yaml
4

Deploy the controller

cd trainy-controller
make deploy IMG=trainy-controller:local
kubectl -n trainy-controller-system wait --for=condition=available \
  --timeout=120s deployment/trainy-controller-controller-manager
If you don’t have cert-manager installed locally, disable the JobSet defaulting webhook:
kubectl -n trainy-controller-system set env \
  deployment/trainy-controller-controller-manager ENABLE_WEBHOOKS=false
5

Install the SDK and launch a test job

cd ..
poetry install --with dev

poetry run konduktor launch --wait -y tests/test_yamls/return_value_shell.yaml
Expected stdout:
{"val_loss": 0.123}
6

Inspect the KonduktorResult condition on the JobSet

kubectl get jobset -o jsonpath='{.items[0].status.conditions}' | jq .
You should see a KonduktorResult condition whose message is the JSON value your workload wrote — this is what job.result() is reading under the hood. Under multikueue, Kueue’s JobSet adapter copies this entire status struct (including our condition) to the manager cluster for free.

Use case: Optuna sweep

The primitive pairs naturally with an Optuna hyperparameter sweep:
import konduktor
import optuna

study = optuna.create_study(storage="postgresql://...", study_name="my-sweep")

for _ in range(50):
    trial = study.ask()
    lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
    task = build_task(lr=lr)
    job = konduktor.launch(task, detach_run=True)
    try:
        val_loss = job.result(timeout=3600)["val_loss"]
        study.tell(trial, val_loss)
    except konduktor.JobFailedError:
        study.tell(trial, state=optuna.trial.TrialState.FAIL)

With a Hydra sweeper

The hydra-konduktor-launcher plugin referenced below is a forthcoming companion package — it ships separately from konduktor so core users aren’t forced to install Hydra. The example below describes its intended shape. Until it’s released, the plain-Python snippet above gets you equivalent behavior without Hydra.
If you already orchestrate sweeps with hydra-core (either the upstream hydra-optuna-sweeper or a custom sweeper), you can swap Hydra’s default BasicLauncher for KonduktorLauncher and get one Konduktor job per trial without touching your sweeper. Install:
pip install konduktor[hydra]   # pulls hydra-core + hydra-konduktor-launcher
Point Hydra at Konduktor:
# configs/train.yaml
defaults:
  - override /hydra/launcher: konduktor
  - override /hydra/sweeper: optuna   # or your custom sweeper

hydra:
  launcher:
    # Each trial becomes one Konduktor job with these resources.
    image_id: ghcr.io/my-org/training:latest
    resources:
      cpus: 4
      memory: 16
      accelerators: H100:1
    labels:
      kueue.x-k8s.io/queue-name: user-queue
Your training code stays unchanged — the plugin wraps your @hydra.main function and captures its return value the same way Hydra’s default BasicLauncher does:
# train.py
import hydra

@hydra.main(version_base=None)
def train(cfg):
    val_loss = run_training(cfg)
    return val_loss   # launcher surfaces this as the job's result

if __name__ == "__main__":
    train()
Run the sweep:
python train.py -m hparams_search=my_sweep
Optuna’s Postgres-backed study and any OptunaPruningCallback-style intermediate metric reporting both keep working unchanged — they talk to Postgres directly from inside each trial, not through Konduktor.