Experimental. This feature is new and may change in backwards-incompatible
ways before it stabilizes. The KonduktorResult condition type, the
$KONDUKTOR_OUTPUT_DIR/result.json contract, and the Job.result() /
konduktor result surfaces are the most likely to shift based on early
feedback. Please share feedback in the Trainy Discord or file an issue.
Konduktor jobs can report a JSON-serializable value back to the launching
process. This is useful for hyperparameter sweeps, Optuna trials, and any
workflow where the launcher needs to read an objective / metric / artifact
pointer from the run.
The contract
Every Konduktor-launched pod has:
- An
emptyDir volume mounted at /konduktor/output.
- An environment variable
KONDUKTOR_OUTPUT_DIR pointing at that directory.
To report a value, write JSON to $KONDUKTOR_OUTPUT_DIR/result.json:
# shell workload — no library install required
echo '{"val_loss": 0.42}' > "$KONDUKTOR_OUTPUT_DIR/result.json"
# python workload — stdlib only
import json, os
with open(f"{os.environ['KONDUKTOR_OUTPUT_DIR']}/result.json", "w") as f:
json.dump({"val_loss": 0.42}, f)
# python workload — with konduktor installed (optional sugar)
import konduktor
konduktor.report({"val_loss": 0.42})
Retrieving the value (Python API)
konduktor.launch() returns a Job handle (a str subclass, back-compatible
with existing job_name = konduktor.launch(task) code). New methods:
import konduktor
task = konduktor.Task(
name="sweep-trial-0",
run='echo \'{"val_loss": 0.42}\' > "$KONDUKTOR_OUTPUT_DIR/result.json"',
)
task.set_resources(konduktor.Resources(cpus=1, memory=2, image_id="ubuntu"))
job = konduktor.launch(task)
value = job.result(timeout=600) # blocks until the job terminates
print(value) # {"val_loss": 0.42}
job.wait(timeout=...) returns the terminal state ("succeeded" or
"failed") without parsing a value.
Retrieving the value (CLI)
konduktor launch --wait blocks until completion and prints the result JSON:
$ konduktor launch --wait -y tests/test_yamls/return_value_shell.yaml
{"val_loss": 0.123}
konduktor result <job-name> fetches the result of a terminal job:
$ konduktor result sweep-trial-0-abcd
{"val_loss": 0.42}
Both commands exit non-zero on job failure, malformed result, or timeout.
Failure modes
| Error | Raised when |
|---|
JobFailedError | Job exited non-zero or never reported a result |
ResultTooLargeError | Result exceeds 4 KB (kubelet’s terminationMessagePath cap) |
MalformedResultError | result.json wasn’t valid JSON |
NoResultReportedError | Job succeeded but didn’t write result.json |
ResultTimeoutError | timeout elapsed before the job was terminal |
If your result is larger than 4 KB, return a summary (e.g. a checkpoint
path), not the full artifact.
Multi-node jobs
In a multi-node JobSet, only worker 0 of the first replicated job reports.
Other workers’ termination messages are ignored. This matches Konduktor’s
$master_addr convention.
Multikueue
This feature works unchanged under multikueue. The pod runs in a worker
cluster, the trainy-controller reconciler patches a KonduktorResult
condition onto the worker-side JobSet, and Kueue’s JobSet multikueue
adapter syncs that condition to the manager-side JobSet. Your client only
talks to the manager cluster.
Try it on a local Kind cluster
You can exercise the full end-to-end flow against a local Kind cluster
without touching any cloud resources. Takes about 5 minutes from
scratch.
Build the controller image
cd trainy-controller
make docker-build IMG=trainy-controller:local
Stand up Kind with Kueue + JobSet
cd ..
bash tests/kind_install.sh
This installs the versions Konduktor is tested against (Kueue v0.15.2,
JobSet v0.10.1) plus VictoriaLogs and OTEL for log streaming.Load the image and create the Kueue queue
kind load docker-image trainy-controller:local
kubectl apply -f tests/smoke_tests/single-clusterqueue-setup.yaml
Deploy the controller
cd trainy-controller
make deploy IMG=trainy-controller:local
kubectl -n trainy-controller-system wait --for=condition=available \
--timeout=120s deployment/trainy-controller-controller-manager
If you don’t have cert-manager installed locally, disable the
JobSet defaulting webhook:kubectl -n trainy-controller-system set env \
deployment/trainy-controller-controller-manager ENABLE_WEBHOOKS=false
Install the SDK and launch a test job
cd ..
poetry install --with dev
poetry run konduktor launch --wait -y tests/test_yamls/return_value_shell.yaml
Expected stdout:Inspect the KonduktorResult condition on the JobSet
kubectl get jobset -o jsonpath='{.items[0].status.conditions}' | jq .
You should see a KonduktorResult condition whose message is the
JSON value your workload wrote — this is what job.result() is
reading under the hood. Under multikueue, Kueue’s JobSet adapter
copies this entire status struct (including our condition) to the
manager cluster for free.
Use case: Optuna sweep
The primitive pairs naturally with an Optuna hyperparameter sweep:
import konduktor
import optuna
study = optuna.create_study(storage="postgresql://...", study_name="my-sweep")
for _ in range(50):
trial = study.ask()
lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
task = build_task(lr=lr)
job = konduktor.launch(task, detach_run=True)
try:
val_loss = job.result(timeout=3600)["val_loss"]
study.tell(trial, val_loss)
except konduktor.JobFailedError:
study.tell(trial, state=optuna.trial.TrialState.FAIL)
With a Hydra sweeper
The hydra-konduktor-launcher plugin referenced below is a
forthcoming companion package — it ships separately from
konduktor so core users aren’t forced to install Hydra. The
example below describes its intended shape. Until it’s released,
the plain-Python snippet above gets you equivalent behavior without
Hydra.
If you already orchestrate sweeps with
hydra-core (either the upstream
hydra-optuna-sweeper or a custom sweeper), you can swap Hydra’s
default BasicLauncher for KonduktorLauncher and get one Konduktor
job per trial without touching your sweeper.
Install:
pip install konduktor[hydra] # pulls hydra-core + hydra-konduktor-launcher
Point Hydra at Konduktor:
# configs/train.yaml
defaults:
- override /hydra/launcher: konduktor
- override /hydra/sweeper: optuna # or your custom sweeper
hydra:
launcher:
# Each trial becomes one Konduktor job with these resources.
image_id: ghcr.io/my-org/training:latest
resources:
cpus: 4
memory: 16
accelerators: H100:1
labels:
kueue.x-k8s.io/queue-name: user-queue
Your training code stays unchanged — the plugin wraps your
@hydra.main function and captures its return value the same way
Hydra’s default BasicLauncher does:
# train.py
import hydra
@hydra.main(version_base=None)
def train(cfg):
val_loss = run_training(cfg)
return val_loss # launcher surfaces this as the job's result
if __name__ == "__main__":
train()
Run the sweep:
python train.py -m hparams_search=my_sweep
Optuna’s Postgres-backed study and any OptunaPruningCallback-style
intermediate metric reporting both keep working unchanged — they talk
to Postgres directly from inside each trial, not through Konduktor.