Skip to main content
At a glance
  • Simple resource declarations and bash UX
# my_task.yaml
name: tune
num_nodes: 2 # scale up your workload

resources:
  cpus: 15
  memory: 90
  accelerators: H100:8
  image_id: gcr.io/k8s-staging-jobset/pytorch-mnist:latest
  labels:
    kueue.x-k8s.io/queue-name: user-queue
    maxRunDurationSeconds: "3200"

run: |
  set -e
  NCCL_DEBUG=INFO torchrun --rdzv_id=123 --nnodes=$NUM_NODES --nproc_per_node=1 --master_addr=$MASTER_ADDR --master_port=1234 --node_rank=$RANK /workspace/mnist.py
Run with:
$ konduktor launch my_task.yaml
  • Out of the box observability: Trainy provides telemetry (prometheus metrics and logs) for tracking cluster performance, utilization, and health.
Getting Started

Setup

Installation & Configuration

Quickstart

Launch your first job

Examples

Many Parallel Jobs

Schedule multiple jobs in parallel for batch inference

Multi-node Jobs

Scale up the resources for a PyTorch distributed job with multiple machines

Interactive Workloads

SSH and connect VSCode to your GPU containers for debugging and dev via Tailscale

Observability

Monitor and troubleshoot your workloads with our curated Grafana dashboards