Experimental features are new and their interface and implementation may change at any time. Expect sharp edges.
<COMPANY>.trainy.us
General Deployments: Deploy any containerized application with automatic horizontal scaling and health checks. Accessible at <COMPANY>2.trainy.us
Launch a deployment
To launch a deployment, use thekonduktor serve launch command shown below.
VLLM
- Deployment:
- App Deployment
- Service:
- App Service
- PodAutoscaler: (optional)
- KPA (Knative-based Pod Autoscaler)
GENERAL
- Deployment:
- App Deployment
- Service:
- App Service
- PodAutoscaler: (optional)
- HPA (Horizontal Pod Autoscaler)
konduktor launch task.yamls for jobs, except serving includes an extra section for replicas, ports, and health endpoint probing. For full, detailed examples of deployment.yaml, check out the bottom of this page.
Check status
To view your deployments, use thekonduktor serve status command.
Include --all-users or -u to see all deployments from all users in the cluster.

--direct to display direct IP endpoints instead of trainy.us endpoints.
--direct every time, you can modify ~/.konduktor/config.yaml as a permanent toggle for konduktor serve status --direct with:

Down a deployment
To delete a deployment, use thekonduktor serve down command.
Include --all or -a to down all deployments from all users in the cluster.
Accessing Deployments
trainy.us endpoints use https while direct IP endpoints use http. Requests through trainy.us timeout after 60 seconds.
vLLM (Aibrix)
Completion API:top destination for tech companies, but it's also a hub for innovation and creativity. So, it's no surprise that the city has a vibrant food scene. From the iconic Golden Gate Bridge to the bustling streets of the Financial District, San Francisco offers a unique blend of culture, history, and modernity. When it comes to food, the city is known for its diverse cuisine, which reflects ...
Chat Completion API
Okay, so I need to help write a random number generator function in Python. Hmm, where do I start? I remember that Python has a module called random which provides functions for generating random numbers. So maybe I should use that. Let me think about what functions are available there.\n\nFirst, there's random.randint(a, b), which returns a random integer N between a and b, inclusive. That's useful. Then...
General
Basic API:Hello from Konduktor Serve!
Health Probe API
Hello from health probe!
Autoscaling
Usekonduktor serve status to monitor the autoscaling process. The autoscaling process could take a few minutes, especially with a cold start from 0.
Scale-Up Behavior
- 0 —> 1: PA (Pod Autoscaler) triggers scale up immediately after the first request to the deployment endpoint.
- 1 —> N: PA triggers scale up based on average request rate metrics collected over a 30-second window.
Scale-Down Behavior
vLLM (Aibrix) Deployments: - stair-step scale-down- N —> N-1: PA triggers scale down based on average request rate metrics collected over a 30-second window. Grace period of 30 mins per pod.
- 1 —> 0: PA triggers scale down to zero replicas after 30 minutes of no requests to the model.
- N —> 0: PA triggers a direct scale down to zero replicas after 20 minutes of no requests to the deployment.
GPU Scheduling Behavior
Observed GKE Behavior:- GKE’s GPU scheduling can be inefficient and may not always utilize nodes optimally.
- GKE spins up new nodes even when existing nodes have sufficient GPU capacity.
Benchmarks
vLLM (Aibrix) Deployments: Throughput/Latency/Tokens




Example YAMLs
Schema
- Deployment Schema - Complete reference for deployment.yaml fields
General
- Simple (no autoscaling + default port (8000) + no health probing)
- Complex (autoscaling + custom port + custom health probing endpoint)