> ## Documentation Index
> Fetch the complete documentation index at: https://docs.trainy.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Konduktor Serve Launch Deployment Yamls

> Schema and examples for your <b>konduktor serve launch</b> `deployment.yaml`

## Konduktor Serve Launch Deployment Yaml

#### Schema

```
name: <string>                    # required

envs:                             # optional
  key: value
workdir: <string>                 # optional

resources:
  cpus: <int/float/string>        # required; unit is vCPUs
  memory: <int/string>            # required; unit is GiB
  # must be image_id: vllm/vllm-openai:v0.7.1 w VLLM DEPLOYMENTS
  image_id: <string>              # required
  accelerators: <string>          # optional
  labels:                           
    kueue.x-k8s.io/queue-name: <string>  # required

serving:
  # if min_replicas != max_replicas, autoscaling is enabled automatically
  min_replicas: <int>             # required
  max_replicas: <int>             # optional; defaults to min_replicas
  ports: <int>                    # optional; defaults to 8000
  # GENERAL DEPLOYMENTS ONLY
  probe: <string>                 # optional; defaults to None (no health probing)
                                  # EXCLUDE PROBE COMPLETELY w VLLM DEPLOYMENTS

file_mounts:                      # optional
  /remote/path: ./local/path

run: |
  # VLLM DEPLOYMENTS ONLY          # required
  python3 -m vllm.entrypoints.openai.api_server \
    --model <string> \
    --max-model-len <int> \
    --tensor-parallel-size <int>  # required w GPUs > 1; otherwise exclude
```

#### Details

**Both**

<a href="/launch-file-sync" target="_blank">`workdir`</a> (optional)

* local directory to sync into the remote workdir before running commands

`resources: cpus` (required)

* default and only unit is vCPU
* accepts int, float, or numeric string:
  * 2, 3.1, 0.2, "4.2", "0.5", etc (0.5 vCPU → 500m (millicores))
* default and only unit is vCPU
* unsupported formats:
  * "500m"
  * "4 vCPU"
  * anything containing letters

`resources: memory` (required)

* default and only unit is GiB (Gi)
* accepts int or numeric string:
  * 2, 16, etc (16 → 16Gi)
* unsupported formats:
  * 8.5
  * "8.5"
  * floats
  * "16G"
  * "0.5 T"
  * anything containing letters

<a href="https://kueue.sigs.k8s.io/docs/tasks/run/jobs/#1-define-the-job" target="_blank">`labels: kueue.x-k8s.io/queue-name`</a> (required)

* required Kueue queue label so the scheduler knows which queue to place the JobSet in

`serving: min_replicas` (required)
`serving: max_replicas` (optional)

* if min\_replicas != max\_replicas, autoscaling is enabled automatically

<a href="/launch-file-sync" target="_blank">`file_mounts`</a> (optional)

* mapping of local files/dirs to remote paths; Konduktor syncs them before launching your job

**vLLM (Aibrix) Specific**

`resources: image_id` (required)

* only `vllm/vllm-openai:v0.7.1` or other version is supported by the OpenAI API

`probe` (exclude)

* only /health is supported by the OpenAI API, so just exclude for simplicity and it will default to /health

`run` (required)

* `python3 -m vllm.entrypoints.openai.api_server` (required)
  * `--model` (required)
    * some models like Llama 3.1 require authentication through a hugging face token, which can be passed into the deployment using <a href="/secrets" target="_blank">Konduktor Secrets</a>
    * ex. `konduktor secret create --kind=env --inline HUGGING_FACE_HUB_TOKEN=hf_ABC123 my-hf-token`
  * `--max-model-len` (required)
  * `--tensor-parallel-size` (required w GPUs > 1; otherwise optional)
* See [here](https://docs.vllm.ai/en/v0.4.1/serving/openai_compatible_server.html#command-line-arguments-for-the-server) for more info on `vllm.entrypoints.openai.api_server`
