# Run Generator with vLLM-Omni

> Start vllm/vllm-omni:cosmos3 Docker server, tensor-parallel and CFG/Ulysses options for Super, POST vision/action endpoints, guardrails toggles, and deploy-config for server-wide guardrail disable.

- Repository: NVIDIA/cosmos
- GitHub: https://github.com/NVIDIA/cosmos
- Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9
- Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt

## Source Files

- `README.md`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb`
- `cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb`
- `cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb`

---

---
title: "Run Generator with vLLM-Omni"
description: "Start vllm/vllm-omni:cosmos3 Docker server, tensor-parallel and CFG/Ulysses options for Super, POST vision/action endpoints, guardrails toggles, and deploy-config for server-wide guardrail disable."
---

The Cosmos 3 Generator production path serves `nvidia/Cosmos3-Nano` or `nvidia/Cosmos3-Super` through the prebuilt `vllm/vllm-omni:cosmos3` image with `vllm serve … --omni --model-class-name Cosmos3OmniDiffusersPipeline`, exposing OpenAI-compatible `/v1/images/generations` and `/v1/videos` routes on port 8000.

<Info>
Cosmos 3 Generator support is upstreaming in [vllm-project/vllm-omni#3454](https://github.com/vllm-project/vllm-omni/pull/3454). Until merge, `vllm/vllm-omni:cosmos3` is the image with every modality (vision, sound, action); the PR-branch install covers only text-to-image, text-to-video, and image-to-video.
</Info>

## Prerequisites

| Requirement | Notes |
| --- | --- |
| Linux + NVIDIA GPU | Ampere, Hopper, or Blackwell |
| Hugging Face auth | Gated Cosmos3 checkpoints: `uvx hf@latest auth login` or `HF_TOKEN` |
| Docker + NVIDIA runtime | `--runtime nvidia --gpus all` for the server container |
| Local media paths | Mount host directories and set `--allowed-local-media-path` so the server can read conditioning images, videos, and action files |

Shared cookbook setup (CUDA driver pairing, HF cache mounts) lives on the [Cookbook environment setup](/cookbook-environment) page.

## Start the server

Pull the image once:

```bash
docker pull vllm/vllm-omni:cosmos3
```

<Steps>
<Step title="Cosmos3-Nano (single GPU)">

```bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000 \
  --init-timeout 1800
```

</Step>
<Step title="Cosmos3-Super (tensor parallel + offload)">

`Cosmos3-Super` (64B) typically needs multiple GPUs. `--tensor-parallel-size` shards weights; `--enable-layerwise-offload` moves transformer blocks between CPU and GPU (lower peak VRAM, higher latency, more host RAM).

```bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Super \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --tensor-parallel-size 4 \
  --enable-layerwise-offload \
  --port 8000 \
  --init-timeout 1800
```

Set `--tensor-parallel-size` to the number of GPUs you allocate.

</Step>
<Step title="Verify readiness">

The process prints `Application startup complete.` when the API is ready. Probe models:

```bash
curl http://localhost:8000/v1/models
```

</Step>
</Steps>

### Parallelism options (Super and Nano)

| Flag | Effect |
| --- | --- |
| `--tensor-parallel-size N` | Shard model weights across `N` GPUs |
| `--enable-layerwise-offload` | Offload transformer blocks CPU↔GPU between steps |
| `--cfg-parallel-size 2` | Run positive and negative CFG branches on two GPUs in parallel |
| `--ulysses-degree 2` | Ulysses sequence parallelism across the sequence dimension |

<Warning>
When combining flags, provision GPUs for the product  
`tensor_parallel_size × cfg_parallel_size × ulysses_degree`.
</Warning>

For CFG parallel, set strength with request `guidance_scale`. Do **not** use `true_cfg_scale` with these Cosmos3 examples.

Example Nano serve with CFG parallel (no Docker wrapper if installed from source):

```bash
vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --cfg-parallel-size 2 \
  --port 8000 \
  --init-timeout 1800
```

### PR-branch install (three vision modes only)

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=cu130 \
  "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@refs/pull/3454/head"
```

Then run `vllm serve` directly (no `docker run … vllm/vllm-omni:cosmos3` wrapper) with the same `--omni` and `--model-class-name` flags.

```text
Client (curl / requests)
        │
        ▼
POST /v1/images/generations  ──► PNG (base64 in JSON)
POST /v1/videos/sync         ──► MP4 bytes (blocking)
POST /v1/videos              ──► job id → poll → /content or action in final JSON
        │
        ▼
vllm serve (Docker: vllm/vllm-omni:cosmos3)
  --omni --model-class-name Cosmos3OmniDiffusersPipeline
```

## Vision generation endpoints

| Mode | Endpoint | Response |
| --- | --- | --- |
| Text to image | `POST /v1/images/generations` | Base64 PNG in JSON |
| Text to video | `POST /v1/videos/sync` | MP4 body |
| Image to video | `POST /v1/videos/sync` | Upload `input_reference` image |
| Video to video | `POST /v1/videos/sync` | Upload source video; set conditioning frames in `extra_params` |
| Video with sound | `POST /v1/videos/sync` | `generate_sound=true` (+ optional `sound_duration`) |

Point separate Nano and Super servers at different bases with:

```bash
export COSMOS3_VLLM_NANO_BASE_URL=http://localhost:8000
export COSMOS3_VLLM_SUPER_BASE_URL=http://localhost:8001
```

### Text-to-video (sync)

<RequestExample>

```bash
curl -sS -X POST http://localhost:8000/v1/videos/sync \
  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
  --form-string "negative_prompt=blurry, distorted, low quality" \
  --form-string "size=1280x720" \
  --form-string "num_frames=189" \
  --form-string "fps=24" \
  --form-string "num_inference_steps=35" \
  --form-string "guidance_scale=6.0" \
  --form-string "flow_shift=10.0" \
  --form-string "seed=0" \
  --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
  -H "Accept: video/mp4" \
  -o cosmos3_t2v.mp4
```

</RequestExample>

Audiovisual cookbooks use structured JSON prompts from `cookbooks/cosmos3/generator/audiovisual/assets/prompts/` with the same fields; default sampling in the vLLM-Omni notebook is 35 steps, `guidance_scale=6.0`, `flow_shift=10.0`, 189 frames at 24 FPS, 1280×720.

### Image-to-video

Add the conditioning file:

```bash
curl -sS -X POST http://localhost:8000/v1/videos/sync \
  --form-string "prompt=..." \
  --form-string "size=1280x720" \
  ... \
  -F "input_reference=@/path/to/image.jpg" \
  -H "Accept: video/mp4" \
  -o cosmos3_i2v.mp4
```

### Text-to-image

Image requests use JSON (`extra_args` for Cosmos-specific toggles) rather than multipart `extra_params`:

```python
import requests

body = {
    "prompt": "...",
    "size": "1280x720",
    "n": 1,
    "num_inference_steps": 35,
    "guidance_scale": 6.0,
    "flow_shift": 10.0,
    "seed": 0,
    "extra_args": {
        "use_resolution_template": False,
        "guardrails": True,
    },
}
requests.post("http://localhost:8000/v1/images/generations", json=body, timeout=600)
```

<Tip>
Use `--form-string` for text fields (`prompt`, `negative_prompt`, `extra_params`). With `-F`, curl treats `;` as a content-type separator and can truncate JSON values.
</Tip>

## Action generation endpoints

Action modes condition on `domain_name` and exchange video/action sequences. Embodiment dimensions and semantics are documented on [Action modality](/action-modality).

| `action_mode` | Typical endpoint | Input | Output |
| --- | --- | --- | --- |
| `forward_dynamics` | `POST /v1/videos` (async) or `POST /v1/videos/sync` | Image + action chunk | Video |
| `inverse_dynamics` | `POST /v1/videos` (async) | Video + instruction | Predicted action in completed job JSON |
| `policy` | `POST /v1/videos` (async) | Image + instruction | Video + action chunk |

Cookbook forward-dynamics jobs POST multipart to `/v1/videos`, poll `GET /v1/videos/{id}`, then download `GET /v1/videos/{id}/content` for the MP4.

<ParamField body="extra_params (JSON)" type="object">
Action-related keys include `action_mode`, `domain_name` (e.g. `av`, `droid_lerobot`, `umi`), `action_chunk_size`, `image_size`, `view_point`, inline `action` array or `action_path`, and optional `raw_action_dim` for inverse dynamics.
</ParamField>

Example forward-dynamics `extra_params` shape (AV):

```json
{
  "action_mode": "forward_dynamics",
  "domain_name": "av",
  "action_chunk_size": 60,
  "image_size": [320, 576],
  "view_point": 0,
  "action": [[...]],
  "guardrails": false
}
```

Inverse dynamics sets `action_mode` to `inverse_dynamics`, `raw_action_dim` to `9` for AV ego pose, and uploads the source clip as `input_reference` (`video/mp4`).

Mount the repo (or action asset directory) into the container and keep paths visible under `--allowed-local-media-path`.

## Common request fields

| Field | Purpose |
| --- | --- |
| `prompt` | Positive prompt (plain text or JSON string for structured prompts) |
| `negative_prompt` | Concepts to avoid (video modes) |
| `size` | `<width>x<height>` (e.g. `1280x720`) |
| `num_frames`, `fps` | Video length and frame rate |
| `num_inference_steps` | Diffusion denoising steps |
| `guidance_scale` | CFG scale for Cosmos3 (not `true_cfg_scale`) |
| `flow_shift` | Scheduler flow-shift |
| `seed` | Reproducibility |
| `max_sequence_length` | Prompt token cap (default `512`; longer prompts truncated) |
| `input_reference` | Conditioning image or video file |
| `generate_sound` | `true` for synchronized audio |
| `extra_params` | JSON Cosmos3 options (action, guardrails, templates, v2v conditioning) |
| `extra_args` | Image-endpoint Cosmos3 options |

## Guardrails

Cosmos3 ships safety guardrails that screen prompts and blur faces in outputs.

**Per request** — set `guardrails` inside `extra_params` (video) or `extra_args` (image):

```bash
--form-string 'extra_params={"guardrails":false,"use_resolution_template":false,"use_duration_template":false}'
```

Action cookbooks commonly set `"guardrails": false` for robotics and AV rollouts.

**Server-wide disable** — guardrail models are not loaded; per-request `guardrails: true` cannot re-enable them. Pass a deploy config (a future release may add `--cosmos3-no-guardrails`):

```yaml
# no_guardrails.yaml
async_chunk: false
stages:
  - stage_id: 0
    max_num_seqs: 1
    enforce_eager: true
    trust_remote_code: true
    model_class_name: Cosmos3OmniDiffusersPipeline
    model_config:
      guardrails: false
      offload_guardrail_models: false
```

```bash
vllm serve nvidia/Cosmos3-Nano --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --deploy-config no_guardrails.yaml \
  --allowed-local-media-path / \
  --port 8000
```

## Notebook-oriented server layout

Action notebooks often bind host port **8001** to container **8000** and pin a single GPU:

```bash
docker rm -f cosmos3-vllm-omni-notebook 2>/dev/null || true

docker run -d --name cosmos3-vllm-omni-notebook \
  --runtime nvidia --gpus '"device=0"' \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$PWD:/workspace" \
  -p 8001:8000 --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
    --omni \
    --model-class-name Cosmos3OmniDiffusersPipeline \
    --allowed-local-media-path / \
    --port 8000 \
    --init-timeout 1800

export COSMOS3_VLLM_BASE_URL=http://localhost:8001
curl http://localhost:8001/v1/models
```

Outputs default to `outputs/cosmos3_action_vllm/` (action) or `cookbooks/cosmos3/generator/audiovisual/outputs/notebooks/` (audiovisual).

## Troubleshooting

| Symptom | Check |
| --- | --- |
| Server never ready | Increase `--init-timeout`; confirm HF cache and model download; `docker logs` |
| `403` / model not found | Hugging Face login and license acceptance for gated repos |
| Local file not found | Volume mount and `--allowed-local-media-path` cover the path used in requests |
| Truncated `extra_params` | Use `--form-string`, not `-F`, for JSON fields |
| OOM on Super | Raise `--tensor-parallel-size`, add `--enable-layerwise-offload`, or reduce resolution/frame count |

See [Troubleshooting](/troubleshooting) for CUDA driver and container pairing.

## Related pages

<CardGroup>
<Card title="Cookbook environment setup" href="/cookbook-environment">
HF auth, Docker image pull, and GPU verification shared across backends.
</Card>
<Card title="vLLM-Omni API reference" href="/vllm-omni-api-reference">
Full endpoint field lists, `action_mode` values, and curl constraints.
</Card>
<Card title="Run Generator action workflows" href="/run-generator-action">
Forward and inverse dynamics across Framework and vLLM-Omni with `domain_name` conditioning.
</Card>
<Card title="Audiovisual cookbook recipes" href="/audiovisual-cookbooks">
End-to-end text/image/video (+ sound) notebooks using `run_with_vllm_omni.ipynb`.
</Card>
<Card title="Choose an integration" href="/choose-integration">
When to pick vLLM-Omni vs Diffusers vs Cosmos Framework.
</Card>
<Card title="Inference benchmarks" href="/inference-benchmarks">
Published vLLM-Omni latency by GPU, resolution, and tensor-parallel width.
</Card>
</CardGroup>
