# Quickstart

> Minimal first-run commands for Generator (Diffusers text-to-video, vLLM-Omni curl) and Reasoner (vLLM serve + OpenAI chat completion), including HF login and expected success signals.

- Repository: NVIDIA/cosmos
- GitHub: https://github.com/NVIDIA/cosmos
- Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9
- Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt

## Source Files

- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/reasoner/README.md`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/robot_kitchen.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/negative_prompts/text2video/neg_prompt.json`

---

---
title: "Quickstart"
description: "Minimal first-run commands for Generator (Diffusers text-to-video, vLLM-Omni curl) and Reasoner (vLLM serve + OpenAI chat completion), including HF login and expected success signals."
---

Cosmos 3 exposes two runtime surfaces—**Generator** (diffusion outputs via Diffusers or vLLM-Omni) and **Reasoner** (text outputs via vLLM chat completions)—that share Hugging Face checkpoint access but use different install and serve paths. This page runs one minimal text-to-video or image-reasoning call per surface; full environment matrices live on [Installation](/installation) and [Cookbook environment setup](/cookbook-environment).

## Prerequisites

| Requirement | Notes |
| --- | --- |
| Linux + NVIDIA GPU | Ampere, Hopper, or Blackwell; BF16 tested |
| `uv`, `git`, `git-lfs` | Framework/vLLM paths need `uv >= 0.11.3` for `--torch-backend=cu130` |
| Hugging Face access | Gated `nvidia/Cosmos3-*` repos |
| CUDA pairing | Match driver to `cu130` (CUDA 13) or `cu128` (CUDA 12.8); do not rely on `--torch-backend=auto` for vLLM |

<Warning>
vLLM wheels are paired to a CUDA minor version. On CUDA 12.x use `vllm==0.19.1` with `--torch-backend=cu128`; on CUDA 13.x use `vllm==0.21.0` with `--torch-backend=cu130`.
</Warning>

## Authenticate with Hugging Face

Create a token with access to the Cosmos 3 collection, then authenticate before the first checkpoint download:

```bash
uvx hf@latest auth login
```

Alternatively set `HF_TOKEN` in the environment. Use `HF_HOME` when you want a shared or larger cache directory.

<Check>
**Success:** `hf auth whoami` (or a successful first `from_pretrained` / `vllm serve` model download) without 401/403 errors from Hugging Face.
</Check>

## Generator: Diffusers text-to-video

Install a Python 3.13 venv with Diffusers and a CUDA-matched `torch` build (example uses CUDA 13):

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=cu130 \
  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
  accelerate av cosmos_guardrail huggingface_hub imageio imageio-ffmpeg \
  torch torchvision transformers
```

From `cookbooks/cosmos3/generator/audiovisual/`, run a minimal 720p text-to-video pass using the checked-in structured prompts:

```python
import json
import torch
from diffusers import Cosmos3OmniPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video

prompt = json.load(open("assets/prompts/text2video/robot_kitchen.json"))
negative = json.load(open("assets/negative_prompts/text2video/neg_prompt.json"))

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=10.0)

result = pipe(
    prompt=json.dumps(prompt),
    negative_prompt=json.dumps(negative),
    image=None,
    num_frames=189,
    height=720,
    width=1280,
    fps=24,
    num_inference_steps=35,
    guidance_scale=6.0,
    enable_sound=False,
    add_resolution_template=False,
    add_duration_template=False,
    generator=torch.Generator(device="cuda").manual_seed(1234),
)
export_to_video(result.video, "/tmp/cosmos3_t2v_diffusers.mp4", fps=24)
```

<Note>
The first run downloads `nvidia/Cosmos3-Nano` and walks every diffusion step—long step times are expected, not a hang. For a plain string prompt without JSON assets, pass `prompt="..."` directly as in the root README quickstart.
</Note>

<Check>
**Success:** `/tmp/cosmos3_t2v_diffusers.mp4` exists and plays; GPU memory stays allocated during denoising; no `torch.cuda.is_available()` false errors after install.
</Check>

## Generator: vLLM-Omni curl

Start the official Docker image (all Generator modalities; API on port 8000):

```bash
docker pull vllm/vllm-omni:cosmos3

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000
```

<Check>
**Server ready:** Log line `Application startup complete.` and `curl http://localhost:8000/v1/models` returns model metadata.
</Check>

Send a blocking text-to-video request (writes MP4 to disk):

```bash
curl -sS -X POST http://localhost:8000/v1/videos/sync \
  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
  --form-string "negative_prompt=blurry, distorted, low quality" \
  --form-string "size=1280x720" \
  --form-string "num_frames=81" \
  --form-string "fps=24" \
  --form-string "num_inference_steps=35" \
  --form-string "guidance_scale=4.0" \
  --form-string "seed=42" \
  -o cosmos3_t2v_output.mp4
```

<Tip>
Use `--form-string` for `prompt`, `negative_prompt`, and `extra_params`. With `-F`, curl treats `;` as a content-type separator and can truncate JSON prompt values.
</Tip>

For cookbook-aligned structured prompts from the audiovisual folder, POST JSON-serialized prompt objects the same way the notebook does (`prompt` and `negative_prompt` as `json.dumps(...)` form fields, `flow_shift=10.0`, `guidance_scale=6.0`, `num_frames=189`).

<Check>
**Success:** `cosmos3_t2v_output.mp4` is non-empty; HTTP 200 with `video/mp4` body; server logs show request completion without OOM.
</Check>

## Reasoner: vLLM serve and chat completion

Install vLLM plus the Cosmos 3 plugin (CUDA 13 example):

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=cu130 "vllm==0.21.0" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"
```

If the build reports DeepGEMM unavailable:

```bash
export VLLM_USE_DEEP_GEMM=0
```

Start a single-GPU Nano Reasoner server:

```bash
CUDA_VISIBLE_DEVICES=0 \
vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --tensor-parallel-size 1 \
  --mm-encoder-tp-mode data \
  --async-scheduling \
  --allowed-local-media-path "$(dirname "$(pwd)")" \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --port 8000
```

<Note>
First startup may compile CUDA graphs for several minutes. Poll readiness with `curl -fsS http://127.0.0.1:8000/health` (cookbook notebooks wait until this succeeds).
</Note>

Query with the OpenAI-compatible client (Qwen3-VL-style multimodal messages):

```python
import openai

image_url = (
    "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/"
    "assets/cosmos3/inputs/vision/robot_153.jpg"
)

client = openai.OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Caption the image in detail."},
            ],
        }
    ],
    max_tokens=4096,
    seed=0,
)

print(response.choices[0].message.content)
```

Equivalent `curl` against chat completions:

```bash
curl -sS http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$(curl -sS http://localhost:8000/v1/models | python3 -c 'import sys,json; print(json.load(sys.stdin)["data"][0]["id"])')"'",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_153.jpg"}},
        {"type": "text", "text": "Caption the image in detail."}
      ]
    }],
    "max_tokens": 4096,
    "seed": 0
  }'
```

<Check>
**Success:** Non-empty assistant `content` in the JSON response; `/v1/models` lists the served checkpoint; local `file://` media works only under paths allowed by `--allowed-local-media-path`.
</Check>

## Surface comparison

```text
                    ┌─────────────────────────────────────┐
  Text / vision in  │  Reasoner (vLLM)                    │  Text out
  ─────────────────►│  Cosmos3ReasonerForConditionalGen   │──────────────►
                    └─────────────────────────────────────┘

                    ┌─────────────────────────────────────┐
  Text / vision in  │  Generator                          │  MP4 / PNG / action
  ─────────────────►│  Diffusers (in-process)             │──────────────►
                    │  vLLM-Omni (OpenAI /v1/videos/sync) │
                    └─────────────────────────────────────┘
```

| Surface | Minimal path | Default model | Primary success signal |
| --- | --- | --- | --- |
| Generator | Diffusers `Cosmos3OmniPipeline` | `nvidia/Cosmos3-Nano` | MP4 written by `export_to_video` |
| Generator | vLLM-Omni `POST /v1/videos/sync` | `nvidia/Cosmos3-Nano` | `Application startup complete.` + MP4 bytes |
| Reasoner | `vllm serve` + `/v1/chat/completions` | `nvidia/Cosmos3-Nano` | `/health` OK + non-empty assistant text |

## Verify GPU and servers

```bash
.venv/bin/python - <<'PY'
import torch
print("torch:", torch.__version__)
print("torch cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY
```

For any OpenAI-compatible server on port 8000:

```bash
curl http://localhost:8000/v1/models
```

## Related pages

<CardGroup>
  <Card title="Installation" href="/installation">
    Prerequisites, CUDA driver pairing, venv and Docker setup, and environment verification beyond these minimal commands.
  </Card>
  <Card title="Choose an integration" href="/choose-integration">
    When to use Diffusers, vLLM-Omni, vLLM, Framework, or Transformers by goal.
  </Card>
  <Card title="Reasoner and Generator" href="/reasoner-and-generator">
    MoT modes, inputs/outputs, and which surface fits understanding vs generation.
  </Card>
  <Card title="Run Generator with Diffusers" href="/run-generator-diffusers">
    Full Diffusers pipeline modes, schedulers, and notebook walkthrough.
  </Card>
  <Card title="Run Generator with vLLM-Omni" href="/run-generator-vllm-omni">
    Super tensor parallelism, action endpoints, and guardrail toggles.
  </Card>
  <Card title="Run Reasoner with vLLM" href="/run-reasoner-vllm">
    Serve flags, video frame kwargs, and reasoning-format prompts.
  </Card>
  <Card title="Troubleshooting" href="/troubleshooting">
    CUDA/driver mismatches, `torch.cuda` false, libxcb headless imports, and DeepGEMM workaround.
  </Card>
</CardGroup>