Agent-readable docs

NVIDIA Cosmos 3 Documentation

Source-grounded reference for Cosmos 3 omnimodal world models: Reasoner and Generator runtime surfaces, Hugging Face checkpoints, integration paths (Diffusers, vLLM-Omni, vLLM, Cosmos Framework), runnable cookbooks, and OpenAI-compatible serving APIs for Physical AI developers.

Pages

  1. OverviewCosmos 3 omnimodal world model surfaces (Reasoner vs Generator), primary entry points, supported modalities, and the shortest path to a first generation or reasoning call.
  2. InstallationPrerequisites (Linux, NVIDIA GPU, uv, Hugging Face auth), CUDA driver pairing (cu130/cu128), venv and Docker setup paths, and environment verification commands.
  3. QuickstartMinimal first-run commands for Generator (Diffusers text-to-video, vLLM-Omni curl) and Reasoner (vLLM serve + OpenAI chat completion), including HF login and expected success signals.
  4. Choose an integrationDecision matrix for Diffusers, vLLM-Omni, vLLM, Transformers (coming soon), and Cosmos Framework by goal: research, production inference, training, or evaluation.
  5. Reasoner and GeneratorMoT architecture modes: autoregressive Reasoner (text/vision in, text out) vs diffusion Generator (multimodal in, vision/sound/action out), shared mRoPE, and when to use each surface.
  6. Model familyCheckpoint catalog (Nano 16B, Super 64B, Text2Image, Image2Video, Nano-Policy-DROID), Hugging Face IDs, capability focus, and size tradeoffs for serving.
  7. Input and output specificationsSupported input/output types and formats, resolution tiers (256p–720p), aspect ratios, frame rates/counts, vision conditioning frame counts, prompt length limits, and sound output specs.
  8. Action modalityAction token semantics, embodiment dimensions (AV 9D, DROID 10D, UMI 10D, humanoid 29D), policy/inverse/forward dynamics modes, and domain_name conditioning for Generator action workflows.
  9. Cookbook environment setupShared uv/Docker setup for all backends: HF auth, CUDA backend tags, Cosmos Framework clone/sync, Diffusers venv, vLLM + vllm-cosmos3 plugin, vLLM-Omni Docker image, and GPU verification probes.
  10. Run Generator with DiffusersInstall Cosmos3OmniPipeline dependencies, configure UniPC scheduler flow_shift, run text-to-image/video and image-to-video with structured JSON prompts, and export MP4 outputs.
  11. Run Generator with vLLM-OmniStart vllm/vllm-omni:cosmos3 Docker server, tensor-parallel and CFG/Ulysses options for Super, POST vision/action endpoints, guardrails toggles, and deploy-config for server-wide guardrail disable.
  12. Run Generator with Cosmos FrameworkClone cosmos-framework, uv sync cu130-train/cu128-train groups, torchrun cosmos_framework.scripts.inference with parallelism presets, checkpoint-path, and JSON input specs from cookbook assets.

Complete Markdown

# NVIDIA Cosmos 3 Documentation

> Source-grounded reference for Cosmos 3 omnimodal world models: Reasoner and Generator runtime surfaces, Hugging Face checkpoints, integration paths (Diffusers, vLLM-Omni, vLLM, Cosmos Framework), runnable cookbooks, and OpenAI-compatible serving APIs for Physical AI developers.

## Context Links

- [Agent index](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms.txt)
- [Human interactive docs](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9)
- [GitHub repository](https://github.com/NVIDIA/cosmos)

## Repository Metadata

- Repository: NVIDIA/cosmos

- Generated: 2026-06-01T20:39:21.817Z
- Updated: 2026-06-01T20:39:30.764Z
- Runtime: Grok CLI
- Format: Documentation
- Pages: 25

## Page Index

- 01. [Overview](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/01-overview.md) - Cosmos 3 omnimodal world model surfaces (Reasoner vs Generator), primary entry points, supported modalities, and the shortest path to a first generation or reasoning call.
- 02. [Installation](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/02-installation.md) - Prerequisites (Linux, NVIDIA GPU, uv, Hugging Face auth), CUDA driver pairing (cu130/cu128), venv and Docker setup paths, and environment verification commands.
- 03. [Quickstart](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/03-quickstart.md) - Minimal first-run commands for Generator (Diffusers text-to-video, vLLM-Omni curl) and Reasoner (vLLM serve + OpenAI chat completion), including HF login and expected success signals.
- 04. [Choose an integration](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/04-choose-an-integration.md) - Decision matrix for Diffusers, vLLM-Omni, vLLM, Transformers (coming soon), and Cosmos Framework by goal: research, production inference, training, or evaluation.
- 05. [Reasoner and Generator](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/05-reasoner-and-generator.md) - MoT architecture modes: autoregressive Reasoner (text/vision in, text out) vs diffusion Generator (multimodal in, vision/sound/action out), shared mRoPE, and when to use each surface.
- 06. [Model family](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/06-model-family.md) - Checkpoint catalog (Nano 16B, Super 64B, Text2Image, Image2Video, Nano-Policy-DROID), Hugging Face IDs, capability focus, and size tradeoffs for serving.
- 07. [Input and output specifications](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/07-input-and-output-specifications.md) - Supported input/output types and formats, resolution tiers (256p–720p), aspect ratios, frame rates/counts, vision conditioning frame counts, prompt length limits, and sound output specs.
- 08. [Action modality](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/08-action-modality.md) - Action token semantics, embodiment dimensions (AV 9D, DROID 10D, UMI 10D, humanoid 29D), policy/inverse/forward dynamics modes, and domain_name conditioning for Generator action workflows.
- 09. [Cookbook environment setup](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/09-cookbook-environment-setup.md) - Shared uv/Docker setup for all backends: HF auth, CUDA backend tags, Cosmos Framework clone/sync, Diffusers venv, vLLM + vllm-cosmos3 plugin, vLLM-Omni Docker image, and GPU verification probes.
- 10. [Run Generator with Diffusers](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/10-run-generator-with-diffusers.md) - Install Cosmos3OmniPipeline dependencies, configure UniPC scheduler flow_shift, run text-to-image/video and image-to-video with structured JSON prompts, and export MP4 outputs.
- 11. [Run Generator with vLLM-Omni](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/11-run-generator-with-vllm-omni.md) - Start vllm/vllm-omni:cosmos3 Docker server, tensor-parallel and CFG/Ulysses options for Super, POST vision/action endpoints, guardrails toggles, and deploy-config for server-wide guardrail disable.
- 12. [Run Generator with Cosmos Framework](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/12-run-generator-with-cosmos-framework.md) - Clone cosmos-framework, uv sync cu130-train/cu128-train groups, torchrun cosmos_framework.scripts.inference with parallelism presets, checkpoint-path, and JSON input specs from cookbook assets.
- 13. [Run Generator action workflows](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/13-run-generator-action-workflows.md) - Forward dynamics (image + action trajectory) and inverse dynamics (video + instruction) across Framework torchrun and vLLM-Omni multipart /v1/videos requests with domain_name and action_mode extra_params.
- 14. [Run Reasoner with vLLM](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/14-run-reasoner-with-vllm.md) - Install vllm-cosmos3 plugin, serve Cosmos3ReasonerForConditionalGeneration with mm-encoder and media-io-kwargs, Qwen3-VL-compatible chat messages, and reasoning-format prompt suffix.
- 15. [Run Reasoner with Cosmos Framework](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/15-run-reasoner-with-cosmos-framework.md) - Build reasoner JSON inputs (model_mode, vision_path, enable_sound), run cosmos_framework.scripts.inference with latency preset, and read reasoner_text.txt outputs; scale Nano to Super via torchrun.
- 16. [vLLM-Omni API reference](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/16-vllm-omni-api-reference.md) - OpenAI-compatible endpoints (/v1/images/generations, /v1/videos, /v1/videos/sync), request fields (prompt, size, num_frames, guidance_scale, extra_params), action_mode values, and curl --form-string constraints.
- 17. [Diffusers pipeline reference](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/17-diffusers-pipeline-reference.md) - Cosmos3OmniPipeline.from_pretrained modes (text-to-image, text-to-video, image-to-video, text-to-video-with-sound), key call arguments, export_to_video, and torch-backend install pairing.
- 18. [Reasoner vLLM configuration](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/18-reasoner-vllm-configuration.md) - vllm serve flags: hf-overrides architectures, tensor-parallel-size, mm-encoder-tp-mode, async-scheduling, allowed-local-media-path, media-io-kwargs, VLLM_USE_DEEP_GEMM, and vLLM/cu130 version pairs.
- 19. [Sampling and prompt parameters](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/19-sampling-and-prompt-parameters.md) - Generator prompt-upsampling defaults, Reasoner sampling tables (with/without reasoning), structured JSON prompt schema, Qwen3-VL message shape, and redacted_reasoning format instruction.
- 20. [Audiovisual cookbook recipes](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/20-audiovisual-cookbook-recipes.md) - End-to-end notebooks for text-to-image, text-to-video, image-to-video with optional sound across Diffusers, Cosmos Framework, and vLLM-Omni; asset layout under assets/prompts and assets/images.
- 21. [Action cookbook recipes](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/21-action-cookbook-recipes.md) - Forward-dynamics and inverse-dynamics notebooks for AV, DROID, and UMI with checked-in trajectories, LeRobot sample data, and Framework vs vLLM-Omni output directories.
- 22. [Reasoner cookbook recipes](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/22-reasoner-cookbook-recipes.md) - Runnable workflows for captioning, temporal localization, embodied/common-sense reasoning, 2D grounding, describe-anything, action CoT, physical plausibility, and situation understanding with bundled media assets.
- 23. [Inference benchmarks](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/23-inference-benchmarks.md) - Published latency tables for Cosmos3-Nano/Super Generator (PyTorch, vLLM-Omni, Diffusers by GPU/resolution/TP) and Reasoner vLLM serving metrics (TTFT, throughput at concurrency tiers).
- 24. [Troubleshooting](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/24-troubleshooting.md) - CUDA/driver mismatches, NGC container selection, torch.cuda unavailable fixes, libxcb headless imports, uv version and --torch-backend errors, and VLLM_USE_DEEP_GEMM workaround.
- 25. [Ecosystem, license, and release](https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/25-ecosystem-license-and-release.md) - Related Cosmos projects (Framework, Curator, Evaluator), OpenMDW-1.1 license terms, known model limitations, release cadence pointers, and third-party dependency notices.

## Source File Index

- `.gitignore`
- `cookbooks/cosmos3/cosmos3-model-architecture.png`
- `cookbooks/cosmos3/generator/action/assets/actions/av_traj_forward.json`
- `cookbooks/cosmos3/generator/action/assets/actions/av_traj_left.json`
- `cookbooks/cosmos3/generator/action/assets/actions/umi.json`
- `cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/meta/info.json`
- `cookbooks/cosmos3/generator/action/assets/images/av_0.jpg`
- `cookbooks/cosmos3/generator/action/assets/videos/av_0.mp4`
- `cookbooks/cosmos3/generator/action/README.md`
- `cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb`
- `cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb`
- `cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/car_driving.jpg`
- `cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/humanoid_robot.jpg`
- `cookbooks/cosmos3/generator/audiovisual/assets/negative_prompts/image2video/neg_prompt.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/negative_prompts/text2video/neg_prompt.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/image2video/coastal_road_audio.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/image2video/humanoid_robot.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2image/robot_draping.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/car_colliding.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/robot_kitchen.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/robot_pouring_water_audio.json`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb`
- `cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/reasoner/assets/action_cot_driving_scene.mp4`
- `cookbooks/cosmos3/reasoner/assets/common_sense_reasoning.mp4`
- `cookbooks/cosmos3/reasoner/assets/describe_anything.png`
- `cookbooks/cosmos3/reasoner/assets/grounding_2d.png`
- `cookbooks/cosmos3/reasoner/assets/physical_plausibility.mp4`
- `cookbooks/cosmos3/reasoner/assets/robot_planning.png`
- `cookbooks/cosmos3/reasoner/assets/robotics_next_action.mp4`
- `cookbooks/cosmos3/reasoner/assets/temporal_localization_1.mp4`
- `cookbooks/cosmos3/reasoner/assets/video_caption.mp4`
- `cookbooks/cosmos3/reasoner/README.md`
- `cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/reasoner/run_with_vllm.ipynb`
- `inference_benchmarks.md`
- `LICENSE`
- `README.md`
- `RELEASE.md`

---

## 01. Overview

> Cosmos 3 omnimodal world model surfaces (Reasoner vs Generator), primary entry points, supported modalities, and the shortest path to a first generation or reasoning call.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/01-overview.md
- Generated: 2026-06-01T20:21:06.971Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/cosmos3-model-architecture.png`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/reasoner/README.md`

---
title: "Overview"
description: "Cosmos 3 omnimodal world model surfaces (Reasoner vs Generator), primary entry points, supported modalities, and the shortest path to a first generation or reasoning call."
---

Cosmos 3 is an omnimodal world model family in this repository, exposed as two runtime surfaces—**Reasoner** (autoregressive text from text and vision) and **Generator** (diffusion outputs for vision, sound, and action)—with runnable paths through Hugging Face Diffusers, vLLM-Omni, vLLM, and the separate [Cosmos Framework](https://github.com/NVIDIA/cosmos-framework) checkout referenced by cookbooks under `cookbooks/cosmos3/`.

## Runtime surfaces

Cosmos 3 routes workloads through a shared Mixture-of-Transformers (MoT) backbone but switches operating mode by task:

| Surface | Inputs | Outputs | Typical workloads |
| --- | --- | --- | --- |
| **Reasoner** | Text, vision (image/video) | Text | Captioning, temporal localization, grounding, embodied and common-sense reasoning, physical plausibility, situation understanding |
| **Generator** | Text, vision, sound, action | Vision, sound, action | Text-to-image/video, image-to-video, video-to-video, synchronized sound, forward/inverse dynamics, policy rollouts |

<Note>
Reasoner production serving loads only the reasoner path (`Cosmos3ReasonerForConditionalGeneration` via vLLM). Generator production serving loads the full omni checkpoint (reasoner + diffusion) through vLLM-Omni or Diffusers.
</Note>

```mermaid
flowchart TB
  subgraph inputs [Inputs]
    T[Text]
    V[Vision image/video]
    S[Sound]
    A[Action JSON]
  end

  subgraph mot [Cosmos 3 MoT backbone]
    AR[Reasoner mode — causal AR transformer]
    DM[Generator mode — diffusion transformer]
    mRoPE[Shared mRoPE across modalities]
  end

  subgraph outputs [Outputs]
    TXT[Text]
    IMG[Image JPG]
    VID[Video MP4]
    AUD[Stereo AAC in MP4]
    ACT[Action JSON]
  end

  T --> AR
  V --> AR
  AR --> TXT

  T --> DM
  V --> DM
  S --> DM
  A --> DM
  mRoPE --- AR
  mRoPE --- DM
  DM --> IMG
  DM --> VID
  DM --> AUD
  DM --> ACT
```

In **Reasoner mode**, language and visual understanding tokens use causal self-attention for next-token prediction. In **Generator mode**, noisy image, video, audio, and action tokens are denoised with full attention so multimodal outputs stay coherent. Both modes share transformer blocks and a unified 3D multi-dimensional rotary position embedding (mRoPE) for spatial and temporal structure.

<Frame caption="Cosmos 3 MoT architecture: shared backbone, Reasoner AR path vs Generator diffusion path.">
  ![Cosmos 3 model architecture](cookbooks/cosmos3/cosmos3-model-architecture.png)
</Frame>

## Supported modalities and formats

| Direction | Types | Formats / notes |
| --- | --- | --- |
| **Inputs** | Text; text + image; text + video; text + image + action | Text string; JPG/PNG/JPEG/WEBP; MP4; JSON action arrays |
| **Outputs** | Image, video, sound, action state, text | JPG; MP4; stereo AAC muxed into MP4 when generated with video; JSON actions; text string |
| **Vision conditioning** | Resolution tiers 256p / 480p / 720p | 720p: 1280×720; 480p: 832×480; 256p: 320×192; video conditioning uses 5 frames at matching resolution |
| **Action conditioning** | Embodiment-dependent dims | Examples: camera/AV 9D; DROID/UMI 10D; humanoid 29D (AgiBot) |
| **Generation defaults** | Resolution, aspect, timing | Default 480p, 16:9, 24 FPS, 189 frames; prompts under ~300 words recommended |

Action workflows treat action as tokens between visual states (9D pose deltas plus grasp state where applicable). Forward dynamics predicts future video from a start image and trajectory; inverse dynamics predicts trajectories from video; policy mode returns video plus a predicted action chunk.

## Primary entry points

Integrations map to goals rather than a single runtime:

| Goal | Entry point | How you invoke it |
| --- | --- | --- |
| Generator research / Python iteration | **Diffusers** `Cosmos3OmniPipeline` | `from_pretrained("nvidia/Cosmos3-Nano")` then `pipe(...)` |
| Generator production API | **vLLM-Omni** | `vllm serve nvidia/Cosmos3-Nano --omni --model-class-name Cosmos3OmniDiffusersPipeline` |
| Reasoner production API | **vLLM** + `vllm-cosmos3` plugin | `vllm serve` with `Cosmos3ReasonerForConditionalGeneration` overrides |
| Native PyTorch inference / training hooks | **Cosmos Framework** | `torchrun -m cosmos_framework.scripts.inference` from `packages/cosmos3` checkout |
| Reasoner research (HF) | **Transformers** | Coming soon |

<Warning>
Match CUDA driver, `--torch-backend` (`cu130` vs `cu128`), and vLLM version pairs. vLLM installs do not reliably support `--torch-backend=auto`; Diffusers can use `auto` for torch but cookbook vLLM paths pin explicit pairs (for example `cu130` + `vllm==0.21.0`).
</Warning>

### Repository layout

:::files
cosmos/                          # This repo — cookbooks, benchmarks, docs pointers
├── README.md                      # Model family, I/O specs, quickstarts
├── inference_benchmarks.md        # Generator latency + Reasoner serving tables
└── cookbooks/cosmos3/
    ├── README.md                  # Shared uv/Docker backend setup
    ├── generator/
    │   ├── audiovisual/           # t2i, t2v, i2v (+ sound) notebooks
    │   └── action/                # Forward / inverse dynamics notebooks
    └── reasoner/                  # Image reasoning (Framework); image+video (vLLM)

packages/cosmos3/                  # Created during setup — cloned cosmos-framework
└── .venv/                         # Framework torchrun interpreter
:::

Checkpoints live on Hugging Face (`nvidia/Cosmos3-Nano`, `Cosmos3-Super`, task-specific Super variants, `Cosmos3-Nano-Policy-DROID`). Authenticate before first download:

```bash
uvx hf@latest auth login
```

## Shortest path to a first call

<Steps>
<Step title="Authenticate to Hugging Face">
Run `uvx hf@latest auth login` (or set `HF_TOKEN`) so gated `nvidia/Cosmos3-*` weights can download. Optional: set `HF_HOME` for a shared cache location.
</Step>
<Step title="Pick a surface and backend">
Use Generator + Diffusers for the fastest local Python generation without a server, or Generator + vLLM-Omni / Reasoner + vLLM when you need an OpenAI-compatible HTTP API on port 8000.
</Step>
<Step title="Run one minimal command">
Follow the tab for your surface. Expect a long first run while weights download; diffusion Generator runs are compute-heavy by design.
</Step>
</Steps>

<Tabs>
<Tab title="Generator — Diffusers (Python)">
```python
import torch
from diffusers import Cosmos3OmniPipeline
from diffusers.utils import export_to_video

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

result = pipe(
    prompt="A mobile robot navigates a warehouse aisle and stops at a shelf.",
    num_frames=189,
    height=720,
    width=1280,
    fps=24.0,
)

export_to_video(result.video, "cosmos3_t2v.mp4", fps=24, macro_block_size=1)
```

Success signal: an MP4 file on disk after all denoising steps complete (not an immediate return).
</Tab>
<Tab title="Generator — vLLM-Omni (HTTP)">
```bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000
```

```bash
curl -sS -X POST http://localhost:8000/v1/videos/sync \
  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
  --form-string "size=1280x720" \
  --form-string "num_frames=81" \
  --form-string "fps=24" \
  --form-string "guidance_scale=4.0" \
  -o cosmos3_t2v_output.mp4
```

Success signals: server log shows `Application startup complete.`; `curl http://localhost:8000/v1/models` lists the model; sync POST writes MP4 bytes.
</Tab>
<Tab title="Reasoner — vLLM (HTTP)">
```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=cu130 "vllm==0.21.0" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"

vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --async-scheduling \
  --allowed-local-media-path / \
  --port 8000
```

```python
import openai

client = openai.OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/robot.jpg"}},
            {"type": "text", "text": "Caption the image in detail."},
        ],
    }],
    max_tokens=4096,
)
print(response.choices[0].message.content)
```

Success signal: non-empty `message.content` from `/v1/chat/completions`. Messages follow Qwen3-VL-compatible image/video content shapes.
</Tab>
</Tabs>

Cookbook quickstarts under `cookbooks/cosmos3/` mirror these paths with checked-in JSON prompts—for example audiovisual Generator examples use `assets/prompts/text2video/robot_kitchen.json`, and Reasoner Framework smoke tests write `reasoner_text.txt` under an output directory.

## Model family (at a glance)

| Checkpoint | Size | Focus |
| --- | ---: | --- |
| `nvidia/Cosmos3-Nano` | 16B | Compact omnimodal model for understanding, simulation, and action reasoning |
| `nvidia/Cosmos3-Super` | 64B | Frontier-scale omnimodal model |
| `nvidia/Cosmos3-Super-Text2Image` | 64B | High-fidelity text-to-image |
| `nvidia/Cosmos3-Super-Image2Video` | 64B | Image-to-video |
| `nvidia/Cosmos3-Nano-Policy-DROID` | 16B | Vision-language robot policy for DROID |

Super Generator serving typically needs multi-GPU tensor parallelism (`--tensor-parallel-size`) and optional layerwise offload; Nano fits single-GPU cookbook defaults.

## Ecosystem and constraints

| Project | Role |
| --- | --- |
| [Cosmos Framework](https://github.com/NVIDIA/cosmos-framework) | Training, native `torchrun` inference, `vllm-cosmos3` plugin source |
| [Cosmos Curator](https://github.com/NVIDIA/cosmos-curator) | Physical AI data curation |
| [Cosmos Evaluator](https://github.com/NVIDIA/cosmos-evaluator) | Automated evaluation for generation and reasoning outputs |

Cosmos 3 can show temporal inconsistency, motion artifacts, sound–video misalignment, and imperfect action consistency. Safety-critical or physically grounded deployment needs validation beyond model outputs. Source and weights use the [OpenMDW-1.1](https://openmdw.ai/license/1-1/) license.

## Related pages

<CardGroup>
<Card title="Installation" href="/installation">
Prerequisites, CUDA pairing, venv and Docker setup, GPU verification.
</Card>
<Card title="Quickstart" href="/quickstart">
Minimal first-run commands for Generator and Reasoner with expected success signals.
</Card>
<Card title="Reasoner and Generator" href="/reasoner-and-generator">
MoT modes, shared mRoPE, and when to use each surface.
</Card>
<Card title="Choose an integration" href="/choose-integration">
Diffusers vs vLLM-Omni vs vLLM vs Cosmos Framework by goal.
</Card>
<Card title="Model family" href="/model-family">
Full checkpoint catalog and serving tradeoffs.
</Card>
<Card title="Cookbook environment" href="/cookbook-environment">
Shared backend setup for all `cookbooks/cosmos3` notebooks.
</Card>
</CardGroup>

---

## 02. Installation

> Prerequisites (Linux, NVIDIA GPU, uv, Hugging Face auth), CUDA driver pairing (cu130/cu128), venv and Docker setup paths, and environment verification commands.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/02-installation.md
- Generated: 2026-06-01T20:21:17.718Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb`
- `cookbooks/cosmos3/reasoner/run_with_vllm.ipynb`
- `cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb`
- `.gitignore`

---
title: "Installation"
description: "Prerequisites (Linux, NVIDIA GPU, uv, Hugging Face auth), CUDA driver pairing (cu130/cu128), venv and Docker setup paths, and environment verification commands."
---

Cosmos 3 installs through **uv-managed Python 3.13 virtual environments** for Diffusers, Cosmos Framework, and vLLM Reasoner paths, or through the **`vllm/vllm-omni:cosmos3` Docker image** for Generator production serving. Every path requires **Linux**, **NVIDIA GPU access**, **Hugging Face authentication** for gated checkpoints, and a **CUDA driver / PyTorch backend pair** (`cu130` or `cu128`) that matches what `nvidia-smi` reports.

## Prerequisites

| Requirement | Details |
| --- | --- |
| Operating system | Linux (documented and tested for Cosmos 3) |
| GPU | NVIDIA Ampere, Hopper, or Blackwell with working driver |
| Package manager | [`uv`](https://docs.astral.sh/uv/getting-started/installation/) **≥ 0.11.3** (older versions fail on framework `pyproject.toml` and newer `--torch-backend` values such as `cu130`) |
| Version control | `git` and `git-lfs` |
| Model access | Hugging Face account with access to gated Cosmos 3 repos |
| Disk | Tens of GiB for venv, uv cache, and model weights (Nano downloads plus CUDA deps) |

<Warning>
Upgrade `uv` before any install if you see `a value is required for '--torch-backend'` or accepted-values lists stopping at `cu129`: run `uv self update` or reinstall from https://astral.sh/uv.
</Warning>

### Hugging Face authentication

Authenticate once before the first model download:

```bash
uvx hf@latest auth login
```

Alternatively set a token in the environment:

```bash
export HF_TOKEN=<your_token>
```

<ParamField body="HF_HOME" type="string">
Optional cache root for checkpoints. Notebooks default to `~/.cache/huggingface`; point this at a volume with enough space for multi‑tens‑of‑GiB downloads.
</ParamField>

### Cosmos Framework checkout access

Cosmos Framework and vLLM Reasoner installs pull from `NVIDIA/cosmos-framework` (HTTPS or SSH). For SSH:

```bash
export COSMOS3_GIT_URL=git@github.com:NVIDIA/cosmos-framework.git
```

The cookbook clones into `packages/cosmos3` at the repo root; that directory is **gitignored** and created locally.

## CUDA driver and backend pairing

System CUDA and PyTorch’s CUDA build must align. Confirm with:

```bash
nvidia-smi
python -c "import torch; print(torch.version.cuda)"
```

| Driver CUDA | Backend tag | Typical pairing |
| --- | --- | --- |
| 13.x | `cu130` | `vllm==0.21.0`, `COSMOS3_UV_GROUP=cu130-train`, Diffusers `--torch-backend=cu130` |
| 12.x | `cu128` | `vllm==0.19.1`, `COSMOS3_UV_GROUP=cu128-train`, Diffusers `--torch-backend=cu128` |

<Note>
CUDA **13** is recommended; **12.8** is supported. vLLM does not publish wheels for every CUDA minor version — **do not rely on `--torch-backend=auto` for vLLM**; pick the explicit pair above.
</Note>

For Diffusers-only venvs, `--torch-backend=auto` lets uv match the driver. Without it, uv may install the newest CUDA wheel (`cu130`), which fails on older drivers with `The NVIDIA driver on your system is too old` and `torch.cuda.is_available()` → `False`.

### NGC base containers (optional)

When using NVIDIA NGC PyTorch images instead of bare-metal uv:

| CUDA | Container |
| --- | --- |
| 13 | `nvcr.io/nvidia/pytorch:25.09-py3` |
| 12 | `nvcr.io/nvidia/pytorch:25.06-py3` |

## Setup paths overview

```text
                    ┌─────────────────────────────────────┐
                    │  Linux + NVIDIA GPU + HF auth       │
                    └─────────────────┬───────────────────┘
                                      │
          ┌───────────────────────────┼───────────────────────────┐
          │                           │                           │
          v                           v                           v
   ┌──────────────┐           ┌──────────────┐            ┌─────────────────┐
   │ Diffusers    │           │ Cosmos       │            │ vLLM Reasoner   │
   │ .venv (uv)   │           │ Framework    │            │ .venv (uv)      │
   │ Generator    │           │ packages/    │            │ + vllm-cosmos3  │
   │ research     │           │ cosmos3/     │            │ plugin          │
   └──────────────┘           └──────────────┘            └─────────────────┘
          │                           │
          │                           │
          └───────────────┬───────────┘
                          v
                 ┌────────────────────┐
                 │ vLLM-Omni Docker   │
                 │ vllm/vllm-omni:    │
                 │ cosmos3            │
                 │ Generator serving  │
                 └────────────────────┘
```

| Path | Surface | Install surface | Default artifact |
| --- | --- | --- | --- |
| Diffusers venv | Generator | `uv venv` + `uv pip install` at repo root or `COSMOS3_DIFFUSERS_VENV` | `.venv` or `.venv-cosmos3-diffusers` |
| Cosmos Framework | Generator, Reasoner | `git clone` + `uv sync --group=cu130-train` | `packages/cosmos3/.venv` |
| vLLM venv | Reasoner | `uv venv` + `vllm` + `vllm-cosmos3` | `.venv` at repo root |
| vLLM-Omni Docker | Generator | `docker pull` + `docker run` | Prebuilt image, port 8000 |

Backend-specific runbooks live under `cookbooks/cosmos3/README.md`; this page covers shared install mechanics.

## Virtual environment setup

<Tabs>
<Tab title="Diffusers (Generator)">

<Steps>
<Step title="Create and activate venv">

From the `cosmos` repo root:

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
```

Notebooks may use a dedicated path:

```bash
export COSMOS3_DIFFUSERS_VENV=/path/to/.venv-cosmos3-diffusers
export COSMOS3_TORCH_BACKEND=cu130   # or cu128 on CUDA 12.x drivers
```

</Step>
<Step title="Install dependencies">

```bash
uv pip install --torch-backend=cu130 \
  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
  accelerate \
  av \
  cosmos_guardrail \
  huggingface_hub \
  imageio \
  imageio-ffmpeg \
  torch \
  torchvision \
  transformers
```

For driver auto-detection (Diffusers only): replace `--torch-backend=cu130` with `--torch-backend=auto`.

</Step>
</Steps>

</Tab>
<Tab title="Cosmos Framework">

<Steps>
<Step title="Clone framework">

```bash
mkdir -p packages
git clone https://github.com/NVIDIA/cosmos-framework.git packages/cosmos3
cd packages/cosmos3
```

Or set `COSMOS3_REPO` / `COSMOS3_GIT_URL` before notebook cells run the clone for you.

</Step>
<Step title="Sync training extras">

Inference imports training extras today; use the `*-train` group matching your driver:

```bash
export GIT_LFS_SKIP_SMUDGE=1

# CUDA 13 driver (default):
uv sync --all-extras --group=cu130-train

# CUDA 12.x driver:
# uv sync --all-extras --group=cu128-train
```

Notebooks honor `COSMOS3_UV_GROUP` (default `cu130-train`). Export `COSMOS3_UV_GROUP=cu128-train` on CUDA 12.x before launch.

</Step>
<Step title="Use the venv">

Activate `packages/cosmos3/.venv` or call `.venv/bin/python` / `.venv/bin/torchrun` by absolute path.

</Step>
</Steps>

</Tab>
<Tab title="vLLM (Reasoner)">

<Steps>
<Step title="Create venv">

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
```

</Step>
<Step title="Install vLLM and plugin">

```bash
# CUDA 13 driver:
uv pip install --torch-backend=cu130 "vllm==0.21.0" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"

# CUDA 12.x driver:
# uv pip install --torch-backend=cu128 "vllm==0.19.1" \
#   "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"
```

The Reasoner notebook also installs `transformers-cosmos3` from the same framework checkout when building a local venv beside `packages/cosmos3`.

</Step>
<Step title="Optional DeepGEMM workaround">

If the build reports DeepGEMM unavailable:

```bash
export VLLM_USE_DEEP_GEMM=0
```

</Step>
</Steps>

<Tip>
When launching `.venv/bin/vllm` without activating the venv, keep `.venv/bin` on `PATH` so FlashInfer’s JIT build can find `ninja` in the venv.
</Tip>

</Tab>
<Tab title="vLLM-Omni (venv, partial)">

Until upstream PRs merge all modalities, the **Docker image** is the supported full-modality build. For text-to-image, text-to-video, and image-to-video from the PR branch:

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=cu130 \
  "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@refs/pull/3454/head"
# CUDA 12.x: use --torch-backend=cu128 instead
```

Then run `vllm serve` directly without the Docker wrapper.

</Tab>
</Tabs>

### Headless graphics libraries

On minimal servers, imports may fail with `libxcb.so.1: cannot open shared object file`:

```bash
apt-get install -y libxcb1 libgl1 libglib2.0-0
```

## Docker setup (vLLM-Omni Generator)

<Steps>
<Step title="Pull image">

```bash
docker pull vllm/vllm-omni:cosmos3
```

</Step>
<Step title="Run Cosmos3-Nano server">

Mount Hugging Face cache and any host directory with local media or action files:

```bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000 \
  --init-timeout 1800
```

Success signal: log line **`Application startup complete.`**

</Step>
<Step title="Run Cosmos3-Super (optional)">

On four GPUs with layer offload:

```bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Super \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --tensor-parallel-size 4 \
  --enable-layerwise-offload \
  --port 8000 \
  --init-timeout 1800
```

Parallelism degrees multiply: ensure GPU count ≥ `tensor_parallel_size × cfg_parallel_size × ulysses_degree`.

</Step>
</Steps>

Generator notebooks expect a running server and optional endpoint overrides:

```bash
export COSMOS3_VLLM_BASE_URL=http://localhost:8000
export COSMOS3_VLLM_NANO_BASE_URL=http://localhost:8000
export COSMOS3_VLLM_SUPER_BASE_URL=http://localhost:8000
```

## Environment verification

### PyTorch GPU probe (venv paths)

Run from an activated Diffusers venv, `packages/cosmos3/.venv`, or repo-root vLLM `.venv`:

```bash
.venv/bin/python - <<'PY'
import torch

print("torch:", torch.__version__)
print("torch cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY
```

<Check>
Expected: `cuda available: True`, non-zero `device count`, and a GPU name on device 0. `False` usually means a **cu130 wheel on a CUDA 12.x driver** — reinstall with `cu128` / `cu128-train` / `vllm==0.19.1`.
</Check>

Framework notebooks run the same probe with `CUDA_VISIBLE_DEVICES` respected inside `packages/cosmos3`.

### vLLM / vLLM-Omni server probe

```bash
curl http://localhost:8000/v1/models
```

A JSON model list confirms the OpenAI-compatible API is up.

### Driver vs wheel cross-check

```bash
nvidia-smi
python -c "import torch; print(torch.version.cuda)"
```

Major versions should match the chosen `cu130` or `cu128` install path.

## Common install variables

| Variable | Used by | Purpose |
| --- | --- | --- |
| `HF_TOKEN` | All backends | Hugging Face auth when not using `hf auth login` |
| `HF_HOME` | Framework, Diffusers notebooks | Checkpoint cache location |
| `HF_HUB_DISABLE_XET` | Diffusers notebook | Disables XET transfer (notebook default) |
| `COSMOS3_UV_GROUP` | Framework notebooks | `cu130-train` or `cu128-train` for `uv sync` |
| `COSMOS3_TORCH_BACKEND` | Diffusers notebook | `cu130` or `cu128` for `uv pip install` |
| `COSMOS3_REPO` | Framework / vLLM notebooks | Framework checkout path (default `packages/cosmos3`) |
| `GIT_LFS_SKIP_SMUDGE` | Framework `uv sync` | Skips LFS smudge for unused test artifacts |
| `VLLM_USE_DEEP_GEMM` | vLLM Reasoner | Set `0` when DeepGEMM is unavailable |
| `COSMOS3_VLLM_BASE_URL` | vLLM-Omni notebooks | Generator API base URL |

Local runtime outputs under `cookbooks/cosmos3/**/outputs/` and `packages/` are gitignored — safe to delete and regenerate.

## Related pages

<CardGroup>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Per-backend install detail, Framework clone/sync, and notebook env vars shared across Reasoner and Generator cookbooks.
</Card>
<Card title="Quickstart" href="/quickstart">
First-run commands after install: Diffusers generation, vLLM-Omni curl, and Reasoner chat completion.
</Card>
<Card title="Choose an integration" href="/choose-integration">
Pick Diffusers, vLLM-Omni, vLLM, or Cosmos Framework by research vs production goal.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
CUDA/driver mismatches, NGC containers, `torch.cuda` false negatives, libxcb, uv version errors, and DeepGEMM workarounds.
</Card>
</CardGroup>

---

## 03. Quickstart

> Minimal first-run commands for Generator (Diffusers text-to-video, vLLM-Omni curl) and Reasoner (vLLM serve + OpenAI chat completion), including HF login and expected success signals.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/03-quickstart.md
- Generated: 2026-06-01T20:21:30.564Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/reasoner/README.md`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/robot_kitchen.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/negative_prompts/text2video/neg_prompt.json`

---
title: "Quickstart"
description: "Minimal first-run commands for Generator (Diffusers text-to-video, vLLM-Omni curl) and Reasoner (vLLM serve + OpenAI chat completion), including HF login and expected success signals."
---

Cosmos 3 exposes two runtime surfaces—**Generator** (diffusion outputs via Diffusers or vLLM-Omni) and **Reasoner** (text outputs via vLLM chat completions)—that share Hugging Face checkpoint access but use different install and serve paths. This page runs one minimal text-to-video or image-reasoning call per surface; full environment matrices live on [Installation](/installation) and [Cookbook environment setup](/cookbook-environment).

## Prerequisites

| Requirement | Notes |
| --- | --- |
| Linux + NVIDIA GPU | Ampere, Hopper, or Blackwell; BF16 tested |
| `uv`, `git`, `git-lfs` | Framework/vLLM paths need `uv >= 0.11.3` for `--torch-backend=cu130` |
| Hugging Face access | Gated `nvidia/Cosmos3-*` repos |
| CUDA pairing | Match driver to `cu130` (CUDA 13) or `cu128` (CUDA 12.8); do not rely on `--torch-backend=auto` for vLLM |

<Warning>
vLLM wheels are paired to a CUDA minor version. On CUDA 12.x use `vllm==0.19.1` with `--torch-backend=cu128`; on CUDA 13.x use `vllm==0.21.0` with `--torch-backend=cu130`.
</Warning>

## Authenticate with Hugging Face

Create a token with access to the Cosmos 3 collection, then authenticate before the first checkpoint download:

```bash
uvx hf@latest auth login
```

Alternatively set `HF_TOKEN` in the environment. Use `HF_HOME` when you want a shared or larger cache directory.

<Check>
**Success:** `hf auth whoami` (or a successful first `from_pretrained` / `vllm serve` model download) without 401/403 errors from Hugging Face.
</Check>

## Generator: Diffusers text-to-video

Install a Python 3.13 venv with Diffusers and a CUDA-matched `torch` build (example uses CUDA 13):

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=cu130 \
  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
  accelerate av cosmos_guardrail huggingface_hub imageio imageio-ffmpeg \
  torch torchvision transformers
```

From `cookbooks/cosmos3/generator/audiovisual/`, run a minimal 720p text-to-video pass using the checked-in structured prompts:

```python
import json
import torch
from diffusers import Cosmos3OmniPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video

prompt = json.load(open("assets/prompts/text2video/robot_kitchen.json"))
negative = json.load(open("assets/negative_prompts/text2video/neg_prompt.json"))

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=10.0)

result = pipe(
    prompt=json.dumps(prompt),
    negative_prompt=json.dumps(negative),
    image=None,
    num_frames=189,
    height=720,
    width=1280,
    fps=24,
    num_inference_steps=35,
    guidance_scale=6.0,
    enable_sound=False,
    add_resolution_template=False,
    add_duration_template=False,
    generator=torch.Generator(device="cuda").manual_seed(1234),
)
export_to_video(result.video, "/tmp/cosmos3_t2v_diffusers.mp4", fps=24)
```

<Note>
The first run downloads `nvidia/Cosmos3-Nano` and walks every diffusion step—long step times are expected, not a hang. For a plain string prompt without JSON assets, pass `prompt="..."` directly as in the root README quickstart.
</Note>

<Check>
**Success:** `/tmp/cosmos3_t2v_diffusers.mp4` exists and plays; GPU memory stays allocated during denoising; no `torch.cuda.is_available()` false errors after install.
</Check>

## Generator: vLLM-Omni curl

Start the official Docker image (all Generator modalities; API on port 8000):

```bash
docker pull vllm/vllm-omni:cosmos3

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000
```

<Check>
**Server ready:** Log line `Application startup complete.` and `curl http://localhost:8000/v1/models` returns model metadata.
</Check>

Send a blocking text-to-video request (writes MP4 to disk):

```bash
curl -sS -X POST http://localhost:8000/v1/videos/sync \
  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
  --form-string "negative_prompt=blurry, distorted, low quality" \
  --form-string "size=1280x720" \
  --form-string "num_frames=81" \
  --form-string "fps=24" \
  --form-string "num_inference_steps=35" \
  --form-string "guidance_scale=4.0" \
  --form-string "seed=42" \
  -o cosmos3_t2v_output.mp4
```

<Tip>
Use `--form-string` for `prompt`, `negative_prompt`, and `extra_params`. With `-F`, curl treats `;` as a content-type separator and can truncate JSON prompt values.
</Tip>

For cookbook-aligned structured prompts from the audiovisual folder, POST JSON-serialized prompt objects the same way the notebook does (`prompt` and `negative_prompt` as `json.dumps(...)` form fields, `flow_shift=10.0`, `guidance_scale=6.0`, `num_frames=189`).

<Check>
**Success:** `cosmos3_t2v_output.mp4` is non-empty; HTTP 200 with `video/mp4` body; server logs show request completion without OOM.
</Check>

## Reasoner: vLLM serve and chat completion

Install vLLM plus the Cosmos 3 plugin (CUDA 13 example):

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=cu130 "vllm==0.21.0" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"
```

If the build reports DeepGEMM unavailable:

```bash
export VLLM_USE_DEEP_GEMM=0
```

Start a single-GPU Nano Reasoner server:

```bash
CUDA_VISIBLE_DEVICES=0 \
vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --tensor-parallel-size 1 \
  --mm-encoder-tp-mode data \
  --async-scheduling \
  --allowed-local-media-path "$(dirname "$(pwd)")" \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --port 8000
```

<Note>
First startup may compile CUDA graphs for several minutes. Poll readiness with `curl -fsS http://127.0.0.1:8000/health` (cookbook notebooks wait until this succeeds).
</Note>

Query with the OpenAI-compatible client (Qwen3-VL-style multimodal messages):

```python
import openai

image_url = (
    "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/"
    "assets/cosmos3/inputs/vision/robot_153.jpg"
)

client = openai.OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Caption the image in detail."},
            ],
        }
    ],
    max_tokens=4096,
    seed=0,
)

print(response.choices[0].message.content)
```

Equivalent `curl` against chat completions:

```bash
curl -sS http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$(curl -sS http://localhost:8000/v1/models | python3 -c 'import sys,json; print(json.load(sys.stdin)["data"][0]["id"])')"'",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_153.jpg"}},
        {"type": "text", "text": "Caption the image in detail."}
      ]
    }],
    "max_tokens": 4096,
    "seed": 0
  }'
```

<Check>
**Success:** Non-empty assistant `content` in the JSON response; `/v1/models` lists the served checkpoint; local `file://` media works only under paths allowed by `--allowed-local-media-path`.
</Check>

## Surface comparison

```text
                    ┌─────────────────────────────────────┐
  Text / vision in  │  Reasoner (vLLM)                    │  Text out
  ─────────────────►│  Cosmos3ReasonerForConditionalGen   │──────────────►
                    └─────────────────────────────────────┘

                    ┌─────────────────────────────────────┐
  Text / vision in  │  Generator                          │  MP4 / PNG / action
  ─────────────────►│  Diffusers (in-process)             │──────────────►
                    │  vLLM-Omni (OpenAI /v1/videos/sync) │
                    └─────────────────────────────────────┘
```

| Surface | Minimal path | Default model | Primary success signal |
| --- | --- | --- | --- |
| Generator | Diffusers `Cosmos3OmniPipeline` | `nvidia/Cosmos3-Nano` | MP4 written by `export_to_video` |
| Generator | vLLM-Omni `POST /v1/videos/sync` | `nvidia/Cosmos3-Nano` | `Application startup complete.` + MP4 bytes |
| Reasoner | `vllm serve` + `/v1/chat/completions` | `nvidia/Cosmos3-Nano` | `/health` OK + non-empty assistant text |

## Verify GPU and servers

```bash
.venv/bin/python - <<'PY'
import torch
print("torch:", torch.__version__)
print("torch cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY
```

For any OpenAI-compatible server on port 8000:

```bash
curl http://localhost:8000/v1/models
```

## Related pages

<CardGroup>
  <Card title="Installation" href="/installation">
    Prerequisites, CUDA driver pairing, venv and Docker setup, and environment verification beyond these minimal commands.
  </Card>
  <Card title="Choose an integration" href="/choose-integration">
    When to use Diffusers, vLLM-Omni, vLLM, Framework, or Transformers by goal.
  </Card>
  <Card title="Reasoner and Generator" href="/reasoner-and-generator">
    MoT modes, inputs/outputs, and which surface fits understanding vs generation.
  </Card>
  <Card title="Run Generator with Diffusers" href="/run-generator-diffusers">
    Full Diffusers pipeline modes, schedulers, and notebook walkthrough.
  </Card>
  <Card title="Run Generator with vLLM-Omni" href="/run-generator-vllm-omni">
    Super tensor parallelism, action endpoints, and guardrail toggles.
  </Card>
  <Card title="Run Reasoner with vLLM" href="/run-reasoner-vllm">
    Serve flags, video frame kwargs, and reasoning-format prompts.
  </Card>
  <Card title="Troubleshooting" href="/troubleshooting">
    CUDA/driver mismatches, `torch.cuda` false, libxcb headless imports, and DeepGEMM workaround.
  </Card>
</CardGroup>

---

## 04. Choose an integration

> Decision matrix for Diffusers, vLLM-Omni, vLLM, Transformers (coming soon), and Cosmos Framework by goal: research, production inference, training, or evaluation.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/04-choose-an-integration.md
- Generated: 2026-06-01T20:21:24.282Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/action/README.md`
- `cookbooks/cosmos3/reasoner/README.md`

---
title: "Choose an integration"
description: "Decision matrix for Diffusers, vLLM-Omni, vLLM, Transformers (coming soon), and Cosmos Framework by goal: research, production inference, training, or evaluation."
---

Cosmos 3 exposes two runtime surfaces—**Reasoner** (text/vision in, text out) and **Generator** (multimodal in, vision/sound/action out)—and five integration paths: **Diffusers**, **vLLM-Omni**, **vLLM**, **Transformers** (Reasoner, coming soon), and **Cosmos Framework** (native PyTorch via `cosmos_framework.scripts.inference` and `torchrun`). Pick the surface first, then match your goal (research, production serving, training, or evaluation) to the backend that owns that workflow in this repository and its cookbooks.

## Decision matrix by goal

| Goal | Surface | Integration | Entry point | Notes |
| --- | --- | --- | --- | --- |
| Generator research or model development | Generator | **Diffusers** | `Cosmos3OmniPipeline.from_pretrained` | Loads the full checkpoint (reasoner + diffusion + media tokenizers); Python-first inspection and modification |
| Generator production inference | Generator | **vLLM-Omni** | `vllm serve … --omni --model-class-name Cosmos3OmniDiffusersPipeline` | OpenAI-compatible `/v1/images/generations`, `/v1/videos`, `/v1/videos/sync`; prefer `vllm/vllm-omni:cosmos3` for all modalities |
| Reasoner research or model development | Reasoner | **Transformers** (coming soon) | — | Planned Hugging Face path for prompts, processors, and model behavior |
| Reasoner production inference | Reasoner | **vLLM** | `vllm serve` + `Cosmos3ReasonerForConditionalGeneration` | OpenAI-compatible chat completions; Qwen3-VL-compatible messages |
| Runnable setup, training, or evaluation | Both | **Cosmos Framework** | `torchrun -m cosmos_framework.scripts.inference` | Clone `NVIDIA/cosmos-framework`, `uv sync --group=cu130-train` (or `cu128-train`); covers Reasoner, Generator audiovisual, and action cookbooks |
| Latency comparison across engines | Generator | **Cosmos Framework** (PyTorch), **vLLM-Omni**, **Diffusers** | See `inference_benchmarks.md` | Benchmarks label Framework OSS inference as **PyTorch** |

<Note>
For **text-only understanding** (captioning, grounding, planning), use **Reasoner + vLLM**, not vLLM-Omni. vLLM-Omni loads the full omni checkpoint for diffusion generation; Reasoner vLLM serves only `Cosmos3ReasonerForConditionalGeneration`.
</Note>

## Pick a surface, then a backend

```text
                    Cosmos 3
                        |
          +-------------+-------------+
          |                           |
      Reasoner                    Generator
   (text out)              (vision / sound / action out)
          |                           |
    +-----+-----+             +-------+-------+
    |           |             |       |       |
Transformers  vLLM      Diffusers  vLLM-Omni  Cosmos Framework
 (soon)    (production)  (research) (production)  (torchrun / train)
    |           |             |       |       |
    +-----------+-------------+-------+-------+
                        |
              Cosmos Framework (all cookbook paths)
```

### Reasoner integrations

| Integration | Best for | API / runtime | Cookbook coverage |
| --- | --- | --- | --- |
| **vLLM** | Production serving, video + image workloads | `vllm serve` with `--hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}'`; chat completions | `run_with_vllm.ipynb` — captioning, temporal localization, embodied reasoning, grounding, action CoT, physical plausibility |
| **Transformers** | Research (planned) | Hugging Face inference | Not yet available; see cookbook env **Transformers (coming soon)** |
| **Cosmos Framework** | Native inference, scaling Nano → Super, benchmark hooks | `cosmos_framework.scripts.inference` with `--parallelism-preset=latency`; JSON inputs with `model_mode: "reasoner"` | `run_with_cosmos_framework.ipynb` — image-focused (`vision_path`); video examples documented under vLLM |

Reasoner vLLM installs pair **CUDA driver ↔ torch backend ↔ vLLM version**: `cu130` with `vllm==0.21.0`, or `cu128` with `vllm==0.19.1`, plus the `vllm-cosmos3` plugin from `cosmos-framework`. Do not rely on `--torch-backend=auto` for vLLM wheels.

### Generator integrations

| Integration | Best for | API / runtime | Cookbook coverage |
| --- | --- | --- | --- |
| **Diffusers** | Research, training, pipeline experimentation | `Cosmos3OmniPipeline`; modes: `text-to-image`, `text-to-video`, `image-to-video`, `text-to-video-with-sound` | Audiovisual only (`run_with_diffusers.ipynb`) |
| **vLLM-Omni** | Production image/video/sound/action serving | Docker `vllm/vllm-omni:cosmos3` or PR-branch install; endpoints in README quickstart | Audiovisual + action (`run_with_vllm_omni.ipynb`, `run_fd_with_vllm.ipynb`, `run_id_with_vllm.ipynb`) |
| **Cosmos Framework** | Full modality matrix, multi-GPU torchrun, OSS benchmark path | `torchrun -m cosmos_framework.scripts.inference` with JSON specs and `--checkpoint-path` | Audiovisual + forward/inverse dynamics |

<Warning>
vLLM-Omni upstreaming is in progress. The **`vllm/vllm-omni:cosmos3`** image supports every Generator modality (including video-to-video, sound, and action). A PR-branch pip install may expose only text-to-image, text-to-video, and image-to-video until follow-up PRs merge.
</Warning>

Action workflows (forward dynamics, inverse dynamics, policy) require **`extra_params`** fields such as `action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, and optionally `action_path`. Forward dynamics can use synchronous `POST /v1/videos/sync`; policy and inverse dynamics use async `POST /v1/videos` to retrieve predicted action chunks.

## Integration profiles

### Diffusers (Generator)

Install a dedicated venv with `diffusers` from Git, `accelerate`, `cosmos_guardrail`, and `transformers`, pinning `--torch-backend` to your driver (`cu130` or `cu128`; `auto` is acceptable here for torch). The pipeline loads **`nvidia/Cosmos3-Nano`** or **`nvidia/Cosmos3-Super`** and returns PIL images or tensors exportable via `export_to_video`.

| Attribute | Value |
| --- | --- |
| Checkpoint scope | Full omni model (reasoner + diffusion + tokenizers) |
| Typical use | Notebook iteration, scheduler tuning (`UniPCMultistepScheduler`, `flow_shift`), structured JSON prompts |
| Not in cookbooks | Action forward/inverse dynamics (use Framework or vLLM-Omni) |
| Benchmark label | **Diffusers** in `inference_benchmarks.md` |

### vLLM-Omni (Generator)

Serves **`Cosmos3OmniDiffusersPipeline`** behind OpenAI-compatible HTTP. Success signal: log line `Application startup complete.` Verify with `curl http://localhost:8000/v1/models`.

| Parallelism option | Purpose |
| --- | --- |
| `--tensor-parallel-size N` | Split weights (required for Super at scale) |
| `--enable-layerwise-offload` | CPU/GPU block offload (latency ↔ memory) |
| `--cfg-parallel-size 2` | Parallel CFG branches; set `guidance_scale` per request |
| `--ulysses-degree 2` | Sequence-parallel attention |

GPU budget: `tensor_parallel_size × cfg_parallel_size × ulysses_degree` must fit available devices.

### vLLM (Reasoner)

Serves **`Cosmos3ReasonerForConditionalGeneration`** for chat completions. Key flags:

| Flag | Role |
| --- | --- |
| `--mm-encoder-tp-mode data` | Data-parallel visual encoder |
| `--async-scheduling` | Throughput-oriented scheduling |
| `--allowed-local-media-path` | Required for local `file://` media |
| `--media-io-kwargs '{"video": {"num_frames": -1}}'` | Let processor see all frames before downstream sampling |

If DeepGEMM is unavailable: `export VLLM_USE_DEEP_GEMM=0` before `vllm serve`.

### Transformers (Reasoner, coming soon)

Documented as the future Python-first Reasoner path parallel to Diffusers on the Generator side. Cookbooks and environment setup reserve a **Transformers** section; no runnable Reasoner Transformers notebook ships in this repo yet.

### Cosmos Framework (both surfaces)

Clone `https://github.com/NVIDIA/cosmos-framework.git` to `packages/cosmos3`, then:

```bash
export GIT_LFS_SKIP_SMUDGE=1
uv sync --all-extras --group=cu130-train   # or cu128-train on CUDA 12.x
```

Inference imports training extras today—use `*-train` groups. Notebooks honor `COSMOS3_UV_GROUP` (default `cu130-train`).

| Workflow | Command pattern |
| --- | --- |
| Generator audiovisual | `torchrun --nproc-per-node=1 -m cosmos_framework.scripts.inference --parallelism-preset=throughput -i <spec.json> -o <out> --checkpoint-path Cosmos3-Nano` |
| Reasoner | `.venv/bin/python -m cosmos_framework.scripts.inference --parallelism-preset=latency -i <reasoner.json> -o <out> --checkpoint-path Cosmos3-Nano --benchmark` |
| Super scale-out | Increase `--nproc-per-node` / `torchrun` world size per notebook |

Ecosystem role: **Cosmos Framework** is the end-to-end Physical AI framework for training and serving; **Cosmos Curator** and **Cosmos Evaluator** sit beside it for data curation and automated evaluation. Post-training recipes for vision, action, and reasoner adaptation are marked **[Coming Soon]** in the root README.

## Cookbook backend map

| Cookbook area | Cosmos Framework | Diffusers | vLLM-Omni | vLLM | Transformers |
| --- | :---: | :---: | :---: | :---: | :---: |
| Generator · audiovisual | ✓ | ✓ | ✓ | — | — |
| Generator · action (fd / id) | ✓ | — | ✓ | — | — |
| Reasoner | ✓ (image-primary) | — | — | ✓ (image + video) | soon |

Shared environment steps (HF auth, CUDA tags, Docker pull, GPU probe) live in the Cosmos3 cookbooks environment guide—install only the backend sections you need.

## Benchmarks and evaluation

`inference_benchmarks.md` compares Generator engines:

| Benchmark label | Integration |
| --- | --- |
| **PyTorch** | Cosmos Framework OSS reference inference (CUDA graphs where supported) |
| **vLLM-Omni** | Total pipeline time at 720p on listed GPUs |
| **Diffusers** | End-to-end `Cosmos3OmniPipeline` without custom CUDA graphs |

Reasoner benchmarks cover **vLLM** serving only (TTFT, latency, throughput at concurrency 1/64/128/256). Empty benchmark cells mean **not yet measured**, not unsupported.

For evaluation pipelines beyond latency tables, route to **Cosmos Evaluator** in the ecosystem and Framework-side workflows as they ship.

## Common fork points

<AccordionGroup>
<Accordion title="I need the fastest path to one Generator video">
Use **Quickstart** paths: Diffusers `Cosmos3OmniPipeline` in-process, or **vLLM-Omni** `curl` against `/v1/videos/sync`. First Diffusers run downloads Nano and runs full diffusion steps—long wall times are expected.
</Accordion>
<Accordion title="I am building a production API behind load balancers">
**Generator → vLLM-Omni** (HTTP, guardrails, tensor parallel for Super). **Reasoner → vLLM** (chat completions, multimodal messages). Keep surfaces on separate services.
</Accordion>
<Accordion title="I am modifying schedulers, prompts, or model code">
**Generator → Diffusers** for in-notebook changes; **Reasoner → Transformers** when available. Use **Cosmos Framework** when you need `torchrun`, JSON job specs, or parity with NVIDIA training stacks.
</Accordion>
<Accordion title="I need action-conditioned rollouts or robot policies">
**Cosmos Framework** or **vLLM-Omni** only. Diffusers cookbooks do not cover action. Pass `domain_name`, `action_mode`, and trajectory files per action cookbook assets.
</Accordion>
<Accordion title="I need Reasoner outputs on long videos">
Prefer **vLLM** Reasoner cookbook (`run_with_vllm.ipynb`). Framework Reasoner cookbooks currently emphasize **image** inputs via `vision_path`.
</Accordion>
</AccordionGroup>

## Environment prerequisites (all paths)

- Linux, NVIDIA GPU (Ampere, Hopper, Blackwell tested at BF16)
- Hugging Face gated-model auth: `uvx hf@latest auth login`
- CUDA **13.x → `cu130`** or **12.x → `cu128`** driver pairing for Framework, vLLM, and explicit torch installs
- Framework + vLLM: git access to `NVIDIA/cosmos-framework` (for `vllm-cosmos3` plugin and Framework clone)

<Tip>
Install backends in isolation—Diffusers venv, vLLM venv, Framework checkout, and vLLM-Omni Docker—so CUDA/torch/vLLM version pins do not conflict. The cookbook environment page lists verification probes for each.
</Tip>

## Related pages

<CardGroup>
<Card title="Reasoner and Generator" href="/reasoner-and-generator">
MoT surfaces, modality matrix, and when to call each runtime.
</Card>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared uv/Docker setup, CUDA tags, and GPU verification for every backend.
</Card>
<Card title="Quickstart" href="/quickstart">
Minimal first-run commands per integration.
</Card>
<Card title="Inference benchmarks" href="/inference-benchmarks">
Latency tables across PyTorch, vLLM-Omni, Diffusers, and Reasoner vLLM.
</Card>
<Card title="Run Generator with Diffusers" href="/run-generator-diffusers">
Cosmos3OmniPipeline modes and export paths.
</Card>
<Card title="Run Generator with vLLM-Omni" href="/run-generator-vllm-omni">
Docker serve, parallelism, guardrails, and action endpoints.
</Card>
<Card title="Run Reasoner with vLLM" href="/run-reasoner-vllm">
Serve flags, chat message shape, and reasoning prompt suffix.
</Card>
<Card title="Run Generator with Cosmos Framework" href="/run-generator-cosmos-framework">
torchrun inference, presets, and JSON specs.
</Card>
</CardGroup>

---

## 05. Reasoner and Generator

> MoT architecture modes: autoregressive Reasoner (text/vision in, text out) vs diffusion Generator (multimodal in, vision/sound/action out), shared mRoPE, and when to use each surface.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/05-reasoner-and-generator.md
- Generated: 2026-06-01T20:22:47.243Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/cosmos3-model-architecture.png`
- `cookbooks/cosmos3/reasoner/README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/action/README.md`

---
title: "Reasoner and Generator"
description: "MoT architecture modes: autoregressive Reasoner (text/vision in, text out) vs diffusion Generator (multimodal in, vision/sound/action out), shared mRoPE, and when to use each surface."
---

Cosmos 3 is a single Mixture-of-Transformers (MoT) checkpoint that exposes two runtime surfaces: an autoregressive **Reasoner** path (causal attention over language and visual-understanding tokens) and a diffusion **Generator** path (full attention over noisy vision, audio, and action tokens). Integrations select the active path through serve flags, pipeline class, or Framework `model_mode`; both paths share transformer layers and a unified 3D multi-dimensional rotary position embedding (mRoPE).

## Two runtime surfaces

| Surface | Inputs | Outputs | Primary workloads |
| --- | --- | --- | --- |
| **Reasoner** | Text, vision (image or video) | Text | Captioning, temporal localization, grounding, embodied and common-sense reasoning, action forecasting, physical plausibility, situation understanding |
| **Generator** | Text, vision, sound, action | Vision, sound, action | Text-to-image/video, image-to-video, video-to-video, synchronized sound, forward/inverse dynamics, policy rollouts, synthetic data |

<Info>
Reasoner and Generator are not separate model families. They are two forward modes through the same Cosmos 3 weights, distinguished by which token subsequence is active and which attention mask applies.
</Info>

## MoT architecture

Cosmos 3 combines an autoregressive (AR) transformer subsequence for reasoning with a diffusion (DM) subsequence for multimodal generation. The stack repeats **L** shared layers; each layer applies layer norm, **shared multimodal attention**, and an MLP. Token encoders sit upstream:

| Subsequence | Encoders | Token types |
| --- | --- | --- |
| **AR (Reasoner)** | Vision encoder (ViT) | Visual understanding tokens `v^AR` |
| **AR (Reasoner)** | Language tokenizer | Language tokens `l`, plus specials such as `EOS` and `BOG` |
| **DM (Generator)** | Vision encoder (VAE) | Noisy vision tokens `v^DM` |
| **DM (Generator)** | Audio encoder | Sound tokens `s` |
| **DM (Generator)** | Action encoder | Action tokens `a` |

<Frame caption="Cosmos 3 MoT diagram: Reasoner Mode (causal AR) vs Generator Mode (full DM attention), shared layers, and attention mask regions.">
![Cosmos 3 model architecture](/cookbooks/cosmos3/cosmos3-model-architecture.png)
</Frame>

```mermaid
flowchart TB
  subgraph encoders["Input encoders"]
    ViT["ViT vision encoder → v^AR"]
    Lang["Language tokenizer → l, EOS, BOG"]
    VAE["VAE vision encoder → v^DM noisy"]
    Audio["Audio encoder → s"]
    Action["Action encoder → a"]
  end

  subgraph stack["Shared transformer × L"]
    direction TB
    subgraph reasonerPath["Reasoner Mode — AR subsequence"]
      LN_AR["Layer norm"]
      Causal["Causal self-attention<br/>Attn(Q_AR, K_AR, V_AR)"]
      MLP_AR["MLP → next language tokens"]
    end
    subgraph genPath["Generator Mode — DM subsequence"]
      LN_DM["Layer norm"]
      Full["Full attention<br/>Attn(Q_DM, [K_AR;K_DM], [V_AR;V_DM])"]
      MLP_DM["MLP → denoised v^DM, s, a"]
    end
    MM["Shared multimodal attention block"]
  end

  ViT --> reasonerPath
  Lang --> reasonerPath
  VAE --> genPath
  Audio --> genPath
  Action --> genPath
  Causal --> MM
  Full --> MM
```

### Reasoner mode (autoregressive)

In Reasoner mode, language and visual-understanding tokens flow through the AR subsequence. Attention within AR is **causal** (lower-triangular mask): each AR query attends only to prior AR keys and values. AR queries are **masked from DM keys** — the reasoner path cannot read noisy diffusion tokens. The forward pass performs next-token prediction for perception, planning, and world reasoning tasks.

Production Reasoner serving loads only the reasoner head:

```shell
vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --async-scheduling \
  --allowed-local-media-path / \
  --port 8000
```

Cosmos Framework selects the same path with an explicit mode flag in the input JSON:

```json
{
  "model_mode": "reasoner",
  "name": "robot_image",
  "prompt": "Describe what is happening in this image in one sentence.",
  "vision_path": "https://example.com/robot_153.jpg",
  "enable_sound": false
}
```

<Warning>
On the current Framework Reasoner path, set `"enable_sound": false` in reasoner JSON inputs. Omitting it can trigger strict argument-validation failures. Framework Reasoner quickstarts also expect image conditioning via `vision_path`; video-heavy workflows are documented primarily against the vLLM Reasoner cookbook.
</Warning>

Reasoner requests follow **Qwen3-VL-compatible** chat messages (`image_url`, `video_url`, `text`). For chain-of-thought style answers, append the `redacted_reasoning` format instruction to the user prompt (see [Sampling and prompt parameters](/sampling-and-prompt-parameters)).

### Generator mode (diffusion)

In Generator mode, noisy image, video, audio, and action tokens occupy the DM subsequence. DM queries use **full attention**: each DM query attends to the concatenated AR and DM keys and values (`[K_AR; K_DM]`, `[V_AR; V_DM]`). That lets conditioning text and visual understanding tokens influence denoising while AR tokens remain blind to DM state. The output is coherent multimodal media — images, MP4 video (optionally with AAC sound), and JSON action chunks.

Typical Generator integrations load the **full** Cosmos 3 checkpoint (reasoner + diffusion paths + media tokenizers):

| Integration | Entry class / command | API shape |
| --- | --- | --- |
| Diffusers | `Cosmos3OmniPipeline.from_pretrained(...)` | Python `pipe(...)` → PIL image or video tensor |
| vLLM-Omni | `vllm serve ... --omni --model-class-name Cosmos3OmniDiffusersPipeline` | OpenAI-compatible `/v1/images/generations`, `/v1/videos`, `/v1/videos/sync` |
| Cosmos Framework | `torchrun -m cosmos_framework.scripts.inference` | JSON input specs under cookbook `assets/` |

Diffusers research installs note that the pipeline includes the reasoner path, diffusion generation path, and media tokenizers even when you only call generation APIs.

### Shared mRoPE and layers

Both modes reuse the same transformer depth, multimodal attention layers, and a unified **3D mRoPE** that encodes spatial and temporal structure across modalities. mRoPE gives consistent position coding when the model reasons over images, video frames, audio streams, and action trajectories in one sequence — whether those tokens are processed causally (Reasoner) or denoised with full context (Generator).

## Input and output contracts

| Contract | Reasoner | Generator |
| --- | --- | --- |
| Text in | Prompts, questions, instructions | Structured JSON scene prompts (often upsampled), negative prompts |
| Vision in | Images, videos (Qwen3-VL message URLs or Framework `vision_path`) | Conditioning images/videos, VAE-encoded frames |
| Sound in | Not on Reasoner output path | Optional input soundtrack for video-to-video-with-sound |
| Action in | Reasoning about actions (forecasting, CoT) | Trajectories for forward dynamics; video+instruction for inverse dynamics and policy |
| Text out | Captions, JSON boxes, labels, chain-of-thought | — |
| Vision out | — | JPG/PNG images, MP4 video |
| Sound out | — | Stereo AAC at 48 kHz muxed into MP4 when enabled |
| Action out | — | JSON action values (policy, inverse dynamics) |

Supported generation settings (resolution tiers 256p–720p, aspect ratios, frame rates, frame counts) apply to Generator outputs. Reasoner sampling parameters (`temperature`, `top_p`, `top_k`, `presence_penalty`) differ for plain answers versus explicit reasoning prompts — see [Sampling and prompt parameters](/sampling-and-prompt-parameters).

Action semantics for Generator workflows (9D ego pose, 10D DROID/UMI end-effector+gripper, `domain_name`, `action_mode`) are documented on [Action modality](/action-modality).

## When to use each surface

| Goal | Use | Avoid |
| --- | --- | --- |
| Understand a scene, localize events, ground objects, judge physics | **Reasoner** | Generator endpoints (they return media, not analysis text) |
| Produce or simulate visuals, sound, or robot trajectories | **Generator** | Reasoner-only vLLM serve (no diffusion denoising) |
| Text answers from images/video in production | Reasoner + **vLLM** (`Cosmos3ReasonerForConditionalGeneration`) | vLLM-Omni (loads full omni checkpoint; heavier for understanding-only) |
| Images/video/audio/action in production | Generator + **vLLM-Omni** | Reasoner vLLM (text-only chat completions) |
| Python-first Generator research | **Diffusers** `Cosmos3OmniPipeline` | — |
| Python-first Reasoner research | Transformers (coming soon) | — |
| Native PyTorch for either surface, training, evaluation | **Cosmos Framework** `cosmos_framework.scripts.inference` | — |

<Note>
vLLM-Omni loads the full checkpoint including the Qwen3-VL-based reasoner path **and** the diffusion path. For understanding-only tasks that return text, prefer [Run Reasoner with vLLM](/run-reasoner-vllm) instead of vLLM-Omni.
</Note>

Benchmarks treat the surfaces separately: Generator tables report **diffusion-path latency** (seconds per t2i/t2v/i2v); Reasoner tables report **vLLM serving metrics** (TTFT, request latency, throughput under concurrency), not denoising step time.

## Inference backends by surface

| Backend | Reasoner | Generator (audiovisual) | Generator (action) |
| --- | :---: | :---: | :---: |
| Cosmos Framework | ✓ | ✓ | ✓ |
| Diffusers | — | ✓ | — |
| Transformers | coming soon | — | — |
| vLLM | ✓ | — | — |
| vLLM-Omni | — | ✓ | ✓ |

Framework Reasoner runs commonly use `--parallelism-preset=latency` on a single GPU (Nano) or `torchrun` across four GPUs (Super). Generator Framework runs typically use `--parallelism-preset=throughput`. Diffusers and vLLM-Omni Generator quickstarts target `nvidia/Cosmos3-Nano` or `nvidia/Cosmos3-Super` with matching tensor-parallel and offload flags for the 64B checkpoint.

## Representative workflows

### Reasoner workflows

| Workflow | Inputs | Output type |
| --- | --- | --- |
| Caption | Video | Text |
| Temporal localization | Video, query | Text or JSON timestamps |
| Embodied / common-sense reasoning | Video, question | Text |
| 2D grounding | Image, prompt | JSON bounding boxes |
| Describe anything | Image, marked subjects | JSON or text attributes |
| Action CoT | Image or video, prompt | Text or JSON trajectories |
| Physical plausibility | Video, prompt | Label |
| Situation understanding | Video, question | Text |

Runnable notebooks: `cookbooks/cosmos3/reasoner/run_with_vllm.ipynb`, `run_with_cosmos_framework.ipynb`.

### Generator workflows

| Workflow | Inputs | Outputs |
| --- | --- | --- |
| Text-to-image / text-to-video | Text | Vision (optional sound) |
| Image-to-video | Text, image | Vision (optional sound) |
| Video-to-video | Text, video | Vision (optional sound) |
| Forward dynamics | Text, image, action trajectory | Vision |
| Policy / inverse dynamics | Text, image or video | Action + vision |

Audiovisual cookbooks live under `cookbooks/cosmos3/generator/audiovisual/`; action cookbooks under `cookbooks/cosmos3/generator/action/` with `action_mode` values `forward_dynamics`, `inverse_dynamics`, and `policy` on vLLM-Omni.

## Checkpoint and modality scope

| Checkpoint | Size | Typical surface |
| --- | ---: | --- |
| `nvidia/Cosmos3-Nano` | 16B | Both Reasoner and Generator in one omnimodal weights file |
| `nvidia/Cosmos3-Super` | 64B | Same; requires multi-GPU serve/generate |
| `nvidia/Cosmos3-Super-Text2Image` | 64B | Generator-focused text-to-image |
| `nvidia/Cosmos3-Super-Image2Video` | 64B | Generator-focused image-to-video |
| `nvidia/Cosmos3-Nano-Policy-DROID` | 16B | Vision-language robot policy (DROID) |

Task-specific HF variants narrow Generator capability; general omnimodal understanding and simulation still route through Nano/Super with the correct surface selected at serve or `model_mode` time.

## Related pages

<CardGroup>
  <Card title="Overview" href="/overview">
    Cosmos 3 surfaces, modalities, and the shortest path to a first Reasoner or Generator call.
  </Card>
  <Card title="Choose an integration" href="/choose-integration">
    Pick Diffusers, vLLM-Omni, vLLM, Framework, or Transformers by research vs production goal.
  </Card>
  <Card title="Input and output specifications" href="/input-output-specifications">
    Resolution tiers, frame counts, prompt limits, and output formats per modality.
  </Card>
  <Card title="Run Reasoner with vLLM" href="/run-reasoner-vllm">
    Serve `Cosmos3ReasonerForConditionalGeneration` and issue Qwen3-VL chat requests.
  </Card>
  <Card title="Run Generator with vLLM-Omni" href="/run-generator-vllm-omni">
    OpenAI-compatible image/video/action generation and guardrail toggles.
  </Card>
  <Card title="Action modality" href="/action-modality">
    Embodiment dimensions, `domain_name`, and forward/inverse/policy action modes.
  </Card>
</CardGroup>

---

## 06. Model family

> Checkpoint catalog (Nano 16B, Super 64B, Text2Image, Image2Video, Nano-Policy-DROID), Hugging Face IDs, capability focus, and size tradeoffs for serving.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/06-model-family.md
- Generated: 2026-06-01T20:22:20.814Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/reasoner/README.md`
- `inference_benchmarks.md`
- `cookbooks/cosmos3/generator/action/README.md`

---
title: "Model family"
description: "Checkpoint catalog (Nano 16B, Super 64B, Text2Image, Image2Video, Nano-Policy-DROID), Hugging Face IDs, capability focus, and size tradeoffs for serving."
---

Cosmos 3 ships five gated Hugging Face checkpoints under the [NVIDIA Cosmos 3 collection](https://huggingface.co/collections/nvidia/cosmos3). Each repo ID (`nvidia/Cosmos3-*`) is the canonical `from_pretrained`, `vllm serve`, and `--checkpoint-path` string across Diffusers, vLLM-Omni, vLLM, and Cosmos Framework. Omnimodal **Nano** (16B) and **Super** (64B) checkpoints expose both **Reasoner** (text out) and **Generator** (vision/sound/action out) surfaces; the three specialized variants narrow the Generator path to a single modality family or robot policy.

## Checkpoint catalog

| Hugging Face ID | Params | Primary capability | Typical surfaces |
| --- | ---: | --- | --- |
| [`nvidia/Cosmos3-Nano`](https://huggingface.co/nvidia/Cosmos3-Nano) | 16B | Compact omnimodal world model: multimodal understanding, world simulation, future prediction, action reasoning, Physical AI | Reasoner + Generator (full omni stack) |
| [`nvidia/Cosmos3-Super`](https://huggingface.co/nvidia/Cosmos3-Super) | 64B | Frontier-scale omnimodal world model with the same modality coverage at higher capacity | Reasoner + Generator (full omni stack) |
| [`nvidia/Cosmos3-Super-Text2Image`](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image) | 64B | High-fidelity text-to-image generation | Generator (image output) |
| [`nvidia/Cosmos3-Super-Image2Video`](https://huggingface.co/nvidia/Cosmos3-Super-Image2Video) | 64B | Temporally coherent image-to-video generation | Generator (video output) |
| [`nvidia/Cosmos3-Nano-Policy-DROID`](https://huggingface.co/nvidia/Cosmos3-Nano-Policy-DROID) | 16B | Vision-language robot policy for DROID manipulation and control | Generator (action + vision; policy-oriented) |

<Note>
All Cosmos 3 model repos are gated. Authenticate before the first download (`uvx hf@latest auth login` or `HF_TOKEN`). Set `HF_HOME` when you need a shared or larger cache volume.
</Note>

## Omnimodal vs specialized checkpoints

```text
                    ┌─────────────────────────────────────┐
                    │     nvidia/Cosmos3-Nano (16B)       │
                    │     nvidia/Cosmos3-Super (64B)      │
                    └──────────────┬──────────────────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              ▼                    ▼                    ▼
        Reasoner mode        Generator mode      (shared MoT weights)
     text/vision → text   multimodal → vision,
                          sound, action

  Specialized (64B / 16B) — Generator-focused subsets:
  • Cosmos3-Super-Text2Image     → text → image
  • Cosmos3-Super-Image2Video    → image → video
  • Cosmos3-Nano-Policy-DROID    → DROID policy / control
```

**Nano** and **Super** are the checkpoints documented end-to-end in this repository’s cookbooks: audiovisual generation, action forward/inverse dynamics, and Reasoner understanding workflows all reference `Cosmos3-Nano` or `Cosmos3-Super` by default. The **Text2Image**, **Image2Video**, and **Policy-DROID** repos are first-class catalog entries for deployments that want a narrower Generator specialization without loading the full omni diffusion stack for every request.

## Reasoner vs Generator on the same checkpoint

A single omnimodal weight file supports two runtime modes. Integration choice determines which path loads:

| Surface | Inputs | Outputs | Load pattern |
| --- | --- | --- | --- |
| **Reasoner** | Text, vision (image/video) | Text | vLLM with `Cosmos3ReasonerForConditionalGeneration` override; Framework `model_mode: "reasoner"` |
| **Generator** | Text, vision, sound, action | Vision, sound, action | Diffusers `Cosmos3OmniPipeline`; vLLM-Omni `Cosmos3OmniDiffusersPipeline`; Framework generator JSON specs |

<Info>
vLLM-Omni for Generator loads the **full** checkpoint (reasoner + diffusion paths). For text-only understanding at scale, serve Reasoner through **vLLM** with the architecture override instead of vLLM-Omni — lower memory and an OpenAI chat-completions API.
</Info>

Reasoner workloads in cookbooks include captioning, temporal localization, embodied/common-sense reasoning, 2D grounding, describe-anything, action chain-of-thought, physical plausibility, and situation understanding. Generator workflows span text-to-image/video (with optional sound), image-to-video, video-to-video, forward dynamics, inverse dynamics, and action policy rollouts.

## Size and serving tradeoffs

### Nano (16B) — default for development and single-GPU serving

| Dimension | Nano behavior |
| --- | --- |
| **GPU footprint** | Fits single-GPU Reasoner and Generator paths in cookbooks (`--tensor-parallel-size 1`, `torchrun --nproc-per-node=1`) |
| **Latency** | Fastest omnimodal Generator latencies in published tables; Reasoner TTFT/throughput benchmarks use `nvidia/Cosmos3-Nano` |
| **Coverage** | Action cookbooks (AV, DROID, UMI forward/inverse dynamics) and audiovisual examples target Nano |
| **When to choose** | Prototyping, constrained hardware, action-world-model experiments, high-throughput Reasoner at moderate concurrency |

Example Reasoner serve (single GPU):

```bash
vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --tensor-parallel-size 1 \
  --async-scheduling \
  --allowed-local-media-path / \
  --port 8000
```

Example Generator serve (vLLM-Omni Docker, single node):

```bash
vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000
```

### Super (64B) — quality-first, multi-GPU serving

| Dimension | Super behavior |
| --- | --- |
| **GPU footprint** | Reasoner cookbook and vLLM-Omni docs use **4-way tensor parallelism** (`--tensor-parallel-size 4`) as the tested Super configuration |
| **Memory relief** | `--enable-layerwise-offload` trades latency for lower peak VRAM by moving transformer blocks between CPU and GPU |
| **Extra parallelism** | Optional `--cfg-parallel-size` (CFG branches) and `--ulysses-degree` (sequence parallel); GPU count must cover `tensor_parallel × cfg_parallel × ulysses` |
| **Latency** | Generator benchmarks show longer diffusion times than Nano at the same resolution; expect higher quality, not higher FPS |
| **When to choose** | Production Reasoner quality (default in `run_with_vllm.ipynb`), 720p Generator at scale, frontier audiovisual fidelity |

Example Super Generator (four GPUs + offload):

```bash
vllm serve nvidia/Cosmos3-Super \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --tensor-parallel-size 4 \
  --enable-layerwise-offload \
  --allowed-local-media-path / \
  --port 8000
```

Cosmos Framework scales Super by increasing `--nproc-per-node` and setting `--checkpoint-path Cosmos3-Super` (audiovisual quickstart pattern).

### Specialized checkpoints

| Checkpoint | Serving implication |
| --- | --- |
| **Super-Text2Image** | Deploy when the fleet only serves `POST /v1/images/generations` or Diffusers `text-to-image` (`num_frames=1`); avoids carrying full video/action modules if your integration supports variant-specific weights |
| **Super-Image2Video** | Deploy for image-conditioned video only; pairs with `input_reference` uploads and i2v sampler settings |
| **Nano-Policy-DROID** | 16B policy head for DROID manipulation; distinct from general **Nano** action forward/inverse dynamics cookbooks, which use the omnimodal Generator on `Cosmos3-Nano` |

<Warning>
This repository’s runnable notebooks standardize on **Cosmos3-Nano** and **Cosmos3-Super** omnimodal IDs. Before production rollout on Text2Image, Image2Video, or Policy-DROID, confirm your serving stack (Diffusers pipeline class, vLLM-Omni model card, Framework checkpoint resolver) accepts the specialized repo ID.
</Warning>

## Integration matrix by checkpoint

| Backend | Nano | Super | Text2Image / Image2Video / Policy-DROID |
| --- | --- | --- | --- |
| **Diffusers** (`Cosmos3OmniPipeline`) | Documented (`nvidia/Cosmos3-Nano`) | Documented (`nvidia/Cosmos3-Super`) | Use matching HF ID if pipeline supports variant configs |
| **vLLM-Omni** | Default Docker quickstart | TP=4 + optional offload | Not covered in cookbooks |
| **vLLM** (Reasoner) | Quickstart + benchmarks | Notebook default (TP=4) | N/A (Reasoner-only path) |
| **Cosmos Framework** | `--checkpoint-path Cosmos3-Nano` | `--checkpoint-path Cosmos3-Super` + multi-GPU `torchrun` | Confirm in Framework docs before training/inference |

Disk planning: cookbook setup notes that **Nano** downloads plus CUDA dependencies can consume tens of GiB; **Super** multiplies weight storage and often requires four GPUs for the documented serving paths.

## Benchmark anchors (omnimodal checkpoints)

Published numbers in [`inference_benchmarks.md`](inference_benchmarks.md) compare **Nano** and **Super** Generator diffusion latency (seconds) across PyTorch, vLLM-Omni, and Diffusers at 256p/480p/720p and tensor-parallel widths 1/4/8. **Nano Reasoner** tables report vLLM TTFT, request latency, and throughput at client concurrency 1/64/128/256 — not diffusion time.

Representative Generator signals (720p t2v, seconds, lower is better):

| GPU | Engine | Nano 720p/1 | Super 720p/1 (where measured) |
| --- | --- | ---: | ---: |
| B200 | PyTorch | 114.85 | 407.50 |
| B200 | vLLM-Omni | 107.84 | 390.28 |
| H100 NVL | Diffusers | 324.20 | — |

Representative Reasoner signals (Nano, B200, Input 50 / Output 100 / Video 1 FPS):

| Metric | Concurrency 1 | Concurrency 256 |
| --- | ---: | ---: |
| TTFT (ms) | 115.27 | 2549.79 |
| Output token throughput (Tok/s) | 180.16 | 2701.08 |

Empty benchmark cells mean **not yet measured**, not unsupported. See the full tables on the inference benchmarks page.

## Choosing a checkpoint

<Steps>
<Step title="Pick the surface">
Need text understanding (caption, VQA, grounding, planning)? → Reasoner on **Nano** or **Super**. Need images, video, sound, or action outputs? → Generator on an omnimodal or specialized checkpoint.
</Step>
<Step title="Pick the size">
Single GPU or fastest iteration → **Nano**. Maximum quality or Reasoner notebook defaults → **Super** with 4× tensor parallel.
</Step>
<Step title="Pick specialization">
Fleet serves only t2i or only i2v at 64B → consider **Super-Text2Image** or **Super-Image2Video**. DROID manipulation policy at 16B → **Nano-Policy-DROID**. General robotics world models (forward/inverse dynamics across AV, DROID, UMI) → omnimodal **Nano** Generator.
</Step>
<Step title="Match the integration">
Research Generator → Diffusers or Framework. Production Generator API → vLLM-Omni. Production Reasoner API → vLLM + `vllm-cosmos3`. Align CUDA pairs (`cu130`/`cu128`) with the cookbook environment guide before download.
</Step>
</Steps>

## Related pages

<CardGroup>
<Card title="Reasoner and Generator" href="/reasoner-and-generator">
MoT modes, shared mRoPE, and when to call each surface on the same weights.
</Card>
<Card title="Choose an integration" href="/choose-integration">
Diffusers vs vLLM-Omni vs vLLM vs Framework by deployment goal.
</Card>
<Card title="Inference benchmarks" href="/inference-benchmarks">
Nano/Super Generator latency tables and Nano Reasoner serving metrics.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Resolution tiers, frame counts, action dimensions, and prompt limits per modality.
</Card>
<Card title="Action modality" href="/action-modality">
Embodiment dims, `domain_name`, and `action_mode` for Generator action workflows.
</Card>
<Card title="Cookbook environment setup" href="/cookbook-environment">
HF auth, CUDA backend tags, and backend-specific install paths for each checkpoint size.
</Card>
</CardGroup>

---

## 07. Input and output specifications

> Supported input/output types and formats, resolution tiers (256p–720p), aspect ratios, frame rates/counts, vision conditioning frame counts, prompt length limits, and sound output specs.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/07-input-and-output-specifications.md
- Generated: 2026-06-01T20:22:51.267Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2image/robot_draping.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/image2video/humanoid_robot.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/humanoid_robot.jpg`
- `cookbooks/cosmos3/reasoner/assets/video_caption.mp4`
- `cookbooks/cosmos3/reasoner/assets/grounding_2d.png`

---
title: "Input and output specifications"
description: "Supported input/output types and formats, resolution tiers (256p–720p), aspect ratios, frame rates/counts, vision conditioning frame counts, prompt length limits, and sound output specs."
---

Cosmos 3 exposes two runtime surfaces—**Reasoner** (text and vision in, text out) and **Generator** (text, vision, sound, and action in; vision, sound, action, and text out)—with modality contracts defined in the repository README, audiovisual cookbook assets under `cookbooks/cosmos3/generator/audiovisual/`, and Reasoner examples under `cookbooks/cosmos3/reasoner/`. Generator integrations map the same contracts through Cosmos Framework JSON specs, `Cosmos3OmniPipeline` arguments, or vLLM-Omni `size` / `num_frames` / `fps` fields.

## Surface summary

| Surface | Inputs | Outputs | Typical formats |
| --- | --- | --- | --- |
| **Reasoner** | Text, image, video | Text | Plain string; JSON for grounding or localization tasks |
| **Generator** | Text, image, video, action | Image, video, sound, action | JPG/PNG image, MP4 video, AAC muxed into MP4, JSON action arrays |

```text
                    ┌─────────────────────────────────────┐
  text / vision ──► │           Cosmos 3 Reasoner          │ ──► text
                    └─────────────────────────────────────┘

  text / vision / sound / action
                    ┌─────────────────────────────────────┐
              ───► │          Cosmos 3 Generator            │ ──► vision (+ optional sound, action)
                    └─────────────────────────────────────┘
```

## Generator: input types and formats

| Input type | Composition | File / payload format |
| --- | --- | --- |
| Text only | Text-to-image, text-to-video | Plain string or structured JSON prompt (see below) |
| Text + image | Image-to-video | JPG, PNG, JPEG, WEBP (`IMAGE_EXTENSIONS` in audiovisual notebooks) |
| Text + video | Video-to-video | MP4 via `input_reference` (vLLM-Omni) or `vision_path` (Framework) |
| Text + image + action | Forward dynamics, policy | Image + JSON action array (`action_path` on server) |
| Text + video + instruction | Inverse dynamics | MP4 + text instruction |

**Framework inference JSON** (written by audiovisual notebooks) carries generation controls alongside the prompt:

| Field | Role | Cookbook default |
| --- | --- | --- |
| `model_mode` | Workflow: `text2image`, `text2video`, `image2video` | Per example |
| `name` | Run identifier / output subdirectory | Per example |
| `prompt` | Structured JSON string (compact-serialized asset file) | Asset under `assets/prompts/` |
| `vision_path` | Conditioning image (image2video), repo-relative | e.g. `assets/images/image2video/car_driving.jpg` |
| `enable_sound` | Request synchronized audio generation | `false` or `true` |
| `num_steps` | Diffusion denoising steps | `35` |
| `guidance` | CFG strength (Framework) / `guidance_scale` (Diffusers, vLLM-Omni) | `6.0` |
| `shift` | Scheduler flow shift / `flow_shift` | `10.0` |
| `fps` | Output frame rate | `24` |
| `num_frames` | Video length in frames (`1` for text-to-image) | `189` (video), `1` (image) |
| `resolution` | Tier string: `"256"`, `"480"`, `"720"` | `"720"` in cookbooks |
| `aspect_ratio` | Comma-separated pair, e.g. `"16,9"` | `"16,9"` |
| `seed` | Reproducibility | `0` |

**Reasoner Framework input** uses a smaller schema, for example:

```json
{
  "model_mode": "reasoner",
  "name": "robot_image",
  "prompt": "Describe what is happening in this image in one sentence.",
  "vision_path": "https://…/robot_153.jpg",
  "enable_sound": false
}
```

Set `enable_sound` to `false` on the current Reasoner Framework path to avoid strict argument-validation failures noted in the Reasoner cookbook README.

## Generator: output types and formats

| Output | Format | Notes |
| --- | --- | --- |
| Image | JPG (Framework); PNG base64 (vLLM-Omni `/v1/images/generations`) | `text-to-image` uses `num_frames=1` |
| Video | MP4 | Exported with `export_to_video` (Diffusers) or returned as `video/mp4` bytes (vLLM-Omni sync endpoint) |
| Sound | Stereo AAC at 48 kHz | Muxed into MP4 when sound is enabled |
| Action | JSON numeric arrays | Policy / inverse dynamics return predicted chunks; forward dynamics returns video only |

## Resolution tiers and pixel dimensions

Cosmos 3 supports three resolution tiers. **Default tier is 480p**; **default aspect ratio is 16:9**.

| Tier | 16:9 pixels (H×W) | Used in |
| --- | --- | --- |
| **256p** | 320×192 | Diffusers benchmarks; cookbook `payload_dimensions` for `resolution: "256"` |
| **480p** | 832×480 | Model default tier; Diffusers benchmarks |
| **720p** | 1280×720 | README vision conditioning; cookbook assets and quickstarts |

Benchmark tables in `inference_benchmarks.md` label these as **256p/1**, **480p/1**, and **720p/1** (height tier / aspect-ratio index). Standard video benchmarks use **189 frames at 24 FPS** unless a resolution tier limits frame count.

<Note>
Audiovisual notebook helpers currently resolve pixel sizes only for **`resolution: "720"`** and **`resolution: "256"`** with **`aspect_ratio: "16,9"`**. Other tier/ratio pairs are supported at the model level (per README) but require explicit `height`/`width` (Diffusers), `size` (vLLM-Omni), or Framework fields you set yourself.
</Note>

### Mapping tiers to API fields

| Integration | How you set resolution |
| --- | --- |
| **Cosmos Framework** | `resolution` + `aspect_ratio` in inference JSON |
| **Diffusers** | `height`, `width` (e.g. `720`, `1280`) |
| **vLLM-Omni** | `size` as `<width>x<height>` (e.g. `1280x720`) |

Checked-in structured prompts also embed explicit pixels under `resolution.W` / `resolution.H` (cookbook assets use **1280×720** for 16:9 video and image prompts).

## Aspect ratios

| Aspect ratio | Default? |
| --- | --- |
| 16:9 | Yes |
| 4:3 | Supported |
| 1:1 | Supported |
| 3:4 | Supported |
| 9:16 | Supported |

In Framework JSON and prompt assets, encode ratios with a **comma** separator (e.g. `"aspect_ratio": "16,9"`), not a colon. vLLM-Omni and Diffusers examples in this repo use explicit `size` or `height`/`width` for 16:9 rather than enumerating every ratio.

Optional template toggles on vLLM-Omni (`extra_params.use_resolution_template`, `use_duration_template`) let the server inject resolution/duration hints; cookbooks often disable them and pass full structured JSON instead.

## Frame rates and frame counts

| Parameter | Supported values | Default |
| --- | --- | --- |
| **FPS** | 10, 16, 24, 30 | 24 |
| **Frame count** | 5–300 | 189 |

**Duration relationship:** at 24 FPS, 189 frames is about **7.9 seconds** of video. Shorter clips in prompt assets declare matching metadata—for example `humanoid_robot.json` uses `"duration": "7s"` and `"fps": 24` for a seven-second scene description.

| Workflow | Typical `num_frames` |
| --- | --- |
| Text-to-image | `1` |
| Text-to-video / image-to-video (cookbooks) | `189` |
| vLLM-Omni README curl example | `81` (valid within 5–300) |
| Action forward dynamics (AV) | 60 frames @ 10 FPS (per action cookbook) |
| Action forward dynamics (DROID) | 16 frames @ 15 FPS per chunk |
| Action forward dynamics (UMI) | 16 frames @ 20 FPS per chunk |

Action robotics notebooks run **autoregressive chunks** (e.g. five 16-frame DROID chunks); each chunk video includes its conditioning frame at index 0, which downstream stitching drops before concatenation.

## Vision conditioning

| Setting | Specification |
| --- | --- |
| **Spatial size** | Matches tier: 1280×720 (720p), 832×480 (480p), 320×192 (256p) |
| **Video conditioning frames** | **5 frames** at the matching resolution |
| **Image conditioning** | Single reference image (`vision_path` or `input_reference`) |
| **Video-to-video** | Source MP4 plus `condition_frame_indexes_vision` and `condition_video_keep` in vLLM-Omni `extra_params` |

Image2video cookbooks ship JPEG conditioning frames under `cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/` (e.g. `humanoid_robot.jpg` paired with `humanoid_robot.json`).

## Prompt formats and length limits

### Plain text

Short natural-language strings work for quickstarts (Diffusers `prompt="…"`, vLLM-Omni `prompt` form field). For world generation, **fewer than 300 words** is recommended.

### Structured JSON prompts

Production audiovisual flows serialize rich scene JSON (subjects, lighting, cinematography, temporal segments) and pass it as the `prompt` string. Example top-level keys from `robot_draping.json` and `humanoid_robot.json`:

| Key group | Examples |
| --- | --- |
| Scene | `subjects`, `background_setting`, `lighting`, `aesthetics` |
| Motion / time | `actions`, `segments`, `temporal_caption`, `duration`, `fps` |
| Output geometry | `resolution` (`W`, `H`), `aspect_ratio` |
| Caption | `comprehensive_t2i_caption` (text-to-image) |

Cookbooks load assets with `compact_json_file()` and send `json.dumps(..., separators=(",", ":"))` so the model receives a single-line JSON string.

### Token limits (vLLM-Omni)

<ParamField body="max_sequence_length" type="integer">
Maximum prompt tokens kept for conditioning. Cosmos 3 default is **512**; longer prompts are truncated with a warning, shorter prompts padded.
</ParamField>

Prompt upsampling (Generator) uses separate LLM sampling defaults (`max_tokens` 20000, etc.) documented on the sampling page—not the same as `max_sequence_length`.

## Sound output

| Property | Value |
| --- | --- |
| Codec | AAC |
| Channels | Stereo |
| Sample rate | 48 kHz |
| Container | Muxed into output MP4 |

Enable sound per integration:

| Integration | Flag |
| --- | --- |
| Cosmos Framework | `"enable_sound": true` in inference JSON |
| Diffusers | `enable_sound=True` on `Cosmos3OmniPipeline` (`text-to-video-with-sound` mode) |
| vLLM-Omni | `generate_sound=true` on `/v1/videos` or `/v1/videos/sync` |

## Action inputs and outputs (summary)

Action modality uses JSON arrays of pose deltas; embodiment dimensionality varies (camera 9D, AV 9D, DROID/UMI 10D, humanoid 29D per README). vLLM-Omni passes `action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, and `action_path` through `extra_params`. See the action modality page for semantics; this page only lists I/O shapes.

| `action_mode` | Primary input | Primary output |
| --- | --- | --- |
| `forward_dynamics` | Image + action chunk | Video (sync API) |
| `inverse_dynamics` | Video + instruction | Video + predicted action chunk (async API) |
| `policy` | Image + instruction | Video + predicted action chunk (async API) |

## Reasoner: input and output

Reasoner follows **Qwen3-VL-compatible** chat messages: `image_url` and `video_url` content parts plus text. Outputs are **text** (or JSON embedded in text for grounding/localization).

| Input format | Example |
| --- | --- |
| Remote image URL | `https://…/robot_153.jpg` |
| Local media | `file://` paths with `--allowed-local-media-path` on the vLLM server |
| Video | `video_caption.mp4`, `grounding_2d.png`, and other assets under `cookbooks/cosmos3/reasoner/assets/` |

| Parameter | Cookbook usage |
| --- | --- |
| `max_tokens` | `4096` in Reasoner vLLM examples |
| Video frame ingestion | `--media-io-kwargs '{"video": {"num_frames": -1}}'` so the processor considers all frames before downstream sampling |

Framework Reasoner currently expects **image** inputs via `vision_path`; video-heavy workflows are documented against vLLM in the Reasoner cookbook README.

## Precision and platform constraints

| Constraint | Value |
| --- | --- |
| Precision | BF16 tested |
| Operating system | Linux |
| GPU | NVIDIA Ampere, Hopper, Blackwell |

## Related pages

<CardGroup>
  <Card title="Reasoner and Generator" href="/reasoner-and-generator">
    When to use each surface and how MoT modes differ.
  </Card>
  <Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
    Prompt upsampling, Reasoner sampling tables, and JSON schema details.
  </Card>
  <Card title="vLLM-Omni API reference" href="/vllm-omni-api-reference">
    Request fields, `extra_params`, and endpoint mapping.
  </Card>
  <Card title="Diffusers pipeline reference" href="/diffusers-pipeline-reference">
    `Cosmos3OmniPipeline` modes and call arguments.
  </Card>
  <Card title="Action modality" href="/action-modality">
    Embodiment dimensions, `domain_name`, and action workflow modes.
  </Card>
  <Card title="Audiovisual cookbooks" href="/audiovisual-cookbooks">
    End-to-end Generator examples with checked-in prompts and images.
  </Card>
</CardGroup>

---

## 08. Action modality

> Action token semantics, embodiment dimensions (AV 9D, DROID 10D, UMI 10D, humanoid 29D), policy/inverse/forward dynamics modes, and domain_name conditioning for Generator action workflows.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/08-action-modality.md
- Generated: 2026-06-01T20:24:26.058Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/generator/action/README.md`
- `cookbooks/cosmos3/generator/action/assets/actions/umi.json`
- `cookbooks/cosmos3/generator/action/assets/actions/av_traj_forward.json`
- `cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/meta/info.json`
- `cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb`

---
title: "Action modality"
description: "Action token semantics, embodiment dimensions (AV 9D, DROID 10D, UMI 10D, humanoid 29D), policy/inverse/forward dynamics modes, and domain_name conditioning for Generator action workflows."
---

Cosmos 3 Generator treats **action** as a first-class modality: action tokens encode transitions between consecutive visual states, are denoised alongside vision (and optionally audio) in Generator mode, and are selected at inference time through `model_mode` (Cosmos Framework JSONL) or `action_mode` plus `domain_name` (vLLM-Omni `extra_params`). Checked-in workflows under `cookbooks/cosmos3/generator/action/` exercise forward and inverse dynamics for AV (9D), DROID (10D), and UMI (10D); the root README documents additional embodiment sizes including humanoid 29D.

## Action token semantics

In Generator mode, the diffusion path denoises image, video, audio, and **action** tokens with full attention, sharing the same transformer stack and 3D mRoPE as other modalities. The action cookbook defines tokens as **transitions between consecutive visual states**, not absolute world poses in isolation.

| Concept | Meaning |
| --- | --- |
| Token semantics | One action token per inter-frame transition (pose delta, gripper change, etc.) |
| Unified pose core | **9D** = 3D translation + **6D continuous rotation** (`rot6d`) between consecutive states |
| Grasp / hand | **1D** open–close for parallel grippers; **15D** human hand (3D × 5 fingers) where applicable |
| On-disk interchange | JSON array of rows: `[[d₀…dₙ₋₁], …]` with row count and `d` fixed per embodiment |
| Framework field | `model_mode`: `forward_dynamics`, `inverse_dynamics` (and other Generator modes outside action) |
| Serving field | `action_mode` in `extra_params`: `forward_dynamics`, `inverse_dynamics`, `policy` |

<Note>
Reasoner workflows predict **text** (next action, action CoT, etc.) from vision; they do not use `domain_name` / `action_mode`. Action **generation** and rollouts are Generator-side.
</Note>

## Embodiment dimensions

Supported conditioning sizes are embodiment-specific. Cookbooks ship runnable examples for three domains; the project README lists the full matrix.

### Cookbook-covered embodiments

| Embodiment | `domain_name` (examples) | Vector | Composition | Generation duration (cookbook) |
| --- | --- | ---: | --- | --- |
| Autonomous vehicle | `av` | **9D** | Ego pose delta (translation + rot6d) | 60 frames @ 10 FPS |
| [DROID](https://arxiv.org/abs/2403.12945) | `droid_lerobot` | **10D** | 9D end-effector pose + **1D** gripper grasp | 16 frames @ 15 FPS |
| UMI | `umi` | **10D** | 9D end-effector pose + **1D** gripper grasp | 16 frames @ 20 FPS |

DROID forward-dynamics uses multiview LeRobot data (`assets/droid_lerobot_example/`), with post-processing described as multiview concatenation, `to-OpenCV`, and normalization. UMI stores a long trajectory in `assets/actions/umi.json` (rows of 10 floats); notebooks split it into **16-action chunks** for autoregressive rollouts.

### Additional embodiments (README)

| Setting | Dimensionality | Notes |
| --- | ---: | --- |
| Camera motion | 9D | Same pose-delta family as AV |
| Autonomous vehicle | 9D | Listed separately in I/O spec; cookbook uses `av` |
| Egocentric motion | 57D | Documented; no checked-in action cookbook yet |
| Single-arm robot (DROID / UR / Fractal / Bridge / UMI) | 10D | DROID and UMI have examples |
| Dual-arm robot (dual DROID) | 20D | Documented; no cookbook example yet |
| Humanoid (AgiBot) | **29D** | Documented; action cookbook TODO lists more embodiments |

Dedicated policy checkpoint: **[Cosmos3-Nano-Policy-DROID](https://huggingface.co/nvidia/Cosmos3-Nano-Policy-DROID)** (16B) for DROID manipulation policy, separate from general Cosmos3-Nano action dynamics demos.

### 9D and 10D layout

**9D row** (AV and pose component of robotics):

```text
[tx, ty, tz, r1, r2, r3, r4, r5, r6]   # meters + rot6d
```

**10D row** (DROID / UMI): 9D pose delta + 1D gripper grasp state.

AV trajectories are often produced from absolute camera-to-world poses (OpenCV convention, meters) via `pose_abs_to_rel` in `cosmos_framework.data.vfm.action.pose_utils`, with:

- `rotation_format="rot6d"`
- `pose_convention="backward_framewise"`
- `translation_scale=1.35` (AV cookbook convention)

That yields **`[T−1, 9]`** relative rows for **`T`** visual frames. Checked-in AV files such as `assets/actions/av_traj_forward.json` are JSON arrays of 9-float rows.

## Action workflow modes

Three Generator **action modes** map to different inputs and outputs. They align with `model_mode` in Framework JSONL and `action_mode` in vLLM-Omni requests.

| Mode | `action_mode` / `model_mode` | Primary input | Primary output | Typical endpoint |
| --- | --- | --- | --- | --- |
| **Forward dynamics** | `forward_dynamics` | Start **image** + action trajectory | **Video** rollout | `POST /v1/videos` (cookbooks) or `POST /v1/videos/sync` (README) |
| **Inverse dynamics** | `inverse_dynamics` | **Video** + text prompt | Predicted **action** chunk (+ job metadata) | Async `POST /v1/videos` |
| **Action policy** | `policy` | **Image** + instruction | **Video** + predicted **action** chunk | Async `POST /v1/videos` |

```text
                    domain_name + embodiment dim
                              │
  forward_dynamics:  image + action[] ──► video
  inverse_dynamics:  video + prompt    ──► action[]  (no action_path in JSONL)
  policy:            image + prompt    ──► video + action[]
```

<Info>
**Forward dynamics** conditions on a known trajectory and predicts future observations. **Inverse dynamics** predicts the trajectory that explains an input video. **Policy** predicts actions (and optionally rollout video) from context and language.
</Info>

### Forward dynamics (`fd`)

- **Inputs:** `vision_path` (conditioning image), `action_path` or inline `action` array, optional `prompt`.
- **Outputs:** `vision.mp4` under the run directory; forward jobs in cookbooks do not require `action` in the completed response.
- **Frame count:** `num_frames = action_chunk_size + 1` (e.g. 61 for AV with chunk size 60).
- **Autoregressive robotics / UMI:** Later chunks use the **last generated frame** from the previous chunk as the next conditioning image (DROID: 5×16 actions; UMI: all 16-action segments in `umi.json`).

### Inverse dynamics (`id`)

- **Inputs:** `vision_path` points to an **MP4**; JSONL has **no** `action_path`.
- **Outputs:** Predicted ego-motion trajectory; vLLM-Omni returns `action` on the completed job (`shape`, `dtype`, `data`). Notebooks mirror Framework as `sample_outputs.json`.
- **AV example:** `raw_action_dim: 9`, `action_chunk_size: 60`, `domain_name: "av"`, `view_point: "ego_view"`.

### Policy

- Documented in README: image + instruction → video + predicted action chunk.
- Use async `POST /v1/videos` and read action from the completed result (same pattern as inverse dynamics).
- Example `domain_name` values in docs include `bridge_orig_lerobot` and `camera_pose` (see vLLM-Omni Cosmos 3 serving examples).

## `domain_name` conditioning

`domain_name` tells the model which **embodiment parser**, normalization, chunking, and view geometry apply. It must stay consistent with action row dimensionality and `action_chunk_size`.

| `domain_name` | Used in | `action_chunk_size` | `image_size` | `view_point` (examples) |
| --- | --- | ---: | ---: | --- |
| `av` | AV fd / id cookbooks | 60 | 480 | `ego_view` |
| `droid_lerobot` | DROID fd cookbook | 16 | 480 | From dataset (`viewpoint`) |
| `umi` | UMI fd cookbook | 16 | 256 | `ego_view` |

README also references `bridge_orig_lerobot`, `camera_pose`, and other robot/AV/camera variants in [vLLM-Omni Cosmos 3 online serving examples](https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/cosmos3).

<Warning>
Mismatching `domain_name`, row length, or `raw_action_dim` produces server-side errors or invalid rollouts. UMI notebooks assert every row has length **10** before chunking.
</Warning>

## JSONL input spec (Cosmos Framework)

Each inference line is one JSON object (JSONL). Shared fields across action cookbooks:

<ParamField body="domain_name" type="string" required>
Embodiment key (`av`, `droid_lerobot`, `umi`, …).
</ParamField>

<ParamField body="model_mode" type="string" required>
`forward_dynamics` or `inverse_dynamics` for action workflows.
</ParamField>

<ParamField body="action_chunk_size" type="integer" required>
Number of action transitions in the chunk (60 AV, 16 robotics/UMI).
</ParamField>

<ParamField body="vision_path" type="string" required>
Absolute path to conditioning **image** (fd) or **video** (id).
</ParamField>

<ParamField body="action_path" type="string">
Required for **forward_dynamics** only; path to JSON action array.
</ParamField>

<ParamField body="fps" type="integer" required>
Output video frame rate (10 AV, 15 DROID, 20 UMI in cookbooks).
</ParamField>

<ParamField body="image_size" type="integer" required>
Short-edge resolution tier (480 AV/DROID, 256 UMI). vLLM may infer canvas from this without explicit `size`.
</ParamField>

<ParamField body="view_point" type="string" required>
Camera geometry hint, e.g. `ego_view` or dataset `viewpoint`.
</ParamField>

<ParamField body="prompt" type="string">
Task text; AV examples use *"You are an autonomous vehicle planning system."*; DROID uses dataset `ai_caption`.
</ParamField>

<ParamField body="seed" type="integer">
Per-run reproducibility seed.
</ParamField>

Example AV forward-dynamics record (conceptual):

```json
{
  "name": "av_forward",
  "domain_name": "av",
  "model_mode": "forward_dynamics",
  "action_chunk_size": 60,
  "action_path": "/path/to/av_traj_forward.json",
  "vision_path": "/path/to/av_0.jpg",
  "fps": 10,
  "image_size": 480,
  "view_point": "ego_view",
  "prompt": "You are an autonomous vehicle planning system.",
  "seed": 0
}
```

Framework entrypoint:

```bash
torchrun --nproc-per-node=1 \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=throughput \
  -i action_forward_dynamics_av_custom.jsonl \
  -o /tmp/cosmos3_action_fd \
  --checkpoint-path Cosmos3-Nano \
  --seed=0
```

## vLLM-Omni `extra_params`

Multipart `POST /v1/videos` sends vision via `input_reference` and packs Cosmos-specific options in `extra_params` (JSON string; use `curl --form-string` so semicolons are not stripped).

<ParamField body="action_mode" type="string" required>
`forward_dynamics`, `inverse_dynamics`, or `policy`.
</ParamField>

<ParamField body="domain_name" type="string" required>
Same semantics as JSONL.
</ParamField>

<ParamField body="action_chunk_size" type="integer" required>
Matches JSONL / trajectory length.
</ParamField>

<ParamField body="action" type="array" required>
For forward dynamics: inline JSON trajectory (cookbooks load `action_path` into this field).
</ParamField>

<ParamField body="raw_action_dim" type="integer">
Set for inverse dynamics when the server must know output width (AV id: **9**).
</ParamField>

<ParamField body="image_size" type="integer" required>
Resolution tier for action canvas.
</ParamField>

<ParamField body="view_point" type="string" required>
View geometry for conditioning.
</ParamField>

<ParamField body="guardrails" type="boolean">
Cookbooks often set `false` for robotics/UMI; default product guardrails apply when omitted.
</ParamField>

Forward-dynamics request shape (from notebooks): `num_frames = action_chunk_size + 1`, `fps` from the record, `guidance_scale=1.0`, `flow_shift=10.0`, plus `extra_params` above. Poll `GET /v1/videos/{id}` until `completed`, then `GET /v1/videos/{id}/content` for MP4 bytes.

| Mode | Returns `action` in job result? | Sync endpoint |
| --- | --- | --- |
| `forward_dynamics` | Optional; cookbooks consume video only | `POST /v1/videos/sync` supported per README |
| `inverse_dynamics` | **Yes** (required for id notebooks) | Async only |
| `policy` | **Yes** | Async only |

Start the server with `--allowed-local-media-path` covering conditioning media and action JSON paths (Docker: mount repo at `/workspace`).

## Asset layout

:::files
cookbooks/cosmos3/generator/action/
├── README.md
├── assets/
│   ├── actions/          # av_traj_*.json (9D), umi.json (10D)
│   ├── images/           # av_0.jpg, umi.png, …
│   ├── videos/           # av_*.mp4 for inverse dynamics
│   └── droid_lerobot_example/  # LeRobot layout + meta/info.json
├── run_fd_with_cosmos_framework.ipynb
├── run_fd_with_vllm.ipynb
├── run_id_with_cosmos_framework.ipynb
└── run_id_with_vllm.ipynb
:::

Outputs default to `outputs/cosmos3_action_vllm/` (vLLM) or framework package output trees under `packages/cosmos3/outputs/cookbooks/...`.

## Verification signals

| Check | Expected |
| --- | --- |
| Action JSON row width | AV/id: 9 floats; DROID/UMI fd: 10 floats |
| AV trajectory from poses | `pose_abs_to_rel` → `[T−1, 9]` |
| UMI chunking | `len(umi_action) % 16 == 0` |
| vLLM forward fd | `vision.mp4` in `<output>/<name>/` |
| vLLM inverse id | `final.json` contains `action` with `data`; `sample_outputs.json` written |
| Frame alignment | `num_frames == action_chunk_size + 1` |

<Tip>
For AV, visualize predicted or input trajectories with `pose_rel_to_abs` using the same `rot6d`, `backward_framewise`, and `translation_scale=1.35` convention as forward-dynamics prep.
</Tip>

## Related pages

<CardGroup>
<Card title="Run Generator action workflows" href="/run-generator-action">
Step-by-step forward and inverse dynamics with Framework torchrun and vLLM-Omni multipart requests.
</Card>
<Card title="Action cookbook recipes" href="/action-cookbooks">
Notebook index for AV, DROID, and UMI with checked-in trajectories and output directories.
</Card>
<Card title="vLLM-Omni API reference" href="/vllm-omni-api-reference">
`/v1/videos` fields, `action_mode` values, and `curl --form-string` constraints.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Global I/O types, action conditioning matrix, and resolution tiers.
</Card>
<Card title="Reasoner and Generator" href="/reasoner-and-generator">
MoT surfaces: when to use Generator action vs Reasoner text outputs.
</Card>
<Card title="Model family" href="/model-family">
Cosmos3-Nano, Super, and Cosmos3-Nano-Policy-DROID checkpoints.
</Card>
</CardGroup>

---

## 09. Cookbook environment setup

> Shared uv/Docker setup for all backends: HF auth, CUDA backend tags, Cosmos Framework clone/sync, Diffusers venv, vLLM + vllm-cosmos3 plugin, vLLM-Omni Docker image, and GPU verification probes.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/09-cookbook-environment-setup.md
- Generated: 2026-06-01T20:23:21.988Z

### Source Files

- `cookbooks/cosmos3/README.md`
- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb`
- `.gitignore`

---
title: "Cookbook environment setup"
description: "Shared uv/Docker setup for all backends: HF auth, CUDA backend tags, Cosmos Framework clone/sync, Diffusers venv, vLLM + vllm-cosmos3 plugin, vLLM-Omni Docker image, and GPU verification probes."
---

`cookbooks/cosmos3/README.md` is the canonical environment guide for every Cosmos3 Reasoner and Generator notebook. Each backend uses a separate install path (framework checkout under `packages/`, a repo-root `.venv` for Diffusers or vLLM, or the `vllm/vllm-omni:cosmos3` image). Pick one backend, complete its section, then run the cookbook that links to it.

## Backend map

| Backend | Install surface | Primary cookbooks |
| --- | --- | --- |
| Cosmos Framework | `packages/cosmos3/.venv` via `uv sync --group=cu130-train` or `cu128-train` | Reasoner, Generator (audiovisual, action) |
| Diffusers | Repo-root `.venv` via `uv pip install --torch-backend=…` | Generator (audiovisual) |
| Transformers | Coming soon | Reasoner |
| vLLM + `vllm-cosmos3` | Repo-root `.venv` | Reasoner |
| vLLM-Omni | Docker `vllm/vllm-omni:cosmos3` (or PR-branch venv) | Generator (audiovisual, action) |

```mermaid
flowchart TB
  subgraph cosmos_repo["cosmos repo checkout"]
    readme["cookbooks/cosmos3/README.md"]
    nb["*.ipynb cookbooks"]
  end
  subgraph fw["packages/cosmos3 — cosmos-framework"]
    uv_sync["uv sync --group=cu130-train | cu128-train"]
    fw_venv[".venv — torchrun / python -m cosmos_framework.scripts.inference"]
  end
  subgraph local_venv["repo-root .venv"]
    diff["Diffusers + Cosmos3OmniPipeline"]
    vllm_r["vllm + vllm-cosmos3 plugin"]
  end
  subgraph docker["Docker"]
    omni["vllm/vllm-omni:cosmos3 — vllm serve --omni"]
  end
  hf["Hugging Face gated models"]
  readme --> fw
  readme --> local_venv
  readme --> docker
  nb --> fw_venv
  nb --> diff
  nb --> vllm_r
  nb --> omni
  uv_sync --> fw_venv
  hf --> fw_venv
  hf --> diff
  hf --> omni
```

<Note>
The framework checkout lives under `packages/` (gitignored). Notebooks resolve `COSMOS3_REPO` from `packages/cosmos3` or `packages/cosmos-framework` when `pyproject.toml` and `cosmos_framework` are present.
</Note>

## Prerequisites

| Requirement | Detail |
| --- | --- |
| OS / GPU | Linux with NVIDIA GPU access |
| Tools | `uv`, `git`, `git-lfs` |
| Hugging Face | Gated Cosmos3 model access; authenticate before first download |
| Framework / vLLM plugin | SSH access to `git@github.com:NVIDIA/cosmos-framework.git` when cloning the framework or installing `vllm-cosmos3` from that repo |
| Disk | Tens of GiB for venvs, `uv` cache, and model weights |

<Steps>
<Step title="Authenticate to Hugging Face">

```bash
uvx hf@latest auth login
```

Or set a token for non-interactive runs:

```bash
export HF_TOKEN=<your_token>
```

Optional: redirect the model cache to a larger disk with `HF_HOME`.

</Step>
<Step title="Confirm uv version">

Cosmos Framework and the notebooks require **`uv >= 0.11.3`**. Older `uv` builds fail on `[tool.uv.audit]` and may not accept `--torch-backend=cu130`.

```bash
uv self update
```

</Step>
<Step title="Match CUDA backend tags to the driver">

Several backends pin a CUDA build of `torch` / `vllm` that must match the NVIDIA driver. Do not rely on `--torch-backend=auto` for vLLM cookbook installs.

| Driver CUDA | Backend tag | vLLM pin (Reasoner) |
| --- | --- | --- |
| 13.x | `cu130` | `vllm==0.21.0` |
| 12.x | `cu128` | `vllm==0.19.1` |

Framework notebooks use dependency groups `cu130-train` (default) or `cu128-train` instead of bare `cu130` / `cu128`.

</Step>
</Steps>

## Shared environment variables

| Variable | Default | When to override |
| --- | --- | --- |
| `COSMOS3_UV_GROUP` | `cu130-train` | `cu128-train` on CUDA 12.x drivers (Framework notebooks) |
| `COSMOS3_TORCH_BACKEND` | `cu130` | `cu128` for Diffusers notebook installs |
| `COSMOS3_REPO` | Auto: `packages/cosmos3` or `packages/cosmos-framework` | Custom framework checkout path |
| `HF_HOME` | `~/.cache/huggingface` | Shared or high-capacity cache |
| `GIT_LFS_SKIP_SMUDGE` | unset | Set to `1` during Framework `uv sync` to skip optional LFS test blobs |
| `VLLM_USE_DEEP_GEMM` | enabled in build | `export VLLM_USE_DEEP_GEMM=0` if DeepGEMM is unavailable |
| `UV_PROJECT_ENVIRONMENT` | Framework `.venv` path | Separate venv location for large installs |
| `CUDA_VISIBLE_DEVICES` | Notebook-specific | GPU selection for inference |

## Cosmos Framework

Native PyTorch inference uses a **cosmos-framework** checkout. From the `cosmos` repo root:

```bash
mkdir -p packages
git clone https://github.com/NVIDIA/cosmos-framework.git packages/cosmos3
cd packages/cosmos3
```

Inference imports training extras today, so sync the **`*-train`** group that matches your driver:

```bash
export GIT_LFS_SKIP_SMUDGE=1

# CUDA 13 driver (default):
uv sync --all-extras --group=cu130-train

# CUDA 12.x driver:
# uv sync --all-extras --group=cu128-train
```

Result: `packages/cosmos3/.venv`. Run commands after `source .venv/bin/activate` or via `.venv/bin/python` / `.venv/bin/torchrun`.

<Tip>
Set `export COSMOS3_UV_GROUP=cu128-train` before launching Framework notebooks on CUDA 12.x systems so install cells pick the correct group.
</Tip>

## Diffusers

Generator audiovisual notebooks use a **repo-root** managed Python 3.13 venv:

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate

uv pip install --torch-backend=cu130 \
  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
  accelerate \
  av \
  cosmos_guardrail \
  huggingface_hub \
  imageio \
  imageio-ffmpeg \
  torch \
  torchvision \
  transformers
```

For CUDA 12.x, use `--torch-backend=cu128` instead of `cu130`. The root README quickstart uses `--torch-backend=auto` for Diffusers only; cookbook notebooks pin `COSMOS3_TORCH_BACKEND` explicitly.

<Warning>
On headless hosts, imports may fail with `libxcb.so.1`. Install `libxcb1`, `libgl1`, and `libglib2.0-0` before running pipelines (see [Troubleshooting](/troubleshooting)).
</Warning>

## Transformers

Transformers-based Reasoner inference is documented as **coming soon** in the cookbooks guide; no install steps are published yet.

## vLLM (Reasoner)

OpenAI-compatible **reasoning** serving requires vLLM plus the **`vllm-cosmos3`** plugin (registers `Cosmos3ReasonerForConditionalGeneration`):

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate

# CUDA 13 driver:
uv pip install --torch-backend=cu130 "vllm==0.21.0" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"

# CUDA 12.x driver:
# uv pip install --torch-backend=cu128 "vllm==0.19.1" \
#   "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"
```

If DeepGEMM is unavailable in your build:

```bash
export VLLM_USE_DEEP_GEMM=0
```

<Info>
When launching `.venv/bin/vllm` without activating the venv, keep `.venv/bin` on `PATH` so FlashInfer’s JIT build can find `ninja` in the venv.
</Info>

## vLLM-Omni (Generator)

The recommended path is the prebuilt image **`vllm/vllm-omni:cosmos3`** (all modalities in the cookbooks):

```bash
docker pull vllm/vllm-omni:cosmos3
```

**Cosmos3-Nano** (port 8000):

```bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000
```

**Cosmos3-Super** (tensor parallel + optional layer offload):

```bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Super \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --tensor-parallel-size 4 \
  --enable-layerwise-offload \
  --port 8000
```

If you installed vLLM-Omni from the upstreaming PR branch instead, run the same `vllm serve ... --omni ...` command on the host without the Docker wrapper.

## Verify the environment

### PyTorch GPU probe (Framework, Diffusers, vLLM venvs)

```bash
.venv/bin/python - <<'PY'
import torch

print("torch:", torch.__version__)
print("torch cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY
```

Success: `cuda available: True` and a valid device name. `False` usually means a `cu130` wheel on a CUDA 12.x driver — switch to `cu128` / `cu128-train` per the tables above.

### vLLM / vLLM-Omni server probe

With the server listening on port 8000:

```bash
curl http://localhost:8000/v1/models
```

vLLM-Omni logs `Application startup complete.` when the API is ready.

## Runtime layout and ignored paths

```text
cosmos/                          # this repo
├── cookbooks/cosmos3/
│   ├── README.md                # environment guide (this page’s source)
│   ├── generator/…/*.ipynb
│   └── reasoner/…/*.ipynb
├── packages/                    # gitignored — framework clone
│   └── cosmos3/
│       ├── .venv/
│       └── cosmos_framework/
└── .venv/                       # gitignored — Diffusers or vLLM (repo root)
```

`.gitignore` excludes `packages/`, `.venv`, cookbook `outputs/`, and `**/env.sh` (machine-specific secrets).

## Related pages

<CardGroup>
<Card title="Installation" href="/installation">
Prerequisites, CUDA pairing, and top-level verification from the root README.
</Card>
<Card title="Quickstart" href="/quickstart">
Minimal first-run commands after environment setup.
</Card>
<Card title="Choose an integration" href="/choose-integration">
Pick Diffusers, vLLM-Omni, vLLM, or Cosmos Framework by goal.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Driver mismatches, `uv` version errors, libxcb, and DeepGEMM workarounds.
</Card>
</CardGroup>

---

## 10. Run Generator with Diffusers

> Install Cosmos3OmniPipeline dependencies, configure UniPC scheduler flow_shift, run text-to-image/video and image-to-video with structured JSON prompts, and export MP4 outputs.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/10-run-generator-with-diffusers.md
- Generated: 2026-06-01T20:25:11.042Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/robot_kitchen.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/negative_prompts/text2video/neg_prompt.json`
- `cookbooks/cosmos3/README.md`

---
title: "Run Generator with Diffusers"
description: "Install Cosmos3OmniPipeline dependencies, configure UniPC scheduler flow_shift, run text-to-image/video and image-to-video with structured JSON prompts, and export MP4 outputs."
---

Generator audiovisual workflows in this repository call Hugging Face `Cosmos3OmniPipeline` from a dedicated Python 3.13 venv, swap in `UniPCMultistepScheduler` with `flow_shift`, pass structured scene JSON via `json.dumps`, and write PNG or MP4 under `cookbooks/cosmos3/generator/audiovisual`. The canonical runnable path is [`run_with_diffusers.ipynb`](cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb); the audiovisual README quickstart matches the same API with fewer moving parts.

## Prerequisites

| Requirement | Detail |
| --- | --- |
| OS / GPU | Linux with NVIDIA Ampere, Hopper, or Blackwell GPU |
| Python tooling | `uv` ≥ 0.11.3, `git`, `git-lfs` |
| Model access | Gated Hugging Face repos (`nvidia/Cosmos3-Nano`, `nvidia/Cosmos3-Super`) |
| Auth | `uvx hf@latest auth login` or `HF_TOKEN` |
| CUDA pairing | Pin `--torch-backend=cu130` (CUDA 13 driver) or `cu128` (CUDA 12.x) — see [Cookbook environment setup](/cookbook-environment) |

<Warning>
On headless Linux, imports may fail with `libxcb.so.1: cannot open shared object file`. Install `libxcb1`, `libgl1`, and `libglib2.0-0` before running the pipeline.
</Warning>

Work from `cookbooks/cosmos3/generator/audiovisual` so relative asset paths resolve.

## Install Cosmos3OmniPipeline dependencies

Shared install steps live in [Cookbook environment setup — Diffusers](/cookbook-environment). For a standalone venv at the repo root:

<Steps>
<Step title="Create and activate the venv">

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
```

</Step>
<Step title="Install packages with a CUDA-matched torch backend">

```bash
uv pip install --torch-backend=cu130 \
  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
  accelerate \
  av \
  cosmos_guardrail \
  huggingface_hub \
  imageio \
  imageio-ffmpeg \
  torch \
  torchvision \
  transformers
```

Use `--torch-backend=cu128` when your driver reports CUDA 12.x.

</Step>
<Step title="Verify GPU visibility">

```bash
.venv/bin/python - <<'PY'
import torch
print("torch:", torch.__version__)
print("cuda available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY
```

Expect `cuda available: True` and a valid device name.

</Step>
</Steps>

The notebook installs into `.venv-cosmos3-diffusers` by default (`COSMOS3_DIFFUSERS_VENV`), registers a Jupyter kernel named `Cosmos3 Diffusers (Python 3.13)`, and requires that kernel for all inference cells.

<Note>
`--torch-backend=auto` in the root README quickstart lets uv pick a CUDA wheel; on mismatched drivers this yields `torch.cuda.is_available() == False`. Prefer an explicit `cu130` or `cu128` tag as in the cookbooks guide.
</Note>

## Runtime layout

```text
cookbooks/cosmos3/generator/audiovisual/
├── assets/
│   ├── prompts/          # structured JSON per modality
│   ├── negative_prompts/ # video modes only
│   └── images/           # image2video conditioning
├── run_with_diffusers.ipynb
└── outputs/notebooks/    # default COSMOS3_AUDIOVISUAL_OUTPUT_ROOT
```

```mermaid
flowchart LR
  subgraph inputs [Inputs]
    P[JSON prompt file]
    N[negative_prompt.json]
    I[conditioning image]
  end
  subgraph pipeline [Cosmos3OmniPipeline]
    L[from_pretrained]
    S[UniPCMultistepScheduler flow_shift]
    G[pipe denoise call]
  end
  subgraph outputs [Outputs]
    PNG[PNG text2image]
    MP4[MP4 via export_to_video or encode_video]
  end
  P --> L
  N --> G
  I --> G
  L --> S --> G
  G --> PNG
  G --> MP4
```

## Configure UniPC scheduler and flow_shift

After `Cosmos3OmniPipeline.from_pretrained`, replace the default scheduler:

```python
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

pipe.scheduler = UniPCMultistepScheduler.from_config(
    pipe.scheduler.config, flow_shift=10.0
)
```

Cookbook defaults use `flow_shift` (alias `shift`) **10.0**, matching vLLM-Omni `flow_shift` in the audiovisual README. Re-apply the scheduler per run if the payload overrides `shift`:

```python
pipe.scheduler = UniPCMultistepScheduler.from_config(
    pipe.scheduler.config, flow_shift=payload["shift"]
)
```

## Structured JSON prompts

Scene prompts are JSON objects (subjects, lighting, cinematography, temporal fields), not plain strings. Pass them as compact JSON strings:

```python
import json

prompt = json.load(open("assets/prompts/text2video/robot_kitchen.json"))
negative = json.load(open("assets/negative_prompts/text2video/neg_prompt.json"))

result = pipe(
    prompt=json.dumps(prompt),
    negative_prompt=json.dumps(negative),
    ...
)
```

| Field group | Typical keys |
| --- | --- |
| Scene | `subjects`, `background_setting`, `lighting`, `aesthetics` |
| Motion (video) | `actions`, `segments`, `temporal_caption`, `cinematography` |
| Output hints | `resolution` (`W`/`H`), `aspect_ratio` (e.g. `"16,9"`), `duration`, `fps` |

Text-to-image prompts may use `comprehensive_t2i_caption` instead of `temporal_caption`. Negative prompts for **text-to-video** and **image-to-video** live under `assets/negative_prompts/<mode>/neg_prompt.json`. Text-to-image runs use an empty `negative_prompt`.

Disable template injection when prompts already encode resolution and duration:

<ParamField body="add_resolution_template" type="bool">
When `False`, do not append resolution templates to the prompt (cookbook default).
</ParamField>

<ParamField body="add_duration_template" type="bool">
When `False`, do not append duration templates (cookbook default).
</ParamField>

## Default sampling parameters

| Parameter | Cookbook value | Maps to `pipe()` |
| --- | ---: | --- |
| `num_steps` | 35 | `num_inference_steps` |
| `guidance` | 6.0 | `guidance_scale` |
| `shift` | 10.0 | `UniPCMultistepScheduler` `flow_shift` |
| `fps` | 24 | `fps` |
| `num_frames` | 189 | ~7.9 s at 24 FPS |
| `resolution` + `aspect_ratio` | `720` + `16,9` | `height=720`, `width=1280` |
| `seed` | 1234 | `torch.Generator(device="cuda").manual_seed(...)` |

189 frames at 24 FPS aligns with the standard Cosmos3 video profile in the root README.

## Load the pipeline

```python
import torch
from diffusers import Cosmos3OmniPipeline

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano",
    torch_dtype=torch.bfloat16,
    device_map="cuda",  # quickstart style
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=10.0)
```

| Checkpoint alias | Hugging Face ID |
| --- | --- |
| `Cosmos3-Nano` | `nvidia/Cosmos3-Nano` |
| `Cosmos3-Super` | `nvidia/Cosmos3-Super` |

The notebook loads with `safety_checker=None`, `enable_safety_checker=True`, optional `HF_TOKEN`, then `pipe.to("cuda")`. Super needs substantially more VRAM; expect longer first-run downloads and denoising time.

## Text-to-video quickstart

From `cookbooks/cosmos3/generator/audiovisual`:

```python
import json
import torch
from diffusers import Cosmos3OmniPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video

prompt = json.load(open("assets/prompts/text2video/robot_kitchen.json"))
negative = json.load(open("assets/negative_prompts/text2video/neg_prompt.json"))

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=10.0)

result = pipe(
    prompt=json.dumps(prompt),
    negative_prompt=json.dumps(negative),
    image=None,
    num_frames=189,
    height=720,
    width=1280,
    fps=24,
    num_inference_steps=35,
    guidance_scale=6.0,
    enable_sound=False,
    add_resolution_template=False,
    add_duration_template=False,
    generator=torch.Generator(device="cuda").manual_seed(1234),
)
export_to_video(result.video, "/tmp/cosmos3_t2v_diffusers.mp4", fps=24, macro_block_size=1)
```

<Check>
Success: MP4 written at the target path; first run also completes `Cosmos3-Nano` download into `HF_HOME`. Long per-step logs during 35 denoise steps are expected, not a hang.
</Check>

## Generator modes

| Mode | `num_frames` | Conditioning | Sound | Output |
| --- | ---: | --- | --- | --- |
| Text-to-image | 1 | — | off | PNG (`result.video[0].save`) |
| Text-to-video | 189 (default) | `image=None` | optional | MP4 |
| Image-to-video | 189 (default) | `load_image(...)` | optional | MP4 |
| Text-to-video with sound | 189 | — | `enable_sound=True` | MP4 + AAC via `encode_video` |

Diffusers mode names in the root README: `text-to-image`, `text-to-video`, `image-to-video`, `text-to-video-with-sound`. Sound requires checkpoints with sound modules; mux with `encode_video(..., audio=result.sound, audio_sample_rate=pipe.sound_tokenizer.config.sampling_rate)` when `result.sound` is present.

### Text-to-image

```python
result = pipe(
    prompt=json.dumps(prompt_obj),
    negative_prompt="",
    num_frames=1,
    height=720,
    width=1280,
    num_inference_steps=35,
    guidance_scale=6.0,
    add_resolution_template=False,
    add_duration_template=False,
    generator=torch.Generator(device="cuda").manual_seed(1234),
)
result.video[0].save("robot_draping.png")
```

Example prompt: `assets/prompts/text2image/robot_draping.json`.

### Image-to-video

```python
from diffusers.utils import load_image

image = load_image("assets/images/image2video/car_driving.jpg")
result = pipe(
    prompt=json.dumps(prompt_obj),
    negative_prompt=json.dumps(negative_obj),
    image=image,
    num_frames=189,
    height=720,
    width=1280,
    fps=24,
    num_inference_steps=35,
    guidance_scale=6.0,
    enable_sound=False,
    add_resolution_template=False,
    add_duration_template=False,
    generator=torch.Generator(device="cuda").manual_seed(1234),
)
export_to_video(result.video, "car_driving.mp4", fps=24, macro_block_size=1)
```

Pair prompts under `assets/prompts/image2video/` with images under `assets/images/image2video/`.

## Notebook workflow

`run_with_diffusers.ipynb` sequences: configure paths → install venv/kernel → verify CUDA → preview assets → `create_payload(use_case, backend="diffusers")` → `run_diffusers_payload(...)` → `view_run(...)`.

| Use case key | Model | Mode |
| --- | --- | --- |
| `t2i` | Nano | text2image |
| `t2v_nano_noaudio` | Nano | text2video |
| `t2vs` | Nano | text2video + sound |
| `i2v_nano_noaudio` | Nano | image2video |
| `i2vs` | Nano | image2video + sound |
| `t2i_super` / `t2v_super_noaudio` / `i2v_super_noaudio` | Super | same modes |

Outputs default to `outputs/notebooks/diffusers/<use_case>/`.

## Environment variables

| Variable | Default | Purpose |
| --- | --- | --- |
| `COSMOS3_DIFFUSERS_VENV` | `<repo>/.venv-cosmos3-diffusers` | Dedicated venv path |
| `COSMOS3_TORCH_BACKEND` | `cu130` | `uv pip install --torch-backend` |
| `COSMOS3_AUDIOVISUAL_OUTPUT_ROOT` | `.../outputs/notebooks` | Payloads and media |
| `HF_HOME` | `~/.cache/huggingface` | Model cache |
| `CUDA_VISIBLE_DEVICES` | `0` | GPU selection |
| `HF_TOKEN` | unset | Gated model download |

## Troubleshooting

| Symptom | Mitigation |
| --- | --- |
| `torch.cuda.is_available()` is `False` | Match `--torch-backend` to driver (`cu130` / `cu128`); see [Troubleshooting](/troubleshooting) |
| `libxcb.so.1` on import | Install X11/GL libs listed above |
| `uv` rejects `--torch-backend=cu130` | Upgrade `uv` to ≥ 0.11.3 |
| Kernel mismatch in notebook | Switch to `Cosmos3 Diffusers (Python 3.13)` and run the restore cell |
| OOM on Super | Use Nano first; Super needs multi-GPU serving paths via [Run Generator with vLLM-Omni](/run-generator-vllm-omni) |

## Related pages

<CardGroup>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared uv install, CUDA backend tags, and GPU verification for Diffusers, Framework, and vLLM backends.
</Card>
<Card title="Diffusers pipeline reference" href="/diffusers-pipeline-reference">
`Cosmos3OmniPipeline.from_pretrained` modes, call arguments, and `export_to_video` details.
</Card>
<Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
Structured JSON schema, prompt-upsampling defaults, and template flags.
</Card>
<Card title="Audiovisual cookbook recipes" href="/audiovisual-cookbooks">
Notebook index for Diffusers, Framework, and vLLM-Omni with asset layout.
</Card>
<Card title="Choose an integration" href="/choose-integration">
When to prefer Diffusers vs vLLM-Omni vs Cosmos Framework.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Resolution tiers, frame counts, aspect ratios, and output formats.
</Card>
</CardGroup>

---

## 11. Run Generator with vLLM-Omni

> Start vllm/vllm-omni:cosmos3 Docker server, tensor-parallel and CFG/Ulysses options for Super, POST vision/action endpoints, guardrails toggles, and deploy-config for server-wide guardrail disable.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/11-run-generator-with-vllm-omni.md
- Generated: 2026-06-01T20:24:01.912Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb`
- `cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb`
- `cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb`

---
title: "Run Generator with vLLM-Omni"
description: "Start vllm/vllm-omni:cosmos3 Docker server, tensor-parallel and CFG/Ulysses options for Super, POST vision/action endpoints, guardrails toggles, and deploy-config for server-wide guardrail disable."
---

The Cosmos 3 Generator production path serves `nvidia/Cosmos3-Nano` or `nvidia/Cosmos3-Super` through the prebuilt `vllm/vllm-omni:cosmos3` image with `vllm serve … --omni --model-class-name Cosmos3OmniDiffusersPipeline`, exposing OpenAI-compatible `/v1/images/generations` and `/v1/videos` routes on port 8000.

<Info>
Cosmos 3 Generator support is upstreaming in [vllm-project/vllm-omni#3454](https://github.com/vllm-project/vllm-omni/pull/3454). Until merge, `vllm/vllm-omni:cosmos3` is the image with every modality (vision, sound, action); the PR-branch install covers only text-to-image, text-to-video, and image-to-video.
</Info>

## Prerequisites

| Requirement | Notes |
| --- | --- |
| Linux + NVIDIA GPU | Ampere, Hopper, or Blackwell |
| Hugging Face auth | Gated Cosmos3 checkpoints: `uvx hf@latest auth login` or `HF_TOKEN` |
| Docker + NVIDIA runtime | `--runtime nvidia --gpus all` for the server container |
| Local media paths | Mount host directories and set `--allowed-local-media-path` so the server can read conditioning images, videos, and action files |

Shared cookbook setup (CUDA driver pairing, HF cache mounts) lives on the [Cookbook environment setup](/cookbook-environment) page.

## Start the server

Pull the image once:

```bash
docker pull vllm/vllm-omni:cosmos3
```

<Steps>
<Step title="Cosmos3-Nano (single GPU)">

```bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000 \
  --init-timeout 1800
```

</Step>
<Step title="Cosmos3-Super (tensor parallel + offload)">

`Cosmos3-Super` (64B) typically needs multiple GPUs. `--tensor-parallel-size` shards weights; `--enable-layerwise-offload` moves transformer blocks between CPU and GPU (lower peak VRAM, higher latency, more host RAM).

```bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Super \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --tensor-parallel-size 4 \
  --enable-layerwise-offload \
  --port 8000 \
  --init-timeout 1800
```

Set `--tensor-parallel-size` to the number of GPUs you allocate.

</Step>
<Step title="Verify readiness">

The process prints `Application startup complete.` when the API is ready. Probe models:

```bash
curl http://localhost:8000/v1/models
```

</Step>
</Steps>

### Parallelism options (Super and Nano)

| Flag | Effect |
| --- | --- |
| `--tensor-parallel-size N` | Shard model weights across `N` GPUs |
| `--enable-layerwise-offload` | Offload transformer blocks CPU↔GPU between steps |
| `--cfg-parallel-size 2` | Run positive and negative CFG branches on two GPUs in parallel |
| `--ulysses-degree 2` | Ulysses sequence parallelism across the sequence dimension |

<Warning>
When combining flags, provision GPUs for the product  
`tensor_parallel_size × cfg_parallel_size × ulysses_degree`.
</Warning>

For CFG parallel, set strength with request `guidance_scale`. Do **not** use `true_cfg_scale` with these Cosmos3 examples.

Example Nano serve with CFG parallel (no Docker wrapper if installed from source):

```bash
vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --cfg-parallel-size 2 \
  --port 8000 \
  --init-timeout 1800
```

### PR-branch install (three vision modes only)

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=cu130 \
  "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@refs/pull/3454/head"
```

Then run `vllm serve` directly (no `docker run … vllm/vllm-omni:cosmos3` wrapper) with the same `--omni` and `--model-class-name` flags.

```text
Client (curl / requests)
        │
        ▼
POST /v1/images/generations  ──► PNG (base64 in JSON)
POST /v1/videos/sync         ──► MP4 bytes (blocking)
POST /v1/videos              ──► job id → poll → /content or action in final JSON
        │
        ▼
vllm serve (Docker: vllm/vllm-omni:cosmos3)
  --omni --model-class-name Cosmos3OmniDiffusersPipeline
```

## Vision generation endpoints

| Mode | Endpoint | Response |
| --- | --- | --- |
| Text to image | `POST /v1/images/generations` | Base64 PNG in JSON |
| Text to video | `POST /v1/videos/sync` | MP4 body |
| Image to video | `POST /v1/videos/sync` | Upload `input_reference` image |
| Video to video | `POST /v1/videos/sync` | Upload source video; set conditioning frames in `extra_params` |
| Video with sound | `POST /v1/videos/sync` | `generate_sound=true` (+ optional `sound_duration`) |

Point separate Nano and Super servers at different bases with:

```bash
export COSMOS3_VLLM_NANO_BASE_URL=http://localhost:8000
export COSMOS3_VLLM_SUPER_BASE_URL=http://localhost:8001
```

### Text-to-video (sync)

<RequestExample>

```bash
curl -sS -X POST http://localhost:8000/v1/videos/sync \
  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
  --form-string "negative_prompt=blurry, distorted, low quality" \
  --form-string "size=1280x720" \
  --form-string "num_frames=189" \
  --form-string "fps=24" \
  --form-string "num_inference_steps=35" \
  --form-string "guidance_scale=6.0" \
  --form-string "flow_shift=10.0" \
  --form-string "seed=0" \
  --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
  -H "Accept: video/mp4" \
  -o cosmos3_t2v.mp4
```

</RequestExample>

Audiovisual cookbooks use structured JSON prompts from `cookbooks/cosmos3/generator/audiovisual/assets/prompts/` with the same fields; default sampling in the vLLM-Omni notebook is 35 steps, `guidance_scale=6.0`, `flow_shift=10.0`, 189 frames at 24 FPS, 1280×720.

### Image-to-video

Add the conditioning file:

```bash
curl -sS -X POST http://localhost:8000/v1/videos/sync \
  --form-string "prompt=..." \
  --form-string "size=1280x720" \
  ... \
  -F "input_reference=@/path/to/image.jpg" \
  -H "Accept: video/mp4" \
  -o cosmos3_i2v.mp4
```

### Text-to-image

Image requests use JSON (`extra_args` for Cosmos-specific toggles) rather than multipart `extra_params`:

```python
import requests

body = {
    "prompt": "...",
    "size": "1280x720",
    "n": 1,
    "num_inference_steps": 35,
    "guidance_scale": 6.0,
    "flow_shift": 10.0,
    "seed": 0,
    "extra_args": {
        "use_resolution_template": False,
        "guardrails": True,
    },
}
requests.post("http://localhost:8000/v1/images/generations", json=body, timeout=600)
```

<Tip>
Use `--form-string` for text fields (`prompt`, `negative_prompt`, `extra_params`). With `-F`, curl treats `;` as a content-type separator and can truncate JSON values.
</Tip>

## Action generation endpoints

Action modes condition on `domain_name` and exchange video/action sequences. Embodiment dimensions and semantics are documented on [Action modality](/action-modality).

| `action_mode` | Typical endpoint | Input | Output |
| --- | --- | --- | --- |
| `forward_dynamics` | `POST /v1/videos` (async) or `POST /v1/videos/sync` | Image + action chunk | Video |
| `inverse_dynamics` | `POST /v1/videos` (async) | Video + instruction | Predicted action in completed job JSON |
| `policy` | `POST /v1/videos` (async) | Image + instruction | Video + action chunk |

Cookbook forward-dynamics jobs POST multipart to `/v1/videos`, poll `GET /v1/videos/{id}`, then download `GET /v1/videos/{id}/content` for the MP4.

<ParamField body="extra_params (JSON)" type="object">
Action-related keys include `action_mode`, `domain_name` (e.g. `av`, `droid_lerobot`, `umi`), `action_chunk_size`, `image_size`, `view_point`, inline `action` array or `action_path`, and optional `raw_action_dim` for inverse dynamics.
</ParamField>

Example forward-dynamics `extra_params` shape (AV):

```json
{
  "action_mode": "forward_dynamics",
  "domain_name": "av",
  "action_chunk_size": 60,
  "image_size": [320, 576],
  "view_point": 0,
  "action": [[...]],
  "guardrails": false
}
```

Inverse dynamics sets `action_mode` to `inverse_dynamics`, `raw_action_dim` to `9` for AV ego pose, and uploads the source clip as `input_reference` (`video/mp4`).

Mount the repo (or action asset directory) into the container and keep paths visible under `--allowed-local-media-path`.

## Common request fields

| Field | Purpose |
| --- | --- |
| `prompt` | Positive prompt (plain text or JSON string for structured prompts) |
| `negative_prompt` | Concepts to avoid (video modes) |
| `size` | `<width>x<height>` (e.g. `1280x720`) |
| `num_frames`, `fps` | Video length and frame rate |
| `num_inference_steps` | Diffusion denoising steps |
| `guidance_scale` | CFG scale for Cosmos3 (not `true_cfg_scale`) |
| `flow_shift` | Scheduler flow-shift |
| `seed` | Reproducibility |
| `max_sequence_length` | Prompt token cap (default `512`; longer prompts truncated) |
| `input_reference` | Conditioning image or video file |
| `generate_sound` | `true` for synchronized audio |
| `extra_params` | JSON Cosmos3 options (action, guardrails, templates, v2v conditioning) |
| `extra_args` | Image-endpoint Cosmos3 options |

## Guardrails

Cosmos3 ships safety guardrails that screen prompts and blur faces in outputs.

**Per request** — set `guardrails` inside `extra_params` (video) or `extra_args` (image):

```bash
--form-string 'extra_params={"guardrails":false,"use_resolution_template":false,"use_duration_template":false}'
```

Action cookbooks commonly set `"guardrails": false` for robotics and AV rollouts.

**Server-wide disable** — guardrail models are not loaded; per-request `guardrails: true` cannot re-enable them. Pass a deploy config (a future release may add `--cosmos3-no-guardrails`):

```yaml
# no_guardrails.yaml
async_chunk: false
stages:
  - stage_id: 0
    max_num_seqs: 1
    enforce_eager: true
    trust_remote_code: true
    model_class_name: Cosmos3OmniDiffusersPipeline
    model_config:
      guardrails: false
      offload_guardrail_models: false
```

```bash
vllm serve nvidia/Cosmos3-Nano --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --deploy-config no_guardrails.yaml \
  --allowed-local-media-path / \
  --port 8000
```

## Notebook-oriented server layout

Action notebooks often bind host port **8001** to container **8000** and pin a single GPU:

```bash
docker rm -f cosmos3-vllm-omni-notebook 2>/dev/null || true

docker run -d --name cosmos3-vllm-omni-notebook \
  --runtime nvidia --gpus '"device=0"' \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$PWD:/workspace" \
  -p 8001:8000 --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
    --omni \
    --model-class-name Cosmos3OmniDiffusersPipeline \
    --allowed-local-media-path / \
    --port 8000 \
    --init-timeout 1800

export COSMOS3_VLLM_BASE_URL=http://localhost:8001
curl http://localhost:8001/v1/models
```

Outputs default to `outputs/cosmos3_action_vllm/` (action) or `cookbooks/cosmos3/generator/audiovisual/outputs/notebooks/` (audiovisual).

## Troubleshooting

| Symptom | Check |
| --- | --- |
| Server never ready | Increase `--init-timeout`; confirm HF cache and model download; `docker logs` |
| `403` / model not found | Hugging Face login and license acceptance for gated repos |
| Local file not found | Volume mount and `--allowed-local-media-path` cover the path used in requests |
| Truncated `extra_params` | Use `--form-string`, not `-F`, for JSON fields |
| OOM on Super | Raise `--tensor-parallel-size`, add `--enable-layerwise-offload`, or reduce resolution/frame count |

See [Troubleshooting](/troubleshooting) for CUDA driver and container pairing.

## Related pages

<CardGroup>
<Card title="Cookbook environment setup" href="/cookbook-environment">
HF auth, Docker image pull, and GPU verification shared across backends.
</Card>
<Card title="vLLM-Omni API reference" href="/vllm-omni-api-reference">
Full endpoint field lists, `action_mode` values, and curl constraints.
</Card>
<Card title="Run Generator action workflows" href="/run-generator-action">
Forward and inverse dynamics across Framework and vLLM-Omni with `domain_name` conditioning.
</Card>
<Card title="Audiovisual cookbook recipes" href="/audiovisual-cookbooks">
End-to-end text/image/video (+ sound) notebooks using `run_with_vllm_omni.ipynb`.
</Card>
<Card title="Choose an integration" href="/choose-integration">
When to pick vLLM-Omni vs Diffusers vs Cosmos Framework.
</Card>
<Card title="Inference benchmarks" href="/inference-benchmarks">
Published vLLM-Omni latency by GPU, resolution, and tensor-parallel width.
</Card>
</CardGroup>

---

## 12. Run Generator with Cosmos Framework

> Clone cosmos-framework, uv sync cu130-train/cu128-train groups, torchrun cosmos_framework.scripts.inference with parallelism presets, checkpoint-path, and JSON input specs from cookbook assets.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/12-run-generator-with-cosmos-framework.md
- Generated: 2026-06-01T20:24:41.535Z

### Source Files

- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb`
- `README.md`

---
title: "Run Generator with Cosmos Framework"
description: "Clone cosmos-framework, uv sync cu130-train/cu128-train groups, torchrun cosmos_framework.scripts.inference with parallelism presets, checkpoint-path, and JSON input specs from cookbook assets."
---

Generator audiovisual and action workflows in this repository run through the **Cosmos Framework** checkout (`cosmos_framework.scripts.inference`), launched with `torchrun` for multi-GPU diffusion or `python -m` for single-process action runs. Cookbooks under `cookbooks/cosmos3/generator/` supply structured JSON prompts, conditioning images, action trajectories, and example `torchrun` invocations against `Cosmos3-Nano` and `Cosmos3-Super`.

## When to use this path

| Goal | Cosmos Framework | Alternative in this repo |
| --- | --- | --- |
| Research-style PyTorch inference with full checkpoint control | Yes | Diffusers (`Cosmos3OmniPipeline`) |
| Production OpenAI-compatible serving | No | vLLM-Omni |
| Training, evaluation, omni-model recipes | Yes (framework repo) | — |

The framework path imports training extras at install time (`*-train` groups) because the current inference entrypoint depends on those modules.

## Prerequisites

<Steps>
<Step title="Host and access">

- Linux with NVIDIA GPU (Ampere, Hopper, or Blackwell per product docs).
- [`uv`](https://docs.astral.sh/uv/getting-started/installation/) **≥ 0.11.3**, `git`, and `git-lfs`.
- Hugging Face access to gated Cosmos3 repos: `uvx hf@latest auth login` or `export HF_TOKEN=...`.
- Read access to [NVIDIA/cosmos-framework](https://github.com/NVIDIA/cosmos-framework) (HTTPS or SSH clone URL).

</Step>
<Step title="CUDA driver pairing">

Match the `uv` dependency group to your driver CUDA major version:

| Driver CUDA | `uv sync` group | Set before notebooks |
| --- | --- | --- |
| 13.x | `cu130-train` | `export COSMOS3_UV_GROUP=cu130-train` (default) |
| 12.x | `cu128-train` | `export COSMOS3_UV_GROUP=cu128-train` |

Only `cu130-train` and `cu128-train` are defined in the framework `pyproject.toml`. A CUDA 12.x driver with the default `cu130-train` group typically yields `cuda available: False` in the verify step.

</Step>
</Steps>

Shared backend setup for all cookbooks lives in [Cookbook environment setup](/cookbook-environment).

## Install Cosmos Framework

From the `cosmos` repository root, clone (or reuse) the framework tree and sync dependencies:

```bash
mkdir -p packages
git clone https://github.com/NVIDIA/cosmos-framework.git packages/cosmos3
cd packages/cosmos3

# Skip LFS smudge for lerobot test artifacts the cookbooks do not need.
export GIT_LFS_SKIP_SMUDGE=1

# CUDA 13 driver (default):
uv sync --all-extras --group=cu130-train

# CUDA 12.x driver:
# uv sync --all-extras --group=cu128-train
```

<Note>
The cookbooks README clones into `packages/cosmos3`. The audiovisual notebook also accepts `packages/cosmos-framework` if that path already contains `pyproject.toml` and `cosmos_framework/`.
</Note>

The install creates `.venv` at `packages/cosmos3/.venv`. Either activate it (`source .venv/bin/activate`) or call `.venv/bin/torchrun` and `.venv/bin/python` by absolute path.

Optional: point `UV_PROJECT_ENVIRONMENT` at a large-disk venv path before `uv sync` (audiovisual notebook pattern).

## Verify GPU and Python

```bash
cd packages/cosmos3
.venv/bin/python - <<'PY'
import torch
print("torch:", torch.__version__)
print("torch cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY
```

Expect `cuda available: True` and a device name before running generation.

## Inference entrypoint

```text
packages/cosmos3/.venv/bin/torchrun  →  -m cosmos_framework.scripts.inference
                                              │
                    ┌─────────────────────────┼─────────────────────────┐
                    ▼                         ▼                         ▼
            Audiovisual JSON           Action JSONL              (Reasoner — other page)
            text2image / t2v / i2v     forward_dynamics /
                                       inverse_dynamics
```

Core CLI shape:

```bash
torchrun --nproc-per-node=<N> \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=<preset> \
  -i <input.json|.jsonl> \
  -o <output_dir> \
  --checkpoint-path <Cosmos3-Nano|Cosmos3-Super> \
  [--seed=0] [--benchmark]
```

| Flag | Role |
| --- | --- |
| `-i` | Input spec: single JSON file or JSONL (one JSON object per line for multi-run specs) |
| `-o` | Output root directory |
| `--checkpoint-path` | Hugging Face checkpoint id, e.g. `Cosmos3-Nano`, `Cosmos3-Super` |
| `--parallelism-preset` | Framework parallelism profile (see below) |
| `--seed` / `--seed=0` | Reproducibility seed |
| `--benchmark` | Write timing metadata (`benchmark.json`) — used in action notebooks |
| `--video-save-quality` | Video encode quality (action AV example uses `8`) |
| `--image_size` | Output size hint for action runs (e.g. `480`) |

Action cookbooks sometimes set distributed env vars manually and call `.venv/bin/python -m cosmos_framework.scripts.inference` with `RANK=0 WORLD_SIZE=1` instead of `torchrun`.

## Parallelism presets

| Preset | Typical Generator use | Launch pattern |
| --- | --- | --- |
| `throughput` | Audiovisual text-to-image, text-to-video, image-to-video | `torchrun --nproc-per-node=$COSMOS3_NUM_GPUS` (notebook default **4**) |
| `latency` | Action forward/inverse dynamics | Single GPU: `python -m` or `torchrun --nproc-per-node=1` |

Audiovisual runs also pass `--master-addr` and `--master-port` (notebook allocates free ports per workflow). Quickstart text-to-video uses `--nproc-per-node=1`.

## Quickstart: text-to-video (Nano)

After framework install, from `cookbooks/cosmos3/generator/audiovisual/`:

```bash
# Use the framework venv torchrun (from repo root, adjust path if needed):
packages/cosmos3/.venv/bin/torchrun --nproc-per-node=1 \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=throughput \
  -i assets/prompts/text2video/robot_kitchen.json \
  -o /tmp/cosmos3_t2v_framework \
  --checkpoint-path Cosmos3-Nano \
  --seed=0
```

<Check>
First run downloads `Cosmos3-Nano` via Hugging Face. Diffusion over 189 frames at 720p is compute-heavy; long step times are expected.
</Check>

For **Cosmos3-Super**, set `--checkpoint-path Cosmos3-Super` and increase `--nproc-per-node` to match available GPUs (notebook uses the same `throughput` preset with multi-GPU `torchrun`).

## Cookbook assets layout

Audiovisual prompts and conditioning media live under `cookbooks/cosmos3/generator/audiovisual/assets/`:

:::files
cookbooks/cosmos3/generator/audiovisual/assets/
├── prompts/
│   ├── text2image/          # e.g. robot_draping.json
│   ├── text2video/          # e.g. robot_kitchen.json, robot_pouring_water_audio.json
│   └── image2video/         # e.g. car_driving.json, coastal_road_audio.json
├── negative_prompts/
│   ├── text2video/neg_prompt.json
│   └── image2video/neg_prompt.json
└── images/image2video/      # e.g. car_driving.jpg, coastal_road_audio.jpg
:::

Prompt files are **structured JSON scene specs** (subjects, cinematography, `temporal_caption`, `resolution`, `fps`, etc.), not plain strings. The quickstart passes the prompt file directly to `-i`; the full notebook wraps the same files into framework payload JSON with sampling fields.

Action examples use `cookbooks/cosmos3/generator/action/assets/` (images, videos, trajectories) and write specs under `packages/cosmos3/outputs/cookbooks/cosmos3/generator/action/inputs/`.

## Audiovisual input spec (notebook payload)

The audiovisual notebook builds per-run JSON under `outputs/notebooks/pytorch/payloads/<use_case>.json` with a consistent schema:

| Field | Typical value | Notes |
| --- | --- | --- |
| `model_mode` | `text2image`, `text2video`, `image2video` | Selects Generator modality |
| `prompt` | Compact JSON **string** of the scene spec file | From `assets/prompts/...` |
| `negative_prompt` | Compact JSON string or `""` | From `assets/negative_prompts/<mode>/neg_prompt.json`; empty for text2image |
| `enable_sound` | `true` / `false` | Sound-bearing prompts use dedicated asset pairs |
| `num_steps` | `35` | Diffusion steps |
| `guidance` | `6.0` | CFG scale |
| `shift` | `10.0` | Scheduler flow shift |
| `fps` | `24` | |
| `num_frames` | `189` (video), `1` (text2image) | ~7.9 s at 24 FPS for default video |
| `resolution` | `"720"` | |
| `aspect_ratio` | `"16,9"` | Comma-separated pair in cookbook payloads |
| `seed` | `0` | |
| `vision_path` | Relative path to conditioning image | Required for `image2video` |

Example payload fragment (text-to-video, no audio):

```json
{
  "model_mode": "text2video",
  "name": "t2v_nano_noaudio",
  "prompt": "{...compact scene JSON...}",
  "negative_prompt": "{...compact neg prompt JSON...}",
  "enable_sound": false,
  "num_steps": 35,
  "guidance": 6.0,
  "shift": 10.0,
  "fps": 24,
  "num_frames": 189,
  "resolution": "720",
  "aspect_ratio": "16,9",
  "seed": 0
}
```

Image-to-video adds `vision_path` relative to the payload file directory (e.g. path into `assets/images/image2video/`).

### Notebook asset matrix

| Use case key | Checkpoint | Mode | Sound |
| --- | --- | --- | --- |
| `t2i` | Cosmos3-Nano | text2image | off |
| `t2i_super` | Cosmos3-Super | text2image | off |
| `t2v_nano_noaudio` | Cosmos3-Nano | text2video | off |
| `t2vs` | Cosmos3-Nano | text2video | on |
| `i2v_nano_noaudio` | Cosmos3-Nano | image2video | off |
| `i2vs` | Cosmos3-Nano | image2video | on |
| `t2v_super_noaudio` | Cosmos3-Super | text2video | off |
| `i2v_super_noaudio` | Cosmos3-Super | image2video | off |

Run pattern (text-to-image on Nano):

```bash
cd packages/cosmos3
CUDA_VISIBLE_DEVICES=0,1,2,3 \
  .venv/bin/torchrun \
  --nproc-per-node=4 \
  --master-addr=127.0.0.1 \
  --master-port=<free_port> \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=throughput \
  -i /path/to/t2i.json \
  -o /path/to/output/t2i \
  --checkpoint-path Cosmos3-Nano \
  --seed=0
```

## Scale checkpoints and GPUs

| Checkpoint | Size | Cookbook GPU hint |
| --- | ---: | --- |
| `Cosmos3-Nano` | 16B | Quickstart: 1 GPU; audiovisual notebook default: 4 GPUs |
| `Cosmos3-Super` | 64B | Same `throughput` preset; raise `--nproc-per-node` to available GPU count |

Set `export COSMOS3_NUM_GPUS=4` and `export CUDA_VISIBLE_DEVICES=0,1,2,3` before notebook cells, or pass `--nproc-per-node` explicitly in shell commands.

## Action Generator (forward and inverse dynamics)

Action workflows use **JSONL** specs (one JSON object per line) and the `latency` preset. They are documented in depth on [Run Generator action workflows](/run-generator-action); summary for Framework-only runs:

**Forward dynamics** (`model_mode`: `forward_dynamics`): start image + `action_path` + `domain_name` (`av`, `droid_lerobot`, `umi`, …). Output video per run:

```text
<output_dir>/<name>/vision.mp4
```

**Inverse dynamics** (`model_mode`: `inverse_dynamics`): input `vision_path` video only; predicted action in `<output_dir>/<name>/sample_outputs.json` under `outputs[0].content["action"]`.

Example AV forward-dynamics record:

```json
{
  "action_chunk_size": 60,
  "action_path": "/abs/path/to/av_traj_forward.json",
  "domain_name": "av",
  "fps": 10,
  "image_size": 480,
  "view_point": "ego_view",
  "model_mode": "forward_dynamics",
  "name": "av_forward",
  "prompt": "You are an autonomous vehicle planning system.",
  "seed": 0,
  "vision_path": "/abs/path/to/av_0.jpg"
}
```

Run:

```bash
cd packages/cosmos3
CUDA_VISIBLE_DEVICES=0 \
  .venv/bin/python -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  -i outputs/cookbooks/cosmos3/generator/action/inputs/action_forward_dynamics_av_custom.jsonl \
  -o outputs/cookbooks/cosmos3/generator/action/action_forward_dynamics_av_custom \
  --checkpoint-path Cosmos3-Nano \
  --video-save-quality 8 \
  --image_size 480 \
  --seed 0 \
  --benchmark
```

Embodiment dimensions and FPS defaults for AV / DROID / UMI are summarized in the action cookbook README (9D–10D pose deltas, 60 frames @ 10 FPS for AV, etc.).

## Outputs and verification

| Workflow | Output location | Success signal |
| --- | --- | --- |
| Audiovisual | `-o` directory; `*.mp4` or `*.png` under run subfolders | Generated media files; notebook `view_run()` skips `*_preview.mp4` |
| Forward dynamics | `<output>/<name>/vision.mp4` | MP4 exists per JSONL `name` |
| Inverse dynamics | `<output>/<name>/sample_outputs.json` | `action` array in first output content |
| With `--benchmark` | `benchmark.json` under output root | Timing averages in JSON |

Hugging Face weights cache under `HF_HOME` (default `~/.cache/huggingface`).

## Useful environment variables

| Variable | Default / role |
| --- | --- |
| `COSMOS3_REPO` | Framework checkout path (`packages/cosmos3`) |
| `COSMOS3_UV_GROUP` | `cu130-train` or `cu128-train` |
| `COSMOS3_UV_ENV` / `UV_PROJECT_ENVIRONMENT` | Path to `.venv` used by `torchrun` |
| `COSMOS3_NUM_GPUS` | `4` in audiovisual notebook |
| `CUDA_VISIBLE_DEVICES` | GPU indices for the run |
| `COSMOS3_MASTER_ADDR` | `127.0.0.1` for distributed audiovisual |
| `COSMOS3_*_MASTER_PORT` | Per-workflow free ports in notebook |
| `HF_HOME` / `HF_TOKEN` | Model download cache and auth |
| `GIT_LFS_SKIP_SMUDGE` | `1` during `uv sync` |

Action notebooks may require a one-time kernel restart after `configure_cosmos_framework_runtime_env()` updates `LD_LIBRARY_PATH` for CUDA and FFmpeg libraries.

## Troubleshooting

<Warning>
**Headless import errors** (`libxcb.so.1`): install `libxcb1 libgl1 libglib2.0-0` on minimal Linux images.
</Warning>

| Symptom | Mitigation |
| --- | --- |
| `cuda available: False` after sync | Switch to `cu128-train` on CUDA 12.x drivers; confirm with `nvidia-smi` |
| `uv` parse / `--torch-backend` errors | Upgrade `uv` to ≥ 0.11.3 (`uv self update`) |
| Clone / sync failures on LFS blobs | Keep `GIT_LFS_SKIP_SMUDGE=1` for cookbook installs |
| Missing `torchrun` | Run install cell; use `$COSMOS3_UV_ENV/bin/torchrun` explicitly |
| Super OOM | Reduce resolution in payload, use fewer frames, or add GPUs via `--nproc-per-node` |

See [Troubleshooting](/troubleshooting) for cross-backend CUDA and container notes.

## Related pages

<CardGroup>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared uv/Docker setup, HF auth, and CUDA group selection for all backends.
</Card>
<Card title="Audiovisual cookbook recipes" href="/audiovisual-cookbooks">
Full notebook matrix for text-to-image, text-to-video, and image-to-video with optional sound.
</Card>
<Card title="Run Generator action workflows" href="/run-generator-action">
Forward and inverse dynamics JSONL specs, domains, and Framework vs vLLM-Omni paths.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Resolution tiers, frame counts, prompt limits, and output formats.
</Card>
<Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
Structured JSON prompt schema and Generator sampling defaults.
</Card>
<Card title="Choose an integration" href="/choose-integration">
When to prefer Framework vs Diffusers vs vLLM-Omni.
</Card>
<Card title="Run Generator with Diffusers" href="/run-generator-diffusers">
Python-first alternative without a framework checkout.
</Card>
</CardGroup>

---

## 13. Run Generator action workflows

> Forward dynamics (image + action trajectory) and inverse dynamics (video + instruction) across Framework torchrun and vLLM-Omni multipart /v1/videos requests with domain_name and action_mode extra_params.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/13-run-generator-action-workflows.md
- Generated: 2026-06-01T20:26:05.049Z

### Source Files

- `cookbooks/cosmos3/generator/action/README.md`
- `cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb`
- `cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/action/assets/actions/av_traj_left.json`
- `cookbooks/cosmos3/generator/action/assets/videos/av_0.mp4`

---
title: "Run Generator action workflows"
description: "Forward dynamics (image + action trajectory) and inverse dynamics (video + instruction) across Framework torchrun and vLLM-Omni multipart /v1/videos requests with domain_name and action_mode extra_params."
---

Cosmos3-Nano Generator action workflows live under `cookbooks/cosmos3/generator/action/`: JSONL input specs drive `cosmos_framework.scripts.inference` (native PyTorch), while the same specs map to multipart `POST /v1/videos` jobs on vLLM-Omni with `extra_params.action_mode` and `extra_params.domain_name`. Forward dynamics (`forward_dynamics`) rolls out video from a conditioning image plus an action trajectory; inverse dynamics (`inverse_dynamics`) predicts ego-motion from an input video and instruction.

## Modes and embodiments

| Mode | Framework `model_mode` | vLLM `extra_params.action_mode` | Inputs | Outputs (cookbooks) |
| --- | --- | --- | --- | --- |
| Forward dynamics | `forward_dynamics` | `forward_dynamics` | Start image (`vision_path`) + action file (`action_path`) or inline `action` | `vision.mp4` per run |
| Inverse dynamics | `inverse_dynamics` | `inverse_dynamics` | Input video (`vision_path`), no `action_path` | `sample_outputs.json` with predicted action |

Embodiment conditioning uses `domain_name` (and usually `view_point`). Checked-in examples:

| Embodiment | `domain_name` | Action dim | Chunk / duration (cookbooks) |
| --- | --- | --- | --- |
| Autonomous vehicle | `av` | 9D ego pose | 60 actions, 10 FPS → `num_frames` 61 |
| DROID (LeRobot) | `droid_lerobot` | 10D (9D pose + gripper) | 16 actions per chunk, 5 autoregressive chunks |
| UMI | `umi` | 10D | 16 actions per chunk, all chunks in `assets/actions/umi.json` |

Pose semantics (9D translation + 6D rotation, grasp encoding) are documented in the action cookbook README; AV trajectories use `rot6d`, `backward_framewise`, and `translation_scale=1.35` when visualizing.

## Cookbook layout and assets

```text
cookbooks/cosmos3/generator/action/
├── README.md
├── run_fd_with_cosmos_framework.ipynb   # FD: AV, DROID, UMI (Framework)
├── run_id_with_cosmos_framework.ipynb   # ID: AV (Framework)
├── run_fd_with_vllm.ipynb               # FD: AV, DROID, UMI (vLLM-Omni)
├── run_id_with_vllm.ipynb               # ID: AV (vLLM-Omni)
└── assets/
    ├── images/          # av_0.jpg, umi.png, …
    ├── actions/         # av_traj_*.json, umi.json
    ├── videos/          # av_0.mp4, av_1.mp4 (inverse dynamics)
    └── droid_lerobot_example/   # LeRobot sample for robotics FD
```

Environment setup (HF auth, CUDA groups, Framework clone, vLLM-Omni Docker) is centralized in the [Cosmos3 cookbooks environment setup](/cookbook-environment); action notebooks assume that baseline.

## JSONL input spec (both backends)

Each line is one JSON object. Framework reads `model_mode`; vLLM maps the same run to `extra_params.action_mode`.

**Forward dynamics (AV example)** — one shared start frame, three trajectories:

```json
{
  "name": "av_left",
  "model_mode": "forward_dynamics",
  "domain_name": "av",
  "view_point": "ego_view",
  "vision_path": "/abs/path/to/assets/images/av_0.jpg",
  "action_path": "/abs/path/to/assets/actions/av_traj_left.json",
  "action_chunk_size": 60,
  "fps": 10,
  "image_size": 480,
  "prompt": "You are an autonomous vehicle planning system.",
  "seed": 0
}
```

**Inverse dynamics (AV)** — video only; action is predicted:

```json
{
  "name": "av_inverse_0",
  "model_mode": "inverse_dynamics",
  "domain_name": "av",
  "view_point": "ego_view",
  "vision_path": "/abs/path/to/assets/videos/av_0.mp4",
  "action_chunk_size": 60,
  "fps": 10,
  "image_size": 480,
  "prompt": "You are an autonomous vehicle planning system.",
  "seed": 0
}
```

<Note>
Framework resolves relative paths against the JSONL file directory. vLLM notebooks write **absolute** `vision_path` values because request cells read records directly; action trajectories for FD are embedded in `extra_params.action` (loaded from `action_path`) rather than sent as a separate upload.
</Note>

## Run with Cosmos Framework

Prerequisites: Linux + NVIDIA GPU, Hugging Face access to `Cosmos3-Nano`, Framework checkout (notebooks default `packages/cosmos3`), `uv sync --all-extras --group=cu130-train` (or `cu128-train` per driver).

### Entry command

The README quickstart uses `torchrun`; notebooks invoke the same module with single-process distributed env vars:

```bash
torchrun --nproc-per-node=1 \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  -i <path/to/spec.jsonl> \
  -o <output_dir> \
  --checkpoint-path Cosmos3-Nano \
  --image_size 480 \
  --video-save-quality 8 \
  --seed 0
```

AV forward dynamics (batch JSONL):

```bash
export COSMOS3_AV_FD_INPUT=.../inputs/action_forward_dynamics_av_custom.jsonl
export COSMOS3_AV_FD_OUTPUT=.../action_forward_dynamics_av_custom

CUDA_VISIBLE_DEVICES=0 \
MASTER_ADDR=127.0.0.1 MASTER_PORT=29500 RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 \
  .venv/bin/python -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  -i "$COSMOS3_AV_FD_INPUT" \
  -o "$COSMOS3_AV_FD_OUTPUT" \
  --checkpoint-path Cosmos3-Nano \
  --image_size 480 \
  --video-save-quality 8 \
  --seed 0 \
  --benchmark
```

Inverse dynamics uses the same flags with `action_inverse_dynamics_av_custom.jsonl` and writes **`sample_outputs.json`** per run (not `vision.mp4`):

```text
<output_dir>/<name>/sample_outputs.json
  └── outputs[0].content["action"]   # [T-1, 9] for AV
```

### Autoregressive robotics and UMI

DROID and UMI forward dynamics run **one JSONL line per 16-action chunk**. Chunk 0 conditions on a ground-truth image; later chunks condition on the **last frame** of the previous chunk’s generated video (extracted with ffmpeg in the notebook). Robotics uses `--no-guardrails` and `domain_name: droid_lerobot`; UMI uses `domain_name: umi`, `image_size: 256`, and `fps: 20`.

Default Framework output root (unless `COSMOS3_OUTPUT_ROOT` is set):

```text
packages/cosmos3/outputs/cookbooks/cosmos3/generator/action/
├── inputs/…jsonl
└── action_forward_dynamics_av_custom/<name>/vision.mp4
```

## Run with vLLM-Omni

### Start server

From the `cosmos` repo root (mount repo + HF cache; allow local media):

```bash
docker rm -f cosmos3-vllm-omni-notebook 2>/dev/null || true

docker run -d --name cosmos3-vllm-omni-notebook \
  --runtime nvidia --gpus '"device=0"' \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$PWD:/workspace" \
  -p 8001:8000 --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
    --omni \
    --model-class-name Cosmos3OmniDiffusersPipeline \
    --allowed-local-media-path / \
    --port 8000 \
    --init-timeout 1800

curl http://localhost:8001/v1/models
```

Notebooks default `COSMOS3_VLLM_BASE_URL=http://localhost:8001` and outputs under `outputs/cosmos3_action_vllm/`.

### Multipart `POST /v1/videos`

Action jobs use the **async** video endpoint (poll until `status=completed`). Forward dynamics then downloads MP4 bytes from `/v1/videos/{id}/content`; inverse dynamics reads predicted action from the completed job JSON.

```mermaid
sequenceDiagram
  participant NB as action notebook
  participant API as vLLM-Omni /v1/videos
  participant FS as run_dir

  NB->>API: POST multipart (prompt, num_frames, extra_params, input_reference)
  API-->>NB: job id
  loop until completed
    NB->>API: GET /v1/videos/{id}
    API-->>NB: status, progress, action (ID)
  end
  alt forward_dynamics
    NB->>API: GET /v1/videos/{id}/content
    API-->>FS: vision.mp4
  else inverse_dynamics
    NB->>FS: action.json, sample_outputs.json
  end
```

<Warning>
Use `curl --form-string` for `prompt` and `extra_params`, not `-F`, when values contain `;` (see root README). Encode `extra_params` as a single JSON string.
</Warning>

### `extra_params` for action

| Field | Forward dynamics | Inverse dynamics |
| --- | --- | --- |
| `action_mode` | `forward_dynamics` | `inverse_dynamics` |
| `domain_name` | e.g. `av`, `droid_lerobot`, `umi` | e.g. `av` |
| `action_chunk_size` | Matches JSONL | Matches JSONL |
| `image_size` | e.g. `480` (AV/DROID), `256` (UMI) | e.g. `480` |
| `view_point` | e.g. `ego_view` or dataset viewpoint | e.g. `ego_view` |
| `action` | 2D array loaded from `action_path` | omitted |
| `raw_action_dim` | omitted (FD) | `9` for AV ID notebooks |
| `guardrails` | `false` in cookbook requests | `false` in cookbook requests |

Top-level form fields used in notebooks (in addition to `input_reference` file upload):

<ParamField body="prompt" type="string" required>
Instruction or domain prompt (AV: autonomous-vehicle planning string; robotics: dataset `ai_caption`).
</ParamField>

<ParamField body="num_frames" type="integer" required>
Set to `action_chunk_size + 1` (e.g. 61 for AV).
</ParamField>

<ParamField body="fps" type="integer" required>
Embodiment frame rate (AV: 10; UMI: 20; DROID: from dataset `conditioning_fps`).
</ParamField>

<ParamField body="num_inference_steps" type="integer">
Cookbooks use `30`.
</ParamField>

<ParamField body="guidance_scale" type="float">
Cookbooks use `1.0` for action runs.
</ParamField>

<ParamField body="flow_shift" type="float">
Cookbooks use `10.0`.
</ParamField>

<ParamField body="extra_params" type="string (JSON)" required>
Stringified JSON object; must include `action_mode` and `domain_name`.
</ParamField>

**Forward dynamics** uploads the conditioning **image** as `input_reference`. **Inverse dynamics** uploads the **video** (`video/mp4`). Robotics/UMI FD omit `size`/`width`/`height`; the server derives canvas from `image_size` with aspect-preserving padding.

Example FD `extra_params` payload shape:

```json
{
  "action_mode": "forward_dynamics",
  "domain_name": "av",
  "action_chunk_size": 60,
  "image_size": 480,
  "view_point": "ego_view",
  "action": [[...9 floats...], ...],
  "guardrails": false
}
```

Root README notes FD may also use synchronous `POST /v1/videos/sync` when only video is returned; the action cookbooks standardize on async `POST /v1/videos` plus `/content` download for parity with inverse-dynamics job polling.

### vLLM output layout

```text
outputs/cosmos3_action_vllm/
├── inputs/action_forward_dynamics_av_custom.jsonl
├── action_forward_dynamics_av_custom/<name>/
│   ├── response.json
│   ├── final.json
│   └── vision.mp4
├── action_inverse_dynamics_av_custom/<name>/
│   ├── response.json
│   ├── final.json
│   ├── action.json
│   └── sample_outputs.json
└── action_forward_dynamics_robotics_custom/ …
```

## Verification

<Steps>
<Step title="Confirm environment">
HF login succeeds; Framework `.venv` or vLLM container sees GPU; `curl http://localhost:8001/v1/models` returns Cosmos3-Nano metadata (vLLM path).
</Step>
<Step title="Run AV forward dynamics">
Three trajectories (`av_forward`, `av_left`, `av_right`) from one `av_0.jpg` produce three `vision.mp4` files under `action_forward_dynamics_av_custom/`.
</Step>
<Step title="Run AV inverse dynamics">
`av_0.mp4` and `av_1.mp4` produce `sample_outputs.json` per run; visualize predicted 9D poses with the notebook pose utilities.
</Step>
<Step title="Optional robotics / UMI">
DROID: 5×16 frames autoregressive with last-frame conditioning; UMI: chunks stitched to one preview MP4. Expect guardrails disabled and `--no-guardrails` on Framework robotics runs.
</Step>
</Steps>

<Check>
Success signals: JSONL specs print in notebook setup cells; Framework exits 0 with `vision.mp4` or `sample_outputs.json` under the configured output root; vLLM jobs reach `status: completed` and FD runs save non-empty `vision.mp4`.
</Check>

## Framework vs vLLM-Omni

| Concern | Cosmos Framework | vLLM-Omni |
| --- | --- | --- |
| Mode field | `model_mode` in JSONL | `action_mode` in `extra_params` |
| Action input | `action_path` on disk | `action` array inside `extra_params` (FD) |
| Inference API | `python -m cosmos_framework.scripts.inference` | `POST /v1/videos` + poll + optional `/content` |
| Default outputs | Under Framework checkout `outputs/cookbooks/...` | `outputs/cosmos3_action_vllm/` |
| Best for | Training-aligned batch runs, `--benchmark`, `--no-guardrails` CLI | OpenAI-compatible serving, same prompts as production |

## Related pages

<CardGroup>
<Card title="Action modality" href="/action-modality">
Action token semantics, embodiment dimensions, and `domain_name` conditioning.
</Card>
<Card title="Action cookbook recipes" href="/action-cookbooks">
Notebook index with asset paths and output directories.
</Card>
<Card title="Run Generator with Cosmos Framework" href="/run-generator-cosmos-framework">
General Framework inference flags and parallelism presets.
</Card>
<Card title="Run Generator with vLLM-Omni" href="/run-generator-vllm-omni">
Server startup, guardrails, and vision endpoints.
</Card>
<Card title="vLLM-Omni API reference" href="/vllm-omni-api-reference">
`/v1/videos` fields, `action_mode` values, and curl constraints.
</Card>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared uv/Docker setup for Framework and vLLM-Omni.
</Card>
</CardGroup>

---

## 14. Run Reasoner with vLLM

> Install vllm-cosmos3 plugin, serve Cosmos3ReasonerForConditionalGeneration with mm-encoder and media-io-kwargs, Qwen3-VL-compatible chat messages, and reasoning-format prompt suffix.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/14-run-reasoner-with-vllm.md
- Generated: 2026-06-01T20:39:21.810Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/reasoner/README.md`
- `cookbooks/cosmos3/reasoner/run_with_vllm.ipynb`
- `cookbooks/cosmos3/reasoner/assets/robotics_next_action.mp4`
- `cookbooks/cosmos3/reasoner/assets/temporal_localization_1.mp4`

---
title: "Run Reasoner with vLLM"
description: "Install vllm-cosmos3 plugin, serve Cosmos3ReasonerForConditionalGeneration with mm-encoder and media-io-kwargs, Qwen3-VL-compatible chat messages, and reasoning-format prompt suffix."
---

The Cosmos 3 Reasoner runs as an OpenAI-compatible `vllm serve` process: the `vllm-cosmos3` plugin registers `Cosmos3ReasonerForConditionalGeneration`, multimodal inputs use Qwen3-VL-style chat messages, and clients call `POST /v1/chat/completions` for text outputs from images and videos.

## Architecture

```mermaid
sequenceDiagram
    participant Client as OpenAI client
    participant API as vLLM /v1
    participant Plugin as vllm-cosmos3
    participant Model as Cosmos3ReasonerForConditionalGeneration

    Client->>API: POST /v1/chat/completions
    Note over Client,API: image_url / video_url + text prompt
    API->>Plugin: Cosmos3ReasonerForConditionalGeneration
    Plugin->>Model: mm encoder + AR decode
    Model-->>API: text tokens
    API-->>Client: choices[0].message.content
```

| Component | Role |
| --- | --- |
| `vllm-cosmos3` | Registers Reasoner architecture and processors from `cosmos-framework` |
| `Cosmos3ReasonerForConditionalGeneration` | Autoregressive Reasoner path (text out only) |
| `--mm-encoder-tp-mode data` | Data-parallel visual encoder for multimodal workloads |
| `--media-io-kwargs` | Server-side video frame ingestion before downstream sampling |

<Note>
Reasoner vLLM loads only the understanding path. For image/video/sound/action generation, use vLLM-Omni instead.
</Note>

## Prerequisites

| Requirement | Detail |
| --- | --- |
| OS / GPU | Linux; NVIDIA Ampere, Hopper, or Blackwell |
| Package manager | `uv` (see cookbook environment guide) |
| Hugging Face | Gated access to `nvidia/Cosmos3-Nano` and/or `nvidia/Cosmos3-Super` |
| CUDA pairing | `cu130` + `vllm==0.21.0` (CUDA 13 driver) or `cu128` + `vllm==0.19.1` (CUDA 12.x) |

Authenticate before the first download:

```bash
uvx hf@latest auth login
```

## Install vLLM and vllm-cosmos3

<Steps>
<Step title="Create a Python 3.13 venv">

From the `cosmos` repository root (or any working directory):

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
```

</Step>
<Step title="Install the CUDA-matched vLLM stack">

<Tabs>
<Tab title="CUDA 13 (cu130)">

```bash
uv pip install --torch-backend=cu130 "vllm==0.21.0" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"
```

</Tab>
<Tab title="CUDA 12.x (cu128)">

```bash
uv pip install --torch-backend=cu128 "vllm==0.19.1" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"
```

</Tab>
</Tabs>

The notebook path can also install from a local `packages/cosmos3` checkout (`transformers-cosmos3` + `vllm-cosmos3`) after cloning `cosmos-framework`.

</Step>
<Step title="Verify GPU visibility">

```bash
.venv/bin/python - <<'PY'
import torch
print("cuda available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY
```

</Step>
</Steps>

<Warning>
`--torch-backend=auto` is unreliable for vLLM wheels. Match the install pair to `nvidia-smi` driver CUDA or `torch.cuda.is_available()` returns `False`.
</Warning>

If the build reports DeepGEMM unavailable:

```bash
export VLLM_USE_DEEP_GEMM=0
```

When invoking `.venv/bin/vllm` without activating the venv, keep `.venv/bin` on `PATH` so FlashInfer can find `ninja`.

## Serve the Reasoner

Override Hugging Face `architectures` so vLLM loads the Reasoner class instead of the full omnimodal checkpoint default.

### Cosmos3-Nano (single GPU)

From the repo root, with media under `cookbooks/cosmos3` reachable via `file://` URLs:

```bash
CUDA_VISIBLE_DEVICES=0 \
vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --tensor-parallel-size 1 \
  --mm-encoder-tp-mode data \
  --async-scheduling \
  --allowed-local-media-path "$(dirname "$(pwd)")" \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --port 8000
```

### Cosmos3-Super (4 GPUs)

The Reasoner cookbook notebook defaults to Super with tensor parallelism:

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve nvidia/Cosmos3-Super \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --tensor-parallel-size 4 \
  --mm-encoder-tp-mode data \
  --async-scheduling \
  --allowed-local-media-path /path/to/cookbooks/cosmos3 \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --port 8001
```

| Flag | Purpose |
| --- | --- |
| `--hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}'` | Select Reasoner weights and config |
| `--tensor-parallel-size` | GPU count for model parallelism (`1` Nano, `4` Super in cookbooks) |
| `--mm-encoder-tp-mode data` | Data parallelism for the multimodal visual encoder |
| `--async-scheduling` | Async request scheduling (recommended in cookbooks) |
| `--allowed-local-media-path` | Parent directory allowed for `file://` image/video paths in requests |
| `--media-io-kwargs '{"video": {"num_frames": -1}}'` | Let the processor see all frames before downstream frame sampling |

<Info>
First startup compiles CUDA graphs and can take several minutes. Poll readiness with `curl -fsS http://127.0.0.1:8000/health` or `curl http://localhost:8000/v1/models`.
</Info>

## Query with chat completions

Clients use the OpenAI SDK against `http://localhost:<port>/v1` with any API key (for example `EMPTY`).

### Image caption

```python
import openai

client = openai.OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_153.jpg"
                    },
                },
                {"type": "text", "text": "Caption the image in detail."},
            ],
        }
    ],
    max_tokens=4096,
    seed=0,
)
print(response.choices[0].message.content)
```

### Local video (file URL)

Resolve cookbook assets to `file://` URIs. The server path in `--allowed-local-media-path` must contain the file.

```python
from pathlib import Path

video_path = "cookbooks/cosmos3/reasoner/assets/temporal_localization_1.mp4"
video_url = Path(video_path).resolve().as_uri()

response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url}},
                {"type": "text", "text": "Describe the video in detail."},
            ],
        }
    ],
    max_tokens=4096,
    extra_body={"mm_processor_kwargs": {"fps": 4, "do_sample_frames": True}},
)
```

Bundled examples under `cookbooks/cosmos3/reasoner/assets/` include `robotics_next_action.mp4`, `temporal_localization_1.mp4`, `video_caption.mp4`, and task-specific images for grounding and planning.

## Qwen3-VL-compatible messages

Reasoner requests follow Qwen3-VL message conventions: multimodal content is an ordered list of typed parts (`image_url`, `video_url`, `text`). Optional system role:

```json
[
  {
    "role": "system",
    "content": [{"type": "text", "text": "You are a helpful assistant."}]
  },
  {
    "role": "user",
    "content": [
      {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}},
      {"type": "text", "text": "List notable events with approximate timestamps."}
    ]
  }
]
```

| Content type | Field shape |
| --- | --- |
| `image_url` | `{"url": "<https or file:// path>"}` |
| `video_url` | `{"url": "<https or file:// path>"}` |
| `text` | `{"text": "<prompt>"}` |

For per-request video frame rate and sampling, pass `extra_body`:

<ParamField body="mm_processor_kwargs" type="object">
Server-side multimodal processor overrides. Cookbooks use `{"fps": 4, "do_sample_frames": true}` for video understanding workloads.
</ParamField>

## Reasoning-format prompt suffix

For chain-of-thought style outputs, append this instruction to the user text (tags are literal in model outputs):

```text
Answer the question using the following format:

<think>
Your reasoning.
</think>

Write your final answer immediately after the </think> tag.
```

Embodied examples (for example `robotics_next_action.mp4`) use the same pattern inline. Parse structured JSON after `</think>` when the task requests trajectories or boxes.

## Sampling parameters

| Parameter | Without reasoning | With reasoning |
| --- | ---: | ---: |
| `temperature` | `0.7` | `0.6` |
| `top_p` | `0.8` | `0.95` |
| `top_k` | `20` | `20` |
| `repetition_penalty` | `1.0` | `1.0` |
| `presence_penalty` | `1.5` | `0.0` |

Reasoning-heavy cookbook cells pass the “with reasoning” set explicitly:

```python
client.chat.completions.create(
    model=model,
    messages=[...],
    max_tokens=4096,
    temperature=0.6,
    top_p=0.95,
    presence_penalty=0.0,
    extra_body={"top_k": 20, "repetition_penalty": 1.0},
)
```

## Verification

| Check | Command / signal |
| --- | --- |
| Server up | `curl http://localhost:8000/v1/models` returns model list |
| Health | `curl -fsS http://127.0.0.1:8000/health` succeeds |
| End-to-end | Chat completion returns non-empty `choices[0].message.content` |

Published Nano Reasoner serving metrics (TTFT, latency, throughput at concurrency 1/64/128/256) are in the repository inference benchmarks doc.

## Cookbook notebook

`cookbooks/cosmos3/reasoner/run_with_vllm.ipynb` installs the stack, launches Super on four GPUs by default (`VLLM_PORT=8001`), waits on `/health`, and runs captioning, temporal localization, embodied reasoning, grounding, action CoT, and related workflows. To use Nano, change only the server launch cell (`nvidia/Cosmos3-Nano`, `--tensor-parallel-size 1`, one GPU) and client `base_url` port; prompts resolve `MODEL` from `client.models.list()`.

## Related pages

<CardGroup>
<Card title="Cookbook environment" href="/cookbook-environment">
Shared uv setup, CUDA tags, HF auth, and vLLM install for Reasoner and Generator backends.
</Card>
<Card title="Reasoner vLLM configuration" href="/reasoner-vllm-configuration">
Flag reference for serve options, DeepGEMM, and CUDA version pairs.
</Card>
<Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
Reasoner sampling tables, message schema, and reasoning-format details.
</Card>
<Card title="Reasoner cookbooks" href="/reasoner-cookbooks">
End-to-end notebook workflows and bundled `assets/` media.
</Card>
<Card title="Quickstart" href="/quickstart">
Minimal first-run Reasoner serve plus chat completion.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
CUDA/driver mismatches, `torch.cuda` failures, and `VLLM_USE_DEEP_GEMM`.
</Card>
</CardGroup>

---

## 15. Run Reasoner with Cosmos Framework

> Build reasoner JSON inputs (model_mode, vision_path, enable_sound), run cosmos_framework.scripts.inference with latency preset, and read reasoner_text.txt outputs; scale Nano to Super via torchrun.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/15-run-reasoner-with-cosmos-framework.md
- Generated: 2026-06-01T20:26:01.179Z

### Source Files

- `cookbooks/cosmos3/reasoner/README.md`
- `cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/reasoner/assets/robot_planning.png`
- `cookbooks/cosmos3/reasoner/assets/describe_anything.png`
- `README.md`

---
title: "Run Reasoner with Cosmos Framework"
description: "Build reasoner JSON inputs (model_mode, vision_path, enable_sound), run cosmos_framework.scripts.inference with latency preset, and read reasoner_text.txt outputs; scale Nano to Super via torchrun."
---

Cosmos3 Reasoner inference on the native PyTorch path runs through `cosmos_framework.scripts.inference` in a [Cosmos Framework](https://github.com/NVIDIA/cosmos-framework) checkout (`packages/cosmos3`). Each job takes one JSON input file (`-i`), writes text under `{output_dir}/{name}/reasoner_text.txt`, and uses `--parallelism-preset=latency` with `COSMOS_TRAINING=false` for cookbook Reasoner runs.

## Prerequisites

| Requirement | Notes |
| --- | --- |
| Linux + NVIDIA GPU | Ampere, Hopper, or Blackwell per project support matrix |
| `uv`, `git`, `git-lfs` | Framework install uses `uv sync` |
| Hugging Face access | Gated Cosmos3 repos; `uvx hf@latest auth login` or `HF_TOKEN` |
| Framework checkout access | HTTPS or SSH to `NVIDIA/cosmos-framework` |
| Disk | Nano download plus CUDA deps can use tens of GiB |

Full shared setup (CUDA `cu130-train` / `cu128-train` pairing, clone path, GPU verify) lives on the cookbook environment page. Install the framework venv at `packages/cosmos3/.venv` before running commands below.

<Steps>
<Step title="Install Cosmos Framework">

From the `cosmos` repo root:

```bash
mkdir -p packages
git clone https://github.com/NVIDIA/cosmos-framework.git packages/cosmos3
cd packages/cosmos3

export GIT_LFS_SKIP_SMUDGE=1
uv sync --all-extras --group=cu130-train   # use cu128-train on CUDA 12.x drivers
```

</Step>
<Step title="Verify GPU">

```bash
cd packages/cosmos3
.venv/bin/python - <<'PY'
import torch
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
PY
```

</Step>
</Steps>

## Reasoner input JSON

Point `-i` at a single JSON file. Cookbook Reasoner inputs set `model_mode` to `"reasoner"` and include `enable_sound: false` — shipped examples fail argument validation without it.

### Core fields

| Field | Required | Type | Role |
| --- | --- | --- | --- |
| `model_mode` | yes | string | Must be `"reasoner"` |
| `name` | yes | string | Output subdirectory name under `-o` |
| `prompt` | yes | string | User instruction or task text |
| `enable_sound` | yes (today) | bool | Set `false` for current Reasoner path |
| `vision_path` | no | string | HTTP(S) URL or local path to an image |

### Optional sampling fields

Capability prompts in `run_with_cosmos_framework.ipynb` also use:

| Field | Example use |
| --- | --- |
| `max_new_tokens` | `4096` for long captions and structured outputs |
| `do_sample` | `true` for trajectory / chain-of-thought prompts |
| `temperature` | `0.6` with reasoning-format prompts |
| `top_p` | `0.95` with reasoning |
| `top_k` | `20` |
| `repetition_penalty` | `1.0` |
| `presence_penalty` | `0.0` with reasoning (vs `1.5` without in README sampling table) |

<ParamField body="model_mode" type="string" required>
Must be `"reasoner"`. Selects autoregressive text output (not Generator diffusion modes).
</ParamField>

<ParamField body="vision_path" type="string">
Image URL or filesystem path. Framework Reasoner currently treats this as a PIL image input. Video Reasoner workflows belong on the vLLM path.
</ParamField>

<ParamField body="enable_sound" type="boolean" required>
Set `false` for all current cookbook Reasoner JSON. Omitting or setting `true` triggers strict argument-validation failure on the shipped path.
</ParamField>

### Minimal examples

Text-only:

```json
{
  "model_mode": "reasoner",
  "name": "nano_text",
  "prompt": "Describe a modern robotics research laboratory in one sentence.",
  "enable_sound": false
}
```

Image-conditioned:

```json
{
  "model_mode": "reasoner",
  "name": "robot_image",
  "prompt": "Describe what is happening in this image in one sentence.",
  "vision_path": "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_153.jpg",
  "enable_sound": false
}
```

Local cookbook assets (for example `robot_planning.png`, `describe_anything.png`, `grounding_2d.png`) resolve under `cookbooks/cosmos3/reasoner/assets/`.

## Run inference

Work from the framework checkout (`cd packages/cosmos3`). The entrypoint is `python -m cosmos_framework.scripts.inference` (or `.venv/bin/torchrun` for multi-GPU Super).

```text
cosmos repo
└── packages/cosmos3/          # COSMOS3_REPO, run commands here
    ├── .venv/
    └── outputs/cookbooks/cosmos3/reasoner/...
```

### Cosmos3-Nano (single GPU)

Nano fits one GPU. Export distributed env vars even for `WORLD_SIZE=1` (cookbook pattern):

```bash
cd packages/cosmos3

mkdir -p outputs/cookbooks/cosmos3/reasoner/inputs
cat > outputs/cookbooks/cosmos3/reasoner/inputs/robot_image.json <<'JSON'
{
  "model_mode": "reasoner",
  "name": "robot_image",
  "prompt": "Describe what is happening in this image in one sentence.",
  "vision_path": "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_153.jpg",
  "enable_sound": false
}
JSON

COSMOS_TRAINING=false CUDA_VISIBLE_DEVICES=0 \
MASTER_ADDR=127.0.0.1 MASTER_PORT=29501 RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 \
.venv/bin/python -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  -i outputs/cookbooks/cosmos3/reasoner/inputs/robot_image.json \
  -o outputs/cookbooks/cosmos3/reasoner/nano/cosmos_framework_image \
  --checkpoint-path Cosmos3-Nano \
  --seed=0 \
  --benchmark
```

| CLI flag | Reasoner cookbook value | Notes |
| --- | --- | --- |
| `--parallelism-preset` | `latency` | Generator audiovisual examples use `throughput`; Reasoner cookbooks use `latency` |
| `-i` | path to input JSON | One sample per invocation |
| `-o` | output run directory | Per-sample folder is `{name}/` inside this directory |
| `--checkpoint-path` | `Cosmos3-Nano` or `Cosmos3-Super` | Short names; weights download from Hugging Face on first run |
| `--seed` | `0` | Reproducibility |
| `--benchmark` | optional | Writes `benchmark.json` next to sample outputs with timing aggregates |
| `COSMOS_TRAINING` | `false` | Set for Reasoner cookbook runs |

### Cosmos3-Super (multi-GPU)

`Cosmos3-Super` (64B) needs multiple GPUs. The Reasoner cookbook points to `.venv/bin/torchrun`; the audiovisual Framework notebook uses four processes for Super (`COSMOS3_NUM_GPUS=4`, `CUDA_VISIBLE_DEVICES=0,1,2,3`).

```bash
cd packages/cosmos3

export CUDA_VISIBLE_DEVICES=0,1,2,3
export COSMOS3_NUM_GPUS=4
export COSMOS3_MASTER_ADDR=127.0.0.1
export COSMOS3_MASTER_PORT=29502

COSMOS_TRAINING=false \
.venv/bin/torchrun \
  --nproc-per-node="$COSMOS3_NUM_GPUS" \
  --master-addr="$COSMOS3_MASTER_ADDR" \
  --master-port="$COSMOS3_MASTER_PORT" \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  -i outputs/cookbooks/cosmos3/reasoner/inputs/robot_image.json \
  -o outputs/cookbooks/cosmos3/reasoner/super/cosmos_framework_image \
  --checkpoint-path Cosmos3-Super \
  --seed=0 \
  --benchmark
```

<Note>
Match `--nproc-per-node` to the number of visible GPUs. vLLM Reasoner serves Super with `--tensor-parallel-size 4` on four GPUs; align Framework GPU count with your hardware and memory headroom.
</Note>

## Output layout

For `-o outputs/.../cosmos_framework_image` and input `"name": "robot_image"`:

```text
outputs/.../cosmos_framework_image/
├── benchmark.json              # when --benchmark is set (run-level aggregates)
└── robot_image/
    └── reasoner_text.txt       # model text output
```

<ResponseField name="reasoner_text.txt" type="string">
Plain-text Reasoner completion for the prompt (and vision conditioning when `vision_path` is set). Read with `cat` or your notebook display helper.
</ResponseField>

With `--benchmark`, inspect timing:

```bash
cat outputs/cookbooks/cosmos3/reasoner/nano/cosmos_framework_image/benchmark.json
```

The notebook prints the `average` object from that file when present.

## Capability workflows (image)

`run_with_cosmos_framework.ipynb` builds inputs under `outputs/.../reasoner/nano/inputs/` and `inputs/capabilities/`, then runs Nano with the same CLI pattern. Bundled assets drive local `vision_path` values.

| Input `name` | Task | Asset / vision |
| --- | --- | --- |
| `image_caption_detail` | Detailed caption | Remote robot image URL |
| `robot_planning` | Subtask plan for manipulation | `assets/robot_planning.png` |
| `ground_load_bbox` | 2D bounding box JSON | `assets/grounding_2d.png` |
| `describe_marked_subjects` | Describe marked subjects (JSON) | `assets/describe_anything.png` |
| `trajectory_bowl` / `trajectory_flower` | 2D gripper trajectory + `redacted_reasoning` format | `action_cot_trajectory.png`, `robot_planning.png` |

Trajectory prompts append the reasoning-format block from the main README (`redacted_reasoning` tags) and set sampling fields (`do_sample`, `temperature`, `top_p`, etc.).

<Warning>
Framework Reasoner expects **images** via `vision_path`. Video assets under `cookbooks/cosmos3/reasoner/assets/*.mp4` are exercised in `run_with_vllm.ipynb`, not the Framework image path.
</Warning>

## Environment overrides

| Variable | Default | Purpose |
| --- | --- | --- |
| `COSMOS3_REPO` | `<cosmos>/packages/cosmos3` | Framework checkout root |
| `COSMOS3_UV_GROUP` | `cu130-train` | `cu128-train` on CUDA 12.x drivers |
| `COSMOS3_OUTPUT_ROOT` | `.../reasoner/nano` under framework outputs | Notebook output base |
| `CUDA_VISIBLE_DEVICES` | `0` (Nano) | GPU selection |
| `HF_HOME` | `~/.cache/huggingface` | Model cache location |
| `COSMOS_TRAINING` | `false` for Reasoner | Disables training code paths for inference |

## Troubleshooting

| Symptom | Likely fix |
| --- | --- |
| `cuda available: False` after `uv sync` | Wrong `COSMOS3_UV_GROUP` for driver; use `cu128-train` on CUDA 12.x |
| Reasoner JSON validation error | Add `"enable_sound": false` |
| Video input not working | Use [Run Reasoner with vLLM](/run-reasoner-vllm) for video; keep Framework on images |
| `uv` / `--torch-backend` errors | Upgrade `uv` (≥ 0.11.3 per cookbook notes); see troubleshooting page |
| First run slow / large download | `Cosmos3-Nano` / `Cosmos3-Super` fetch from Hugging Face; ensure `HF_TOKEN` or `hf auth login` |

## Related pages

<CardGroup>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Clone Framework, `uv sync` train groups, and verify GPU before Reasoner runs.
</Card>
<Card title="Reasoner and Generator" href="/reasoner-and-generator">
MoT surfaces: Reasoner (text out) vs Generator (vision/sound/action out).
</Card>
<Card title="Run Reasoner with vLLM" href="/run-reasoner-vllm">
OpenAI-compatible Reasoner for image and video, including Super on four GPUs.
</Card>
<Card title="Reasoner cookbook recipes" href="/reasoner-cookbooks">
Captioning, grounding, temporal localization, and video workflows with bundled media.
</Card>
<Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
Reasoner sampling tables and `redacted_reasoning` prompt suffix.
</Card>
<Card title="Choose an integration" href="/choose-integration">
When to pick Framework vs vLLM vs Diffusers for research or serving.
</Card>
</CardGroup>

---

## 16. vLLM-Omni API reference

> OpenAI-compatible endpoints (/v1/images/generations, /v1/videos, /v1/videos/sync), request fields (prompt, size, num_frames, guidance_scale, extra_params), action_mode values, and curl --form-string constraints.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/16-vllm-omni-api-reference.md
- Generated: 2026-06-01T20:26:43.394Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb`
- `cookbooks/cosmos3/generator/action/README.md`
- `cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb`

---
title: "vLLM-Omni API reference"
description: "OpenAI-compatible endpoints (/v1/images/generations, /v1/videos, /v1/videos/sync), request fields (prompt, size, num_frames, guidance_scale, extra_params), action_mode values, and curl --form-string constraints."
---

Cosmos 3 Generator production serving exposes an OpenAI-compatible HTTP API on the vLLM-Omni server (`vllm serve … --omni --model-class-name Cosmos3OmniDiffusersPipeline`). Vision generation uses `POST /v1/images/generations` (JSON) and `POST /v1/videos/sync` (multipart, blocking MP4). Action and long-running jobs use asynchronous `POST /v1/videos` with `GET /v1/videos/{id}` polling and optional `GET /v1/videos/{id}/content` for the rendered video.

<Note>
The `vllm/vllm-omni:cosmos3` Docker image ships all modalities (text-to-image, text-to-video, image-to-video, video-to-video, video-with-sound, action). A PR-branch install of vLLM-Omni currently covers only text-to-image, text-to-video, and image-to-video until upstream merges complete.
</Note>

## Endpoint map

| Method | Path | Use when | Response |
| --- | --- | --- | --- |
| `POST` | `/v1/images/generations` | Text-to-image | JSON with base64 PNG in `data[0].b64_json` |
| `POST` | `/v1/videos/sync` | Text/image/video generation; forward dynamics (video-only output) | Raw MP4 bytes (`Accept: video/mp4`) |
| `POST` | `/v1/videos` | Policy, inverse dynamics, or chunked action jobs that return action data | JSON job handle with `id` |
| `GET` | `/v1/videos/{id}` | Poll async job | JSON with `status`, `progress`, and `action` when complete |
| `GET` | `/v1/videos/{id}/content` | Download completed video | Raw MP4 bytes |
| `GET` | `/v1/models` | Verify server readiness | OpenAI-style model list |

```text
Client                          vLLM-Omni (port 8000)
  |-- POST /v1/images/generations --> JSON { data[].b64_json }
  |-- POST /v1/videos/sync ---------> MP4 body (blocking)
  |-- POST /v1/videos -------------> { id }
  |       +-- GET /v1/videos/{id} --> { status, progress, action? }
  |       +-- GET .../content -----> MP4 body
```

## Prerequisites

Start the server with local media paths allowed when using `action_path` or mounted cookbook assets:

```bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000
```

The API is ready when logs show `Application startup complete.` Confirm with `curl http://localhost:8000/v1/models`.

For Super (64B), add `--tensor-parallel-size` and optionally `--enable-layerwise-offload`. CFG and Ulysses parallelism use `--cfg-parallel-size` and `--ulysses-degree`; set request `guidance_scale` for CFG strength (do not use `true_cfg_scale`).

## `POST /v1/images/generations`

:::endpoint POST /v1/images/generations Text-to-image; returns base64-encoded PNG in JSON.

This endpoint accepts **`Content-Type: application/json`** (not multipart). Cosmos-specific image options go in `extra_args`, not `extra_params`.

<ParamField body="prompt" type="string" required>
Positive text prompt. Cookbooks often pass a JSON-encoded structured prompt string.
</ParamField>

<ParamField body="size" type="string" required>
Output resolution as `widthxheight` (for example `1280x720` for 720p 16:9).
</ParamField>

<ParamField body="n" type="integer">
Number of images; cookbooks use `1`.
</ParamField>

<ParamField body="num_inference_steps" type="integer">
Diffusion denoising steps.
</ParamField>

<ParamField body="guidance_scale" type="number">
Classifier-free guidance scale for Cosmos 3. Use this field with `--cfg-parallel-size`; do not send `true_cfg_scale`.
</ParamField>

<ParamField body="flow_shift" type="number">
Scheduler flow-shift value (Diffusers cookbooks commonly use `10.0`).
</ParamField>

<ParamField body="seed" type="integer">
Reproducibility seed.
</ParamField>

<ParamField body="extra_args" type="object">
Cosmos 3 image options such as `use_resolution_template` and `guardrails`.
</ParamField>

<ResponseField name="data" type="array">
Each element includes `b64_json` (PNG bytes, base64-encoded).
</ResponseField>

:::

<RequestExample>

```bash title="curl text-to-image"
curl -sS -X POST http://localhost:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A robotics laboratory with a manipulator arm.",
    "size": "1280x720",
    "n": 1,
    "num_inference_steps": 35,
    "guidance_scale": 6.0,
    "flow_shift": 10.0,
    "seed": 0,
    "extra_args": {
      "use_resolution_template": false,
      "guardrails": true
    }
  }'
```

</RequestExample>

## `POST /v1/videos/sync`

:::endpoint POST /v1/videos/sync Blocking video generation; returns MP4 bytes directly.

Send **`multipart/form-data`**. Set `Accept: video/mp4` to receive the encoded file in the response body.

| Mode | Required inputs | Notable fields |
| --- | --- | --- |
| Text-to-video | `prompt` | `size`, `num_frames`, `fps`, `guidance_scale` |
| Image-to-video | `prompt`, `input_reference` (image file) | Same sampling fields |
| Video-to-video | `prompt`, `input_reference` (video file) | `extra_params`: `condition_frame_indexes_vision`, `condition_video_keep` |
| Video with sound | `prompt` | `generate_sound=true`, `sound_duration` (seconds, often `num_frames / fps`) |
| Forward dynamics | `input_reference` (image), action in `extra_params` | `action_mode`: `forward_dynamics` |

<ParamField body="prompt" type="string" required>
Positive text prompt (plain string or JSON-encoded structured prompt).
</ParamField>

<ParamField body="negative_prompt" type="string">
Concepts or artifacts to avoid.
</ParamField>

<ParamField body="size" type="string">
Output resolution as `widthxheight` (for example `1280x720`, `832x480`, `320x192`).
</ParamField>

<ParamField body="num_frames" type="string | integer">
Video length in frames. Supported range is 5–300; default in model settings is 189. Action cookbooks often set `action_chunk_size + 1`.
</ParamField>

<ParamField body="fps" type="string | integer">
Frame rate: 10, 16, 24, or 30 (default 24). AV forward dynamics uses 10; DROID 15; UMI 20 in action examples.
</ParamField>

<ParamField body="num_inference_steps" type="string | integer">
Diffusion denoising steps (README example: 35; action examples: 30).
</ParamField>

<ParamField body="guidance_scale" type="string | number">
CFG scale. Audiovisual examples use `6.0`; action examples use `1.0`.
</ParamField>

<ParamField body="flow_shift" type="string | number">
Scheduler flow-shift (commonly `10.0`).
</ParamField>

<ParamField body="seed" type="string | integer">
Reproducibility seed.
</ParamField>

<ParamField body="max_sequence_length" type="integer">
Maximum prompt tokens kept for conditioning (Cosmos 3 default `512`). Longer prompts are truncated with a warning.
</ParamField>

<ParamField body="input_reference" type="file">
Uploaded image or video for image-to-video, video-to-video, and action conditioning. Use curl `-F input_reference=@/path/to/file`.
</ParamField>

<ParamField body="generate_sound" type="string">
Set to `true` to mux a stereo AAC soundtrack (48 kHz) into the output MP4.
</ParamField>

<ParamField body="sound_duration" type="string">
Soundtrack length in seconds; cookbooks set `num_frames / fps` to three decimal places.
</ParamField>

<ParamField body="extra_params" type="string (JSON)">
JSON-encoded Cosmos 3 options (see table below). Serialize compactly when using curl `--form-string`.
</ParamField>

:::

<RequestExample>

```bash title="curl text-to-video"
curl -sS -X POST http://localhost:8000/v1/videos/sync \
  -H "Accept: video/mp4" \
  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
  --form-string "negative_prompt=blurry, distorted, low quality" \
  --form-string "size=1280x720" \
  --form-string "num_frames=81" \
  --form-string "fps=24" \
  --form-string "num_inference_steps=35" \
  --form-string "guidance_scale=4.0" \
  --form-string "seed=42" \
  -o cosmos3_t2v_output.mp4
```

</RequestExample>

<RequestExample>

```python title="requests text-to-video with extra_params"
import json
import requests

response = requests.post(
    "http://localhost:8000/v1/videos/sync",
    data={
        "prompt": json.dumps({"scene": "robot kitchen"}),
        "negative_prompt": json.dumps({"avoid": "blur"}),
        "size": "1280x720",
        "num_frames": "189",
        "fps": "24",
        "num_inference_steps": "35",
        "guidance_scale": "6.0",
        "flow_shift": "10.0",
        "seed": "0",
        "extra_params": json.dumps({
            "use_resolution_template": False,
            "use_duration_template": False,
            "guardrails": True,
        }),
    },
    headers={"Accept": "video/mp4"},
)
response.raise_for_status()
open("/tmp/cosmos3_t2v.mp4", "wb").write(response.content)
```

</RequestExample>

## `POST /v1/videos` (async jobs)

:::endpoint POST /v1/videos Asynchronous video/action jobs with polling.

Use this endpoint when the response includes **predicted action** data (policy or inverse dynamics) or when generation time exceeds comfortable HTTP timeouts. The initial response is JSON with an `id`. Poll `GET /v1/videos/{id}` until `status` is `completed`, then fetch video bytes from `GET /v1/videos/{id}/content` if needed. Terminal failure states are `failed` and `cancelled`.

| Mode | `action_mode` in `extra_params` | `input_reference` | Primary output |
| --- | --- | --- | --- |
| Policy | `policy` | Image + instruction | Video + action chunk (read from completed job JSON) |
| Inverse dynamics | `inverse_dynamics` | Video + instruction | Predicted ego/action trajectory in `action` |
| Forward dynamics | `forward_dynamics` | Image + action chunk | Video (`/content`); action may be absent |

Cookbook forward-dynamics flows POST the same multipart fields as sync, but target `/v1/videos` and poll every two seconds until completion.

:::

### Async job lifecycle

```mermaid
sequenceDiagram
    participant Client
    participant API as vLLM-Omni /v1/videos
    Client->>API: POST multipart (prompt, extra_params, input_reference)
    API-->>Client: { id }
    loop Until completed
        Client->>API: GET /v1/videos/{id}
        API-->>Client: { status, progress }
    end
    Client->>API: GET /v1/videos/{id}/content
    API-->>Client: MP4 bytes
    Note over Client: action fields read from final GET /v1/videos/{id}
```

## `extra_params` reference

Pass `extra_params` as a **single JSON object** serialized into one form field. Keys used in Cosmos 3 cookbooks and the root README:

| Key | Applies to | Purpose |
| --- | --- | --- |
| `action_mode` | Action | `policy`, `inverse_dynamics`, or `forward_dynamics` |
| `domain_name` | Action | Embodiment selector: `av`, `droid_lerobot`, `umi`, `bridge_orig_lerobot`, `camera_pose`, and variants in vLLM-Omni online-serving examples |
| `action_chunk_size` | Action | Trajectory length in action steps (AV example: 60; DROID/UMI: 16) |
| `raw_action_dim` | Action | Action vector width (inverse dynamics AV example: `9`; UMI: `10`) |
| `action_path` | Action | Server-readable path to a JSON action file (requires `--allowed-local-media-path`) |
| `action` | Action | Inline JSON array of action rows (used in forward-dynamics notebooks instead of `action_path`) |
| `image_size` | Action | Conditioning resolution tier (for example `480`) |
| `view_point` | Action | Camera/view identifier (for example `ego_view`) |
| `condition_frame_indexes_vision` | Video-to-video | Which source frames stay as clean vision conditioning |
| `condition_video_keep` | Video-to-video | Conditioning retention policy for source video |
| `use_resolution_template` | Vision | When `false`, do not wrap prompt with resolution template |
| `use_duration_template` | Vision | When `false`, do not wrap prompt with duration template |
| `guardrails` | All | When `false`, skip prompt screening and face blurring for that request |

<Warning>
Use curl `--form-string` for `prompt`, `negative_prompt`, and `extra_params`. With `-F`, curl treats `;` as a MIME parameter separator and **silently truncates** values that contain semicolons (common inside JSON).
</Warning>

<Tip>
Reserve `-F` for file uploads such as `input_reference=@/path/to/image.jpg`. Text fields that are safe only as `--form-string` should not be sent with `-F`.
</Tip>

### Action `extra_params` examples

**AV forward dynamics** (async job, 60-step chunk at 10 FPS):

```json
{
  "action_mode": "forward_dynamics",
  "domain_name": "av",
  "action_chunk_size": 60,
  "image_size": 480,
  "view_point": "ego_view",
  "action": [[...]],
  "guardrails": false
}
```

Form fields alongside `extra_params`: `num_frames` = `61` (`action_chunk_size + 1`), `fps` = `10`, `guidance_scale` = `1.0`, `flow_shift` = `10.0`, plus `input_reference` image upload.

**AV inverse dynamics** (async job, video input):

```json
{
  "action_mode": "inverse_dynamics",
  "domain_name": "av",
  "action_chunk_size": 60,
  "image_size": 480,
  "view_point": "ego_view",
  "raw_action_dim": 9,
  "guardrails": false
}
```

On completion, read `action` from the final `GET /v1/videos/{id}` JSON (notebooks expect `action.data` for inverse dynamics).

**DROID / UMI forward dynamics** use `domain_name` values `droid_lerobot` and `umi` respectively, with `action_chunk_size` 16 and embodiment-specific FPS (15 for DROID, 20 for UMI in cookbook specs).

## Embodiment quick reference

Action cookbooks align `domain_name`, frame counts, and action dimensionality as follows:

| Embodiment | `domain_name` (examples) | Action dim | Typical `action_chunk_size` | FPS in cookbooks |
| --- | --- | ---: | ---: | ---: |
| Autonomous vehicle | `av` | 9D ego pose | 60 | 10 |
| DROID | `droid_lerobot` | 10D (9D pose + 1D gripper) | 16 | 15 |
| UMI | `umi` | 10D | 16 | 20 |

See the action modality page for full embodiment semantics and additional `domain_name` values referenced in vLLM-Omni online-serving examples.

## Guardrails

Cosmos 3 enables safety guardrails by default (prompt screening and face blurring). Disable per request:

```bash
curl -sS -X POST http://localhost:8000/v1/videos/sync \
  --form-string "prompt=A small warehouse robot moves a blue box." \
  --form-string 'extra_params={"guardrails":false,"use_resolution_template":false,"use_duration_template":false}' \
  -o cosmos3_t2v.mp4
```

To disable guardrails server-wide (models not loaded; per-request `guardrails: true` cannot re-enable), pass a deploy config with `model_config.guardrails: false` on `vllm serve … --deploy-config no_guardrails.yaml`.

## Resolution and sampling defaults

| Setting | Supported / default |
| --- | --- |
| Resolution tiers | 256p (`320x192`), 480p (`832x480`), 720p (`1280x720`); default tier 480p |
| Aspect ratios | 16:9, 4:3, 1:1, 3:4, 9:16; default 16:9 |
| Frame count | 5–300; default 189 |
| Frame rates | 10, 16, 24, 30 FPS; default 24 |
| Prompt length | Fewer than 300 words recommended for generation prompts |

External API field definitions also appear in the [vLLM-Omni Image Generation API](https://docs.vllm.ai/projects/vllm-omni/en/latest/serving/image_generation_api/) and [Videos API](https://docs.vllm.ai/projects/vllm-omni/en/latest/serving/videos_api/) documentation; Cosmos-specific behavior is concentrated in `extra_params`, `extra_args`, and `action_mode`.

## Related pages

<CardGroup>
<Card title="Run Generator with vLLM-Omni" href="/run-generator-vllm-omni">
Docker server startup, tensor parallelism, CFG/Ulysses, and deploy-config guardrail disable.
</Card>
<Card title="Run Generator action workflows" href="/run-generator-action">
Forward and inverse dynamics multipart requests with `domain_name` and chunked robotics generation.
</Card>
<Card title="Action modality" href="/action-modality">
Embodiment dimensions, policy/inverse/forward dynamics semantics, and `domain_name` conditioning.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Resolution tiers, frame rates, vision conditioning frame counts, and sound output format.
</Card>
<Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
Structured JSON prompts, upsampling defaults, and template toggles mirrored in `extra_params`.
</Card>
<Card title="Audiovisual cookbooks" href="/audiovisual-cookbooks">
End-to-end `run_with_vllm_omni.ipynb` recipes for image and video with optional sound.
</Card>
</CardGroup>

---

## 17. Diffusers pipeline reference

> Cosmos3OmniPipeline.from_pretrained modes (text-to-image, text-to-video, image-to-video, text-to-video-with-sound), key call arguments, export_to_video, and torch-backend install pairing.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/17-diffusers-pipeline-reference.md
- Generated: 2026-06-01T20:27:09.984Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/car_colliding.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/car_driving.jpg`

---
title: "Diffusers pipeline reference"
description: "Cosmos3OmniPipeline.from_pretrained modes (text-to-image, text-to-video, image-to-video, text-to-video-with-sound), key call arguments, export_to_video, and torch-backend install pairing."
---

`Cosmos3OmniPipeline` is the Hugging Face Diffusers entry point for Cosmos 3 Generator audiovisual workflows. It loads full omnimodal checkpoints (`nvidia/Cosmos3-Nano`, `nvidia/Cosmos3-Super`), runs diffusion with a UniPC scheduler and configurable `flow_shift`, accepts compact JSON scene prompts, and returns frame tensors you save as PNG (single frame) or MP4 via `export_to_video` or `encode_video` when audio is enabled.

## Checkpoints and `from_pretrained`

Cookbooks map logical model names to Hugging Face IDs:

| Cookbook name | `from_pretrained` ID |
| --- | --- |
| `Cosmos3-Nano` | `nvidia/Cosmos3-Nano` |
| `Cosmos3-Super` | `nvidia/Cosmos3-Super` |

The omnimodal Nano and Super checkpoints cover text-to-image, text-to-video, image-to-video, and sound-capable video generation in one pipeline. Specialized Super variants (`Cosmos3-Super-Text2Image`, `Cosmos3-Super-Image2Video`) exist for task-focused serving; the audiovisual Diffusers cookbook uses the full omni checkpoints above.

<ParamField body="model_id" type="string" required>
Hugging Face repo id, e.g. `nvidia/Cosmos3-Nano`.
</ParamField>

<ParamField body="torch_dtype" type="torch.dtype">
Use `torch.bfloat16` (BF16 is the tested Generator precision).
</ParamField>

<ParamField body="device_map" type="string">
Optional in minimal quickstarts: `device_map="cuda"` loads weights directly on GPU. The cookbook instead calls `pipe.to("cuda")` after load.
</ParamField>

<ParamField body="safety_checker" type="object | None">
Cookbook sets `safety_checker=None` with `enable_safety_checker=True` so guardrails still run through the pipeline’s safety path.
</ParamField>

<ParamField body="token" type="string | None">
Pass `HF_TOKEN` when not using `hf auth login`; gated Cosmos3 repos require Hugging Face authentication.
</ParamField>

<CodeGroup>
```python title="README quickstart"
pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
```

```python title="Cookbook pattern"
pipe = Cosmos3OmniPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    safety_checker=None,
    enable_safety_checker=True,
    token=os.environ.get("HF_TOKEN") or None,
)
pipe.to("cuda")
```
</CodeGroup>

<Note>
The first run downloads weights from Hugging Face. Diffusion at 720p with 35 steps is compute-heavy; long per-step times on the first generation are expected, not a hang.
</Note>

## Generation modes

README documents four Diffusers modes. The audiovisual notebook implements them with internal `model_mode` values and `enable_sound`:

| Diffusers mode | Notebook `model_mode` | Conditioning | `num_frames` | Sound |
| --- | --- | --- | ---: | --- |
| Text-to-image | `text2image` | Prompt only | `1` | Off |
| Text-to-video | `text2video` | Prompt + optional negative JSON | `189` (default) | `enable_sound=False` |
| Image-to-video | `image2video` | Prompt + `image=` + negative JSON | `189` | `enable_sound=False` |
| Text-to-video-with-sound | `text2video` | Same as T2V | `189` | `enable_sound=True` |

Image-to-video-with-sound uses `image2video` with `enable_sound=True`. Super cookbook sections run T2V and I2V without audio; Nano runs the full matrix including sound.

```text
                    ┌─────────────────┐
  JSON prompt ─────►│ Cosmos3Omni     │
  negative (video)  │ Pipeline.__call__│──► result.video (frames)
  image (i2v)       └────────┬────────┘    result.sound (optional)
                             │
              enable_sound=True & sound present
                             ▼
                    encode_video (AAC in MP4)
              else
                             ▼
                    export_to_video (silent MP4)
              text2image (num_frames=1)
                             ▼
                    result.video[0].save (.png)
```

## Install and `--torch-backend` pairing

Create a Python 3.13 venv and install Diffusers from the upstream git ref plus media dependencies. Pin `--torch-backend` to the CUDA major version your NVIDIA driver supports.

| Driver CUDA | `--torch-backend` | Notes |
| --- | --- | --- |
| 13.x | `cu130` | Default in cookbooks (`COSMOS3_TORCH_BACKEND=cu130`). |
| 12.x | `cu128` | Use on CUDA 12.8 drivers. |
| Detect at install | `auto` | README quickstart option; uv picks a matching `torch` wheel. Without any pin, uv may install `cu130` and `torch.cuda.is_available()` returns `False` on older drivers. |

<Tabs>
<Tab title="Cookbook (explicit backend)">

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate

uv pip install --torch-backend=cu130 \
  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
  accelerate av cosmos_guardrail huggingface_hub \
  imageio imageio-ffmpeg torch torchvision transformers
```

Set `export COSMOS3_TORCH_BACKEND=cu128` before the same command on CUDA 12.x systems.

</Tab>
<Tab title="README quickstart (auto backend)">

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=auto \
  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
  accelerate av cosmos_guardrail huggingface_hub \
  imageio imageio-ffmpeg torch torchvision transformers
```

</Tab>
</Tabs>

<Warning>
Requires `uv >= 0.11.3` for `cu130` backend recognition. Headless Linux may need `libxcb1`, `libgl1`, and `libglib2.0-0` before importing the pipeline.
</Warning>

Verify GPU visibility after install:

```bash
python - <<'PY'
import torch
print("torch:", torch.__version__, "cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
PY
```

## Scheduler

Replace the pipeline scheduler before generation. Cookbooks use `UniPCMultistepScheduler` with `flow_shift` (default **10.0**, aligned with vLLM-Omni `flow_shift`):

```python
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

pipe.scheduler = UniPCMultistepScheduler.from_config(
    pipe.scheduler.config, flow_shift=10.0
)
```

Per-run payloads can override `shift`; the runner re-applies `flow_shift=payload["shift"]` before each call.

## `__call__` parameters

Shared cookbook defaults (`FIXED_SAMPLING`):

| Parameter | Default | Role |
| --- | ---: | --- |
| `num_inference_steps` | `35` | Diffusion steps |
| `guidance_scale` | `6.0` | Classifier-free guidance |
| `fps` | `24` | Output frame rate (video modes) |
| `num_frames` | `189` | ~7.9 s at 24 FPS for T2V/I2V |
| `seed` | `1234` | Via `torch.Generator(device="cuda").manual_seed(...)` |

<ParamField body="prompt" type="string" required>
Scene description. Cookbooks pass **compact JSON** (`json.dumps(..., separators=(",", ":"))`) from files under `assets/prompts/{text2image,text2video,image2video}/`. Plain text strings also work in the README minimal example.
</ParamField>

<ParamField body="negative_prompt" type="string">
Structured JSON for video modes, loaded from `assets/negative_prompts/{mode}/neg_prompt.json`. Text-to-image uses `negative_prompt=""`.
</ParamField>

<ParamField body="image" type="PIL.Image | None">
Required for image-to-video. Load with `load_image(path)`; cookbook sets `image=None` for text-only modes.
</ParamField>

<ParamField body="height" type="int">
With `width`, derived from payload `resolution` + `aspect_ratio`. Cookbook supports `720` + `16,9` → **720×1280** and `256` + `16,9` → **192×320**.
</ParamField>

<ParamField body="width" type="int">
See `height`.
</ParamField>

<ParamField body="num_frames" type="int">
`1` for text-to-image; `189` for standard video examples.
</ParamField>

<ParamField body="fps" type="float">
Output FPS for video modes; passed to export helpers.
</ParamField>

<ParamField body="num_inference_steps" type="int">
Diffusion step count.
</ParamField>

<ParamField body="guidance_scale" type="float">
CFG strength.
</ParamField>

<ParamField body="enable_sound" type="bool">
`True` for text-to-video-with-sound (and image-to-video-with-sound). Requires a checkpoint with sound modules; when `result.sound` is present, mux audio with `encode_video`.
</ParamField>

<ParamField body="add_resolution_template" type="bool">
Cookbook sets `False` when height/width are explicit.
</ParamField>

<ParamField body="add_duration_template" type="bool">
Cookbook sets `False` when `num_frames` and `fps` are explicit.
</ParamField>

<ParamField body="generator" type="torch.Generator">
CUDA generator for reproducible seeds.
</ParamField>

### Text-to-image

```python
result = pipe(
    prompt=payload["prompt"],
    negative_prompt="",
    num_frames=1,
    height=720,
    width=1280,
    num_inference_steps=35,
    guidance_scale=6.0,
    add_resolution_template=False,
    add_duration_template=False,
    generator=generator,
)
result.video[0].save("output.png")
```

README notes single-frame generation returns a PIL-accessible frame via `result.video[0]`; the cookbook saves PNG from that tensor.

### Text-to-video and image-to-video

```python
from diffusers.utils import load_image

image = load_image("assets/images/image2video/car_driving.jpg")  # i2v only

result = pipe(
    prompt=json.dumps(prompt_dict, separators=(",", ":")),
    negative_prompt=json.dumps(neg_dict, separators=(",", ":")),
    image=image,  # None for t2v
    num_frames=189,
    height=720,
    width=1280,
    fps=24,
    num_inference_steps=35,
    guidance_scale=6.0,
    enable_sound=False,
    add_resolution_template=False,
    add_duration_template=False,
    generator=generator,
)
```

### Text-to-video-with-sound

Set `enable_sound=True`. After `__call__`, if `result.sound is not None`, mux with `encode_video` using the pipeline sound tokenizer sample rate; otherwise fall back to silent `export_to_video`.

## Exporting outputs

| Output | Helper | When |
| --- | --- | --- |
| Silent MP4 | `export_to_video(result.video, path, fps=24, macro_block_size=1)` | Video modes without audio |
| MP4 + AAC | `encode_video(result.video, fps=..., output_path=..., audio=result.sound, audio_sample_rate=pipe.sound_tokenizer.config.sampling_rate)` | `enable_sound=True` and sound returned |
| PNG | `result.video[0].save(path)` | `num_frames=1` |

Stereo AAC at 48 kHz is the documented sound output spec when generated with video. Match `fps` in export calls to the `fps` passed into `pipe()`.

<Check>
Success signal: MP4 or PNG written to disk; cookbook prints `generated in X.Xs` and `wrote {path}`. For sound runs, confirm `result.sound is not None` before calling `encode_video`.
</Check>

## Structured prompt assets

Prompts under `cookbooks/cosmos3/generator/audiovisual/assets/prompts/` are JSON scene specs (subjects, cinematography, `temporal_caption`, `resolution`, `fps`, etc.). Example text-to-video prompt: `assets/prompts/text2video/car_colliding.json` (720p, 24 FPS, 7s duration in metadata). Image-to-video pairs prompts like `assets/prompts/image2video/car_driving.json` with conditioning images such as `assets/images/image2video/car_driving.jpg`.

Pass prompts to the pipeline as **serialized JSON strings**, not raw dicts, to match cookbook and vLLM-Omni parity.

## Operational notes

- **Guardrails**: `cosmos_guardrail` is an install dependency; safety checking stays enabled in cookbook loads.
- **Memory**: Switching Nano ↔ Super in one process deletes the cached pipeline and calls `torch.cuda.empty_cache()` before reloading.
- **Upstream API docs**: Mode-specific examples and additional kwargs are documented in [Cosmos 3 Diffusers API](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cosmos3).

## Related pages

<CardGroup>
<Card title="Run Generator with Diffusers" href="/run-generator-diffusers">
Step-by-step install, scheduler setup, and first MP4 from the audiovisual cookbook.
</Card>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared `cu130`/`cu128` matrix, HF auth, and GPU verification for all backends.
</Card>
<Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
Structured JSON prompt schema, negative prompts, and upsampling defaults.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Resolution tiers, frame counts, aspect ratios, and sound output specs.
</Card>
<Card title="Audiovisual cookbook recipes" href="/audiovisual-cookbooks">
Full `run_with_diffusers.ipynb` walkthrough for every Nano/Super asset set.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
CUDA/driver mismatches, `torch.cuda` false negatives, libxcb, and uv backend errors.
</Card>
</CardGroup>

---

## 18. Reasoner vLLM configuration

> vllm serve flags: hf-overrides architectures, tensor-parallel-size, mm-encoder-tp-mode, async-scheduling, allowed-local-media-path, media-io-kwargs, VLLM_USE_DEEP_GEMM, and vLLM/cu130 version pairs.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/18-reasoner-vllm-configuration.md
- Generated: 2026-06-01T20:27:13.547Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/reasoner/README.md`
- `cookbooks/cosmos3/reasoner/run_with_vllm.ipynb`

---
title: "Reasoner vLLM configuration"
description: "vllm serve flags: hf-overrides architectures, tensor-parallel-size, mm-encoder-tp-mode, async-scheduling, allowed-local-media-path, media-io-kwargs, VLLM_USE_DEEP_GEMM, and vLLM/cu130 version pairs."
---

Cosmos 3 Reasoner production inference runs through `vllm serve` on gated Hugging Face checkpoints (`nvidia/Cosmos3-Nano`, `nvidia/Cosmos3-Super`), with the `vllm-cosmos3` plugin registering `Cosmos3ReasonerForConditionalGeneration` and cookbook-tested flags for multimodal tensor parallelism, local `file://` media, and video frame ingestion.

## Install and CUDA version pairs

Create a Python 3.13 venv and install a **matched** `torch` backend and `vllm` wheel. vLLM does not ship wheels for every CUDA minor version, so `--torch-backend=auto` is unreliable for Reasoner serving — pin the pair that matches `nvidia-smi` driver CUDA.

| Driver CUDA | `uv` torch backend | `vllm` version |
| --- | --- | --- |
| 13.x | `cu130` | `0.21.0` |
| 12.x | `cu128` | `0.19.1` |

<Tabs>
<Tab title="CUDA 13 (cu130)">

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate

uv pip install --torch-backend=cu130 "vllm==0.21.0" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"
```

</Tab>
<Tab title="CUDA 12.x (cu128)">

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate

uv pip install --torch-backend=cu128 "vllm==0.19.1" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"
```

</Tab>
</Tabs>

The Reasoner notebook also installs `transformers-cosmos3` from a local `cosmos-framework` checkout (`packages/transformers-cosmos3` alongside `packages/vllm-cosmos3`). The git URL install path above is sufficient for a minimal server; clone the framework when you need the full notebook dependency set.

<Warning>
Installing `cu130` wheels on a CUDA 12.x driver yields `torch.cuda.is_available() == False` and server startup failures. Switch to the `cu128` / `vllm==0.19.1` pair instead of using `--torch-backend=auto`.
</Warning>

Authenticate to Hugging Face before the first serve (gated model repos):

```bash
uvx hf@latest auth login
```

## Reference `vllm serve` commands

Cookbooks use a **full** flag set for image and video Reasoner workloads. The root README quickstart omits some multimodal flags; prefer the cookbook commands below for parity with `run_with_vllm.ipynb` and `cookbooks/cosmos3/reasoner/README.md`.

### Cosmos3-Nano (single GPU)

```bash
CUDA_VISIBLE_DEVICES=0 \
vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --tensor-parallel-size 1 \
  --mm-encoder-tp-mode data \
  --async-scheduling \
  --allowed-local-media-path "$(dirname "$(pwd)")" \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --port 8000
```

When launching from `cookbooks/cosmos3/reasoner/`, `$(dirname "$(pwd)")` resolves to `cookbooks/cosmos3`, covering `file://` paths under that tree.

### Cosmos3-Super (4 GPUs)

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve nvidia/Cosmos3-Super \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --tensor-parallel-size 4 \
  --mm-encoder-tp-mode data \
  --async-scheduling \
  --allowed-local-media-path "$COSMOS3_MEDIA_ROOT" \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --port 8001
```

Set `COSMOS3_MEDIA_ROOT` to the cookbook media root (for example `…/cosmos/cookbooks/cosmos3` in `run_with_vllm.ipynb`).

| Checkpoint | Typical GPUs | `--tensor-parallel-size` | Default port in docs |
| --- | --- | ---: | ---: |
| `nvidia/Cosmos3-Nano` | 1 | `1` | `8000` |
| `nvidia/Cosmos3-Super` | 4 | `4` | `8001` (notebook) |

<Note>
The first server start compiles CUDA graphs and can take several minutes. Poll readiness with `curl -fsS http://127.0.0.1:<port>/health` or `curl http://localhost:<port>/v1/models`.
</Note>

```text
Client (OpenAI /v1/chat/completions)
        │
        ▼
vllm serve nvidia/Cosmos3-{Nano|Super}
  + vllm-cosmos3 → Cosmos3ReasonerForConditionalGeneration
  + mm-encoder / media-io / allowed-local-media-path
        │
        ▼
Text output (Qwen3-VL-compatible multimodal messages in)
```

## Serve flag reference

| Flag | Role |
| --- | --- |
| `--hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}'` | Loads the Reasoner head; without it, the omnimodal checkpoint default architecture is wrong for text-only Reasoner serving. |
| `--tensor-parallel-size` | Shards model weights across GPUs (`1` for Nano, `4` for Super in cookbooks). |
| `--mm-encoder-tp-mode data` | Uses data parallelism for the multimodal visual encoder. |
| `--async-scheduling` | Enables async scheduling (used in all Reasoner cookbook serve lines). |
| `--allowed-local-media-path` | Required prefix allowlist when requests use local `file://` image or video URLs. |
| `--media-io-kwargs '{"video": {"num_frames": -1}}'` | Lets the processor consider all available frames before downstream frame sampling. |
| `--port` | HTTP port for the OpenAI-compatible API (cookbook examples use `8000` or `8001`). |

<ParamField body="--hf-overrides" type="JSON string" required>
Overrides Hugging Face config at load time. Reasoner serving sets `architectures` to `Cosmos3ReasonerForConditionalGeneration`, registered by the `vllm-cosmos3` plugin.
</ParamField>

<ParamField body="--tensor-parallel-size" type="integer" required>
Number of GPUs for tensor-parallel inference. Must align with `CUDA_VISIBLE_DEVICES` count and available GPU memory for the chosen checkpoint.
</ParamField>

<ParamField body="--mm-encoder-tp-mode" type="string">
Cookbooks set `data` for data-parallel multimodal encoder execution alongside tensor-parallel language weights.
</ParamField>

<ParamField body="--async-scheduling" type="boolean (flag)">
Present on all documented Reasoner serve commands; omit only if you are deliberately matching a minimal README-only experiment.
</ParamField>

<ParamField body="--allowed-local-media-path" type="filesystem path">
Directory prefix allowed for local media paths in chat messages. Must be a parent of every `file://` path you send. Remote `https://` URLs do not require this flag but still need network access from the server.
</ParamField>

<ParamField body="--media-io-kwargs" type="JSON string">
Server-side media I/O options. Cookbooks pass `{"video": {"num_frames": -1}}` so video ingestion does not cap frames before the client or processor applies sampling.
</ParamField>

### Minimal serve (README quickstart)

The repository README documents a shorter Nano command without `--tensor-parallel-size`, `--mm-encoder-tp-mode`, or `--media-io-kwargs`:

```bash
vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --async-scheduling \
  --allowed-local-media-path / \
  --port 8000
```

Use the full cookbook flag set when serving local videos or matching benchmark/cookbook behavior.

## Environment variables

### `VLLM_USE_DEEP_GEMM`

If the vLLM build reports DeepGEMM as unavailable, disable it before starting the server:

```bash
export VLLM_USE_DEEP_GEMM=0
vllm serve nvidia/Cosmos3-Nano ...
```

### Process and build helpers

| Variable | Purpose |
| --- | --- |
| `CUDA_VISIBLE_DEVICES` | Restricts which GPUs the server binds (`0` for Nano, `0,1,2,3` for Super in cookbooks). |
| `TMPDIR` | Notebook sets `/tmp/${USER:-vllm}-vllm` for vLLM temp files. |
| `VLLM_PORT` / `VLLM_LOG_FILE` | Notebook overrides for background `setsid` launch and log tailing. |

<Tip>
When invoking `.venv/bin/vllm` without activating the venv, keep `.venv/bin` on `PATH`. FlashInfer JIT builds shell out to `ninja`, which lives in the venv.
</Tip>

## Client configuration (server complement)

The server exposes OpenAI-compatible `/v1/chat/completions`. Clients use `api_key="EMPTY"` and `base_url="http://localhost:<port>/v1"`.

- Resolve the model id dynamically: `client.models.list().data[0].id`.
- Image and video content follow Qwen3-VL-style message blocks (`image_url`, `video_url`).
- For local video in the notebook, convert paths with `Path(...).resolve().as_uri()` and ensure the path stays under `--allowed-local-media-path`.
- Per-request video sampling can be passed via `extra_body`, for example `{"mm_processor_kwargs": {"fps": 4, "do_sample_frames": True}}`.

Reasoning-style outputs use a prompt suffix with `redacted_reasoning` tags; sampling defaults differ with and without that format — see the sampling parameters page.

## Verification

<Steps>
<Step title="Confirm GPU visibility">

```bash
.venv/bin/python - <<'PY'
import torch
print("torch:", torch.__version__)
print("torch cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
PY
```

</Step>
<Step title="Start the server">

Run the Nano or Super command from above with matching `CUDA_VISIBLE_DEVICES` and `--tensor-parallel-size`.

</Step>
<Step title="Check the API">

```bash
curl -fsS http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
```

</Step>
<Step title="Send a smoke request">

```python
import openai

client = openai.OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

response = client.chat.completions.create(
    model=model,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
            {"type": "text", "text": "Caption the image in detail."},
        ],
    }],
    max_tokens=4096,
    seed=0,
)
print(response.choices[0].message.content)
```

</Step>
</Steps>

Published Reasoner serving metrics (TTFT, request latency, throughput at concurrency 1/64/128/256) for `nvidia/Cosmos3-Nano` via vLLM are in the inference benchmarks doc; concurrency there is **client** concurrency, not `--tensor-parallel-size`.

## Related pages

<CardGroup>
<Card title="Run Reasoner with vLLM" href="/run-reasoner-vllm">
End-to-end install, serve, chat completion, and reasoning-format prompts.
</Card>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared vLLM + vllm-cosmos3 install, CUDA tags, and GPU verification.
</Card>
<Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
Reasoner `top_p` / `temperature` tables and Qwen3-VL message shape.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
CUDA/driver mismatches, `uv` backend errors, and `VLLM_USE_DEEP_GEMM` workaround.
</Card>
<Card title="Inference benchmarks" href="/inference-benchmarks">
Cosmos3-Nano Reasoner vLLM TTFT and throughput under load.
</Card>
</CardGroup>

---

## 19. Sampling and prompt parameters

> Generator prompt-upsampling defaults, Reasoner sampling tables (with/without reasoning), structured JSON prompt schema, Qwen3-VL message shape, and redacted_reasoning format instruction.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/19-sampling-and-prompt-parameters.md
- Generated: 2026-06-01T20:28:23.477Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/robot_pouring_water_audio.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/negative_prompts/image2video/neg_prompt.json`
- `cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb`
- `cookbooks/cosmos3/reasoner/run_with_vllm.ipynb`

---
title: "Sampling and prompt parameters"
description: "Generator prompt-upsampling defaults, Reasoner sampling tables (with/without reasoning), structured JSON prompt schema, Qwen3-VL message shape, and redacted_reasoning format instruction."
---

Cosmos 3 splits prompt and sampling configuration across two surfaces: the **Generator** accepts dense structured JSON (often produced by prompt upsampling) plus diffusion denoising parameters (`guidance_scale`, `num_inference_steps`, `flow_shift`, `seed`), while the **Reasoner** uses Qwen3-VL-compatible chat messages with autoregressive sampling (`temperature`, `top_p`, `top_k`, penalties) and an optional `redacted_reasoning` output format for chain-of-thought tasks.

## Generator prompt upsampling

Prompt upsampling expands a short scene description into a dense structured JSON prompt suitable for diffusion conditioning. The repository documents these **upsampling** defaults (autoregressive text generation used to author prompts, not diffusion sampling):

| Parameter | Value |
| --- | ---: |
| `max_tokens` | `20000` |
| `temperature` | `0.7` |
| `top_p` | `0.8` |
| `top_k` | `20` |
| `repetition_penalty` | `1.0` |
| `presence_penalty` | `1.5` |
| `seed` | `3407` |

Checked-in examples under `cookbooks/cosmos3/generator/audiovisual/assets/prompts/` are already upsampled; you can reuse them directly or regenerate prompts with the same parameter profile.

## Generator diffusion sampling

Audiovisual cookbooks share a `FIXED_SAMPLING` block for **diffusion** generation across Diffusers, vLLM-Omni, and Cosmos Framework paths:

| Parameter | Cookbook default | Notes |
| --- | ---: | --- |
| `num_steps` / `num_inference_steps` | `35` | Denoising steps |
| `guidance` / `guidance_scale` | `6.0` | Classifier-free guidance; use `guidance_scale`, not `true_cfg_scale` |
| `shift` / `flow_shift` | `10.0` | UniPC scheduler flow shift (Diffusers sets `flow_shift` on `UniPCMultistepScheduler`) |
| `fps` | `24` | Output frame rate |
| `num_frames` | `189` | ~7.9 s at 24 FPS |
| `resolution` | `"720"` | Maps to 1280×720 with `aspect_ratio` `"16,9"` |
| `aspect_ratio` | `"16,9"` | Comma-separated form used in payloads |
| `seed` | `1234` (Diffusers), `0` (vLLM-Omni / Framework) | Backend-specific in notebooks |

Supported payload resolution mapping in notebooks:

| `resolution` | `aspect_ratio` | Height × width |
| --- | --- | --- |
| `720` | `16,9` | 720 × 1280 |
| `256` | `16,9` | 192 × 320 |

README quickstart examples may use different values (for example `guidance_scale=4.0`, `num_frames=81` in curl samples). Treat cookbook `FIXED_SAMPLING` as the reference for asset-driven runs; override per request for experiments.

<Warning>
Action Generator workflows (forward/inverse dynamics) use `guidance_scale=1.0` in action cookbooks — not the audiovisual `6.0` default.
</Warning>

### Template toggles

Cookbooks disable built-in prompt templates so structured JSON carries resolution and duration:

```python
add_resolution_template=False
add_duration_template=False
```

vLLM-Omni passes the same flags in `extra_params`:

```json
{"use_resolution_template": false, "use_duration_template": false, "guardrails": true}
```

## Structured JSON prompt schema

Generator audiovisual prompts are JSON objects checked in under `cookbooks/cosmos3/generator/audiovisual/assets/prompts/{text2image,text2video,image2video}/`. They are passed to the model as **compact JSON strings** (`json.dumps(..., separators=(",", ":"))`), not as raw prose.

### Top-level fields

| Field | Typical use | Present in |
| --- | --- | --- |
| `subjects` | Array of scene entities with per-subject attributes | All modes |
| `background_setting` | Environment description | All modes |
| `lighting` | `conditions`, `direction`, `shadows`, `illumination_effect` | All modes |
| `aesthetics` | `composition`, `color_scheme`, `mood_atmosphere`, `patterns` | All modes |
| `cinematography` | Camera framing, motion, DOF, lens | All modes |
| `style_medium`, `artistic_style`, `context` | Medium and narrative context | All modes |
| `text_and_signage_elements` | On-screen text (often `[]`) | All modes |
| `resolution` | `{"H": int, "W": int}` | All modes |
| `aspect_ratio` | e.g. `"16,9"` (comma, not colon) | All modes |
| `actions` | Timed action beats (`time`, `description`) | Video prompts |
| `segments` | Shot segments with `time_range`, `key_changes`, `camera` | Video prompts |
| `transitions` | Edit transitions (often `[]`) | Video prompts |
| `temporal_caption` | Dense timeline prose | `text2video`, `image2video` |
| `comprehensive_t2i_caption` | Single rich caption | `text2image` |
| `duration`, `fps` | Clip timing metadata | Video prompts |
| `audio_description` | Sound design prose | Prompts with sound |
| `subject_details` | Extra keyed detail (e.g. fabric, pins) | Some `text2image` |
| `quadrant_scan` | Spatial quadrant descriptions | Some `text2image` |
| `physical_realism` | Physics constraints prose | Some negative prompts |

### Per-subject fields

Each `subjects[]` entry commonly includes:

`description`, `appearance_details`, `relationship`, `location`, `relative_size`, `orientation`, `pose`, `action`, `state_changes`, and optional humanoid fields (`clothing`, `expression`, `gender`, `age`, `skin_tone_and_texture`, `facial_features`, `number_of_subjects`, `number_of_arms`, `number_of_legs`). Image prompts may add `number_of_hands`, `number_of_fingers`.

### Negative prompts

Video modes (`text2video`, `image2video`) pair a structured positive prompt with a structured **negative** JSON under `assets/negative_prompts/{mode}/neg_prompt.json`. The negative schema mirrors the positive layout (subjects, lighting, cinematography, `temporal_caption`, etc.) and describes artifacts to avoid. **Text-to-image** runs use an empty `negative_prompt` string in cookbooks.

### Serialization and payloads

```text
prompt file (.json)
    → compact_json_file() → payload["prompt"] (string)
    → pipeline / API (json.dumps(prompt) at call site)
```

Notebook `create_payload()` writes a sidecar JSON per use case:

```json
{
  "model_mode": "text2video",
  "name": "t2v_nano_noaudio",
  "prompt": "{...compact structured JSON...}",
  "negative_prompt": "{...compact negative JSON or \"\"}",
  "enable_sound": false,
  "num_steps": 35,
  "guidance": 6.0,
  "shift": 10.0,
  "fps": 24,
  "num_frames": 189,
  "resolution": "720",
  "aspect_ratio": "16,9",
  "seed": 1234
}
```

Image-to-video payloads add `vision_path` (relative image path). Preview helpers resolve captions via `temporal_caption`, `comprehensive_t2i_caption`, or `extra.prompt`.

## Reasoner sampling parameters

Reasoner outputs are autoregressive text. Documented defaults depend on whether the task uses explicit chain-of-thought formatting.

### Without reasoning

Use for captioning, VQA, grounding, and other direct-answer tasks. vLLM cookbook calls often omit explicit sampling kwargs and rely on server defaults, with `max_tokens=4096` and `seed=0` on image examples.

| Parameter | Value |
| --- | ---: |
| `temperature` | `0.7` |
| `top_p` | `0.8` |
| `top_k` | `20` |
| `repetition_penalty` | `1.0` |
| `presence_penalty` | `1.5` |

### With reasoning

Use when the user prompt includes the `redacted_reasoning` format block (see below). Cookbooks and Framework inputs set:

| Parameter | Value |
| --- | ---: |
| `temperature` | `0.6` |
| `top_p` | `0.95` |
| `top_k` | `20` |
| `repetition_penalty` | `1.0` |
| `presence_penalty` | `0.0` |

vLLM example (OpenAI client):

```python
client.chat.completions.create(
    model=MODEL,
    messages=[...],
    max_tokens=4096,
    temperature=0.6,
    top_p=0.95,
    presence_penalty=0.0,
    extra_body={"top_k": 20, "repetition_penalty": 1.0},
)
```

Cosmos Framework Reasoner JSON can mirror the same fields (`do_sample`, `temperature`, `top_p`, `top_k`, `repetition_penalty`, `presence_penalty`, `max_new_tokens`) on trajectory and planning inputs.

### Video multimodal processing

Video requests in the vLLM Reasoner notebook commonly pass:

```python
extra_body={"mm_processor_kwargs": {"fps": 4, "do_sample_frames": True}}
```

Pair with server `--media-io-kwargs '{"video": {"num_frames": -1}}'` so the encoder can see all frames before downstream sampling.

## Qwen3-VL message shape

Reasoner serving follows **Qwen3-VL-compatible** chat conventions. Production vLLM uses `Cosmos3ReasonerForConditionalGeneration` with an OpenAI-compatible `/v1/chat/completions` API.

### Recommended layout

Include an optional `system` message, then a `user` message whose `content` is a **multimodal array** (vision first, then text):

```json
[
  {
    "role": "system",
    "content": [{"type": "text", "text": "You are a helpful assistant."}]
  },
  {
    "role": "user",
    "content": [
      {"type": "video_url", "video_url": "https://example.com/video.mp4"},
      {"type": "text", "text": "List the notable events with approximate timestamps."}
    ]
  }
]
```

### URL forms used in cookbooks

| Modality | Content block | URL shape |
| --- | --- | --- |
| Image | `{"type": "image_url", "image_url": {"url": "<url>"}}` | HTTPS or `file://` from `Path(...).resolve().as_uri()` |
| Video | `{"type": "video_url", "video_url": {"url": "<url>"}}` | Same; local video requires `--allowed-local-media-path` on the server |

README shows a flat `video_url` string form for illustration; cookbook clients use the nested `{"url": ...}` object — match the cookbook shape when copying runnable examples.

Image-only quickstarts may omit `system` and use a single `user` turn:

```python
messages=[{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": image_url}},
        {"type": "text", "text": "Caption the image in detail."},
    ],
}]
```

## `redacted_reasoning` format instruction

For chain-of-thought Reasoner tasks (embodied next-action, action CoT trajectories, planning), append this block to the **user text** (after the task-specific instruction):

```text
Answer the question using the following format:

<think>
Your reasoning.
</think>

Write your final answer immediately after the </think> tag.
```

The model emits reasoning inside `redacted_reasoning` tags; parsers split on `</think>` and read JSON or prose from the suffix. Notebook helpers strip fenced code and extract JSON arrays for trajectory visualization.

<Note>
Apply the **with reasoning** sampling table whenever this format block is present. Set `presence_penalty` to `0.0` (not `1.5`) so reasoning tokens are not suppressed.
</Note>

Compact single-line variant (equivalent intent):

```text
Answer the question using the following format: <think> Your reasoning. </think> Write your final answer immediately after the </think> tag.
```

## Quick reference: which parameters where

```text
┌─────────────────────────────────────────────────────────────────┐
│ Generator (diffusion)                                           │
│  • Structured JSON prompt + optional negative JSON              │
│  • num_inference_steps, guidance_scale, flow_shift, seed        │
│  • num_frames, fps, size/height/width                           │
│  • Upsampling (separate AR pass): README table, seed 3407       │
├─────────────────────────────────────────────────────────────────┤
│ Reasoner (autoregressive)                                       │
│  • Qwen3-VL messages (image_url / video_url + text)             │
│  • temperature, top_p, top_k, repetition/presence penalties     │
│  • Optional redacted_reasoning wrapper + with-reasoning table   │
│  • mm_processor_kwargs for video frame sampling                 │
└─────────────────────────────────────────────────────────────────┘
```

## Related pages

<CardGroup>
<Card title="Input and output specifications" href="/input-output-specifications">
Resolution tiers, frame counts, prompt length limits, and output formats that bound sampling choices.
</Card>
<Card title="Run Generator with Diffusers" href="/run-generator-diffusers">
`Cosmos3OmniPipeline` calls with structured JSON prompts and scheduler `flow_shift`.
</Card>
<Card title="Run Reasoner with vLLM" href="/run-reasoner-vllm">
Serve Reasoner and send Qwen3-VL chat completions with local media paths.
</Card>
<Card title="vLLM-Omni API reference" href="/vllm-omni-api-reference">
Request fields for `guidance_scale`, `flow_shift`, `extra_params`, and curl `--form-string` rules.
</Card>
<Card title="Reasoner cookbooks" href="/reasoner-cookbooks">
Runnable captioning, grounding, CoT, and situation-understanding examples with bundled media.
</Card>
<Card title="Audiovisual cookbooks" href="/audiovisual-cookbooks">
End-to-end Generator notebooks and asset layout under `assets/prompts`.
</Card>
</CardGroup>

---

## 20. Audiovisual cookbook recipes

> End-to-end notebooks for text-to-image, text-to-video, image-to-video with optional sound across Diffusers, Cosmos Framework, and vLLM-Omni; asset layout under assets/prompts and assets/images.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/20-audiovisual-cookbook-recipes.md
- Generated: 2026-06-01T20:28:37.389Z

### Source Files

- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb`
- `cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/image2video/coastal_road_audio.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2image/robot_draping.json`
- `README.md`

---
title: "Audiovisual cookbook recipes"
description: "End-to-end notebooks for text-to-image, text-to-video, image-to-video with optional sound across Diffusers, Cosmos Framework, and vLLM-Omni; asset layout under assets/prompts and assets/images."
---

The Generator audiovisual cookbooks under `cookbooks/cosmos3/generator/audiovisual/` ship three Jupyter walkthroughs plus checked-in structured JSON prompts and conditioning images. Each notebook runs the same eight use-case keys (`ASSET_SETS`) through a different backend: in-process `Cosmos3OmniPipeline` (Diffusers), `torchrun -m cosmos_framework.scripts.inference` (Cosmos Framework), or HTTP calls to a running vLLM-Omni server (`/v1/images/generations` and `/v1/videos/sync`).

## Cookbook location

| Artifact | Path |
| --- | --- |
| Folder README and CLI quickstarts | `cookbooks/cosmos3/generator/audiovisual/README.md` |
| Diffusers notebook | `run_with_diffusers.ipynb` |
| Cosmos Framework notebook | `run_with_cosmos_framework.ipynb` |
| vLLM-Omni notebook | `run_with_vllm_omni.ipynb` |
| Shared environment guide | `cookbooks/cosmos3/README.md` |

Run notebooks from the audiovisual folder (or with that folder as the working directory) so relative `assets/...` paths resolve. The root `README.md` examples table links each notebook to nbviewer renders on `main`.

## Three backends at a glance

| Backend | Notebook | Runtime | Primary APIs |
| --- | --- | --- | --- |
| Diffusers | `run_with_diffusers.ipynb` | Dedicated venv + Jupyter kernel `Cosmos3 Diffusers (Python 3.13)` | `Cosmos3OmniPipeline.from_pretrained`, `export_to_video` / `encode_video` |
| Cosmos Framework | `run_with_cosmos_framework.ipynb` | Framework checkout `packages/cosmos3` (or `packages/cosmos-framework`) + `uv sync` | `torchrun -m cosmos_framework.scripts.inference` |
| vLLM-Omni | `run_with_vllm_omni.ipynb` | External Docker or host server on port 8000 | `POST /v1/images/generations`, `POST /v1/videos/sync` via `curl` |

<Note>
All three notebooks share the same `ASSET_SETS` manifest and default 720p 16:9 sampling block. Nano examples run first; Super covers text-to-image plus text-to-video and image-to-video **without** audio only.
</Note>

```text
assets/prompts/*.json  +  assets/images/*.jpg (I2V)
            │
            ▼
    create_payload(use_case)  →  outputs/notebooks/{backend}/payloads/{use_case}.json
            │
            ├── diffusers/     → Cosmos3OmniPipeline  → .png / .mp4
            ├── pytorch/       → torchrun inference  → framework output tree
            └── vllm_omni/     → curl OpenAI-compat  → .png / .mp4
```

## Asset layout

Checked-in inputs live under `assets/`. Prompts are modality-scoped; image-to-video pairs each have a matching JPEG under `assets/images/image2video/`.

:::files
cookbooks/cosmos3/generator/audiovisual/
├── README.md
├── run_with_diffusers.ipynb
├── run_with_cosmos_framework.ipynb
├── run_with_vllm_omni.ipynb
├── assets/
│   ├── prompts/
│   │   ├── text2image/
│   │   │   └── robot_draping.json
│   │   ├── text2video/
│   │   │   ├── robot_kitchen.json
│   │   │   ├── robot_pouring_water_audio.json
│   │   │   └── car_colliding.json
│   │   └── image2video/
│   │       ├── car_driving.json
│   │       ├── coastal_road_audio.json
│   │       └── humanoid_robot.json
│   ├── images/image2video/
│   │   ├── car_driving.jpg
│   │   ├── coastal_road_audio.jpg
│   │   └── humanoid_robot.jpg
│   └── negative_prompts/
│       ├── text2video/neg_prompt.json
│       └── image2video/neg_prompt.json
└── outputs/notebooks/          # created at runtime (gitignored in practice)
    ├── diffusers/
    ├── pytorch/
    └── vllm_omni/
:::

### Prompt files wired into notebooks

The notebooks’ `ASSET_SETS` table selects these prompt/image pairs (extra JSON under `assets/prompts/` such as `car_colliding.json` and `humanoid_robot.json` are available for custom payloads but are not in the default manifest):

| Use-case key | Model | Mode | Prompt | Image (I2V only) | Sound |
| --- | --- | --- | --- | --- | --- |
| `t2i` | Cosmos3-Nano | text2image | `text2image/robot_draping.json` | — | off |
| `t2i_super` | Cosmos3-Super | text2image | `text2image/robot_draping.json` | — | off |
| `t2v_nano_noaudio` | Cosmos3-Nano | text2video | `text2video/robot_kitchen.json` | — | off |
| `t2vs` | Cosmos3-Nano | text2video | `text2video/robot_pouring_water_audio.json` | — | on |
| `i2v_nano_noaudio` | Cosmos3-Nano | image2video | `image2video/car_driving.json` | `car_driving.jpg` | off |
| `i2vs` | Cosmos3-Nano | image2video | `image2video/coastal_road_audio.json` | `coastal_road_audio.jpg` | on |
| `t2v_super_noaudio` | Cosmos3-Super | text2video | `text2video/robot_kitchen.json` | — | off |
| `i2v_super_noaudio` | Cosmos3-Super | image2video | `image2video/car_driving.json` | `car_driving.jpg` | off |

Video modes load a modality-specific negative prompt from `assets/negative_prompts/{text2video|image2video}/neg_prompt.json`. Text-to-image runs pass an empty negative prompt string.

## Structured JSON prompts

Generator cookbooks expect **structured JSON** scene specifications, not a single free-text line. Prompts are passed to inference as compact JSON strings (`json.dumps(..., separators=(",", ":"))`).

### Common fields

| Field | Role |
| --- | --- |
| `subjects` | Array of scene entities with description, appearance, pose, action, spatial placement |
| `background_setting` | Environment description |
| `lighting`, `aesthetics`, `cinematography` | Look, mood, camera behavior |
| `actions`, `segments` | Time-ranged motion beats |
| `temporal_caption` | Video: single narrative caption across time |
| `comprehensive_t2i_caption` | Image: consolidated caption (e.g. `robot_draping.json`) |
| `audio_description` | Sound design text when generating synchronized audio |
| `resolution` | `{"W": 1280, "H": 720}` in checked-in assets |
| `aspect_ratio` | `"16,9"` (comma-separated, matches cookbook payload convention) |
| `duration`, `fps` | Present on video prompt JSON (e.g. `"7s"`, `24`) |

Sound-enabled runs use prompts that include a rich `audio_description` (engine/tire noise, impacts, braking screech, etc.). The `enable_sound` / `generate_sound` flag on the inference call must still be set to actually emit audio.

### Inference payload shape

Each use case writes a runner payload JSON under `outputs/notebooks/{backend}/payloads/{use_case}/{use_case}.json`:

| Field | Default (all notebooks) | Notes |
| --- | --- | --- |
| `model_mode` | `text2image`, `text2video`, or `image2video` | Drives pipeline branch |
| `prompt` | Compact JSON string from asset file | |
| `negative_prompt` | Compact JSON or `""` | Empty for T2I |
| `enable_sound` | per `ASSET_SETS` | Maps to `enable_sound` (Diffusers/Framework) or `generate_sound` (vLLM-Omni) |
| `num_steps` | `35` | |
| `guidance` | `6.0` | CFG strength |
| `shift` | `10.0` | UniPC `flow_shift` |
| `fps` | `24` | |
| `num_frames` | `189` (video); `1` for Framework T2I | ~7.9 s at 24 fps |
| `resolution` | `"720"` | Maps to 720×1280 when `aspect_ratio` is `"16,9"` |
| `aspect_ratio` | `"16,9"` | |
| `seed` | `1234` (Diffusers), `0` (Framework, vLLM-Omni) | |
| `vision_path` | relative path | Image2video only; resolved beside payload file |

Image2video payloads store `vision_path` relative to the payload directory so Framework and vLLM-Omni can resolve local files inside the mounted workspace.

## Default sampling and templates

All backends disable resolution and duration templates in cookbook runs:

- Diffusers: `add_resolution_template=False`, `add_duration_template=False`
- vLLM-Omni `extra_params`: `"use_resolution_template": false`, `"use_duration_template": false`, `"guardrails": true`

Diffusers additionally sets `UniPCMultistepScheduler` with `flow_shift` from the payload (default `10.0`).

## Notebook workflow pattern

Every use-case section follows the same three steps:

<Steps>
<Step title="Create payload">
Run `create_payload("<use_case_key>", backend="diffusers"|"pytorch"|"vllm_omni")` to materialize payload JSON and print paths. Environment variables such as `COSMOS3_DIFFUSERS_T2I_INPUT` / `_OUTPUT` are set for Framework bash cells.
</Step>
<Step title="Run inference">
Execute the backend-specific run cell (`run_diffusers_payload`, `torchrun` bash, or `run_vllm_payload`).
</Step>
<Step title="View results">
Call `view_run(output_dir)` to embed PNG or MP4 in the notebook.
</Step>
</Steps>

Default output root: `cookbooks/cosmos3/generator/audiovisual/outputs/notebooks/` (override with `COSMOS3_AUDIOVISUAL_OUTPUT_ROOT`). Generated media paths:

`outputs/notebooks/{diffusers|pytorch|vllm_omni}/{use_case_key}/{use_case_key}.png` or `.mp4`

## Run with Diffusers

**Environment:** `COSMOS3_DIFFUSERS_VENV` (default `<cosmos-root>/.venv-cosmos3-diffusers`), `COSMOS3_TORCH_BACKEND` (`cu130` or `cu128`), Hugging Face auth.

<Steps>
<Step title="Install">
`uv venv` + `uv pip install` diffusers (git), torch, accelerate, `cosmos_guardrail`, ipykernel; register kernel `cosmos3-diffusers`.
</Step>
<Step title="Switch kernel">
Select **Cosmos3 Diffusers (Python 3.13)** and run the restore-environment cell so `sys.executable` matches the venv.
</Step>
<Step title="Generate">
`Cosmos3OmniPipeline.from_pretrained("nvidia/Cosmos3-Nano"|"nvidia/Cosmos3-Super")` with `torch.bfloat16`, `flow_shift` on scheduler, then `run_diffusers_payload`. Video with sound uses `encode_video` when `result.sound` is present.
</Step>
</Steps>

**CLI quickstart** (from folder README): load `assets/prompts/text2video/robot_kitchen.json`, `num_frames=189`, `1280x720`, `enable_sound=False`, `export_to_video` to `/tmp/cosmos3_t2v_diffusers.mp4`.

## Run with Cosmos Framework

**Environment:** clone `cosmos-framework` to `packages/cosmos3`, `uv sync --all-extras --group=cu130-train` (or `cu128-train` via `COSMOS3_UV_GROUP`), `COSMOS3_NUM_GPUS` (default `4`), `CUDA_VISIBLE_DEVICES`.

Inference command shape (each use case):

```bash
torchrun --nproc-per-node="$COSMOS3_NUM_GPUS" \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=throughput \
  -i "$COSMOS3_PYTORCH_<USE_CASE>_INPUT" \
  -o "$COSMOS3_PYTORCH_<USE_CASE>_OUTPUT" \
  --checkpoint-path "Cosmos3-Nano"   # or Cosmos3-Super
  --seed=0
```

**CLI quickstart:** single-GPU text-to-video from `assets/prompts/text2video/robot_kitchen.json` to `/tmp/cosmos3_t2v_framework` with `--checkpoint-path Cosmos3-Nano`.

<Warning>
Headless Linux may need `libxcb1`, `libgl1`, `libglib2.0-0` if imports fail with `libxcb.so.1` errors (documented in notebooks and root README).
</Warning>

## Run with vLLM-Omni

The notebook does **not** start the server; it assumes Docker (or host) `vllm serve` is already listening.

| Modality | Endpoint | Response |
| --- | --- | --- |
| Text-to-image | `POST {base}/v1/images/generations` | JSON with `data[0].b64_json` → PNG |
| Text-to-video / image-to-video | `POST {base}/v1/videos/sync` | Raw `video/mp4` body |

<ParamField body="COSMOS3_VLLM_NANO_BASE_URL" type="string">
Base URL for Nano (default `http://localhost:8000`). Normalized to `{url}/v1`.
</ParamField>

<ParamField body="COSMOS3_VLLM_SUPER_BASE_URL" type="string">
Separate base URL when Nano and Super run on different ports.
</ParamField>

Video form fields mirror Diffusers: `prompt`, `negative_prompt`, `size` (`1280x720`), `num_frames`, `fps`, `num_inference_steps`, `guidance_scale`, `flow_shift`, `seed`, `extra_params` JSON. Image-to-video adds `input_reference=@/path/to/image`. Audio: `generate_sound=true` and `sound_duration` derived from `num_frames / fps`.

**Server quickstart** (README): `docker run` `vllm/vllm-omni:cosmos3` serving `nvidia/Cosmos3-Nano` with `--allowed-local-media-path /` and workspace mount; Super adds `--tensor-parallel-size` and `--enable-layerwise-offload`.

For image-to-video via HTTP only, post to `/v1/videos/sync` with `files={"input_reference": ...}` as in the README Python snippet.

## Verification signals

| Check | Success signal |
| --- | --- |
| Diffusers GPU cell | `cuda available: True`, expected kernel python path |
| Framework `uv sync` | `$COSMOS3_UV_ENV/bin/python` exists |
| vLLM endpoint cell | Prints `images/generations` and `videos/sync` URLs per model |
| Generation | `wrote .../outputs/notebooks/.../{use_case}.mp4` or `.png` |
| Viewer | Inline video/image in notebook; no `_preview`/`_browser` suffix pollution |

## Prerequisites (all notebooks)

- Linux with NVIDIA GPU, gated Hugging Face models (`uvx hf auth login` or `HF_TOKEN`)
- `uv >= 0.11.3` and CUDA backend tag matching driver (`cu130` / `cu128`)
- Cosmos Framework / vLLM paths: access to `NVIDIA/cosmos-framework` where applicable
- Sufficient disk for weights, uv cache, and outputs

Centralized setup steps: [Cookbook environment setup](/cookbook-environment).

## Related pages

<CardGroup>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared uv/Docker setup for Diffusers, Framework, and vLLM-Omni backends.
</Card>
<Card title="Run Generator with Diffusers" href="/run-generator-diffusers">
`Cosmos3OmniPipeline` flags, scheduler, and export paths.
</Card>
<Card title="Run Generator with Cosmos Framework" href="/run-generator-cosmos-framework">
`torchrun` inference, parallelism presets, checkpoint paths.
</Card>
<Card title="Run Generator with vLLM-Omni" href="/run-generator-vllm-omni">
Docker serve, tensor parallel, guardrails, deploy config.
</Card>
<Card title="Diffusers pipeline reference" href="/diffusers-pipeline-reference">
Pipeline modes and call arguments.
</Card>
<Card title="vLLM-Omni API reference" href="/vllm-omni-api-reference">
OpenAI-compatible video/image endpoints and form fields.
</Card>
<Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
Structured JSON schema and template toggles.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Resolution tiers, frame counts, and sound output specs.
</Card>
<Card title="Choose an integration" href="/choose-integration">
When to pick Diffusers vs Framework vs vLLM-Omni.
</Card>
</CardGroup>

---

## 21. Action cookbook recipes

> Forward-dynamics and inverse-dynamics notebooks for AV, DROID, and UMI with checked-in trajectories, LeRobot sample data, and Framework vs vLLM-Omni output directories.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/21-action-cookbook-recipes.md
- Generated: 2026-06-01T20:28:28.294Z

### Source Files

- `cookbooks/cosmos3/generator/action/README.md`
- `cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb`
- `cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb`
- `cookbooks/cosmos3/generator/action/assets/images/av_0.jpg`
- `README.md`

---
title: "Action cookbook recipes"
description: "Forward-dynamics and inverse-dynamics notebooks for AV, DROID, and UMI with checked-in trajectories, LeRobot sample data, and Framework vs vLLM-Omni output directories."
---

The `cookbooks/cosmos3/generator/action/` tree ships four Jupyter notebooks that run **Cosmos3-Nano** action Generator workflows: forward dynamics (image + trajectory → video) for AV, DROID, and UMI, and inverse dynamics (AV video → predicted ego-motion) on two backends—native **Cosmos Framework** (`cosmos_framework.scripts.inference`) and **vLLM-Omni** (`POST /v1/videos`). Checked-in assets under `assets/` supply trajectories, images, videos, and a trimmed LeRobot DROID sample; each notebook writes JSONL specs under an `inputs/` subtree and artifacts under backend-specific output roots.

## Notebook map

| Task | Backend | Notebook | Primary outputs |
| --- | --- | --- | --- |
| Forward dynamics | Cosmos Framework | `run_fd_with_cosmos_framework.ipynb` | Per-run `vision.mp4`; robotics/UMI stitched MP4s |
| Forward dynamics | vLLM-Omni | `run_fd_with_vllm.ipynb` | Same layout under `outputs/cosmos3_action_vllm/` |
| Inverse dynamics | Cosmos Framework | `run_id_with_cosmos_framework.ipynb` | Per-run `sample_outputs.json` (predicted action) |
| Inverse dynamics | vLLM-Omni | `run_id_with_vllm.ipynb` | Same + API debug JSON (`response.json`, `final.json`, `action.json`) |

Environment setup (HF auth, CUDA groups, Docker server, framework clone) is centralized in the [Cookbook environment setup](/cookbook-environment) page; the action folder README links to those sections for each backend.

## Forward vs inverse dynamics

| Mode | `model_mode` / `action_mode` | Inputs | Model output | Cookbook embodiments |
| --- | --- | --- | --- | --- |
| Forward dynamics | `forward_dynamics` | Start image (`vision_path`) + action JSON (`action_path`) | Generated video (`vision.mp4`) | AV (3 trajectories), DROID (autoregressive chunks), UMI (autoregressive chunks) |
| Inverse dynamics | `inverse_dynamics` | Input video only (`vision_path`; no `action_path`) | Predicted action in `sample_outputs.json` | AV only (`av_0.mp4`, `av_1.mp4`) |

Forward dynamics rolls out future observations conditioned on ego or end-effector motion. Inverse dynamics recovers the ego-motion trajectory that explains each input driving clip.

## Embodiment parameters

The action README defines pose semantics and generation timing per embodiment:

| Embodiment | `domain_name` | Action dim | Chunk size in cookbooks | FPS | `image_size` |
| --- | --- | --- | --- | --- | --- |
| Autonomous vehicle | `av` | 9D ego pose | 60 | 10 | 480 |
| DROID (LeRobot) | `droid_lerobot` | 10D (9D pose + 1D gripper) | 16 per chunk, 5 chunks (80 frames) | 15 (from dataset) | 480 |
| UMI | `umi` | 10D | 16 per chunk; checked-in `umi.json` has 32 rows → 2 chunks | 20 | 256 |

Action rows are JSON arrays of floats. AV trajectories use `rot6d` rotation, `backward_framewise` pose convention, and `translation_scale=1.35` when converting absolute camera poses via `cosmos_framework.data.vfm.action.pose_utils.pose_abs_to_rel`.

## Checked-in assets

:::files
cookbooks/cosmos3/generator/action/
├── README.md
├── run_fd_with_cosmos_framework.ipynb
├── run_fd_with_vllm.ipynb
├── run_id_with_cosmos_framework.ipynb
├── run_id_with_vllm.ipynb
└── assets/
    ├── images/
    │   ├── av_0.jpg          # AV FD start frame
    │   ├── av_1.jpg
    │   └── umi.png           # UMI FD start frame
    ├── videos/
    │   ├── av_0.mp4          # AV ID inputs
    │   ├── av_1.mp4
    │   └── umi.mp4
    ├── actions/
    │   ├── av_traj_forward.json
    │   ├── av_traj_left.json
    │   ├── av_traj_right.json
    │   └── umi.json          # 32×10D UMI actions
    └── droid_lerobot_example/   # LeRobot v3.0 sample (1 episode, 3 cameras)
        ├── meta/info.json
        ├── data/chunk-000/file-000.parquet
        └── videos/observation.image.*/
:::

DROID forward-dynamics notebooks load this tree with `DROIDLeRobotDataset` from `cosmos_framework.data.vfm.action.datasets`, extract 16-action chunks, write per-chunk action JSON under `inputs/`, and use the dataset’s `ai_caption` as the generation prompt.

## Output directories

Backends use separate default roots so Framework and vLLM runs do not collide.

| Backend | Default root | Override env var |
| --- | --- | --- |
| Cosmos Framework | `packages/cosmos3/outputs/cookbooks/cosmos3/generator/action/` | `COSMOS3_OUTPUT_ROOT` |
| vLLM-Omni | `<cosmos-repo>/outputs/cosmos3_action_vllm/` | `COSMOS3_VLLM_OUTPUT_ROOT` |

Both layouts place derived specs in `inputs/` and run artifacts in named subfolders:

```text
<output-root>/
├── inputs/
│   ├── action_forward_dynamics_av_custom.jsonl
│   ├── action_forward_dynamics_robotics_custom.jsonl
│   ├── action_forward_dynamics_umi_custom.jsonl
│   ├── action_inverse_dynamics_av_custom.jsonl
│   ├── robotics_droid_action_chunk_XX.json
│   ├── umi_action_chunk_XX_10d.json
│   └── … conditioning PNGs for autoregressive chunks
├── action_forward_dynamics_av_custom/<name>/vision.mp4
├── action_forward_dynamics_robotics_custom/<chunk_name>/vision.mp4
│   └── robotics_action_cond_stitched.mp4
├── action_forward_dynamics_umi_custom/<chunk_name>/vision.mp4
│   └── umi_action_cond_stitched.mp4
└── action_inverse_dynamics_av_custom/<name>/sample_outputs.json
```

Framework notebooks also export `COSMOS3_*_INPUT` / `COSMOS3_*_OUTPUT` env vars for bash inference cells. vLLM notebooks default `COSMOS3_VLLM_BASE_URL` to `http://localhost:8001` (Docker maps container port 8000 → host 8001).

## JSONL input spec

Each inference run is one JSON object per line. Shared fields across action cookbooks:

| Field | Forward dynamics | Inverse dynamics |
| --- | --- | --- |
| `name` | Run identifier (output subfolder) | Same |
| `model_mode` | `forward_dynamics` | `inverse_dynamics` |
| `domain_name` | `av`, `droid_lerobot`, or `umi` | `av` |
| `vision_path` | Absolute path to start image | Absolute path to input video |
| `action_path` | Absolute path to action JSON array | Omitted |
| `action_chunk_size` | 60 (AV), 16 (robotics/UMI) | 60 |
| `fps` | 10 / 15 / 20 | 10 |
| `image_size` | 480 or 256 (UMI) | 480 |
| `view_point` | `ego_view` or dataset viewpoint | `ego_view` |
| `prompt` | Task text (AV planning string; DROID caption; UMI: `"mouse arrangement"`) | AV planning string |
| `seed` | Per-run seed | `0` |

AV forward dynamics reuses one start image (`assets/images/av_0.jpg`) with three trajectories (`av_forward`, `av_left`, `av_right`).

## Run with Cosmos Framework

<Steps>
<Step title="Install and configure">

Clone or reuse `packages/cosmos3`, run `uv sync --all-extras --group=cu130-train` (or `cu128-train` on CUDA 12.x drivers), authenticate to Hugging Face, and set `COSMOS3_CHECKPOINT_PATH=Cosmos3-Nano` if needed. See [Run Generator with Cosmos Framework](/run-generator-cosmos-framework).

</Step>
<Step title="Open the matching notebook">

- Forward: `run_fd_with_cosmos_framework.ipynb` (AV, DROID autoregressive, UMI autoregressive)
- Inverse: `run_id_with_cosmos_framework.ipynb` (AV videos only)

</Step>
<Step title="Run inference">

AV forward dynamics invokes the framework entrypoint once per JSONL file:

```bash
.venv/bin/python -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  -i "$COSMOS3_AV_FD_INPUT" \
  -o "$COSMOS3_AV_FD_OUTPUT" \
  --checkpoint-path "$COSMOS3_CHECKPOINT_PATH" \
  --video-save-quality 8 \
  --image_size 480 \
  --seed 0 \
  --benchmark
```

DROID and UMI forward dynamics loop five and two chunks respectively, calling inference per chunk with `--no-guardrails`, then ffmpeg-extract the last generated frame as the next chunk’s conditioning image. Robotics uses `--parallelism-preset=latency` and writes `robotics_action_cond_stitched.mp4` (80 frames). UMI stitches to `umi_action_cond_stitched.mp4` (32 frames from two 16-action chunks).

Inverse dynamics runs the same entrypoint against `action_inverse_dynamics_av_custom.jsonl` and reads predicted actions from `<output>/<name>/sample_outputs.json` at `outputs[0].content["action"]`.

</Step>
</Steps>

<Note>
Framework notebooks clone `https://github.com/NVIDIA/cosmos-framework.git` into `packages/cosmos3` when missing and set `GIT_LFS_SKIP_SMUDGE=1` during `uv sync` so LeRobot LFS blobs in upstream test data do not break installs.
</Note>

## Run with vLLM-Omni

<Steps>
<Step title="Start the server">

From the `cosmos` repo root, start the official image (mount HF cache and workspace, allow local media):

```bash
docker rm -f cosmos3-vllm-omni-notebook 2>/dev/null || true

docker run -d --name cosmos3-vllm-omni-notebook \
  --runtime nvidia --gpus '"device=0"' \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$PWD:/workspace" \
  -p 8001:8000 --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
    --omni \
    --model-class-name Cosmos3OmniDiffusersPipeline \
    --allowed-local-media-path / \
    --port 8000

curl http://localhost:8001/v1/models
```

</Step>
<Step title="Configure notebook paths">

`run_fd_with_vllm.ipynb` / `run_id_with_vllm.ipynb` resolve `COSMOS_ROOT`, default `COSMOS3_VLLM_OUTPUT_ROOT` to `outputs/cosmos3_action_vllm`, and poll `COSMOS3_VLLM_BASE_URL`.

</Step>
<Step title="Submit requests">

Forward dynamics posts multipart jobs to `POST /v1/videos` with `input_reference` (image) and JSON `extra_params` embedding the action array inline (not only `action_path`):

<ParamField body="action_mode" type="string" required>
`forward_dynamics` or `inverse_dynamics`.
</ParamField>

<ParamField body="domain_name" type="string" required>
Embodiment key: `av`, `droid_lerobot`, or `umi`.
</ParamField>

<ParamField body="action_chunk_size" type="integer" required>
Must match trajectory length (60 for AV, 16 for robotics/UMI chunks).
</ParamField>

<ParamField body="action" type="array" required>
Forward dynamics only: JSON action trajectory loaded from `action_path`.
</ParamField>

<ParamField body="image_size" type="integer">
Canvas short side; robotics omits explicit `size` so vLLM pads to aspect ratio.
</ParamField>

<ParamField body="view_point" type="string">
e.g. `ego_view` or DROID dataset `viewpoint`.
</ParamField>

<ParamField body="raw_action_dim" type="integer">
Inverse dynamics AV runs set `9`.
</ParamField>

<ParamField body="guardrails" type="boolean">
Cookbooks set `false` for reproducible demos; robotics FD explicitly disables guardrails.
</ParamField>

Top-level form fields used in notebooks: `prompt`, `num_frames` (`action_chunk_size + 1`), `fps`, `num_inference_steps` (30), `guidance_scale` (1.0), `flow_shift` (10.0), `seed`, `extra_params` (JSON string).

</Step>
<Step title="Poll and download">

Forward dynamics polls `GET /v1/videos/{id}` until `completed`, then downloads `GET /v1/videos/{id}/content` to `<run_dir>/vision.mp4`. Inverse dynamics uses the same async job flow; predicted actions land in `sample_outputs.json` (and optional `action.json`). DROID/UMI autoregressive loops match the Framework pattern: last frame → next chunk’s PNG via ffmpeg.

</Step>
</Steps>

```mermaid
sequenceDiagram
    participant NB as action notebook
    participant API as vLLM-Omni :8001
    participant FS as output run_dir

    NB->>API: POST /v1/videos (image + extra_params)
    API-->>NB: job id
    loop until completed
        NB->>API: GET /v1/videos/{id}
        API-->>NB: status / progress
    end
    NB->>API: GET /v1/videos/{id}/content
    API-->>FS: vision.mp4 (FD) or action payload (ID)
    NB->>FS: sample_outputs.json (ID)
```

<Warning>
Inverse dynamics returns an action chunk in the job result; the README notes policy and inverse modes use asynchronous `POST /v1/videos` (not `/v1/videos/sync`). Forward dynamics can use sync in general deployments; these cookbooks use the async poll pattern for all video jobs.
</Warning>

## Autoregressive robotics and UMI

Both forward-dynamics backends share the same chunking model:

```text
Chunk 0: GT image + action JSON  →  vision.mp4
         └─ extract frame N ──► conditioning PNG for chunk 1
Chunk 1..K-1: repeat with updated vision_path
Stitch: concat per-chunk MP4s → *_stitched.mp4
```

| Example | Chunks | Actions per chunk | Total generated frames | Stitched artifact |
| --- | --- | --- | --- | --- |
| DROID LeRobot | 5 | 16 | 80 | `robotics_action_cond_stitched.mp4` |
| UMI | 2 (from 32-row `umi.json`) | 16 | 32 | `umi_action_cond_stitched.mp4` |

DROID chunk 0 saves the ground-truth first frame from `DROIDLeRobotDataset` to `robotics_droid_autoregressive_input_chunk_00.png`. Later chunks use paths `robotics_droid_autoregressive_input_chunk_XX.png` populated from the previous chunk’s generated video.

## Verification signals

| Check | Expected signal |
| --- | --- |
| vLLM server | `curl http://localhost:8001/v1/models` returns model metadata |
| Framework GPU | Notebook verify cell reports `cuda available: True` |
| AV FD | Three runs under `action_forward_dynamics_av_custom/*/vision.mp4` |
| AV ID | `sample_outputs.json` per `av_inverse_0` / `av_inverse_1` with `outputs[0].content.action` |
| DROID FD | Five chunk folders + optional 80-frame stitched MP4 |
| UMI FD | Two chunk folders + 32-frame stitched MP4 |

Trajectory visualization cells plot AV ego paths (3D frustum + bird’s-eye) from input JSON (FD) or predicted actions (ID).

## Choosing a backend

| Goal | Prefer |
| --- | --- |
| Match training/inference stack, batch JSONL, local checkpoint control | Cosmos Framework notebooks |
| OpenAI-compatible HTTP API, Docker-only deploy, no framework venv | vLLM-Omni notebooks |
| Embed action inline in HTTP without mounting `action_path` on server | vLLM (`extra_params.action` array) |
| Per-chunk subprocess control, `--no-guardrails` CLI flag | Framework (robotics/UMI loops) |

Detailed API fields and curl constraints live on [vLLM-Omni API reference](/vllm-omni-api-reference) and [Run Generator action workflows](/run-generator-action). Action token semantics and embodiment dimensions are documented on [Action modality](/action-modality).

## Related pages

<CardGroup>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared uv/Docker setup for Framework and vLLM-Omni before running action notebooks.
</Card>
<Card title="Run Generator action workflows" href="/run-generator-action">
Forward and inverse dynamics CLI/API patterns beyond the notebook walkthrough.
</Card>
<Card title="Action modality" href="/action-modality">
Action token semantics, embodiment dimensions, and `domain_name` conditioning.
</Card>
<Card title="Run Generator with vLLM-Omni" href="/run-generator-vllm-omni">
Server flags, guardrails, and multipart video endpoints used by `run_*_with_vllm.ipynb`.
</Card>
<Card title="Run Generator with Cosmos Framework" href="/run-generator-cosmos-framework">
`cosmos_framework.scripts.inference` parallelism presets and checkpoint paths.
</Card>
<Card title="vLLM-Omni API reference" href="/vllm-omni-api-reference">
`POST /v1/videos`, `extra_params`, and `action_mode` values.
</Card>
</CardGroup>

---

## 22. Reasoner cookbook recipes

> Runnable workflows for captioning, temporal localization, embodied/common-sense reasoning, 2D grounding, describe-anything, action CoT, physical plausibility, and situation understanding with bundled media assets.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/22-reasoner-cookbook-recipes.md
- Generated: 2026-06-01T20:28:38.252Z

### Source Files

- `cookbooks/cosmos3/reasoner/README.md`
- `cookbooks/cosmos3/reasoner/run_with_vllm.ipynb`
- `cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/reasoner/assets/common_sense_reasoning.mp4`
- `cookbooks/cosmos3/reasoner/assets/action_cot_driving_scene.mp4`
- `cookbooks/cosmos3/reasoner/assets/physical_plausibility.mp4`
- `README.md`

---
title: "Reasoner cookbook recipes"
description: "Runnable workflows for captioning, temporal localization, embodied/common-sense reasoning, 2D grounding, describe-anything, action CoT, physical plausibility, and situation understanding with bundled media assets."
---

The Cosmos 3 Reasoner cookbooks under `cookbooks/cosmos3/reasoner/` ship two runnable notebooks—`run_with_vllm.ipynb` (OpenAI chat completions over image/video, default **Cosmos3-Super** on 4 GPUs) and `run_with_cosmos_framework.ipynb` (JSON inputs through `cosmos_framework.scripts.inference`, **Cosmos3-Nano** on 1 GPU)—plus checked-in MP4/PNG assets in `assets/` and a quickstart in `README.md`. Environment setup for both backends is centralized in the shared [Cookbook environment setup](/cookbook-environment) page.

## Cookbook layout

```text
cookbooks/cosmos3/reasoner/
├── README.md                          # Quickstarts for Framework and vLLM
├── run_with_vllm.ipynb                # Image + video recipes (Super, vLLM)
├── run_with_cosmos_framework.ipynb  # Text + image recipes (Nano, Framework)
└── assets/                            # Bundled media for notebook cells
    ├── video_caption.mp4
    ├── temporal_localization_1.mp4
    ├── temporal_localization_2.mp4
    ├── robotics_next_action.mp4
    ├── drive_scene_next_action.mp4
    ├── assisted_task_next_action.mp4
    ├── common_sense_reasoning.mp4
    ├── action_cot_driving_scene.mp4
    ├── physical_plausibility.mp4
    ├── situation_understanding.mp4
    ├── robot_planning.png
    ├── grounding_2d.png
    ├── describe_anything.png
    └── action_cot_trajectory.png
```

<Note>
Cosmos Framework Reasoner currently treats `vision_path` as a PIL **image** input. Video-heavy recipes (captioning over MP4, temporal localization, embodied video reasoning, physical plausibility, situation understanding) run in `run_with_vllm.ipynb`. Image workflows overlap across both notebooks.
</Note>

## Choose a backend

| Goal | Notebook | Model (default) | API / entrypoint | Media |
| --- | --- | --- | --- | --- |
| Full image + video recipe set, production-style serving | `run_with_vllm.ipynb` | `nvidia/Cosmos3-Super`, TP=4 | `vllm serve` + OpenAI `chat.completions` | Local `file://` URLs under `cookbooks/cosmos3` via `--allowed-local-media-path` |
| Native PyTorch inference, JSON batch inputs, parsed overlays in-notebook | `run_with_cosmos_framework.ipynb` | `Cosmos3-Nano`, 1 GPU | `python -m cosmos_framework.scripts.inference` | `vision_path` (image URL or local path); outputs in `reasoner_text.txt` |
| Fast CLI smoke test | `README.md` | Nano (either backend) | Copy-paste commands without opening a notebook | Remote robot image URL or local assets |

Transformers-based Reasoner inference is listed as coming soon in the cookbooks README; use vLLM or Cosmos Framework today.

## Recipe catalog

Recipes map to the Reasoner use-case table in the repository README. Assets column lists files under `cookbooks/cosmos3/reasoner/assets/` unless noted.

| Workflow | Modality | Asset(s) | vLLM notebook | Framework notebook |
| --- | --- | --- | --- | --- |
| Image caption (detail) | Image | Remote `robot_153.jpg` (vLLM); same URL in Framework JSON | Image Caption | `image_caption_detail.json` |
| Video caption | Video | `video_caption.mp4` | Video Caption | — |
| Temporal localization — action segments | Video | `temporal_localization_1.mp4` | Temporal Localization | — |
| Temporal localization — event timeline | Video | `temporal_localization_2.mp4` | Event Timeline | — |
| Temporal localization — timestamp query | Video | `temporal_localization_2.mp4` (reused) | Timestamp Query | — |
| Temporal localization — interval QA | Video | `temporal_localization_2.mp4` (reused) | Interval Question | — |
| Embodied — robotics next action | Video | `robotics_next_action.mp4` | Robotics Next Action | — |
| Embodied — drive scene next action | Video | `drive_scene_next_action.mp4` | Drive Scene Next Action | — |
| Embodied — robot planning | Image | `robot_planning.png` | Robot Planning | `robot_planning.json` |
| Embodied — assisted task next action | Video | `assisted_task_next_action.mp4` | Assisted Task Next Action | — |
| Common-sense reasoning | Video | `common_sense_reasoning.mp4` | Common Sense Reasoning | — |
| 2D grounding | Image | `grounding_2d.png` | 2D Grounding (+ box overlay) | `ground_load_bbox.json` |
| Describe anything | Image | `describe_anything.png` | Describe Anything | `describe_marked_subjects.json` |
| Action CoT — gripper trajectory | Image | `action_cot_trajectory.png` | Trajectory Coordinates (+ point overlay) | `trajectory_bowl.json` |
| Action CoT — flower task trajectory | Image | `robot_planning.png` | Trajectory Coordinates (second cell) | `trajectory_flower.json` |
| Action CoT — driving scene | Video | `action_cot_driving_scene.mp4` | Driving Scene | — |
| Physical plausibility | Video | `physical_plausibility.mp4` | Physical Plausibility Analysis | — |
| Situation understanding | Video | `situation_understanding.mp4` | Situation Understanding | — |
| Text-only smoke test | Text | — | — | `nano_text.json` |
| Image one-liner smoke test | Image | Remote `robot_153.jpg` | — | `nano_image.json` |

## Shared request conventions

Reasoner calls follow **Qwen3-VL-compatible** chat messages: vision items use `image_url` or `video_url`; prompts are sibling `text` parts in the user message.

### Sampling defaults

| Parameter | Without explicit reasoning | With `redacted_reasoning` / `<think>` prompts |
| --- | ---: | ---: |
| `temperature` | `0.7` | `0.6` |
| `top_p` | `0.8` | `0.95` |
| `top_k` | `20` | `20` |
| `repetition_penalty` | `1.0` | `1.0` |
| `presence_penalty` | `1.5` | `0.0` |

Notebook cells commonly set `max_tokens=4096` (vLLM) or `max_new_tokens: 4096` (Framework JSON). Action CoT and driving-scene cells in the vLLM notebook pass the “with reasoning” sampler values explicitly.

### Explicit reasoning format

Append this instruction when you want chain-of-thought before the final answer:

```text
Answer the question using the following format:

<think>
Your reasoning.
</think>

Write your final answer immediately after the </think> tag.
```

Common-sense, assisted-task, robotics next-action, trajectory, and driving-scene cells use this pattern.

### Video processor kwargs (vLLM)

Video examples pass frame sampling through `extra_body`:

```python
extra_body={"mm_processor_kwargs": {"fps": 4, "do_sample_frames": True}}
```

The vLLM server is started with `--media-io-kwargs '{"video": {"num_frames": -1}}'` so the encoder can consume full frame streams when needed.

## Run with vLLM (`run_with_vllm.ipynb`)

<Steps>
<Step title="Install vLLM and cosmos3 plugins">

Clone or reuse `packages/cosmos3` from [NVIDIA/cosmos-framework](https://github.com/NVIDIA/cosmos-framework.git), create `.venv`, and install the CUDA-matched pair (`cu130` → `vllm==0.21.0`, `cu128` → `vllm==0.19.1`) plus `transformers-cosmos3` and `vllm-cosmos3` from the framework checkout. See [Cookbook environment setup](/cookbook-environment) § vLLM.

</Step>
<Step title="Launch Cosmos3-Super">

Default notebook server (4× GPU):

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve nvidia/Cosmos3-Super \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --tensor-parallel-size 4 \
  --mm-encoder-tp-mode data \
  --async-scheduling \
  --allowed-local-media-path "$(dirname "$(pwd)")" \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --port 8001
```

Set `--allowed-local-media-path` to `cookbooks/cosmos3` (parent of `reasoner/`) so `file://` URLs for bundled assets resolve. For **Nano** on one GPU, use the README quickstart on port `8000` with `--tensor-parallel-size 1`.

</Step>
<Step title="Query with OpenAI client">

```python
import openai
from pathlib import Path

client = openai.OpenAI(api_key="EMPTY", base_url="http://localhost:8001/v1")
MODEL = client.models.list().data[0].id

video_path = "cookbooks/cosmos3/reasoner/assets/video_caption.mp4"
video_url = Path(video_path).resolve().as_uri()

response = client.chat.completions.create(
    model=MODEL,
    messages=[{
        "role": "user",
        "content": [
            {"type": "video_url", "video_url": {"url": video_url}},
            {"type": "text", "text": "Describe the video in detail."},
        ],
    }],
    max_tokens=4096,
    extra_body={"mm_processor_kwargs": {"fps": 4, "do_sample_frames": True}},
)
print(response.choices[0].message.content)
```

Helper functions in the notebook: `asset_path(name)` → `assets/<name>`, `asset_url(name)` → `file://` URI for requests.

</Step>
</Steps>

<Warning>
Only change the **server launch cell** and **client `base_url` port** when switching checkpoints; prompts resolve `MODEL` dynamically via `client.models.list()`.
</Warning>

### Structured outputs in-notebook

| Recipe | Parse helper | Coordinate space |
| --- | --- | --- |
| 2D grounding | `parse_boxes()` — strips markdown fences, loads JSON array/object | Boxes normalized **0–1000**, scaled to image pixels for PIL overlay |
| Action CoT trajectories | `parse_points()` — extracts `point_2d` entries | Pixel coordinates drawn on image preview |

Grounding prompt (shared with Framework): `Locate the accurate bounding box of the load as a whole. Return a json.`

Describe-anything prompt expects JSON with keys `subject_id`, `category`, and `caption`.

## Run with Cosmos Framework (`run_with_cosmos_framework.ipynb`)

<Steps>
<Step title="Install framework">

From repo root, clone into `packages/cosmos3` and sync training extras (required by the inference import path):

```bash
export GIT_LFS_SKIP_SMUDGE=1
cd packages/cosmos3
uv sync --all-extras --group=cu130-train   # or cu128-train on CUDA 12.x
```

Override with `export COSMOS3_UV_GROUP=cu128-train` before running the notebook on CUDA 12.x drivers.

</Step>
<Step title="Write Reasoner JSON inputs">

Every input requires `"model_mode": "reasoner"` and **`"enable_sound": false`** to avoid strict validation failures on the current Reasoner path.

```json
{
  "model_mode": "reasoner",
  "name": "nano_image",
  "prompt": "Describe what is happening in this image in one sentence.",
  "vision_path": "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_153.jpg",
  "enable_sound": false
}
```

Capability inputs are written under `packages/cosmos3/outputs/cookbooks/cosmos3/reasoner/nano/inputs/capabilities/` (paths configurable via `COSMOS3_OUTPUT_ROOT` / `COSMOS3_INPUT_DIR`).

</Step>
<Step title="Run inference">

```bash
cd packages/cosmos3
COSMOS_TRAINING=false CUDA_VISIBLE_DEVICES=0 \
MASTER_ADDR=127.0.0.1 MASTER_PORT=29501 RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 \
.venv/bin/python -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  -i outputs/cookbooks/cosmos3/reasoner/nano/inputs/nano_image.json \
  -o outputs/cookbooks/cosmos3/reasoner/nano/cosmos_framework_nano_image \
  --checkpoint-path Cosmos3-Nano \
  --seed=0 \
  --benchmark
```

Text output path:

```text
<output_dir>/<name>/reasoner_text.txt
```

The notebook’s display helper renders prompt, media, model text, and parsed boxes or trajectory points side by side.

</Step>
</Steps>

<Action CoT JSON inputs set `"do_sample": true` with `temperature: 0.6`, `top_p: 0.95`, `top_k: 20`, `repetition_penalty: 1.0`, `presence_penalty: 0.0` to match the “with reasoning” sampler table.

<Info>
The Reasoner README notes scaling **Nano → Super** with `.venv/bin/torchrun` for Framework; the shipped framework notebook cells run **Nano on a single GPU** (`WORLD_SIZE=1`). For Super-scale Framework runs, see [Run Reasoner with Cosmos Framework](/run-reasoner-cosmos-framework).
</Info>

## Prompt patterns by workflow

### Captioning

| Target | Prompt (representative) |
| --- | --- |
| Image | `Caption the image in detail.` |
| Video | `Describe the video in detail.` |

### Temporal localization

| Variant | Prompt highlights | Time format |
| --- | --- | --- |
| Action segments | List action segments; JSON with `start`, `end`, `caption` | `seconds` |
| Event timeline | Notable events in JSON | `mm:ss.ff` |
| Timestamp query | Natural-language event description → interval | `mm:ss.ff`, keys `start`, `end` |
| Interval QA | `What happened between 00:05.64 and 00:17.49?` | — |

### Embodied reasoning

| Scenario | Prompt (abbrev.) |
| --- | --- |
| Robotics next action | `What can be the next immediate action?` + reasoning format |
| Drive scene | Autonomous-vehicle planner: observe critical objects, next action and trajectory |
| Robot planning | `The task is to put flower into the red bottle. Generate a plan consisting of subtasks...` |
| Assisted task | Overall printer-cartridge task + current step `place old ink_cartridge` → next action with reasoning format |

### Common sense, physical plausibility, situation

| Workflow | Prompt (representative) |
| --- | --- |
| Common sense | `Can the countertop support the weight of the juicers?` + reasoning format |
| Physical plausibility | Judge object permanence / trajectories; answer `(A) Possible` or `(B) Impossible`; ignore simulation quality and experimental “rising wall” |
| Situation understanding | `What is the person doing with the skillet? What will the person likely do next in this situation?` |

### Action CoT

| Variant | Task string | Output shape |
| --- | --- | --- |
| Bowl trajectory | `Move the pink bowl to the right` | JSON `{"point_2d": [x, y], "label": "gripper trajectory"}` + reasoning format |
| Flower trajectory | `Put flower into the red bottle` | Same JSON schema on `robot_planning.png` |
| Driving scene | Step-by-step critical objects for safe navigation | Reasoning format over `action_cot_driving_scene.mp4` |

## Verification signals

| Check | vLLM | Cosmos Framework |
| --- | --- | --- |
| Server ready | `curl -fsS http://127.0.0.1:8001/health` (notebook default port) | Inference exits 0; `reasoner_text.txt` exists |
| Model loaded | `client.models.list()` returns Cosmos checkpoint id | Log shows checkpoint download / load for `Cosmos3-Nano` |
| Media found | `asset_path(...)` assert passes at notebook start | `vision_path` resolves (local asset or HTTPS URL) |
| Structured parse | Red boxes / trajectory dots render in notebook | Display cell shows overlay when JSON parses |

<Check>
Successful first run: image caption cell prints multi-sentence description; Framework smoke test writes non-empty `reasoner_text.txt` under the configured output root.
</Check>

## Related pages

<CardGroup>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared uv/Docker setup, HF auth, CUDA `cu130`/`cu128` pairs, and vLLM plugin install for Reasoner notebooks.
</Card>
<Card title="Run Reasoner with vLLM" href="/run-reasoner-vllm">
Production `vllm serve` flags, Qwen3-VL message shape, and reasoning-format suffix reference.
</Card>
<Card title="Run Reasoner with Cosmos Framework" href="/run-reasoner-cosmos-framework">
JSON input schema, `enable_sound`, latency preset, and Super multi-GPU torchrun patterns.
</Card>
<Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
Reasoner sampler tables and structured JSON / chat message conventions.
</Card>
<Card title="Reasoner and Generator" href="/reasoner-and-generator">
When to use Reasoner (text out) vs Generator surfaces.
</Card>
<Card title="Choose an integration" href="/choose-integration">
Pick vLLM vs Cosmos Framework vs Diffusers by deployment goal.
</Card>
</CardGroup>

---

## 23. Inference benchmarks

> Published latency tables for Cosmos3-Nano/Super Generator (PyTorch, vLLM-Omni, Diffusers by GPU/resolution/TP) and Reasoner vLLM serving metrics (TTFT, throughput at concurrency tiers).

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/23-inference-benchmarks.md
- Generated: 2026-06-01T20:29:47.636Z

### Source Files

- `inference_benchmarks.md`
- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb`
- `cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb`

---
title: "Inference benchmarks"
description: "Published latency tables for Cosmos3-Nano/Super Generator (PyTorch, vLLM-Omni, Diffusers by GPU/resolution/TP) and Reasoner vLLM serving metrics (TTFT, throughput at concurrency tiers)."
---

Published latency and serving numbers for Cosmos3 live in the repository root file `inference_benchmarks.md`. Generator sections report diffusion-path latency in **seconds** for text-to-image (t2i), text-to-video (t2v), and image-to-video (i2v) across **PyTorch**, **vLLM-Omni**, and **Diffusers**. The Reasoner section reports **vLLM** autoregressive serving metrics—time to first token (TTFT), end-to-end request latency, and throughput—at client concurrency tiers 1, 64, 128, and 256. Results are filled in incrementally; **empty cells mean that GPU, engine, resolution, or tensor-parallel combination has not been measured yet**, not that it is unsupported.

## Benchmark inventory

| Section | Surface | Checkpoint | Metrics |
| --- | --- | --- | --- |
| Cosmos3-Nano generator | Generator (diffusion) | `nvidia/Cosmos3-Nano` (16B) | Latency (s) by GPU, engine, resolution tier, TP width |
| Cosmos3-Super generator | Generator (diffusion) | `nvidia/Cosmos3-Super` (64B) | Same modalities and engines at larger scale |
| Cosmos3-Nano reasoner | Reasoner (VLM) | `nvidia/Cosmos3-Nano` | TTFT, request latency (ms), req/s and tok/s at concurrency 1/64/128/256 |

<Info>
Canonical tables with all measured cells are maintained in [`inference_benchmarks.md`](https://github.com/nvidia/cosmos/blob/main/inference_benchmarks.md) on the main branch. This page documents methodology, column semantics, and the published numbers as of the current checkout.
</Info>

## How to read generator tables

### Column semantics

Generator table headers use **`{resolution}p/{gpu_count}`**:

| Column pattern | Meaning |
| --- | --- |
| `256p/1`, `480p/1`, `720p/1` | Single-GPU run at 320×192, 832×480, or 1280×720 |
| `256p/4`, `720p/8`, etc. | Tensor parallelism across 4 or 8 GPUs |

### Engines compared

| Engine | What the number measures | Resolution coverage in tables |
| --- | --- | --- |
| **PyTorch** | Average generation (sampling) time from OSS reference inference; CUDA Graphs enabled where supported | All resolution/TP columns where measured |
| **vLLM-Omni** | Total pipeline time | Primarily **720p** on supported GPUs |
| **Diffusers** | End-to-end time via Hugging Face `Cosmos3OmniPipeline` without custom CUDA graphs | **256p/1**, **480p/1**, **720p/1** (single-GPU only) |

### Shared workload settings

All generator runs use **BF16**, **batch size 1**, and **identical prompts, seeds, and sampler settings** across engines where noted. Video workloads follow the standard Cosmos3 profile: **189 frames at 24 FPS** (about 7.9 seconds), matching cookbook defaults such as `num_frames=189` in audiovisual Generator notebooks.

<Warning>
vLLM-Omni figures are tied to the upcoming public vLLM-Omni release and may change before GA. Values marked with **(*)** on H100 NVL are pre-release measurements.
</Warning>

## Cosmos3-Nano generator

### Text-to-video (t2v)

Latency in **seconds**.

| GPU | Engine | 256p/1 | 256p/4 | 256p/8 | 480p/1 | 480p/4 | 480p/8 | 720p/1 | 720p/4 | 720p/8 |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| **RTX PRO 6000 Blackwell** | PyTorch | | | | | | | 786.37 | 225.45 | |
| | vLLM-Omni | | | | | | | 369.67 | 114.30 | 68.66 |
| | Diffusers | 11.20 | | | 112.00 | | | 392.00 | | |
| **H20** | PyTorch | | | | | | | 931.39 | 268.88 | 157.71 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 30.20 | | | 258.00 | | | 926.00 | | |
| **H100 NVL** | PyTorch | | | 3.95 | 84.12 | | | 297.27 | 94.15 | 61.63 |
| | vLLM-Omni | | | | | | | 311.13 | 88.25(*) | 54.01(*) |
| | Diffusers | 11.00 | | | 90.00 | | | 324.20 | | |
| **H200 NVL** | PyTorch | | | | | | | 244.39 | 77.35 | 45.70 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 9.00 | | | 74.00 | | | 276.20 | | |
| **H100 80GB HBM3** | PyTorch | 7.61 | | | 59.83 | | | 207.78 | | |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 9.00 | | | 68.00 | | | 240.00 | | |
| **H200 141GB HBM3** | PyTorch | | 3.34 | 3.19 | | | 13.97 | 214.28 | 67.48 | 41.26 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 9.00 | | | 67.00 | | | 239.60 | | |
| **B200** | PyTorch | 4.56 | 2.78 | 2.79 | | | | 114.85 | 39.75 | 26.27 |
| | vLLM-Omni | | | | | | | 107.84 | 35.29 | 22.87 |
| | Diffusers | 7.00 | | | 36.80 | | | 117.00 | | |
| **B300** | PyTorch | | | | | | | | | |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 39.40 | | | 63.40 | | | 139.40 | | |

On **B200** at **720p/8**, vLLM-Omni (22.87 s) is roughly **5× faster** than Diffusers single-GPU (117.00 s) for the same modality. **720p/4** tensor parallelism cuts Nano t2v latency sharply versus **720p/1** on the same GPU (for example, B200 PyTorch: 114.85 s → 39.75 s).

### Image-to-video (i2v)

Latency in **seconds**.

| GPU | Engine | 256p/1 | 256p/4 | 256p/8 | 480p/1 | 480p/4 | 480p/8 | 720p/1 | 720p/4 | 720p/8 |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| **RTX PRO 6000 Blackwell** | PyTorch | | | | | | | 788.80 | 226.25 | 127.79 |
| | vLLM-Omni | | | | | | | 375.01 | 119.27 | 73.57 |
| | Diffusers | 12.00 | | | 112.00 | | | 397.00 | | |
| **H20** | PyTorch | | | | | | | | | 158.10 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 31.00 | | | 258.00 | | | 925.00 | | |
| **H100 NVL** | PyTorch | | | 3.99 | 84.50 | 28.69 | | 298.57 | 95.76 | 60.58 |
| | vLLM-Omni | | | | | | | 286.33 | 92.23(*) | 58.02(*) |
| | Diffusers | 11.00 | | | 91.00 | | | 325.20 | | |
| **H200 NVL** | PyTorch | | | | | | | 246.62 | 77.69 | 45.99 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 9.00 | | | 74.00 | | | 275.20 | | |
| **H100 80GB HBM3** | PyTorch | 7.64 | | | | | | 207.87 | | |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 9.00 | | | 68.00 | | | 239.80 | | |
| **H200 141GB HBM3** | PyTorch | | 3.37 | 3.17 | | | 14.07 | 214.80 | 67.14 | 41.00 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 9.00 | | | 67.20 | | | 240.00 | | |
| **B200** | PyTorch | 4.60 | 2.77 | 2.81 | | | 9.66 | 113.90 | 40.01 | 26.58 |
| | vLLM-Omni | | | | | | | 110.19 | 37.76 | 25.68 |
| | Diffusers | | | | | | | 116.00 | | |
| **B300** | PyTorch | | | | | | | | | |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 28.60 | | | 65.60 | | | 139.60 | | |

### Text-to-image (t2i)

Latency in **seconds**. t2i workloads are shortest; they are useful for sanity-checking a GPU stack before running full 189-frame video jobs.

| GPU | Engine | 256p/1 | 256p/4 | 256p/8 | 480p/1 | 480p/4 | 480p/8 | 720p/1 | 720p/4 | 720p/8 |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| **RTX PRO 6000 Blackwell** | PyTorch | | | | | | | 7.12 | 3.18 | 2.70 |
| | vLLM-Omni | | | | | | | 4.99 | 2.32 | 1.96 |
| | Diffusers | 2.00 | | | 4.00 | | | 5.00 | | |
| **H20** | PyTorch | | | | | | | | | |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 3.00 | | | 6.00 | | | 10.00 | | |
| **H100 NVL** | PyTorch | | 2.45 | | | | | 4.21 | 2.57 | 2.64 |
| | vLLM-Omni | | | | | | | 3.44 | 1.83 | 1.90 |
| | Diffusers | 3.00 | | | 3.00 | | | 4.00 | | |
| **H200 NVL** | PyTorch | | | | | | | 3.58 | 2.62 | 2.64 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 3.00 | | | 3.00 | | | 4.00 | | |
| **H100 80GB HBM3** | PyTorch | 3.01 | | | | | | 3.45 | | |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 3.00 | | | 3.00 | | | 4.00 | | |
| **H200 141GB HBM3** | PyTorch | | 2.59 | 2.70 | | 2.78 | | 3.28 | 2.84 | 2.77 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 3.00 | | | 3.00 | | | 4.00 | | |
| **B200** | PyTorch | | | 2.59 | 2.75 | | 2.56 | 2.87 | 2.58 | 2.62 |
| | vLLM-Omni | | | | | | | 1.77 | 2.20 | 3.41 |
| | Diffusers | | | | | | | 3.00 | | |
| **B300** | PyTorch | | | | | | | | | |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 36.20 | | | | | | 41.00 | | |

<Tip>
At **256p**, multi-GPU tensor parallelism on **B300** can underperform single-GPU due to small-workload TP overhead; single-GPU is recommended at that resolution tier.
</Tip>

## Cosmos3-Super generator

Super benchmarks use the same modalities, engines, and column layout as Nano. Expect **longer runtimes** than Nano at the same resolution because of the **64B** checkpoint. Early Super coverage is narrower: **vLLM-Omni** and **Diffusers** runs exist primarily on **B200** and select **H200** configurations.

### Text-to-video (t2v)

| GPU | Engine | 256p/1 | 256p/4 | 256p/8 | 480p/1 | 480p/4 | 480p/8 | 720p/1 | 720p/4 | 720p/8 |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| **RTX PRO 6000 Blackwell** | PyTorch | | | | | | | | | 427.16 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | | | | | | | | | |
| **H20** | PyTorch | | | | | | | | | 492.41 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | | | | | | | | | |
| **H100 NVL** | PyTorch | | | | | 101.27 | | | 330.04 | 186.19 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | | | | | | | | | |
| **H200 NVL** | PyTorch | | | | | | | | 258.34 | 139.37 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 33.00 | | | 286.80 | | | 1036.00 | | |
| **H100 80GB HBM3** | PyTorch | | | | | | | | | |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | | | | | | | | | |
| **H200 141GB HBM3** | PyTorch | | | | | 70.27 | 41.78 | | 224.43 | 123.49 |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 31.00 | | | 251.60 | | | 886.20 | | |
| **B200** | PyTorch | | 5.59 | | 114.38 | 35.73 | 21.39 | 407.50 | 118.38 | 65.93 |
| | vLLM-Omni | | | | | | | 390.28 | 113.31 | 62.11 |
| | Diffusers | | | | 127.20 | | | 414.40 | | |
| **B300** | PyTorch | | | | | | | | | |
| | vLLM-Omni | | | | | | | | | |
| | Diffusers | 54.20 | | | 155.40 | | | 424.80 | | |

### Image-to-video (i2v) and text-to-image (t2i) highlights

| Modality | GPU | Engine | Notable measurement |
| --- | --- | --- | --- |
| i2v 720p/8 | B200 | PyTorch | 65.91 s |
| i2v 720p/8 | B200 | vLLM-Omni | 64.82 s |
| i2v 720p/1 | H200 NVL | Diffusers | 1034.60 s |
| t2i 720p/1 | B200 | vLLM-Omni | 6.02 s |
| t2i 720p/4 | B200 | PyTorch | 4.28 s |

Full Super **i2v** and **t2i** grids are in `inference_benchmarks.md` under [Cosmos3-Super Generator](https://github.com/nvidia/cosmos/blob/main/inference_benchmarks.md#cosmos3-super-generator).

## Cosmos3-Nano reasoner (vLLM)

Reasoner benchmarks measure **autoregressive text generation** from vision and text inputs—not diffusion sampling. All runs use checkpoint **`nvidia/Cosmos3-Nano`** served through **vLLM**, with metrics collected by the **AIPerf** client.

### Workload matrix

Each GPU section in `inference_benchmarks.md` contains **four** workload tables:

| Workload label | Input tokens | Output tokens | Video FPS | Typical use |
| --- | ---: | ---: | ---: | --- |
| Input 50 / Output 1 / Video 1 FPS | 50 | 1 | 1 | Minimal single-token response |
| Input 50 / Output 1 / Video 2 FPS | 50 | 1 | 2 | Higher video sampling rate |
| Input 50 / Output 100 / Video 1 FPS | 50 | 100 | 1 | Short caption / VQA-style answer |
| Input 50 / Output 100 / Video 2 FPS | 50 | 100 | 2 | Longer answer with denser video input |

### Metrics

| Metric | Unit | Interpretation |
| --- | --- | --- |
| **Time To First Token (TTFT)** | ms | Latency until the first output token is emitted; lower is better |
| **Request Latency** | ms | End-to-end time per request; for Output 1 workloads, equals TTFT |
| **Request Throughput** | Req/s | Completed requests per second; higher is better |
| **Output Token Throughput** | Tok/s | Generated tokens per second; for Output 1, matches request throughput |
| **Request Count** | requests | Total requests in the benchmark window |

<ParamField body="concurrency" type="integer">
Client-side simultaneous requests issued by AIPerf (1, 64, 128, or 256). This is **not** vLLM tensor-parallel GPU count.
</ParamField>

### Cross-GPU comparison (Input 50 / Output 100 / Video 1 FPS)

Representative serving numbers at **concurrency 64**—a mid-tier load for captioning-style **100-token** outputs:

| GPU | TTFT (ms) | Request latency (ms) | Req/s | Tok/s |
| --- | ---: | ---: | ---: | ---: |
| RTX PRO 6000 Blackwell | 2280.43 | 9309.93 | 6.85 | 684.76 |
| H20 | 6607.80 | 18990.48 | 3.37 | 336.55 |
| H100 NVL | 2890.81 | 9192.19 | 6.94 | 694.37 |
| H200 NVL | 1948.06 | 5284.58 | 12.07 | 1206.86 |
| H100 80GB HBM3 (SXM) | 1720.73 | 5251.56 | 12.14 | 1213.61 |
| H200 141GB HBM3 | 2060.05 | 4965.09 | 12.84 | 1284.38 |
| B200 | 1106.35 | 2736.53 | 23.28 | 2328.01 |
| B300 | 1070.93 | 2657.06 | 23.96 | 2396.35 |

At **concurrency 1** on the same workload, **B300** reaches **203.29 Tok/s** with **490.11 ms** request latency versus **71.22 Tok/s** on RTX PRO 6000 Blackwell.

### Concurrency scaling pattern

TTFT grows roughly linearly with client concurrency because queued prefills compete for the same server. For **Output 1** workloads, request latency equals TTFT at every concurrency tier. For **Output 100** workloads, request latency exceeds TTFT because generation continues after the first token.

Example — **B200**, Input 50 / Output 100 / Video 1 FPS:

| Metric | C=1 | C=64 | C=128 | C=256 |
| --- | ---: | ---: | ---: | ---: |
| TTFT (ms) | 115.27 | 1106.35 | 2111.97 | 2549.79 |
| Request latency (ms) | 553.01 | 2736.53 | 5001.20 | 9279.25 |
| Output tok/s | 180.16 | 2328.01 | 2523.07 | 2701.08 |

Throughput often peaks between **concurrency 128 and 256** before latency dominates SLA budgets.

### Per-GPU tables

<AccordionGroup>
<Accordion title="RTX PRO 6000 Blackwell, H20, H100 NVL, H200 NVL">
Full four-workload tables for these GPUs are in `inference_benchmarks.md` under [Cosmos3-Nano Reasoner](https://github.com/nvidia/cosmos/blob/main/inference_benchmarks.md#cosmos3-nano-reasoner).
</Accordion>
<Accordion title="H100 80GB HBM3 (SXM), H200 141GB HBM3, B200, B300">
Full four-workload tables for datacenter Blackwell and Hopper SKUs are in the same Reasoner section of `inference_benchmarks.md`.
</Accordion>
</AccordionGroup>

## Choosing hardware and integration from benchmarks

```text
                    ┌─────────────────────────────────────┐
                    │     Goal: pick stack from tables    │
                    └─────────────────┬───────────────────┘
                                      │
          ┌───────────────────────────┼───────────────────────────┐
          ▼                           ▼                           ▼
   Lowest latency t2v          Fastest iteration          Production VLM QPS
   at 720p (datacenter)        (single GPU, t2i)          (Reasoner, high C)
          │                           │                           │
          ▼                           ▼                           ▼
   B200/B300 + vLLM-Omni         Diffusers 256p/1            B200/B300 vLLM
   720p/8 TP (Nano ~23–27s)     (Nano ~2–11s t2i)          (23+ tok/s @ C=64)
          │                           │
          └───────── PyTorch OSS if you need Framework parity ─────────┘
```

| Goal | Favor | Example from tables |
| --- | --- | --- |
| Fastest Nano **720p t2v** | vLLM-Omni + multi-GPU TP | B200 720p/8: **22.87 s** (vLLM-Omni) vs **114.85 s** (PyTorch 720p/1) |
| Simplest single-GPU video | Diffusers | B200 720p/1 t2v: **117.00 s** |
| Highest Reasoner **tok/s** | B200/B300 at concurrency 64–256 | B300 Output 100 / V1 FPS @ C=64: **2396 Tok/s** |
| Super quality at scale | B200 + TP | Super t2v B200 720p/8: **62.11 s** (vLLM-Omni) |

<Note>
Benchmarks characterize reference workloads under controlled settings. Your prompts, guardrails, resolution, frame count, and server flags will shift absolute numbers. Use tables for **relative** comparisons across GPUs, engines, and TP widths on the same modality.
</Note>

## Reproducing or extending measurements

<Steps>
<Step title="Match generator workload settings">
Use BF16, batch size 1, 189 frames at 24 FPS for video, and fixed seeds/prompts across PyTorch, vLLM-Omni, and Diffusers runs. Align resolution tiers with 256p (320×192), 480p (832×480), and 720p (1280×720).
</Step>
<Step title="Run the integration under test">
- **PyTorch / Framework:** OSS inference benchmarking path; Framework cookbooks also support `--benchmark` on `cosmos_framework.scripts.inference` for local timing.
- **vLLM-Omni:** Docker `vllm/vllm-omni:cosmos3` with `--tensor-parallel-size` for Super; see audiovisual vLLM-Omni notebook.
- **Diffusers:** `Cosmos3OmniPipeline` end-to-end without CUDA graphs.
</Step>
<Step title="Run Reasoner serving benchmarks">
Serve `nvidia/Cosmos3-Nano` with Reasoner architecture overrides (`Cosmos3ReasonerForConditionalGeneration`), then drive load with the AIPerf client at concurrency 1, 64, 128, and 256.
</Step>
<Step title="Publish new cells">
Add results to `inference_benchmarks.md` in the appropriate Generator or Reasoner table; leave cells empty until a combination is actually measured.
</Step>
</Steps>

## Related pages

<CardGroup>
<Card title="Choose an integration" href="/choose-integration">
Compare PyTorch, vLLM-Omni, Diffusers, and Framework paths that these benchmarks measure.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Resolution tiers, frame counts, and precision defaults that define generator workloads.
</Card>
<Card title="Run Generator with vLLM-Omni" href="/run-generator-vllm-omni">
Tensor parallelism and Docker serve flags behind vLLM-Omni benchmark configurations.
</Card>
<Card title="Run Generator with Diffusers" href="/run-generator-diffusers">
Cosmos3OmniPipeline setup for Diffusers end-to-end latency numbers.
</Card>
<Card title="Run Reasoner with vLLM" href="/run-reasoner-vllm">
Serve Cosmos3-Nano Reasoner—the stack used for TTFT and throughput tables.
</Card>
<Card title="Reasoner vLLM configuration" href="/page-reasoner-vllm-configuration">
Serve flags (TP, mm-encoder, media-io-kwargs) that affect Reasoner serving performance.
</Card>
</CardGroup>

---

## 24. Troubleshooting

> CUDA/driver mismatches, NGC container selection, torch.cuda unavailable fixes, libxcb headless imports, uv version and --torch-backend errors, and VLLM_USE_DEEP_GEMM workaround.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/24-troubleshooting.md
- Generated: 2026-06-01T20:29:31.763Z

### Source Files

- `README.md`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb`
- `cookbooks/cosmos3/reasoner/run_with_vllm.ipynb`

---
title: "Troubleshooting"
description: "CUDA/driver mismatches, NGC container selection, torch.cuda unavailable fixes, libxcb headless imports, uv version and --torch-backend errors, and VLLM_USE_DEEP_GEMM workaround."
---

Cosmos 3 setup failures usually trace to a CUDA driver and PyTorch wheel mismatch, an outdated `uv` that cannot parse framework config or accept `cu130`, missing X11 libraries on headless hosts, or vLLM Reasoner builds that require disabling DeepGEMM. The tables below map symptoms to backend-specific fixes for Diffusers, Cosmos Framework `uv sync`, vLLM, and the `vllm/vllm-omni:cosmos3` container.

## Symptom map

```text
Symptom                          Likely cause                    Fix surface
─────────────────────────────────────────────────────────────────────────────
"The NVIDIA driver is too old"   torch cu130 on CUDA 12 driver  --torch-backend / COSMOS3_*
cuda available: False            Same mismatch                    cu128 / cu128-train
--torch-backend invalid/cu129    uv < 0.11.3                      uv self update
libxcb.so.1 missing              Headless X11 libs                apt-get libxcb1 ...
DeepGEMM unavailable             vLLM build / GPU combo           VLLM_USE_DEEP_GEMM=0
vllm serve fails after install   cu130+vllm 0.21 on CUDA 12       cu128 + vllm 0.19.1
```

## CUDA driver and PyTorch alignment

Supported driver CUDA lines are **13.x** (recommended) and **12.x** (12.8 wheels). The driver CUDA reported by `nvidia-smi` and the CUDA baked into PyTorch must agree on the **major** version.

| Check | Command |
| --- | --- |
| Driver CUDA | `nvidia-smi` (top-right CUDA Version) |
| PyTorch CUDA | `python -c "import torch; print(torch.version.cuda)"` |
| GPU visible | `python -c "import torch; print(torch.cuda.is_available())"` |

| Driver CUDA | Diffusers / generic `uv pip` | Cosmos Framework `uv sync` | vLLM Reasoner |
| --- | --- | --- | --- |
| 13.x | `--torch-backend=cu130` or `COSMOS3_TORCH_BACKEND=cu130` | `--group=cu130-train` or `COSMOS3_UV_GROUP=cu130-train` | `--torch-backend=cu130 "vllm==0.21.0"` |
| 12.x | `--torch-backend=cu128` or `COSMOS3_TORCH_BACKEND=cu128` | `--group=cu128-train` or `COSMOS3_UV_GROUP=cu128-train` | `--torch-backend=cu128 "vllm==0.19.1"` |

<Warning>
vLLM does not publish wheels for every CUDA minor version. For Reasoner installs, **`--torch-backend=auto` is not reliable** — always pick the `cu130`/`vllm==0.21.0` or `cu128`/`vllm==0.19.1` pair that matches your driver.
</Warning>

<Tip>
For **Diffusers** Generator installs only, `uv pip install --torch-backend=auto` lets `uv` detect the NVIDIA driver and select a matching `torch`/`torchvision` build. Without it, `uv pip install torch` defaults to the newest CUDA wheel (`cu130`), which fails on pre-CUDA-13 drivers.
</Tip>

## `torch.cuda.is_available()` is False

### Typical error

```text
The NVIDIA driver on your system is too old
```

PyTorch was built for a newer CUDA than the host driver supports. `uv pip install torch` without `--torch-backend` pulls CUDA 13 (`cu130`) by default.

<Steps>
<Step title="Confirm the mismatch">

Run both checks and compare major CUDA versions:

```bash
nvidia-smi
python -c "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"
```

</Step>
<Step title="Reinstall torch with a matching backend">

<Tabs>
<Tab title="Diffusers venv">

```bash
source .venv/bin/activate
uv pip install --torch-backend=auto torch torchvision
# or pin explicitly:
# uv pip install --torch-backend=cu128 torch torchvision
```

Set `COSMOS3_TORCH_BACKEND=cu128` before re-running the Diffusers cookbook install cell if you use the notebook env vars.

</Tab>
<Tab title="Cosmos Framework">

```bash
export COSMOS3_UV_GROUP=cu128-train   # CUDA 12.x driver
cd packages/cosmos3
uv sync --all-extras --group="$COSMOS3_UV_GROUP"
```

Default notebooks use `cu130-train`; change the group **before** the configuration cell on CUDA 12.x systems.

</Tab>
<Tab title="vLLM Reasoner">

```bash
source .venv/bin/activate
# CUDA 12.x driver:
uv pip install --torch-backend=cu128 "vllm==0.19.1" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"
```

</Tab>
</Tabs>

</Step>
<Step title="Verify GPU access">

```bash
.venv/bin/python - <<'PY'
import torch
print("torch:", torch.__version__)
print("torch cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY
```

Expect `cuda available: True` and at least one device.

</Step>
</Steps>

## NGC base container selection

When you run inside NVIDIA NGC PyTorch images instead of a local `uv` venv, match the container tag to your target CUDA line:

| Target CUDA | NGC image |
| --- | --- |
| CUDA 13 | `nvcr.io/nvidia/pytorch:25.09-py3` |
| CUDA 12 | `nvcr.io/nvidia/pytorch:25.06-py3` |

Pair the container with the same `cu130` / `cu128` backend choices as bare-metal installs. Generator production serving via vLLM-Omni uses the separate prebuilt image `vllm/vllm-omni:cosmos3` (not the NGC PyTorch tags).

## vLLM-Omni Docker

| Issue | Remediation |
| --- | --- |
| Server not ready | Wait for `Application startup complete.` in container logs |
| Local media not found | Mount host paths and pass `--allowed-local-media-path` covering those files |
| Super OOM | Add `--tensor-parallel-size 4 --enable-layerwise-offload` for `nvidia/Cosmos3-Super` |

Minimal Nano serve pattern:

```bash
docker pull vllm/vllm-omni:cosmos3

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000
```

Verify the API: `curl http://localhost:8000/v1/models`.

## Headless import: `libxcb.so.1`

On headless servers and minimal containers, importing Diffusers pipelines or Framework modules can fail because a dependency links against X11/graphics libraries that are not installed:

```text
libxcb.so.1: cannot open shared object file
```

Install the system packages:

```bash
apt-get install -y libxcb1 libgl1 libglib2.0-0
```

Cookbooks for Diffusers and Cosmos Framework audiovisual generation document this under prerequisites.

## `uv` version and `--torch-backend` errors

Cosmos Framework requires **`uv >= 0.11.3`** (enforced in its `pyproject.toml`). Older `uv` builds fail in two common ways:

| Error pattern | Cause |
| --- | --- |
| Parse failure on `[tool.uv.audit]` | `uv` too old for framework `pyproject.toml` |
| `a value is required for '--torch-backend'` | Flag present but value rejected |
| Accepted backends stop at `cu129` | `uv` predates `cu130` support |

Upgrade:

```bash
uv self update
# or reinstall from https://astral.sh/uv
```

Confirm version: `uv --version` (must be ≥ 0.11.3).

### Environment variables for notebooks

| Variable | Default | Use on CUDA 12.x |
| --- | --- | --- |
| `COSMOS3_TORCH_BACKEND` | `cu130` | `cu128` (Diffusers) |
| `COSMOS3_UV_GROUP` | `cu130-train` | `cu128-train` (Framework) |

Export these **before** the notebook configuration or install cells.

### Framework install: git-LFS mirror failures

If `uv sync` fails on optional git-LFS test artifacts in the Framework mirror:

```bash
export GIT_LFS_SKIP_SMUDGE=1
uv sync --all-extras --group=cu130-train
```

## `VLLM_USE_DEEP_GEMM` workaround

vLLM Reasoner serving may report that **DeepGEMM is unavailable** on your GPU or build. Disable DeepGEMM before `vllm serve`:

```bash
export VLLM_USE_DEEP_GEMM=0

vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --async-scheduling \
  --allowed-local-media-path / \
  --port 8000
```

Apply the same export in notebook shells that launch the background vLLM server.

<Note>
When launching with `.venv/bin/vllm` without activating the venv, ensure `.venv/bin` is on `PATH`. FlashInfer’s JIT kernel build invokes `ninja`, which lives in the venv.
</Note>

## Backend quick reference

| Backend | CUDA pairing | `auto` backend? |
| --- | --- | --- |
| Diffusers Generator | `COSMOS3_TORCH_BACKEND` / `--torch-backend` | Yes (`auto` recommended in README quickstart) |
| Cosmos Framework | `COSMOS3_UV_GROUP` (`cu130-train` / `cu128-train`) | No — pick group explicitly |
| vLLM Reasoner | `cu130`+`vllm==0.21.0` or `cu128`+`vllm==0.19.1` | No |
| vLLM-Omni Generator | Prebuilt `vllm/vllm-omni:cosmos3` image | N/A (container CUDA is fixed) |

## Related pages

<CardGroup>
<Card title="Installation" href="/installation">
Prerequisites, CUDA driver pairing, venv and Docker setup, and first verification commands.
</Card>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared uv/Docker paths for every backend, HF auth, and GPU verification probes.
</Card>
<Card title="Reasoner vLLM configuration" href="/reasoner-vllm-configuration">
`vllm serve` flags, tensor parallelism, `VLLM_USE_DEEP_GEMM`, and cu130/cu128 version pairs.
</Card>
<Card title="Diffusers pipeline reference" href="/diffusers-pipeline-reference">
`Cosmos3OmniPipeline` modes and `--torch-backend` install pairing.
</Card>
</CardGroup>

---

## 25. Ecosystem, license, and release

> Related Cosmos projects (Framework, Curator, Evaluator), OpenMDW-1.1 license terms, known model limitations, release cadence pointers, and third-party dependency notices.

- Page Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/pages/25-ecosystem-license-and-release.md
- Generated: 2026-06-01T20:29:44.179Z

### Source Files

- `README.md`
- `LICENSE`
- `RELEASE.md`
- `cookbooks/cosmos3/README.md`

---
title: "Ecosystem, license, and release"
description: "Related Cosmos projects (Framework, Curator, Evaluator), OpenMDW-1.1 license terms, known model limitations, release cadence pointers, and third-party dependency notices."
---

The **NVIDIA/cosmos** repository ships Cosmos 3 cookbooks, benchmarks, and integration quickstarts under **OpenMDW-1.1**, while training, curation, and automated evaluation live in sibling repositories (**cosmos-framework**, **cosmos-curator**, **cosmos-evaluator**). Install paths in this repo pull additional third-party packages (Diffusers, vLLM, vLLM-Omni, `cosmos_guardrail`, Hugging Face tooling); review each dependency’s license before production use.

## Cosmos platform map

NVIDIA Cosmos is an open platform of world models, datasets, and tools for Physical AI (robots, autonomous vehicles, smart infrastructure). This repository focuses on **Cosmos 3** runnable examples and serving guides; the broader platform splits operational concerns across dedicated projects.

```mermaid
flowchart TB
  subgraph cosmos_repo["NVIDIA/cosmos (this repo)"]
    CB["cookbooks/cosmos3"]
    BM["inference_benchmarks.md"]
    QS["README quickstarts"]
  end

  subgraph models["Model artifacts"]
    HF["Hugging Face: nvidia/cosmos3 collection"]
  end

  subgraph framework["cosmos-framework"]
    INF["cosmos_framework.scripts.inference"]
    TRN["Training recipes (Coming Soon)"]
    VLLM3["packages/vllm-cosmos3"]
  end

  subgraph data["cosmos-curator"]
    CUR["Processing, annotation, filtering, deduplication"]
  end

  subgraph eval["cosmos-evaluator"]
    EV["World generation and reasoning evaluation"]
  end

  CB --> INF
  CB --> HF
  QS --> HF
  VLLM3 --> CB
  TRN -.-> CB
  CUR -.-> TRN
  EV -.-> INF
```

| Project | Repository | Role in a Physical AI workflow |
| --- | --- | --- |
| **Cosmos (this repo)** | [NVIDIA/cosmos](https://github.com/NVIDIA/cosmos) | Cosmos 3 cookbooks, `inference_benchmarks.md`, Diffusers/vLLM/vLLM-Omni quickstarts, and example assets |
| **Cosmos Framework** | [NVIDIA/cosmos-framework](https://github.com/NVIDIA/cosmos-framework) | End-to-end setup, native PyTorch inference (`torchrun`), training, and evaluation workflows; hosts `vllm-cosmos3` |
| **Cosmos Curator** | [NVIDIA/cosmos-curator](https://github.com/NVIDIA/cosmos-curator) | Distributed data curation: processing, annotation, filtering, deduplication |
| **Cosmos Evaluator** | [NVIDIA/cosmos-evaluator](https://github.com/NVIDIA/cosmos-evaluator) | Automated evaluation of world generation and world reasoning outputs |

<Info>
Cookbooks that use Cosmos Framework or vLLM require access to `git@github.com:NVIDIA/cosmos-framework.git` (or HTTPS clone). Framework setup is documented in [Cookbook environment setup](/cookbook-environment).
</Info>

### How this repo connects to Framework

| Integration goal | Primary surface in this repo | Where Framework fits |
| --- | --- | --- |
| Generator / Reasoner research with full checkpoint | Diffusers `Cosmos3OmniPipeline` | Optional; Framework exposes `cosmos_framework.scripts.inference` with parallelism presets |
| Production Generator API | vLLM-Omni (`vllm/vllm-omni:cosmos3`) | Same checkpoints; Framework for batch `torchrun` jobs |
| Production Reasoner API | vLLM + `vllm-cosmos3` from Framework | Plugin registers `Cosmos3ReasonerForConditionalGeneration` |
| Training, post-training, task eval | Not in this repo yet | Framework; README marks post-training recipes **[Coming Soon]** |

Clone path used by cookbooks:

```bash
mkdir -p packages
git clone https://github.com/NVIDIA/cosmos-framework.git packages/cosmos3
cd packages/cosmos3
export GIT_LFS_SKIP_SMUDGE=1
uv sync --all-extras --group=cu130-train   # or cu128-train on CUDA 12.x
```

## OpenMDW-1.1 license

NVIDIA Cosmos **source code and models** are released under the [OpenMDW License Agreement, version 1.1 (OpenMDW-1.1)](https://openmdw.ai/license/1-1/). The full text is in the repository root `LICENSE`. Cookbook notebooks declare `SPDX-License-Identifier: OpenMDW-1.1`.

### Scope: Model Materials

Under the agreement, **Model Materials** means:

1. One or more machine learning models (architecture and parameters), and  
2. All related artifacts (associated data, documentation, and software) provided under the agreement.

### Grants and distribution

| Topic | Terms |
| --- | --- |
| **Permission** | Free of charge, to deal in Model Materials without restriction, including copyright, patent, database, and trade secret rights—subject to compliance with the agreement |
| **Distribution** | If you distribute any portion of Model Materials, include (1) a copy of the agreement and (2) all applicable copyright and origin notices from the materials |
| **Outputs** | No restrictions or obligations on use, modification, or sharing of **outputs** generated by using the Model Materials |
| **Patent retaliation** | Rights terminate if you file, maintain, or voluntarily participate in a lawsuit asserting the Model Materials infringe patent or copyright—unless that suit responds to a corresponding suit first brought against you |

### Disclaimers and your responsibilities

The Model Materials are provided **“AS IS”** without warranty (merchantability, fitness, title, non-infringement, accuracy, latent defects) to the fullest extent permitted by law.

You are solely responsible for:

1. Clearing rights of other persons that may apply to the Model Materials or any use thereof (including copyrights or other rights embodied in the materials)  
2. Obtaining necessary consents, permissions, or other rights for any use  
3. Performing due diligence or other investigations into the Model Materials or anything incorporated therein  

Providers of the Model Materials are not liable for claims arising from the materials or their use.

### Custom licensing

For a license outside OpenMDW-1.1, contact **[cosmos-license@nvidia.com](mailto:cosmos-license@nvidia.com)**.

<Warning>
OpenMDW-1.1 governs NVIDIA Cosmos source and models in this repository. **Third-party packages** installed by setup commands (PyTorch, Diffusers, vLLM, Hugging Face Hub, guardrails, and others) remain under their own licenses—see [Third-party dependencies](#third-party-dependencies).
</Warning>

## Known model limitations

Cosmos 3 can produce artifacts in long, high-resolution, or physically complex outputs. Documented failure modes include:

| Category | Examples |
| --- | --- |
| **Temporal / motion** | Temporal inconsistency, unstable camera or object motion |
| **Multimodal alignment** | Inaccurate sound–video alignment, imperfect action–state consistency |
| **Geometry / physics** | Object morphing, inaccurate 3D structure, implausible physical dynamics |

<Warning>
Applications that require physically grounded simulation, **safety-critical control**, or complex multi-agent behavior need additional validation, guardrails, and system-level safety analysis before deployment—not reliance on raw model output alone.
</Warning>

### Safety guardrails (Generator)

Cosmos 3 Generator integrations ship **safety guardrails** (`cosmos_guardrail` in Diffusers installs) that screen prompts and blur faces in generated output. vLLM-Omni exposes per-request control via `extra_params.guardrails` (default on in several cookbooks; action robotics examples often disable guardrails for throughput). Server-wide disable uses a deploy config (`guardrails: false` in `model_config`); a dedicated `--cosmos3-no-guardrails` flag is noted as a future release item in the README.

Disabling guardrails does not remove the model limitations above; it only changes prompt/output screening behavior.

## Release cadence and version history

### Where to look

| Artifact | Location | Contents |
| --- | --- | --- |
| **Release cadence table** | `RELEASE.md` | Prior platform milestones with dates |
| **Cosmos 3 announcement** | `README.md` → News | May 31, 2026 release: Hugging Face collection, Framework workflows, technical report link |
| **Inference benchmarks** | `inference_benchmarks.md` | Generator latency and Reasoner serving metrics (updated incrementally) |

### Documented milestones (`RELEASE.md`)

| Version | Description | Date |
| --- | --- | --- |
| v1.0 | Initial diffusion and autoregressive WFMs release | 2025-01-06 |
| v0.1 | Initial tokenizer release | 2024-11-06 |

`RELEASE.md` references detailed notes at `release_notes/v0p1.md`; that path is not present in the current repository checkout—use `RELEASE.md` and README News for authoritative dates until release notes are published in-tree.

### Cosmos 3 (current generation)

:::updates
@update Cosmos 3 — May 31, 2026 — Models published in the [NVIDIA Cosmos 3 Hugging Face collection](https://huggingface.co/collections/nvidia/cosmos3). [Cosmos Framework](https://github.com/NVIDIA/cosmos-framework) provides runnable setup, inference, training, and evaluation workflows. Technical report: [Cosmos 3 Technical Report](https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf).
:::

### In-repo capabilities still marked Coming Soon

| Capability | Status in README / cookbooks |
| --- | --- |
| Post-training recipes (vision, action, reasoner) + task-specific evaluation | Coming Soon (Framework) |
| Reasoner with Transformers | Coming soon |
| vLLM-Omni upstream (all modalities in stock `vllm-omni`) | Partial upstreaming via [vllm-omni#3454](https://github.com/vllm-project/vllm-omni/pull/3454); `vllm/vllm-omni:cosmos3` Docker image is the full-modality official build until merge |

## Third-party dependencies

> This project may download and install additional third-party open source software projects. Review the license terms of those projects before use.

Setup commands across README and `cookbooks/cosmos3/README.md` commonly install or reference:

| Dependency | Typical use | Install / source pointer |
| --- | --- | --- |
| **PyTorch / torchvision** | All GPU backends | `uv pip` with `--torch-backend=cu130` or `cu128` |
| **Diffusers** (git) | Generator research, `Cosmos3OmniPipeline` | `git+https://github.com/huggingface/diffusers.git` |
| **transformers**, **accelerate**, **huggingface_hub** | Model loading, HF auth | Pip alongside Diffusers or vLLM |
| **vLLM** | Reasoner production serving | `vllm==0.21.0` (cu130) or `vllm==0.19.1` (cu128) |
| **vllm-cosmos3** | Reasoner architecture plugin | `cosmos-framework.git#subdirectory=packages/vllm-cosmos3` |
| **vLLM-Omni** | Generator OpenAI-compatible API | `vllm/vllm-omni:cosmos3` Docker image or PR-branch pip install |
| **cosmos_guardrail** | Generator prompt/output safety | Diffusers venv install list |
| **av**, **imageio**, **imageio-ffmpeg** | Video I/O | Diffusers path |
| **NVIDIA NGC PyTorch containers** | Recommended base images | `nvcr.io/nvidia/pytorch:25.09-py3` (CUDA 13) or `25.06-py3` (CUDA 12) |

### Sample and asset dependencies

Action cookbooks include a **LeRobot-format DROID** sample under `cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/`. Reasoner cookbooks may fetch vision assets from **`nvidia-cosmos/cosmos-dependencies`** (see reasoner README input URLs). Those assets are governed by their respective repositories and licenses, not only OpenMDW-1.1.

### Architectural third-party mentions

- Generator vLLM-Omni loads a checkpoint that includes a **Qwen3-VL-based** reasoner path alongside the diffusion path.  
- Reasoner serving follows **Qwen3-VL-compatible** chat message conventions for image and video inputs.  
- Hugging Face **gated** Cosmos3 model repos require authentication (`uvx hf auth login` or `HF_TOKEN`).

<Note>
The Cosmos Framework requires **`uv >= 0.11.3`** for `pyproject.toml` parsing and `--torch-backend` values such as `cu130`. Older `uv` versions fail sync/install with opaque errors—upgrade via `uv self update` before Framework work.
</Note>

## Related pages

<CardGroup>
  <Card title="Overview" href="/overview">
    Cosmos 3 Reasoner vs Generator surfaces, modalities, and the shortest first-run path.
  </Card>
  <Card title="Choose an integration" href="/choose-integration">
    Decision matrix for Diffusers, vLLM-Omni, vLLM, Framework, and coming-soon Transformers.
  </Card>
  <Card title="Cookbook environment setup" href="/cookbook-environment">
    Shared uv/Docker setup, HF auth, CUDA tags, and Framework clone/sync.
  </Card>
  <Card title="Inference benchmarks" href="/inference-benchmarks">
    Published Generator latency and Reasoner vLLM serving tables.
  </Card>
  <Card title="Troubleshooting" href="/troubleshooting">
    CUDA/driver pairing, NGC containers, uv version, and DeepGEMM workarounds.
  </Card>
</CardGroup>

---