# Choose an integration

> Decision matrix for Diffusers, vLLM-Omni, vLLM, Transformers (coming soon), and Cosmos Framework by goal: research, production inference, training, or evaluation.

- Repository: NVIDIA/cosmos
- GitHub: https://github.com/NVIDIA/cosmos
- Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9
- Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt

## Source Files

- `README.md`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/action/README.md`
- `cookbooks/cosmos3/reasoner/README.md`

---

---
title: "Choose an integration"
description: "Decision matrix for Diffusers, vLLM-Omni, vLLM, Transformers (coming soon), and Cosmos Framework by goal: research, production inference, training, or evaluation."
---

Cosmos 3 exposes two runtime surfaces—**Reasoner** (text/vision in, text out) and **Generator** (multimodal in, vision/sound/action out)—and five integration paths: **Diffusers**, **vLLM-Omni**, **vLLM**, **Transformers** (Reasoner, coming soon), and **Cosmos Framework** (native PyTorch via `cosmos_framework.scripts.inference` and `torchrun`). Pick the surface first, then match your goal (research, production serving, training, or evaluation) to the backend that owns that workflow in this repository and its cookbooks.

## Decision matrix by goal

| Goal | Surface | Integration | Entry point | Notes |
| --- | --- | --- | --- | --- |
| Generator research or model development | Generator | **Diffusers** | `Cosmos3OmniPipeline.from_pretrained` | Loads the full checkpoint (reasoner + diffusion + media tokenizers); Python-first inspection and modification |
| Generator production inference | Generator | **vLLM-Omni** | `vllm serve … --omni --model-class-name Cosmos3OmniDiffusersPipeline` | OpenAI-compatible `/v1/images/generations`, `/v1/videos`, `/v1/videos/sync`; prefer `vllm/vllm-omni:cosmos3` for all modalities |
| Reasoner research or model development | Reasoner | **Transformers** (coming soon) | — | Planned Hugging Face path for prompts, processors, and model behavior |
| Reasoner production inference | Reasoner | **vLLM** | `vllm serve` + `Cosmos3ReasonerForConditionalGeneration` | OpenAI-compatible chat completions; Qwen3-VL-compatible messages |
| Runnable setup, training, or evaluation | Both | **Cosmos Framework** | `torchrun -m cosmos_framework.scripts.inference` | Clone `NVIDIA/cosmos-framework`, `uv sync --group=cu130-train` (or `cu128-train`); covers Reasoner, Generator audiovisual, and action cookbooks |
| Latency comparison across engines | Generator | **Cosmos Framework** (PyTorch), **vLLM-Omni**, **Diffusers** | See `inference_benchmarks.md` | Benchmarks label Framework OSS inference as **PyTorch** |

<Note>
For **text-only understanding** (captioning, grounding, planning), use **Reasoner + vLLM**, not vLLM-Omni. vLLM-Omni loads the full omni checkpoint for diffusion generation; Reasoner vLLM serves only `Cosmos3ReasonerForConditionalGeneration`.
</Note>

## Pick a surface, then a backend

```text
                    Cosmos 3
                        |
          +-------------+-------------+
          |                           |
      Reasoner                    Generator
   (text out)              (vision / sound / action out)
          |                           |
    +-----+-----+             +-------+-------+
    |           |             |       |       |
Transformers  vLLM      Diffusers  vLLM-Omni  Cosmos Framework
 (soon)    (production)  (research) (production)  (torchrun / train)
    |           |             |       |       |
    +-----------+-------------+-------+-------+
                        |
              Cosmos Framework (all cookbook paths)
```

### Reasoner integrations

| Integration | Best for | API / runtime | Cookbook coverage |
| --- | --- | --- | --- |
| **vLLM** | Production serving, video + image workloads | `vllm serve` with `--hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}'`; chat completions | `run_with_vllm.ipynb` — captioning, temporal localization, embodied reasoning, grounding, action CoT, physical plausibility |
| **Transformers** | Research (planned) | Hugging Face inference | Not yet available; see cookbook env **Transformers (coming soon)** |
| **Cosmos Framework** | Native inference, scaling Nano → Super, benchmark hooks | `cosmos_framework.scripts.inference` with `--parallelism-preset=latency`; JSON inputs with `model_mode: "reasoner"` | `run_with_cosmos_framework.ipynb` — image-focused (`vision_path`); video examples documented under vLLM |

Reasoner vLLM installs pair **CUDA driver ↔ torch backend ↔ vLLM version**: `cu130` with `vllm==0.21.0`, or `cu128` with `vllm==0.19.1`, plus the `vllm-cosmos3` plugin from `cosmos-framework`. Do not rely on `--torch-backend=auto` for vLLM wheels.

### Generator integrations

| Integration | Best for | API / runtime | Cookbook coverage |
| --- | --- | --- | --- |
| **Diffusers** | Research, training, pipeline experimentation | `Cosmos3OmniPipeline`; modes: `text-to-image`, `text-to-video`, `image-to-video`, `text-to-video-with-sound` | Audiovisual only (`run_with_diffusers.ipynb`) |
| **vLLM-Omni** | Production image/video/sound/action serving | Docker `vllm/vllm-omni:cosmos3` or PR-branch install; endpoints in README quickstart | Audiovisual + action (`run_with_vllm_omni.ipynb`, `run_fd_with_vllm.ipynb`, `run_id_with_vllm.ipynb`) |
| **Cosmos Framework** | Full modality matrix, multi-GPU torchrun, OSS benchmark path | `torchrun -m cosmos_framework.scripts.inference` with JSON specs and `--checkpoint-path` | Audiovisual + forward/inverse dynamics |

<Warning>
vLLM-Omni upstreaming is in progress. The **`vllm/vllm-omni:cosmos3`** image supports every Generator modality (including video-to-video, sound, and action). A PR-branch pip install may expose only text-to-image, text-to-video, and image-to-video until follow-up PRs merge.
</Warning>

Action workflows (forward dynamics, inverse dynamics, policy) require **`extra_params`** fields such as `action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, and optionally `action_path`. Forward dynamics can use synchronous `POST /v1/videos/sync`; policy and inverse dynamics use async `POST /v1/videos` to retrieve predicted action chunks.

## Integration profiles

### Diffusers (Generator)

Install a dedicated venv with `diffusers` from Git, `accelerate`, `cosmos_guardrail`, and `transformers`, pinning `--torch-backend` to your driver (`cu130` or `cu128`; `auto` is acceptable here for torch). The pipeline loads **`nvidia/Cosmos3-Nano`** or **`nvidia/Cosmos3-Super`** and returns PIL images or tensors exportable via `export_to_video`.

| Attribute | Value |
| --- | --- |
| Checkpoint scope | Full omni model (reasoner + diffusion + tokenizers) |
| Typical use | Notebook iteration, scheduler tuning (`UniPCMultistepScheduler`, `flow_shift`), structured JSON prompts |
| Not in cookbooks | Action forward/inverse dynamics (use Framework or vLLM-Omni) |
| Benchmark label | **Diffusers** in `inference_benchmarks.md` |

### vLLM-Omni (Generator)

Serves **`Cosmos3OmniDiffusersPipeline`** behind OpenAI-compatible HTTP. Success signal: log line `Application startup complete.` Verify with `curl http://localhost:8000/v1/models`.

| Parallelism option | Purpose |
| --- | --- |
| `--tensor-parallel-size N` | Split weights (required for Super at scale) |
| `--enable-layerwise-offload` | CPU/GPU block offload (latency ↔ memory) |
| `--cfg-parallel-size 2` | Parallel CFG branches; set `guidance_scale` per request |
| `--ulysses-degree 2` | Sequence-parallel attention |

GPU budget: `tensor_parallel_size × cfg_parallel_size × ulysses_degree` must fit available devices.

### vLLM (Reasoner)

Serves **`Cosmos3ReasonerForConditionalGeneration`** for chat completions. Key flags:

| Flag | Role |
| --- | --- |
| `--mm-encoder-tp-mode data` | Data-parallel visual encoder |
| `--async-scheduling` | Throughput-oriented scheduling |
| `--allowed-local-media-path` | Required for local `file://` media |
| `--media-io-kwargs '{"video": {"num_frames": -1}}'` | Let processor see all frames before downstream sampling |

If DeepGEMM is unavailable: `export VLLM_USE_DEEP_GEMM=0` before `vllm serve`.

### Transformers (Reasoner, coming soon)

Documented as the future Python-first Reasoner path parallel to Diffusers on the Generator side. Cookbooks and environment setup reserve a **Transformers** section; no runnable Reasoner Transformers notebook ships in this repo yet.

### Cosmos Framework (both surfaces)

Clone `https://github.com/NVIDIA/cosmos-framework.git` to `packages/cosmos3`, then:

```bash
export GIT_LFS_SKIP_SMUDGE=1
uv sync --all-extras --group=cu130-train   # or cu128-train on CUDA 12.x
```

Inference imports training extras today—use `*-train` groups. Notebooks honor `COSMOS3_UV_GROUP` (default `cu130-train`).

| Workflow | Command pattern |
| --- | --- |
| Generator audiovisual | `torchrun --nproc-per-node=1 -m cosmos_framework.scripts.inference --parallelism-preset=throughput -i <spec.json> -o <out> --checkpoint-path Cosmos3-Nano` |
| Reasoner | `.venv/bin/python -m cosmos_framework.scripts.inference --parallelism-preset=latency -i <reasoner.json> -o <out> --checkpoint-path Cosmos3-Nano --benchmark` |
| Super scale-out | Increase `--nproc-per-node` / `torchrun` world size per notebook |

Ecosystem role: **Cosmos Framework** is the end-to-end Physical AI framework for training and serving; **Cosmos Curator** and **Cosmos Evaluator** sit beside it for data curation and automated evaluation. Post-training recipes for vision, action, and reasoner adaptation are marked **[Coming Soon]** in the root README.

## Cookbook backend map

| Cookbook area | Cosmos Framework | Diffusers | vLLM-Omni | vLLM | Transformers |
| --- | :---: | :---: | :---: | :---: | :---: |
| Generator · audiovisual | ✓ | ✓ | ✓ | — | — |
| Generator · action (fd / id) | ✓ | — | ✓ | — | — |
| Reasoner | ✓ (image-primary) | — | — | ✓ (image + video) | soon |

Shared environment steps (HF auth, CUDA tags, Docker pull, GPU probe) live in the Cosmos3 cookbooks environment guide—install only the backend sections you need.

## Benchmarks and evaluation

`inference_benchmarks.md` compares Generator engines:

| Benchmark label | Integration |
| --- | --- |
| **PyTorch** | Cosmos Framework OSS reference inference (CUDA graphs where supported) |
| **vLLM-Omni** | Total pipeline time at 720p on listed GPUs |
| **Diffusers** | End-to-end `Cosmos3OmniPipeline` without custom CUDA graphs |

Reasoner benchmarks cover **vLLM** serving only (TTFT, latency, throughput at concurrency 1/64/128/256). Empty benchmark cells mean **not yet measured**, not unsupported.

For evaluation pipelines beyond latency tables, route to **Cosmos Evaluator** in the ecosystem and Framework-side workflows as they ship.

## Common fork points

<AccordionGroup>
<Accordion title="I need the fastest path to one Generator video">
Use **Quickstart** paths: Diffusers `Cosmos3OmniPipeline` in-process, or **vLLM-Omni** `curl` against `/v1/videos/sync`. First Diffusers run downloads Nano and runs full diffusion steps—long wall times are expected.
</Accordion>
<Accordion title="I am building a production API behind load balancers">
**Generator → vLLM-Omni** (HTTP, guardrails, tensor parallel for Super). **Reasoner → vLLM** (chat completions, multimodal messages). Keep surfaces on separate services.
</Accordion>
<Accordion title="I am modifying schedulers, prompts, or model code">
**Generator → Diffusers** for in-notebook changes; **Reasoner → Transformers** when available. Use **Cosmos Framework** when you need `torchrun`, JSON job specs, or parity with NVIDIA training stacks.
</Accordion>
<Accordion title="I need action-conditioned rollouts or robot policies">
**Cosmos Framework** or **vLLM-Omni** only. Diffusers cookbooks do not cover action. Pass `domain_name`, `action_mode`, and trajectory files per action cookbook assets.
</Accordion>
<Accordion title="I need Reasoner outputs on long videos">
Prefer **vLLM** Reasoner cookbook (`run_with_vllm.ipynb`). Framework Reasoner cookbooks currently emphasize **image** inputs via `vision_path`.
</Accordion>
</AccordionGroup>

## Environment prerequisites (all paths)

- Linux, NVIDIA GPU (Ampere, Hopper, Blackwell tested at BF16)
- Hugging Face gated-model auth: `uvx hf@latest auth login`
- CUDA **13.x → `cu130`** or **12.x → `cu128`** driver pairing for Framework, vLLM, and explicit torch installs
- Framework + vLLM: git access to `NVIDIA/cosmos-framework` (for `vllm-cosmos3` plugin and Framework clone)

<Tip>
Install backends in isolation—Diffusers venv, vLLM venv, Framework checkout, and vLLM-Omni Docker—so CUDA/torch/vLLM version pins do not conflict. The cookbook environment page lists verification probes for each.
</Tip>

## Related pages

<CardGroup>
<Card title="Reasoner and Generator" href="/reasoner-and-generator">
MoT surfaces, modality matrix, and when to call each runtime.
</Card>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared uv/Docker setup, CUDA tags, and GPU verification for every backend.
</Card>
<Card title="Quickstart" href="/quickstart">
Minimal first-run commands per integration.
</Card>
<Card title="Inference benchmarks" href="/inference-benchmarks">
Latency tables across PyTorch, vLLM-Omni, Diffusers, and Reasoner vLLM.
</Card>
<Card title="Run Generator with Diffusers" href="/run-generator-diffusers">
Cosmos3OmniPipeline modes and export paths.
</Card>
<Card title="Run Generator with vLLM-Omni" href="/run-generator-vllm-omni">
Docker serve, parallelism, guardrails, and action endpoints.
</Card>
<Card title="Run Reasoner with vLLM" href="/run-reasoner-vllm">
Serve flags, chat message shape, and reasoning prompt suffix.
</Card>
<Card title="Run Generator with Cosmos Framework" href="/run-generator-cosmos-framework">
torchrun inference, presets, and JSON specs.
</Card>
</CardGroup>
