# Model family

> Checkpoint catalog (Nano 16B, Super 64B, Text2Image, Image2Video, Nano-Policy-DROID), Hugging Face IDs, capability focus, and size tradeoffs for serving.

- Repository: NVIDIA/cosmos
- GitHub: https://github.com/NVIDIA/cosmos
- Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9
- Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt

## Source Files

- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/reasoner/README.md`
- `inference_benchmarks.md`
- `cookbooks/cosmos3/generator/action/README.md`

---

---
title: "Model family"
description: "Checkpoint catalog (Nano 16B, Super 64B, Text2Image, Image2Video, Nano-Policy-DROID), Hugging Face IDs, capability focus, and size tradeoffs for serving."
---

Cosmos 3 ships five gated Hugging Face checkpoints under the [NVIDIA Cosmos 3 collection](https://huggingface.co/collections/nvidia/cosmos3). Each repo ID (`nvidia/Cosmos3-*`) is the canonical `from_pretrained`, `vllm serve`, and `--checkpoint-path` string across Diffusers, vLLM-Omni, vLLM, and Cosmos Framework. Omnimodal **Nano** (16B) and **Super** (64B) checkpoints expose both **Reasoner** (text out) and **Generator** (vision/sound/action out) surfaces; the three specialized variants narrow the Generator path to a single modality family or robot policy.

## Checkpoint catalog

| Hugging Face ID | Params | Primary capability | Typical surfaces |
| --- | ---: | --- | --- |
| [`nvidia/Cosmos3-Nano`](https://huggingface.co/nvidia/Cosmos3-Nano) | 16B | Compact omnimodal world model: multimodal understanding, world simulation, future prediction, action reasoning, Physical AI | Reasoner + Generator (full omni stack) |
| [`nvidia/Cosmos3-Super`](https://huggingface.co/nvidia/Cosmos3-Super) | 64B | Frontier-scale omnimodal world model with the same modality coverage at higher capacity | Reasoner + Generator (full omni stack) |
| [`nvidia/Cosmos3-Super-Text2Image`](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image) | 64B | High-fidelity text-to-image generation | Generator (image output) |
| [`nvidia/Cosmos3-Super-Image2Video`](https://huggingface.co/nvidia/Cosmos3-Super-Image2Video) | 64B | Temporally coherent image-to-video generation | Generator (video output) |
| [`nvidia/Cosmos3-Nano-Policy-DROID`](https://huggingface.co/nvidia/Cosmos3-Nano-Policy-DROID) | 16B | Vision-language robot policy for DROID manipulation and control | Generator (action + vision; policy-oriented) |

<Note>
All Cosmos 3 model repos are gated. Authenticate before the first download (`uvx hf@latest auth login` or `HF_TOKEN`). Set `HF_HOME` when you need a shared or larger cache volume.
</Note>

## Omnimodal vs specialized checkpoints

```text
                    ┌─────────────────────────────────────┐
                    │     nvidia/Cosmos3-Nano (16B)       │
                    │     nvidia/Cosmos3-Super (64B)      │
                    └──────────────┬──────────────────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              ▼                    ▼                    ▼
        Reasoner mode        Generator mode      (shared MoT weights)
     text/vision → text   multimodal → vision,
                          sound, action

  Specialized (64B / 16B) — Generator-focused subsets:
  • Cosmos3-Super-Text2Image     → text → image
  • Cosmos3-Super-Image2Video    → image → video
  • Cosmos3-Nano-Policy-DROID    → DROID policy / control
```

**Nano** and **Super** are the checkpoints documented end-to-end in this repository’s cookbooks: audiovisual generation, action forward/inverse dynamics, and Reasoner understanding workflows all reference `Cosmos3-Nano` or `Cosmos3-Super` by default. The **Text2Image**, **Image2Video**, and **Policy-DROID** repos are first-class catalog entries for deployments that want a narrower Generator specialization without loading the full omni diffusion stack for every request.

## Reasoner vs Generator on the same checkpoint

A single omnimodal weight file supports two runtime modes. Integration choice determines which path loads:

| Surface | Inputs | Outputs | Load pattern |
| --- | --- | --- | --- |
| **Reasoner** | Text, vision (image/video) | Text | vLLM with `Cosmos3ReasonerForConditionalGeneration` override; Framework `model_mode: "reasoner"` |
| **Generator** | Text, vision, sound, action | Vision, sound, action | Diffusers `Cosmos3OmniPipeline`; vLLM-Omni `Cosmos3OmniDiffusersPipeline`; Framework generator JSON specs |

<Info>
vLLM-Omni for Generator loads the **full** checkpoint (reasoner + diffusion paths). For text-only understanding at scale, serve Reasoner through **vLLM** with the architecture override instead of vLLM-Omni — lower memory and an OpenAI chat-completions API.
</Info>

Reasoner workloads in cookbooks include captioning, temporal localization, embodied/common-sense reasoning, 2D grounding, describe-anything, action chain-of-thought, physical plausibility, and situation understanding. Generator workflows span text-to-image/video (with optional sound), image-to-video, video-to-video, forward dynamics, inverse dynamics, and action policy rollouts.

## Size and serving tradeoffs

### Nano (16B) — default for development and single-GPU serving

| Dimension | Nano behavior |
| --- | --- |
| **GPU footprint** | Fits single-GPU Reasoner and Generator paths in cookbooks (`--tensor-parallel-size 1`, `torchrun --nproc-per-node=1`) |
| **Latency** | Fastest omnimodal Generator latencies in published tables; Reasoner TTFT/throughput benchmarks use `nvidia/Cosmos3-Nano` |
| **Coverage** | Action cookbooks (AV, DROID, UMI forward/inverse dynamics) and audiovisual examples target Nano |
| **When to choose** | Prototyping, constrained hardware, action-world-model experiments, high-throughput Reasoner at moderate concurrency |

Example Reasoner serve (single GPU):

```bash
vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --tensor-parallel-size 1 \
  --async-scheduling \
  --allowed-local-media-path / \
  --port 8000
```

Example Generator serve (vLLM-Omni Docker, single node):

```bash
vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000
```

### Super (64B) — quality-first, multi-GPU serving

| Dimension | Super behavior |
| --- | --- |
| **GPU footprint** | Reasoner cookbook and vLLM-Omni docs use **4-way tensor parallelism** (`--tensor-parallel-size 4`) as the tested Super configuration |
| **Memory relief** | `--enable-layerwise-offload` trades latency for lower peak VRAM by moving transformer blocks between CPU and GPU |
| **Extra parallelism** | Optional `--cfg-parallel-size` (CFG branches) and `--ulysses-degree` (sequence parallel); GPU count must cover `tensor_parallel × cfg_parallel × ulysses` |
| **Latency** | Generator benchmarks show longer diffusion times than Nano at the same resolution; expect higher quality, not higher FPS |
| **When to choose** | Production Reasoner quality (default in `run_with_vllm.ipynb`), 720p Generator at scale, frontier audiovisual fidelity |

Example Super Generator (four GPUs + offload):

```bash
vllm serve nvidia/Cosmos3-Super \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --tensor-parallel-size 4 \
  --enable-layerwise-offload \
  --allowed-local-media-path / \
  --port 8000
```

Cosmos Framework scales Super by increasing `--nproc-per-node` and setting `--checkpoint-path Cosmos3-Super` (audiovisual quickstart pattern).

### Specialized checkpoints

| Checkpoint | Serving implication |
| --- | --- |
| **Super-Text2Image** | Deploy when the fleet only serves `POST /v1/images/generations` or Diffusers `text-to-image` (`num_frames=1`); avoids carrying full video/action modules if your integration supports variant-specific weights |
| **Super-Image2Video** | Deploy for image-conditioned video only; pairs with `input_reference` uploads and i2v sampler settings |
| **Nano-Policy-DROID** | 16B policy head for DROID manipulation; distinct from general **Nano** action forward/inverse dynamics cookbooks, which use the omnimodal Generator on `Cosmos3-Nano` |

<Warning>
This repository’s runnable notebooks standardize on **Cosmos3-Nano** and **Cosmos3-Super** omnimodal IDs. Before production rollout on Text2Image, Image2Video, or Policy-DROID, confirm your serving stack (Diffusers pipeline class, vLLM-Omni model card, Framework checkpoint resolver) accepts the specialized repo ID.
</Warning>

## Integration matrix by checkpoint

| Backend | Nano | Super | Text2Image / Image2Video / Policy-DROID |
| --- | --- | --- | --- |
| **Diffusers** (`Cosmos3OmniPipeline`) | Documented (`nvidia/Cosmos3-Nano`) | Documented (`nvidia/Cosmos3-Super`) | Use matching HF ID if pipeline supports variant configs |
| **vLLM-Omni** | Default Docker quickstart | TP=4 + optional offload | Not covered in cookbooks |
| **vLLM** (Reasoner) | Quickstart + benchmarks | Notebook default (TP=4) | N/A (Reasoner-only path) |
| **Cosmos Framework** | `--checkpoint-path Cosmos3-Nano` | `--checkpoint-path Cosmos3-Super` + multi-GPU `torchrun` | Confirm in Framework docs before training/inference |

Disk planning: cookbook setup notes that **Nano** downloads plus CUDA dependencies can consume tens of GiB; **Super** multiplies weight storage and often requires four GPUs for the documented serving paths.

## Benchmark anchors (omnimodal checkpoints)

Published numbers in [`inference_benchmarks.md`](inference_benchmarks.md) compare **Nano** and **Super** Generator diffusion latency (seconds) across PyTorch, vLLM-Omni, and Diffusers at 256p/480p/720p and tensor-parallel widths 1/4/8. **Nano Reasoner** tables report vLLM TTFT, request latency, and throughput at client concurrency 1/64/128/256 — not diffusion time.

Representative Generator signals (720p t2v, seconds, lower is better):

| GPU | Engine | Nano 720p/1 | Super 720p/1 (where measured) |
| --- | --- | ---: | ---: |
| B200 | PyTorch | 114.85 | 407.50 |
| B200 | vLLM-Omni | 107.84 | 390.28 |
| H100 NVL | Diffusers | 324.20 | — |

Representative Reasoner signals (Nano, B200, Input 50 / Output 100 / Video 1 FPS):

| Metric | Concurrency 1 | Concurrency 256 |
| --- | ---: | ---: |
| TTFT (ms) | 115.27 | 2549.79 |
| Output token throughput (Tok/s) | 180.16 | 2701.08 |

Empty benchmark cells mean **not yet measured**, not unsupported. See the full tables on the inference benchmarks page.

## Choosing a checkpoint

<Steps>
<Step title="Pick the surface">
Need text understanding (caption, VQA, grounding, planning)? → Reasoner on **Nano** or **Super**. Need images, video, sound, or action outputs? → Generator on an omnimodal or specialized checkpoint.
</Step>
<Step title="Pick the size">
Single GPU or fastest iteration → **Nano**. Maximum quality or Reasoner notebook defaults → **Super** with 4× tensor parallel.
</Step>
<Step title="Pick specialization">
Fleet serves only t2i or only i2v at 64B → consider **Super-Text2Image** or **Super-Image2Video**. DROID manipulation policy at 16B → **Nano-Policy-DROID**. General robotics world models (forward/inverse dynamics across AV, DROID, UMI) → omnimodal **Nano** Generator.
</Step>
<Step title="Match the integration">
Research Generator → Diffusers or Framework. Production Generator API → vLLM-Omni. Production Reasoner API → vLLM + `vllm-cosmos3`. Align CUDA pairs (`cu130`/`cu128`) with the cookbook environment guide before download.
</Step>
</Steps>

## Related pages

<CardGroup>
<Card title="Reasoner and Generator" href="/reasoner-and-generator">
MoT modes, shared mRoPE, and when to call each surface on the same weights.
</Card>
<Card title="Choose an integration" href="/choose-integration">
Diffusers vs vLLM-Omni vs vLLM vs Framework by deployment goal.
</Card>
<Card title="Inference benchmarks" href="/inference-benchmarks">
Nano/Super Generator latency tables and Nano Reasoner serving metrics.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Resolution tiers, frame counts, action dimensions, and prompt limits per modality.
</Card>
<Card title="Action modality" href="/action-modality">
Embodiment dims, `domain_name`, and `action_mode` for Generator action workflows.
</Card>
<Card title="Cookbook environment setup" href="/cookbook-environment">
HF auth, CUDA backend tags, and backend-specific install paths for each checkpoint size.
</Card>
</CardGroup>
