# Overview

> Cosmos 3 omnimodal world model surfaces (Reasoner vs Generator), primary entry points, supported modalities, and the shortest path to a first generation or reasoning call.

- Repository: NVIDIA/cosmos
- GitHub: https://github.com/NVIDIA/cosmos
- Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9
- Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt

## Source Files

- `README.md`
- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/cosmos3-model-architecture.png`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/reasoner/README.md`

---

---
title: "Overview"
description: "Cosmos 3 omnimodal world model surfaces (Reasoner vs Generator), primary entry points, supported modalities, and the shortest path to a first generation or reasoning call."
---

Cosmos 3 is an omnimodal world model family in this repository, exposed as two runtime surfaces—**Reasoner** (autoregressive text from text and vision) and **Generator** (diffusion outputs for vision, sound, and action)—with runnable paths through Hugging Face Diffusers, vLLM-Omni, vLLM, and the separate [Cosmos Framework](https://github.com/NVIDIA/cosmos-framework) checkout referenced by cookbooks under `cookbooks/cosmos3/`.

## Runtime surfaces

Cosmos 3 routes workloads through a shared Mixture-of-Transformers (MoT) backbone but switches operating mode by task:

| Surface | Inputs | Outputs | Typical workloads |
| --- | --- | --- | --- |
| **Reasoner** | Text, vision (image/video) | Text | Captioning, temporal localization, grounding, embodied and common-sense reasoning, physical plausibility, situation understanding |
| **Generator** | Text, vision, sound, action | Vision, sound, action | Text-to-image/video, image-to-video, video-to-video, synchronized sound, forward/inverse dynamics, policy rollouts |

<Note>
Reasoner production serving loads only the reasoner path (`Cosmos3ReasonerForConditionalGeneration` via vLLM). Generator production serving loads the full omni checkpoint (reasoner + diffusion) through vLLM-Omni or Diffusers.
</Note>

```mermaid
flowchart TB
  subgraph inputs [Inputs]
    T[Text]
    V[Vision image/video]
    S[Sound]
    A[Action JSON]
  end

  subgraph mot [Cosmos 3 MoT backbone]
    AR[Reasoner mode — causal AR transformer]
    DM[Generator mode — diffusion transformer]
    mRoPE[Shared mRoPE across modalities]
  end

  subgraph outputs [Outputs]
    TXT[Text]
    IMG[Image JPG]
    VID[Video MP4]
    AUD[Stereo AAC in MP4]
    ACT[Action JSON]
  end

  T --> AR
  V --> AR
  AR --> TXT

  T --> DM
  V --> DM
  S --> DM
  A --> DM
  mRoPE --- AR
  mRoPE --- DM
  DM --> IMG
  DM --> VID
  DM --> AUD
  DM --> ACT
```

In **Reasoner mode**, language and visual understanding tokens use causal self-attention for next-token prediction. In **Generator mode**, noisy image, video, audio, and action tokens are denoised with full attention so multimodal outputs stay coherent. Both modes share transformer blocks and a unified 3D multi-dimensional rotary position embedding (mRoPE) for spatial and temporal structure.

<Frame caption="Cosmos 3 MoT architecture: shared backbone, Reasoner AR path vs Generator diffusion path.">
  ![Cosmos 3 model architecture](cookbooks/cosmos3/cosmos3-model-architecture.png)
</Frame>

## Supported modalities and formats

| Direction | Types | Formats / notes |
| --- | --- | --- |
| **Inputs** | Text; text + image; text + video; text + image + action | Text string; JPG/PNG/JPEG/WEBP; MP4; JSON action arrays |
| **Outputs** | Image, video, sound, action state, text | JPG; MP4; stereo AAC muxed into MP4 when generated with video; JSON actions; text string |
| **Vision conditioning** | Resolution tiers 256p / 480p / 720p | 720p: 1280×720; 480p: 832×480; 256p: 320×192; video conditioning uses 5 frames at matching resolution |
| **Action conditioning** | Embodiment-dependent dims | Examples: camera/AV 9D; DROID/UMI 10D; humanoid 29D (AgiBot) |
| **Generation defaults** | Resolution, aspect, timing | Default 480p, 16:9, 24 FPS, 189 frames; prompts under ~300 words recommended |

Action workflows treat action as tokens between visual states (9D pose deltas plus grasp state where applicable). Forward dynamics predicts future video from a start image and trajectory; inverse dynamics predicts trajectories from video; policy mode returns video plus a predicted action chunk.

## Primary entry points

Integrations map to goals rather than a single runtime:

| Goal | Entry point | How you invoke it |
| --- | --- | --- |
| Generator research / Python iteration | **Diffusers** `Cosmos3OmniPipeline` | `from_pretrained("nvidia/Cosmos3-Nano")` then `pipe(...)` |
| Generator production API | **vLLM-Omni** | `vllm serve nvidia/Cosmos3-Nano --omni --model-class-name Cosmos3OmniDiffusersPipeline` |
| Reasoner production API | **vLLM** + `vllm-cosmos3` plugin | `vllm serve` with `Cosmos3ReasonerForConditionalGeneration` overrides |
| Native PyTorch inference / training hooks | **Cosmos Framework** | `torchrun -m cosmos_framework.scripts.inference` from `packages/cosmos3` checkout |
| Reasoner research (HF) | **Transformers** | Coming soon |

<Warning>
Match CUDA driver, `--torch-backend` (`cu130` vs `cu128`), and vLLM version pairs. vLLM installs do not reliably support `--torch-backend=auto`; Diffusers can use `auto` for torch but cookbook vLLM paths pin explicit pairs (for example `cu130` + `vllm==0.21.0`).
</Warning>

### Repository layout

:::files
cosmos/                          # This repo — cookbooks, benchmarks, docs pointers
├── README.md                      # Model family, I/O specs, quickstarts
├── inference_benchmarks.md        # Generator latency + Reasoner serving tables
└── cookbooks/cosmos3/
    ├── README.md                  # Shared uv/Docker backend setup
    ├── generator/
    │   ├── audiovisual/           # t2i, t2v, i2v (+ sound) notebooks
    │   └── action/                # Forward / inverse dynamics notebooks
    └── reasoner/                  # Image reasoning (Framework); image+video (vLLM)

packages/cosmos3/                  # Created during setup — cloned cosmos-framework
└── .venv/                         # Framework torchrun interpreter
:::

Checkpoints live on Hugging Face (`nvidia/Cosmos3-Nano`, `Cosmos3-Super`, task-specific Super variants, `Cosmos3-Nano-Policy-DROID`). Authenticate before first download:

```bash
uvx hf@latest auth login
```

## Shortest path to a first call

<Steps>
<Step title="Authenticate to Hugging Face">
Run `uvx hf@latest auth login` (or set `HF_TOKEN`) so gated `nvidia/Cosmos3-*` weights can download. Optional: set `HF_HOME` for a shared cache location.
</Step>
<Step title="Pick a surface and backend">
Use Generator + Diffusers for the fastest local Python generation without a server, or Generator + vLLM-Omni / Reasoner + vLLM when you need an OpenAI-compatible HTTP API on port 8000.
</Step>
<Step title="Run one minimal command">
Follow the tab for your surface. Expect a long first run while weights download; diffusion Generator runs are compute-heavy by design.
</Step>
</Steps>

<Tabs>
<Tab title="Generator — Diffusers (Python)">
```python
import torch
from diffusers import Cosmos3OmniPipeline
from diffusers.utils import export_to_video

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

result = pipe(
    prompt="A mobile robot navigates a warehouse aisle and stops at a shelf.",
    num_frames=189,
    height=720,
    width=1280,
    fps=24.0,
)

export_to_video(result.video, "cosmos3_t2v.mp4", fps=24, macro_block_size=1)
```

Success signal: an MP4 file on disk after all denoising steps complete (not an immediate return).
</Tab>
<Tab title="Generator — vLLM-Omni (HTTP)">
```bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000
```

```bash
curl -sS -X POST http://localhost:8000/v1/videos/sync \
  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
  --form-string "size=1280x720" \
  --form-string "num_frames=81" \
  --form-string "fps=24" \
  --form-string "guidance_scale=4.0" \
  -o cosmos3_t2v_output.mp4
```

Success signals: server log shows `Application startup complete.`; `curl http://localhost:8000/v1/models` lists the model; sync POST writes MP4 bytes.
</Tab>
<Tab title="Reasoner — vLLM (HTTP)">
```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=cu130 "vllm==0.21.0" \
  "vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3"

vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --async-scheduling \
  --allowed-local-media-path / \
  --port 8000
```

```python
import openai

client = openai.OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/robot.jpg"}},
            {"type": "text", "text": "Caption the image in detail."},
        ],
    }],
    max_tokens=4096,
)
print(response.choices[0].message.content)
```

Success signal: non-empty `message.content` from `/v1/chat/completions`. Messages follow Qwen3-VL-compatible image/video content shapes.
</Tab>
</Tabs>

Cookbook quickstarts under `cookbooks/cosmos3/` mirror these paths with checked-in JSON prompts—for example audiovisual Generator examples use `assets/prompts/text2video/robot_kitchen.json`, and Reasoner Framework smoke tests write `reasoner_text.txt` under an output directory.

## Model family (at a glance)

| Checkpoint | Size | Focus |
| --- | ---: | --- |
| `nvidia/Cosmos3-Nano` | 16B | Compact omnimodal model for understanding, simulation, and action reasoning |
| `nvidia/Cosmos3-Super` | 64B | Frontier-scale omnimodal model |
| `nvidia/Cosmos3-Super-Text2Image` | 64B | High-fidelity text-to-image |
| `nvidia/Cosmos3-Super-Image2Video` | 64B | Image-to-video |
| `nvidia/Cosmos3-Nano-Policy-DROID` | 16B | Vision-language robot policy for DROID |

Super Generator serving typically needs multi-GPU tensor parallelism (`--tensor-parallel-size`) and optional layerwise offload; Nano fits single-GPU cookbook defaults.

## Ecosystem and constraints

| Project | Role |
| --- | --- |
| [Cosmos Framework](https://github.com/NVIDIA/cosmos-framework) | Training, native `torchrun` inference, `vllm-cosmos3` plugin source |
| [Cosmos Curator](https://github.com/NVIDIA/cosmos-curator) | Physical AI data curation |
| [Cosmos Evaluator](https://github.com/NVIDIA/cosmos-evaluator) | Automated evaluation for generation and reasoning outputs |

Cosmos 3 can show temporal inconsistency, motion artifacts, sound–video misalignment, and imperfect action consistency. Safety-critical or physically grounded deployment needs validation beyond model outputs. Source and weights use the [OpenMDW-1.1](https://openmdw.ai/license/1-1/) license.

## Related pages

<CardGroup>
<Card title="Installation" href="/installation">
Prerequisites, CUDA pairing, venv and Docker setup, GPU verification.
</Card>
<Card title="Quickstart" href="/quickstart">
Minimal first-run commands for Generator and Reasoner with expected success signals.
</Card>
<Card title="Reasoner and Generator" href="/reasoner-and-generator">
MoT modes, shared mRoPE, and when to use each surface.
</Card>
<Card title="Choose an integration" href="/choose-integration">
Diffusers vs vLLM-Omni vs vLLM vs Cosmos Framework by goal.
</Card>
<Card title="Model family" href="/model-family">
Full checkpoint catalog and serving tradeoffs.
</Card>
<Card title="Cookbook environment" href="/cookbook-environment">
Shared backend setup for all `cookbooks/cosmos3` notebooks.
</Card>
</CardGroup>