# Reasoner and Generator

> MoT architecture modes: autoregressive Reasoner (text/vision in, text out) vs diffusion Generator (multimodal in, vision/sound/action out), shared mRoPE, and when to use each surface.

- Repository: NVIDIA/cosmos
- GitHub: https://github.com/NVIDIA/cosmos
- Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9
- Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt

## Source Files

- `README.md`
- `cookbooks/cosmos3/cosmos3-model-architecture.png`
- `cookbooks/cosmos3/reasoner/README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/action/README.md`

---

---
title: "Reasoner and Generator"
description: "MoT architecture modes: autoregressive Reasoner (text/vision in, text out) vs diffusion Generator (multimodal in, vision/sound/action out), shared mRoPE, and when to use each surface."
---

Cosmos 3 is a single Mixture-of-Transformers (MoT) checkpoint that exposes two runtime surfaces: an autoregressive **Reasoner** path (causal attention over language and visual-understanding tokens) and a diffusion **Generator** path (full attention over noisy vision, audio, and action tokens). Integrations select the active path through serve flags, pipeline class, or Framework `model_mode`; both paths share transformer layers and a unified 3D multi-dimensional rotary position embedding (mRoPE).

## Two runtime surfaces

| Surface | Inputs | Outputs | Primary workloads |
| --- | --- | --- | --- |
| **Reasoner** | Text, vision (image or video) | Text | Captioning, temporal localization, grounding, embodied and common-sense reasoning, action forecasting, physical plausibility, situation understanding |
| **Generator** | Text, vision, sound, action | Vision, sound, action | Text-to-image/video, image-to-video, video-to-video, synchronized sound, forward/inverse dynamics, policy rollouts, synthetic data |

<Info>
Reasoner and Generator are not separate model families. They are two forward modes through the same Cosmos 3 weights, distinguished by which token subsequence is active and which attention mask applies.
</Info>

## MoT architecture

Cosmos 3 combines an autoregressive (AR) transformer subsequence for reasoning with a diffusion (DM) subsequence for multimodal generation. The stack repeats **L** shared layers; each layer applies layer norm, **shared multimodal attention**, and an MLP. Token encoders sit upstream:

| Subsequence | Encoders | Token types |
| --- | --- | --- |
| **AR (Reasoner)** | Vision encoder (ViT) | Visual understanding tokens `v^AR` |
| **AR (Reasoner)** | Language tokenizer | Language tokens `l`, plus specials such as `EOS` and `BOG` |
| **DM (Generator)** | Vision encoder (VAE) | Noisy vision tokens `v^DM` |
| **DM (Generator)** | Audio encoder | Sound tokens `s` |
| **DM (Generator)** | Action encoder | Action tokens `a` |

<Frame caption="Cosmos 3 MoT diagram: Reasoner Mode (causal AR) vs Generator Mode (full DM attention), shared layers, and attention mask regions.">
![Cosmos 3 model architecture](/cookbooks/cosmos3/cosmos3-model-architecture.png)
</Frame>

```mermaid
flowchart TB
  subgraph encoders["Input encoders"]
    ViT["ViT vision encoder → v^AR"]
    Lang["Language tokenizer → l, EOS, BOG"]
    VAE["VAE vision encoder → v^DM noisy"]
    Audio["Audio encoder → s"]
    Action["Action encoder → a"]
  end

  subgraph stack["Shared transformer × L"]
    direction TB
    subgraph reasonerPath["Reasoner Mode — AR subsequence"]
      LN_AR["Layer norm"]
      Causal["Causal self-attention<br/>Attn(Q_AR, K_AR, V_AR)"]
      MLP_AR["MLP → next language tokens"]
    end
    subgraph genPath["Generator Mode — DM subsequence"]
      LN_DM["Layer norm"]
      Full["Full attention<br/>Attn(Q_DM, [K_AR;K_DM], [V_AR;V_DM])"]
      MLP_DM["MLP → denoised v^DM, s, a"]
    end
    MM["Shared multimodal attention block"]
  end

  ViT --> reasonerPath
  Lang --> reasonerPath
  VAE --> genPath
  Audio --> genPath
  Action --> genPath
  Causal --> MM
  Full --> MM
```

### Reasoner mode (autoregressive)

In Reasoner mode, language and visual-understanding tokens flow through the AR subsequence. Attention within AR is **causal** (lower-triangular mask): each AR query attends only to prior AR keys and values. AR queries are **masked from DM keys** — the reasoner path cannot read noisy diffusion tokens. The forward pass performs next-token prediction for perception, planning, and world reasoning tasks.

Production Reasoner serving loads only the reasoner head:

```shell
vllm serve nvidia/Cosmos3-Nano \
  --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
  --async-scheduling \
  --allowed-local-media-path / \
  --port 8000
```

Cosmos Framework selects the same path with an explicit mode flag in the input JSON:

```json
{
  "model_mode": "reasoner",
  "name": "robot_image",
  "prompt": "Describe what is happening in this image in one sentence.",
  "vision_path": "https://example.com/robot_153.jpg",
  "enable_sound": false
}
```

<Warning>
On the current Framework Reasoner path, set `"enable_sound": false` in reasoner JSON inputs. Omitting it can trigger strict argument-validation failures. Framework Reasoner quickstarts also expect image conditioning via `vision_path`; video-heavy workflows are documented primarily against the vLLM Reasoner cookbook.
</Warning>

Reasoner requests follow **Qwen3-VL-compatible** chat messages (`image_url`, `video_url`, `text`). For chain-of-thought style answers, append the `redacted_reasoning` format instruction to the user prompt (see [Sampling and prompt parameters](/sampling-and-prompt-parameters)).

### Generator mode (diffusion)

In Generator mode, noisy image, video, audio, and action tokens occupy the DM subsequence. DM queries use **full attention**: each DM query attends to the concatenated AR and DM keys and values (`[K_AR; K_DM]`, `[V_AR; V_DM]`). That lets conditioning text and visual understanding tokens influence denoising while AR tokens remain blind to DM state. The output is coherent multimodal media — images, MP4 video (optionally with AAC sound), and JSON action chunks.

Typical Generator integrations load the **full** Cosmos 3 checkpoint (reasoner + diffusion paths + media tokenizers):

| Integration | Entry class / command | API shape |
| --- | --- | --- |
| Diffusers | `Cosmos3OmniPipeline.from_pretrained(...)` | Python `pipe(...)` → PIL image or video tensor |
| vLLM-Omni | `vllm serve ... --omni --model-class-name Cosmos3OmniDiffusersPipeline` | OpenAI-compatible `/v1/images/generations`, `/v1/videos`, `/v1/videos/sync` |
| Cosmos Framework | `torchrun -m cosmos_framework.scripts.inference` | JSON input specs under cookbook `assets/` |

Diffusers research installs note that the pipeline includes the reasoner path, diffusion generation path, and media tokenizers even when you only call generation APIs.

### Shared mRoPE and layers

Both modes reuse the same transformer depth, multimodal attention layers, and a unified **3D mRoPE** that encodes spatial and temporal structure across modalities. mRoPE gives consistent position coding when the model reasons over images, video frames, audio streams, and action trajectories in one sequence — whether those tokens are processed causally (Reasoner) or denoised with full context (Generator).

## Input and output contracts

| Contract | Reasoner | Generator |
| --- | --- | --- |
| Text in | Prompts, questions, instructions | Structured JSON scene prompts (often upsampled), negative prompts |
| Vision in | Images, videos (Qwen3-VL message URLs or Framework `vision_path`) | Conditioning images/videos, VAE-encoded frames |
| Sound in | Not on Reasoner output path | Optional input soundtrack for video-to-video-with-sound |
| Action in | Reasoning about actions (forecasting, CoT) | Trajectories for forward dynamics; video+instruction for inverse dynamics and policy |
| Text out | Captions, JSON boxes, labels, chain-of-thought | — |
| Vision out | — | JPG/PNG images, MP4 video |
| Sound out | — | Stereo AAC at 48 kHz muxed into MP4 when enabled |
| Action out | — | JSON action values (policy, inverse dynamics) |

Supported generation settings (resolution tiers 256p–720p, aspect ratios, frame rates, frame counts) apply to Generator outputs. Reasoner sampling parameters (`temperature`, `top_p`, `top_k`, `presence_penalty`) differ for plain answers versus explicit reasoning prompts — see [Sampling and prompt parameters](/sampling-and-prompt-parameters).

Action semantics for Generator workflows (9D ego pose, 10D DROID/UMI end-effector+gripper, `domain_name`, `action_mode`) are documented on [Action modality](/action-modality).

## When to use each surface

| Goal | Use | Avoid |
| --- | --- | --- |
| Understand a scene, localize events, ground objects, judge physics | **Reasoner** | Generator endpoints (they return media, not analysis text) |
| Produce or simulate visuals, sound, or robot trajectories | **Generator** | Reasoner-only vLLM serve (no diffusion denoising) |
| Text answers from images/video in production | Reasoner + **vLLM** (`Cosmos3ReasonerForConditionalGeneration`) | vLLM-Omni (loads full omni checkpoint; heavier for understanding-only) |
| Images/video/audio/action in production | Generator + **vLLM-Omni** | Reasoner vLLM (text-only chat completions) |
| Python-first Generator research | **Diffusers** `Cosmos3OmniPipeline` | — |
| Python-first Reasoner research | Transformers (coming soon) | — |
| Native PyTorch for either surface, training, evaluation | **Cosmos Framework** `cosmos_framework.scripts.inference` | — |

<Note>
vLLM-Omni loads the full checkpoint including the Qwen3-VL-based reasoner path **and** the diffusion path. For understanding-only tasks that return text, prefer [Run Reasoner with vLLM](/run-reasoner-vllm) instead of vLLM-Omni.
</Note>

Benchmarks treat the surfaces separately: Generator tables report **diffusion-path latency** (seconds per t2i/t2v/i2v); Reasoner tables report **vLLM serving metrics** (TTFT, request latency, throughput under concurrency), not denoising step time.

## Inference backends by surface

| Backend | Reasoner | Generator (audiovisual) | Generator (action) |
| --- | :---: | :---: | :---: |
| Cosmos Framework | ✓ | ✓ | ✓ |
| Diffusers | — | ✓ | — |
| Transformers | coming soon | — | — |
| vLLM | ✓ | — | — |
| vLLM-Omni | — | ✓ | ✓ |

Framework Reasoner runs commonly use `--parallelism-preset=latency` on a single GPU (Nano) or `torchrun` across four GPUs (Super). Generator Framework runs typically use `--parallelism-preset=throughput`. Diffusers and vLLM-Omni Generator quickstarts target `nvidia/Cosmos3-Nano` or `nvidia/Cosmos3-Super` with matching tensor-parallel and offload flags for the 64B checkpoint.

## Representative workflows

### Reasoner workflows

| Workflow | Inputs | Output type |
| --- | --- | --- |
| Caption | Video | Text |
| Temporal localization | Video, query | Text or JSON timestamps |
| Embodied / common-sense reasoning | Video, question | Text |
| 2D grounding | Image, prompt | JSON bounding boxes |
| Describe anything | Image, marked subjects | JSON or text attributes |
| Action CoT | Image or video, prompt | Text or JSON trajectories |
| Physical plausibility | Video, prompt | Label |
| Situation understanding | Video, question | Text |

Runnable notebooks: `cookbooks/cosmos3/reasoner/run_with_vllm.ipynb`, `run_with_cosmos_framework.ipynb`.

### Generator workflows

| Workflow | Inputs | Outputs |
| --- | --- | --- |
| Text-to-image / text-to-video | Text | Vision (optional sound) |
| Image-to-video | Text, image | Vision (optional sound) |
| Video-to-video | Text, video | Vision (optional sound) |
| Forward dynamics | Text, image, action trajectory | Vision |
| Policy / inverse dynamics | Text, image or video | Action + vision |

Audiovisual cookbooks live under `cookbooks/cosmos3/generator/audiovisual/`; action cookbooks under `cookbooks/cosmos3/generator/action/` with `action_mode` values `forward_dynamics`, `inverse_dynamics`, and `policy` on vLLM-Omni.

## Checkpoint and modality scope

| Checkpoint | Size | Typical surface |
| --- | ---: | --- |
| `nvidia/Cosmos3-Nano` | 16B | Both Reasoner and Generator in one omnimodal weights file |
| `nvidia/Cosmos3-Super` | 64B | Same; requires multi-GPU serve/generate |
| `nvidia/Cosmos3-Super-Text2Image` | 64B | Generator-focused text-to-image |
| `nvidia/Cosmos3-Super-Image2Video` | 64B | Generator-focused image-to-video |
| `nvidia/Cosmos3-Nano-Policy-DROID` | 16B | Vision-language robot policy (DROID) |

Task-specific HF variants narrow Generator capability; general omnimodal understanding and simulation still route through Nano/Super with the correct surface selected at serve or `model_mode` time.

## Related pages

<CardGroup>
  <Card title="Overview" href="/overview">
    Cosmos 3 surfaces, modalities, and the shortest path to a first Reasoner or Generator call.
  </Card>
  <Card title="Choose an integration" href="/choose-integration">
    Pick Diffusers, vLLM-Omni, vLLM, Framework, or Transformers by research vs production goal.
  </Card>
  <Card title="Input and output specifications" href="/input-output-specifications">
    Resolution tiers, frame counts, prompt limits, and output formats per modality.
  </Card>
  <Card title="Run Reasoner with vLLM" href="/run-reasoner-vllm">
    Serve `Cosmos3ReasonerForConditionalGeneration` and issue Qwen3-VL chat requests.
  </Card>
  <Card title="Run Generator with vLLM-Omni" href="/run-generator-vllm-omni">
    OpenAI-compatible image/video/action generation and guardrail toggles.
  </Card>
  <Card title="Action modality" href="/action-modality">
    Embodiment dimensions, `domain_name`, and forward/inverse/policy action modes.
  </Card>
</CardGroup>
