# Input and output specifications

> Supported input/output types and formats, resolution tiers (256p–720p), aspect ratios, frame rates/counts, vision conditioning frame counts, prompt length limits, and sound output specs.

- Repository: NVIDIA/cosmos
- GitHub: https://github.com/NVIDIA/cosmos
- Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9
- Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt

## Source Files

- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2image/robot_draping.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/image2video/humanoid_robot.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/humanoid_robot.jpg`
- `cookbooks/cosmos3/reasoner/assets/video_caption.mp4`
- `cookbooks/cosmos3/reasoner/assets/grounding_2d.png`

---

---
title: "Input and output specifications"
description: "Supported input/output types and formats, resolution tiers (256p–720p), aspect ratios, frame rates/counts, vision conditioning frame counts, prompt length limits, and sound output specs."
---

Cosmos 3 exposes two runtime surfaces—**Reasoner** (text and vision in, text out) and **Generator** (text, vision, sound, and action in; vision, sound, action, and text out)—with modality contracts defined in the repository README, audiovisual cookbook assets under `cookbooks/cosmos3/generator/audiovisual/`, and Reasoner examples under `cookbooks/cosmos3/reasoner/`. Generator integrations map the same contracts through Cosmos Framework JSON specs, `Cosmos3OmniPipeline` arguments, or vLLM-Omni `size` / `num_frames` / `fps` fields.

## Surface summary

| Surface | Inputs | Outputs | Typical formats |
| --- | --- | --- | --- |
| **Reasoner** | Text, image, video | Text | Plain string; JSON for grounding or localization tasks |
| **Generator** | Text, image, video, action | Image, video, sound, action | JPG/PNG image, MP4 video, AAC muxed into MP4, JSON action arrays |

```text
                    ┌─────────────────────────────────────┐
  text / vision ──► │           Cosmos 3 Reasoner          │ ──► text
                    └─────────────────────────────────────┘

  text / vision / sound / action
                    ┌─────────────────────────────────────┐
              ───► │          Cosmos 3 Generator            │ ──► vision (+ optional sound, action)
                    └─────────────────────────────────────┘
```

## Generator: input types and formats

| Input type | Composition | File / payload format |
| --- | --- | --- |
| Text only | Text-to-image, text-to-video | Plain string or structured JSON prompt (see below) |
| Text + image | Image-to-video | JPG, PNG, JPEG, WEBP (`IMAGE_EXTENSIONS` in audiovisual notebooks) |
| Text + video | Video-to-video | MP4 via `input_reference` (vLLM-Omni) or `vision_path` (Framework) |
| Text + image + action | Forward dynamics, policy | Image + JSON action array (`action_path` on server) |
| Text + video + instruction | Inverse dynamics | MP4 + text instruction |

**Framework inference JSON** (written by audiovisual notebooks) carries generation controls alongside the prompt:

| Field | Role | Cookbook default |
| --- | --- | --- |
| `model_mode` | Workflow: `text2image`, `text2video`, `image2video` | Per example |
| `name` | Run identifier / output subdirectory | Per example |
| `prompt` | Structured JSON string (compact-serialized asset file) | Asset under `assets/prompts/` |
| `vision_path` | Conditioning image (image2video), repo-relative | e.g. `assets/images/image2video/car_driving.jpg` |
| `enable_sound` | Request synchronized audio generation | `false` or `true` |
| `num_steps` | Diffusion denoising steps | `35` |
| `guidance` | CFG strength (Framework) / `guidance_scale` (Diffusers, vLLM-Omni) | `6.0` |
| `shift` | Scheduler flow shift / `flow_shift` | `10.0` |
| `fps` | Output frame rate | `24` |
| `num_frames` | Video length in frames (`1` for text-to-image) | `189` (video), `1` (image) |
| `resolution` | Tier string: `"256"`, `"480"`, `"720"` | `"720"` in cookbooks |
| `aspect_ratio` | Comma-separated pair, e.g. `"16,9"` | `"16,9"` |
| `seed` | Reproducibility | `0` |

**Reasoner Framework input** uses a smaller schema, for example:

```json
{
  "model_mode": "reasoner",
  "name": "robot_image",
  "prompt": "Describe what is happening in this image in one sentence.",
  "vision_path": "https://…/robot_153.jpg",
  "enable_sound": false
}
```

Set `enable_sound` to `false` on the current Reasoner Framework path to avoid strict argument-validation failures noted in the Reasoner cookbook README.

## Generator: output types and formats

| Output | Format | Notes |
| --- | --- | --- |
| Image | JPG (Framework); PNG base64 (vLLM-Omni `/v1/images/generations`) | `text-to-image` uses `num_frames=1` |
| Video | MP4 | Exported with `export_to_video` (Diffusers) or returned as `video/mp4` bytes (vLLM-Omni sync endpoint) |
| Sound | Stereo AAC at 48 kHz | Muxed into MP4 when sound is enabled |
| Action | JSON numeric arrays | Policy / inverse dynamics return predicted chunks; forward dynamics returns video only |

## Resolution tiers and pixel dimensions

Cosmos 3 supports three resolution tiers. **Default tier is 480p**; **default aspect ratio is 16:9**.

| Tier | 16:9 pixels (H×W) | Used in |
| --- | --- | --- |
| **256p** | 320×192 | Diffusers benchmarks; cookbook `payload_dimensions` for `resolution: "256"` |
| **480p** | 832×480 | Model default tier; Diffusers benchmarks |
| **720p** | 1280×720 | README vision conditioning; cookbook assets and quickstarts |

Benchmark tables in `inference_benchmarks.md` label these as **256p/1**, **480p/1**, and **720p/1** (height tier / aspect-ratio index). Standard video benchmarks use **189 frames at 24 FPS** unless a resolution tier limits frame count.

<Note>
Audiovisual notebook helpers currently resolve pixel sizes only for **`resolution: "720"`** and **`resolution: "256"`** with **`aspect_ratio: "16,9"`**. Other tier/ratio pairs are supported at the model level (per README) but require explicit `height`/`width` (Diffusers), `size` (vLLM-Omni), or Framework fields you set yourself.
</Note>

### Mapping tiers to API fields

| Integration | How you set resolution |
| --- | --- |
| **Cosmos Framework** | `resolution` + `aspect_ratio` in inference JSON |
| **Diffusers** | `height`, `width` (e.g. `720`, `1280`) |
| **vLLM-Omni** | `size` as `<width>x<height>` (e.g. `1280x720`) |

Checked-in structured prompts also embed explicit pixels under `resolution.W` / `resolution.H` (cookbook assets use **1280×720** for 16:9 video and image prompts).

## Aspect ratios

| Aspect ratio | Default? |
| --- | --- |
| 16:9 | Yes |
| 4:3 | Supported |
| 1:1 | Supported |
| 3:4 | Supported |
| 9:16 | Supported |

In Framework JSON and prompt assets, encode ratios with a **comma** separator (e.g. `"aspect_ratio": "16,9"`), not a colon. vLLM-Omni and Diffusers examples in this repo use explicit `size` or `height`/`width` for 16:9 rather than enumerating every ratio.

Optional template toggles on vLLM-Omni (`extra_params.use_resolution_template`, `use_duration_template`) let the server inject resolution/duration hints; cookbooks often disable them and pass full structured JSON instead.

## Frame rates and frame counts

| Parameter | Supported values | Default |
| --- | --- | --- |
| **FPS** | 10, 16, 24, 30 | 24 |
| **Frame count** | 5–300 | 189 |

**Duration relationship:** at 24 FPS, 189 frames is about **7.9 seconds** of video. Shorter clips in prompt assets declare matching metadata—for example `humanoid_robot.json` uses `"duration": "7s"` and `"fps": 24` for a seven-second scene description.

| Workflow | Typical `num_frames` |
| --- | --- |
| Text-to-image | `1` |
| Text-to-video / image-to-video (cookbooks) | `189` |
| vLLM-Omni README curl example | `81` (valid within 5–300) |
| Action forward dynamics (AV) | 60 frames @ 10 FPS (per action cookbook) |
| Action forward dynamics (DROID) | 16 frames @ 15 FPS per chunk |
| Action forward dynamics (UMI) | 16 frames @ 20 FPS per chunk |

Action robotics notebooks run **autoregressive chunks** (e.g. five 16-frame DROID chunks); each chunk video includes its conditioning frame at index 0, which downstream stitching drops before concatenation.

## Vision conditioning

| Setting | Specification |
| --- | --- |
| **Spatial size** | Matches tier: 1280×720 (720p), 832×480 (480p), 320×192 (256p) |
| **Video conditioning frames** | **5 frames** at the matching resolution |
| **Image conditioning** | Single reference image (`vision_path` or `input_reference`) |
| **Video-to-video** | Source MP4 plus `condition_frame_indexes_vision` and `condition_video_keep` in vLLM-Omni `extra_params` |

Image2video cookbooks ship JPEG conditioning frames under `cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/` (e.g. `humanoid_robot.jpg` paired with `humanoid_robot.json`).

## Prompt formats and length limits

### Plain text

Short natural-language strings work for quickstarts (Diffusers `prompt="…"`, vLLM-Omni `prompt` form field). For world generation, **fewer than 300 words** is recommended.

### Structured JSON prompts

Production audiovisual flows serialize rich scene JSON (subjects, lighting, cinematography, temporal segments) and pass it as the `prompt` string. Example top-level keys from `robot_draping.json` and `humanoid_robot.json`:

| Key group | Examples |
| --- | --- |
| Scene | `subjects`, `background_setting`, `lighting`, `aesthetics` |
| Motion / time | `actions`, `segments`, `temporal_caption`, `duration`, `fps` |
| Output geometry | `resolution` (`W`, `H`), `aspect_ratio` |
| Caption | `comprehensive_t2i_caption` (text-to-image) |

Cookbooks load assets with `compact_json_file()` and send `json.dumps(..., separators=(",", ":"))` so the model receives a single-line JSON string.

### Token limits (vLLM-Omni)

<ParamField body="max_sequence_length" type="integer">
Maximum prompt tokens kept for conditioning. Cosmos 3 default is **512**; longer prompts are truncated with a warning, shorter prompts padded.
</ParamField>

Prompt upsampling (Generator) uses separate LLM sampling defaults (`max_tokens` 20000, etc.) documented on the sampling page—not the same as `max_sequence_length`.

## Sound output

| Property | Value |
| --- | --- |
| Codec | AAC |
| Channels | Stereo |
| Sample rate | 48 kHz |
| Container | Muxed into output MP4 |

Enable sound per integration:

| Integration | Flag |
| --- | --- |
| Cosmos Framework | `"enable_sound": true` in inference JSON |
| Diffusers | `enable_sound=True` on `Cosmos3OmniPipeline` (`text-to-video-with-sound` mode) |
| vLLM-Omni | `generate_sound=true` on `/v1/videos` or `/v1/videos/sync` |

## Action inputs and outputs (summary)

Action modality uses JSON arrays of pose deltas; embodiment dimensionality varies (camera 9D, AV 9D, DROID/UMI 10D, humanoid 29D per README). vLLM-Omni passes `action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, and `action_path` through `extra_params`. See the action modality page for semantics; this page only lists I/O shapes.

| `action_mode` | Primary input | Primary output |
| --- | --- | --- |
| `forward_dynamics` | Image + action chunk | Video (sync API) |
| `inverse_dynamics` | Video + instruction | Video + predicted action chunk (async API) |
| `policy` | Image + instruction | Video + predicted action chunk (async API) |

## Reasoner: input and output

Reasoner follows **Qwen3-VL-compatible** chat messages: `image_url` and `video_url` content parts plus text. Outputs are **text** (or JSON embedded in text for grounding/localization).

| Input format | Example |
| --- | --- |
| Remote image URL | `https://…/robot_153.jpg` |
| Local media | `file://` paths with `--allowed-local-media-path` on the vLLM server |
| Video | `video_caption.mp4`, `grounding_2d.png`, and other assets under `cookbooks/cosmos3/reasoner/assets/` |

| Parameter | Cookbook usage |
| --- | --- |
| `max_tokens` | `4096` in Reasoner vLLM examples |
| Video frame ingestion | `--media-io-kwargs '{"video": {"num_frames": -1}}'` so the processor considers all frames before downstream sampling |

Framework Reasoner currently expects **image** inputs via `vision_path`; video-heavy workflows are documented against vLLM in the Reasoner cookbook README.

## Precision and platform constraints

| Constraint | Value |
| --- | --- |
| Precision | BF16 tested |
| Operating system | Linux |
| GPU | NVIDIA Ampere, Hopper, Blackwell |

## Related pages

<CardGroup>
  <Card title="Reasoner and Generator" href="/reasoner-and-generator">
    When to use each surface and how MoT modes differ.
  </Card>
  <Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
    Prompt upsampling, Reasoner sampling tables, and JSON schema details.
  </Card>
  <Card title="vLLM-Omni API reference" href="/vllm-omni-api-reference">
    Request fields, `extra_params`, and endpoint mapping.
  </Card>
  <Card title="Diffusers pipeline reference" href="/diffusers-pipeline-reference">
    `Cosmos3OmniPipeline` modes and call arguments.
  </Card>
  <Card title="Action modality" href="/action-modality">
    Embodiment dimensions, `domain_name`, and action workflow modes.
  </Card>
  <Card title="Audiovisual cookbooks" href="/audiovisual-cookbooks">
    End-to-end Generator examples with checked-in prompts and images.
  </Card>
</CardGroup>
