# Run Generator with Cosmos Framework

> Clone cosmos-framework, uv sync cu130-train/cu128-train groups, torchrun cosmos_framework.scripts.inference with parallelism presets, checkpoint-path, and JSON input specs from cookbook assets.

- Repository: NVIDIA/cosmos
- GitHub: https://github.com/NVIDIA/cosmos
- Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9
- Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt

## Source Files

- `cookbooks/cosmos3/README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb`
- `cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb`
- `README.md`

---

---
title: "Run Generator with Cosmos Framework"
description: "Clone cosmos-framework, uv sync cu130-train/cu128-train groups, torchrun cosmos_framework.scripts.inference with parallelism presets, checkpoint-path, and JSON input specs from cookbook assets."
---

Generator audiovisual and action workflows in this repository run through the **Cosmos Framework** checkout (`cosmos_framework.scripts.inference`), launched with `torchrun` for multi-GPU diffusion or `python -m` for single-process action runs. Cookbooks under `cookbooks/cosmos3/generator/` supply structured JSON prompts, conditioning images, action trajectories, and example `torchrun` invocations against `Cosmos3-Nano` and `Cosmos3-Super`.

## When to use this path

| Goal | Cosmos Framework | Alternative in this repo |
| --- | --- | --- |
| Research-style PyTorch inference with full checkpoint control | Yes | Diffusers (`Cosmos3OmniPipeline`) |
| Production OpenAI-compatible serving | No | vLLM-Omni |
| Training, evaluation, omni-model recipes | Yes (framework repo) | — |

The framework path imports training extras at install time (`*-train` groups) because the current inference entrypoint depends on those modules.

## Prerequisites

<Steps>
<Step title="Host and access">

- Linux with NVIDIA GPU (Ampere, Hopper, or Blackwell per product docs).
- [`uv`](https://docs.astral.sh/uv/getting-started/installation/) **≥ 0.11.3**, `git`, and `git-lfs`.
- Hugging Face access to gated Cosmos3 repos: `uvx hf@latest auth login` or `export HF_TOKEN=...`.
- Read access to [NVIDIA/cosmos-framework](https://github.com/NVIDIA/cosmos-framework) (HTTPS or SSH clone URL).

</Step>
<Step title="CUDA driver pairing">

Match the `uv` dependency group to your driver CUDA major version:

| Driver CUDA | `uv sync` group | Set before notebooks |
| --- | --- | --- |
| 13.x | `cu130-train` | `export COSMOS3_UV_GROUP=cu130-train` (default) |
| 12.x | `cu128-train` | `export COSMOS3_UV_GROUP=cu128-train` |

Only `cu130-train` and `cu128-train` are defined in the framework `pyproject.toml`. A CUDA 12.x driver with the default `cu130-train` group typically yields `cuda available: False` in the verify step.

</Step>
</Steps>

Shared backend setup for all cookbooks lives in [Cookbook environment setup](/cookbook-environment).

## Install Cosmos Framework

From the `cosmos` repository root, clone (or reuse) the framework tree and sync dependencies:

```bash
mkdir -p packages
git clone https://github.com/NVIDIA/cosmos-framework.git packages/cosmos3
cd packages/cosmos3

# Skip LFS smudge for lerobot test artifacts the cookbooks do not need.
export GIT_LFS_SKIP_SMUDGE=1

# CUDA 13 driver (default):
uv sync --all-extras --group=cu130-train

# CUDA 12.x driver:
# uv sync --all-extras --group=cu128-train
```

<Note>
The cookbooks README clones into `packages/cosmos3`. The audiovisual notebook also accepts `packages/cosmos-framework` if that path already contains `pyproject.toml` and `cosmos_framework/`.
</Note>

The install creates `.venv` at `packages/cosmos3/.venv`. Either activate it (`source .venv/bin/activate`) or call `.venv/bin/torchrun` and `.venv/bin/python` by absolute path.

Optional: point `UV_PROJECT_ENVIRONMENT` at a large-disk venv path before `uv sync` (audiovisual notebook pattern).

## Verify GPU and Python

```bash
cd packages/cosmos3
.venv/bin/python - <<'PY'
import torch
print("torch:", torch.__version__)
print("torch cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY
```

Expect `cuda available: True` and a device name before running generation.

## Inference entrypoint

```text
packages/cosmos3/.venv/bin/torchrun  →  -m cosmos_framework.scripts.inference
                                              │
                    ┌─────────────────────────┼─────────────────────────┐
                    ▼                         ▼                         ▼
            Audiovisual JSON           Action JSONL              (Reasoner — other page)
            text2image / t2v / i2v     forward_dynamics /
                                       inverse_dynamics
```

Core CLI shape:

```bash
torchrun --nproc-per-node=<N> \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=<preset> \
  -i <input.json|.jsonl> \
  -o <output_dir> \
  --checkpoint-path <Cosmos3-Nano|Cosmos3-Super> \
  [--seed=0] [--benchmark]
```

| Flag | Role |
| --- | --- |
| `-i` | Input spec: single JSON file or JSONL (one JSON object per line for multi-run specs) |
| `-o` | Output root directory |
| `--checkpoint-path` | Hugging Face checkpoint id, e.g. `Cosmos3-Nano`, `Cosmos3-Super` |
| `--parallelism-preset` | Framework parallelism profile (see below) |
| `--seed` / `--seed=0` | Reproducibility seed |
| `--benchmark` | Write timing metadata (`benchmark.json`) — used in action notebooks |
| `--video-save-quality` | Video encode quality (action AV example uses `8`) |
| `--image_size` | Output size hint for action runs (e.g. `480`) |

Action cookbooks sometimes set distributed env vars manually and call `.venv/bin/python -m cosmos_framework.scripts.inference` with `RANK=0 WORLD_SIZE=1` instead of `torchrun`.

## Parallelism presets

| Preset | Typical Generator use | Launch pattern |
| --- | --- | --- |
| `throughput` | Audiovisual text-to-image, text-to-video, image-to-video | `torchrun --nproc-per-node=$COSMOS3_NUM_GPUS` (notebook default **4**) |
| `latency` | Action forward/inverse dynamics | Single GPU: `python -m` or `torchrun --nproc-per-node=1` |

Audiovisual runs also pass `--master-addr` and `--master-port` (notebook allocates free ports per workflow). Quickstart text-to-video uses `--nproc-per-node=1`.

## Quickstart: text-to-video (Nano)

After framework install, from `cookbooks/cosmos3/generator/audiovisual/`:

```bash
# Use the framework venv torchrun (from repo root, adjust path if needed):
packages/cosmos3/.venv/bin/torchrun --nproc-per-node=1 \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=throughput \
  -i assets/prompts/text2video/robot_kitchen.json \
  -o /tmp/cosmos3_t2v_framework \
  --checkpoint-path Cosmos3-Nano \
  --seed=0
```

<Check>
First run downloads `Cosmos3-Nano` via Hugging Face. Diffusion over 189 frames at 720p is compute-heavy; long step times are expected.
</Check>

For **Cosmos3-Super**, set `--checkpoint-path Cosmos3-Super` and increase `--nproc-per-node` to match available GPUs (notebook uses the same `throughput` preset with multi-GPU `torchrun`).

## Cookbook assets layout

Audiovisual prompts and conditioning media live under `cookbooks/cosmos3/generator/audiovisual/assets/`:

:::files
cookbooks/cosmos3/generator/audiovisual/assets/
├── prompts/
│   ├── text2image/          # e.g. robot_draping.json
│   ├── text2video/          # e.g. robot_kitchen.json, robot_pouring_water_audio.json
│   └── image2video/         # e.g. car_driving.json, coastal_road_audio.json
├── negative_prompts/
│   ├── text2video/neg_prompt.json
│   └── image2video/neg_prompt.json
└── images/image2video/      # e.g. car_driving.jpg, coastal_road_audio.jpg
:::

Prompt files are **structured JSON scene specs** (subjects, cinematography, `temporal_caption`, `resolution`, `fps`, etc.), not plain strings. The quickstart passes the prompt file directly to `-i`; the full notebook wraps the same files into framework payload JSON with sampling fields.

Action examples use `cookbooks/cosmos3/generator/action/assets/` (images, videos, trajectories) and write specs under `packages/cosmos3/outputs/cookbooks/cosmos3/generator/action/inputs/`.

## Audiovisual input spec (notebook payload)

The audiovisual notebook builds per-run JSON under `outputs/notebooks/pytorch/payloads/<use_case>.json` with a consistent schema:

| Field | Typical value | Notes |
| --- | --- | --- |
| `model_mode` | `text2image`, `text2video`, `image2video` | Selects Generator modality |
| `prompt` | Compact JSON **string** of the scene spec file | From `assets/prompts/...` |
| `negative_prompt` | Compact JSON string or `""` | From `assets/negative_prompts/<mode>/neg_prompt.json`; empty for text2image |
| `enable_sound` | `true` / `false` | Sound-bearing prompts use dedicated asset pairs |
| `num_steps` | `35` | Diffusion steps |
| `guidance` | `6.0` | CFG scale |
| `shift` | `10.0` | Scheduler flow shift |
| `fps` | `24` | |
| `num_frames` | `189` (video), `1` (text2image) | ~7.9 s at 24 FPS for default video |
| `resolution` | `"720"` | |
| `aspect_ratio` | `"16,9"` | Comma-separated pair in cookbook payloads |
| `seed` | `0` | |
| `vision_path` | Relative path to conditioning image | Required for `image2video` |

Example payload fragment (text-to-video, no audio):

```json
{
  "model_mode": "text2video",
  "name": "t2v_nano_noaudio",
  "prompt": "{...compact scene JSON...}",
  "negative_prompt": "{...compact neg prompt JSON...}",
  "enable_sound": false,
  "num_steps": 35,
  "guidance": 6.0,
  "shift": 10.0,
  "fps": 24,
  "num_frames": 189,
  "resolution": "720",
  "aspect_ratio": "16,9",
  "seed": 0
}
```

Image-to-video adds `vision_path` relative to the payload file directory (e.g. path into `assets/images/image2video/`).

### Notebook asset matrix

| Use case key | Checkpoint | Mode | Sound |
| --- | --- | --- | --- |
| `t2i` | Cosmos3-Nano | text2image | off |
| `t2i_super` | Cosmos3-Super | text2image | off |
| `t2v_nano_noaudio` | Cosmos3-Nano | text2video | off |
| `t2vs` | Cosmos3-Nano | text2video | on |
| `i2v_nano_noaudio` | Cosmos3-Nano | image2video | off |
| `i2vs` | Cosmos3-Nano | image2video | on |
| `t2v_super_noaudio` | Cosmos3-Super | text2video | off |
| `i2v_super_noaudio` | Cosmos3-Super | image2video | off |

Run pattern (text-to-image on Nano):

```bash
cd packages/cosmos3
CUDA_VISIBLE_DEVICES=0,1,2,3 \
  .venv/bin/torchrun \
  --nproc-per-node=4 \
  --master-addr=127.0.0.1 \
  --master-port=<free_port> \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=throughput \
  -i /path/to/t2i.json \
  -o /path/to/output/t2i \
  --checkpoint-path Cosmos3-Nano \
  --seed=0
```

## Scale checkpoints and GPUs

| Checkpoint | Size | Cookbook GPU hint |
| --- | ---: | --- |
| `Cosmos3-Nano` | 16B | Quickstart: 1 GPU; audiovisual notebook default: 4 GPUs |
| `Cosmos3-Super` | 64B | Same `throughput` preset; raise `--nproc-per-node` to available GPU count |

Set `export COSMOS3_NUM_GPUS=4` and `export CUDA_VISIBLE_DEVICES=0,1,2,3` before notebook cells, or pass `--nproc-per-node` explicitly in shell commands.

## Action Generator (forward and inverse dynamics)

Action workflows use **JSONL** specs (one JSON object per line) and the `latency` preset. They are documented in depth on [Run Generator action workflows](/run-generator-action); summary for Framework-only runs:

**Forward dynamics** (`model_mode`: `forward_dynamics`): start image + `action_path` + `domain_name` (`av`, `droid_lerobot`, `umi`, …). Output video per run:

```text
<output_dir>/<name>/vision.mp4
```

**Inverse dynamics** (`model_mode`: `inverse_dynamics`): input `vision_path` video only; predicted action in `<output_dir>/<name>/sample_outputs.json` under `outputs[0].content["action"]`.

Example AV forward-dynamics record:

```json
{
  "action_chunk_size": 60,
  "action_path": "/abs/path/to/av_traj_forward.json",
  "domain_name": "av",
  "fps": 10,
  "image_size": 480,
  "view_point": "ego_view",
  "model_mode": "forward_dynamics",
  "name": "av_forward",
  "prompt": "You are an autonomous vehicle planning system.",
  "seed": 0,
  "vision_path": "/abs/path/to/av_0.jpg"
}
```

Run:

```bash
cd packages/cosmos3
CUDA_VISIBLE_DEVICES=0 \
  .venv/bin/python -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  -i outputs/cookbooks/cosmos3/generator/action/inputs/action_forward_dynamics_av_custom.jsonl \
  -o outputs/cookbooks/cosmos3/generator/action/action_forward_dynamics_av_custom \
  --checkpoint-path Cosmos3-Nano \
  --video-save-quality 8 \
  --image_size 480 \
  --seed 0 \
  --benchmark
```

Embodiment dimensions and FPS defaults for AV / DROID / UMI are summarized in the action cookbook README (9D–10D pose deltas, 60 frames @ 10 FPS for AV, etc.).

## Outputs and verification

| Workflow | Output location | Success signal |
| --- | --- | --- |
| Audiovisual | `-o` directory; `*.mp4` or `*.png` under run subfolders | Generated media files; notebook `view_run()` skips `*_preview.mp4` |
| Forward dynamics | `<output>/<name>/vision.mp4` | MP4 exists per JSONL `name` |
| Inverse dynamics | `<output>/<name>/sample_outputs.json` | `action` array in first output content |
| With `--benchmark` | `benchmark.json` under output root | Timing averages in JSON |

Hugging Face weights cache under `HF_HOME` (default `~/.cache/huggingface`).

## Useful environment variables

| Variable | Default / role |
| --- | --- |
| `COSMOS3_REPO` | Framework checkout path (`packages/cosmos3`) |
| `COSMOS3_UV_GROUP` | `cu130-train` or `cu128-train` |
| `COSMOS3_UV_ENV` / `UV_PROJECT_ENVIRONMENT` | Path to `.venv` used by `torchrun` |
| `COSMOS3_NUM_GPUS` | `4` in audiovisual notebook |
| `CUDA_VISIBLE_DEVICES` | GPU indices for the run |
| `COSMOS3_MASTER_ADDR` | `127.0.0.1` for distributed audiovisual |
| `COSMOS3_*_MASTER_PORT` | Per-workflow free ports in notebook |
| `HF_HOME` / `HF_TOKEN` | Model download cache and auth |
| `GIT_LFS_SKIP_SMUDGE` | `1` during `uv sync` |

Action notebooks may require a one-time kernel restart after `configure_cosmos_framework_runtime_env()` updates `LD_LIBRARY_PATH` for CUDA and FFmpeg libraries.

## Troubleshooting

<Warning>
**Headless import errors** (`libxcb.so.1`): install `libxcb1 libgl1 libglib2.0-0` on minimal Linux images.
</Warning>

| Symptom | Mitigation |
| --- | --- |
| `cuda available: False` after sync | Switch to `cu128-train` on CUDA 12.x drivers; confirm with `nvidia-smi` |
| `uv` parse / `--torch-backend` errors | Upgrade `uv` to ≥ 0.11.3 (`uv self update`) |
| Clone / sync failures on LFS blobs | Keep `GIT_LFS_SKIP_SMUDGE=1` for cookbook installs |
| Missing `torchrun` | Run install cell; use `$COSMOS3_UV_ENV/bin/torchrun` explicitly |
| Super OOM | Reduce resolution in payload, use fewer frames, or add GPUs via `--nproc-per-node` |

See [Troubleshooting](/troubleshooting) for cross-backend CUDA and container notes.

## Related pages

<CardGroup>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared uv/Docker setup, HF auth, and CUDA group selection for all backends.
</Card>
<Card title="Audiovisual cookbook recipes" href="/audiovisual-cookbooks">
Full notebook matrix for text-to-image, text-to-video, and image-to-video with optional sound.
</Card>
<Card title="Run Generator action workflows" href="/run-generator-action">
Forward and inverse dynamics JSONL specs, domains, and Framework vs vLLM-Omni paths.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Resolution tiers, frame counts, prompt limits, and output formats.
</Card>
<Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
Structured JSON prompt schema and Generator sampling defaults.
</Card>
<Card title="Choose an integration" href="/choose-integration">
When to prefer Framework vs Diffusers vs vLLM-Omni.
</Card>
<Card title="Run Generator with Diffusers" href="/run-generator-diffusers">
Python-first alternative without a framework checkout.
</Card>
</CardGroup>
