# Run Generator with Cosmos Framework > Clone cosmos-framework, uv sync cu130-train/cu128-train groups, torchrun cosmos_framework.scripts.inference with parallelism presets, checkpoint-path, and JSON input specs from cookbook assets. - Repository: NVIDIA/cosmos - GitHub: https://github.com/NVIDIA/cosmos - Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9 - Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt ## Source Files - `cookbooks/cosmos3/README.md` - `cookbooks/cosmos3/generator/audiovisual/README.md` - `cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb` - `cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb` - `cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb` - `README.md` --- --- title: "Run Generator with Cosmos Framework" description: "Clone cosmos-framework, uv sync cu130-train/cu128-train groups, torchrun cosmos_framework.scripts.inference with parallelism presets, checkpoint-path, and JSON input specs from cookbook assets." --- Generator audiovisual and action workflows in this repository run through the **Cosmos Framework** checkout (`cosmos_framework.scripts.inference`), launched with `torchrun` for multi-GPU diffusion or `python -m` for single-process action runs. Cookbooks under `cookbooks/cosmos3/generator/` supply structured JSON prompts, conditioning images, action trajectories, and example `torchrun` invocations against `Cosmos3-Nano` and `Cosmos3-Super`. ## When to use this path | Goal | Cosmos Framework | Alternative in this repo | | --- | --- | --- | | Research-style PyTorch inference with full checkpoint control | Yes | Diffusers (`Cosmos3OmniPipeline`) | | Production OpenAI-compatible serving | No | vLLM-Omni | | Training, evaluation, omni-model recipes | Yes (framework repo) | — | The framework path imports training extras at install time (`*-train` groups) because the current inference entrypoint depends on those modules. ## Prerequisites - Linux with NVIDIA GPU (Ampere, Hopper, or Blackwell per product docs). - [`uv`](https://docs.astral.sh/uv/getting-started/installation/) **≥ 0.11.3**, `git`, and `git-lfs`. - Hugging Face access to gated Cosmos3 repos: `uvx hf@latest auth login` or `export HF_TOKEN=...`. - Read access to [NVIDIA/cosmos-framework](https://github.com/NVIDIA/cosmos-framework) (HTTPS or SSH clone URL). Match the `uv` dependency group to your driver CUDA major version: | Driver CUDA | `uv sync` group | Set before notebooks | | --- | --- | --- | | 13.x | `cu130-train` | `export COSMOS3_UV_GROUP=cu130-train` (default) | | 12.x | `cu128-train` | `export COSMOS3_UV_GROUP=cu128-train` | Only `cu130-train` and `cu128-train` are defined in the framework `pyproject.toml`. A CUDA 12.x driver with the default `cu130-train` group typically yields `cuda available: False` in the verify step. Shared backend setup for all cookbooks lives in [Cookbook environment setup](/cookbook-environment). ## Install Cosmos Framework From the `cosmos` repository root, clone (or reuse) the framework tree and sync dependencies: ```bash mkdir -p packages git clone https://github.com/NVIDIA/cosmos-framework.git packages/cosmos3 cd packages/cosmos3 # Skip LFS smudge for lerobot test artifacts the cookbooks do not need. export GIT_LFS_SKIP_SMUDGE=1 # CUDA 13 driver (default): uv sync --all-extras --group=cu130-train # CUDA 12.x driver: # uv sync --all-extras --group=cu128-train ``` The cookbooks README clones into `packages/cosmos3`. The audiovisual notebook also accepts `packages/cosmos-framework` if that path already contains `pyproject.toml` and `cosmos_framework/`. The install creates `.venv` at `packages/cosmos3/.venv`. Either activate it (`source .venv/bin/activate`) or call `.venv/bin/torchrun` and `.venv/bin/python` by absolute path. Optional: point `UV_PROJECT_ENVIRONMENT` at a large-disk venv path before `uv sync` (audiovisual notebook pattern). ## Verify GPU and Python ```bash cd packages/cosmos3 .venv/bin/python - <<'PY' import torch print("torch:", torch.__version__) print("torch cuda:", torch.version.cuda) print("cuda available:", torch.cuda.is_available()) print("device count:", torch.cuda.device_count()) if torch.cuda.is_available(): print("device 0:", torch.cuda.get_device_name(0)) PY ``` Expect `cuda available: True` and a device name before running generation. ## Inference entrypoint ```text packages/cosmos3/.venv/bin/torchrun → -m cosmos_framework.scripts.inference │ ┌─────────────────────────┼─────────────────────────┐ ▼ ▼ ▼ Audiovisual JSON Action JSONL (Reasoner — other page) text2image / t2v / i2v forward_dynamics / inverse_dynamics ``` Core CLI shape: ```bash torchrun --nproc-per-node= \ -m cosmos_framework.scripts.inference \ --parallelism-preset= \ -i \ -o \ --checkpoint-path \ [--seed=0] [--benchmark] ``` | Flag | Role | | --- | --- | | `-i` | Input spec: single JSON file or JSONL (one JSON object per line for multi-run specs) | | `-o` | Output root directory | | `--checkpoint-path` | Hugging Face checkpoint id, e.g. `Cosmos3-Nano`, `Cosmos3-Super` | | `--parallelism-preset` | Framework parallelism profile (see below) | | `--seed` / `--seed=0` | Reproducibility seed | | `--benchmark` | Write timing metadata (`benchmark.json`) — used in action notebooks | | `--video-save-quality` | Video encode quality (action AV example uses `8`) | | `--image_size` | Output size hint for action runs (e.g. `480`) | Action cookbooks sometimes set distributed env vars manually and call `.venv/bin/python -m cosmos_framework.scripts.inference` with `RANK=0 WORLD_SIZE=1` instead of `torchrun`. ## Parallelism presets | Preset | Typical Generator use | Launch pattern | | --- | --- | --- | | `throughput` | Audiovisual text-to-image, text-to-video, image-to-video | `torchrun --nproc-per-node=$COSMOS3_NUM_GPUS` (notebook default **4**) | | `latency` | Action forward/inverse dynamics | Single GPU: `python -m` or `torchrun --nproc-per-node=1` | Audiovisual runs also pass `--master-addr` and `--master-port` (notebook allocates free ports per workflow). Quickstart text-to-video uses `--nproc-per-node=1`. ## Quickstart: text-to-video (Nano) After framework install, from `cookbooks/cosmos3/generator/audiovisual/`: ```bash # Use the framework venv torchrun (from repo root, adjust path if needed): packages/cosmos3/.venv/bin/torchrun --nproc-per-node=1 \ -m cosmos_framework.scripts.inference \ --parallelism-preset=throughput \ -i assets/prompts/text2video/robot_kitchen.json \ -o /tmp/cosmos3_t2v_framework \ --checkpoint-path Cosmos3-Nano \ --seed=0 ``` First run downloads `Cosmos3-Nano` via Hugging Face. Diffusion over 189 frames at 720p is compute-heavy; long step times are expected. For **Cosmos3-Super**, set `--checkpoint-path Cosmos3-Super` and increase `--nproc-per-node` to match available GPUs (notebook uses the same `throughput` preset with multi-GPU `torchrun`). ## Cookbook assets layout Audiovisual prompts and conditioning media live under `cookbooks/cosmos3/generator/audiovisual/assets/`: :::files cookbooks/cosmos3/generator/audiovisual/assets/ ├── prompts/ │ ├── text2image/ # e.g. robot_draping.json │ ├── text2video/ # e.g. robot_kitchen.json, robot_pouring_water_audio.json │ └── image2video/ # e.g. car_driving.json, coastal_road_audio.json ├── negative_prompts/ │ ├── text2video/neg_prompt.json │ └── image2video/neg_prompt.json └── images/image2video/ # e.g. car_driving.jpg, coastal_road_audio.jpg ::: Prompt files are **structured JSON scene specs** (subjects, cinematography, `temporal_caption`, `resolution`, `fps`, etc.), not plain strings. The quickstart passes the prompt file directly to `-i`; the full notebook wraps the same files into framework payload JSON with sampling fields. Action examples use `cookbooks/cosmos3/generator/action/assets/` (images, videos, trajectories) and write specs under `packages/cosmos3/outputs/cookbooks/cosmos3/generator/action/inputs/`. ## Audiovisual input spec (notebook payload) The audiovisual notebook builds per-run JSON under `outputs/notebooks/pytorch/payloads/.json` with a consistent schema: | Field | Typical value | Notes | | --- | --- | --- | | `model_mode` | `text2image`, `text2video`, `image2video` | Selects Generator modality | | `prompt` | Compact JSON **string** of the scene spec file | From `assets/prompts/...` | | `negative_prompt` | Compact JSON string or `""` | From `assets/negative_prompts//neg_prompt.json`; empty for text2image | | `enable_sound` | `true` / `false` | Sound-bearing prompts use dedicated asset pairs | | `num_steps` | `35` | Diffusion steps | | `guidance` | `6.0` | CFG scale | | `shift` | `10.0` | Scheduler flow shift | | `fps` | `24` | | | `num_frames` | `189` (video), `1` (text2image) | ~7.9 s at 24 FPS for default video | | `resolution` | `"720"` | | | `aspect_ratio` | `"16,9"` | Comma-separated pair in cookbook payloads | | `seed` | `0` | | | `vision_path` | Relative path to conditioning image | Required for `image2video` | Example payload fragment (text-to-video, no audio): ```json { "model_mode": "text2video", "name": "t2v_nano_noaudio", "prompt": "{...compact scene JSON...}", "negative_prompt": "{...compact neg prompt JSON...}", "enable_sound": false, "num_steps": 35, "guidance": 6.0, "shift": 10.0, "fps": 24, "num_frames": 189, "resolution": "720", "aspect_ratio": "16,9", "seed": 0 } ``` Image-to-video adds `vision_path` relative to the payload file directory (e.g. path into `assets/images/image2video/`). ### Notebook asset matrix | Use case key | Checkpoint | Mode | Sound | | --- | --- | --- | --- | | `t2i` | Cosmos3-Nano | text2image | off | | `t2i_super` | Cosmos3-Super | text2image | off | | `t2v_nano_noaudio` | Cosmos3-Nano | text2video | off | | `t2vs` | Cosmos3-Nano | text2video | on | | `i2v_nano_noaudio` | Cosmos3-Nano | image2video | off | | `i2vs` | Cosmos3-Nano | image2video | on | | `t2v_super_noaudio` | Cosmos3-Super | text2video | off | | `i2v_super_noaudio` | Cosmos3-Super | image2video | off | Run pattern (text-to-image on Nano): ```bash cd packages/cosmos3 CUDA_VISIBLE_DEVICES=0,1,2,3 \ .venv/bin/torchrun \ --nproc-per-node=4 \ --master-addr=127.0.0.1 \ --master-port= \ -m cosmos_framework.scripts.inference \ --parallelism-preset=throughput \ -i /path/to/t2i.json \ -o /path/to/output/t2i \ --checkpoint-path Cosmos3-Nano \ --seed=0 ``` ## Scale checkpoints and GPUs | Checkpoint | Size | Cookbook GPU hint | | --- | ---: | --- | | `Cosmos3-Nano` | 16B | Quickstart: 1 GPU; audiovisual notebook default: 4 GPUs | | `Cosmos3-Super` | 64B | Same `throughput` preset; raise `--nproc-per-node` to available GPU count | Set `export COSMOS3_NUM_GPUS=4` and `export CUDA_VISIBLE_DEVICES=0,1,2,3` before notebook cells, or pass `--nproc-per-node` explicitly in shell commands. ## Action Generator (forward and inverse dynamics) Action workflows use **JSONL** specs (one JSON object per line) and the `latency` preset. They are documented in depth on [Run Generator action workflows](/run-generator-action); summary for Framework-only runs: **Forward dynamics** (`model_mode`: `forward_dynamics`): start image + `action_path` + `domain_name` (`av`, `droid_lerobot`, `umi`, …). Output video per run: ```text //vision.mp4 ``` **Inverse dynamics** (`model_mode`: `inverse_dynamics`): input `vision_path` video only; predicted action in `//sample_outputs.json` under `outputs[0].content["action"]`. Example AV forward-dynamics record: ```json { "action_chunk_size": 60, "action_path": "/abs/path/to/av_traj_forward.json", "domain_name": "av", "fps": 10, "image_size": 480, "view_point": "ego_view", "model_mode": "forward_dynamics", "name": "av_forward", "prompt": "You are an autonomous vehicle planning system.", "seed": 0, "vision_path": "/abs/path/to/av_0.jpg" } ``` Run: ```bash cd packages/cosmos3 CUDA_VISIBLE_DEVICES=0 \ .venv/bin/python -m cosmos_framework.scripts.inference \ --parallelism-preset=latency \ -i outputs/cookbooks/cosmos3/generator/action/inputs/action_forward_dynamics_av_custom.jsonl \ -o outputs/cookbooks/cosmos3/generator/action/action_forward_dynamics_av_custom \ --checkpoint-path Cosmos3-Nano \ --video-save-quality 8 \ --image_size 480 \ --seed 0 \ --benchmark ``` Embodiment dimensions and FPS defaults for AV / DROID / UMI are summarized in the action cookbook README (9D–10D pose deltas, 60 frames @ 10 FPS for AV, etc.). ## Outputs and verification | Workflow | Output location | Success signal | | --- | --- | --- | | Audiovisual | `-o` directory; `*.mp4` or `*.png` under run subfolders | Generated media files; notebook `view_run()` skips `*_preview.mp4` | | Forward dynamics | `//vision.mp4` | MP4 exists per JSONL `name` | | Inverse dynamics | `//sample_outputs.json` | `action` array in first output content | | With `--benchmark` | `benchmark.json` under output root | Timing averages in JSON | Hugging Face weights cache under `HF_HOME` (default `~/.cache/huggingface`). ## Useful environment variables | Variable | Default / role | | --- | --- | | `COSMOS3_REPO` | Framework checkout path (`packages/cosmos3`) | | `COSMOS3_UV_GROUP` | `cu130-train` or `cu128-train` | | `COSMOS3_UV_ENV` / `UV_PROJECT_ENVIRONMENT` | Path to `.venv` used by `torchrun` | | `COSMOS3_NUM_GPUS` | `4` in audiovisual notebook | | `CUDA_VISIBLE_DEVICES` | GPU indices for the run | | `COSMOS3_MASTER_ADDR` | `127.0.0.1` for distributed audiovisual | | `COSMOS3_*_MASTER_PORT` | Per-workflow free ports in notebook | | `HF_HOME` / `HF_TOKEN` | Model download cache and auth | | `GIT_LFS_SKIP_SMUDGE` | `1` during `uv sync` | Action notebooks may require a one-time kernel restart after `configure_cosmos_framework_runtime_env()` updates `LD_LIBRARY_PATH` for CUDA and FFmpeg libraries. ## Troubleshooting **Headless import errors** (`libxcb.so.1`): install `libxcb1 libgl1 libglib2.0-0` on minimal Linux images. | Symptom | Mitigation | | --- | --- | | `cuda available: False` after sync | Switch to `cu128-train` on CUDA 12.x drivers; confirm with `nvidia-smi` | | `uv` parse / `--torch-backend` errors | Upgrade `uv` to ≥ 0.11.3 (`uv self update`) | | Clone / sync failures on LFS blobs | Keep `GIT_LFS_SKIP_SMUDGE=1` for cookbook installs | | Missing `torchrun` | Run install cell; use `$COSMOS3_UV_ENV/bin/torchrun` explicitly | | Super OOM | Reduce resolution in payload, use fewer frames, or add GPUs via `--nproc-per-node` | See [Troubleshooting](/troubleshooting) for cross-backend CUDA and container notes. ## Related pages Shared uv/Docker setup, HF auth, and CUDA group selection for all backends. Full notebook matrix for text-to-image, text-to-video, and image-to-video with optional sound. Forward and inverse dynamics JSONL specs, domains, and Framework vs vLLM-Omni paths. Resolution tiers, frame counts, prompt limits, and output formats. Structured JSON prompt schema and Generator sampling defaults. When to prefer Framework vs Diffusers vs vLLM-Omni. Python-first alternative without a framework checkout.