# Run Generator with Diffusers

> Install Cosmos3OmniPipeline dependencies, configure UniPC scheduler flow_shift, run text-to-image/video and image-to-video with structured JSON prompts, and export MP4 outputs.

- Repository: NVIDIA/cosmos
- GitHub: https://github.com/NVIDIA/cosmos
- Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9
- Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt

## Source Files

- `README.md`
- `cookbooks/cosmos3/generator/audiovisual/README.md`
- `cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb`
- `cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/robot_kitchen.json`
- `cookbooks/cosmos3/generator/audiovisual/assets/negative_prompts/text2video/neg_prompt.json`
- `cookbooks/cosmos3/README.md`

---

---
title: "Run Generator with Diffusers"
description: "Install Cosmos3OmniPipeline dependencies, configure UniPC scheduler flow_shift, run text-to-image/video and image-to-video with structured JSON prompts, and export MP4 outputs."
---

Generator audiovisual workflows in this repository call Hugging Face `Cosmos3OmniPipeline` from a dedicated Python 3.13 venv, swap in `UniPCMultistepScheduler` with `flow_shift`, pass structured scene JSON via `json.dumps`, and write PNG or MP4 under `cookbooks/cosmos3/generator/audiovisual`. The canonical runnable path is [`run_with_diffusers.ipynb`](cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb); the audiovisual README quickstart matches the same API with fewer moving parts.

## Prerequisites

| Requirement | Detail |
| --- | --- |
| OS / GPU | Linux with NVIDIA Ampere, Hopper, or Blackwell GPU |
| Python tooling | `uv` ≥ 0.11.3, `git`, `git-lfs` |
| Model access | Gated Hugging Face repos (`nvidia/Cosmos3-Nano`, `nvidia/Cosmos3-Super`) |
| Auth | `uvx hf@latest auth login` or `HF_TOKEN` |
| CUDA pairing | Pin `--torch-backend=cu130` (CUDA 13 driver) or `cu128` (CUDA 12.x) — see [Cookbook environment setup](/cookbook-environment) |

<Warning>
On headless Linux, imports may fail with `libxcb.so.1: cannot open shared object file`. Install `libxcb1`, `libgl1`, and `libglib2.0-0` before running the pipeline.
</Warning>

Work from `cookbooks/cosmos3/generator/audiovisual` so relative asset paths resolve.

## Install Cosmos3OmniPipeline dependencies

Shared install steps live in [Cookbook environment setup — Diffusers](/cookbook-environment). For a standalone venv at the repo root:

<Steps>
<Step title="Create and activate the venv">

```bash
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
```

</Step>
<Step title="Install packages with a CUDA-matched torch backend">

```bash
uv pip install --torch-backend=cu130 \
  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
  accelerate \
  av \
  cosmos_guardrail \
  huggingface_hub \
  imageio \
  imageio-ffmpeg \
  torch \
  torchvision \
  transformers
```

Use `--torch-backend=cu128` when your driver reports CUDA 12.x.

</Step>
<Step title="Verify GPU visibility">

```bash
.venv/bin/python - <<'PY'
import torch
print("torch:", torch.__version__)
print("cuda available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY
```

Expect `cuda available: True` and a valid device name.

</Step>
</Steps>

The notebook installs into `.venv-cosmos3-diffusers` by default (`COSMOS3_DIFFUSERS_VENV`), registers a Jupyter kernel named `Cosmos3 Diffusers (Python 3.13)`, and requires that kernel for all inference cells.

<Note>
`--torch-backend=auto` in the root README quickstart lets uv pick a CUDA wheel; on mismatched drivers this yields `torch.cuda.is_available() == False`. Prefer an explicit `cu130` or `cu128` tag as in the cookbooks guide.
</Note>

## Runtime layout

```text
cookbooks/cosmos3/generator/audiovisual/
├── assets/
│   ├── prompts/          # structured JSON per modality
│   ├── negative_prompts/ # video modes only
│   └── images/           # image2video conditioning
├── run_with_diffusers.ipynb
└── outputs/notebooks/    # default COSMOS3_AUDIOVISUAL_OUTPUT_ROOT
```

```mermaid
flowchart LR
  subgraph inputs [Inputs]
    P[JSON prompt file]
    N[negative_prompt.json]
    I[conditioning image]
  end
  subgraph pipeline [Cosmos3OmniPipeline]
    L[from_pretrained]
    S[UniPCMultistepScheduler flow_shift]
    G[pipe denoise call]
  end
  subgraph outputs [Outputs]
    PNG[PNG text2image]
    MP4[MP4 via export_to_video or encode_video]
  end
  P --> L
  N --> G
  I --> G
  L --> S --> G
  G --> PNG
  G --> MP4
```

## Configure UniPC scheduler and flow_shift

After `Cosmos3OmniPipeline.from_pretrained`, replace the default scheduler:

```python
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

pipe.scheduler = UniPCMultistepScheduler.from_config(
    pipe.scheduler.config, flow_shift=10.0
)
```

Cookbook defaults use `flow_shift` (alias `shift`) **10.0**, matching vLLM-Omni `flow_shift` in the audiovisual README. Re-apply the scheduler per run if the payload overrides `shift`:

```python
pipe.scheduler = UniPCMultistepScheduler.from_config(
    pipe.scheduler.config, flow_shift=payload["shift"]
)
```

## Structured JSON prompts

Scene prompts are JSON objects (subjects, lighting, cinematography, temporal fields), not plain strings. Pass them as compact JSON strings:

```python
import json

prompt = json.load(open("assets/prompts/text2video/robot_kitchen.json"))
negative = json.load(open("assets/negative_prompts/text2video/neg_prompt.json"))

result = pipe(
    prompt=json.dumps(prompt),
    negative_prompt=json.dumps(negative),
    ...
)
```

| Field group | Typical keys |
| --- | --- |
| Scene | `subjects`, `background_setting`, `lighting`, `aesthetics` |
| Motion (video) | `actions`, `segments`, `temporal_caption`, `cinematography` |
| Output hints | `resolution` (`W`/`H`), `aspect_ratio` (e.g. `"16,9"`), `duration`, `fps` |

Text-to-image prompts may use `comprehensive_t2i_caption` instead of `temporal_caption`. Negative prompts for **text-to-video** and **image-to-video** live under `assets/negative_prompts/<mode>/neg_prompt.json`. Text-to-image runs use an empty `negative_prompt`.

Disable template injection when prompts already encode resolution and duration:

<ParamField body="add_resolution_template" type="bool">
When `False`, do not append resolution templates to the prompt (cookbook default).
</ParamField>

<ParamField body="add_duration_template" type="bool">
When `False`, do not append duration templates (cookbook default).
</ParamField>

## Default sampling parameters

| Parameter | Cookbook value | Maps to `pipe()` |
| --- | ---: | --- |
| `num_steps` | 35 | `num_inference_steps` |
| `guidance` | 6.0 | `guidance_scale` |
| `shift` | 10.0 | `UniPCMultistepScheduler` `flow_shift` |
| `fps` | 24 | `fps` |
| `num_frames` | 189 | ~7.9 s at 24 FPS |
| `resolution` + `aspect_ratio` | `720` + `16,9` | `height=720`, `width=1280` |
| `seed` | 1234 | `torch.Generator(device="cuda").manual_seed(...)` |

189 frames at 24 FPS aligns with the standard Cosmos3 video profile in the root README.

## Load the pipeline

```python
import torch
from diffusers import Cosmos3OmniPipeline

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano",
    torch_dtype=torch.bfloat16,
    device_map="cuda",  # quickstart style
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=10.0)
```

| Checkpoint alias | Hugging Face ID |
| --- | --- |
| `Cosmos3-Nano` | `nvidia/Cosmos3-Nano` |
| `Cosmos3-Super` | `nvidia/Cosmos3-Super` |

The notebook loads with `safety_checker=None`, `enable_safety_checker=True`, optional `HF_TOKEN`, then `pipe.to("cuda")`. Super needs substantially more VRAM; expect longer first-run downloads and denoising time.

## Text-to-video quickstart

From `cookbooks/cosmos3/generator/audiovisual`:

```python
import json
import torch
from diffusers import Cosmos3OmniPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video

prompt = json.load(open("assets/prompts/text2video/robot_kitchen.json"))
negative = json.load(open("assets/negative_prompts/text2video/neg_prompt.json"))

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=10.0)

result = pipe(
    prompt=json.dumps(prompt),
    negative_prompt=json.dumps(negative),
    image=None,
    num_frames=189,
    height=720,
    width=1280,
    fps=24,
    num_inference_steps=35,
    guidance_scale=6.0,
    enable_sound=False,
    add_resolution_template=False,
    add_duration_template=False,
    generator=torch.Generator(device="cuda").manual_seed(1234),
)
export_to_video(result.video, "/tmp/cosmos3_t2v_diffusers.mp4", fps=24, macro_block_size=1)
```

<Check>
Success: MP4 written at the target path; first run also completes `Cosmos3-Nano` download into `HF_HOME`. Long per-step logs during 35 denoise steps are expected, not a hang.
</Check>

## Generator modes

| Mode | `num_frames` | Conditioning | Sound | Output |
| --- | ---: | --- | --- | --- |
| Text-to-image | 1 | — | off | PNG (`result.video[0].save`) |
| Text-to-video | 189 (default) | `image=None` | optional | MP4 |
| Image-to-video | 189 (default) | `load_image(...)` | optional | MP4 |
| Text-to-video with sound | 189 | — | `enable_sound=True` | MP4 + AAC via `encode_video` |

Diffusers mode names in the root README: `text-to-image`, `text-to-video`, `image-to-video`, `text-to-video-with-sound`. Sound requires checkpoints with sound modules; mux with `encode_video(..., audio=result.sound, audio_sample_rate=pipe.sound_tokenizer.config.sampling_rate)` when `result.sound` is present.

### Text-to-image

```python
result = pipe(
    prompt=json.dumps(prompt_obj),
    negative_prompt="",
    num_frames=1,
    height=720,
    width=1280,
    num_inference_steps=35,
    guidance_scale=6.0,
    add_resolution_template=False,
    add_duration_template=False,
    generator=torch.Generator(device="cuda").manual_seed(1234),
)
result.video[0].save("robot_draping.png")
```

Example prompt: `assets/prompts/text2image/robot_draping.json`.

### Image-to-video

```python
from diffusers.utils import load_image

image = load_image("assets/images/image2video/car_driving.jpg")
result = pipe(
    prompt=json.dumps(prompt_obj),
    negative_prompt=json.dumps(negative_obj),
    image=image,
    num_frames=189,
    height=720,
    width=1280,
    fps=24,
    num_inference_steps=35,
    guidance_scale=6.0,
    enable_sound=False,
    add_resolution_template=False,
    add_duration_template=False,
    generator=torch.Generator(device="cuda").manual_seed(1234),
)
export_to_video(result.video, "car_driving.mp4", fps=24, macro_block_size=1)
```

Pair prompts under `assets/prompts/image2video/` with images under `assets/images/image2video/`.

## Notebook workflow

`run_with_diffusers.ipynb` sequences: configure paths → install venv/kernel → verify CUDA → preview assets → `create_payload(use_case, backend="diffusers")` → `run_diffusers_payload(...)` → `view_run(...)`.

| Use case key | Model | Mode |
| --- | --- | --- |
| `t2i` | Nano | text2image |
| `t2v_nano_noaudio` | Nano | text2video |
| `t2vs` | Nano | text2video + sound |
| `i2v_nano_noaudio` | Nano | image2video |
| `i2vs` | Nano | image2video + sound |
| `t2i_super` / `t2v_super_noaudio` / `i2v_super_noaudio` | Super | same modes |

Outputs default to `outputs/notebooks/diffusers/<use_case>/`.

## Environment variables

| Variable | Default | Purpose |
| --- | --- | --- |
| `COSMOS3_DIFFUSERS_VENV` | `<repo>/.venv-cosmos3-diffusers` | Dedicated venv path |
| `COSMOS3_TORCH_BACKEND` | `cu130` | `uv pip install --torch-backend` |
| `COSMOS3_AUDIOVISUAL_OUTPUT_ROOT` | `.../outputs/notebooks` | Payloads and media |
| `HF_HOME` | `~/.cache/huggingface` | Model cache |
| `CUDA_VISIBLE_DEVICES` | `0` | GPU selection |
| `HF_TOKEN` | unset | Gated model download |

## Troubleshooting

| Symptom | Mitigation |
| --- | --- |
| `torch.cuda.is_available()` is `False` | Match `--torch-backend` to driver (`cu130` / `cu128`); see [Troubleshooting](/troubleshooting) |
| `libxcb.so.1` on import | Install X11/GL libs listed above |
| `uv` rejects `--torch-backend=cu130` | Upgrade `uv` to ≥ 0.11.3 |
| Kernel mismatch in notebook | Switch to `Cosmos3 Diffusers (Python 3.13)` and run the restore cell |
| OOM on Super | Use Nano first; Super needs multi-GPU serving paths via [Run Generator with vLLM-Omni](/run-generator-vllm-omni) |

## Related pages

<CardGroup>
<Card title="Cookbook environment setup" href="/cookbook-environment">
Shared uv install, CUDA backend tags, and GPU verification for Diffusers, Framework, and vLLM backends.
</Card>
<Card title="Diffusers pipeline reference" href="/diffusers-pipeline-reference">
`Cosmos3OmniPipeline.from_pretrained` modes, call arguments, and `export_to_video` details.
</Card>
<Card title="Sampling and prompt parameters" href="/sampling-and-prompt-parameters">
Structured JSON schema, prompt-upsampling defaults, and template flags.
</Card>
<Card title="Audiovisual cookbook recipes" href="/audiovisual-cookbooks">
Notebook index for Diffusers, Framework, and vLLM-Omni with asset layout.
</Card>
<Card title="Choose an integration" href="/choose-integration">
When to prefer Diffusers vs vLLM-Omni vs Cosmos Framework.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Resolution tiers, frame counts, aspect ratios, and output formats.
</Card>
</CardGroup>
