Agent-readable docs

NVIDIA Cosmos 3 Documentation

Source-grounded reference for Cosmos 3 omnimodal world models: Reasoner and Generator runtime surfaces, Hugging Face checkpoints, integration paths (Diffusers, vLLM-Omni, vLLM, Cosmos Framework), runnable cookbooks, and OpenAI-compatible serving APIs for Physical AI developers.

Pages

OverviewCosmos 3 omnimodal world model surfaces (Reasoner vs Generator), primary entry points, supported modalities, and the shortest path to a first generation or reasoning call.
InstallationPrerequisites (Linux, NVIDIA GPU, uv, Hugging Face auth), CUDA driver pairing (cu130/cu128), venv and Docker setup paths, and environment verification commands.
QuickstartMinimal first-run commands for Generator (Diffusers text-to-video, vLLM-Omni curl) and Reasoner (vLLM serve + OpenAI chat completion), including HF login and expected success signals.
Choose an integrationDecision matrix for Diffusers, vLLM-Omni, vLLM, Transformers (coming soon), and Cosmos Framework by goal: research, production inference, training, or evaluation.
Reasoner and GeneratorMoT architecture modes: autoregressive Reasoner (text/vision in, text out) vs diffusion Generator (multimodal in, vision/sound/action out), shared mRoPE, and when to use each surface.
Model familyCheckpoint catalog (Nano 16B, Super 64B, Text2Image, Image2Video, Nano-Policy-DROID), Hugging Face IDs, capability focus, and size tradeoffs for serving.
Input and output specificationsSupported input/output types and formats, resolution tiers (256p–720p), aspect ratios, frame rates/counts, vision conditioning frame counts, prompt length limits, and sound output specs.
Action modalityAction token semantics, embodiment dimensions (AV 9D, DROID 10D, UMI 10D, humanoid 29D), policy/inverse/forward dynamics modes, and domain_name conditioning for Generator action workflows.
Cookbook environment setupShared uv/Docker setup for all backends: HF auth, CUDA backend tags, Cosmos Framework clone/sync, Diffusers venv, vLLM + vllm-cosmos3 plugin, vLLM-Omni Docker image, and GPU verification probes.
Run Generator with DiffusersInstall Cosmos3OmniPipeline dependencies, configure UniPC scheduler flow_shift, run text-to-image/video and image-to-video with structured JSON prompts, and export MP4 outputs.
Run Generator with vLLM-OmniStart vllm/vllm-omni:cosmos3 Docker server, tensor-parallel and CFG/Ulysses options for Super, POST vision/action endpoints, guardrails toggles, and deploy-config for server-wide guardrail disable.
Run Generator with Cosmos FrameworkClone cosmos-framework, uv sync cu130-train/cu128-train groups, torchrun cosmos_framework.scripts.inference with parallelism presets, checkpoint-path, and JSON input specs from cookbook assets.
Run Generator action workflowsForward dynamics (image + action trajectory) and inverse dynamics (video + instruction) across Framework torchrun and vLLM-Omni multipart /v1/videos requests with domain_name and action_mode extra_params.
Run Reasoner with vLLMInstall vllm-cosmos3 plugin, serve Cosmos3ReasonerForConditionalGeneration with mm-encoder and media-io-kwargs, Qwen3-VL-compatible chat messages, and reasoning-format prompt suffix.
Run Reasoner with Cosmos FrameworkBuild reasoner JSON inputs (model_mode, vision_path, enable_sound), run cosmos_framework.scripts.inference with latency preset, and read reasoner_text.txt outputs; scale Nano to Super via torchrun.
vLLM-Omni API referenceOpenAI-compatible endpoints (/v1/images/generations, /v1/videos, /v1/videos/sync), request fields (prompt, size, num_frames, guidance_scale, extra_params), action_mode values, and curl --form-string constraints.
Diffusers pipeline referenceCosmos3OmniPipeline.from_pretrained modes (text-to-image, text-to-video, image-to-video, text-to-video-with-sound), key call arguments, export_to_video, and torch-backend install pairing.
Reasoner vLLM configurationvllm serve flags: hf-overrides architectures, tensor-parallel-size, mm-encoder-tp-mode, async-scheduling, allowed-local-media-path, media-io-kwargs, VLLM_USE_DEEP_GEMM, and vLLM/cu130 version pairs.
Sampling and prompt parametersGenerator prompt-upsampling defaults, Reasoner sampling tables (with/without reasoning), structured JSON prompt schema, Qwen3-VL message shape, and redacted_reasoning format instruction.
Audiovisual cookbook recipesEnd-to-end notebooks for text-to-image, text-to-video, image-to-video with optional sound across Diffusers, Cosmos Framework, and vLLM-Omni; asset layout under assets/prompts and assets/images.
Action cookbook recipesForward-dynamics and inverse-dynamics notebooks for AV, DROID, and UMI with checked-in trajectories, LeRobot sample data, and Framework vs vLLM-Omni output directories.
Reasoner cookbook recipesRunnable workflows for captioning, temporal localization, embodied/common-sense reasoning, 2D grounding, describe-anything, action CoT, physical plausibility, and situation understanding with bundled media assets.
Inference benchmarksPublished latency tables for Cosmos3-Nano/Super Generator (PyTorch, vLLM-Omni, Diffusers by GPU/resolution/TP) and Reasoner vLLM serving metrics (TTFT, throughput at concurrency tiers).
TroubleshootingCUDA/driver mismatches, NGC container selection, torch.cuda unavailable fixes, libxcb headless imports, uv version and --torch-backend errors, and VLLM_USE_DEEP_GEMM workaround.
Ecosystem, license, and releaseRelated Cosmos projects (Framework, Curator, Evaluator), OpenMDW-1.1 license terms, known model limitations, release cadence pointers, and third-party dependency notices.

Complete Markdown

The complete agent-readable Markdown files are published separately from this HTML page.