# Action modality > Action token semantics, embodiment dimensions (AV 9D, DROID 10D, UMI 10D, humanoid 29D), policy/inverse/forward dynamics modes, and domain_name conditioning for Generator action workflows. - Repository: NVIDIA/cosmos - GitHub: https://github.com/NVIDIA/cosmos - Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9 - Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt ## Source Files - `README.md` - `cookbooks/cosmos3/generator/action/README.md` - `cookbooks/cosmos3/generator/action/assets/actions/umi.json` - `cookbooks/cosmos3/generator/action/assets/actions/av_traj_forward.json` - `cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/meta/info.json` - `cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb` --- --- title: "Action modality" description: "Action token semantics, embodiment dimensions (AV 9D, DROID 10D, UMI 10D, humanoid 29D), policy/inverse/forward dynamics modes, and domain_name conditioning for Generator action workflows." --- Cosmos 3 Generator treats **action** as a first-class modality: action tokens encode transitions between consecutive visual states, are denoised alongside vision (and optionally audio) in Generator mode, and are selected at inference time through `model_mode` (Cosmos Framework JSONL) or `action_mode` plus `domain_name` (vLLM-Omni `extra_params`). Checked-in workflows under `cookbooks/cosmos3/generator/action/` exercise forward and inverse dynamics for AV (9D), DROID (10D), and UMI (10D); the root README documents additional embodiment sizes including humanoid 29D. ## Action token semantics In Generator mode, the diffusion path denoises image, video, audio, and **action** tokens with full attention, sharing the same transformer stack and 3D mRoPE as other modalities. The action cookbook defines tokens as **transitions between consecutive visual states**, not absolute world poses in isolation. | Concept | Meaning | | --- | --- | | Token semantics | One action token per inter-frame transition (pose delta, gripper change, etc.) | | Unified pose core | **9D** = 3D translation + **6D continuous rotation** (`rot6d`) between consecutive states | | Grasp / hand | **1D** open–close for parallel grippers; **15D** human hand (3D × 5 fingers) where applicable | | On-disk interchange | JSON array of rows: `[[d₀…dₙ₋₁], …]` with row count and `d` fixed per embodiment | | Framework field | `model_mode`: `forward_dynamics`, `inverse_dynamics` (and other Generator modes outside action) | | Serving field | `action_mode` in `extra_params`: `forward_dynamics`, `inverse_dynamics`, `policy` | Reasoner workflows predict **text** (next action, action CoT, etc.) from vision; they do not use `domain_name` / `action_mode`. Action **generation** and rollouts are Generator-side. ## Embodiment dimensions Supported conditioning sizes are embodiment-specific. Cookbooks ship runnable examples for three domains; the project README lists the full matrix. ### Cookbook-covered embodiments | Embodiment | `domain_name` (examples) | Vector | Composition | Generation duration (cookbook) | | --- | --- | ---: | --- | --- | | Autonomous vehicle | `av` | **9D** | Ego pose delta (translation + rot6d) | 60 frames @ 10 FPS | | [DROID](https://arxiv.org/abs/2403.12945) | `droid_lerobot` | **10D** | 9D end-effector pose + **1D** gripper grasp | 16 frames @ 15 FPS | | UMI | `umi` | **10D** | 9D end-effector pose + **1D** gripper grasp | 16 frames @ 20 FPS | DROID forward-dynamics uses multiview LeRobot data (`assets/droid_lerobot_example/`), with post-processing described as multiview concatenation, `to-OpenCV`, and normalization. UMI stores a long trajectory in `assets/actions/umi.json` (rows of 10 floats); notebooks split it into **16-action chunks** for autoregressive rollouts. ### Additional embodiments (README) | Setting | Dimensionality | Notes | | --- | ---: | --- | | Camera motion | 9D | Same pose-delta family as AV | | Autonomous vehicle | 9D | Listed separately in I/O spec; cookbook uses `av` | | Egocentric motion | 57D | Documented; no checked-in action cookbook yet | | Single-arm robot (DROID / UR / Fractal / Bridge / UMI) | 10D | DROID and UMI have examples | | Dual-arm robot (dual DROID) | 20D | Documented; no cookbook example yet | | Humanoid (AgiBot) | **29D** | Documented; action cookbook TODO lists more embodiments | Dedicated policy checkpoint: **[Cosmos3-Nano-Policy-DROID](https://huggingface.co/nvidia/Cosmos3-Nano-Policy-DROID)** (16B) for DROID manipulation policy, separate from general Cosmos3-Nano action dynamics demos. ### 9D and 10D layout **9D row** (AV and pose component of robotics): ```text [tx, ty, tz, r1, r2, r3, r4, r5, r6] # meters + rot6d ``` **10D row** (DROID / UMI): 9D pose delta + 1D gripper grasp state. AV trajectories are often produced from absolute camera-to-world poses (OpenCV convention, meters) via `pose_abs_to_rel` in `cosmos_framework.data.vfm.action.pose_utils`, with: - `rotation_format="rot6d"` - `pose_convention="backward_framewise"` - `translation_scale=1.35` (AV cookbook convention) That yields **`[T−1, 9]`** relative rows for **`T`** visual frames. Checked-in AV files such as `assets/actions/av_traj_forward.json` are JSON arrays of 9-float rows. ## Action workflow modes Three Generator **action modes** map to different inputs and outputs. They align with `model_mode` in Framework JSONL and `action_mode` in vLLM-Omni requests. | Mode | `action_mode` / `model_mode` | Primary input | Primary output | Typical endpoint | | --- | --- | --- | --- | --- | | **Forward dynamics** | `forward_dynamics` | Start **image** + action trajectory | **Video** rollout | `POST /v1/videos` (cookbooks) or `POST /v1/videos/sync` (README) | | **Inverse dynamics** | `inverse_dynamics` | **Video** + text prompt | Predicted **action** chunk (+ job metadata) | Async `POST /v1/videos` | | **Action policy** | `policy` | **Image** + instruction | **Video** + predicted **action** chunk | Async `POST /v1/videos` | ```text domain_name + embodiment dim │ forward_dynamics: image + action[] ──► video inverse_dynamics: video + prompt ──► action[] (no action_path in JSONL) policy: image + prompt ──► video + action[] ``` **Forward dynamics** conditions on a known trajectory and predicts future observations. **Inverse dynamics** predicts the trajectory that explains an input video. **Policy** predicts actions (and optionally rollout video) from context and language. ### Forward dynamics (`fd`) - **Inputs:** `vision_path` (conditioning image), `action_path` or inline `action` array, optional `prompt`. - **Outputs:** `vision.mp4` under the run directory; forward jobs in cookbooks do not require `action` in the completed response. - **Frame count:** `num_frames = action_chunk_size + 1` (e.g. 61 for AV with chunk size 60). - **Autoregressive robotics / UMI:** Later chunks use the **last generated frame** from the previous chunk as the next conditioning image (DROID: 5×16 actions; UMI: all 16-action segments in `umi.json`). ### Inverse dynamics (`id`) - **Inputs:** `vision_path` points to an **MP4**; JSONL has **no** `action_path`. - **Outputs:** Predicted ego-motion trajectory; vLLM-Omni returns `action` on the completed job (`shape`, `dtype`, `data`). Notebooks mirror Framework as `sample_outputs.json`. - **AV example:** `raw_action_dim: 9`, `action_chunk_size: 60`, `domain_name: "av"`, `view_point: "ego_view"`. ### Policy - Documented in README: image + instruction → video + predicted action chunk. - Use async `POST /v1/videos` and read action from the completed result (same pattern as inverse dynamics). - Example `domain_name` values in docs include `bridge_orig_lerobot` and `camera_pose` (see vLLM-Omni Cosmos 3 serving examples). ## `domain_name` conditioning `domain_name` tells the model which **embodiment parser**, normalization, chunking, and view geometry apply. It must stay consistent with action row dimensionality and `action_chunk_size`. | `domain_name` | Used in | `action_chunk_size` | `image_size` | `view_point` (examples) | | --- | --- | ---: | ---: | --- | | `av` | AV fd / id cookbooks | 60 | 480 | `ego_view` | | `droid_lerobot` | DROID fd cookbook | 16 | 480 | From dataset (`viewpoint`) | | `umi` | UMI fd cookbook | 16 | 256 | `ego_view` | README also references `bridge_orig_lerobot`, `camera_pose`, and other robot/AV/camera variants in [vLLM-Omni Cosmos 3 online serving examples](https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/cosmos3). Mismatching `domain_name`, row length, or `raw_action_dim` produces server-side errors or invalid rollouts. UMI notebooks assert every row has length **10** before chunking. ## JSONL input spec (Cosmos Framework) Each inference line is one JSON object (JSONL). Shared fields across action cookbooks: Embodiment key (`av`, `droid_lerobot`, `umi`, …). `forward_dynamics` or `inverse_dynamics` for action workflows. Number of action transitions in the chunk (60 AV, 16 robotics/UMI). Absolute path to conditioning **image** (fd) or **video** (id). Required for **forward_dynamics** only; path to JSON action array. Output video frame rate (10 AV, 15 DROID, 20 UMI in cookbooks). Short-edge resolution tier (480 AV/DROID, 256 UMI). vLLM may infer canvas from this without explicit `size`. Camera geometry hint, e.g. `ego_view` or dataset `viewpoint`. Task text; AV examples use *"You are an autonomous vehicle planning system."*; DROID uses dataset `ai_caption`. Per-run reproducibility seed. Example AV forward-dynamics record (conceptual): ```json { "name": "av_forward", "domain_name": "av", "model_mode": "forward_dynamics", "action_chunk_size": 60, "action_path": "/path/to/av_traj_forward.json", "vision_path": "/path/to/av_0.jpg", "fps": 10, "image_size": 480, "view_point": "ego_view", "prompt": "You are an autonomous vehicle planning system.", "seed": 0 } ``` Framework entrypoint: ```bash torchrun --nproc-per-node=1 \ -m cosmos_framework.scripts.inference \ --parallelism-preset=throughput \ -i action_forward_dynamics_av_custom.jsonl \ -o /tmp/cosmos3_action_fd \ --checkpoint-path Cosmos3-Nano \ --seed=0 ``` ## vLLM-Omni `extra_params` Multipart `POST /v1/videos` sends vision via `input_reference` and packs Cosmos-specific options in `extra_params` (JSON string; use `curl --form-string` so semicolons are not stripped). `forward_dynamics`, `inverse_dynamics`, or `policy`. Same semantics as JSONL. Matches JSONL / trajectory length. For forward dynamics: inline JSON trajectory (cookbooks load `action_path` into this field). Set for inverse dynamics when the server must know output width (AV id: **9**). Resolution tier for action canvas. View geometry for conditioning. Cookbooks often set `false` for robotics/UMI; default product guardrails apply when omitted. Forward-dynamics request shape (from notebooks): `num_frames = action_chunk_size + 1`, `fps` from the record, `guidance_scale=1.0`, `flow_shift=10.0`, plus `extra_params` above. Poll `GET /v1/videos/{id}` until `completed`, then `GET /v1/videos/{id}/content` for MP4 bytes. | Mode | Returns `action` in job result? | Sync endpoint | | --- | --- | --- | | `forward_dynamics` | Optional; cookbooks consume video only | `POST /v1/videos/sync` supported per README | | `inverse_dynamics` | **Yes** (required for id notebooks) | Async only | | `policy` | **Yes** | Async only | Start the server with `--allowed-local-media-path` covering conditioning media and action JSON paths (Docker: mount repo at `/workspace`). ## Asset layout :::files cookbooks/cosmos3/generator/action/ ├── README.md ├── assets/ │ ├── actions/ # av_traj_*.json (9D), umi.json (10D) │ ├── images/ # av_0.jpg, umi.png, … │ ├── videos/ # av_*.mp4 for inverse dynamics │ └── droid_lerobot_example/ # LeRobot layout + meta/info.json ├── run_fd_with_cosmos_framework.ipynb ├── run_fd_with_vllm.ipynb ├── run_id_with_cosmos_framework.ipynb └── run_id_with_vllm.ipynb ::: Outputs default to `outputs/cosmos3_action_vllm/` (vLLM) or framework package output trees under `packages/cosmos3/outputs/cookbooks/...`. ## Verification signals | Check | Expected | | --- | --- | | Action JSON row width | AV/id: 9 floats; DROID/UMI fd: 10 floats | | AV trajectory from poses | `pose_abs_to_rel` → `[T−1, 9]` | | UMI chunking | `len(umi_action) % 16 == 0` | | vLLM forward fd | `vision.mp4` in `//` | | vLLM inverse id | `final.json` contains `action` with `data`; `sample_outputs.json` written | | Frame alignment | `num_frames == action_chunk_size + 1` | For AV, visualize predicted or input trajectories with `pose_rel_to_abs` using the same `rot6d`, `backward_framewise`, and `translation_scale=1.35` convention as forward-dynamics prep. ## Related pages Step-by-step forward and inverse dynamics with Framework torchrun and vLLM-Omni multipart requests. Notebook index for AV, DROID, and UMI with checked-in trajectories and output directories. `/v1/videos` fields, `action_mode` values, and `curl --form-string` constraints. Global I/O types, action conditioning matrix, and resolution tiers. MoT surfaces: when to use Generator action vs Reasoner text outputs. Cosmos3-Nano, Super, and Cosmos3-Nano-Policy-DROID checkpoints.