# Action modality

> Action token semantics, embodiment dimensions (AV 9D, DROID 10D, UMI 10D, humanoid 29D), policy/inverse/forward dynamics modes, and domain_name conditioning for Generator action workflows.

- Repository: NVIDIA/cosmos
- GitHub: https://github.com/NVIDIA/cosmos
- Human docs: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9
- Complete Markdown: https://grok-wiki.com/public/docs/nvidia-cosmos-82de3e90abd9/llms-full.txt

## Source Files

- `README.md`
- `cookbooks/cosmos3/generator/action/README.md`
- `cookbooks/cosmos3/generator/action/assets/actions/umi.json`
- `cookbooks/cosmos3/generator/action/assets/actions/av_traj_forward.json`
- `cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/meta/info.json`
- `cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb`

---

---
title: "Action modality"
description: "Action token semantics, embodiment dimensions (AV 9D, DROID 10D, UMI 10D, humanoid 29D), policy/inverse/forward dynamics modes, and domain_name conditioning for Generator action workflows."
---

Cosmos 3 Generator treats **action** as a first-class modality: action tokens encode transitions between consecutive visual states, are denoised alongside vision (and optionally audio) in Generator mode, and are selected at inference time through `model_mode` (Cosmos Framework JSONL) or `action_mode` plus `domain_name` (vLLM-Omni `extra_params`). Checked-in workflows under `cookbooks/cosmos3/generator/action/` exercise forward and inverse dynamics for AV (9D), DROID (10D), and UMI (10D); the root README documents additional embodiment sizes including humanoid 29D.

## Action token semantics

In Generator mode, the diffusion path denoises image, video, audio, and **action** tokens with full attention, sharing the same transformer stack and 3D mRoPE as other modalities. The action cookbook defines tokens as **transitions between consecutive visual states**, not absolute world poses in isolation.

| Concept | Meaning |
| --- | --- |
| Token semantics | One action token per inter-frame transition (pose delta, gripper change, etc.) |
| Unified pose core | **9D** = 3D translation + **6D continuous rotation** (`rot6d`) between consecutive states |
| Grasp / hand | **1D** open–close for parallel grippers; **15D** human hand (3D × 5 fingers) where applicable |
| On-disk interchange | JSON array of rows: `[[d₀…dₙ₋₁], …]` with row count and `d` fixed per embodiment |
| Framework field | `model_mode`: `forward_dynamics`, `inverse_dynamics` (and other Generator modes outside action) |
| Serving field | `action_mode` in `extra_params`: `forward_dynamics`, `inverse_dynamics`, `policy` |

<Note>
Reasoner workflows predict **text** (next action, action CoT, etc.) from vision; they do not use `domain_name` / `action_mode`. Action **generation** and rollouts are Generator-side.
</Note>

## Embodiment dimensions

Supported conditioning sizes are embodiment-specific. Cookbooks ship runnable examples for three domains; the project README lists the full matrix.

### Cookbook-covered embodiments

| Embodiment | `domain_name` (examples) | Vector | Composition | Generation duration (cookbook) |
| --- | --- | ---: | --- | --- |
| Autonomous vehicle | `av` | **9D** | Ego pose delta (translation + rot6d) | 60 frames @ 10 FPS |
| [DROID](https://arxiv.org/abs/2403.12945) | `droid_lerobot` | **10D** | 9D end-effector pose + **1D** gripper grasp | 16 frames @ 15 FPS |
| UMI | `umi` | **10D** | 9D end-effector pose + **1D** gripper grasp | 16 frames @ 20 FPS |

DROID forward-dynamics uses multiview LeRobot data (`assets/droid_lerobot_example/`), with post-processing described as multiview concatenation, `to-OpenCV`, and normalization. UMI stores a long trajectory in `assets/actions/umi.json` (rows of 10 floats); notebooks split it into **16-action chunks** for autoregressive rollouts.

### Additional embodiments (README)

| Setting | Dimensionality | Notes |
| --- | ---: | --- |
| Camera motion | 9D | Same pose-delta family as AV |
| Autonomous vehicle | 9D | Listed separately in I/O spec; cookbook uses `av` |
| Egocentric motion | 57D | Documented; no checked-in action cookbook yet |
| Single-arm robot (DROID / UR / Fractal / Bridge / UMI) | 10D | DROID and UMI have examples |
| Dual-arm robot (dual DROID) | 20D | Documented; no cookbook example yet |
| Humanoid (AgiBot) | **29D** | Documented; action cookbook TODO lists more embodiments |

Dedicated policy checkpoint: **[Cosmos3-Nano-Policy-DROID](https://huggingface.co/nvidia/Cosmos3-Nano-Policy-DROID)** (16B) for DROID manipulation policy, separate from general Cosmos3-Nano action dynamics demos.

### 9D and 10D layout

**9D row** (AV and pose component of robotics):

```text
[tx, ty, tz, r1, r2, r3, r4, r5, r6]   # meters + rot6d
```

**10D row** (DROID / UMI): 9D pose delta + 1D gripper grasp state.

AV trajectories are often produced from absolute camera-to-world poses (OpenCV convention, meters) via `pose_abs_to_rel` in `cosmos_framework.data.vfm.action.pose_utils`, with:

- `rotation_format="rot6d"`
- `pose_convention="backward_framewise"`
- `translation_scale=1.35` (AV cookbook convention)

That yields **`[T−1, 9]`** relative rows for **`T`** visual frames. Checked-in AV files such as `assets/actions/av_traj_forward.json` are JSON arrays of 9-float rows.

## Action workflow modes

Three Generator **action modes** map to different inputs and outputs. They align with `model_mode` in Framework JSONL and `action_mode` in vLLM-Omni requests.

| Mode | `action_mode` / `model_mode` | Primary input | Primary output | Typical endpoint |
| --- | --- | --- | --- | --- |
| **Forward dynamics** | `forward_dynamics` | Start **image** + action trajectory | **Video** rollout | `POST /v1/videos` (cookbooks) or `POST /v1/videos/sync` (README) |
| **Inverse dynamics** | `inverse_dynamics` | **Video** + text prompt | Predicted **action** chunk (+ job metadata) | Async `POST /v1/videos` |
| **Action policy** | `policy` | **Image** + instruction | **Video** + predicted **action** chunk | Async `POST /v1/videos` |

```text
                    domain_name + embodiment dim
                              │
  forward_dynamics:  image + action[] ──► video
  inverse_dynamics:  video + prompt    ──► action[]  (no action_path in JSONL)
  policy:            image + prompt    ──► video + action[]
```

<Info>
**Forward dynamics** conditions on a known trajectory and predicts future observations. **Inverse dynamics** predicts the trajectory that explains an input video. **Policy** predicts actions (and optionally rollout video) from context and language.
</Info>

### Forward dynamics (`fd`)

- **Inputs:** `vision_path` (conditioning image), `action_path` or inline `action` array, optional `prompt`.
- **Outputs:** `vision.mp4` under the run directory; forward jobs in cookbooks do not require `action` in the completed response.
- **Frame count:** `num_frames = action_chunk_size + 1` (e.g. 61 for AV with chunk size 60).
- **Autoregressive robotics / UMI:** Later chunks use the **last generated frame** from the previous chunk as the next conditioning image (DROID: 5×16 actions; UMI: all 16-action segments in `umi.json`).

### Inverse dynamics (`id`)

- **Inputs:** `vision_path` points to an **MP4**; JSONL has **no** `action_path`.
- **Outputs:** Predicted ego-motion trajectory; vLLM-Omni returns `action` on the completed job (`shape`, `dtype`, `data`). Notebooks mirror Framework as `sample_outputs.json`.
- **AV example:** `raw_action_dim: 9`, `action_chunk_size: 60`, `domain_name: "av"`, `view_point: "ego_view"`.

### Policy

- Documented in README: image + instruction → video + predicted action chunk.
- Use async `POST /v1/videos` and read action from the completed result (same pattern as inverse dynamics).
- Example `domain_name` values in docs include `bridge_orig_lerobot` and `camera_pose` (see vLLM-Omni Cosmos 3 serving examples).

## `domain_name` conditioning

`domain_name` tells the model which **embodiment parser**, normalization, chunking, and view geometry apply. It must stay consistent with action row dimensionality and `action_chunk_size`.

| `domain_name` | Used in | `action_chunk_size` | `image_size` | `view_point` (examples) |
| --- | --- | ---: | ---: | --- |
| `av` | AV fd / id cookbooks | 60 | 480 | `ego_view` |
| `droid_lerobot` | DROID fd cookbook | 16 | 480 | From dataset (`viewpoint`) |
| `umi` | UMI fd cookbook | 16 | 256 | `ego_view` |

README also references `bridge_orig_lerobot`, `camera_pose`, and other robot/AV/camera variants in [vLLM-Omni Cosmos 3 online serving examples](https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/cosmos3).

<Warning>
Mismatching `domain_name`, row length, or `raw_action_dim` produces server-side errors or invalid rollouts. UMI notebooks assert every row has length **10** before chunking.
</Warning>

## JSONL input spec (Cosmos Framework)

Each inference line is one JSON object (JSONL). Shared fields across action cookbooks:

<ParamField body="domain_name" type="string" required>
Embodiment key (`av`, `droid_lerobot`, `umi`, …).
</ParamField>

<ParamField body="model_mode" type="string" required>
`forward_dynamics` or `inverse_dynamics` for action workflows.
</ParamField>

<ParamField body="action_chunk_size" type="integer" required>
Number of action transitions in the chunk (60 AV, 16 robotics/UMI).
</ParamField>

<ParamField body="vision_path" type="string" required>
Absolute path to conditioning **image** (fd) or **video** (id).
</ParamField>

<ParamField body="action_path" type="string">
Required for **forward_dynamics** only; path to JSON action array.
</ParamField>

<ParamField body="fps" type="integer" required>
Output video frame rate (10 AV, 15 DROID, 20 UMI in cookbooks).
</ParamField>

<ParamField body="image_size" type="integer" required>
Short-edge resolution tier (480 AV/DROID, 256 UMI). vLLM may infer canvas from this without explicit `size`.
</ParamField>

<ParamField body="view_point" type="string" required>
Camera geometry hint, e.g. `ego_view` or dataset `viewpoint`.
</ParamField>

<ParamField body="prompt" type="string">
Task text; AV examples use *"You are an autonomous vehicle planning system."*; DROID uses dataset `ai_caption`.
</ParamField>

<ParamField body="seed" type="integer">
Per-run reproducibility seed.
</ParamField>

Example AV forward-dynamics record (conceptual):

```json
{
  "name": "av_forward",
  "domain_name": "av",
  "model_mode": "forward_dynamics",
  "action_chunk_size": 60,
  "action_path": "/path/to/av_traj_forward.json",
  "vision_path": "/path/to/av_0.jpg",
  "fps": 10,
  "image_size": 480,
  "view_point": "ego_view",
  "prompt": "You are an autonomous vehicle planning system.",
  "seed": 0
}
```

Framework entrypoint:

```bash
torchrun --nproc-per-node=1 \
  -m cosmos_framework.scripts.inference \
  --parallelism-preset=throughput \
  -i action_forward_dynamics_av_custom.jsonl \
  -o /tmp/cosmos3_action_fd \
  --checkpoint-path Cosmos3-Nano \
  --seed=0
```

## vLLM-Omni `extra_params`

Multipart `POST /v1/videos` sends vision via `input_reference` and packs Cosmos-specific options in `extra_params` (JSON string; use `curl --form-string` so semicolons are not stripped).

<ParamField body="action_mode" type="string" required>
`forward_dynamics`, `inverse_dynamics`, or `policy`.
</ParamField>

<ParamField body="domain_name" type="string" required>
Same semantics as JSONL.
</ParamField>

<ParamField body="action_chunk_size" type="integer" required>
Matches JSONL / trajectory length.
</ParamField>

<ParamField body="action" type="array" required>
For forward dynamics: inline JSON trajectory (cookbooks load `action_path` into this field).
</ParamField>

<ParamField body="raw_action_dim" type="integer">
Set for inverse dynamics when the server must know output width (AV id: **9**).
</ParamField>

<ParamField body="image_size" type="integer" required>
Resolution tier for action canvas.
</ParamField>

<ParamField body="view_point" type="string" required>
View geometry for conditioning.
</ParamField>

<ParamField body="guardrails" type="boolean">
Cookbooks often set `false` for robotics/UMI; default product guardrails apply when omitted.
</ParamField>

Forward-dynamics request shape (from notebooks): `num_frames = action_chunk_size + 1`, `fps` from the record, `guidance_scale=1.0`, `flow_shift=10.0`, plus `extra_params` above. Poll `GET /v1/videos/{id}` until `completed`, then `GET /v1/videos/{id}/content` for MP4 bytes.

| Mode | Returns `action` in job result? | Sync endpoint |
| --- | --- | --- |
| `forward_dynamics` | Optional; cookbooks consume video only | `POST /v1/videos/sync` supported per README |
| `inverse_dynamics` | **Yes** (required for id notebooks) | Async only |
| `policy` | **Yes** | Async only |

Start the server with `--allowed-local-media-path` covering conditioning media and action JSON paths (Docker: mount repo at `/workspace`).

## Asset layout

:::files
cookbooks/cosmos3/generator/action/
├── README.md
├── assets/
│   ├── actions/          # av_traj_*.json (9D), umi.json (10D)
│   ├── images/           # av_0.jpg, umi.png, …
│   ├── videos/           # av_*.mp4 for inverse dynamics
│   └── droid_lerobot_example/  # LeRobot layout + meta/info.json
├── run_fd_with_cosmos_framework.ipynb
├── run_fd_with_vllm.ipynb
├── run_id_with_cosmos_framework.ipynb
└── run_id_with_vllm.ipynb
:::

Outputs default to `outputs/cosmos3_action_vllm/` (vLLM) or framework package output trees under `packages/cosmos3/outputs/cookbooks/...`.

## Verification signals

| Check | Expected |
| --- | --- |
| Action JSON row width | AV/id: 9 floats; DROID/UMI fd: 10 floats |
| AV trajectory from poses | `pose_abs_to_rel` → `[T−1, 9]` |
| UMI chunking | `len(umi_action) % 16 == 0` |
| vLLM forward fd | `vision.mp4` in `<output>/<name>/` |
| vLLM inverse id | `final.json` contains `action` with `data`; `sample_outputs.json` written |
| Frame alignment | `num_frames == action_chunk_size + 1` |

<Tip>
For AV, visualize predicted or input trajectories with `pose_rel_to_abs` using the same `rot6d`, `backward_framewise`, and `translation_scale=1.35` convention as forward-dynamics prep.
</Tip>

## Related pages

<CardGroup>
<Card title="Run Generator action workflows" href="/run-generator-action">
Step-by-step forward and inverse dynamics with Framework torchrun and vLLM-Omni multipart requests.
</Card>
<Card title="Action cookbook recipes" href="/action-cookbooks">
Notebook index for AV, DROID, and UMI with checked-in trajectories and output directories.
</Card>
<Card title="vLLM-Omni API reference" href="/vllm-omni-api-reference">
`/v1/videos` fields, `action_mode` values, and `curl --form-string` constraints.
</Card>
<Card title="Input and output specifications" href="/input-output-specifications">
Global I/O types, action conditioning matrix, and resolution tiers.
</Card>
<Card title="Reasoner and Generator" href="/reasoner-and-generator">
MoT surfaces: when to use Generator action vs Reasoner text outputs.
</Card>
<Card title="Model family" href="/model-family">
Cosmos3-Nano, Super, and Cosmos3-Nano-Policy-DROID checkpoints.
</Card>
</CardGroup>
