# Cookbook: Hardware-Aware Model Recommendations via hwfit

> How services/hwfit/ scans GPU/CPU/RAM, fits GGUF/FP8/AWQ candidates against the box, and proposes a download-and-serve plan — the practical heart of the "click to install a local LLM" pitch, adapted from the llmfit library.

- Repository: pewdiepie-archdaemon/odysseus
- GitHub: https://github.com/pewdiepie-archdaemon/odysseus
- Human wiki: https://grok-wiki.com/public/wiki/pewdiepie-archdaemon-odysseus-8b8805c93124
- Complete Markdown: https://grok-wiki.com/public/wiki/pewdiepie-archdaemon-odysseus-8b8805c93124/llms-full.txt

## Source Files

- `services/hwfit/hardware.py`
- `services/hwfit/fit.py`
- `services/hwfit/models.py`
- `services/hwfit/image_models.py`
- `routes/cookbook_routes.py`
- `routes/cookbook_helpers.py`
- `routes/hwfit_routes.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [services/hwfit/hardware.py](services/hwfit/hardware.py)
- [services/hwfit/fit.py](services/hwfit/fit.py)
- [services/hwfit/models.py](services/hwfit/models.py)
- [services/hwfit/image_models.py](services/hwfit/image_models.py)
- [services/hwfit/data/hf_models.json](services/hwfit/data/hf_models.json)
- [routes/hwfit_routes.py](routes/hwfit_routes.py)
- [routes/cookbook_routes.py](routes/cookbook_routes.py)
- [licenses/llmfit-MIT-LICENSE.txt](licenses/llmfit-MIT-LICENSE.txt)
- [README.md](README.md)
</details>

# Cookbook: Hardware-Aware Model Recommendations via hwfit

The Cookbook tab is Odysseus's "click to install a local LLM" surface, and `services/hwfit/` is the brain behind it. It probes the box (NVIDIA, AMD, Apple, or remote Windows over SSH), reads a curated model catalog of ~hundreds of HuggingFace models, simulates how each one would fit in VRAM at various quantizations, scores them on quality/speed/fit/context, and hands the ranked list back to the UI so the user can click "Download" or "Serve" on a model that will actually run. The library is adapted from Alex Jones's open-source `llmfit` (MIT) — see [licenses/llmfit-MIT-LICENSE.txt:1-3](licenses/llmfit-MIT-LICENSE.txt) and [README.md:13](README.md) — and rebuilt inside Odysseus as a provider-neutral, BYOC sizer that targets local engines (vLLM, llama.cpp, Ollama) rather than any hosted API.

The interesting part is not the catalog — it's the arithmetic. The same model can fit, half-fit, or not fit depending on quant, KV-cache context, whether GPUs are identical, and whether vLLM can tensor-parallel across them. `hwfit` encodes those constraints in roughly 800 lines of pure Python, with no GPU code, so it runs on the web server and produces an answer in milliseconds.

## What the probe actually reads

`detect_system()` is the single entry point. It returns a flat dict of total/available RAM (GB), CPU cores, CPU model, GPU name, total VRAM, per-GPU detail, and a `backend` string drawn from `{cuda, rocm, cpu_x86, cpu_arm, mps}`. The detection is cached per host for 30 minutes (`CACHE_TTL = 1800`) because, as the comment notes, "hardware rarely changes; use the Rescan button to force a re-probe."

NVIDIA detection runs `nvidia-smi --query-gpu=memory.total,name --format=csv,noheader,nounits` and parses one row per device, using the row position as the CUDA index that later gets pinned via `CUDA_VISIBLE_DEVICES`. The remote-SSH path is hardened against the classic non-interactive PATH problem: if the first call comes back empty on a remote, it retries through `bash -lc` with `/usr/local/cuda/bin` added, and as a last resort tries `nvidia-smi` by absolute path in three common locations. When the binary is there but the driver isn't talking (the post-update / no-reboot case), it surfaces the NVML error string into `gpu_error` so the UI can say "GPU driver error" instead of the misleading "No GPU". See [services/hwfit/hardware.py:71-129](services/hwfit/hardware.py).

AMD detection walks `/sys/class/drm/card*/device/` looking for `vendor == 0x1002`, and is intentionally nuanced about APUs. Discrete cards report real VRAM in `mem_info_vram_total`; Strix Halo and similar unified-memory SoCs report a tiny `vram_total` with the real pool in `mem_info_vis_vram_total`, so the code takes the max of the two and only falls back to `mem_info_gtt_total` if both are zero. The `unified_memory` flag is set when `vis_vram >= vram`, and the code explicitly does not cap it at system RAM because BIOS-carved UMA is physically backed but invisible to `/proc/meminfo`. See [services/hwfit/hardware.py:132-204](services/hwfit/hardware.py).

Windows is detected differently: one giant PowerShell command bundles `Win32_OperatingSystem`, `Win32_Processor`, `nvidia-smi`, and `Win32_VideoController` into a single JSON blob, because round-tripping multiple commands over SSH to Windows is slow ([services/hwfit/hardware.py:286-360](services/hwfit/hardware.py)).

Sources: [services/hwfit/hardware.py:6-37](services/hwfit/hardware.py), [services/hwfit/hardware.py:286-457](services/hwfit/hardware.py)

## Tensor-parallel pools: why a mixed box gets split

vLLM only tensor-parallels across **identical** GPUs. A workstation with `1×4090 + 2×3090` cannot serve a model across all three — it has to pick a homogeneous subset. `_group_gpus()` does exactly that: it groups by `(name, round(vram_gb))`, carries each GPU's CUDA index, and sorts the resulting groups by total VRAM descending, so the largest single-tensor-parallel pool becomes the default serving target.

```text
detected: [4090(24), 3090(24), 3090(24)]
                │
                ▼
groups: [{name:"3090", count:2, vram_each:24, indices:[1,2], vram_total:48},
         {name:"4090", count:1, vram_each:24, indices:[0],   vram_total:24}]
         ▲                                                       ▲
         largest pool → default serve target              still selectable
```

The grouped list flows back to the UI as `system.gpu_groups`, and the `/api/hwfit/models` route lets the caller pick a pool via `gpu_group` and clamp the count via `gpu_count` ([routes/hwfit_routes.py:131-172](routes/hwfit_routes.py)). When `gpu_count` is set, `system.gpu_only = True` is flipped on, which is consequential: it tells the fit step to refuse offload-to-RAM fallbacks for that ranking ([services/hwfit/fit.py:226-230](services/hwfit/fit.py)). Without it, a 96 GB GPU would still list a 175 GB model because the model "fits" by spilling most of its layers to system RAM — the comment calls out that exact bug.

Sources: [services/hwfit/hardware.py:40-68](services/hwfit/hardware.py), [routes/hwfit_routes.py:131-175](routes/hwfit_routes.py)

## The catalog and its quant math

`get_models()` lazy-loads `services/hwfit/data/hf_models.json` once and caches it ([services/hwfit/models.py:162-173](services/hwfit/models.py)). Each entry carries provider, parameter count (raw or `"7B"`/`"355M"`), context length, native quantization, optional `gguf_sources` (a list of HF repos that ship GGUF variants), MoE metadata (`is_moe`, `active_parameters`), and so on. Parameter parsing has a deliberate trap-handler: a bare number ≥ 1,000,000 is treated as a raw parameter count, and `"355"` is treated as 0.355 B, because otherwise a 355M model would sort above every 70B model ([services/hwfit/models.py:52-83](services/hwfit/models.py)).

The fit math hinges on three tables, all keyed by quant label:

| Quant | Bytes/param (memory) | Speed multiplier | Quality penalty |
|---|---|---|---|
| F16 / BF16 | 2.0 | 0.60 | 0 |
| FP8 | 1.0 | 0.85 | 0 |
| Q8_0 | 1.0 | 0.80 | 0 |
| Q6_K | 0.75 | 0.95 | −1 |
| Q5_K_M | 0.625 | 1.00 | −2 |
| Q4_K_M / Q4_0 | 0.5 | 1.15 | −5 |
| Q3_K_M | 0.375 | 1.25 | −8 |
| Q2_K | 0.25 | 1.35 | −12 |
| AWQ-4bit / GPTQ-Int4 | 0.5 | 1.20 | −3 |
| AWQ-8bit / GPTQ-Int8 | 1.0 | 0.85 | 0 |
| mlx-4/6/8-bit | 0.5 / 0.75 / 1.0 | 1.15 / 1.0 / 0.85 | −4 / −1 / 0 |

Memory estimation is a one-liner: `pb * bpp + 0.000008 * active_params * ctx + 0.5`. The KV cache uses **active** params, not total, because for MoE only the active experts have KV state — total VRAM is dominated by weights, but speed and KV are dominated by what actually runs per token ([services/hwfit/models.py:86-101](services/hwfit/models.py)).

Sources: [services/hwfit/models.py:5-101](services/hwfit/models.py)

## How a single model gets ranked

`analyze_model()` is the per-model worker. The flow is more nuanced than a single budget check because of two subtleties — prequantized formats have fixed bit-widths (you can't try Q3 on an AWQ-4bit), and GGUF cannot be sharded across GPUs the way vLLM-served safetensors can.

```mermaid
flowchart TD
    A[model + system] --> B{prequantized?<br/>AWQ/GPTQ/FP8/MLX}
    B -- yes --> C[use native quant only]
    B -- no --> D{target_quant set?}
    D -- yes --> E[try target_quant]
    D -- no --> F[default Q4_K_M]
    C --> G[_try_quant_at]
    E --> G
    F --> G
    G --> H{fits in VRAM?}
    H -- yes --> I[run_mode=gpu]
    H -- no --> J{fits in RAM + has GPU?}
    J -- yes --> K[run_mode=cpu_offload]
    J -- no --> L[halve ctx, retry until 1024]
    L --> M{fit found?}
    M -- no --> N[return too_tight badge]
    M -- yes --> O[fit_level + speed + composite score]
    I --> O
    K --> O
```

The "shard or not" decision is the one most likely to surprise a reader of the README. `effective_vram` is `single_gpu_vram` for GGUF/dense builds (because llama.cpp can't shard), but full multi-GPU VRAM for prequantized formats served by vLLM — *even when the same model also lists a GGUF alternate download* ([services/hwfit/fit.py:236-247](services/hwfit/fit.py)). A `2×24GB` box ranks a 70B model as runnable in AWQ-4bit (~35 GB across both cards) but not in Q4_K_M GGUF (won't fit on one 24 GB card).

If nothing fits, the model isn't dropped — it's returned with `fit_level: "too_tight"` and `run_mode: "no_fit"` so the UI can render a red row. Without that, editing the manual-hardware sliders upward never revealed bigger models, because they were filtered out before the user could see what *would* fit ([services/hwfit/fit.py:278-303](services/hwfit/fit.py)).

When a fit *is* found, four sub-scores are computed and weighted by use case:

| Use case | quality | speed | fit | context |
|---|---|---|---|---|
| general | 0.45 | 0.30 | 0.15 | 0.10 |
| coding | 0.50 | 0.20 | 0.15 | 0.15 |
| reasoning | 0.55 | 0.15 | 0.15 | 0.15 |
| chat | 0.40 | 0.35 | 0.15 | 0.10 |
| multimodal | 0.50 | 0.20 | 0.15 | 0.15 |
| embedding | 0.30 | 0.40 | 0.20 | 0.10 |

`_quality_score` rewards larger param counts on a bucketed curve (30 → 95 from <1B to ≥40B), nudges scores for known family names (`+3` deepseek, `+2` qwen/llama, `+1` mistral/gemma), and applies the per-quant penalty. `_speed_score` estimates tok/s using a real bandwidth table (`GPU_BANDWIDTH` covers 60+ NVIDIA / AMD / datacenter cards from `5090` through `mi300x`), then divides by per-use-case targets — 40 tok/s is "good" for chat/coding, 25 for reasoning, 200 for embedding. `_fit_score` plateaus at 100 between 50% and 80% VRAM utilization and drops sharply above 90% (the "marginal" zone). `_context_score` rewards hitting the use case's context target (4096 chat, 8192 coding/reasoning, 512 embedding).

Sources: [services/hwfit/fit.py:9-49](services/hwfit/fit.py), [services/hwfit/fit.py:62-160](services/hwfit/fit.py), [services/hwfit/fit.py:212-356](services/hwfit/fit.py)

## Speed estimation: bandwidth, not benchmarks

The single most useful number in the response is `speed_tps`, and it comes from physics, not measurements. For a recognized GPU, `_estimate_speed` does:

```
model_gb   = active_params_b * bytes_per_param
raw_tps    = (gpu_bandwidth / model_gb) * 0.55
            * (1.0 dense | 0.8 MoE | 0.5 cpu_offload)
```

The 0.55 efficiency factor is the realized fraction of peak memory bandwidth a typical transformer decode loop achieves. MoE gets 0.8× because routing overhead eats into the bandwidth win. CPU offload halves it because PCIe is the new bottleneck. If the GPU isn't in the lookup, the code falls back to a backend-keyed constant `k / pb * speed_mult`, with `k = 220` for CUDA, 180 for ROCm, 90 for ARM CPU, 70 for x86 CPU ([services/hwfit/fit.py:62-88](services/hwfit/fit.py)). The estimates are deliberately rough — the goal is to separate "60 tok/s, fine" from "3 tok/s, painful" at a glance.

## The image-model side path

`image_models.py` is a separate hard-coded registry of 15 diffusion models (FLUX, SDXL, SD 3.5, Qwen-Image, HunyuanImage, Tongyi Z-Image), each with `vram_bf16` / `vram_fp8` / `vram_q4` rows and a `quant_repos` map pointing to community FP8/Q4 weight repos. Ranking is dramatically simpler: try BF16, then FP8, then Q4, accept the first that fits under 90 % of GPU VRAM, and label the headroom as `perfect`/`good`/`tight`/`no_fit`. There is no per-quant memory formula; the VRAM numbers are precomputed because diffusion models have hand-tuned offload strategies that don't follow `params × bytes_per_param` ([services/hwfit/image_models.py:6-374](services/hwfit/image_models.py)). The `/api/hwfit/image-models` route also forces single-GPU VRAM because diffusion pipelines don't tensor-parallel ([routes/hwfit_routes.py:177-202](routes/hwfit_routes.py)).

## The "what if" simulator

`_apply_manual_hardware()` in `routes/hwfit_routes.py` is a deliberate redesign of the original additive behavior. The previous version added a fake "1× 400 GB" GPU to a detected `2× 70 GB` setup and then averaged: per-GPU cap went from 70 to 180 GB (= 540/3), so GGUF models larger than that still didn't surface — the "cap stuck at detected level" bug. The current code **replaces** the GPU configuration entirely, building a single homogeneous pool with the entered `vram_each` as the literal per-GPU cap. RAM-mode wipes GPUs and reroutes everything through `cpu_x86`. Two more switches — `ignore_detected_gpu` and `ignore_detected_ram` — let the UI strip the live box's contribution without entering manual values, which is what powers the "Suggest models for an RTX 5090 I don't own yet" workflow ([routes/hwfit_routes.py:9-83](routes/hwfit_routes.py), [routes/hwfit_routes.py:113-124](routes/hwfit_routes.py)).

A quieter detail: `rank_models()` sorts twice. First by composite score to pick the *visible set* of N, then by whatever the user clicked (params, vram, context). If it sorted once by the user's column, sorting by `params` would truncate to the biggest models that don't even fit, while sorting by `vram` would truncate to the smallest — the score-first prefilter keeps the visible cohort stable as the user re-sorts ([services/hwfit/fit.py:453-462](services/hwfit/fit.py)).

Sources: [routes/hwfit_routes.py:9-83](routes/hwfit_routes.py), [services/hwfit/fit.py:368-463](services/hwfit/fit.py)

## From recommendation to "click to install"

`hwfit` only computes; the Cookbook glues the ranked result to action. The route layer exposes two GETs (`/api/hwfit/system` and `/api/hwfit/models`) that return JSON the front-end renders as a model list. When the user clicks Download, the request lands at `POST /api/cookbook/api/model/download` in `routes/cookbook_routes.py`, which writes a generated bash (or PowerShell, for Windows remotes) runner script, `scp`s it to the target host if remote, and launches it inside a fresh `tmux` session named `cookbook-<hex>`. The runner installs `huggingface_hub` if missing, opportunistically pulls in `hf_transfer` for the parallel Rust downloader, and falls back to plain `snapshot_download(...)` when either binary or the Rust path is unavailable ([routes/cookbook_routes.py:307-535](routes/cookbook_routes.py)). The "Serve" button takes the same shape but writes a different runner — vLLM, llama.cpp, or Ollama depending on the engine — wired to the GPU subset the user picked ([routes/cookbook_routes.py:731-833](routes/cookbook_routes.py)).

```text
┌──────────────┐  GET /api/hwfit/system   ┌────────────────────┐
│  Cookbook    │ ───────────────────────► │ hwfit.hardware     │
│  UI          │                          │  (probe + cache)   │
│              │  GET /api/hwfit/models   └────────────────────┘
│              │ ───────────────────────► ┌────────────────────┐
│              │   ◄── ranked JSON ────── │ hwfit.fit          │
│              │                          │  (rank_models)     │
│              │                          └────────────────────┘
│ click Down…  │                          ┌────────────────────┐
│              │ POST /api/cookbook/...   │ cookbook_routes    │
│              │ ───────────────────────► │  tmux + ssh + scp  │
└──────────────┘                          │  hf download / vllm│
                                          └────────────────────┘
```

The separation is intentional: `services/hwfit/` is a pure-Python library with no FastAPI, no shell, no SSH, no auth — just hardware introspection and arithmetic. Everything stateful (sessions, tokens, tmux logs, admin-gating, validation) lives in the route layer.

## What builders should notice

A few things make this design portable to other projects:

- **The catalog is data, not code.** `hf_models.json` is a flat list of JSON records, the only logic baked into it is the field schema. `scripts/add_hwfit_models.py` exists for bulk-adding from HuggingFace search results, which means the recommender can grow without rebuilding the package.
- **Provider-neutral by construction.** Nothing in `services/hwfit/` calls an Anthropic, OpenAI, or HuggingFace inference API. It targets weights at rest (`gguf_sources`, `quant_repos`) and local engines (vLLM, llama.cpp, Ollama). Swap the catalog file or feed it a different `system` dict and it ranks against a different box, including hypothetical ones.
- **One sane fallback per branch.** Detection has three nvidia-smi paths, AMD has VRAM-vs-vis-VRAM-vs-GTT, Windows has nvidia-smi-then-WMI, downloads have hf-CLI-then-Python-then-pip-install. The pattern is consistent: try the fast/correct thing, then the slow/correct thing, then the "best we can do."
- **The "too tight" badge.** Surfacing what *doesn't* fit, with the exact GB it would need, is what makes the simulator usable. Without it, the recommender can only ever shrink the user's mental model of what's possible.

A wiki page can only point at the code; the math is in the source. The interesting takeaway is that "click to install a local LLM" is mostly a sizing problem, and sizing is a 200-line problem if you treat quantization, KV cache, and tensor-parallel pool homogeneity as first-class concepts instead of edge cases.
