# Memory & Skills: ChromaDB + Skill Extraction That Evolves

> How the memory subsystem combines ChromaDB vector storage, fastembed ONNX embeddings, keyword fallback, and a skill_extractor that distills recurring patterns into reusable skills — the mechanism behind the "your agent gets better over time" claim.

- Repository: pewdiepie-archdaemon/odysseus
- GitHub: https://github.com/pewdiepie-archdaemon/odysseus
- Human wiki: https://grok-wiki.com/public/wiki/pewdiepie-archdaemon-odysseus-8b8805c93124
- Complete Markdown: https://grok-wiki.com/public/wiki/pewdiepie-archdaemon-odysseus-8b8805c93124/llms-full.txt

## Source Files

- `services/memory/memory.py`
- `services/memory/memory_vector.py`
- `services/memory/skill_extractor.py`
- `services/memory/skills.py`
- `src/memory_vector.py`
- `src/chroma_client.py`
- `src/embeddings.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [services/memory/memory.py](services/memory/memory.py)
- [services/memory/memory_vector.py](services/memory/memory_vector.py)
- [services/memory/skill_extractor.py](services/memory/skill_extractor.py)
- [services/memory/skills.py](services/memory/skills.py)
- [services/memory/skill_format.py](services/memory/skill_format.py)
- [services/memory/memory_extractor.py](services/memory/memory_extractor.py)
- [src/chroma_client.py](src/chroma_client.py)
- [src/embeddings.py](src/embeddings.py)
- [src/memory_vector.py](src/memory_vector.py)
</details>

# Memory & Skills: ChromaDB + Skill Extraction That Evolves

Most "agent memory" implementations are a vector store and a system prompt. Odysseus splits the problem into two: **memory** — the agent's working set of facts about you — and **skills** — short, replayable procedures the agent has *learned to do* from prior sessions. Both are stored locally, both index into the same ChromaDB collection family, and both degrade cleanly when the embedding backend disappears.

This page walks the mechanism: how a memory entry ends up in a Chroma `odysseus_memories` collection, how the embedding pipeline falls back from a remote HTTP server to a local fastembed ONNX model, how the keyword-only path keeps working when neither is available, and how `skill_extractor.py` distills "the agent took 2+ rounds and 2+ tool calls" into a SKILL.md file the next session can find.

## The Two-Track Architecture

Memory and skills are kept structurally separate on purpose: memories are facts (free text, owner-stamped, optionally vectorised), skills are typed procedures (frontmatter + structured markdown) stored as files on disk.

```mermaid
flowchart LR
    subgraph Agent["Agent runtime"]
        AL["agent_loop.py<br/>(rounds, tool calls)"]
        SP["system prompt"]
    end

    subgraph Memory["Memory subsystem"]
        MM["MemoryManager<br/>(memory.json)"]
        MV["MemoryVectorStore<br/>collection: odysseus_memories"]
        MX["memory_extractor.py<br/>(periodic LLM tidy)"]
    end

    subgraph Skills["Skills subsystem"]
        SM["SkillsManager<br/>data/skills/&lt;cat&gt;/&lt;name&gt;/SKILL.md"]
        SX["skill_extractor.py<br/>(post-run distillation)"]
        SU["_usage.json sidecar"]
    end

    subgraph Embed["Embedding + storage"]
        EC["get_embedding_client()<br/>HTTP → FastEmbed fallback"]
        CC["get_chroma_client()<br/>chromadb.HttpClient"]
    end

    AL -->|"finish_round()"| SX
    AL --> MM
    SP <-->|"index_for()"| SM
    SP <-->|"get_relevant_memories()"| MM

    MM --> MV
    MV --> EC
    MV --> CC
    MX --> MM

    SX --> SM
    SM --> SU
```

Sources: [services/memory/memory.py:35-360](), [services/memory/memory_vector.py:15-176](), [services/memory/skills.py:62-271](), [services/memory/skill_extractor.py:51-209]()

## ChromaDB: the Vector Store Layer

The vector store is intentionally thin. `MemoryVectorStore` opens a single Chroma collection called `odysseus_memories` configured with `hnsw:space=cosine` and stores **pre-computed** embeddings — Chroma does no embedding of its own. The same applies to RAG; only the collection name differs.

```python
# services/memory/memory_vector.py
self._collection = client.get_or_create_collection(
    name=self.COLLECTION_NAME,
    metadata={"hnsw:space": "cosine"},
)
```

Three operational details worth noting:

| Behaviour | Detail | Source |
|---|---|---|
| Health latch | `_healthy=False` on init failure; every op is a no-op when unhealthy, so a missing Chroma daemon doesn't crash agent runs. | [services/memory/memory_vector.py:23-49]() |
| Dedup on add | Before `add()`, the store does a `get(ids=[memory_id])` and skips if present — same memory id is idempotent. | [services/memory/memory_vector.py:65-79]() |
| Similarity convert | Chroma returns cosine *distance*. The code returns `round(1.0 - distance, 4)` so the rest of the app reasons in similarity space. | [services/memory/memory_vector.py:107-114]() |
| Rebuild path | `rebuild()` deletes and recreates the collection, then re-embeds in batches of 100 — used after bulk imports or schema migrations. | [services/memory/memory_vector.py:134-175]() |
| Soft-delete duplicates | `find_similar(text, threshold=0.92)` is the gate before adding new auto-extracted memories so the index doesn't bloat with near-duplicates. | [services/memory/memory_vector.py:116-132]() |

The Chroma client itself is a singleton over an HTTP transport — no in-process Chroma, no DuckDB — pointed at `CHROMADB_HOST:CHROMADB_PORT` (defaults `localhost:8100`). Importing `chromadb` is optional; if it isn't installed the factory raises a `RuntimeError` with the install hint rather than failing silently.

```python
# src/chroma_client.py
_client = chromadb.HttpClient(host=host, port=port)
_client.heartbeat()                  # immediate health check at startup
```

Sources: [src/chroma_client.py:16-48](), [services/memory/memory_vector.py:15-115]()

## fastembed: the Zero-Config Fallback

The embedding layer is where the "BYOC/BYOK friendly" claim shows up. `get_embedding_client()` tries an OpenAI-compatible HTTP endpoint first (Ollama, vLLM, llama.cpp, anything that speaks `POST /v1/embeddings`), and if that fails it loads a local fastembed ONNX model (default `sentence-transformers/all-MiniLM-L6-v2`, ~50 MB) cached under `data/fastembed_cache/`.

The fallback isn't reattempted on every call. Once an HTTP probe fails, a process-level latch `_http_embed_down = True` is tripped so subsequent RAG / memory / tool calls don't pay the ~3 s connect timeout every time:

```python
# src/embeddings.py
if not _http_embed_down:
    try:
        client = EmbeddingClient()
        client.get_sentence_embedding_dimension()  # health check
        return client
    except Exception:
        _http_embed_down = True
        # ...fall through to FastEmbedClient
```

`reset_http_embed_state()` is the explicit way to re-probe after the admin panel saves a new endpoint — without it, an Ollama instance that came up *after* startup would never be picked up.

Both clients expose the same minimal surface (`encode(texts, normalize_embeddings=True)`, `get_sentence_embedding_dimension()`), so `MemoryVectorStore` doesn't know which backend it's using. The HTTP path batches in 64s; fastembed embeds inline; both L2-normalise so the cosine collection's distances are well-behaved.

Sources: [src/embeddings.py:27-87](), [src/embeddings.py:90-143](), [src/embeddings.py:163-213]()

## Keyword Fallback: Memory Without Vectors

When Chroma is unreachable *and* fastembed isn't installed, the memory subsystem still works — it just falls back to a keyword scorer inside `MemoryManager`. `get_relevant_memories()` is essentially a hand-built BM25 lite: tokenize, Jaccard, then category-aware boosts.

The query is first classified into one of `identity / contact / preference / task / fact` by keyword presence, and matching memories get a multiplicative boost:

```python
# services/memory/memory.py
if query_type == "contact":
    has_contact_info = any(word in memory_text for word in [
        "@gmail.com", "@", ".com", "phone", "number",
        "address", "http", "www", "tel:"])
    if has_contact_info:
        final_score *= 1.4   # 40% boost
```

The identity case is special: for an identity query, every memory that looks like a name (capitalised pair, `"i'm"`, `"my name"`, `"call me"`) is pre-seeded at score 0.9 *regardless* of token overlap. This is the bit that keeps "what's my name?" working even when the question shares no tokens with the stored fact.

This same module also recognises inline `remember: X` commands via a single regex (`remember | memorize | save | note | store`) before any LLM is involved — the cheapest possible path for explicit user saves.

Sources: [services/memory/memory.py:81-99](), [services/memory/memory.py:263-359]()

## Skill Extraction: How "It Gets Better"

The skill extractor is the piece behind the README's claim that the agent learns over time. It runs **after** an agent loop finishes, and the threshold is intentionally low:

```python
# services/memory/skill_extractor.py
if round_count < 2 and tool_count < 2:
    return None   # nothing complex enough to distill
```

A single-round, single-tool answer is treated as not worth turning into a skill. Two or more of either, and the extractor takes the last 12 messages of the session, hands them to the LLM with a tightly worded system prompt that is more about *refusal* than instruction, and asks for a JSON skill object — or the bare word `null`.

The prompt has explicit anti-patterns the model is told to reject:

- The real work happened **outside the computer** (user did it physically, in person, on another device).
- A one-off, personal, context-specific task that won't recur.
- A pure Q&A or explanation with no transferable method.
- The agent failed or gave up.

This conservative shape is the point. The library bloats fast if every agent run produces a skill; the prompt is engineered to bias toward `null`.

### Post-processing the LLM Response

Real models don't always cooperate. The extractor handles several known failure modes:

| Failure mode | Mitigation | Source |
|---|---|---|
| `null` or empty | Bail silently — debug log only | [services/memory/skill_extractor.py:120-125]() |
| `<think>...</think>` (R1-class models) | `strip_think(prose=True, prompt_echo=True)` removes reasoning preamble | [services/memory/skill_extractor.py:127-137]() |
| Markdown code fences around JSON | Strip first ` ``` ` line and trailing fence | [services/memory/skill_extractor.py:140-142]() |
| JSON embedded in surrounding prose | Slice from first `{` to last `}` | [services/memory/skill_extractor.py:143-149]() |
| Low-confidence "maybe" skills | Drop anything below `MIN_CONFIDENCE = 0.6` | [services/memory/skill_extractor.py:44-45](), [services/memory/skill_extractor.py:163-172]() |
| Duplicate titles | Case-insensitive title match against existing skills, drop | [services/memory/skill_extractor.py:174-179]() |

Successful extractions fire a `skill_added` event on the bus so the UI can show a toast without polling.

Sources: [services/memory/skill_extractor.py:15-41](), [services/memory/skill_extractor.py:51-209]()

## Skill Storage: SKILL.md on Disk, Not in a DB

Where memories live in `memory.json` and a vector index, skills live as actual files: `data/skills/<category>/<name>/SKILL.md`, each with YAML frontmatter and a structured body (`When to Use / Procedure / Pitfalls / Verification`). The format is deliberately inspired by Hermes' skills format — human-editable, git-friendly, and round-trippable.

Hot, churn-prone data is kept *out* of the file. Usage counters and audit verdicts live in a sidecar `data/skills/_usage.json` keyed by skill name:

> "Usage counters (`uses`, `last_used`) live in a sidecar so the SKILL.md content doesn't churn on every retrieval." — [services/memory/skill_format.py:43-44]()

`SkillsManager.add_skill()` does free dedup on every LLM-authored save: it tokenises name + description + when-to-use + procedure, runs Jaccard against every existing skill, and at `>= 0.82` overlap it bumps the existing skill's `uses` counter and returns it with `_deduped=True` instead of writing a new file. User-authored skills bypass the gate ("a human asked for it").

```python
# services/memory/skills.py
if _jaccard(cand, ex) >= 0.82:
    self.record_use(s["name"])
    return {**s, "_deduped": True, "_duplicate_of": s.get("name")}
```

### What the Agent Actually Sees

`index_for(owner, active_toolsets, platform)` is the function that produces the lightweight `[{name, description, category, status}]` list injected into the system prompt. The filtering rules are worth flagging because they aren't obvious:

- **Published** skills always included.
- **Drafts** are excluded — *except* drafts written by the teacher-escalation loop (`source == "teacher-escalation"`). The teacher loop's whole job is for the student to find a new procedure on the very next turn; gating it behind a manual publish click would defeat the loop.
- **`requires_toolsets`** hides a skill unless every required toolset is active.
- **`fallback_for_toolsets`** does the opposite — hides the skill *when* a named toolset is active, so a "manual scp" fallback disappears when the real SSH tool is loaded.
- **Platform gate** — `platforms: [linux, macos]` excludes the skill on Windows.

Sources: [services/memory/skills.py:276-365](), [services/memory/skills.py:494-545](), [services/memory/skill_format.py:7-44]()

## Audit & Tidy: The Self-Cleaning Loop

`memory_extractor.py` carries a complementary mechanism for memories. After each LLM turn, recent messages are sent to the model with an extraction prompt; the audit pass periodically rewrites vague entries, consolidates duplicates, and removes junk.

The audit is expensive (30–120 s LLM call) so it's gated by a fingerprint:

```python
# services/memory/memory_extractor.py
items = sorted(
    (str(e.get("id", "")), e.get("text", ""), e.get("category", ""))
    for e in entries
)
# sha256 over id+text+category — any add/edit/delete invalidates it
```

The fingerprint per owner is persisted in `memory_tidy_state.json` next to `memory.json`. If the current fingerprint matches the last successful audit, the LLM call is skipped — "running the LLM again on an already-clean list was wasting 30-120s per call and occasionally timing out on the second pass."

Sources: [services/memory/memory_extractor.py:1-58]()

## Tradeoffs and Surprising Details

A few things in this design that a README-only reader would miss:

- **Two parallel implementations.** `services/memory/memory_vector.py` and `src/memory_vector.py` are byte-identical at the time of writing. The `services/` tree is the newer modular layout; `src/` is the legacy package still imported by older callers. The same is true of `memory.py` in both trees.
- **Ownership is enforced by strict equality.** `SkillsManager.load(owner=…)` filters with `s.get("owner") == owner`. An earlier predicate also let skills with *no* owner through, which leaked legacy skills to every authenticated user — the inline comment calls this out as a fixed security regression ([services/memory/skills.py:265-270]()).
- **Skills can be `requires_toolsets` or `fallback_for_toolsets`.** This is a small but real mechanism for toolset-aware procedure routing — a skill describing "how to send mail by raw SMTP" can be marked as fallback for the Gmail toolset and only surface when Gmail isn't loaded.
- **No SentenceTransformer dependency.** `EmbeddingClient.encode` is "drop-in" for SentenceTransformer's interface but never imports it. fastembed is the only true local dependency and it ships ONNX, not PyTorch — keeping the install graph light.
- **Confidence floors at two levels.** The extractor drops anything below 0.6 ([services/memory/skill_extractor.py:44-45]()); the retrieval function multiplies score by `1.0 + confidence * 0.1` so higher-confidence skills win ties ([services/memory/skills.py:604]()).
- **The "skill index" the agent reads is sorted by `(category, name)`.** Skill ordering in the prompt is deterministic, which matters for prompt caching — the same skill set produces the same prefix bytes turn after turn.

## What Builders Should Notice

The pattern here is portable and provider-neutral by design. Three composable layers — Chroma + a pluggable embedding client + a keyword fallback — give the system three failure-graceful modes (full vector, vector with local ONNX, no vector at all). The skill extractor on top is essentially a *prompt-engineered classifier* whose first job is to say "this isn't worth saving" — the conservatism is what keeps the library small enough to inject into every system prompt.

The split between dynamic memory (JSON + Chroma) and procedural memory (markdown files on disk) is the load-bearing decision. Memories are short, owner-scoped, and easy to re-embed; skills are versioned, hand-editable, and intentionally durable. The fact that one of them is a vector store and the other is a directory of files is exactly the right mismatch.

Sources: [services/memory/memory_vector.py:90-114](), [services/memory/skills.py:494-545](), [services/memory/skill_extractor.py:15-45]()