# Invariants, Failure Modes & Safe-Change Rules

> The core invariants: state lives exclusively in iii-engine (no local SQLite or Postgres required); the worker suppresses unhandledRejection to survive iii SDK 30s timeouts under write bursts; BM25 index is always present so search never fully fails; circuit-breakers isolate provider outages; and the sdk-guard hook prevents recursive hook invocations. Safe-change rules: embedding dimension changes require index migration; adding a new function requires both registerXxx in index.ts and a tools-registry entry; provider fallback order is config-driven, not hardcoded.

- Repository: rohitg00/agentmemory
- GitHub: https://github.com/rohitg00/agentmemory
- Human wiki: https://grok-wiki.com/public/wiki/rohitg00-agentmemory-94f173bce1dc
- Complete Markdown: https://grok-wiki.com/public/wiki/rohitg00-agentmemory-94f173bce1dc/llms-full.txt

## Source Files

- `src/index.ts`
- `src/providers/circuit-breaker.ts`
- `src/hooks/sdk-guard.ts`
- `src/functions/migrate-vector-index.ts`
- `src/health/monitor.ts`
- `src/health/thresholds.ts`
- `src/mcp/tools-registry.ts`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [src/index.ts](src/index.ts)
- [src/providers/circuit-breaker.ts](src/providers/circuit-breaker.ts)
- [src/providers/resilient.ts](src/providers/resilient.ts)
- [src/providers/fallback-chain.ts](src/providers/fallback-chain.ts)
- [src/providers/index.ts](src/providers/index.ts)
- [src/hooks/sdk-guard.ts](src/hooks/sdk-guard.ts)
- [src/functions/migrate-vector-index.ts](src/functions/migrate-vector-index.ts)
- [src/functions/search.ts](src/functions/search.ts)
- [src/state/vector-index.ts](src/state/vector-index.ts)
- [src/health/monitor.ts](src/health/monitor.ts)
- [src/health/thresholds.ts](src/health/thresholds.ts)
- [src/mcp/tools-registry.ts](src/mcp/tools-registry.ts)
</details>

# Invariants, Failure Modes & Safe-Change Rules

This page documents the structural invariants that agentmemory relies on to stay correct, the failure modes each invariant guards against, and the rules that safe contributors must follow when modifying the system. Understanding these constraints lets you predict what will break (and why) before touching any subsystem.

The design philosophy is *graceful degradation*: a crashed embedding provider must not stop BM25 search; a slow iii-engine write must not kill the worker process; a recursive hook invocation must not burn tokens in an infinite loop. Each of the invariants below is the concrete mechanism that enforces one of those goals.

---

## Core Invariants

### 1. State lives exclusively in iii-engine (StateKV)

All persistent data — observations, memories, sessions, actions, leases, signals — is stored through `StateKV`, which delegates every read and write to the iii-engine. There is no local SQLite, no local Postgres, and no file-level persistence beyond the optional in-process BM25/vector snapshot written through `IndexPersistence` (itself backed by `kv.set`).

This means:
- The worker is stateless beyond what iii-engine holds. A process restart replays nothing locally; state survives because iii-engine holds it.
- The optional `IndexPersistence` is a read-through cache of the hot search index, not a source of truth. Losing it forces a `rebuildIndex` pass but does not lose any memories.

Sources: [src/index.ts:188-213]()

---

### 2. `unhandledRejection` is suppressed to survive iii SDK 30s timeouts

Under sustained write load (for example, Claude Code hooks firing across many projects), `state::set` can occasionally exceed the SDK's 30-second internal timeout and produce a rejected Promise that no call-site `.catch()` catches. Without a process-level handler, Node.js would terminate the long-lived worker.

The worker installs a throttled global handler at startup:

```typescript
// src/index.ts:119-129
let lastUnhandledLogAt = 0;
process.on("unhandledRejection", (reason) => {
  const now = Date.now();
  if (now - lastUnhandledLogAt < 60_000) return;
  lastUnhandledLogAt = now;
  const r = reason as { code?: string; function_id?: string; message?: string };
  console.warn(
    `[agentmemory] unhandledRejection (suppressed):`,
    r?.code ? `${r.code} ${r.function_id ?? ""} ${r.message ?? ""}`.trim() : reason,
  );
});
```

The handler **logs once per minute at most** (throttle prevents log storms on write bursts) and then **continues**. The relevant `.catch()` at the individual call site already surfaced the error; the global handler is only the last safety net.

**Failure mode if removed:** A single timeout under load would `process.exit(1)`, destroying the long-lived worker and requiring a manual restart.

Sources: [src/index.ts:112-129]()

---

### 3. BM25 index is always present — search never fully fails

`getSearchIndex()` in `src/functions/search.ts` uses lazy initialization with a module-level singleton and never returns `null`:

```typescript
// src/functions/search.ts:16-19
let index: SearchIndex | null = null
export function getSearchIndex(): SearchIndex {
  if (!index) index = new SearchIndex()
  return index
}
```

The vector index, by contrast, is `null` when no embedding provider is configured. The worker logs `BM25-only mode` at boot. `HybridSearch` is constructed with both; it short-circuits vector scoring when the vector index is absent. This means:

- BM25 keyword search always works, even with zero providers configured.
- Semantic (cosine) search is layered on top when an embedding provider is present.
- A provider outage at runtime degrades to BM25 without a crash.

Sources: [src/functions/search.ts:12-19](), [src/index.ts:193-195, 324-334]()

---

### 4. Circuit-breakers isolate provider outages

Every LLM provider call is wrapped in a `ResilientProvider`, which owns a `CircuitBreaker` instance. The breaker has three states:

```
closed → (≥3 failures within 60s) → open → (30s recovery) → half-open → (1 success) → closed
                                                                           ↓ failure
                                                                          open
```

```typescript
// src/providers/circuit-breaker.ts:23-30
constructor(opts?: CircuitBreakerOptions) {
  this.failureThreshold = Math.max(1, Math.floor(positiveFinite(opts?.failureThreshold, 3)));
  this.failureWindowMs = positiveFinite(opts?.failureWindowMs, 60_000);
  this.recoveryTimeoutMs = positiveFinite(opts?.recoveryTimeoutMs, 30_000);
}
```

When the circuit is open, `ResilientProvider.call()` throws `"circuit_breaker_open"` immediately without making a network call. Functions that call compress/summarize (`registerCompressFunction`, `registerSummarizeFunction`, etc.) propagate this error rather than hanging on a downed provider.

When multiple fallback providers are configured (via `AGENTMEMORY_FALLBACK_PROVIDERS`), `FallbackChainProvider` tries each one in order. `ResilientProvider` wraps the entire chain, so the circuit-breaker trips only after the whole chain has exhausted.

Sources: [src/providers/circuit-breaker.ts:13-82](), [src/providers/resilient.ts:4-37](), [src/providers/fallback-chain.ts:4-31]()

---

### 5. The sdk-guard hook prevents recursive hook invocations

When agentmemory spawns a Claude session internally (e.g., via the agent-sdk provider for compress/summarize), the child Claude Code session inherits all parent hook scripts. If a child session's hooks fire and call back into `/agentmemory/*`, the result is unbounded recursion that burns tokens and creates ghost sessions.

Two signals identify a SDK-child context, and hook scripts must test for both:

```typescript
// src/hooks/sdk-guard.ts:20-26
export function isSdkChildContext(payload: unknown): boolean {
  if (process.env.AGENTMEMORY_SDK_CHILD === "1") return true;
  if (!payload || typeof payload !== "object") return false;
  const p = payload as Record<string, unknown>;
  if (p["entrypoint"] === "sdk-ts") return true;
  return false;
}
```

Signal 1 — `AGENTMEMORY_SDK_CHILD=1`: set by the agent-sdk provider before spawning `query()`, inherited by all child processes.  
Signal 2 — `payload.entrypoint === "sdk-ts"`: written by Claude Code into hook stdin when the session was launched by the Agent SDK.

**Any hook script must call `isSdkChildContext(payload)` before doing any work and return silently when it is true.**

Sources: [src/hooks/sdk-guard.ts:1-26]()

---

## Failure Modes

### Embedding dimension mismatch corrupts search silently

`cosineSimilarity` in `VectorIndex` returns `0` when the two arrays have different lengths:

```typescript
// src/state/vector-index.ts:9-11
function cosineSimilarity(a: Float32Array, b: Float32Array): number {
  if (a.length !== b.length) return 0;
  ...
}
```

A mismatch between a stored vector's dimension and the query vector's dimension causes that observation to score zero on every query — it silently disappears from results without an error. The system guards against this at two sites:

1. **Write site** (`vectorIndexAddGuarded`): validates `embedding.length !== ep.dimensions` before calling `vi.add()`. Logs a warning and skips the item.
2. **Persistence load** (`src/index.ts:368-409`): `VectorIndex.validateDimensions()` walks every persisted vector and refuses to restore the index if any mismatches are found. The worker either throws a fatal error (forcing operator action) or discards the stale index when `AGENTMEMORY_DROP_STALE_INDEX=true`.

Sources: [src/state/vector-index.ts:9-11, 77-90](), [src/index.ts:362-409](), [src/functions/search.ts:55-87]()

### Index rebuild blocks the viewer server if awaited

`rebuildIndex` iterates every observation across every session and awaits an embedding provider call per record. On a large corpus with a rate-limited endpoint this can take hours. The worker fires it as a **fire-and-forget** void:

```typescript
// src/index.ts:423-431
void rebuildIndex(kv)
  .then((indexCount) => {
    if (indexCount > 0) {
      bootLog(`Search index rebuilt: ${indexCount} entries`);
      indexPersistence.scheduleSave();
    }
  })
  .catch((err) => {
    console.warn(`[agentmemory] Failed to rebuild search index:`, err);
  });
```

If this were awaited, the viewer server would remain unbound for the rebuild duration. Search degrades (partial coverage) but the viewer starts immediately.

Sources: [src/index.ts:412-431]()

### Health monitor thresholds

`evaluateHealth` in `src/health/thresholds.ts` classifies the worker into three states: `healthy`, `degraded`, or `critical`. Default thresholds:

| Metric | warn | critical |
|---|---|---|
| Event-loop lag | 100 ms | 500 ms |
| CPU usage | 80% | 90% |
| Heap usage | 80% | 95% |
| Engine connection | reconnecting | disconnected / failed |

KV connectivity is actively probed each cycle via a `set`+`get` round-trip with a 5-second timeout. A `kv_probe_failed` alert is raised if either the write or read times out.

Sources: [src/health/thresholds.ts:13-21, 33-80](), [src/health/monitor.ts:48-64]()

---

## Safe-Change Rules

### Embedding dimension changes require index migration

Switching to an embedding provider that declares a different `dimensions` value makes the existing persisted index incompatible. The system refuses to load cross-dimension indexes at startup. Safe procedure:

1. Run `mem::migrate-vector-index` (backed by `migrateVectorIndex` in `src/functions/migrate-vector-index.ts`). It re-embeds every memory and every session's observations against the new provider in a fresh `VectorIndex`, with per-session isolation so one bad session does not abort the rest.
2. Inspect `MigrateVectorIndexResult.failed` and `failedSessions` before swapping the live index.
3. Only then switch the `AGENTMEMORY_EMBEDDING_PROVIDER` env var.

Do **not** set `AGENTMEMORY_DROP_STALE_INDEX=true` on a production install unless you accept losing all vector search history. The flag is meant for development resets.

Sources: [src/functions/migrate-vector-index.ts:44-152](), [src/index.ts:362-409]()

### Adding a new function requires two registrations

Every new capability must be registered in two places:

1. **`src/index.ts`**: call `registerXxxFunction(sdk, kv, ...)` inside `main()`. Without this, the iii-engine never knows the function exists.
2. **`src/mcp/tools-registry.ts`**: add a `McpToolDef` entry to the appropriate version array (`CORE_TOOLS`, `V050_TOOLS`, etc.) and include it in `getAllTools()`. Without this, the MCP surface (used by agents and the `npx @agentmemory/mcp` adapter) does not expose the tool.

The two registrations are deliberately separate: an internal function can exist without an MCP surface, but an MCP tool with no backing function will fail silently or throw at invocation time.

Sources: [src/index.ts:204-306](), [src/mcp/tools-registry.ts:931-948]()

### Provider fallback order is config-driven, not hardcoded

`createFallbackProvider` in `src/providers/index.ts` reads the primary provider from config and appends fallbacks from `loadFallbackConfig()` (driven by `AGENTMEMORY_FALLBACK_PROVIDERS`). The chain order is:

```
primary provider → fallback[0] → fallback[1] → ...
```

`FallbackChainProvider.tryAll()` tries each in sequence; the first success wins. Do not hardcode a fallback inside any individual provider implementation — add it to the config-driven chain instead. This keeps the fallback topology observable and operator-controlled without code changes.

Sources: [src/providers/index.ts:32-59](), [src/providers/fallback-chain.ts:18-30]()

### New hook scripts must guard against SDK-child recursion

Any new Claude Code hook script that calls an agentmemory endpoint must import `isSdkChildContext` from `src/hooks/sdk-guard.ts` and return without action when it returns `true`. Omitting this guard causes the hook to fire inside agent-sdk child sessions, potentially triggering the exact endpoint that spawned the child session and creating an unbounded invocation loop.

Sources: [src/hooks/sdk-guard.ts:1-26]()

---

## Boundary Summary

```text
┌────────────────────────────────────────────────────────────┐
│                       agentmemory worker                   │
│                                                            │
│  ┌──────────────┐    ┌────────────────────────────────┐   │
│  │  BM25 index  │    │  VectorIndex (optional)        │   │
│  │  (always)    │    │  null when no embedder set     │   │
│  └──────┬───────┘    └───────────────┬────────────────┘   │
│         │                            │ guarded write       │
│         └────────────┬───────────────┘  (dim check)       │
│                      ▼                                     │
│              HybridSearch (BM25 + vector + graph)          │
│                      │                                     │
│  ┌───────────────────▼────────────────────────────────┐   │
│  │                StateKV → iii-engine                 │   │
│  │  (source of truth; no local DB)                     │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                            │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  ResilientProvider (CircuitBreaker)                  │   │
│  │    └─ FallbackChainProvider (config-driven order)   │   │
│  │         └─ anthropic | openai | openrouter | ...    │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                            │
│  process.on("unhandledRejection") ← suppresses SDK        │
│    30s timeout rejections (log-throttled, never rethrow)   │
└────────────────────────────────────────────────────────────┘
         ↑ hook invocations
         │ isSdkChildContext() guard prevents recursion
         └─ sdk-guard.ts
```

---

## Summary

The five invariants — iii-engine as the sole state store, suppressed `unhandledRejection` for timeout survival, always-present BM25 for search availability, circuit-breakers for provider isolation, and the sdk-guard hook for recursion prevention — collectively ensure that no single component failure terminates the worker or corrupts search. Contributors must respect the dimension-migration rule when changing embedding providers, the dual-registration rule when adding functions, and the config-driven fallback rule when modifying provider topology. The system is intentionally designed so that the degraded path (BM25-only, no LLM provider) still provides useful recall.

Sources: [src/index.ts:112-129, 193-195, 362-409](), [src/hooks/sdk-guard.ts:1-26]()
