# The Bug Hall of Shame: Deadlocks, Bedrock Bombs & Discord Disasters

> The CHANGELOG reads like a confessional: a sub-agent deadlock guard was needed in the expansion recursion guard; AWS Bedrock silently rejected messages with empty content arrays until a fix in v0.10.0; the /lossless native command description was too long and Discord was silently truncating it during slash-command registration; a stat-fail loop could get a conversation permanently stuck in transparent passthrough mode (no compaction ever ran); and bootstrap replay floods could inject thousands of duplicate messages. Each bug was quietly fixed in a patch release.

- Repository: Martian-Engineering/lossless-claw
- GitHub: https://github.com/Martian-Engineering/lossless-claw
- Human wiki: https://grok-wiki.com/public/wiki/martian-engineering-lossless-claw-a94e8135853e
- Complete Markdown: https://grok-wiki.com/public/wiki/martian-engineering-lossless-claw-a94e8135853e/llms-full.txt

## Source Files

- `CHANGELOG.md`
- `src/tools/lcm-expansion-recursion-guard.ts`
- `src/assembler.ts`
- `test/bootstrap-flood-regression.test.ts`
- `test/regression-2026-03-17.test.ts`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [CHANGELOG.md](CHANGELOG.md)
- [src/tools/lcm-expansion-recursion-guard.ts](src/tools/lcm-expansion-recursion-guard.ts)
- [src/assembler.ts](src/assembler.ts)
- [src/plugin/lcm-command.ts](src/plugin/lcm-command.ts)
- [test/bootstrap-flood-regression.test.ts](test/bootstrap-flood-regression.test.ts)
- [test/regression-2026-03-17.test.ts](test/regression-2026-03-17.test.ts)
</details>

# The Bug Hall of Shame: Deadlocks, Bedrock Bombs & Discord Disasters

Every production system accumulates a rogues' gallery of bugs that only surface under real-world conditions — the ones that slip past every code review and unit test until a user hits them at the worst possible moment. Lossless Claw is no exception. Its CHANGELOG reads less like a feature log and more like a confessional: five separate patch releases each quietly fixed a different catastrophic failure mode, each one a small masterwork of "who could have predicted this."

This page digs into the juicy specifics — what actually went wrong, why it was subtle, and how each fix closed the gap without introducing new problems.

---

## Bug 1: The Sub-Agent Deadlock That Wasn't (Until It Was)

**Version fixed:** 0.11.x (via `lcm-expansion-recursion-guard.ts`)

### What happened

The `lcm_expand_query` tool lets an AI agent spawn sub-agents to handle expensive retrieval work. The recursion guard was supposed to prevent sub-agents from re-entering `lcm_expand_query`, but it only blocked on *depth*. It had no concept of *concurrency*.

The specific failure mode: two calls originating from the same origin session could simultaneously acquire what should have been a single expansion lane. Both would proceed. Both would stamp delegated contexts. Both would eventually wait for the other to finish. Classic deadlock.

### The fix

Two separate guard mechanisms were added:

**Recursion guard** (`evaluateExpansionRecursionGuard`): blocks any session whose stamped `expansionDepth >= EXPANSION_DELEGATION_DEPTH_CAP` (hardcoded to 1). A second-time block on the same `requestId` is classified as `"idempotent_reentry"` rather than `"depth_cap"` so telemetry can distinguish retries from genuine recursive calls.

**Concurrency guard** (`acquireExpansionConcurrencySlot`): tracks a single active `requestId` per origin session in `activeRequestIdByOriginSessionKey`. A second concurrent caller gets `EXPANSION_CONCURRENCY_BLOCKED` immediately:

```typescript
// src/tools/lcm-expansion-recursion-guard.ts:255-266
if (activeRequestId && activeRequestId !== requestId) {
  return {
    blocked: true,
    code: EXPANSION_CONCURRENCY_ERROR_CODE,
    reason: "origin_session_in_flight",
    message:
      `${EXPANSION_CONCURRENCY_ERROR_CODE}: Another lcm_expand_query delegation is already ` +
      `in flight for origin session (${originSessionKey}; activeRequestId=${activeRequestId}). ` +
      buildExpansionConcurrencyRecoveryGuidance(originSessionKey),
    ...
  };
}
```

The recovery message even tells the blocked sub-agent what to do: use `lcm_grep` or `lcm_describe` as immediate fallbacks instead of waiting.

### The sneaky part

The guard maintains three separate in-memory maps: one for delegated contexts, one for blocked request IDs (for idempotency detection), and one for the active slot. All three are module-level singletons, reset only in tests. This means a crashed gateway process that never called `releaseExpansionConcurrencySlot` would leave a ghost slot — a potential follow-up bug if sub-agents are long-lived.

Sources: [src/tools/lcm-expansion-recursion-guard.ts:61-63](src/tools/lcm-expansion-recursion-guard.ts), [src/tools/lcm-expansion-recursion-guard.ts:247-278](src/tools/lcm-expansion-recursion-guard.ts)

---

## Bug 2: AWS Bedrock's Invisible Content Rejection

**Version fixed:** v0.10.0 (PR #606)

### What happened

AWS Bedrock Converse API has strict opinions about message content: if `content` is an empty array `[]` for a `user` or `toolResult` message, it rejects the entire request with:

> `The content field in the Message object at messages.N is empty. Add a ContentBlock object to the content field and try again.`

The pre-existing empty-content filter in the assembler only checked the `assistant` role. User and toolResult messages with momentarily empty content arrays (possible during content transformation pipelines) sailed right through to Bedrock, which then silently refused them. The error surfaced as an API rejection downstream, far from its cause.

### The fix

A unified `isEmptyMessageContent` helper was added to `src/assembler.ts` that handles every role:

```typescript
// src/assembler.ts:151-170
export function isEmptyMessageContent(message: {
  role?: unknown;
  content?: unknown;
}): boolean {
  if (!message) return true;
  const content = message.content;
  if (content === undefined || content === null) return true;
  if (Array.isArray(content)) {
    if (content.length === 0) return true;          // ← the new universal guard
    if (message.role === "assistant") {
      if (isThinkingOnlyContent(content)) return true;
      if (isBlankContent(content)) return true;
    }
    return false;
  }
  if (typeof content === "string") {
    return content.trim() === "";
  }
  return false;
}
```

The comment in the source is unusually candid about the asymmetric gap: *"The pre-existing filter only protected the assistant role, leaving an asymmetric gap when an empty user/toolResult shape is momentarily produced upstream."*

### The sneaky part

Bedrock's rejection message uses the phrase *"Add a ContentBlock object"* — which means if you were watching logs, you'd see a provider error about a missing content block, with no indication that lossless-claw's assembly pipeline was the source. The bug was effectively anonymous in production traces.

Sources: [src/assembler.ts:129-170](src/assembler.ts), [CHANGELOG.md](CHANGELOG.md) (v0.10.0, PR #606)

---

## Bug 3: Discord Silently Ate the `/lossless` Command

**Version fixed:** v0.11.2 (PR #672)

### What happened

Discord enforces a hard limit on slash command description length during registration. If a command's description string exceeds this limit, Discord truncates or silently rejects the registration — no error, no warning. The command simply stops appearing in users' command menus, or worse, appears with a garbled description.

The `/lossless` command was registered with a description string that was too long. Discord quietly swallowed the registration failure.

### The fix

The description was shortened to fit within Discord's limit. The current value in the source:

```typescript
// src/plugin/lcm-command.ts:2232-2233
description:
  "Lossless Claw health, backups, compaction, junk review, and doctor tools.",
```

This is short enough to pass Discord's validator without truncation. The command also exposes the native name `"lossless"` (mapping to `/lossless` in Discord) via `nativeNames.default`.

### The sneaky part

There was no error, no log entry, no exception. Discord simply silently truncated the registration. Users would see the command appear to register successfully on the OpenClaw side, but the command would be broken or missing on the Discord side. This category of bug — provider-side silent failure with no feedback — is particularly nasty because the symptom (broken command) appears nowhere near the cause (description too long at registration time).

Sources: [src/plugin/lcm-command.ts:2225-2234](src/plugin/lcm-command.ts), [CHANGELOG.md](CHANGELOG.md) (v0.11.2, PR #672)

---

## Bug 4: The Stat-Fail Loop That Killed Compaction Forever

**Version fixed:** v0.11.0 (PR #685)

### What happened

This one has a chain-of-custody worthy of a true crime doc:

1. **PR #649** added a graceful fallback in `afterTurn`: when `stat(sessionFile)` fails, return `hasOverlap:true` to allow live persistence to continue. The expectation was that the `refreshAfterTurnBootstrapState` hook would then refresh the checkpoint on the next call.

2. **The bug**: that hook calls `refreshBootstrapState`, which also calls `stat(sessionFile)` — and also throws on failure. The catch block in the hook swallowed the error silently. So `conversation_bootstrap_state` remained `NULL`.

3. **The consequence**: every subsequent `afterTurn` re-entered the slow path with `reason="checkpoint-missing"`. Checkpoint-missing is explicitly excluded from `allowNoAnchorImport`. The conversation got **permanently stuck**.

4. **The stuck state**: once stuck, the assembler's safe-fallback returned `params.messages` verbatim — raw, uncompacted messages — because no DB anchor could be established. Compaction never ran again. The context window filled up. The host's emergency overflow truncation became the only safety net.

### The fix

When the stat-fail slow path is hit, a placeholder `conversation_bootstrap_state` row is now written directly via `summaryStore.upsertConversationBootstrapState` — bypassing `stat()` entirely — so the contract "permissive return ⟹ checkpoint exists" is restored. Subsequent turns recover from `offset=0` once the transcript becomes statable again, routing through the DB-anchor reconciliation path so already-persisted messages aren't replayed.

The CHANGELOG entry for this fix is the most detailed of the five — three dense paragraphs explaining the causal chain — because the original PR author clearly wanted no ambiguity about what went wrong.

Sources: [CHANGELOG.md](CHANGELOG.md) (v0.11.0, PR #685)

---

## Bug 5: Bootstrap Replay Floods

**Version fixed:** v0.10.0 (PR #640), tested in `test/bootstrap-flood-regression.test.ts`

### What happened

When a gateway restarts and reconnects to an existing conversation, LCM bootstraps by reading the session transcript and importing messages into its SQLite DB. The bug: if the stored checkpoint was stale (wrong `mtime`, wrong `size`, mismatched hash), the bootstrap `reconcileSessionTail` path would treat the entire transcript as "new" content to import.

On a long conversation with thousands of messages, this meant **thousands of duplicate rows** being injected into LCM's DB on every restart. Every row existed twice (or more). Compaction would process duplicates as real history. Expand queries would surface duplicate summaries. The DB bloated silently.

The two conditions that triggered this:
- The `maintain()` function rewrote the JSONL transcript but didn't update the bootstrap checkpoint (the PR #280 bug that the flood test covers).
- Any situation where checkpoint state diverged from actual file state (gateway crash, OS-level filesystem quirk, etc.).

### The fix

Two defenses were added:

**Checkpoint update**: `maintain()` now updates the bootstrap checkpoint after a successful transcript rewrite, so the next bootstrap sees the post-rewrite state as current and imports 0 messages.

**Import cap**: `reconcileSessionTail` now enforces a cap of `max(existingDbCount × 0.2, 50)` rows per bootstrap. If a reconcile would import more than the cap, it aborts with `reason: "reconcile import capped"` and imports 0 messages — protecting the DB even when the checkpoint is fully stale.

The regression test covers both defenses separately and together:

```typescript
// test/bootstrap-flood-regression.test.ts:323-325
expect(
  boot2.reason,
  "should report import cap was hit",
).toBe("reconcile import capped");
```

The test also verifies the combined case: a valid checkpoint update blocks the flood; a corrupted checkpoint triggers the cap; both together provide defense-in-depth.

Sources: [test/bootstrap-flood-regression.test.ts:212-447](test/bootstrap-flood-regression.test.ts), [CHANGELOG.md](CHANGELOG.md) (v0.10.0, PR #640)

---

## Pattern: Why These Bugs Are Interesting

All five bugs share a structural property: **the failure was silent and the symptom was distant from the cause**.

| Bug | Silent failure | Distant symptom |
|---|---|---|
| Sub-agent deadlock | No deadlock error; sub-agents just hang | Stalled user turn |
| Bedrock content rejection | API error with no pointer to LCM pipeline | Provider-level failure |
| Discord command truncation | Registration "succeeds" on LCM side | Command missing in Discord UI |
| Stat-fail loop | Compaction never runs; no error logged | Context window overflow later |
| Bootstrap flood | Duplicate rows silently inserted | DB bloat, wrong summaries |

Each fix added either an explicit guard (the concurrency slot, the import cap, the unified empty-content filter) or restored a broken contract (the checkpoint-after-maintain, the placeholder checkpoint-on-stat-fail). None of the fixes were large — most are under 20 lines — but each required understanding a subtle invariant that existed only in the original author's head until the bug surfaced.

The test suite now covers four of the five with dedicated regression files, ensuring that the same class of bug cannot re-enter silently.

Sources: [CHANGELOG.md](CHANGELOG.md) (v0.10.0–v0.11.2 entries), [test/bootstrap-flood-regression.test.ts](test/bootstrap-flood-regression.test.ts), [test/regression-2026-03-17.test.ts](test/regression-2026-03-17.test.ts)
