# Test Suite & Reproducibility Harness

> What tests/test_smoke.py covers (accept path, banned-token injection, unknown-behavior binding, protected-type addition, disallowed change kind, promote lifecycle) and how the harness/ scripts (run_corpus.py, run_adversarial.py, run_future_event.py, extractor_recall.py, rollback_precondition.py, compare.py, report.py, invariants.py) regenerate the paper result files in harness/results/. Explains the LLM-free invariant enforced by the harness (ANTHROPIC_API_KEY must be unset) and the SELFGRAPH_OBJECTTYPE_MATCH=literal vs. relaxed condition that produces corpus.literal.jsonl vs. corpus.relaxed.jsonl.

- Repository: yoheinakajima/activegraph-selfgraph
- GitHub: https://github.com/yoheinakajima/activegraph-selfgraph
- Human wiki: https://grok-wiki.com/public/wiki/yoheinakajima-activegraph-selfgraph-41747ef30393
- Complete Markdown: https://grok-wiki.com/public/wiki/yoheinakajima-activegraph-selfgraph-41747ef30393/llms-full.txt

## Source Files

- `tests/test_smoke.py`
- `tests/test_harness.py`
- `harness/reproduce.sh`
- `harness/run_corpus.py`
- `harness/invariants.py`
- `REPRODUCE.md`
- `harness/results/CANONICAL_SHAS.txt`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [tests/test_smoke.py](tests/test_smoke.py)
- [tests/test_harness.py](tests/test_harness.py)
- [harness/invariants.py](harness/invariants.py)
- [harness/run_corpus.py](harness/run_corpus.py)
- [harness/run_adversarial.py](harness/run_adversarial.py)
- [harness/rollback_precondition.py](harness/rollback_precondition.py)
- [harness/extractor_recall.py](harness/extractor_recall.py)
- [harness/compare.py](harness/compare.py)
- [harness/report.py](harness/report.py)
- [harness/reproduce.sh](harness/reproduce.sh)
- [REPRODUCE.md](REPRODUCE.md)
- [harness/results/CANONICAL_SHAS.txt](harness/results/CANONICAL_SHAS.txt)
</details>

# Test Suite & Reproducibility Harness

This page covers the two-layer verification system in `activegraph-selfgraph`: a fast pytest suite (`tests/`) that validates the core propose→validate→sandbox lifecycle without any external dependencies, and a measurement harness (`harness/`) that regenerates the result files cited in the paper from a cold start in under one minute. Together they establish that selfgraph's self-modification pipeline is both correct and reproducible on any machine with Python 3.11 and no API key.

The central design constraint is **LLM-freedom**: every measurable number the paper cites comes from a deterministic, network-free pipeline. The harness enforces this as a hard invariant — it refuses to start if `ANTHROPIC_API_KEY` is set — so the canonical SHA-256 fingerprints recorded in `harness/results/CANONICAL_SHAS.txt` can be independently verified without a model subscription.

---

## tests/test_smoke.py — The Unit Test Layer

`tests/test_smoke.py` is the fast, self-contained correctness test for the core `selfgraph` modules. Every test builds a fresh in-memory `Graph` (via `_fresh()`), ingests one or two files with `ingest_paths`, and runs `extract_capabilities(use_llm=False)` to produce a deterministic capability graph before exercising a specific behavior.

Sources: [tests/test_smoke.py:26-46]()

### Covered test scenarios

| Test function | What it checks |
|---|---|
| `test_ingest_and_extract` | `ingest_paths` produces `File` + `Chunk` objects; `extract_capabilities` emits ≥6 `Capability` and ≥4 `AuthorityRule` nodes |
| `test_proposal_accepted` | A benign goal clears `validate_proposal` and `sandbox_apply` adds objects |
| `test_proposal_rejected_when_banned_token_injected` | Manually injecting `subprocess.Popen(['rm', '-rf', '/'])` into a proposal's change list triggers a `banned-token` violation |
| `test_proposal_rejected_for_unknown_behavior` | A `bind_behavior` change referencing a behavior that doesn't exist in the graph triggers `unknown-behavior` |
| `test_proposal_rejected_for_protected_type_add` | Adding `AuthorityRule` or `Capability` objects directly causes ≥2 `protected-type` violations |
| `test_proposal_rejected_for_unknown_change_kind` | `spawn_subprocess` + `add_policy` changes raise both `disallowed-kind` and `permission-escalation` violations |
| `test_sandbox_promote_changes_main_graph` | `sandbox_apply(promote=True)` materializes objects on the live graph and sets the proposal status to `"applied"` |
| `test_validate_proposal_mutate_status_false` | `validate_proposal(mutate_status=False)` returns a passing report without advancing a `"draft"` proposal's status |
| `test_sandbox_sqlite_fork_isolates_main_graph` | With a SQLite-backed `Runtime`, `sandbox_apply(promote=False)` uses the real `Runtime.fork` path (label starts with `sqlite-fork@`) and leaves the live graph byte-identical to before |
| `test_promote_lifecycle_requires_validated_status` | `sandbox_apply(promote=True)` on a still-`"draft"` proposal raises `ValueError` — the lifecycle gate cannot be bypassed |

Sources: [tests/test_smoke.py:53-207]()

### Proposal lifecycle enforced by the tests

```text
         propose_patch_for
               │
          status = "draft"
               │
      validate_proposal()
          ┌────┴────┐
       ok=True   ok=False
          │          │
   status="validated"  status="rejected"
          │
   sandbox_apply(promote=True)
          │
   status="applied"
```

The state transitions are enforced by convention (not a state machine), but `test_promote_lifecycle_requires_validated_status` and `test_validate_proposal_mutate_status_false` together pin both the forward and non-mutating paths. Sources: [tests/test_smoke.py:148-207](), [REPRODUCE.md:259-268]()

---

## tests/test_harness.py — Harness Building Blocks

`tests/test_harness.py` exercises the harness infrastructure itself without running the full corpus (which is slow). Its tests verify that the helper functions used by `run_corpus.py` are deterministic and correct.

| Test | Verified property |
|---|---|
| `test_path_class_runtime_vs_selfgraph` | Paths inside the installed `activegraph` package directory (or `module://activegraph` pseudo-paths) map to `"runtime"`; everything else maps to `"selfgraph"` |
| `test_generate_goal_set_is_deterministic_and_sorted` | Two graphs built identically produce the same goal sequence, sorted by `(node_type, name, template_index)` |
| `test_generate_goal_set_records_provenance` | Every goal row carries `derived_from_node_id`, `derived_from_node_type`, and `derived_from_path_class` |
| `test_classify_change_matches_citation_taxonomy` | `classify_change` returns the exact four category strings: `grounded-in-extracted`, `built-in-scaffold`, `self-authored`, `domain-new` |
| `test_objecttype_match_flag_literal_excludes_runtime_object_types` | Under `SELFGRAPH_OBJECTTYPE_MATCH=literal` the extractor emits zero `ObjectType` nodes whose `source_file_path` lives in the activegraph package |
| `test_objecttype_match_flag_invalid_value_raises` | An unrecognized flag value (e.g. `"lenient"`) raises `ValueError` immediately — a typo cannot silently shift results |
| `test_relaxed_extractor_catches_runtime_object_types` | Under `relaxed` mode, at least one `ObjectType` node (including `"company"` from the diligence pack) is emitted with an activegraph package path |
| `test_run_goal_emits_expected_row_shape` | A full propose→validate→sandbox call via `run_goal` returns a row with all required keys and `live_graph_unchanged=True` |

Sources: [tests/test_harness.py:33-218]()

---

## harness/invariants.py — The LLM-Free Gate

Every harness entry point imports and calls `require_no_llm_env()` as its first action. The function checks whether `ANTHROPIC_API_KEY` is present in the environment and exits with code 64 if it is, unless `SELFGRAPH_HARNESS_ALLOW_LLM=1` is also set (which permits an LLM-augmented variant while printing a loud warning).

```python
# harness/invariants.py:19-52
def require_no_llm_env() -> None:
    if not os.environ.get("ANTHROPIC_API_KEY"):
        return
    if os.environ.get(_OVERRIDE_VAR) == "1":
        # warn and proceed — results will NOT match canonical shas
        return
    # ... structured error message + sys.exit(64)
```

The rationale: `selfgraph/extract.py` has an optional LLM augmentation pass gated on `ANTHROPIC_API_KEY`. If the key is set, the extractor produces an LLM-shaped graph, making the output non-deterministic and breaking both SHA reproducibility and the "no API key required" claim in `REPRODUCE.md`. The `*.meta.json` companion file each run writes carries `"llm_augment_active": false` as an audit stamp.

Sources: [harness/invariants.py:1-52](), [REPRODUCE.md:40-77]()

---

## SELFGRAPH_OBJECTTYPE_MATCH — The A/B Control Variable

The extractor in `selfgraph/extract.py` recognizes `ObjectType` declarations in two modes, selected by the environment variable `SELFGRAPH_OBJECTTYPE_MATCH`:

| Value | Regex active | What it captures | Corpus output |
|---|---|---|---|
| `literal` | `add_object("Cap", ...)` capitalized literal only | Selfgraph-repo ObjectTypes only | `corpus.literal.jsonl` (BEFORE) |
| `relaxed` (default) | Above + `ObjectType(name="...", ...)` constructor calls | Also activegraph runtime pack ObjectTypes | `corpus.relaxed.jsonl` (AFTER) |

Any other value raises `ValueError` immediately — a typo cannot silently produce a shifted result. Sources: [REPRODUCE.md:124-160](), [tests/test_harness.py:111-162]()

The effect on corpus size is significant: the BEFORE condition produces **45 goals** (0 runtime-derived), while the AFTER condition produces **72 goals** (27 runtime-derived + 45 selfgraph-derived). The A/B cleanliness invariant — verified by `harness/compare.py` at the end of every cold run — requires that the selfgraph-derived grounding row be byte-identical across both conditions (27/45 in both). This single-variable guarantee is what makes the runtime-derived 18/27 (66.7%) grounding finding a causal result rather than a confound. Sources: [REPRODUCE.md:141-162]()

---

## harness/run_corpus.py — The Main Measurement Pipeline

`run_corpus.py` is the backbone of the reproducibility harness. It runs the full propose→validate→sandbox measurement loop and streams results to a JSONL file.

### Pipeline steps

```text
build_graph()
  ├── ingest_paths(["selfgraph", "README.md", "demo.py"])
  ├── ingest_module_docs("activegraph", max_submodules=40)
  ├── ingest_paths([activegraph/packs], max_bytes=400_000)
  └── extract_capabilities(use_llm=False)
         │
generate_goal_set(graph)
  └── every Capability + ObjectType × ("monitor {name}", "track {name}", "configure {name}")
      sorted by (node_type, node_name, template_index)
         │
for each goal → run_goal(graph, runtime, goal_row)
  ├── propose_patch_for(graph, goal)
  ├── classify_change() for each change  [origin taxonomy]
  ├── validate_proposal(graph, pid)      [guardrail report]
  └── sandbox_apply(graph, pid, runtime, promote=False)
      ├── assert fork_label starts with "sqlite-fork@"
      └── assert live_graph_unchanged
         │
write corpus.jsonl + run.meta.json
assert fork_violations == [] and isolation_violations == []
```

Sources: [harness/run_corpus.py:1-359]()

### Per-row JSONL schema

Each row records:

- **Goal provenance**: `goal`, `derived_from_node_id`, `derived_from_node_type`, `derived_from_node_name`, `derived_from_source`, `derived_from_path_class`
- **Proposal details**: `proposal_id`, `used_fallback_scaffold`, `n_changes`
- **Origin taxonomy**: `origin_counts` (`grounded-in-extracted`, `built-in-scaffold`, `self-authored`, `domain-new`) and `per_change` list
- **Grounding edges**: `patch_modifies` (list of `{target_id, target_name, target_source, target_path_class}`) and `n_patch_modifies`
- **Guardrail**: `guardrail.ok`, `guardrail.n_violations`, `guardrail.violation_kinds`
- **Sandbox**: `sandbox.fork_label`, `sandbox.fork_path`, `sandbox.n_added_objects`, `sandbox.n_added_relations`, `sandbox.live_graph_unchanged`

Sources: [harness/run_corpus.py:145-255]()

### Determinism measures

Two sources of cross-machine non-determinism were eliminated (documented in `REPRODUCE.md`):
1. Wall-clock fields (`t_start`/`t_end`) were removed from rows.
2. `os.walk` in `ingest_paths` and `pkgutil.walk_packages` in `ingest_module_docs` now sort output before processing, pinning `Object#N` IDs across machines. A `set()` of regex captures in `extract.py` is also sorted.

Sources: [REPRODUCE.md:108-121]()

---

## harness/run_adversarial.py — Guardrail Safety Slice

`run_adversarial.py` mechanically generates unsafe proposals across every violation class and verifies that `validate_proposal` catches them. It does not hand-write adversarial payloads — it enumerates them from the validator's own constants (`_BANNED_TOKENS`, `_PROTECTED_TYPES`, `ALLOWED_KINDS`).

### Generators

| Generator | Method | Violation class |
|---|---|---|
| `gen_banned_token_attempts` | One attempt per token in `guardrails._BANNED_TOKENS`; injects the token into an `add_object` data blob | `banned-token` |
| `gen_unknown_behavior_attempts` | 5 attempts binding synthetic behavior names not present in any ingested source | `unknown-behavior` |
| `gen_protected_type_attempts` | One attempt per protected type (`AuthorityRule`, `Capability`) | `protected-type` |
| `gen_disallowed_kind_attempts` | One attempt using `spawn_subprocess` change kind | `disallowed-kind` |
| `gen_permission_escalation_attempts` | One attempt adding a policy with `can_approve` | `permission-escalation` |

After running all attempts, the script cross-checks `corpus.relaxed.jsonl` for false positives (benign proposals that were incorrectly rejected). The output is a confusion-style table: `n_attempts`, `n_caught`, and `gap` per class. Any under-catch is a **result** recorded in the paper, not a bug silently patched. Sources: [harness/run_adversarial.py:1-315]()

---

## harness/rollback_precondition.py — Promote + Replay Verification

This script measures the precondition for the paper's rollback claim: that every self-modification is a real logged event on the same event log that replay reconstructs.

It reproduces the full `corpus.relaxed` pipeline (same ingest, extract, and `generate_goal_set`), then promotes every guardrail-validated proposal inside an isolated `Runtime.fork`. On the reference machine this covers **n=72 promotions** including all 9 `bind_behavior` proposals. Each fork shares the SQLite file with the main pipeline graph but operates under a distinct `run_id`, so it neither contaminates other trials nor mutates the main pipeline.

Per trial it records:
- `n_promote_events`: number of `actor="promote"` events appended (expected: `n_changes + 1`)
- `all_changes_logged`: whether promote produced enough logged events for every allowed-kind change
- `replay_byte_identical`: whether opening a fresh `SQLiteEventStore` for the fork's `run_id` and replaying events up to (but not including) the first promote-actor event reconstructs a snapshot byte-identical to the pre-promote snapshot

Reference machine result: 72/72 `all_changes_logged`, 72/72 `replay_byte_identical`. Sources: [harness/rollback_precondition.py:1-71](), [REPRODUCE.md:164-205]()

---

## harness/extractor_recall.py — Discovery Recall Measurement

`extractor_recall.py` quantifies the extraction fidelity bottleneck described in §7 of the paper. It uses Python's `ast` module (independent of the extractor's regex) as the ground-truth denominator.

**Behavior denominator**: every function in the activegraph package decorated with `@behavior`, `@llm_behavior`, or `@relation_behavior` (top-level name or attribute tail, with or without a call).

**ObjectType denominator**: union of `add_object("<X>", ...)` first-positional string literals and `ObjectType(name="<X>", ...)` keyword string literals.

The extractor is run twice — once under `SELFGRAPH_OBJECTTYPE_MATCH=literal`, once under `=relaxed` — and recall, missed names, and any runtime-derived false positives (e.g. `hello` from a code-template literal in `activegraph/packs/scaffold.py`) are reported. Output goes to `harness/results/extractor_recall.json`. Sources: [harness/extractor_recall.py:1-80]()

---

## harness/compare.py — A/B Cleanliness Enforcement

`compare.py` reads the BEFORE (`corpus.literal.jsonl`) and AFTER (`corpus.relaxed.jsonl`) files and prints a side-by-side table covering: goal set size by path class, grounding rate overall and split, fallback-scaffold rate, origin mix, and sandbox regression metrics.

The critical output is the **A/B cleanliness invariant** check at the end:

```python
# harness/compare.py:163-173
bs = b["by_class_grounded"].get("selfgraph", (0, 0))
as_ = a["by_class_grounded"].get("selfgraph", (0, 0))
if bs == as_:
    print("→ identical: single-variable A/B holds")
    return 0
# else: MISMATCH — exits non-zero
```

If the selfgraph-derived grounding row differs between conditions, `reproduce.sh` exits non-zero, blocking the result. Sources: [harness/compare.py:156-173]()

---

## harness/report.py — Corpus Aggregate Summary

`report.py` reads a corpus JSONL file (defaulting to `corpus.relaxed.jsonl`) and prints a flat summary covering: corpus shape, grounding rate by path class, fallback-scaffold rate, origin mix, guardrail outcomes with violation kind counts, and sandbox isolation metrics. It is invoked standalone after a run to inspect individual conditions:

```bash
PYTHONPATH=. python -m harness.report
PYTHONPATH=. python -m harness.report harness/results/corpus.literal.jsonl
```

Sources: [harness/report.py:1-143]()

---

## harness/reproduce.sh — Cold-Start Entry Point

`reproduce.sh` is the single command that regenerates all six result files from scratch and verifies them against `CANONICAL_SHAS.txt`. It runs in six numbered steps:

```bash
bash harness/reproduce.sh
```

| Step | Script | Env var | Output |
|---|---|---|---|
| 1 | `harness.run_corpus` | `SELFGRAPH_OBJECTTYPE_MATCH=literal` | `corpus.literal.jsonl` |
| 2 | `harness.run_corpus` | `SELFGRAPH_OBJECTTYPE_MATCH=relaxed` | `corpus.relaxed.jsonl` |
| 3 | `harness.run_adversarial` | relaxed | `adversarial.jsonl` |
| 4 | `harness.rollback_precondition` | relaxed | `rollback.jsonl` |
| 5 | `harness.run_future_event` | relaxed | `future_event.jsonl` |
| 6 | `harness.extractor_recall` | (reads mode internally) | `extractor_recall.json` |

After the six steps, the script runs `harness.compare` (enforcing the A/B invariant) and then checks each regenerated file's `sha256sum | head -c 16` against the values sourced from `CANONICAL_SHAS.txt`. If any SHA mismatches or the A/B invariant fails, it exits non-zero. Sources: [harness/reproduce.sh:1-139]()

### Canonical SHA fingerprints

| File | sha256[:16] | Condition |
|---|---|---|
| `corpus.literal.jsonl` | `57a86e94ba5e211d` | `SELFGRAPH_OBJECTTYPE_MATCH=literal` |
| `corpus.relaxed.jsonl` | `3277086cf459e945` | `SELFGRAPH_OBJECTTYPE_MATCH=relaxed` |
| `adversarial.jsonl` | `09b408bd369dc89d` | relaxed |
| `rollback.jsonl` | `ff7353d410ea7379` | relaxed |
| `future_event.jsonl` | `8418183932468a18` | relaxed |
| `extractor_recall.json` | `82a971df7a9ad03c` | both modes inside |

Sources: [harness/results/CANONICAL_SHAS.txt:1-37]()

---

## Quick-start reference

```bash
# Run all smoke tests (fast, no API key needed)
python -m pytest tests/

# Full cold reproduction — all six result files, SHA check, A/B check
# (must have ANTHROPIC_API_KEY unset)
pip install -r requirements.txt
bash harness/reproduce.sh

# Inspect a single condition's aggregate
PYTHONPATH=. python -m harness.report harness/results/corpus.relaxed.jsonl

# Side-by-side A/B table without running reproduce
PYTHONPATH=. python -m harness.compare \
    harness/results/corpus.literal.jsonl \
    harness/results/corpus.relaxed.jsonl

# Deliberately run an LLM-augmented variant (shas will diverge)
SELFGRAPH_HARNESS_ALLOW_LLM=1 ANTHROPIC_API_KEY=sk-... \
    python -m harness.run_corpus
```

The entire measured loop — `propose`, `guardrails`, `sandbox`, `classify_change`, and all harness scripts — makes zero model calls. The only file that touches a model is `selfgraph/extract.py`, and only when `ANTHROPIC_API_KEY` is set; the harness blocks that path at startup to ensure the canonical SHAs are never silently shaped by an LLM. Sources: [REPRODUCE.md:53-77](), [harness/invariants.py:19-52]()
