# What Gets Put In The Box

> How Graphify decides which files count, skips sensitive or noisy inputs, converts Office and Google Workspace files, handles transcripts, and caches work for later runs.

- Repository: safishamsi/graphify
- GitHub: https://github.com/safishamsi/graphify
- Human wiki: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d
- Complete Markdown: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/llms-full.txt

## Source Files

- `graphify/detect.py`
- `graphify/ingest.py`
- `graphify/google_workspace.py`
- `graphify/transcribe.py`
- `graphify/cache.py`
- `tests/test_detect.py`
- `tests/test_google_workspace.py`
- `tests/test_transcribe.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [graphify/detect.py](graphify/detect.py)
- [graphify/ingest.py](graphify/ingest.py)
- [graphify/google_workspace.py](graphify/google_workspace.py)
- [graphify/transcribe.py](graphify/transcribe.py)
- [graphify/cache.py](graphify/cache.py)
- [graphify/__main__.py](graphify/__main__.py)
- [pyproject.toml](pyproject.toml)
- [README.md](README.md)
- [tests/test_detect.py](tests/test_detect.py)
- [tests/test_google_workspace.py](tests/test_google_workspace.py)
- [tests/test_transcribe.py](tests/test_transcribe.py)
- [tests/test_cache.py](tests/test_cache.py)
</details>

# What Gets Put In The Box

Graphify's "box" is the set of files it will turn into graph input. The box is not just "everything under this folder." Graphify first walks the tree, avoids common junk, skips likely secrets, converts some files into readable Markdown sidecars, and remembers previous work so later runs can focus on what changed.

No generated wiki context, `STRATEGY.md`, or `docs/solutions/**` files were present in this checkout. The Compound Engineering profile was used only as page-shape guidance; the implementation claims below come from repository code and tests.

## The Short Mental Model

Think of Graphify as packing a moving box:

```text
folder on disk
  -> walk files, but skip trash piles and secret drawers
  -> classify remaining files by type
  -> convert Office / Google shortcuts when possible
  -> keep videos as media inputs and transcripts as text outputs
  -> hash and cache extracted results for next time
```

The important boundary is that file selection is provider-neutral. Detection, skipping, conversion, transcript caching, and manifest comparison happen locally. Later semantic extraction can use different configured backends, but the "what counts as input" layer is not tied to one model provider.

Sources: [graphify/detect.py:862-1005](), [graphify/__main__.py:2978-3017](), [README.md:357-362]()

## File Types Graphify Recognizes

Graphify classifies files into five buckets: `code`, `document`, `paper`, `image`, and `video`. The extension sets are declared near the top of `detect.py`, and `classify_file()` applies them in a fixed order.

| Bucket | Examples | Notes |
|---|---|---|
| `code` | `.py`, `.ts`, `.tsx`, `.js`, `.go`, `.rs`, `.java`, `.sql`, `.json`, shell files | Extensionless scripts can count as code when a supported shebang is found. |
| `document` | `.md`, `.mdx`, `.txt`, `.rst`, `.html`, `.yaml`, `.yml`, `.docx`, `.xlsx`, `.gdoc`, `.gsheet`, `.gslides` | Office and Google Workspace files may be converted before extraction. |
| `paper` | `.pdf`, or Markdown/text that looks academic | Markdown/text becomes `paper` only after enough paper-like signals are found. |
| `image` | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.svg` | Images are semantic extraction inputs. |
| `video` | `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`, `.m4v`, `.mp3`, `.wav`, `.m4a`, `.ogg` | Videos are counted as files but not as readable words during detection. |

A small but important edge case: PDFs inside Xcode asset catalogs are skipped, because those PDFs are usually vector assets rather than papers.

```python
# graphify/detect.py
if ext in PAPER_EXTENSIONS:
    if any(part.endswith(tuple(_ASSET_DIR_MARKERS)) for part in path.parts):
        return None
    return FileType.PAPER
```

Sources: [graphify/detect.py:18-33](), [graphify/detect.py:289-316](), [tests/test_detect.py:6-33](), [tests/test_detect.py:301-315]()

## Walking The Tree Without Eating The Build Folder

`detect()` starts from a scan root and builds a `files` dictionary with all five buckets. It uses `os.walk()`, prunes known-noise directories before descending, skips lockfiles, then classifies each remaining file.

The always-skipped directory list includes dependency folders, virtual environments, language build outputs, framework caches, coverage reports, visual regression bundles, Storybook builds, Graphify's own output, and `.worktrees`. This prevents generated output from becoming false architecture input.

Graphify does not blindly skip every dot directory. Tests show `.github/` is allowed, while `.next/` and `.graphify/` are still skipped. That means useful hidden project configuration can enter the box, but framework caches and Graphify's own cache do not.

Sources: [graphify/detect.py:537-575](), [graphify/detect.py:896-925](), [tests/test_detect.py:372-459](), [tests/test_detect.py:647-657]()

## Ignore, Include, And Exclude Rules

Graphify supports project-level filtering with `.graphifyignore`. If `.graphifyignore` is absent, it falls back to `.gitignore`; if both exist, `.graphifyignore` wins. Patterns are loaded from the nearest VCS root down to the scan root, so subdirectory scans inside a repo still inherit repo-level ignore rules.

The matching behavior follows important gitignore ideas:

| Rule | Behavior |
|---|---|
| Last match wins | Later patterns override earlier ones. |
| `!` negation | A negated pattern can re-include something already ignored. |
| Parent exclusion still wins | A file cannot be rescued if an ancestor directory remains excluded. |
| CLI excludes | `extra_excludes` are appended last, so command-line excludes override ignore files. |
| Include file | `.graphifyinclude` can opt hidden files or directories into traversal, but sensitive files and hard-skipped noise dirs are still excluded. |

Tests cover all of these decisions, including `.gitignore` fallback, `.graphifyignore` precedence, negation behavior, and explicit extra excludes.

Sources: [graphify/detect.py:618-731](), [graphify/detect.py:734-840](), [graphify/detect.py:876-885](), [tests/test_detect.py:89-121](), [tests/test_detect.py:464-503](), [tests/test_detect.py:611-672]()

## Sensitive Inputs Stay Out

Before a classified file is accepted, Graphify checks whether the path looks sensitive. There are two layers:

1. Parent directory names such as `.ssh`, `.gnupg`, `.aws`, `.gcloud`, `secrets`, `.secrets`, and `credentials`.
2. Filename patterns such as `.env`, private key and certificate extensions, passwords, secrets, tokens, `.netrc`, `.pgpass`, and common cloud credential names.

This check happens after ignore filtering and before conversion or word counting. Sensitive paths are recorded in `skipped_sensitive` rather than added to the input buckets.

The tests are intentionally specific: `api_token.txt`, `oauth_token.json`, `app_secret.yaml`, `passwords.py`, SSH keys, and `config/secrets/db.json` are flagged, while false friends like `tokenizer.py` and `tokenize.py` are not.

Sources: [graphify/detect.py:39-61](), [graphify/detect.py:82-91](), [graphify/detect.py:935-940](), [tests/test_detect.py:506-558]()

## Office Files Become Markdown Sidecars

`.docx` and `.xlsx` files are first classified as documents, then converted into Markdown files under `graphify-out/converted/`. Graphify does this because the later extraction path needs readable text, not raw Office binaries.

For Word documents, `docx_to_markdown()` reads paragraphs, maps heading styles to Markdown headings, maps list styles to bullets, and serializes tables as Markdown tables. For Excel workbooks, `xlsx_to_markdown()` reads each sheet and turns non-empty rows into sheet sections and tables.

If conversion produces no text, or the optional libraries are missing, the original Office file is skipped with a note suggesting `pip install graphifyy[office]`.

Sources: [graphify/detect.py:334-371](), [graphify/detect.py:374-401](), [graphify/detect.py:494-520](), [graphify/detect.py:963-974](), [pyproject.toml:50-59]()

## Google Workspace Shortcuts Are Opt-In

Google Drive desktop files such as `.gdoc`, `.gsheet`, and `.gslides` are shortcuts, not document content. By default, Graphify classifies them as documents but skips them with a message telling the user to pass `--google-workspace` or set `GRAPHIFY_GOOGLE_WORKSPACE=1`.

When enabled, Graphify reads the shortcut JSON, extracts a Drive file ID from fields like `doc_id`, `file_id`, `fileId`, `id`, `resource_id`, or the URL, then exports the real content through the `gws` CLI. Google Docs export as Markdown, Slides export as plain text, and Sheets export as `.xlsx` before passing through the spreadsheet-to-Markdown converter.

The converted sidecar includes frontmatter that records the source file, source type, Google file ID, export MIME type, source URL, and a hash of the Google account email when present. That account hash preserves traceability without storing the raw email in the sidecar.

Sources: [graphify/google_workspace.py:1-29](), [graphify/google_workspace.py:63-91](), [graphify/google_workspace.py:94-122](), [graphify/google_workspace.py:129-147](), [graphify/google_workspace.py:150-223](), [graphify/detect.py:942-962](), [tests/test_google_workspace.py:7-31](), [tests/test_google_workspace.py:33-75]()

## URLs, Web Pages, PDFs, Images, And YouTube Adds

`ingest.py` handles content added by URL. It first classifies the URL as tweet, arXiv, GitHub, YouTube, PDF, image, or generic webpage. PDFs and images are downloaded as binary files. Web pages, tweets, and arXiv pages are saved as annotated Markdown with YAML frontmatter. YouTube URLs are handed to the video downloader in `transcribe.py`.

The URL path is security-aware: `ingest()` validates URLs before fetching, and the lower-level fetches go through `graphify.security.safe_fetch` or `safe_fetch_text`.

Sources: [graphify/ingest.py:64-81](), [graphify/ingest.py:84-100](), [graphify/ingest.py:136-162](), [graphify/ingest.py:165-207](), [graphify/ingest.py:218-269]()

## Video And Transcript Handling

Graphify detects audio/video files, but detection does not count their bytes as words. The transcript layer is separate.

`transcribe.py` can transcribe a local media file or a URL. For URLs, it validates the URL, downloads an audio-only stream through `yt-dlp`, and names the downloaded file from a stable URL hash. For transcription, it uses `faster-whisper` locally with a model name from `GRAPHIFY_WHISPER_MODEL`, defaulting to `base`.

Caching is simple: if `graphify-out/transcripts/<media-stem>.txt` already exists, `transcribe()` returns that path unless `force=True`. `transcribe_all()` processes a list and skips files that fail, warning instead of stopping the whole batch.

The Whisper prompt is also local and provider-neutral. It uses `GRAPHIFY_WHISPER_PROMPT` if set; otherwise it formats up to five graph "god node" labels into a domain hint, or falls back to a punctuation/paragraph prompt.

Sources: [graphify/transcribe.py:9-18](), [graphify/transcribe.py:43-90](), [graphify/transcribe.py:93-113](), [graphify/transcribe.py:116-183](), [tests/test_transcribe.py:22-54](), [tests/test_transcribe.py:68-110](), [tests/test_detect.py:353-369]()

## Caching: Remember The Work, Not The Whole Run

Graphify has two related memory systems: extraction caches and the manifest.

The extraction cache stores result JSON under `graphify-out/cache/{kind}/{hash}.json`, where `kind` is usually `ast` or `semantic`. The hash is based on file content plus the path relative to the cache root, which makes cache entries portable across checkout directories. For Markdown, Graphify strips YAML frontmatter before hashing, so metadata-only changes do not invalidate extraction results.

A stat index at `graphify-out/cache/stat-index.json` avoids rereading unchanged files. If file size and `mtime_ns` match the previous entry, Graphify reuses the previous hash. If the stat data changes, it rereads and hashes the file.

Semantic caching groups nodes, edges, and hyperedges by `source_file`, then saves one cache entry per source file. During extraction, cached semantic results are merged directly, and only uncached files go to fresh semantic extraction.

Sources: [graphify/cache.py:17-41](), [graphify/cache.py:97-146](), [graphify/cache.py:149-190](), [graphify/cache.py:193-245](), [graphify/cache.py:263-329](), [tests/test_cache.py:19-51](), [tests/test_cache.py:79-128](), [graphify/__main__.py:3045-3135]()

## Incremental Runs: What Changed Since Last Time

The manifest is the second memory system. `save_manifest()` writes file mtimes plus separate `ast_hash` and `semantic_hash` values. This separation matters because `graphify update` can refresh AST-only code information without pretending that semantic document extraction is also current.

`detect_incremental()` runs a normal detection pass, loads the previous manifest, and returns changed files separately from unchanged files. It has a fast path for unchanged mtimes and matching hashes, and a slower path that compares content hashes when mtimes move. It also reports deleted files so the graph builder can prune old sources.

During full extraction, `__main__.py` is careful not to stamp semantic success for documents, papers, or images whose semantic chunks failed. That keeps failed files eligible for retry on the next incremental run.

Sources: [graphify/detect.py:1021-1091](), [graphify/detect.py:1094-1126](), [graphify/__main__.py:2983-3017](), [graphify/__main__.py:3152-3166](), [tests/test_detect.py:270-299](), [tests/test_detect.py:560-610]()

## What Does Not Go In The Box

Graphify deliberately leaves out several things:

| Input | Why it stays out |
|---|---|
| Known dependency/build/cache directories | They are generated or redundant, not source knowledge. |
| Lockfiles such as `package-lock.json`, `Cargo.lock`, `poetry.lock` | They are large generated dependency state, not usually useful graph input. |
| Sensitive-looking paths | Secrets should not enter extraction. |
| Google Workspace shortcuts without opt-in | Shortcut files are pointers and may require authenticated export. |
| Failed Office/Google conversions | Graphify needs readable text sidecars. |
| Video bytes in word counts | Media becomes useful after transcription, not by counting binary data. |
| `graphify-out/converted/` sidecars during the original walk | This prevents Graphify from re-processing its own conversion output. |

Sources: [graphify/detect.py:537-565](), [graphify/detect.py:887-891](), [graphify/detect.py:929-934](), [graphify/detect.py:937-974](), [tests/test_detect.py:318-343]()

## Summary

Graphify's input box is built in layers: recognize useful file types, avoid obvious noise, respect ignore rules, refuse likely secrets, convert unreadable-but-supported formats into Markdown, handle media through transcript files, and use caches plus manifests so repeated runs only redo necessary work. This keeps the architecture portable: local file discovery and caching are independent of the model backend, while semantic extraction can run through whichever configured provider or local backend the user chooses.

Sources: [graphify/detect.py:862-1005](), [graphify/cache.py:263-329](), [README.md:357-362]()
