# Data-first exploration

> Row/column reads, index navigation (.by), pagination (.first/.last/.sample), PATCH writes, import/export, and large-table safety limits.

- Repository: timescale/tigerfs
- GitHub: https://github.com/timescale/tigerfs
- Human docs: https://grok-wiki.com/public/docs/timescale-tigerfs-60719456a5c3
- Complete Markdown: https://grok-wiki.com/public/docs/timescale-tigerfs-60719456a5c3/llms-full.txt

## Source Files

- `docs/data-first.md`
- `internal/tigerfs/db/query.go`
- `internal/tigerfs/db/pipeline.go`
- `internal/tigerfs/fs/operations.go`
- `internal/tigerfs/format/convert.go`
- `test/integration/crud_test.go`

---

---
title: "Data-first exploration"
description: "Row/column reads, index navigation (.by), pagination (.first/.last/.sample), PATCH writes, import/export, and large-table safety limits."
---

Data-first mode maps each PostgreSQL table under the mount to a directory tree: primary keys become path segments, row bodies are readable as `.json`/`.csv`/`.tsv`/`.yaml` files or as per-column files, and dot-capability directories (`.by`, `.filter`, `.first`, `.export`, `.import`) compile into SQL in `internal/tigerfs/db/pipeline.go` and `internal/tigerfs/fs/operations.go`. Reads always query PostgreSQL; metadata caches do not store row content.

```text
/mnt/<schema>/<table>/
├── <pk>                    # Row directory (or <pk>.json / .tsv / …)
│   ├── <col>               # Numeric types: no extension
│   ├── <col>.txt           # text/varchar
│   ├── <col>.json          # json/jsonb
│   └── <col>.bin           # bytea
├── .by/                    # Indexed column navigation
├── .filter/                # Any-column equality filters
├── .first/  .last/  .sample/
├── .columns/  .order/  .export/
├── .import/                # Bulk load (not pipeline-composable)
└── .info/                  # ddl, schema, columns, count
```

<Info>
File-first workspaces (markdown/plaintext under `.build/`) reject workspace-level `.by`, `.filter`, and `.export` paths. Use `/.tables/<workspace>/` for data-first access to the backing table without bypassing history semantics on normal workspace files.
</Info>

## Prerequisites

- A mounted TigerFS database (see [Quickstart](/quickstart)).
- Default schema tables visible under the mount root (or an explicit schema path).
- For large tables: configure `dir_listing_limit`, `dir_filter_limit`, and `query_timeout` before relying on `ls` over millions of rows.

## Reading rows and columns

| Path pattern | Operation | Backend |
|--------------|-----------|---------|
| `/<table>/<pk>.json` | Full row, JSON | `db.GetRow` + `format.RowToJSON` |
| `/<table>/<pk>.tsv` / `.csv` / `.yaml` | Full row, tabular/YAML | `format.RowTo*` |
| `/<table>/<pk>/<col>` or `<col>.txt` | Single column | `db.GetColumn` + `format.ConvertValueToText` |
| `/<table>/.info/count` | Row count | Metadata query |
| `/<table>/.info/columns` | Column names (one per line) | Column cache |

```bash
ls /mnt/db/users/                    # PKs (capped by dir_listing_limit)
cat /mnt/db/users/42.json            # Entire row
cat /mnt/db/users/42/email.txt       # One column
cat /mnt/db/users/.info/count        # Total rows
```

Row reads go through `Operations.ReadFile` → `readRowFile` / `readColumnFile`, which call the database on every read. Column files append a trailing newline on output; empty writes to a column set NULL (`TestCRUDNullHandling`, `TestFSOperations_WriteFile_NullColumn`).

### Row-as-directory dot formats

Inside `/<table>/<pk>/`, dot-prefixed files expose the full row in alternate formats:

```bash
cat /mnt/db/users/42/.json
cat /mnt/db/users/42/.csv
```

## Index navigation with `.by/`

`.by/` exposes only columns that have indexes. Listing is index-backed and uses `dir_listing_limit` for distinct-value queries (same cap as table row listings in the shared operations path).

```bash
ls /mnt/db/orders/.by/                      # Indexed columns
ls /mnt/db/orders/.by/status/               # Distinct indexed values
ls /mnt/db/orders/.by/status/pending/       # Rows matching status=pending
cat /mnt/db/orders/.by/customer_id/7.json   # Row via index path
```

Each `.by/<column>/<value>/` segment adds an **indexed** filter (`FilterCondition.Indexed = true` in `db.QueryParams`). Filters AND-combine with prior `.by/` and `.filter/` segments.

### Composite indexes

Two equivalent styles when a multi-column index exists:

```bash
# Sequential (two filters)
.by/last_name/Smith/.by/first_name/John/

# Composite (single segment, matches index column order)
.by/last_name.first_name/Smith.John/
```

Composite syntax is preferred when it matches an existing composite index definition.

## Pagination: `.first/`, `.last/`, `.sample/`

Pagination capabilities appear as numeric subdirectories. They apply at the current pipeline context and push down to SQL (`LimitFirst`, `LimitLast`, `LimitSample` in `db/pipeline.go`).

| Capability | Default ordering | SQL effect |
|------------|------------------|------------|
| `.first/N/` | Primary key ascending | `ORDER BY pk ASC LIMIT N` |
| `.last/N/` | Primary key descending | `ORDER BY pk DESC LIMIT N` |
| `.sample/N/` | Random subset | `ORDER BY RANDOM() LIMIT N` |

```bash
ls /mnt/db/events/.first/100/              # First 100 PKs
ls /mnt/db/events/.last/50/                # Last 50 PKs
ls /mnt/db/events/.sample/25/              # Random 25 PKs
```

### Nested pagination

Chaining limits builds subqueries (`NeedsSubquery()` / `buildNestedPipelineSQL`):

| Path | Meaning |
|------|---------|
| `.first/100/.last/50/` | Last 50 of the first 100 rows (rows 51–100 by PK order) |
| `.last/100/.first/50/` | First 50 of the last 100 rows |
| `.first/1000/.sample/50/` | Random 50 drawn from the first 1000 |

Post-limit filters are allowed (e.g. `.first/100/.filter/status/active/` filters within the capped set).

### Ordering interaction

After `.order/<col>/`, only `.first/`, `.last/`, `.sample/`, `.columns/`, and `.export/` may follow. Use `.first/` for ascending sort results and `.last/` for descending on the ordered column.

## Pipeline queries and export

Capability segments between the table root and a row listing or export file form one pipeline. `FSContext` in `internal/tigerfs/fs/context.go` tracks filters, order, limits, column projection, and terminal state; `ToQueryParams()` feeds `db.QueryRowsPipeline` / `QueryRowsWithDataPipeline`.

```bash
cat /mnt/db/orders/.by/customer_id/42/.by/status/pending/.order/created_at/.last/10/.export/json
```

Rough SQL shape:

```sql
SELECT * FROM orders
WHERE customer_id = $1 AND status = $2
ORDER BY created_at DESC
LIMIT 10
```

### Terminal and redundant chains

Rules enforced at parse time (`path.go`, `FSContext.CanAdd*`):

| Rule | Example |
|------|---------|
| `.export/` is terminal | Nothing after `.export/json` |
| After `.columns/`, only `.export/` | `.columns/id,name/.filter/...` invalid |
| No double `.first/` or `.last/` | Use a single smaller limit |
| No limit after `.sample/` | Use `.sample/N/` with smaller N |
| No second `.order/` | Second replaces first; use one `.order/` |
| `.import/` not in pipelines | Bulk import is a separate write path |

Export formats under `.export/`: `csv`, `tsv`, `json`, `yaml`, plus `.with-headers/` variants for csv/tsv. Reading an export path runs the full pipeline and serializes all matching rows.

<Note>
Deep pipeline grammar, `PathType` resolution, and ADR-level SQL pushdown are documented on [Capability directories](/capability-directories).
</Note>

## `.by/` vs `.filter/`

| Aspect | `.by/` | `.filter/` |
|--------|--------|------------|
| Columns listed | Indexed columns only | All table columns |
| Value listing | Index scan, `dir_listing_limit` cap | `DISTINCT` query, `dir_filter_limit` cap |
| Performance | Predictable index use | May scan large tables |
| Typical use | Known indexed lookups | Ad-hoc equality on any column |

```bash
ls /mnt/db/users/.by/email/          # Fast value listing
ls /mnt/db/users/.filter/notes/      # Any column; may hit safety limits
```

For `.filter/<col>/` on very large tables, when the cached row estimate exceeds `dir_filter_limit` (default 100,000), legacy FUSE mounts surface a `.table-too-large` indicator file instead of enumerating distinct values. Direct paths still work:

```bash
cat /mnt/db/large_events/.filter/type/click/.first/100/.export/json
```

## Writing data (PATCH semantics)

Structured row formats use **PATCH** semantics: only columns present in the payload are updated; omitted columns are unchanged (`format.ParseJSON`, `ParseCSVWithHeader`, `ParseTSVWithHeader`; `Operations.writeRowFile` → `db.UpdateRow`).

| Write target | Semantics | Notes |
|--------------|-----------|-------|
| `/<pk>.json` / `.yaml` / `.csv` / `.tsv` | PATCH by column keys / header row | Insert if row missing; format suffix required for new rows |
| `/<pk>/<col>` or `<col>.txt` | Single-column replace | Empty content → NULL |
| `/<pk>` (no extension), existing row | PUT in schema column order | Bare path; TSV values in column order |
| `/<pk>` (no extension), new row | Rejected | Use `/<pk>.tsv` (or `.json`, etc.) to avoid NFS inode conflicts |

```bash
echo 'new@example.com' > /mnt/db/users/42/email.txt
echo '{"name":"Ada"}' > /mnt/db/users/42.json
echo -e 'email\tname\na@b.com\tAda' > /mnt/db/users/42.tsv
mkdir /mnt/db/users/99 && echo 'x@y.com' > /mnt/db/users/99/email.txt
rm -r /mnt/db/users/99/
```

Writes invalidate stat/path caches for the table schema (`statCache.invalidate`); cross-mount reads still see fresh row data from PostgreSQL.

<Warning>
Text primary keys ending in `.json`, `.csv`, `.tsv`, or `.yaml` are ambiguous with format extensions. Prefer the row-directory view or escape-aware tooling.
</Warning>

## Bulk import and export

Import is **write-only** under `.import/` and is not composable with pipeline segments.

| Mode path | Behavior |
|-----------|----------|
| `.import/.sync/<file>.csv` | Upsert: update existing PKs, insert new |
| `.import/.overwrite/<file>.csv` | Truncate table, then load |
| `.import/.append/<file>.csv` | Insert additional rows only |

Formats: `csv`, `tsv`, `json`, `yaml` under each mode; `.no-headers/` subpaths accept csv/tsv without a header row (column order from table schema).

```bash
cat > /mnt/db/staging/.import/.overwrite/users.csv <<'EOF'
id,name,email
1,Alice,alice@example.com
2,Bob,bob@example.com
EOF
```

Export reads use `.export/<fmt>` at the end of a pipeline (or on a bare table with implicit limits). Verified modes in integration tests: sync, append, overwrite (`TestFSOperations_Import_*`).

## Large-table safety limits

Configuration lives in `~/.config/tigerfs/config.yaml` and `TIGERFS_*` environment variables (`internal/tigerfs/config/config.go`).

| Key | Default | Effect |
|-----|---------|--------|
| `dir_listing_limit` | `1000` | Max PKs returned in `ls` on a table; max distinct values listed under `.by/<col>/` |
| `dir_filter_limit` | `100000` | Max distinct values fetched for `.filter/<col>/`; tables above estimate skip value listing (`.table-too-large` on legacy FUSE) |
| `dir_writing_limit` | `100000` | Bulk write safety threshold |
| `query_timeout` | `30s` | Statement timeout for queries |
| `max_pipeline_depth` | `10` | After this depth, pipeline capability dirs are hidden (rows still list) to stop runaway `find`/agent recursion |
| `no_filename_extensions` | `false` | Disable `.txt`/`.json`/`.bin` column suffixes |

```yaml
filesystem:
  dir_listing_limit: 1000
  dir_filter_limit: 100000
  query_timeout: 30s
```

```bash
export TIGERFS_DIR_LISTING_LIMIT=5000
export TIGERFS_DIR_FILTER_LIMIT=100000
export TIGERFS_QUERY_TIMEOUT=60s
```

<Steps>
<Step title="Explore a large table safely">
Use pagination instead of bare `ls`:

```bash
ls /mnt/db/events/.first/100/
cat /mnt/db/events/.first/100/.export/json
```
</Step>
<Step title="Filter without value listing">
When `ls .filter/col/` shows `.table-too-large`, jump directly to the value:

```bash
cat /mnt/db/events/.filter/type/click/.last/50/.export/csv
```
</Step>
<Step title="Verify row count">
```bash
cat /mnt/db/events/.info/count
```
</Step>
</Steps>

Legacy Linux FUSE (`legacy_fuse: true`) additionally refuses full table `ls` when `pg_class` row estimate exceeds `dir_listing_limit` (logged as "Table too large for directory listing"). The default shared `fs.Operations` path (including macOS NFS) returns up to `dir_listing_limit` PKs plus capability directories.

## File-first backing tables via `.tables/`

Synthesized workspaces store rows in the `tigerfs` schema. Data-first operations on that data use:

```bash
ls /mnt/db/.tables/notes/
cat /mnt/db/.tables/notes/.info/columns
```

<Warning>
Writes under `.tables/<workspace>/` bypass workspace history triggers. They do not appear in `.log/`, `.history/`, or undo paths. Use the file-first workspace for normal edits; reserve `.tables/` for diagnostics and repairs.
</Warning>

## Verification signals

| Action | Expected signal |
|--------|-----------------|
| `cat …/<pk>.json` after edit | Updated keys only; other fields unchanged (PATCH) |
| `cat …/.info/count` after import overwrite | Count matches imported rows |
| `ls …/.by/<indexed_col>/` | Only indexed columns appear |
| Pipeline export | Single JSON array / CSV with header, no trailing capabilities |
| Blocked workspace pipeline | Error with hint mentioning `.tables/` |

Integration coverage: `test/integration/crud_test.go` (insert/select/update/delete, partial updates), `test/integration/pipeline_test.go` (`.by`, nested pagination, `.sample`), `test/integration/fs_operations_test.go` (import modes, PATCH via pipeline column write), `test/integration/synthesized_test.go` (workspace vs `.tables/` gates).

## Related pages

<CardGroup>
<Card title="Capability directories" href="/capability-directories">
Pipeline grammar, chaining matrix, SQL pushdown, and PathType resolution.
</Card>
<Card title="Data formats reference" href="/data-formats-reference">
TSV/CSV/JSON/YAML encoding, PATCH details, NULL rules, and type extensions.
</Card>
<Card title="File-first and data-first" href="/file-first-and-data-first">
When mounts are workspaces vs raw tables and how mode detection works.
</Card>
<Card title="Consistency and caching" href="/consistency-and-caching">
Why reads are always fresh and which metadata caches invalidate on writes.
</Card>
<Card title="Configuration reference" href="/configuration-reference">
Full config struct, env vars, and precedence.
</Card>
</CardGroup>
