From f155783dfc4b57efc41d310bdfce23aea4ec939e Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 21:51:40 -0700 Subject: [PATCH 1/7] docs: plan of record for #305 facet-count coherence Joint Claude+Codex plan for computing multi-filter facet counts from the #299 bitmask index (avoiding the 39M-row membership scan). Covers the masks histogram approach, native DuckDB benchmarks, source handling, the #306 missing-pid prerequisite, the honesty rule (never baseline under active filters), a 4-phase rollout, and risks. Refs #305, #304, #306, #276. Co-Authored-By: Claude Opus 4.8 (1M context) --- PLAN_305_facet_counts.md | 128 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 128 insertions(+) create mode 100644 PLAN_305_facet_counts.md diff --git a/PLAN_305_facet_counts.md b/PLAN_305_facet_counts.md new file mode 100644 index 0000000..22f4c7a --- /dev/null +++ b/PLAN_305_facet_counts.md @@ -0,0 +1,128 @@ +# Plan — #305 Facet-count correctness & coherence (counts beyond the single-filter cube) + +**Status:** plan of record · **Authors:** Claude + Codex (joint, 2026-06-19) · **Tracks:** #305 (meta), #304 (live bug), #306 (data bug), #276 (semantics), #230, #286 + +## Problem + +At the global/zoomed-out view, facet **counts** only cross-filter for a **single** active +filter (served by the precomputed `facet_tree_cross_filter` cube via `effectiveSingleFilter()`, +`explorer.qmd:3374`). With **2+ active selections** the cube is skipped and tree-dim counts +revert to the **unfiltered baseline** (`explorer.qmd:3557-3562`) — deliberately, to avoid a +~39M-row `sample_facet_membership` GROUP BY that stalls DuckDB-WASM. So the **map is correctly +filtered** (#299 bitmask) but the **count numbers are wrong** for multi-filter at world view. +Zoomed-in is already correct (live membership path, `explorer.qmd:3564-3598`). Reported by +Eric in #304 (e.g. *Material: Anthropogenic* + *Specimen Type: Artifact*). + +## Core idea — count per node FROM the bitmasks (no membership scan) + +For a count dimension D over a qualifying pid set Q: + +```sql +SELECT nb.concept_uri AS value, COUNT(*) AS count +FROM s +JOIN read_parquet('${node_bits_url}') nb ON nb.facet_type = '' + AND (s._mask & (1::BIGINT << nb.bit_index)) <> 0 +WHERE +GROUP BY nb.concept_uri +``` + +- Masks are **one row per pid** → `COUNT(*)` == `COUNT(DISTINCT pid)` (do **not** add `COUNT(DISTINCT)` — wasted work). +- Membership encodes **ancestor closure** → child bit ⇒ parent bit, so **parent ≥ child** holds for free. +- The bitwise join is against `facet_node_bits` (≤63 rows/dim), not the 39M-row membership table. + +**Cross-filter semantics (confirmed correct):** OR among selected nodes *within* a dimension; +AND *across* dimensions; **omit the target dimension's own predicate** (exclude-self), matching +the UI contract (`explorer.qmd:3305`) and the existing cube (`build_frontend_derived.py:460`). +Special case: **zero selected sources** ⇒ impossible filter for tree targets, but the source +dimension's own histogram must still exclude source and stay constrained by the three trees. + +## Feasibility (Codex native DuckDB benchmarks, real 6M-pid data) + +| Query | Time | +|---|---| +| One unfiltered 22-node histogram | ~0.17s | +| All three tree dims via one 56-node join | ~0.69s | +| Eric's Material+Specimen-Type → context histogram | ~0.02s | +| Embedded source predicate/group | ~0.014s | +| `facets_v3` semi-join (rejected approach) | ~0.145s | + +Strongly supports feasibility. **Worst case ≈ 6M × 56 = ~336M cheap bit-tests**, not merely a +6M-row scan → **WASM cold+warm benchmarking remains a hard release gate** (the app has documented +DuckDB-WASM concurrency/deadlock sensitivity). + +## Source handling + +Add `source` as a plain **VARCHAR** column to a **new complete per-pid index** (source is +exclusive, not multi-valued → a mask is wrong; a `facets_v3` semi-join adds a second 6M-row +read + hash join). New schema: + +``` +pid · source · material_mask · context_mask · object_type_mask · build_id · schema_version +``` + +## Data-correctness prerequisite (#306) + +The current `sample_facet_masks` is built from `membership`, so it **omits ~29,917 located +samples** with coordinates but no tree membership (`facets_v3` = 6,026,242 vs masks = 5,996,325). +The new index must start from **`samp_geo`**, LEFT JOIN masks, and emit **zero masks** for +no-membership pids. Tracked as **#306** (must land in Phase 1). + +## Honesty rule (non-negotiable) + +**Never fall back to the unfiltered baseline when filters are active** — that is exactly the +#304 bug. `applyFacetCounts(..., null)` means baseline (`explorer.qmd:1381`). On any +miss/error under active filters, show **"—" / "unavailable" / stale styling / hidden counts** — +never authoritative-looking baseline numbers. Apply all dimension maps **atomically after one +stale-guard check** (`facetCountsReqId`); never partially repaint dimensions. + +## Phased plan + +### Phase 1 — Artifact + semantic foundation +- Build the complete per-pid index above. Start from `samp_geo`; include zero-mask pids (#306). +- Add `source` VARCHAR; extend the build **fingerprint** to cover the pid/source universe + (today it fingerprints membership only → would go stale silently once source/coverage change). +- AI-free validator: pid completeness vs `facets_v3`; source equality; node-bit coverage; + mask ≡ membership for pids that have membership; build_id consistency. +- **Publish under a new immutable filename/tag — do NOT overwrite the cached `202608` artifact.** +- Ratify the count contract (#276) as **membership / "anywhere in tree"** — already the shipped + semantics, so this is confirmation, not a redesign. + +### Phase 2 — Fix #304 at global view +- Generalized **direct-control filter snapshot** (today's `describeCrossFilters()` zeroes tree + selections at global view — the root cause). Generalize `effectiveSingleFilter()` to all selections. +- Route **multi-filter / full-tree-mode / global / no-search** counts through one mask-index + histogram query (the `tree_counts ∪ source_counts` shape). +- Keep the single-filter cube as a **pure optimization**, not a required semantic path. +- Apply dimension maps atomically after the stale check. On failure → unavailable, **not** baseline. +- **Benchmark cold + warm DuckDB-WASM** across empty / narrow / broad / all-pid qualifying sets; + set a concrete interaction budget before release. + +### Phase 3 — Extend (not premature unification) +- Add **viewport** support (identical padded bbox, `VIEWPORT_PAD_FACTOR = 0.3`). +- Add **search** via `search_pids`. +- Characterization-test the new path against the existing membership path; **then** retire the + membership count implementation. (Immediate unification would balloon the first fix into + viewport + search + flat-mode + stale-update refactoring — defer.) + +### Phase 4 — Verification + cleanup +- Unit-test predicate generation (exclude-self, zero-source states). +- Data validator gates from Phase 1, run in CI. +- Playwright: Eric's exact URL, multi-value-in-one-dim, source+tree, impossible combos, + global↔viewport transitions, superseded requests. +- Remove the misleading active-filter baseline fallback everywhere. +- Codex adversarial review; staging verify on real R2 data; deploy smoke gate. + +## Constraints / landmines +- **63-bit ceiling:** current dims are 19 / 22 / 15 nodes — ample headroom. Keep the build + hard-fail; design a future multiword schema but **do not implement now**. +- **Flat / mixed render mode:** keep the mask path gated OFF (subtree-membership semantics are + wrong for a flat dim) — defer to existing paths, as the cube already does. +- **Immutable data:** new tag/filename; never overwrite a cached artifact. + +## Top 3 risks +1. **WASM latency / connection starvation** — credible SQL, unmeasured in-browser; ~336M bit-tests worst case. +2. **Stale / incomplete index generations** — #306 missing pids + membership-only build_id; mixed cached generations → convincing-but-false counts. +3. **Honesty & state-transition failures** — baseline-as-truth fallback + async camera transitions unless results applied atomically. + +--- +*Joint plan: Claude (code trace, drafting) + Codex (native benchmarks, source design, #306 discovery, honesty rule, phasing). 2026-06-19.* From 473ec29eebce5e62e375ee6b99ddb2291908a5b7 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 22:08:09 -0700 Subject: [PATCH 2/7] #306/#305: complete per-pid facet index (sample_facet_index) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Build sample_facet_index: one row per LOCATED sample (samp_geo), not only those with tree membership. sample_facet_masks is built FROM membership and silently omits ~29,917 located samples with no tree concept (#306); the multi-filter global-view count path (#304/#305) must count over the whole located universe, so it scans this complete index instead. - start from samp_geo, LEFT JOIN membership-derived masks, zero-mask the no-membership pids (correct: in no subtree, but still located + still counts toward source/total) - add `source` VARCHAR (exclusive, not multi-valued -> not a mask) - build_id = ":": membership half matches facet_node_bits (mask-bit interpretation gate); coverage half fingerprints the samp_geo (pid, source) universe so #306-class drift can't go stale silently (today's membership-only id is blind to it) - schema_version column for forward-compat Validator: --index gate asserts schema, one-row-per-pid, pid==facets_v2 (located universe completeness, #306), source==facets_v2, structured single build_id matching node_bits, masks bit-identical to sample_facet_masks for shared pids, and zero-mask for every no-membership pid. Tests: a located no-membership sample (root-only material) — present + zero masked in the index, absent from masks; coverage-id changes on source flip; all corruption gates bite. 29/29 pass. Docs: SERIALIZATIONS.md §4.12 ratifies the #276 count contract (membership / "anywhere in tree") served by this index. Refs #305 #306 #304 #276 --- SERIALIZATIONS.md | 37 ++++++ scripts/build_frontend_derived.py | 92 +++++++++++++- scripts/validate_frontend_derived.py | 86 +++++++++++++ tests/test_frontend_derived.py | 174 +++++++++++++++++++++++++++ 4 files changed, 386 insertions(+), 3 deletions(-) diff --git a/SERIALIZATIONS.md b/SERIALIZATIONS.md index b5a9be5..bc9e0d1 100644 --- a/SERIALIZATIONS.md +++ b/SERIALIZATIONS.md @@ -122,6 +122,7 @@ builder — a fresh build is NOT bit-for-bit identical to them (see | `isamples_202601_sample_facets_v2.parquet` | `(pid, source, material, context, object_type, label, description, place_name)` — all VARCHAR scalars; each facet column is a single URI per sample (not an array) | 63 MB | 6.0 M | wide | Search Explorer multi-dim facet filtering | QUERY_SPEC §3.3, §5.1 | | `isamples_202601_facet_summaries.parquet` | Baseline `(facet_type, facet_value, scheme, count)` | 2 KB | 56 | wide | Every tutorial (instant initial facet counts) | QUERY_SPEC §3.3 tier 1 | | `isamples_202601_facet_cross_filter.parquet` | Pre-computed counts for single-filter cross-facet queries | 6 KB | 526 | wide | Search Explorer cross-filter UI | QUERY_SPEC §3.3 tier 2a | +| `_sample_facet_index.parquet` | Complete per-pid facet index `(pid, source, material_mask, context_mask, object_type_mask, build_id, schema_version)` — **one row per located sample**, including samples with no tree membership (zero-masked, #306). Scanned by the multi-filter global-view count path (#304/#305). | ~60 MB | 6.0 M | wide (membership + samp_geo) | Interactive Explorer multi-filter facet counts | §4.12 below | ### Tier: vocabulary labels @@ -296,6 +297,42 @@ for the alias when you want "latest." - **Consumer**: `scripts/enrich_wide_with_oc_thumbnails.py` uses OC narrow to fill thumbnails into 202604 unified wide. Also used directly in PQG benchmark work. - **Future**: these become the prototype upstream for per-source sidecars (see §3, bottom row). +### 4.12 `_sample_facet_index.parquet` (complete per-pid facet index, #305/#306) + +- **Role**: The single artifact the **multi-filter** global-view count path scans + (#304/#305). Where `sample_facet_masks` is built FROM `membership` and therefore + silently omits located samples that carry no tree concept (~29,917 in the 202608 + generation — **#306**), this index starts from `samp_geo` (the authoritative + located set) and is **complete**: one row per located `pid`. Samples with no + membership are present and **zero-masked**. +- **Headline schema** (7 cols): `pid (VARCHAR), source (VARCHAR), material_mask + (BIGINT), context_mask (BIGINT), object_type_mask (BIGINT), build_id (VARCHAR), + schema_version (INTEGER)`. `source` is a plain VARCHAR (source is exclusive, not + multi-valued, so a mask would be wrong). Each `*_mask` bit `(1 << bit_index)` is + set iff the pid is a member of that node, under the bit assignment in + `facet_node_bits`; membership encodes ancestor closure, so a parent bit is set for + the whole subtree. +- **`build_id` is structured** as `":"`: + - the **membership half** equals `facet_node_bits.build_id` — the explorer must + only interpret the mask bits when the two agree (else the bits are read under a + foreign assignment); + - the **coverage half** is a fingerprint of the `(pid, source)` universe over + `samp_geo`, so a changed source value or located-pid set (the #306 drift class) + changes `build_id` instead of going stale silently. +- **Count contract (ratifies #276)**: a facet count is **membership / "anywhere in + the tree"** — a sample counts toward a node if that node is anywhere in the + sample's asserted-concept subtree closure (OR within a dimension, AND across + dimensions, **excluding the target dimension's own predicate**). This is the + already-shipped cube/UI semantics; this index does not change it, it serves the + same contract to the *multi-filter* case the single-filter cube cannot reach. +- **Validation**: `validate_frontend_derived.py --index ` asserts: schema, + one row per pid, `pid` set == `sample_facets_v2.pid` (the located universe), + `source` == facets_v2 source, single structured `build_id` whose membership half + matches `node_bits`, masks bit-identical to `sample_facet_masks` for shared pids, + and that every no-membership pid is zero-masked. +- **Immutability**: published under a **new** versioned filename — it is a new + artifact name and never overwrites a cached `sample_facet_masks` or any prior tag. + ## 5. URL convention All substrate files live under `https://data.isamples.org/` — a diff --git a/scripts/build_frontend_derived.py b/scripts/build_frontend_derived.py index d92cff0..661a78e 100755 --- a/scripts/build_frontend_derived.py +++ b/scripts/build_frontend_derived.py @@ -19,6 +19,7 @@ - {tag}_facet_summaries.parquet facet_type, facet_value, scheme, count - {tag}_facet_cross_filter.parquet filter_source/material/context/object_type, facet_type, facet_value, count - {tag}_wide_h3.parquet wide + h3_res4/6/8 (large; built only on --only wide_h3) + - {tag}_sample_facet_index.parquet pid, source, material_mask, context_mask, object_type_mask(BIGINT), build_id, schema_version(INT) — COMPLETE per-pid index over ALL located samples (incl. #306 no-membership pids, zero-masked); the multi-filter global-view count path (#304/#305) scans this - {tag}_manifest.json provenance + per-output rowcount/schema/sha256 MATERIAL SELECTION (issue #265/#271): the broad SKOS root @@ -64,11 +65,16 @@ ARTIFACTS = ["sample_facets_v2", "samples_map_lite", "h3_summaries", "facet_summaries", "facet_cross_filter", "wide_h3", "sample_facet_membership", "facet_tree_summaries", - "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks"] + "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks", + "sample_facet_index"] # #293: max tree nodes per dim that fit in a signed BIGINT mask (bits 0..62). # Live max is 22 (context); guard so a future vocab explosion fails loudly # instead of silently overflowing a mask bit. MASK_MAX_BITS = 63 +# #305/#306: schema version for the complete per-pid facet index. Bump when the +# column set / semantics of sample_facet_index change so the explorer can refuse +# an index it doesn't understand instead of mis-reading it. +INDEX_SCHEMA_VERSION = 1 # Shared SQL expression for sample_facets_v2.description (#277 part 2). # Appends space-joined concept labels (IC labels across all 4 concept dims) @@ -508,6 +514,35 @@ def membership_build_id(con): FROM membership""").fetchone()[0] +def coverage_build_id(con): + # #305/#306: fingerprint of the COMPLETE per-pid universe the index covers — + # the (pid, source) pairs over the LOCATED set (samp_geo), independent of tree + # membership. membership_build_id() alone is blind to this: an index built from + # the SAME membership but a changed source value or a changed located-pid set + # (exactly the #306 class of drift — located pids with no membership) would + # carry an identical membership id and go stale SILENTLY. Order-independent XOR + # of per-row hashes; samp_geo.pid is unique (hard-checked in build_base_tables), + # so no XOR cancellation. COALESCE(source,'') so a NULL→'' source flip is caught. + return con.sql(""" + SELECT CAST(COALESCE(bit_xor(hash(pid || chr(31) || COALESCE(source, ''))), 0) + AS VARCHAR) + FROM samp_geo""").fetchone()[0] + + +def index_build_id(con): + # #305/#306: the index's generation identity = ":". + # The two halves are deliberately separable so the explorer can BOTH: + # 1. gate the mask-bit interpretation by matching the membership half against + # facet_node_bits.build_id (the bit assignment is a pure function of + # membership — the index masks are only meaningful under the SAME node_bits + # generation), and + # 2. detect coverage/source drift via the coverage half (would otherwise be + # invisible to a membership-only id). + # ':' never appears in either half (both are decimal hash strings), so a single + # split is unambiguous. node_bits/masks keep their plain membership id. + return f"{membership_build_id(con)}:{coverage_build_id(con)}" + + def build_facet_node_bits(con, out, build_id): # #293: authoritative concept_uri -> bit_index assignment per tree dim. The # explorer loads this to turn a node selection into a bitmask and filter @@ -559,6 +594,51 @@ def build_sample_facet_masks(con, out, build_id): ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") +def build_sample_facet_index(con, out, build_id): + # #305/#306: the COMPLETE per-pid facet index — one row for EVERY located + # sample (samp_geo), not only those with tree membership. This is the artifact + # the multi-filter global-view count path (#304) scans: it must be able to count + # over the whole located universe, including samples that carry NO tree concept. + # + # WHY this exists separately from sample_facet_masks: masks is built FROM + # `membership`, so it silently OMITS located samples with no membership row + # (~29,917 in the 202608 generation — #306). Counting/ filtering off masks alone + # undercounts the located universe. Here we start from samp_geo (the authoritative + # located set), LEFT JOIN the same membership-derived masks, and emit a ZERO mask + # for no-membership pids (a zero mask matches no node bit → contributes 0 to every + # tree facet, which is correct: the sample is in no subtree — but it IS still a + # located sample and still counts toward `source` and the located total). + # + # `source` is a plain VARCHAR (source is exclusive, not multi-valued — a mask + # would be wrong, a facets_v3 semi-join would add a second 6M-row read). The + # mask columns are bit-identical to sample_facet_masks for pids that have + # membership (validated). bit assignment == build_facet_node_bits / masks. + con.execute(f"""COPY ( + WITH nb AS ( + SELECT facet_type, concept_uri, + (1::BIGINT << (ROW_NUMBER() OVER (PARTITION BY facet_type ORDER BY concept_uri) - 1)) AS bitval + FROM (SELECT DISTINCT facet_type, concept_uri FROM membership) + ), + masks AS ( + SELECT m.pid, + COALESCE(bit_or(CASE WHEN m.facet_type='material' THEN nb.bitval END), 0)::BIGINT AS material_mask, + COALESCE(bit_or(CASE WHEN m.facet_type='context' THEN nb.bitval END), 0)::BIGINT AS context_mask, + COALESCE(bit_or(CASE WHEN m.facet_type='object_type' THEN nb.bitval END), 0)::BIGINT AS object_type_mask + FROM membership m JOIN nb ON nb.facet_type=m.facet_type AND nb.concept_uri=m.concept_uri + GROUP BY m.pid + ) + SELECT sg.pid, + sg.source::VARCHAR AS source, + COALESCE(mk.material_mask, 0)::BIGINT AS material_mask, + COALESCE(mk.context_mask, 0)::BIGINT AS context_mask, + COALESCE(mk.object_type_mask, 0)::BIGINT AS object_type_mask, + '{build_id}' AS build_id, + {INDEX_SCHEMA_VERSION}::INTEGER AS schema_version + FROM samp_geo sg LEFT JOIN masks mk ON mk.pid = sg.pid + ORDER BY sg.pid + ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") + + def file_meta(con, path): n = con.sql(f"SELECT COUNT(*) FROM read_parquet('{path}')").fetchone()[0] schema = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM read_parquet('{path}')").fetchall()] @@ -623,7 +703,8 @@ def emit(name, fn): # Hierarchy artifacts (#281/#282) — need vocab_labels for the SKOS tree. HIER_ARTIFACTS = {"sample_facet_membership", "facet_tree_summaries", - "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks"} + "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks", + "sample_facet_index"} if any(want(a) for a in HIER_ARTIFACTS): if not args.vocab_labels: # Fail loud if the user EXPLICITLY asked for a hierarchy artifact @@ -641,10 +722,15 @@ def emit(name, fn): # #293 bitmask filter artifacts — needs membership (above). node_bits # and masks share a node-set build_id so the explorer only uses the mask # path when the two are from the same generation (Codex P1). - if want("facet_node_bits") or want("sample_facet_masks"): + if want("facet_node_bits") or want("sample_facet_masks") or want("sample_facet_index"): _bid = membership_build_id(con) emit("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) emit("sample_facet_masks", lambda o: build_sample_facet_masks(con, o, _bid)) + # #305/#306: complete per-pid index (every located pid + source, + # zero-mask for no-membership pids). Its build_id embeds the SAME + # membership id as node_bits (mask-bit interpretation gate) PLUS a + # coverage id over samp_geo's (pid, source) universe (staleness gate). + emit("sample_facet_index", lambda o: build_sample_facet_index(con, o, index_build_id(con))) if not args.no_manifest: log("hashing inputs/outputs for manifest…", t0) diff --git a/scripts/validate_frontend_derived.py b/scripts/validate_frontend_derived.py index debd8e4..17d3904 100755 --- a/scripts/validate_frontend_derived.py +++ b/scripts/validate_frontend_derived.py @@ -13,6 +13,8 @@ python scripts/validate_frontend_derived.py \ --facets URL --map-lite URL --summaries URL --cross-filter URL \ --h3 URL4 URL6 URL8 + # hierarchy + #305/#306 complete index (auto-discovered with --dir/--tag, or): + python scripts/validate_frontend_derived.py --dir DIR --tag TAG --index INDEX.parquet """ import argparse, hashlib, json, os, sys import duckdb @@ -55,6 +57,7 @@ def main(): ap.add_argument("--tree-cross-filter", help="facet_tree_cross_filter parquet (#290/#293); optional") ap.add_argument("--node-bits", help="facet_node_bits parquet (#293); optional") ap.add_argument("--masks", help="sample_facet_masks parquet (#293); optional") + ap.add_argument("--index", help="sample_facet_index parquet (#305/#306); optional") ap.add_argument("--wide", help="source wide parquet — enables the SEMANTIC gate " "(re-derive and diff the written files against a fresh build)") ap.add_argument("--min-rows", type=int, default=1_000_000, @@ -300,6 +303,7 @@ def _opt(name, attr): treexf = _opt("facet_tree_cross_filter", "tree_cross_filter") nodebits = _opt("facet_node_bits", "node_bits") masks = _opt("sample_facet_masks", "masks") + index = _opt("sample_facet_index", "index") if tree: T = f"read_parquet('{tree}')" # parent ≥ child for every edge, every dim (distinct-pid UNION semantics — @@ -477,6 +481,88 @@ def _opt(name, attr): + (SELECT COUNT(*) FROM (SELECT * FROM b EXCEPT SELECT * FROM a))""") check(f"masks filter == membership ({dim})", d == 0, f"{d} pids differ for a real {dim} node") + # --- #305/#306 complete per-pid facet index (sample_facet_index) --- + # The multi-filter global-view count path scans THIS file. It must cover the + # ENTIRE located universe (facets_v2 = samp_geo), carry the correct source per + # pid, zero-mask the no-membership pids (#306), and be bit-identical to + # sample_facet_masks for pids that DO have membership — otherwise counts drift. + if index: + IX = f"read_parquet('{index}')" + # schema/types contract (the explorer reads these columns positionally-ish) + ix_sch = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM {IX}").fetchall()] + EXP_IX = [("pid", "VARCHAR"), ("source", "VARCHAR"), + ("material_mask", "BIGINT"), ("context_mask", "BIGINT"), + ("object_type_mask", "BIGINT"), ("build_id", "VARCHAR"), + ("schema_version", "INTEGER")] + check("index schema matches contract", ix_sch == EXP_IX, f"got {ix_sch}") + # one row per pid + ixdup = scalar(f"SELECT COUNT(*) FROM (SELECT pid FROM {IX} GROUP BY pid HAVING COUNT(*)>1)") + check("index: one row per pid", ixdup == 0, f"{ixdup} duplicate pids in index") + # COMPLETENESS (#306): index pid SET == facets_v2 pid SET (the located universe). + # This is the whole point — masks omits no-membership located pids; the index + # must not. A symmetric set-diff catches both missing and extra pids. + ix_diff = scalar(f"""SELECT + (SELECT COUNT(*) FROM (SELECT pid FROM {F} EXCEPT SELECT pid FROM {IX})) + + (SELECT COUNT(*) FROM (SELECT pid FROM {IX} EXCEPT SELECT pid FROM {F}))""") + check("index.pid == facets_v2.pid (complete located universe, #306)", ix_diff == 0, + f"{ix_diff} pids differ between index and facets_v2") + # SOURCE equality: index.source must equal facets_v2.source for every pid + # (IS NOT DISTINCT FROM so NULL==NULL). A source-value drift is exactly what + # the coverage half of build_id is meant to catch — assert it directly too. + src_bad = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {F} f ON i.pid=f.pid + WHERE i.source IS DISTINCT FROM f.source""") + check("index.source == facets_v2.source", src_bad == 0, f"{src_bad} pids with mismatched source") + # single build_id, structured ":" + ix_bids = scalar(f"SELECT COUNT(DISTINCT build_id) FROM {IX}") + check("index: single build_id", ix_bids == 1, f"{ix_bids} distinct build_ids (want 1)") + ix_bid_fmt = scalar(f"SELECT COUNT(*) FROM {IX} WHERE build_id NOT LIKE '%:%'") + check("index: build_id is ':'", ix_bid_fmt == 0, + "build_id(s) missing the ':' coverage separator") + # single schema_version (sanity — the explorer keys behavior off it) + ix_sv = scalar(f"SELECT COUNT(DISTINCT schema_version) FROM {IX}") + check("index: single schema_version", ix_sv == 1, f"{ix_sv} distinct schema_versions") + if nodebits and ix_bids == 1: + # the membership half of the index build_id MUST equal node_bits.build_id + # so the mask bits are interpreted under the SAME assignment (else counts + # are computed against a stale/foreign bit layout — silently wrong). + nb_bid = scalar(f"SELECT COUNT(DISTINCT build_id) FROM read_parquet('{nodebits}')") + if nb_bid == 1: + membership_half_matches = scalar( + f"SELECT (SELECT split_part(MIN(build_id), ':', 1) FROM {IX}) " + f"= (SELECT MIN(build_id) FROM read_parquet('{nodebits}'))") + check("index build_id membership-half == node_bits build_id", + bool(membership_half_matches), + "index masks would be read under a different bit assignment than node_bits") + if masks: + MK = f"read_parquet('{masks}')" + # For pids that ALSO appear in masks (i.e. have membership), the three + # mask columns must be byte-identical — the index is a superset, not a + # re-computation that could drift. + mk_diff = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {MK} m ON i.pid=m.pid + WHERE i.material_mask IS DISTINCT FROM m.material_mask + OR i.context_mask IS DISTINCT FROM m.context_mask + OR i.object_type_mask IS DISTINCT FROM m.object_type_mask""") + check("index masks == sample_facet_masks (shared pids)", mk_diff == 0, + f"{mk_diff} shared pids with differing masks") + # The ONLY pids allowed to differ between the two files are the + # no-membership located pids (#306): present in index, absent from masks, + # and they MUST be zero-masked. Verify (a) the extra pids are exactly the + # zero-mask ones and (b) none are in masks. + extra_nonzero = scalar(f"""SELECT COUNT(*) FROM {IX} i + WHERE NOT EXISTS (SELECT 1 FROM {MK} m WHERE m.pid=i.pid) + AND (i.material_mask <> 0 OR i.context_mask <> 0 OR i.object_type_mask <> 0)""") + check("index: no-membership extra pids are zero-masked (#306)", extra_nonzero == 0, + f"{extra_nonzero} index-only pids carry a non-zero mask (should be 0)") + elif mem: + # No masks file to diff against, but we can still assert the #306 invariant + # directly off membership: every index pid NOT in membership is zero-masked. + M = f"read_parquet('{mem}')" + extra_nonzero = scalar(f"""SELECT COUNT(*) FROM {IX} i + WHERE NOT EXISTS (SELECT 1 FROM {M} m WHERE m.pid=i.pid) + AND (i.material_mask <> 0 OR i.context_mask <> 0 OR i.object_type_mask <> 0)""") + check("index: no-membership pids zero-masked (#306, vs membership)", extra_nonzero == 0, + f"{extra_nonzero} no-membership pids carry a non-zero mask") + print(f"\n{'CHECK':<44} {'RESULT':<6} DETAIL\n" + "-" * 90) ok = True for name, passed, detail in R: diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index 94e05bc..73b5701 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -519,6 +519,180 @@ def test_facet_masks_only_builds(tmp_path): assert (tmp_path / "t_sample_facet_masks.parquet").exists() +# --------------------------------------------------------------------------- +# sample_facet_index — COMPLETE per-pid index (#305/#306) +# Reuses the tree vocab but adds a LOCATED sample with NO tree membership +# (root-only material, null context/object) — the exact #306 class that +# sample_facet_masks silently omits. The index must include it, zero-masked. +# --------------------------------------------------------------------------- +# (pid, source, material_rids, context_rids, object_rids) +INDEX_SAMPLES = [ + ("s1", "A", [2], [11], [21]), # mineral, earthinterior, othersolidobject (full membership) + ("s2", "A", [3], [11], [21]), # rock, earthinterior, othersolidobject (full membership) + ("s3", "B", [2], [11], [21]), # mineral, earthinterior, othersolidobject (full membership) + ("s-nomem", "C", [1], None, None), # material ROOT-only -> NULL material -> NO membership; located +] + + +def build_index_fixture(wide, vocab): + """Tree fixture + a located no-membership sample (the #306 case).""" + con = duckdb.connect(); con.execute("INSTALL spatial; LOAD spatial;") + ic_rows = " UNION ALL ".join( + f"SELECT 'IdentifiedConcept' AS otype, '{uri}' AS pid, {rid}::BIGINT AS row_id, NULL::VARCHAR AS n, " + f"'lbl' AS label, NULL::VARCHAR AS description, NULL::VARCHAR[] AS place_name, NULL::TIMESTAMP AS result_time, " + f"NULL AS geometry, NULL::BIGINT[] AS p__has_material_category, NULL::BIGINT[] AS p__has_context_category, " + f"NULL::BIGINT[] AS p__has_sample_object_type, NULL::BIGINT[] AS p__keywords" + for rid, uri in TREE_CONCEPTS) + msr = [] + for i, (pid, src, m, c, o) in enumerate(INDEX_SAMPLES): + msr.append( + f"SELECT 'MaterialSampleRecord' AS otype, '{pid}' AS pid, NULL::BIGINT AS row_id, '{src}' AS n, " + f"'label {pid}' AS label, 'desc {pid}' AS description, ['plc-{pid}']::VARCHAR[] AS place_name, " + f"NULL::TIMESTAMP AS result_time, ST_AsWKB(ST_Point({10.0+i},{40.0+i})) AS geometry, " + f"{_arr(m)} AS p__has_material_category, {_arr(c)} AS p__has_context_category, " + f"{_arr(o)} AS p__has_sample_object_type, NULL::BIGINT[] AS p__keywords") + con.execute(f"COPY ({ic_rows} UNION ALL {' UNION ALL '.join(msr)}) TO '{wide}' (FORMAT PARQUET)") + edges = " UNION ALL ".join( + f"SELECT '{u}' uri, {('NULL' if p is None else repr(p))}::VARCHAR broader, 'data_v1' uri_form" + for u, p in TREE_EDGES) + con.execute(f"COPY ({edges}) TO '{vocab}' (FORMAT PARQUET)") + con.close() + + +def _build_index(tmp_path, wide, vocab, tag="t"): + cmd = [sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", tag, + "--skip", "wide_h3", "--no-manifest", "--vocab-labels", vocab] + return subprocess.run(cmd, capture_output=True, text=True) + + +def test_sample_facet_index_complete_and_zero_masked(tmp_path): + """#305/#306: the index covers EVERY located pid (incl. the no-membership one, + which masks omits), carries source, and zero-masks no-membership pids while + staying bit-identical to masks for membership pids.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + r = _build_index(tmp_path, wide, vocab) + assert r.returncode == 0, f"index build failed:\n{r.stdout}\n{r.stderr}" + con = duckdb.connect() + IX = f"read_parquet('{tmp_path / 't_sample_facet_index.parquet'}')" + MK = f"read_parquet('{tmp_path / 't_sample_facet_masks.parquet'}')" + F = f"read_parquet('{tmp_path / 't_sample_facets_v2.parquet'}')" + + # COMPLETENESS: index has all 4 located pids; masks has only the 3 with membership + assert con.sql(f"SELECT COUNT(*) FROM {IX}").fetchone()[0] == 4 + assert con.sql(f"SELECT COUNT(*) FROM {MK}").fetchone()[0] == 3 + # index pid set == facets_v2 pid set (the located universe) + diff = con.sql(f"""SELECT (SELECT COUNT(*) FROM (SELECT pid FROM {F} EXCEPT SELECT pid FROM {IX})) + + (SELECT COUNT(*) FROM (SELECT pid FROM {IX} EXCEPT SELECT pid FROM {F}))""").fetchone()[0] + assert diff == 0, "index pid set != facets_v2 located universe" + # the no-membership located sample IS present and FULLY zero-masked (#306) + row = con.sql(f"SELECT material_mask, context_mask, object_type_mask, source FROM {IX} WHERE pid='s-nomem'").fetchone() + assert row is not None, "s-nomem (located, no membership) missing from index — the #306 bug" + assert row[:3] == (0, 0, 0), f"s-nomem should be zero-masked, got {row[:3]}" + assert row[3] == "C", f"s-nomem source should be 'C', got {row[3]!r}" + # masks bit-identical for shared pids + mk_diff = con.sql(f"""SELECT COUNT(*) FROM {IX} i JOIN {MK} m ON i.pid=m.pid + WHERE i.material_mask IS DISTINCT FROM m.material_mask + OR i.context_mask IS DISTINCT FROM m.context_mask + OR i.object_type_mask IS DISTINCT FROM m.object_type_mask""").fetchone()[0] + assert mk_diff == 0, "index masks drift from sample_facet_masks for shared pids" + # build_id is structured ":" and its membership half == node_bits + bid = con.sql(f"SELECT DISTINCT build_id FROM {IX}").fetchall() + assert len(bid) == 1 and ":" in bid[0][0], f"index build_id not single/structured: {bid}" + nb_bid = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 't_facet_node_bits.parquet'}')").fetchone()[0] + assert bid[0][0].split(":", 1)[0] == nb_bid, "index build_id membership-half != node_bits build_id" + + +def test_sample_facet_index_coverage_id_changes_with_source(tmp_path): + """The coverage half of build_id must change when source/coverage changes even + though membership is identical (the silent-staleness hole the plan calls out).""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab, tag="a").returncode == 0 + con = duckdb.connect() + bid_a = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 'a_sample_facet_index.parquet'}')").fetchone()[0] + + # rebuild with a flipped source on s-nomem (same membership, different coverage) + wide2 = str(tmp_path / "wide2.parquet") + con.execute(f"""COPY ( + SELECT * REPLACE (CASE WHEN pid='s-nomem' THEN 'Z' ELSE n END AS n) FROM read_parquet('{wide}') + ) TO '{wide2}' (FORMAT PARQUET)""") + assert _build_index(tmp_path, wide2, vocab, tag="b").returncode == 0 + bid_b = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 'b_sample_facet_index.parquet'}')").fetchone()[0] + + mem_a, cov_a = bid_a.split(":", 1) + mem_b, cov_b = bid_b.split(":", 1) + assert mem_a == mem_b, "membership half should be unchanged (membership identical)" + assert cov_a != cov_b, "coverage half should change when a source value changes" + + +def test_sample_facet_index_validator_passes_and_gates_bite(tmp_path): + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # clean: validator (incl. the new index gates) must PASS + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", + "--min-rows", "1", "--wide", wide], capture_output=True, text=True) + assert v.returncode == 0, f"validator failed on clean index fixture:\n{v.stdout}\n{v.stderr}" + assert "index.pid == facets_v2.pid" in v.stdout + + # corrupt 1: drop the no-membership pid (re-introduce the #306 bug) -> completeness gate fires + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT * FROM read_parquet('{ix}') WHERE pid <> 's-nomem') + TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v2 = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v2.returncode != 0 and "index.pid == facets_v2.pid" in v2.stdout, \ + f"completeness gate failed to catch a dropped located pid:\n{v2.stdout}" + + +def test_sample_facet_index_nonzero_mask_on_nomem_caught(tmp_path): + """A no-membership pid carrying a non-zero mask is a corruption the gate must catch.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, + CASE WHEN pid='s-nomem' THEN 1::BIGINT ELSE material_mask END AS material_mask, + context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "zero-masked" in v.stdout, \ + f"zero-mask gate failed to catch a non-zero no-membership pid:\n{v.stdout}" + + +def test_sample_facet_index_source_drift_caught(tmp_path): + """index.source must equal facets_v2.source; a drifted source must fail the gate.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, + CASE WHEN pid='s1' THEN 'WRONG' ELSE source END AS source, + material_mask, context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "index.source == facets_v2.source" in v.stdout, \ + f"source-equality gate failed to catch drift:\n{v.stdout}" + + +def test_sample_facet_index_only_builds(tmp_path): + """--only sample_facet_index must produce the file (and its prereqs build).""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + r = subprocess.run([sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", "t", + "--no-manifest", "--vocab-labels", vocab, + "--only", "sample_facet_index"], capture_output=True, text=True) + assert r.returncode == 0, f"{r.stdout}\n{r.stderr}" + assert (tmp_path / "t_sample_facet_index.parquet").exists(), \ + f"--only sample_facet_index produced no file:\n{r.stdout}" + + def test_scheme_corruption_caught(tmp_path): wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") assert _build(tmp_path, wide).returncode == 0 From bf3da2a0293ece399874c6819ad89eae9654f904 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 22:19:10 -0700 Subject: [PATCH 3/7] #306/#305: harden index validator + fingerprint (Codex round 1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Address Codex adversarial review of the sample_facet_index Phase-1 diff: - #1 staleness gate: validator now INDEPENDENTLY recomputes both build_id halves from the written siblings (coverage from sample_facets_v2, membership from sample_facet_membership) and asserts equality — a stale or hand-edited coverage id now FAILS at build time even though node_bits is unchanged. Runtime explorer handshake documented as a Phase 2 item. - #2 validator metadata/NULL gates: schema_version must equal the exact contract version (999 was accepted); build_id must match the exact shape __:__ ("a colon somewhere" accepted :bogus:extra); NOT-NULL contract on pid/masks/build_id/version; zero-mask checks use IS DISTINCT FROM 0 (a NULL mask escaped the old <> 0). - #3 independent mask check: re-derive masks from membership + node_bits for EVERY index pid (incl. zero rows) and symmetric-diff — no longer trusts the optional sample_facet_masks file. - #4 orphan artifact: facet_node_bits is force-emitted whenever masks/index is built, so --only sample_facet_index can't ship an uninterpretable mask file. - #5 fingerprint robustness: shared MEMBERSHIP/COVERAGE token exprs; XOR+SUM+ COUNT trio (resists XOR cancellation / 2-row swaps); NULL source encoded with a sentinel byte so NULL≠''; honest non-cryptographic caveat. - SQL gap: reject NULL pids in build_base_tables (a single NULL slipped the dup check, broke pid joins, and was absent from the coverage hash). Tests: +5 adversarial (null mask, schema_version=999, malformed build_id, stale coverage id, null pid) — all gates bite. 34/34 pass. Refs #305 #306 --- SERIALIZATIONS.md | 21 +++-- scripts/build_frontend_derived.py | 81 ++++++++++++++----- scripts/validate_frontend_derived.py | 111 ++++++++++++++++++++------- tests/test_frontend_derived.py | 105 +++++++++++++++++++++++-- 4 files changed, 257 insertions(+), 61 deletions(-) diff --git a/SERIALIZATIONS.md b/SERIALIZATIONS.md index bc9e0d1..2488f09 100644 --- a/SERIALIZATIONS.md +++ b/SERIALIZATIONS.md @@ -325,11 +325,22 @@ for the alias when you want "latest." dimensions, **excluding the target dimension's own predicate**). This is the already-shipped cube/UI semantics; this index does not change it, it serves the same contract to the *multi-filter* case the single-filter cube cannot reach. -- **Validation**: `validate_frontend_derived.py --index ` asserts: schema, - one row per pid, `pid` set == `sample_facets_v2.pid` (the located universe), - `source` == facets_v2 source, single structured `build_id` whose membership half - matches `node_bits`, masks bit-identical to `sample_facet_masks` for shared pids, - and that every no-membership pid is zero-masked. +- **Validation**: `validate_frontend_derived.py --index ` is an INDEPENDENT + oracle — it asserts schema + NOT-NULL columns, one row per pid, `pid` set == + `sample_facets_v2.pid` (the located universe), `source` == facets_v2 source, + `schema_version` == the exact contract version, a well-formed single `build_id`, + and then **recomputes** both `build_id` halves from the written sibling files (the + coverage half from `sample_facets_v2`, the membership half from + `sample_facet_membership`) and asserts equality — so a stale/edited coverage id is + caught at build time even though `node_bits` is unchanged. Masks are re-derived + directly from `membership` + `node_bits` for **every** index pid (zero rows + included) and symmetric-diffed against the file. +- **Runtime staleness handshake (Phase 2)**: the *explorer* loads the index by URL + and so cannot, on its own, know the expected coverage half. The load-time refusal + of a stale-coverage index requires an external trusted pointer (e.g. the expected + composite id co-published in the manifest / a `current` alias the explorer reads); + that handshake lands with the count-query consumer in Phase 2. Build-time + validation (above) is the gate for now. - **Immutability**: published under a **new** versioned filename — it is a new artifact name and never overwrites a cached `sample_facet_masks` or any prior tag. diff --git a/scripts/build_frontend_derived.py b/scripts/build_frontend_derived.py index 661a78e..13a1d5f 100755 --- a/scripts/build_frontend_derived.py +++ b/scripts/build_frontend_derived.py @@ -208,12 +208,19 @@ def build_base_tables(con, wide, t0): n_geo = con.sql("SELECT COUNT(*) FROM samp_geo").fetchone()[0] n_dup = con.sql("SELECT COUNT(*) FROM (SELECT pid FROM samp_geo GROUP BY pid HAVING COUNT(*)>1)").fetchone()[0] n_icdup = con.sql("SELECT COUNT(*) FROM (SELECT row_id FROM ic GROUP BY row_id HAVING COUNT(*)>1)").fetchone()[0] - log(f"samp={n_samp:,} samp_geo={n_geo:,} duplicate_pids={n_dup:,} duplicate_concept_row_ids={n_icdup:,}", t0) - if n_dup or n_icdup: - # HARD fail (Codex): non-unique keys make the output grain wrong (inflated + # #305/#306 (Codex): the GROUP BY...HAVING COUNT>1 dup check above collapses all + # NULL pids into ONE group, so a SINGLE NULL pid slips through — yet a NULL pid + # breaks every pid join (joins drop it) and is silently absent from the coverage + # fingerprint. Reject it explicitly. + n_null = con.sql("SELECT COUNT(*) FROM samp_geo WHERE pid IS NULL").fetchone()[0] + log(f"samp={n_samp:,} samp_geo={n_geo:,} duplicate_pids={n_dup:,} " + f"null_pids={n_null:,} duplicate_concept_row_ids={n_icdup:,}", t0) + if n_dup or n_icdup or n_null: + # HARD fail (Codex): non-unique/NULL keys make the output grain wrong (inflated # facet counts, ambiguous joins, non-total ORDER BY pid). Abort before writing. raise SystemExit( - f"FATAL: non-unique keys — duplicate_pids={n_dup}, duplicate_concept_row_ids={n_icdup}. " + f"FATAL: non-unique/NULL keys — duplicate_pids={n_dup}, null_pids={n_null}, " + f"duplicate_concept_row_ids={n_icdup}. " f"Output grain/joins would be wrong; refusing to write.") @@ -499,6 +506,32 @@ def build_facet_tree_cross_filter(con, out): ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") +# --- generation fingerprints (#293, #305/#306) ----------------------------- +# The per-row token expressions are module constants so the VALIDATOR can +# recompute the exact same fingerprint independently from the written sibling +# files (membership, sample_facets_v2) and assert the index's build_id — i.e. +# the builder is not the sole authority on the index's generation identity. +# chr(31) (US) separates fields. A NULL source is encoded with a distinct +# sentinel byte so NULL and '' can never collide (Codex #5). +MEMBERSHIP_TOKEN_EXPR = "pid || chr(31) || facet_type || chr(31) || concept_uri" +COVERAGE_TOKEN_EXPR = ("pid || chr(31) || CASE WHEN source IS NULL THEN chr(0) ELSE chr(1) END " + "|| chr(31) || COALESCE(source, '')") + + +def _fingerprint(con, relation, token_expr): + # Order-independent generation fingerprint over `relation`. Combines THREE + # order-independent accumulators — XOR, SUM (HUGEINT, no overflow), and COUNT — + # of per-row hashes. XOR alone cancels identical rows and is vulnerable to a + # 2-row swap; SUM+COUNT defeat that. NOT a cryptographic digest: this defends + # against accidental drift / stale generations, not an adversary who engineers a + # multi-row hash collision. Grain is unique (validated), so the trio is stable. + x, s, n = con.sql( + f"SELECT COALESCE(bit_xor(hash({token_expr})), 0), " + f" COALESCE(SUM(hash({token_expr})::HUGEINT), 0), COUNT(*) " + f"FROM {relation}").fetchone() + return f"{x}_{s}_{n}" + + def membership_build_id(con): # #293 (Codex P1, r2): a fingerprint of the FULL membership generation — not # just the node set. Both node_bits (positional bit assignment) and masks are @@ -506,12 +539,8 @@ def membership_build_id(con): # change that would alter either artifact (new/dropped pids, re-mapped concepts, # AND node-set changes). Embedding this id in both lets the explorer refuse the # mask path unless the two are from the SAME generation (guards a stale-cached - # masks file). Order-independent XOR of per-row hashes (membership grain is - # unique per (pid,facet_type,concept_uri) — validated — so no XOR cancellation). - return con.sql(""" - SELECT CAST(COALESCE(bit_xor(hash(pid || chr(31) || facet_type || chr(31) || concept_uri)), 0) - AS VARCHAR) - FROM membership""").fetchone()[0] + # masks file). membership grain is unique per (pid,facet_type,concept_uri). + return _fingerprint(con, "membership", MEMBERSHIP_TOKEN_EXPR) def coverage_build_id(con): @@ -519,14 +548,10 @@ def coverage_build_id(con): # the (pid, source) pairs over the LOCATED set (samp_geo), independent of tree # membership. membership_build_id() alone is blind to this: an index built from # the SAME membership but a changed source value or a changed located-pid set - # (exactly the #306 class of drift — located pids with no membership) would - # carry an identical membership id and go stale SILENTLY. Order-independent XOR - # of per-row hashes; samp_geo.pid is unique (hard-checked in build_base_tables), - # so no XOR cancellation. COALESCE(source,'') so a NULL→'' source flip is caught. - return con.sql(""" - SELECT CAST(COALESCE(bit_xor(hash(pid || chr(31) || COALESCE(source, ''))), 0) - AS VARCHAR) - FROM samp_geo""").fetchone()[0] + # (exactly the #306 class of drift — located pids with no membership) would carry + # an identical membership id and go stale SILENTLY. NULL vs '' source are encoded + # distinctly (sentinel byte) so a NULL↔'' flip changes the fingerprint. + return _fingerprint(con, "samp_geo", COVERAGE_TOKEN_EXPR) def index_build_id(con): @@ -538,8 +563,9 @@ def index_build_id(con): # generation), and # 2. detect coverage/source drift via the coverage half (would otherwise be # invisible to a membership-only id). - # ':' never appears in either half (both are decimal hash strings), so a single - # split is unambiguous. node_bits/masks keep their plain membership id. + # Each half is "__" (decimal + underscore only); ':' never + # appears in either half, so split(':', 1) is unambiguous. node_bits/masks keep + # their plain membership id (== the membership half here, for the match gate). return f"{membership_build_id(con)}:{coverage_build_id(con)}" @@ -724,7 +750,20 @@ def emit(name, fn): # path when the two are from the same generation (Codex P1). if want("facet_node_bits") or want("sample_facet_masks") or want("sample_facet_index"): _bid = membership_build_id(con) - emit("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) + # facet_node_bits is the bit-INTERPRETATION file: sample_facet_masks + # AND sample_facet_index store raw mask bits that are meaningless + # without it. So node_bits is FORCE-emitted whenever either consumer + # is built — even under `--only sample_facet_masks/_index` — so a + # build never ships an orphan mask file (Codex #4). The shared _bid + # lets the explorer gate the two on the same generation (Codex #293). + force_node_bits = want("sample_facet_masks") or want("sample_facet_index") + if want("facet_node_bits") or force_node_bits: + build_node = lambda o: build_facet_node_bits(con, o, _bid) + if want("facet_node_bits"): + emit("facet_node_bits", build_node) + else: + build_node(p("facet_node_bits")); produced.append(p("facet_node_bits")) + log("facet_node_bits ✓ (auto-paired with masks/index)", t0) emit("sample_facet_masks", lambda o: build_sample_facet_masks(con, o, _bid)) # #305/#306: complete per-pid index (every located pid + source, # zero-mask for no-membership pids). Its build_id embeds the SAME diff --git a/scripts/validate_frontend_derived.py b/scripts/validate_frontend_derived.py index 17d3904..84a4294 100755 --- a/scripts/validate_frontend_derived.py +++ b/scripts/validate_frontend_derived.py @@ -487,14 +487,35 @@ def _opt(name, attr): # pid, zero-mask the no-membership pids (#306), and be bit-identical to # sample_facet_masks for pids that DO have membership — otherwise counts drift. if index: + # Import the builder's canonical schema version + fingerprint token exprs so + # the validator is an INDEPENDENT oracle (Codex #1/#5): it recomputes the + # expected build_id from the written sibling files and asserts equality, + # rather than trusting whatever the builder stamped. + sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) + from build_frontend_derived import (INDEX_SCHEMA_VERSION, MEMBERSHIP_TOKEN_EXPR, + COVERAGE_TOKEN_EXPR) + + def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint + x, s, n = con.sql( + f"SELECT COALESCE(bit_xor(hash({token_expr})), 0), " + f"COALESCE(SUM(hash({token_expr})::HUGEINT), 0), COUNT(*) FROM {relation}").fetchone() + return f"{x}_{s}_{n}" + IX = f"read_parquet('{index}')" - # schema/types contract (the explorer reads these columns positionally-ish) + # schema/types contract (the explorer reads these columns by name) ix_sch = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM {IX}").fetchall()] EXP_IX = [("pid", "VARCHAR"), ("source", "VARCHAR"), ("material_mask", "BIGINT"), ("context_mask", "BIGINT"), ("object_type_mask", "BIGINT"), ("build_id", "VARCHAR"), ("schema_version", "INTEGER")] check("index schema matches contract", ix_sch == EXP_IX, f"got {ix_sch}") + # NOT-NULL contract (source MAY be NULL — it mirrors facets_v2; everything + # else must be present). A NULL mask is a real corruption (Codex #2): the + # zero-mask check below uses IS DISTINCT FROM so it also rejects NULL. + nn_bad = scalar(f"""SELECT COUNT(*) FROM {IX} WHERE pid IS NULL + OR material_mask IS NULL OR context_mask IS NULL OR object_type_mask IS NULL + OR build_id IS NULL OR schema_version IS NULL""") + check("index: required columns non-NULL", nn_bad == 0, f"{nn_bad} rows with NULL pid/mask/build_id/version") # one row per pid ixdup = scalar(f"SELECT COUNT(*) FROM (SELECT pid FROM {IX} GROUP BY pid HAVING COUNT(*)>1)") check("index: one row per pid", ixdup == 0, f"{ixdup} duplicate pids in index") @@ -507,20 +528,42 @@ def _opt(name, attr): check("index.pid == facets_v2.pid (complete located universe, #306)", ix_diff == 0, f"{ix_diff} pids differ between index and facets_v2") # SOURCE equality: index.source must equal facets_v2.source for every pid - # (IS NOT DISTINCT FROM so NULL==NULL). A source-value drift is exactly what - # the coverage half of build_id is meant to catch — assert it directly too. + # (IS DISTINCT FROM so NULL==NULL). A source-value drift is exactly what the + # coverage half of build_id is meant to catch — assert it directly too. src_bad = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {F} f ON i.pid=f.pid WHERE i.source IS DISTINCT FROM f.source""") check("index.source == facets_v2.source", src_bad == 0, f"{src_bad} pids with mismatched source") - # single build_id, structured ":" + # schema_version must be EXACTLY the contract version (Codex #2: "one distinct + # value" let 999 through). The explorer refuses an index it can't interpret. + ix_sv_bad = scalar(f"SELECT COUNT(*) FROM {IX} WHERE schema_version IS DISTINCT FROM {INDEX_SCHEMA_VERSION}") + check(f"index: schema_version == {INDEX_SCHEMA_VERSION}", ix_sv_bad == 0, + f"{ix_sv_bad} rows with schema_version != {INDEX_SCHEMA_VERSION}") + # build_id must be a SINGLE value of the EXACT shape "__:__" + # — exactly two ':'-separated halves, each exactly three '_'-separated decimals. + # (Codex #2: "a colon somewhere" accepted ":bogus:extra".) ix_bids = scalar(f"SELECT COUNT(DISTINCT build_id) FROM {IX}") check("index: single build_id", ix_bids == 1, f"{ix_bids} distinct build_ids (want 1)") - ix_bid_fmt = scalar(f"SELECT COUNT(*) FROM {IX} WHERE build_id NOT LIKE '%:%'") - check("index: build_id is ':'", ix_bid_fmt == 0, - "build_id(s) missing the ':' coverage separator") - # single schema_version (sanity — the explorer keys behavior off it) - ix_sv = scalar(f"SELECT COUNT(DISTINCT schema_version) FROM {IX}") - check("index: single schema_version", ix_sv == 1, f"{ix_sv} distinct schema_versions") + bid_re = r'^[0-9]+_[0-9]+_[0-9]+:[0-9]+_[0-9]+_[0-9]+$' + ix_bid_fmt = scalar(f"SELECT COUNT(*) FROM {IX} WHERE regexp_matches(build_id, '{bid_re}') = FALSE") + check("index: build_id is well-formed ':'", ix_bid_fmt == 0, + "build_id(s) not exactly __:__") + # INDEPENDENT build_id recomputation (Codex #1/#5): the coverage half is + # recomputed from facets_v2 (the located (pid, source) universe — equivalent + # to samp_geo) and, when membership is present, the membership half from it. + # This is the gate that actually catches coverage/source staleness at + # validation time — a membership-only id (or a hand-edited coverage half) + # FAILS here even though node_bits is unchanged. + if ix_bids == 1: + ix_bid = scalar(f"SELECT MIN(build_id) FROM {IX}") + exp_cov = _fp(F, COVERAGE_TOKEN_EXPR) + got_cov = ix_bid.split(":", 1)[1] if ":" in ix_bid else "" + check("index build_id coverage-half == recomputed from facets_v2", got_cov == exp_cov, + f"coverage half {got_cov!r} != expected {exp_cov!r} (stale/edited coverage)") + if mem: + exp_mem = _fp(f"read_parquet('{mem}')", MEMBERSHIP_TOKEN_EXPR) + got_mem = ix_bid.split(":", 1)[0] + check("index build_id membership-half == recomputed from membership", got_mem == exp_mem, + f"membership half {got_mem!r} != expected {exp_mem!r} (stale/foreign generation)") if nodebits and ix_bids == 1: # the membership half of the index build_id MUST equal node_bits.build_id # so the mask bits are interpreted under the SAME assignment (else counts @@ -533,35 +576,45 @@ def _opt(name, attr): check("index build_id membership-half == node_bits build_id", bool(membership_half_matches), "index masks would be read under a different bit assignment than node_bits") - if masks: + # INDEPENDENT mask re-derivation (Codex #3): re-derive the expected masks for + # EVERY index pid directly from membership + node_bits (zero for pids with no + # membership) and symmetric-diff against the index. This validates the masks + # without trusting the optional sample_facet_masks file, AND covers the + # zero-mask rows that a masks-only diff never sees. + if mem and nodebits: + M = f"read_parquet('{mem}')" + NB = f"read_parquet('{nodebits}')" + ref = (f"WITH nb AS (SELECT facet_type, concept_uri, (1::BIGINT << bit_index) AS bitval FROM {NB}), " + f"mm AS (SELECT m.pid, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='material' THEN nb.bitval END),0)::BIGINT material_mask, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='context' THEN nb.bitval END),0)::BIGINT context_mask, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='object_type' THEN nb.bitval END),0)::BIGINT object_type_mask " + f"FROM {M} m JOIN nb ON nb.facet_type=m.facet_type AND nb.concept_uri=m.concept_uri GROUP BY m.pid) " + f"SELECT i.pid, COALESCE(mm.material_mask,0)::BIGINT material_mask, " + f"COALESCE(mm.context_mask,0)::BIGINT context_mask, " + f"COALESCE(mm.object_type_mask,0)::BIGINT object_type_mask " + f"FROM (SELECT pid FROM {IX}) i LEFT JOIN mm ON mm.pid=i.pid") + fil = f"SELECT pid, material_mask, context_mask, object_type_mask FROM {IX}" + mm_diff = scalar(f"SELECT (SELECT COUNT(*) FROM (({ref}) EXCEPT ({fil}))) " + f"+ (SELECT COUNT(*) FROM (({fil}) EXCEPT ({ref})))") + check("index masks == re-derived from membership+node_bits (all pids)", mm_diff == 0, + f"{mm_diff} index rows disagree with an independent re-derivation") + elif masks: + # Fallback when membership/node_bits aren't both available: diff shared + # pids against the masks file and assert no-membership extras are zero. MK = f"read_parquet('{masks}')" - # For pids that ALSO appear in masks (i.e. have membership), the three - # mask columns must be byte-identical — the index is a superset, not a - # re-computation that could drift. mk_diff = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {MK} m ON i.pid=m.pid WHERE i.material_mask IS DISTINCT FROM m.material_mask OR i.context_mask IS DISTINCT FROM m.context_mask OR i.object_type_mask IS DISTINCT FROM m.object_type_mask""") check("index masks == sample_facet_masks (shared pids)", mk_diff == 0, f"{mk_diff} shared pids with differing masks") - # The ONLY pids allowed to differ between the two files are the - # no-membership located pids (#306): present in index, absent from masks, - # and they MUST be zero-masked. Verify (a) the extra pids are exactly the - # zero-mask ones and (b) none are in masks. extra_nonzero = scalar(f"""SELECT COUNT(*) FROM {IX} i WHERE NOT EXISTS (SELECT 1 FROM {MK} m WHERE m.pid=i.pid) - AND (i.material_mask <> 0 OR i.context_mask <> 0 OR i.object_type_mask <> 0)""") + AND (i.material_mask IS DISTINCT FROM 0 OR i.context_mask IS DISTINCT FROM 0 + OR i.object_type_mask IS DISTINCT FROM 0)""") check("index: no-membership extra pids are zero-masked (#306)", extra_nonzero == 0, - f"{extra_nonzero} index-only pids carry a non-zero mask (should be 0)") - elif mem: - # No masks file to diff against, but we can still assert the #306 invariant - # directly off membership: every index pid NOT in membership is zero-masked. - M = f"read_parquet('{mem}')" - extra_nonzero = scalar(f"""SELECT COUNT(*) FROM {IX} i - WHERE NOT EXISTS (SELECT 1 FROM {M} m WHERE m.pid=i.pid) - AND (i.material_mask <> 0 OR i.context_mask <> 0 OR i.object_type_mask <> 0)""") - check("index: no-membership pids zero-masked (#306, vs membership)", extra_nonzero == 0, - f"{extra_nonzero} no-membership pids carry a non-zero mask") + f"{extra_nonzero} index-only pids carry a non-zero/NULL mask (should be 0)") print(f"\n{'CHECK':<44} {'RESULT':<6} DETAIL\n" + "-" * 90) ok = True diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index 73b5701..a6140aa 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -211,6 +211,24 @@ def test_duplicate_pid_hard_fails(tmp_path): f"builder should hard-fail on duplicate pids:\n{r.stdout}\n{r.stderr}" +def test_null_pid_hard_fails(tmp_path): + """Codex (#305 SQL gap): a SINGLE NULL pid slips the GROUP BY...HAVING COUNT>1 + dup check but breaks every pid join + the coverage fingerprint. Reject it.""" + wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") + con = duckdb.connect(); con.execute("INSTALL spatial; LOAD spatial;") + nullp = str(tmp_path / "wide_nullpid.parquet") + # append one located row with a NULL pid + con.execute(f"""COPY ( + SELECT * FROM read_parquet('{wide}') + UNION ALL + SELECT * REPLACE (NULL::VARCHAR AS pid) FROM read_parquet('{wide}') WHERE pid='m-real-first' + ) TO '{nullp}' (FORMAT PARQUET)""") + con.close() + r = _build(tmp_path, nullp) + assert r.returncode != 0 and "null_pids=1" in (r.stdout + r.stderr).lower(), \ + f"builder should hard-fail on a NULL pid:\n{r.stdout}\n{r.stderr}" + + def test_manifest_emitted(tmp_path): import json wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") @@ -648,7 +666,8 @@ def test_sample_facet_index_validator_passes_and_gates_bite(tmp_path): def test_sample_facet_index_nonzero_mask_on_nomem_caught(tmp_path): - """A no-membership pid carrying a non-zero mask is a corruption the gate must catch.""" + """A no-membership pid carrying a non-zero mask is a corruption the gate must + catch. With membership+node_bits present the independent re-derivation fires.""" wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") build_index_fixture(wide, vocab) assert _build_index(tmp_path, wide, vocab).returncode == 0 @@ -660,8 +679,78 @@ def test_sample_facet_index_nonzero_mask_on_nomem_caught(tmp_path): FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], capture_output=True, text=True) - assert v.returncode != 0 and "zero-masked" in v.stdout, \ - f"zero-mask gate failed to catch a non-zero no-membership pid:\n{v.stdout}" + assert v.returncode != 0 and "re-derived from membership" in v.stdout, \ + f"mask gate failed to catch a non-zero no-membership pid:\n{v.stdout}" + + +def test_sample_facet_index_null_mask_caught(tmp_path): + """Codex #2: a NULL mask escaped the old `<> 0` predicate (NULL=unknown). The + NOT-NULL gate must catch it.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, + CASE WHEN pid='s-nomem' THEN NULL::BIGINT ELSE material_mask END AS material_mask, + context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "non-NULL" in v.stdout, \ + f"NOT-NULL gate failed to catch a NULL mask:\n{v.stdout}" + + +def test_sample_facet_index_bad_schema_version_caught(tmp_path): + """Codex #2: schema_version=999 passed the old 'one distinct value' check.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + build_id, 999::INTEGER AS schema_version FROM read_parquet('{ix}')) + TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "schema_version ==" in v.stdout, \ + f"schema_version gate failed to catch 999:\n{v.stdout}" + + +def test_sample_facet_index_malformed_build_id_caught(tmp_path): + """Codex #2: ':bogus:extra' passed the old 'a colon somewhere' check.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + # keep a correct membership half, append junk extra ':' segments + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + split_part(build_id, ':', 1) || ':bogus:extra' AS build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "well-formed" in v.stdout, \ + f"build_id format gate failed to catch a malformed id:\n{v.stdout}" + + +def test_sample_facet_index_stale_coverage_id_caught(tmp_path): + """Codex #1: a build_id whose coverage half doesn't match the actual (pid,source) + universe (e.g. membership-only id, or hand-edited coverage) must fail — even + though node_bits is unchanged.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + # replace the coverage half with a structurally-valid but WRONG fingerprint + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + split_part(build_id, ':', 1) || ':1_2_3' AS build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "coverage-half" in v.stdout, \ + f"coverage staleness gate failed to catch a wrong coverage id:\n{v.stdout}" def test_sample_facet_index_source_drift_caught(tmp_path): @@ -681,8 +770,10 @@ def test_sample_facet_index_source_drift_caught(tmp_path): f"source-equality gate failed to catch drift:\n{v.stdout}" -def test_sample_facet_index_only_builds(tmp_path): - """--only sample_facet_index must produce the file (and its prereqs build).""" +def test_sample_facet_index_only_auto_pairs_node_bits(tmp_path): + """Codex #4: --only sample_facet_index must NOT ship an orphan — facet_node_bits + (the bit-interpretation file) is force-emitted alongside it. The resulting pair + must pass the validator (which needs node_bits to re-derive the masks).""" wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") build_index_fixture(wide, vocab) r = subprocess.run([sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", "t", @@ -690,7 +781,9 @@ def test_sample_facet_index_only_builds(tmp_path): "--only", "sample_facet_index"], capture_output=True, text=True) assert r.returncode == 0, f"{r.stdout}\n{r.stderr}" assert (tmp_path / "t_sample_facet_index.parquet").exists(), \ - f"--only sample_facet_index produced no file:\n{r.stdout}" + f"--only sample_facet_index produced no index file:\n{r.stdout}" + assert (tmp_path / "t_facet_node_bits.parquet").exists(), \ + f"--only sample_facet_index did not auto-pair facet_node_bits (orphan):\n{r.stdout}" def test_scheme_corruption_caught(tmp_path): From a308fe2288d678f50aac2012227e3c49a651e26f Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 22:24:32 -0700 Subject: [PATCH 4/7] #306/#305: close index validator fail-open gaps (Codex round 2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - r2 #1: index validation now FAILS CLOSED when sample_facet_membership or facet_node_bits is absent — those siblings are what the build_id recompute and mask re-derivation depend on, so a missing sibling must gate, not run a silent partial pass (Codex reproduced a corrupt index + restamped node_bits passing once the siblings were removed). They're always co-produced, so --dir/--tag satisfies it. - r2 #2: assert node_bits has EXACTLY ONE non-NULL build_id before the membership-half match — 0 or >1 distinct ids previously skipped the match entirely (fail-open). Codex r2 confirmed: builder _fingerprint() == validator _fp() exactly; coverage-from-facets_v2 == samp_geo; the three round-1 bypasses stay rejected; node_bits force-emit has no double-build and masks/index can't ship without node_bits (even with --skip). Tests: +2 (missing-siblings fail-closed, multi-id node_bits). 36/36 pass. Refs #305 #306 --- scripts/validate_frontend_derived.py | 19 +++++++++++++-- tests/test_frontend_derived.py | 36 ++++++++++++++++++++++++++++ 2 files changed, 53 insertions(+), 2 deletions(-) diff --git a/scripts/validate_frontend_derived.py b/scripts/validate_frontend_derived.py index 84a4294..d4aac72 100755 --- a/scripts/validate_frontend_derived.py +++ b/scripts/validate_frontend_derived.py @@ -502,6 +502,16 @@ def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint return f"{x}_{s}_{n}" IX = f"read_parquet('{index}')" + # GATING PRECONDITION (Codex r2 #1): the index cannot be fully validated + # without BOTH sample_facet_membership and facet_node_bits — they are what + # the independent build_id recompute and mask re-derivation depend on. If + # either is absent we must FAIL rather than silently run a partial (fail-open) + # validation that would pass a corrupt index/node_bits pair. They are always + # co-produced by a build, so --dir/--tag (or passing both) satisfies this. + check("index validation has membership + node_bits (required to gate)", + bool(mem and nodebits), + "pass --membership and --node-bits (or --dir/--tag with both present); " + "the index can't be validated in isolation") # schema/types contract (the explorer reads these columns by name) ix_sch = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM {IX}").fetchall()] EXP_IX = [("pid", "VARCHAR"), ("source", "VARCHAR"), @@ -568,8 +578,13 @@ def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint # the membership half of the index build_id MUST equal node_bits.build_id # so the mask bits are interpreted under the SAME assignment (else counts # are computed against a stale/foreign bit layout — silently wrong). - nb_bid = scalar(f"SELECT COUNT(DISTINCT build_id) FROM read_parquet('{nodebits}')") - if nb_bid == 1: + # (Codex r2 #2): assert node_bits has EXACTLY ONE non-NULL build_id FIRST — + # 0 or >1 distinct ids previously skipped the match entirely (fail-open). + nb_bid = scalar(f"SELECT COUNT(DISTINCT build_id) FROM read_parquet('{nodebits}') WHERE build_id IS NOT NULL") + nb_null = scalar(f"SELECT COUNT(*) FROM read_parquet('{nodebits}') WHERE build_id IS NULL") + check("node_bits: exactly one non-NULL build_id", nb_bid == 1 and nb_null == 0, + f"{nb_bid} distinct non-NULL build_ids, {nb_null} NULLs (want 1 / 0)") + if nb_bid == 1 and nb_null == 0: membership_half_matches = scalar( f"SELECT (SELECT split_part(MIN(build_id), ':', 1) FROM {IX}) " f"= (SELECT MIN(build_id) FROM read_parquet('{nodebits}'))") diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index a6140aa..bcf0152 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -770,6 +770,42 @@ def test_sample_facet_index_source_drift_caught(tmp_path): f"source-equality gate failed to catch drift:\n{v.stdout}" +def test_sample_facet_index_validation_requires_siblings(tmp_path): + """Codex r2 #1: validating the index WITHOUT membership+node_bits must FAIL + (fail-closed), not silently run a partial pass. Reproduces the bypass where a + corrupt index + restamped node_bits passed once the siblings were removed.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # delete the siblings the index validation depends on + os.remove(tmp_path / "t_sample_facet_membership.parquet") + os.remove(tmp_path / "t_facet_node_bits.parquet") + os.remove(tmp_path / "t_sample_facet_masks.parquet") + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "required to gate" in v.stdout, \ + f"index validation should fail-closed without membership+node_bits:\n{v.stdout}" + + +def test_sample_facet_index_bad_node_bits_build_id_caught(tmp_path): + """Codex r2 #2: a node_bits file with !=1 non-NULL build_id previously skipped + the membership-half match entirely. It must now FAIL.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + nb = str(tmp_path / "t_facet_node_bits.parquet") + con = duckdb.connect(); tmp_nb = nb + ".tmp" + # give node_bits TWO distinct build_ids (one row restamped) + con.execute(f"""COPY (SELECT facet_type, concept_uri, bit_index, + CASE WHEN bit_index = (SELECT MIN(bit_index) FROM read_parquet('{nb}')) + THEN 'OTHER_GEN' ELSE build_id END AS build_id + FROM read_parquet('{nb}')) TO '{tmp_nb}' (FORMAT PARQUET)"""); os.replace(tmp_nb, nb) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "exactly one non-NULL build_id" in v.stdout, \ + f"node_bits build_id gate failed to catch a multi-id file:\n{v.stdout}" + + def test_sample_facet_index_only_auto_pairs_node_bits(tmp_path): """Codex #4: --only sample_facet_index must NOT ship an orphan — facet_node_bits (the bit-interpretation file) is force-emitted alongside it. The resulting pair From 23cefacb44916470204b258e07e974e19f2bbbb2 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 22:29:47 -0700 Subject: [PATCH 5/7] #306/#305: force-emit full fast-path bundle with index (Codex round 3) Codex r3 caught a regression my own fixes introduced: --only sample_facet_index emitted index + node_bits but NOT sample_facet_membership, so the new validator gate (which requires membership as its independent oracle) would reject that output. Index validation legitimately also needs sample_facets_v2 (the #306 completeness oracle), so --only sample_facet_index is a PARTIAL rebuild against an existing full build, by design. - force_dep(): whenever masks or index is built, force-emit BOTH sample_facet_membership and facet_node_bits (the bundle the index validator depends on), exactly once, recorded for the manifest. - test now proves the real workflow: full build -> --only sample_facet_index rebuild into the same dir -> re-validate PASSES (not just file-existence). Codex r3 confirmed both round-2 fail-open gaps stay closed (rc=1 on every bypass) with no new skip path. 36/36 pass. Refs #305 #306 --- scripts/build_frontend_derived.py | 47 ++++++++++++++++++------------- tests/test_frontend_derived.py | 26 +++++++++++------ 2 files changed, 46 insertions(+), 27 deletions(-) diff --git a/scripts/build_frontend_derived.py b/scripts/build_frontend_derived.py index 13a1d5f..bb1ac3c 100755 --- a/scripts/build_frontend_derived.py +++ b/scripts/build_frontend_derived.py @@ -741,29 +741,38 @@ def emit(name, fn): log("SKIP hierarchy artifacts: pass --vocab-labels ", t0) else: build_concept_membership(con, args.wide, args.vocab_labels, t0) - emit("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) + # The mask fast-path bundle = {membership, node_bits, masks?, index}. + # sample_facet_masks AND sample_facet_index store raw mask bits that are + # uninterpretable without facet_node_bits, and that the validator can only + # gate by re-deriving from sample_facet_membership. So whenever masks or + # index is requested, FORCE-emit membership + node_bits too (even under + # `--only sample_facet_index`) — otherwise the build ships an artifact its + # own validator must reject (Codex #4 / r3). force_dep() builds a not-wanted + # artifact exactly once and records it for the manifest. + need_fastpath = want("facet_node_bits") or want("sample_facet_masks") or want("sample_facet_index") + force_deps = want("sample_facet_masks") or want("sample_facet_index") + def force_dep(name, fn): + if want(name): + emit(name, fn) + elif p(name) not in produced: + fn(p(name)); produced.append(p(name)); log(f"{name} ✓ (auto-paired with masks/index)", t0) + + if force_deps: + force_dep("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) + else: + emit("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) emit("facet_tree_summaries", lambda o: build_facet_tree_summaries(con, o)) # #290/#293 cross-filter cube — needs membership (above) + samp_geo (source). emit("facet_tree_cross_filter", lambda o: build_facet_tree_cross_filter(con, o)) - # #293 bitmask filter artifacts — needs membership (above). node_bits - # and masks share a node-set build_id so the explorer only uses the mask - # path when the two are from the same generation (Codex P1). - if want("facet_node_bits") or want("sample_facet_masks") or want("sample_facet_index"): + # #293 bitmask filter artifacts — needs membership (above). node_bits and + # masks share a node-set build_id so the explorer only uses the mask path + # when the two are from the same generation (Codex P1). + if need_fastpath: _bid = membership_build_id(con) - # facet_node_bits is the bit-INTERPRETATION file: sample_facet_masks - # AND sample_facet_index store raw mask bits that are meaningless - # without it. So node_bits is FORCE-emitted whenever either consumer - # is built — even under `--only sample_facet_masks/_index` — so a - # build never ships an orphan mask file (Codex #4). The shared _bid - # lets the explorer gate the two on the same generation (Codex #293). - force_node_bits = want("sample_facet_masks") or want("sample_facet_index") - if want("facet_node_bits") or force_node_bits: - build_node = lambda o: build_facet_node_bits(con, o, _bid) - if want("facet_node_bits"): - emit("facet_node_bits", build_node) - else: - build_node(p("facet_node_bits")); produced.append(p("facet_node_bits")) - log("facet_node_bits ✓ (auto-paired with masks/index)", t0) + if force_deps: + force_dep("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) + else: + emit("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) emit("sample_facet_masks", lambda o: build_sample_facet_masks(con, o, _bid)) # #305/#306: complete per-pid index (every located pid + source, # zero-mask for no-membership pids). Its build_id embeds the SAME diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index bcf0152..42ac846 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -806,20 +806,30 @@ def test_sample_facet_index_bad_node_bits_build_id_caught(tmp_path): f"node_bits build_id gate failed to catch a multi-id file:\n{v.stdout}" -def test_sample_facet_index_only_auto_pairs_node_bits(tmp_path): - """Codex #4: --only sample_facet_index must NOT ship an orphan — facet_node_bits - (the bit-interpretation file) is force-emitted alongside it. The resulting pair - must pass the validator (which needs node_bits to re-derive the masks).""" +def test_sample_facet_index_only_auto_pairs_bundle(tmp_path): + """Codex #4/r3: --only sample_facet_index must NOT ship an orphan. It + force-emits its whole fast-path bundle — facet_node_bits (bit interpretation) + AND sample_facet_membership (the validator's independent oracle). A partial + rebuild into a directory holding a prior FULL build must then re-validate + cleanly (the real iterate-on-one-artifact workflow).""" wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") build_index_fixture(wide, vocab) + # 1) full build so the validator's non-index siblings (facets_v2, h3, …) exist + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # 2) rebuild ONLY the index into the same dir r = subprocess.run([sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", "t", "--no-manifest", "--vocab-labels", vocab, "--only", "sample_facet_index"], capture_output=True, text=True) assert r.returncode == 0, f"{r.stdout}\n{r.stderr}" - assert (tmp_path / "t_sample_facet_index.parquet").exists(), \ - f"--only sample_facet_index produced no index file:\n{r.stdout}" - assert (tmp_path / "t_facet_node_bits.parquet").exists(), \ - f"--only sample_facet_index did not auto-pair facet_node_bits (orphan):\n{r.stdout}" + # the bundle the index needs to be validatable was (re)emitted, not orphaned + for sib in ("t_sample_facet_index.parquet", "t_facet_node_bits.parquet", + "t_sample_facet_membership.parquet"): + assert (tmp_path / sib).exists(), f"--only sample_facet_index did not emit {sib}:\n{r.stdout}" + # 3) the rebuilt artifacts validate against the existing full-build siblings + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode == 0, f"--only sample_facet_index rebuild failed validation:\n{v.stdout}\n{v.stderr}" + assert "index.pid == facets_v2.pid" in v.stdout def test_scheme_corruption_caught(tmp_path): From a20a94f0f691e221811ee141179b081745635951 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 19 Jun 2026 22:32:57 -0700 Subject: [PATCH 6/7] #306/#305: prove index-only rebuild RE-EMITS its bundle (Codex round 4) Delete index+node_bits+membership after the full build and before the --only sample_facet_index rebuild, so a stale full-build file can't satisfy the existence checks. Retains the re-emission + validator-pass assertions. Codex r4 confirmed the implementation correct: index-only and masks-only emit membership+node_bits exactly once, no duplicate emissions in a full build, partial-rebuild validation passes, manifest records each forced dep once, and no invocation produces masks/index without node_bits+membership. 36/36 pass. Refs #305 #306 --- tests/test_frontend_derived.py | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index 42ac846..8ef109c 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -816,15 +816,21 @@ def test_sample_facet_index_only_auto_pairs_bundle(tmp_path): build_index_fixture(wide, vocab) # 1) full build so the validator's non-index siblings (facets_v2, h3, …) exist assert _build_index(tmp_path, wide, vocab).returncode == 0 + # delete the bundle the index-only rebuild must RE-EMIT, so a stale full-build + # file can't satisfy the existence checks (Codex r4: prove re-emission, not + # leftover presence). + bundle = ("t_sample_facet_index.parquet", "t_facet_node_bits.parquet", + "t_sample_facet_membership.parquet") + for sib in bundle: + os.remove(tmp_path / sib) # 2) rebuild ONLY the index into the same dir r = subprocess.run([sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", "t", "--no-manifest", "--vocab-labels", vocab, "--only", "sample_facet_index"], capture_output=True, text=True) assert r.returncode == 0, f"{r.stdout}\n{r.stderr}" # the bundle the index needs to be validatable was (re)emitted, not orphaned - for sib in ("t_sample_facet_index.parquet", "t_facet_node_bits.parquet", - "t_sample_facet_membership.parquet"): - assert (tmp_path / sib).exists(), f"--only sample_facet_index did not emit {sib}:\n{r.stdout}" + for sib in bundle: + assert (tmp_path / sib).exists(), f"--only sample_facet_index did not re-emit {sib}:\n{r.stdout}" # 3) the rebuilt artifacts validate against the existing full-build siblings v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], capture_output=True, text=True) From 39dc101f20c8312d80d17cae656ba9cf0e0634d0 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Sat, 20 Jun 2026 06:10:17 -0700 Subject: [PATCH 7/7] #305/#306: keep membership_build_id format-compatible with deployed node_bits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Codex-r1 hardening switched membership_build_id to the XOR+SUM+COUNT trio — but that id is a DEPLOYED CONTRACT: the live 202608 facet_node_bits/sample_facet_masks carry it as a bare bit_xor decimal (e.g. 11317573279759780618), and the explorer's facetIndexReady preflight matches the index's membership-half against the deployed node_bits.build_id. The trio format would never match, leaving multi-filter counts permanently "unavailable" against already-published artifacts. Revert membership_build_id to the bare order-independent XOR (grain is unique → no cancellation, as the original argued). Keep the richer trio ONLY for coverage_build_id (a NEW id, no compat constraint). Index build_id is therefore ":__". Validator recomputes the membership half with bare XOR (_xor_fp) and the well-formedness regex matches the asymmetric shape. Result: a freshly-built index's membership-half == the DEPLOYED node_bits.build_id, so the index can be published standalone (new filename, non-destructive) without republishing node_bits/masks/membership. 36/36 pass. Refs #305 #306 --- scripts/build_frontend_derived.py | 22 +++++++++++++++++----- scripts/validate_frontend_derived.py | 19 ++++++++++++------- 2 files changed, 29 insertions(+), 12 deletions(-) diff --git a/scripts/build_frontend_derived.py b/scripts/build_frontend_derived.py index bb1ac3c..dd96baf 100755 --- a/scripts/build_frontend_derived.py +++ b/scripts/build_frontend_derived.py @@ -539,8 +539,19 @@ def membership_build_id(con): # change that would alter either artifact (new/dropped pids, re-mapped concepts, # AND node-set changes). Embedding this id in both lets the explorer refuse the # mask path unless the two are from the SAME generation (guards a stale-cached - # masks file). membership grain is unique per (pid,facet_type,concept_uri). - return _fingerprint(con, "membership", MEMBERSHIP_TOKEN_EXPR) + # masks file). + # + # FORMAT IS A DEPLOYED CONTRACT (#305): the live facet_node_bits / sample_facet_masks + # (202608) carry this id as a BARE bit_xor decimal, and the explorer's facetIndexReady + # preflight matches the index's membership-half against the DEPLOYED node_bits.build_id. + # So this MUST stay the plain order-independent XOR — do NOT switch it to the + # _fingerprint trio (that would change the string and break generation-matching + # against every already-published artifact). membership grain is unique per + # (pid,facet_type,concept_uri) — validated — so no XOR cancellation. The richer + # trio is reserved for coverage_build_id, which is a NEW id with no compat constraint. + return con.sql( + f"SELECT CAST(COALESCE(bit_xor(hash({MEMBERSHIP_TOKEN_EXPR})), 0) AS VARCHAR) FROM membership" + ).fetchone()[0] def coverage_build_id(con): @@ -563,9 +574,10 @@ def index_build_id(con): # generation), and # 2. detect coverage/source drift via the coverage half (would otherwise be # invisible to a membership-only id). - # Each half is "__" (decimal + underscore only); ':' never - # appears in either half, so split(':', 1) is unambiguous. node_bits/masks keep - # their plain membership id (== the membership half here, for the match gate). + # FORMATS DIFFER BY DESIGN: membership half is the BARE bit_xor decimal (the + # deployed node_bits/masks contract — see membership_build_id), coverage half is + # the richer "__" trio (a new id). ':' appears in neither half + # (decimals + underscores only), so split(':', 1) is unambiguous. return f"{membership_build_id(con)}:{coverage_build_id(con)}" diff --git a/scripts/validate_frontend_derived.py b/scripts/validate_frontend_derived.py index d4aac72..754c9d0 100755 --- a/scripts/validate_frontend_derived.py +++ b/scripts/validate_frontend_derived.py @@ -495,12 +495,16 @@ def _opt(name, attr): from build_frontend_derived import (INDEX_SCHEMA_VERSION, MEMBERSHIP_TOKEN_EXPR, COVERAGE_TOKEN_EXPR) - def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint + def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint (coverage trio) x, s, n = con.sql( f"SELECT COALESCE(bit_xor(hash({token_expr})), 0), " f"COALESCE(SUM(hash({token_expr})::HUGEINT), 0), COUNT(*) FROM {relation}").fetchone() return f"{x}_{s}_{n}" + def _xor_fp(relation, token_expr): # mirror membership_build_id (bare XOR — deployed contract) + return str(con.sql( + f"SELECT CAST(COALESCE(bit_xor(hash({token_expr})), 0) AS VARCHAR) FROM {relation}").fetchone()[0]) + IX = f"read_parquet('{index}')" # GATING PRECONDITION (Codex r2 #1): the index cannot be fully validated # without BOTH sample_facet_membership and facet_node_bits — they are what @@ -548,15 +552,16 @@ def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint ix_sv_bad = scalar(f"SELECT COUNT(*) FROM {IX} WHERE schema_version IS DISTINCT FROM {INDEX_SCHEMA_VERSION}") check(f"index: schema_version == {INDEX_SCHEMA_VERSION}", ix_sv_bad == 0, f"{ix_sv_bad} rows with schema_version != {INDEX_SCHEMA_VERSION}") - # build_id must be a SINGLE value of the EXACT shape "__:__" - # — exactly two ':'-separated halves, each exactly three '_'-separated decimals. - # (Codex #2: "a colon somewhere" accepted ":bogus:extra".) + # build_id must be a SINGLE value of the EXACT shape ":__" + # — the membership half is the BARE bit_xor decimal (deployed node_bits contract), + # the coverage half is the three '_'-separated decimal trio. (Codex #2: "a colon + # somewhere" accepted ":bogus:extra".) ix_bids = scalar(f"SELECT COUNT(DISTINCT build_id) FROM {IX}") check("index: single build_id", ix_bids == 1, f"{ix_bids} distinct build_ids (want 1)") - bid_re = r'^[0-9]+_[0-9]+_[0-9]+:[0-9]+_[0-9]+_[0-9]+$' + bid_re = r'^[0-9]+:[0-9]+_[0-9]+_[0-9]+$' ix_bid_fmt = scalar(f"SELECT COUNT(*) FROM {IX} WHERE regexp_matches(build_id, '{bid_re}') = FALSE") check("index: build_id is well-formed ':'", ix_bid_fmt == 0, - "build_id(s) not exactly __:__") + "build_id(s) not exactly :__") # INDEPENDENT build_id recomputation (Codex #1/#5): the coverage half is # recomputed from facets_v2 (the located (pid, source) universe — equivalent # to samp_geo) and, when membership is present, the membership half from it. @@ -570,7 +575,7 @@ def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint check("index build_id coverage-half == recomputed from facets_v2", got_cov == exp_cov, f"coverage half {got_cov!r} != expected {exp_cov!r} (stale/edited coverage)") if mem: - exp_mem = _fp(f"read_parquet('{mem}')", MEMBERSHIP_TOKEN_EXPR) + exp_mem = _xor_fp(f"read_parquet('{mem}')", MEMBERSHIP_TOKEN_EXPR) # bare XOR (deployed contract) got_mem = ix_bid.split(":", 1)[0] check("index build_id membership-half == recomputed from membership", got_mem == exp_mem, f"membership half {got_mem!r} != expected {exp_mem!r} (stale/foreign generation)")