diff --git a/PLAN_305_facet_counts.md b/PLAN_305_facet_counts.md new file mode 100644 index 0000000..22f4c7a --- /dev/null +++ b/PLAN_305_facet_counts.md @@ -0,0 +1,128 @@ +# Plan — #305 Facet-count correctness & coherence (counts beyond the single-filter cube) + +**Status:** plan of record · **Authors:** Claude + Codex (joint, 2026-06-19) · **Tracks:** #305 (meta), #304 (live bug), #306 (data bug), #276 (semantics), #230, #286 + +## Problem + +At the global/zoomed-out view, facet **counts** only cross-filter for a **single** active +filter (served by the precomputed `facet_tree_cross_filter` cube via `effectiveSingleFilter()`, +`explorer.qmd:3374`). With **2+ active selections** the cube is skipped and tree-dim counts +revert to the **unfiltered baseline** (`explorer.qmd:3557-3562`) — deliberately, to avoid a +~39M-row `sample_facet_membership` GROUP BY that stalls DuckDB-WASM. So the **map is correctly +filtered** (#299 bitmask) but the **count numbers are wrong** for multi-filter at world view. +Zoomed-in is already correct (live membership path, `explorer.qmd:3564-3598`). Reported by +Eric in #304 (e.g. *Material: Anthropogenic* + *Specimen Type: Artifact*). + +## Core idea — count per node FROM the bitmasks (no membership scan) + +For a count dimension D over a qualifying pid set Q: + +```sql +SELECT nb.concept_uri AS value, COUNT(*) AS count +FROM s +JOIN read_parquet('${node_bits_url}') nb ON nb.facet_type = '' + AND (s._mask & (1::BIGINT << nb.bit_index)) <> 0 +WHERE +GROUP BY nb.concept_uri +``` + +- Masks are **one row per pid** → `COUNT(*)` == `COUNT(DISTINCT pid)` (do **not** add `COUNT(DISTINCT)` — wasted work). +- Membership encodes **ancestor closure** → child bit ⇒ parent bit, so **parent ≥ child** holds for free. +- The bitwise join is against `facet_node_bits` (≤63 rows/dim), not the 39M-row membership table. + +**Cross-filter semantics (confirmed correct):** OR among selected nodes *within* a dimension; +AND *across* dimensions; **omit the target dimension's own predicate** (exclude-self), matching +the UI contract (`explorer.qmd:3305`) and the existing cube (`build_frontend_derived.py:460`). +Special case: **zero selected sources** ⇒ impossible filter for tree targets, but the source +dimension's own histogram must still exclude source and stay constrained by the three trees. + +## Feasibility (Codex native DuckDB benchmarks, real 6M-pid data) + +| Query | Time | +|---|---| +| One unfiltered 22-node histogram | ~0.17s | +| All three tree dims via one 56-node join | ~0.69s | +| Eric's Material+Specimen-Type → context histogram | ~0.02s | +| Embedded source predicate/group | ~0.014s | +| `facets_v3` semi-join (rejected approach) | ~0.145s | + +Strongly supports feasibility. **Worst case ≈ 6M × 56 = ~336M cheap bit-tests**, not merely a +6M-row scan → **WASM cold+warm benchmarking remains a hard release gate** (the app has documented +DuckDB-WASM concurrency/deadlock sensitivity). + +## Source handling + +Add `source` as a plain **VARCHAR** column to a **new complete per-pid index** (source is +exclusive, not multi-valued → a mask is wrong; a `facets_v3` semi-join adds a second 6M-row +read + hash join). New schema: + +``` +pid · source · material_mask · context_mask · object_type_mask · build_id · schema_version +``` + +## Data-correctness prerequisite (#306) + +The current `sample_facet_masks` is built from `membership`, so it **omits ~29,917 located +samples** with coordinates but no tree membership (`facets_v3` = 6,026,242 vs masks = 5,996,325). +The new index must start from **`samp_geo`**, LEFT JOIN masks, and emit **zero masks** for +no-membership pids. Tracked as **#306** (must land in Phase 1). + +## Honesty rule (non-negotiable) + +**Never fall back to the unfiltered baseline when filters are active** — that is exactly the +#304 bug. `applyFacetCounts(..., null)` means baseline (`explorer.qmd:1381`). On any +miss/error under active filters, show **"—" / "unavailable" / stale styling / hidden counts** — +never authoritative-looking baseline numbers. Apply all dimension maps **atomically after one +stale-guard check** (`facetCountsReqId`); never partially repaint dimensions. + +## Phased plan + +### Phase 1 — Artifact + semantic foundation +- Build the complete per-pid index above. Start from `samp_geo`; include zero-mask pids (#306). +- Add `source` VARCHAR; extend the build **fingerprint** to cover the pid/source universe + (today it fingerprints membership only → would go stale silently once source/coverage change). +- AI-free validator: pid completeness vs `facets_v3`; source equality; node-bit coverage; + mask ≡ membership for pids that have membership; build_id consistency. +- **Publish under a new immutable filename/tag — do NOT overwrite the cached `202608` artifact.** +- Ratify the count contract (#276) as **membership / "anywhere in tree"** — already the shipped + semantics, so this is confirmation, not a redesign. + +### Phase 2 — Fix #304 at global view +- Generalized **direct-control filter snapshot** (today's `describeCrossFilters()` zeroes tree + selections at global view — the root cause). Generalize `effectiveSingleFilter()` to all selections. +- Route **multi-filter / full-tree-mode / global / no-search** counts through one mask-index + histogram query (the `tree_counts ∪ source_counts` shape). +- Keep the single-filter cube as a **pure optimization**, not a required semantic path. +- Apply dimension maps atomically after the stale check. On failure → unavailable, **not** baseline. +- **Benchmark cold + warm DuckDB-WASM** across empty / narrow / broad / all-pid qualifying sets; + set a concrete interaction budget before release. + +### Phase 3 — Extend (not premature unification) +- Add **viewport** support (identical padded bbox, `VIEWPORT_PAD_FACTOR = 0.3`). +- Add **search** via `search_pids`. +- Characterization-test the new path against the existing membership path; **then** retire the + membership count implementation. (Immediate unification would balloon the first fix into + viewport + search + flat-mode + stale-update refactoring — defer.) + +### Phase 4 — Verification + cleanup +- Unit-test predicate generation (exclude-self, zero-source states). +- Data validator gates from Phase 1, run in CI. +- Playwright: Eric's exact URL, multi-value-in-one-dim, source+tree, impossible combos, + global↔viewport transitions, superseded requests. +- Remove the misleading active-filter baseline fallback everywhere. +- Codex adversarial review; staging verify on real R2 data; deploy smoke gate. + +## Constraints / landmines +- **63-bit ceiling:** current dims are 19 / 22 / 15 nodes — ample headroom. Keep the build + hard-fail; design a future multiword schema but **do not implement now**. +- **Flat / mixed render mode:** keep the mask path gated OFF (subtree-membership semantics are + wrong for a flat dim) — defer to existing paths, as the cube already does. +- **Immutable data:** new tag/filename; never overwrite a cached artifact. + +## Top 3 risks +1. **WASM latency / connection starvation** — credible SQL, unmeasured in-browser; ~336M bit-tests worst case. +2. **Stale / incomplete index generations** — #306 missing pids + membership-only build_id; mixed cached generations → convincing-but-false counts. +3. **Honesty & state-transition failures** — baseline-as-truth fallback + async camera transitions unless results applied atomically. + +--- +*Joint plan: Claude (code trace, drafting) + Codex (native benchmarks, source design, #306 discovery, honesty rule, phasing). 2026-06-19.* diff --git a/SERIALIZATIONS.md b/SERIALIZATIONS.md index b5a9be5..2488f09 100644 --- a/SERIALIZATIONS.md +++ b/SERIALIZATIONS.md @@ -122,6 +122,7 @@ builder — a fresh build is NOT bit-for-bit identical to them (see | `isamples_202601_sample_facets_v2.parquet` | `(pid, source, material, context, object_type, label, description, place_name)` — all VARCHAR scalars; each facet column is a single URI per sample (not an array) | 63 MB | 6.0 M | wide | Search Explorer multi-dim facet filtering | QUERY_SPEC §3.3, §5.1 | | `isamples_202601_facet_summaries.parquet` | Baseline `(facet_type, facet_value, scheme, count)` | 2 KB | 56 | wide | Every tutorial (instant initial facet counts) | QUERY_SPEC §3.3 tier 1 | | `isamples_202601_facet_cross_filter.parquet` | Pre-computed counts for single-filter cross-facet queries | 6 KB | 526 | wide | Search Explorer cross-filter UI | QUERY_SPEC §3.3 tier 2a | +| `_sample_facet_index.parquet` | Complete per-pid facet index `(pid, source, material_mask, context_mask, object_type_mask, build_id, schema_version)` — **one row per located sample**, including samples with no tree membership (zero-masked, #306). Scanned by the multi-filter global-view count path (#304/#305). | ~60 MB | 6.0 M | wide (membership + samp_geo) | Interactive Explorer multi-filter facet counts | §4.12 below | ### Tier: vocabulary labels @@ -296,6 +297,53 @@ for the alias when you want "latest." - **Consumer**: `scripts/enrich_wide_with_oc_thumbnails.py` uses OC narrow to fill thumbnails into 202604 unified wide. Also used directly in PQG benchmark work. - **Future**: these become the prototype upstream for per-source sidecars (see §3, bottom row). +### 4.12 `_sample_facet_index.parquet` (complete per-pid facet index, #305/#306) + +- **Role**: The single artifact the **multi-filter** global-view count path scans + (#304/#305). Where `sample_facet_masks` is built FROM `membership` and therefore + silently omits located samples that carry no tree concept (~29,917 in the 202608 + generation — **#306**), this index starts from `samp_geo` (the authoritative + located set) and is **complete**: one row per located `pid`. Samples with no + membership are present and **zero-masked**. +- **Headline schema** (7 cols): `pid (VARCHAR), source (VARCHAR), material_mask + (BIGINT), context_mask (BIGINT), object_type_mask (BIGINT), build_id (VARCHAR), + schema_version (INTEGER)`. `source` is a plain VARCHAR (source is exclusive, not + multi-valued, so a mask would be wrong). Each `*_mask` bit `(1 << bit_index)` is + set iff the pid is a member of that node, under the bit assignment in + `facet_node_bits`; membership encodes ancestor closure, so a parent bit is set for + the whole subtree. +- **`build_id` is structured** as `":"`: + - the **membership half** equals `facet_node_bits.build_id` — the explorer must + only interpret the mask bits when the two agree (else the bits are read under a + foreign assignment); + - the **coverage half** is a fingerprint of the `(pid, source)` universe over + `samp_geo`, so a changed source value or located-pid set (the #306 drift class) + changes `build_id` instead of going stale silently. +- **Count contract (ratifies #276)**: a facet count is **membership / "anywhere in + the tree"** — a sample counts toward a node if that node is anywhere in the + sample's asserted-concept subtree closure (OR within a dimension, AND across + dimensions, **excluding the target dimension's own predicate**). This is the + already-shipped cube/UI semantics; this index does not change it, it serves the + same contract to the *multi-filter* case the single-filter cube cannot reach. +- **Validation**: `validate_frontend_derived.py --index ` is an INDEPENDENT + oracle — it asserts schema + NOT-NULL columns, one row per pid, `pid` set == + `sample_facets_v2.pid` (the located universe), `source` == facets_v2 source, + `schema_version` == the exact contract version, a well-formed single `build_id`, + and then **recomputes** both `build_id` halves from the written sibling files (the + coverage half from `sample_facets_v2`, the membership half from + `sample_facet_membership`) and asserts equality — so a stale/edited coverage id is + caught at build time even though `node_bits` is unchanged. Masks are re-derived + directly from `membership` + `node_bits` for **every** index pid (zero rows + included) and symmetric-diffed against the file. +- **Runtime staleness handshake (Phase 2)**: the *explorer* loads the index by URL + and so cannot, on its own, know the expected coverage half. The load-time refusal + of a stale-coverage index requires an external trusted pointer (e.g. the expected + composite id co-published in the manifest / a `current` alias the explorer reads); + that handshake lands with the count-query consumer in Phase 2. Build-time + validation (above) is the gate for now. +- **Immutability**: published under a **new** versioned filename — it is a new + artifact name and never overwrites a cached `sample_facet_masks` or any prior tag. + ## 5. URL convention All substrate files live under `https://data.isamples.org/` — a diff --git a/explorer.qmd b/explorer.qmd index d52666d..2ea754c 100644 --- a/explorer.qmd +++ b/explorer.qmd @@ -578,6 +578,9 @@ format: .facet-row.zero { opacity: 0.4; } .facet-row.zero:hover { opacity: 0.65; } .facet-count.recomputing { opacity: 0.55; font-style: italic; } + /* #304/#305: counts can't be computed for this filter combo (index + unpublished/inconsistent or query failed) — honest dash, NOT a baseline. */ + .facet-count.count-unavailable { opacity: 0.5; font-style: italic; cursor: help; } /* #281/#282 Material tree (preview) — compact rows */ .filter-body .facet-treeroot { font-weight: 600; font-size: 12px; line-height: 1.5; padding: 0 0 2px; color: #444; } .filter-body .facet-treenode { font-size: 12px; line-height: 1.5; } @@ -749,9 +752,20 @@ Cesium.Ion.defaultAccessToken = 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOi // localStorage ISAMPLES_DATA_BASE (sticky per-browser override) // Defaults to the production R2 origin, so shipped builds are unchanged. R2_BASE = (() => { + const DEFAULT = "https://data.isamples.org"; const raw = new URLSearchParams(location.search).get('data_base') || (typeof localStorage !== 'undefined' && localStorage.getItem('ISAMPLES_DATA_BASE')) - || "https://data.isamples.org"; + || DEFAULT; + // The base is interpolated into `read_parquet('${url}/...')` in many queries, so + // a value containing a single quote (or backslash) could break out of the SQL + // string literal (Codex P2 — data_base is user-controllable via ?data_base= / + // localStorage). Reject any such value and fall back to the default. Also require + // an http(s) or root-relative form (httpfs only range-fetches absolute http(s); + // a relative is resolved against origin below). + if (/['"\\\x00-\x1f]/.test(raw) || !(/^https?:\/\//.test(raw) || raw.startsWith('/'))) { + if (raw !== DEFAULT) console.warn('Ignoring unsafe/invalid data_base override; using default origin:', raw); + return DEFAULT; + } // DuckDB-WASM's httpfs only range-fetches ABSOLUTE http(s) URLs. A root- // relative override (?data_base=/data) is read as a virtual-filesystem // glob and fails with "No files found that match the pattern" — silently, @@ -816,6 +830,15 @@ tree_cross_filter_url = `${R2_BASE}/isamples_202608_facet_tree_cross_filter.parq // facetFilterSQL falls back to the membership scan (no regression). masks_url = `${R2_BASE}/isamples_202608_sample_facet_masks.parquet` node_bits_url = `${R2_BASE}/isamples_202608_facet_node_bits.parquet` +// #304/#305 complete per-pid facet index: one row per LOCATED sample (incl. the +// ~29,917 with no tree membership that sample_facet_masks omits, #306) + source. +// The multi-filter global-view count path scans this with the node_bits bitmask +// predicate so 2+ active filters get correct counts instead of reverting to the +// unfiltered baseline (#304). Best-effort: if it isn't published/consistent, +// facetIndexReady stays false and the multi-filter path shows "unavailable" +// rather than a misleading baseline (the honesty rule — never baseline under +// active filters). +index_url = `${R2_BASE}/isamples_202608_sample_facet_index.parquet` // Canonical palette — see issue #113. Path-relative so this works under // both isamples.org (custom domain at root) and project-pages fork @@ -1390,6 +1413,8 @@ function applyFacetCounts(facetKey, countsMap) { } el.textContent = `(${Number(count).toLocaleString()})`; el.classList.remove('recomputing'); + el.classList.remove('count-unavailable'); // #304/#305: real count clears the dash + el.removeAttribute('title'); const row = document.querySelector(`.facet-row[data-facet="${facetKey}"][data-value="${CSS.escape(value)}"]`); if (row) row.classList.toggle('zero', count === 0); @@ -1400,6 +1425,23 @@ function markFacetCountsRecomputing() { document.querySelectorAll('.facet-count').forEach(el => el.classList.add('recomputing')); } +// #304/#305: the honesty rule — when active filters are in play but we cannot +// compute correct cross-filtered counts (the complete index is unpublished / +// inconsistent, or a count query failed), we must NOT fall back to the unfiltered +// baseline (that misleading baseline IS the #304 bug). Show an explicit +// "unavailable" dash instead. Scoped to the given dim keys (default: all). +function markFacetCountsUnavailable(facetKeys = null) { + const keys = facetKeys || ['source', 'material', 'context', 'object_type']; + for (const key of keys) { + document.querySelectorAll(`.facet-count[data-facet="${key}"]`).forEach(el => { + el.textContent = '(—)'; + el.classList.remove('recomputing'); + el.classList.add('count-unavailable'); + el.title = 'Count unavailable for this filter combination'; + }); + } +} + // === URL State: encode/decode globe state in hash fragment === // (decode side — parseNum + readHash — extracted to assets/js/explorer-utils.js, PR3) function buildHash(v) { @@ -1775,11 +1817,26 @@ db = { // null and facetFilterSQL falls back to the membership scan — so this is safe to // ship before sample_facet_masks / facet_node_bits are published. nodeBitsReady = { + // Two SEPARATE readiness signals (Codex P2 #4 — they were wrongly coupled): + // __nodeBitsMap + __nodeBitsBuild → the concept_uri→bit map + its generation. + // Valid as soon as node_bits itself is. The #304/#305 COUNT path + // (sample_facet_index) needs ONLY this, not masks. + // __nodeBits → the SAME map, but advertised only after the masks file + // preflights & generation-matches. facetFilterSQL's mask FILTER path reads + // masks_url, so it must not run until masks is proven present/consistent. + // A missing/mismatched masks file must NOT disable the (valid) count bundle. window.__nodeBits = null; + window.__nodeBitsMap = null; + window.__nodeBitsBuild = null; + let map, nbBuild; + // STEP 1: node_bits — its OWN try, so the count-path values it publishes are + // never cleared by a downstream masks failure (Codex r2 P1: the masks query + // previously threw into a shared catch that nulled __nodeBitsMap/__nodeBitsBuild, + // re-coupling the two and disabling valid counts). try { const rows = await db.query( `SELECT facet_type, concept_uri, bit_index, build_id FROM read_parquet('${node_bits_url}')`); - const map = { material: new Map(), context: new Map(), object_type: new Map() }; + map = { material: new Map(), context: new Map(), object_type: new Map() }; const nbBuilds = new Set(); for (const r of rows) { if (map[r.facet_type]) map[r.facet_type].set(r.concept_uri, Number(r.bit_index)); @@ -1789,25 +1846,114 @@ nodeBitsReady = { // node_bits must carry exactly ONE build_id (a mixed file is corrupt; don't // let a last-row-wins value coincidentally match masks — Codex r2). if (!haveBits || nbBuilds.size !== 1) return false; - const nbBuild = [...nbBuilds][0]; - // Codex P1.1 + P1.2: PREFLIGHT the masks file AND require its build_id to - // match node_bits. This proves masks is present/readable (else the mask - // query path would fail with no retry) and that both artifacts are from the - // SAME generation (else a stale-cached masks file would map bits wrong). - // Only then advertise readiness; otherwise facetFilterSQL uses membership. + nbBuild = [...nbBuilds][0]; + // node_bits is valid on its own → publish the bit map + generation NOW so the + // count path (which depends only on node_bits + the index) runs even if masks + // is absent. facetIndexReady consumes window.__nodeBitsBuild. + window.__nodeBitsMap = map; + window.__nodeBitsBuild = nbBuild; + } catch (err) { + console.warn('node_bits preflight failed; facetFilterSQL + count path use fallbacks:', err); + window.__nodeBits = null; + window.__nodeBitsMap = null; + window.__nodeBitsBuild = null; + return false; + } + // STEP 2: masks — a SEPARATE try. A masks failure here must leave __nodeBitsMap / + // __nodeBitsBuild intact (count path stays enabled) and only withhold __nodeBits + // (the masks-gated FILTER signal for facetFilterSQL). Codex P1.1 + P1.2: masks + // must be present/readable AND the SAME generation as node_bits, else + // facetFilterSQL uses the membership fallback. + try { const mrows = await db.query( `SELECT DISTINCT build_id FROM read_parquet('${masks_url}')`); const maskBuilds = mrows.map(r => r.build_id); if (maskBuilds.length !== 1 || maskBuilds[0] !== nbBuild) { - console.warn('masks/node_bits build_id mismatch or missing; using membership fallback', + console.warn('masks/node_bits build_id mismatch; using membership fallback (count path unaffected)', { nbBuild, maskBuilds }); return false; } window.__nodeBits = map; return true; } catch (err) { - console.warn('node_bits/masks preflight failed; facetFilterSQL will use the membership fallback:', err); - window.__nodeBits = null; + console.warn('masks preflight failed; facetFilterSQL uses membership fallback (count path unaffected):', err); + return false; + } +} +``` + +```{ojs} +//| echo: false +//| output: false +// #304/#305: preflight the complete per-pid facet index (sample_facet_index). The +// multi-filter global-view count path scans it; it must be (a) present/readable, +// (b) schema_version we understand, (c) from the SAME node_bits generation as the +// bit map (build_id membership half == window.__nodeBitsBuild) so the mask bits +// mean what we think, AND (d) COVERAGE-COMPLETE — its pid count equals the located +// universe (facets) so a stale/incomplete index (the #306 drift class, which a +// membership-only id is blind to) can't silently undercount. The full coverage +// fingerprint is enforced at BUILD time (validate_frontend_derived); this is the +// cheap runtime handshake the plan calls for (SERIALIZATIONS §4.12). Only then +// advertise readiness; otherwise the multi-filter path shows "unavailable" instead +// of a baseline (honesty rule). Depends on nodeBitsReady for __nodeBitsBuild. +facetIndexReady = { + const _ = nodeBitsReady; // sequence after the node_bits preflight + window.__facetIndexReady = false; + const INDEX_SCHEMA_VERSION = 1; // mirrors build_frontend_derived.INDEX_SCHEMA_VERSION + try { + const nbBuild = window.__nodeBitsBuild; + if (!nbBuild) return false; // no usable bit map → index path can't run + const rows = await db.query( + `SELECT DISTINCT build_id, schema_version FROM read_parquet('${index_url}')`); + if (!rows || rows.length !== 1) return false; // missing / mixed generations + const sv = Number(rows[0].schema_version); + if (sv !== INDEX_SCHEMA_VERSION) { + console.warn('sample_facet_index schema_version unsupported; multi-filter counts unavailable', sv); + return false; + } + const membershipHalf = String(rows[0].build_id).split(':', 1)[0]; + if (membershipHalf !== nbBuild) { + console.warn('sample_facet_index/node_bits generation mismatch; multi-filter counts unavailable', + { indexMembershipHalf: membershipHalf, nbBuild }); + return false; + } + // (d) runtime coverage handshake: the index must cover the SAME located + // universe the counts are about. Compare the per-SOURCE histogram of the + // index to facet_summaries' source rows (which the builder computes as + // GROUP BY source over the SAME located set, samp_geo) — a symmetric diff + // that catches a stale/partial index (per-source count drift) and SOURCE + // drift, which a bare total-row-count check would miss (Codex r2). + // facet_summaries is ~2 KB and already loaded at boot, so this is a near-free + // check; an earlier draft scanned the 60 MB facets_v3 and cost ~8.6 s of + // DuckDB-WASM connection time at boot (measured in-browser) — exactly the + // single-connection starvation this app guards against. It is still a CHEAP + // proxy, NOT a complete staleness check: it does NOT detect a same-source, + // same-cardinality PID swap (mismatch stays 0). That residual is closed by + // (i) the generation-id match above — the membership half is a hash over + // membership *including pid*, so any swap of a pid that HAS membership changes + // it, leaving only swaps of the #306 no-membership pids — and (ii) the + // BUILD-TIME full coverage fingerprint + pid-set equality gate + // (validate_frontend_derived, SERIALIZATIONS §4.12), the authoritative check. + const cov = await db.query(` + WITH i AS (SELECT source AS v, COUNT(*) c FROM read_parquet('${index_url}') + WHERE NULLIF(TRIM(source), '') IS NOT NULL GROUP BY source), + f AS (SELECT facet_value AS v, count AS c + FROM read_parquet('${facet_summaries_url}') WHERE facet_type = 'source') + SELECT (SELECT COUNT(*) FROM (SELECT * FROM i EXCEPT SELECT * FROM f)) + + (SELECT COUNT(*) FROM (SELECT * FROM f EXCEPT SELECT * FROM i)) AS mismatch`); + if (Number(cov[0].mismatch) !== 0) { + console.warn('sample_facet_index per-source coverage != located universe; multi-filter counts unavailable', + { mismatch: Number(cov[0].mismatch) }); + return false; + } + window.__facetIndexReady = true; + // readiness can flip true AFTER boot already computed counts once (lost the + // async race) — reconcile any active multi-filter now (Codex P1 boot race). + if (typeof window.__onFacetIndexReady === 'function') window.__onFacetIndexReady(); + return true; + } catch (err) { + console.warn('sample_facet_index preflight failed; multi-filter global counts will show unavailable:', err); + window.__facetIndexReady = false; return false; } } @@ -3422,6 +3568,115 @@ zoomWatcher = { } } + // #304/#305: a DIRECT snapshot of every active selection read from the controls — + // unlike describeCrossFilters(), it does NOT zero tree selections at global view + // (that zeroing is the #304 root cause). Returns per-dim selected values (source + // strings; tree concept_uris), an active-dim count, and the source-impossible + // flag. Tree dims contribute only in full tree mode (flat dims keep facets_v3 + // paths). source: [] means "no constraint" (all or none selected — disambiguated + // by sourceImpossible). + function directFilterSnapshot() { + const sourceTotal = document.querySelectorAll('#sourceFilter input[type="checkbox"]').length; + const sources = getActiveSources(); + const sourceImpossible = sourceTotal > 0 && sources.length === 0; + // a partial source selection is a real constraint; all/none → no positive list + const sourceVals = (sources.length > 0 && sources.length < sourceTotal) ? sources : []; + const snap = { source: sourceVals, sourceImpossible, material: [], context: [], object_type: [] }; + for (const key of TREE_DIM_KEYS) { + if (treeActive(key)) snap[key] = treeSelection(key) || []; + } + // dims that impose a constraint (drives "is there anything to cross-filter?") + let active = sourceImpossible || sourceVals.length > 0 ? 1 : 0; + for (const key of TREE_DIM_KEYS) if (snap[key].length > 0) active++; + snap.activeCount = active; + return snap; + } + + // #304/#305: build the cross-filter WHERE conditions over sample_facet_index + // (alias-free; columns resolve to the index) for the OTHER active dims, EXCLUDING + // `excludeKey` (exclude-self). Tree dims → `(_mask & ) <> 0` from the + // node_bits bit map; source → `source IN (...)`. Returns null if a tree selection + // can't be turned into a mask (bit map absent / unknown node) so the caller can + // bail honestly instead of under-constraining. A source-impossible state zeroes + // every NON-source target (the source dim's own histogram still computes). + function maskIndexWhere(snap, excludeKey) { + // #304/#305: the count path uses __nodeBitsMap (valid as soon as node_bits is) + // — NOT __nodeBits, which is gated on the masks FILTER artifact (Codex #4). + const bits = (typeof window !== 'undefined') ? window.__nodeBitsMap : null; + const conds = []; + for (const key of TREE_DIM_KEYS) { + if (key === excludeKey) continue; + const sel = snap[key]; + if (!sel || sel.length === 0) continue; + if (!bits || !bits[key]) return null; + let mask = 0n; + for (const uri of sel) { + const idx = bits[key].get(uri); + if (idx === undefined) return null; // unknown node → bail + mask |= (1n << BigInt(idx)); + } + if (mask === 0n) return null; + conds.push(`(${key}_mask & ${mask}) <> 0`); // BigInt → decimal BIGINT literal + } + if (excludeKey !== 'source') { + if (snap.sourceImpossible) return ['1=0']; // no source selectable → 0 for tree targets + if (snap.source.length > 0) { + conds.push(`source IN (${snap.source.map(v => `'${escSql(v)}'`).join(',')})`); + } + } + return conds; + } + + // #304/#305: compute multi-filter cross-filtered counts for ALL dims off the + // complete per-pid index and apply them ATOMICALLY after a single stale check. + // Returns: 'ok' (applied), 'superseded' (a newer request won), 'unavailable' + // (index/bitmap usable but a query failed → honesty: caller shows the dash), or + // 'fallthrough' (index/bitmap not usable → caller uses legacy paths). + async function applyMaskIndexCounts(snap, dims, myReq) { + if (!window.__facetIndexReady || !window.__nodeBitsMap) return 'fallthrough'; + // Build ONE query (UNION ALL of per-dim arms) rather than 4 concurrent + // full-index scans — the documented DuckDB-WASM single-connection + // starvation risk (Codex P2). One statement = one connection slot. Each arm + // carries its OWN exclude-self cross-filter WHERE. Common output shape + // (dim, value, count) so one pass groups them. + const arms = []; + for (const d of dims) { + const conds = maskIndexWhere(snap, d.key); + if (conds === null) return 'unavailable'; // selected node missing a bit → can't compute honestly + const whereSQL = conds.length ? `WHERE ${conds.join(' AND ')}` : ''; + if (d.key === 'source') { + // source target: COUNT(*) per source over the index (one row/pid), + // constrained by the tree dims only (exclude-self). + const w = conds.length ? `${whereSQL} AND source IS NOT NULL` : 'WHERE source IS NOT NULL'; + arms.push(`SELECT 'source' AS dim, source AS value, COUNT(*) AS count + FROM read_parquet('${index_url}') s ${w} GROUP BY source`); + } else { + // tree target: JOIN node_bits; each (pid,node) the pid is a member of + // yields one row, so COUNT(*) per concept_uri == distinct pids with + // that bit. Membership encodes ancestor closure → parent ≥ child. + arms.push(`SELECT '${d.key}' AS dim, nb.concept_uri AS value, COUNT(*) AS count + FROM read_parquet('${index_url}') s + JOIN read_parquet('${node_bits_url}') nb + ON nb.facet_type = '${d.key}' + AND (s.${d.key}_mask & (1::BIGINT << nb.bit_index)) <> 0 + ${whereSQL} GROUP BY nb.concept_uri`); + } + } + try { + const rows = await db.query(arms.join('\nUNION ALL\n')); + if (myReq !== facetCountsReqId) return 'superseded'; + const grouped = { source: new Map(), material: new Map(), context: new Map(), object_type: new Map() }; + for (const r of rows) if (grouped[r.dim]) grouped[r.dim].set(r.value, Number(r.count)); + // atomic repaint: all dims at once, only after the single stale check + for (const d of dims) applyFacetCounts(d.key, grouped[d.key]); + return 'ok'; + } catch (err) { + if (myReq !== facetCountsReqId) return 'superseded'; + console.warn('mask-index cross-filter counts failed; counts unavailable (not baseline):', err); + return 'unavailable'; + } + } + async function updateCrossFilteredCounts(myReq) { if (myReq !== facetCountsReqId) return; const { dims, activeDims, totalActiveValues, sourceImpossible } = describeCrossFilters(); @@ -3490,6 +3745,32 @@ zoomWatcher = { if (myReq !== facetCountsReqId) return; } + // #304/#305: MULTI-filter (or multi-value single-dim, or source-impossible, + // or a single-filter cube MISS) at global view + full tree mode + no search. + // The single-filter cube above can't serve these, and the legacy slow path + // below reverts every tree dim to the unfiltered BASELINE at global view + // (the #304 bug). Route through the complete per-pid index instead — correct + // multi-filter counts in one columnar bitmask scan, applied atomically. + // HONESTY RULE: on any failure show "unavailable", NEVER the baseline. A + // 'fallthrough' (index/bitmap not usable, e.g. not yet published) defers to + // the legacy paths unchanged so this is safe to ship before the index lands. + if (bboxSQL === null && !searchIsActive() && TREE_DIM_KEYS.every(treeActive)) { + const snap = directFilterSnapshot(); + if (snap.activeCount >= 1) { + const res = await applyMaskIndexCounts(snap, dims, myReq); + if (res === 'superseded' || res === 'ok') return; + // res is 'unavailable' OR 'fallthrough' (index not ready / a node has + // no bit). At global + full tree + active filters there is NO correct + // legacy path — the legacy slow path reverts tree dims to the + // unfiltered BASELINE, which IS the #304 bug. Honesty rule: show the + // dash, never baseline. (When readiness flips true later, + // __onFacetIndexReady re-runs this and the dash becomes real counts.) + if (myReq !== facetCountsReqId) return; // defensive stale check (Codex) + markFacetCountsUnavailable(); + return; + } + } + // Baseline early-return only applies when there is no filter AND no // spatial constraint. In a non-global view with no facet filter, B1 // still wants per-value counts scoped to what's visible — fall @@ -4342,6 +4623,13 @@ zoomWatcher = { window.__onFilteredClustersReady = () => { if (hasFacetFilters()) reconcileGlobeForFilters(); }; + // #304/#305: when the facet-index preflight settles true AFTER boot already + // computed counts (async race), recompute so any filter that fell back to the + // "unavailable" dash becomes correct counts without a user action. Refresh + // UNCONDITIONALLY — hasFacetFilters() excludes SOURCE selections (Codex r2 P1), + // and refreshFacetCounts is debounced + facetCountsReqId-guarded, so a redundant + // no-filter refresh just recomputes the baseline harmlessly. + window.__onFacetIndexReady = () => { refreshFacetCounts(); }; setTimeout(() => { if (window.__filteredClustersReady === true && hasFacetFilters()) { const h = viewer.camera.positionCartographic.height; diff --git a/scripts/build_frontend_derived.py b/scripts/build_frontend_derived.py index d92cff0..dd96baf 100755 --- a/scripts/build_frontend_derived.py +++ b/scripts/build_frontend_derived.py @@ -19,6 +19,7 @@ - {tag}_facet_summaries.parquet facet_type, facet_value, scheme, count - {tag}_facet_cross_filter.parquet filter_source/material/context/object_type, facet_type, facet_value, count - {tag}_wide_h3.parquet wide + h3_res4/6/8 (large; built only on --only wide_h3) + - {tag}_sample_facet_index.parquet pid, source, material_mask, context_mask, object_type_mask(BIGINT), build_id, schema_version(INT) — COMPLETE per-pid index over ALL located samples (incl. #306 no-membership pids, zero-masked); the multi-filter global-view count path (#304/#305) scans this - {tag}_manifest.json provenance + per-output rowcount/schema/sha256 MATERIAL SELECTION (issue #265/#271): the broad SKOS root @@ -64,11 +65,16 @@ ARTIFACTS = ["sample_facets_v2", "samples_map_lite", "h3_summaries", "facet_summaries", "facet_cross_filter", "wide_h3", "sample_facet_membership", "facet_tree_summaries", - "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks"] + "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks", + "sample_facet_index"] # #293: max tree nodes per dim that fit in a signed BIGINT mask (bits 0..62). # Live max is 22 (context); guard so a future vocab explosion fails loudly # instead of silently overflowing a mask bit. MASK_MAX_BITS = 63 +# #305/#306: schema version for the complete per-pid facet index. Bump when the +# column set / semantics of sample_facet_index change so the explorer can refuse +# an index it doesn't understand instead of mis-reading it. +INDEX_SCHEMA_VERSION = 1 # Shared SQL expression for sample_facets_v2.description (#277 part 2). # Appends space-joined concept labels (IC labels across all 4 concept dims) @@ -202,12 +208,19 @@ def build_base_tables(con, wide, t0): n_geo = con.sql("SELECT COUNT(*) FROM samp_geo").fetchone()[0] n_dup = con.sql("SELECT COUNT(*) FROM (SELECT pid FROM samp_geo GROUP BY pid HAVING COUNT(*)>1)").fetchone()[0] n_icdup = con.sql("SELECT COUNT(*) FROM (SELECT row_id FROM ic GROUP BY row_id HAVING COUNT(*)>1)").fetchone()[0] - log(f"samp={n_samp:,} samp_geo={n_geo:,} duplicate_pids={n_dup:,} duplicate_concept_row_ids={n_icdup:,}", t0) - if n_dup or n_icdup: - # HARD fail (Codex): non-unique keys make the output grain wrong (inflated + # #305/#306 (Codex): the GROUP BY...HAVING COUNT>1 dup check above collapses all + # NULL pids into ONE group, so a SINGLE NULL pid slips through — yet a NULL pid + # breaks every pid join (joins drop it) and is silently absent from the coverage + # fingerprint. Reject it explicitly. + n_null = con.sql("SELECT COUNT(*) FROM samp_geo WHERE pid IS NULL").fetchone()[0] + log(f"samp={n_samp:,} samp_geo={n_geo:,} duplicate_pids={n_dup:,} " + f"null_pids={n_null:,} duplicate_concept_row_ids={n_icdup:,}", t0) + if n_dup or n_icdup or n_null: + # HARD fail (Codex): non-unique/NULL keys make the output grain wrong (inflated # facet counts, ambiguous joins, non-total ORDER BY pid). Abort before writing. raise SystemExit( - f"FATAL: non-unique keys — duplicate_pids={n_dup}, duplicate_concept_row_ids={n_icdup}. " + f"FATAL: non-unique/NULL keys — duplicate_pids={n_dup}, null_pids={n_null}, " + f"duplicate_concept_row_ids={n_icdup}. " f"Output grain/joins would be wrong; refusing to write.") @@ -493,6 +506,32 @@ def build_facet_tree_cross_filter(con, out): ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") +# --- generation fingerprints (#293, #305/#306) ----------------------------- +# The per-row token expressions are module constants so the VALIDATOR can +# recompute the exact same fingerprint independently from the written sibling +# files (membership, sample_facets_v2) and assert the index's build_id — i.e. +# the builder is not the sole authority on the index's generation identity. +# chr(31) (US) separates fields. A NULL source is encoded with a distinct +# sentinel byte so NULL and '' can never collide (Codex #5). +MEMBERSHIP_TOKEN_EXPR = "pid || chr(31) || facet_type || chr(31) || concept_uri" +COVERAGE_TOKEN_EXPR = ("pid || chr(31) || CASE WHEN source IS NULL THEN chr(0) ELSE chr(1) END " + "|| chr(31) || COALESCE(source, '')") + + +def _fingerprint(con, relation, token_expr): + # Order-independent generation fingerprint over `relation`. Combines THREE + # order-independent accumulators — XOR, SUM (HUGEINT, no overflow), and COUNT — + # of per-row hashes. XOR alone cancels identical rows and is vulnerable to a + # 2-row swap; SUM+COUNT defeat that. NOT a cryptographic digest: this defends + # against accidental drift / stale generations, not an adversary who engineers a + # multi-row hash collision. Grain is unique (validated), so the trio is stable. + x, s, n = con.sql( + f"SELECT COALESCE(bit_xor(hash({token_expr})), 0), " + f" COALESCE(SUM(hash({token_expr})::HUGEINT), 0), COUNT(*) " + f"FROM {relation}").fetchone() + return f"{x}_{s}_{n}" + + def membership_build_id(con): # #293 (Codex P1, r2): a fingerprint of the FULL membership generation — not # just the node set. Both node_bits (positional bit assignment) and masks are @@ -500,12 +539,46 @@ def membership_build_id(con): # change that would alter either artifact (new/dropped pids, re-mapped concepts, # AND node-set changes). Embedding this id in both lets the explorer refuse the # mask path unless the two are from the SAME generation (guards a stale-cached - # masks file). Order-independent XOR of per-row hashes (membership grain is - # unique per (pid,facet_type,concept_uri) — validated — so no XOR cancellation). - return con.sql(""" - SELECT CAST(COALESCE(bit_xor(hash(pid || chr(31) || facet_type || chr(31) || concept_uri)), 0) - AS VARCHAR) - FROM membership""").fetchone()[0] + # masks file). + # + # FORMAT IS A DEPLOYED CONTRACT (#305): the live facet_node_bits / sample_facet_masks + # (202608) carry this id as a BARE bit_xor decimal, and the explorer's facetIndexReady + # preflight matches the index's membership-half against the DEPLOYED node_bits.build_id. + # So this MUST stay the plain order-independent XOR — do NOT switch it to the + # _fingerprint trio (that would change the string and break generation-matching + # against every already-published artifact). membership grain is unique per + # (pid,facet_type,concept_uri) — validated — so no XOR cancellation. The richer + # trio is reserved for coverage_build_id, which is a NEW id with no compat constraint. + return con.sql( + f"SELECT CAST(COALESCE(bit_xor(hash({MEMBERSHIP_TOKEN_EXPR})), 0) AS VARCHAR) FROM membership" + ).fetchone()[0] + + +def coverage_build_id(con): + # #305/#306: fingerprint of the COMPLETE per-pid universe the index covers — + # the (pid, source) pairs over the LOCATED set (samp_geo), independent of tree + # membership. membership_build_id() alone is blind to this: an index built from + # the SAME membership but a changed source value or a changed located-pid set + # (exactly the #306 class of drift — located pids with no membership) would carry + # an identical membership id and go stale SILENTLY. NULL vs '' source are encoded + # distinctly (sentinel byte) so a NULL↔'' flip changes the fingerprint. + return _fingerprint(con, "samp_geo", COVERAGE_TOKEN_EXPR) + + +def index_build_id(con): + # #305/#306: the index's generation identity = ":". + # The two halves are deliberately separable so the explorer can BOTH: + # 1. gate the mask-bit interpretation by matching the membership half against + # facet_node_bits.build_id (the bit assignment is a pure function of + # membership — the index masks are only meaningful under the SAME node_bits + # generation), and + # 2. detect coverage/source drift via the coverage half (would otherwise be + # invisible to a membership-only id). + # FORMATS DIFFER BY DESIGN: membership half is the BARE bit_xor decimal (the + # deployed node_bits/masks contract — see membership_build_id), coverage half is + # the richer "__" trio (a new id). ':' appears in neither half + # (decimals + underscores only), so split(':', 1) is unambiguous. + return f"{membership_build_id(con)}:{coverage_build_id(con)}" def build_facet_node_bits(con, out, build_id): @@ -559,6 +632,51 @@ def build_sample_facet_masks(con, out, build_id): ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") +def build_sample_facet_index(con, out, build_id): + # #305/#306: the COMPLETE per-pid facet index — one row for EVERY located + # sample (samp_geo), not only those with tree membership. This is the artifact + # the multi-filter global-view count path (#304) scans: it must be able to count + # over the whole located universe, including samples that carry NO tree concept. + # + # WHY this exists separately from sample_facet_masks: masks is built FROM + # `membership`, so it silently OMITS located samples with no membership row + # (~29,917 in the 202608 generation — #306). Counting/ filtering off masks alone + # undercounts the located universe. Here we start from samp_geo (the authoritative + # located set), LEFT JOIN the same membership-derived masks, and emit a ZERO mask + # for no-membership pids (a zero mask matches no node bit → contributes 0 to every + # tree facet, which is correct: the sample is in no subtree — but it IS still a + # located sample and still counts toward `source` and the located total). + # + # `source` is a plain VARCHAR (source is exclusive, not multi-valued — a mask + # would be wrong, a facets_v3 semi-join would add a second 6M-row read). The + # mask columns are bit-identical to sample_facet_masks for pids that have + # membership (validated). bit assignment == build_facet_node_bits / masks. + con.execute(f"""COPY ( + WITH nb AS ( + SELECT facet_type, concept_uri, + (1::BIGINT << (ROW_NUMBER() OVER (PARTITION BY facet_type ORDER BY concept_uri) - 1)) AS bitval + FROM (SELECT DISTINCT facet_type, concept_uri FROM membership) + ), + masks AS ( + SELECT m.pid, + COALESCE(bit_or(CASE WHEN m.facet_type='material' THEN nb.bitval END), 0)::BIGINT AS material_mask, + COALESCE(bit_or(CASE WHEN m.facet_type='context' THEN nb.bitval END), 0)::BIGINT AS context_mask, + COALESCE(bit_or(CASE WHEN m.facet_type='object_type' THEN nb.bitval END), 0)::BIGINT AS object_type_mask + FROM membership m JOIN nb ON nb.facet_type=m.facet_type AND nb.concept_uri=m.concept_uri + GROUP BY m.pid + ) + SELECT sg.pid, + sg.source::VARCHAR AS source, + COALESCE(mk.material_mask, 0)::BIGINT AS material_mask, + COALESCE(mk.context_mask, 0)::BIGINT AS context_mask, + COALESCE(mk.object_type_mask, 0)::BIGINT AS object_type_mask, + '{build_id}' AS build_id, + {INDEX_SCHEMA_VERSION}::INTEGER AS schema_version + FROM samp_geo sg LEFT JOIN masks mk ON mk.pid = sg.pid + ORDER BY sg.pid + ) TO '{out}' (FORMAT PARQUET, COMPRESSION ZSTD)""") + + def file_meta(con, path): n = con.sql(f"SELECT COUNT(*) FROM read_parquet('{path}')").fetchone()[0] schema = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM read_parquet('{path}')").fetchall()] @@ -623,7 +741,8 @@ def emit(name, fn): # Hierarchy artifacts (#281/#282) — need vocab_labels for the SKOS tree. HIER_ARTIFACTS = {"sample_facet_membership", "facet_tree_summaries", - "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks"} + "facet_tree_cross_filter", "facet_node_bits", "sample_facet_masks", + "sample_facet_index"} if any(want(a) for a in HIER_ARTIFACTS): if not args.vocab_labels: # Fail loud if the user EXPLICITLY asked for a hierarchy artifact @@ -634,17 +753,44 @@ def emit(name, fn): log("SKIP hierarchy artifacts: pass --vocab-labels ", t0) else: build_concept_membership(con, args.wide, args.vocab_labels, t0) - emit("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) + # The mask fast-path bundle = {membership, node_bits, masks?, index}. + # sample_facet_masks AND sample_facet_index store raw mask bits that are + # uninterpretable without facet_node_bits, and that the validator can only + # gate by re-deriving from sample_facet_membership. So whenever masks or + # index is requested, FORCE-emit membership + node_bits too (even under + # `--only sample_facet_index`) — otherwise the build ships an artifact its + # own validator must reject (Codex #4 / r3). force_dep() builds a not-wanted + # artifact exactly once and records it for the manifest. + need_fastpath = want("facet_node_bits") or want("sample_facet_masks") or want("sample_facet_index") + force_deps = want("sample_facet_masks") or want("sample_facet_index") + def force_dep(name, fn): + if want(name): + emit(name, fn) + elif p(name) not in produced: + fn(p(name)); produced.append(p(name)); log(f"{name} ✓ (auto-paired with masks/index)", t0) + + if force_deps: + force_dep("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) + else: + emit("sample_facet_membership", lambda o: build_sample_facet_membership(con, o)) emit("facet_tree_summaries", lambda o: build_facet_tree_summaries(con, o)) # #290/#293 cross-filter cube — needs membership (above) + samp_geo (source). emit("facet_tree_cross_filter", lambda o: build_facet_tree_cross_filter(con, o)) - # #293 bitmask filter artifacts — needs membership (above). node_bits - # and masks share a node-set build_id so the explorer only uses the mask - # path when the two are from the same generation (Codex P1). - if want("facet_node_bits") or want("sample_facet_masks"): + # #293 bitmask filter artifacts — needs membership (above). node_bits and + # masks share a node-set build_id so the explorer only uses the mask path + # when the two are from the same generation (Codex P1). + if need_fastpath: _bid = membership_build_id(con) - emit("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) + if force_deps: + force_dep("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) + else: + emit("facet_node_bits", lambda o: build_facet_node_bits(con, o, _bid)) emit("sample_facet_masks", lambda o: build_sample_facet_masks(con, o, _bid)) + # #305/#306: complete per-pid index (every located pid + source, + # zero-mask for no-membership pids). Its build_id embeds the SAME + # membership id as node_bits (mask-bit interpretation gate) PLUS a + # coverage id over samp_geo's (pid, source) universe (staleness gate). + emit("sample_facet_index", lambda o: build_sample_facet_index(con, o, index_build_id(con))) if not args.no_manifest: log("hashing inputs/outputs for manifest…", t0) diff --git a/scripts/validate_frontend_derived.py b/scripts/validate_frontend_derived.py index debd8e4..754c9d0 100755 --- a/scripts/validate_frontend_derived.py +++ b/scripts/validate_frontend_derived.py @@ -13,6 +13,8 @@ python scripts/validate_frontend_derived.py \ --facets URL --map-lite URL --summaries URL --cross-filter URL \ --h3 URL4 URL6 URL8 + # hierarchy + #305/#306 complete index (auto-discovered with --dir/--tag, or): + python scripts/validate_frontend_derived.py --dir DIR --tag TAG --index INDEX.parquet """ import argparse, hashlib, json, os, sys import duckdb @@ -55,6 +57,7 @@ def main(): ap.add_argument("--tree-cross-filter", help="facet_tree_cross_filter parquet (#290/#293); optional") ap.add_argument("--node-bits", help="facet_node_bits parquet (#293); optional") ap.add_argument("--masks", help="sample_facet_masks parquet (#293); optional") + ap.add_argument("--index", help="sample_facet_index parquet (#305/#306); optional") ap.add_argument("--wide", help="source wide parquet — enables the SEMANTIC gate " "(re-derive and diff the written files against a fresh build)") ap.add_argument("--min-rows", type=int, default=1_000_000, @@ -300,6 +303,7 @@ def _opt(name, attr): treexf = _opt("facet_tree_cross_filter", "tree_cross_filter") nodebits = _opt("facet_node_bits", "node_bits") masks = _opt("sample_facet_masks", "masks") + index = _opt("sample_facet_index", "index") if tree: T = f"read_parquet('{tree}')" # parent ≥ child for every edge, every dim (distinct-pid UNION semantics — @@ -477,6 +481,161 @@ def _opt(name, attr): + (SELECT COUNT(*) FROM (SELECT * FROM b EXCEPT SELECT * FROM a))""") check(f"masks filter == membership ({dim})", d == 0, f"{d} pids differ for a real {dim} node") + # --- #305/#306 complete per-pid facet index (sample_facet_index) --- + # The multi-filter global-view count path scans THIS file. It must cover the + # ENTIRE located universe (facets_v2 = samp_geo), carry the correct source per + # pid, zero-mask the no-membership pids (#306), and be bit-identical to + # sample_facet_masks for pids that DO have membership — otherwise counts drift. + if index: + # Import the builder's canonical schema version + fingerprint token exprs so + # the validator is an INDEPENDENT oracle (Codex #1/#5): it recomputes the + # expected build_id from the written sibling files and asserts equality, + # rather than trusting whatever the builder stamped. + sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) + from build_frontend_derived import (INDEX_SCHEMA_VERSION, MEMBERSHIP_TOKEN_EXPR, + COVERAGE_TOKEN_EXPR) + + def _fp(relation, token_expr): # mirror build_frontend_derived._fingerprint (coverage trio) + x, s, n = con.sql( + f"SELECT COALESCE(bit_xor(hash({token_expr})), 0), " + f"COALESCE(SUM(hash({token_expr})::HUGEINT), 0), COUNT(*) FROM {relation}").fetchone() + return f"{x}_{s}_{n}" + + def _xor_fp(relation, token_expr): # mirror membership_build_id (bare XOR — deployed contract) + return str(con.sql( + f"SELECT CAST(COALESCE(bit_xor(hash({token_expr})), 0) AS VARCHAR) FROM {relation}").fetchone()[0]) + + IX = f"read_parquet('{index}')" + # GATING PRECONDITION (Codex r2 #1): the index cannot be fully validated + # without BOTH sample_facet_membership and facet_node_bits — they are what + # the independent build_id recompute and mask re-derivation depend on. If + # either is absent we must FAIL rather than silently run a partial (fail-open) + # validation that would pass a corrupt index/node_bits pair. They are always + # co-produced by a build, so --dir/--tag (or passing both) satisfies this. + check("index validation has membership + node_bits (required to gate)", + bool(mem and nodebits), + "pass --membership and --node-bits (or --dir/--tag with both present); " + "the index can't be validated in isolation") + # schema/types contract (the explorer reads these columns by name) + ix_sch = [(r[0], r[1]) for r in con.sql(f"DESCRIBE SELECT * FROM {IX}").fetchall()] + EXP_IX = [("pid", "VARCHAR"), ("source", "VARCHAR"), + ("material_mask", "BIGINT"), ("context_mask", "BIGINT"), + ("object_type_mask", "BIGINT"), ("build_id", "VARCHAR"), + ("schema_version", "INTEGER")] + check("index schema matches contract", ix_sch == EXP_IX, f"got {ix_sch}") + # NOT-NULL contract (source MAY be NULL — it mirrors facets_v2; everything + # else must be present). A NULL mask is a real corruption (Codex #2): the + # zero-mask check below uses IS DISTINCT FROM so it also rejects NULL. + nn_bad = scalar(f"""SELECT COUNT(*) FROM {IX} WHERE pid IS NULL + OR material_mask IS NULL OR context_mask IS NULL OR object_type_mask IS NULL + OR build_id IS NULL OR schema_version IS NULL""") + check("index: required columns non-NULL", nn_bad == 0, f"{nn_bad} rows with NULL pid/mask/build_id/version") + # one row per pid + ixdup = scalar(f"SELECT COUNT(*) FROM (SELECT pid FROM {IX} GROUP BY pid HAVING COUNT(*)>1)") + check("index: one row per pid", ixdup == 0, f"{ixdup} duplicate pids in index") + # COMPLETENESS (#306): index pid SET == facets_v2 pid SET (the located universe). + # This is the whole point — masks omits no-membership located pids; the index + # must not. A symmetric set-diff catches both missing and extra pids. + ix_diff = scalar(f"""SELECT + (SELECT COUNT(*) FROM (SELECT pid FROM {F} EXCEPT SELECT pid FROM {IX})) + + (SELECT COUNT(*) FROM (SELECT pid FROM {IX} EXCEPT SELECT pid FROM {F}))""") + check("index.pid == facets_v2.pid (complete located universe, #306)", ix_diff == 0, + f"{ix_diff} pids differ between index and facets_v2") + # SOURCE equality: index.source must equal facets_v2.source for every pid + # (IS DISTINCT FROM so NULL==NULL). A source-value drift is exactly what the + # coverage half of build_id is meant to catch — assert it directly too. + src_bad = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {F} f ON i.pid=f.pid + WHERE i.source IS DISTINCT FROM f.source""") + check("index.source == facets_v2.source", src_bad == 0, f"{src_bad} pids with mismatched source") + # schema_version must be EXACTLY the contract version (Codex #2: "one distinct + # value" let 999 through). The explorer refuses an index it can't interpret. + ix_sv_bad = scalar(f"SELECT COUNT(*) FROM {IX} WHERE schema_version IS DISTINCT FROM {INDEX_SCHEMA_VERSION}") + check(f"index: schema_version == {INDEX_SCHEMA_VERSION}", ix_sv_bad == 0, + f"{ix_sv_bad} rows with schema_version != {INDEX_SCHEMA_VERSION}") + # build_id must be a SINGLE value of the EXACT shape ":__" + # — the membership half is the BARE bit_xor decimal (deployed node_bits contract), + # the coverage half is the three '_'-separated decimal trio. (Codex #2: "a colon + # somewhere" accepted ":bogus:extra".) + ix_bids = scalar(f"SELECT COUNT(DISTINCT build_id) FROM {IX}") + check("index: single build_id", ix_bids == 1, f"{ix_bids} distinct build_ids (want 1)") + bid_re = r'^[0-9]+:[0-9]+_[0-9]+_[0-9]+$' + ix_bid_fmt = scalar(f"SELECT COUNT(*) FROM {IX} WHERE regexp_matches(build_id, '{bid_re}') = FALSE") + check("index: build_id is well-formed ':'", ix_bid_fmt == 0, + "build_id(s) not exactly :__") + # INDEPENDENT build_id recomputation (Codex #1/#5): the coverage half is + # recomputed from facets_v2 (the located (pid, source) universe — equivalent + # to samp_geo) and, when membership is present, the membership half from it. + # This is the gate that actually catches coverage/source staleness at + # validation time — a membership-only id (or a hand-edited coverage half) + # FAILS here even though node_bits is unchanged. + if ix_bids == 1: + ix_bid = scalar(f"SELECT MIN(build_id) FROM {IX}") + exp_cov = _fp(F, COVERAGE_TOKEN_EXPR) + got_cov = ix_bid.split(":", 1)[1] if ":" in ix_bid else "" + check("index build_id coverage-half == recomputed from facets_v2", got_cov == exp_cov, + f"coverage half {got_cov!r} != expected {exp_cov!r} (stale/edited coverage)") + if mem: + exp_mem = _xor_fp(f"read_parquet('{mem}')", MEMBERSHIP_TOKEN_EXPR) # bare XOR (deployed contract) + got_mem = ix_bid.split(":", 1)[0] + check("index build_id membership-half == recomputed from membership", got_mem == exp_mem, + f"membership half {got_mem!r} != expected {exp_mem!r} (stale/foreign generation)") + if nodebits and ix_bids == 1: + # the membership half of the index build_id MUST equal node_bits.build_id + # so the mask bits are interpreted under the SAME assignment (else counts + # are computed against a stale/foreign bit layout — silently wrong). + # (Codex r2 #2): assert node_bits has EXACTLY ONE non-NULL build_id FIRST — + # 0 or >1 distinct ids previously skipped the match entirely (fail-open). + nb_bid = scalar(f"SELECT COUNT(DISTINCT build_id) FROM read_parquet('{nodebits}') WHERE build_id IS NOT NULL") + nb_null = scalar(f"SELECT COUNT(*) FROM read_parquet('{nodebits}') WHERE build_id IS NULL") + check("node_bits: exactly one non-NULL build_id", nb_bid == 1 and nb_null == 0, + f"{nb_bid} distinct non-NULL build_ids, {nb_null} NULLs (want 1 / 0)") + if nb_bid == 1 and nb_null == 0: + membership_half_matches = scalar( + f"SELECT (SELECT split_part(MIN(build_id), ':', 1) FROM {IX}) " + f"= (SELECT MIN(build_id) FROM read_parquet('{nodebits}'))") + check("index build_id membership-half == node_bits build_id", + bool(membership_half_matches), + "index masks would be read under a different bit assignment than node_bits") + # INDEPENDENT mask re-derivation (Codex #3): re-derive the expected masks for + # EVERY index pid directly from membership + node_bits (zero for pids with no + # membership) and symmetric-diff against the index. This validates the masks + # without trusting the optional sample_facet_masks file, AND covers the + # zero-mask rows that a masks-only diff never sees. + if mem and nodebits: + M = f"read_parquet('{mem}')" + NB = f"read_parquet('{nodebits}')" + ref = (f"WITH nb AS (SELECT facet_type, concept_uri, (1::BIGINT << bit_index) AS bitval FROM {NB}), " + f"mm AS (SELECT m.pid, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='material' THEN nb.bitval END),0)::BIGINT material_mask, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='context' THEN nb.bitval END),0)::BIGINT context_mask, " + f"COALESCE(bit_or(CASE WHEN m.facet_type='object_type' THEN nb.bitval END),0)::BIGINT object_type_mask " + f"FROM {M} m JOIN nb ON nb.facet_type=m.facet_type AND nb.concept_uri=m.concept_uri GROUP BY m.pid) " + f"SELECT i.pid, COALESCE(mm.material_mask,0)::BIGINT material_mask, " + f"COALESCE(mm.context_mask,0)::BIGINT context_mask, " + f"COALESCE(mm.object_type_mask,0)::BIGINT object_type_mask " + f"FROM (SELECT pid FROM {IX}) i LEFT JOIN mm ON mm.pid=i.pid") + fil = f"SELECT pid, material_mask, context_mask, object_type_mask FROM {IX}" + mm_diff = scalar(f"SELECT (SELECT COUNT(*) FROM (({ref}) EXCEPT ({fil}))) " + f"+ (SELECT COUNT(*) FROM (({fil}) EXCEPT ({ref})))") + check("index masks == re-derived from membership+node_bits (all pids)", mm_diff == 0, + f"{mm_diff} index rows disagree with an independent re-derivation") + elif masks: + # Fallback when membership/node_bits aren't both available: diff shared + # pids against the masks file and assert no-membership extras are zero. + MK = f"read_parquet('{masks}')" + mk_diff = scalar(f"""SELECT COUNT(*) FROM {IX} i JOIN {MK} m ON i.pid=m.pid + WHERE i.material_mask IS DISTINCT FROM m.material_mask + OR i.context_mask IS DISTINCT FROM m.context_mask + OR i.object_type_mask IS DISTINCT FROM m.object_type_mask""") + check("index masks == sample_facet_masks (shared pids)", mk_diff == 0, + f"{mk_diff} shared pids with differing masks") + extra_nonzero = scalar(f"""SELECT COUNT(*) FROM {IX} i + WHERE NOT EXISTS (SELECT 1 FROM {MK} m WHERE m.pid=i.pid) + AND (i.material_mask IS DISTINCT FROM 0 OR i.context_mask IS DISTINCT FROM 0 + OR i.object_type_mask IS DISTINCT FROM 0)""") + check("index: no-membership extra pids are zero-masked (#306)", extra_nonzero == 0, + f"{extra_nonzero} index-only pids carry a non-zero/NULL mask (should be 0)") + print(f"\n{'CHECK':<44} {'RESULT':<6} DETAIL\n" + "-" * 90) ok = True for name, passed, detail in R: diff --git a/tests/test_frontend_derived.py b/tests/test_frontend_derived.py index 94e05bc..8ef109c 100644 --- a/tests/test_frontend_derived.py +++ b/tests/test_frontend_derived.py @@ -211,6 +211,24 @@ def test_duplicate_pid_hard_fails(tmp_path): f"builder should hard-fail on duplicate pids:\n{r.stdout}\n{r.stderr}" +def test_null_pid_hard_fails(tmp_path): + """Codex (#305 SQL gap): a SINGLE NULL pid slips the GROUP BY...HAVING COUNT>1 + dup check but breaks every pid join + the coverage fingerprint. Reject it.""" + wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") + con = duckdb.connect(); con.execute("INSTALL spatial; LOAD spatial;") + nullp = str(tmp_path / "wide_nullpid.parquet") + # append one located row with a NULL pid + con.execute(f"""COPY ( + SELECT * FROM read_parquet('{wide}') + UNION ALL + SELECT * REPLACE (NULL::VARCHAR AS pid) FROM read_parquet('{wide}') WHERE pid='m-real-first' + ) TO '{nullp}' (FORMAT PARQUET)""") + con.close() + r = _build(tmp_path, nullp) + assert r.returncode != 0 and "null_pids=1" in (r.stdout + r.stderr).lower(), \ + f"builder should hard-fail on a NULL pid:\n{r.stdout}\n{r.stderr}" + + def test_manifest_emitted(tmp_path): import json wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") @@ -519,6 +537,307 @@ def test_facet_masks_only_builds(tmp_path): assert (tmp_path / "t_sample_facet_masks.parquet").exists() +# --------------------------------------------------------------------------- +# sample_facet_index — COMPLETE per-pid index (#305/#306) +# Reuses the tree vocab but adds a LOCATED sample with NO tree membership +# (root-only material, null context/object) — the exact #306 class that +# sample_facet_masks silently omits. The index must include it, zero-masked. +# --------------------------------------------------------------------------- +# (pid, source, material_rids, context_rids, object_rids) +INDEX_SAMPLES = [ + ("s1", "A", [2], [11], [21]), # mineral, earthinterior, othersolidobject (full membership) + ("s2", "A", [3], [11], [21]), # rock, earthinterior, othersolidobject (full membership) + ("s3", "B", [2], [11], [21]), # mineral, earthinterior, othersolidobject (full membership) + ("s-nomem", "C", [1], None, None), # material ROOT-only -> NULL material -> NO membership; located +] + + +def build_index_fixture(wide, vocab): + """Tree fixture + a located no-membership sample (the #306 case).""" + con = duckdb.connect(); con.execute("INSTALL spatial; LOAD spatial;") + ic_rows = " UNION ALL ".join( + f"SELECT 'IdentifiedConcept' AS otype, '{uri}' AS pid, {rid}::BIGINT AS row_id, NULL::VARCHAR AS n, " + f"'lbl' AS label, NULL::VARCHAR AS description, NULL::VARCHAR[] AS place_name, NULL::TIMESTAMP AS result_time, " + f"NULL AS geometry, NULL::BIGINT[] AS p__has_material_category, NULL::BIGINT[] AS p__has_context_category, " + f"NULL::BIGINT[] AS p__has_sample_object_type, NULL::BIGINT[] AS p__keywords" + for rid, uri in TREE_CONCEPTS) + msr = [] + for i, (pid, src, m, c, o) in enumerate(INDEX_SAMPLES): + msr.append( + f"SELECT 'MaterialSampleRecord' AS otype, '{pid}' AS pid, NULL::BIGINT AS row_id, '{src}' AS n, " + f"'label {pid}' AS label, 'desc {pid}' AS description, ['plc-{pid}']::VARCHAR[] AS place_name, " + f"NULL::TIMESTAMP AS result_time, ST_AsWKB(ST_Point({10.0+i},{40.0+i})) AS geometry, " + f"{_arr(m)} AS p__has_material_category, {_arr(c)} AS p__has_context_category, " + f"{_arr(o)} AS p__has_sample_object_type, NULL::BIGINT[] AS p__keywords") + con.execute(f"COPY ({ic_rows} UNION ALL {' UNION ALL '.join(msr)}) TO '{wide}' (FORMAT PARQUET)") + edges = " UNION ALL ".join( + f"SELECT '{u}' uri, {('NULL' if p is None else repr(p))}::VARCHAR broader, 'data_v1' uri_form" + for u, p in TREE_EDGES) + con.execute(f"COPY ({edges}) TO '{vocab}' (FORMAT PARQUET)") + con.close() + + +def _build_index(tmp_path, wide, vocab, tag="t"): + cmd = [sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", tag, + "--skip", "wide_h3", "--no-manifest", "--vocab-labels", vocab] + return subprocess.run(cmd, capture_output=True, text=True) + + +def test_sample_facet_index_complete_and_zero_masked(tmp_path): + """#305/#306: the index covers EVERY located pid (incl. the no-membership one, + which masks omits), carries source, and zero-masks no-membership pids while + staying bit-identical to masks for membership pids.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + r = _build_index(tmp_path, wide, vocab) + assert r.returncode == 0, f"index build failed:\n{r.stdout}\n{r.stderr}" + con = duckdb.connect() + IX = f"read_parquet('{tmp_path / 't_sample_facet_index.parquet'}')" + MK = f"read_parquet('{tmp_path / 't_sample_facet_masks.parquet'}')" + F = f"read_parquet('{tmp_path / 't_sample_facets_v2.parquet'}')" + + # COMPLETENESS: index has all 4 located pids; masks has only the 3 with membership + assert con.sql(f"SELECT COUNT(*) FROM {IX}").fetchone()[0] == 4 + assert con.sql(f"SELECT COUNT(*) FROM {MK}").fetchone()[0] == 3 + # index pid set == facets_v2 pid set (the located universe) + diff = con.sql(f"""SELECT (SELECT COUNT(*) FROM (SELECT pid FROM {F} EXCEPT SELECT pid FROM {IX})) + + (SELECT COUNT(*) FROM (SELECT pid FROM {IX} EXCEPT SELECT pid FROM {F}))""").fetchone()[0] + assert diff == 0, "index pid set != facets_v2 located universe" + # the no-membership located sample IS present and FULLY zero-masked (#306) + row = con.sql(f"SELECT material_mask, context_mask, object_type_mask, source FROM {IX} WHERE pid='s-nomem'").fetchone() + assert row is not None, "s-nomem (located, no membership) missing from index — the #306 bug" + assert row[:3] == (0, 0, 0), f"s-nomem should be zero-masked, got {row[:3]}" + assert row[3] == "C", f"s-nomem source should be 'C', got {row[3]!r}" + # masks bit-identical for shared pids + mk_diff = con.sql(f"""SELECT COUNT(*) FROM {IX} i JOIN {MK} m ON i.pid=m.pid + WHERE i.material_mask IS DISTINCT FROM m.material_mask + OR i.context_mask IS DISTINCT FROM m.context_mask + OR i.object_type_mask IS DISTINCT FROM m.object_type_mask""").fetchone()[0] + assert mk_diff == 0, "index masks drift from sample_facet_masks for shared pids" + # build_id is structured ":" and its membership half == node_bits + bid = con.sql(f"SELECT DISTINCT build_id FROM {IX}").fetchall() + assert len(bid) == 1 and ":" in bid[0][0], f"index build_id not single/structured: {bid}" + nb_bid = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 't_facet_node_bits.parquet'}')").fetchone()[0] + assert bid[0][0].split(":", 1)[0] == nb_bid, "index build_id membership-half != node_bits build_id" + + +def test_sample_facet_index_coverage_id_changes_with_source(tmp_path): + """The coverage half of build_id must change when source/coverage changes even + though membership is identical (the silent-staleness hole the plan calls out).""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab, tag="a").returncode == 0 + con = duckdb.connect() + bid_a = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 'a_sample_facet_index.parquet'}')").fetchone()[0] + + # rebuild with a flipped source on s-nomem (same membership, different coverage) + wide2 = str(tmp_path / "wide2.parquet") + con.execute(f"""COPY ( + SELECT * REPLACE (CASE WHEN pid='s-nomem' THEN 'Z' ELSE n END AS n) FROM read_parquet('{wide}') + ) TO '{wide2}' (FORMAT PARQUET)""") + assert _build_index(tmp_path, wide2, vocab, tag="b").returncode == 0 + bid_b = con.sql(f"SELECT DISTINCT build_id FROM read_parquet('{tmp_path / 'b_sample_facet_index.parquet'}')").fetchone()[0] + + mem_a, cov_a = bid_a.split(":", 1) + mem_b, cov_b = bid_b.split(":", 1) + assert mem_a == mem_b, "membership half should be unchanged (membership identical)" + assert cov_a != cov_b, "coverage half should change when a source value changes" + + +def test_sample_facet_index_validator_passes_and_gates_bite(tmp_path): + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # clean: validator (incl. the new index gates) must PASS + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", + "--min-rows", "1", "--wide", wide], capture_output=True, text=True) + assert v.returncode == 0, f"validator failed on clean index fixture:\n{v.stdout}\n{v.stderr}" + assert "index.pid == facets_v2.pid" in v.stdout + + # corrupt 1: drop the no-membership pid (re-introduce the #306 bug) -> completeness gate fires + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT * FROM read_parquet('{ix}') WHERE pid <> 's-nomem') + TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v2 = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v2.returncode != 0 and "index.pid == facets_v2.pid" in v2.stdout, \ + f"completeness gate failed to catch a dropped located pid:\n{v2.stdout}" + + +def test_sample_facet_index_nonzero_mask_on_nomem_caught(tmp_path): + """A no-membership pid carrying a non-zero mask is a corruption the gate must + catch. With membership+node_bits present the independent re-derivation fires.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, + CASE WHEN pid='s-nomem' THEN 1::BIGINT ELSE material_mask END AS material_mask, + context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "re-derived from membership" in v.stdout, \ + f"mask gate failed to catch a non-zero no-membership pid:\n{v.stdout}" + + +def test_sample_facet_index_null_mask_caught(tmp_path): + """Codex #2: a NULL mask escaped the old `<> 0` predicate (NULL=unknown). The + NOT-NULL gate must catch it.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, + CASE WHEN pid='s-nomem' THEN NULL::BIGINT ELSE material_mask END AS material_mask, + context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "non-NULL" in v.stdout, \ + f"NOT-NULL gate failed to catch a NULL mask:\n{v.stdout}" + + +def test_sample_facet_index_bad_schema_version_caught(tmp_path): + """Codex #2: schema_version=999 passed the old 'one distinct value' check.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + build_id, 999::INTEGER AS schema_version FROM read_parquet('{ix}')) + TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "schema_version ==" in v.stdout, \ + f"schema_version gate failed to catch 999:\n{v.stdout}" + + +def test_sample_facet_index_malformed_build_id_caught(tmp_path): + """Codex #2: ':bogus:extra' passed the old 'a colon somewhere' check.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + # keep a correct membership half, append junk extra ':' segments + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + split_part(build_id, ':', 1) || ':bogus:extra' AS build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "well-formed" in v.stdout, \ + f"build_id format gate failed to catch a malformed id:\n{v.stdout}" + + +def test_sample_facet_index_stale_coverage_id_caught(tmp_path): + """Codex #1: a build_id whose coverage half doesn't match the actual (pid,source) + universe (e.g. membership-only id, or hand-edited coverage) must fail — even + though node_bits is unchanged.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + # replace the coverage half with a structurally-valid but WRONG fingerprint + con.execute(f"""COPY (SELECT pid, source, material_mask, context_mask, object_type_mask, + split_part(build_id, ':', 1) || ':1_2_3' AS build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "coverage-half" in v.stdout, \ + f"coverage staleness gate failed to catch a wrong coverage id:\n{v.stdout}" + + +def test_sample_facet_index_source_drift_caught(tmp_path): + """index.source must equal facets_v2.source; a drifted source must fail the gate.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + ix = str(tmp_path / "t_sample_facet_index.parquet") + con = duckdb.connect(); tmp_i = ix + ".tmp" + con.execute(f"""COPY (SELECT pid, + CASE WHEN pid='s1' THEN 'WRONG' ELSE source END AS source, + material_mask, context_mask, object_type_mask, build_id, schema_version + FROM read_parquet('{ix}')) TO '{tmp_i}' (FORMAT PARQUET)"""); os.replace(tmp_i, ix) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "index.source == facets_v2.source" in v.stdout, \ + f"source-equality gate failed to catch drift:\n{v.stdout}" + + +def test_sample_facet_index_validation_requires_siblings(tmp_path): + """Codex r2 #1: validating the index WITHOUT membership+node_bits must FAIL + (fail-closed), not silently run a partial pass. Reproduces the bypass where a + corrupt index + restamped node_bits passed once the siblings were removed.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # delete the siblings the index validation depends on + os.remove(tmp_path / "t_sample_facet_membership.parquet") + os.remove(tmp_path / "t_facet_node_bits.parquet") + os.remove(tmp_path / "t_sample_facet_masks.parquet") + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "required to gate" in v.stdout, \ + f"index validation should fail-closed without membership+node_bits:\n{v.stdout}" + + +def test_sample_facet_index_bad_node_bits_build_id_caught(tmp_path): + """Codex r2 #2: a node_bits file with !=1 non-NULL build_id previously skipped + the membership-half match entirely. It must now FAIL.""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + assert _build_index(tmp_path, wide, vocab).returncode == 0 + nb = str(tmp_path / "t_facet_node_bits.parquet") + con = duckdb.connect(); tmp_nb = nb + ".tmp" + # give node_bits TWO distinct build_ids (one row restamped) + con.execute(f"""COPY (SELECT facet_type, concept_uri, bit_index, + CASE WHEN bit_index = (SELECT MIN(bit_index) FROM read_parquet('{nb}')) + THEN 'OTHER_GEN' ELSE build_id END AS build_id + FROM read_parquet('{nb}')) TO '{tmp_nb}' (FORMAT PARQUET)"""); os.replace(tmp_nb, nb) + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode != 0 and "exactly one non-NULL build_id" in v.stdout, \ + f"node_bits build_id gate failed to catch a multi-id file:\n{v.stdout}" + + +def test_sample_facet_index_only_auto_pairs_bundle(tmp_path): + """Codex #4/r3: --only sample_facet_index must NOT ship an orphan. It + force-emits its whole fast-path bundle — facet_node_bits (bit interpretation) + AND sample_facet_membership (the validator's independent oracle). A partial + rebuild into a directory holding a prior FULL build must then re-validate + cleanly (the real iterate-on-one-artifact workflow).""" + wide = str(tmp_path / "wide.parquet"); vocab = str(tmp_path / "vocab.parquet") + build_index_fixture(wide, vocab) + # 1) full build so the validator's non-index siblings (facets_v2, h3, …) exist + assert _build_index(tmp_path, wide, vocab).returncode == 0 + # delete the bundle the index-only rebuild must RE-EMIT, so a stale full-build + # file can't satisfy the existence checks (Codex r4: prove re-emission, not + # leftover presence). + bundle = ("t_sample_facet_index.parquet", "t_facet_node_bits.parquet", + "t_sample_facet_membership.parquet") + for sib in bundle: + os.remove(tmp_path / sib) + # 2) rebuild ONLY the index into the same dir + r = subprocess.run([sys.executable, BUILD, "--wide", wide, "--outdir", str(tmp_path), "--tag", "t", + "--no-manifest", "--vocab-labels", vocab, + "--only", "sample_facet_index"], capture_output=True, text=True) + assert r.returncode == 0, f"{r.stdout}\n{r.stderr}" + # the bundle the index needs to be validatable was (re)emitted, not orphaned + for sib in bundle: + assert (tmp_path / sib).exists(), f"--only sample_facet_index did not re-emit {sib}:\n{r.stdout}" + # 3) the rebuilt artifacts validate against the existing full-build siblings + v = subprocess.run([sys.executable, VALIDATE, "--dir", str(tmp_path), "--tag", "t", "--min-rows", "1"], + capture_output=True, text=True) + assert v.returncode == 0, f"--only sample_facet_index rebuild failed validation:\n{v.stdout}\n{v.stderr}" + assert "index.pid == facets_v2.pid" in v.stdout + + def test_scheme_corruption_caught(tmp_path): wide = str(tmp_path / "wide.parquet"); build_fixture_wide(wide, "blob") assert _build(tmp_path, wide).returncode == 0