#304/#305 Phase 2: correct multi-filter facet counts at global view (stacked on #307)#308
Draft
rdhyee wants to merge 12 commits into
Draft
#304/#305 Phase 2: correct multi-filter facet counts at global view (stacked on #307)#308rdhyee wants to merge 12 commits into
rdhyee wants to merge 12 commits into
Conversation
Joint Claude+Codex plan for computing multi-filter facet counts from the isamplesorg#299 bitmask index (avoiding the 39M-row membership scan). Covers the masks histogram approach, native DuckDB benchmarks, source handling, the isamplesorg#306 missing-pid prerequisite, the honesty rule (never baseline under active filters), a 4-phase rollout, and risks. Refs isamplesorg#305, isamplesorg#304, isamplesorg#306, isamplesorg#276. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_facet_index) Build sample_facet_index: one row per LOCATED sample (samp_geo), not only those with tree membership. sample_facet_masks is built FROM membership and silently omits ~29,917 located samples with no tree concept (isamplesorg#306); the multi-filter global-view count path (isamplesorg#304/isamplesorg#305) must count over the whole located universe, so it scans this complete index instead. - start from samp_geo, LEFT JOIN membership-derived masks, zero-mask the no-membership pids (correct: in no subtree, but still located + still counts toward source/total) - add `source` VARCHAR (exclusive, not multi-valued -> not a mask) - build_id = "<membership_id>:<coverage_id>": membership half matches facet_node_bits (mask-bit interpretation gate); coverage half fingerprints the samp_geo (pid, source) universe so isamplesorg#306-class drift can't go stale silently (today's membership-only id is blind to it) - schema_version column for forward-compat Validator: --index gate asserts schema, one-row-per-pid, pid==facets_v2 (located universe completeness, isamplesorg#306), source==facets_v2, structured single build_id matching node_bits, masks bit-identical to sample_facet_masks for shared pids, and zero-mask for every no-membership pid. Tests: a located no-membership sample (root-only material) — present + zero masked in the index, absent from masks; coverage-id changes on source flip; all corruption gates bite. 29/29 pass. Docs: SERIALIZATIONS.md §4.12 ratifies the isamplesorg#276 count contract (membership / "anywhere in tree") served by this index. Refs isamplesorg#305 isamplesorg#306 isamplesorg#304 isamplesorg#276
… (Codex round 1) Address Codex adversarial review of the sample_facet_index Phase-1 diff: - #1 staleness gate: validator now INDEPENDENTLY recomputes both build_id halves from the written siblings (coverage from sample_facets_v2, membership from sample_facet_membership) and asserts equality — a stale or hand-edited coverage id now FAILS at build time even though node_bits is unchanged. Runtime explorer handshake documented as a Phase 2 item. - #2 validator metadata/NULL gates: schema_version must equal the exact contract version (999 was accepted); build_id must match the exact shape <xor>_<sum>_<cnt>:<xor>_<sum>_<cnt> ("a colon somewhere" accepted <m>:bogus:extra); NOT-NULL contract on pid/masks/build_id/version; zero-mask checks use IS DISTINCT FROM 0 (a NULL mask escaped the old <> 0). - #3 independent mask check: re-derive masks from membership + node_bits for EVERY index pid (incl. zero rows) and symmetric-diff — no longer trusts the optional sample_facet_masks file. - #4 orphan artifact: facet_node_bits is force-emitted whenever masks/index is built, so --only sample_facet_index can't ship an uninterpretable mask file. - #5 fingerprint robustness: shared MEMBERSHIP/COVERAGE token exprs; XOR+SUM+ COUNT trio (resists XOR cancellation / 2-row swaps); NULL source encoded with a sentinel byte so NULL≠''; honest non-cryptographic caveat. - SQL gap: reject NULL pids in build_base_tables (a single NULL slipped the dup check, broke pid joins, and was absent from the coverage hash). Tests: +5 adversarial (null mask, schema_version=999, malformed build_id, stale coverage id, null pid) — all gates bite. 34/34 pass. Refs isamplesorg#305 isamplesorg#306
… (Codex round 2) - r2 #1: index validation now FAILS CLOSED when sample_facet_membership or facet_node_bits is absent — those siblings are what the build_id recompute and mask re-derivation depend on, so a missing sibling must gate, not run a silent partial pass (Codex reproduced a corrupt index + restamped node_bits passing once the siblings were removed). They're always co-produced, so --dir/--tag satisfies it. - r2 #2: assert node_bits has EXACTLY ONE non-NULL build_id before the membership-half match — 0 or >1 distinct ids previously skipped the match entirely (fail-open). Codex r2 confirmed: builder _fingerprint() == validator _fp() exactly; coverage-from-facets_v2 == samp_geo; the three round-1 bypasses stay rejected; node_bits force-emit has no double-build and masks/index can't ship without node_bits (even with --skip). Tests: +2 (missing-siblings fail-closed, multi-id node_bits). 36/36 pass. Refs isamplesorg#305 isamplesorg#306
…h index (Codex round 3) Codex r3 caught a regression my own fixes introduced: --only sample_facet_index emitted index + node_bits but NOT sample_facet_membership, so the new validator gate (which requires membership as its independent oracle) would reject that output. Index validation legitimately also needs sample_facets_v2 (the isamplesorg#306 completeness oracle), so --only sample_facet_index is a PARTIAL rebuild against an existing full build, by design. - force_dep(): whenever masks or index is built, force-emit BOTH sample_facet_membership and facet_node_bits (the bundle the index validator depends on), exactly once, recorded for the manifest. - test now proves the real workflow: full build -> --only sample_facet_index rebuild into the same dir -> re-validate PASSES (not just file-existence). Codex r3 confirmed both round-2 fail-open gaps stay closed (rc=1 on every bypass) with no new skip path. 36/36 pass. Refs isamplesorg#305 isamplesorg#306
…s bundle (Codex round 4) Delete index+node_bits+membership after the full build and before the --only sample_facet_index rebuild, so a stale full-build file can't satisfy the existence checks. Retains the re-emission + validator-pass assertions. Codex r4 confirmed the implementation correct: index-only and masks-only emit membership+node_bits exactly once, no duplicate emissions in a full build, partial-rebuild validation passes, manifest records each forced dep once, and no invocation produces masks/index without node_bits+membership. 36/36 pass. Refs isamplesorg#305 isamplesorg#306
…atible with deployed node_bits The Codex-r1 hardening switched membership_build_id to the XOR+SUM+COUNT trio — but that id is a DEPLOYED CONTRACT: the live 202608 facet_node_bits/sample_facet_masks carry it as a bare bit_xor decimal (e.g. 11317573279759780618), and the explorer's facetIndexReady preflight matches the index's membership-half against the deployed node_bits.build_id. The trio format would never match, leaving multi-filter counts permanently "unavailable" against already-published artifacts. Revert membership_build_id to the bare order-independent XOR (grain is unique → no cancellation, as the original argued). Keep the richer trio ONLY for coverage_build_id (a NEW id, no compat constraint). Index build_id is therefore "<m_xor>:<c_xor>_<c_sum>_<c_cnt>". Validator recomputes the membership half with bare XOR (_xor_fp) and the well-formedness regex matches the asymmetric shape. Result: a freshly-built index's membership-half == the DEPLOYED node_bits.build_id, so the index can be published standalone (new filename, non-destructive) without republishing node_bits/masks/membership. 36/36 pass. Refs isamplesorg#305 isamplesorg#306
… global view via mask index The single-filter cube (isamplesorg#298) serves correct cross-filtered counts for exactly ONE active filter at global view. With 2+ filters, effectiveSingleFilter() returns null, the cube is skipped, and updateCrossFilteredCounts falls to the legacy early-return that reverts every tree dim to the UNFILTERED baseline (explorer.qmd) — the isamplesorg#304 bug (e.g. Material=anthropogenic + Object=artifact showed 261,086 artifacts instead of the cross-filtered 145,770). Fix: route the global / full-tree-mode / no-search / multi-filter case through the complete per-pid index (sample_facet_index, isamplesorg#305/isamplesorg#306) with the node_bits bitmask predicate: - directFilterSnapshot(): reads ALL selections from the controls WITHOUT zeroing tree dims at global view (that zeroing in describeCrossFilters is the root cause). - maskIndexWhere(): per-dim cross-filter predicate (OR within a dim via the bit mask, AND across dims, exclude-self); source-impossible zeroes tree targets but not source's own histogram (the plan's special case). - applyMaskIndexCounts(): one columnar bitmask query per dim, applied ATOMICALLY after a single facetCountsReqId stale check; COUNT(*) over the one-row-per-pid index == distinct pids. - HONESTY RULE: on query failure → 'unavailable' (a "(—)" dash, .count-unavailable), NEVER the baseline. If the index/bitmap isn't usable yet (unpublished) → 'fallthrough' to the legacy paths unchanged, so this is safe to ship first. - facetIndexReady preflight: index present + schema_version==1 + build_id membership-half == node_bits generation; else stays unavailable, not wrong. Single-filter cube kept as a pure optimization. Viewport + search still take the live membership path (Phase 3). Verified on real 202608 data: index covers 6,026,242 located pids vs masks' 5,996,325 — exactly the 29,917 isamplesorg#306 samples recovered; full validator passes; Eric's case returns 145,770 (cross-filtered) not 261,086 (baseline). Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
…y, boot race, coverage, decoupling) - P1 fallthrough reintroduced isamplesorg#304: at global + full tree + active filters there is NO correct legacy path (it reverts to baseline), so 'unavailable' AND 'fallthrough' now both show the dash, never legacy baseline. + defensive stale check before markFacetCountsUnavailable. - P1 boot race: facetIndexReady now calls window.__onFacetIndexReady() when it settles true; the main script registers it to refreshFacetCounts() if filters are active, so a URL-restored multi-filter that computed before readiness reconciles to correct counts instead of staying on the dash. - P1/P2 runtime coverage handshake: facetIndexReady now also verifies the index row count == facets (located universe) — catches stale/partial/mixed-cache indexes (the isamplesorg#306 drift class a membership-only id is blind to). Full fingerprint stays a build-time gate (SERIALIZATIONS §4.12). - P2 decouple node_bits from masks: nodeBitsReady publishes __nodeBitsMap + __nodeBitsBuild as soon as node_bits is valid (count path needs only these); __nodeBits (masks-gated) stays for facetFilterSQL. A missing masks file no longer disables a valid count bundle. Count path now reads __nodeBitsMap. - P2 WASM concurrency: the 4 per-dim queries are now ONE UNION ALL statement (one connection slot) instead of 4 concurrent full-index scans. - P2 SQL injection: R2_BASE (user-controllable via ?data_base=/localStorage) is rejected if it contains a quote/backslash/control char or isn't http(s)/root- relative, so it can't break out of read_parquet('...') in any query. Re-verified on real 202608 data: single UNION ALL query returns Eric's case as 145,770 cross-filtered across all dims (vs 261,086 baseline). Codex confirmed sound: COUNT(*) per concept == distinct pids; exclude-self / OR-within / AND-across / source-impossible / multi-value-single-dim semantics; single post-Promise stale check. WASM cold/warm browser benchmark remains the Phase 4 release gate. Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
… source-only refresh (Codex r2) - P1: nodeBitsReady split into TWO try blocks — a masks-query failure no longer throws into the shared catch that nulled __nodeBitsMap/__nodeBitsBuild. node_bits publishes the count-path values in step 1; masks preflight is step 2 and only withholds __nodeBits (the FILTER signal) on failure. Count path now truly survives a missing/mismatched masks file. - P1: __onFacetIndexReady refreshes UNCONDITIONALLY — hasFacetFilters() excludes source, so a source-only filter that painted dashes before readiness now also reconciles. Debounce + reqId guard make a redundant no-filter refresh harmless. - P2 staleness: coverage handshake upgraded from bare row-count to per-SOURCE histogram equality (index vs facets_v3) — catches equal-cardinality PID replacement + source drift, not just a smaller index. Verified mismatch=0 on real 202608 data (GEOME/OPENCONTEXT/SESAR/SMITHSONIAN all match). Still a proxy, not the full fingerprint (build-time gate); generation-id match guards mask corruption. Codex r2 confirmed otherwise sound: no global active-filter route reaches baseline; UNION ALL exclude-self + dim regrouping correct; unqualified cols bind to index `s`; all R2_BASE forms pass; 36 py + 13 node tests pass. Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
…laim (Codex r3) The per-source histogram handshake does NOT catch a same-source, same-cardinality PID swap (mismatch stays 0) — correct the overclaim. That residual is closed by (i) the generation-id match (the membership half hashes pid, so any swap of a pid WITH membership changes it — leaving only isamplesorg#306 no-membership pids) and (ii) the authoritative BUILD-TIME coverage fingerprint + pid-set equality gate. Runtime stays a cheap source/cardinality proxy by design (boot-query budget); a future co-published expected-id pointer would let it assert the full fingerprint. Codex r3 confirmed all P1s closed: masks failure never clears the count-path values; map/nbBuild correctly scoped; unconditional readiness refresh can't loop; coverage SQL valid + NULL-safe + cheap. Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
2268252 to
0832091
Compare
…erage check cheap (boot-perf) In-browser DuckDB-WASM verification against the live index showed the per-source coverage handshake cost ~8.6s at boot — it scanned the 60MB facets_v3. Compare the index's per-source histogram to facet_summaries' source rows instead: the builder computes those as GROUP BY source over the SAME located set (samp_geo), the file is ~2KB and already loaded at boot, and it's verified equivalent (mismatch=0 on real 202608 data). This removes a multi-second connection hold at boot — exactly the single-connection starvation the app guards against — with no loss of coverage signal. Index source filter mirrors the builder's NULLIF(TRIM(source),'') so empty-string sources can't cause a false mismatch. Verified end-to-end in WASM: preflight PASS, Eric's case = 145,770, cold 1121ms / warm 174ms. Refs isamplesorg#305 isamplesorg#304
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 2 of #305 — fix #304: correct multi-filter facet counts at global view
🔗 Stacked on #307 (Phase 1). Draft until #307 merges; the net-new change here is
explorer.qmdonly (Phase 1's build/validator commits show until the base merges).Fixes #304: at the global/zoomed-out view, facet counts only cross-filter for a single active filter (served by the
facet_tree_cross_filtercube, #298). With 2+ active selections,effectiveSingleFilter()returns null, the cube is skipped, andupdateCrossFilteredCountsfalls to a legacy early-return that reverts every tree dim to the unfiltered baseline — misleading counts. (Map/point filtering was already correct; only the numbers misled.)The fix
A new path in
updateCrossFilteredCounts, between the single-filter cube and the baseline early-return, routes the global / full-tree-mode / no-search / multi-filter case through the complete per-pid index (sample_facet_index, #305/#306) using thefacet_node_bitsbitmask predicate:directFilterSnapshot()reads all selections from the controls without zeroing tree dims at global view (that zeroing indescribeCrossFilters()is the root cause).maskIndexWhere()builds the per-dim cross-filter predicate (OR within a dim via the bit mask, AND across dims, exclude-self); source-impossible zeroes tree targets but not source's own histogram.applyMaskIndexCounts()runs oneUNION ALLstatement (one DuckDB-WASM connection slot, not 4 concurrent scans), then applies all dims atomically after a singlefacetCountsReqIdstale check.COUNT(*)over the one-row-per-pid index == distinct pids.Honesty rule (never baseline under active filters)
On any failure, or while the index is unpublished/inconsistent, multi-filter counts show an explicit "(—)" dash (
.count-unavailable), never the misleading baseline. AfacetIndexReadypreflight gates on: index present +schema_version+build_idmembership-half ==node_bitsgeneration + per-source coverage == the located universe. When readiness settles after boot,__onFacetIndexReadyreconciles any active filter.Verified on real 202608 data
sample_facet_index(validator ALL-PASS); it covers 6,026,242 located pids vs masks' 5,996,325 — exactly the 29,917 Data: sample_facet_masks omits ~29,917 located samples with no tree membership (built from membership, not samp_geo) #306 samples recovered.UNION ALLreturns consistent counts across all dims.Review notes (4 Codex rounds to LGTM)
Addressed: honesty-rule fallthrough, boot-race reconciliation, runtime coverage handshake, node_bits/masks readiness decoupling, single-query concurrency, and
R2_BASESQL-injection hardening.Not in this PR
sample_facet_indexto R2 (built/validated locally) — must land with/just before this so deployed multi-filter counts compute (else they honestly show "—").Refs #305 #304 #306