Skip to content

#304/#305 Phase 2: correct multi-filter facet counts at global view (stacked on #307)#308

Draft
rdhyee wants to merge 12 commits into
isamplesorg:mainfrom
rdhyee:feat/304-multifilter-counts
Draft

#304/#305 Phase 2: correct multi-filter facet counts at global view (stacked on #307)#308
rdhyee wants to merge 12 commits into
isamplesorg:mainfrom
rdhyee:feat/304-multifilter-counts

Conversation

@rdhyee

@rdhyee rdhyee commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Phase 2 of #305fix #304: correct multi-filter facet counts at global view

🔗 Stacked on #307 (Phase 1). Draft until #307 merges; the net-new change here is explorer.qmd only (Phase 1's build/validator commits show until the base merges).

Fixes #304: at the global/zoomed-out view, facet counts only cross-filter for a single active filter (served by the facet_tree_cross_filter cube, #298). With 2+ active selections, effectiveSingleFilter() returns null, the cube is skipped, and updateCrossFilteredCounts falls to a legacy early-return that reverts every tree dim to the unfiltered baseline — misleading counts. (Map/point filtering was already correct; only the numbers misled.)

The fix

A new path in updateCrossFilteredCounts, between the single-filter cube and the baseline early-return, routes the global / full-tree-mode / no-search / multi-filter case through the complete per-pid index (sample_facet_index, #305/#306) using the facet_node_bits bitmask predicate:

  • directFilterSnapshot() reads all selections from the controls without zeroing tree dims at global view (that zeroing in describeCrossFilters() is the root cause).
  • maskIndexWhere() builds the per-dim cross-filter predicate (OR within a dim via the bit mask, AND across dims, exclude-self); source-impossible zeroes tree targets but not source's own histogram.
  • applyMaskIndexCounts() runs one UNION ALL statement (one DuckDB-WASM connection slot, not 4 concurrent scans), then applies all dims atomically after a single facetCountsReqId stale check. COUNT(*) over the one-row-per-pid index == distinct pids.

Honesty rule (never baseline under active filters)

On any failure, or while the index is unpublished/inconsistent, multi-filter counts show an explicit "(—)" dash (.count-unavailable), never the misleading baseline. A facetIndexReady preflight gates on: index present + schema_version + build_id membership-half == node_bits generation + per-source coverage == the located universe. When readiness settles after boot, __onFacetIndexReady reconciles any active filter.

Verified on real 202608 data

Review notes (4 Codex rounds to LGTM)

Addressed: honesty-rule fallthrough, boot-race reconciliation, runtime coverage handshake, node_bits/masks readiness decoupling, single-query concurrency, and R2_BASE SQL-injection hardening.

Not in this PR

  • Publishing sample_facet_index to R2 (built/validated locally) — must land with/just before this so deployed multi-filter counts compute (else they honestly show "—").
  • Cold/warm DuckDB-WASM benchmark — the release gate (Phase 4).
  • Viewport + search counts (Phase 3); Playwright + remove baseline fallback (Phase 4).

Refs #305 #304 #306

rdhyee and others added 11 commits June 19, 2026 21:51
Joint Claude+Codex plan for computing multi-filter facet counts from the
isamplesorg#299 bitmask index (avoiding the 39M-row membership scan). Covers the
masks histogram approach, native DuckDB benchmarks, source handling, the
isamplesorg#306 missing-pid prerequisite, the honesty rule (never baseline under
active filters), a 4-phase rollout, and risks.

Refs isamplesorg#305, isamplesorg#304, isamplesorg#306, isamplesorg#276.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_facet_index)

Build sample_facet_index: one row per LOCATED sample (samp_geo), not only
those with tree membership. sample_facet_masks is built FROM membership and
silently omits ~29,917 located samples with no tree concept (isamplesorg#306); the
multi-filter global-view count path (isamplesorg#304/isamplesorg#305) must count over the whole
located universe, so it scans this complete index instead.

- start from samp_geo, LEFT JOIN membership-derived masks, zero-mask the
  no-membership pids (correct: in no subtree, but still located + still
  counts toward source/total)
- add `source` VARCHAR (exclusive, not multi-valued -> not a mask)
- build_id = "<membership_id>:<coverage_id>": membership half matches
  facet_node_bits (mask-bit interpretation gate); coverage half fingerprints
  the samp_geo (pid, source) universe so isamplesorg#306-class drift can't go stale
  silently (today's membership-only id is blind to it)
- schema_version column for forward-compat

Validator: --index gate asserts schema, one-row-per-pid, pid==facets_v2
(located universe completeness, isamplesorg#306), source==facets_v2, structured single
build_id matching node_bits, masks bit-identical to sample_facet_masks for
shared pids, and zero-mask for every no-membership pid.

Tests: a located no-membership sample (root-only material) — present + zero
masked in the index, absent from masks; coverage-id changes on source flip;
all corruption gates bite. 29/29 pass.

Docs: SERIALIZATIONS.md §4.12 ratifies the isamplesorg#276 count contract (membership /
"anywhere in tree") served by this index.

Refs isamplesorg#305 isamplesorg#306 isamplesorg#304 isamplesorg#276
… (Codex round 1)

Address Codex adversarial review of the sample_facet_index Phase-1 diff:

- #1 staleness gate: validator now INDEPENDENTLY recomputes both build_id
  halves from the written siblings (coverage from sample_facets_v2,
  membership from sample_facet_membership) and asserts equality — a stale or
  hand-edited coverage id now FAILS at build time even though node_bits is
  unchanged. Runtime explorer handshake documented as a Phase 2 item.
- #2 validator metadata/NULL gates: schema_version must equal the exact
  contract version (999 was accepted); build_id must match the exact shape
  <xor>_<sum>_<cnt>:<xor>_<sum>_<cnt> ("a colon somewhere" accepted
  <m>:bogus:extra); NOT-NULL contract on pid/masks/build_id/version; zero-mask
  checks use IS DISTINCT FROM 0 (a NULL mask escaped the old <> 0).
- #3 independent mask check: re-derive masks from membership + node_bits for
  EVERY index pid (incl. zero rows) and symmetric-diff — no longer trusts the
  optional sample_facet_masks file.
- #4 orphan artifact: facet_node_bits is force-emitted whenever masks/index is
  built, so --only sample_facet_index can't ship an uninterpretable mask file.
- #5 fingerprint robustness: shared MEMBERSHIP/COVERAGE token exprs; XOR+SUM+
  COUNT trio (resists XOR cancellation / 2-row swaps); NULL source encoded with
  a sentinel byte so NULL≠''; honest non-cryptographic caveat.
- SQL gap: reject NULL pids in build_base_tables (a single NULL slipped the
  dup check, broke pid joins, and was absent from the coverage hash).

Tests: +5 adversarial (null mask, schema_version=999, malformed build_id,
stale coverage id, null pid) — all gates bite. 34/34 pass.

Refs isamplesorg#305 isamplesorg#306
… (Codex round 2)

- r2 #1: index validation now FAILS CLOSED when sample_facet_membership or
  facet_node_bits is absent — those siblings are what the build_id recompute
  and mask re-derivation depend on, so a missing sibling must gate, not run a
  silent partial pass (Codex reproduced a corrupt index + restamped node_bits
  passing once the siblings were removed). They're always co-produced, so
  --dir/--tag satisfies it.
- r2 #2: assert node_bits has EXACTLY ONE non-NULL build_id before the
  membership-half match — 0 or >1 distinct ids previously skipped the match
  entirely (fail-open).

Codex r2 confirmed: builder _fingerprint() == validator _fp() exactly;
coverage-from-facets_v2 == samp_geo; the three round-1 bypasses stay rejected;
node_bits force-emit has no double-build and masks/index can't ship without
node_bits (even with --skip).

Tests: +2 (missing-siblings fail-closed, multi-id node_bits). 36/36 pass.

Refs isamplesorg#305 isamplesorg#306
…h index (Codex round 3)

Codex r3 caught a regression my own fixes introduced: --only sample_facet_index
emitted index + node_bits but NOT sample_facet_membership, so the new validator
gate (which requires membership as its independent oracle) would reject that
output. Index validation legitimately also needs sample_facets_v2 (the isamplesorg#306
completeness oracle), so --only sample_facet_index is a PARTIAL rebuild against
an existing full build, by design.

- force_dep(): whenever masks or index is built, force-emit BOTH
  sample_facet_membership and facet_node_bits (the bundle the index validator
  depends on), exactly once, recorded for the manifest.
- test now proves the real workflow: full build -> --only sample_facet_index
  rebuild into the same dir -> re-validate PASSES (not just file-existence).

Codex r3 confirmed both round-2 fail-open gaps stay closed (rc=1 on every
bypass) with no new skip path. 36/36 pass.

Refs isamplesorg#305 isamplesorg#306
…s bundle (Codex round 4)

Delete index+node_bits+membership after the full build and before the
--only sample_facet_index rebuild, so a stale full-build file can't satisfy
the existence checks. Retains the re-emission + validator-pass assertions.

Codex r4 confirmed the implementation correct: index-only and masks-only emit
membership+node_bits exactly once, no duplicate emissions in a full build,
partial-rebuild validation passes, manifest records each forced dep once, and
no invocation produces masks/index without node_bits+membership.

36/36 pass.

Refs isamplesorg#305 isamplesorg#306
…atible with deployed node_bits

The Codex-r1 hardening switched membership_build_id to the XOR+SUM+COUNT trio —
but that id is a DEPLOYED CONTRACT: the live 202608 facet_node_bits/sample_facet_masks
carry it as a bare bit_xor decimal (e.g. 11317573279759780618), and the explorer's
facetIndexReady preflight matches the index's membership-half against the deployed
node_bits.build_id. The trio format would never match, leaving multi-filter counts
permanently "unavailable" against already-published artifacts.

Revert membership_build_id to the bare order-independent XOR (grain is unique →
no cancellation, as the original argued). Keep the richer trio ONLY for
coverage_build_id (a NEW id, no compat constraint). Index build_id is therefore
"<m_xor>:<c_xor>_<c_sum>_<c_cnt>". Validator recomputes the membership half with
bare XOR (_xor_fp) and the well-formedness regex matches the asymmetric shape.

Result: a freshly-built index's membership-half == the DEPLOYED node_bits.build_id,
so the index can be published standalone (new filename, non-destructive) without
republishing node_bits/masks/membership. 36/36 pass.

Refs isamplesorg#305 isamplesorg#306
… global view via mask index

The single-filter cube (isamplesorg#298) serves correct cross-filtered counts for exactly
ONE active filter at global view. With 2+ filters, effectiveSingleFilter()
returns null, the cube is skipped, and updateCrossFilteredCounts falls to the
legacy early-return that reverts every tree dim to the UNFILTERED baseline
(explorer.qmd) — the isamplesorg#304 bug (e.g. Material=anthropogenic + Object=artifact
showed 261,086 artifacts instead of the cross-filtered 145,770).

Fix: route the global / full-tree-mode / no-search / multi-filter case through
the complete per-pid index (sample_facet_index, isamplesorg#305/isamplesorg#306) with the node_bits
bitmask predicate:
- directFilterSnapshot(): reads ALL selections from the controls WITHOUT zeroing
  tree dims at global view (that zeroing in describeCrossFilters is the root cause).
- maskIndexWhere(): per-dim cross-filter predicate (OR within a dim via the bit
  mask, AND across dims, exclude-self); source-impossible zeroes tree targets but
  not source's own histogram (the plan's special case).
- applyMaskIndexCounts(): one columnar bitmask query per dim, applied ATOMICALLY
  after a single facetCountsReqId stale check; COUNT(*) over the one-row-per-pid
  index == distinct pids.
- HONESTY RULE: on query failure → 'unavailable' (a "(—)" dash, .count-unavailable),
  NEVER the baseline. If the index/bitmap isn't usable yet (unpublished) →
  'fallthrough' to the legacy paths unchanged, so this is safe to ship first.
- facetIndexReady preflight: index present + schema_version==1 + build_id
  membership-half == node_bits generation; else stays unavailable, not wrong.

Single-filter cube kept as a pure optimization. Viewport + search still take the
live membership path (Phase 3).

Verified on real 202608 data: index covers 6,026,242 located pids vs masks'
5,996,325 — exactly the 29,917 isamplesorg#306 samples recovered; full validator passes;
Eric's case returns 145,770 (cross-filtered) not 261,086 (baseline).

Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
…y, boot race, coverage, decoupling)

- P1 fallthrough reintroduced isamplesorg#304: at global + full tree + active filters there
  is NO correct legacy path (it reverts to baseline), so 'unavailable' AND
  'fallthrough' now both show the dash, never legacy baseline. + defensive stale
  check before markFacetCountsUnavailable.
- P1 boot race: facetIndexReady now calls window.__onFacetIndexReady() when it
  settles true; the main script registers it to refreshFacetCounts() if filters
  are active, so a URL-restored multi-filter that computed before readiness
  reconciles to correct counts instead of staying on the dash.
- P1/P2 runtime coverage handshake: facetIndexReady now also verifies the index
  row count == facets (located universe) — catches stale/partial/mixed-cache
  indexes (the isamplesorg#306 drift class a membership-only id is blind to). Full
  fingerprint stays a build-time gate (SERIALIZATIONS §4.12).
- P2 decouple node_bits from masks: nodeBitsReady publishes __nodeBitsMap +
  __nodeBitsBuild as soon as node_bits is valid (count path needs only these);
  __nodeBits (masks-gated) stays for facetFilterSQL. A missing masks file no
  longer disables a valid count bundle. Count path now reads __nodeBitsMap.
- P2 WASM concurrency: the 4 per-dim queries are now ONE UNION ALL statement
  (one connection slot) instead of 4 concurrent full-index scans.
- P2 SQL injection: R2_BASE (user-controllable via ?data_base=/localStorage) is
  rejected if it contains a quote/backslash/control char or isn't http(s)/root-
  relative, so it can't break out of read_parquet('...') in any query.

Re-verified on real 202608 data: single UNION ALL query returns Eric's case as
145,770 cross-filtered across all dims (vs 261,086 baseline).

Codex confirmed sound: COUNT(*) per concept == distinct pids; exclude-self /
OR-within / AND-across / source-impossible / multi-value-single-dim semantics;
single post-Promise stale check. WASM cold/warm browser benchmark remains the
Phase 4 release gate.

Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
… source-only refresh (Codex r2)

- P1: nodeBitsReady split into TWO try blocks — a masks-query failure no longer
  throws into the shared catch that nulled __nodeBitsMap/__nodeBitsBuild. node_bits
  publishes the count-path values in step 1; masks preflight is step 2 and only
  withholds __nodeBits (the FILTER signal) on failure. Count path now truly
  survives a missing/mismatched masks file.
- P1: __onFacetIndexReady refreshes UNCONDITIONALLY — hasFacetFilters() excludes
  source, so a source-only filter that painted dashes before readiness now also
  reconciles. Debounce + reqId guard make a redundant no-filter refresh harmless.
- P2 staleness: coverage handshake upgraded from bare row-count to per-SOURCE
  histogram equality (index vs facets_v3) — catches equal-cardinality PID
  replacement + source drift, not just a smaller index. Verified mismatch=0 on real
  202608 data (GEOME/OPENCONTEXT/SESAR/SMITHSONIAN all match). Still a proxy, not
  the full fingerprint (build-time gate); generation-id match guards mask corruption.

Codex r2 confirmed otherwise sound: no global active-filter route reaches baseline;
UNION ALL exclude-self + dim regrouping correct; unqualified cols bind to index `s`;
all R2_BASE forms pass; 36 py + 13 node tests pass.

Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
…laim (Codex r3)

The per-source histogram handshake does NOT catch a same-source, same-cardinality
PID swap (mismatch stays 0) — correct the overclaim. That residual is closed by
(i) the generation-id match (the membership half hashes pid, so any swap of a pid
WITH membership changes it — leaving only isamplesorg#306 no-membership pids) and (ii) the
authoritative BUILD-TIME coverage fingerprint + pid-set equality gate. Runtime
stays a cheap source/cardinality proxy by design (boot-query budget); a future
co-published expected-id pointer would let it assert the full fingerprint.

Codex r3 confirmed all P1s closed: masks failure never clears the count-path
values; map/nbBuild correctly scoped; unconditional readiness refresh can't loop;
coverage SQL valid + NULL-safe + cheap.

Refs isamplesorg#305 isamplesorg#304 isamplesorg#306
…erage check cheap (boot-perf)

In-browser DuckDB-WASM verification against the live index showed the per-source
coverage handshake cost ~8.6s at boot — it scanned the 60MB facets_v3. Compare
the index's per-source histogram to facet_summaries' source rows instead: the
builder computes those as GROUP BY source over the SAME located set (samp_geo),
the file is ~2KB and already loaded at boot, and it's verified equivalent
(mismatch=0 on real 202608 data). This removes a multi-second connection hold at
boot — exactly the single-connection starvation the app guards against — with no
loss of coverage signal. Index source filter mirrors the builder's
NULLIF(TRIM(source),'') so empty-string sources can't cause a false mismatch.

Verified end-to-end in WASM: preflight PASS, Eric's case = 145,770, cold 1121ms /
warm 174ms.

Refs isamplesorg#305 isamplesorg#304
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Interactive Explorer: Facet counts wrong when filtering by multiple facet values

1 participant