Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 45 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,41 +12,41 @@ without it.

## Quick Start

Two equivalent interfaces build and maintain the per-partition summaries. Use whichever
fits your workflow.

### Function interface
Summaries are built and maintained through a custom index access method, so pruning
follows the normal index lifecycle (`pg_dump`/restore, `REINDEX`, `DROP INDEX`).

```sql
CREATE EXTENSION pg_table_range;

-- Register one or more columns of a partitioned (or plain) table and build summaries.
-- Pass the relation as an OID; cast a name with ::regclass::oid.
SELECT table_range_create('events'::regclass::oid, ARRAY['val', 'created_at']);
-- Summarize one or more columns of a partitioned (or plain) table.
CREATE INDEX events_tr ON events USING table_range (val, created_at);

-- Queries now prune partitions whose summarized range cannot match the predicate.
-- Queries now prune partitions whose summary cannot match the predicate.
-- Verify with EXPLAIN: non-matching partitions disappear from the plan.
EXPLAIN (COSTS OFF) SELECT * FROM events WHERE val >= 250;

-- Recompute after bulk loads (also clears staleness); or drop registration entirely.
SELECT table_range_refresh('events'::regclass::oid);
SELECT table_range_drop('events'::regclass::oid);
```

### Index interface

```sql
-- Builds the same summaries via a custom index access method.
CREATE INDEX events_tr ON events USING table_range (val, created_at);

-- Pruning works immediately; REINDEX rebuilds summaries after heavy churn.
-- Recompute after heavy churn; or drop the summaries entirely.
REINDEX INDEX events_tr;
DROP INDEX events_tr; -- removes the summaries it built
```

The index is never used for scans — it exists only to build and own the summaries — so
it adds no scan-time overhead and is never chosen by the planner for data access.

### Supported column types (no setup, including PostGIS)

`CREATE INDEX … USING table_range` works on any **btree-comparable** type and any
**range** type out of the box. The required operator classes are provisioned
automatically by mirroring the types that already have a btree/range operator class — and
that mirror re-runs whenever an extension is installed, so **PostGIS geometry works the
moment you `CREATE EXTENSION postgis`, with no extra step**:

```sql
CREATE EXTENSION postgis; -- geometry opclass auto-registers
CREATE INDEX places_tr ON places USING table_range (geom);
EXPLAIN (COSTS OFF) SELECT * FROM places WHERE geom && ST_MakeEnvelope(0,0,10,10);
```

## How it works

- **Summaries.** For each leaf partition and indexed column, one row in
Expand All @@ -69,23 +69,30 @@ it adds no scan-time overhead and is never chosen by the planner for data access
is pruned by testing the constant against the partition's stored extent with
PostgreSQL's own `&&` operator — so a partition is eliminated when its extent cannot
overlap the query.
- **Automatic correctness.** Data changes mark the affected partition's summaries
*stale*, and stale summaries are never used for pruning — so a change can never cause a
missing row. The function interface installs row-level triggers
(`INSERT`/`UPDATE`/`DELETE`/`TRUNCATE`); the index interface marks stale from
`aminsert`. `table_range_refresh` (or `REINDEX`) recomputes and re-enables pruning.
A `sql_drop` event trigger removes summaries for any dropped relation or index.
- **Automatic correctness.** An insert that extends a partition marks its summary
*stale* (via the index's `aminsert`), and stale summaries are never used for pruning —
so a change can never cause a missing row. Deletes only shrink a partition's true
range, so the summary stays conservatively wide and remains safe. `REINDEX` recomputes
and re-enables pruning after churn, and a `sql_drop` event trigger removes a dropped
index's (or table's) summaries.

## Performance

The win is at **planning time** on wide trees, where the planner would otherwise build
paths for every partition. On a 1000-partition table queried by a non-key column
(`bench/planning_benchmark.sql`, PostgreSQL 18):
The benefit is at **execution**: a selective predicate on a non-key column scans only the
matching partition instead of every partition. On 100 partitions × 30k rows = 3M rows
(`bench/planning_benchmark.sql`, PostgreSQL 18, warm):

| | Total query time (plan + exec) |
|---|---|
| pruning off (scans all 100 partitions) | ~125 ms |
| pruning on (scans 1 partition) | ~18 ms |

| | Planning time | Result |
|---|---|---|
| pruning off | ~125 ms | 50 rows |
| pruning on | ~17 ms | 50 rows |
Pruning is **not** a free planning-time win: it adds a small per-plan overhead (loading
summaries once, then evaluating each partition — single-digit to low-tens of ms on
hundreds of partitions). It pays off when the partitions it eliminates are large enough
that avoiding their scan outweighs that overhead — so it helps most on **large
partitions with a selective non-key predicate**, and can be a slight net cost on tiny
partitions. Use `table_range.enable_pruning` to measure both ways on your workload.

Summaries are loaded **once per plan** (not per partition); the
`e2e_per_plan_cache_loads_once_regardless_of_partitions` test asserts exactly one
Expand Down Expand Up @@ -114,20 +121,18 @@ Everything not listed is conservatively **kept** (never mispruned):

## Catalog

- `table_range_summary` — one summary row per (owner, leaf partition, column):
- `table_range_summary` — one summary row per (index, leaf partition, column):
`index_oid`, `relid`, `attnum`, `kind` (`minmax` or `overlap`), `type_name`,
`min_summary`, `max_summary`, `has_nulls`, `all_nulls`, `stale`, `tuple_version`.
- `table_range_registered` — parents registered through the function interface and their
columns.

## Project layout

| File | Responsibility |
|------|----------------|
| `src/lib.rs` | GUCs, `_PG_init`, catalog/bootstrap SQL, test wiring |
| `src/summary_build.rs` | SPI summary build/refresh/drop, registration, staleness triggers |
| `src/summary_build.rs` | SPI summary build (scalar min/max + range/geometry extent) |
| `src/prune_hook.rs` | planner + pathlist hooks, per-plan cache, typed in-memory evaluation |
| `src/index_am.rs` | `table_range` index access method and operator classes |
| `src/index_am.rs` | `table_range` index access method + automatic operator-class provisioning |
| `src/e2e_tests.rs`, `src/index_am_tests.rs` | end-to-end tests |

## Building and testing
Expand All @@ -149,7 +154,5 @@ range-type tests, which exercise the same code path.

- `NOT IN` / `<> ALL`, `NOT (...)`, expression predicates, and parameterized
prepared-statement plans are kept rather than pruned.
- Summaries are exact at build/refresh time; between changes and a refresh the affected
partitions are simply not pruned (always correct, just less selective).
- The index interface marks a partition stale on insert; recompute with `REINDEX` (or
`table_range_refresh` for the function interface) to re-enable pruning after churn.
- Summaries are exact at build time; an insert that extends a partition marks it stale
(not pruned, but still correct) until the next `REINDEX`.
58 changes: 30 additions & 28 deletions bench/planning_benchmark.sql
Original file line number Diff line number Diff line change
@@ -1,49 +1,51 @@
-- Planning-time benchmark for table_range pruning.
-- Benchmark for table_range pruning.
--
-- Builds a wide partition tree where the queried column is NOT the partition key, so
-- native PostgreSQL pruning cannot help, and compares planning time + plan size with
-- table_range pruning off vs. on. Run with:
-- Measures end-to-end query time (planning + execution, warm) for a selective predicate
-- on a NON-partition-key column, with table_range pruning on vs. off. Native PostgreSQL
-- cannot prune on a non-key column, so without pruning the query scans every partition.
--
-- cargo pgrx run pg18
-- \i bench/planning_benchmark.sql
--
-- Look at the "Planning Time" line and the number of child plans in each EXPLAIN.
-- Pruning trades a small per-plan overhead (loading summaries + evaluating each
-- partition) for skipping the scan of non-matching partitions, so it wins when the
-- partitions it eliminates are large enough to outweigh that overhead.

\set part_count 1000
\set part_count 100
\set rows_per_part 30000

DROP TABLE IF EXISTS bench_events CASCADE;
CREATE TABLE bench_events (region int, val bigint) PARTITION BY LIST (region);
CREATE TABLE bench_events (region int, val bigint, pad text) PARTITION BY LIST (region);

-- One partition per region; each holds a disjoint 1000-wide band of `val`.
SELECT format(
'CREATE TABLE bench_events_%s PARTITION OF bench_events FOR VALUES IN (%s);',
g, g
)
'CREATE TABLE bench_events_%s PARTITION OF bench_events FOR VALUES IN (%s);', g, g)
FROM generate_series(1, :part_count) g \gexec

-- region is the partition key; val is a disjoint band per partition (the queried,
-- non-key column).
INSERT INTO bench_events
SELECT g, (g * 1000) + s
FROM generate_series(1, :part_count) g,
generate_series(0, 49) s;
SELECT g, g * 1000000 + s, repeat('x', 50)
FROM generate_series(1, :part_count) g, generate_series(0, :rows_per_part - 1) s;

ANALYZE bench_events;
VACUUM ANALYZE bench_events;
CREATE INDEX bench_events_tr ON bench_events USING table_range (val);

SELECT table_range_create('bench_events'::regclass::oid, ARRAY['val']);
\timing on

\echo '==================== pruning OFF ===================='
SET table_range.enable_pruning = off;
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY ON)
SELECT * FROM bench_events WHERE val BETWEEN 500000 AND 500049;

\echo '==================== pruning ON ===================='
-- Warm the relation cache first so the numbers reflect steady state, not first-touch
-- partition-metadata loading (which both modes pay equally).
SET table_range.enable_pruning = on;
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY ON)
SELECT * FROM bench_events WHERE val BETWEEN 500000 AND 500049;
SELECT count(*) FROM bench_events WHERE val = 50000000;

\echo '==================== pruning ON (warm) ===================='
SELECT count(*) FROM bench_events WHERE val = 50000000;
SELECT count(*) FROM bench_events WHERE val = 50000000;

\echo '==================== correctness check (must match) ===================='
SET table_range.enable_pruning = off;
SELECT count(*) AS off_count FROM bench_events WHERE val BETWEEN 500000 AND 500049;
SET table_range.enable_pruning = on;
SELECT count(*) AS on_count FROM bench_events WHERE val BETWEEN 500000 AND 500049;
SELECT count(*) FROM bench_events WHERE val = 50000000;

\echo '==================== pruning OFF (warm) ===================='
SELECT count(*) FROM bench_events WHERE val = 50000000;
SELECT count(*) FROM bench_events WHERE val = 50000000;

DROP TABLE bench_events CASCADE;
Loading
Loading