MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support (main)#5222
MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support (main)#5222arcivanov wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements support for BLOB, TEXT, JSON, and GEOMETRY columns in the HEAP (MEMORY) storage engine using variable-length continuation runs. The review identified several critical issues, including potential integer overflows in hp_blob.c during the calculation of total_records_needed and the accumulation of total_copy_size, which could lead to memory corruption. Additionally, a buffer overflow vulnerability was found in hp_create.c due to an under-allocated record buffer when using reclength instead of visible_offset. Other issues include a memory leak of pending blob frees in hp_close.c when delete_on_close is false, and unresolved merge conflict markers in the type_enum.result and type_set.result test files.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
d93fa14 to
27413ba
Compare
Five code fixes and test re-recordings for issues found by CI on PR MariaDB#5222. **NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin returns NULL from `make_new_field()` (cursor values cannot be materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE` without a NULL check. Added `if (result)` guard. **xmltype identity loss in `Item_type_holder::create_tmp_field_ex()`**: `dynamic_cast<Type_handler_blob_common*>` matched `Type_handler_xmltype` (inherits `Type_handler_long_blob`), replacing it with `Field_blob_key`. Replaced with virtual dispatch via `type_handler_for_tmp_table()` -- xmltype's override preserves its identity, blob_common's returns `blob_key_type_handler()` when `part_of_unique_key`. Pack length recovery uses `blob_type_handler(max_length, NULL)->length_bytes()` since the original type_handler can be varchar (promoted via `too_big_for_varchar()`), not just blob -- a `static_cast<Type_handler_blob_common*>` on varchar would crash (INTERSECT ALL with `varchar(1024)` in utf8mb3). **Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the original `choose_engine()` (both 10.11 and upstream/main) never had a reclength check. MDEV-38975 introduced it when replacing the `blob_fields` condition. HEAP has no internal reclength limit -- `hp_create.c` stores `uint reclength` and allocates blocks of that size; `max_supported_record_length()` is only checked in `unireg.cc` during user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have reclength ~880KB (13 bare `Varchar()` columns). The check forced them to Aria where `fill_slave_status()` returned 0 rows. Removed the check and the unused `reclength` parameter from `pick_engine()`. **Multi-update `tmp_memory_table_size` override**: the 10.11 feature overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while `tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit `tmp_memory_table_size=0` directive. Since main removed `big_tables` entirely (MDEV-19713), the override is not needed. Removed. Test changes: - `spatial_utility_function_collect`: added ORDER BY to window function that lacked it (results were engine-row-order-dependent) - `tmp_space_usage`: removed multi-update override; forced disk for MDEV-34016/34060 Aria-specific test sections (blob I_S tables now stay in MEMORY) - Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
Five code fixes and test re-recordings for issues found by CI on PR MariaDB#5222. **NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin returns NULL from `make_new_field()` (cursor values cannot be materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE` without a NULL check. Added `if (result)` guard. **xmltype identity loss in `Item_type_holder::create_tmp_field_ex()`**: `dynamic_cast<Type_handler_blob_common*>` matched `Type_handler_xmltype` (inherits `Type_handler_long_blob`), replacing it with `Field_blob_key`. Replaced with virtual dispatch via `type_handler_for_tmp_table()` -- xmltype's override preserves its identity, blob_common's returns `blob_key_type_handler()` when `part_of_unique_key`. Pack length recovery uses `blob_type_handler(max_length, NULL)->length_bytes()` since the original type_handler can be varchar (promoted via `too_big_for_varchar()`), not just blob -- a `static_cast<Type_handler_blob_common*>` on varchar would crash (INTERSECT ALL with `varchar(1024)` in utf8mb3). **Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the original `choose_engine()` (both 10.11 and upstream/main) never had a reclength check. MDEV-38975 introduced it when replacing the `blob_fields` condition. HEAP has no internal reclength limit -- `hp_create.c` stores `uint reclength` and allocates blocks of that size; `max_supported_record_length()` is only checked in `unireg.cc` during user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have reclength ~880KB (13 bare `Varchar()` columns). The check forced them to Aria where `fill_slave_status()` returned 0 rows. Removed the check and the unused `reclength` parameter from `pick_engine()`. **Multi-update `tmp_memory_table_size` override**: the 10.11 feature overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while `tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit `tmp_memory_table_size=0` directive. Since main removed `big_tables` entirely (MDEV-19713), the override is not needed. Removed. Test changes: - `spatial_utility_function_collect`: added ORDER BY to window function that lacked it (results were engine-row-order-dependent) - `tmp_space_usage`: removed multi-update override; forced disk for MDEV-34016/34060 Aria-specific test sections (blob I_S tables now stay in MEMORY) - Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
4630b54 to
ef6f104
Compare
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222. **NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin returns NULL from `make_new_field()` (cursor values cannot be materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE` without a NULL check. Added `if (result)` guard. **xmltype identity loss and recursive CTE reclength mismatch in `Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now requires both: (1) `type_handler_for_tmp_table()` returns `blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>` confirms the original type is a native blob. Condition 1 excludes xmltype (its override returns itself). Condition 2 excludes VARCHAR types promoted via `varstring_type_handler()` -> `too_big_for_varchar()` -> `blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs (e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as `Field_varchar` in the incremental table (`part_of_unique_key=false`), causing a `reclength` mismatch assertion in `select_union_recursive::send_data()` (`main.json_equals` crash). **Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the original `choose_engine()` (both 10.11 and upstream/main) never had a reclength check. MDEV-38975 introduced it when replacing the `blob_fields` condition. HEAP has no internal reclength limit -- `hp_create.c` stores `uint reclength` and allocates blocks of that size; `max_supported_record_length()` is only checked in `unireg.cc` during user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have reclength ~880KB (13 bare `Varchar()` columns). The check forced them to Aria where `fill_slave_status()` returned 0 rows. Removed the check and the unused `reclength` parameter from `pick_engine()`. **Multi-update `tmp_memory_table_size` override**: the 10.11 feature overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while `tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit `tmp_memory_table_size=0` directive. Since main removed `big_tables` entirely (MDEV-19713), the override is not needed. Removed. **Zero-length key rejection in `check_tmp_key()`**: defense-in-depth guard rejecting `key_len == 0` to prevent useless zero-length keys from being created by `add_tmp_key()`. Test changes: - `spatial_utility_function_collect`: added ORDER BY to window function that lacked it (results were engine-row-order-dependent) - `tmp_space_usage`: removed multi-update override; forced disk for MDEV-34016/34060 Aria-specific test sections (blob I_S tables now stay in MEMORY) - `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'` with targeted I_S query (Created_tmp_files varies on sanitizer builds) - Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
ef6f104 to
e70b25a
Compare
gkodinov
left a comment
There was a problem hiding this comment.
Thank you for your contribution! This is a preliminary review.
Some small issues found. I'd also consider squashing the two commits into one.
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222. **NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin returns NULL from `make_new_field()` (cursor values cannot be materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE` without a NULL check. Added `if (result)` guard. **xmltype identity loss and recursive CTE reclength mismatch in `Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now requires both: (1) `type_handler_for_tmp_table()` returns `blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>` confirms the original type is a native blob. Condition 1 excludes xmltype (its override returns itself). Condition 2 excludes VARCHAR types promoted via `varstring_type_handler()` -> `too_big_for_varchar()` -> `blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs (e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as `Field_varchar` in the incremental table (`part_of_unique_key=false`), causing a `reclength` mismatch assertion in `select_union_recursive::send_data()` (`main.json_equals` crash). **Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the original `choose_engine()` (both 10.11 and upstream/main) never had a reclength check. MDEV-38975 introduced it when replacing the `blob_fields` condition. HEAP has no internal reclength limit -- `hp_create.c` stores `uint reclength` and allocates blocks of that size; `max_supported_record_length()` is only checked in `unireg.cc` during user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have reclength ~880KB (13 bare `Varchar()` columns). The check forced them to Aria where `fill_slave_status()` returned 0 rows. Removed the check and the unused `reclength` parameter from `pick_engine()`. **Multi-update `tmp_memory_table_size` override**: the 10.11 feature overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while `tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit `tmp_memory_table_size=0` directive. Since main removed `big_tables` entirely (MDEV-19713), the override is not needed. Removed. **Zero-length key rejection in `check_tmp_key()`**: defense-in-depth guard rejecting `key_len == 0` to prevent useless zero-length keys from being created by `add_tmp_key()`. Test changes: - `spatial_utility_function_collect`: added ORDER BY to window function that lacked it (results were engine-row-order-dependent) - `tmp_space_usage`: removed multi-update override; forced disk for MDEV-34016/34060 Aria-specific test sections (blob I_S tables now stay in MEMORY) - `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'` with targeted I_S query (Created_tmp_files varies on sanitizer builds) - Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
e70b25a to
64e8242
Compare
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222. **NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin returns NULL from `make_new_field()` (cursor values cannot be materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE` without a NULL check. Added `if (result)` guard. **xmltype identity loss and recursive CTE reclength mismatch in `Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now requires both: (1) `type_handler_for_tmp_table()` returns `blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>` confirms the original type is a native blob. Condition 1 excludes xmltype (its override returns itself). Condition 2 excludes VARCHAR types promoted via `varstring_type_handler()` -> `too_big_for_varchar()` -> `blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs (e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as `Field_varchar` in the incremental table (`part_of_unique_key=false`), causing a `reclength` mismatch assertion in `select_union_recursive::send_data()` (`main.json_equals` crash). **Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the original `choose_engine()` (both 10.11 and upstream/main) never had a reclength check. MDEV-38975 introduced it when replacing the `blob_fields` condition. HEAP has no internal reclength limit -- `hp_create.c` stores `uint reclength` and allocates blocks of that size; `max_supported_record_length()` is only checked in `unireg.cc` during user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have reclength ~880KB (13 bare `Varchar()` columns). The check forced them to Aria where `fill_slave_status()` returned 0 rows. Removed the check and the unused `reclength` parameter from `pick_engine()`. **Multi-update `tmp_memory_table_size` override**: the 10.11 feature overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while `tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit `tmp_memory_table_size=0` directive. Since main removed `big_tables` entirely (MDEV-19713), the override is not needed. Removed. **Zero-length key rejection in `check_tmp_key()`**: defense-in-depth guard rejecting `key_len == 0` to prevent useless zero-length keys from being created by `add_tmp_key()`. Test changes: - `spatial_utility_function_collect`: added ORDER BY to window function that lacked it (results were engine-row-order-dependent) - `tmp_space_usage`: removed multi-update override; forced disk for MDEV-34016/34060 Aria-specific test sections (blob I_S tables now stay in MEMORY) - `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'` with targeted I_S query (Created_tmp_files varies on sanitizer builds) - Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
64e8242 to
7663ea2
Compare
…e blob columns
Remove the HA_NO_BLOBS restriction from the HEAP engine, allowing
the optimizer to keep temporary tables with BLOB/TEXT columns in
memory when they fit within max_heap_table_size / tmp_memory_table_size
limits. Additionally, advertise HA_CAN_GEOMETRY so explicit
CREATE TABLE ... ENGINE=MEMORY with GEOMETRY columns works.
Unlike other HEAP blob implementations (e.g. Percona), this patch
provides full HASH index support on blob columns, enabling efficient
lookups, GROUP BY, and DISTINCT operations directly in HEAP without
falling back to disk.
Architecture
------------
BLOB data is stored using continuation records -- additional fixed-size
records allocated from the same HP_BLOCK that holds regular rows. This
reuses existing allocation, free list, and size accounting with minimal
structural change, and avoids per-blob my_malloc() calls.
The existing single-byte visibility flag is extended into a flags byte
with bits for HP_ROW_HAS_CONT, HP_ROW_IS_CONT, HP_ROW_CONT_ZEROCOPY,
HP_ROW_SINGLE_REC, and HP_ROW_MULTIPLE_REC. Continuation records are
grouped into variable-length runs -- contiguous sequences within a leaf
block. Only the first record of each run carries a 10-byte header
(next_cont pointer + run_rec_count); inner records are pure payload.
Three storage formats, detected by flag bits via inline predicates:
Case A (HP_ROW_SINGLE_REC): single record, no header, data at
offset 0. Zero-copy read.
Case B (HP_ROW_CONT_ZEROCOPY): single run, multiple records.
Header in rec 0, data contiguous in rec 1..N-1. Zero-copy read
via chain + recbuffer.
Case C (HP_ROW_MULTIPLE_REC): one or more runs linked via
next_cont. Reassembly into blob_buff required.
Run allocation uses a two-phase strategy: (1) peek-then-unlink walk
of the free list detecting contiguous groups, (2) tail allocation
from HP_BLOCK for remaining data. A Step 3 scavenge fallback
walks the entire free list when tail allocation fails.
HP_SHARE::total_records tracks all physical records (primary +
continuation), while HP_SHARE::records remains the logical count
used by hash bucket mapping.
Reassembly buffer (HP_INFO::blob_buff) follows the same pattern as
InnoDB's blob_heap -- allocated once, grown via my_realloc, freed
on heap_reset()/close. Zero-copy cases (A/B) return pointers
directly into HP_BLOCK with no copy.
Full HASH index key handling for BLOB columns: hp_rec_hashnr(),
hp_rec_key_cmp(), hp_key_cmp(), hp_make_key(), hp_hashnr() are
extended for HA_BLOB_PART segments. Hash pre-check optimization
skips expensive blob materialization when hashes differ. PAD SPACE
collation semantics are preserved for blob key comparisons.
Field_blob_key (Monty) produces HEAP-native key format (4-byte length
+ 8-byte data pointer) directly, eliminating key buffer translation
between the SQL layer and HEAP engine.
SQL layer changes
-----------------
choose_engine(): removed blob_fields check, added reclength >
HA_MAX_REC_LENGTH.
finalize(): HEAP+blob uses fixed-width rows; GROUP BY key setup sets
key_part_flag from field, uses item max_length for blob key sizing.
store_length initialized for all GROUP BY key parts. DISTINCT key
setup skips null-bits helper for HEAP.
remove_duplicates(): blob check moved before HEAP check to fall
through to remove_dup_with_compare().
Aggregator_distinct::add(): overflow-to-disk conversion via
create_internal_tmp_table_from_heap() for non-dup write errors.
Expression cache disabled for HEAP+blob (key format incompatibility).
FULLTEXT early detection in mysql_derived_prepare(): forces disk
engine via TMP_TABLE_FORCE_MYISAM when outer query uses MATCH.
Deferred blob chain free (MDEV-39732): heap_delete() saves chain
pointers to pending_blob_chains, flushed on next mutation or
heap_reset()/close. Prevents dangling zero-copy pointers during
binlog_log_row().
REPLACE safety (MDEV-39825): HP_SHARE::write_can_replace flag
forces copy mode in hp_read_blobs(), preventing blob data corruption
from freed-then-reused continuation records during REPLACE.
Geometry GROUP_CONCAT fix (MDEV-39761): downgrade Field_geom to
Field_blob for GROUP_CONCAT temp tables in both expression creation
paths. Type_handler_geometry::type_handler_for_tmp_table() added.
Geometry GROUP BY key fix (MDEV-39871): detect when new_key_field()
produced non-blob Field_varstring for a blob column, replace with
Field_blob_key.
Performance
-----------
Non-blob tables: zero regression. Every blob-specific code path is
guarded by if (share->blob_count). No new allocations, no format
changes, no hash function changes for non-blob keys.
Blob tables: eliminates file creation/deletion overhead and page cache
management. For single-run blobs (common case), the read path is
entirely zero-copy.
Limitations
-----------
1. No BTREE indexes on blob columns (HASH only)
2. No partial-key prefix indexing for blobs
3. 2x memory for Case C reads only (A/B are 1x)
4. No blob compression
5. 65,535 records per run (uint16 cap, auto-split)
6. max_heap_table_size applies to continuation records
7. Expression cache disabled for HEAP+blob
8. FULLTEXT forces disk engine
Linked bugs fixed:
- MDEV-39703: mroonga fulltext test ordering
- MDEV-39723: ER_DUP_ENTRY on GROUP BY with blob column
- MDEV-39724: crash in hp_is_single_rec with GROUP BY
- MDEV-39732: slave crash in hp_free_run_chain on blob replication
- MDEV-39761: Field_geom::store() assertion in GROUP_CONCAT
- MDEV-39782: RBR ER_KEY_NOT_FOUND on HEAP blob UPDATE
- MDEV-39825: blob data corruption on REPLACE into HEAP table
- MDEV-39871: crash in my_hash_sort_bin on GROUP BY with geometry
Reviewed by: Michael Widenius <monty@mariadb.org>
Monty reviewed the entire patch. Areas where he suggested changes
or contributed code:
- Field_blob_key class (HEAP-native blob key format, 4-byte length +
data pointer)
- Duplicate key fix on HEAP-to-Aria conversion
- hp_blob_key_length() uint32 fix
- hp_rec_hashnr_stored removal
- type_handler_for_tmp_table() param cleanup
- Type_handler_geometry::type_handler_for_tmp_table() virtual
- blob pointer bzero()
- find_unique_row() double-materialization fix
- Tail reclaim review
- Batch tail allocation review
- hp_update.c cleanup
- Field_blob_compressed temp table fix
- row_pack_length() dedup
- pack_length_no_ptr() removal
- Race condition fix in HEAP
- MDEV-39703 mroonga test fix
- MDEV-39825 write_can_replace optimization
- Documentation (Docs/internal-temporary-tables.txt)
Contribution by: Alexander Barkov <bar@mariadb.com>
Type_handler::make_and_init_table_field_ex() -- refactored temp table
field creation from inline code in sql_select.cc into type handler
virtual methods (sql_type.cc, sql_type_geom.cc), enabling clean
per-type-handler field creation for HEAP blob promotion.
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222. **NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin returns NULL from `make_new_field()` (cursor values cannot be materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE` without a NULL check. Added `if (result)` guard. **xmltype identity loss and recursive CTE reclength mismatch in `Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now requires both: (1) `type_handler_for_tmp_table()` returns `blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>` confirms the original type is a native blob. Condition 1 excludes xmltype (its override returns itself). Condition 2 excludes VARCHAR types promoted via `varstring_type_handler()` -> `too_big_for_varchar()` -> `blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs (e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as `Field_varchar` in the incremental table (`part_of_unique_key=false`), causing a `reclength` mismatch assertion in `select_union_recursive::send_data()` (`main.json_equals` crash). **Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the original `choose_engine()` (both 10.11 and upstream/main) never had a reclength check. MDEV-38975 introduced it when replacing the `blob_fields` condition. HEAP has no internal reclength limit -- `hp_create.c` stores `uint reclength` and allocates blocks of that size; `max_supported_record_length()` is only checked in `unireg.cc` during user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have reclength ~880KB (13 bare `Varchar()` columns). The check forced them to Aria where `fill_slave_status()` returned 0 rows. Removed the check and the unused `reclength` parameter from `pick_engine()`. **Multi-update `tmp_memory_table_size` override**: the 10.11 feature overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while `tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit `tmp_memory_table_size=0` directive. Since main removed `big_tables` entirely (MDEV-19713), the override is not needed. Removed. **Zero-length key rejection in `check_tmp_key()`**: defense-in-depth guard rejecting `key_len == 0` to prevent useless zero-length keys from being created by `add_tmp_key()`. Test changes: - `spatial_utility_function_collect`: added ORDER BY to window function that lacked it (results were engine-row-order-dependent) - `tmp_space_usage`: removed multi-update override; forced disk for MDEV-34016/34060 Aria-specific test sections (blob I_S tables now stay in MEMORY) - `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'` with targeted I_S query (Created_tmp_files varies on sanitizer builds) - Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
7663ea2 to
7982019
Compare
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222. **NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin returns NULL from `make_new_field()` (cursor values cannot be materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE` without a NULL check. Added `if (result)` guard. **xmltype identity loss and recursive CTE reclength mismatch in `Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now requires both: (1) `type_handler_for_tmp_table()` returns `blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>` confirms the original type is a native blob. Condition 1 excludes xmltype (its override returns itself). Condition 2 excludes VARCHAR types promoted via `varstring_type_handler()` -> `too_big_for_varchar()` -> `blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs (e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as `Field_varchar` in the incremental table (`part_of_unique_key=false`), causing a `reclength` mismatch assertion in `select_union_recursive::send_data()` (`main.json_equals` crash). **Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the original `choose_engine()` (both 10.11 and upstream/main) never had a reclength check. MDEV-38975 introduced it when replacing the `blob_fields` condition. HEAP has no internal reclength limit -- `hp_create.c` stores `uint reclength` and allocates blocks of that size; `max_supported_record_length()` is only checked in `unireg.cc` during user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have reclength ~880KB (13 bare `Varchar()` columns). The check forced them to Aria where `fill_slave_status()` returned 0 rows. Removed the check and the unused `reclength` parameter from `pick_engine()`. **Multi-update `tmp_memory_table_size` override**: the 10.11 feature overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while `tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit `tmp_memory_table_size=0` directive. Since main removed `big_tables` entirely (MDEV-19713), the override is not needed. Removed. **Zero-length key rejection in `check_tmp_key()`**: defense-in-depth guard rejecting `key_len == 0` to prevent useless zero-length keys from being created by `add_tmp_key()`. **Non-deterministic `column_compression` test**: HEAP blob support allows compressed VARCHAR/TEXT temp tables to stay in HEAP instead of falling to Aria, changing row iteration order. Added `--sorted_result` to the two MDEV-24726 subqueries that lack `ORDER BY`. Test changes: - `spatial_utility_function_collect`: added ORDER BY to window function that lacked it (results were engine-row-order-dependent) - `tmp_space_usage`: removed multi-update override; forced disk for MDEV-34016/34060 Aria-specific test sections (blob I_S tables now stay in MEMORY) - `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'` with targeted I_S query (Created_tmp_files varies on sanitizer builds) - Re-recorded 8 tests for expected "temp table stays in MEMORY" changes - `column_compression`: added `--sorted_result` for MDEV-24726 queries
7982019 to
70a62a7
Compare
MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support
JIRA: MDEV-38975
10.11 base PR: #4735
10.11 Phase 1 (VARCHAR->BLOB promotion) PR: #4812
Scope: 182 files changed, 12,434 insertions, 801 deletions (single squashed commit)
Motivation
The growing adoption of JSON columns, Dynamic Columns, and GEOMETRY types means that an increasing number of SQL operations -- GROUP BY, DISTINCT, UNION, subqueries, CTEs -- materialize intermediate results into temporary tables containing BLOB/TEXT fields. Today, the HEAP engine unconditionally rejects BLOB/TEXT columns (
HA_NO_BLOBS), forcing the optimizer to use Aria temporary tables even when the actual data is small. This feature removes that restriction, allowing the optimizer to keep temporary tables with BLOB/TEXT columns in memory when they fit withinmax_heap_table_size/tmp_memory_table_sizelimits. Additionally,HA_CAN_GEOMETRYis advertised so explicitCREATE TABLE ... ENGINE=MEMORYwith GEOMETRY columns works.Architecture
BLOB data is stored using continuation records -- additional fixed-size records allocated from the same
HP_BLOCKthat holds regular rows. This reuses existing allocation, free list, and size accounting with minimal structural change, and avoids per-blobmy_malloc()calls.The existing single-byte visibility flag is extended into a flags byte. Continuation records are grouped into variable-length runs -- contiguous sequences within a leaf block. Only the first record of each run carries a 10-byte header (
next_contpointer +run_rec_count); inner records are pure payload.Three storage formats, detected by flag bits via inline predicates:
HP_ROW_SINGLE_RECHP_ROW_CONT_ZEROCOPYchain + recbufferHP_ROW_MULTIPLE_RECnext_contblob_buffRun allocation uses a two-phase strategy: (1) peek-then-unlink walk of the free list detecting contiguous groups, (2) tail allocation from
HP_BLOCKfor remaining data. A Step 3 scavenge fallback walks the entire free list when tail allocation fails.HP_SHARE::total_recordstracks all physical records (primary + continuation), whileHP_SHARE::recordsremains the logical count used by hash bucket mapping.Reassembly buffer (
HP_INFO::blob_buff) follows the same pattern as InnoDB'sblob_heap-- allocated once, grown viamy_realloc, freed onheap_reset()/close. Zero-copy cases (A/B) return pointers directly intoHP_BLOCKwith no copy.Full HASH index key handling for BLOB columns:
hp_rec_hashnr(),hp_rec_key_cmp(),hp_key_cmp(),hp_make_key(),hp_hashnr()are extended forHA_BLOB_PARTsegments. Hash pre-check optimization skips expensive blob materialization when hashes differ. PAD SPACE collation semantics are preserved.Field_blob_key(Monty) produces HEAP-native key format (4-byte length + 8-byte data pointer) directly, eliminating key buffer translation between the SQL layer and HEAP engine.SQL layer changes
choose_engine(): Removedblob_fieldscheck, addedreclength > HA_MAX_REC_LENGTHfinalize(): HEAP+blob uses fixed-width rows; GROUP BY key setup setskey_part_flagfrom field, uses itemmax_lengthfor blob key sizing;store_lengthinitialized for all GROUP BY key parts; DISTINCT key setup skips null-bits helper for HEAPremove_duplicates(): Blob check moved before HEAP check to fall through toremove_dup_with_compare()Aggregator_distinct::add(): Overflow-to-disk conversion viacreate_internal_tmp_table_from_heap()for non-dup write errorsmysql_derived_prepare(): Forces disk engine viaTMP_TABLE_FORCE_MYISAMwhen outer query usesMATCHheap_delete()saves chain pointers topending_blob_chains, flushed on next mutation orheap_reset()/closeHP_SHARE::write_can_replaceflag forces copy mode inhp_read_blobs()Field_geomtoField_blobfor GROUP_CONCAT temp tables;Type_handler_geometry::type_handler_for_tmp_table()addednew_key_field()produced non-blobField_varstringfor a blob column, replace withField_blob_keyVARCHAR->BLOB promotion (Phase 1)
VARCHAR fields whose
octet_length > HEAP_CONVERT_IF_BIGGER_TO_BLOB(32 bytes) are automatically promoted to BLOB when the temporary table uses the HEAP engine. This eliminates HEAP's fixed-width row waste for wide VARCHAR columns common inINFORMATION_SCHEMAviews, JSON-heavy schemas, and many user tables.pick_engine()extracted fromchoose_engine()and called early instart()to setm_heap_expectedblob_type_handler()(sql/field.cc) andvarstring_type_handler()(sql/sql_type.cc), gated byTmp_field_param::is_heap_engine()derived_with_keysref access for BLOB columns: newheap_store_key_blob_refstore_key subclass bypasses SQL-layer key buffer, writes directly intorecord[0]'sField_blobcreate_ref_for_key()memory leak fix for promotedField_blob's via explicittmp.cleanup()Performance
Non-blob tables: Zero regression. Every blob-specific code path is guarded by
if (share->blob_count). No new allocations, no format changes, no hash function changes for non-blob keys.Blob tables: Eliminates file creation/deletion overhead and page cache management. For single-run blobs (common case), the read path is entirely zero-copy.
Limitations
uint16cap, auto-split)max_heap_table_sizeapplies to continuation recordsLinked bugs fixed
ER_DUP_ENTRYon GROUP BY with blob columnhp_is_single_recwith GROUP BYhp_free_run_chainon blob replicationField_geom::store()assertion in GROUP_CONCATER_KEY_NOT_FOUNDon HEAP blob UPDATEmy_hash_sort_binon GROUP BY with geometryForward-port adaptations (main-only)
This feature was developed on 10.11 and squashed onto
main. Beyond conflict resolution across 28 files, the following adaptations were required:API:
my_ci_hash_sort()converted to(&hasher, cs, data, len)API inhp_hash.cblob pathsbig_tablesremoved (MDEV-19713); replaced withtmp_memory_table_sizesave/restore insql_update.ccTABLE_SHARE::uniquesreplaced withhave_unique_constraint()type_handler_for_tmp_table()andmake_new_field()4thTmp_field_paramparameter added totype_xmltypeandtype_cursorpluginsKEYstructmemsetreplaced with= {}value-initialization;TYPELIB/Field_enumupdated forType_typelib_attributesBehavioral fixes (main-only):
RESULT_TMP_TABLEadded tofree_tmp_table()assertion (sql/sql_select.cc:24160)sql_delete.cc:1331multi-delete ORDER changed from stack-local toalloc_root(matchingsql_update.cc:2206) -- eliminates dangling pointer through feature's GROUP BY field-free loopfree_tmp_table()group field-free loop runs unconditionally (sql/sql_select.cc:24182)Test re-recordings:
main/blob_sj_test.result: optimizer selectivity estimate changedmain/derived_view.result: cardinality estimate changedheap/blob.result: newmax_sort_lengthwarning on ORDER BY of long blobsContributors
Field_blob_keyclass, 24 review/code commits across the feature (duplicate key fix,hp_blob_key_length(),hp_rec_hashnr_storedremoval,type_handler_for_tmp_table()cleanup,Field_blob_compressedfix, race condition fix, MDEV-39825 optimization,Docs/internal-temporary-tables.txt, and more)Type_handler::make_and_init_table_field_ex()refactoring (temp table field creation into type handler virtual methods)Test plan
heaptest suite (25 tests)main.select,main.distinct,main.group_by,main.derived_view,main.information_schema,main.blob_sj_test,main.func_group,main.type_blob,main.subselect,main.subselect2,main.derived,main.union,main.delete,main.multi_update