MDEV-38975: Promote wide VARCHAR to BLOB for HEAP internal temp tables - Phase 1 (main)#5225
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for BLOB, TEXT, JSON, and GEOMETRY columns in the HEAP (MEMORY) storage engine, implementing continuation record chains and new key types (HA_KEYTYPE_VARTEXT4 and HA_KEYTYPE_VARBINARY4) for internal temporary tables. The review feedback highlights several critical issues: in mysys/my_compare.c, pointer copying for the new key types must be offset by 4 bytes to prevent memory corruption; in sql/sql_select.h, a dummy field allocation in heap_store_key_blob_ref should be avoided to prevent a memory leak; and in sql/sql_show.cc, a type-safe memmove should replace the error-prone bmove with manual pointer arithmetic.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
e418e5b to
6d3f3c7
Compare
gkodinov
left a comment
There was a problem hiding this comment.
Thank you for your contribution! This is a preliminary review.
LGTM (on formal criteria like tests passing etc).
Please stand by for the final review.
6d3f3c7 to
751c18c
Compare
…e blob columns
Remove the HA_NO_BLOBS restriction from the HEAP engine, allowing
the optimizer to keep temporary tables with BLOB/TEXT columns in
memory when they fit within max_heap_table_size / tmp_memory_table_size
limits. Additionally, advertise HA_CAN_GEOMETRY so explicit
CREATE TABLE ... ENGINE=MEMORY with GEOMETRY columns works.
Unlike other HEAP blob implementations (e.g. Percona), this patch
provides full HASH index support on blob columns, enabling efficient
lookups, GROUP BY, and DISTINCT operations directly in HEAP without
falling back to disk.
Architecture
------------
BLOB data is stored using continuation records -- additional fixed-size
records allocated from the same HP_BLOCK that holds regular rows. This
reuses existing allocation, free list, and size accounting with minimal
structural change, and avoids per-blob my_malloc() calls.
The existing single-byte visibility flag is extended into a flags byte
with bits for HP_ROW_HAS_CONT, HP_ROW_IS_CONT, HP_ROW_CONT_ZEROCOPY,
HP_ROW_SINGLE_REC, and HP_ROW_MULTIPLE_REC. Continuation records are
grouped into variable-length runs -- contiguous sequences within a leaf
block. Only the first record of each run carries a 10-byte header
(next_cont pointer + run_rec_count); inner records are pure payload.
Three storage formats, detected by flag bits via inline predicates:
Case A (HP_ROW_SINGLE_REC): single record, no header, data at
offset 0. Zero-copy read.
Case B (HP_ROW_CONT_ZEROCOPY): single run, multiple records.
Header in rec 0, data contiguous in rec 1..N-1. Zero-copy read
via chain + recbuffer.
Case C (HP_ROW_MULTIPLE_REC): one or more runs linked via
next_cont. Reassembly into blob_buff required.
Run allocation uses a two-phase strategy: (1) peek-then-unlink walk
of the free list detecting contiguous groups, (2) tail allocation
from HP_BLOCK for remaining data. A Step 3 scavenge fallback
walks the entire free list when tail allocation fails.
HP_SHARE::total_records tracks all physical records (primary +
continuation), while HP_SHARE::records remains the logical count
used by hash bucket mapping.
Reassembly buffer (HP_INFO::blob_buff) follows the same pattern as
InnoDB's blob_heap -- allocated once, grown via my_realloc, freed
on heap_reset()/close. Zero-copy cases (A/B) return pointers
directly into HP_BLOCK with no copy.
Full HASH index key handling for BLOB columns: hp_rec_hashnr(),
hp_rec_key_cmp(), hp_key_cmp(), hp_make_key(), hp_hashnr() are
extended for HA_BLOB_PART segments. Hash pre-check optimization
skips expensive blob materialization when hashes differ. PAD SPACE
collation semantics are preserved for blob key comparisons.
Field_blob_key (Monty) produces HEAP-native key format (4-byte length
+ 8-byte data pointer) directly, eliminating key buffer translation
between the SQL layer and HEAP engine.
SQL layer changes
-----------------
choose_engine(): removed blob_fields check, added reclength >
HA_MAX_REC_LENGTH.
finalize(): HEAP+blob uses fixed-width rows; GROUP BY key setup sets
key_part_flag from field, uses item max_length for blob key sizing.
store_length initialized for all GROUP BY key parts. DISTINCT key
setup skips null-bits helper for HEAP.
remove_duplicates(): blob check moved before HEAP check to fall
through to remove_dup_with_compare().
Aggregator_distinct::add(): overflow-to-disk conversion via
create_internal_tmp_table_from_heap() for non-dup write errors.
Expression cache disabled for HEAP+blob (key format incompatibility).
FULLTEXT early detection in mysql_derived_prepare(): forces disk
engine via TMP_TABLE_FORCE_MYISAM when outer query uses MATCH.
Deferred blob chain free (MDEV-39732): heap_delete() saves chain
pointers to pending_blob_chains, flushed on next mutation or
heap_reset()/close. Prevents dangling zero-copy pointers during
binlog_log_row().
REPLACE safety (MDEV-39825): HP_SHARE::write_can_replace flag
forces copy mode in hp_read_blobs(), preventing blob data corruption
from freed-then-reused continuation records during REPLACE.
Geometry GROUP_CONCAT fix (MDEV-39761): downgrade Field_geom to
Field_blob for GROUP_CONCAT temp tables in both expression creation
paths. Type_handler_geometry::type_handler_for_tmp_table() added.
Geometry GROUP BY key fix (MDEV-39871): detect when new_key_field()
produced non-blob Field_varstring for a blob column, replace with
Field_blob_key.
Performance
-----------
Non-blob tables: zero regression. Every blob-specific code path is
guarded by if (share->blob_count). No new allocations, no format
changes, no hash function changes for non-blob keys.
Blob tables: eliminates file creation/deletion overhead and page cache
management. For single-run blobs (common case), the read path is
entirely zero-copy.
Limitations
-----------
1. No BTREE indexes on blob columns (HASH only)
2. No partial-key prefix indexing for blobs
3. 2x memory for Case C reads only (A/B are 1x)
4. No blob compression
5. 65,535 records per run (uint16 cap, auto-split)
6. max_heap_table_size applies to continuation records
7. Expression cache disabled for HEAP+blob
8. FULLTEXT forces disk engine
Linked bugs fixed:
- MDEV-39703: mroonga fulltext test ordering
- MDEV-39723: ER_DUP_ENTRY on GROUP BY with blob column
- MDEV-39724: crash in hp_is_single_rec with GROUP BY
- MDEV-39732: slave crash in hp_free_run_chain on blob replication
- MDEV-39761: Field_geom::store() assertion in GROUP_CONCAT
- MDEV-39782: RBR ER_KEY_NOT_FOUND on HEAP blob UPDATE
- MDEV-39825: blob data corruption on REPLACE into HEAP table
- MDEV-39871: crash in my_hash_sort_bin on GROUP BY with geometry
Reviewed by: Michael Widenius <monty@mariadb.org>
Monty reviewed the entire patch. Areas where he suggested changes
or contributed code:
- Field_blob_key class (HEAP-native blob key format, 4-byte length +
data pointer)
- Duplicate key fix on HEAP-to-Aria conversion
- hp_blob_key_length() uint32 fix
- hp_rec_hashnr_stored removal
- type_handler_for_tmp_table() param cleanup
- Type_handler_geometry::type_handler_for_tmp_table() virtual
- blob pointer bzero()
- find_unique_row() double-materialization fix
- Tail reclaim review
- Batch tail allocation review
- hp_update.c cleanup
- Field_blob_compressed temp table fix
- row_pack_length() dedup
- pack_length_no_ptr() removal
- Race condition fix in HEAP
- MDEV-39703 mroonga test fix
- MDEV-39825 write_can_replace optimization
- Documentation (Docs/internal-temporary-tables.txt)
Contribution by: Alexander Barkov <bar@mariadb.com>
Type_handler::make_and_init_table_field_ex() -- refactored temp table
field creation from inline code in sql_select.cc into type handler
virtual methods (sql_type.cc, sql_type_geom.cc), enabling clean
per-type-handler field creation for HEAP blob promotion.
751c18c to
738c2af
Compare
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222. **NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin returns NULL from `make_new_field()` (cursor values cannot be materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE` without a NULL check. Added `if (result)` guard. **xmltype identity loss and recursive CTE reclength mismatch in `Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now requires both: (1) `type_handler_for_tmp_table()` returns `blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>` confirms the original type is a native blob. Condition 1 excludes xmltype (its override returns itself). Condition 2 excludes VARCHAR types promoted via `varstring_type_handler()` -> `too_big_for_varchar()` -> `blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs (e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as `Field_varchar` in the incremental table (`part_of_unique_key=false`), causing a `reclength` mismatch assertion in `select_union_recursive::send_data()` (`main.json_equals` crash). **Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the original `choose_engine()` (both 10.11 and upstream/main) never had a reclength check. MDEV-38975 introduced it when replacing the `blob_fields` condition. HEAP has no internal reclength limit -- `hp_create.c` stores `uint reclength` and allocates blocks of that size; `max_supported_record_length()` is only checked in `unireg.cc` during user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have reclength ~880KB (13 bare `Varchar()` columns). The check forced them to Aria where `fill_slave_status()` returned 0 rows. Removed the check and the unused `reclength` parameter from `pick_engine()`. **Multi-update `tmp_memory_table_size` override**: the 10.11 feature overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while `tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit `tmp_memory_table_size=0` directive. Since main removed `big_tables` entirely (MDEV-19713), the override is not needed. Removed. **Zero-length key rejection in `check_tmp_key()`**: defense-in-depth guard rejecting `key_len == 0` to prevent useless zero-length keys from being created by `add_tmp_key()`. **Non-deterministic `column_compression` test**: HEAP blob support allows compressed VARCHAR/TEXT temp tables to stay in HEAP instead of falling to Aria, changing row iteration order. Added `--sorted_result` to the two MDEV-24726 subqueries that lack `ORDER BY`. Test changes: - `spatial_utility_function_collect`: added ORDER BY to window function that lacked it (results were engine-row-order-dependent) - `tmp_space_usage`: removed multi-update override; forced disk for MDEV-34016/34060 Aria-specific test sections (blob I_S tables now stay in MEMORY) - `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'` with targeted I_S query (Created_tmp_files varies on sanitizer builds) - Re-recorded 8 tests for expected "temp table stays in MEMORY" changes - `column_compression`: added `--sorted_result` for MDEV-24726 queries
Use `Field_blob_key` as the single unified mechanism for ALL blob columns in HEAP temp tables, replacing Phase 1's dual approach of `Field_blob_key` for GROUP BY/DISTINCT + `rebuild_blob_key_from_segments` for derived table ref access. Key changes: - `Tmp_field_param::is_heap_engine()` gates `Field_blob_key` creation for all blob fields in HEAP temp tables (not just `part_of_unique_key`) - `varstring_type_handler()` promotes VARCHAR > `HEAP_CONVERT_IF_BIGGER_TO_BLOB` to blob via `blob_type_handler()` → `type_handler_blob_key` - `Type_handler_blob_common::type_handler_for_tmp_table()` returns `blob_key_type_handler()` when `is_heap_engine()` - `Item_field::create_tmp_field_from_item_field()` redirects HEAP blob fields through the type handler system - `Item_type_holder::create_tmp_field_ex()` extended for UNION/CTE blobs Fix three latent bugs in `Field_blob_key` exposed by ref access: 1. `Field_blob_key::key_cmp()` treated key bytes at offset 4 as inline data, but the key format is `[4B length][8B pointer_to_data]` — comparison was against raw pointer bytes instead of actual data. 2. `cmp_buffer_with_ref()` eq_ref cache compared raw key buffer bytes. When `Field_blob_key::value` buffer is reused across lookups, the `[4B length][8B pointer]` bytes don't change even when the pointed-to data differs, causing stale result reuse. Disable the cache for all HEAP blob key parts (remove the `length == 0` guard). 3. `Field_blob_key::new_key_field()` returns `Field_blob_key` (unlike `Field_blob::new_key_field()` which returns `Field_varstring`). The `store_key` mechanism stores into `to_field->value` String, which leaks because `store_key` is `Sql_alloc` with no destructor. Add `store_key::cleanup()` called from `JOIN_TAB::cleanup()` and `subselect_uniquesubquery_engine::cleanup()`. `Field_blob_key::key_part_length_bytes()` changed from 4 to 0 so `store_length = key_length (12) + null_byte + 0`, matching HEAP's `seg->length (12) + null` for correct multi-part key alignment. `hp_key_cmp()` blob packlength changed from hardcoded 4 to `seg->bit_start` (actual field packlength) for TEXT (packlength=2). Re-record GROUP_CONCAT-related results: the `Tmp_field_param` threading through `tmp_table_field_from_field_type()` (base branch) closed a plumbing gap where `Item_sum` and literal items dropped the param, so they now reach the HEAP promotion gates like all other expression items. `GROUP_CONCAT` results in HEAP temp tables become `Field_blob_key` (longtext metadata, 12-byte blob key parts), consistent with the `Item_func` path that was already recorded (e.g. `substring()` sj-materialization keys).
738c2af to
c6f7626
Compare
Summary
Phase 1 of MDEV-38975 VARCHAR-to-BLOB promotion for HEAP internal temp tables, forward-ported to main (13.1). Depends on PR #5222 (base HEAP blob support for main).
HEAP_CONVERT_IF_BIGGER_TO_BLOB(32 chars) toField_blob_keyin HEAP temp tables, reducing fixed-width row wasteTmp_field_param::is_heap_engine()gatesField_blob_keycreation for all blob fields in HEAP temp tablesvarstring_type_handler()promotes wide VARCHAR to blob viablob_type_handler()for HEAPType_handler_blob_common::type_handler_for_tmp_table()returnsblob_key_type_handler()whenis_heap_engine()Main-specific adaptations
ha_index_initassertion (handler.cc): main added aDBUG_ASSERTrejecting reads onHA_UNIQUE_HASHkeys. Relaxed for HEAP since its hash index natively supports blob key lookups (unlike Aria/MyISAM unique hash constraints)add_schema_fields): blob promotion path usedsystem_charset_info(utf8mb3_general1400_as_cion main) instead ofsystem_charset_info_for_i_s(utf8mb3_general_ci), causing collation mismatches on I_S joinspick_engine()signature: main removed thereclengthparameterKEYnon-trivial type:memsetto value-initialization for Phase 1 unit testsItem_type_holder::create_tmp_field_ex(): main uses virtual dispatch +dynamic_castdual guard for xmltype/varchar-promotion safety; Phase 1'sis_heap_engine()feeds throughtype_handler_for_tmp_table()into the existing guardTest plan
heap.heap_blob_derived_keys-- ref access on promoted BLOB derived table keysmain.sj_mat_blob-- semi-join materialization with promoted BLOB keysmain.heap_blob_default-- default value handling for promoted columnsmain.subselect_mat,main.subselect_sj_mat-- no regressions in sj-materializemain.information_schema-- no collation mismatch on I_S joins