Skip to content

MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support (main)#5222

Open
arcivanov wants to merge 2 commits into
MariaDB:mainfrom
arcivanov:MDEV-38975-main
Open

MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support (main)#5222
arcivanov wants to merge 2 commits into
MariaDB:mainfrom
arcivanov:MDEV-38975-main

Conversation

@arcivanov

@arcivanov arcivanov commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support

JIRA: MDEV-38975
10.11 base PR: #4735
10.11 Phase 1 (VARCHAR->BLOB promotion) PR: #4812
Scope: 182 files changed, 12,434 insertions, 801 deletions (single squashed commit)


Motivation

The growing adoption of JSON columns, Dynamic Columns, and GEOMETRY types means that an increasing number of SQL operations -- GROUP BY, DISTINCT, UNION, subqueries, CTEs -- materialize intermediate results into temporary tables containing BLOB/TEXT fields. Today, the HEAP engine unconditionally rejects BLOB/TEXT columns (HA_NO_BLOBS), forcing the optimizer to use Aria temporary tables even when the actual data is small. This feature removes that restriction, allowing the optimizer to keep temporary tables with BLOB/TEXT columns in memory when they fit within max_heap_table_size / tmp_memory_table_size limits. Additionally, HA_CAN_GEOMETRY is advertised so explicit CREATE TABLE ... ENGINE=MEMORY with GEOMETRY columns works.

Architecture

BLOB data is stored using continuation records -- additional fixed-size records allocated from the same HP_BLOCK that holds regular rows. This reuses existing allocation, free list, and size accounting with minimal structural change, and avoids per-blob my_malloc() calls.

The existing single-byte visibility flag is extended into a flags byte. Continuation records are grouped into variable-length runs -- contiguous sequences within a leaf block. Only the first record of each run carries a 10-byte header (next_cont pointer + run_rec_count); inner records are pure payload.

Three storage formats, detected by flag bits via inline predicates:

Case Flag Layout Read
A HP_ROW_SINGLE_REC Single record, no header, data at offset 0 Zero-copy
B HP_ROW_CONT_ZEROCOPY Single run, multiple records. Header in rec 0, data in rec 1..N-1 Zero-copy via chain + recbuffer
C HP_ROW_MULTIPLE_REC One or more runs linked via next_cont Reassembly into blob_buff

Run allocation uses a two-phase strategy: (1) peek-then-unlink walk of the free list detecting contiguous groups, (2) tail allocation from HP_BLOCK for remaining data. A Step 3 scavenge fallback walks the entire free list when tail allocation fails.

HP_SHARE::total_records tracks all physical records (primary + continuation), while HP_SHARE::records remains the logical count used by hash bucket mapping.

Reassembly buffer (HP_INFO::blob_buff) follows the same pattern as InnoDB's blob_heap -- allocated once, grown via my_realloc, freed on heap_reset()/close. Zero-copy cases (A/B) return pointers directly into HP_BLOCK with no copy.

Full HASH index key handling for BLOB columns: hp_rec_hashnr(), hp_rec_key_cmp(), hp_key_cmp(), hp_make_key(), hp_hashnr() are extended for HA_BLOB_PART segments. Hash pre-check optimization skips expensive blob materialization when hashes differ. PAD SPACE collation semantics are preserved.

Field_blob_key (Monty) produces HEAP-native key format (4-byte length + 8-byte data pointer) directly, eliminating key buffer translation between the SQL layer and HEAP engine.

SQL layer changes

  • choose_engine(): Removed blob_fields check, added reclength > HA_MAX_REC_LENGTH
  • finalize(): HEAP+blob uses fixed-width rows; GROUP BY key setup sets key_part_flag from field, uses item max_length for blob key sizing; store_length initialized for all GROUP BY key parts; DISTINCT key setup skips null-bits helper for HEAP
  • remove_duplicates(): Blob check moved before HEAP check to fall through to remove_dup_with_compare()
  • Aggregator_distinct::add(): Overflow-to-disk conversion via create_internal_tmp_table_from_heap() for non-dup write errors
  • Expression cache: Disabled for HEAP+blob (key format incompatibility)
  • FULLTEXT early detection in mysql_derived_prepare(): Forces disk engine via TMP_TABLE_FORCE_MYISAM when outer query uses MATCH
  • Deferred blob chain free (MDEV-39732): heap_delete() saves chain pointers to pending_blob_chains, flushed on next mutation or heap_reset()/close
  • REPLACE safety (MDEV-39825): HP_SHARE::write_can_replace flag forces copy mode in hp_read_blobs()
  • Geometry GROUP_CONCAT fix (MDEV-39761): Downgrade Field_geom to Field_blob for GROUP_CONCAT temp tables; Type_handler_geometry::type_handler_for_tmp_table() added
  • Geometry GROUP BY key fix (MDEV-39871): Detect when new_key_field() produced non-blob Field_varstring for a blob column, replace with Field_blob_key

VARCHAR->BLOB promotion (Phase 1)

VARCHAR fields whose octet_length > HEAP_CONVERT_IF_BIGGER_TO_BLOB (32 bytes) are automatically promoted to BLOB when the temporary table uses the HEAP engine. This eliminates HEAP's fixed-width row waste for wide VARCHAR columns common in INFORMATION_SCHEMA views, JSON-heavy schemas, and many user tables.

  • pick_engine() extracted from choose_engine() and called early in start() to set m_heap_expected
  • Promotion logic in blob_type_handler() (sql/field.cc) and varstring_type_handler() (sql/sql_type.cc), gated by Tmp_field_param::is_heap_engine()
  • derived_with_keys ref access for BLOB columns: new heap_store_key_blob_ref store_key subclass bypasses SQL-layer key buffer, writes directly into record[0]'s Field_blob
  • create_ref_for_key() memory leak fix for promoted Field_blob's via explicit tmp.cleanup()

Performance

Non-blob tables: Zero regression. Every blob-specific code path is guarded by if (share->blob_count). No new allocations, no format changes, no hash function changes for non-blob keys.

Blob tables: Eliminates file creation/deletion overhead and page cache management. For single-run blobs (common case), the read path is entirely zero-copy.

Limitations

  1. No BTREE indexes on blob columns (HASH only)
  2. No partial-key prefix indexing for blobs
  3. 2x memory for Case C reads only (A/B are 1x)
  4. No blob compression
  5. 65,535 records per run (uint16 cap, auto-split)
  6. max_heap_table_size applies to continuation records
  7. Expression cache disabled for HEAP+blob
  8. FULLTEXT forces disk engine

Linked bugs fixed

  • MDEV-39703: mroonga fulltext test ordering
  • MDEV-39723: ER_DUP_ENTRY on GROUP BY with blob column
  • MDEV-39724: crash in hp_is_single_rec with GROUP BY
  • MDEV-39732: slave crash in hp_free_run_chain on blob replication
  • MDEV-39761: Field_geom::store() assertion in GROUP_CONCAT
  • MDEV-39782: RBR ER_KEY_NOT_FOUND on HEAP blob UPDATE
  • MDEV-39825: blob data corruption on REPLACE into HEAP table
  • MDEV-39871: crash in my_hash_sort_bin on GROUP BY with geometry

Forward-port adaptations (main-only)

This feature was developed on 10.11 and squashed onto main. Beyond conflict resolution across 28 files, the following adaptations were required:

API:

  • my_ci_hash_sort() converted to (&hasher, cs, data, len) API in hp_hash.c blob paths
  • big_tables removed (MDEV-19713); replaced with tmp_memory_table_size save/restore in sql_update.cc
  • TABLE_SHARE::uniques replaced with have_unique_constraint()
  • type_handler_for_tmp_table() and make_new_field() 4th Tmp_field_param parameter added to type_xmltype and type_cursor plugins
  • KEY struct memset replaced with = {} value-initialization; TYPELIB/Field_enum updated for Type_typelib_attributes

Behavioral fixes (main-only):

  • RESULT_TMP_TABLE added to free_tmp_table() assertion (sql/sql_select.cc:24160)
  • sql_delete.cc:1331 multi-delete ORDER changed from stack-local to alloc_root (matching sql_update.cc:2206) -- eliminates dangling pointer through feature's GROUP BY field-free loop
  • GROUP BY key field memory leak on HEAP-to-disk conversion fixed -- free_tmp_table() group field-free loop runs unconditionally (sql/sql_select.cc:24182)

Test re-recordings:

  • main/blob_sj_test.result: optimizer selectivity estimate changed
  • main/derived_view.result: cardinality estimate changed
  • heap/blob.result: new max_sort_length warning on ORDER BY of long blobs

Contributors

  • Reviewed by: Michael Widenius -- Field_blob_key class, 24 review/code commits across the feature (duplicate key fix, hp_blob_key_length(), hp_rec_hashnr_stored removal, type_handler_for_tmp_table() cleanup, Field_blob_compressed fix, race condition fix, MDEV-39825 optimization, Docs/internal-temporary-tables.txt, and more)
  • Contribution by: Alexander Barkov -- Type_handler::make_and_init_table_field_ex() refactoring (temp table field creation into type handler virtual methods)

Test plan

  • Debug build clean (zero compiler errors/warnings)
  • Full heap test suite (25 tests)
  • main.select, main.distinct, main.group_by, main.derived_view, main.information_schema, main.blob_sj_test, main.func_group, main.type_blob, main.subselect, main.subselect2, main.derived, main.union, main.delete, main.multi_update
  • No memory leaks (Debug build internal memory accounting clean on shutdown)
  • CI full test suite

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for BLOB, TEXT, JSON, and GEOMETRY columns in the HEAP (MEMORY) storage engine using variable-length continuation runs. The review identified several critical issues, including potential integer overflows in hp_blob.c during the calculation of total_records_needed and the accumulation of total_copy_size, which could lead to memory corruption. Additionally, a buffer overflow vulnerability was found in hp_create.c due to an under-allocated record buffer when using reclength instead of visible_offset. Other issues include a memory leak of pending blob frees in hp_close.c when delete_on_close is false, and unresolved merge conflict markers in the type_enum.result and type_set.result test files.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread storage/heap/hp_blob.c
Comment thread storage/heap/hp_create.c
Comment thread storage/heap/hp_blob.c
Comment thread storage/heap/hp_close.c
Comment thread mysql-test/main/type_enum.result Outdated
Comment thread mysql-test/main/type_set.result Outdated
arcivanov added a commit to arcivanov/mariadb-server that referenced this pull request Jun 12, 2026
Five code fixes and test re-recordings for issues found by CI on PR MariaDB#5222.

**NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin
returns NULL from `make_new_field()` (cursor values cannot be
materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE`
without a NULL check. Added `if (result)` guard.

**xmltype identity loss in `Item_type_holder::create_tmp_field_ex()`**:
`dynamic_cast<Type_handler_blob_common*>` matched `Type_handler_xmltype`
(inherits `Type_handler_long_blob`), replacing it with `Field_blob_key`.
Replaced with virtual dispatch via `type_handler_for_tmp_table()` --
xmltype's override preserves its identity, blob_common's returns
`blob_key_type_handler()` when `part_of_unique_key`. Pack length recovery
uses `blob_type_handler(max_length, NULL)->length_bytes()` since the
original type_handler can be varchar (promoted via `too_big_for_varchar()`),
not just blob -- a `static_cast<Type_handler_blob_common*>` on varchar
would crash (INTERSECT ALL with `varchar(1024)` in utf8mb3).

**Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the
original `choose_engine()` (both 10.11 and upstream/main) never had a
reclength check. MDEV-38975 introduced it when replacing the
`blob_fields` condition. HEAP has no internal reclength limit --
`hp_create.c` stores `uint reclength` and allocates blocks of that size;
`max_supported_record_length()` is only checked in `unireg.cc` during
user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have
reclength ~880KB (13 bare `Varchar()` columns). The check forced them to
Aria where `fill_slave_status()` returned 0 rows. Removed the check and
the unused `reclength` parameter from `pick_engine()`.

**Multi-update `tmp_memory_table_size` override**: the 10.11 feature
overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port
translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable
was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while
`tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit
`tmp_memory_table_size=0` directive. Since main removed `big_tables`
entirely (MDEV-19713), the override is not needed. Removed.

Test changes:
- `spatial_utility_function_collect`: added ORDER BY to window function
  that lacked it (results were engine-row-order-dependent)
- `tmp_space_usage`: removed multi-update override; forced disk for
  MDEV-34016/34060 Aria-specific test sections (blob I_S tables now
  stay in MEMORY)
- Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
arcivanov added a commit to arcivanov/mariadb-server that referenced this pull request Jun 12, 2026
Five code fixes and test re-recordings for issues found by CI on PR MariaDB#5222.

**NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin
returns NULL from `make_new_field()` (cursor values cannot be
materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE`
without a NULL check. Added `if (result)` guard.

**xmltype identity loss in `Item_type_holder::create_tmp_field_ex()`**:
`dynamic_cast<Type_handler_blob_common*>` matched `Type_handler_xmltype`
(inherits `Type_handler_long_blob`), replacing it with `Field_blob_key`.
Replaced with virtual dispatch via `type_handler_for_tmp_table()` --
xmltype's override preserves its identity, blob_common's returns
`blob_key_type_handler()` when `part_of_unique_key`. Pack length recovery
uses `blob_type_handler(max_length, NULL)->length_bytes()` since the
original type_handler can be varchar (promoted via `too_big_for_varchar()`),
not just blob -- a `static_cast<Type_handler_blob_common*>` on varchar
would crash (INTERSECT ALL with `varchar(1024)` in utf8mb3).

**Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the
original `choose_engine()` (both 10.11 and upstream/main) never had a
reclength check. MDEV-38975 introduced it when replacing the
`blob_fields` condition. HEAP has no internal reclength limit --
`hp_create.c` stores `uint reclength` and allocates blocks of that size;
`max_supported_record_length()` is only checked in `unireg.cc` during
user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have
reclength ~880KB (13 bare `Varchar()` columns). The check forced them to
Aria where `fill_slave_status()` returned 0 rows. Removed the check and
the unused `reclength` parameter from `pick_engine()`.

**Multi-update `tmp_memory_table_size` override**: the 10.11 feature
overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port
translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable
was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while
`tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit
`tmp_memory_table_size=0` directive. Since main removed `big_tables`
entirely (MDEV-19713), the override is not needed. Removed.

Test changes:
- `spatial_utility_function_collect`: added ORDER BY to window function
  that lacked it (results were engine-row-order-dependent)
- `tmp_space_usage`: removed multi-update override; forced disk for
  MDEV-34016/34060 Aria-specific test sections (blob I_S tables now
  stay in MEMORY)
- Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
arcivanov added a commit to arcivanov/mariadb-server that referenced this pull request Jun 12, 2026
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222.

**NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin
returns NULL from `make_new_field()` (cursor values cannot be
materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE`
without a NULL check. Added `if (result)` guard.

**xmltype identity loss and recursive CTE reclength mismatch in
`Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now
requires both: (1) `type_handler_for_tmp_table()` returns
`blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>`
confirms the original type is a native blob. Condition 1 excludes xmltype
(its override returns itself). Condition 2 excludes VARCHAR types promoted
via `varstring_type_handler()` -> `too_big_for_varchar()` ->
`blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs
(e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in
the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as
`Field_varchar` in the incremental table (`part_of_unique_key=false`),
causing a `reclength` mismatch assertion in
`select_union_recursive::send_data()` (`main.json_equals` crash).

**Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the
original `choose_engine()` (both 10.11 and upstream/main) never had a
reclength check. MDEV-38975 introduced it when replacing the
`blob_fields` condition. HEAP has no internal reclength limit --
`hp_create.c` stores `uint reclength` and allocates blocks of that size;
`max_supported_record_length()` is only checked in `unireg.cc` during
user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have
reclength ~880KB (13 bare `Varchar()` columns). The check forced them to
Aria where `fill_slave_status()` returned 0 rows. Removed the check and
the unused `reclength` parameter from `pick_engine()`.

**Multi-update `tmp_memory_table_size` override**: the 10.11 feature
overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port
translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable
was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while
`tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit
`tmp_memory_table_size=0` directive. Since main removed `big_tables`
entirely (MDEV-19713), the override is not needed. Removed.

**Zero-length key rejection in `check_tmp_key()`**: defense-in-depth
guard rejecting `key_len == 0` to prevent useless zero-length keys from
being created by `add_tmp_key()`.

Test changes:
- `spatial_utility_function_collect`: added ORDER BY to window function
  that lacked it (results were engine-row-order-dependent)
- `tmp_space_usage`: removed multi-update override; forced disk for
  MDEV-34016/34060 Aria-specific test sections (blob I_S tables now
  stay in MEMORY)
- `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'`
  with targeted I_S query (Created_tmp_files varies on sanitizer builds)
- Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
@arcivanov arcivanov changed the title MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support (main) Jun 12, 2026
@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Jun 12, 2026
@gkodinov gkodinov requested a review from montywi June 12, 2026 11:33

@gkodinov gkodinov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! This is a preliminary review.

Some small issues found. I'd also consider squashing the two commits into one.

Comment thread include/my_compare.h
Comment thread .gitignore Outdated
arcivanov added a commit to arcivanov/mariadb-server that referenced this pull request Jun 12, 2026
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222.

**NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin
returns NULL from `make_new_field()` (cursor values cannot be
materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE`
without a NULL check. Added `if (result)` guard.

**xmltype identity loss and recursive CTE reclength mismatch in
`Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now
requires both: (1) `type_handler_for_tmp_table()` returns
`blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>`
confirms the original type is a native blob. Condition 1 excludes xmltype
(its override returns itself). Condition 2 excludes VARCHAR types promoted
via `varstring_type_handler()` -> `too_big_for_varchar()` ->
`blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs
(e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in
the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as
`Field_varchar` in the incremental table (`part_of_unique_key=false`),
causing a `reclength` mismatch assertion in
`select_union_recursive::send_data()` (`main.json_equals` crash).

**Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the
original `choose_engine()` (both 10.11 and upstream/main) never had a
reclength check. MDEV-38975 introduced it when replacing the
`blob_fields` condition. HEAP has no internal reclength limit --
`hp_create.c` stores `uint reclength` and allocates blocks of that size;
`max_supported_record_length()` is only checked in `unireg.cc` during
user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have
reclength ~880KB (13 bare `Varchar()` columns). The check forced them to
Aria where `fill_slave_status()` returned 0 rows. Removed the check and
the unused `reclength` parameter from `pick_engine()`.

**Multi-update `tmp_memory_table_size` override**: the 10.11 feature
overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port
translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable
was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while
`tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit
`tmp_memory_table_size=0` directive. Since main removed `big_tables`
entirely (MDEV-19713), the override is not needed. Removed.

**Zero-length key rejection in `check_tmp_key()`**: defense-in-depth
guard rejecting `key_len == 0` to prevent useless zero-length keys from
being created by `add_tmp_key()`.

Test changes:
- `spatial_utility_function_collect`: added ORDER BY to window function
  that lacked it (results were engine-row-order-dependent)
- `tmp_space_usage`: removed multi-update override; forced disk for
  MDEV-34016/34060 Aria-specific test sections (blob I_S tables now
  stay in MEMORY)
- `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'`
  with targeted I_S query (Created_tmp_files varies on sanitizer builds)
- Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
arcivanov added a commit to arcivanov/mariadb-server that referenced this pull request Jun 12, 2026
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222.

**NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin
returns NULL from `make_new_field()` (cursor values cannot be
materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE`
without a NULL check. Added `if (result)` guard.

**xmltype identity loss and recursive CTE reclength mismatch in
`Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now
requires both: (1) `type_handler_for_tmp_table()` returns
`blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>`
confirms the original type is a native blob. Condition 1 excludes xmltype
(its override returns itself). Condition 2 excludes VARCHAR types promoted
via `varstring_type_handler()` -> `too_big_for_varchar()` ->
`blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs
(e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in
the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as
`Field_varchar` in the incremental table (`part_of_unique_key=false`),
causing a `reclength` mismatch assertion in
`select_union_recursive::send_data()` (`main.json_equals` crash).

**Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the
original `choose_engine()` (both 10.11 and upstream/main) never had a
reclength check. MDEV-38975 introduced it when replacing the
`blob_fields` condition. HEAP has no internal reclength limit --
`hp_create.c` stores `uint reclength` and allocates blocks of that size;
`max_supported_record_length()` is only checked in `unireg.cc` during
user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have
reclength ~880KB (13 bare `Varchar()` columns). The check forced them to
Aria where `fill_slave_status()` returned 0 rows. Removed the check and
the unused `reclength` parameter from `pick_engine()`.

**Multi-update `tmp_memory_table_size` override**: the 10.11 feature
overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port
translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable
was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while
`tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit
`tmp_memory_table_size=0` directive. Since main removed `big_tables`
entirely (MDEV-19713), the override is not needed. Removed.

**Zero-length key rejection in `check_tmp_key()`**: defense-in-depth
guard rejecting `key_len == 0` to prevent useless zero-length keys from
being created by `add_tmp_key()`.

Test changes:
- `spatial_utility_function_collect`: added ORDER BY to window function
  that lacked it (results were engine-row-order-dependent)
- `tmp_space_usage`: removed multi-update override; forced disk for
  MDEV-34016/34060 Aria-specific test sections (blob I_S tables now
  stay in MEMORY)
- `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'`
  with targeted I_S query (Created_tmp_files varies on sanitizer builds)
- Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
@arcivanov arcivanov requested a review from gkodinov June 12, 2026 14:57
…e blob columns

Remove the HA_NO_BLOBS restriction from the HEAP engine, allowing
the optimizer to keep temporary tables with BLOB/TEXT columns in
memory when they fit within max_heap_table_size / tmp_memory_table_size
limits.  Additionally, advertise HA_CAN_GEOMETRY so explicit
CREATE TABLE ... ENGINE=MEMORY with GEOMETRY columns works.

Unlike other HEAP blob implementations (e.g. Percona), this patch
provides full HASH index support on blob columns, enabling efficient
lookups, GROUP BY, and DISTINCT operations directly in HEAP without
falling back to disk.

Architecture
------------

BLOB data is stored using continuation records -- additional fixed-size
records allocated from the same HP_BLOCK that holds regular rows.  This
reuses existing allocation, free list, and size accounting with minimal
structural change, and avoids per-blob my_malloc() calls.

The existing single-byte visibility flag is extended into a flags byte
with bits for HP_ROW_HAS_CONT, HP_ROW_IS_CONT, HP_ROW_CONT_ZEROCOPY,
HP_ROW_SINGLE_REC, and HP_ROW_MULTIPLE_REC.  Continuation records are
grouped into variable-length runs -- contiguous sequences within a leaf
block.  Only the first record of each run carries a 10-byte header
(next_cont pointer + run_rec_count); inner records are pure payload.

Three storage formats, detected by flag bits via inline predicates:

  Case A (HP_ROW_SINGLE_REC): single record, no header, data at
  offset 0.  Zero-copy read.

  Case B (HP_ROW_CONT_ZEROCOPY): single run, multiple records.
  Header in rec 0, data contiguous in rec 1..N-1.  Zero-copy read
  via chain + recbuffer.

  Case C (HP_ROW_MULTIPLE_REC): one or more runs linked via
  next_cont.  Reassembly into blob_buff required.

Run allocation uses a two-phase strategy: (1) peek-then-unlink walk
of the free list detecting contiguous groups, (2) tail allocation
from HP_BLOCK for remaining data.  A Step 3 scavenge fallback
walks the entire free list when tail allocation fails.

HP_SHARE::total_records tracks all physical records (primary +
continuation), while HP_SHARE::records remains the logical count
used by hash bucket mapping.

Reassembly buffer (HP_INFO::blob_buff) follows the same pattern as
InnoDB's blob_heap -- allocated once, grown via my_realloc, freed
on heap_reset()/close.  Zero-copy cases (A/B) return pointers
directly into HP_BLOCK with no copy.

Full HASH index key handling for BLOB columns: hp_rec_hashnr(),
hp_rec_key_cmp(), hp_key_cmp(), hp_make_key(), hp_hashnr() are
extended for HA_BLOB_PART segments.  Hash pre-check optimization
skips expensive blob materialization when hashes differ.  PAD SPACE
collation semantics are preserved for blob key comparisons.

Field_blob_key (Monty) produces HEAP-native key format (4-byte length
+ 8-byte data pointer) directly, eliminating key buffer translation
between the SQL layer and HEAP engine.

SQL layer changes
-----------------

choose_engine(): removed blob_fields check, added reclength >
HA_MAX_REC_LENGTH.

finalize(): HEAP+blob uses fixed-width rows; GROUP BY key setup sets
key_part_flag from field, uses item max_length for blob key sizing.
store_length initialized for all GROUP BY key parts.  DISTINCT key
setup skips null-bits helper for HEAP.

remove_duplicates(): blob check moved before HEAP check to fall
through to remove_dup_with_compare().

Aggregator_distinct::add(): overflow-to-disk conversion via
create_internal_tmp_table_from_heap() for non-dup write errors.

Expression cache disabled for HEAP+blob (key format incompatibility).

FULLTEXT early detection in mysql_derived_prepare(): forces disk
engine via TMP_TABLE_FORCE_MYISAM when outer query uses MATCH.

Deferred blob chain free (MDEV-39732): heap_delete() saves chain
pointers to pending_blob_chains, flushed on next mutation or
heap_reset()/close.  Prevents dangling zero-copy pointers during
binlog_log_row().

REPLACE safety (MDEV-39825): HP_SHARE::write_can_replace flag
forces copy mode in hp_read_blobs(), preventing blob data corruption
from freed-then-reused continuation records during REPLACE.

Geometry GROUP_CONCAT fix (MDEV-39761): downgrade Field_geom to
Field_blob for GROUP_CONCAT temp tables in both expression creation
paths.  Type_handler_geometry::type_handler_for_tmp_table() added.

Geometry GROUP BY key fix (MDEV-39871): detect when new_key_field()
produced non-blob Field_varstring for a blob column, replace with
Field_blob_key.

Performance
-----------

Non-blob tables: zero regression.  Every blob-specific code path is
guarded by if (share->blob_count).  No new allocations, no format
changes, no hash function changes for non-blob keys.

Blob tables: eliminates file creation/deletion overhead and page cache
management.  For single-run blobs (common case), the read path is
entirely zero-copy.

Limitations
-----------

1. No BTREE indexes on blob columns (HASH only)
2. No partial-key prefix indexing for blobs
3. 2x memory for Case C reads only (A/B are 1x)
4. No blob compression
5. 65,535 records per run (uint16 cap, auto-split)
6. max_heap_table_size applies to continuation records
7. Expression cache disabled for HEAP+blob
8. FULLTEXT forces disk engine

Linked bugs fixed:

- MDEV-39703: mroonga fulltext test ordering
- MDEV-39723: ER_DUP_ENTRY on GROUP BY with blob column
- MDEV-39724: crash in hp_is_single_rec with GROUP BY
- MDEV-39732: slave crash in hp_free_run_chain on blob replication
- MDEV-39761: Field_geom::store() assertion in GROUP_CONCAT
- MDEV-39782: RBR ER_KEY_NOT_FOUND on HEAP blob UPDATE
- MDEV-39825: blob data corruption on REPLACE into HEAP table
- MDEV-39871: crash in my_hash_sort_bin on GROUP BY with geometry

Reviewed by: Michael Widenius <monty@mariadb.org>
  Monty reviewed the entire patch. Areas where he suggested changes
  or contributed code:
  - Field_blob_key class (HEAP-native blob key format, 4-byte length +
    data pointer)
  - Duplicate key fix on HEAP-to-Aria conversion
  - hp_blob_key_length() uint32 fix
  - hp_rec_hashnr_stored removal
  - type_handler_for_tmp_table() param cleanup
  - Type_handler_geometry::type_handler_for_tmp_table() virtual
  - blob pointer bzero()
  - find_unique_row() double-materialization fix
  - Tail reclaim review
  - Batch tail allocation review
  - hp_update.c cleanup
  - Field_blob_compressed temp table fix
  - row_pack_length() dedup
  - pack_length_no_ptr() removal
  - Race condition fix in HEAP
  - MDEV-39703 mroonga test fix
  - MDEV-39825 write_can_replace optimization
  - Documentation (Docs/internal-temporary-tables.txt)
Contribution by: Alexander Barkov <bar@mariadb.com>
  Type_handler::make_and_init_table_field_ex() -- refactored temp table
  field creation from inline code in sql_select.cc into type handler
  virtual methods (sql_type.cc, sql_type_geom.cc), enabling clean
  per-type-handler field creation for HEAP blob promotion.
arcivanov added a commit to arcivanov/mariadb-server that referenced this pull request Jun 12, 2026
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222.

**NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin
returns NULL from `make_new_field()` (cursor values cannot be
materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE`
without a NULL check. Added `if (result)` guard.

**xmltype identity loss and recursive CTE reclength mismatch in
`Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now
requires both: (1) `type_handler_for_tmp_table()` returns
`blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>`
confirms the original type is a native blob. Condition 1 excludes xmltype
(its override returns itself). Condition 2 excludes VARCHAR types promoted
via `varstring_type_handler()` -> `too_big_for_varchar()` ->
`blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs
(e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in
the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as
`Field_varchar` in the incremental table (`part_of_unique_key=false`),
causing a `reclength` mismatch assertion in
`select_union_recursive::send_data()` (`main.json_equals` crash).

**Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the
original `choose_engine()` (both 10.11 and upstream/main) never had a
reclength check. MDEV-38975 introduced it when replacing the
`blob_fields` condition. HEAP has no internal reclength limit --
`hp_create.c` stores `uint reclength` and allocates blocks of that size;
`max_supported_record_length()` is only checked in `unireg.cc` during
user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have
reclength ~880KB (13 bare `Varchar()` columns). The check forced them to
Aria where `fill_slave_status()` returned 0 rows. Removed the check and
the unused `reclength` parameter from `pick_engine()`.

**Multi-update `tmp_memory_table_size` override**: the 10.11 feature
overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port
translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable
was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while
`tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit
`tmp_memory_table_size=0` directive. Since main removed `big_tables`
entirely (MDEV-19713), the override is not needed. Removed.

**Zero-length key rejection in `check_tmp_key()`**: defense-in-depth
guard rejecting `key_len == 0` to prevent useless zero-length keys from
being created by `add_tmp_key()`.

Test changes:
- `spatial_utility_function_collect`: added ORDER BY to window function
  that lacked it (results were engine-row-order-dependent)
- `tmp_space_usage`: removed multi-update override; forced disk for
  MDEV-34016/34060 Aria-specific test sections (blob I_S tables now
  stay in MEMORY)
- `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'`
  with targeted I_S query (Created_tmp_files varies on sanitizer builds)
- Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222.

**NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin
returns NULL from `make_new_field()` (cursor values cannot be
materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE`
without a NULL check. Added `if (result)` guard.

**xmltype identity loss and recursive CTE reclength mismatch in
`Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now
requires both: (1) `type_handler_for_tmp_table()` returns
`blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>`
confirms the original type is a native blob. Condition 1 excludes xmltype
(its override returns itself). Condition 2 excludes VARCHAR types promoted
via `varstring_type_handler()` -> `too_big_for_varchar()` ->
`blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs
(e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in
the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as
`Field_varchar` in the incremental table (`part_of_unique_key=false`),
causing a `reclength` mismatch assertion in
`select_union_recursive::send_data()` (`main.json_equals` crash).

**Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the
original `choose_engine()` (both 10.11 and upstream/main) never had a
reclength check. MDEV-38975 introduced it when replacing the
`blob_fields` condition. HEAP has no internal reclength limit --
`hp_create.c` stores `uint reclength` and allocates blocks of that size;
`max_supported_record_length()` is only checked in `unireg.cc` during
user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have
reclength ~880KB (13 bare `Varchar()` columns). The check forced them to
Aria where `fill_slave_status()` returned 0 rows. Removed the check and
the unused `reclength` parameter from `pick_engine()`.

**Multi-update `tmp_memory_table_size` override**: the 10.11 feature
overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port
translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable
was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while
`tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit
`tmp_memory_table_size=0` directive. Since main removed `big_tables`
entirely (MDEV-19713), the override is not needed. Removed.

**Zero-length key rejection in `check_tmp_key()`**: defense-in-depth
guard rejecting `key_len == 0` to prevent useless zero-length keys from
being created by `add_tmp_key()`.

**Non-deterministic `column_compression` test**: HEAP blob support allows
compressed VARCHAR/TEXT temp tables to stay in HEAP instead of falling to
Aria, changing row iteration order. Added `--sorted_result` to the two
MDEV-24726 subqueries that lack `ORDER BY`.

Test changes:
- `spatial_utility_function_collect`: added ORDER BY to window function
  that lacked it (results were engine-row-order-dependent)
- `tmp_space_usage`: removed multi-update override; forced disk for
  MDEV-34016/34060 Aria-specific test sections (blob I_S tables now
  stay in MEMORY)
- `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'`
  with targeted I_S query (Created_tmp_files varies on sanitizer builds)
- Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
- `column_compression`: added `--sorted_result` for MDEV-24726 queries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.

Development

Successfully merging this pull request may close these issues.

3 participants