Skip to content

MDEV-38975: Promote wide VARCHAR to BLOB for HEAP internal temp tables - Phase 1 (main)#5225

Open
arcivanov wants to merge 3 commits into
MariaDB:mainfrom
arcivanov:MDEV-38975-Phase1-varchar-blob-promotion-main
Open

MDEV-38975: Promote wide VARCHAR to BLOB for HEAP internal temp tables - Phase 1 (main)#5225
arcivanov wants to merge 3 commits into
MariaDB:mainfrom
arcivanov:MDEV-38975-Phase1-varchar-blob-promotion-main

Conversation

@arcivanov

Copy link
Copy Markdown
Contributor

Summary

Phase 1 of MDEV-38975 VARCHAR-to-BLOB promotion for HEAP internal temp tables, forward-ported to main (13.1). Depends on PR #5222 (base HEAP blob support for main).

  • Promotes VARCHAR columns wider than HEAP_CONVERT_IF_BIGGER_TO_BLOB (32 chars) to Field_blob_key in HEAP temp tables, reducing fixed-width row waste
  • Tmp_field_param::is_heap_engine() gates Field_blob_key creation for all blob fields in HEAP temp tables
  • varstring_type_handler() promotes wide VARCHAR to blob via blob_type_handler() for HEAP
  • Type_handler_blob_common::type_handler_for_tmp_table() returns blob_key_type_handler() when is_heap_engine()

Main-specific adaptations

  • ha_index_init assertion (handler.cc): main added a DBUG_ASSERT rejecting reads on HA_UNIQUE_HASH keys. Relaxed for HEAP since its hash index natively supports blob key lookups (unlike Aria/MyISAM unique hash constraints)
  • I_S collation fix (add_schema_fields): blob promotion path used system_charset_info (utf8mb3_general1400_as_ci on main) instead of system_charset_info_for_i_s (utf8mb3_general_ci), causing collation mismatches on I_S joins
  • pick_engine() signature: main removed the reclength parameter
  • KEY non-trivial type: memset to value-initialization for Phase 1 unit tests
  • Item_type_holder::create_tmp_field_ex(): main uses virtual dispatch + dynamic_cast dual guard for xmltype/varchar-promotion safety; Phase 1's is_heap_engine() feeds through type_handler_for_tmp_table() into the existing guard

Test plan

  • heap.heap_blob_derived_keys -- ref access on promoted BLOB derived table keys
  • main.sj_mat_blob -- semi-join materialization with promoted BLOB keys
  • main.heap_blob_default -- default value handling for promoted columns
  • main.subselect_mat, main.subselect_sj_mat -- no regressions in sj-materialize
  • main.information_schema -- no collation mismatch on I_S joins
  • Full CI suite

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for BLOB, TEXT, JSON, and GEOMETRY columns in the HEAP (MEMORY) storage engine, implementing continuation record chains and new key types (HA_KEYTYPE_VARTEXT4 and HA_KEYTYPE_VARBINARY4) for internal temporary tables. The review feedback highlights several critical issues: in mysys/my_compare.c, pointer copying for the new key types must be offset by 4 bytes to prevent memory corruption; in sql/sql_select.h, a dummy field allocation in heap_store_key_blob_ref should be avoided to prevent a memory leak; and in sql/sql_show.cc, a type-safe memmove should replace the error-prone bmove with manual pointer arithmetic.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread mysys/my_compare.c
Comment thread mysys/my_compare.c
Comment thread sql/sql_select.h
Comment thread sql/sql_show.cc
@arcivanov arcivanov force-pushed the MDEV-38975-Phase1-varchar-blob-promotion-main branch 3 times, most recently from e418e5b to 6d3f3c7 Compare June 12, 2026 07:35
@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Jun 12, 2026
@gkodinov gkodinov requested a review from montywi June 12, 2026 11:07

@gkodinov gkodinov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! This is a preliminary review.

LGTM (on formal criteria like tests passing etc).

Please stand by for the final review.

@arcivanov arcivanov force-pushed the MDEV-38975-Phase1-varchar-blob-promotion-main branch from 6d3f3c7 to 751c18c Compare June 12, 2026 14:55
…e blob columns

Remove the HA_NO_BLOBS restriction from the HEAP engine, allowing
the optimizer to keep temporary tables with BLOB/TEXT columns in
memory when they fit within max_heap_table_size / tmp_memory_table_size
limits.  Additionally, advertise HA_CAN_GEOMETRY so explicit
CREATE TABLE ... ENGINE=MEMORY with GEOMETRY columns works.

Unlike other HEAP blob implementations (e.g. Percona), this patch
provides full HASH index support on blob columns, enabling efficient
lookups, GROUP BY, and DISTINCT operations directly in HEAP without
falling back to disk.

Architecture
------------

BLOB data is stored using continuation records -- additional fixed-size
records allocated from the same HP_BLOCK that holds regular rows.  This
reuses existing allocation, free list, and size accounting with minimal
structural change, and avoids per-blob my_malloc() calls.

The existing single-byte visibility flag is extended into a flags byte
with bits for HP_ROW_HAS_CONT, HP_ROW_IS_CONT, HP_ROW_CONT_ZEROCOPY,
HP_ROW_SINGLE_REC, and HP_ROW_MULTIPLE_REC.  Continuation records are
grouped into variable-length runs -- contiguous sequences within a leaf
block.  Only the first record of each run carries a 10-byte header
(next_cont pointer + run_rec_count); inner records are pure payload.

Three storage formats, detected by flag bits via inline predicates:

  Case A (HP_ROW_SINGLE_REC): single record, no header, data at
  offset 0.  Zero-copy read.

  Case B (HP_ROW_CONT_ZEROCOPY): single run, multiple records.
  Header in rec 0, data contiguous in rec 1..N-1.  Zero-copy read
  via chain + recbuffer.

  Case C (HP_ROW_MULTIPLE_REC): one or more runs linked via
  next_cont.  Reassembly into blob_buff required.

Run allocation uses a two-phase strategy: (1) peek-then-unlink walk
of the free list detecting contiguous groups, (2) tail allocation
from HP_BLOCK for remaining data.  A Step 3 scavenge fallback
walks the entire free list when tail allocation fails.

HP_SHARE::total_records tracks all physical records (primary +
continuation), while HP_SHARE::records remains the logical count
used by hash bucket mapping.

Reassembly buffer (HP_INFO::blob_buff) follows the same pattern as
InnoDB's blob_heap -- allocated once, grown via my_realloc, freed
on heap_reset()/close.  Zero-copy cases (A/B) return pointers
directly into HP_BLOCK with no copy.

Full HASH index key handling for BLOB columns: hp_rec_hashnr(),
hp_rec_key_cmp(), hp_key_cmp(), hp_make_key(), hp_hashnr() are
extended for HA_BLOB_PART segments.  Hash pre-check optimization
skips expensive blob materialization when hashes differ.  PAD SPACE
collation semantics are preserved for blob key comparisons.

Field_blob_key (Monty) produces HEAP-native key format (4-byte length
+ 8-byte data pointer) directly, eliminating key buffer translation
between the SQL layer and HEAP engine.

SQL layer changes
-----------------

choose_engine(): removed blob_fields check, added reclength >
HA_MAX_REC_LENGTH.

finalize(): HEAP+blob uses fixed-width rows; GROUP BY key setup sets
key_part_flag from field, uses item max_length for blob key sizing.
store_length initialized for all GROUP BY key parts.  DISTINCT key
setup skips null-bits helper for HEAP.

remove_duplicates(): blob check moved before HEAP check to fall
through to remove_dup_with_compare().

Aggregator_distinct::add(): overflow-to-disk conversion via
create_internal_tmp_table_from_heap() for non-dup write errors.

Expression cache disabled for HEAP+blob (key format incompatibility).

FULLTEXT early detection in mysql_derived_prepare(): forces disk
engine via TMP_TABLE_FORCE_MYISAM when outer query uses MATCH.

Deferred blob chain free (MDEV-39732): heap_delete() saves chain
pointers to pending_blob_chains, flushed on next mutation or
heap_reset()/close.  Prevents dangling zero-copy pointers during
binlog_log_row().

REPLACE safety (MDEV-39825): HP_SHARE::write_can_replace flag
forces copy mode in hp_read_blobs(), preventing blob data corruption
from freed-then-reused continuation records during REPLACE.

Geometry GROUP_CONCAT fix (MDEV-39761): downgrade Field_geom to
Field_blob for GROUP_CONCAT temp tables in both expression creation
paths.  Type_handler_geometry::type_handler_for_tmp_table() added.

Geometry GROUP BY key fix (MDEV-39871): detect when new_key_field()
produced non-blob Field_varstring for a blob column, replace with
Field_blob_key.

Performance
-----------

Non-blob tables: zero regression.  Every blob-specific code path is
guarded by if (share->blob_count).  No new allocations, no format
changes, no hash function changes for non-blob keys.

Blob tables: eliminates file creation/deletion overhead and page cache
management.  For single-run blobs (common case), the read path is
entirely zero-copy.

Limitations
-----------

1. No BTREE indexes on blob columns (HASH only)
2. No partial-key prefix indexing for blobs
3. 2x memory for Case C reads only (A/B are 1x)
4. No blob compression
5. 65,535 records per run (uint16 cap, auto-split)
6. max_heap_table_size applies to continuation records
7. Expression cache disabled for HEAP+blob
8. FULLTEXT forces disk engine

Linked bugs fixed:

- MDEV-39703: mroonga fulltext test ordering
- MDEV-39723: ER_DUP_ENTRY on GROUP BY with blob column
- MDEV-39724: crash in hp_is_single_rec with GROUP BY
- MDEV-39732: slave crash in hp_free_run_chain on blob replication
- MDEV-39761: Field_geom::store() assertion in GROUP_CONCAT
- MDEV-39782: RBR ER_KEY_NOT_FOUND on HEAP blob UPDATE
- MDEV-39825: blob data corruption on REPLACE into HEAP table
- MDEV-39871: crash in my_hash_sort_bin on GROUP BY with geometry

Reviewed by: Michael Widenius <monty@mariadb.org>
  Monty reviewed the entire patch. Areas where he suggested changes
  or contributed code:
  - Field_blob_key class (HEAP-native blob key format, 4-byte length +
    data pointer)
  - Duplicate key fix on HEAP-to-Aria conversion
  - hp_blob_key_length() uint32 fix
  - hp_rec_hashnr_stored removal
  - type_handler_for_tmp_table() param cleanup
  - Type_handler_geometry::type_handler_for_tmp_table() virtual
  - blob pointer bzero()
  - find_unique_row() double-materialization fix
  - Tail reclaim review
  - Batch tail allocation review
  - hp_update.c cleanup
  - Field_blob_compressed temp table fix
  - row_pack_length() dedup
  - pack_length_no_ptr() removal
  - Race condition fix in HEAP
  - MDEV-39703 mroonga test fix
  - MDEV-39825 write_can_replace optimization
  - Documentation (Docs/internal-temporary-tables.txt)
Contribution by: Alexander Barkov <bar@mariadb.com>
  Type_handler::make_and_init_table_field_ex() -- refactored temp table
  field creation from inline code in sql_select.cc into type handler
  virtual methods (sql_type.cc, sql_type_geom.cc), enabling clean
  per-type-handler field creation for HEAP blob promotion.
@arcivanov arcivanov force-pushed the MDEV-38975-Phase1-varchar-blob-promotion-main branch from 751c18c to 738c2af Compare June 12, 2026 15:27
Six code fixes and test re-recordings for issues found by CI on PR MariaDB#5222.

**NULL dereference in `create_tmp_field()`**: `SYS_REFCURSOR` plugin
returns NULL from `make_new_field()` (cursor values cannot be
materialized). The feature added `result->flags |= FIELD_PART_OF_TMP_UNIQUE`
without a NULL check. Added `if (result)` guard.

**xmltype identity loss and recursive CTE reclength mismatch in
`Item_type_holder::create_tmp_field_ex()`**: the blob_key dispatch now
requires both: (1) `type_handler_for_tmp_table()` returns
`blob_key_type_handler()`, AND (2) `dynamic_cast<Type_handler_blob_common*>`
confirms the original type is a native blob. Condition 1 excludes xmltype
(its override returns itself). Condition 2 excludes VARCHAR types promoted
via `varstring_type_handler()` -> `too_big_for_varchar()` ->
`blob_type_handler()`. Without condition 2, wide VARCHAR in recursive CTEs
(e.g. `cast('...' as varchar(1000))`) was promoted to `Field_blob_key` in
the main UNION DISTINCT table (`part_of_unique_key=true`) but stayed as
`Field_varchar` in the incremental table (`part_of_unique_key=false`),
causing a `reclength` mismatch assertion in
`select_union_recursive::send_data()` (`main.json_equals` crash).

**Spurious `reclength > HA_MAX_REC_LENGTH` in `pick_engine()`**: the
original `choose_engine()` (both 10.11 and upstream/main) never had a
reclength check. MDEV-38975 introduced it when replacing the
`blob_fields` condition. HEAP has no internal reclength limit --
`hp_create.c` stores `uint reclength` and allocates blocks of that size;
`max_supported_record_length()` is only checked in `unireg.cc` during
user-facing CREATE TABLE. I_S tables like SLAVE_STATUS routinely have
reclength ~880KB (13 bare `Varchar()` columns). The check forced them to
Aria where `fill_slave_status()` returned 0 rows. Removed the check and
the unused `reclength` parameter from `pick_engine()`.

**Multi-update `tmp_memory_table_size` override**: the 10.11 feature
overrode `big_tables=FALSE` for multi-update dedup tables. The forward-port
translated this as `tmp_memory_table_size=SIZE_T_MAX` when the variable
was 0. But `big_tables=FALSE` was a soft "don't force disk" hint, while
`tmp_memory_table_size=SIZE_T_MAX` overrides the user's explicit
`tmp_memory_table_size=0` directive. Since main removed `big_tables`
entirely (MDEV-19713), the override is not needed. Removed.

**Zero-length key rejection in `check_tmp_key()`**: defense-in-depth
guard rejecting `key_len == 0` to prevent useless zero-length keys from
being created by `add_tmp_key()`.

**Non-deterministic `column_compression` test**: HEAP blob support allows
compressed VARCHAR/TEXT temp tables to stay in HEAP instead of falling to
Aria, changing row iteration order. Added `--sorted_result` to the two
MDEV-24726 subqueries that lack `ORDER BY`.

Test changes:
- `spatial_utility_function_collect`: added ORDER BY to window function
  that lacked it (results were engine-row-order-dependent)
- `tmp_space_usage`: removed multi-update override; forced disk for
  MDEV-34016/34060 Aria-specific test sections (blob I_S tables now
  stay in MEMORY)
- `blob_update_overflow`: replaced `SHOW STATUS LIKE 'Created_tmp_%'`
  with targeted I_S query (Created_tmp_files varies on sanitizer builds)
- Re-recorded 8 tests for expected "temp table stays in MEMORY" changes
- `column_compression`: added `--sorted_result` for MDEV-24726 queries
Use `Field_blob_key` as the single unified mechanism for ALL blob
columns in HEAP temp tables, replacing Phase 1's dual approach of
`Field_blob_key` for GROUP BY/DISTINCT + `rebuild_blob_key_from_segments`
for derived table ref access.

Key changes:
- `Tmp_field_param::is_heap_engine()` gates `Field_blob_key` creation
  for all blob fields in HEAP temp tables (not just `part_of_unique_key`)
- `varstring_type_handler()` promotes VARCHAR > `HEAP_CONVERT_IF_BIGGER_TO_BLOB`
  to blob via `blob_type_handler()` → `type_handler_blob_key`
- `Type_handler_blob_common::type_handler_for_tmp_table()` returns
  `blob_key_type_handler()` when `is_heap_engine()`
- `Item_field::create_tmp_field_from_item_field()` redirects HEAP blob
  fields through the type handler system
- `Item_type_holder::create_tmp_field_ex()` extended for UNION/CTE blobs

Fix three latent bugs in `Field_blob_key` exposed by ref access:

1. `Field_blob_key::key_cmp()` treated key bytes at offset 4 as inline
   data, but the key format is `[4B length][8B pointer_to_data]` —
   comparison was against raw pointer bytes instead of actual data.

2. `cmp_buffer_with_ref()` eq_ref cache compared raw key buffer bytes.
   When `Field_blob_key::value` buffer is reused across lookups, the
   `[4B length][8B pointer]` bytes don't change even when the pointed-to
   data differs, causing stale result reuse. Disable the cache for all
   HEAP blob key parts (remove the `length == 0` guard).

3. `Field_blob_key::new_key_field()` returns `Field_blob_key` (unlike
   `Field_blob::new_key_field()` which returns `Field_varstring`). The
   `store_key` mechanism stores into `to_field->value` String, which
   leaks because `store_key` is `Sql_alloc` with no destructor. Add
   `store_key::cleanup()` called from `JOIN_TAB::cleanup()` and
   `subselect_uniquesubquery_engine::cleanup()`.

`Field_blob_key::key_part_length_bytes()` changed from 4 to 0 so
`store_length = key_length (12) + null_byte + 0`, matching HEAP's
`seg->length (12) + null` for correct multi-part key alignment.

`hp_key_cmp()` blob packlength changed from hardcoded 4 to
`seg->bit_start` (actual field packlength) for TEXT (packlength=2).

Re-record GROUP_CONCAT-related results: the `Tmp_field_param` threading
through `tmp_table_field_from_field_type()` (base branch) closed a
plumbing gap where `Item_sum` and literal items dropped the param, so
they now reach the HEAP promotion gates like all other expression
items. `GROUP_CONCAT` results in HEAP temp tables become
`Field_blob_key` (longtext metadata, 12-byte blob key parts),
consistent with the `Item_func` path that was already recorded
(e.g. `substring()` sj-materialization keys).
@arcivanov arcivanov force-pushed the MDEV-38975-Phase1-varchar-blob-promotion-main branch from 738c2af to c6f7626 Compare June 12, 2026 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.

Development

Successfully merging this pull request may close these issues.

3 participants