Skip to content

perf: phased codecpipeline#3885

Open
d-v-b wants to merge 67 commits into
zarr-developers:mainfrom
d-v-b:perf/prepared-write-v2
Open

perf: phased codecpipeline#3885
d-v-b wants to merge 67 commits into
zarr-developers:mainfrom
d-v-b:perf/prepared-write-v2

Conversation

@d-v-b

@d-v-b d-v-b commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

This PR defines a new codec pipeline class called PhasedCodecPipeline that enables much higher performance for chunk encoding and decoding than the current BatchedCodecPipeline.

The approach here is to completely ignore how the v3 spec defines array -> bytes codecs 😆. Instead of treating codecs as functions that mix IO and compute, we treat codec encoding and decoding as a sequence:

  1. preparatory IO, async
    fetch exactly what we need to fetch from storage, given the codecs we have. So if there's a sharding codec in the first array->bytes position, the codec pipeline knows it must fetch the shard index, then fetch the involved subchunks, before passing them to compute.
  2. pure compute. sync. Apply filters and compressors. safe to parallelize over chunks.
  3. (if writing): final IO, async. reconcile the in-memory compressed chunks against our model of the stored chunk. Write out bytes.

Basically, we use the first array -> bytes codec to figure out what kind of preparatory IO and final IO we need to perform, and the rest of the codecs to figure out what kind of chunk encoding we need to do. Separating IO from compute in different phases makes things simpler and faster.

Happy to chat more about this direction. IMO the spec should be re-written with this framing, because it makes much more sense than trying to shoe-horn sharding in as a codec.

I don't want to make our benchmarking suite any bigger but on my laptop this codec pipeline is 2-5x faster than the batchedcodec pipeline for a lot of workloads. I can include some of those benchmarks later.

This was mostly written by claude, based on previous work in #3719. All these changes should be non-breaking, so I think this is in principle safe for us to play around with in a patch or minor release.

Edit: this PR depends on changes submitted in #3907 and #3908

Another edit: the big pitch of this PR -- separating IO from compute -- didn't end up valuable, because we ran into a large amount of overhead due to indexing / python object creation. It turns out the vast majority of the speed benefits can be had simply by avoiding async for storage backends that don't need it. see #3885 (comment) for a detailed summary of the current state of things here.

d-v-b added 4 commits April 7, 2026 10:38
`PreparedWrite` models a set of per-chunk changes that would be applied to a stored chunk. `SupportsChunkPacking`
is a protocol for array -> bytes codecs that can use `PreparedWrite` objects to update an existing chunk.
@github-actions github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 8, 2026
@codecov

codecov Bot commented Apr 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.54296% with 92 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.21%. Comparing base (96a62b5) to head (bda7c9f).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/core/codec_pipeline.py 83.79% 59 Missing ⚠️
src/zarr/codecs/sharding.py 95.37% 16 Missing ⚠️
src/zarr/abc/store.py 75.67% 9 Missing ⚠️
src/zarr/codecs/numcodecs/_codecs.py 66.66% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3885      +/-   ##
==========================================
- Coverage   93.53%   93.21%   -0.33%     
==========================================
  Files          88       88              
  Lines       11917    12561     +644     
==========================================
+ Hits        11147    11709     +562     
- Misses        770      852      +82     
Files with missing lines Coverage Δ
src/zarr/codecs/_v2.py 94.11% <100.00%> (+0.50%) ⬆️
src/zarr/core/array.py 97.89% <100.00%> (+0.01%) ⬆️
src/zarr/core/config.py 100.00% <ø> (ø)
src/zarr/storage/_fsspec.py 91.32% <ø> (ø)
src/zarr/testing/buffer.py 93.18% <100.00%> (-6.82%) ⬇️
src/zarr/codecs/numcodecs/_codecs.py 93.18% <66.66%> (-3.21%) ⬇️
src/zarr/abc/store.py 92.68% <75.67%> (-3.75%) ⬇️
src/zarr/codecs/sharding.py 93.54% <95.37%> (+2.02%) ⬆️
src/zarr/core/codec_pipeline.py 86.91% <83.79%> (-5.26%) ⬇️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b

d-v-b commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

@TomAugspurger how would this design work with CUDA codecs?

@d-v-b d-v-b force-pushed the perf/prepared-write-v2 branch from 5d3064e to b67a5a0 Compare April 15, 2026 09:51
@github-actions github-actions Bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 15, 2026
@d-v-b d-v-b force-pushed the perf/prepared-write-v2 branch 2 times, most recently from a84a15a to 68a7cdc Compare April 17, 2026 10:41
Comment thread src/zarr/core/codec_pipeline.py Outdated
Comment on lines +943 to +962
# Phase 1: fetch all chunks (IO, sequential)
raw_buffers: list[Buffer | None] = [
bg.get_sync(prototype=cs.prototype) # type: ignore[attr-defined]
for bg, cs, *_ in batch
]

# Phase 2: decode (compute, optionally threaded)
def _decode_one(raw: Buffer | None, chunk_spec: ArraySpec) -> NDBuffer | None:
if raw is None:
return None
return transform.decode_chunk(raw, chunk_spec)

specs = [cs for _, cs, *_ in batch]
if n_workers > 0 and len(batch) > 1:
with ThreadPoolExecutor(max_workers=n_workers) as pool:
decoded_list = list(pool.map(_decode_one, raw_buffers, specs))
else:
decoded_list = [
_decode_one(raw, spec) for raw, spec in zip(raw_buffers, specs, strict=True)
]

@ilan-gold ilan-gold Apr 17, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this all multi-threaded i.e., the I/O as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should benchmark this, but my expectation was that IO against memory storage and local storage is not compute-limited, and so threads wouldn't remove a real bottleneck. for memory storage i'm sure this is true, not sure about local storage though

d-v-b and others added 6 commits April 17, 2026 22:51
Adds a SupportsSetRange protocol to zarr.abc.store for stores that
allow overwriting a byte range within an existing value. Implementations
are added for LocalStore (using file-handle seek+write) and MemoryStore
(in-memory bytearray slice assignment).

This is the prerequisite for the partial-shard write fast path in
ShardingCodec, which can patch individual inner-chunk slots without
rewriting the entire shard blob when the inner codec chain is fixed-size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V2Codec, BytesCodec, BloscCodec, etc. previously only implemented the
async _decode_single / _encode_single methods. Add their sync
counterparts (_decode_sync / _encode_sync) so that the upcoming
SyncCodecPipeline can dispatch through them without spinning up an
event loop.

For codecs that wrap external compressors (numcodecs.Zstd, numcodecs.Blosc,
the V2 fallback chain), the sync versions just call the underlying
compressor's blocking API directly instead of routing through
asyncio.to_thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arallelism

Adds SyncCodecPipeline alongside BatchedCodecPipeline. The new pipeline
runs codecs through their sync entry points (_decode_sync / _encode_sync)
and dispatches per-chunk work to a module-level thread pool sized by
the codec_pipeline.max_workers config (default = os.cpu_count()).

Each chunk's full lifecycle (fetch + decode + scatter for reads;
get-existing + merge + encode + set/delete for writes) runs as one
pool task — overlapping IO of one chunk with compute of another.
Scatter into the shared output buffer is thread-safe because chunks
have non-overlapping output selections.

The async wrappers (read/write) detect SupportsGetSync/SupportsSetSync
stores and dispatch to the sync fast path, passing the configured
max_workers. Other stores fall through to the async path, which still
uses asyncio.concurrent_map at async.concurrency.

Notes on perf:
- Default (None → cpu_count) is tuned for chunks ≥ ~512 KB.
- Small chunks (≤ 64 KB) regress 1.5-3x because pool dispatch overhead
  (~30-50 µs/task) dominates per-chunk work. Workaround:
  zarr.config.set({"codec_pipeline.max_workers": 1}).
- For large chunks on local/memory stores, IO+compute parallelism
  yields 1.7-2.5x over BatchedCodecPipeline on direct-API reads and
  ~2.5x on roundtrip.

ChunkTransform encapsulates the sync codec chain. It caches resolved
ArraySpecs across calls with the same chunk_spec — combined with the
constant-ArraySpec optimization in indexing, hot-path overhead is
minimized.

Includes test scaffolding for the new pipeline (test_sync_codec_pipeline)
and config plumbing for the max_workers key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds _encode_partial_sync and _decode_partial_sync to ShardingCodec.
For fixed-size inner codec chains and stores that implement
SupportsSetRange, partial writes patch individual inner-chunk slots
in-place instead of rewriting the whole shard:

  - Reads existing shard index (one byte-range get).
  - For each affected inner chunk: decodes the slot, merges the new
    region, re-encodes.
  - Writes each modified slot at its deterministic byte offset, then
    rewrites just the index.

For variable-size inner codecs (e.g. with compression) or stores that
don't support byte-range writes, falls through to a full-shard rewrite
matching BatchedCodecPipeline semantics.

The partial-decode path computes a ReadPlan from the shard index and
issues one byte-range get per overlapping chunk, decoding only what
the read selection touches.

Both paths are dispatched from SyncCodecPipeline via the existing
supports_partial_decode / supports_partial_encode protocol checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new test files:

  test_codec_invariants — asserts contract-level properties that every
  codec / shard / buffer combination must satisfy: round-trip exactness,
  prototype propagation, fill-value handling, all-empty shard handling.

  test_pipeline_parity — exhaustive matrix asserting that
  SyncCodecPipeline and BatchedCodecPipeline produce semantically
  identical results across codec configs, layouts (including
  nested sharding), write sequences, and write_empty_chunks settings.
  Three checks per cell:
    1. Same array contents on read.
    2. Same set of store keys after writes.
    3. Each pipeline reads the other's output identically (catches
       layout-divergence bugs).

These tests pinned the design throughout the SyncCodecPipeline +
partial-shard development.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds .gitignore entries for .claude/, CLAUDE.md, and docs/superpowers/
so local IDE/agent planning artifacts don't get committed by accident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b force-pushed the perf/prepared-write-v2 branch from aa111a2 to 1be5563 Compare April 17, 2026 21:04
Comment thread src/zarr/core/codec_pipeline.py Outdated
selected = decoded[chunk_selection]
if drop_axes:
selected = selected.squeeze(axis=drop_axes)
out[out_selection] = selected

@ilan-gold ilan-gold Apr 18, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth experimenting with moving this setting operation out[out_selection] = selected outside the threadpool execution since, IIRC, it holds the GIL and is probably non-trivial time-wise.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory usage will probably go up a bit though....

d-v-b and others added 8 commits June 5, 2026 18:43
…ross-file dup

The file named test_sync_codec_pipeline.py tested no pipeline -- it is the unit
test suite for ChunkTransform (the per-chunk synchronous codec chain that
FusedCodecPipeline uses internally). "sync codec pipeline" was an earlier name
for the Fused pipeline; the filename had outlived it. Renamed to
test_chunk_transform.py (git mv preserves history) and added a module docstring
naming what it actually covers.

Also removed test_sync_transform_encode_decode_roundtrip from
test_fused_pipeline.py: it was a weaker cross-file duplicate of this file's
test_encode_decode_roundtrip (which covers the same encode->decode->compare over
five codec chains rather than just bytes-only). Its one extra assertion -- that
evolve_from_array_spec populates _sync_transform -- is already covered by
test_evolve_from_array_spec in the Fused file.

test_codec_pipeline.py left as-is: all three tests are correctly placed and
cover things the Scenario suite can't (the low-level pipeline.read GetResult
API, a plain dict store, and the zarr-developers#3937 cast_value dtype-threading regression).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The byte-range-write machinery works, but the right store interface for it is
still undecided, so it is removed from this PR and will return once that lands.

Removed:
- SupportsSetRange protocol (abc/store.py) and its __all__ export.
- MemoryStore.set_range / set_range_sync / _set_range_impl and the
  SupportsSetRange base (storage/_memory.py).
- LocalStore.set_range / set_range_sync, the _put_range helper, and the
  SupportsSetRange base (storage/_local.py).
- The sharding codec's byte-range-write fast path in _encode_partial_sync;
  partial shard writes now always take the full-shard-rewrite path (identical
  to BatchedCodecPipeline, verified by the pipeline-parity suite). Also dropped
  the now-dead _chunk_byte_offset helper it relied on.
- changes/3907.feature.md (the byte-range-writes changelog note). The
  byte-range-READ changelog (3004) is unrelated and kept.

Byte-range READS (ByteRequest, get(byte_range=), get_ranges coalescing,
the read-side bulk shard decode) are untouched -- this only removes writes.

The known-good tests that exercise byte-range writes are commented out (not
deleted) in test_store/test_memory.py, test_store/test_local.py, and
test_fused_pipeline.py, to restore once the store design is settled.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR-added module-level helper in array.py with zero callers — an ArraySpec-reuse
optimization that was never wired up. Plain function, no protocol role, safe to
drop. Verified: no references anywhere in src/ or tests/, and the full
array/sharding/pipeline suites stay green.

Note: ShardingCodec._encode_sync, though never *called*, is NOT dead — it is a
required member of the runtime_checkable SupportsSyncCodec protocol. Removing it
drops ShardingCodec from SupportsSyncCodec and breaks the sync read-fallback
routing (16 test failures), so it stays.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The docstring claimed _encode_sync "iterates inner chunks in Morton order —
that's the canonical layout the shard index expects", which is wrong and a
latent footgun: it implies the method imposes a morton physical layout. It does
not. The morton iteration only populates an intermediate dict whose key order is
immaterial; the on-disk layout is decided downstream by the subchunk_write_order
loop in _encode_shard_dict_sync (same as the async _encode_single sibling).

Also clarified that this method IS reached — via nested sharding, where an inner
ShardingCodec is encoded through the outer codec's ChunkTransform. (It is not
called for top-level sharded writes, which route through _encode_partial_sync.)

Verified empirically: routing through nested _encode_sync, all three
subchunk_write_order values roundtrip correctly AND morton vs lexicographic
produce physically different bytes — i.e. the order is honored, not ignored.
Behavior unchanged; docstring only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR-added thin wrapper (`_load_shard_index_maybe(...) or _ShardIndex.create_empty(...)`)
with zero invocations anywhere in src/ or tests/. Unlike _encode_sync, this is
genuinely removable: confirmed it is NOT a member of any runtime_checkable
protocol or ABC (no reference in src/zarr/abc/, not a base-class override) and is
reached by no dynamic dispatch (no getattr / string reference). main has no
_load_shard_index* methods at all, so it was introduced and left unused by this
PR. The _maybe and _maybe_sync variants it wrapped remain and are used.

Verified: full sharding + nested-sharding + parity + pipeline suites stay green,
ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ring

The FusedCodecPipeline class docstring still described sharded writes as using
"byte-range writes via set_range_sync" — but byte-range-write support was removed
from this PR (set_range_sync / SupportsSetRange are gone). Sharded writes now take
the codec's synchronous full-shard-rewrite path. Docstring only; no behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This branch's docstrings/comments had introduced RST-style ``double-backtick``
inline literals, which this project does not use (plain single backticks only —
no RST roles or double-backticks). Converted the 25 occurrences across the
sharding codec, codec_pipeline, and fsspec store docstrings/comments to single
backticks. Style only; no behavior change.

Also confirmed (via git blame, this-branch lines only) there are no remaining
references to removed/outdated designs: the byte-range-write (set_range) mentions
and the "separating IO from compute" framing were already corrected earlier in
this branch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… opt-in

Pairs with the FusedCodecPipeline default: keep the new pipeline, but do NOT
enable threading by default. `max_workers=None` (auto -> cpu_count) spawned a
thread pool on every read/write, which is a behavior change with real downstream
risk — it runs custom stores/codecs concurrently (thread-safety) and can
oversubscribe many-core nodes whose workloads already parallelize at the
dask/MPI layer. The default is now 1 (fully sequential: the pool is never
created when max_workers <= 1). Parallelism is opt-in via
`codec_pipeline.max_workers` (positive int, or None for auto).

Updates _resolve_max_workers docstring and the config-defaults test accordingly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@d-v-b d-v-b added the run-downstream Run the tests of downstream libraries (e.g., xarray) against zarr label Jun 6, 2026
@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Jun 6, 2026
@codspeed-hq

codspeed-hq Bot commented Jun 6, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by ×3.1

⚡ 31 improved benchmarks
✅ 11 untouched benchmarks
🆕 120 new benchmarks
⏩ 30 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_sharded_morton_indexing_large[(32, 32, 32)-memory] 9,186.1 ms 753.8 ms ×12
WallTime test_sharded_morton_indexing_large[(30, 30, 30)-memory] 7,531.3 ms 620.9 ms ×12
WallTime test_sharded_morton_indexing_large[(33, 33, 33)-memory] 10,022.6 ms 827.1 ms ×12
WallTime test_sharded_morton_indexing[(32, 32, 32)-memory] 1,148.4 ms 95.5 ms ×12
WallTime test_sharded_morton_indexing[(16, 16, 16)-memory] 144.1 ms 12.9 ms ×11
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory] 409.8 ms 58.8 ms ×7
WallTime test_slice_indexing[(50, 50, 50)-(slice(10, -10, 4), slice(10, -10, 4), slice(10, -10, 4))-memory] 209.5 ms 40.2 ms ×5.2
WallTime test_slice_indexing[(50, 50, 50)-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory] 405.2 ms 80 ms ×5.1
WallTime test_slice_indexing[None-(slice(10, -10, 4), slice(10, -10, 4), slice(10, -10, 4))-memory] 191.4 ms 46.6 ms ×4.1
WallTime test_slice_indexing[None-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory] 350.7 ms 85.4 ms ×4.1
WallTime test_slice_indexing[None-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory] 354.5 ms 88.7 ms ×4
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(0, 3, 2), slice(0, 10, None))-memory] 6.6 ms 2.4 ms ×2.8
WallTime test_slice_indexing[None-(slice(None, None, None), slice(0, 3, 2), slice(0, 10, None))-memory] 3.3 ms 1.3 ms ×2.6
WallTime test_slice_indexing[(50, 50, 50)-(slice(10, -10, 4), slice(10, -10, 4), slice(10, -10, 4))-memory_get_latency] 208.1 ms 85.2 ms ×2.4
WallTime test_slice_indexing[(50, 50, 50)-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory_get_latency] 406 ms 170.1 ms ×2.4
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory_get_latency] 410.8 ms 172.8 ms ×2.4
WallTime test_slice_indexing[None-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory_get_latency] 404 ms 179.7 ms ×2.2
WallTime test_slice_indexing[None-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory_get_latency] 408.7 ms 182.2 ms ×2.2
WallTime test_slice_indexing[None-(slice(10, -10, 4), slice(10, -10, 4), slice(10, -10, 4))-memory_get_latency] 220 ms 98.1 ms ×2.2
WallTime test_slice_indexing[(50, 50, 50)-(0, 0, 0)-memory] 1,984.5 µs 929.1 µs ×2.1
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing d-v-b:perf/prepared-write-v2 (298295f) with main (6ce787d)2

Open in CodSpeed

Footnotes

  1. 30 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (b9d3964) during the generation of this report, so 6ce787d was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

…egression)

CodSpeed flagged test_sharded_morton_write_single_chunk regressing ~38-39%
(writing one 1x1x1 chunk into a 32^3 = 32768-chunk shard). Both main and this
branch do a full shard rewrite for a partial write, so the rewrite itself is not
the regression — and it is NOT the removed byte-range fast path (that path was
gated out here anyway: write_empty_chunks defaults to False -> skip_empty=True).

The cause: the sync _encode_partial_sync rebuilt the in-memory shard_dict with a
per-coordinate __getitem__ loop over all 32768 chunks (O(n_chunks) Python
overhead + try/except per chunk), whereas main's async _encode_partial_single
builds the same dict with a single vectorized index lookup via
_ShardReader.to_dict_vectorized. Switched the sync path to to_dict_vectorized
(a plain, non-async method; _shard_reader_from_bytes_sync already returns a
_ShardReader), matching the async path.

The dict's key order is immaterial (the physical layout is decided downstream by
the subchunk_write_order loop in _encode_shard_dict_sync), so the merge loop —
which looks up by coordinate, not order — is unaffected.

Local micro-benchmark (32^3 shard, single 1x1x1 chunk write): 59.4 -> 40.0
ms/write (~1.5x), matching the CodSpeed delta. Correctness: full sharding +
pipeline-parity suites pass (581), so Fused still matches Batched byte-for-byte.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@d-v-b d-v-b added benchmark Code will be benchmarked in a CI job. and removed benchmark Code will be benchmarked in a CI job. labels Jun 6, 2026
d-v-b and others added 2 commits June 6, 2026 23:05
The deps=optional CI job (where cast_value_rs is installed) failed
test_codec_pipeline_threads_dtype_through_evolve and several test_cast_value
tests with "Invalid endianness: None" / "endian needs to be specified for
multi-byte data types".

Root cause: FusedCodecPipeline.evolve_from_array_spec evolved EVERY codec against
the same original array_spec:

    evolved = tuple(c.evolve_from_array_spec(array_spec=array_spec) for c in self.codecs)

When an array->array codec widens the dtype (e.g. cast_value int8 -> int16), the
BytesCodec serializer was still evolved against the single-byte SOURCE dtype, so
it stripped its `endian` to None (bytes.py treats single-byte dtypes as having no
endianness) and then failed at decode time on the multi-byte data.
BatchedCodecPipeline.evolve_from_array_spec already threads the spec forward
(spec = evolved_codec.resolve_metadata(spec)); the Fused version did not. Fixed by
mirroring the Batched threading.

Also added test_evolve_threads_spec_preserving_serializer_endian: a
dependency-free regression test (uses a minimal dtype-widening AA codec stub, no
cast_value_rs) that runs on BOTH pipelines via the pipeline_class fixture. It
fails on [sync] without this fix and passes with it — closing the gap where the
only coverage required an optional dep and thus ran in no default env.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The endian bug fixed in the previous commit existed because the two codec
pipelines duplicated the same conceptual logic and one copy drifted: Batched's
evolve_from_array_spec threaded the spec forward correctly, Fused's did not.
Duplicated logic that can silently diverge is a standing bug source, so extract
the drift-prone pieces into single sources of truth that both pipelines delegate
to (mirroring the existing resolve_aa_specs precedent):

- evolve_codecs(codecs, array_spec): the construction-time spec-threading loop.
  Both BatchedCodecPipeline and FusedCodecPipeline.evolve_from_array_spec now
  call it. There is now exactly one place this logic lives, so it cannot drift.
- pipeline_supports_partial_decode / _encode(ab, *, aa, bb, require_no_aa_bb):
  the partial-decode/encode predicate. Both pipelines delegate. The two pass
  DIFFERENT require_no_aa_bb values (Batched True, Fused False) — that divergence
  is pre-existing and deliberately preserved here (not silently unified); it is
  now explicit at the call sites and documented in one function instead of being
  buried in two slightly-different inline isinstance checks.

Behavior-preserving: each pipeline computes exactly what it did before. Left the
trivial fan-out loops (validate, compute_encoded_size) as-is — deduping those
would require changing the CodecPipeline ABC contract (abstract -> concrete) for
near-zero drift benefit. Full pipeline/sharding/parity suites + the dtype-evolve
regression test pass; ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@d-v-b

d-v-b commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

(claude wrote this)

Current state of this PR (reconciled against the original framing)

This branch has evolved substantially. Posting a current-state summary so the description can be updated.

⚠️ The original "hard IO/compute boundary" framing no longer describes this PR

This PR was opened as a phased pipeline built on separating IO from compute (preparatory IO → pure compute → final IO), with the suggestion that the spec be rewritten around that framing. That is not what shipped. The class was renamed PhasedCodecPipelineFusedCodecPipeline, and its docstring now explicitly disclaims the original thesis:

The win is NOT "separating IO from compute" — the codecs (notably ShardingCodec) still perform their own storage IO.

The actual win is synchronous, fused per-chunk execution: eliminating BatchedCodecPipeline's ~one-coroutine-per-chunk async scheduling, which was the real dominant overhead — not IO/compute mixing. IO ownership stays with the codecs (the zarrs model). The PreparedWrite/phase abstractions from the first commit are not in the live code path. The "fetch exactly what we need" idea survives only as a read-path optimization inside the sharding codec (get_ranges_sync coalescing, vectorized whole-shard bulk decode), not as a pipeline-level phase.

So the PR description should be rewritten around "synchronous execution" rather than "separating IO from compute."

What it actually is now

  • FusedCodecPipeline runs the codec chain synchronously when all codecs are sync-capable (SupportsSyncCodec); falls back to the async path (≡ BatchedCodecPipeline) for non-sync stores (ZipStore, fsspec).
  • Intended to be behavior-identical to Batched; a pipeline-parity test battery asserts Fused ≡ Batched (same results, same store keys, byte-identical shards).

User-facing changes

  • Default pipeline flipped to FusedCodecPipeline (the one genuinely user-facing change). One-line revert: zarr.config.set({"codec_pipeline.path": "zarr.core.codec_pipeline.BatchedCodecPipeline"}).
  • Threading off by default (codec_pipeline.max_workers=1, sequential). Threading-by-default is deferred (downstream thread-safety; oversubscription on many-core nodes). Opt in with max_workers > 1.
  • No public API additions; import surface unchanged.

Scope changes since opening

Performance (latest CodSpeed)

  • ×3.1 overall, 0 regressed, 31 improved, 11 untouched. Sharded morton reads ~×11–12.
  • A mid-development write regression (test_sharded_morton_write_single_chunk, −38–39%) was fixed (vectorized the partial-shard dict build), which also lifted the overall figure ×2.7 → ×3.1.

Hardening highlights

  • Fixed a spec-threading bug where FusedCodecPipeline.evolve_from_array_spec evolved codecs against the original (not forward-threaded) spec, stripping BytesCodec.endian for dtype-widening cast_value chains. Root cause was duplicated logic drifting between the two pipelines; a follow-up refactor extracted the shared logic (evolve_codecs, pipeline_supports_partial_decode/encode) into freestanding functions so the class of bug can't recur. Added a dependency-free regression test on both pipelines.
  • Shared CodecPipelineTests suite now runs every pipeline-agnostic behavior on both pipelines × sync/async stores.

Open items for review

  1. Update this description (IO/compute framing → synchronous-execution framing).
  2. Default flip: opt-in for a release first, or explicit risk sign-off (mitigated by parity tests + one-line revert).
  3. Downstream: numcodecs zarr3 passes; xarray backend suite needs a clean run, we are waiting on Revert "fix: make xarray downstream tests work" #4047 for that.
  4. supports_partial_decode/encode divergence between the pipelines (deferred; behavior question).
  5. Pre-existing Batched transpose+sharding bug (on main too; deferred to a separate PR).

@d-v-b

d-v-b commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

@ilan-gold this is ready for you to dig into

@floriankrb

Copy link
Copy Markdown

@d-v-b This looks promising. Is this branch ready to be tested regarding the regression we saw here ecmwf/anemoi-datasets#486?

@d-v-b

d-v-b commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@floriankrb yes if you have time to test it, that would be great

d-v-b and others added 8 commits June 10, 2026 15:15
The pool dispatch in read_sync/write_sync (codec_pipeline.max_workers > 1) had
zero functional test coverage — only config-default assertions existed — even
though threading is the opt-in we point users at. Adds:

- an end-to-end multi-chunk read/write roundtrip with max_workers=4 (verified
  the pool dispatch actually fires, not the sequential branch);
- worker-exception propagation tests for both write_sync (list-consumed
  pool.map) and read_sync (tuple-consumed pool.map): a store error raised in a
  pool worker must surface to the caller;
- a concurrent-decode test: transpose filter (so ChunkTransform._resolve_specs
  cache traffic actually occurs — with no AA codecs the cache is bypassed),
  pool workers decoding concurrently, plus an outer thread pool issuing
  overlapping reads. Pins correctness under concurrency around the shared
  transform's mutable cache.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
zarr_format=2 appeared in none of the pipeline test files: v2 arrays were only
exercised implicitly through whichever pipeline is the global default. v2 goes
through the V2Codec wrapper (numcodecs filters + compressor) — a different
codec path than the v3 AA/AB/BB chain, with its own _encode_sync/_decode_sync
under FusedCodecPipeline — so it deserves explicit coverage on BOTH pipelines
and BOTH store kinds (sync fast path + async fallback).

Adds v2-roundtrip (uncompressed) and v2-gzip-roundtrip (numcodecs.GZip — the v2
compressor spelling; v3 codec configs are rejected for v2 arrays) to SCENARIOS.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Both _encode_shard_dict_sync and the async _encode_shard_dict encoded the shard
index TWICE when index_location=start: encode to learn the length, shift the
present chunks' offsets by it, then re-encode with corrected offsets. The index
size is knowable without encoding — _shard_index_size() is already the byte-
exact contract every index read path relies on (reads slice exactly that many
bytes) — so the layout can use absolute offsets from the start and the index is
encoded once. Saves a full index encode (including its crc32c over the offsets
array) per shard write with index_location=start.

The layout loop was also duplicated between the sync and async versions — the
same drift surface that produced the evolve_from_array_spec endian bug. Both now
delegate to a shared pure _build_shard_layout (offset math lives once) and
_assemble_shard. A runtime guard verifies the encoded index length matches
_shard_index_size rather than silently corrupting offsets if someone ever
configures variable-size index codecs.

Verified: 606 tests pass including the pipeline-parity byte-identical shard
assertions across index_location=start/end and every subchunk_write_order, and
the sharded reopen tests — the on-disk layout is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
read_missing_chunks exists to help consumers distinguish a transport error from
a truly missing chunk. That distinction is a STORE-KEY-level concept: a missing
shard key raises ChunkNotFoundError. It does not cleanly apply to inner
subchunks of a shard that was fetched successfully — there is no transport
ambiguity there; the shard index simply records the subchunk as absent — so
those fill with the fill value rather than raising.

Both pipelines already implement exactly this (verified empirically), but
nothing pinned it, so the asymmetry vs unsharded arrays read as a bug in review.
This adds a shared-suite test (both pipelines x sync/async stores) asserting
both sides: missing shard key raises; missing inner subchunk of an existing
shard fills.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore: add parallelism TODOs

* perf: non-sharded reads

* chore: code duplication

* fix: arguments dont spread themselves!

* fix: bring in suggested guard

* perf: don't block on pool ops

* chore: docstring + materialize early

---------

Co-authored-by: ilan-gold <ilanbassgold@gmail.com>
…line

Addresses the open question on this PR about sync/async byte getters,
benchmark-guided as discussed.

_ShardingByteGetter/_ShardingByteSetter are in-memory dict wrappers but
presented only an async API, so the nested codec_pipeline.read over inner
chunks fell to the async fallback: one concurrent_map coroutine per inner chunk
for a dict lookup (~2.1 us/chunk pure asyncio overhead, ~8.8 ms per 4096-chunk
shard). On top of that, the nested pipeline (ShardingCodec.codec_pipeline) is
built by from_codecs and never evolved, so its sync transform was always None —
inner chunks also paid per-chunk AsyncChunkTransform coroutines, and the
decode_sync/encode_sync fallback improvements in this PR could not reach them.

Changes:
- SyncByteGetter / SyncByteSetter runtime protocols in zarr.abc.store
  (resurrecting the design from the original perf/prepared-write experiments,
  same names and shape). StorePath matches structurally but is still gated on
  its STORE's sync support; the protocols gate non-StorePath byte getters.
- _ShardingByteGetter/Setter implement get_sync/set_sync/delete_sync; the async
  methods delegate to the sync ones (single implementation).
- ShardingCodec._get_inner_pipeline(shard_spec): the nested pipeline evolved
  against the inner chunk spec (threads specs through the inner chain AND
  builds the sync transform). The four nested read/write call sites use it.
- FusedCodecPipeline.read/write gates accept non-StorePath SyncByteGetter/
  SyncByteSetter, so nested inner-chunk IO takes read_sync/write_sync.
- _decode_shard_index/_encode_shard_index delegate to their sync twins (pure
  compute; kills a per-shard AsyncChunkTransform round-trip and a sync/async
  duplication).

Benchmark (4096 inner chunks per shard, LatencyStore@0 i.e. the async-fallback
path = sharded data on remote stores), vs this PR's head:

  uncompressed read: 44.2 -> 28.9 ms (1.53x)
  gzip         read: 182.0 -> 50.9 ms (3.6x)
  writes: unchanged (~34 / ~70 ms) — already optimized by this PR's
  encode_sync fallback (whole-shard sync encode, no byte setters involved).

Adds a regression test asserting sharded fallback reads route inner chunks
through the sync fast path (read_sync engaged, zero AsyncChunkTransform calls);
verified it fails if the gate is removed. Full sharding + parity + pipeline
suites pass (619).

Assisted-by: ClaudeCode:claude-fable-5
… cache inner pipeline

The pipeline test suites have failed intermittently all along on an arbitrary
test that passes in isolation. Root cause (allocation site verified with
PYTHONTRACEMALLOC): pytest-asyncio implicitly creates an event loop in
_get_event_loop_no_warn during fixture setup/teardown and never closes it. When
GC reclaims that loop — or its self-pipe socketpair — mid-test, pytest's
unraisable hook converts the ResourceWarning into a failure of whichever
unrelated test happens to be running. The sync-bytegetter change increased
per-shard-op allocation churn enough to make this near-deterministic, which is
how it was finally traced.

Two changes:
- pyproject filterwarnings: narrowly ignore the two unraisable shapes
  (BaseEventLoop.__del__, AF_UNIX socketpair), mirroring the existing
  s3fs/aiobotocore entry. Not zarr's loops.
- ShardingCodec._get_inner_pipeline is now memoized per (pipeline class,
  shard_spec) — evolving built a ChunkTransform on every shard operation. The
  pipeline class is part of the key so codec_pipeline.path config changes are
  still honored.

Battery that previously failed ~every run now passes 3x consecutively (635).

Assisted-by: ClaudeCode:claude-fable-5
…hain evolve

Continues the anti-skew work: where the sync and async sharding paths
implemented the same logic twice, extract a single source of truth so the
copies cannot drift (the mechanism behind the pipeline-level endian bug).

- _get_inner_chunk_transform / _get_index_chunk_transform now evolve their
  codec chains via evolve_codecs (spec THREADED forward). Both previously
  evolved every codec against the same unthreaded spec — the exact bug shape
  that stripped BytesCodec.endian at the pipeline level, latent here for any
  spec-changing inner codec. Both are also now actually memoized (the inner
  transform's docstring claimed a cache that did not exist; transforms were
  rebuilt per call).
- New regression test (dependency-free dtype-widening stub codec) asserting
  the inner serializer keeps its endian; verified it fails on the unthreaded
  version.
- _shard_index_byte_range(): the index-location byte-range arithmetic existed
  verbatim in both _load_shard_index_maybe and its _sync twin; now one helper.
- _pair_chunks_with_byte_ranges(): the chunk-coord/byte-range pairing loop
  existed verbatim in both _load_partial_shard_maybe and its _sync twin; now
  one helper.

Deliberately NOT unified: the small hand-rolled loops remaining in
_decode_sync/_encode_sync vs their async twins. Post sync-bytegetter work the
async versions are thin delegations to the (shared, evolved) nested pipeline,
so the heavy machinery — evolve, transforms, layout, index codecs — is already
single-sourced; force-merging the residual loops would couple different
missing-chunk/concurrency semantics for little drift-surface gain. The
pipeline-parity suite guards their behavioral equivalence.

Full battery passes twice (636); ruff + mypy clean.

Assisted-by: ClaudeCode:claude-fable-5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job. performance Potential issues with Zarr performance (I/O, memory, etc.) run-downstream Run the tests of downstream libraries (e.g., xarray) against zarr

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants