perf: phased codecpipeline by d-v-b · Pull Request #3885 · zarr-developers/zarr-python

d-v-b · 2026-04-08T13:34:35Z

This PR defines a new codec pipeline class called PhasedCodecPipeline that enables much higher performance for chunk encoding and decoding than the current BatchedCodecPipeline.

The approach here is to completely ignore how the v3 spec defines array -> bytes codecs 😆. Instead of treating codecs as functions that mix IO and compute, we treat codec encoding and decoding as a sequence:

preparatory IO, async
fetch exactly what we need to fetch from storage, given the codecs we have. So if there's a sharding codec in the first array->bytes position, the codec pipeline knows it must fetch the shard index, then fetch the involved subchunks, before passing them to compute.
pure compute. sync. Apply filters and compressors. safe to parallelize over chunks.
(if writing): final IO, async. reconcile the in-memory compressed chunks against our model of the stored chunk. Write out bytes.

Basically, we use the first array -> bytes codec to figure out what kind of preparatory IO and final IO we need to perform, and the rest of the codecs to figure out what kind of chunk encoding we need to do. Separating IO from compute in different phases makes things simpler and faster.

Happy to chat more about this direction. IMO the spec should be re-written with this framing, because it makes much more sense than trying to shoe-horn sharding in as a codec.

I don't want to make our benchmarking suite any bigger but on my laptop this codec pipeline is 2-5x faster than the batchedcodec pipeline for a lot of workloads. I can include some of those benchmarks later.

This was mostly written by claude, based on previous work in #3719. All these changes should be non-breaking, so I think this is in principle safe for us to play around with in a patch or minor release.

Edit: this PR depends on changes submitted in #3907 and #3908

Another edit: the big pitch of this PR -- separating IO from compute -- didn't end up valuable, because we ran into a large amount of overhead due to indexing / python object creation. It turns out the vast majority of the speed benefits can be had simply by avoiding async for storage backends that don't need it. see #3885 (comment) for a detailed summary of the current state of things here.

`PreparedWrite` models a set of per-chunk changes that would be applied to a stored chunk. `SupportsChunkPacking` is a protocol for array -> bytes codecs that can use `PreparedWrite` objects to update an existing chunk.

…into perf/prepared-write-v2

codecov · 2026-04-08T16:52:11Z

Codecov Report

❌ Patch coverage is 88.54296% with 92 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.21%. Comparing base (96a62b5) to head (bda7c9f).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/zarr/core/codec_pipeline.py	83.79%	59 Missing ⚠️
src/zarr/codecs/sharding.py	95.37%	16 Missing ⚠️
src/zarr/abc/store.py	75.67%	9 Missing ⚠️
src/zarr/codecs/numcodecs/_codecs.py	66.66%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3885      +/-   ##
==========================================
- Coverage   93.53%   93.21%   -0.33%     
==========================================
  Files          88       88              
  Lines       11917    12561     +644     
==========================================
+ Hits        11147    11709     +562     
- Misses        770      852      +82

Files with missing lines	Coverage Δ
src/zarr/codecs/_v2.py	`94.11% <100.00%> (+0.50%)`	⬆️
src/zarr/core/array.py	`97.89% <100.00%> (+0.01%)`	⬆️
src/zarr/core/config.py	`100.00% <ø> (ø)`
src/zarr/storage/_fsspec.py	`91.32% <ø> (ø)`
src/zarr/testing/buffer.py	`93.18% <100.00%> (-6.82%)`	⬇️
src/zarr/codecs/numcodecs/_codecs.py	`93.18% <66.66%> (-3.21%)`	⬇️
src/zarr/abc/store.py	`92.68% <75.67%> (-3.75%)`	⬇️
src/zarr/codecs/sharding.py	`93.54% <95.37%> (+2.02%)`	⬆️
src/zarr/core/codec_pipeline.py	`86.91% <83.79%> (-5.26%)`	⬇️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…into perf/prepared-write-v2

d-v-b · 2026-04-09T08:36:25Z

@TomAugspurger how would this design work with CUDA codecs?

ilan-gold · 2026-04-17T13:08:47Z

+        # Phase 1: fetch all chunks (IO, sequential)
+        raw_buffers: list[Buffer | None] = [
+            bg.get_sync(prototype=cs.prototype)  # type: ignore[attr-defined]
+            for bg, cs, *_ in batch
+        ]
+
+        # Phase 2: decode (compute, optionally threaded)
+        def _decode_one(raw: Buffer | None, chunk_spec: ArraySpec) -> NDBuffer | None:
+            if raw is None:
+                return None
+            return transform.decode_chunk(raw, chunk_spec)
+
+        specs = [cs for _, cs, *_ in batch]
+        if n_workers > 0 and len(batch) > 1:
+            with ThreadPoolExecutor(max_workers=n_workers) as pool:
+                decoded_list = list(pool.map(_decode_one, raw_buffers, specs))
+        else:
+            decoded_list = [
+                _decode_one(raw, spec) for raw, spec in zip(raw_buffers, specs, strict=True)
+            ]


Why isn't this all multi-threaded i.e., the I/O as well?

I should benchmark this, but my expectation was that IO against memory storage and local storage is not compute-limited, and so threads wouldn't remove a real bottleneck. for memory storage i'm sure this is true, not sure about local storage though

Adds a SupportsSetRange protocol to zarr.abc.store for stores that allow overwriting a byte range within an existing value. Implementations are added for LocalStore (using file-handle seek+write) and MemoryStore (in-memory bytearray slice assignment). This is the prerequisite for the partial-shard write fast path in ShardingCodec, which can patch individual inner-chunk slots without rewriting the entire shard blob when the inner codec chain is fixed-size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

V2Codec, BytesCodec, BloscCodec, etc. previously only implemented the async _decode_single / _encode_single methods. Add their sync counterparts (_decode_sync / _encode_sync) so that the upcoming SyncCodecPipeline can dispatch through them without spinning up an event loop. For codecs that wrap external compressors (numcodecs.Zstd, numcodecs.Blosc, the V2 fallback chain), the sync versions just call the underlying compressor's blocking API directly instead of routing through asyncio.to_thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…arallelism Adds SyncCodecPipeline alongside BatchedCodecPipeline. The new pipeline runs codecs through their sync entry points (_decode_sync / _encode_sync) and dispatches per-chunk work to a module-level thread pool sized by the codec_pipeline.max_workers config (default = os.cpu_count()). Each chunk's full lifecycle (fetch + decode + scatter for reads; get-existing + merge + encode + set/delete for writes) runs as one pool task — overlapping IO of one chunk with compute of another. Scatter into the shared output buffer is thread-safe because chunks have non-overlapping output selections. The async wrappers (read/write) detect SupportsGetSync/SupportsSetSync stores and dispatch to the sync fast path, passing the configured max_workers. Other stores fall through to the async path, which still uses asyncio.concurrent_map at async.concurrency. Notes on perf: - Default (None → cpu_count) is tuned for chunks ≥ ~512 KB. - Small chunks (≤ 64 KB) regress 1.5-3x because pool dispatch overhead (~30-50 µs/task) dominates per-chunk work. Workaround: zarr.config.set({"codec_pipeline.max_workers": 1}). - For large chunks on local/memory stores, IO+compute parallelism yields 1.7-2.5x over BatchedCodecPipeline on direct-API reads and ~2.5x on roundtrip. ChunkTransform encapsulates the sync codec chain. It caches resolved ArraySpecs across calls with the same chunk_spec — combined with the constant-ArraySpec optimization in indexing, hot-path overhead is minimized. Includes test scaffolding for the new pipeline (test_sync_codec_pipeline) and config plumbing for the max_workers key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds _encode_partial_sync and _decode_partial_sync to ShardingCodec. For fixed-size inner codec chains and stores that implement SupportsSetRange, partial writes patch individual inner-chunk slots in-place instead of rewriting the whole shard: - Reads existing shard index (one byte-range get). - For each affected inner chunk: decodes the slot, merges the new region, re-encodes. - Writes each modified slot at its deterministic byte offset, then rewrites just the index. For variable-size inner codecs (e.g. with compression) or stores that don't support byte-range writes, falls through to a full-shard rewrite matching BatchedCodecPipeline semantics. The partial-decode path computes a ReadPlan from the shard index and issues one byte-range get per overlapping chunk, decoding only what the read selection touches. Both paths are dispatched from SyncCodecPipeline via the existing supports_partial_decode / supports_partial_encode protocol checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two new test files: test_codec_invariants — asserts contract-level properties that every codec / shard / buffer combination must satisfy: round-trip exactness, prototype propagation, fill-value handling, all-empty shard handling. test_pipeline_parity — exhaustive matrix asserting that SyncCodecPipeline and BatchedCodecPipeline produce semantically identical results across codec configs, layouts (including nested sharding), write sequences, and write_empty_chunks settings. Three checks per cell: 1. Same array contents on read. 2. Same set of store keys after writes. 3. Each pipeline reads the other's output identically (catches layout-divergence bugs). These tests pinned the design throughout the SyncCodecPipeline + partial-shard development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds .gitignore entries for .claude/, CLAUDE.md, and docs/superpowers/ so local IDE/agent planning artifacts don't get committed by accident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ilan-gold · 2026-04-18T14:35:38Z

+            selected = decoded[chunk_selection]
+            if drop_axes:
+                selected = selected.squeeze(axis=drop_axes)
+            out[out_selection] = selected


It might be worth experimenting with moving this setting operation out[out_selection] = selected outside the threadpool execution since, IIRC, it holds the GIL and is probably non-trivial time-wise.

The memory usage will probably go up a bit though....

…ross-file dup The file named test_sync_codec_pipeline.py tested no pipeline -- it is the unit test suite for ChunkTransform (the per-chunk synchronous codec chain that FusedCodecPipeline uses internally). "sync codec pipeline" was an earlier name for the Fused pipeline; the filename had outlived it. Renamed to test_chunk_transform.py (git mv preserves history) and added a module docstring naming what it actually covers. Also removed test_sync_transform_encode_decode_roundtrip from test_fused_pipeline.py: it was a weaker cross-file duplicate of this file's test_encode_decode_roundtrip (which covers the same encode->decode->compare over five codec chains rather than just bytes-only). Its one extra assertion -- that evolve_from_array_spec populates _sync_transform -- is already covered by test_evolve_from_array_spec in the Fused file. test_codec_pipeline.py left as-is: all three tests are correctly placed and cover things the Scenario suite can't (the low-level pipeline.read GetResult API, a plain dict store, and the zarr-developers#3937 cast_value dtype-threading regression). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The byte-range-write machinery works, but the right store interface for it is still undecided, so it is removed from this PR and will return once that lands. Removed: - SupportsSetRange protocol (abc/store.py) and its __all__ export. - MemoryStore.set_range / set_range_sync / _set_range_impl and the SupportsSetRange base (storage/_memory.py). - LocalStore.set_range / set_range_sync, the _put_range helper, and the SupportsSetRange base (storage/_local.py). - The sharding codec's byte-range-write fast path in _encode_partial_sync; partial shard writes now always take the full-shard-rewrite path (identical to BatchedCodecPipeline, verified by the pipeline-parity suite). Also dropped the now-dead _chunk_byte_offset helper it relied on. - changes/3907.feature.md (the byte-range-writes changelog note). The byte-range-READ changelog (3004) is unrelated and kept. Byte-range READS (ByteRequest, get(byte_range=), get_ranges coalescing, the read-side bulk shard decode) are untouched -- this only removes writes. The known-good tests that exercise byte-range writes are commented out (not deleted) in test_store/test_memory.py, test_store/test_local.py, and test_fused_pipeline.py, to restore once the store design is settled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PR-added module-level helper in array.py with zero callers — an ArraySpec-reuse optimization that was never wired up. Plain function, no protocol role, safe to drop. Verified: no references anywhere in src/ or tests/, and the full array/sharding/pipeline suites stay green. Note: ShardingCodec._encode_sync, though never *called*, is NOT dead — it is a required member of the runtime_checkable SupportsSyncCodec protocol. Removing it drops ShardingCodec from SupportsSyncCodec and breaks the sync read-fallback routing (16 test failures), so it stays. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The docstring claimed _encode_sync "iterates inner chunks in Morton order — that's the canonical layout the shard index expects", which is wrong and a latent footgun: it implies the method imposes a morton physical layout. It does not. The morton iteration only populates an intermediate dict whose key order is immaterial; the on-disk layout is decided downstream by the subchunk_write_order loop in _encode_shard_dict_sync (same as the async _encode_single sibling). Also clarified that this method IS reached — via nested sharding, where an inner ShardingCodec is encoded through the outer codec's ChunkTransform. (It is not called for top-level sharded writes, which route through _encode_partial_sync.) Verified empirically: routing through nested _encode_sync, all three subchunk_write_order values roundtrip correctly AND morton vs lexicographic produce physically different bytes — i.e. the order is honored, not ignored. Behavior unchanged; docstring only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PR-added thin wrapper (`_load_shard_index_maybe(...) or _ShardIndex.create_empty(...)`) with zero invocations anywhere in src/ or tests/. Unlike _encode_sync, this is genuinely removable: confirmed it is NOT a member of any runtime_checkable protocol or ABC (no reference in src/zarr/abc/, not a base-class override) and is reached by no dynamic dispatch (no getattr / string reference). main has no _load_shard_index* methods at all, so it was introduced and left unused by this PR. The _maybe and _maybe_sync variants it wrapped remain and are used. Verified: full sharding + nested-sharding + parity + pipeline suites stay green, ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ring The FusedCodecPipeline class docstring still described sharded writes as using "byte-range writes via set_range_sync" — but byte-range-write support was removed from this PR (set_range_sync / SupportsSetRange are gone). Sharded writes now take the codec's synchronous full-shard-rewrite path. Docstring only; no behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

This branch's docstrings/comments had introduced RST-style ``double-backtick`` inline literals, which this project does not use (plain single backticks only — no RST roles or double-backticks). Converted the 25 occurrences across the sharding codec, codec_pipeline, and fsspec store docstrings/comments to single backticks. Style only; no behavior change. Also confirmed (via git blame, this-branch lines only) there are no remaining references to removed/outdated designs: the byte-range-write (set_range) mentions and the "separating IO from compute" framing were already corrected earlier in this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… opt-in Pairs with the FusedCodecPipeline default: keep the new pipeline, but do NOT enable threading by default. `max_workers=None` (auto -> cpu_count) spawned a thread pool on every read/write, which is a behavior change with real downstream risk — it runs custom stores/codecs concurrently (thread-safety) and can oversubscribe many-core nodes whose workloads already parallelize at the dask/MPI layer. The default is now 1 (fully sequential: the pool is never created when max_workers <= 1). Parallelism is opt-in via `codec_pipeline.max_workers` (positive int, or None for auto). Updates _resolve_max_workers docstring and the config-defaults test accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

codspeed-hq · 2026-06-06T20:12:08Z

Merging this PR will improve performance by ×3.1

⚡ 31 improved benchmarks
✅ 11 untouched benchmarks
🆕 120 new benchmarks
⏩ 30 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_sharded_morton_indexing_large[(32, 32, 32)-memory]`	9,186.1 ms	753.8 ms	×12
⚡	WallTime	`test_sharded_morton_indexing_large[(30, 30, 30)-memory]`	7,531.3 ms	620.9 ms	×12
⚡	WallTime	`test_sharded_morton_indexing_large[(33, 33, 33)-memory]`	10,022.6 ms	827.1 ms	×12
⚡	WallTime	`test_sharded_morton_indexing[(32, 32, 32)-memory]`	1,148.4 ms	95.5 ms	×12
⚡	WallTime	`test_sharded_morton_indexing[(16, 16, 16)-memory]`	144.1 ms	12.9 ms	×11
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory]`	409.8 ms	58.8 ms	×7
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(10, -10, 4), slice(10, -10, 4), slice(10, -10, 4))-memory]`	209.5 ms	40.2 ms	×5.2
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory]`	405.2 ms	80 ms	×5.1
⚡	WallTime	`test_slice_indexing[None-(slice(10, -10, 4), slice(10, -10, 4), slice(10, -10, 4))-memory]`	191.4 ms	46.6 ms	×4.1
⚡	WallTime	`test_slice_indexing[None-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory]`	350.7 ms	85.4 ms	×4.1
⚡	WallTime	`test_slice_indexing[None-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory]`	354.5 ms	88.7 ms	×4
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(0, 3, 2), slice(0, 10, None))-memory]`	6.6 ms	2.4 ms	×2.8
⚡	WallTime	`test_slice_indexing[None-(slice(None, None, None), slice(0, 3, 2), slice(0, 10, None))-memory]`	3.3 ms	1.3 ms	×2.6
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(10, -10, 4), slice(10, -10, 4), slice(10, -10, 4))-memory_get_latency]`	208.1 ms	85.2 ms	×2.4
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory_get_latency]`	406 ms	170.1 ms	×2.4
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory_get_latency]`	410.8 ms	172.8 ms	×2.4
⚡	WallTime	`test_slice_indexing[None-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory_get_latency]`	404 ms	179.7 ms	×2.2
⚡	WallTime	`test_slice_indexing[None-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory_get_latency]`	408.7 ms	182.2 ms	×2.2
⚡	WallTime	`test_slice_indexing[None-(slice(10, -10, 4), slice(10, -10, 4), slice(10, -10, 4))-memory_get_latency]`	220 ms	98.1 ms	×2.2
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(0, 0, 0)-memory]`	1,984.5 µs	929.1 µs	×2.1
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing d-v-b:perf/prepared-write-v2 (298295f) with main (6ce787d)²}

30 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
No successful run was found on main (b9d3964) during the generation of this report, so 6ce787d was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

…egression) CodSpeed flagged test_sharded_morton_write_single_chunk regressing ~38-39% (writing one 1x1x1 chunk into a 32^3 = 32768-chunk shard). Both main and this branch do a full shard rewrite for a partial write, so the rewrite itself is not the regression — and it is NOT the removed byte-range fast path (that path was gated out here anyway: write_empty_chunks defaults to False -> skip_empty=True). The cause: the sync _encode_partial_sync rebuilt the in-memory shard_dict with a per-coordinate __getitem__ loop over all 32768 chunks (O(n_chunks) Python overhead + try/except per chunk), whereas main's async _encode_partial_single builds the same dict with a single vectorized index lookup via _ShardReader.to_dict_vectorized. Switched the sync path to to_dict_vectorized (a plain, non-async method; _shard_reader_from_bytes_sync already returns a _ShardReader), matching the async path. The dict's key order is immaterial (the physical layout is decided downstream by the subchunk_write_order loop in _encode_shard_dict_sync), so the merge loop — which looks up by coordinate, not order — is unaffected. Local micro-benchmark (32^3 shard, single 1x1x1 chunk write): 59.4 -> 40.0 ms/write (~1.5x), matching the CodSpeed delta. Correctness: full sharding + pipeline-parity suites pass (581), so Fused still matches Batched byte-for-byte. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The deps=optional CI job (where cast_value_rs is installed) failed test_codec_pipeline_threads_dtype_through_evolve and several test_cast_value tests with "Invalid endianness: None" / "endian needs to be specified for multi-byte data types". Root cause: FusedCodecPipeline.evolve_from_array_spec evolved EVERY codec against the same original array_spec: evolved = tuple(c.evolve_from_array_spec(array_spec=array_spec) for c in self.codecs) When an array->array codec widens the dtype (e.g. cast_value int8 -> int16), the BytesCodec serializer was still evolved against the single-byte SOURCE dtype, so it stripped its `endian` to None (bytes.py treats single-byte dtypes as having no endianness) and then failed at decode time on the multi-byte data. BatchedCodecPipeline.evolve_from_array_spec already threads the spec forward (spec = evolved_codec.resolve_metadata(spec)); the Fused version did not. Fixed by mirroring the Batched threading. Also added test_evolve_threads_spec_preserving_serializer_endian: a dependency-free regression test (uses a minimal dtype-widening AA codec stub, no cast_value_rs) that runs on BOTH pipelines via the pipeline_class fixture. It fails on [sync] without this fix and passes with it — closing the gap where the only coverage required an optional dep and thus ran in no default env. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The endian bug fixed in the previous commit existed because the two codec pipelines duplicated the same conceptual logic and one copy drifted: Batched's evolve_from_array_spec threaded the spec forward correctly, Fused's did not. Duplicated logic that can silently diverge is a standing bug source, so extract the drift-prone pieces into single sources of truth that both pipelines delegate to (mirroring the existing resolve_aa_specs precedent): - evolve_codecs(codecs, array_spec): the construction-time spec-threading loop. Both BatchedCodecPipeline and FusedCodecPipeline.evolve_from_array_spec now call it. There is now exactly one place this logic lives, so it cannot drift. - pipeline_supports_partial_decode / _encode(ab, *, aa, bb, require_no_aa_bb): the partial-decode/encode predicate. Both pipelines delegate. The two pass DIFFERENT require_no_aa_bb values (Batched True, Fused False) — that divergence is pre-existing and deliberately preserved here (not silently unified); it is now explicit at the call sites and documented in one function instead of being buried in two slightly-different inline isinstance checks. Behavior-preserving: each pipeline computes exactly what it did before. Left the trivial fan-out loops (validate, compute_encoded_size) as-is — deduping those would require changing the CodecPipeline ABC contract (abstract -> concrete) for near-zero drift benefit. Full pipeline/sharding/parity suites + the dtype-evolve regression test pass; ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

d-v-b · 2026-06-07T09:21:57Z

(claude wrote this)

Current state of this PR (reconciled against the original framing)

This branch has evolved substantially. Posting a current-state summary so the description can be updated.

⚠️ The original "hard IO/compute boundary" framing no longer describes this PR

This PR was opened as a phased pipeline built on separating IO from compute (preparatory IO → pure compute → final IO), with the suggestion that the spec be rewritten around that framing. That is not what shipped. The class was renamed PhasedCodecPipeline → FusedCodecPipeline, and its docstring now explicitly disclaims the original thesis:

The win is NOT "separating IO from compute" — the codecs (notably ShardingCodec) still perform their own storage IO.

The actual win is synchronous, fused per-chunk execution: eliminating BatchedCodecPipeline's ~one-coroutine-per-chunk async scheduling, which was the real dominant overhead — not IO/compute mixing. IO ownership stays with the codecs (the zarrs model). The PreparedWrite/phase abstractions from the first commit are not in the live code path. The "fetch exactly what we need" idea survives only as a read-path optimization inside the sharding codec (get_ranges_sync coalescing, vectorized whole-shard bulk decode), not as a pipeline-level phase.

So the PR description should be rewritten around "synchronous execution" rather than "separating IO from compute."

What it actually is now

FusedCodecPipeline runs the codec chain synchronously when all codecs are sync-capable (SupportsSyncCodec); falls back to the async path (≡ BatchedCodecPipeline) for non-sync stores (ZipStore, fsspec).
Intended to be behavior-identical to Batched; a pipeline-parity test battery asserts Fused ≡ Batched (same results, same store keys, byte-identical shards).

User-facing changes

Default pipeline flipped to FusedCodecPipeline (the one genuinely user-facing change). One-line revert: zarr.config.set({"codec_pipeline.path": "zarr.core.codec_pipeline.BatchedCodecPipeline"}).
Threading off by default (codec_pipeline.max_workers=1, sequential). Threading-by-default is deferred (downstream thread-safety; oversubscription on many-core nodes). Opt in with max_workers > 1.
No public API additions; import surface unchanged.

Scope changes since opening

Byte-range writes were removed from this PR (pending a store-interface decision); partial shard writes take the full-rewrite path. ⇒ the "depends on feat: add SupportsSetRange protocol and store implementations #3907" note is stale — feat: add SupportsSetRange protocol and store implementations #3907 (SupportsSetRange) is still open, but this PR no longer depends on it.
perf: cache default ArraySpec for regular chunk grids #3908 is merged.

Performance (latest CodSpeed)

×3.1 overall, 0 regressed, 31 improved, 11 untouched. Sharded morton reads ~×11–12.
A mid-development write regression (test_sharded_morton_write_single_chunk, −38–39%) was fixed (vectorized the partial-shard dict build), which also lifted the overall figure ×2.7 → ×3.1.

Hardening highlights

Fixed a spec-threading bug where FusedCodecPipeline.evolve_from_array_spec evolved codecs against the original (not forward-threaded) spec, stripping BytesCodec.endian for dtype-widening cast_value chains. Root cause was duplicated logic drifting between the two pipelines; a follow-up refactor extracted the shared logic (evolve_codecs, pipeline_supports_partial_decode/encode) into freestanding functions so the class of bug can't recur. Added a dependency-free regression test on both pipelines.
Shared CodecPipelineTests suite now runs every pipeline-agnostic behavior on both pipelines × sync/async stores.

Open items for review

Update this description (IO/compute framing → synchronous-execution framing).
Default flip: opt-in for a release first, or explicit risk sign-off (mitigated by parity tests + one-line revert).
Downstream: numcodecs zarr3 passes; xarray backend suite needs a clean run, we are waiting on Revert "fix: make xarray downstream tests work" #4047 for that.
supports_partial_decode/encode divergence between the pipelines (deferred; behavior question).
Pre-existing Batched transpose+sharding bug (on main too; deferred to a separate PR).

d-v-b · 2026-06-07T09:26:48Z

@ilan-gold this is ready for you to dig into

floriankrb · 2026-06-09T08:45:15Z

@d-v-b This looks promising. Is this branch ready to be tested regarding the regression we saw here ecmwf/anemoi-datasets#486?

d-v-b · 2026-06-09T08:59:23Z

@floriankrb yes if you have time to test it, that would be great

The pool dispatch in read_sync/write_sync (codec_pipeline.max_workers > 1) had zero functional test coverage — only config-default assertions existed — even though threading is the opt-in we point users at. Adds: - an end-to-end multi-chunk read/write roundtrip with max_workers=4 (verified the pool dispatch actually fires, not the sequential branch); - worker-exception propagation tests for both write_sync (list-consumed pool.map) and read_sync (tuple-consumed pool.map): a store error raised in a pool worker must surface to the caller; - a concurrent-decode test: transpose filter (so ChunkTransform._resolve_specs cache traffic actually occurs — with no AA codecs the cache is bypassed), pool workers decoding concurrently, plus an outer thread pool issuing overlapping reads. Pins correctness under concurrency around the shared transform's mutable cache. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

zarr_format=2 appeared in none of the pipeline test files: v2 arrays were only exercised implicitly through whichever pipeline is the global default. v2 goes through the V2Codec wrapper (numcodecs filters + compressor) — a different codec path than the v3 AA/AB/BB chain, with its own _encode_sync/_decode_sync under FusedCodecPipeline — so it deserves explicit coverage on BOTH pipelines and BOTH store kinds (sync fast path + async fallback). Adds v2-roundtrip (uncompressed) and v2-gzip-roundtrip (numcodecs.GZip — the v2 compressor spelling; v3 codec configs are rejected for v2 arrays) to SCENARIOS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Both _encode_shard_dict_sync and the async _encode_shard_dict encoded the shard index TWICE when index_location=start: encode to learn the length, shift the present chunks' offsets by it, then re-encode with corrected offsets. The index size is knowable without encoding — _shard_index_size() is already the byte- exact contract every index read path relies on (reads slice exactly that many bytes) — so the layout can use absolute offsets from the start and the index is encoded once. Saves a full index encode (including its crc32c over the offsets array) per shard write with index_location=start. The layout loop was also duplicated between the sync and async versions — the same drift surface that produced the evolve_from_array_spec endian bug. Both now delegate to a shared pure _build_shard_layout (offset math lives once) and _assemble_shard. A runtime guard verifies the encoded index length matches _shard_index_size rather than silently corrupting offsets if someone ever configures variable-size index codecs. Verified: 606 tests pass including the pipeline-parity byte-identical shard assertions across index_location=start/end and every subchunk_write_order, and the sharded reopen tests — the on-disk layout is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

read_missing_chunks exists to help consumers distinguish a transport error from a truly missing chunk. That distinction is a STORE-KEY-level concept: a missing shard key raises ChunkNotFoundError. It does not cleanly apply to inner subchunks of a shard that was fetched successfully — there is no transport ambiguity there; the shard index simply records the subchunk as absent — so those fill with the fill value rather than raising. Both pipelines already implement exactly this (verified empirically), but nothing pinned it, so the asymmetry vs unsharded arrays read as a bug in review. This adds a shared-suite test (both pipelines x sync/async stores) asserting both sides: missing shard key raises; missing inner subchunk of an existing shard fills. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore: add parallelism TODOs * perf: non-sharded reads * chore: code duplication * fix: arguments dont spread themselves! * fix: bring in suggested guard * perf: don't block on pool ops * chore: docstring + materialize early --------- Co-authored-by: ilan-gold <ilanbassgold@gmail.com>

…line Addresses the open question on this PR about sync/async byte getters, benchmark-guided as discussed. _ShardingByteGetter/_ShardingByteSetter are in-memory dict wrappers but presented only an async API, so the nested codec_pipeline.read over inner chunks fell to the async fallback: one concurrent_map coroutine per inner chunk for a dict lookup (~2.1 us/chunk pure asyncio overhead, ~8.8 ms per 4096-chunk shard). On top of that, the nested pipeline (ShardingCodec.codec_pipeline) is built by from_codecs and never evolved, so its sync transform was always None — inner chunks also paid per-chunk AsyncChunkTransform coroutines, and the decode_sync/encode_sync fallback improvements in this PR could not reach them. Changes: - SyncByteGetter / SyncByteSetter runtime protocols in zarr.abc.store (resurrecting the design from the original perf/prepared-write experiments, same names and shape). StorePath matches structurally but is still gated on its STORE's sync support; the protocols gate non-StorePath byte getters. - _ShardingByteGetter/Setter implement get_sync/set_sync/delete_sync; the async methods delegate to the sync ones (single implementation). - ShardingCodec._get_inner_pipeline(shard_spec): the nested pipeline evolved against the inner chunk spec (threads specs through the inner chain AND builds the sync transform). The four nested read/write call sites use it. - FusedCodecPipeline.read/write gates accept non-StorePath SyncByteGetter/ SyncByteSetter, so nested inner-chunk IO takes read_sync/write_sync. - _decode_shard_index/_encode_shard_index delegate to their sync twins (pure compute; kills a per-shard AsyncChunkTransform round-trip and a sync/async duplication). Benchmark (4096 inner chunks per shard, LatencyStore@0 i.e. the async-fallback path = sharded data on remote stores), vs this PR's head: uncompressed read: 44.2 -> 28.9 ms (1.53x) gzip read: 182.0 -> 50.9 ms (3.6x) writes: unchanged (~34 / ~70 ms) — already optimized by this PR's encode_sync fallback (whole-shard sync encode, no byte setters involved). Adds a regression test asserting sharded fallback reads route inner chunks through the sync fast path (read_sync engaged, zero AsyncChunkTransform calls); verified it fails if the gate is removed. Full sharding + parity + pipeline suites pass (619). Assisted-by: ClaudeCode:claude-fable-5

… cache inner pipeline The pipeline test suites have failed intermittently all along on an arbitrary test that passes in isolation. Root cause (allocation site verified with PYTHONTRACEMALLOC): pytest-asyncio implicitly creates an event loop in _get_event_loop_no_warn during fixture setup/teardown and never closes it. When GC reclaims that loop — or its self-pipe socketpair — mid-test, pytest's unraisable hook converts the ResourceWarning into a failure of whichever unrelated test happens to be running. The sync-bytegetter change increased per-shard-op allocation churn enough to make this near-deterministic, which is how it was finally traced. Two changes: - pyproject filterwarnings: narrowly ignore the two unraisable shapes (BaseEventLoop.__del__, AF_UNIX socketpair), mirroring the existing s3fs/aiobotocore entry. Not zarr's loops. - ShardingCodec._get_inner_pipeline is now memoized per (pipeline class, shard_spec) — evolving built a ChunkTransform on every shard operation. The pipeline class is part of the key so codec_pipeline.path config changes are still honored. Battery that previously failed ~every run now passes 3x consecutively (635). Assisted-by: ClaudeCode:claude-fable-5

…hain evolve Continues the anti-skew work: where the sync and async sharding paths implemented the same logic twice, extract a single source of truth so the copies cannot drift (the mechanism behind the pipeline-level endian bug). - _get_inner_chunk_transform / _get_index_chunk_transform now evolve their codec chains via evolve_codecs (spec THREADED forward). Both previously evolved every codec against the same unthreaded spec — the exact bug shape that stripped BytesCodec.endian at the pipeline level, latent here for any spec-changing inner codec. Both are also now actually memoized (the inner transform's docstring claimed a cache that did not exist; transforms were rebuilt per call). - New regression test (dependency-free dtype-widening stub codec) asserting the inner serializer keeps its endian; verified it fails on the unthreaded version. - _shard_index_byte_range(): the index-location byte-range arithmetic existed verbatim in both _load_shard_index_maybe and its _sync twin; now one helper. - _pair_chunks_with_byte_ranges(): the chunk-coord/byte-range pairing loop existed verbatim in both _load_partial_shard_maybe and its _sync twin; now one helper. Deliberately NOT unified: the small hand-rolled loops remaining in _decode_sync/_encode_sync vs their async twins. Post sync-bytegetter work the async versions are thin delegations to the (shared, evolved) nested pipeline, so the heavy machinery — evolve, transforms, layout, index codecs — is already single-sourced; force-merging the residual loops would couple different missing-chunk/concurrency semantics for little drift-surface gain. The pipeline-parity suite guards their behavioral equivalence. Full battery passes twice (636); ruff + mypy clean. Assisted-by: ClaudeCode:claude-fable-5

d-v-b added 4 commits April 7, 2026 10:38

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

a072c31

…into perf/prepared-write-v2

feat: new codec pipeline that uses sync path

47a407f

feat: complete second codecpipeline

3c27e49

github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 8, 2026

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

9b834a4

…into perf/prepared-write-v2

d-v-b added 2 commits April 8, 2026 19:51

fix: handle rectilinear chunks

c731cf2

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

9e25150

…into perf/prepared-write-v2

This was referenced Apr 8, 2026

perf/prepared write #3727

Closed

perf/sharding chunk transform #3729

Closed

perf/chunkrequest #3730

Closed

sketch out sync codecs + threadpool #3715

Closed

fixup

ae0580c

d-v-b mentioned this pull request Apr 9, 2026

[do not merge] benchmarks + tests for phased codecpipeline #3891

Open

d-v-b force-pushed the perf/prepared-write-v2 branch from 5d3064e to b67a5a0 Compare April 15, 2026 09:51

github-actions Bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 15, 2026

d-v-b force-pushed the perf/prepared-write-v2 branch 2 times, most recently from a84a15a to 68a7cdc Compare April 17, 2026 10:41

ilan-gold reviewed Apr 17, 2026

View reviewed changes

d-v-b and others added 6 commits April 17, 2026 22:51

chore: gitignore local agent/planning notes

1be5563

Adds .gitignore entries for .claude/, CLAUDE.md, and docs/superpowers/ so local IDE/agent planning artifacts don't get committed by accident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

d-v-b force-pushed the perf/prepared-write-v2 branch from aa111a2 to 1be5563 Compare April 17, 2026 21:04

ilan-gold reviewed Apr 18, 2026

View reviewed changes

Merge branch 'main' into perf/prepared-write-v2

985716b

d-v-b and others added 8 commits June 5, 2026 18:43

d-v-b added the run-downstream Run the tests of downstream libraries (e.g., xarray) against zarr label Jun 6, 2026

Merge branch 'main' into perf/prepared-write-v2

558365f

d-v-b added the benchmark Code will be benchmarked in a CI job. label Jun 6, 2026

d-v-b added benchmark Code will be benchmarked in a CI job. and removed benchmark Code will be benchmarked in a CI job. labels Jun 6, 2026

d-v-b and others added 2 commits June 6, 2026 23:05

Merge branch 'main' into perf/prepared-write-v2

9ca6932

d-v-b and others added 8 commits June 10, 2026 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: phased codecpipeline#3885

perf: phased codecpipeline#3885
d-v-b wants to merge 67 commits into
zarr-developers:mainfrom
d-v-b:perf/prepared-write-v2

d-v-b commented Apr 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

d-v-b commented Apr 9, 2026

Uh oh!

ilan-gold Apr 17, 2026 •

edited

Loading

Uh oh!

d-v-b Apr 17, 2026

Uh oh!

ilan-gold Apr 18, 2026 •

edited

Loading

Uh oh!

ilan-gold Apr 18, 2026

Uh oh!

codspeed-hq Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

d-v-b commented Jun 7, 2026 •

edited

Loading

Uh oh!

d-v-b commented Jun 7, 2026

Uh oh!

floriankrb commented Jun 9, 2026

Uh oh!

d-v-b commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

d-v-b commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d-v-b commented Apr 9, 2026

Uh oh!

ilan-gold Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

codspeed-hq Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by ×3.1

Performance Changes

Footnotes

Uh oh!

d-v-b commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current state of this PR (reconciled against the original framing)

⚠️ The original "hard IO/compute boundary" framing no longer describes this PR

What it actually is now

User-facing changes

Scope changes since opening

Performance (latest CodSpeed)

Hardening highlights

Open items for review

Uh oh!

d-v-b commented Jun 7, 2026

Uh oh!

floriankrb commented Jun 9, 2026

Uh oh!

d-v-b commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

d-v-b commented Apr 8, 2026 •

edited

Loading

codecov Bot commented Apr 8, 2026 •

edited

Loading

ilan-gold Apr 17, 2026 •

edited

Loading

ilan-gold Apr 18, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 6, 2026 •

edited

Loading

d-v-b commented Jun 7, 2026 •

edited

Loading