xarray-contrib · brendancol · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/.claude/sweep-performance-state.csv b/.claude/sweep-performance-state.csv
@@ -33,7 +33,7 @@ normalize,2026-03-31T18:00:00Z,SAFE,compute-bound,0,1124,Boolean indexing replac
 pathfinding,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. CuPy .get() is required -- A* has no GPU kernel. Per-pixel .compute() is only 2 calls for start/goal validation. seg.values in multi_stop_search collects already-computed results for stitching.
 perlin,2026-03-31T18:00:00Z,WILL OOM,memory-bound,0,,
 polygon_clip,2026-06-10,SAFE,graph-bound,0,3191,"crop=True picked tiny leading edge chunk as rasterize mask size -> 13169-task graph; fixed to max(rc),max(cc) -> 1045 tasks. crop=False/numpy/cupy clean. Cat1-5 clean. GPU+dask+cupy run-validated."
-polygonize,2026-05-29,RISKY,compute-bound,0,2608,"Pass 2 (2026-05-29): re-audit. 0 HIGH. 1 MEDIUM fixed (#2608): _polygonize_dask called dask.compute() once per chunk in a nested Python loop, serializing one chunk per scheduler round-trip. Fixed to batch one dask.compute() per chunk row. Output byte-identical (verified conn=4 and conn=8). Measured 2.79x faster on a 4-worker LocalCluster (1024x1024/64 chunks); threaded-scheduler win is marginal (~1.03x warm) since @ngjit kernels release the GIL. 8 new tests in test_polygonize_dask_row_batch_2608.py; 299 polygonize tests pass. Cat1 clean (no .values/.compute-in-loop wrapping dask; np.asarray at L1064/L2278 only wrap CPU input / user transform). Cat3: no @cuda.jit kernels; _polygonize_cupy GPU->CPU transfer is documented (boundary tracing is sequential, cannot run on GPU); cupy int path runs end-to-end ~2.2s/512x512, dominated by CPU _scan. Cat4 LOW (not fixed): _calculate_regions_cupy allocates bin_mask=(data==v) per unique value (O(n_unique) passes); verified low impact, _scan dominates. Cat5 clean. Cat6: RISKY unchanged -- driver accumulates O(total polygons) interior polys; per-row batch keeps peak bounded to one row. bottleneck=compute-bound (_scan). | Re-audit 2026-04-16 after PR 1190 NaN fix + 1176 simplification."
+polygonize,2026-06-12,RISKY,compute-bound,0,3303,"Pass 3 (2026-06-12): re-audit after #2817/#2913/#3041. 0 HIGH. 1 MEDIUM fixed (#3303): _compute_region_value_ranges ran a pure-Python per-pixel loop (95% of float chunk time; 0.283s of 0.299s on 1024x1024, float chunks ~30x int) and re-ran _calculate_regions on an already-labelled block; moved to jitted _region_ranges_scan + _polygonize_numpy_regions label reuse (0.299s -> 0.015s/chunk). Side fix: w_match/s_match flags were always-truthy (_is_close numba overload generator called from pure Python returns impl function); output-neutral by chunk geometry, now computed correctly in jit. Cat1/2 clean (dask.compute batching is the documented #2673 design). Cat3 validated on GPU: cupy int/float + dask+cupy run end-to-end, single documented transfer, no round-trip. Cat4/5 LOW unchanged: _calculate_regions_cupy per-unique-value labeling (low impact); per-polygon Python classify loop in _polygonize_chunk dominates only on pathological many-polygon chunks (788K polys -> 7.8s). Cat6 RISKY unchanged: driver accumulates O(total polygons); 32-chunk batches bound transient peak. 527 polygonize tests + 40 new pass."
 proximity,2026-06-09,RISKY,graph-bound,0,3103,"Pass 2 (2026-06-09): re-audit after 16 fix commits since 2026-03-31. 0 HIGH, 2 MEDIUM found and fixed: (1) #3103/PR #3126 line-sweep @ngjit closure inside _process recompiled per call (~0.42s constant overhead; 10x10 warm call 0.44s->1ms after module-level hoist with explicit args, 1000x1000 0.49s->35ms); (2) #3132/PR #3137 dask xs/ys coordinate grids built via da.tile/da.repeat+rechunk cost ~185 tasks/chunk with the ys term scaling O(raster height) (~4.3 tasks/row, 44K tasks at 10240 rows); chunk-aligned da.broadcast_to gives identical values, bounded graph 18535->5554 tasks (3.3x) on 2560^2/256 chunks; regression test bounds tasks/chunk<80 (old 100.4, new 58.7) + ragged-chunk parity. LOW not fixed: zeros+fill(-1) row buffers in line-sweep; numpy backend materializes full float64 xs/ys grids (guarded since #1111); unbounded KDTree streaming count pass computes chunks on driver by design (gh-879). GPU validated on CUDA host: cupy 1024^2 proximity 6ms device-resident with exact numpy parity, dask+cupy bounded parity exact, _proximity_cuda_kernel 56 regs/thread (no register pressure). _halo_depth python loop measured 58ms at 100K coords - not a finding. Verdict RISKY (was WILL OOM): unbounded paths either guarded (MemoryError at 80% mem) or stream via kdtree; bounded map_overlap peak scales with chunk size."
 rasterize,2026-06-09,SAFE,compute-bound,0,3107,"Pass 4 (2026-06-09): re-audit found 2 MEDIUM Cat-4 allocation findings, 0 HIGH. (a) all four backends return via astype(dtype) which copies the float64 work buffer even when dtype is already float64 (the default) -- _run_numpy L1237, _run_cupy L2211, _rasterize_tile_numpy L2460, _rasterize_tile_cupy L2688; fix astype(dtype, copy=False). (b) CPU paths allocate order as full-raster int64 (8 B/px) for every merge mode but only first/last predicates read it; for _should_write_any merges (max/min/sum/count, user callables) an int8 buffer suffices (numba wraps the dead int64 store) -- _run_numpy L1188, _rasterize_tile_numpy L2420. tracemalloc 4000x4000 numpy merge=sum: peak 25 B/px -> 10 B/px expected (out 8 + written 1 + order 1); merge=last 25 -> 17 B/px. Filed #3107, fixed via deep-sweep rockout. GPU validated on host (CUDA available): cupy 512x512 last/sum/max returns cupy.ndarray with CPU parity, dask+cupy sum parity True, no host round trip. Dask graph probe: 2560x2560 chunks=256 -> 400 tasks / 100 chunks (4.0 tasks/chunk, unchanged). LOW (not fixed, documented): _extract_polygon_boundary_segments int variant L702 is dead code (only the _float variant is called). SAFE/compute-bound: per-tile buffers scale with chunk size; scanline/burn JIT kernels dominate runtime. | Pass 3 (2026-05-27): re-audit identified 1 MEDIUM Cat-3 GPU-transfer finding. _run_cupy (L2065/L2083) and _rasterize_tile_cupy (L2541/L2555) called cupy.asarray(poly_props/poly_global) twice when all_touched=True -- once for the scanline poly_launch tuple and once for the supercover boundary_launch tuple. The two tuples reference the same per-tile props tables. Filed #2506 and fixed by hoisting the upload above the scanline/boundary conditional so both launches share the same device buffer. Microbench: 1000 polys/4 cols 0.051->0.024 ms/iter (2.1x); 10000 polys/8 cols 0.218->0.092 ms/iter (2.4x, saves 720 KB/tile of redundant H2D transfer). 12 new tests in test_rasterize_props_hoist_2506.py (4 AST-structural single-asarray-call assertions + 5 cupy all_touched parity merges + 3 dask+cupy smoke tests). All 470 rasterize tests pass. Dask graph probe: 25600x25600 chunks=1024 yields 2500 tasks for 625 tiles (4 tasks/chunk), unchanged. Noted pre-existing dask+cupy all_touched parity gap on boundary segments crossing tile borders (not addressed by this PR). SAFE/graph-bound verdict holds. | Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset."
 reproject,2026-06-12,SAFE,compute-bound,0,3267 3268,"Pass 7 (2026-06-12, deep-sweep): 0 HIGH, 2 MEDIUM found and fixed. #3267: in-memory numpy path held ~7 output-sized float64 temporaries and dask promotion keyed on input size only -- measured 8.4 GB peak RSS for a 1.15 GB output (7.3x); fix computes the grid first, promotes on output size too (mirrors merge #3048 pattern), re-applies the pixel guard for materializing paths, and fixes the 3-D chunks tuple in the dask wrap (post-fix peak 0.5 GB, lazy). #3268: multi-band cupy CPU-fallback transform path re-uploaded local_row/local_col per band via cp.asarray inside _resample_cupy_native (measured 12 H2D uploads for 6 bands, 26.1 MB vs 4.4 MB needed); fix hoists the device conversion before the band loop. LOW (documented, not fixed): _resample_cupy_native redundant copy+scan when the caller pre-converted nodata (+149% on the resample step, hits _reproject_dask_cupy fast path with non-NaN nodata); geoid_height_raster allocates full HxW meshgrid x2 plus output from dims (no strips, no dask path). Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, single blockwise layer, lazy); merge 2 inputs -> 16 tasks. GPU validated on host (CUDA available): cupy 2048^2 4326->3857 in 23 ms on-device; dask+cupy eager fast path matches in-memory exactly. SAFE/compute-bound holds. | Pass 6 (2026-06-09): 0 HIGH. 1 MEDIUM found and fixed (#3106): _reproject_chunk_numpy probed try_numba_transform, then _transform_coords probed it again before the pyproj fallback -- each wasted probe re-parses CRS params (~10 pyproj to_dict/to_authority round-trips) and allocates 4 chunk-sized float64 coordinate arrays. Measured 512x512 chunk, 4326->ESRI:54009: ~0.3-0.5 ms/probe, ~11% of the 5.3 ms chunk worker, repeated per output chunk on dask+numpy and merge per-block paths. Fix: worker passes no CRS objects to _transform_coords (inner retry gated on both non-None); cupy CPU fallbacks keep the inner probe (their first numba attempt). 3 new tests (TestNoDuplicateNumbaFastPathProbe); 447 reproject tests pass. LOW (not fixed, documented): try_numba_transform allocates 4 flat arrays before branch dispatch -- wasted for the lcc/tmerc 2D-kernel branches and unsupported pairs; _resample_cupy_native does a redundant .copy() when nodata is non-NaN and the caller already passed a fresh float64 copy; per-projection param extractors (_lcc_params etc.) call crs.to_dict() without the UserWarning suppression that _get_datum_params got in #3076, so fallback chunks emit pyproj warning spam. Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, 2 layers); merge 2 inputs -> 64 tasks/32 chunks. Source window per task capped at 64 Mpix. GPU validated on host (CUDA available): cupy 1024^2 fast path 13 ms, try_cuda_transform stays on-device, dask+cupy end-to-end OK, numpy/cupy max abs diff 2e-12, NaN positions identical. SAFE/compute-bound holds. | Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep."

diff --git a/benchmarks/benchmarks/polygonize.py b/benchmarks/benchmarks/polygonize.py
@@ -70,6 +70,27 @@ def time_polygonize(self, nx):
         polygonize(self.raster, mask=self.mask)
 
 
+class PolygonizeDaskFloat:
+    """Benchmark the float dask path: per-chunk jitted ranges scan plus
+    cross-chunk merge (#3303)."""
+    params = ([100, 300, 1000],)
+    param_names = ("nx",)
+
+    def setup(self, nx):
+        try:
+            import dask.array as da
+        except ImportError:
+            raise NotImplementedError("dask not available")
+        ny = nx // 2
+        rng = np.random.default_rng(9461713)
+        raster = rng.integers(low=0, high=4, size=(ny, nx)).astype(np.float64)
+        chunks = (max(ny // 4, 1), max(nx // 4, 1))
+        self.raster = xr.DataArray(da.from_array(raster, chunks=chunks))
+
+    def time_polygonize(self, nx):
+        polygonize(self.raster)
+
+
 class PolygonizeCCL:
     """Benchmark connected-component labeling phase."""
     params = ([100, 300, 1000], ["numpy", "cupy"])