diff --git a/.claude/sweep-performance-state.csv b/.claude/sweep-performance-state.csv index 80c03143c..21dff2d1e 100644 --- a/.claude/sweep-performance-state.csv +++ b/.claude/sweep-performance-state.csv @@ -33,7 +33,7 @@ normalize,2026-03-31T18:00:00Z,SAFE,compute-bound,0,1124,Boolean indexing replac pathfinding,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. CuPy .get() is required -- A* has no GPU kernel. Per-pixel .compute() is only 2 calls for start/goal validation. seg.values in multi_stop_search collects already-computed results for stitching. perlin,2026-03-31T18:00:00Z,WILL OOM,memory-bound,0,, polygon_clip,2026-06-10,SAFE,graph-bound,0,3191,"crop=True picked tiny leading edge chunk as rasterize mask size -> 13169-task graph; fixed to max(rc),max(cc) -> 1045 tasks. crop=False/numpy/cupy clean. Cat1-5 clean. GPU+dask+cupy run-validated." -polygonize,2026-05-29,RISKY,compute-bound,0,2608,"Pass 2 (2026-05-29): re-audit. 0 HIGH. 1 MEDIUM fixed (#2608): _polygonize_dask called dask.compute() once per chunk in a nested Python loop, serializing one chunk per scheduler round-trip. Fixed to batch one dask.compute() per chunk row. Output byte-identical (verified conn=4 and conn=8). Measured 2.79x faster on a 4-worker LocalCluster (1024x1024/64 chunks); threaded-scheduler win is marginal (~1.03x warm) since @ngjit kernels release the GIL. 8 new tests in test_polygonize_dask_row_batch_2608.py; 299 polygonize tests pass. Cat1 clean (no .values/.compute-in-loop wrapping dask; np.asarray at L1064/L2278 only wrap CPU input / user transform). Cat3: no @cuda.jit kernels; _polygonize_cupy GPU->CPU transfer is documented (boundary tracing is sequential, cannot run on GPU); cupy int path runs end-to-end ~2.2s/512x512, dominated by CPU _scan. Cat4 LOW (not fixed): _calculate_regions_cupy allocates bin_mask=(data==v) per unique value (O(n_unique) passes); verified low impact, _scan dominates. Cat5 clean. Cat6: RISKY unchanged -- driver accumulates O(total polygons) interior polys; per-row batch keeps peak bounded to one row. bottleneck=compute-bound (_scan). | Re-audit 2026-04-16 after PR 1190 NaN fix + 1176 simplification." +polygonize,2026-06-12,RISKY,compute-bound,0,3303,"Pass 3 (2026-06-12): re-audit after #2817/#2913/#3041. 0 HIGH. 1 MEDIUM fixed (#3303): _compute_region_value_ranges ran a pure-Python per-pixel loop (95% of float chunk time; 0.283s of 0.299s on 1024x1024, float chunks ~30x int) and re-ran _calculate_regions on an already-labelled block; moved to jitted _region_ranges_scan + _polygonize_numpy_regions label reuse (0.299s -> 0.015s/chunk). Side fix: w_match/s_match flags were always-truthy (_is_close numba overload generator called from pure Python returns impl function); output-neutral by chunk geometry, now computed correctly in jit. Cat1/2 clean (dask.compute batching is the documented #2673 design). Cat3 validated on GPU: cupy int/float + dask+cupy run end-to-end, single documented transfer, no round-trip. Cat4/5 LOW unchanged: _calculate_regions_cupy per-unique-value labeling (low impact); per-polygon Python classify loop in _polygonize_chunk dominates only on pathological many-polygon chunks (788K polys -> 7.8s). Cat6 RISKY unchanged: driver accumulates O(total polygons); 32-chunk batches bound transient peak. 527 polygonize tests + 40 new pass." proximity,2026-06-09,RISKY,graph-bound,0,3103,"Pass 2 (2026-06-09): re-audit after 16 fix commits since 2026-03-31. 0 HIGH, 2 MEDIUM found and fixed: (1) #3103/PR #3126 line-sweep @ngjit closure inside _process recompiled per call (~0.42s constant overhead; 10x10 warm call 0.44s->1ms after module-level hoist with explicit args, 1000x1000 0.49s->35ms); (2) #3132/PR #3137 dask xs/ys coordinate grids built via da.tile/da.repeat+rechunk cost ~185 tasks/chunk with the ys term scaling O(raster height) (~4.3 tasks/row, 44K tasks at 10240 rows); chunk-aligned da.broadcast_to gives identical values, bounded graph 18535->5554 tasks (3.3x) on 2560^2/256 chunks; regression test bounds tasks/chunk<80 (old 100.4, new 58.7) + ragged-chunk parity. LOW not fixed: zeros+fill(-1) row buffers in line-sweep; numpy backend materializes full float64 xs/ys grids (guarded since #1111); unbounded KDTree streaming count pass computes chunks on driver by design (gh-879). GPU validated on CUDA host: cupy 1024^2 proximity 6ms device-resident with exact numpy parity, dask+cupy bounded parity exact, _proximity_cuda_kernel 56 regs/thread (no register pressure). _halo_depth python loop measured 58ms at 100K coords - not a finding. Verdict RISKY (was WILL OOM): unbounded paths either guarded (MemoryError at 80% mem) or stream via kdtree; bounded map_overlap peak scales with chunk size." rasterize,2026-06-09,SAFE,compute-bound,0,3107,"Pass 4 (2026-06-09): re-audit found 2 MEDIUM Cat-4 allocation findings, 0 HIGH. (a) all four backends return via astype(dtype) which copies the float64 work buffer even when dtype is already float64 (the default) -- _run_numpy L1237, _run_cupy L2211, _rasterize_tile_numpy L2460, _rasterize_tile_cupy L2688; fix astype(dtype, copy=False). (b) CPU paths allocate order as full-raster int64 (8 B/px) for every merge mode but only first/last predicates read it; for _should_write_any merges (max/min/sum/count, user callables) an int8 buffer suffices (numba wraps the dead int64 store) -- _run_numpy L1188, _rasterize_tile_numpy L2420. tracemalloc 4000x4000 numpy merge=sum: peak 25 B/px -> 10 B/px expected (out 8 + written 1 + order 1); merge=last 25 -> 17 B/px. Filed #3107, fixed via deep-sweep rockout. GPU validated on host (CUDA available): cupy 512x512 last/sum/max returns cupy.ndarray with CPU parity, dask+cupy sum parity True, no host round trip. Dask graph probe: 2560x2560 chunks=256 -> 400 tasks / 100 chunks (4.0 tasks/chunk, unchanged). LOW (not fixed, documented): _extract_polygon_boundary_segments int variant L702 is dead code (only the _float variant is called). SAFE/compute-bound: per-tile buffers scale with chunk size; scanline/burn JIT kernels dominate runtime. | Pass 3 (2026-05-27): re-audit identified 1 MEDIUM Cat-3 GPU-transfer finding. _run_cupy (L2065/L2083) and _rasterize_tile_cupy (L2541/L2555) called cupy.asarray(poly_props/poly_global) twice when all_touched=True -- once for the scanline poly_launch tuple and once for the supercover boundary_launch tuple. The two tuples reference the same per-tile props tables. Filed #2506 and fixed by hoisting the upload above the scanline/boundary conditional so both launches share the same device buffer. Microbench: 1000 polys/4 cols 0.051->0.024 ms/iter (2.1x); 10000 polys/8 cols 0.218->0.092 ms/iter (2.4x, saves 720 KB/tile of redundant H2D transfer). 12 new tests in test_rasterize_props_hoist_2506.py (4 AST-structural single-asarray-call assertions + 5 cupy all_touched parity merges + 3 dask+cupy smoke tests). All 470 rasterize tests pass. Dask graph probe: 25600x25600 chunks=1024 yields 2500 tasks for 625 tiles (4 tasks/chunk), unchanged. Noted pre-existing dask+cupy all_touched parity gap on boundary segments crossing tile borders (not addressed by this PR). SAFE/graph-bound verdict holds. | Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset." reproject,2026-06-12,SAFE,compute-bound,0,3267 3268,"Pass 7 (2026-06-12, deep-sweep): 0 HIGH, 2 MEDIUM found and fixed. #3267: in-memory numpy path held ~7 output-sized float64 temporaries and dask promotion keyed on input size only -- measured 8.4 GB peak RSS for a 1.15 GB output (7.3x); fix computes the grid first, promotes on output size too (mirrors merge #3048 pattern), re-applies the pixel guard for materializing paths, and fixes the 3-D chunks tuple in the dask wrap (post-fix peak 0.5 GB, lazy). #3268: multi-band cupy CPU-fallback transform path re-uploaded local_row/local_col per band via cp.asarray inside _resample_cupy_native (measured 12 H2D uploads for 6 bands, 26.1 MB vs 4.4 MB needed); fix hoists the device conversion before the band loop. LOW (documented, not fixed): _resample_cupy_native redundant copy+scan when the caller pre-converted nodata (+149% on the resample step, hits _reproject_dask_cupy fast path with non-NaN nodata); geoid_height_raster allocates full HxW meshgrid x2 plus output from dims (no strips, no dask path). Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, single blockwise layer, lazy); merge 2 inputs -> 16 tasks. GPU validated on host (CUDA available): cupy 2048^2 4326->3857 in 23 ms on-device; dask+cupy eager fast path matches in-memory exactly. SAFE/compute-bound holds. | Pass 6 (2026-06-09): 0 HIGH. 1 MEDIUM found and fixed (#3106): _reproject_chunk_numpy probed try_numba_transform, then _transform_coords probed it again before the pyproj fallback -- each wasted probe re-parses CRS params (~10 pyproj to_dict/to_authority round-trips) and allocates 4 chunk-sized float64 coordinate arrays. Measured 512x512 chunk, 4326->ESRI:54009: ~0.3-0.5 ms/probe, ~11% of the 5.3 ms chunk worker, repeated per output chunk on dask+numpy and merge per-block paths. Fix: worker passes no CRS objects to _transform_coords (inner retry gated on both non-None); cupy CPU fallbacks keep the inner probe (their first numba attempt). 3 new tests (TestNoDuplicateNumbaFastPathProbe); 447 reproject tests pass. LOW (not fixed, documented): try_numba_transform allocates 4 flat arrays before branch dispatch -- wasted for the lcc/tmerc 2D-kernel branches and unsupported pairs; _resample_cupy_native does a redundant .copy() when nodata is non-NaN and the caller already passed a fresh float64 copy; per-projection param extractors (_lcc_params etc.) call crs.to_dict() without the UserWarning suppression that _get_datum_params got in #3076, so fallback chunks emit pyproj warning spam. Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, 2 layers); merge 2 inputs -> 64 tasks/32 chunks. Source window per task capped at 64 Mpix. GPU validated on host (CUDA available): cupy 1024^2 fast path 13 ms, try_cuda_transform stays on-device, dask+cupy end-to-end OK, numpy/cupy max abs diff 2e-12, NaN positions identical. SAFE/compute-bound holds. | Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep." diff --git a/benchmarks/benchmarks/polygonize.py b/benchmarks/benchmarks/polygonize.py index 3ed53bd00..05fbadbf9 100644 --- a/benchmarks/benchmarks/polygonize.py +++ b/benchmarks/benchmarks/polygonize.py @@ -70,6 +70,27 @@ def time_polygonize(self, nx): polygonize(self.raster, mask=self.mask) +class PolygonizeDaskFloat: + """Benchmark the float dask path: per-chunk jitted ranges scan plus + cross-chunk merge (#3303).""" + params = ([100, 300, 1000],) + param_names = ("nx",) + + def setup(self, nx): + try: + import dask.array as da + except ImportError: + raise NotImplementedError("dask not available") + ny = nx // 2 + rng = np.random.default_rng(9461713) + raster = rng.integers(low=0, high=4, size=(ny, nx)).astype(np.float64) + chunks = (max(ny // 4, 1), max(nx // 4, 1)) + self.raster = xr.DataArray(da.from_array(raster, chunks=chunks)) + + def time_polygonize(self, nx): + polygonize(self.raster) + + class PolygonizeCCL: """Benchmark connected-component labeling phase.""" params = ([100, 300, 1000], ["numpy", "cupy"]) diff --git a/xrspatial/polygonize.py b/xrspatial/polygonize.py index 05b01f179..93b00caa8 100644 --- a/xrspatial/polygonize.py +++ b/xrspatial/polygonize.py @@ -1126,14 +1126,17 @@ def _to_geojson( return {"type": "FeatureCollection", "features": features} -def _polygonize_numpy( +def _polygonize_numpy_regions( values: np.ndarray, mask: Optional[np.ndarray], connectivity_8: bool, transform: Optional[np.ndarray], atol: float = _DEFAULT_ATOL, rtol: float = _DEFAULT_RTOL, -) -> Tuple[List[Union[int, float]], List[List[np.ndarray]]]: +) -> Tuple[List[Union[int, float]], List[List[np.ndarray]], np.ndarray]: + """Like :func:`_polygonize_numpy` but also returns the flat region + labels so callers (the dask chunk path) can reuse them instead of + re-running ``_calculate_regions`` on the same block (#3303).""" ny, nx = values.shape @@ -1165,6 +1168,19 @@ def _polygonize_numpy( column, polygon_points = _scan( regions, values_flat, mask_flat, connectivity_8, transform, nx, ny) + return column, polygon_points, regions + + +def _polygonize_numpy( + values: np.ndarray, + mask: Optional[np.ndarray], + connectivity_8: bool, + transform: Optional[np.ndarray], + atol: float = _DEFAULT_ATOL, + rtol: float = _DEFAULT_RTOL, +) -> Tuple[List[Union[int, float]], List[List[np.ndarray]]]: + column, polygon_points, _ = _polygonize_numpy_regions( + values, mask, connectivity_8, transform, atol=atol, rtol=rtol) return column, polygon_points @@ -1316,7 +1332,7 @@ def _polygonize_chunk(block, mask_block, connectivity_8, row_offset, col_offset, if mask_block is not None: mask_block = _to_numpy(mask_block) ny, nx = block.shape - column, polygon_points = _polygonize_numpy( + column, polygon_points, regions = _polygonize_numpy_regions( block, mask_block, connectivity_8, transform=None, atol=atol, rtol=rtol) @@ -1331,7 +1347,7 @@ def _polygonize_chunk(block, mask_block, connectivity_8, row_offset, col_offset, val_ranges, boundary_cells = _compute_region_value_ranges( block, mask_block, connectivity_8, atol, rtol, column, row_offset=row_offset, col_offset=col_offset, - ny_total=ny_total, nx_total=nx_total) + ny_total=ny_total, nx_total=nx_total, regions=regions) else: val_ranges = [(c, c) for c in column] boundary_cells = [{} for _ in column] @@ -1372,16 +1388,113 @@ def _polygonize_chunk(block, mask_block, connectivity_8, row_offset, col_offset, return interior, boundary +# Per-pixel companion kernel for _compute_region_value_ranges (#3303). +# Aggregates per-region (min, max) values and collects the pixels lying on +# internal chunk edges, recording numpy CCL's direct W / S close test for +# each. Runs jitted because the previous pure-Python version of this loop +# dominated float dask chunk time (~95% of _polygonize_chunk). Calling +# _is_close here (in nopython mode) also gives the real boolean result; +# from pure Python the numba overload generator returns its implementation +# function instead of a bool, so the old loop recorded always-truthy +# w_match / s_match flags. +@ngjit +def _region_ranges_scan( + regions: np.ndarray, # _regions_dtype, shape (nx_eff*ny,) + values_flat: np.ndarray, # shape (nx_eff*ny,) + mask_flat: Optional[np.ndarray], # shape (nx_eff*ny,) + n_regions: int, + nx: int, # original (un-padded) row length + ny: int, + nx_eff: int, # row length after nx==1 padding + left_internal: bool, + right_internal: bool, + bottom_internal: bool, + top_internal: bool, + atol: float, + rtol: float, +): + val_min = np.empty(n_regions, dtype=values_flat.dtype) + val_max = np.empty(n_regions, dtype=values_flat.dtype) + has_value = np.zeros(n_regions, dtype=np.bool_) + + # Internal-edge pixels live on at most four one-pixel-wide strips. + max_cells = 2 * (nx + ny) + cell_ij = np.empty(max_cells, dtype=np.int64) + cell_w = np.empty(max_cells, dtype=np.bool_) + cell_s = np.empty(max_cells, dtype=np.bool_) + n_cells = 0 + + for ij in range(len(regions)): + r = regions[ij] + if r == 0: + continue + if mask_flat is not None and not mask_flat[ij]: + continue + idx = r - 1 + if idx >= n_regions: + continue + v = values_flat[ij] + if not has_value[idx]: + has_value[idx] = True + val_min[idx] = v + val_max[idx] = v + else: + if v < val_min[idx]: + val_min[idx] = v + if v > val_max[idx]: + val_max[idx] = v + + local_col = ij % nx_eff + local_row = ij // nx_eff + # nx==1 padded a dummy column; skip it (only local_col 0 is real). + if nx == 1 and local_col != 0: + continue + on_internal = ( + (left_internal and local_col == 0) or + (right_internal and local_col == nx - 1) or + (bottom_internal and local_row == 0) or + (top_internal and local_row == ny - 1)) + if on_internal: + # Record whether numpy CCL's per-pixel close test matched + # this pixel against its in-chunk W / S orthogonal + # neighbour. This mirrors the ``matches_W`` / ``matches_S`` + # predicates in :func:`_calculate_regions` exactly -- it is + # the direct ``_is_close(cur, neighbour)`` test, NOT region + # co-membership (two pixels can share a region via a third + # path without being directly close). The 8-conn diagonal + # guard (#2677) consults the diagonal only when this + # orthogonal test failed, matching numpy's + # ``if not matches_W`` / ``if not matches_S`` shortcut. W + # is (local_row, local_col-1); S is (local_row-1, local_col). + w_match = ( + local_col > 0 and + (mask_flat is None or mask_flat[ij - 1]) and + _is_close(v, values_flat[ij - 1], atol, rtol)) + s_match = ( + local_row > 0 and + (mask_flat is None or mask_flat[ij - nx_eff]) and + _is_close(v, values_flat[ij - nx_eff], atol, rtol)) + cell_ij[n_cells] = ij + cell_w[n_cells] = w_match + cell_s[n_cells] = s_match + n_cells += 1 + + return val_min, val_max, has_value, cell_ij, cell_w, cell_s, n_cells + + def _compute_region_value_ranges(block, mask_block, connectivity_8, atol, rtol, column, row_offset=0, col_offset=0, - ny_total=None, nx_total=None): + ny_total=None, nx_total=None, + regions=None): """Return per-region value ranges and internal-boundary cell values. ``column`` is the polygon-order value list from ``_polygonize_numpy``; - the returned lists are parallel to it. Re-runs the same - ``_calculate_regions`` pass used inside ``_polygonize_numpy`` (so - region IDs match exactly), then groups pixel values by region. + the returned lists are parallel to it. ``regions`` takes the flat + region labels already computed by ``_polygonize_numpy_regions`` for + this block; when ``None`` the same ``_calculate_regions`` pass is + re-run here (so region IDs match exactly). Pixel values are then + grouped by region in the jitted :func:`_region_ranges_scan` (#3303). Returns ``(val_ranges, boundary_cells)`` where: @@ -1432,8 +1545,13 @@ def _compute_region_value_ranges(block, mask_block, connectivity_8, nx_eff = nx mask_flat = full_mask.ravel() if full_mask is not None else None - regions = _calculate_regions( - values_flat, mask_flat, connectivity_8, nx_eff, ny, atol, rtol) + if regions is None: + regions = _calculate_regions( + values_flat, mask_flat, connectivity_8, nx_eff, ny, atol, rtol) + elif len(regions) != nx_eff * ny: + raise ValueError( + f"regions length {len(regions)} does not match the block's " + f"padded flat size {nx_eff * ny}") # Which chunk edges are internal (shared with a neighbour chunk)? track_cells = ny_total is not None and nx_total is not None @@ -1443,69 +1561,35 @@ def _compute_region_value_ranges(block, mask_block, connectivity_8, top_internal = track_cells and row_offset + ny < ny_total # Aggregate min/max value per region in raster-scan order so the - # output is parallel to the polygon column from _scan. + # output is parallel to the polygon column from _scan. The per-pixel + # work runs in the jitted _region_ranges_scan; only the per-region + # assembly (tuples and dicts) stays in Python (#3303). n_regions = len(column) if n_regions == 0: return [], [] - val_min = [None] * n_regions - val_max = [None] * n_regions + val_min, val_max, has_value, cell_ij, cell_w, cell_s, n_cells = \ + _region_ranges_scan( + regions, values_flat, mask_flat, n_regions, nx, ny, nx_eff, + bool(left_internal), bool(right_internal), + bool(bottom_internal), bool(top_internal), atol, rtol) + boundary_cells = [None] * n_regions - for ij in range(len(regions)): - r = regions[ij] - if r == 0: - continue - if mask_flat is not None and not mask_flat[ij]: - continue - idx = r - 1 - if idx >= n_regions: - continue - v = values_flat[ij] - if val_min[idx] is None or v < val_min[idx]: - val_min[idx] = v - if val_max[idx] is None or v > val_max[idx]: - val_max[idx] = v + for k in range(n_cells): + ij = int(cell_ij[k]) + idx = int(regions[ij]) - 1 + cells = boundary_cells[idx] + if cells is None: + cells = {} + boundary_cells[idx] = cells + local_col = ij % nx_eff + local_row = ij // nx_eff + cells[(local_col + col_offset, local_row + row_offset)] = ( + values_flat[ij], bool(cell_w[k]), bool(cell_s[k])) - if track_cells: - local_col = ij % nx_eff - local_row = ij // nx_eff - # nx==1 padded a dummy column; skip it (only local_col 0 is real). - if nx == 1 and local_col != 0: - continue - on_internal = ( - (left_internal and local_col == 0) or - (right_internal and local_col == nx - 1) or - (bottom_internal and local_row == 0) or - (top_internal and local_row == ny - 1)) - if on_internal: - cells = boundary_cells[idx] - if cells is None: - cells = {} - boundary_cells[idx] = cells - # Record whether numpy CCL's per-pixel close test matched - # this pixel against its in-chunk W / S orthogonal - # neighbour. This mirrors the ``matches_W`` / ``matches_S`` - # predicates in :func:`_calculate_regions` exactly -- it is - # the direct ``_is_close(cur, neighbour)`` test, NOT region - # co-membership (two pixels can share a region via a third - # path without being directly close). The 8-conn diagonal - # guard (#2677) consults the diagonal only when this - # orthogonal test failed, matching numpy's - # ``if not matches_W`` / ``if not matches_S`` shortcut. W - # is (local_row, local_col-1); S is (local_row-1, local_col). - w_match = ( - local_col > 0 and - (mask_flat is None or mask_flat[ij - 1]) and - _is_close(v, values_flat[ij - 1], atol, rtol)) - s_match = ( - local_row > 0 and - (mask_flat is None or mask_flat[ij - nx_eff]) and - _is_close(v, values_flat[ij - nx_eff], atol, rtol)) - cells[(local_col + col_offset, local_row + row_offset)] = ( - v, bool(w_match), bool(s_match)) out = [] cells_out = [] for i in range(n_regions): - if val_min[i] is None: + if not has_value[i]: # Fall back to the polygon's representative value if no # pixel matched (defensive; shouldn't happen). out.append((column[i], column[i])) diff --git a/xrspatial/tests/test_polygonize_issue_3303.py b/xrspatial/tests/test_polygonize_issue_3303.py new file mode 100644 index 000000000..59f0f8999 --- /dev/null +++ b/xrspatial/tests/test_polygonize_issue_3303.py @@ -0,0 +1,241 @@ +"""Tests for issue #3303: jitted _region_ranges_scan in the float dask +chunk path. + +The fix replaced the pure-Python per-pixel loop in +``_compute_region_value_ranges`` with the jitted ``_region_ranges_scan``, +reused the region labels from ``_polygonize_numpy_regions`` instead of +re-running ``_calculate_regions``, and corrected the recorded ``w_match`` +/ ``s_match`` flags (the old loop called the numba overload generator +``_is_close`` from pure Python, which returns the implementation function +-- always truthy -- instead of a bool). + +These tests pin three things: + +1. ``_compute_region_value_ranges`` gives identical results whether the + caller passes precomputed ``regions`` or lets it recompute them. +2. The recorded boundary-cell flags equal a naive ``_values_close`` + reference (this fails on the pre-#3303 code). +3. Public dask float output still matches numpy across chunkings and + connectivities. +""" + +import numpy as np +import pytest +import xarray as xr + +from xrspatial.polygonize import ( + _DEFAULT_ATOL, + _DEFAULT_RTOL, + _compute_region_value_ranges, + _polygonize_numpy, + _polygonize_numpy_regions, + _values_close, + polygonize, +) + +dask = pytest.importorskip("dask") +import dask.array as da # noqa: E402 + + +def _reference_ranges(block, mask_block, connectivity_8, atol, rtol, column, + row_offset, col_offset, ny_total, nx_total): + """Naive Python port of the per-pixel aggregation, using + ``_values_close`` for the W / S flags (the documented semantics).""" + _, _, regions = _polygonize_numpy_regions( + block, mask_block, connectivity_8, None, atol=atol, rtol=rtol) + + ny, nx = block.shape + if np.issubdtype(block.dtype, np.floating): + nan_mask = ~np.isnan(block) + if mask_block is not None: + full_mask = mask_block.astype(bool) & nan_mask + else: + full_mask = nan_mask + else: + full_mask = mask_block + if nx == 1: + nx_eff = 2 + values_flat = np.hstack((block, block)).ravel() + if full_mask is not None: + full_mask = np.hstack((full_mask, np.zeros_like(full_mask))) + else: + full_mask = np.zeros((ny, 2), dtype=bool) + full_mask[:, 0] = True + else: + nx_eff = nx + values_flat = block.ravel() + mask_flat = full_mask.ravel() if full_mask is not None else None + + track = ny_total is not None and nx_total is not None + left = track and col_offset > 0 + right = track and col_offset + nx < nx_total + bottom = track and row_offset > 0 + top = track and row_offset + ny < ny_total + + n_regions = len(column) + val_min = [None] * n_regions + val_max = [None] * n_regions + cells_out = [{} for _ in range(n_regions)] + for ij in range(len(regions)): + r = regions[ij] + if r == 0: + continue + if mask_flat is not None and not mask_flat[ij]: + continue + idx = r - 1 + if idx >= n_regions: + continue + v = values_flat[ij] + if val_min[idx] is None or v < val_min[idx]: + val_min[idx] = v + if val_max[idx] is None or v > val_max[idx]: + val_max[idx] = v + lc = ij % nx_eff + lr = ij // nx_eff + if nx == 1 and lc != 0: + continue + on_internal = ((left and lc == 0) or (right and lc == nx - 1) or + (bottom and lr == 0) or (top and lr == ny - 1)) + if on_internal: + w = (lc > 0 and + (mask_flat is None or mask_flat[ij - 1]) and + bool(_values_close(v, values_flat[ij - 1], atol, rtol))) + s = (lr > 0 and + (mask_flat is None or mask_flat[ij - nx_eff]) and + bool(_values_close(v, values_flat[ij - nx_eff], atol, rtol))) + cells_out[idx][(lc + col_offset, lr + row_offset)] = ( + v, bool(w), bool(s)) + ranges = [] + for i in range(n_regions): + if val_min[i] is None: + ranges.append((column[i], column[i])) + else: + ranges.append((val_min[i], val_max[i])) + return ranges, cells_out + + +def _cases(): + rng = np.random.default_rng(3303) + a = rng.random((13, 17)) * 1e-4 + 1.0 + a[3:5, 4:9] = np.nan + m = rng.random((13, 17)) > 0.2 + b = rng.integers(0, 3, (9, 1)).astype(np.float64) + c = rng.integers(0, 4, (12, 12)).astype(np.float32) + return [ + ("float-nan", a, None), + ("float-mask", a, m), + ("nx1", b, None), + ("float32", c, None), + ] + + +@pytest.mark.parametrize("name,block,mask", _cases()) +@pytest.mark.parametrize("connectivity_8", [False, True]) +@pytest.mark.parametrize("offsets", [ + (0, 0, None, None), # single chunk: no internal edges + (4, 6, 40, 40), # all four edges internal + (0, 6, 40, 40), # bottom edge on the raster boundary +]) +def test_precomputed_regions_match_recompute(name, block, mask, + connectivity_8, offsets): + ro, co, nyt, nxt = offsets + column, _, regions = _polygonize_numpy_regions( + block, mask, connectivity_8, None) + with_regions = _compute_region_value_ranges( + block, mask, connectivity_8, _DEFAULT_ATOL, _DEFAULT_RTOL, column, + row_offset=ro, col_offset=co, ny_total=nyt, nx_total=nxt, + regions=regions) + without_regions = _compute_region_value_ranges( + block, mask, connectivity_8, _DEFAULT_ATOL, _DEFAULT_RTOL, column, + row_offset=ro, col_offset=co, ny_total=nyt, nx_total=nxt) + assert with_regions == without_regions + + +@pytest.mark.parametrize("name,block,mask", _cases()) +@pytest.mark.parametrize("connectivity_8", [False, True]) +def test_ranges_and_flags_match_reference(name, block, mask, connectivity_8): + ro, co, nyt, nxt = 4, 6, 40, 40 + column, _, regions = _polygonize_numpy_regions( + block, mask, connectivity_8, None) + got_ranges, got_cells = _compute_region_value_ranges( + block, mask, connectivity_8, _DEFAULT_ATOL, _DEFAULT_RTOL, column, + row_offset=ro, col_offset=co, ny_total=nyt, nx_total=nxt, + regions=regions) + ref_ranges, ref_cells = _reference_ranges( + block, mask, connectivity_8, _DEFAULT_ATOL, _DEFAULT_RTOL, column, + ro, co, nyt, nxt) + assert len(got_ranges) == len(ref_ranges) + for g, r in zip(got_ranges, ref_ranges): + np.testing.assert_allclose(g, r, rtol=0, atol=0) + assert got_cells == ref_cells + + +def test_w_s_flags_record_real_closeness(): + # Pre-#3303 the flags were truthy whenever the position / mask + # conditions held, because _is_close called from pure Python returns + # numba's implementation function instead of a bool. + block = np.array([[1.0, 5.0, 1.0, 1.0], + [1.0, 1.0, 1.0, 1.0]]) + column, _, regions = _polygonize_numpy_regions(block, None, False, None) + _, cells = _compute_region_value_ranges( + block, None, False, 1e-8, 1e-5, column, + row_offset=0, col_offset=0, ny_total=8, nx_total=4, regions=regions) + merged = {} + for d in cells: + merged.update(d) + # Only the top edge (local_row == ny-1 == 1) is internal here + # (col_offset == row_offset == 0 and the block spans the full width). + # Pixel (1, 1) has value 1.0; W neighbour (0, 1) is 1.0 (close) and + # S neighbour (1, 0) is 5.0 (not close). The pre-#3303 code records + # s_match=True here because the position/mask conditions hold. + v, w_match, s_match = merged[(1, 1)] + assert v == 1.0 + assert w_match is True + assert s_match is False + # Pixel (0, 1) has value 1.0; no W neighbour, S neighbour (0, 0) + # is 1.0 (close). + v, w_match, s_match = merged[(0, 1)] + assert v == 1.0 + assert w_match is False + assert s_match is True + + +@pytest.mark.parametrize("connectivity", [4, 8]) +@pytest.mark.parametrize("chunks", [(5, 7), (4, 4), (13, 5)]) +def test_dask_float_matches_numpy(connectivity, chunks): + rng = np.random.default_rng(42) + data = rng.integers(0, 4, (13, 17)).astype(np.float64) + data += rng.random((13, 17)) * 1e-7 # tolerance-level noise + data[2:4, 5:9] = np.nan + raster = xr.DataArray(data, dims=["y", "x"]) + dask_raster = xr.DataArray( + da.from_array(data, chunks=chunks), dims=["y", "x"]) + + col_np, pp_np = polygonize(raster, connectivity=connectivity) + col_da, pp_da = polygonize(dask_raster, connectivity=connectivity) + + assert col_np == col_da + assert len(pp_np) == len(pp_da) + # Boundary-merged rings can start at a different vertex than numpy's + # trace (pre-existing behaviour, also on the code before #3303), so + # compare geometric equality rather than raw vertex arrays. + shapely_geom = pytest.importorskip("shapely.geometry") + for rings_np, rings_da in zip(pp_np, pp_da): + poly_np = shapely_geom.Polygon(rings_np[0], rings_np[1:]) + poly_da = shapely_geom.Polygon(rings_da[0], rings_da[1:]) + assert poly_np.equals(poly_da) + + +def test_polygonize_numpy_unchanged_by_regions_split(): + # _polygonize_numpy must stay a thin wrapper with identical output. + rng = np.random.default_rng(7) + block = rng.integers(0, 3, (8, 9)).astype(np.float64) + col_a, pp_a = _polygonize_numpy(block, None, False, None) + col_b, pp_b, regions = _polygonize_numpy_regions(block, None, False, None) + assert col_a == col_b + assert len(pp_a) == len(pp_b) + for ra, rb in zip(pp_a, pp_b): + for x, y in zip(ra, rb): + np.testing.assert_array_equal(x, y) + # Region labels cover exactly the polygons _scan emitted. + assert regions.max() == len(col_b)