diff --git a/.claude/sweep-performance-state.csv b/.claude/sweep-performance-state.csv
index 80c03143c..21dff2d1e 100644
--- a/.claude/sweep-performance-state.csv
+++ b/.claude/sweep-performance-state.csv
@@ -33,7 +33,7 @@ normalize,2026-03-31T18:00:00Z,SAFE,compute-bound,0,1124,Boolean indexing replac
 pathfinding,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. CuPy .get() is required -- A* has no GPU kernel. Per-pixel .compute() is only 2 calls for start/goal validation. seg.values in multi_stop_search collects already-computed results for stitching.
 perlin,2026-03-31T18:00:00Z,WILL OOM,memory-bound,0,,
 polygon_clip,2026-06-10,SAFE,graph-bound,0,3191,"crop=True picked tiny leading edge chunk as rasterize mask size -> 13169-task graph; fixed to max(rc),max(cc) -> 1045 tasks. crop=False/numpy/cupy clean. Cat1-5 clean. GPU+dask+cupy run-validated."
-polygonize,2026-05-29,RISKY,compute-bound,0,2608,"Pass 2 (2026-05-29): re-audit. 0 HIGH. 1 MEDIUM fixed (#2608): _polygonize_dask called dask.compute() once per chunk in a nested Python loop, serializing one chunk per scheduler round-trip. Fixed to batch one dask.compute() per chunk row. Output byte-identical (verified conn=4 and conn=8). Measured 2.79x faster on a 4-worker LocalCluster (1024x1024/64 chunks); threaded-scheduler win is marginal (~1.03x warm) since @ngjit kernels release the GIL. 8 new tests in test_polygonize_dask_row_batch_2608.py; 299 polygonize tests pass. Cat1 clean (no .values/.compute-in-loop wrapping dask; np.asarray at L1064/L2278 only wrap CPU input / user transform). Cat3: no @cuda.jit kernels; _polygonize_cupy GPU->CPU transfer is documented (boundary tracing is sequential, cannot run on GPU); cupy int path runs end-to-end ~2.2s/512x512, dominated by CPU _scan. Cat4 LOW (not fixed): _calculate_regions_cupy allocates bin_mask=(data==v) per unique value (O(n_unique) passes); verified low impact, _scan dominates. Cat5 clean. Cat6: RISKY unchanged -- driver accumulates O(total polygons) interior polys; per-row batch keeps peak bounded to one row. bottleneck=compute-bound (_scan). | Re-audit 2026-04-16 after PR 1190 NaN fix + 1176 simplification."
+polygonize,2026-06-12,RISKY,compute-bound,0,3303,"Pass 3 (2026-06-12): re-audit after #2817/#2913/#3041. 0 HIGH. 1 MEDIUM fixed (#3303): _compute_region_value_ranges ran a pure-Python per-pixel loop (95% of float chunk time; 0.283s of 0.299s on 1024x1024, float chunks ~30x int) and re-ran _calculate_regions on an already-labelled block; moved to jitted _region_ranges_scan + _polygonize_numpy_regions label reuse (0.299s -> 0.015s/chunk). Side fix: w_match/s_match flags were always-truthy (_is_close numba overload generator called from pure Python returns impl function); output-neutral by chunk geometry, now computed correctly in jit. Cat1/2 clean (dask.compute batching is the documented #2673 design). Cat3 validated on GPU: cupy int/float + dask+cupy run end-to-end, single documented transfer, no round-trip. Cat4/5 LOW unchanged: _calculate_regions_cupy per-unique-value labeling (low impact); per-polygon Python classify loop in _polygonize_chunk dominates only on pathological many-polygon chunks (788K polys -> 7.8s). Cat6 RISKY unchanged: driver accumulates O(total polygons); 32-chunk batches bound transient peak. 527 polygonize tests + 40 new pass."
 proximity,2026-06-09,RISKY,graph-bound,0,3103,"Pass 2 (2026-06-09): re-audit after 16 fix commits since 2026-03-31. 0 HIGH, 2 MEDIUM found and fixed: (1) #3103/PR #3126 line-sweep @ngjit closure inside _process recompiled per call (~0.42s constant overhead; 10x10 warm call 0.44s->1ms after module-level hoist with explicit args, 1000x1000 0.49s->35ms); (2) #3132/PR #3137 dask xs/ys coordinate grids built via da.tile/da.repeat+rechunk cost ~185 tasks/chunk with the ys term scaling O(raster height) (~4.3 tasks/row, 44K tasks at 10240 rows); chunk-aligned da.broadcast_to gives identical values, bounded graph 18535->5554 tasks (3.3x) on 2560^2/256 chunks; regression test bounds tasks/chunk<80 (old 100.4, new 58.7) + ragged-chunk parity. LOW not fixed: zeros+fill(-1) row buffers in line-sweep; numpy backend materializes full float64 xs/ys grids (guarded since #1111); unbounded KDTree streaming count pass computes chunks on driver by design (gh-879). GPU validated on CUDA host: cupy 1024^2 proximity 6ms device-resident with exact numpy parity, dask+cupy bounded parity exact, _proximity_cuda_kernel 56 regs/thread (no register pressure). _halo_depth python loop measured 58ms at 100K coords - not a finding. Verdict RISKY (was WILL OOM): unbounded paths either guarded (MemoryError at 80% mem) or stream via kdtree; bounded map_overlap peak scales with chunk size."
 rasterize,2026-06-09,SAFE,compute-bound,0,3107,"Pass 4 (2026-06-09): re-audit found 2 MEDIUM Cat-4 allocation findings, 0 HIGH. (a) all four backends return via astype(dtype) which copies the float64 work buffer even when dtype is already float64 (the default) -- _run_numpy L1237, _run_cupy L2211, _rasterize_tile_numpy L2460, _rasterize_tile_cupy L2688; fix astype(dtype, copy=False). (b) CPU paths allocate order as full-raster int64 (8 B/px) for every merge mode but only first/last predicates read it; for _should_write_any merges (max/min/sum/count, user callables) an int8 buffer suffices (numba wraps the dead int64 store) -- _run_numpy L1188, _rasterize_tile_numpy L2420. tracemalloc 4000x4000 numpy merge=sum: peak 25 B/px -> 10 B/px expected (out 8 + written 1 + order 1); merge=last 25 -> 17 B/px. Filed #3107, fixed via deep-sweep rockout. GPU validated on host (CUDA available): cupy 512x512 last/sum/max returns cupy.ndarray with CPU parity, dask+cupy sum parity True, no host round trip. Dask graph probe: 2560x2560 chunks=256 -> 400 tasks / 100 chunks (4.0 tasks/chunk, unchanged). LOW (not fixed, documented): _extract_polygon_boundary_segments int variant L702 is dead code (only the _float variant is called). SAFE/compute-bound: per-tile buffers scale with chunk size; scanline/burn JIT kernels dominate runtime. | Pass 3 (2026-05-27): re-audit identified 1 MEDIUM Cat-3 GPU-transfer finding. _run_cupy (L2065/L2083) and _rasterize_tile_cupy (L2541/L2555) called cupy.asarray(poly_props/poly_global) twice when all_touched=True -- once for the scanline poly_launch tuple and once for the supercover boundary_launch tuple. The two tuples reference the same per-tile props tables. Filed #2506 and fixed by hoisting the upload above the scanline/boundary conditional so both launches share the same device buffer. Microbench: 1000 polys/4 cols 0.051->0.024 ms/iter (2.1x); 10000 polys/8 cols 0.218->0.092 ms/iter (2.4x, saves 720 KB/tile of redundant H2D transfer). 12 new tests in test_rasterize_props_hoist_2506.py (4 AST-structural single-asarray-call assertions + 5 cupy all_touched parity merges + 3 dask+cupy smoke tests). All 470 rasterize tests pass. Dask graph probe: 25600x25600 chunks=1024 yields 2500 tasks for 625 tiles (4 tasks/chunk), unchanged. Noted pre-existing dask+cupy all_touched parity gap on boundary segments crossing tile borders (not addressed by this PR). SAFE/graph-bound verdict holds. | Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset."
 reproject,2026-06-12,SAFE,compute-bound,0,3267 3268,"Pass 7 (2026-06-12, deep-sweep): 0 HIGH, 2 MEDIUM found and fixed. #3267: in-memory numpy path held ~7 output-sized float64 temporaries and dask promotion keyed on input size only -- measured 8.4 GB peak RSS for a 1.15 GB output (7.3x); fix computes the grid first, promotes on output size too (mirrors merge #3048 pattern), re-applies the pixel guard for materializing paths, and fixes the 3-D chunks tuple in the dask wrap (post-fix peak 0.5 GB, lazy). #3268: multi-band cupy CPU-fallback transform path re-uploaded local_row/local_col per band via cp.asarray inside _resample_cupy_native (measured 12 H2D uploads for 6 bands, 26.1 MB vs 4.4 MB needed); fix hoists the device conversion before the band loop. LOW (documented, not fixed): _resample_cupy_native redundant copy+scan when the caller pre-converted nodata (+149% on the resample step, hits _reproject_dask_cupy fast path with non-NaN nodata); geoid_height_raster allocates full HxW meshgrid x2 plus output from dims (no strips, no dask path). Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, single blockwise layer, lazy); merge 2 inputs -> 16 tasks. GPU validated on host (CUDA available): cupy 2048^2 4326->3857 in 23 ms on-device; dask+cupy eager fast path matches in-memory exactly. SAFE/compute-bound holds. | Pass 6 (2026-06-09): 0 HIGH. 1 MEDIUM found and fixed (#3106): _reproject_chunk_numpy probed try_numba_transform, then _transform_coords probed it again before the pyproj fallback -- each wasted probe re-parses CRS params (~10 pyproj to_dict/to_authority round-trips) and allocates 4 chunk-sized float64 coordinate arrays. Measured 512x512 chunk, 4326->ESRI:54009: ~0.3-0.5 ms/probe, ~11% of the 5.3 ms chunk worker, repeated per output chunk on dask+numpy and merge per-block paths. Fix: worker passes no CRS objects to _transform_coords (inner retry gated on both non-None); cupy CPU fallbacks keep the inner probe (their first numba attempt). 3 new tests (TestNoDuplicateNumbaFastPathProbe); 447 reproject tests pass. LOW (not fixed, documented): try_numba_transform allocates 4 flat arrays before branch dispatch -- wasted for the lcc/tmerc 2D-kernel branches and unsupported pairs; _resample_cupy_native does a redundant .copy() when nodata is non-NaN and the caller already passed a fresh float64 copy; per-projection param extractors (_lcc_params etc.) call crs.to_dict() without the UserWarning suppression that _get_datum_params got in #3076, so fallback chunks emit pyproj warning spam. Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, 2 layers); merge 2 inputs -> 64 tasks/32 chunks. Source window per task capped at 64 Mpix. GPU validated on host (CUDA available): cupy 1024^2 fast path 13 ms, try_cuda_transform stays on-device, dask+cupy end-to-end OK, numpy/cupy max abs diff 2e-12, NaN positions identical. SAFE/compute-bound holds. | Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep."
diff --git a/benchmarks/benchmarks/polygonize.py b/benchmarks/benchmarks/polygonize.py
index 3ed53bd00..05fbadbf9 100644
--- a/benchmarks/benchmarks/polygonize.py
+++ b/benchmarks/benchmarks/polygonize.py
@@ -70,6 +70,27 @@ def time_polygonize(self, nx):
         polygonize(self.raster, mask=self.mask)
 
 
+class PolygonizeDaskFloat:
+    """Benchmark the float dask path: per-chunk jitted ranges scan plus
+    cross-chunk merge (#3303)."""
+    params = ([100, 300, 1000],)
+    param_names = ("nx",)
+
+    def setup(self, nx):
+        try:
+            import dask.array as da
+        except ImportError:
+            raise NotImplementedError("dask not available")
+        ny = nx // 2
+        rng = np.random.default_rng(9461713)
+        raster = rng.integers(low=0, high=4, size=(ny, nx)).astype(np.float64)
+        chunks = (max(ny // 4, 1), max(nx // 4, 1))
+        self.raster = xr.DataArray(da.from_array(raster, chunks=chunks))
+
+    def time_polygonize(self, nx):
+        polygonize(self.raster)
+
+
 class PolygonizeCCL:
     """Benchmark connected-component labeling phase."""
     params = ([100, 300, 1000], ["numpy", "cupy"])
diff --git a/xrspatial/polygonize.py b/xrspatial/polygonize.py
index 05b01f179..93b00caa8 100644
--- a/xrspatial/polygonize.py
+++ b/xrspatial/polygonize.py
@@ -1126,14 +1126,17 @@ def _to_geojson(
     return {"type": "FeatureCollection", "features": features}
 
 
-def _polygonize_numpy(
+def _polygonize_numpy_regions(
     values: np.ndarray,
     mask: Optional[np.ndarray],
     connectivity_8: bool,
     transform: Optional[np.ndarray],
     atol: float = _DEFAULT_ATOL,
     rtol: float = _DEFAULT_RTOL,
-) -> Tuple[List[Union[int, float]], List[List[np.ndarray]]]:
+) -> Tuple[List[Union[int, float]], List[List[np.ndarray]], np.ndarray]:
+    """Like :func:`_polygonize_numpy` but also returns the flat region
+    labels so callers (the dask chunk path) can reuse them instead of
+    re-running ``_calculate_regions`` on the same block (#3303)."""
 
     ny, nx = values.shape
 
@@ -1165,6 +1168,19 @@ def _polygonize_numpy(
     column, polygon_points = _scan(
         regions, values_flat, mask_flat, connectivity_8, transform, nx, ny)
 
+    return column, polygon_points, regions
+
+
+def _polygonize_numpy(
+    values: np.ndarray,
+    mask: Optional[np.ndarray],
+    connectivity_8: bool,
+    transform: Optional[np.ndarray],
+    atol: float = _DEFAULT_ATOL,
+    rtol: float = _DEFAULT_RTOL,
+) -> Tuple[List[Union[int, float]], List[List[np.ndarray]]]:
+    column, polygon_points, _ = _polygonize_numpy_regions(
+        values, mask, connectivity_8, transform, atol=atol, rtol=rtol)
     return column, polygon_points
 
 
@@ -1316,7 +1332,7 @@ def _polygonize_chunk(block, mask_block, connectivity_8, row_offset, col_offset,
     if mask_block is not None:
         mask_block = _to_numpy(mask_block)
     ny, nx = block.shape
-    column, polygon_points = _polygonize_numpy(
+    column, polygon_points, regions = _polygonize_numpy_regions(
         block, mask_block, connectivity_8, transform=None,
         atol=atol, rtol=rtol)
 
@@ -1331,7 +1347,7 @@ def _polygonize_chunk(block, mask_block, connectivity_8, row_offset, col_offset,
         val_ranges, boundary_cells = _compute_region_value_ranges(
             block, mask_block, connectivity_8, atol, rtol, column,
             row_offset=row_offset, col_offset=col_offset,
-            ny_total=ny_total, nx_total=nx_total)
+            ny_total=ny_total, nx_total=nx_total, regions=regions)
     else:
         val_ranges = [(c, c) for c in column]
         boundary_cells = [{} for _ in column]
@@ -1372,16 +1388,113 @@ def _polygonize_chunk(block, mask_block, connectivity_8, row_offset, col_offset,
     return interior, boundary
 
 
+# Per-pixel companion kernel for _compute_region_value_ranges (#3303).
+# Aggregates per-region (min, max) values and collects the pixels lying on
+# internal chunk edges, recording numpy CCL's direct W / S close test for
+# each.  Runs jitted because the previous pure-Python version of this loop
+# dominated float dask chunk time (~95% of _polygonize_chunk).  Calling
+# _is_close here (in nopython mode) also gives the real boolean result;
+# from pure Python the numba overload generator returns its implementation
+# function instead of a bool, so the old loop recorded always-truthy
+# w_match / s_match flags.
+@ngjit
+def _region_ranges_scan(
+    regions: np.ndarray,               # _regions_dtype, shape (nx_eff*ny,)
+    values_flat: np.ndarray,           # shape (nx_eff*ny,)
+    mask_flat: Optional[np.ndarray],   # shape (nx_eff*ny,)
+    n_regions: int,
+    nx: int,                           # original (un-padded) row length
+    ny: int,
+    nx_eff: int,                       # row length after nx==1 padding
+    left_internal: bool,
+    right_internal: bool,
+    bottom_internal: bool,
+    top_internal: bool,
+    atol: float,
+    rtol: float,
+):
+    val_min = np.empty(n_regions, dtype=values_flat.dtype)
+    val_max = np.empty(n_regions, dtype=values_flat.dtype)
+    has_value = np.zeros(n_regions, dtype=np.bool_)
+
+    # Internal-edge pixels live on at most four one-pixel-wide strips.
+    max_cells = 2 * (nx + ny)
+    cell_ij = np.empty(max_cells, dtype=np.int64)
+    cell_w = np.empty(max_cells, dtype=np.bool_)
+    cell_s = np.empty(max_cells, dtype=np.bool_)
+    n_cells = 0
+
+    for ij in range(len(regions)):
+        r = regions[ij]
+        if r == 0:
+            continue
+        if mask_flat is not None and not mask_flat[ij]:
+            continue
+        idx = r - 1
+        if idx >= n_regions:
+            continue
+        v = values_flat[ij]
+        if not has_value[idx]:
+            has_value[idx] = True
+            val_min[idx] = v
+            val_max[idx] = v
+        else:
+            if v < val_min[idx]:
+                val_min[idx] = v
+            if v > val_max[idx]:
+                val_max[idx] = v
+
+        local_col = ij % nx_eff
+        local_row = ij // nx_eff
+        # nx==1 padded a dummy column; skip it (only local_col 0 is real).
+        if nx == 1 and local_col != 0:
+            continue
+        on_internal = (
+            (left_internal and local_col == 0) or
+            (right_internal and local_col == nx - 1) or
+            (bottom_internal and local_row == 0) or
+            (top_internal and local_row == ny - 1))
+        if on_internal:
+            # Record whether numpy CCL's per-pixel close test matched
+            # this pixel against its in-chunk W / S orthogonal
+            # neighbour.  This mirrors the ``matches_W`` / ``matches_S``
+            # predicates in :func:`_calculate_regions` exactly -- it is
+            # the direct ``_is_close(cur, neighbour)`` test, NOT region
+            # co-membership (two pixels can share a region via a third
+            # path without being directly close).  The 8-conn diagonal
+            # guard (#2677) consults the diagonal only when this
+            # orthogonal test failed, matching numpy's
+            # ``if not matches_W`` / ``if not matches_S`` shortcut.  W
+            # is (local_row, local_col-1); S is (local_row-1, local_col).
+            w_match = (
+                local_col > 0 and
+                (mask_flat is None or mask_flat[ij - 1]) and
+                _is_close(v, values_flat[ij - 1], atol, rtol))
+            s_match = (
+                local_row > 0 and
+                (mask_flat is None or mask_flat[ij - nx_eff]) and
+                _is_close(v, values_flat[ij - nx_eff], atol, rtol))
+            cell_ij[n_cells] = ij
+            cell_w[n_cells] = w_match
+            cell_s[n_cells] = s_match
+            n_cells += 1
+
+    return val_min, val_max, has_value, cell_ij, cell_w, cell_s, n_cells
+
+
 def _compute_region_value_ranges(block, mask_block, connectivity_8,
                                  atol, rtol, column,
                                  row_offset=0, col_offset=0,
-                                 ny_total=None, nx_total=None):
+                                 ny_total=None, nx_total=None,
+                                 regions=None):
     """Return per-region value ranges and internal-boundary cell values.
 
     ``column`` is the polygon-order value list from ``_polygonize_numpy``;
-    the returned lists are parallel to it.  Re-runs the same
-    ``_calculate_regions`` pass used inside ``_polygonize_numpy`` (so
-    region IDs match exactly), then groups pixel values by region.
+    the returned lists are parallel to it.  ``regions`` takes the flat
+    region labels already computed by ``_polygonize_numpy_regions`` for
+    this block; when ``None`` the same ``_calculate_regions`` pass is
+    re-run here (so region IDs match exactly).  Pixel values are then
+    grouped by region in the jitted :func:`_region_ranges_scan` (#3303).
 
     Returns ``(val_ranges, boundary_cells)`` where:
 
@@ -1432,8 +1545,13 @@ def _compute_region_value_ranges(block, mask_block, connectivity_8,
         nx_eff = nx
 
     mask_flat = full_mask.ravel() if full_mask is not None else None
-    regions = _calculate_regions(
-        values_flat, mask_flat, connectivity_8, nx_eff, ny, atol, rtol)
+    if regions is None:
+        regions = _calculate_regions(
+            values_flat, mask_flat, connectivity_8, nx_eff, ny, atol, rtol)
+    elif len(regions) != nx_eff * ny:
+        raise ValueError(
+            f"regions length {len(regions)} does not match the block's "
+            f"padded flat size {nx_eff * ny}")
 
     # Which chunk edges are internal (shared with a neighbour chunk)?
     track_cells = ny_total is not None and nx_total is not None
@@ -1443,69 +1561,35 @@ def _compute_region_value_ranges(block, mask_block, connectivity_8,
     top_internal = track_cells and row_offset + ny < ny_total
 
     # Aggregate min/max value per region in raster-scan order so the
-    # output is parallel to the polygon column from _scan.
+    # output is parallel to the polygon column from _scan.  The per-pixel
+    # work runs in the jitted _region_ranges_scan; only the per-region
+    # assembly (tuples and dicts) stays in Python (#3303).
     n_regions = len(column)
     if n_regions == 0:
         return [], []
-    val_min = [None] * n_regions
-    val_max = [None] * n_regions
+    val_min, val_max, has_value, cell_ij, cell_w, cell_s, n_cells = \
+        _region_ranges_scan(
+            regions, values_flat, mask_flat, n_regions, nx, ny, nx_eff,
+            bool(left_internal), bool(right_internal),
+            bool(bottom_internal), bool(top_internal), atol, rtol)
+
     boundary_cells = [None] * n_regions
-    for ij in range(len(regions)):
-        r = regions[ij]
-        if r == 0:
-            continue
-        if mask_flat is not None and not mask_flat[ij]:
-            continue
-        idx = r - 1
-        if idx >= n_regions:
-            continue
-        v = values_flat[ij]
-        if val_min[idx] is None or v < val_min[idx]:
-            val_min[idx] = v
-        if val_max[idx] is None or v > val_max[idx]:
-            val_max[idx] = v
+    for k in range(n_cells):
+        ij = int(cell_ij[k])
+        idx = int(regions[ij]) - 1
+        cells = boundary_cells[idx]
+        if cells is None:
+            cells = {}
+            boundary_cells[idx] = cells
+        local_col = ij % nx_eff
+        local_row = ij // nx_eff
+        cells[(local_col + col_offset, local_row + row_offset)] = (
+            values_flat[ij], bool(cell_w[k]), bool(cell_s[k]))
 
-        if track_cells:
-            local_col = ij % nx_eff
-            local_row = ij // nx_eff
-            # nx==1 padded a dummy column; skip it (only local_col 0 is real).
-            if nx == 1 and local_col != 0:
-                continue
-            on_internal = (
-                (left_internal and local_col == 0) or
-                (right_internal and local_col == nx - 1) or
-                (bottom_internal and local_row == 0) or
-                (top_internal and local_row == ny - 1))
-            if on_internal:
-                cells = boundary_cells[idx]
-                if cells is None:
-                    cells = {}
-                    boundary_cells[idx] = cells
-                # Record whether numpy CCL's per-pixel close test matched
-                # this pixel against its in-chunk W / S orthogonal
-                # neighbour.  This mirrors the ``matches_W`` / ``matches_S``
-                # predicates in :func:`_calculate_regions` exactly -- it is
-                # the direct ``_is_close(cur, neighbour)`` test, NOT region
-                # co-membership (two pixels can share a region via a third
-                # path without being directly close).  The 8-conn diagonal
-                # guard (#2677) consults the diagonal only when this
-                # orthogonal test failed, matching numpy's
-                # ``if not matches_W`` / ``if not matches_S`` shortcut.  W
-                # is (local_row, local_col-1); S is (local_row-1, local_col).
-                w_match = (
-                    local_col > 0 and
-                    (mask_flat is None or mask_flat[ij - 1]) and
-                    _is_close(v, values_flat[ij - 1], atol, rtol))
-                s_match = (
-                    local_row > 0 and
-                    (mask_flat is None or mask_flat[ij - nx_eff]) and
-                    _is_close(v, values_flat[ij - nx_eff], atol, rtol))
-                cells[(local_col + col_offset, local_row + row_offset)] = (
-                    v, bool(w_match), bool(s_match))
     out = []
     cells_out = []
     for i in range(n_regions):
-        if val_min[i] is None:
+        if not has_value[i]:
             # Fall back to the polygon's representative value if no
             # pixel matched (defensive; shouldn't happen).
             out.append((column[i], column[i]))
diff --git a/xrspatial/tests/test_polygonize_issue_3303.py b/xrspatial/tests/test_polygonize_issue_3303.py
new file mode 100644
index 000000000..59f0f8999
--- /dev/null
+++ b/xrspatial/tests/test_polygonize_issue_3303.py
@@ -0,0 +1,241 @@
+"""Tests for issue #3303: jitted _region_ranges_scan in the float dask
+chunk path.
+
+The fix replaced the pure-Python per-pixel loop in
+``_compute_region_value_ranges`` with the jitted ``_region_ranges_scan``,
+reused the region labels from ``_polygonize_numpy_regions`` instead of
+re-running ``_calculate_regions``, and corrected the recorded ``w_match``
+/ ``s_match`` flags (the old loop called the numba overload generator
+``_is_close`` from pure Python, which returns the implementation function
+-- always truthy -- instead of a bool).
+
+These tests pin three things:
+
+1. ``_compute_region_value_ranges`` gives identical results whether the
+   caller passes precomputed ``regions`` or lets it recompute them.
+2. The recorded boundary-cell flags equal a naive ``_values_close``
+   reference (this fails on the pre-#3303 code).
+3. Public dask float output still matches numpy across chunkings and
+   connectivities.
+"""
+
+import numpy as np
+import pytest
+import xarray as xr
+
+from xrspatial.polygonize import (
+    _DEFAULT_ATOL,
+    _DEFAULT_RTOL,
+    _compute_region_value_ranges,
+    _polygonize_numpy,
+    _polygonize_numpy_regions,
+    _values_close,
+    polygonize,
+)
+
+dask = pytest.importorskip("dask")
+import dask.array as da  # noqa: E402
+
+
+def _reference_ranges(block, mask_block, connectivity_8, atol, rtol, column,
+                      row_offset, col_offset, ny_total, nx_total):
+    """Naive Python port of the per-pixel aggregation, using
+    ``_values_close`` for the W / S flags (the documented semantics)."""
+    _, _, regions = _polygonize_numpy_regions(
+        block, mask_block, connectivity_8, None, atol=atol, rtol=rtol)
+
+    ny, nx = block.shape
+    if np.issubdtype(block.dtype, np.floating):
+        nan_mask = ~np.isnan(block)
+        if mask_block is not None:
+            full_mask = mask_block.astype(bool) & nan_mask
+        else:
+            full_mask = nan_mask
+    else:
+        full_mask = mask_block
+    if nx == 1:
+        nx_eff = 2
+        values_flat = np.hstack((block, block)).ravel()
+        if full_mask is not None:
+            full_mask = np.hstack((full_mask, np.zeros_like(full_mask)))
+        else:
+            full_mask = np.zeros((ny, 2), dtype=bool)
+            full_mask[:, 0] = True
+    else:
+        nx_eff = nx
+        values_flat = block.ravel()
+    mask_flat = full_mask.ravel() if full_mask is not None else None
+
+    track = ny_total is not None and nx_total is not None
+    left = track and col_offset > 0
+    right = track and col_offset + nx < nx_total
+    bottom = track and row_offset > 0
+    top = track and row_offset + ny < ny_total
+
+    n_regions = len(column)
+    val_min = [None] * n_regions
+    val_max = [None] * n_regions
+    cells_out = [{} for _ in range(n_regions)]
+    for ij in range(len(regions)):
+        r = regions[ij]
+        if r == 0:
+            continue
+        if mask_flat is not None and not mask_flat[ij]:
+            continue
+        idx = r - 1
+        if idx >= n_regions:
+            continue
+        v = values_flat[ij]
+        if val_min[idx] is None or v < val_min[idx]:
+            val_min[idx] = v
+        if val_max[idx] is None or v > val_max[idx]:
+            val_max[idx] = v
+        lc = ij % nx_eff
+        lr = ij // nx_eff
+        if nx == 1 and lc != 0:
+            continue
+        on_internal = ((left and lc == 0) or (right and lc == nx - 1) or
+                       (bottom and lr == 0) or (top and lr == ny - 1))
+        if on_internal:
+            w = (lc > 0 and
+                 (mask_flat is None or mask_flat[ij - 1]) and
+                 bool(_values_close(v, values_flat[ij - 1], atol, rtol)))
+            s = (lr > 0 and
+                 (mask_flat is None or mask_flat[ij - nx_eff]) and
+                 bool(_values_close(v, values_flat[ij - nx_eff], atol, rtol)))
+            cells_out[idx][(lc + col_offset, lr + row_offset)] = (
+                v, bool(w), bool(s))
+    ranges = []
+    for i in range(n_regions):
+        if val_min[i] is None:
+            ranges.append((column[i], column[i]))
+        else:
+            ranges.append((val_min[i], val_max[i]))
+    return ranges, cells_out
+
+
+def _cases():
+    rng = np.random.default_rng(3303)
+    a = rng.random((13, 17)) * 1e-4 + 1.0
+    a[3:5, 4:9] = np.nan
+    m = rng.random((13, 17)) > 0.2
+    b = rng.integers(0, 3, (9, 1)).astype(np.float64)
+    c = rng.integers(0, 4, (12, 12)).astype(np.float32)
+    return [
+        ("float-nan", a, None),
+        ("float-mask", a, m),
+        ("nx1", b, None),
+        ("float32", c, None),
+    ]
+
+
+@pytest.mark.parametrize("name,block,mask", _cases())
+@pytest.mark.parametrize("connectivity_8", [False, True])
+@pytest.mark.parametrize("offsets", [
+    (0, 0, None, None),    # single chunk: no internal edges
+    (4, 6, 40, 40),        # all four edges internal
+    (0, 6, 40, 40),        # bottom edge on the raster boundary
+])
+def test_precomputed_regions_match_recompute(name, block, mask,
+                                             connectivity_8, offsets):
+    ro, co, nyt, nxt = offsets
+    column, _, regions = _polygonize_numpy_regions(
+        block, mask, connectivity_8, None)
+    with_regions = _compute_region_value_ranges(
+        block, mask, connectivity_8, _DEFAULT_ATOL, _DEFAULT_RTOL, column,
+        row_offset=ro, col_offset=co, ny_total=nyt, nx_total=nxt,
+        regions=regions)
+    without_regions = _compute_region_value_ranges(
+        block, mask, connectivity_8, _DEFAULT_ATOL, _DEFAULT_RTOL, column,
+        row_offset=ro, col_offset=co, ny_total=nyt, nx_total=nxt)
+    assert with_regions == without_regions
+
+
+@pytest.mark.parametrize("name,block,mask", _cases())
+@pytest.mark.parametrize("connectivity_8", [False, True])
+def test_ranges_and_flags_match_reference(name, block, mask, connectivity_8):
+    ro, co, nyt, nxt = 4, 6, 40, 40
+    column, _, regions = _polygonize_numpy_regions(
+        block, mask, connectivity_8, None)
+    got_ranges, got_cells = _compute_region_value_ranges(
+        block, mask, connectivity_8, _DEFAULT_ATOL, _DEFAULT_RTOL, column,
+        row_offset=ro, col_offset=co, ny_total=nyt, nx_total=nxt,
+        regions=regions)
+    ref_ranges, ref_cells = _reference_ranges(
+        block, mask, connectivity_8, _DEFAULT_ATOL, _DEFAULT_RTOL, column,
+        ro, co, nyt, nxt)
+    assert len(got_ranges) == len(ref_ranges)
+    for g, r in zip(got_ranges, ref_ranges):
+        np.testing.assert_allclose(g, r, rtol=0, atol=0)
+    assert got_cells == ref_cells
+
+
+def test_w_s_flags_record_real_closeness():
+    # Pre-#3303 the flags were truthy whenever the position / mask
+    # conditions held, because _is_close called from pure Python returns
+    # numba's implementation function instead of a bool.
+    block = np.array([[1.0, 5.0, 1.0, 1.0],
+                      [1.0, 1.0, 1.0, 1.0]])
+    column, _, regions = _polygonize_numpy_regions(block, None, False, None)
+    _, cells = _compute_region_value_ranges(
+        block, None, False, 1e-8, 1e-5, column,
+        row_offset=0, col_offset=0, ny_total=8, nx_total=4, regions=regions)
+    merged = {}
+    for d in cells:
+        merged.update(d)
+    # Only the top edge (local_row == ny-1 == 1) is internal here
+    # (col_offset == row_offset == 0 and the block spans the full width).
+    # Pixel (1, 1) has value 1.0; W neighbour (0, 1) is 1.0 (close) and
+    # S neighbour (1, 0) is 5.0 (not close).  The pre-#3303 code records
+    # s_match=True here because the position/mask conditions hold.
+    v, w_match, s_match = merged[(1, 1)]
+    assert v == 1.0
+    assert w_match is True
+    assert s_match is False
+    # Pixel (0, 1) has value 1.0; no W neighbour, S neighbour (0, 0)
+    # is 1.0 (close).
+    v, w_match, s_match = merged[(0, 1)]
+    assert v == 1.0
+    assert w_match is False
+    assert s_match is True
+
+
+@pytest.mark.parametrize("connectivity", [4, 8])
+@pytest.mark.parametrize("chunks", [(5, 7), (4, 4), (13, 5)])
+def test_dask_float_matches_numpy(connectivity, chunks):
+    rng = np.random.default_rng(42)
+    data = rng.integers(0, 4, (13, 17)).astype(np.float64)
+    data += rng.random((13, 17)) * 1e-7  # tolerance-level noise
+    data[2:4, 5:9] = np.nan
+    raster = xr.DataArray(data, dims=["y", "x"])
+    dask_raster = xr.DataArray(
+        da.from_array(data, chunks=chunks), dims=["y", "x"])
+
+    col_np, pp_np = polygonize(raster, connectivity=connectivity)
+    col_da, pp_da = polygonize(dask_raster, connectivity=connectivity)
+
+    assert col_np == col_da
+    assert len(pp_np) == len(pp_da)
+    # Boundary-merged rings can start at a different vertex than numpy's
+    # trace (pre-existing behaviour, also on the code before #3303), so
+    # compare geometric equality rather than raw vertex arrays.
+    shapely_geom = pytest.importorskip("shapely.geometry")
+    for rings_np, rings_da in zip(pp_np, pp_da):
+        poly_np = shapely_geom.Polygon(rings_np[0], rings_np[1:])
+        poly_da = shapely_geom.Polygon(rings_da[0], rings_da[1:])
+        assert poly_np.equals(poly_da)
+
+
+def test_polygonize_numpy_unchanged_by_regions_split():
+    # _polygonize_numpy must stay a thin wrapper with identical output.
+    rng = np.random.default_rng(7)
+    block = rng.integers(0, 3, (8, 9)).astype(np.float64)
+    col_a, pp_a = _polygonize_numpy(block, None, False, None)
+    col_b, pp_b, regions = _polygonize_numpy_regions(block, None, False, None)
+    assert col_a == col_b
+    assert len(pp_a) == len(pp_b)
+    for ra, rb in zip(pp_a, pp_b):
+        for x, y in zip(ra, rb):
+            np.testing.assert_array_equal(x, y)
+    # Region labels cover exactly the polygons _scan emitted.
+    assert regions.max() == len(col_b)