Skip to content

polygonize: jit the float-chunk region ranges scan (#3303)#3314

Merged
brendancol merged 3 commits into
mainfrom
deep-sweep-performance-polygonize-2026-06-12
Jun 12, 2026
Merged

polygonize: jit the float-chunk region ranges scan (#3303)#3314
brendancol merged 3 commits into
mainfrom
deep-sweep-performance-polygonize-2026-06-12

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #3303.

_compute_region_value_ranges looped over every pixel of a float chunk in pure Python and re-ran _calculate_regions on a block _polygonize_numpy had already labelled. On a 1024x1024 float64 chunk with ~180 regions that loop was 0.283 s of the 0.299 s _polygonize_chunk total, making float dask chunks about 30x slower than integer ones.

  • Move the per-pixel min/max aggregation and internal-edge cell collection into the jitted _region_ranges_scan. Only the per-region tuple/dict assembly stays in Python. The same chunk now takes 0.015 s.
  • Add _polygonize_numpy_regions so the chunk path reuses the region labels instead of recomputing them. _polygonize_numpy is a thin wrapper with unchanged output. _compute_region_value_ranges keeps a regions=None fallback that recomputes, and the tests check both paths agree.
  • Side effect worth knowing about: the old loop called _is_close (a numba overload generator) from pure Python, which returns the implementation function instead of a bool, so the recorded w_match/s_match flags were truthy whenever the position/mask conditions held. That was output-neutral (when the W/S neighbour is absent from the global boundary-cell field, the gated diagonal is always a same-chunk pair the in-chunk CCL already resolved), but the flags were wrong. Inside the jitted kernel _is_close resolves correctly, and a new test pins the real values.

Backends: behaviour unchanged on numpy and cupy (they never hit this code path); dask+numpy and dask+cupy float rasters get the speedup. Verified on GPU: cupy int/float and dask+cupy runs produce the same polygons as numpy.

Test plan:

  • New test_polygonize_issue_3303.py: precomputed-regions vs recompute parity, ranges/flags vs a naive _values_close reference, flag polarity (fails on the old code), dask-vs-numpy float output across chunkings and connectivities, _polygonize_numpy wrapper parity.
  • Full polygonize battery: 527 passed, 16 skipped.
  • flake8 clean on changed files.
  • cupy / dask+cupy / dask+numpy smoke runs match numpy output.

No wall-clock assertions added (see #2889).

_compute_region_value_ranges looped over every pixel of a float chunk in
pure Python, which took ~95% of _polygonize_chunk time on the dask path,
and re-ran _calculate_regions on a block _polygonize_numpy had already
labelled. Move the per-pixel work into the jitted _region_ranges_scan,
reuse the region labels via the new _polygonize_numpy_regions, and keep
only the per-region tuple/dict assembly in Python. A 1024x1024 float64
chunk with ~180 regions drops from 0.299s to 0.015s.

This also fixes the recorded w_match/s_match boundary-cell flags: the
old loop called _is_close (a numba overload generator) from pure Python,
which returns the implementation function instead of a bool, so the
flags were truthy whenever the position/mask conditions held. Inside the
jitted kernel _is_close resolves properly. Output is unchanged (the
buggy flags were provably output-neutral); existing 527 polygonize tests
pass and the new test file pins ranges/flags parity and dask-vs-numpy
float output.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Jun 12, 2026

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: polygonize: jit the float-chunk region ranges scan (#3303)

Blockers (must fix before merge)

None found.

Suggestions (should fix, not blocking)

  • benchmarks/benchmarks/polygonize.py has no benchmark for the float dask chunk path this PR optimizes. The existing Polygonize/PolygonizeCCL classes use int32 numpy/cupy input, so a regression in _region_ranges_scan or _compute_region_value_ranges would not show up in asv. Add a float64 dask case.

Nits (optional improvements)

  • _compute_region_value_ranges (xrspatial/polygonize.py:1485) trusts a caller-supplied regions to match the block's padded flat shape. A len(regions) == nx_eff * ny check would catch a mismatched caller early. There is only one internal caller today and the parity tests cover it, so this is cheap insurance rather than a real risk.

What looks good

  • The jitted kernel mirrors the original loop's skip conditions exactly: region 0, masked pixels, idx >= n_regions, and the nx==1 padded column. I checked the max_cells = 2*(nx+ny) bound; each strip pixel records at most once, so the arrays cannot overflow (2nx+2ny-4 distinct strip pixels for nx,ny >= 2, fewer with masking).
  • Region-label reuse goes through a new _polygonize_numpy_regions and keeps _polygonize_numpy as a wrapper with a parity test, plus a regions=None fallback that the tests compare against the precomputed path.
  • The w_match/s_match flag fix is real: _is_close is an overload generator and returns its impl function when called from pure Python, so the old flags were truthy whenever the position/mask conditions held. In jit it resolves correctly. test_w_s_flags_record_real_closeness pins both polarities and fails on the old code.
  • The dask-vs-numpy float test compares geometric equality and documents why raw vertex arrays differ (merged boundary rings can start at a different vertex; that divergence predates this PR).

Checklist

  • Algorithm matches reference (kernel is a line-for-line port of the Python loop, verified by a _values_close reference implementation in the tests)
  • All implemented backends produce consistent results (cupy, dask+numpy, dask+cupy smoke-checked against numpy)
  • NaN handling is correct (NaN-mask prep unchanged; float-nan test case included)
  • Edge cases are covered by tests (nx==1, masked, float32, conn 4/8, raster-edge vs internal-edge offsets)
  • Dask chunk boundaries handled correctly (boundary-cell keys/values byte-identical to old code; flags now correct)
  • No premature materialization or unnecessary copies
  • Benchmark exists for the optimized path (see suggestion)
  • README feature matrix: not applicable, no public API change
  • Docstrings present and accurate (internal functions documented, regions param described)

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review after 32cc304

Both findings from the first pass are addressed:

  • PolygonizeDaskFloat benchmark added (benchmarks/benchmarks/polygonize.py). It builds a float64 dask raster in 4x4 chunks and times the public polygonize, which exercises _region_ranges_scan and the cross-chunk merge. Smoke-ran locally.
  • _compute_region_value_ranges now rejects a caller-supplied regions whose length does not match the block's padded flat size (xrspatial/polygonize.py:1551).

Note: flake8 reports F401 for _simplify_polygons in benchmarks/benchmarks/polygonize.py; that import was already unused on main before this PR and is left alone here.

Full polygonize battery still green (527 passed, 16 skipped). No further findings.

@brendancol brendancol merged commit f14947a into main Jun 12, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

polygonize: float dask chunks spend ~95% of time in pure-Python _compute_region_value_ranges

1 participant