Skip to content

perf(dedupe): skip unchanged rows, VALUES fast-write, prefetch vulnerability_ids#15046

Draft
valentijnscholten wants to merge 2 commits into
DefectDojo:devfrom
valentijnscholten:perf/dedupe-skip-unchanged-and-values-write
Draft

perf(dedupe): skip unchanged rows, VALUES fast-write, prefetch vulnerability_ids#15046
valentijnscholten wants to merge 2 commits into
DefectDojo:devfrom
valentijnscholten:perf/dedupe-skip-unchanged-and-values-write

Conversation

@valentijnscholten

@valentijnscholten valentijnscholten commented Jun 19, 2026

Copy link
Copy Markdown
Member

What

Performance optimizations for the hash-recompute / dedupe write path in mass_model_updater and Finding.get_vulnerability_ids.

The dedupe recalculation command performance relies heavily on bulk_update. Turns out this is performing a very expensive part of code to calculate a bit CASE WHEN .... clause. Because this is a hot path and the dedupe only updates 1 (or 3) hash code fields, I feel it's OK to use raw SQL here as it saves huge amounts of time.

mass_model_updater (dojo/utils.py)

  • skip-unchanged (skip_unchanged=True default): rows whose tracked fields weren't changed by the update function are no longer written. Re-running over already-correct data now issues zero UPDATEs. The old value is read from the page query (via __dict__, so deferred fields don't trigger an extra query).
  • writer hook: optional writer(model_type, batch, fields) callable to replace bulk_update for persisting a batch; defaults to bulk_update.

VALUES fast-write (dojo/finding/deduplication.py)

  • hashcode_values_writer writes the text hash columns with a single UPDATE … FROM (VALUES …) join instead of bulk_update's per-row CASE/WHEN expression tree (which is O(rows×fields) Python nodes to build/resolve and dominates wide-batch writes). PostgreSQL only; falls back to bulk_update on other backends. Values are bound as parameters.

vulnerability_ids N+1 (dojo/models.py)

  • get_vulnerability_ids now reads finding.vulnerability_id_set.all() (prefetch-aware) instead of a fresh Vulnerability_Id.objects.filter(...) + .count(). For parsers whose hash config includes vulnerability_ids this was a per-finding COUNT + SELECT N+1.

dedupe command (dojo/management/commands/dedupe.py)

  • prefetch vulnerability_id_set; use the VALUES writer for the hash pass.

Why

The dedupe/recompute write path scales poorly on large finding sets: bulk_update builds a per-row CASE/WHEN for the updated columns (slow to build/resolve for wide batches), it writes every row even when the value is unchanged, and get_vulnerability_ids issued a query per finding for affected parsers. These changes remove all three: unchanged rows aren't written, changed rows use a VALUES join, and the vuln-id lookup honors prefetch.

Concretely, on a ~15k-finding all-changed batch the VALUES join avoids ~14s of Python spent building and resolving bulk_update's CASE/WHEN expression tree (the resolve_expression cost in profiling). That figure is purely the write mechanism — independent of any caching. The skip-unchanged and prefetch wins on top are workload-dependent (driven by how many rows actually changed and whether the parser hashes on vulnerability_ids).

Tests

unittests/test_mass_model_updater.py — skip-unchanged, writer hook, VALUES writer (incl NULL handling), fields=None side-effect mode. All green on PostgreSQL.

Compatibility

Backward-compatible: skip_unchanged only skips no-op writes; writer defaults to bulk_update; the VALUES path is Postgres-gated with a bulk_update fallback.

…lnerability_ids

mass_model_updater optimizations for the hash-recompute / dedupe write path:

- skip_unchanged (default True): rows whose tracked fields were not changed by
  the update function are no longer written. Re-runs over already-correct data
  issue zero UPDATEs. Compared against values loaded by the page query (read from
  __dict__ so deferred fields don't trigger an extra query).
- writer hook: optional `writer(model_type, batch, fields)` callable to replace
  Django's bulk_update for a batch. Defaults to bulk_update.
- hashcode_values_writer (dojo/finding/deduplication.py): writes the text hash
  columns with a single `UPDATE ... FROM (VALUES ...)` join instead of
  bulk_update's per-row CASE/WHEN expression tree, which dominates wide-batch
  updates. PostgreSQL only; falls back to bulk_update otherwise. Values are bound
  as parameters.
- Finding.get_vulnerability_ids now reads the prefetch-aware reverse relation
  (finding.vulnerability_id_set.all()) instead of a fresh
  Vulnerability_Id.objects.filter(...) + .count(), so prefetch_related is honored
  (was a per-finding N+1 for parsers whose hash config includes vulnerability_ids).
- dedupe command: prefetch vulnerability_id_set and use the VALUES writer for the
  hash pass.

Adds unittests/test_mass_model_updater.py covering skip-unchanged, the writer
hook, the VALUES writer (incl NULL handling) and fields=None side-effect mode.
@valentijnscholten valentijnscholten modified the milestones: 3.0.100, 3.1.0 Jun 19, 2026
The new TestMassModelUpdater hardcoded fixtures = ["dojo_testdata.json"],
which fails under V3_FEATURE_LOCATIONS=True: loading the legacy Endpoint
rows trips the model's V3 deprecation guard during tagulous retag. Decorate
with @versioned_fixtures (the established repo pattern) so it loads
dojo_testdata_locations.json under V3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant