perf(dedupe): skip unchanged rows, VALUES fast-write, prefetch vulnerability_ids by valentijnscholten · Pull Request #15046 · DefectDojo/django-DefectDojo

valentijnscholten · 2026-06-19T20:55:27Z

What

Performance optimizations for the hash-recompute / dedupe write path in mass_model_updater and Finding.get_vulnerability_ids.

The dedupe recalculation command performance relies heavily on bulk_update. Turns out this is performing a very expensive part of code to calculate a bit CASE WHEN .... clause. Because this is a hot path and the dedupe only updates 1 (or 3) hash code fields, I feel it's OK to use raw SQL here as it saves huge amounts of time.

`mass_model_updater` (`dojo/utils.py`)

skip-unchanged (skip_unchanged=True default): rows whose tracked fields weren't changed by the update function are no longer written. Re-running over already-correct data now issues zero UPDATEs. The old value is read from the page query (via __dict__, so deferred fields don't trigger an extra query).
writer hook: optional writer(model_type, batch, fields) callable to replace bulk_update for persisting a batch; defaults to bulk_update.

VALUES fast-write (`dojo/finding/deduplication.py`)

hashcode_values_writer writes the text hash columns with a single UPDATE … FROM (VALUES …) join instead of bulk_update's per-row CASE/WHEN expression tree (which is O(rows×fields) Python nodes to build/resolve and dominates wide-batch writes). PostgreSQL only; falls back to bulk_update on other backends. Values are bound as parameters.

vulnerability_ids N+1 (`dojo/models.py`)

get_vulnerability_ids now reads finding.vulnerability_id_set.all() (prefetch-aware) instead of a fresh Vulnerability_Id.objects.filter(...) + .count(). For parsers whose hash config includes vulnerability_ids this was a per-finding COUNT + SELECT N+1.

dedupe command (`dojo/management/commands/dedupe.py`)

prefetch vulnerability_id_set; use the VALUES writer for the hash pass.

Why

The dedupe/recompute write path scales poorly on large finding sets: bulk_update builds a per-row CASE/WHEN for the updated columns (slow to build/resolve for wide batches), it writes every row even when the value is unchanged, and get_vulnerability_ids issued a query per finding for affected parsers. These changes remove all three: unchanged rows aren't written, changed rows use a VALUES join, and the vuln-id lookup honors prefetch.

Concretely, on a ~15k-finding all-changed batch the VALUES join avoids ~14s of Python spent building and resolving bulk_update's CASE/WHEN expression tree (the resolve_expression cost in profiling). That figure is purely the write mechanism — independent of any caching. The skip-unchanged and prefetch wins on top are workload-dependent (driven by how many rows actually changed and whether the parser hashes on vulnerability_ids).

Tests

unittests/test_mass_model_updater.py — skip-unchanged, writer hook, VALUES writer (incl NULL handling), fields=None side-effect mode. All green on PostgreSQL.

Compatibility

Backward-compatible: skip_unchanged only skips no-op writes; writer defaults to bulk_update; the VALUES path is Postgres-gated with a bulk_update fallback.

…lnerability_ids mass_model_updater optimizations for the hash-recompute / dedupe write path: - skip_unchanged (default True): rows whose tracked fields were not changed by the update function are no longer written. Re-runs over already-correct data issue zero UPDATEs. Compared against values loaded by the page query (read from __dict__ so deferred fields don't trigger an extra query). - writer hook: optional `writer(model_type, batch, fields)` callable to replace Django's bulk_update for a batch. Defaults to bulk_update. - hashcode_values_writer (dojo/finding/deduplication.py): writes the text hash columns with a single `UPDATE ... FROM (VALUES ...)` join instead of bulk_update's per-row CASE/WHEN expression tree, which dominates wide-batch updates. PostgreSQL only; falls back to bulk_update otherwise. Values are bound as parameters. - Finding.get_vulnerability_ids now reads the prefetch-aware reverse relation (finding.vulnerability_id_set.all()) instead of a fresh Vulnerability_Id.objects.filter(...) + .count(), so prefetch_related is honored (was a per-finding N+1 for parsers whose hash config includes vulnerability_ids). - dedupe command: prefetch vulnerability_id_set and use the VALUES writer for the hash pass. Adds unittests/test_mass_model_updater.py covering skip-unchanged, the writer hook, the VALUES writer (incl NULL handling) and fields=None side-effect mode.

The new TestMassModelUpdater hardcoded fixtures = ["dojo_testdata.json"], which fails under V3_FEATURE_LOCATIONS=True: loading the legacy Endpoint rows trips the model's V3 deprecation guard during tagulous retag. Decorate with @versioned_fixtures (the established repo pattern) so it loads dojo_testdata_locations.json under V3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions Bot added the unittests label Jun 19, 2026

valentijnscholten modified the milestones: 3.0.100, 3.1.0 Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(dedupe): skip unchanged rows, VALUES fast-write, prefetch vulnerability_ids#15046

perf(dedupe): skip unchanged rows, VALUES fast-write, prefetch vulnerability_ids#15046
valentijnscholten wants to merge 2 commits into
DefectDojo:devfrom
valentijnscholten:perf/dedupe-skip-unchanged-and-values-write

valentijnscholten commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

valentijnscholten commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

mass_model_updater (dojo/utils.py)

VALUES fast-write (dojo/finding/deduplication.py)

vulnerability_ids N+1 (dojo/models.py)

dedupe command (dojo/management/commands/dedupe.py)

Why

Tests

Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

valentijnscholten commented Jun 19, 2026 •

edited

Loading

`mass_model_updater` (`dojo/utils.py`)

VALUES fast-write (`dojo/finding/deduplication.py`)

vulnerability_ids N+1 (`dojo/models.py`)

dedupe command (`dojo/management/commands/dedupe.py`)