perf(dedupe): skip unchanged rows, VALUES fast-write, prefetch vulnerability_ids#15046
Draft
valentijnscholten wants to merge 2 commits into
Draft
Conversation
…lnerability_ids mass_model_updater optimizations for the hash-recompute / dedupe write path: - skip_unchanged (default True): rows whose tracked fields were not changed by the update function are no longer written. Re-runs over already-correct data issue zero UPDATEs. Compared against values loaded by the page query (read from __dict__ so deferred fields don't trigger an extra query). - writer hook: optional `writer(model_type, batch, fields)` callable to replace Django's bulk_update for a batch. Defaults to bulk_update. - hashcode_values_writer (dojo/finding/deduplication.py): writes the text hash columns with a single `UPDATE ... FROM (VALUES ...)` join instead of bulk_update's per-row CASE/WHEN expression tree, which dominates wide-batch updates. PostgreSQL only; falls back to bulk_update otherwise. Values are bound as parameters. - Finding.get_vulnerability_ids now reads the prefetch-aware reverse relation (finding.vulnerability_id_set.all()) instead of a fresh Vulnerability_Id.objects.filter(...) + .count(), so prefetch_related is honored (was a per-finding N+1 for parsers whose hash config includes vulnerability_ids). - dedupe command: prefetch vulnerability_id_set and use the VALUES writer for the hash pass. Adds unittests/test_mass_model_updater.py covering skip-unchanged, the writer hook, the VALUES writer (incl NULL handling) and fields=None side-effect mode.
The new TestMassModelUpdater hardcoded fixtures = ["dojo_testdata.json"], which fails under V3_FEATURE_LOCATIONS=True: loading the legacy Endpoint rows trips the model's V3 deprecation guard during tagulous retag. Decorate with @versioned_fixtures (the established repo pattern) so it loads dojo_testdata_locations.json under V3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Performance optimizations for the hash-recompute / dedupe write path in
mass_model_updaterandFinding.get_vulnerability_ids.The dedupe recalculation command performance relies heavily on bulk_update. Turns out this is performing a very expensive part of code to calculate a bit CASE WHEN .... clause. Because this is a hot path and the dedupe only updates 1 (or 3) hash code fields, I feel it's OK to use raw SQL here as it saves huge amounts of time.
mass_model_updater(dojo/utils.py)skip_unchanged=Truedefault): rows whose trackedfieldsweren't changed by the update function are no longer written. Re-running over already-correct data now issues zero UPDATEs. The old value is read from the page query (via__dict__, so deferred fields don't trigger an extra query).writer(model_type, batch, fields)callable to replacebulk_updatefor persisting a batch; defaults tobulk_update.VALUES fast-write (
dojo/finding/deduplication.py)hashcode_values_writerwrites the text hash columns with a singleUPDATE … FROM (VALUES …)join instead ofbulk_update's per-rowCASE/WHENexpression tree (which isO(rows×fields)Python nodes to build/resolve and dominates wide-batch writes). PostgreSQL only; falls back tobulk_updateon other backends. Values are bound as parameters.vulnerability_ids N+1 (
dojo/models.py)get_vulnerability_idsnow readsfinding.vulnerability_id_set.all()(prefetch-aware) instead of a freshVulnerability_Id.objects.filter(...)+.count(). For parsers whose hash config includesvulnerability_idsthis was a per-findingCOUNT + SELECTN+1.dedupe command (
dojo/management/commands/dedupe.py)vulnerability_id_set; use the VALUES writer for the hash pass.Why
The dedupe/recompute write path scales poorly on large finding sets:
bulk_updatebuilds a per-rowCASE/WHENfor the updated columns (slow to build/resolve for wide batches), it writes every row even when the value is unchanged, andget_vulnerability_idsissued a query per finding for affected parsers. These changes remove all three: unchanged rows aren't written, changed rows use a VALUES join, and the vuln-id lookup honors prefetch.Concretely, on a ~15k-finding all-changed batch the VALUES join avoids ~14s of Python spent building and resolving
bulk_update'sCASE/WHENexpression tree (theresolve_expressioncost in profiling). That figure is purely the write mechanism — independent of any caching. The skip-unchanged and prefetch wins on top are workload-dependent (driven by how many rows actually changed and whether the parser hashes onvulnerability_ids).Tests
unittests/test_mass_model_updater.py— skip-unchanged, writer hook, VALUES writer (incl NULL handling),fields=Noneside-effect mode. All green on PostgreSQL.Compatibility
Backward-compatible:
skip_unchangedonly skips no-op writes;writerdefaults tobulk_update; the VALUES path is Postgres-gated with abulk_updatefallback.