Skip to content

feat: allow import/reimport to wait for deduplication to complete#15007

Open
valentijnscholten wants to merge 12 commits into
DefectDojo:devfrom
valentijnscholten:import-execution-mode
Open

feat: allow import/reimport to wait for deduplication to complete#15007
valentijnscholten wants to merge 12 commits into
DefectDojo:devfrom
valentijnscholten:import-execution-mode

Conversation

@valentijnscholten

@valentijnscholten valentijnscholten commented Jun 14, 2026

Copy link
Copy Markdown
Member

Summary

While thinking about what a v3 import API could look like, I realized we're missing one piece of the puzzel in the current (re)import.

By default deduplication runs as an async celery task. The results are not awaited, so the (re)import process can finish before deduplication is complete.
Consequences are:

  • The mail notification scan_added contains incorrect statistics
  • The API response contains incorrect delta statitistics.

With the recent optimizations on both import and deduplication the chances of these incorrect statistics has gone down. However the most confusing case happens when the deduplication is partly finished. In that case the statistics will contain some duplicates, but not all of them. Example report: #13777

Pro also suffers from this when not using the background-import. Also the pattern in this PR could be used for other Pro features like the current asynchronous prioritization.

The solution in this PR is to introduce a new deduplication_execution_mode enum that controls deduplication execution:

async (default): perform deduplication in the background, so not wait for the results.
sync (block_execution==True): perform deduplication in the foreground/in-line with the (re)import.
async_await: perform deduplication in the background and wait for the results.

The async_await is the new option and could be the new default in the future. The block_execution user profile variable is removed in this PR as it is obselete.
Data is migrated. Using block_execution==True could be used to achieve a similar result, but it would slow the import a lot more because deduplication would run in the foreground. It would also make everything synchronous inluding notifications, pushing to JIRA, etc causing further slowdowns.

The new deduplication_execution_mode can be set in the user profile, but also in the API requests.

In most scenario's the await setting will not lead to significantly longer running import, may only 1 or 2 seconds extra.
The exception is on very busy/overloaded instances where the sync deduplication jobs are queueing up.
For this reason the maximum wait time is 1 minute. After 1 minute the (re)import no longer waits and returns statistics as-is.
A response variable deduplication_complete will be False in this case.

Details

  • Add a new import/reimport deduplication execution mode, configurable on the user profile via deduplication_execution_mode (async / async_wait / sync) and overridable per request through a deduplication_execution_mode field on the import and reimport endpoints. The request value takes precedence over the profile, otherwise it falls back to async.
  • async_wait (new): deduplication/post-processing is still dispatched to the background, but the request waits for deduplication to finish before responding, so scan_added notifications and the returned statistics reflect the deduplicated state. JIRA push, product grading and other non-deduplication tasks remain asynchronous and are not awaited. async (default, return immediately) and sync (run inline) round out the modes.
  • The new field is independent of the existing block_execution flag, which is retained as the global "run all async tasks (notifications, JIRA, grading, ...) in the foreground" switch. A data migration seeds deduplication_execution_mode='sync' for users who had block_execution=True, and the resolver falls back to block_execution → sync, so import behavior is unchanged for them.
  • post_process_findings_batch now stores its result (ignore_result=False) so async_wait can join on the dispatched batch. The wait is bounded by the new DD_DEDUPLICATION_ASYNC_WAIT_TIMEOUT setting (default 60s) so a missing or stuck worker degrades to responding anyway rather than hanging.
  • Add a deduplication_complete boolean to the import/reimport response indicating whether deduplication had finished by the time the response was produced (true for sync and for async_wait when it completed within the timeout; false for async).
  • Make scan_added notifications deduplication-aware: once deduplication has completed, refresh the duplicate status of the affected findings from the database and split the new / reactivated / untouched lists into real and duplicate sub-lists, exclude duplicates from the headline count (so an all-duplicate import sends scan_added_empty), and surface the duplicate groups in the mail and webhook templates. In plain async mode the lists are left untouched, preserving historical behavior.
  • The import/reimport response statistics are accurate after the await without further change, since they query findings and annotate the duplicate count directly.
  • The browser import/reimport views resolve deduplication_execution_mode from the user's profile (the API resolves it in the serializer), so UI imports honor the profile setting instead of always defaulting to async. (No per-import selector is added to the UI form in this change.)
  • Surface deduplication_execution_mode in the user profile form and the user detail view, and add 2.60 upgrade notes.

Introduce a third import/reimport post-processing execution mode and make
it configurable per request and via the user profile.

- async (default): dispatch post-processing to the background, respond immediately
- async_wait (new): dispatch to the background but wait for deduplication to
  finish before responding, so scan_added notifications and the returned
  statistics reflect the deduplicated state
- sync: run post-processing inline in the web process

Replace the legacy UserContactInfo.block_execution boolean with an
import_execution_mode CharField; a data migration maps block_execution=True
to the 'sync' mode. wants_block_execution() now derives from the sync mode,
preserving global foreground-execution semantics for all async tasks.

Add a request field import_execution_mode (ImportScanSerializer/
ReImportScanSerializer) resolved against the profile (request override wins),
and a deduplication_complete boolean in the import/reimport response. The
post_process_findings_batch task now stores its result (ignore_result=False)
so async_wait can join on it via AsyncResult.get(); the join is bounded by
the new DD_IMPORT_ASYNC_WAIT_TIMEOUT setting.
notify_scan_added used the importer's in-memory finding lists, whose
duplicate flag is stale because deduplication runs on separately-fetched
instances. Once deduplication has completed (sync or async_wait, signalled by
deduplication_complete), refresh the duplicate flag from the database and split
each action list into "real" and duplicate findings:

- findings_new        -> net-new (excludes deduplicated)
- findings_new_duplicate / findings_reactivated_duplicate / findings_untouched_duplicate
- finding_count       -> recomputed to exclude new/reactivated duplicates
  (so an all-duplicate import sends scan_added_empty)

In plain async mode (dedup not awaited) the lists are left untouched, preserving
historical behavior. Mail and webhook scan_added templates render the new
duplicate sections.

Note: import/reimport response statistics are already dedup-accurate once
awaited — Test.statistics / Test_Import.statistics query Finding and annotate
the duplicate count directly; only the notification used stale in-memory data.
Move wait_for_post_processing() ahead of the test_added notification dispatch
(not just notify_scan_added) so both notifications sent during a fresh import
are emitted after deduplication has completed in async_wait mode. The reimporter
already orders the await before its single notification.
@github-actions github-actions Bot added New Migration Adding a new migration file. Take care when merging. settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR apiv2 unittests integration_tests ui labels Jun 14, 2026
The mode governs whether the request waits for deduplication, so name it for
that. Rename the UserContactInfo field, the request/serializer field, the
ImporterOptions attribute and validator, the resolver, and the
IMPORT_EXECUTION_MODE_* constants to DEDUPLICATION_EXECUTION_MODE_*. Add a
RenameField migration (0271) rather than mutating the add/remove migrations
introduced earlier on this branch.
…ion_mode for import dedup

block_execution gates every async task (notifications, jira, grading, deletes,
reindex, ...) via we_want_async, so it must remain. Restore it as the global
"run all async tasks in the foreground" flag and make deduplication_execution_mode
an independent field that only controls how import/reimport deduplication
post-processing is dispatched and awaited.

- Restore the block_execution field; wants_block_execution() reads it again.
- resolve_deduplication_execution_mode() falls back to block_execution -> sync.
- Replace the destructive remove migration (0270) with a seed-only data migration
  that maps block_execution=True -> deduplication_execution_mode 'sync'; keep both.
- Restore fixtures, the profile form field, and the user detail view (both fields).
- Revert the test sweep: tests that need global foreground use block_execution=True
  again; the dedicated deduplication_execution_mode tests are updated for the split.
@github-actions github-actions Bot added the docs label Jun 14, 2026
…ort/reimport

The API resolves deduplication_execution_mode in the serializer, but the UI
import/reimport views built their context straight from the form and never
resolved it, so UI imports silently defaulted to async regardless of the user's
profile. Resolve it from the profile in both UI process_form paths.
@valentijnscholten valentijnscholten added this to the 2.60.0 milestone Jun 14, 2026
@valentijnscholten valentijnscholten changed the title feat: configurable import/reimport execution mode (async_wait) with dedup-accurate notifications feat: allow import/reimport to wait for deduplication to complete Jun 14, 2026
Option B keeps block_execution as a checkbox in the profile form, so the
integration helper toggles id_block_execution again instead of the
deduplication_execution_mode select (which no longer existed under that id).
Collapse the add/rename pair into a single AddField that creates
deduplication_execution_mode with its final name, and keep the seed as a
separate data migration, per the convention of keeping schema and data
migrations apart. Drops the interim rename migration.
…E_LOCATIONS

The CI 'true' matrix variant enables V3_FEATURE_LOCATIONS, where loading
dojo_testdata.json (containing Endpoints) raises NotImplementedError. Decorate
the test classes with @versioned_fixtures so the locations fixture is used in
that mode, matching the other import/reimport test suites.
@valentijnscholten valentijnscholten added the affects_pro PRs that affect Pro and need a coordinated release/merge moment. label Jun 14, 2026
@valentijnscholten valentijnscholten marked this pull request as ready for review June 14, 2026 17:02
@valentijnscholten valentijnscholten marked this pull request as draft June 14, 2026 18:04
Covers the 'async_wait' execution mode: post-processing is dispatched to a
background worker and the request joins on it before responding. The dedup
queries run in the worker (separate connection), not the web request, so the
only web-side delta over the plain async path is the post-dedup notification
refresh SELECT (+1): 109->110 first import, 89->90 second.

Does not use CELERY_TASK_ALWAYS_EAGER (that would run the task inline on the
request connection and wrongly count worker queries). The dedup batch is
dispatched async and the join (AsyncResult.get) is mocked to return instantly,
simulating a finished worker. Adds an optional dedup_mode param to
_deduplication_performance.
@valentijnscholten valentijnscholten marked this pull request as ready for review June 14, 2026 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects_pro PRs that affect Pro and need a coordinated release/merge moment. apiv2 docs New Migration Adding a new migration file. Take care when merging. settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR ui unittests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant