Skip to content

Retry TestCustomAuthorizerApp deployment on transient IAM role propagation failure#2451

Draft
GarrettBeatty wants to merge 1 commit into
devfrom
gcbeatty/fix-custom-authorizer-deploy-retry
Draft

Retry TestCustomAuthorizerApp deployment on transient IAM role propagation failure#2451
GarrettBeatty wants to merge 1 commit into
devfrom
gcbeatty/fix-custom-authorizer-deploy-retry

Conversation

@GarrettBeatty

Copy link
Copy Markdown
Contributor

Problem

The durabletesting3 CI job (run 28258443522) failed in the TestCustomAuthorizerApp.IntegrationTests project — all 20 tests failed because the fixture's CloudFormation deployment rolled back:

SimpleHttpApiUserInfo   CREATE_FAILED
Resource handler returned message: "The role defined for the function cannot be assumed by Lambda.
(Service: Lambda, Status Code: 400 ...)" (HandlerErrorCode: InvalidRequest)

This is a transient IAM eventual-consistency race: the stack creates a per-function IAM role and immediately calls Lambda CreateFunction, but the role's trust policy hasn't propagated through IAM yet. Once one function fails, CloudFormation cancels the other in-flight resources and rolls the whole stack back (ROLLBACK_COMPLETE), failing every test in the project. It is unrelated to the PR's code changes and typically passes on re-run.

Fix

Wrap dotnet lambda deploy-serverless in DeploymentScript.ps1 with a retry loop (3 attempts):

  • Between attempts, delete the rolled-back stack — a ROLLBACK_COMPLETE stack cannot be updated or re-created — and wait stack-delete-complete before retrying.
  • Add a brief pause to give IAM additional time to settle.
  • Surface CloudFormation failed-resource events on each failure for easier debugging.

This makes the integration test resilient to the transient IAM propagation failure instead of failing the whole CI job.

Testing

  • PowerShell script parses cleanly ([Parser]::ParseFile).
  • The retry path only triggers on deployment failure; the happy path is unchanged (single deploy, then break).

…ation failure

The TestCustomAuthorizerApp integration test stack deploys many Lambda
functions that reference IAM roles created in the same stack. CloudFormation
occasionally calls Lambda CreateFunction before the role's trust policy has
propagated through IAM, producing "The role defined for the function cannot
be assumed by Lambda" and rolling the whole stack back, which fails all 20
tests in the project.

Wrap the deploy in a retry loop (3 attempts). Between attempts, delete the
rolled-back stack (a ROLLBACK_COMPLETE stack cannot be re-created) and pause
briefly to let IAM settle. Surface CloudFormation failed-resource events on
each failure for easier debugging.
@GarrettBeatty GarrettBeatty added the Release Not Needed Add this label if a PR does not need to be released. label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Release Not Needed Add this label if a PR does not need to be released.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant