Skip to content

Add WebhookCertificateRotationDeferredByUnhealthyMRs alert#27

Open
videlov wants to merge 1 commit into
mainfrom
deferred-by-unhealthy-mrs-alert
Open

Add WebhookCertificateRotationDeferredByUnhealthyMRs alert#27
videlov wants to merge 1 commit into
mainfrom
deferred-by-unhealthy-mrs-alert

Conversation

@videlov

@videlov videlov commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds one new alert to controlplane-remote.yaml, companion to a behavior change in webhook-injector that ships in SAP-cloud-infrastructure/webhook-injector#12.

What the alert tracks

webhook-injector now defers a non-force cert rotation (expired, san_mismatch) when at least one labeled ManagedResource has a ResourcesApplied or ResourcesHealthy condition with Status != True. The cert keeps being served from the existing valid leaf; rotation resumes automatically when MR(s) recover. If the cert reaches NotAfter before that happens, an escape hatch lets the rotation proceed regardless with a louder warning log.

The new counter webhook_injector_certificate_rotation_deferred_unhealthy_mrs_total increments per deferred reconcile pass and per past-NotAfter under-duress rotation. This alert fires on a sustained non-zero rate.

Alert definition

- alert: WebhookCertificateRotationDeferredByUnhealthyMRs
  expr: sum by (namespace) (rate(webhook_injector_certificate_rotation_deferred_unhealthy_mrs_total[5m])) > 0
  for: 15m
  severity: warning

Style matches the other webhook-injector propagation alerts in the same file: sum by (namespace), 5m rate window, dig-based overrides, {{ $labels.namespace }}-enriched summary/description, playbook URL placeholder with #TODO: add playbook.

Operator runbook (short version)

When this fires:

  1. Check the WebhookManagedResourceUnhealthy alert (also in this file) — should be firing for the same namespace, telling you which condition is bad.
  2. Fix the underlying GRM/MR issue.
  3. The deferred rotation will resume automatically. If the cert is approaching expiry, WebhookCertificateNearExpiry (and ultimately AboutToExpire) fire as the safety net — at which point the escape hatch in webhook-injector will rotate anyway.

Migration note

Why this is a separate PR from #26: the alert was originally added to #26 by accident after that PR had already been merged. The original push targeted the closed branch. This PR cleans that up by re-applying the change on a fresh branch off the post-merge main.

Verification

$ helm lint charts/controlplane-operations
1 chart(s) linted, 0 chart(s) failed

$ helm template ... | promtool check rules /dev/stdin
SUCCESS: 12 rules found

(was 11 on main, +1 from this PR)

Chart + plugindefinition version bumped 1.1.131.1.14.

webhook-injector now defers a non-force cert rotation (expired,
san_mismatch, etc.) when at least one labeled ManagedResource has
a ResourcesApplied or ResourcesHealthy condition with Status != True.
The cert keeps being served from the existing valid leaf; rotation
resumes automatically when the MR(s) recover.

This counter fires when that deferral is active. A sustained
non-zero rate over 15m means a labeled MR has been unhealthy long
enough that a cert rotation is now being held back by it — combined
with the existing WebhookManagedResourceUnhealthy alert it tells
on-call exactly what to fix (and the CertificateAboutToExpire alert
is the safety net if the underlying issue isn't fixed before the
cert reaches NotAfter; at that point the gate's escape hatch lets
the rotation proceed regardless and the counter keeps ticking, so
this alert stays firing throughout the under-duress rotation as
well).

Style follows the existing webhook-injector propagation alerts:
sum by (namespace), 5m rate window, 15m for, warning severity,
label-enriched summary/description naming the affected shoot.

Patch-bump chart + plugin to 1.1.14.

Verified:
- helm lint: clean
- promtool check rules: SUCCESS, 12 rules found (was 11; added one).
@videlov videlov requested a review from a team as a code owner June 26, 2026 12:48
Copilot AI review requested due to automatic review settings June 26, 2026 12:48

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Prometheus alert to the controlplane-operations Helm chart to detect when webhook-injector is repeatedly deferring certificate rotation due to unhealthy ManagedResources, and bumps the chart/plugin versions accordingly.

Changes:

  • Added WebhookCertificateRotationDeferredByUnhealthyMRs alert to controlplane-remote.yaml (rate-based warning with per-alert disable toggle and standard labels/annotations).
  • Bumped Helm chart version 1.1.131.1.14.
  • Bumped PluginDefinition and referenced chart version 1.1.131.1.14.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
charts/controlplane-operations/alerts/controlplane-remote.yaml Adds the new webhook-injector certificate-rotation deferral alert using the established templating/style in this file.
charts/controlplane-operations/Chart.yaml Chart version bump to publish the new alert change.
charts/controlplane-operations/plugindefinition.yaml Plugin bundle version + referenced chart version bump to match the chart release.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants