Add WebhookCertificateRotationDeferredByUnhealthyMRs alert#27
Open
videlov wants to merge 1 commit into
Open
Conversation
webhook-injector now defers a non-force cert rotation (expired, san_mismatch, etc.) when at least one labeled ManagedResource has a ResourcesApplied or ResourcesHealthy condition with Status != True. The cert keeps being served from the existing valid leaf; rotation resumes automatically when the MR(s) recover. This counter fires when that deferral is active. A sustained non-zero rate over 15m means a labeled MR has been unhealthy long enough that a cert rotation is now being held back by it — combined with the existing WebhookManagedResourceUnhealthy alert it tells on-call exactly what to fix (and the CertificateAboutToExpire alert is the safety net if the underlying issue isn't fixed before the cert reaches NotAfter; at that point the gate's escape hatch lets the rotation proceed regardless and the counter keeps ticking, so this alert stays firing throughout the under-duress rotation as well). Style follows the existing webhook-injector propagation alerts: sum by (namespace), 5m rate window, 15m for, warning severity, label-enriched summary/description naming the affected shoot. Patch-bump chart + plugin to 1.1.14. Verified: - helm lint: clean - promtool check rules: SUCCESS, 12 rules found (was 11; added one).
There was a problem hiding this comment.
Pull request overview
Adds a new Prometheus alert to the controlplane-operations Helm chart to detect when webhook-injector is repeatedly deferring certificate rotation due to unhealthy ManagedResources, and bumps the chart/plugin versions accordingly.
Changes:
- Added
WebhookCertificateRotationDeferredByUnhealthyMRsalert tocontrolplane-remote.yaml(rate-based warning with per-alert disable toggle and standard labels/annotations). - Bumped Helm chart version
1.1.13→1.1.14. - Bumped PluginDefinition and referenced chart version
1.1.13→1.1.14.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| charts/controlplane-operations/alerts/controlplane-remote.yaml | Adds the new webhook-injector certificate-rotation deferral alert using the established templating/style in this file. |
| charts/controlplane-operations/Chart.yaml | Chart version bump to publish the new alert change. |
| charts/controlplane-operations/plugindefinition.yaml | Plugin bundle version + referenced chart version bump to match the chart release. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds one new alert to
controlplane-remote.yaml, companion to a behavior change in webhook-injector that ships in SAP-cloud-infrastructure/webhook-injector#12.What the alert tracks
webhook-injector now defers a non-force cert rotation (
expired,san_mismatch) when at least one labeled ManagedResource has aResourcesAppliedorResourcesHealthycondition withStatus != True. The cert keeps being served from the existing valid leaf; rotation resumes automatically when MR(s) recover. If the cert reachesNotAfterbefore that happens, an escape hatch lets the rotation proceed regardless with a louder warning log.The new counter
webhook_injector_certificate_rotation_deferred_unhealthy_mrs_totalincrements per deferred reconcile pass and per past-NotAfterunder-duress rotation. This alert fires on a sustained non-zero rate.Alert definition
Style matches the other webhook-injector propagation alerts in the same file:
sum by (namespace), 5m rate window, dig-based overrides,{{ $labels.namespace }}-enriched summary/description, playbook URL placeholder with#TODO: add playbook.Operator runbook (short version)
When this fires:
WebhookManagedResourceUnhealthyalert (also in this file) — should be firing for the same namespace, telling you which condition is bad.WebhookCertificateNearExpiry(and ultimatelyAboutToExpire) fire as the safety net — at which point the escape hatch in webhook-injector will rotate anyway.Migration note
Why this is a separate PR from #26: the alert was originally added to #26 by accident after that PR had already been merged. The original push targeted the closed branch. This PR cleans that up by re-applying the change on a fresh branch off the post-merge
main.Verification
(was 11 on main, +1 from this PR)
Chart + plugindefinition version bumped
1.1.13→1.1.14.