feat(helm): add readiness probes for plugin sidecars#384
feat(helm): add readiness probes for plugin sidecars#384WentingWu666666 wants to merge 3 commits into
Conversation
|
🤖 Auto-triaged by documentdb-triage-tool. Applied: Reasoningeffort from diff stats (70+0 LOC, 4 files); LLM: Additive opt-in Helm chart improvements (PodDisruptionBudget and plugin probes) as part of GA-readiness audit, touching manifests/templates with no schema or cross-component changes. If a label is wrong, remove it manually and ping |
9f5c3be to
a33d54b
Compare
PodDisruptionBudget (M9 from documentdb#381) ================================== Add an optional PodDisruptionBudget for the operator, gated by podDisruptionBudget.enabled (default: false). Disabled by default because the operator currently ships with replicaCount: 1 and a PDB on a single-replica deployment blocks node drains rather than helping availability. Users running multi-replica with leader election should enable it. Plugin probes (M12 from documentdb#381) ============================= The sidecar-injector and wal-replica deployments are gRPC servers on port 9090 that previously had no probes pods were marked Ready as soon as the container started, regardless of whether the gRPC endpoint was actually serving. Add tcpSocket readiness + liveness probes on port 9090, gated by pluginProbes.enabled (default: true) with tunable initialDelaySeconds, periodSeconds, and failureThreshold. TCP socket probe is used because the plugins do not expose an HTTP health endpoint. The probe verifies the gRPC server is bound and accepting connections. Verified locally on kind: - helm template renders PDB only when enabled; renders probes by default. - helm upgrade applies PDB; second upgrade with --set ... enabled=false removes it as expected. - sidecar-injector pod becomes Ready (TCP probe passes against the actually-running gRPC server). - helm lint clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Add helm-unittest test coverage for two new features: - PDB (11_pdb_test.yaml): 10 tests covering feature gate toggle, metadata/namespace, label selector, and minAvailable/maxUnavailable configurations including percentage support. - Plugin probes (02_sidecar_injector_test.yaml, 03_wal_replica_test.yaml): 6 tests (3 per template) verifying default TCP readiness/liveness probes, disabled probes, and custom probe settings. Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Align the chart with upstream CloudNative-PG practice following review discussion: - Remove the opt-in operator PodDisruptionBudget (templates/11_pdb.yaml, its tests, and values). The operator runs at replicaCount: 1, where a PDB is either a no-op or blocks node drains; CNPG ships no operator PDB. This should be reintroduced alongside multi-replica + leader election. - Drop the livenessProbe from the sidecar-injector and wal-replica plugin deployments, keeping only the TCP readiness probe. This matches the upstream CNPG plugin-barman-cloud, which is readiness-only; a TCP liveness check risks restart loops on a weak signal. - Align readiness initialDelaySeconds default to 10 (CNPG's value). helm lint and helm unittest (88 tests) pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
4a59d93 to
f09b5a9
Compare
There was a problem hiding this comment.
Pull request overview
Adds readiness gating for the CNPG plugin deployments in the Helm chart to avoid startup races where CNPG routes gRPC plugin traffic before the listener is accepting connections (GA audit item M12).
Changes:
- Introduces
pluginProbesvalues (enabled by default) to configure a TCP socket readiness probe on port9090. - Wires the readiness probe into the sidecar-injector and wal-replica plugin Deployments (readiness only; no liveness).
- Extends
helm unittestcoverage to validate default rendering, opt-out behavior, and overridden probe timings.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| operator/documentdb-helm-chart/values.yaml | Adds pluginProbes configuration block with defaults and rationale. |
| operator/documentdb-helm-chart/templates/02_documentdb_sidecar_injector.yaml | Adds conditional TCP readiness probe to the sidecar-injector plugin container. |
| operator/documentdb-helm-chart/templates/03_documentdb_wal_replica.yaml | Adds conditional TCP readiness probe to the wal-replica plugin container. |
| operator/documentdb-helm-chart/tests/02_sidecar_injector_test.yaml | Adds unit tests asserting readiness probe presence/absence and overridden settings. |
| operator/documentdb-helm-chart/tests/03_wal_replica_test.yaml | Adds unit tests asserting readiness probe presence/absence and overridden settings. |
What this PR does
Adds readiness probes for the CNPG plugin sidecars, aligning the chart with upstream CloudNative-PG practice.
Plugin readiness probes (M12)
The sidecar-injector and wal-replica deployments are gRPC servers on port 9090 with no probes today — pods are marked Ready as soon as the container starts, regardless of whether the gRPC endpoint is actually serving. Because CNPG reaches these plugins through a
Service(annotatedcnpg.io/pluginPort: "9090"), an un-gated pod can receive plugin calls before the gRPC listener is bound, causing connection-refused races during startup, upgrades, and reschedules.Adds a
tcpSocketreadiness probe on port 9090, gated bypluginProbes.enabled(default: true) with tunableinitialDelaySeconds/periodSeconds/failureThreshold.This mirrors the upstream CNPG reference plugin (
plugin-barman-cloud), which ships the exact sametcpSocketreadiness probe on 9090. A TCP socket probe is used because the plugins don't expose an HTTP health endpoint; it verifies the gRPC server has bound the port and is accepting connections — better than nothing, not as strong as a real gRPC health check (deferred until the plugins implementgrpc.health.v1).Readiness only, no liveness: like CNPG's plugin, this is intentionally readiness-only. A TCP liveness probe would restart the container on a weak "port open" signal, risking restart loops under load for little benefit.
Dropped from this PR
replicaCount: 1, where a PDB is either a no-op (maxUnavailable/minAvailable: 0) or permanently blocks node drains (minAvailable: 1). Upstream CNPG ships no operator PDB for the same single-replica reason. This should be reintroduced alongside multi-replica + leader election, not ahead of it. See Helm chart improvements for GA readiness #381 (M9) for the rationale.Local verification
helm lintclean.helm templaterenders the TCP readiness probe by default; omits it withpluginProbes.enabled=false.livenessProbeand noPodDisruptionBudgetare rendered.helm unittest— 88 tests pass.Tracking