Skip to content

feat(helm): add readiness probes for plugin sidecars#384

Open
WentingWu666666 wants to merge 3 commits into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/helm-pdb-and-probes
Open

feat(helm): add readiness probes for plugin sidecars#384
WentingWu666666 wants to merge 3 commits into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/helm-pdb-and-probes

Conversation

@WentingWu666666

@WentingWu666666 WentingWu666666 commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Part of the GA-readiness chart audit (#381).

What this PR does

Adds readiness probes for the CNPG plugin sidecars, aligning the chart with upstream CloudNative-PG practice.

Plugin readiness probes (M12)

The sidecar-injector and wal-replica deployments are gRPC servers on port 9090 with no probes today — pods are marked Ready as soon as the container starts, regardless of whether the gRPC endpoint is actually serving. Because CNPG reaches these plugins through a Service (annotated cnpg.io/pluginPort: "9090"), an un-gated pod can receive plugin calls before the gRPC listener is bound, causing connection-refused races during startup, upgrades, and reschedules.

Adds a tcpSocket readiness probe on port 9090, gated by pluginProbes.enabled (default: true) with tunable initialDelaySeconds / periodSeconds / failureThreshold.

This mirrors the upstream CNPG reference plugin (plugin-barman-cloud), which ships the exact same tcpSocket readiness probe on 9090. A TCP socket probe is used because the plugins don't expose an HTTP health endpoint; it verifies the gRPC server has bound the port and is accepting connections — better than nothing, not as strong as a real gRPC health check (deferred until the plugins implement grpc.health.v1).

Readiness only, no liveness: like CNPG's plugin, this is intentionally readiness-only. A TCP liveness probe would restart the container on a weak "port open" signal, risking restart loops under load for little benefit.

Dropped from this PR

  • PodDisruptionBudget (M9) — removed. The operator runs at replicaCount: 1, where a PDB is either a no-op (maxUnavailable/minAvailable: 0) or permanently blocks node drains (minAvailable: 1). Upstream CNPG ships no operator PDB for the same single-replica reason. This should be reintroduced alongside multi-replica + leader election, not ahead of it. See Helm chart improvements for GA readiness #381 (M9) for the rationale.

Local verification

  • helm lint clean.
  • helm template renders the TCP readiness probe by default; omits it with pluginProbes.enabled=false.
  • No livenessProbe and no PodDisruptionBudget are rendered.
  • helm unittest — 88 tests pass.

Tracking

@documentdb-triage-tool documentdb-triage-tool Bot added the enhancement New feature or request label May 19, 2026
@documentdb-triage-tool

Copy link
Copy Markdown

🤖 Auto-triaged by documentdb-triage-tool.

Applied: enhancement
Project fields suggested: Component manifests · Priority P2 · Effort M · Status In Progress
Confidence: 0.85 (mixed)

Reasoning

effort from diff stats (70+0 LOC, 4 files); LLM: Additive opt-in Helm chart improvements (PodDisruptionBudget and plugin probes) as part of GA-readiness audit, touching manifests/templates with no schema or cross-component changes.

If a label is wrong, remove it manually and ping @patty-chow so the rules can be tuned. The bot will not re-label items that already have component labels.

wentingwu000 and others added 3 commits June 23, 2026 12:34
PodDisruptionBudget (M9 from documentdb#381)
==================================

Add an optional PodDisruptionBudget for the operator, gated by
podDisruptionBudget.enabled (default: false). Disabled by default
because the operator currently ships with replicaCount: 1 and a PDB on
a single-replica deployment blocks node drains rather than helping
availability. Users running multi-replica with leader election should
enable it.

Plugin probes (M12 from documentdb#381)
=============================

The sidecar-injector and wal-replica deployments are gRPC servers on
port 9090 that previously had no probes  pods were marked Ready as
soon as the container started, regardless of whether the gRPC endpoint
was actually serving. Add tcpSocket readiness + liveness probes on
port 9090, gated by pluginProbes.enabled (default: true) with tunable
initialDelaySeconds, periodSeconds, and failureThreshold.

TCP socket probe is used because the plugins do not expose an HTTP
health endpoint. The probe verifies the gRPC server is bound and
accepting connections.

Verified locally on kind:
  - helm template renders PDB only when enabled; renders probes by default.
  - helm upgrade applies PDB; second upgrade with --set ... enabled=false
    removes it as expected.
  - sidecar-injector pod becomes Ready (TCP probe passes against the
    actually-running gRPC server).
  - helm lint clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Add helm-unittest test coverage for two new features:

- PDB (11_pdb_test.yaml): 10 tests covering feature gate toggle,
  metadata/namespace, label selector, and minAvailable/maxUnavailable
  configurations including percentage support.

- Plugin probes (02_sidecar_injector_test.yaml, 03_wal_replica_test.yaml):
  6 tests (3 per template) verifying default TCP readiness/liveness
  probes, disabled probes, and custom probe settings.

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Align the chart with upstream CloudNative-PG practice following review
discussion:

- Remove the opt-in operator PodDisruptionBudget (templates/11_pdb.yaml,
  its tests, and values). The operator runs at replicaCount: 1, where a
  PDB is either a no-op or blocks node drains; CNPG ships no operator PDB.
  This should be reintroduced alongside multi-replica + leader election.
- Drop the livenessProbe from the sidecar-injector and wal-replica plugin
  deployments, keeping only the TCP readiness probe. This matches the
  upstream CNPG plugin-barman-cloud, which is readiness-only; a TCP
  liveness check risks restart loops on a weak signal.
- Align readiness initialDelaySeconds default to 10 (CNPG's value).

helm lint and helm unittest (88 tests) pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/helm-pdb-and-probes branch from 4a59d93 to f09b5a9 Compare June 23, 2026 16:34
@WentingWu666666 WentingWu666666 changed the title feat(helm): opt-in PodDisruptionBudget and probes for CNPG plugin sidecars feat(helm): add readiness probes for CNPG plugin sidecars Jun 23, 2026
@WentingWu666666 WentingWu666666 changed the title feat(helm): add readiness probes for CNPG plugin sidecars feat(helm): add readiness probes for plugin sidecars Jun 23, 2026
@WentingWu666666 WentingWu666666 marked this pull request as ready for review June 23, 2026 16:38
Copilot AI review requested due to automatic review settings June 23, 2026 16:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds readiness gating for the CNPG plugin deployments in the Helm chart to avoid startup races where CNPG routes gRPC plugin traffic before the listener is accepting connections (GA audit item M12).

Changes:

  • Introduces pluginProbes values (enabled by default) to configure a TCP socket readiness probe on port 9090.
  • Wires the readiness probe into the sidecar-injector and wal-replica plugin Deployments (readiness only; no liveness).
  • Extends helm unittest coverage to validate default rendering, opt-out behavior, and overridden probe timings.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
operator/documentdb-helm-chart/values.yaml Adds pluginProbes configuration block with defaults and rationale.
operator/documentdb-helm-chart/templates/02_documentdb_sidecar_injector.yaml Adds conditional TCP readiness probe to the sidecar-injector plugin container.
operator/documentdb-helm-chart/templates/03_documentdb_wal_replica.yaml Adds conditional TCP readiness probe to the wal-replica plugin container.
operator/documentdb-helm-chart/tests/02_sidecar_injector_test.yaml Adds unit tests asserting readiness probe presence/absence and overridden settings.
operator/documentdb-helm-chart/tests/03_wal_replica_test.yaml Adds unit tests asserting readiness probe presence/absence and overridden settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request triage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants