fix(ibmcloud): preserve path component in COS backend URL#830
fix(ibmcloud): preserve path component in COS backend URL#830adrianriobo wants to merge 12 commits into
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
When a backed URL includes a path (e.g. s3://bucket/some/path), the previous code discarded the path and stored Pulumi state at the bucket root. Refactor extractBucket into extractBucketAndPath so the path is carried through to PULUMI_BACKEND_URL. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/integrations/gitlab/snippet-linux.sh`:
- Around line 35-36: The /etc/resolv.conf fallback currently excludes only IPv4
loopback (127.*) so IPv6 loopback (::1) can still be captured into _dns_servers;
update the awk filter in the _dns_servers assignment (the line using awk
'/^nameserver/ && $2 !~ /^127\./ {print $2}') to also exclude the IPv6 loopback
(e.g. add a condition like $2 != "::1" or $2 !~ /^::1/), ensuring nameserver
entries equal to ::1 are not included.
- Around line 48-54: The script currently checks for "dns_servers" globally
which can match other sections; change the detection/update logic so it only
looks inside the [containers] section: when checking for existence use a
section-scoped search (e.g., search between the line matching "^\[containers\]"
and the next section header "^\[.*\]") to decide whether to replace dns_servers
or append it after the [containers] header; update the two sed/grep branches
that reference dns_servers and /etc/containers/containers.conf and ensure the
inserted value uses the same ${_toml_list} variable and formatting as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 466885cf-1897-4a3c-8cf6-9db3d3fcc8aa
📒 Files selected for processing (1)
pkg/integrations/gitlab/snippet-linux.sh
Detect the host's upstream DNS servers at runner setup time and write them to /etc/containers/containers.conf so Podman propagates them into every container it creates, including the nested inner containers spawned by `podman build` RUN steps. Without this, inner build containers inherit the systemd-resolved loopback stub (127.0.0.53) which is unreachable from inside a container, causing intermittent "Could not resolve host" failures for external domains. Detection tries resolvectl, nmcli, and /etc/resolv.conf in order, filtering out loopback addresses at each step. The existence check and in-place replacement are scoped to the [containers] section via awk to avoid false-positive matches in other sections (e.g. [network]). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pass --name "{{ .Name }}" to gitlab-runner register in all platform
snippets (Linux, Darwin, Windows). Without this flag the runner defaults
to the system hostname, which is not meaningful in ephemeral environments.
Set the Name field from ProjectName() rather than RunID() so the runner
is identified by the project it belongs to.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
deekay2310
left a comment
There was a problem hiding this comment.
Apart from my one comment regarding context.go, we should also address CodeRabbit's comments and I think this PR will be ready to merge.
|
|
||
| func manageIntegration(c *Context, ca *ContextArgs) error { | ||
| if ca.GHRunnerArgs != nil { | ||
| ca.GHRunnerArgs.Name = c.RunID() |
There was a problem hiding this comment.
Should this also be changed to c.ProjectName() just like GLRunnerArgs.name?
There was a problem hiding this comment.
yeah I just found out using the RunID as name is not really helpful so make it with the project name clarifies it, but as this change is GH related can you change it on your PR?
There was a problem hiding this comment.
Sounds good! I'll look into it
When both GitLab runner and OTel are configured, Podman is set to use the journald log driver so CI job container output is captured by systemd journal. The otelcol config gains a journald/gitlab-jobs receiver that parses CONTAINER_NAME to extract job_id, project_id, and runner_token, and the existing filelog/gitlab-runner receiver gets regex_parser operators to promote job= and runner= fields from the runner daemon log body into attributes. Both streams share job_id, enabling cross-stream correlation in any OTel backend. The journald log driver change is gated on a new LogToJournald field in GitLabRunnerArgs, set only when hasOtel is true in the IBM Power and IBM Z providers. The otelcol receivers remain gated on the existing MonitorGitLabRunner flag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng containers.conf
Root cause (confirmed live on s390x runner):
podman build RUN steps run inside a nested container created by inner
Podman inside the docker executor container. Inner Podman (using
Netavark) creates a new bridge for these build-step containers, but the
privileged-but-bridge-networked outer container has no working iptables/
nftables NAT, so packets from the 10.88.0.0/16 bridge subnet can never
reach the DNS servers. The inner containers' /etc/resolv.conf had the
correct nameservers written but DNS queries timed out because they could
not be routed.
The previous chmod 644 and dns_servers fixes correctly configured the
outer executor container but did not propagate into the nested build-step
containers.
Fixes applied:
1. Add netns = "host" to /etc/containers/containers.conf.
When the file is bind-mounted into the executor container (see redhat-developer#3),
inner Podman reads this and creates all build-step containers sharing
the outer container's network namespace instead of a new bridge.
They inherit the working resolv.conf with no NAT required.
2. Guarantee /etc/containers/containers.conf exists before runner
registration even when DNS detection finds no upstream servers, so
the bind mount in redhat-developer#3 always has a real file to attach to.
3. Add --docker-volumes to the runner registration so that every
executor container gets the host's containers.conf bind-mounted at
/etc/containers/containers.conf:ro. This makes the nested inner
Podman pick up both dns_servers and netns = "host" automatically,
without any change to the CI job Dockerfile or podman build invocation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
74eb149 to
16330a5
Compare
…n build Two problems with inner podman build containers inside the CI executor: 1. DNS failed because the executor container's resolv.conf points to a loopback stub (127.0.0.53/systemd-resolved) that is unreachable from inside nested containers. Fixed by detecting real upstream DNS servers via resolvectl/nmcli/resolv.conf and writing them to containers.conf dns_servers so Podman passes working nameservers to inner containers. 2. netns="host" was added as a workaround to give inner containers a routable network stack, but it causes them to share the actual host network namespace — so concurrent CI jobs compete for the same ports (e.g. nginx: bind() to 0.0.0.0:8080 failed: Address already in use). Replaced with sysctl ip_forward=1 so Netavark's bridge NAT works correctly, giving each inner container an isolated namespace while still routing egress through the executor container's interface. The containers.conf is bind-mounted into every executor container via --docker-volumes so Podman inside the CI job inherits dns_servers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…config The mount-data-home.sh script looped forever waiting for a TYPE=mpath block device to appear, with no exit condition and TimeoutStartSec=0 in the systemd unit. On larger LPAR sizes (32+ cores / 192+ GB), IBM Cloud takes longer to establish Fibre Channel paths, so the device can take many minutes to appear. The infinite loop blocked all subsequent runcmd steps (restorecon, runner installation) indefinitely. Two changes: - Explicitly start multipathd before entering the loop so the daemon is guaranteed to be running when we call 'multipathd reconfigure'. - Cap the loop at 60 iterations (~10 minutes). If no multipath device appears in that window, proceed without mounting the data volume so the runner installation can still complete. The instance remains functional on its root filesystem; the data volume can be mounted manually once the FC paths are established. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…urces Adds TagsAsStringArray helper to convert map[string]string (key=value from --tags flag) to the key:value string format expected by IBM Cloud user tags. Threads tags through the network module and both action packages so every taggable resource gets the user-supplied tags: PiVolume, PiInstance (PiUserTags) on PowerVS; IsInstance, IsSshKey, IsVpc, IsSubnet, IsPublicGateway, IsSecurityGroup, IsFloatingIp, and ResourceGroup (Tags) on VPC/IBM Z paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-dns at registration Two improvements to make DNS reliable on all platforms: 1. Add /run/systemd/resolve/resolv.conf as a DNS detection fallback. On Ubuntu 24.04 with systemd-resolved, /etc/resolv.conf points to the 127.0.0.53 stub which is filtered out. The real upstream servers live in /run/systemd/resolve/resolv.conf and are needed for the fix to work. 2. Pass detected DNS servers as --docker-dns when registering the runner. containers.conf dns_servers fixes DNS for nested podman build containers, but Podman's Docker socket API does not always honour it when creating the executor container itself. Passing --docker-dns explicitly in config.toml ensures every job container gets working DNS regardless. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three fixes for persistent DNS failures inside CI executor containers: 1. Public DNS fallback: if resolvectl/nmcli/resolv.conf detect nothing (e.g. on non-standard cloud images), fall back to 8.8.8.8/8.8.4.4. The machine has internet access (it talks to GitLab), so public DNS works. 2. Explicit iptables MASQUERADE for 10.88.0.0/16: on RHEL 9, 'podman system reset --force' disrupts Netavark's firewalld integration. Netavark normally adds the MASQUERADE rule when the first container is created, but the runner resolves DNS at job startup — before any container has ever run. Adding the rule explicitly ensures it is in place before the first CI job. 3. firewall-cmd --add-masquerade: belt-and-suspenders for RHEL/firewalld systems. Enables masquerade at the zone level in addition to the iptables rule, so forwarded container traffic is NATted regardless of how firewalld manages its nftables entries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…permanent The sysctl ip_forward settings and firewall-cmd masquerade rule added in the previous fix were transient: sysctl -w is not preserved across reboots, and firewall-cmd without --permanent is lost on firewalld restart. Persist ip_forward via /etc/sysctl.d/99-podman-ip-forward.conf so Podman bridge NAT works correctly after any reboot. Add --permanent to firewall-cmd --add-masquerade and follow it with --reload so the rule is immediately active AND survives across reboots and firewalld restarts on RHEL/Fedora (ppc64le) machines. Non-firewalld systems (Ubuntu/s390x) are unaffected: firewall-cmd fails silently via || true and the iptables POSTROUTING rule covers them. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rk subnet
Root cause (confirmed live on s390x runner):
The host's Podman CNI bridge uses 10.88.0.0/16. When gitlab-runner
creates an executor container (on the host's 10.88.0.0/16 bridge,
e.g. at 10.88.0.22), and the CI job runs 'podman run' inside that
executor, inner Podman (Netavark) creates its own bridge — also at
10.88.0.0/16 — inside the executor's network namespace.
This produces two competing kernel routes inside the executor:
10.88.0.0/16 dev eth0 src 10.88.0.22 (executor → host)
10.88.0.0/16 dev cni-podman0 src 10.88.0.1 (Netavark bridge)
Nested containers land on Netavark's bridge (e.g. at 10.88.0.23).
The duplicate route to 10.88.0.0/16 breaks forwarding: DNS queries
from inner containers to IBM Cloud resolvers (10.130.98.x) have no
unambiguous egress path, causing socket.gaierror [Errno -2].
Note: the executor container's own DNS works fine because it reaches
IBM DNS directly through the host bridge, without going through
Netavark. Only nested 'podman run' containers are affected.
Fix:
Add [network] default_subnet = "192.168.100.0/24" to
/etc/containers/containers.conf, which is bind-mounted into every
executor container via --docker-volumes. Netavark inside the executor
reads this and creates its bridge on 192.168.100.0/24, leaving
10.88.0.0/16 exclusively for the executor's eth0 (host connection).
Forwarding and MASQUERADE then work correctly, giving nested
containers working DNS and internet access.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/provider/ibmcloud/ibmcloud.go (1)
227-231: ⚡ Quick winStabilize tag serialization order to avoid nondeterministic resource diffs.
At Line 229, iterating a Go map directly makes
pulumi.StringArrayorder nondeterministic across runs. Since this helper feeds many IBM Cloud resources, this can cause noisy previews/updates if ordering is treated as significant. Sort keys before append.Proposed fix
+import "sort" ... func TagsAsStringArray(tags map[string]string) pulumi.StringArray { result := make(pulumi.StringArray, 0, len(tags)) - for k, v := range tags { + keys := make([]string, 0, len(tags)) + for k := range tags { + keys = append(keys, k) + } + sort.Strings(keys) + for _, k := range keys { + v := tags[k] result = append(result, pulumi.String(fmt.Sprintf("%s:%s", k, v))) } return result }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/provider/ibmcloud/ibmcloud.go` around lines 227 - 231, The TagsAsStringArray function iterates directly over the tags map, which produces nondeterministic ordering in Go since map iteration order is randomized. To fix this, extract all the keys from the tags map into a slice, sort that slice of keys, and then iterate over the sorted keys in the for loop instead of directly iterating over the map. This ensures the resulting pulumi.StringArray has consistent ordering across multiple runs, eliminating noisy diffs in resource previews and updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/integrations/gitlab/snippet-linux.sh`:
- Around line 84-89: The grep check on line 84 searches for dns_options globally
in the file without scoping it to the [containers] section, and the subsequent
sed replacement on line 85 also operates globally. This can cause unintended
modifications to dns_options in other sections or leave [containers].dns_options
unset. Modify the grep pattern to specifically check for dns_options within the
[containers] section (for example, by using a pattern that matches dns_options
after the [containers] header), and update the sed command on line 85 to scope
the replacement to only the [containers] section to ensure consistent Podman DNS
configuration.
In `@pkg/provider/ibmcloud/services/power/power.go`:
- Around line 84-96: The loop that calls vc.Get(volumeId) currently retries on
all errors for the full duration, but should fail fast on permanent failures.
Modify the error handling block (the else clause where err is not nil) to
inspect the underlying power-go-client error for its HTTP status code. If the
error is a non-retriable 4xx response, return the error immediately to fail
fast. If the error is transient (5xx or 429), continue with the existing retry
logic that sleeps and retries. This same fix should be applied to the similar
error handling pattern in the section that also handles vc.Get errors (lines
97-104 and beyond).
---
Nitpick comments:
In `@pkg/provider/ibmcloud/ibmcloud.go`:
- Around line 227-231: The TagsAsStringArray function iterates directly over the
tags map, which produces nondeterministic ordering in Go since map iteration
order is randomized. To fix this, extract all the keys from the tags map into a
slice, sort that slice of keys, and then iterate over the sorted keys in the for
loop instead of directly iterating over the map. This ensures the resulting
pulumi.StringArray has consistent ordering across multiple runs, eliminating
noisy diffs in resource previews and updates.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 996c0259-0ea3-4fe6-9953-7253897ca76e
📒 Files selected for processing (11)
pkg/integrations/gitlab/glrunner.gopkg/integrations/gitlab/snippet-linux.shpkg/integrations/gitlab/types.gopkg/integrations/integrations.gopkg/integrations/otelcol/snippet-linux.shpkg/provider/ibmcloud/action/ibm-power/cloud-configpkg/provider/ibmcloud/action/ibm-power/ibm-power.gopkg/provider/ibmcloud/action/ibm-z/ibm-z.gopkg/provider/ibmcloud/ibmcloud.gopkg/provider/ibmcloud/modules/network/network.gopkg/provider/ibmcloud/services/power/power.go
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/integrations/integrations.go
- pkg/integrations/gitlab/types.go
- pkg/integrations/gitlab/glrunner.go
- pkg/integrations/otelcol/snippet-linux.sh
| if grep -q '^dns_options' /etc/containers/containers.conf; then | ||
| sudo sed -i 's|^dns_options.*|dns_options = ["timeout:2", "attempts:5", "single-request"]|' \ | ||
| /etc/containers/containers.conf | ||
| else | ||
| sudo sed -i '/^\[containers\]/a dns_options = ["timeout:2", "attempts:5", "single-request"]' \ | ||
| /etc/containers/containers.conf |
There was a problem hiding this comment.
Scope dns_options edits to the [containers] section.
Line 84 checks dns_options globally, and Line 85 replaces it globally. If another section defines dns_options, [containers].dns_options may remain unset (or unrelated sections get modified), causing inconsistent Podman DNS behavior.
Proposed fix
- # Add or update dns_options within [containers]
- if grep -q '^dns_options' /etc/containers/containers.conf; then
- sudo sed -i 's|^dns_options.*|dns_options = ["timeout:2", "attempts:5", "single-request"]|' \
- /etc/containers/containers.conf
+ # Add or update dns_options within [containers]
+ if awk '/^\[containers\]/{f=1;next} /^\[/{f=0} f && /^[[:space:]]*dns_options[[:space:]]*=/{found=1} END{exit !found}' \
+ /etc/containers/containers.conf; then
+ sudo sed -i '/^\[containers\]/,/^\[/{s|^[[:space:]]*dns_options[[:space:]]*=.*|dns_options = ["timeout:2", "attempts:5", "single-request"]|}' \
+ /etc/containers/containers.conf
else
sudo sed -i '/^\[containers\]/a dns_options = ["timeout:2", "attempts:5", "single-request"]' \
/etc/containers/containers.conf
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if grep -q '^dns_options' /etc/containers/containers.conf; then | |
| sudo sed -i 's|^dns_options.*|dns_options = ["timeout:2", "attempts:5", "single-request"]|' \ | |
| /etc/containers/containers.conf | |
| else | |
| sudo sed -i '/^\[containers\]/a dns_options = ["timeout:2", "attempts:5", "single-request"]' \ | |
| /etc/containers/containers.conf | |
| # Add or update dns_options within [containers] | |
| if awk '/^\[containers\]/{f=1;next} /^\[/{f=0} f && /^[[:space:]]*dns_options[[:space:]]*=/{found=1} END{exit !found}' \ | |
| /etc/containers/containers.conf; then | |
| sudo sed -i '/^\[containers\]/,/^\[/{s|^[[:space:]]*dns_options[[:space:]]*=.*|dns_options = ["timeout:2", "attempts:5", "single-request"]|}' \ | |
| /etc/containers/containers.conf | |
| else | |
| sudo sed -i '/^\[containers\]/a dns_options = ["timeout:2", "attempts:5", "single-request"]' \ | |
| /etc/containers/containers.conf | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/integrations/gitlab/snippet-linux.sh` around lines 84 - 89, The grep
check on line 84 searches for dns_options globally in the file without scoping
it to the [containers] section, and the subsequent sed replacement on line 85
also operates globally. This can cause unintended modifications to dns_options
in other sections or leave [containers].dns_options unset. Modify the grep
pattern to specifically check for dns_options within the [containers] section
(for example, by using a pattern that matches dns_options after the [containers]
header), and update the sed command on line 85 to scope the replacement to only
the [containers] section to ensure consistent Podman DNS configuration.
| for i := 0; i < 60; i++ { // up to ~10 minutes | ||
| vol, err := vc.Get(volumeId) | ||
| if err == nil { | ||
| switch vol.State { | ||
| case "available", "in-use": | ||
| logging.Infof("volume %s is in state %q, proceeding with attachment", volumeId, vol.State) | ||
| return volumeId, nil | ||
| default: | ||
| logging.Infof("volume %s state: %q, retrying in 10s...", volumeId, vol.State) | ||
| } | ||
| } else { | ||
| logging.Infof("volume %s GET returned error (retrying in 10s): %v", volumeId, err) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the actual code in the file
cat -n pkg/provider/ibmcloud/services/power/power.go | sed -n '79,104p'Repository: redhat-developer/mapt
Length of output: 1114
🏁 Script executed:
# Let's also check the imports and understand what vc (volumeClient) is
head -50 pkg/provider/ibmcloud/services/power/power.goRepository: redhat-developer/mapt
Length of output: 1428
🏁 Script executed:
# Search for volumeClient definition and initialization
rg -n "volumeClient\|vc\.Get\|IBMPIVolumeClient" pkg/provider/ibmcloud/services/power/power.go -A 3 -B 1Repository: redhat-developer/mapt
Length of output: 47
🏁 Script executed:
# Check if there are any error type definitions or status codes being handled
rg -n "error\|Error\|status" pkg/provider/ibmcloud/services/power/power.go | head -30Repository: redhat-developer/mapt
Length of output: 47
🏁 Script executed:
# Let's find the dependency on power-go-client
rg "github.com/IBM-Cloud/power-go-client" go.modRepository: redhat-developer/mapt
Length of output: 111
🏁 Script executed:
# Find the volumeClient function definition
rg -n "func volumeClient" pkg/provider/ibmcloud/services/power/power.go -A 15Repository: redhat-developer/mapt
Length of output: 689
🏁 Script executed:
# Check for any error handling patterns in the file
rg -n "http\|status\|StatusCode\|Type\|errors\." pkg/provider/ibmcloud/services/power/power.go -iRepository: redhat-developer/mapt
Length of output: 47
🏁 Script executed:
# Search for similar retry patterns in other files for context
rg -n "for.*i.*<.*60\|time\.After.*10" pkg/provider/ibmcloud/ -lRepository: redhat-developer/mapt
Length of output: 47
🏁 Script executed:
# Look for error classification/handling in other power-related files
find pkg/provider/ibmcloud -name "*.go" -type f | head -10Repository: redhat-developer/mapt
Length of output: 520
🏁 Script executed:
# Check if there's any existing error classification in the codebase
rg -n "IsRetryable\|Retryable\|Transient\|Permanent" pkg/provider/ibmcloud/ -iRepository: redhat-developer/mapt
Length of output: 47
🌐 Web query:
power-go-client IBMPIVolumeClient error handling retriable
💡 Result:
In IBM-Cloud/power-go-client, errors returned by the lower-level API calls (including those from the IBMPIVolumeClient volume-related methods) are wrapped via ibmpisession.SDKFailWithAPIError when the SDK call returns a non-nil error (example shown in ibm-pi-instance.go: it returns SDKFailWithAPIError(err, …) on err!= nil). [1] What this means for “retriable” handling: 1) power-go-client itself does not expose (in the reviewed code) a per-method “IsRetriable” hook specifically for IBMPIVolumeClient errors; instead, retry behavior is governed by the shared IBM Go SDK HTTP/retry mechanisms (ibm-cloud-sdk-common / ibm/go-sdk-core) and its generic error handling. The project is built on IBM SDK core and supports “automatic retries” as described in IBM’s SDK common documentation. [2] 2) Therefore, whether a particular IBMPIVolumeClient error is retried depends on how the IBM SDK core classifies the underlying HTTP/API error as retryable (e.g., throttling/rate-limit, transient network/5xx patterns), not on any IBMPIVolumeClient-specific “retriable error” implementation. Practical guidance (how to handle retriable errors in your code): - Enable/adjust SDK automatic retries at the IBM SDK core/client level (documented as “Automatic retries” in ibm-cloud-sdk-common). [2] - If you need explicit application-level retry decisions for power-go-client errors, you will need to inspect the error type/content you get from SDKFailWithAPIError (e.g., underlying HTTP status / error payload) and implement your own retry policy in your caller. If you can share the exact IBMPIVolumeClient method name and the error you’re seeing (HTTP status code, message), I can pinpoint whether that specific class of failure is typically treated as retryable by the IBM SDK core layer and give a concrete snippet for retry decision logic.
Citations:
- 1: https://github.com/IBM-Cloud/power-go-client/blob/master/clients/instance/ibm-pi-instance.go
- 2: https://github.com/IBM/ibm-cloud-sdk-common/blob/master/README.md
🌐 Web query:
IBM Cloud Power API error codes status codes
💡 Result:
IBM Cloud Power (PowerVS/Power Cloud) API “error codes / status codes” are primarily the standard HTTP status codes returned by the API, plus a service-specific error payload (for Power Cloud endpoints you’ll commonly see values like pcloudPvminstancesActionPostBadRequest with a description/message). IBM’s API handbook states that HTTP responses use standard 2xx/4xx/5xx categories, and when a 4xx/5xx status code is returned the response body uses an error container model. [1][2][3] 1) HTTP status codes (what you should key off of) - 4xx client errors: these indicate the request is invalid or cannot be fulfilled due to client-side issues; IBM documents the general meaning of 4xx codes and notes common ones like 404/405/409/etc. [1] - 5xx server errors: these indicate server-side failure; IBM documents general meaning of 5xx codes like 500/504/505/etc. [1] - IBM specifically says that if the service returns a 400-series or 500-series status, the response body must be an error container model. [2][3] 2) Service-specific PowerVS/Power Cloud error codes you’ll see in practice (examples) In PowerVS API calls under the /pcloud/... namespace, common failures surface as a combination of HTTP status + an error “title”/reason string (e.g., pcloudPvminstancesActionPostBadRequest) and a description message. - 400 Bad Request (example: action not allowed due to current state) - Starting a Power Virtual Server instance that is already in ACTIVE state can return HTTP 400 with a message like “is already in an ACTIVE state and cannot be started.” [4] - 400 Bad Request (example: invalid/required parameter missing) - Creating a PVM instance can return HTTP 400 where the error description mentions that a calculated parameter failed (for example, “storageType is required when not using a storage affinity policy”). [5] - 400 Bad Request (example: operation not supported in environment) - Updating a PVM instance can return HTTP 400 with “cloudInitialisation is not supported in this environment.” [6] - 409 Conflict (example: cannot attach volume when instance health/status not OK) - Volume attachment can return HTTP 409 conflict, with a message indicating the volume can’t be attached due to the instance health not being “OK” (e.g., health=WARNING). [7] - 409 Conflict (example: attach fails due to instance/VM state constraints) - The Terraform-provider issue for attachment shows an HTTP 409 conflict response for volume attachment to a PVM instance. [7] - 400 Bad Request (example: clone cancel only allowed in certain states) - Cancelling a volume clone request that is already in “completed” state returns HTTP 400, with a message indicating it must be in “prepared or available” to cancel. [8] 3) How to interpret “status_code” vs “error code” - IBM’s API handbook distinguishes HTTP status codes (what the HTTP layer returns) from the structured error payload that accompanies 4xx/5xx responses. The error container model includes an errors field with objects that include a code and message, and it may include status_code and trace. [2][9] - For Power Cloud/PowerVS, the examples you’ll encounter in client output show both: a) The HTTP status (400/409, etc.) b) A service-generated error key/name and a human-readable Description/Message. [8][4][5][7] Practical takeaway - If you’re mapping “Power API error codes” to a decision system, start with the HTTP status code (400 vs 409 vs 5xx) as the primary discriminator, then inspect the Power Cloud error payload fields (code/message/description) for the specific root cause (e.g., “already in ACTIVE”, “cannot cancel because state is completed”, “cloudInitialisation not supported”, “cannot attach because health not OK”). [1][2][4][8][7]
Citations:
- 1: https://cloud.ibm.com/docs/api-handbook?topic=api-handbook-status-codes
- 2: https://cloud.ibm.com/docs/api-handbook?topic=api-handbook-errors
- 3: https://github.com/ibm-cloud-docs/api-handbook/blob/086d28c9357b39612ceb816130d0f6ad92f82859/design/errors.md
- 4: PowerVS start and possibly stop actions failing when resource already in the desired state IBM-Cloud/terraform-provider-ibm#4400
- 5: PowerVS terraform deployment failing IBM-Cloud/terraform-provider-ibm#2787
- 6: High Priority: ibm_pi_instance fails to create instance. IBM-Cloud/terraform-provider-ibm#4905
- 7: Volume attachment to PVS Instances not working IBM-Cloud/terraform-provider-ibm#2810
- 8: https://community.ibm.com/community/user/blogs/amritanshu-verma/2026/06/11/ibm-power-virtual-server-volume-clone-lifecycle
- 9: https://cloud.ibm.com/docs/vpc?topic=vpc-instance-metadata-error-codes
🏁 Script executed:
# Check if there are examples or tests showing how errors are handled
rg -n "Get\|error" pkg/provider/ibmcloud/action/ibm-power/ -A 2 -B 1 | head -40Repository: redhat-developer/mapt
Length of output: 47
🏁 Script executed:
# Look for any existing patterns of error classification in the codebase
rg -n "switch.*err\|type.*error" pkg/provider/ibmcloud/ -lRepository: redhat-developer/mapt
Length of output: 47
🏁 Script executed:
# Check go.mod for error handling related dependencies
cat go.mod | grep -i "error\|retry\|backoff"Repository: redhat-developer/mapt
Length of output: 321
Fail fast on non-retriable volume lookup failures.
Lines 84-96 currently retry all vc.Get(volumeId) errors for ~10 minutes. Permanent failures (e.g., invalid volume ID, permission errors, volume not found) should return immediately; otherwise this adds avoidable timeout delays and masks root causes during incident triage.
Classify errors by HTTP status code and retry only transient ones (5xx, 429), while failing fast on non-retriable 4xx responses. The underlying power-go-client error can be inspected for the HTTP status to make this determination.
Also applies to: 97-104
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/provider/ibmcloud/services/power/power.go` around lines 84 - 96, The loop
that calls vc.Get(volumeId) currently retries on all errors for the full
duration, but should fail fast on permanent failures. Modify the error handling
block (the else clause where err is not nil) to inspect the underlying
power-go-client error for its HTTP status code. If the error is a non-retriable
4xx response, return the error immediately to fail fast. If the error is
transient (5xx or 429), continue with the existing retry logic that sleeps and
retries. This same fix should be applied to the similar error handling pattern
in the section that also handles vc.Get errors (lines 97-104 and beyond).
When a backed URL includes a path (e.g. s3://bucket/some/path), the previous code discarded the path and stored Pulumi state at the bucket root. Refactor extractBucket into extractBucketAndPath so the path is carried through to PULUMI_BACKEND_URL.
Fixes #829
Fixes #832