Skip to content

Foundry private networking for Azure AI agents#8708

Draft
m5i-work wants to merge 12 commits into
huimiu/foundry-azure-yamlfrom
m5i/foundry-private-network
Draft

Foundry private networking for Azure AI agents#8708
m5i-work wants to merge 12 commits into
huimiu/foundry-azure-yamlfrom
m5i/foundry-private-network

Conversation

@m5i-work

@m5i-work m5i-work commented Jun 18, 2026

Copy link
Copy Markdown
Member

Summary

Adds declarative private networking support for host: microsoft.foundry services in the Azure AI agents extension.

This PR teaches the in-memory Foundry synthesizer to provision a network-bound Foundry account/project from azure.yaml:

services:
  agent:
    host: azure.ai.agent
    network:
      mode: byo
      byo:
        vnet:
          id: ${AZURE_VNET_ID}
        agentSubnet:
          name: agent-subnet
          prefix: 192.168.10.0/24
        peSubnet:
          name: pe-subnet
          prefix: 192.168.11.0/24

It also adds schema/docs/tests and an E2E harness for validating the private-networking scenarios.

Changes

  • Add network: schema support for Foundry-hosted services.
  • Synthesize Bicep for BYO VNet mode:
    • delegated hosted-agent subnet
    • private endpoint subnet
    • Foundry account network isolation
    • Foundry private endpoint
    • private DNS zones for AI Services/OpenAI/Cognitive Services
  • Preserve ${VAR} placeholders during azd ai agent init --infra eject; resolve only at provision time.
  • Fix live-deployment-only template issue: avoid unresolved module reference() in networkInjections.subnetArmId by using a deterministic subnet ARM id string.
  • Fix ARM output casing restoration for AZURE_FOUNDRY_NETWORK_MODE.
  • Add E2E harness under cli/azd/extensions/azure.ai.agents/test/e2e/network/.
  • Add optional ABAC ACR image-build path for private-runtime validation.

Documentation

E2E validation performed

Validated the feature end-to-end against live Azure using a private-networked Foundry account and, for runtime validation, a private-only ABAC ACR.

Scenario 1: Provision private-networked Foundry from azure.yaml

Validated both BYO VNet mode and Microsoft-managed VNet mode.

For BYO VNet mode, validated that a network: block on a Foundry-hosted service provisions:

  • Foundry account with public access disabled
  • BYO VNet integration
  • delegated hosted-agent subnet
  • private endpoint subnet
  • Foundry private endpoint
  • required private DNS zones and VNet links
  • AZURE_FOUNDRY_NETWORK_MODE=byo

For managed VNet mode, validated:

  • networkInjections.useMicrosoftManagedNetwork=true
  • Foundry data plane remains publicly reachable for azd deploy / invoke because managed mode does not create a customer private endpoint
  • AZURE_FOUNDRY_NETWORK_MODE=managed
  • AZURE_FOUNDRY_MANAGED_ISOLATION_MODE=AllowInternetOutbound

Also validated the matrix of create/reference subnet modes and own/reference DNS modes.

Scenario 2: Eject preserves private-networking configuration

Validated that azd ai agent init --infra ejects equivalent Bicep and preserves ${VAR} placeholders, so values such as ${AZURE_VNET_ID} are resolved at provision time rather than frozen during eject.

The ejected template was verified to be idempotent against the already-provisioned private-networked account.

Scenario 3: Deploy and invoke hosted agents

Validated real hosted-agent deploy and invoke for both network modes.

For BYO VNet mode, validated deploy/invoke with:

  • Foundry public access disabled
  • Foundry private endpoint only
  • ACR public access disabled
  • ABAC-enabled ACR
  • ACR private endpoint
  • private DNS for both Foundry and ACR
  • deploy/invoke client running from a peered private-network runner VM

For managed VNet mode, validated deploy/invoke with:

  • Microsoft-managed network injection
  • Foundry public data-plane access enabled
  • ABAC-enabled ACR
  • image built with az acr build --source-acr-auth-id [caller]
  • Foundry project MI granted Container Registry Repository Reader

In both cases, the hosted agent reached active status and azd ai agent invoke returned the expected echo response.

Validated VNet topology

Workload VNet: azdnetp5one1519-vnet 192.168.0.0/16
├── pe-subnet 192.168.11.0/24
│   ├── Foundry private endpoint
│   └── ACR private endpoint
├── agent-subnet 192.168.10.0/24
│   └── delegated to Microsoft.App/environments
├── ref-pe-subnet 192.168.20.0/24
└── ref-agent-subnet 192.168.21.0/24
    └── delegated to Microsoft.App/environments

Runner VNet: azdnetp5one1519-runner-eastus2 10.240.0.0/16
└── runner-subnet 10.240.1.0/24
    └── runner VM

Peering:
azdnetp5one1519-runner-eastus2 <-> azdnetp5one1519-vnet

Private DNS zones linked to both VNets:

privatelink.services.ai.azure.com
privatelink.openai.azure.com
privatelink.cognitiveservices.azure.com
privatelink.azurecr.io

Notable findings from validation

  • BYO VNet deploy/invoke must run from a host with private endpoint reachability; public-internet clients correctly fail when Foundry public access is disabled.
  • Managed VNet mode does not create a customer private endpoint, so the Foundry data plane remains public while the hosted-agent runtime uses Microsoft-managed network injection.
  • Hosted-agent image pull uses the Foundry project MI, not the parent account MI.
  • Private-only ACR runtime pull requires Premium ACR, an ACR private endpoint, privatelink.azurecr.io, and Container Registry Repository Reader on the Foundry project MI.
  • Building into an ABAC-enabled ACR requires the caller to have Container Registry Repository Writer; the harness uses az acr build --source-acr-auth-id [caller].

Reviewer checklist

  • Review schema shape for network:.
  • Review synthesized Bicep and generated ARM JSON.
  • Review project MI vs account MI ACR pull grant in the E2E harness.
  • Review eject-time ${VAR} preservation behavior.
  • Review whether private ACR / VM runner support should remain as E2E harness-only documentation or become fully automated in a follow-up.

m5i-work added 8 commits June 18, 2026 13:28
Add a declarative network: block to the Foundry service in azure.yaml and
teach the bicep-less synthesizer to provision a VNet-bound (network-secured)
account from it. Additive: an absent block yields today's public account.

- schema: network: surface (mode byo|managed, byo vnet/subnets tri-state,
  managed isolationMode, dns create-or-reference) on microsoft.foundry.json
- synthesizer: decode network:, resolve ${VAR}, validate (mode coherence,
  vnet ARM id, subnet tri-state/CIDR, DNS rg/sub), emit network params +
  NetworkMode for telemetry
- templates: new modules/network.bicep, subnet.bicep, private-endpoint-dns.bicep;
  resources.bicep/main.bicep guard the network path on enableNetworkIsolation
  (publicNetworkAccess Disabled, networkAcls Deny, agent networkInjections,
  account private endpoint + 3 AI DNS zones); main.arm.json regenerated
- provider: pass azd env for ${VAR} resolution, emit provision.network_mode,
  warn that network: is ignored when endpoint: (brownfield) is set
- docs/tests: synthesizer network tests, eject module assertions, extension
  README network section, telemetry-data.md provision.network_mode
The existing on-disk provision flow resolves ${VAR} in main.parameters.json
from the azd environment at provision time. Eject must therefore keep ${VAR}
references verbatim instead of resolving them eagerly from the process env and
freezing a literal into the ejected file.

- synthesis.Input gains PreserveVarRefs; when set, byo.vnet.id and
  dns.subscription pass through verbatim and the format checks that cannot run
  on an unexpanded placeholder are skipped (concrete-but-malformed still fails)
- eject (init --infra) sets PreserveVarRefs so the ejected main.parameters.json
  stays environment-portable; the provision path still resolves and validates
- tests: synthesizer preserve-mode (pass-through + concrete validation) and an
  eject e2e asserting ${AZURE_VNET_ID}/${AZURE_DNS_SUBSCRIPTION_ID} survive
Bash E2E harness validating host: microsoft.foundry private networking,
designed to minimize Azure resource-operation time:

- ONE real network account is provisioned (create+own matrix cell) with a BYO
  --image agent, then deploy + invoke prove the agent works under the VNet.
- Scenario 1 (bicep-less) and the other 3 matrix cells (subnet create/reference
  x DNS own/reference) are verified with 'azd provision --preview' (ARM what-if),
  which creates nothing.
- Scenario 2 (eject) is verified against the live account: eject -> what-if
  reports no changes (idempotent), proving the on-disk template + provision-time
  ${VAR} resolution reproduces the in-memory topology; a manual infra/ edit then
  surfaces as the only delta. Guards the ${VAR}-preservation fix end-to-end.
- A shared BYO VNet (+ reference subnets / external DNS zones) is created once
  and reused across cells.

Files: run-network-e2e.sh (phases 0-6 orchestrator), assert-resources.sh (live
az topology checks: publicNetworkAccess Disabled, account PE groupIds, 3 AI DNS
zones, agent-subnet delegation), lib.sh (logging/assert/azure.yaml mutation),
README.md (cost rationale, prerequisites, cleanup). Westus account region per
requirement; AcrPull granted to the project MI on the ABAC registry.
Decouple the private-networking E2E from the BYO-image init UX (PR 8689) so it
runs against the current branch today:

- Replace 'azd ai agent init --image' with a hand-authored azure.yaml fixture
  (foundry service + network: block + agent image:), created via 'azd env new'.
  image: yields includeAcr=false, matching BYO image, so no ACR at provision.
  Verified the fixture synthesizes: mode=byo, enableNetworkIsolation=true,
  includeAcr=false, ${VAR} resolved.
- Gate phase 5 (deploy + invoke) behind RUN_DEPLOY=true: the headless BYO-image
  deploy needs the AZD_AGENT_SKIP_ACR short-circuit from PR 8689, otherwise
  deploy defaults to build and fails. Phases 0-4 (local gates, shared VNet,
  what-if matrix, one real provision, eject idempotency) validate all the
  networking code without it.
- Fix the ABAC registry role: grant 'Container Registry Repository Reader'
  (ABAC-aware) instead of AcrPull; move the grant into the gated deploy phase.
- Drop the --image preflight; README updated (scenario table, prerequisites,
  RUN_DEPLOY usage, role).
…sing

Two product bugs surfaced by live E2E provisioning (ARM what-if does not catch
either; both require a real deployment):

1. networkInjections preflight failure. The account and the network module
   deploy in the same template, so subnetArmId: network!.outputs.agentSubnetId
   compiled to an unresolved reference() at the CognitiveServices RP preflight,
   which then failed to convert networkInjections to its typed contract
   (InvalidResourceProperties). Build the subnet ARM id as a deterministic
   string from the concrete vnetId param instead, and add an explicit dependsOn
   so the subnet still exists before injection. Recompiled main.arm.json.

2. AZURE_FOUNDRY_NETWORK_MODE missing from canonicalOutputNames. ARM mangles
   output-name casing (AZURE_..._MODE -> azurE_..._MODE); without the canonical
   remapping the env key was stored mis-cased and azd env get-value
   AZURE_FOUNDRY_NETWORK_MODE returned empty. Added it to the restore list and a
   regression case to TestArmOutputsToProto_RepairsMangledKeyCase.

Validated end-to-end: real westus network-isolated Foundry account provisions
green with all topology assertions passing (publicNetworkAccess Disabled,
networkAcls Deny, private endpoint, agent-subnet delegation, 3 AI DNS zones,
network mode byo), across the full subnet create/reference x DNS own/reference
matrix, plus eject idempotency (what-if reports no changes).
Fixes found while running the harness against live Azure (phases 0-4):

- Hand-authored project must include an agent.yaml (kind: hosted + image:)
  alongside azure.yaml; the foundry provider requires an agent definition file.
- setup_project now sets AZURE_RESOURCE_GROUP (the subscription-scoped template
  creates the RG but the provider needs the name) and AZD_AGENT_SKIP_ACR=true
  (BYO-image deploy signal).
- Phase 0 refreshes the dev extension from current source
  (build -> pack -> publish -> install) so the run tests local code, registering
  the provisioning-provider capability + microsoft.foundry provider. Gated by
  SKIP_EXT_REFRESH.
- What-if matrix gates on a successful ARM what-if (exit 0) rather than grepping
  a summary-only preview; this still validates reference-mode subnet/zone
  existence and delegation against the real VNet.
- Idempotent private-dns zone creation (reruns no longer fail on existing zones).
- Add MAX_PHASE to stop early while iterating.
- ACR grant uses the ABAC-aware Container Registry Repository Reader role.
- Fix set -u unbound-variable crash in the phase-4 assert message.
- .gitignore the transient per-run log directories.

Phases 0-4 (local gates, shared VNet, what-if matrix, one real provision +
topology assertions, eject idempotency) pass green. Phase 5 (deploy + invoke)
stays gated behind RUN_DEPLOY=true and needs a reachable BYO agent image.
Update the Foundry private-network E2E harness so phase 5 can build the
~/agents/echo-dual image itself instead of requiring a prebuilt external image.

- Add BUILD_IMAGE=true, ECHO_DUAL_DIR, ACR_NAME/ACR_RG, IMAGE_REPO/IMAGE_TAG.
- Create the target ACR with --role-assignment-mode rbac-abac and reject reuse
  of non-ABAC registries.
- Grant the caller Container Registry Repository Writer before the ACR Task push.
  Resolve the caller object id from the ARM token oid claim to avoid Microsoft
  Graph / Conditional Access failures.
- Build with the required `az acr build --source-acr-auth-id [caller]` form.
- Keep the project MI grant on the ABAC-aware Container Registry Repository
  Reader role for image pull.
- Add TARGET_RG support so investigation runs can keep VNet, DNS, ACR, and the
  real Foundry env in a single RG.

Live validation: the harness created an ABAC ACR, granted caller writer, built
and pushed ~/agents/echo-dual with --source-acr-auth-id [caller], provisioned a
private-networked Foundry account, and granted the project MI Repository Reader.
The subsequent deploy failed from this public runner with the expected private
endpoint 403, which is documented.
Live phase-5 validation showed hosted-agent image pull uses the Foundry project
managed identity, not the parent account identity. Update the network E2E
harness to resolve AZURE_AI_PROJECT_ID via ARM and grant the project MI the
ABAC-aware Container Registry Repository Reader role on the BYO ACR, falling
back to the account MI only for older API shapes.

Also persist AZURE_TENANT_ID in the azd env so postdeploy hooks do not fail on
VM/managed-identity runners after deploy succeeds.
@github-actions github-actions Bot added ext-agents azure.ai.agents extension ext-foundry azure.ai.{agents,connections,inspector,projects,routines,skills,toolboxes}, microsoft.foundry labels Jun 18, 2026
m5i-work added 4 commits June 18, 2026 17:58
Add a concise README cheatsheet for initializing, provisioning, deploying, and
invoking a hosted Foundry agent with a BYO container image under VNet private
networking. Include ACR requirements for ABAC and private-only registries.
Keep the extension README concise by moving the detailed Foundry private
networking schema, requirements, and BYO-image VNet cheatsheet into
`docs/private-networking.md`, with a short README pointer.
Live managed-network provisioning showed that the resources module emitted
AZURE_FOUNDRY_MANAGED_ISOLATION_MODE but the subscription-scoped main template
never forwarded it, so azd env only received AZURE_FOUNDRY_NETWORK_MODE.

Forward the output from main.bicep, add it to the provider canonical output-name
restore list, and cover ARM casing repair with a regression test. Also document
the managed VNet provisioning scenario in the private-networking guide.

Live validation: provisioned network.mode=managed in westus and verified the
account had publicNetworkAccess Disabled, networkAcls Deny, networkInjections
with useMicrosoftManagedNetwork=true, AZURE_FOUNDRY_NETWORK_MODE=managed, and
AZURE_FOUNDRY_MANAGED_ISOLATION_MODE=AllowInternetOutbound.
Live managed-network deploy validation showed that managed mode configures the
hosted-agent runtime to use a Microsoft-managed network but does not create a
customer private endpoint for the Foundry data plane. Disabling public access in
that mode makes azd deploy/invoke fail with `403 Public access is disabled`.

Keep public data-plane access enabled for managed mode while preserving BYO
mode behavior (public access disabled + private endpoint). Update the private
networking guide with managed deploy/invoke guidance.

Live validation: provisioned managed mode, converted the test ACR to ABAC,
built the echo-dual image with `az acr build --source-acr-auth-id [caller]`,
granted the Foundry project MI `Container Registry Repository Reader`, deployed
successfully, and invoked the hosted agent successfully.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ext-agents azure.ai.agents extension ext-foundry azure.ai.{agents,connections,inspector,projects,routines,skills,toolboxes}, microsoft.foundry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant