atp-indexer: single-instance stack (collapse ECS+Aurora+ALB onto one box) + API performance fixes by ludamad · Pull Request #87 · AztecProtocol/staking-dashboard

ludamad · 2026-05-29T20:41:54Z

atp-indexer: single-instance stack + performance fixes (tested on staging)

Collapses the atp-indexer's per-env ECS + Aurora + ALB onto one Docker-Compose EC2 box
(same pattern as the sale migration, deployed and serving prod), plus the performance fixes
that pattern surfaced. Everything here is running on staging now:
api.staging.stake.aztec.network (box, CloudFront-front) + staging.stake.aztec.network
(dashboard website env, restored).

Infrastructure (`atp-indexer/single-instance/`, additive stack)

One box: Caddy + atp-indexer (Ponder, yarn start = index + serve) + local Postgres on a
durable /data EBS volume, daily DLM snapshots (retain 7).
CloudFront-front for prod-like envs: branded api.<env>.stake.aztec.network endpoint
(api.stake.aztec.network for prod), ACM + WAF + secret-header origin gate, SG locked to
the CloudFront prefix list on :80 only (two ports exceed the SG rule quota).
Seeded from a prod Aurora dump including the ponder_sync RPC cache: backfill completes in
~2 minutes at cache_rate=100% — no chain reindexing, ever.
staking-dashboard/terraform: staging gets a DNS record + alias like prod/testnet, derived
from the same expression as the cert module.

Performance (measured on staging, prod-parity data: 15.9k positions / 77 providers)

GET /api/providers:

	before	after
prod fleet (Aurora)	21.4s	—
box, old code	8.4s	—
box, this PR (uncached)	—	~1.5s (handler ~0.9s)
box, this PR (cached)	—	~0.25s (server-side ~1ms)

fetchFailedDeposits OR-explosion fix: the WHERE clause had one OR(AND(attester,withdrawer))
branch per stake pair (thousands), which defeated the planner and sequential-scanned the event
tables per request — this is what drove the fleet's Aurora to 99% CPU. Now: one indexed
IN (attesters) + exact pair-matching in memory. Also fixes the empty-pairs case, which
previously produced no WHERE clause at all.
Response cache on the read-only /api/* data routes (API_CACHE_TTL_MS, default 10s,
0 disables; health stays uncached): collapses each dashboard polling burst to one execution
per query shape.
Batched attester inserts (one insert per event instead of one per attester).

Operational notes (cutover playbook)

Ponder fingerprints the schema to the app build → code deploys need a new schema
(fleet history v2→v25 is the same pattern). Rebuild from ponder_sync takes ~2 min; run the
new build as a parallel container, verify parity, flip. Old schema stays as instant rollback.
The providers.json metadata is generated by yarn bootstrap before image build (CI does
this; local builds must too).

koenmtb1 · 2026-06-02T05:53:44Z

Any reason for using Caddy? Cloudfront is basically free.
We'll also still need an A + B environment to allow syncs to happen on non-live backends.
And would this use direct EC2 instead of ECS? If so, why not go all the way with cost savings and use a different compute provider?

…ddy/CloudFront-front) Collapses the atp-indexer's ECS+Aurora+ALB onto one EC2 box running THIS repo's atp-indexer image (it indexes ATP_FACTORY_MATP/LATP etc. that ignition's does not -> not interchangeable) against a local Postgres, fronted by Caddy: Caddy-direct (Let's Encrypt) for testnet, or CloudFront-front (ACM + secret-header gate, http-only origin) for prod. Frontend stays on S3+CloudFront. Mirrors the ignition single-instance pattern (repos are decoupled; intentional duplication). Additive, own Terraform state; cutover documented in README. terraform validate passes; not applied. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…output, cutover docs Review workflow findings: - BLOCKER: empty default cloudfront_secret_header_value + Caddy always enforcing the header in CF mode would lock out all requests. Add a precondition requiring it non-empty when si_front_with_cloudfront=true; clarify the variable description. - Add mutual-exclusion precondition (si_create_dns_records XOR si_front_with_cloudfront). - HIGH: add cf_domain_name output (alias) so the frontend's remote-state read keeps working at cutover without an output rename. - Docs: README cutover now spells out the data.tf state-key swap + output name; env-var docs point at the full app.tf indexer_env_vars set (required vs tuning). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…refix list in CF mode Same defense-in-depth as ignition: CF mode locks 80/443 to the CloudFront origin-facing managed prefix list; Caddy-direct unchanged. SSM (egress) still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- BLOCKER: cf_domain_name output now defined in BOTH modes (was CF-only) -> testnet cutover no longer fails on a missing output; returns the indexer hostname in Caddy-direct mode. - Remove superfluous API_PORT from the indexer container ('yarn start' serves the API on PORT; API_PORT is only for the separate 'yarn serve' which the box doesn't run). - README cutover: clarify cf_domain_name in both modes + add the required frontend rebuild/redeploy step (indexer URL is baked at build time). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…AF from shared state Read the CloudFront secret (SSM) and CLOUDFRONT-scoped WAF from the shared ignition-infrastructure state — the same ones the existing atp-indexer CloudFront uses — so a prod CF-front deploy needs no manual secret/WAF inputs and matches the current posture. The cloudfront_secret_header_value / si_cf_web_acl_arn vars remain as overrides; env_parent added for the shared-state key. Precondition + Caddy gate now check the resolved secret. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… CloudFront mode Add a precondition: when si_front_with_cloudfront=true the resolved CLOUDFRONT-scoped WAF must be non-null (else a silent no-WAF CloudFront would defeat the security model). Points at the shared backend_waf_arn / si_cf_web_acl_arn override. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nt + daily /data snapshots Two fixes learned from operating the sale box in production: - The CloudFront origin-facing prefix list attached to two ports exceeds the per-security-group rule quota (AuthorizeSecurityGroupIngress fails); behind CloudFront Caddy serves plain HTTP, so :443 is now Caddy-direct-only. - Add DLM daily snapshots of the /data volume (retain 7). Staking data is live, so the box's Postgres needs a recovery point independent of the fleet's Aurora. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Serve the indexer at api.<env>.stake.aztec.network (api.stake.aztec.network for prod) instead of raw *.cloudfront.net URLs, so the dashboard points at one stable hostname and infra behind it can change without touching the frontend. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…outes Ports the sale indexer's response-cache lesson: the dashboard polls the same query shapes from every open session and the data only changes ~per block, so a 10s cache (API_CACHE_TTL_MS, 0 disables) collapses each burst to one Postgres execution per query shape. Prod Aurora peaked at 99% CPU on exactly these bursts. Health routes stay uncached for probes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

One insert with all values instead of a round-trip per attester, matching the set-based write patterns used elsewhere. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…etwork Give staging a DNS record + alias like prod/testnet, and derive the alias from the same expression as module.domain so the cert and alias always agree. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…-memory pair match fetchFailedDeposits built one WHERE clause with an OR branch per (attester, withdrawer) pair; provider/list passes every delegation and direct stake, so the planner sequential-scanned the event tables evaluating thousands of OR branches per row (~8s per /api/providers on a box, ~21s on the fleet). Filter by an indexed IN on the distinct attesters instead and match exact pairs in memory on the small result. Also early-return on no pairs (the empty OR previously meant no WHERE at all). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Schema-per-deploy gives the box the fleet's prod-g property (sync on a non-live backend) as a parallel container on the same Postgres, verified on staging; box-level A/B covers riskier changes via the si-origin record. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ludamad marked this pull request as ready for review May 29, 2026 22:11

ludamad requested a review from a team as a code owner May 29, 2026 22:11

Adam Domurad and others added 7 commits July 2, 2026 11:29

ludamad force-pushed the atp-indexer-single-instance branch from b40a502 to 1f1a0ce Compare July 2, 2026 15:32

Adam Domurad and others added 5 commits July 2, 2026 12:50

perf(atp-indexer): batch attester inserts per event

7ec7d8a

One insert with all values instead of a round-trip per attester, matching the set-based write patterns used elsewhere. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ludamad changed the title ~~atp-indexer: single-instance stack (collapse ECS+Aurora+ALB onto one box)~~ atp-indexer: single-instance stack (collapse ECS+Aurora+ALB onto one box) + API performance fixes Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

atp-indexer: single-instance stack (collapse ECS+Aurora+ALB onto one box) + API performance fixes#87

atp-indexer: single-instance stack (collapse ECS+Aurora+ALB onto one box) + API performance fixes#87
ludamad wants to merge 13 commits into
mainfrom
atp-indexer-single-instance

ludamad commented May 29, 2026 •

edited

Loading

Uh oh!

koenmtb1 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ludamad commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

atp-indexer: single-instance stack + performance fixes (tested on staging)

Infrastructure (atp-indexer/single-instance/, additive stack)

Performance (measured on staging, prod-parity data: 15.9k positions / 77 providers)

Operational notes (cutover playbook)

Uh oh!

koenmtb1 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ludamad commented May 29, 2026 •

edited

Loading

Infrastructure (`atp-indexer/single-instance/`, additive stack)