Skip to content

atp-indexer: single-instance stack (collapse ECS+Aurora+ALB onto one box) + API performance fixes#87

Open
ludamad wants to merge 13 commits into
mainfrom
atp-indexer-single-instance
Open

atp-indexer: single-instance stack (collapse ECS+Aurora+ALB onto one box) + API performance fixes#87
ludamad wants to merge 13 commits into
mainfrom
atp-indexer-single-instance

Conversation

@ludamad

@ludamad ludamad commented May 29, 2026

Copy link
Copy Markdown

atp-indexer: single-instance stack + performance fixes (tested on staging)

Collapses the atp-indexer's per-env ECS + Aurora + ALB onto one Docker-Compose EC2 box
(same pattern as the sale migration, deployed and serving prod), plus the performance fixes
that pattern surfaced. Everything here is running on staging now:
api.staging.stake.aztec.network (box, CloudFront-front) + staging.stake.aztec.network
(dashboard website env, restored).

Infrastructure (atp-indexer/single-instance/, additive stack)

  • One box: Caddy + atp-indexer (Ponder, yarn start = index + serve) + local Postgres on a
    durable /data EBS volume, daily DLM snapshots (retain 7).
  • CloudFront-front for prod-like envs: branded api.<env>.stake.aztec.network endpoint
    (api.stake.aztec.network for prod), ACM + WAF + secret-header origin gate, SG locked to
    the CloudFront prefix list on :80 only (two ports exceed the SG rule quota).
  • Seeded from a prod Aurora dump including the ponder_sync RPC cache: backfill completes in
    ~2 minutes at cache_rate=100% — no chain reindexing, ever.
  • staking-dashboard/terraform: staging gets a DNS record + alias like prod/testnet, derived
    from the same expression as the cert module.

Performance (measured on staging, prod-parity data: 15.9k positions / 77 providers)

GET /api/providers:

before after
prod fleet (Aurora) 21.4s
box, old code 8.4s
box, this PR (uncached) ~1.5s (handler ~0.9s)
box, this PR (cached) ~0.25s (server-side ~1ms)
  • fetchFailedDeposits OR-explosion fix: the WHERE clause had one OR(AND(attester,withdrawer))
    branch per stake pair (thousands), which defeated the planner and sequential-scanned the event
    tables per request — this is what drove the fleet's Aurora to 99% CPU. Now: one indexed
    IN (attesters) + exact pair-matching in memory. Also fixes the empty-pairs case, which
    previously produced no WHERE clause at all.
  • Response cache on the read-only /api/* data routes (API_CACHE_TTL_MS, default 10s,
    0 disables; health stays uncached): collapses each dashboard polling burst to one execution
    per query shape.
  • Batched attester inserts (one insert per event instead of one per attester).

Operational notes (cutover playbook)

  • Ponder fingerprints the schema to the app build → code deploys need a new schema
    (fleet history v2→v25 is the same pattern). Rebuild from ponder_sync takes ~2 min; run the
    new build as a parallel container, verify parity, flip. Old schema stays as instant rollback.
  • The providers.json metadata is generated by yarn bootstrap before image build (CI does
    this; local builds must too).

@ludamad ludamad marked this pull request as ready for review May 29, 2026 22:11
@ludamad ludamad requested a review from a team as a code owner May 29, 2026 22:11
@koenmtb1

koenmtb1 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Any reason for using Caddy? Cloudfront is basically free.
We'll also still need an A + B environment to allow syncs to happen on non-live backends.
And would this use direct EC2 instead of ECS? If so, why not go all the way with cost savings and use a different compute provider?

Adam Domurad and others added 7 commits July 2, 2026 11:29
…ddy/CloudFront-front)

Collapses the atp-indexer's ECS+Aurora+ALB onto one EC2 box running THIS repo's atp-indexer
image (it indexes ATP_FACTORY_MATP/LATP etc. that ignition's does not -> not interchangeable)
against a local Postgres, fronted by Caddy: Caddy-direct (Let's Encrypt) for testnet, or
CloudFront-front (ACM + secret-header gate, http-only origin) for prod. Frontend stays on
S3+CloudFront. Mirrors the ignition single-instance pattern (repos are decoupled; intentional
duplication). Additive, own Terraform state; cutover documented in README. terraform validate
passes; not applied.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…output, cutover docs

Review workflow findings:
- BLOCKER: empty default cloudfront_secret_header_value + Caddy always enforcing the header
  in CF mode would lock out all requests. Add a precondition requiring it non-empty when
  si_front_with_cloudfront=true; clarify the variable description.
- Add mutual-exclusion precondition (si_create_dns_records XOR si_front_with_cloudfront).
- HIGH: add cf_domain_name output (alias) so the frontend's remote-state read keeps working
  at cutover without an output rename.
- Docs: README cutover now spells out the data.tf state-key swap + output name; env-var
  docs point at the full app.tf indexer_env_vars set (required vs tuning).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…refix list in CF mode

Same defense-in-depth as ignition: CF mode locks 80/443 to the CloudFront origin-facing
managed prefix list; Caddy-direct unchanged. SSM (egress) still works.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- BLOCKER: cf_domain_name output now defined in BOTH modes (was CF-only) -> testnet cutover
  no longer fails on a missing output; returns the indexer hostname in Caddy-direct mode.
- Remove superfluous API_PORT from the indexer container ('yarn start' serves the API on
  PORT; API_PORT is only for the separate 'yarn serve' which the box doesn't run).
- README cutover: clarify cf_domain_name in both modes + add the required frontend
  rebuild/redeploy step (indexer URL is baked at build time).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…AF from shared state

Read the CloudFront secret (SSM) and CLOUDFRONT-scoped WAF from the shared
ignition-infrastructure state — the same ones the existing atp-indexer CloudFront uses — so a
prod CF-front deploy needs no manual secret/WAF inputs and matches the current posture. The
cloudfront_secret_header_value / si_cf_web_acl_arn vars remain as overrides; env_parent added
for the shared-state key. Precondition + Caddy gate now check the resolved secret.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… CloudFront mode

Add a precondition: when si_front_with_cloudfront=true the resolved CLOUDFRONT-scoped WAF
must be non-null (else a silent no-WAF CloudFront would defeat the security model). Points at
the shared backend_waf_arn / si_cf_web_acl_arn override.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nt + daily /data snapshots

Two fixes learned from operating the sale box in production:
- The CloudFront origin-facing prefix list attached to two ports exceeds the per-security-group
  rule quota (AuthorizeSecurityGroupIngress fails); behind CloudFront Caddy serves plain HTTP,
  so :443 is now Caddy-direct-only.
- Add DLM daily snapshots of the /data volume (retain 7). Staking data is live, so the box's
  Postgres needs a recovery point independent of the fleet's Aurora.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ludamad ludamad force-pushed the atp-indexer-single-instance branch from b40a502 to 1f1a0ce Compare July 2, 2026 15:32
Adam Domurad and others added 5 commits July 2, 2026 12:50
Serve the indexer at api.<env>.stake.aztec.network (api.stake.aztec.network for prod)
instead of raw *.cloudfront.net URLs, so the dashboard points at one stable hostname and
infra behind it can change without touching the frontend.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…outes

Ports the sale indexer's response-cache lesson: the dashboard polls the same query shapes
from every open session and the data only changes ~per block, so a 10s cache (API_CACHE_TTL_MS,
0 disables) collapses each burst to one Postgres execution per query shape. Prod Aurora peaked
at 99% CPU on exactly these bursts. Health routes stay uncached for probes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
One insert with all values instead of a round-trip per attester, matching the set-based
write patterns used elsewhere.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…etwork

Give staging a DNS record + alias like prod/testnet, and derive the alias from the same
expression as module.domain so the cert and alias always agree.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…-memory pair match

fetchFailedDeposits built one WHERE clause with an OR branch per (attester, withdrawer) pair;
provider/list passes every delegation and direct stake, so the planner sequential-scanned the
event tables evaluating thousands of OR branches per row (~8s per /api/providers on a box,
~21s on the fleet). Filter by an indexed IN on the distinct attesters instead and match exact
pairs in memory on the small result. Also early-return on no pairs (the empty OR previously
meant no WHERE at all).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ludamad ludamad changed the title atp-indexer: single-instance stack (collapse ECS+Aurora+ALB onto one box) atp-indexer: single-instance stack (collapse ECS+Aurora+ALB onto one box) + API performance fixes Jul 2, 2026
Schema-per-deploy gives the box the fleet's prod-g property (sync on a non-live backend)
as a parallel container on the same Postgres, verified on staging; box-level A/B covers
riskier changes via the si-origin record.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants