atp-indexer: single-instance stack (collapse ECS+Aurora+ALB onto one box) + API performance fixes#87
Open
ludamad wants to merge 13 commits into
Open
atp-indexer: single-instance stack (collapse ECS+Aurora+ALB onto one box) + API performance fixes#87ludamad wants to merge 13 commits into
ludamad wants to merge 13 commits into
Conversation
Contributor
|
Any reason for using Caddy? Cloudfront is basically free. |
…ddy/CloudFront-front) Collapses the atp-indexer's ECS+Aurora+ALB onto one EC2 box running THIS repo's atp-indexer image (it indexes ATP_FACTORY_MATP/LATP etc. that ignition's does not -> not interchangeable) against a local Postgres, fronted by Caddy: Caddy-direct (Let's Encrypt) for testnet, or CloudFront-front (ACM + secret-header gate, http-only origin) for prod. Frontend stays on S3+CloudFront. Mirrors the ignition single-instance pattern (repos are decoupled; intentional duplication). Additive, own Terraform state; cutover documented in README. terraform validate passes; not applied. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…output, cutover docs Review workflow findings: - BLOCKER: empty default cloudfront_secret_header_value + Caddy always enforcing the header in CF mode would lock out all requests. Add a precondition requiring it non-empty when si_front_with_cloudfront=true; clarify the variable description. - Add mutual-exclusion precondition (si_create_dns_records XOR si_front_with_cloudfront). - HIGH: add cf_domain_name output (alias) so the frontend's remote-state read keeps working at cutover without an output rename. - Docs: README cutover now spells out the data.tf state-key swap + output name; env-var docs point at the full app.tf indexer_env_vars set (required vs tuning). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…refix list in CF mode Same defense-in-depth as ignition: CF mode locks 80/443 to the CloudFront origin-facing managed prefix list; Caddy-direct unchanged. SSM (egress) still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- BLOCKER: cf_domain_name output now defined in BOTH modes (was CF-only) -> testnet cutover
no longer fails on a missing output; returns the indexer hostname in Caddy-direct mode.
- Remove superfluous API_PORT from the indexer container ('yarn start' serves the API on
PORT; API_PORT is only for the separate 'yarn serve' which the box doesn't run).
- README cutover: clarify cf_domain_name in both modes + add the required frontend
rebuild/redeploy step (indexer URL is baked at build time).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…AF from shared state Read the CloudFront secret (SSM) and CLOUDFRONT-scoped WAF from the shared ignition-infrastructure state — the same ones the existing atp-indexer CloudFront uses — so a prod CF-front deploy needs no manual secret/WAF inputs and matches the current posture. The cloudfront_secret_header_value / si_cf_web_acl_arn vars remain as overrides; env_parent added for the shared-state key. Precondition + Caddy gate now check the resolved secret. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… CloudFront mode Add a precondition: when si_front_with_cloudfront=true the resolved CLOUDFRONT-scoped WAF must be non-null (else a silent no-WAF CloudFront would defeat the security model). Points at the shared backend_waf_arn / si_cf_web_acl_arn override. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nt + daily /data snapshots Two fixes learned from operating the sale box in production: - The CloudFront origin-facing prefix list attached to two ports exceeds the per-security-group rule quota (AuthorizeSecurityGroupIngress fails); behind CloudFront Caddy serves plain HTTP, so :443 is now Caddy-direct-only. - Add DLM daily snapshots of the /data volume (retain 7). Staking data is live, so the box's Postgres needs a recovery point independent of the fleet's Aurora. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
b40a502 to
1f1a0ce
Compare
Serve the indexer at api.<env>.stake.aztec.network (api.stake.aztec.network for prod) instead of raw *.cloudfront.net URLs, so the dashboard points at one stable hostname and infra behind it can change without touching the frontend. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…outes Ports the sale indexer's response-cache lesson: the dashboard polls the same query shapes from every open session and the data only changes ~per block, so a 10s cache (API_CACHE_TTL_MS, 0 disables) collapses each burst to one Postgres execution per query shape. Prod Aurora peaked at 99% CPU on exactly these bursts. Health routes stay uncached for probes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
One insert with all values instead of a round-trip per attester, matching the set-based write patterns used elsewhere. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…etwork Give staging a DNS record + alias like prod/testnet, and derive the alias from the same expression as module.domain so the cert and alias always agree. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…-memory pair match fetchFailedDeposits built one WHERE clause with an OR branch per (attester, withdrawer) pair; provider/list passes every delegation and direct stake, so the planner sequential-scanned the event tables evaluating thousands of OR branches per row (~8s per /api/providers on a box, ~21s on the fleet). Filter by an indexed IN on the distinct attesters instead and match exact pairs in memory on the small result. Also early-return on no pairs (the empty OR previously meant no WHERE at all). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Schema-per-deploy gives the box the fleet's prod-g property (sync on a non-live backend) as a parallel container on the same Postgres, verified on staging; box-level A/B covers riskier changes via the si-origin record. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
atp-indexer: single-instance stack + performance fixes (tested on staging)
Collapses the atp-indexer's per-env ECS + Aurora + ALB onto one Docker-Compose EC2 box
(same pattern as the sale migration, deployed and serving prod), plus the performance fixes
that pattern surfaced. Everything here is running on staging now:
api.staging.stake.aztec.network(box, CloudFront-front) +staging.stake.aztec.network(dashboard website env, restored).
Infrastructure (
atp-indexer/single-instance/, additive stack)yarn start= index + serve) + local Postgres on adurable
/dataEBS volume, daily DLM snapshots (retain 7).api.<env>.stake.aztec.networkendpoint(
api.stake.aztec.networkfor prod), ACM + WAF + secret-header origin gate, SG locked tothe CloudFront prefix list on :80 only (two ports exceed the SG rule quota).
ponder_syncRPC cache: backfill completes in~2 minutes at
cache_rate=100%— no chain reindexing, ever.staking-dashboard/terraform: staging gets a DNS record + alias like prod/testnet, derivedfrom the same expression as the cert module.
Performance (measured on staging, prod-parity data: 15.9k positions / 77 providers)
GET /api/providers:fetchFailedDepositsOR-explosion fix: the WHERE clause had oneOR(AND(attester,withdrawer))branch per stake pair (thousands), which defeated the planner and sequential-scanned the event
tables per request — this is what drove the fleet's Aurora to 99% CPU. Now: one indexed
IN (attesters)+ exact pair-matching in memory. Also fixes the empty-pairs case, whichpreviously produced no WHERE clause at all.
/api/*data routes (API_CACHE_TTL_MS, default 10s,0 disables; health stays uncached): collapses each dashboard polling burst to one execution
per query shape.
Operational notes (cutover playbook)
(fleet history v2→v25 is the same pattern). Rebuild from
ponder_synctakes ~2 min; run thenew build as a parallel container, verify parity, flip. Old schema stays as instant rollback.
providers.jsonmetadata is generated byyarn bootstrapbefore image build (CI doesthis; local builds must too).