feat(benchmark): gateway benchmark harness (footprint, scaling, config, heap)#64
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new benchmark/ Python-based harness to measure the ROS2 Medkit gateway’s runtime cost (memory footprint, scaling behavior, config sweep impact, and heap/leak signals) by orchestrating Docker Compose runs and sampling /proc metrics, with accompanying report/chart generation and unit tests for the pure parsing/aggregation logic.
Changes:
- Introduces a
benchmarkCLI (python -m benchmark.benchmark) with lanes: footprint, scaling, sweep, heap, memcheck, attribute, and report aggregation. - Adds a synthetic ROS 2 graph generator (rclpy) plus Docker Compose + Dockerfile tooling to run gateway + graph in a single container.
- Adds a substantial pure-Python library layer for sampling/parsing/metrics/reporting, covered by unit tests and fixtures.
Reviewed changes
Copilot reviewed 48 out of 52 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| benchmark/benchmark.py | Main CLI orchestrator for all benchmark lanes; run directory management, aggregation, and reporting. |
| benchmark/turtlebot3.py | Demo wiring/config for the turtlebot3 integration benchmark target. |
| benchmark/README.md | Usage docs, prerequisites, lane descriptions, and quickstart commands. |
| benchmark/requirements.txt | Python dependencies for report/chart generation and tests. |
| benchmark/.dockerignore | Excludes results and caches from Docker build context. |
| benchmark/configs/overrides.yaml | Config override sets used for the sweep lane. |
| benchmark/lib/config_sweep.py | Pure helpers for merging and applying param overrides at the gateway namespace root. |
| benchmark/lib/docker_helpers.py | Docker/Compose wrappers for starting services, exec’ing commands, and reading /proc files. |
| benchmark/lib/gateway_client.py | JSON parsing helper for collection endpoints (items-count). |
| benchmark/lib/leak_parse.py | Pure parsers for heaptrack and valgrind memcheck summaries. |
| benchmark/lib/metrics.py | Pure numeric/stat helpers (median/IQR/linfit/slope CI/log-log exponent/steady window). |
| benchmark/lib/report.py | Repeat aggregation + lane verdict logic and markdown/chart renderers. |
| benchmark/lib/runner.py | Shared “cell runner” logic: start container, warmup, sample window, summarize. |
| benchmark/lib/runmeta.py | Captures run metadata (host, kernel, allocator, image digest, etc.). |
| benchmark/lib/sampler.py | /proc sampling and parsing (smaps_rollup, status, stat) + CSV writing. |
| benchmark/lib/warmup.py | Warmup predicates (entity-count stability + USS derivative threshold). |
| benchmark/lib/init.py | Package marker for benchmark.lib. |
| benchmark/scaler/spawn_nodes.py | Synthetic graph planning (node/topic/service/param specs). |
| benchmark/scaler/synthetic_graph.py | rclpy-based synthetic graph host that publishes and exposes services. |
| benchmark/scaler/init.py | Package marker for benchmark.scaler. |
| benchmark/profiles/synthetic.compose.yml | Docker Compose profile to run gateway + synthetic graph in one container. |
| benchmark/profiles/Dockerfile.benchmark | Benchmark image build (ROS Jazzy, tools, clone/build ros2_medkit). |
| benchmark/profiles/run_gateway_and_graph.sh | Container entrypoint to start synthetic graph and gateway (optionally under heaptrack/valgrind). |
| benchmark/profiles/fastdds.supp | Valgrind suppressions for FastDDS-related shutdown noise. |
| benchmark/tests/test_cli_wiring.py | CLI help/subcommand presence test. |
| benchmark/tests/test_config_sweep.py | Unit tests for deep-merge and override application behavior. |
| benchmark/tests/test_docker_helpers.py | Unit tests for PID parsing error cases. |
| benchmark/tests/test_gateway_client.py | Unit tests for items-count JSON parsing. |
| benchmark/tests/test_heap_report.py | Unit tests for heap report markdown rendering. |
| benchmark/tests/test_leak_parse.py | Unit tests for heaptrack/memcheck summary parsing. |
| benchmark/tests/test_memcheck_report.py | Unit tests for memcheck report markdown rendering. |
| benchmark/tests/test_metrics.py | Unit tests for numeric/stat helpers. |
| benchmark/tests/test_overrides_load.py | Unit tests for loading override sets YAML. |
| benchmark/tests/test_report_aggregate.py | Unit tests for repeat aggregation + verdict helpers. |
| benchmark/tests/test_report_render.py | Unit tests for footprint markdown rendering formatting/contents. |
| benchmark/tests/test_runner_summary.py | Unit tests for window summarization output keys and sanity. |
| benchmark/tests/test_runmeta.py | Unit tests for required run metadata fields. |
| benchmark/tests/test_sampler_loop.py | Unit tests for sampling loop helpers and CPU-cores derivation. |
| benchmark/tests/test_sampler_parse.py | Unit tests for /proc parsing routines with fixtures. |
| benchmark/tests/test_scaler_plan.py | Unit tests for synthetic graph planning (counts, uniqueness, cardinality). |
| benchmark/tests/test_scaling_rows.py | Unit tests for scaling row derivation (USS per entity). |
| benchmark/tests/test_validation.py | Unit tests for synthetic-vs-demo validation messaging. |
| benchmark/tests/test_warmup.py | Unit tests for warmup predicate helpers. |
| benchmark/tests/fixtures/smaps_rollup.txt | Fixture for smaps_rollup parsing tests. |
| benchmark/tests/fixtures/stat.txt | Fixture for /proc/<pid>/stat parsing tests. |
| benchmark/tests/fixtures/status.txt | Fixture for /proc/<pid>/status parsing tests. |
| benchmark/tests/fixtures/memcheck.txt | Fixture for valgrind memcheck parsing tests. |
| benchmark/tests/fixtures/heaptrack_print.txt | Fixture for heaptrack_print parsing tests. |
| benchmark/tests/fixtures/medkit_params.yaml | Fixture for params override application tests. |
| benchmark/tests/init.py | Package marker for benchmark.tests. |
| benchmark/init.py | Package marker for benchmark. |
| .gitignore | Ignores benchmark results output, baked params, and benchmark pyc/caches. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6242b91 to
b9997a7
Compare
cea132c to
4d43cc1
Compare
/proc USS/PSS/CPU sampling, the Student-t + AR(1) statistics engine, run metadata, config-override loading and leak/memcheck log parsing, with unit tests.
…compare Fresh-container cell runner with the enforced warm-up gate, median/IQR aggregation and CI-gated report rendering, the transient burst sampler, and the baseline-diff engine.
…r profiles Synthetic ROS 2 graph + HTTP load generator, the fault_manager injector, and the single-container gateway/graph/fault images and entrypoints.
The orchestrator CLI wiring every lane plus all/report, and the harness README documenting the method, metrics and lanes.
Rebuilds the gateway with debug symbols and runs it under heaptrack attached to the real Nav2 graph; the tracked heap plateaus, so the gateway does not leak on Nav2.
Pin the gateway commit via ROS2_MEDKIT_REF through Compose, capture the SHA in the demo image, seed a host-keyed baseline, and add a dispatch+weekly CI job that compares a run against it and fails on regression.
4d43cc1 to
e8075e4
Compare
mfaferek93
reviewed
Jun 20, 2026
Add a churn lane that gates gateway memory growth under ROS graph churn (static vs churning-graph USS slope, PASS/FAIL, exit 1 on leak), plus a synthetic-graph churn mode (BENCH_CHURN_SEC / BENCH_CHURN_COUNT). Honesty and robustness fixes so lanes report real data instead of silent zeros: - memcheck: run valgrind directly on gateway_node (not the ros2 launcher), capture stderr, gate on readiness, poll for the LEAK SUMMARY; fix the malformed fastdds.supp that made valgrind abort at startup - heap: bash pipefail and fail loud when heaptrack produces no summary - heap_on_nav2.sh: two-phase (clean USS without heaptrack as the leak verdict, short heaptrack pass for call-site attribution) with an OLS slope CI - scaling regression gate: baseline-relative CI comparison instead of an absolute ci_lo>1 threshold; absolute floor only when baseline is absent - compare: gate high host load across all lanes, not just the first - docker_helpers: per-call subprocess timeouts and curl --max-time / --connect-timeout; accept any positive CLK_TCK; merge_stderr option - sampler: tolerate transient /proc read errors, stop after the process is gone - burst: take clk_tck as a parameter; require USS to leave the band before declaring recovery - load_gen: include timed-out requests in tail latency, report error_rate, stop on SIGTERM - fault lane: mark failed cells and exclude them from the table, chart and optimization signals instead of rendering fabricated zeros - runner: warm the gateway under load for the load lane; treat an empty sample window as a failed cell - report: leak verdict wording (leaked-at-exit, not "heap grew") - cmd_report / _latest_run_dir: clear error on missing or empty results dir - cmd_load: per-level thread census - turtlebot3: override_root typed as list[str] Tooling: - --run-dir to write several lanes into one shared run dir for a single compare - CI runs the harness unit tests on a GitHub-hosted runner - portable test working directory; docs and unit tests for all of the above
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this gives
A benchmark for the gateway's runtime cost that points at what to optimize AND tracks whether a change improved or regressed it. A Python orchestrator drives docker compose and samples the gateway process via
/proc(USS/PSS, CPU-cores), with repeats and confidence intervals - not single readings.Each lane writes a table, a chart, and a JSON summary with a verdict line.
benchmark/README.mddocuments the method and metrics.Lanes
scripts/heap_on_nav2.shruns a long heaptrack on the real Nav2 stack (debug-symbol gateway) - the tracked heap plateaus, so the gateway does not leak on Nav2.footprintvia--load).Regression tracking ("did we improve?")
ROS2_MEDKIT_REF, pinnable through Compose), demo image digest, host CPU/RAM/allocator, and a high-load flag.comparediffs a run against a committedbaseline/<host>.json, refuses cross-machine or high-load runs, and exits non-zero on regression (USS +10%, CPU +15%, scaling exponent CI crossing 1.0).update-baselinere-pins after a confirmed improvement.workflow_dispatch+ weekly CI job (self-hosted runner) benchmarks a pinned medkit ref and fails on regression.Built to not overclaim
/procslope without heaptrack call-sites is inconclusive, not a leak; the heap lane discloses that/procUSS under heaptrack is inflated.What it found (one host, illustrative)
USS ~ entities^0.46, CI [0.26, 0.65] - sub-linear confirmed.Notes
Synthetic lanes run the gateway and graph (or fault_manager) in one container (the Docker bridge does not forward DDS multicast) and build a debug-symbol gateway image for heap/leak work. Runs on plain Docker (probing via
docker execalso covers docker-out-of-docker). Unit tests: 158. The CI job needs a fixed self-hosted runner so the host-keyed baseline stays valid.Related Issue
n/a
Checklist
benchmark/README.md)