Documents how to measure rippled's OpenTelemetry overhead via the ripple/perf-iac comparison pipeline (telemetry on vs off, identical load, eBPF profile diff), the exact trigger command, the option to dispatch from rippled CI (and its cross-org token blocker), and the lessons learned (gitignored role tasks file, sequential matrix legs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7.0 KiB
Benchmarking OpenTelemetry Overhead
How to empirically measure the runtime cost of rippled's OpenTelemetry
instrumentation, using the ripple/perf-iac performance pipeline.
Tracking: RIPD-7155 (under epic RIPD-5060).
What is measured
A perf-iac Performance Comparison run builds and deploys two rippled clusters on dedicated EC2, drives identical JMeter payment load at both, and profiles both:
| Side | rippled build | runtime cfg | collector | profiling |
|---|---|---|---|---|
| on-demand | telemetry compiled in (phase-10 default telemetry=True) |
[telemetry] enabled=1, OTLP → 127.0.0.1:4318 |
node-local sidecar (receive + discard) | on |
| baseline | telemetry compiled out (telemetry=False) |
none | none | on |
Overhead = the delta between the two sides — the rippled-process eBPF profile difference (CPU spent in span creation / attribute extraction on the hot path) plus the JMeter TPS / latency delta. The OTel trace data itself is discarded; only the cost of producing it is measured.
Why a local discard-collector
rippled's OTLP exporter runs on a background thread. If the endpoint is dead, that thread burns CPU on failed-export retries — and because the exporter is inside the rippled process, that retry CPU lands in the rippled profile and inflates the apparent overhead. A node-local collector that accepts and discards (nop exporter) lets the export succeed instantly, keeping the profile clean. It is CPU-capped (50%) and, being a separate process, is excluded from the rippled-process profile regardless.
Prerequisites (one-time)
Two branches carry the benchmark setup:
| Branch | Repo | Purpose |
|---|---|---|
pratik/otel-phase11-telemetry-off |
XRPLF/rippled |
baseline binary — conanfile.py default_options.telemetry = False |
pratik/otel-benchmarking-test |
ripple/perf-iac |
adds the per-side telemetry config key + otel_collector Ansible role |
Both must be pushed. The perf-iac branch is the one to run the workflow from (see below) so the telemetry plumbing is present.
Triggering a run (manual — recommended)
Run the comparison from the perf-iac branch via gh, or via the Actions UI
with Use workflow from = pratik/otel-benchmarking-test.
gh workflow run perf-internal.yml -R ripple/perf-iac \
--ref pratik/otel-benchmarking-test \
-f work-item=RIPD-7155 \
-f testname_base=otel_overhead_phase10 \
-f ondemand_performance_config='{"repo":"xrplf/rippled","ref":"pratik/otel-phase10-workload-validation","telemetry":"on","test_tpm":"60000","test_duration":"600","profiling":"true"}' \
-f baseline_performance_config='{"repo":"xrplf/rippled","ref":"pratik/otel-phase11-telemetry-off","telemetry":"off","profiling":"true"}'
| Field | Meaning |
|---|---|
--ref |
must be the perf-iac branch with the telemetry changes |
work-item |
real Jira key, ≤32 chars (names the dynamic env) |
telemetry |
per-side: on for on-demand, off for baseline — never merges |
test_tpm |
aggregate throughput per minute (60000 ≈ 1000 TPS) |
test_duration |
seconds (600 = 10 min) |
profiling |
true on both sides — this is the measurement |
Shared keys (test_tpm, test_duration) inherit baseline ← on-demand, so set
them once. Omitting ssh-public-key auto-destroys the env after the run.
Reading results
- Report URL appears in the Performance Testing job log.
- Slack notice to
#ripplex-performance-rippled-ci. - Compare the two sides' rippled-process profiles + the TPS/latency table.
Triggering from rippled CI (optional — needs a cross-org token)
It is possible to add a workflow_dispatch job in XRPLF/rippled that shells
out to dispatch the perf-iac run, so the benchmark can be kicked off from the
rippled repo. This is not wired up yet because of a cross-org auth
requirement, documented here so DevOps can decide.
The blocker
- rippled lives in
XRPLF; perf-iac lives inripple(different orgs). - A workflow's default
GITHUB_TOKENis scoped to its own repo and cannot dispatch a workflow in another org. - A PAT or GitHub App token with
actions: writeonripple/perf-iacmust be stored as a secret (e.g.PERF_IAC_DISPATCH_TOKEN) inXRPLF/rippled. Provisioning that token is an org-admin decision.
Sketch (once the token exists)
A .github/workflows/otel-benchmark-trigger.yml in rippled, workflow_dispatch
with inputs for the two refs / TPM / duration, whose single step dispatches
perf-iac:
steps:
- dispatch perf-internal.yml on ripple/perf-iac
using gh CLI (or actions/github-script) authenticated with
secrets.PERF_IAC_DISPATCH_TOKEN, passing the same -f inputs as the
manual command above.
Notes / caveats:
- This only kicks off the perf-iac run; the actual provisioning, build,
load, and profiling still execute in perf-iac under its own OIDC role
and repo
vars— so no rippled-side AWS access is needed, only the dispatch token. - Results still surface in the perf-iac Actions run + Slack, not in rippled CI. If rippled-side visibility is wanted, the trigger job can poll the dispatched run and echo its conclusion/report URL into the rippled job summary.
Lessons learned
- A parent-directory
.gitignore(tasks/) silently excluded the Ansible role'stasks/main.yml. The role committed without its tasks file and ran as a no-op — the collector never installed, leaving the OTLP endpoint dead. Always verify what is tracked (git ls-files <role>/) after committing a new role, not just what exists locally; runansible-playbook --syntax-checkon the pushed tree. - Matrix legs run
max-parallel: 1— on-demand and baseline run sequentially on one dynamic env (good for comparability; doubles wall-clock). - Validate the role mechanics locally (syntax-check, render templates); the
full integration (real AMI apt install, 5-node provisioning, load) only
exercises in the pipeline — so a short
test_durationsmoke run is the cheapest way to shake out integration bugs before a long measurement run.