mirror of https://github.com/XRPLF/rippled.git synced 2026-07-24 07:30:30 +00:00

Files

Pratik Mankawde 8dd64d4dcd fix(telemetry): add second-scale spanmetrics histogram buckets

P95 of second-scale spans was a meaningless interpolation. The spanmetrics
histogram topped out at [.. 1s, 5s], so consensus.round (~3.9s) and
consensus.establish (~1.9s) all fell into one 1s-5s bucket and
histogram_quantile interpolated linearly across that 4s-wide gap — the
"Build vs Close" / "Ledger Close Duration" panels' P95 read ~4800ms purely
as an artifact (verified: sum/count avg = 3824ms). ledger.acquire was worse:
~17% of samples exceeded the 5s ceiling, so its p95/p99 were unmeasurable.

Add 2s, 3s, 4s (resolve the 1-5s pile-up) and 10s, 30s (give the
ledger.acquire catch-up tail a measurable home). All ten existing boundaries
are preserved and the list stays strictly ascending (the connector
binary-searches buckets and silently misbuckets otherwise). Pin unit=ms so a
future collector default-unit flip can't rename the metric to _seconds.

Buckets chosen from the live mainnet distribution, not guessed. Native
beast::insight histograms (ms-scale RPC/IO timers in Telemetry.cpp) are 100%
under 5s, so they keep the original buckets — this is collector-only.

Applies on collector restart (cumulative series reset once, handled by
rate()). Runbook and regression-threshold bucket notes updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-07-22 14:27:06 +01:00

baselines

fix(telemetry): refresh regression baseline + widen bucket-noise thresholds

2026-06-05 19:58:07 +01:00

benchmark-results

Phase 10: Workload validation - synthetic load generation and telemetry checks

2026-03-31 22:32:02 +01:00

benchmark.sh

docs(telemetry): drop sampling_ratio from workload sample configs

2026-06-10 10:06:24 +01:00

capture_timings.py

fix: address review findings in regression gate

2026-04-24 19:36:15 +01:00

collect_system_metrics.sh

formatting updates

2026-06-02 14:34:27 +01:00

compare_to_baseline.py

fix: address review findings in regression gate

2026-04-24 19:36:15 +01:00

expected_metrics.json

test(telemetry): add rpc_in_flight_requests to phase10 expected_metrics harness inventory

2026-07-21 20:35:12 +01:00

expected_spans.json

test(telemetry): align workload harness with shared attr names

2026-06-11 23:14:58 +01:00

generate-validator-keys.sh

formatting updates

2026-06-02 14:34:27 +01:00

prom_queries.py

fix: address review findings in regression gate

2026-04-24 19:36:15 +01:00

README.md

fix(telemetry): align Phase 10 workload harness with current OTel recording surface + fix CI

2026-06-05 17:08:58 +01:00

regression-metrics.json

docs(telemetry): update metric names across docs, tests, alert rules, and plan docs

2026-07-08 18:01:53 +01:00

regression-thresholds.json

fix(telemetry): add second-scale spanmetrics histogram buckets

2026-07-22 14:27:06 +01:00

requirements.txt

Phase 10: Workload validation - synthetic load generation and telemetry checks

2026-03-31 22:32:02 +01:00

rpc_load_generator.py

fix(telemetry): align Phase 10 workload harness with current OTel recording surface + fix CI

2026-06-05 17:08:58 +01:00

run-full-validation.sh

docs(telemetry): update metric names across docs, tests, alert rules, and plan docs

2026-07-08 18:01:53 +01:00

test_accounts.json

Phase 10: Workload validation - synthetic load generation and telemetry checks

2026-03-31 22:32:02 +01:00

tx_submitter.py

fix: StatsDGauge dirty init + tx_submitter sequence drift in CI

2026-04-24 19:42:20 +01:00

validate_telemetry.py

docs(telemetry): update metric names across docs, tests, alert rules, and plan docs

2026-07-08 18:01:53 +01:00

workload_orchestrator.py

fix(telemetry): raise TX error threshold for CI workload validation

2026-07-08 15:37:21 +01:00

workload-profiles.json

test(telemetry): fix stale txq.accept_tx span name in workload profile

2026-06-11 23:21:59 +01:00

xrpld-validator.cfg.template

docs(telemetry): update metric names across docs, tests, alert rules, and plan docs

2026-07-08 18:01:53 +01:00

README.md

Telemetry Workload Tools

Synthetic workload generation and validation tools for xrpld's OpenTelemetry telemetry stack. These tools validate that all spans, metrics, dashboards, and log-trace correlation work end-to-end under controlled load.

Quick Start

# Build xrpld with telemetry enabled
conan install . --build=missing -o telemetry=True
cmake --preset default -Dtelemetry=ON
cmake --build --preset default

# Run full validation (starts everything, runs load, validates)
docker/telemetry/workload/run-full-validation.sh --xrpld .build/xrpld

# Cleanup when done
docker/telemetry/workload/run-full-validation.sh --cleanup

Architecture

The validation suite runs a multi-node xrpld cluster as local processes alongside a Docker Compose telemetry stack. The cluster exercises consensus, peer-to-peer spans (proposals, validations), and all metric pipelines.

run-full-validation.sh (shell orchestrator)
  |
  |-- docker-compose.workload.yaml
  |     |-- otel-collector (traces via OTLP + StatsD receiver)
  |     |-- tempo (trace backend + TraceQL search API)
  |     |-- prometheus (metrics scraping)
  |     |-- grafana (dashboards, provisioned automatically)
  |
  |-- generate-validator-keys.sh
  |     -> validator-keys.json, validators.txt
  |
  |-- Nx xrpld nodes (local processes, full telemetry)
  |     - Each node: [telemetry] enabled=1, trace_rpc/consensus/transactions
  |     - [signing_support] true (server-side signing for tx_submitter)
  |     - Peer discovery via [ips] (not [ips_fixed]) for active peer counts
  |
  |-- workload_orchestrator.py (phased load execution)
  |     |-- rpc_load_generator.py (WebSocket RPC traffic)
  |     |-- tx_submitter.py (transaction diversity)
  |     -> workload-report.json + per-phase reports
  |
  |-- validate_telemetry.py (pass/fail checks)
  |     -> validation-report.json
  |
  |-- benchmark.sh (baseline vs telemetry comparison)
        -> benchmark-report-*.md

Workload Profiles

The workload orchestrator (workload_orchestrator.py) reads named profiles from workload-profiles.json and executes sequential load phases. Within each phase, the RPC generator and TX submitter run concurrently.

Available Profiles

Profile	Phases	Duration	Purpose
`full-validation`	6	~5 min + 1 min propagation	Full 18-dashboard coverage with burst/idle/plateau patterns
`quick-smoke`	1	~30s + 30s propagation	Fast CI smoke test
`stress`	3	~3.5 min + 1 min propagation	Heavy sustained load for benchmarking

full-validation Phases

Phase	RPC Rate	TX TPS	Duration	Dashboard Coverage
warmup	5 RPS	—	30s	Node Health, Validator Health (baseline gauges)
steady-state	30 RPS	3 TPS	60s	All dashboards (plateau data)
rpc-burst	100 RPS	—	30s	Job Queue, RPC Performance (latency spikes)
tx-flood	5 RPS	20 TPS	30s	Fee Market & TxQ, Transaction Overview
mixed-peak	50 RPS	10 TPS	60s	Consensus Health, Ledger Operations
cooldown	5 RPS	—	30s	Recovery patterns, state transitions

Custom Profiles

Add profiles to workload-profiles.json:

{
  "profiles": {
    "my-custom": {
      "description": "Custom profile for specific testing",
      "phases": [
        {
          "name": "phase-name",
          "description": "What this phase exercises",
          "duration_sec": 60,
          "rpc": { "rate": 50, "weights": { "server_info": 80, "fee": 20 } },
          "tx": { "tps": 5, "weights": { "Payment": 100 } }
        }
      ],
      "propagation_wait_sec": 30
    }
  }
}

Set "rpc" or "tx" to null to skip that generator for a phase. Custom "weights" override the default command/transaction distribution.

Tools Reference

run-full-validation.sh

Orchestrates the complete validation pipeline. Starts the telemetry stack, starts a multi-node xrpld cluster, generates load, and validates the results.

# Full validation with defaults (uses full-validation profile)
./run-full-validation.sh --xrpld /path/to/xrpld

# Quick smoke test
./run-full-validation.sh --xrpld /path/to/xrpld --profile quick-smoke

# Stress test with benchmarks
./run-full-validation.sh --xrpld /path/to/xrpld --profile stress --with-benchmark

# Skip Loki checks (if Phase 8 not deployed)
./run-full-validation.sh --xrpld /path/to/xrpld --skip-loki

workload_orchestrator.py

Reads a named profile from workload-profiles.json and executes sequential load phases. Within each phase, rpc_load_generator.py and tx_submitter.py run as concurrent subprocesses. Produces per-phase reports and a combined summary.

# Run with a specific profile
python3 workload_orchestrator.py --profile full-validation

# Multiple endpoints
python3 workload_orchestrator.py --profile full-validation \
    --endpoints ws://localhost:6006 ws://localhost:6007

# Save combined report
python3 workload_orchestrator.py --profile stress --report /tmp/report.json

rpc_load_generator.py

Generates RPC traffic matching realistic production distribution. Uses xrpld's native WebSocket command format ({"command": ...}) with flat parameters — the same format as tx_submitter.py.

40% health checks (server_info, fee)
30% wallet queries (account_info, account_lines, account_objects)
15% explorer queries (ledger, ledger_data)
10% transaction lookups (tx, account_tx)
5% DEX queries (book_offers, amm_info)

# Basic usage
python3 rpc_load_generator.py --endpoints ws://localhost:6006 --rate 50 --duration 120

# Multiple endpoints (round-robin)
python3 rpc_load_generator.py \
    --endpoints ws://localhost:6006 ws://localhost:6007 \
    --rate 100 --duration 300

# Custom weights
python3 rpc_load_generator.py --endpoints ws://localhost:6006 \
    --weights '{"server_info": 80, "account_info": 20}'

tx_submitter.py

Submits diverse transaction types to exercise the full span and metric surface. Uses xrpld's native WebSocket command format ({"command": ...}) rather than JSON-RPC format. The response payload is inside the "result" key, with "status" at the top level.

Supported transaction types:

Payment (XRP transfers) — exercises tx.process, tx.receive, tx.apply
OfferCreate / OfferCancel (DEX activity)
TrustSet (trust line creation)
NFTokenMint / NFTokenCreateOffer (NFT activity)
EscrowCreate / EscrowFinish (escrow lifecycle)
AMMCreate / AMMDeposit (AMM pool operations)

Requires [signing_support] true in the node config for server-side signing.

# Basic usage
python3 tx_submitter.py --endpoint ws://localhost:6006 --tps 5 --duration 120

# Custom mix
python3 tx_submitter.py --endpoint ws://localhost:6006 \
    --weights '{"Payment": 60, "OfferCreate": 20, "TrustSet": 20}'

validate_telemetry.py

Automated validation that all expected telemetry data exists. Every metric and span is required — if it doesn't fire, the validation fails.

Span validation: All span types from expected_spans.json with required attributes and parent-child hierarchies
Metric validation: All metrics from expected_metrics.json — SpanMetrics, StatsD gauges/counters/histograms, Phase 9 OTLP metrics. Every listed metric must have > 0 series. Uses the Prometheus /api/v1/series endpoint (not instant queries) to avoid false negatives from stale gauges.
Log-trace correlation: trace_id/span_id in Loki logs (requires Loki)
Dashboard validation: All 10 Grafana dashboards load with panels

# Run all validations
python3 validate_telemetry.py --report /tmp/report.json

# Skip Loki checks
python3 validate_telemetry.py --skip-loki --report /tmp/report.json

OTel Timings Regression Gate

capture_timings.py + compare_to_baseline.py implement a regression gate that compares OTel-derived per-span/per-RPC/per-job timings against a committed baseline. Unlike benchmark.sh (which measures the overhead of enabling telemetry on the current binary), this gate catches xrpld performance regressions over time by diffing against a stored baseline from a prior run.

How it runs inside the validation pipeline:

run-full-validation.sh executes the normal workload and validation suite.
After validation, capture_timings.py queries Prometheus for every metric in regression-metrics.json and writes reports/timings.json.
compare_to_baseline.py reads timings.json, baselines/baseline-timings.json, and regression-thresholds.json, then either:
- Prints the paste-me JSON block (when the baseline is a placeholder or empty) and exits 0.
- Prints a delta table, writes reports/regression-report.json, and exits non-zero if any metric breached both the percentage AND absolute bound.

Bootstrapping a baseline:

Push the branch. The Telemetry Validation CI run prints the full timings JSON under "Paste into baselines/baseline-timings.json" in the workflow Step Summary.
Open a PR copying that JSON block verbatim into baselines/baseline-timings.json. Reviewer approval is the audit gate.
Subsequent runs compare against it; the gate fails on regression.

Per-run tuning:

--skip-regression disables the gate (local exploration only).
REGRESSION_WINDOW env var overrides the default Prometheus rate() window (3m). Keep close to the workload duration.
Metric surface lives in regression-metrics.json; thresholds in regression-thresholds.json; both are reviewed changes.

See baselines/README.md for the baseline lifecycle and refresh process.

benchmark.sh

Compares baseline (no telemetry) vs telemetry-enabled performance:

./benchmark.sh --xrpld /path/to/xrpld --duration 300

Thresholds (configurable via environment):

Metric	Threshold	Env Variable
CPU overhead	< 3%	BENCH_CPU_OVERHEAD_PCT
Memory overhead	< 5MB	BENCH_MEM_OVERHEAD_MB
RPC p99 latency	< 2ms	BENCH_RPC_LATENCY_IMPACT_MS
Throughput impact	< 5%	BENCH_TPS_IMPACT_PCT
Consensus impact	< 1%	BENCH_CONSENSUS_IMPACT_PCT

Reading Validation Reports

The validation report (validation-report.json) is structured as:

{
  "summary": {
    "total": 45,
    "passed": 42,
    "failed": 3,
    "all_passed": false
  },
  "checks": [
    {
      "name": "span.rpc.ws_message",
      "category": "span",
      "passed": true,
      "message": "rpc.ws_message: 15 traces found",
      "details": { "trace_count": 15 }
    }
  ]
}

Categories:

span: Span type existence and attribute validation
metric: Prometheus metric existence
log: Log-trace correlation checks
dashboard: Grafana dashboard accessibility

CI Integration

The validation runs as a GitHub Actions workflow (.github/workflows/telemetry-validation.yml):

Triggered manually or on pushes to telemetry branches
Builds xrpld, starts the full stack, runs load, validates
Uploads reports as artifacts
Posts summary to PR

Configuration Files

File	Purpose
`workload-profiles.json`	Named load profiles with phase definitions
`expected_spans.json`	Span inventory (names, attributes, hierarchies, config flags)
`expected_metrics.json`	Metric inventory — every listed metric must be present
`test_accounts.json`	Test account roles (keys generated at runtime)
`regression-metrics.json`	Metric surface for the OTel regression gate
`regression-thresholds.json`	Per-metric regression bounds (pct AND abs)
`baselines/baseline-timings.json`	Committed baseline — populated from first CI run
`requirements.txt`	Python dependencies

expected_metrics.json Format

{
  "category_name": {
    "description": "Human-readable description.",
    "metrics": ["metric_1", "metric_2"]
  }
}

Every metric listed must produce > 0 Prometheus series during the validation run. If a metric doesn't fire, the workload generators need to produce enough load to trigger it.

expected_spans.json Format

Each span entry defines its name, category, parent (for hierarchy validation), required attributes, and the config_flag that must be enabled:

{
  "name": "rpc.command.*",
  "category": "rpc",
  "parent": "rpc.process",
  "required_attributes": ["command", "version", "rpc_role", "rpc_status"],
  "config_flag": "trace_rpc"
}

Node Configuration Notes

The orchestrator (run-full-validation.sh) generates node configs with:

[telemetry] enabled=1 with all trace categories (trace_rpc, trace_consensus, trace_transactions)
[signing_support] true — required for tx_submitter.py to submit signed transactions via WebSocket
[ips] (not [ips_fixed]) — ensures peer connections are counted in Peer_Finder_Active_Inbound/Outbound_Peers metrics (fixed peers are excluded from these counters by design)

StatsD Gauge Behaviour

Beast::insight StatsD gauges only emit when their value changes from the previous sample. This can cause two problems in the validation environment:

Initial-zero gauges — if a gauge value is 0 from startup and never changes, the gauge would never emit. To address this, StatsDGaugeImpl initializes m_dirty = true, ensuring the first flush always emits the initial value.
Stale gauges — once a gauge stabilizes (e.g., peer count stays at 1), it stops emitting new data points. Prometheus marks it stale after ~5 minutes. The validation script uses the Prometheus /api/v1/series endpoint instead of instant queries to catch such gauges.