mirror of
https://github.com/XRPLF/rippled.git
synced 2026-04-29 15:37:57 +00:00
Captures per-span / per-RPC / per-job timings from Prometheus after the workload run and diffs them against a committed baseline. Regression requires breaching both a percentage and an absolute bound, tolerating small-value noise. When the baseline is a placeholder, the comparator emits the captured JSON in the exact schema for one-time paste into baselines/baseline-timings.json, and the CI Step Summary surfaces that block for the reviewer. Scope: gate only — automated baseline persistence, benchmark.sh PromQL migration, and the historical trend dashboard remain follow-ups.
256 lines
11 KiB
Markdown
256 lines
11 KiB
Markdown
# Phase 10: Synthetic Workload Generation & Telemetry Validation — Task List
|
|
|
|
> **Status**: Future Enhancement
|
|
>
|
|
> **Goal**: Build tools that generate realistic XRPL traffic to validate the full Phases 1-9 telemetry stack end-to-end — all spans, attributes, metrics, dashboards, and log-trace correlation — under controlled load.
|
|
>
|
|
> **Scope**: Python/shell test harness + multi-node docker-compose environment + automated validation scripts + performance benchmarks.
|
|
>
|
|
> **Branch**: `pratik/otel-phase10-workload-validation` (from `pratik/otel-phase9-metric-gap-fill`)
|
|
>
|
|
> **Depends on**: Phase 9 (internal metric gap fill) — validates the full metric surface
|
|
|
|
### Related Plan Documents
|
|
|
|
| Document | Relevance |
|
|
| -------------------------------------------------------------------- | --------------------------------------------------------------- |
|
|
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 10 plan: motivation, architecture, exit criteria (§6.8.3) |
|
|
| [09-data-collection-reference.md](./09-data-collection-reference.md) | Defines the full inventory of spans/metrics to validate |
|
|
| [Phase9_taskList.md](./Phase9_taskList.md) | Prerequisite — all internal metrics must be emitting |
|
|
|
|
### Why This Phase Exists
|
|
|
|
Before Phases 1-9 can be considered production-ready, we need proof that:
|
|
|
|
1. All 16 spans fire with correct attributes under real transaction workloads
|
|
2. All 255+ StatsD metrics + ~50 Phase 9 metrics appear in Prometheus with non-zero values
|
|
3. Log-trace correlation (Phase 8) produces clickable trace_id links in Loki
|
|
4. All 10 Grafana dashboards render meaningful data (no empty panels)
|
|
5. Performance overhead stays within bounds (< 3% CPU, < 5MB memory)
|
|
6. The telemetry stack survives sustained load without data loss or queue backpressure
|
|
|
|
---
|
|
|
|
## Task 10.1: Multi-Node Test Harness
|
|
|
|
**Objective**: Create a docker-compose environment with 3-5 validator nodes that produces real consensus rounds.
|
|
|
|
**What to do**:
|
|
|
|
- Create `docker/telemetry/docker-compose.workload.yaml`:
|
|
- 5 rippled validator nodes with UNL configured for each other
|
|
- All telemetry enabled: `[telemetry] enabled=1`, `[insight] server=otel`
|
|
- Full OTel stack: Collector, Tempo, Prometheus, Loki, Grafana
|
|
- Shared network with service discovery
|
|
|
|
- Each node should:
|
|
- Generate validator keys at startup
|
|
- Configure all 5 nodes in its UNL
|
|
- Enable all trace categories including `trace_peer=1`
|
|
- Write logs to a file tailed by the OTel Collector filelog receiver
|
|
|
|
- Include a `Makefile` target: `make telemetry-workload-up` / `make telemetry-workload-down`
|
|
|
|
**Key files**:
|
|
|
|
- New: `docker/telemetry/docker-compose.workload.yaml`
|
|
- New: `docker/telemetry/workload/generate-validator-keys.sh`
|
|
- New: `docker/telemetry/workload/xrpld-validator.cfg.template`
|
|
|
|
---
|
|
|
|
## Task 10.2: RPC Load Generator
|
|
|
|
**Objective**: Configurable tool that fires all traced RPC commands at controlled rates.
|
|
|
|
**What to do**:
|
|
|
|
- Create `docker/telemetry/workload/rpc_load_generator.py`:
|
|
- Connects to one or more rippled WebSocket endpoints
|
|
- Fires all RPC commands that have trace spans: `server_info`, `ledger`, `tx`, `account_info`, `account_lines`, `fee`, `submit`, etc.
|
|
- Configurable parameters: rate (RPS), duration, command distribution weights
|
|
- Injects `traceparent` HTTP headers to test W3C context propagation
|
|
- Logs progress and errors to stdout
|
|
|
|
- Command distribution should match realistic production ratios:
|
|
- 40% `server_info` / `fee` (health checks)
|
|
- 30% `account_info` / `account_lines` / `account_objects` (wallet queries)
|
|
- 15% `ledger` / `ledger_data` (explorer queries)
|
|
- 10% `tx` / `account_tx` (transaction lookups)
|
|
- 5% `book_offers` / `amm_info` (DEX queries)
|
|
|
|
**Key files**:
|
|
|
|
- New: `docker/telemetry/workload/rpc_load_generator.py`
|
|
- New: `docker/telemetry/workload/requirements.txt`
|
|
|
|
---
|
|
|
|
## Task 10.3: Transaction Submitter
|
|
|
|
**Objective**: Generate diverse transaction types to exercise `tx.*` and `ledger.*` spans.
|
|
|
|
**What to do**:
|
|
|
|
- Create `docker/telemetry/workload/tx_submitter.py`:
|
|
- Pre-funds test accounts from genesis account
|
|
- Submits a mix of transaction types:
|
|
- `Payment` (XRP and issued currencies) — exercises `tx.process`, `tx.apply`
|
|
- `OfferCreate` / `OfferCancel` — DEX activity
|
|
- `TrustSet` — trust line creation for issued currencies
|
|
- `NFTokenMint` / `NFTokenCreateOffer` / `NFTokenAcceptOffer` — NFT activity
|
|
- `EscrowCreate` / `EscrowFinish` — escrow lifecycle
|
|
- `AMMCreate` / `AMMDeposit` / `AMMWithdraw` — AMM pool operations (if amendment enabled)
|
|
- Configurable: TPS target, transaction mix weights, duration
|
|
- Monitors submission results and tracks success/failure rates
|
|
|
|
- The transaction mix ensures the telemetry captures the full range of ledger activity that third parties care about.
|
|
|
|
**Key files**:
|
|
|
|
- New: `docker/telemetry/workload/tx_submitter.py`
|
|
- New: `docker/telemetry/workload/test_accounts.json` (pre-generated keypairs)
|
|
|
|
---
|
|
|
|
## Task 10.4: Telemetry Validation Suite
|
|
|
|
**Objective**: Automated scripts that verify all expected telemetry data exists after a workload run.
|
|
|
|
**What to do**:
|
|
|
|
- Create `docker/telemetry/workload/validate_telemetry.py`:
|
|
|
|
**Span validation** (queries Tempo API):
|
|
- Assert all 16 span names appear in traces
|
|
- Assert each span has its required attributes (22 total attributes across spans)
|
|
- Assert parent-child relationships are correct (`rpc.request` → `rpc.process` → `rpc.command.*`)
|
|
- Assert span durations are reasonable (> 0, < 60s)
|
|
|
|
**Metric validation** (queries Prometheus API):
|
|
- Assert all SpanMetrics-derived metrics are non-zero: `traces_span_metrics_calls_total`, `traces_span_metrics_duration_milliseconds_bucket`
|
|
- Assert all StatsD metrics are non-zero: `rippled_LedgerMaster_Validated_Ledger_Age`, `rippled_Peer_Finder_Active_*`, etc.
|
|
- Assert all Phase 9 metrics are non-zero: `rippled_nodestore_*`, `rippled_cache_*`, `rippled_txq_*`, `rippled_rpc_method_*`, `rippled_object_count`, `rippled_load_factor*`
|
|
- Assert metric label cardinality is within bounds
|
|
|
|
**Log-trace correlation validation** (queries Loki API):
|
|
- Assert logs contain `trace_id=` and `span_id=` fields
|
|
- Pick a random trace_id from Tempo → query Loki for matching logs → assert results exist
|
|
- Assert Grafana derived field links are functional
|
|
|
|
**Dashboard validation**:
|
|
- For each of the 10 Grafana dashboards, query the dashboard API and assert no panels show "No data"
|
|
|
|
- Output: JSON report with pass/fail per check, suitable for CI.
|
|
|
|
**Key files**:
|
|
|
|
- New: `docker/telemetry/workload/validate_telemetry.py`
|
|
- New: `docker/telemetry/workload/expected_spans.json` (span inventory for validation)
|
|
- New: `docker/telemetry/workload/expected_metrics.json` (metric inventory for validation)
|
|
|
|
---
|
|
|
|
## Task 10.5: Performance Benchmark Suite
|
|
|
|
**Objective**: Measure CPU/memory/latency overhead of the telemetry stack.
|
|
|
|
**What to do**:
|
|
|
|
- Create `docker/telemetry/workload/benchmark.sh`:
|
|
- **Baseline run**: Start cluster with `[telemetry] enabled=0`, run transaction workload for 5 minutes, record metrics
|
|
- **Telemetry run**: Start cluster with full telemetry enabled, run identical workload, record metrics
|
|
- **Comparison**: Calculate deltas for:
|
|
- CPU usage (per-node average)
|
|
- Memory RSS (per-node peak)
|
|
- RPC p99 latency
|
|
- Transaction throughput (TPS)
|
|
- Consensus round time p95
|
|
- Ledger close time p95
|
|
|
|
- Output: Markdown table comparing baseline vs. telemetry, with pass/fail against targets:
|
|
- CPU overhead < 3%
|
|
- Memory overhead < 5MB
|
|
- RPC latency impact < 2ms p99
|
|
- Throughput impact < 5%
|
|
- Consensus impact < 1%
|
|
|
|
- Store results in `docker/telemetry/workload/benchmark-results/` for historical tracking.
|
|
|
|
**Key files**:
|
|
|
|
- New: `docker/telemetry/workload/benchmark.sh`
|
|
- New: `docker/telemetry/workload/collect_system_metrics.sh`
|
|
|
|
---
|
|
|
|
## Task 10.6: CI Integration
|
|
|
|
**Objective**: Wire the validation suite into CI for regression detection.
|
|
|
|
**What to do**:
|
|
|
|
- Create a CI workflow (GitHub Actions or equivalent) that:
|
|
1. Builds rippled with `-DXRPL_ENABLE_TELEMETRY=ON`
|
|
2. Starts the multi-node workload harness
|
|
3. Runs the RPC load generator + transaction submitter for 2 minutes
|
|
4. Runs the validation suite
|
|
5. Runs the benchmark suite
|
|
6. Fails the build if any validation check fails or benchmark exceeds thresholds
|
|
7. Archives the validation report and benchmark results as artifacts
|
|
|
|
- This should be a separate workflow (not part of the main CI), triggered manually or on telemetry-related branch changes.
|
|
|
|
**Key files**:
|
|
|
|
- New: `.github/workflows/telemetry-validation.yml`
|
|
- New: `docker/telemetry/workload/run-full-validation.sh` (orchestrator script)
|
|
|
|
---
|
|
|
|
## Task 10.7: Documentation
|
|
|
|
**Objective**: Document the workload tools and validation process.
|
|
|
|
**What to do**:
|
|
|
|
- Create `docker/telemetry/workload/README.md`:
|
|
- Quick start guide for running workload harness
|
|
- Configuration options for load generator and tx submitter
|
|
- How to read validation reports
|
|
- How to run benchmarks and interpret results
|
|
|
|
- Update `docs/telemetry-runbook.md`:
|
|
- Add "Validating Telemetry Stack" section
|
|
- Add "Performance Benchmarking" section
|
|
|
|
- Update `OpenTelemetryPlan/09-data-collection-reference.md`:
|
|
- Add "Validation" section with expected metric/span counts
|
|
|
|
---
|
|
|
|
## Exit Criteria — Delivered in PR #6519
|
|
|
|
- [x] Multi-node validator cluster starts and reaches consensus
|
|
- [x] RPC load generator fires all traced RPC commands at configurable rates
|
|
- [x] Transaction submitter generates 6+ transaction types at configurable TPS
|
|
- [x] Validation suite confirms all required spans, attributes, and metrics
|
|
- [x] Log-trace correlation validated end-to-end (Loki ↔ Tempo)
|
|
- [x] Grafana dashboards render data (no empty panels)
|
|
- [x] Overhead benchmark (`benchmark.sh`) measures telemetry-off vs telemetry-on deltas
|
|
- [x] CI workflow runs validation on telemetry branch changes
|
|
- [x] Validation report output is CI-parseable (JSON with exit codes)
|
|
- [x] OTel-driven regression gate captures per-span/per-RPC/per-job timings from
|
|
Prometheus and compares against a committed baseline
|
|
|
|
## Follow-up Work (tracked in separate PRs)
|
|
|
|
- [ ] FU-2: Automate baseline persistence across CI runs (artifact uploaded
|
|
on merge to `develop`, downloaded on PR runs). Current mechanism
|
|
requires a manual baseline-refresh PR.
|
|
- [ ] FU-4: Replace the proxy measurements in `benchmark.sh` (wall-clock curl
|
|
p99, ledger-cadence-as-TPS, ledger-cadence-as-consensus-p95) with
|
|
PromQL quantile queries from the same pipeline the regression gate uses.
|
|
- [ ] FU-6: Grafana dashboard plotting historical baseline values keyed by
|
|
commit SHA, for triaging noisy regressions.
|