mirror of https://github.com/XRPLF/rippled.git synced 2026-04-10 22:12:25 +00:00

Files

Pratik Mankawde 1efedb2fe0 Phase 9-11: Future enhancement plans for metric gap fill, workload validation, and third-party pipelines

- Phase 9: Internal Metric Instrumentation Gap Fill (10 tasks, 12d)
  - MetricsRegistry class, NodeStore I/O, cache, TxQ, PerfLog, CountedObjects, load factors
- Phase 10: Synthetic Workload Generation & Telemetry Validation (7 tasks, 10d)
  - Multi-node harness, RPC/tx generators, validation suite, benchmarks, CI
- Phase 11: Third-Party Data Collection Pipelines (11 tasks, 15d)
  - Custom OTel Collector receiver (Go), 30 external metrics, alerting rules, 4 dashboards
- Updated 06-implementation-phases.md with plan sections §6.8.2-§6.8.4, gantt, effort summary
- Updated 09-data-collection-reference.md with §5b-§5d future metric definitions
- Updated 08-appendix.md with Phase 9-11 glossary, task list entries, cross-reference guide, effort summary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-12 22:12:28 +00:00

10 KiB

Raw Blame History

Phase 10: Synthetic Workload Generation & Telemetry Validation — Task List

Status: Future Enhancement

Goal: Build tools that generate realistic XRPL traffic to validate the full Phases 1-9 telemetry stack end-to-end — all spans, attributes, metrics, dashboards, and log-trace correlation — under controlled load.

Scope: Python/shell test harness + multi-node docker-compose environment + automated validation scripts + performance benchmarks.

Branch: pratik/otel-phase10-workload-validation (from pratik/otel-phase9-metric-gap-fill)

Depends on: Phase 9 (internal metric gap fill) — validates the full metric surface

Document	Relevance
06-implementation-phases.md	Phase 10 plan: motivation, architecture, exit criteria (§6.8.3)
09-data-collection-reference.md	Defines the full inventory of spans/metrics to validate
Phase9_taskList.md	Prerequisite — all internal metrics must be emitting

Why This Phase Exists

Before Phases 1-9 can be considered production-ready, we need proof that:

All 16 spans fire with correct attributes under real transaction workloads
All 255+ StatsD metrics + ~50 Phase 9 metrics appear in Prometheus with non-zero values
Log-trace correlation (Phase 8) produces clickable trace_id links in Loki
All 10 Grafana dashboards render meaningful data (no empty panels)
Performance overhead stays within bounds (< 3% CPU, < 5MB memory)
The telemetry stack survives sustained load without data loss or queue backpressure

Task 10.1: Multi-Node Test Harness

Objective: Create a docker-compose environment with 3-5 validator nodes that produces real consensus rounds.

What to do:

Create docker/telemetry/docker-compose.workload.yaml:
- 5 rippled validator nodes with UNL configured for each other
- All telemetry enabled: [telemetry] enabled=1, [insight] server=otel
- Full OTel stack: Collector, Jaeger, Tempo, Prometheus, Loki, Grafana
- Shared network with service discovery
Each node should:
- Generate validator keys at startup
- Configure all 5 nodes in its UNL
- Enable all trace categories including trace_peer=1
- Write logs to a file tailed by the OTel Collector filelog receiver
Include a Makefile target: make telemetry-workload-up / make telemetry-workload-down

Key files:

New: docker/telemetry/docker-compose.workload.yaml
New: docker/telemetry/workload/generate-validator-keys.sh
New: docker/telemetry/workload/xrpld-validator.cfg.template

Task 10.2: RPC Load Generator

Objective: Configurable tool that fires all traced RPC commands at controlled rates.

What to do:

Create docker/telemetry/workload/rpc_load_generator.py:
- Connects to one or more rippled WebSocket endpoints
- Fires all RPC commands that have trace spans: server_info, ledger, tx, account_info, account_lines, fee, submit, etc.
- Configurable parameters: rate (RPS), duration, command distribution weights
- Injects traceparent HTTP headers to test W3C context propagation
- Logs progress and errors to stdout
Command distribution should match realistic production ratios:
- 40% server_info / fee (health checks)
- 30% account_info / account_lines / account_objects (wallet queries)
- 15% ledger / ledger_data (explorer queries)
- 10% tx / account_tx (transaction lookups)
- 5% book_offers / amm_info (DEX queries)

Key files:

New: docker/telemetry/workload/rpc_load_generator.py
New: docker/telemetry/workload/requirements.txt

Task 10.3: Transaction Submitter

Objective: Generate diverse transaction types to exercise tx.* and ledger.* spans.

What to do:

Create docker/telemetry/workload/tx_submitter.py:
- Pre-funds test accounts from genesis account
- Submits a mix of transaction types:
  - Payment (XRP and issued currencies) — exercises tx.process, tx.apply
  - OfferCreate / OfferCancel — DEX activity
  - TrustSet — trust line creation for issued currencies
  - NFTokenMint / NFTokenCreateOffer / NFTokenAcceptOffer — NFT activity
  - EscrowCreate / EscrowFinish — escrow lifecycle
  - AMMCreate / AMMDeposit / AMMWithdraw — AMM pool operations (if amendment enabled)
- Configurable: TPS target, transaction mix weights, duration
- Monitors submission results and tracks success/failure rates
The transaction mix ensures the telemetry captures the full range of ledger activity that third parties care about.

Key files:

New: docker/telemetry/workload/tx_submitter.py
New: docker/telemetry/workload/test_accounts.json (pre-generated keypairs)

Task 10.4: Telemetry Validation Suite

Objective: Automated scripts that verify all expected telemetry data exists after a workload run.

What to do:

Create docker/telemetry/workload/validate_telemetry.py:

Span validation (queries Jaeger/Tempo API):
- Assert all 16 span names appear in traces
- Assert each span has its required attributes (22 total attributes across spans)
- Assert parent-child relationships are correct (rpc.request → rpc.process → rpc.command.*)
- Assert span durations are reasonable (> 0, < 60s)
Metric validation (queries Prometheus API):
- Assert all SpanMetrics-derived metrics are non-zero: traces_span_metrics_calls_total, traces_span_metrics_duration_milliseconds_bucket
- Assert all StatsD metrics are non-zero: rippled_LedgerMaster_Validated_Ledger_Age, rippled_Peer_Finder_Active_*, etc.
- Assert all Phase 9 metrics are non-zero: rippled_nodestore_*, rippled_cache_*, rippled_txq_*, rippled_rpc_method_*, rippled_object_count, rippled_load_factor*
- Assert metric label cardinality is within bounds
Log-trace correlation validation (queries Loki API):
- Assert logs contain trace_id= and span_id= fields
- Pick a random trace_id from Jaeger → query Loki for matching logs → assert results exist
- Assert Grafana derived field links are functional
Dashboard validation:
- For each of the 10 Grafana dashboards, query the dashboard API and assert no panels show "No data"
Output: JSON report with pass/fail per check, suitable for CI.

Key files:

New: docker/telemetry/workload/validate_telemetry.py
New: docker/telemetry/workload/expected_spans.json (span inventory for validation)
New: docker/telemetry/workload/expected_metrics.json (metric inventory for validation)

Task 10.5: Performance Benchmark Suite

Objective: Measure CPU/memory/latency overhead of the telemetry stack.

What to do:

Create docker/telemetry/workload/benchmark.sh:
- Baseline run: Start cluster with [telemetry] enabled=0, run transaction workload for 5 minutes, record metrics
- Telemetry run: Start cluster with full telemetry enabled, run identical workload, record metrics
- Comparison: Calculate deltas for:
  - CPU usage (per-node average)
  - Memory RSS (per-node peak)
  - RPC p99 latency
  - Transaction throughput (TPS)
  - Consensus round time p95
  - Ledger close time p95
Output: Markdown table comparing baseline vs. telemetry, with pass/fail against targets:
- CPU overhead < 3%
- Memory overhead < 5MB
- RPC latency impact < 2ms p99
- Throughput impact < 5%
- Consensus impact < 1%
Store results in docker/telemetry/workload/benchmark-results/ for historical tracking.

Key files:

New: docker/telemetry/workload/benchmark.sh
New: docker/telemetry/workload/collect_system_metrics.sh

Task 10.6: CI Integration

Objective: Wire the validation suite into CI for regression detection.

What to do:

Create a CI workflow (GitHub Actions or equivalent) that:
1. Builds rippled with -DXRPL_ENABLE_TELEMETRY=ON
2. Starts the multi-node workload harness
3. Runs the RPC load generator + transaction submitter for 2 minutes
4. Runs the validation suite
5. Runs the benchmark suite
6. Fails the build if any validation check fails or benchmark exceeds thresholds
7. Archives the validation report and benchmark results as artifacts
This should be a separate workflow (not part of the main CI), triggered manually or on telemetry-related branch changes.

Key files:

New: .github/workflows/telemetry-validation.yml
New: docker/telemetry/workload/run-full-validation.sh (orchestrator script)

Task 10.7: Documentation

Objective: Document the workload tools and validation process.

What to do:

Create docker/telemetry/workload/README.md:
- Quick start guide for running workload harness
- Configuration options for load generator and tx submitter
- How to read validation reports
- How to run benchmarks and interpret results
Update docs/telemetry-runbook.md:
- Add "Validating Telemetry Stack" section
- Add "Performance Benchmarking" section
Update OpenTelemetryPlan/09-data-collection-reference.md:
- Add "Validation" section with expected metric/span counts

Effort Summary

Task	Description	Effort	Risk
10.1	Multi-node test harness	2d	Medium
10.2	RPC load generator	1d	Low
10.3	Transaction submitter	2d	Medium
10.4	Telemetry validation suite	2d	Medium
10.5	Performance benchmark suite	1.5d	Low
10.6	CI integration	1d	Medium
10.7	Documentation	0.5d	Low

Total Effort: 10 days

Exit Criteria

5-node validator cluster starts and reaches consensus in docker-compose
RPC load generator fires all traced RPC commands at configurable rates
Transaction submitter generates 6+ transaction types at configurable TPS
Validation suite confirms all 16 spans, 22 attributes, 300+ metrics are present
Log-trace correlation validated end-to-end (Loki ↔ Tempo)
All 10 Grafana dashboards render data (no empty panels)
Benchmark shows < 3% CPU overhead, < 5MB memory overhead
CI workflow runs validation on telemetry branch changes
Validation report output is CI-parseable (JSON with exit codes)

10 KiB Raw Blame History

Phase 10: Synthetic Workload Generation & Telemetry Validation — Task List

Related Plan Documents

Why This Phase Exists

Task 10.1: Multi-Node Test Harness

Task 10.2: RPC Load Generator

Task 10.3: Transaction Submitter

Task 10.4: Telemetry Validation Suite

Task 10.5: Performance Benchmark Suite

Task 10.6: CI Integration

Task 10.7: Documentation

Effort Summary

Exit Criteria

10 KiB

Raw Blame History