- Phase 9: Internal Metric Instrumentation Gap Fill (10 tasks, 12d) - MetricsRegistry class, NodeStore I/O, cache, TxQ, PerfLog, CountedObjects, load factors - Phase 10: Synthetic Workload Generation & Telemetry Validation (7 tasks, 10d) - Multi-node harness, RPC/tx generators, validation suite, benchmarks, CI - Phase 11: Third-Party Data Collection Pipelines (11 tasks, 15d) - Custom OTel Collector receiver (Go), 30 external metrics, alerting rules, 4 dashboards - Updated 06-implementation-phases.md with plan sections §6.8.2-§6.8.4, gantt, effort summary - Updated 09-data-collection-reference.md with §5b-§5d future metric definitions - Updated 08-appendix.md with Phase 9-11 glossary, task list entries, cross-reference guide, effort summary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 KiB
Phase 10: Synthetic Workload Generation & Telemetry Validation — Task List
Status: Future Enhancement
Goal: Build tools that generate realistic XRPL traffic to validate the full Phases 1-9 telemetry stack end-to-end — all spans, attributes, metrics, dashboards, and log-trace correlation — under controlled load.
Scope: Python/shell test harness + multi-node docker-compose environment + automated validation scripts + performance benchmarks.
Branch:
pratik/otel-phase10-workload-validation(frompratik/otel-phase9-metric-gap-fill)Depends on: Phase 9 (internal metric gap fill) — validates the full metric surface
Related Plan Documents
| Document | Relevance |
|---|---|
| 06-implementation-phases.md | Phase 10 plan: motivation, architecture, exit criteria (§6.8.3) |
| 09-data-collection-reference.md | Defines the full inventory of spans/metrics to validate |
| Phase9_taskList.md | Prerequisite — all internal metrics must be emitting |
Why This Phase Exists
Before Phases 1-9 can be considered production-ready, we need proof that:
- All 16 spans fire with correct attributes under real transaction workloads
- All 255+ StatsD metrics + ~50 Phase 9 metrics appear in Prometheus with non-zero values
- Log-trace correlation (Phase 8) produces clickable trace_id links in Loki
- All 10 Grafana dashboards render meaningful data (no empty panels)
- Performance overhead stays within bounds (< 3% CPU, < 5MB memory)
- The telemetry stack survives sustained load without data loss or queue backpressure
Task 10.1: Multi-Node Test Harness
Objective: Create a docker-compose environment with 3-5 validator nodes that produces real consensus rounds.
What to do:
-
Create
docker/telemetry/docker-compose.workload.yaml:- 5 rippled validator nodes with UNL configured for each other
- All telemetry enabled:
[telemetry] enabled=1,[insight] server=otel - Full OTel stack: Collector, Jaeger, Tempo, Prometheus, Loki, Grafana
- Shared network with service discovery
-
Each node should:
- Generate validator keys at startup
- Configure all 5 nodes in its UNL
- Enable all trace categories including
trace_peer=1 - Write logs to a file tailed by the OTel Collector filelog receiver
-
Include a
Makefiletarget:make telemetry-workload-up/make telemetry-workload-down
Key files:
- New:
docker/telemetry/docker-compose.workload.yaml - New:
docker/telemetry/workload/generate-validator-keys.sh - New:
docker/telemetry/workload/xrpld-validator.cfg.template
Task 10.2: RPC Load Generator
Objective: Configurable tool that fires all traced RPC commands at controlled rates.
What to do:
-
Create
docker/telemetry/workload/rpc_load_generator.py:- Connects to one or more rippled WebSocket endpoints
- Fires all RPC commands that have trace spans:
server_info,ledger,tx,account_info,account_lines,fee,submit, etc. - Configurable parameters: rate (RPS), duration, command distribution weights
- Injects
traceparentHTTP headers to test W3C context propagation - Logs progress and errors to stdout
-
Command distribution should match realistic production ratios:
- 40%
server_info/fee(health checks) - 30%
account_info/account_lines/account_objects(wallet queries) - 15%
ledger/ledger_data(explorer queries) - 10%
tx/account_tx(transaction lookups) - 5%
book_offers/amm_info(DEX queries)
- 40%
Key files:
- New:
docker/telemetry/workload/rpc_load_generator.py - New:
docker/telemetry/workload/requirements.txt
Task 10.3: Transaction Submitter
Objective: Generate diverse transaction types to exercise tx.* and ledger.* spans.
What to do:
-
Create
docker/telemetry/workload/tx_submitter.py:- Pre-funds test accounts from genesis account
- Submits a mix of transaction types:
Payment(XRP and issued currencies) — exercisestx.process,tx.applyOfferCreate/OfferCancel— DEX activityTrustSet— trust line creation for issued currenciesNFTokenMint/NFTokenCreateOffer/NFTokenAcceptOffer— NFT activityEscrowCreate/EscrowFinish— escrow lifecycleAMMCreate/AMMDeposit/AMMWithdraw— AMM pool operations (if amendment enabled)
- Configurable: TPS target, transaction mix weights, duration
- Monitors submission results and tracks success/failure rates
-
The transaction mix ensures the telemetry captures the full range of ledger activity that third parties care about.
Key files:
- New:
docker/telemetry/workload/tx_submitter.py - New:
docker/telemetry/workload/test_accounts.json(pre-generated keypairs)
Task 10.4: Telemetry Validation Suite
Objective: Automated scripts that verify all expected telemetry data exists after a workload run.
What to do:
-
Create
docker/telemetry/workload/validate_telemetry.py:Span validation (queries Jaeger/Tempo API):
- Assert all 16 span names appear in traces
- Assert each span has its required attributes (22 total attributes across spans)
- Assert parent-child relationships are correct (
rpc.request→rpc.process→rpc.command.*) - Assert span durations are reasonable (> 0, < 60s)
Metric validation (queries Prometheus API):
- Assert all SpanMetrics-derived metrics are non-zero:
traces_span_metrics_calls_total,traces_span_metrics_duration_milliseconds_bucket - Assert all StatsD metrics are non-zero:
rippled_LedgerMaster_Validated_Ledger_Age,rippled_Peer_Finder_Active_*, etc. - Assert all Phase 9 metrics are non-zero:
rippled_nodestore_*,rippled_cache_*,rippled_txq_*,rippled_rpc_method_*,rippled_object_count,rippled_load_factor* - Assert metric label cardinality is within bounds
Log-trace correlation validation (queries Loki API):
- Assert logs contain
trace_id=andspan_id=fields - Pick a random trace_id from Jaeger → query Loki for matching logs → assert results exist
- Assert Grafana derived field links are functional
Dashboard validation:
- For each of the 10 Grafana dashboards, query the dashboard API and assert no panels show "No data"
-
Output: JSON report with pass/fail per check, suitable for CI.
Key files:
- New:
docker/telemetry/workload/validate_telemetry.py - New:
docker/telemetry/workload/expected_spans.json(span inventory for validation) - New:
docker/telemetry/workload/expected_metrics.json(metric inventory for validation)
Task 10.5: Performance Benchmark Suite
Objective: Measure CPU/memory/latency overhead of the telemetry stack.
What to do:
-
Create
docker/telemetry/workload/benchmark.sh:- Baseline run: Start cluster with
[telemetry] enabled=0, run transaction workload for 5 minutes, record metrics - Telemetry run: Start cluster with full telemetry enabled, run identical workload, record metrics
- Comparison: Calculate deltas for:
- CPU usage (per-node average)
- Memory RSS (per-node peak)
- RPC p99 latency
- Transaction throughput (TPS)
- Consensus round time p95
- Ledger close time p95
- Baseline run: Start cluster with
-
Output: Markdown table comparing baseline vs. telemetry, with pass/fail against targets:
- CPU overhead < 3%
- Memory overhead < 5MB
- RPC latency impact < 2ms p99
- Throughput impact < 5%
- Consensus impact < 1%
-
Store results in
docker/telemetry/workload/benchmark-results/for historical tracking.
Key files:
- New:
docker/telemetry/workload/benchmark.sh - New:
docker/telemetry/workload/collect_system_metrics.sh
Task 10.6: CI Integration
Objective: Wire the validation suite into CI for regression detection.
What to do:
-
Create a CI workflow (GitHub Actions or equivalent) that:
- Builds rippled with
-DXRPL_ENABLE_TELEMETRY=ON - Starts the multi-node workload harness
- Runs the RPC load generator + transaction submitter for 2 minutes
- Runs the validation suite
- Runs the benchmark suite
- Fails the build if any validation check fails or benchmark exceeds thresholds
- Archives the validation report and benchmark results as artifacts
- Builds rippled with
-
This should be a separate workflow (not part of the main CI), triggered manually or on telemetry-related branch changes.
Key files:
- New:
.github/workflows/telemetry-validation.yml - New:
docker/telemetry/workload/run-full-validation.sh(orchestrator script)
Task 10.7: Documentation
Objective: Document the workload tools and validation process.
What to do:
-
Create
docker/telemetry/workload/README.md:- Quick start guide for running workload harness
- Configuration options for load generator and tx submitter
- How to read validation reports
- How to run benchmarks and interpret results
-
Update
docs/telemetry-runbook.md:- Add "Validating Telemetry Stack" section
- Add "Performance Benchmarking" section
-
Update
OpenTelemetryPlan/09-data-collection-reference.md:- Add "Validation" section with expected metric/span counts
Effort Summary
| Task | Description | Effort | Risk |
|---|---|---|---|
| 10.1 | Multi-node test harness | 2d | Medium |
| 10.2 | RPC load generator | 1d | Low |
| 10.3 | Transaction submitter | 2d | Medium |
| 10.4 | Telemetry validation suite | 2d | Medium |
| 10.5 | Performance benchmark suite | 1.5d | Low |
| 10.6 | CI integration | 1d | Medium |
| 10.7 | Documentation | 0.5d | Low |
Total Effort: 10 days
Exit Criteria
- 5-node validator cluster starts and reaches consensus in docker-compose
- RPC load generator fires all traced RPC commands at configurable rates
- Transaction submitter generates 6+ transaction types at configurable TPS
- Validation suite confirms all 16 spans, 22 attributes, 300+ metrics are present
- Log-trace correlation validated end-to-end (Loki ↔ Tempo)
- All 10 Grafana dashboards render data (no empty panels)
- Benchmark shows < 3% CPU overhead, < 5MB memory overhead
- CI workflow runs validation on telemetry branch changes
- Validation report output is CI-parseable (JSON with exit codes)