Pratik Mankawde
787b496484
Phase 10: Synthetic workload generation and telemetry validation tools
...
Add comprehensive workload harness for end-to-end validation of the
Phases 1-9 telemetry stack:
Task 10.1 — Multi-node test harness:
- docker-compose.workload.yaml with full OTel stack (Collector, Jaeger,
Tempo, Prometheus, Loki, Grafana)
- generate-validator-keys.sh for automated key generation
- xrpld-validator.cfg.template for node configuration
Task 10.2 — RPC load generator:
- rpc_load_generator.py with WebSocket client, configurable rates,
realistic command distribution (40% health, 30% wallet, 15% explorer,
10% tx lookups, 5% DEX), W3C traceparent injection
Task 10.3 — Transaction submitter:
- tx_submitter.py with 10 transaction types (Payment, OfferCreate,
OfferCancel, TrustSet, NFTokenMint, NFTokenCreateOffer, EscrowCreate,
EscrowFinish, AMMCreate, AMMDeposit), auto-funded test accounts
Task 10.4 — Telemetry validation suite:
- validate_telemetry.py checking spans (Jaeger), metrics (Prometheus),
log-trace correlation (Loki), dashboards (Grafana)
- expected_spans.json (17 span types, 22 attributes, 3 hierarchies)
- expected_metrics.json (SpanMetrics, StatsD, Phase 9, dashboards)
Task 10.5 — Performance benchmark suite:
- benchmark.sh for baseline vs telemetry comparison
- collect_system_metrics.sh for CPU/memory/latency sampling
- Thresholds: <3% CPU, <5MB memory, <2ms RPC p99, <5% TPS, <1% consensus
Task 10.6 — CI integration:
- telemetry-validation.yml GitHub Actions workflow
- run-full-validation.sh orchestrator script
- Manual trigger + telemetry branch auto-trigger
Task 10.7 — Documentation:
- workload/README.md with quick start and tool reference
- Updated telemetry-runbook.md with validation and benchmark sections
- Updated 09-data-collection-reference.md with validation inventory
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-17 10:59:16 +00:00
Pratik Mankawde
9289cb671d
Phase 9: Internal Metric Instrumentation Gap Fill (Tasks 9.1-9.10)
...
Implement ~50 OTel metrics covering NodeStore I/O, cache hit rates,
TxQ state, PerfLog per-RPC/per-job counters, CountedObject instances,
and load factor breakdown via MetricsRegistry.
Core implementation:
- MetricsRegistry class with synchronous instruments (Counter, Histogram)
for RPC and Job metrics, and ObservableGauge callbacks for cache, TxQ,
CountedObject, LoadFactor, and NodeStore state polling.
- ServiceRegistry extended with getMetricsRegistry() virtual method.
- Application wires MetricsRegistry lifecycle (create/start/stop).
- PerfLogImp instrumented to emit OTel metrics on RPC and Job events.
Dashboards & observability:
- 3 new Grafana dashboards: RPC Performance, Job Queue, Fee Market/TxQ.
- Extended statsd-node-health dashboard with NodeStore, Cache, and
CountedObject panels.
- 10 alerting rules added to telemetry-runbook.md.
- Integration test extended with 12 OTel metric validation checks.
Documentation:
- 09-data-collection-reference.md updated with Phase 9 metric tables.
- Unit tests for MetricsRegistry disabled-path (no-op) behavior.
All OTel SDK code guarded with #ifdef XRPL_ENABLE_TELEMETRY.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-17 10:56:00 +00:00
Pratik Mankawde
b73592f934
Phase 9-11: Future enhancement plans for metric gap fill, workload validation, and third-party pipelines
...
- Phase 9: Internal Metric Instrumentation Gap Fill (10 tasks, 12d)
- MetricsRegistry class, NodeStore I/O, cache, TxQ, PerfLog, CountedObjects, load factors
- Phase 10: Synthetic Workload Generation & Telemetry Validation (7 tasks, 10d)
- Multi-node harness, RPC/tx generators, validation suite, benchmarks, CI
- Phase 11: Third-Party Data Collection Pipelines (11 tasks, 15d)
- Custom OTel Collector receiver (Go), 30 external metrics, alerting rules, 4 dashboards
- Updated 06-implementation-phases.md with plan sections §6.8.2-§6.8.4, gantt, effort summary
- Updated 09-data-collection-reference.md with §5b-§5d future metric definitions
- Updated 08-appendix.md with Phase 9-11 glossary, task list entries, cross-reference guide, effort summary
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-17 10:56:00 +00:00
Pratik Mankawde
2573e956f1
Phase 8: Update documentation for log-trace correlation
...
Task 8.6: Add Log-Trace Correlation section to telemetry-runbook.md
with LogQL examples, verification steps, and troubleshooting guidance.
Update 09-data-collection-reference.md section 5a from "Future" to
actual implementation docs covering log format, ingestion pipeline,
Grafana correlation config, and Loki backend. Add Phase 8 log
correlation test section and troubleshooting to TESTING.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-17 10:55:54 +00:00
Pratik Mankawde
503d3f7d48
Phase 8: Log-trace correlation plan docs and task list
...
- Add §6.8.1 to 06-implementation-phases.md with full Phase 8 plan
(motivation, architecture, Mermaid diagrams, tasks table, exit criteria)
- Add Phase8_taskList.md with per-task breakdown (8.1-8.6)
- Add §5a log-trace correlation section to 09-data-collection-reference.md
- Add Phase 8 row to OpenTelemetryPlan.md, update totals to 13 weeks / 8 phases
- Add Phases 6-8 to Gantt chart in 06-implementation-phases.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-17 10:55:54 +00:00
Pratik Mankawde
7d51436d26
Phase 7: Native OTel metrics migration (Tasks 7.1-7.7)
...
Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.
- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
reference docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 16:46:36 +00:00
Pratik Mankawde
702cf63c62
Separate plan from tasks: move Phase 7 plan into 06-implementation-phases.md, remove Phase 8 content
...
- Move Phase 7 motivation (gains/losses/decision) and architecture (class
hierarchy, data flow diagram, config) from Phase7_taskList.md into
06-implementation-phases.md §6.8
- Strip Phase7_taskList.md to tasks only (7.1-7.8 + summary table)
- Remove Phase8_taskList.md — belongs on Phase 8 branch
- Remove §6.8.1 (Phase 8) from 06-implementation-phases.md
- Remove §5a (Phase 8 log correlation) from 09-data-collection-reference.md
- Remove Phase 8 row from OpenTelemetryPlan.md phase table
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 16:46:36 +00:00
Pratik Mankawde
85a2220312
Phase 7-8: Plan docs for native OTel metrics migration and log-trace correlation
...
Phase 7 (native metrics): Replace StatsDCollector with OTelCollectorImpl
behind the existing beast::insight::Collector interface. Maps Counter,
Gauge, Meter, Event to OTel SDK instruments. Exports via OTLP/HTTP to
same collector endpoint as traces. Eliminates StatsD UDP dependency.
Resolves deferred Phase 6 Task 6.1 (|m wire format).
Phase 8 (log correlation): Inject trace_id/span_id into JLOG output
via Logs::format() thread-local span context read. Add Grafana Loki
with OTel Collector filelog receiver for centralized log ingestion.
Enable bidirectional Tempo-Loki correlation in Grafana.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 16:46:36 +00:00
Pratik Mankawde
a8c2f94e8a
Remove 'rippled' prefix from dashboard titles, add new dashboards to doc
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 16:46:36 +00:00
Pratik Mankawde
f1025d4f71
Fix markdown formatting in data collection reference
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 16:46:36 +00:00
Pratik Mankawde
64d8369dbc
Add consensus.accept.apply span to data collection reference
...
Add the close time span and its 6 attributes to the Phase 4 consensus
span table and attribute table in 09-data-collection-reference.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 16:46:36 +00:00
Pratik Mankawde
4dcd65968f
document updates
...
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com >
2026-03-16 16:46:36 +00:00
Pratik Mankawde
2bea046dab
Phase 6: Integrate beast::insight StatsD metrics into telemetry pipeline
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 16:46:36 +00:00