diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index bdedabf711..e0ec014ebf 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -51,6 +51,15 @@ gantt section Phase 8 Log-Trace Correlation :p8, after p7, 1w + + section Phase 9 (Future) + Internal Metric Gap Fill :p9, after p8, 2.5w + + section Phase 10 (Future) + Workload Validation :p10, after p9, 2w + + section Phase 11 (Future) + Third-Party Collection :p11, after p10, 3w ``` --- @@ -649,6 +658,272 @@ flowchart LR --- +## 6.8.2 Phase 9: Internal Metric Instrumentation Gap Fill (Weeks 14-15) — Future Enhancement + +> **Status**: Planned, not yet implemented. + +### Motivation + +Phases 1-8 establish trace spans, StatsD metrics bridge, native OTel metrics, and log-trace correlation. However, ~50+ metrics that exist inside rippled's `get_counts`, `server_info`, TxQ, PerfLog, and `CountedObject` systems have **no time-series export path**. These are the metrics that exchanges, payment processors, analytics providers, validators, and researchers need most — NodeStore I/O performance, cache hit rates, per-RPC-method counters, transaction queue depth, fee escalation levels, and live object instance counts. + +### Architecture + +Hybrid approach — two instrumentation strategies based on proximity to existing code: + +```mermaid +flowchart TB + subgraph rippled["rippled process"] + subgraph existing["Existing beast::insight registrations"] + NS["NodeStore I/O
(Database.cpp)"] + end + subgraph newreg["New OTel MetricsRegistry"] + CR["Cache Hit Rates
(async gauge callbacks)"] + TQ["TxQ Metrics
(async gauge callbacks)"] + PL["PerfLog RPC/Job
(counters + histograms)"] + CO["CountedObjects
(async gauge callbacks)"] + LF["Load Factors
(async gauge callbacks)"] + end + end + + subgraph export["Export Pipelines"] + BI["beast::insight
OTelCollector (Phase 7)"] + OS["OTel Metrics SDK
PeriodicMetricReader"] + end + + NS --> BI + CR --> OS + TQ --> OS + PL --> OS + CO --> OS + LF --> OS + + BI --> OTLP["OTLP/HTTP :4318
/v1/metrics"] + OS --> OTLP + + style rippled fill:#1a2633,color:#ccc,stroke:#4a90d9 + style existing fill:#2a4a6b,color:#fff,stroke:#4a90d9 + style newreg fill:#2a4a6b,color:#fff,stroke:#4a90d9 + style export fill:#1a3320,color:#ccc,stroke:#5cb85c + style NS fill:#4a90d9,color:#fff,stroke:#2a6db5 + style CR fill:#5cb85c,color:#fff,stroke:#3d8b3d + style TQ fill:#5cb85c,color:#fff,stroke:#3d8b3d + style PL fill:#5cb85c,color:#fff,stroke:#3d8b3d + style CO fill:#5cb85c,color:#fff,stroke:#3d8b3d + style LF fill:#5cb85c,color:#fff,stroke:#3d8b3d + style BI fill:#449d44,color:#fff,stroke:#2d6e2d + style OS fill:#449d44,color:#fff,stroke:#2d6e2d + style OTLP fill:#f0ad4e,color:#000,stroke:#c78c2e +``` + +- **beast::insight extensions** (blue): NodeStore I/O metrics added near existing `Database.cpp` registrations — exported via Phase 7's `OTelCollector`. +- **OTel MetricsRegistry** (green): New centralized class using `ObservableGauge` async callbacks for cache, TxQ, PerfLog, CountedObjects, and load factors — polled at 10s intervals by `PeriodicMetricReader`. + +### Third-Party Consumer Context + +| Consumer Category | Key Metrics They Need From Phase 9 | +| ---------------------- | --------------------------------------------------------------- | +| Exchanges | Fee escalation levels, TxQ depth, settlement latency | +| Payment Processors | Load factors, io_latency, transaction throughput | +| Analytics Providers | NodeStore I/O, cache hit rates, counted objects | +| Validators / Operators | Per-job execution times, PerfLog RPC counters, consensus timing | +| Academic Researchers | Consensus performance time-series, fee market dynamics | +| Institutional Custody | Server health scores, reserve calculations, node availability | + +### Tasks + +| Task | Description | Effort | Risk | +| ---- | ----------------------------------------- | ------ | ------ | +| 9.1 | NodeStore I/O metrics | 1d | Low | +| 9.2 | Cache hit rate metrics + MetricsRegistry | 2d | Medium | +| 9.3 | TxQ metrics | 1d | Low | +| 9.4 | PerfLog per-RPC metrics | 1.5d | Medium | +| 9.5 | PerfLog per-job metrics | 1d | Low | +| 9.6 | Counted object instance metrics | 0.5d | Low | +| 9.7 | Fee escalation & load factor metrics | 0.5d | Low | +| 9.8 | New Grafana dashboards (2 new, 2 updated) | 2d | Low | +| 9.9 | Update documentation | 1d | Low | +| 9.10 | Integration tests | 1.5d | Medium | + +**Total Effort**: 12 days + +See [Phase9_taskList.md](./Phase9_taskList.md) for detailed per-task breakdown. + +### Exit Criteria + +- [ ] All ~50 new metrics visible in Prometheus via OTLP pipeline +- [ ] `MetricsRegistry` class registers/deregisters cleanly with OTel SDK +- [ ] 2 new Grafana dashboards operational (Fee Market, Job Queue) +- [ ] No performance regression (< 0.5% CPU overhead from new callbacks) +- [ ] Documentation updated with full new metric inventory + +--- + +## 6.8.3 Phase 10: Synthetic Workload Generation & Telemetry Validation (Weeks 16-17) — Future Enhancement + +> **Status**: Planned, not yet implemented. + +### Motivation + +Before the telemetry stack (Phases 1-9) can be considered production-ready, we need automated proof that all 16 spans, 22 attributes, 300+ metrics, 10 Grafana dashboards, and log-trace correlation work correctly under realistic load. This phase establishes a reusable CI-integrated validation suite and performance benchmark baseline. + +### Architecture + +```mermaid +flowchart LR + subgraph harness["Docker Compose Workload Harness"] + direction TB + V1["Validator 1"] ~~~ V2["Validator 2"] ~~~ V3["Validator 3"] + V4["Validator 4"] ~~~ V5["Validator 5"] + end + + subgraph generators["Workload Generators"] + RPC["RPC Load Generator
(configurable RPS,
command distribution)"] + TX["Transaction Submitter
(Payment, Offer, NFT,
Escrow, AMM mix)"] + end + + subgraph validation["Validation Suite"] + SV["Span Validator
(Jaeger/Tempo API)"] + MV["Metric Validator
(Prometheus API)"] + LV["Log-Trace Validator
(Loki API)"] + DV["Dashboard Validator
(Grafana API)"] + BM["Benchmark Suite
(CPU, memory, latency
ON vs OFF comparison)"] + end + + generators --> harness + harness --> validation + + style harness fill:#1a2633,color:#ccc,stroke:#4a90d9 + style generators fill:#1a3320,color:#ccc,stroke:#5cb85c + style validation fill:#332a1a,color:#ccc,stroke:#f0ad4e + style V1 fill:#4a90d9,color:#fff,stroke:#2a6db5 + style V2 fill:#4a90d9,color:#fff,stroke:#2a6db5 + style V3 fill:#4a90d9,color:#fff,stroke:#2a6db5 + style V4 fill:#4a90d9,color:#fff,stroke:#2a6db5 + style V5 fill:#4a90d9,color:#fff,stroke:#2a6db5 + style RPC fill:#5cb85c,color:#fff,stroke:#3d8b3d + style TX fill:#5cb85c,color:#fff,stroke:#3d8b3d + style SV fill:#f0ad4e,color:#000,stroke:#c78c2e + style MV fill:#f0ad4e,color:#000,stroke:#c78c2e + style LV fill:#f0ad4e,color:#000,stroke:#c78c2e + style DV fill:#f0ad4e,color:#000,stroke:#c78c2e + style BM fill:#f0ad4e,color:#000,stroke:#c78c2e +``` + +### Tasks + +| Task | Description | Effort | Risk | +| ---- | -------------------------------------- | ------ | ------ | +| 10.1 | Multi-node test harness (5 validators) | 2d | Medium | +| 10.2 | RPC load generator | 1d | Low | +| 10.3 | Transaction submitter (6+ tx types) | 2d | Medium | +| 10.4 | Telemetry validation suite | 2d | Medium | +| 10.5 | Performance benchmark suite | 1.5d | Low | +| 10.6 | CI integration | 1d | Medium | +| 10.7 | Documentation | 0.5d | Low | + +**Total Effort**: 10 days + +See [Phase10_taskList.md](./Phase10_taskList.md) for detailed per-task breakdown. + +### Exit Criteria + +- [ ] 5-node validator cluster starts and reaches consensus in docker-compose +- [ ] Validation suite confirms all 16 spans, 22 attributes, 300+ metrics +- [ ] All 10 Grafana dashboards render data (no empty panels) +- [ ] Benchmark shows < 3% CPU overhead, < 5MB memory overhead +- [ ] CI workflow runs validation on telemetry branch changes + +--- + +## 6.8.4 Phase 11: Third-Party Data Collection Pipelines (Weeks 18-20) — Future Enhancement + +> **Status**: Planned, not yet implemented. + +### Motivation + +rippled has no native Prometheus/OTLP metrics export for data accessible only via JSON-RPC (`server_info`, `get_counts`, `fee`, `peers`, `validators`, `feature`). Every external consumer — exchanges, payment processors, analytics providers, validators, compliance firms, DeFi protocols, researchers, custodians, and CBDC platforms — must build custom JSON-RPC polling and conversion pipelines. This phase centralizes that work into a reusable custom OTel Collector receiver. + +### Architecture + +```mermaid +flowchart LR + subgraph receiver["Custom OTel Collector Receiver (Go)"] + direction TB + SI["server_info
collector"] + GC["get_counts
collector"] + FE["fee
collector"] + PE["peers
collector"] + VA["validators
collector"] + DX["DEX/AMM
collector
(optional)"] + end + + rippled["rippled
Admin RPC
:5005"] -->|"JSON-RPC
poll every 30s"| receiver + + receiver -->|"xrpl_* metrics"| PROM["Prometheus
:9090"] + receiver -->|"OTLP export"| OTLP["Any OTLP-
compatible
backend"] + + PROM --> GF["Grafana
4 new dashboards"] + PROM --> AL["Prometheus
Alerting Rules"] + + style receiver fill:#1a3320,color:#ccc,stroke:#5cb85c + style SI fill:#5cb85c,color:#fff,stroke:#3d8b3d + style GC fill:#5cb85c,color:#fff,stroke:#3d8b3d + style FE fill:#5cb85c,color:#fff,stroke:#3d8b3d + style PE fill:#5cb85c,color:#fff,stroke:#3d8b3d + style VA fill:#5cb85c,color:#fff,stroke:#3d8b3d + style DX fill:#449d44,color:#fff,stroke:#2d6e2d + style rippled fill:#4a90d9,color:#fff,stroke:#2a6db5 + style PROM fill:#f0ad4e,color:#000,stroke:#c78c2e + style OTLP fill:#f0ad4e,color:#000,stroke:#c78c2e + style GF fill:#5bc0de,color:#000,stroke:#3aa8c1 + style AL fill:#d9534f,color:#fff,stroke:#b52d2d +``` + +### Third-Party Consumer Gap Analysis + +| Consumer Category | Data Unlocked by Phase 11 | +| ---------------------- | ------------------------------------------------------------ | +| Exchanges | Real-time fee estimates, TxQ capacity, server health scores | +| Payment Processors | Settlement latency percentiles, corridor health | +| Analytics Providers | Validator metrics, network topology, amendment voting status | +| DeFi / AMM | AMM pool TVL, DEX order book depth, trade volumes | +| Validators / Operators | Per-peer latency, version distribution, UNL health, alerting | +| Compliance | Transaction volume trends, network growth metrics | +| Academic Researchers | Consensus performance time-series, decentralization metrics | +| CBDC / Tokenization | Token supply tracking, trust line adoption, freeze status | +| Institutional Custody | Multi-sig status, escrow tracking, reserve calculations | +| Wallet Providers | Server health for node selection, fee prediction data | + +### Tasks + +| Task | Description | Effort | Risk | +| ----- | ------------------------------------- | ------ | ------ | +| 11.1 | OTel Collector receiver scaffold (Go) | 1.5d | Medium | +| 11.2 | server_info / server_state collector | 2d | Low | +| 11.3 | get_counts collector | 1.5d | Low | +| 11.4 | Peer topology collector | 1.5d | Medium | +| 11.5 | Validator & amendment collector | 1d | Low | +| 11.6 | Fee & TxQ collector | 0.5d | Low | +| 11.7 | DEX & AMM collector (optional) | 1.5d | Medium | +| 11.8 | Prometheus alerting rules | 1d | Low | +| 11.9 | New Grafana dashboards (4) | 2d | Low | +| 11.10 | Integration with Phase 10 validation | 1d | Low | +| 11.11 | Documentation | 1d | Low | + +**Total Effort**: 15 days + +See [Phase11_taskList.md](./Phase11_taskList.md) for detailed per-task breakdown. + +### Exit Criteria + +- [ ] Custom OTel Collector receiver exports all `xrpl_*` metrics to Prometheus +- [ ] 4 new Grafana dashboards operational (Validator Health, Network Topology, Fee Market, DEX/AMM) +- [ ] Prometheus alerting rules fire correctly for simulated failures +- [ ] Receiver handles rippled restart/unavailability gracefully +- [ ] Go receiver has unit tests with >80% coverage + +--- + ## 6.9 Risk Assessment ```mermaid @@ -705,22 +980,35 @@ pie showData "Phase 3: Transaction Tracing" : 11 "Phase 4: Consensus Tracing" : 11 "Phase 5: Documentation" : 5 + "Phase 6: StatsD Bridge" : 5.6 + "Phase 7: Native OTel Metrics" : 8 + "Phase 8: Log-Trace Correlation" : 4.5 + "Phase 9: Metric Gap Fill" : 12 + "Phase 10: Workload Validation" : 10 + "Phase 11: Third-Party Collection" : 15 ``` -**Total Effort Distribution (47 developer-days)** +**Total Effort Distribution (102.1 developer-days)** ### Resource Requirements -| Phase | Developers | Duration | Total Effort | -| --------- | ---------- | ----------- | ------------ | -| 1 | 2 | 2 weeks | 10 days | -| 2 | 1-2 | 2 weeks | 10 days | -| 3 | 2 | 2 weeks | 11 days | -| 4 | 2 | 2 weeks | 11 days | -| 5 | 1 | 1 week | 5 days | -| **Total** | **2** | **9 weeks** | **47 days** | +| Phase | Developers | Duration | Total Effort | Status | +| ---------------- | ---------- | ------------ | -------------- | ------------------ | +| 1 | 2 | 2 weeks | 10 days | Active | +| 2 | 1-2 | 2 weeks | 10 days | Active | +| 3 | 2 | 2 weeks | 11 days | Active | +| 4 | 2 | 2 weeks | 11 days | Active | +| 5 | 1 | 1 week | 5 days | Active | +| 6 | 1 | 1 week | 5.6 days | Active | +| 7 | 1-2 | 2 weeks | 8 days | Active | +| 8 | 1 | 1 week | 4.5 days | Active | +| 9 | 1-2 | 2.5 weeks | 12 days | Future Enhancement | +| 10 | 1 | 2 weeks | 10 days | Future Enhancement | +| 11 | 1-2 | 3 weeks | 15 days | Future Enhancement | +| **Total (1-8)** | **2** | **13 weeks** | **65.1 days** | | +| **Total (1-11)** | **2** | **20 weeks** | **102.1 days** | | --- @@ -924,16 +1212,19 @@ Clear, measurable criteria for each phase. ### 6.13.6 Success Metrics Summary -| Phase | Primary Metric | Secondary Metric | Deadline | -| ------- | ---------------------------- | --------------------------- | -------------- | -| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 | -| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 | -| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 | -| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 | -| Phase 5 | Production deployment | Operators trained | End of Week 9 | -| Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 | -| Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 | -| Phase 8 | trace_id in logs + Loki | Tempo↔Loki correlation | End of Week 13 | +| Phase | Primary Metric | Secondary Metric | Deadline | Status | +| -------- | -------------------------------- | --------------------------- | -------------- | ------------------ | +| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 | Active | +| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 | Active | +| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 | Active | +| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 | Active | +| Phase 5 | Production deployment | Operators trained | End of Week 9 | Active | +| Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 | Active | +| Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 | Active | +| Phase 8 | trace_id in logs + Loki | Tempo↔Loki correlation | End of Week 13 | Active | +| Phase 9 | 50+ new internal metrics in Prom | 2 new dashboards | End of Week 15 | Future Enhancement | +| Phase 10 | Full telemetry stack validated | < 3% CPU overhead proven | End of Week 17 | Future Enhancement | +| Phase 11 | Third-party metrics via receiver | 4 new dashboards + alerting | End of Week 20 | Future Enhancement | --- diff --git a/OpenTelemetryPlan/08-appendix.md b/OpenTelemetryPlan/08-appendix.md index a74cc513b3..c8aa797347 100644 --- a/OpenTelemetryPlan/08-appendix.md +++ b/OpenTelemetryPlan/08-appendix.md @@ -37,6 +37,18 @@ | **PerfLog** | Existing performance logging system in rippled | | **Beast Insight** | Existing metrics framework in rippled | +### Phase 9–11 Terms + +| Term | Definition | +| --------------------------- | ------------------------------------------------------------------------- | +| **MetricsRegistry** | Centralized class for OTel async gauge registrations (Phase 9) | +| **ObservableGauge** | OTel Metrics SDK async instrument polled via callback at fixed intervals | +| **PeriodicMetricReader** | OTel SDK component that invokes gauge callbacks at configurable intervals | +| **CountedObject** | rippled template that tracks live instance counts via atomic counters | +| **TxQ** | Transaction queue managing fee escalation and ordering | +| **Load Factor** | Combined multiplier affecting transaction cost (local, cluster, network) | +| **OTel Collector Receiver** | Custom Go plugin that polls rippled RPC and emits OTel metrics (Phase 11) | + --- ## 8.2 Span Hierarchy Visualization @@ -107,10 +119,11 @@ flowchart TB ## 8.4 Version History -| Version | Date | Author | Changes | -| ------- | ---------- | ------ | --------------------------------- | -| 1.0 | 2026-02-12 | - | Initial implementation plan | -| 1.1 | 2026-02-13 | - | Refactored into modular documents | +| Version | Date | Author | Changes | +| ------- | ---------- | ------ | -------------------------------------------- | +| 1.0 | 2026-02-12 | - | Initial implementation plan | +| 1.1 | 2026-02-13 | - | Refactored into modular documents | +| 1.2 | 2026-03-09 | - | Added Phases 9–11 (future enhancement plans) | --- @@ -135,16 +148,83 @@ flowchart TB ### Task Lists -| Document | Description | -| -------------------------------------------------------------------------- | -------------------------------------- | -| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration | -| [Phase2_taskList.md](./Phase2_taskList.md) | RPC layer trace instrumentation | -| [Phase3_taskList.md](./Phase3_taskList.md) | Peer overlay & consensus tracing | -| [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing | -| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing | -| [Phase5_IntegrationTest_taskList.md](./Phase5_IntegrationTest_taskList.md) | Observability stack integration tests | -| [Phase7_taskList.md](./Phase7_taskList.md) | Native OTel metrics migration | -| [Phase8_taskList.md](./Phase8_taskList.md) | Log-trace correlation | +| Document | Description | +| -------------------------------------------------------------------------- | --------------------------------------------------- | +| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration | +| [Phase2_taskList.md](./Phase2_taskList.md) | RPC layer trace instrumentation | +| [Phase3_taskList.md](./Phase3_taskList.md) | Peer overlay & consensus tracing | +| [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing | +| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing | +| [Phase5_IntegrationTest_taskList.md](./Phase5_IntegrationTest_taskList.md) | Observability stack integration tests | +| [Phase7_taskList.md](./Phase7_taskList.md) | Native OTel metrics migration | +| [Phase8_taskList.md](./Phase8_taskList.md) | Log-trace correlation | +| [Phase9_taskList.md](./Phase9_taskList.md) | Internal metric instrumentation gap fill (future) | +| [Phase10_taskList.md](./Phase10_taskList.md) | Synthetic workload generation & validation (future) | +| [Phase11_taskList.md](./Phase11_taskList.md) | Third-party data collection pipelines (future) | + +> **Note**: Phases 1 and 6 do not have separate task list files. Phase 1 tasks are documented in [06-implementation-phases.md §6.2](./06-implementation-phases.md). Phase 6 tasks are documented in [06-implementation-phases.md §6.7](./06-implementation-phases.md). + +--- + +## 8.6 Phase 9–11 Cross-Reference Guide + +This guide maps Phase 9–11 content to its location across the documentation. + +### Phase 9: Internal Metric Instrumentation Gap Fill + +| Content | Location | +| ------------------------------- | ------------------------------------------------------------------------ | +| Plan & architecture | [06-implementation-phases.md §6.8.2](./06-implementation-phases.md) | +| Task list (10 tasks, 12d) | [Phase9_taskList.md](./Phase9_taskList.md) | +| Future metric definitions (~50) | [09-data-collection-reference.md §5b](./09-data-collection-reference.md) | +| New class: `MetricsRegistry` | `src/xrpld/telemetry/MetricsRegistry.h/.cpp` (planned) | +| New dashboards | `rippled-fee-market`, `rippled-job-queue` (planned) | + +**Metric categories**: NodeStore I/O, Cache Hit Rates, TxQ, PerfLog Per-RPC, PerfLog Per-Job, Counted Objects, Fee Escalation & Load Factors. + +### Phase 10: Synthetic Workload Generation & Telemetry Validation + +| Content | Location | +| ------------------------ | ------------------------------------------------------------------------ | +| Plan & architecture | [06-implementation-phases.md §6.8.3](./06-implementation-phases.md) | +| Task list (7 tasks, 10d) | [Phase10_taskList.md](./Phase10_taskList.md) | +| Validation inventory | [09-data-collection-reference.md §5c](./09-data-collection-reference.md) | +| Test harness | `docker/telemetry/docker-compose.workload.yaml` (planned) | +| CI workflow | `.github/workflows/telemetry-validation.yml` (planned) | + +**Validates**: 16 spans, 22 attributes, 300+ metrics, 10 dashboards, log-trace correlation. + +### Phase 11: Third-Party Data Collection Pipelines + +| Content | Location | +| --------------------------------- | ------------------------------------------------------------------------ | +| Plan & architecture | [06-implementation-phases.md §6.8.4](./06-implementation-phases.md) | +| Task list (11 tasks, 15d) | [Phase11_taskList.md](./Phase11_taskList.md) | +| External metric definitions (~30) | [09-data-collection-reference.md §5d](./09-data-collection-reference.md) | +| Custom OTel Collector receiver | `docker/telemetry/otel-rippled-receiver/` (planned) | +| Prometheus alerting rules (11) | [09-data-collection-reference.md §5d](./09-data-collection-reference.md) | +| New dashboards (4) | Validator Health, Network Topology, Fee Market (External), DEX & AMM | + +**Consumer categories**: Exchanges, Payment Processors, DeFi/AMM, NFT Marketplaces, Analytics Providers, Wallets, Compliance, Academic Researchers, Institutional Custody, CBDC Bridge Operators. + +--- + +## 8.7 Effort Summary (All Phases) + +| Phase | Description | Effort | Status | +| ----- | -------------------------------- | ---------- | ------------------ | +| 1 | Core SDK integration | 5d | Active | +| 2 | RPC tracing | 5d | Active | +| 3 | Peer & consensus tracing | 8d | Active | +| 4 | Transaction lifecycle | 7d | Active | +| 5 | Ledger & advanced | 7.1d | Active | +| 6 | StatsD → OTel bridge | 8d | Active | +| 7 | Native OTel metrics | 15d | Active | +| 8 | Log-trace correlation | 10d | Active | +| 9 | Internal metric gap fill | 12d | Future Enhancement | +| 10 | Workload generation & validation | 10d | Future Enhancement | +| 11 | Third-party data pipelines | 15d | Future Enhancement | +| | **Total** | **102.1d** | | --- diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md index aea6457501..587cafec73 100644 --- a/OpenTelemetryPlan/09-data-collection-reference.md +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -567,6 +567,217 @@ count_over_time({job="rippled"} |= "trace_id=" [5m]) --- +## 5b. Future: Internal Metric Gap Fill (Phase 9) + +> **Status**: Planned, not yet implemented. +> **Plan details**: [06-implementation-phases.md §6.8.2](./06-implementation-phases.md) — motivation, architecture, third-party context +> **Task breakdown**: [Phase9_taskList.md](./Phase9_taskList.md) — per-task implementation details + +Phase 9 fills ~50+ metrics that exist inside rippled but currently lack time-series export. Uses a hybrid approach: `beast::insight` extensions for NodeStore I/O, OTel `ObservableGauge` async callbacks for new categories. + +### New Metric Categories + +#### NodeStore I/O (via beast::insight) + +| Prometheus Metric | Type | Description | +| ------------------------------------ | ----- | ----------------------------------- | +| `rippled_nodestore_reads_total` | Gauge | Cumulative read operations | +| `rippled_nodestore_reads_hit` | Gauge | Cache-served reads | +| `rippled_nodestore_writes` | Gauge | Cumulative write operations | +| `rippled_nodestore_written_bytes` | Gauge | Cumulative bytes written | +| `rippled_nodestore_read_bytes` | Gauge | Cumulative bytes read | +| `rippled_nodestore_read_duration_us` | Gauge | Cumulative read time (microseconds) | +| `rippled_nodestore_write_load` | Gauge | Current write load score | +| `rippled_nodestore_read_queue` | Gauge | Items in read queue | + +#### Cache Hit Rates (via OTel MetricsRegistry) + +| Prometheus Metric | Type | Description | +| ------------------------------- | ----- | ------------------------------------ | +| `rippled_cache_SLE_hit_rate` | Gauge | SLE cache hit rate (0.0-1.0) | +| `rippled_cache_ledger_hit_rate` | Gauge | Ledger object cache hit rate | +| `rippled_cache_AL_hit_rate` | Gauge | AcceptedLedger cache hit rate | +| `rippled_cache_treenode_size` | Gauge | SHAMap TreeNode cache size (entries) | +| `rippled_cache_fullbelow_size` | Gauge | FullBelow cache size | + +#### Transaction Queue (via OTel MetricsRegistry) + +| Prometheus Metric | Type | Description | +| -------------------------------------- | ----- | -------------------------------- | +| `rippled_txq_count` | Gauge | Current transactions in queue | +| `rippled_txq_max_size` | Gauge | Maximum queue capacity | +| `rippled_txq_in_ledger` | Gauge | Transactions in open ledger | +| `rippled_txq_per_ledger` | Gauge | Expected transactions per ledger | +| `rippled_txq_open_ledger_fee_level` | Gauge | Open ledger fee escalation level | +| `rippled_txq_med_fee_level` | Gauge | Median fee level in queue | +| `rippled_txq_reference_fee_level` | Gauge | Reference fee level | +| `rippled_txq_min_processing_fee_level` | Gauge | Minimum fee to get processed | + +#### PerfLog Per-RPC Method (via OTel Metrics SDK) + +| Prometheus Metric | Type | Labels | Description | +| --------------------------------------- | --------- | ----------------- | --------------------------- | +| `rippled_rpc_method_started_total` | Counter | `method=""` | RPC calls started | +| `rippled_rpc_method_finished_total` | Counter | `method=""` | RPC calls completed | +| `rippled_rpc_method_errored_total` | Counter | `method=""` | RPC calls errored | +| `rippled_rpc_method_duration_us_bucket` | Histogram | `method=""` | Execution time distribution | + +#### PerfLog Per-Job Type (via OTel Metrics SDK) + +| Prometheus Metric | Type | Labels | Description | +| ---------------------------------------- | --------- | ------------------- | --------------- | +| `rippled_job_queued_total` | Counter | `job_type=""` | Jobs queued | +| `rippled_job_started_total` | Counter | `job_type=""` | Jobs started | +| `rippled_job_finished_total` | Counter | `job_type=""` | Jobs completed | +| `rippled_job_queued_duration_us_bucket` | Histogram | `job_type=""` | Queue wait time | +| `rippled_job_running_duration_us_bucket` | Histogram | `job_type=""` | Execution time | + +#### Counted Object Instances (via OTel MetricsRegistry) + +| Prometheus Metric | Type | Labels | Description | +| ---------------------- | ----- | --------------- | ------------------------------- | +| `rippled_object_count` | Gauge | `type=""` | Live instances of internal type | + +Tracked types: `Transaction`, `Ledger`, `NodeObject`, `STTx`, `STLedgerEntry`, `InboundLedger`, `Pathfinder`, `PathRequest`, `HashRouterEntry` + +#### Fee Escalation & Load Factors (via OTel MetricsRegistry) + +| Prometheus Metric | Type | Description | +| ------------------------------------ | ----- | ------------------------------------ | +| `rippled_load_factor` | Gauge | Combined transaction cost multiplier | +| `rippled_load_factor_server` | Gauge | Server + cluster + network load | +| `rippled_load_factor_local` | Gauge | Local server load only | +| `rippled_load_factor_net` | Gauge | Network-wide load estimate | +| `rippled_load_factor_cluster` | Gauge | Cluster peer load | +| `rippled_load_factor_fee_escalation` | Gauge | Open ledger fee escalation | +| `rippled_load_factor_fee_queue` | Gauge | Queue entry fee level | + +### New Grafana Dashboards (Phase 9) + +| Dashboard | UID | Data Source | Key Panels | +| ------------------ | -------------------- | ----------- | ----------------------------------------------------------------- | +| Fee Market & TxQ | `rippled-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown, escalation | +| Job Queue Analysis | `rippled-job-queue` | Prometheus | Per-job rates, queue wait times, execution times, queue depth | + +--- + +## 5c. Future: Synthetic Workload Generation & Telemetry Validation (Phase 10) + +> **Status**: Planned, not yet implemented. +> **Plan details**: [06-implementation-phases.md §6.8.3](./06-implementation-phases.md) — motivation, architecture +> **Task breakdown**: [Phase10_taskList.md](./Phase10_taskList.md) — per-task implementation details + +Phase 10 builds a 5-node validator docker-compose harness with RPC load generators, transaction submitters, and automated validation scripts that verify all spans, metrics, dashboards, and log-trace correlation work end-to-end. Includes a benchmark suite comparing telemetry-ON vs telemetry-OFF overhead. + +### Validated Telemetry Inventory + +| Category | Expected Count | Validation Method | +| ------------------ | -------------- | -------------------------------- | +| Trace spans | 16 | Jaeger/Tempo API query | +| Span attributes | 22 | Per-span attribute assertion | +| StatsD metrics | 255+ | Prometheus query | +| Phase 9 metrics | 50+ | Prometheus query | +| SpanMetrics RED | 4 per span | Prometheus query | +| Grafana dashboards | 10 | Dashboard API "no data" check | +| Log-trace links | Present | Loki query + Tempo reverse check | + +--- + +## 5d. Future: Third-Party Data Collection Pipelines (Phase 11) + +> **Status**: Planned, not yet implemented. +> **Plan details**: [06-implementation-phases.md §6.8.4](./06-implementation-phases.md) — motivation, architecture, consumer gap analysis +> **Task breakdown**: [Phase11_taskList.md](./Phase11_taskList.md) — per-task implementation details + +Phase 11 builds a custom OTel Collector receiver (Go) that polls rippled's admin RPCs and exports `xrpl_*` metrics for external consumers. No rippled code changes. + +### Exported Metrics (via Custom OTel Collector Receiver) + +#### Node Health (from server_info) + +| Prometheus Metric | Type | Description | +| --------------------------------------- | ----- | ----------------------------------------------- | +| `xrpl_server_state` | Gauge | Operating mode (0=disconnected ... 5=proposing) | +| `xrpl_server_state_duration_seconds` | Gauge | Seconds in current state | +| `xrpl_uptime_seconds` | Gauge | Consecutive seconds running | +| `xrpl_io_latency_ms` | Gauge | I/O subsystem latency | +| `xrpl_amendment_blocked` | Gauge | 1 if amendment-blocked, 0 otherwise | +| `xrpl_peers_count` | Gauge | Connected peers | +| `xrpl_validated_ledger_seq` | Gauge | Latest validated ledger sequence | +| `xrpl_validated_ledger_age_seconds` | Gauge | Seconds since last validated close | +| `xrpl_last_close_proposers` | Gauge | Proposers in last consensus round | +| `xrpl_last_close_converge_time_seconds` | Gauge | Last consensus round duration | +| `xrpl_load_factor` | Gauge | Transaction cost multiplier | +| `xrpl_state_duration_seconds` | Gauge | Per-state duration (`state` label) | +| `xrpl_state_transitions_total` | Gauge | Per-state transition count (`state` label) | + +#### Peer Topology (from peers) + +| Prometheus Metric | Type | Description | +| --------------------------- | ----- | ----------------------------------- | +| `xrpl_peers_inbound_count` | Gauge | Inbound peer connections | +| `xrpl_peers_outbound_count` | Gauge | Outbound peer connections | +| `xrpl_peer_latency_p50_ms` | Gauge | Median peer latency | +| `xrpl_peer_latency_p95_ms` | Gauge | p95 peer latency | +| `xrpl_peer_version_count` | Gauge | Peers per version (`version` label) | +| `xrpl_peer_diverged_count` | Gauge | Peers with diverged tracking status | + +#### Validator & Amendment (from validators, feature) + +| Prometheus Metric | Type | Description | +| ------------------------------------- | ----- | --------------------------------------- | +| `xrpl_trusted_validators_count` | Gauge | UNL validator count | +| `xrpl_amendment_enabled_count` | Gauge | Enabled amendments | +| `xrpl_amendment_majority_count` | Gauge | Amendments with majority | +| `xrpl_amendment_unsupported_majority` | Gauge | 1 if unsupported amendment has majority | +| `xrpl_validator_list_active` | Gauge | 1 if validator list is active | + +#### Fee Market (from fee) + +| Prometheus Metric | Type | Description | +| -------------------------------- | ----- | ------------------------------------- | +| `xrpl_fee_open_ledger_fee_drops` | Gauge | Minimum fee for open ledger inclusion | +| `xrpl_fee_median_fee_drops` | Gauge | Median fee level | +| `xrpl_fee_queue_size` | Gauge | Current transaction queue depth | +| `xrpl_fee_current_ledger_size` | Gauge | Transactions in current open ledger | + +#### DEX & AMM (optional, from book_offers, amm_info) + +| Prometheus Metric | Type | Labels | Description | +| -------------------------- | ----- | --------------------- | ---------------------- | +| `xrpl_amm_tvl_drops` | Gauge | `pool=""` | Total value locked | +| `xrpl_amm_trading_fee` | Gauge | `pool=""` | Pool trading fee (bps) | +| `xrpl_orderbook_bid_depth` | Gauge | `pair=""` | Total bid volume | +| `xrpl_orderbook_ask_depth` | Gauge | `pair=""` | Total ask volume | +| `xrpl_orderbook_spread` | Gauge | `pair=""` | Best bid-ask spread | + +### New Grafana Dashboards (Phase 11) + +| Dashboard | UID | Data Source | Key Panels | +| ------------------ | ----------------------------- | ----------- | ---------------------------------------------------------------------- | +| Validator Health | `rippled-validator-health` | Prometheus | Server state timeline, proposer count, converge time, amendment voting | +| Network Topology | `rippled-network-topology` | Prometheus | Peer count, version distribution, latency distribution, diverged peers | +| Fee Market (Ext) | `rippled-fee-market-external` | Prometheus | Fee levels, queue depth, load factor breakdown, escalation timeline | +| DEX & AMM Overview | `rippled-dex-amm` | Prometheus | AMM TVL, order book depth, spread trends, trading fee revenue | + +### Prometheus Alerting Rules (Phase 11) + +| Alert Name | Severity | Condition | For | +| ---------------------------------- | -------- | ----------------------------------------------------------- | --- | +| `XRPLServerNotFull` | Critical | `xrpl_server_state < 4` for 15m | 15m | +| `XRPLAmendmentBlocked` | Critical | `xrpl_amendment_blocked == 1` | 1m | +| `XRPLNoPeers` | Critical | `xrpl_peers_count == 0` | 5m | +| `XRPLLedgerStale` | Critical | `xrpl_validated_ledger_age_seconds > 120` | 2m | +| `XRPLHighIOLatency` | Critical | `xrpl_io_latency_ms > 100` | 5m | +| `XRPLUnsupportedAmendmentMajority` | Critical | `xrpl_amendment_unsupported_majority == 1` | 1m | +| `XRPLLowPeerCount` | Warning | `xrpl_peers_count < 10` | 15m | +| `XRPLHighLoadFactor` | Warning | `xrpl_load_factor > 10` | 10m | +| `XRPLSlowConsensus` | Warning | `xrpl_last_close_converge_time_seconds > 6` | 5m | +| `XRPLValidatorListExpiring` | Warning | `(xrpl_validator_list_expiration_seconds - time()) < 86400` | 1h | +| `XRPLStateFlapping` | Warning | `rate(xrpl_state_transitions_total{state="full"}[1h]) > 2` | 30m | + +--- + ## 6. Known Issues | Issue | Impact | Status | diff --git a/OpenTelemetryPlan/Phase10_taskList.md b/OpenTelemetryPlan/Phase10_taskList.md new file mode 100644 index 0000000000..fa652da778 --- /dev/null +++ b/OpenTelemetryPlan/Phase10_taskList.md @@ -0,0 +1,256 @@ +# Phase 10: Synthetic Workload Generation & Telemetry Validation — Task List + +> **Status**: Future Enhancement +> +> **Goal**: Build tools that generate realistic XRPL traffic to validate the full Phases 1-9 telemetry stack end-to-end — all spans, attributes, metrics, dashboards, and log-trace correlation — under controlled load. +> +> **Scope**: Python/shell test harness + multi-node docker-compose environment + automated validation scripts + performance benchmarks. +> +> **Branch**: `pratik/otel-phase10-workload-validation` (from `pratik/otel-phase9-metric-gap-fill`) +> +> **Depends on**: Phase 9 (internal metric gap fill) — validates the full metric surface + +### Related Plan Documents + +| Document | Relevance | +| -------------------------------------------------------------------- | --------------------------------------------------------------- | +| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 10 plan: motivation, architecture, exit criteria (§6.8.3) | +| [09-data-collection-reference.md](./09-data-collection-reference.md) | Defines the full inventory of spans/metrics to validate | +| [Phase9_taskList.md](./Phase9_taskList.md) | Prerequisite — all internal metrics must be emitting | + +### Why This Phase Exists + +Before Phases 1-9 can be considered production-ready, we need proof that: + +1. All 16 spans fire with correct attributes under real transaction workloads +2. All 255+ StatsD metrics + ~50 Phase 9 metrics appear in Prometheus with non-zero values +3. Log-trace correlation (Phase 8) produces clickable trace_id links in Loki +4. All 10 Grafana dashboards render meaningful data (no empty panels) +5. Performance overhead stays within bounds (< 3% CPU, < 5MB memory) +6. The telemetry stack survives sustained load without data loss or queue backpressure + +--- + +## Task 10.1: Multi-Node Test Harness + +**Objective**: Create a docker-compose environment with 3-5 validator nodes that produces real consensus rounds. + +**What to do**: + +- Create `docker/telemetry/docker-compose.workload.yaml`: + - 5 rippled validator nodes with UNL configured for each other + - All telemetry enabled: `[telemetry] enabled=1`, `[insight] server=otel` + - Full OTel stack: Collector, Jaeger, Tempo, Prometheus, Loki, Grafana + - Shared network with service discovery + +- Each node should: + - Generate validator keys at startup + - Configure all 5 nodes in its UNL + - Enable all trace categories including `trace_peer=1` + - Write logs to a file tailed by the OTel Collector filelog receiver + +- Include a `Makefile` target: `make telemetry-workload-up` / `make telemetry-workload-down` + +**Key files**: + +- New: `docker/telemetry/docker-compose.workload.yaml` +- New: `docker/telemetry/workload/generate-validator-keys.sh` +- New: `docker/telemetry/workload/xrpld-validator.cfg.template` + +--- + +## Task 10.2: RPC Load Generator + +**Objective**: Configurable tool that fires all traced RPC commands at controlled rates. + +**What to do**: + +- Create `docker/telemetry/workload/rpc_load_generator.py`: + - Connects to one or more rippled WebSocket endpoints + - Fires all RPC commands that have trace spans: `server_info`, `ledger`, `tx`, `account_info`, `account_lines`, `fee`, `submit`, etc. + - Configurable parameters: rate (RPS), duration, command distribution weights + - Injects `traceparent` HTTP headers to test W3C context propagation + - Logs progress and errors to stdout + +- Command distribution should match realistic production ratios: + - 40% `server_info` / `fee` (health checks) + - 30% `account_info` / `account_lines` / `account_objects` (wallet queries) + - 15% `ledger` / `ledger_data` (explorer queries) + - 10% `tx` / `account_tx` (transaction lookups) + - 5% `book_offers` / `amm_info` (DEX queries) + +**Key files**: + +- New: `docker/telemetry/workload/rpc_load_generator.py` +- New: `docker/telemetry/workload/requirements.txt` + +--- + +## Task 10.3: Transaction Submitter + +**Objective**: Generate diverse transaction types to exercise `tx.*` and `ledger.*` spans. + +**What to do**: + +- Create `docker/telemetry/workload/tx_submitter.py`: + - Pre-funds test accounts from genesis account + - Submits a mix of transaction types: + - `Payment` (XRP and issued currencies) — exercises `tx.process`, `tx.apply` + - `OfferCreate` / `OfferCancel` — DEX activity + - `TrustSet` — trust line creation for issued currencies + - `NFTokenMint` / `NFTokenCreateOffer` / `NFTokenAcceptOffer` — NFT activity + - `EscrowCreate` / `EscrowFinish` — escrow lifecycle + - `AMMCreate` / `AMMDeposit` / `AMMWithdraw` — AMM pool operations (if amendment enabled) + - Configurable: TPS target, transaction mix weights, duration + - Monitors submission results and tracks success/failure rates + +- The transaction mix ensures the telemetry captures the full range of ledger activity that third parties care about. + +**Key files**: + +- New: `docker/telemetry/workload/tx_submitter.py` +- New: `docker/telemetry/workload/test_accounts.json` (pre-generated keypairs) + +--- + +## Task 10.4: Telemetry Validation Suite + +**Objective**: Automated scripts that verify all expected telemetry data exists after a workload run. + +**What to do**: + +- Create `docker/telemetry/workload/validate_telemetry.py`: + + **Span validation** (queries Jaeger/Tempo API): + - Assert all 16 span names appear in traces + - Assert each span has its required attributes (22 total attributes across spans) + - Assert parent-child relationships are correct (`rpc.request` → `rpc.process` → `rpc.command.*`) + - Assert span durations are reasonable (> 0, < 60s) + + **Metric validation** (queries Prometheus API): + - Assert all SpanMetrics-derived metrics are non-zero: `traces_span_metrics_calls_total`, `traces_span_metrics_duration_milliseconds_bucket` + - Assert all StatsD metrics are non-zero: `rippled_LedgerMaster_Validated_Ledger_Age`, `rippled_Peer_Finder_Active_*`, etc. + - Assert all Phase 9 metrics are non-zero: `rippled_nodestore_*`, `rippled_cache_*`, `rippled_txq_*`, `rippled_rpc_method_*`, `rippled_object_count`, `rippled_load_factor*` + - Assert metric label cardinality is within bounds + + **Log-trace correlation validation** (queries Loki API): + - Assert logs contain `trace_id=` and `span_id=` fields + - Pick a random trace_id from Jaeger → query Loki for matching logs → assert results exist + - Assert Grafana derived field links are functional + + **Dashboard validation**: + - For each of the 10 Grafana dashboards, query the dashboard API and assert no panels show "No data" + +- Output: JSON report with pass/fail per check, suitable for CI. + +**Key files**: + +- New: `docker/telemetry/workload/validate_telemetry.py` +- New: `docker/telemetry/workload/expected_spans.json` (span inventory for validation) +- New: `docker/telemetry/workload/expected_metrics.json` (metric inventory for validation) + +--- + +## Task 10.5: Performance Benchmark Suite + +**Objective**: Measure CPU/memory/latency overhead of the telemetry stack. + +**What to do**: + +- Create `docker/telemetry/workload/benchmark.sh`: + - **Baseline run**: Start cluster with `[telemetry] enabled=0`, run transaction workload for 5 minutes, record metrics + - **Telemetry run**: Start cluster with full telemetry enabled, run identical workload, record metrics + - **Comparison**: Calculate deltas for: + - CPU usage (per-node average) + - Memory RSS (per-node peak) + - RPC p99 latency + - Transaction throughput (TPS) + - Consensus round time p95 + - Ledger close time p95 + +- Output: Markdown table comparing baseline vs. telemetry, with pass/fail against targets: + - CPU overhead < 3% + - Memory overhead < 5MB + - RPC latency impact < 2ms p99 + - Throughput impact < 5% + - Consensus impact < 1% + +- Store results in `docker/telemetry/workload/benchmark-results/` for historical tracking. + +**Key files**: + +- New: `docker/telemetry/workload/benchmark.sh` +- New: `docker/telemetry/workload/collect_system_metrics.sh` + +--- + +## Task 10.6: CI Integration + +**Objective**: Wire the validation suite into CI for regression detection. + +**What to do**: + +- Create a CI workflow (GitHub Actions or equivalent) that: + 1. Builds rippled with `-DXRPL_ENABLE_TELEMETRY=ON` + 2. Starts the multi-node workload harness + 3. Runs the RPC load generator + transaction submitter for 2 minutes + 4. Runs the validation suite + 5. Runs the benchmark suite + 6. Fails the build if any validation check fails or benchmark exceeds thresholds + 7. Archives the validation report and benchmark results as artifacts + +- This should be a separate workflow (not part of the main CI), triggered manually or on telemetry-related branch changes. + +**Key files**: + +- New: `.github/workflows/telemetry-validation.yml` +- New: `docker/telemetry/workload/run-full-validation.sh` (orchestrator script) + +--- + +## Task 10.7: Documentation + +**Objective**: Document the workload tools and validation process. + +**What to do**: + +- Create `docker/telemetry/workload/README.md`: + - Quick start guide for running workload harness + - Configuration options for load generator and tx submitter + - How to read validation reports + - How to run benchmarks and interpret results + +- Update `docs/telemetry-runbook.md`: + - Add "Validating Telemetry Stack" section + - Add "Performance Benchmarking" section + +- Update `OpenTelemetryPlan/09-data-collection-reference.md`: + - Add "Validation" section with expected metric/span counts + +--- + +## Effort Summary + +| Task | Description | Effort | Risk | +| ---- | --------------------------- | ------ | ------ | +| 10.1 | Multi-node test harness | 2d | Medium | +| 10.2 | RPC load generator | 1d | Low | +| 10.3 | Transaction submitter | 2d | Medium | +| 10.4 | Telemetry validation suite | 2d | Medium | +| 10.5 | Performance benchmark suite | 1.5d | Low | +| 10.6 | CI integration | 1d | Medium | +| 10.7 | Documentation | 0.5d | Low | + +**Total Effort**: 10 days + +## Exit Criteria + +- [ ] 5-node validator cluster starts and reaches consensus in docker-compose +- [ ] RPC load generator fires all traced RPC commands at configurable rates +- [ ] Transaction submitter generates 6+ transaction types at configurable TPS +- [ ] Validation suite confirms all 16 spans, 22 attributes, 300+ metrics are present +- [ ] Log-trace correlation validated end-to-end (Loki ↔ Tempo) +- [ ] All 10 Grafana dashboards render data (no empty panels) +- [ ] Benchmark shows < 3% CPU overhead, < 5MB memory overhead +- [ ] CI workflow runs validation on telemetry branch changes +- [ ] Validation report output is CI-parseable (JSON with exit codes) diff --git a/OpenTelemetryPlan/Phase11_taskList.md b/OpenTelemetryPlan/Phase11_taskList.md new file mode 100644 index 0000000000..1c7b8a0917 --- /dev/null +++ b/OpenTelemetryPlan/Phase11_taskList.md @@ -0,0 +1,471 @@ +# Phase 11: Third-Party Data Collection Pipelines — Task List + +> **Status**: Future Enhancement +> +> **Goal**: Build a custom OTel Collector receiver that periodically polls rippled's admin RPCs and exports structured metrics for external consumers — making all XRPL health, validator, peer, fee, and DEX data available as Prometheus/OTLP metrics without rippled code changes. +> +> **Scope**: Go-based OTel Collector receiver plugin + Grafana dashboards + Prometheus alerting rules. +> +> **Branch**: `pratik/otel-phase11-third-party-collection` (from `pratik/otel-phase10-workload-validation`) +> +> **Depends on**: Phase 10 (validation harness for testing the new receiver) + +### Related Plan Documents + +| Document | Relevance | +| -------------------------------------------------------------------- | --------------------------------------------------------------- | +| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 11 plan: motivation, architecture, exit criteria (§6.8.4) | +| [09-data-collection-reference.md](./09-data-collection-reference.md) | Defines full metric inventory including third-party metrics | +| [Phase10_taskList.md](./Phase10_taskList.md) | Prerequisite — validation harness for testing | + +### Third-Party Consumer Gap Analysis + +This phase addresses the cross-cutting gap identified during research: **rippled has no native Prometheus/OTLP metrics export for data accessible only via RPC**. Every consumer (exchanges, payment processors, analytics providers, validators, researchers, compliance firms, custodians) must build custom JSON-RPC polling and conversion. This receiver centralizes that work. + +| Consumer Category | Data Unlocked by This Phase | +| -------------------------- | ------------------------------------------------------------------ | +| **Exchanges** | Real-time fee estimates, TxQ capacity, server health scores | +| **Payment Processors** | Settlement latency percentiles, corridor health, path availability | +| **Analytics Providers** | Validator metrics, network topology, amendment voting status | +| **DeFi / AMM** | AMM pool TVL, DEX order book depth, trade volumes | +| **Validators / Operators** | Per-peer latency, version distribution, UNL health, alerting | +| **Compliance** | Transaction volume trends, network growth metrics | +| **Academic Researchers** | Consensus performance time-series, decentralization metrics | +| **CBDC / Tokenization** | Token supply tracking, trust line adoption, freeze status | +| **Institutional Custody** | Multi-sig status, escrow tracking, reserve calculations | +| **Wallet Providers** | Server health for node selection, fee prediction data | + +--- + +## Task 11.1: OTel Collector Receiver Scaffold + +**Objective**: Create the Go project structure for a custom OTel Collector receiver that polls rippled JSON-RPC. + +**What to do**: + +- Create `docker/telemetry/otel-rippled-receiver/`: + - `receiver.go` — implements `receiver.Metrics` interface + - `config.go` — configuration struct (endpoint, poll interval, enabled RPCs) + - `factory.go` — receiver factory registration + - `go.mod` / `go.sum` — Go module with OTel Collector SDK dependency + +- Configuration model: + + ```yaml + rippled_receiver: + endpoint: "http://localhost:5005" # rippled admin RPC + poll_interval: 30s # how often to poll + enabled_collectors: + - server_info + - get_counts + - fee + - peers + - validators + - feature + - server_state + amm_pools: [] # optional: AMM pool IDs to track + book_offers_pairs: [] # optional: currency pairs for DEX depth + ``` + +- Build a custom OTel Collector binary that includes this receiver alongside the standard receivers. + +**Key files**: + +- New: `docker/telemetry/otel-rippled-receiver/receiver.go` +- New: `docker/telemetry/otel-rippled-receiver/config.go` +- New: `docker/telemetry/otel-rippled-receiver/factory.go` +- New: `docker/telemetry/otel-rippled-receiver/go.mod` +- New: `docker/telemetry/otel-rippled-receiver/Dockerfile` + +--- + +## Task 11.2: server_info / server_state Collector + +**Objective**: Poll `server_info` and `server_state` and export all fields as OTel metrics. + +**What to do**: + +- Implement `serverInfoCollector` that calls `server_info` (admin) and extracts: + + **Node Health Gauges:** + - `xrpl_server_state` (enum → int: disconnected=0, connected=1, syncing=2, tracking=3, full=4, proposing=5) + - `xrpl_server_state_duration_seconds` + - `xrpl_uptime_seconds` + - `xrpl_io_latency_ms` + - `xrpl_amendment_blocked` (0 or 1) + - `xrpl_peers_count` + - `xrpl_peer_disconnects_total` + - `xrpl_peer_disconnects_resources_total` + - `xrpl_jq_trans_overflow_total` + + **Consensus Gauges:** + - `xrpl_last_close_proposers` + - `xrpl_last_close_converge_time_seconds` + - `xrpl_validation_quorum` + + **Ledger Gauges:** + - `xrpl_validated_ledger_seq` + - `xrpl_validated_ledger_age_seconds` + - `xrpl_validated_ledger_base_fee_drops` + - `xrpl_validated_ledger_reserve_base_drops` + - `xrpl_validated_ledger_reserve_inc_drops` + - `xrpl_close_time_offset_seconds` (0 when absent) + + **Load Factor Gauges:** + - `xrpl_load_factor` + - `xrpl_load_factor_server` + - `xrpl_load_factor_fee_escalation` + - `xrpl_load_factor_fee_queue` + - `xrpl_load_factor_local` + - `xrpl_load_factor_net` + - `xrpl_load_factor_cluster` + + **State Accounting Gauges** (per state: disconnected, connected, syncing, tracking, full): + - `xrpl_state_duration_seconds{state=""}` + - `xrpl_state_transitions_total{state=""}` + + **Validator Info** (when node is a validator): + - `xrpl_validator_list_count` + - `xrpl_validator_list_expiration_seconds` (epoch) + - `xrpl_validator_list_active` (0 or 1) + +**Key files**: + +- New: `docker/telemetry/otel-rippled-receiver/collectors/server_info.go` + +--- + +## Task 11.3: get_counts Collector + +**Objective**: Poll `get_counts` and export internal object counts and NodeStore stats. + +**What to do**: + +- Implement `getCountsCollector`: + + **Database Gauges:** + - `xrpl_db_size_kb{db="total"}`, `xrpl_db_size_kb{db="ledger"}`, `xrpl_db_size_kb{db="transaction"}` + + **NodeStore Gauges:** + - `xrpl_nodestore_reads_total`, `xrpl_nodestore_reads_hit`, `xrpl_nodestore_writes_total` + - `xrpl_nodestore_read_bytes`, `xrpl_nodestore_written_bytes` + - `xrpl_nodestore_read_duration_us`, `xrpl_nodestore_write_load` + - `xrpl_nodestore_read_queue`, `xrpl_nodestore_read_threads_running` + + **Cache Gauges:** + - `xrpl_cache_hit_rate{cache="SLE"}`, `xrpl_cache_hit_rate{cache="ledger"}`, `xrpl_cache_hit_rate{cache="accepted_ledger"}` + - `xrpl_cache_size{cache="treenode"}`, `xrpl_cache_size{cache="fullbelow"}`, `xrpl_cache_size{cache="accepted_ledger"}` + + **Object Count Gauges:** + - `xrpl_object_count{type=""}` for each counted object type (Transaction, Ledger, NodeObject, STTx, STLedgerEntry, InboundLedger, Pathfinder, etc.) + + **Rates:** + - `xrpl_historical_fetch_per_minute` + - `xrpl_local_txs` + +**Key files**: + +- New: `docker/telemetry/otel-rippled-receiver/collectors/get_counts.go` + +--- + +## Task 11.4: Peer Topology Collector + +**Objective**: Poll `peers` and export per-peer and aggregate network metrics. + +**What to do**: + +- Implement `peersCollector`: + + **Aggregate Gauges:** + - `xrpl_peers_inbound_count` + - `xrpl_peers_outbound_count` + - `xrpl_peers_cluster_count` + + **Per-Peer Gauges** (with labels `peer_key` truncated to 8 chars for cardinality control): + - `xrpl_peer_latency_ms{peer="", version="", inbound=""}` + - `xrpl_peer_uptime_seconds{peer=""}` + - `xrpl_peer_load{peer=""}` + + **Distribution Gauges** (aggregated across all peers): + - `xrpl_peer_latency_p50_ms`, `xrpl_peer_latency_p95_ms`, `xrpl_peer_latency_p99_ms` + - `xrpl_peer_version_count{version=""}` — count of peers per software version + + **Tracking Status:** + - `xrpl_peer_diverged_count` — peers with `track=diverged` + - `xrpl_peer_unknown_count` — peers with `track=unknown` + +**Key files**: + +- New: `docker/telemetry/otel-rippled-receiver/collectors/peers.go` + +**Cardinality note**: Per-peer metrics use truncated keys. For large peer sets (50+), the aggregate distribution gauges are preferred over per-peer labels. + +--- + +## Task 11.5: Validator & Amendment Collector + +**Objective**: Poll `validators` and `feature` to export validator health and amendment voting status. + +**What to do**: + +- Implement `validatorCollector`: + + **From `validators` RPC:** + - `xrpl_trusted_validators_count` + - `xrpl_validator_signing` (0 or 1 — whether local validator is signing) + + **From `feature` RPC:** + - `xrpl_amendment_enabled_count` — total enabled amendments + - `xrpl_amendment_majority_count` — amendments with majority but not yet enabled + - `xrpl_amendment_vetoed_count` — locally vetoed amendments + - `xrpl_amendment_unsupported_majority` (0 or 1) — any unsupported amendment has majority (critical alert) + + **Per-amendment with majority** (limited cardinality — only amendments with `majority` set): + - `xrpl_amendment_majority_time{name=""}` — epoch time when majority was gained + - `xrpl_amendment_votes{name=""}` — current vote count + - `xrpl_amendment_threshold{name=""}` — votes needed + +**Key files**: + +- New: `docker/telemetry/otel-rippled-receiver/collectors/validators.go` + +--- + +## Task 11.6: Fee & TxQ Collector + +**Objective**: Poll `fee` RPC and export real-time fee market data. + +**What to do**: + +- Implement `feeCollector` that calls the public `fee` RPC: + + **Fee Level Gauges:** + - `xrpl_fee_current_ledger_size` — transactions in current open ledger + - `xrpl_fee_expected_ledger_size` — expected transactions at close + - `xrpl_fee_max_queue_size` — maximum transaction queue size + - `xrpl_fee_open_ledger_fee_drops` — minimum fee for open ledger inclusion + - `xrpl_fee_median_fee_drops` — median fee level + - `xrpl_fee_minimum_fee_drops` — base reference fee + - `xrpl_fee_queue_size` — current queue depth + +- This overlaps with Phase 9's internal TxQ metrics but provides an external-only collection path that doesn't require rippled code changes. + +**Key files**: + +- New: `docker/telemetry/otel-rippled-receiver/collectors/fee.go` + +--- + +## Task 11.7: DEX & AMM Collector (Optional) + +**Objective**: Periodically poll configured AMM pools and order book pairs for DeFi metrics. + +**What to do**: + +- Implement `dexCollector` (enabled only when `amm_pools` or `book_offers_pairs` are configured): + + **AMM Pool Gauges** (per configured pool): + - `xrpl_amm_reserve{pool="", asset=""}` — pool reserve amount + - `xrpl_amm_lp_token_supply{pool=""}` — outstanding LP tokens + - `xrpl_amm_trading_fee{pool=""}` — pool trading fee (basis points) + - `xrpl_amm_tvl_drops{pool=""}` — total value locked (XRP-denominated) + + **Order Book Gauges** (per configured pair): + - `xrpl_orderbook_bid_depth{pair="/"}` — total bid volume + - `xrpl_orderbook_ask_depth{pair="/"}` — total ask volume + - `xrpl_orderbook_spread{pair="/"}` — best bid-ask spread + - `xrpl_orderbook_offer_count{pair="/", side="bid|ask"}` — number of offers + +**Key files**: + +- New: `docker/telemetry/otel-rippled-receiver/collectors/dex.go` + +**Note**: This is optional because it requires explicit configuration of which pools/pairs to track. Default configuration tracks no DEX data. + +--- + +## Task 11.8: Prometheus Alerting Rules + +**Objective**: Create production-ready alerting rules for the metrics exported by this receiver. + +**What to do**: + +- Create `docker/telemetry/prometheus/rippled-alerts.yml`: + + **Tier 1 — Critical (page immediately):** + + ```yaml + - alert: XRPLServerNotFull + expr: xrpl_server_state < 4 + for: 15m + + - alert: XRPLAmendmentBlocked + expr: xrpl_amendment_blocked == 1 + for: 1m + + - alert: XRPLNoPeers + expr: xrpl_peers_count == 0 + for: 5m + + - alert: XRPLLedgerStale + expr: xrpl_validated_ledger_age_seconds > 120 + for: 2m + + - alert: XRPLHighIOLatency + expr: xrpl_io_latency_ms > 100 + for: 5m + + - alert: XRPLUnsupportedAmendmentMajority + expr: xrpl_amendment_unsupported_majority == 1 + for: 1m + ``` + + **Tier 2 — Warning (investigate within hours):** + + ```yaml + - alert: XRPLLowPeerCount + expr: xrpl_peers_count < 10 + for: 15m + + - alert: XRPLHighLoadFactor + expr: xrpl_load_factor > 10 + for: 10m + + - alert: XRPLSlowConsensus + expr: xrpl_last_close_converge_time_seconds > 6 + for: 5m + + - alert: XRPLValidatorListExpiring + expr: (xrpl_validator_list_expiration_seconds - time()) < 86400 + for: 1h + + - alert: XRPLClockDrift + expr: xrpl_close_time_offset_seconds > 0 + for: 5m + + - alert: XRPLStateFlapping + expr: rate(xrpl_state_transitions_total{state="full"}[1h]) > 2 + for: 30m + ``` + +**Key files**: + +- New: `docker/telemetry/prometheus/rippled-alerts.yml` +- Update: `docker/telemetry/prometheus/prometheus.yml` (add rule_files reference) + +--- + +## Task 11.9: New Grafana Dashboards + +**Objective**: Create 4 new dashboards for the data exported by the receiver. + +**What to do**: + +- **Validator Health** (`rippled-validator-health`): + - Server state timeline, state duration breakdown + - Proposer count trend, converge time trend, validation quorum + - Validator list expiration countdown + - Amendment voting status (majority/enabled/vetoed) + +- **Network Topology** (`rippled-network-topology`): + - Peer count (inbound/outbound/cluster), peer version distribution + - Peer latency distribution (p50/p95/p99), diverged peer count + - Geographic distribution (if enriched with GeoIP) + - Peer uptime distribution + +- **Fee Market** (`rippled-fee-market-external`): + - Current fee levels (open ledger, median, minimum), fee escalation timeline + - Queue depth vs. capacity, transactions per ledger + - Load factor breakdown (server/network/cluster/escalation) + +- **DEX & AMM Overview** (`rippled-dex-amm`) (only populated when DEX collectors are configured): + - AMM pool TVL, reserve ratios, LP token supply + - Order book depth per pair, spread trends + - Trading fee revenue estimates + +**Key files**: + +- New: `docker/telemetry/grafana/dashboards/rippled-validator-health.json` +- New: `docker/telemetry/grafana/dashboards/rippled-network-topology.json` +- New: `docker/telemetry/grafana/dashboards/rippled-fee-market-external.json` +- New: `docker/telemetry/grafana/dashboards/rippled-dex-amm.json` + +--- + +## Task 11.10: Integration with Phase 10 Validation + +**Objective**: Extend the Phase 10 validation suite to verify this receiver's metrics. + +**What to do**: + +- Update `docker/telemetry/workload/validate_telemetry.py`: + - Add assertions for all `xrpl_*` metrics produced by the receiver + - Verify metric labels have expected values + - Verify alerting rules fire correctly (inject a "bad" state and check alert) + +- Update `docker/telemetry/docker-compose.workload.yaml`: + - Add the custom OTel Collector build with the rippled receiver + - Configure the receiver to poll one of the test nodes + +**Key files**: + +- Update: `docker/telemetry/workload/validate_telemetry.py` +- Update: `docker/telemetry/docker-compose.workload.yaml` +- Update: `docker/telemetry/workload/expected_metrics.json` + +--- + +## Task 11.11: Documentation + +**Objective**: Document the receiver, its metrics, deployment, and alerting. + +**What to do**: + +- Create `docker/telemetry/otel-rippled-receiver/README.md`: + - Architecture overview (how the receiver fits into the OTel Collector) + - Configuration reference (all config options with defaults) + - Metric reference table (all exported metrics with types and labels) + - Deployment guide (building custom collector binary, docker-compose integration) + +- Update `OpenTelemetryPlan/09-data-collection-reference.md`: + - Add "Third-Party Metrics (OTel Collector Receiver)" section + - Add new Grafana dashboard reference (4 dashboards) + - Add alerting rules reference + +- Update `docs/telemetry-runbook.md`: + - Add "Third-Party Metrics Receiver" troubleshooting section + - Add alerting playbook (what to do for each Tier 1/Tier 2 alert) + +--- + +## Effort Summary + +| Task | Description | Effort | Risk | +| ----- | ------------------------------------ | ------ | ------ | +| 11.1 | OTel Collector receiver scaffold | 1.5d | Medium | +| 11.2 | server_info / server_state collector | 2d | Low | +| 11.3 | get_counts collector | 1.5d | Low | +| 11.4 | Peer topology collector | 1.5d | Medium | +| 11.5 | Validator & amendment collector | 1d | Low | +| 11.6 | Fee & TxQ collector | 0.5d | Low | +| 11.7 | DEX & AMM collector (optional) | 1.5d | Medium | +| 11.8 | Prometheus alerting rules | 1d | Low | +| 11.9 | New Grafana dashboards (4) | 2d | Low | +| 11.10 | Integration with Phase 10 validation | 1d | Low | +| 11.11 | Documentation | 1d | Low | + +**Total Effort**: 15 days + +## Exit Criteria + +- [ ] Custom OTel Collector receiver builds and starts without errors +- [ ] All `xrpl_*` metrics from server_info, get_counts, peers, validators, fee appear in Prometheus +- [ ] Metrics update at configured poll interval (default 30s) +- [ ] 4 new Grafana dashboards operational with data +- [ ] Prometheus alerting rules fire correctly for simulated failure conditions +- [ ] DEX/AMM collector works when configured (optional — not required for base exit criteria) +- [ ] Phase 10 validation suite passes with receiver metrics included +- [ ] Receiver handles rippled restart/unavailability gracefully (no crash, logs warning, retries) +- [ ] Documentation complete: receiver README, metric reference, alerting playbook +- [ ] Go receiver has unit tests with >80% coverage diff --git a/OpenTelemetryPlan/Phase9_taskList.md b/OpenTelemetryPlan/Phase9_taskList.md new file mode 100644 index 0000000000..e5986b4812 --- /dev/null +++ b/OpenTelemetryPlan/Phase9_taskList.md @@ -0,0 +1,329 @@ +# Phase 9: Internal Metric Instrumentation Gap Fill — Task List + +> **Status**: Future Enhancement +> +> **Goal**: Instrument rippled to emit ~50+ metrics that exist in `get_counts`/`server_info`/TxQ/PerfLog but currently lack time-series export via the OTel or beast::insight pipelines. +> +> **Scope**: Hybrid approach — extend `beast::insight` for metrics near existing registrations, use OTel Metrics SDK `ObservableGauge` callbacks for new categories (TxQ, PerfLog, CountedObjects). +> +> **Branch**: `pratik/otel-phase9-metric-gap-fill` (from `pratik/otel-phase8-log-correlation`) +> +> **Depends on**: Phase 7 (native OTel metrics pipeline) and Phase 8 (log-trace correlation) + +### Related Plan Documents + +| Document | Relevance | +| -------------------------------------------------------------------- | -------------------------------------------------------------- | +| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 9 plan: motivation, architecture, exit criteria (§6.8.2) | +| [09-data-collection-reference.md](./09-data-collection-reference.md) | Current metric inventory + future metrics section | +| [Phase7_taskList.md](./Phase7_taskList.md) | Prerequisite — OTel Metrics SDK and `OTelCollector` class | +| [Phase8_taskList.md](./Phase8_taskList.md) | Prerequisite — log-trace correlation | + +### Third-Party Consumer Context + +These metrics serve multiple external consumer categories identified during research: + +| Consumer Category | Key Metrics They Need | +| ------------------------- | --------------------------------------------------------------- | +| **Exchanges** | Fee escalation levels, TxQ depth, settlement latency | +| **Payment Processors** | Load factors, io_latency, transaction throughput | +| **Analytics Providers** | NodeStore I/O, cache hit rates, counted objects | +| **Validators/Operators** | Per-job execution times, PerfLog RPC counters, consensus timing | +| **Academic Researchers** | Consensus performance time-series, fee market dynamics | +| **Institutional Custody** | Server health scores, reserve calculations, node availability | + +--- + +## Task 9.1: NodeStore I/O Metrics + +**Objective**: Export node store read/write performance as time-series metrics. + +**What to do**: + +- In `src/libxrpl/nodestore/Database.cpp`, extend existing `beast::insight` registrations to add: + - Gauge: `node_reads_total` (cumulative read operations) + - Gauge: `node_reads_hit` (cache-served reads) + - Gauge: `node_writes` (cumulative write operations) + - Gauge: `node_written_bytes` (cumulative bytes written) + - Gauge: `node_read_bytes` (cumulative bytes read) + - Gauge: `node_reads_duration_us` (cumulative read time in microseconds) + - Gauge: `write_load` (current write load score) + - Gauge: `read_queue` (items in read queue) + +- These values are already computed in `Database::getCountsJson()` (line ~236). Wire the same counters to `beast::insight` hooks. + +**Key modified files**: + +- `src/libxrpl/nodestore/Database.cpp` +- `src/libxrpl/nodestore/Database.h` (add insight members) + +**Derived Prometheus metrics**: `rippled_nodestore_reads_total`, `rippled_nodestore_reads_hit`, `rippled_nodestore_write_load`, etc. + +**Grafana dashboard**: Add "NodeStore I/O" panel group to _Node Health_ dashboard. + +--- + +## Task 9.2: Cache Hit Rate Metrics + +**Objective**: Export SHAMap and ledger cache performance as time-series gauges. + +**What to do**: + +- Register OTel `ObservableGauge` callbacks (via Phase 7's `OTelCollector`) for: + - `SLE_hit_rate` — SLE cache hit rate (0.0–1.0) + - `ledger_hit_rate` — Ledger object cache hit rate + - `AL_hit_rate` — AcceptedLedger cache hit rate + - `treenode_cache_size` — SHAMap TreeNode cache size (entries) + - `treenode_track_size` — Tracked tree nodes + - `fullbelow_size` — FullBelow cache size + +- The callback should read from the same sources as `GetCounts.cpp` handler (line ~43). + +- Create a centralized `MetricsRegistry` class that holds all OTel async gauge registrations, polled at 10-second intervals by the `PeriodicMetricReader`. + +**Key modified files**: + +- New: `src/xrpld/telemetry/MetricsRegistry.h` / `.cpp` +- `src/xrpld/rpc/handlers/GetCounts.cpp` (extract shared access methods) +- `src/xrpld/app/main/Application.cpp` (register MetricsRegistry at startup) + +**Derived Prometheus metrics**: `rippled_cache_SLE_hit_rate`, `rippled_cache_ledger_hit_rate`, `rippled_cache_treenode_size`, etc. + +--- + +## Task 9.3: Transaction Queue (TxQ) Metrics + +**Objective**: Export TxQ depth, capacity, and fee escalation levels as time-series. + +**What to do**: + +- Register OTel `ObservableGauge` callbacks for TxQ state (from `TxQ.h` line ~143): + - `txq_count` — Current transactions in queue + - `txq_max_size` — Maximum queue capacity + - `txq_in_ledger` — Transactions in current open ledger + - `txq_per_ledger` — Expected transactions per ledger + - `txq_reference_fee_level` — Reference fee level + - `txq_min_processing_fee_level` — Minimum fee to get processed + - `txq_med_fee_level` — Median fee level in queue + - `txq_open_ledger_fee_level` — Open ledger fee escalation level + +- Add to the `MetricsRegistry` (Task 9.2). + +**Key modified files**: + +- `src/xrpld/telemetry/MetricsRegistry.cpp` (add TxQ callbacks) +- `src/xrpld/app/tx/detail/TxQ.h` (expose metrics accessor if needed) + +**Derived Prometheus metrics**: `rippled_txq_count`, `rippled_txq_max_size`, `rippled_txq_open_ledger_fee_level`, etc. + +**Grafana dashboard**: New _Fee Market & TxQ_ dashboard (`rippled-fee-market`). + +--- + +## Task 9.4: PerfLog Per-RPC Method Metrics + +**Objective**: Export per-RPC-method call counts and latency as OTel metrics. + +**What to do**: + +- Register OTel instruments for PerfLog RPC counters (from `PerfLogImp.cpp` line ~63): + - Counter: `rpc_method_started_total{method=""}` — calls started + - Counter: `rpc_method_finished_total{method=""}` — calls completed + - Counter: `rpc_method_errored_total{method=""}` — calls errored + - Histogram: `rpc_method_duration_us{method=""}` — execution time distribution + +- Use OTel `Counter` and `Histogram` instruments with `method` attribute label. + +- Hook into the existing PerfLog callback mechanism rather than adding new instrumentation points. + +**Key modified files**: + +- `src/xrpld/perflog/detail/PerfLogImp.cpp` (add OTel instrument updates alongside existing JSON counters) +- `src/xrpld/telemetry/MetricsRegistry.cpp` (register instruments) + +**Derived Prometheus metrics**: `rippled_rpc_method_started_total{method="server_info"}`, `rippled_rpc_method_duration_us_bucket{method="ledger"}`, etc. + +**Grafana dashboard**: Add "Per-Method RPC Breakdown" panel group to _RPC Performance_ dashboard. + +--- + +## Task 9.5: PerfLog Per-Job-Type Metrics + +**Objective**: Export per-job-type queue and execution metrics. + +**What to do**: + +- Register OTel instruments for PerfLog job counters: + - Counter: `job_queued_total{job_type=""}` — jobs queued + - Counter: `job_started_total{job_type=""}` — jobs started + - Counter: `job_finished_total{job_type=""}` — jobs completed + - Histogram: `job_queued_duration_us{job_type=""}` — time spent waiting in queue + - Histogram: `job_running_duration_us{job_type=""}` — execution time distribution + +- Hook into PerfLog's existing job tracking alongside Task 9.4. + +**Key modified files**: + +- `src/xrpld/perflog/detail/PerfLogImp.cpp` +- `src/xrpld/telemetry/MetricsRegistry.cpp` + +**Derived Prometheus metrics**: `rippled_job_queued_total{job_type="ledgerData"}`, `rippled_job_running_duration_us_bucket{job_type="transaction"}`, etc. + +**Grafana dashboard**: New _Job Queue Analysis_ dashboard (`rippled-job-queue`). + +--- + +## Task 9.6: Counted Object Instance Metrics + +**Objective**: Export live instance counts for key internal object types. + +**What to do**: + +- Register OTel `ObservableGauge` callbacks for `CountedObject` instance counts: + - `object_count{type="Transaction"}` — live Transaction objects + - `object_count{type="Ledger"}` — live Ledger objects + - `object_count{type="NodeObject"}` — live NodeObject instances + - `object_count{type="STTx"}` — serialized transaction objects + - `object_count{type="STLedgerEntry"}` — serialized ledger entries + - `object_count{type="InboundLedger"}` — ledgers being fetched + - `object_count{type="Pathfinder"}` — active pathfinding computations + - `object_count{type="PathRequest"}` — active path requests + - `object_count{type="HashRouterEntry"}` — hash router entries + +- The `CountedObject` template already tracks these via atomic counters. The callback just reads the current counts. + +**Key modified files**: + +- `src/xrpld/telemetry/MetricsRegistry.cpp` (add counted object callbacks) +- `include/xrpl/basics/CountedObject.h` (may need static accessor for iteration) + +**Derived Prometheus metrics**: `rippled_object_count{type="Transaction"}`, `rippled_object_count{type="NodeObject"}`, etc. + +**Grafana dashboard**: Add "Object Instance Counts" panel to _Node Health_ dashboard. + +--- + +## Task 9.7: Fee Escalation & Load Factor Metrics + +**Objective**: Export the full load factor breakdown as time-series. + +**What to do**: + +- Register OTel `ObservableGauge` callbacks for load factors (from `NetworkOPs.cpp` line ~2694): + - `load_factor` — combined transaction cost multiplier + - `load_factor_server` — server + cluster + network contribution + - `load_factor_local` — local server load only + - `load_factor_net` — network-wide load estimate + - `load_factor_cluster` — cluster peer load + - `load_factor_fee_escalation` — open ledger fee escalation + - `load_factor_fee_queue` — queue entry fee level + +- These overlap with some existing StatsD metrics but provide finer granularity (individual factor breakdown vs. combined value). + +**Key modified files**: + +- `src/xrpld/telemetry/MetricsRegistry.cpp` +- `src/xrpld/app/misc/NetworkOPs.cpp` (expose load factor accessors if needed) + +**Derived Prometheus metrics**: `rippled_load_factor`, `rippled_load_factor_fee_escalation`, etc. + +**Grafana dashboard**: Add "Load Factor Breakdown" panel to _Fee Market & TxQ_ dashboard. + +--- + +## Task 9.8: New Grafana Dashboards + +**Objective**: Create Grafana dashboards for the new metric categories. + +**What to do**: + +- Create 2 new dashboards: + 1. **Fee Market & TxQ** (`rippled-fee-market`) — TxQ depth/capacity, fee levels, load factor breakdown, fee escalation timeline + 2. **Job Queue Analysis** (`rippled-job-queue`) — Per-job-type rates, queue wait times, execution times, job queue depth + +- Update 2 existing dashboards: + 1. **Node Health** (`rippled-statsd-node-health`) — Add NodeStore I/O panels, cache hit rate panels, object instance counts + 2. **RPC Performance** (`rippled-rpc-perf`) — Add per-method RPC breakdown panels + +**Key modified files**: + +- New: `docker/telemetry/grafana/dashboards/rippled-fee-market.json` +- New: `docker/telemetry/grafana/dashboards/rippled-job-queue.json` +- `docker/telemetry/grafana/dashboards/rippled-statsd-node-health.json` +- `docker/telemetry/grafana/dashboards/rippled-rpc-perf.json` + +--- + +## Task 9.9: Update Documentation + +**Objective**: Update telemetry reference docs with all new metrics. + +**What to do**: + +- Update `OpenTelemetryPlan/09-data-collection-reference.md`: + - Add new section for OTel SDK-exported metrics (NodeStore, cache, TxQ, PerfLog, CountedObjects, load factors) + - Update Grafana dashboard reference table (add 2 new dashboards) + - Add Prometheus query examples for new metrics + +- Update `docs/telemetry-runbook.md`: + - Add alerting rules for new metrics (NodeStore write_load, TxQ capacity, cache hit rate degradation) + - Add troubleshooting entries for new metric categories + +**Key modified files**: + +- `OpenTelemetryPlan/09-data-collection-reference.md` +- `docs/telemetry-runbook.md` + +--- + +## Task 9.10: Integration Tests + +**Objective**: Verify all new metrics appear in Prometheus after a test workload. + +**What to do**: + +- Extend the existing telemetry integration test: + - Start rippled with `[telemetry] enabled=1` and `[insight] server=otel` + - Submit a batch of RPC calls and transactions + - Query Prometheus for each new metric family + - Assert non-zero values for: NodeStore reads, cache hit rates, TxQ count, PerfLog RPC counters, object counts, load factors + +- Add unit tests for the `MetricsRegistry` class: + - Verify callback registration and deregistration + - Verify metric values match `get_counts` JSON output + - Verify graceful behavior when telemetry is disabled + +**Key modified files**: + +- `src/test/telemetry/MetricsRegistry_test.cpp` (new) +- Existing integration test script (extend assertions) + +--- + +## Effort Summary + +| Task | Description | Effort | Risk | +| ---- | ---------------------------------------- | ------ | ------ | +| 9.1 | NodeStore I/O metrics | 1d | Low | +| 9.2 | Cache hit rate metrics + MetricsRegistry | 2d | Medium | +| 9.3 | TxQ metrics | 1d | Low | +| 9.4 | PerfLog per-RPC metrics | 1.5d | Medium | +| 9.5 | PerfLog per-job metrics | 1d | Low | +| 9.6 | Counted object instance metrics | 0.5d | Low | +| 9.7 | Fee escalation & load factor metrics | 0.5d | Low | +| 9.8 | New Grafana dashboards | 2d | Low | +| 9.9 | Update documentation | 1d | Low | +| 9.10 | Integration tests | 1.5d | Medium | + +**Total Effort**: 12 days + +## Exit Criteria + +- [ ] All ~50 new metrics visible in Prometheus via OTLP pipeline +- [ ] `MetricsRegistry` class registers/deregisters cleanly with OTel SDK +- [ ] Async gauge callbacks execute at 10s intervals without performance impact +- [ ] 2 new Grafana dashboards operational (Fee Market, Job Queue) +- [ ] 2 existing dashboards updated with new panel groups +- [ ] Integration test validates all new metric families are non-zero +- [ ] No performance regression (< 0.5% CPU overhead from new callbacks) +- [ ] Documentation updated with full new metric inventory