Phase 9-11: Future enhancement plans for metric gap fill, workload validation, and third-party pipelines

- Phase 9: Internal Metric Instrumentation Gap Fill (10 tasks, 12d)
  - MetricsRegistry class, NodeStore I/O, cache, TxQ, PerfLog, CountedObjects, load factors
- Phase 10: Synthetic Workload Generation & Telemetry Validation (7 tasks, 10d)
  - Multi-node harness, RPC/tx generators, validation suite, benchmarks, CI
- Phase 11: Third-Party Data Collection Pipelines (11 tasks, 15d)
  - Custom OTel Collector receiver (Go), 30 external metrics, alerting rules, 4 dashboards
- Updated 06-implementation-phases.md with plan sections §6.8.2-§6.8.4, gantt, effort summary
- Updated 09-data-collection-reference.md with §5b-§5d future metric definitions
- Updated 08-appendix.md with Phase 9-11 glossary, task list entries, cross-reference guide, effort summary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-03-09 18:06:46 +00:00
parent 5dcf366f8f
commit b73592f934
6 changed files with 1671 additions and 33 deletions

View File

@@ -51,6 +51,15 @@ gantt
section Phase 8
Log-Trace Correlation :p8, after p7, 1w
section Phase 9 (Future)
Internal Metric Gap Fill :p9, after p8, 2.5w
section Phase 10 (Future)
Workload Validation :p10, after p9, 2w
section Phase 11 (Future)
Third-Party Collection :p11, after p10, 3w
```
---
@@ -649,6 +658,272 @@ flowchart LR
---
## 6.8.2 Phase 9: Internal Metric Instrumentation Gap Fill (Weeks 14-15) — Future Enhancement
> **Status**: Planned, not yet implemented.
### Motivation
Phases 1-8 establish trace spans, StatsD metrics bridge, native OTel metrics, and log-trace correlation. However, ~50+ metrics that exist inside rippled's `get_counts`, `server_info`, TxQ, PerfLog, and `CountedObject` systems have **no time-series export path**. These are the metrics that exchanges, payment processors, analytics providers, validators, and researchers need most NodeStore I/O performance, cache hit rates, per-RPC-method counters, transaction queue depth, fee escalation levels, and live object instance counts.
### Architecture
Hybrid approach two instrumentation strategies based on proximity to existing code:
```mermaid
flowchart TB
subgraph rippled["rippled process"]
subgraph existing["Existing beast::insight registrations"]
NS["NodeStore I/O<br/>(Database.cpp)"]
end
subgraph newreg["New OTel MetricsRegistry"]
CR["Cache Hit Rates<br/>(async gauge callbacks)"]
TQ["TxQ Metrics<br/>(async gauge callbacks)"]
PL["PerfLog RPC/Job<br/>(counters + histograms)"]
CO["CountedObjects<br/>(async gauge callbacks)"]
LF["Load Factors<br/>(async gauge callbacks)"]
end
end
subgraph export["Export Pipelines"]
BI["beast::insight<br/>OTelCollector (Phase 7)"]
OS["OTel Metrics SDK<br/>PeriodicMetricReader"]
end
NS --> BI
CR --> OS
TQ --> OS
PL --> OS
CO --> OS
LF --> OS
BI --> OTLP["OTLP/HTTP :4318<br/>/v1/metrics"]
OS --> OTLP
style rippled fill:#1a2633,color:#ccc,stroke:#4a90d9
style existing fill:#2a4a6b,color:#fff,stroke:#4a90d9
style newreg fill:#2a4a6b,color:#fff,stroke:#4a90d9
style export fill:#1a3320,color:#ccc,stroke:#5cb85c
style NS fill:#4a90d9,color:#fff,stroke:#2a6db5
style CR fill:#5cb85c,color:#fff,stroke:#3d8b3d
style TQ fill:#5cb85c,color:#fff,stroke:#3d8b3d
style PL fill:#5cb85c,color:#fff,stroke:#3d8b3d
style CO fill:#5cb85c,color:#fff,stroke:#3d8b3d
style LF fill:#5cb85c,color:#fff,stroke:#3d8b3d
style BI fill:#449d44,color:#fff,stroke:#2d6e2d
style OS fill:#449d44,color:#fff,stroke:#2d6e2d
style OTLP fill:#f0ad4e,color:#000,stroke:#c78c2e
```
- **beast::insight extensions** (blue): NodeStore I/O metrics added near existing `Database.cpp` registrations exported via Phase 7's `OTelCollector`.
- **OTel MetricsRegistry** (green): New centralized class using `ObservableGauge` async callbacks for cache, TxQ, PerfLog, CountedObjects, and load factors polled at 10s intervals by `PeriodicMetricReader`.
### Third-Party Consumer Context
| Consumer Category | Key Metrics They Need From Phase 9 |
| ---------------------- | --------------------------------------------------------------- |
| Exchanges | Fee escalation levels, TxQ depth, settlement latency |
| Payment Processors | Load factors, io_latency, transaction throughput |
| Analytics Providers | NodeStore I/O, cache hit rates, counted objects |
| Validators / Operators | Per-job execution times, PerfLog RPC counters, consensus timing |
| Academic Researchers | Consensus performance time-series, fee market dynamics |
| Institutional Custody | Server health scores, reserve calculations, node availability |
### Tasks
| Task | Description | Effort | Risk |
| ---- | ----------------------------------------- | ------ | ------ |
| 9.1 | NodeStore I/O metrics | 1d | Low |
| 9.2 | Cache hit rate metrics + MetricsRegistry | 2d | Medium |
| 9.3 | TxQ metrics | 1d | Low |
| 9.4 | PerfLog per-RPC metrics | 1.5d | Medium |
| 9.5 | PerfLog per-job metrics | 1d | Low |
| 9.6 | Counted object instance metrics | 0.5d | Low |
| 9.7 | Fee escalation & load factor metrics | 0.5d | Low |
| 9.8 | New Grafana dashboards (2 new, 2 updated) | 2d | Low |
| 9.9 | Update documentation | 1d | Low |
| 9.10 | Integration tests | 1.5d | Medium |
**Total Effort**: 12 days
See [Phase9_taskList.md](./Phase9_taskList.md) for detailed per-task breakdown.
### Exit Criteria
- [ ] All ~50 new metrics visible in Prometheus via OTLP pipeline
- [ ] `MetricsRegistry` class registers/deregisters cleanly with OTel SDK
- [ ] 2 new Grafana dashboards operational (Fee Market, Job Queue)
- [ ] No performance regression (< 0.5% CPU overhead from new callbacks)
- [ ] Documentation updated with full new metric inventory
---
## 6.8.3 Phase 10: Synthetic Workload Generation & Telemetry Validation (Weeks 16-17) — Future Enhancement
> **Status**: Planned, not yet implemented.
### Motivation
Before the telemetry stack (Phases 1-9) can be considered production-ready, we need automated proof that all 16 spans, 22 attributes, 300+ metrics, 10 Grafana dashboards, and log-trace correlation work correctly under realistic load. This phase establishes a reusable CI-integrated validation suite and performance benchmark baseline.
### Architecture
```mermaid
flowchart LR
subgraph harness["Docker Compose Workload Harness"]
direction TB
V1["Validator 1"] ~~~ V2["Validator 2"] ~~~ V3["Validator 3"]
V4["Validator 4"] ~~~ V5["Validator 5"]
end
subgraph generators["Workload Generators"]
RPC["RPC Load Generator<br/>(configurable RPS,<br/>command distribution)"]
TX["Transaction Submitter<br/>(Payment, Offer, NFT,<br/>Escrow, AMM mix)"]
end
subgraph validation["Validation Suite"]
SV["Span Validator<br/>(Jaeger/Tempo API)"]
MV["Metric Validator<br/>(Prometheus API)"]
LV["Log-Trace Validator<br/>(Loki API)"]
DV["Dashboard Validator<br/>(Grafana API)"]
BM["Benchmark Suite<br/>(CPU, memory, latency<br/>ON vs OFF comparison)"]
end
generators --> harness
harness --> validation
style harness fill:#1a2633,color:#ccc,stroke:#4a90d9
style generators fill:#1a3320,color:#ccc,stroke:#5cb85c
style validation fill:#332a1a,color:#ccc,stroke:#f0ad4e
style V1 fill:#4a90d9,color:#fff,stroke:#2a6db5
style V2 fill:#4a90d9,color:#fff,stroke:#2a6db5
style V3 fill:#4a90d9,color:#fff,stroke:#2a6db5
style V4 fill:#4a90d9,color:#fff,stroke:#2a6db5
style V5 fill:#4a90d9,color:#fff,stroke:#2a6db5
style RPC fill:#5cb85c,color:#fff,stroke:#3d8b3d
style TX fill:#5cb85c,color:#fff,stroke:#3d8b3d
style SV fill:#f0ad4e,color:#000,stroke:#c78c2e
style MV fill:#f0ad4e,color:#000,stroke:#c78c2e
style LV fill:#f0ad4e,color:#000,stroke:#c78c2e
style DV fill:#f0ad4e,color:#000,stroke:#c78c2e
style BM fill:#f0ad4e,color:#000,stroke:#c78c2e
```
### Tasks
| Task | Description | Effort | Risk |
| ---- | -------------------------------------- | ------ | ------ |
| 10.1 | Multi-node test harness (5 validators) | 2d | Medium |
| 10.2 | RPC load generator | 1d | Low |
| 10.3 | Transaction submitter (6+ tx types) | 2d | Medium |
| 10.4 | Telemetry validation suite | 2d | Medium |
| 10.5 | Performance benchmark suite | 1.5d | Low |
| 10.6 | CI integration | 1d | Medium |
| 10.7 | Documentation | 0.5d | Low |
**Total Effort**: 10 days
See [Phase10_taskList.md](./Phase10_taskList.md) for detailed per-task breakdown.
### Exit Criteria
- [ ] 5-node validator cluster starts and reaches consensus in docker-compose
- [ ] Validation suite confirms all 16 spans, 22 attributes, 300+ metrics
- [ ] All 10 Grafana dashboards render data (no empty panels)
- [ ] Benchmark shows < 3% CPU overhead, < 5MB memory overhead
- [ ] CI workflow runs validation on telemetry branch changes
---
## 6.8.4 Phase 11: Third-Party Data Collection Pipelines (Weeks 18-20) — Future Enhancement
> **Status**: Planned, not yet implemented.
### Motivation
rippled has no native Prometheus/OTLP metrics export for data accessible only via JSON-RPC (`server_info`, `get_counts`, `fee`, `peers`, `validators`, `feature`). Every external consumer exchanges, payment processors, analytics providers, validators, compliance firms, DeFi protocols, researchers, custodians, and CBDC platforms must build custom JSON-RPC polling and conversion pipelines. This phase centralizes that work into a reusable custom OTel Collector receiver.
### Architecture
```mermaid
flowchart LR
subgraph receiver["Custom OTel Collector Receiver (Go)"]
direction TB
SI["server_info<br/>collector"]
GC["get_counts<br/>collector"]
FE["fee<br/>collector"]
PE["peers<br/>collector"]
VA["validators<br/>collector"]
DX["DEX/AMM<br/>collector<br/>(optional)"]
end
rippled["rippled<br/>Admin RPC<br/>:5005"] -->|"JSON-RPC<br/>poll every 30s"| receiver
receiver -->|"xrpl_* metrics"| PROM["Prometheus<br/>:9090"]
receiver -->|"OTLP export"| OTLP["Any OTLP-<br/>compatible<br/>backend"]
PROM --> GF["Grafana<br/>4 new dashboards"]
PROM --> AL["Prometheus<br/>Alerting Rules"]
style receiver fill:#1a3320,color:#ccc,stroke:#5cb85c
style SI fill:#5cb85c,color:#fff,stroke:#3d8b3d
style GC fill:#5cb85c,color:#fff,stroke:#3d8b3d
style FE fill:#5cb85c,color:#fff,stroke:#3d8b3d
style PE fill:#5cb85c,color:#fff,stroke:#3d8b3d
style VA fill:#5cb85c,color:#fff,stroke:#3d8b3d
style DX fill:#449d44,color:#fff,stroke:#2d6e2d
style rippled fill:#4a90d9,color:#fff,stroke:#2a6db5
style PROM fill:#f0ad4e,color:#000,stroke:#c78c2e
style OTLP fill:#f0ad4e,color:#000,stroke:#c78c2e
style GF fill:#5bc0de,color:#000,stroke:#3aa8c1
style AL fill:#d9534f,color:#fff,stroke:#b52d2d
```
### Third-Party Consumer Gap Analysis
| Consumer Category | Data Unlocked by Phase 11 |
| ---------------------- | ------------------------------------------------------------ |
| Exchanges | Real-time fee estimates, TxQ capacity, server health scores |
| Payment Processors | Settlement latency percentiles, corridor health |
| Analytics Providers | Validator metrics, network topology, amendment voting status |
| DeFi / AMM | AMM pool TVL, DEX order book depth, trade volumes |
| Validators / Operators | Per-peer latency, version distribution, UNL health, alerting |
| Compliance | Transaction volume trends, network growth metrics |
| Academic Researchers | Consensus performance time-series, decentralization metrics |
| CBDC / Tokenization | Token supply tracking, trust line adoption, freeze status |
| Institutional Custody | Multi-sig status, escrow tracking, reserve calculations |
| Wallet Providers | Server health for node selection, fee prediction data |
### Tasks
| Task | Description | Effort | Risk |
| ----- | ------------------------------------- | ------ | ------ |
| 11.1 | OTel Collector receiver scaffold (Go) | 1.5d | Medium |
| 11.2 | server_info / server_state collector | 2d | Low |
| 11.3 | get_counts collector | 1.5d | Low |
| 11.4 | Peer topology collector | 1.5d | Medium |
| 11.5 | Validator & amendment collector | 1d | Low |
| 11.6 | Fee & TxQ collector | 0.5d | Low |
| 11.7 | DEX & AMM collector (optional) | 1.5d | Medium |
| 11.8 | Prometheus alerting rules | 1d | Low |
| 11.9 | New Grafana dashboards (4) | 2d | Low |
| 11.10 | Integration with Phase 10 validation | 1d | Low |
| 11.11 | Documentation | 1d | Low |
**Total Effort**: 15 days
See [Phase11_taskList.md](./Phase11_taskList.md) for detailed per-task breakdown.
### Exit Criteria
- [ ] Custom OTel Collector receiver exports all `xrpl_*` metrics to Prometheus
- [ ] 4 new Grafana dashboards operational (Validator Health, Network Topology, Fee Market, DEX/AMM)
- [ ] Prometheus alerting rules fire correctly for simulated failures
- [ ] Receiver handles rippled restart/unavailability gracefully
- [ ] Go receiver has unit tests with >80% coverage
---
## 6.9 Risk Assessment
```mermaid
@@ -705,22 +980,35 @@ pie showData
"Phase 3: Transaction Tracing" : 11
"Phase 4: Consensus Tracing" : 11
"Phase 5: Documentation" : 5
"Phase 6: StatsD Bridge" : 5.6
"Phase 7: Native OTel Metrics" : 8
"Phase 8: Log-Trace Correlation" : 4.5
"Phase 9: Metric Gap Fill" : 12
"Phase 10: Workload Validation" : 10
"Phase 11: Third-Party Collection" : 15
```
**Total Effort Distribution (47 developer-days)**
**Total Effort Distribution (102.1 developer-days)**
</div>
### Resource Requirements
| Phase | Developers | Duration | Total Effort |
| --------- | ---------- | ----------- | ------------ |
| 1 | 2 | 2 weeks | 10 days |
| 2 | 1-2 | 2 weeks | 10 days |
| 3 | 2 | 2 weeks | 11 days |
| 4 | 2 | 2 weeks | 11 days |
| 5 | 1 | 1 week | 5 days |
| **Total** | **2** | **9 weeks** | **47 days** |
| Phase | Developers | Duration | Total Effort | Status |
| ---------------- | ---------- | ------------ | -------------- | ------------------ |
| 1 | 2 | 2 weeks | 10 days | Active |
| 2 | 1-2 | 2 weeks | 10 days | Active |
| 3 | 2 | 2 weeks | 11 days | Active |
| 4 | 2 | 2 weeks | 11 days | Active |
| 5 | 1 | 1 week | 5 days | Active |
| 6 | 1 | 1 week | 5.6 days | Active |
| 7 | 1-2 | 2 weeks | 8 days | Active |
| 8 | 1 | 1 week | 4.5 days | Active |
| 9 | 1-2 | 2.5 weeks | 12 days | Future Enhancement |
| 10 | 1 | 2 weeks | 10 days | Future Enhancement |
| 11 | 1-2 | 3 weeks | 15 days | Future Enhancement |
| **Total (1-8)** | **2** | **13 weeks** | **65.1 days** | |
| **Total (1-11)** | **2** | **20 weeks** | **102.1 days** | |
---
@@ -924,16 +1212,19 @@ Clear, measurable criteria for each phase.
### 6.13.6 Success Metrics Summary
| Phase | Primary Metric | Secondary Metric | Deadline |
| ------- | ---------------------------- | --------------------------- | -------------- |
| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 |
| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 |
| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 |
| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 |
| Phase 5 | Production deployment | Operators trained | End of Week 9 |
| Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 |
| Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 |
| Phase 8 | trace_id in logs + Loki | TempoLoki correlation | End of Week 13 |
| Phase | Primary Metric | Secondary Metric | Deadline | Status |
| -------- | -------------------------------- | --------------------------- | -------------- | ------------------ |
| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 | Active |
| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 | Active |
| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 | Active |
| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 | Active |
| Phase 5 | Production deployment | Operators trained | End of Week 9 | Active |
| Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 | Active |
| Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 | Active |
| Phase 8 | trace_id in logs + Loki | TempoLoki correlation | End of Week 13 | Active |
| Phase 9 | 50+ new internal metrics in Prom | 2 new dashboards | End of Week 15 | Future Enhancement |
| Phase 10 | Full telemetry stack validated | < 3% CPU overhead proven | End of Week 17 | Future Enhancement |
| Phase 11 | Third-party metrics via receiver | 4 new dashboards + alerting | End of Week 20 | Future Enhancement |
---

View File

@@ -37,6 +37,18 @@
| **PerfLog** | Existing performance logging system in rippled |
| **Beast Insight** | Existing metrics framework in rippled |
### Phase 911 Terms
| Term | Definition |
| --------------------------- | ------------------------------------------------------------------------- |
| **MetricsRegistry** | Centralized class for OTel async gauge registrations (Phase 9) |
| **ObservableGauge** | OTel Metrics SDK async instrument polled via callback at fixed intervals |
| **PeriodicMetricReader** | OTel SDK component that invokes gauge callbacks at configurable intervals |
| **CountedObject** | rippled template that tracks live instance counts via atomic counters |
| **TxQ** | Transaction queue managing fee escalation and ordering |
| **Load Factor** | Combined multiplier affecting transaction cost (local, cluster, network) |
| **OTel Collector Receiver** | Custom Go plugin that polls rippled RPC and emits OTel metrics (Phase 11) |
---
## 8.2 Span Hierarchy Visualization
@@ -107,10 +119,11 @@ flowchart TB
## 8.4 Version History
| Version | Date | Author | Changes |
| ------- | ---------- | ------ | --------------------------------- |
| 1.0 | 2026-02-12 | - | Initial implementation plan |
| 1.1 | 2026-02-13 | - | Refactored into modular documents |
| Version | Date | Author | Changes |
| ------- | ---------- | ------ | -------------------------------------------- |
| 1.0 | 2026-02-12 | - | Initial implementation plan |
| 1.1 | 2026-02-13 | - | Refactored into modular documents |
| 1.2 | 2026-03-09 | - | Added Phases 911 (future enhancement plans) |
---
@@ -135,16 +148,83 @@ flowchart TB
### Task Lists
| Document | Description |
| -------------------------------------------------------------------------- | -------------------------------------- |
| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration |
| [Phase2_taskList.md](./Phase2_taskList.md) | RPC layer trace instrumentation |
| [Phase3_taskList.md](./Phase3_taskList.md) | Peer overlay & consensus tracing |
| [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing |
| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing |
| [Phase5_IntegrationTest_taskList.md](./Phase5_IntegrationTest_taskList.md) | Observability stack integration tests |
| [Phase7_taskList.md](./Phase7_taskList.md) | Native OTel metrics migration |
| [Phase8_taskList.md](./Phase8_taskList.md) | Log-trace correlation |
| Document | Description |
| -------------------------------------------------------------------------- | --------------------------------------------------- |
| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration |
| [Phase2_taskList.md](./Phase2_taskList.md) | RPC layer trace instrumentation |
| [Phase3_taskList.md](./Phase3_taskList.md) | Peer overlay & consensus tracing |
| [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing |
| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing |
| [Phase5_IntegrationTest_taskList.md](./Phase5_IntegrationTest_taskList.md) | Observability stack integration tests |
| [Phase7_taskList.md](./Phase7_taskList.md) | Native OTel metrics migration |
| [Phase8_taskList.md](./Phase8_taskList.md) | Log-trace correlation |
| [Phase9_taskList.md](./Phase9_taskList.md) | Internal metric instrumentation gap fill (future) |
| [Phase10_taskList.md](./Phase10_taskList.md) | Synthetic workload generation & validation (future) |
| [Phase11_taskList.md](./Phase11_taskList.md) | Third-party data collection pipelines (future) |
> **Note**: Phases 1 and 6 do not have separate task list files. Phase 1 tasks are documented in [06-implementation-phases.md §6.2](./06-implementation-phases.md). Phase 6 tasks are documented in [06-implementation-phases.md §6.7](./06-implementation-phases.md).
---
## 8.6 Phase 911 Cross-Reference Guide
This guide maps Phase 911 content to its location across the documentation.
### Phase 9: Internal Metric Instrumentation Gap Fill
| Content | Location |
| ------------------------------- | ------------------------------------------------------------------------ |
| Plan & architecture | [06-implementation-phases.md §6.8.2](./06-implementation-phases.md) |
| Task list (10 tasks, 12d) | [Phase9_taskList.md](./Phase9_taskList.md) |
| Future metric definitions (~50) | [09-data-collection-reference.md §5b](./09-data-collection-reference.md) |
| New class: `MetricsRegistry` | `src/xrpld/telemetry/MetricsRegistry.h/.cpp` (planned) |
| New dashboards | `rippled-fee-market`, `rippled-job-queue` (planned) |
**Metric categories**: NodeStore I/O, Cache Hit Rates, TxQ, PerfLog Per-RPC, PerfLog Per-Job, Counted Objects, Fee Escalation & Load Factors.
### Phase 10: Synthetic Workload Generation & Telemetry Validation
| Content | Location |
| ------------------------ | ------------------------------------------------------------------------ |
| Plan & architecture | [06-implementation-phases.md §6.8.3](./06-implementation-phases.md) |
| Task list (7 tasks, 10d) | [Phase10_taskList.md](./Phase10_taskList.md) |
| Validation inventory | [09-data-collection-reference.md §5c](./09-data-collection-reference.md) |
| Test harness | `docker/telemetry/docker-compose.workload.yaml` (planned) |
| CI workflow | `.github/workflows/telemetry-validation.yml` (planned) |
**Validates**: 16 spans, 22 attributes, 300+ metrics, 10 dashboards, log-trace correlation.
### Phase 11: Third-Party Data Collection Pipelines
| Content | Location |
| --------------------------------- | ------------------------------------------------------------------------ |
| Plan & architecture | [06-implementation-phases.md §6.8.4](./06-implementation-phases.md) |
| Task list (11 tasks, 15d) | [Phase11_taskList.md](./Phase11_taskList.md) |
| External metric definitions (~30) | [09-data-collection-reference.md §5d](./09-data-collection-reference.md) |
| Custom OTel Collector receiver | `docker/telemetry/otel-rippled-receiver/` (planned) |
| Prometheus alerting rules (11) | [09-data-collection-reference.md §5d](./09-data-collection-reference.md) |
| New dashboards (4) | Validator Health, Network Topology, Fee Market (External), DEX & AMM |
**Consumer categories**: Exchanges, Payment Processors, DeFi/AMM, NFT Marketplaces, Analytics Providers, Wallets, Compliance, Academic Researchers, Institutional Custody, CBDC Bridge Operators.
---
## 8.7 Effort Summary (All Phases)
| Phase | Description | Effort | Status |
| ----- | -------------------------------- | ---------- | ------------------ |
| 1 | Core SDK integration | 5d | Active |
| 2 | RPC tracing | 5d | Active |
| 3 | Peer & consensus tracing | 8d | Active |
| 4 | Transaction lifecycle | 7d | Active |
| 5 | Ledger & advanced | 7.1d | Active |
| 6 | StatsD → OTel bridge | 8d | Active |
| 7 | Native OTel metrics | 15d | Active |
| 8 | Log-trace correlation | 10d | Active |
| 9 | Internal metric gap fill | 12d | Future Enhancement |
| 10 | Workload generation & validation | 10d | Future Enhancement |
| 11 | Third-party data pipelines | 15d | Future Enhancement |
| | **Total** | **102.1d** | |
---

View File

@@ -567,6 +567,217 @@ count_over_time({job="rippled"} |= "trace_id=" [5m])
---
## 5b. Future: Internal Metric Gap Fill (Phase 9)
> **Status**: Planned, not yet implemented.
> **Plan details**: [06-implementation-phases.md §6.8.2](./06-implementation-phases.md) — motivation, architecture, third-party context
> **Task breakdown**: [Phase9_taskList.md](./Phase9_taskList.md) — per-task implementation details
Phase 9 fills ~50+ metrics that exist inside rippled but currently lack time-series export. Uses a hybrid approach: `beast::insight` extensions for NodeStore I/O, OTel `ObservableGauge` async callbacks for new categories.
### New Metric Categories
#### NodeStore I/O (via beast::insight)
| Prometheus Metric | Type | Description |
| ------------------------------------ | ----- | ----------------------------------- |
| `rippled_nodestore_reads_total` | Gauge | Cumulative read operations |
| `rippled_nodestore_reads_hit` | Gauge | Cache-served reads |
| `rippled_nodestore_writes` | Gauge | Cumulative write operations |
| `rippled_nodestore_written_bytes` | Gauge | Cumulative bytes written |
| `rippled_nodestore_read_bytes` | Gauge | Cumulative bytes read |
| `rippled_nodestore_read_duration_us` | Gauge | Cumulative read time (microseconds) |
| `rippled_nodestore_write_load` | Gauge | Current write load score |
| `rippled_nodestore_read_queue` | Gauge | Items in read queue |
#### Cache Hit Rates (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
| ------------------------------- | ----- | ------------------------------------ |
| `rippled_cache_SLE_hit_rate` | Gauge | SLE cache hit rate (0.0-1.0) |
| `rippled_cache_ledger_hit_rate` | Gauge | Ledger object cache hit rate |
| `rippled_cache_AL_hit_rate` | Gauge | AcceptedLedger cache hit rate |
| `rippled_cache_treenode_size` | Gauge | SHAMap TreeNode cache size (entries) |
| `rippled_cache_fullbelow_size` | Gauge | FullBelow cache size |
#### Transaction Queue (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
| -------------------------------------- | ----- | -------------------------------- |
| `rippled_txq_count` | Gauge | Current transactions in queue |
| `rippled_txq_max_size` | Gauge | Maximum queue capacity |
| `rippled_txq_in_ledger` | Gauge | Transactions in open ledger |
| `rippled_txq_per_ledger` | Gauge | Expected transactions per ledger |
| `rippled_txq_open_ledger_fee_level` | Gauge | Open ledger fee escalation level |
| `rippled_txq_med_fee_level` | Gauge | Median fee level in queue |
| `rippled_txq_reference_fee_level` | Gauge | Reference fee level |
| `rippled_txq_min_processing_fee_level` | Gauge | Minimum fee to get processed |
#### PerfLog Per-RPC Method (via OTel Metrics SDK)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------- | --------- | ----------------- | --------------------------- |
| `rippled_rpc_method_started_total` | Counter | `method="<name>"` | RPC calls started |
| `rippled_rpc_method_finished_total` | Counter | `method="<name>"` | RPC calls completed |
| `rippled_rpc_method_errored_total` | Counter | `method="<name>"` | RPC calls errored |
| `rippled_rpc_method_duration_us_bucket` | Histogram | `method="<name>"` | Execution time distribution |
#### PerfLog Per-Job Type (via OTel Metrics SDK)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------- | --------- | ------------------- | --------------- |
| `rippled_job_queued_total` | Counter | `job_type="<name>"` | Jobs queued |
| `rippled_job_started_total` | Counter | `job_type="<name>"` | Jobs started |
| `rippled_job_finished_total` | Counter | `job_type="<name>"` | Jobs completed |
| `rippled_job_queued_duration_us_bucket` | Histogram | `job_type="<name>"` | Queue wait time |
| `rippled_job_running_duration_us_bucket` | Histogram | `job_type="<name>"` | Execution time |
#### Counted Object Instances (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
| ---------------------- | ----- | --------------- | ------------------------------- |
| `rippled_object_count` | Gauge | `type="<name>"` | Live instances of internal type |
Tracked types: `Transaction`, `Ledger`, `NodeObject`, `STTx`, `STLedgerEntry`, `InboundLedger`, `Pathfinder`, `PathRequest`, `HashRouterEntry`
#### Fee Escalation & Load Factors (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
| ------------------------------------ | ----- | ------------------------------------ |
| `rippled_load_factor` | Gauge | Combined transaction cost multiplier |
| `rippled_load_factor_server` | Gauge | Server + cluster + network load |
| `rippled_load_factor_local` | Gauge | Local server load only |
| `rippled_load_factor_net` | Gauge | Network-wide load estimate |
| `rippled_load_factor_cluster` | Gauge | Cluster peer load |
| `rippled_load_factor_fee_escalation` | Gauge | Open ledger fee escalation |
| `rippled_load_factor_fee_queue` | Gauge | Queue entry fee level |
### New Grafana Dashboards (Phase 9)
| Dashboard | UID | Data Source | Key Panels |
| ------------------ | -------------------- | ----------- | ----------------------------------------------------------------- |
| Fee Market & TxQ | `rippled-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown, escalation |
| Job Queue Analysis | `rippled-job-queue` | Prometheus | Per-job rates, queue wait times, execution times, queue depth |
---
## 5c. Future: Synthetic Workload Generation & Telemetry Validation (Phase 10)
> **Status**: Planned, not yet implemented.
> **Plan details**: [06-implementation-phases.md §6.8.3](./06-implementation-phases.md) — motivation, architecture
> **Task breakdown**: [Phase10_taskList.md](./Phase10_taskList.md) — per-task implementation details
Phase 10 builds a 5-node validator docker-compose harness with RPC load generators, transaction submitters, and automated validation scripts that verify all spans, metrics, dashboards, and log-trace correlation work end-to-end. Includes a benchmark suite comparing telemetry-ON vs telemetry-OFF overhead.
### Validated Telemetry Inventory
| Category | Expected Count | Validation Method |
| ------------------ | -------------- | -------------------------------- |
| Trace spans | 16 | Jaeger/Tempo API query |
| Span attributes | 22 | Per-span attribute assertion |
| StatsD metrics | 255+ | Prometheus query |
| Phase 9 metrics | 50+ | Prometheus query |
| SpanMetrics RED | 4 per span | Prometheus query |
| Grafana dashboards | 10 | Dashboard API "no data" check |
| Log-trace links | Present | Loki query + Tempo reverse check |
---
## 5d. Future: Third-Party Data Collection Pipelines (Phase 11)
> **Status**: Planned, not yet implemented.
> **Plan details**: [06-implementation-phases.md §6.8.4](./06-implementation-phases.md) — motivation, architecture, consumer gap analysis
> **Task breakdown**: [Phase11_taskList.md](./Phase11_taskList.md) — per-task implementation details
Phase 11 builds a custom OTel Collector receiver (Go) that polls rippled's admin RPCs and exports `xrpl_*` metrics for external consumers. No rippled code changes.
### Exported Metrics (via Custom OTel Collector Receiver)
#### Node Health (from server_info)
| Prometheus Metric | Type | Description |
| --------------------------------------- | ----- | ----------------------------------------------- |
| `xrpl_server_state` | Gauge | Operating mode (0=disconnected ... 5=proposing) |
| `xrpl_server_state_duration_seconds` | Gauge | Seconds in current state |
| `xrpl_uptime_seconds` | Gauge | Consecutive seconds running |
| `xrpl_io_latency_ms` | Gauge | I/O subsystem latency |
| `xrpl_amendment_blocked` | Gauge | 1 if amendment-blocked, 0 otherwise |
| `xrpl_peers_count` | Gauge | Connected peers |
| `xrpl_validated_ledger_seq` | Gauge | Latest validated ledger sequence |
| `xrpl_validated_ledger_age_seconds` | Gauge | Seconds since last validated close |
| `xrpl_last_close_proposers` | Gauge | Proposers in last consensus round |
| `xrpl_last_close_converge_time_seconds` | Gauge | Last consensus round duration |
| `xrpl_load_factor` | Gauge | Transaction cost multiplier |
| `xrpl_state_duration_seconds` | Gauge | Per-state duration (`state` label) |
| `xrpl_state_transitions_total` | Gauge | Per-state transition count (`state` label) |
#### Peer Topology (from peers)
| Prometheus Metric | Type | Description |
| --------------------------- | ----- | ----------------------------------- |
| `xrpl_peers_inbound_count` | Gauge | Inbound peer connections |
| `xrpl_peers_outbound_count` | Gauge | Outbound peer connections |
| `xrpl_peer_latency_p50_ms` | Gauge | Median peer latency |
| `xrpl_peer_latency_p95_ms` | Gauge | p95 peer latency |
| `xrpl_peer_version_count` | Gauge | Peers per version (`version` label) |
| `xrpl_peer_diverged_count` | Gauge | Peers with diverged tracking status |
#### Validator & Amendment (from validators, feature)
| Prometheus Metric | Type | Description |
| ------------------------------------- | ----- | --------------------------------------- |
| `xrpl_trusted_validators_count` | Gauge | UNL validator count |
| `xrpl_amendment_enabled_count` | Gauge | Enabled amendments |
| `xrpl_amendment_majority_count` | Gauge | Amendments with majority |
| `xrpl_amendment_unsupported_majority` | Gauge | 1 if unsupported amendment has majority |
| `xrpl_validator_list_active` | Gauge | 1 if validator list is active |
#### Fee Market (from fee)
| Prometheus Metric | Type | Description |
| -------------------------------- | ----- | ------------------------------------- |
| `xrpl_fee_open_ledger_fee_drops` | Gauge | Minimum fee for open ledger inclusion |
| `xrpl_fee_median_fee_drops` | Gauge | Median fee level |
| `xrpl_fee_queue_size` | Gauge | Current transaction queue depth |
| `xrpl_fee_current_ledger_size` | Gauge | Transactions in current open ledger |
#### DEX & AMM (optional, from book_offers, amm_info)
| Prometheus Metric | Type | Labels | Description |
| -------------------------- | ----- | --------------------- | ---------------------- |
| `xrpl_amm_tvl_drops` | Gauge | `pool="<id>"` | Total value locked |
| `xrpl_amm_trading_fee` | Gauge | `pool="<id>"` | Pool trading fee (bps) |
| `xrpl_orderbook_bid_depth` | Gauge | `pair="<base/quote>"` | Total bid volume |
| `xrpl_orderbook_ask_depth` | Gauge | `pair="<base/quote>"` | Total ask volume |
| `xrpl_orderbook_spread` | Gauge | `pair="<base/quote>"` | Best bid-ask spread |
### New Grafana Dashboards (Phase 11)
| Dashboard | UID | Data Source | Key Panels |
| ------------------ | ----------------------------- | ----------- | ---------------------------------------------------------------------- |
| Validator Health | `rippled-validator-health` | Prometheus | Server state timeline, proposer count, converge time, amendment voting |
| Network Topology | `rippled-network-topology` | Prometheus | Peer count, version distribution, latency distribution, diverged peers |
| Fee Market (Ext) | `rippled-fee-market-external` | Prometheus | Fee levels, queue depth, load factor breakdown, escalation timeline |
| DEX & AMM Overview | `rippled-dex-amm` | Prometheus | AMM TVL, order book depth, spread trends, trading fee revenue |
### Prometheus Alerting Rules (Phase 11)
| Alert Name | Severity | Condition | For |
| ---------------------------------- | -------- | ----------------------------------------------------------- | --- |
| `XRPLServerNotFull` | Critical | `xrpl_server_state < 4` for 15m | 15m |
| `XRPLAmendmentBlocked` | Critical | `xrpl_amendment_blocked == 1` | 1m |
| `XRPLNoPeers` | Critical | `xrpl_peers_count == 0` | 5m |
| `XRPLLedgerStale` | Critical | `xrpl_validated_ledger_age_seconds > 120` | 2m |
| `XRPLHighIOLatency` | Critical | `xrpl_io_latency_ms > 100` | 5m |
| `XRPLUnsupportedAmendmentMajority` | Critical | `xrpl_amendment_unsupported_majority == 1` | 1m |
| `XRPLLowPeerCount` | Warning | `xrpl_peers_count < 10` | 15m |
| `XRPLHighLoadFactor` | Warning | `xrpl_load_factor > 10` | 10m |
| `XRPLSlowConsensus` | Warning | `xrpl_last_close_converge_time_seconds > 6` | 5m |
| `XRPLValidatorListExpiring` | Warning | `(xrpl_validator_list_expiration_seconds - time()) < 86400` | 1h |
| `XRPLStateFlapping` | Warning | `rate(xrpl_state_transitions_total{state="full"}[1h]) > 2` | 30m |
---
## 6. Known Issues
| Issue | Impact | Status |

View File

@@ -0,0 +1,256 @@
# Phase 10: Synthetic Workload Generation & Telemetry Validation — Task List
> **Status**: Future Enhancement
>
> **Goal**: Build tools that generate realistic XRPL traffic to validate the full Phases 1-9 telemetry stack end-to-end — all spans, attributes, metrics, dashboards, and log-trace correlation — under controlled load.
>
> **Scope**: Python/shell test harness + multi-node docker-compose environment + automated validation scripts + performance benchmarks.
>
> **Branch**: `pratik/otel-phase10-workload-validation` (from `pratik/otel-phase9-metric-gap-fill`)
>
> **Depends on**: Phase 9 (internal metric gap fill) — validates the full metric surface
### Related Plan Documents
| Document | Relevance |
| -------------------------------------------------------------------- | --------------------------------------------------------------- |
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 10 plan: motivation, architecture, exit criteria (§6.8.3) |
| [09-data-collection-reference.md](./09-data-collection-reference.md) | Defines the full inventory of spans/metrics to validate |
| [Phase9_taskList.md](./Phase9_taskList.md) | Prerequisite — all internal metrics must be emitting |
### Why This Phase Exists
Before Phases 1-9 can be considered production-ready, we need proof that:
1. All 16 spans fire with correct attributes under real transaction workloads
2. All 255+ StatsD metrics + ~50 Phase 9 metrics appear in Prometheus with non-zero values
3. Log-trace correlation (Phase 8) produces clickable trace_id links in Loki
4. All 10 Grafana dashboards render meaningful data (no empty panels)
5. Performance overhead stays within bounds (< 3% CPU, < 5MB memory)
6. The telemetry stack survives sustained load without data loss or queue backpressure
---
## Task 10.1: Multi-Node Test Harness
**Objective**: Create a docker-compose environment with 3-5 validator nodes that produces real consensus rounds.
**What to do**:
- Create `docker/telemetry/docker-compose.workload.yaml`:
- 5 rippled validator nodes with UNL configured for each other
- All telemetry enabled: `[telemetry] enabled=1`, `[insight] server=otel`
- Full OTel stack: Collector, Jaeger, Tempo, Prometheus, Loki, Grafana
- Shared network with service discovery
- Each node should:
- Generate validator keys at startup
- Configure all 5 nodes in its UNL
- Enable all trace categories including `trace_peer=1`
- Write logs to a file tailed by the OTel Collector filelog receiver
- Include a `Makefile` target: `make telemetry-workload-up` / `make telemetry-workload-down`
**Key files**:
- New: `docker/telemetry/docker-compose.workload.yaml`
- New: `docker/telemetry/workload/generate-validator-keys.sh`
- New: `docker/telemetry/workload/xrpld-validator.cfg.template`
---
## Task 10.2: RPC Load Generator
**Objective**: Configurable tool that fires all traced RPC commands at controlled rates.
**What to do**:
- Create `docker/telemetry/workload/rpc_load_generator.py`:
- Connects to one or more rippled WebSocket endpoints
- Fires all RPC commands that have trace spans: `server_info`, `ledger`, `tx`, `account_info`, `account_lines`, `fee`, `submit`, etc.
- Configurable parameters: rate (RPS), duration, command distribution weights
- Injects `traceparent` HTTP headers to test W3C context propagation
- Logs progress and errors to stdout
- Command distribution should match realistic production ratios:
- 40% `server_info` / `fee` (health checks)
- 30% `account_info` / `account_lines` / `account_objects` (wallet queries)
- 15% `ledger` / `ledger_data` (explorer queries)
- 10% `tx` / `account_tx` (transaction lookups)
- 5% `book_offers` / `amm_info` (DEX queries)
**Key files**:
- New: `docker/telemetry/workload/rpc_load_generator.py`
- New: `docker/telemetry/workload/requirements.txt`
---
## Task 10.3: Transaction Submitter
**Objective**: Generate diverse transaction types to exercise `tx.*` and `ledger.*` spans.
**What to do**:
- Create `docker/telemetry/workload/tx_submitter.py`:
- Pre-funds test accounts from genesis account
- Submits a mix of transaction types:
- `Payment` (XRP and issued currencies) exercises `tx.process`, `tx.apply`
- `OfferCreate` / `OfferCancel` DEX activity
- `TrustSet` trust line creation for issued currencies
- `NFTokenMint` / `NFTokenCreateOffer` / `NFTokenAcceptOffer` NFT activity
- `EscrowCreate` / `EscrowFinish` escrow lifecycle
- `AMMCreate` / `AMMDeposit` / `AMMWithdraw` AMM pool operations (if amendment enabled)
- Configurable: TPS target, transaction mix weights, duration
- Monitors submission results and tracks success/failure rates
- The transaction mix ensures the telemetry captures the full range of ledger activity that third parties care about.
**Key files**:
- New: `docker/telemetry/workload/tx_submitter.py`
- New: `docker/telemetry/workload/test_accounts.json` (pre-generated keypairs)
---
## Task 10.4: Telemetry Validation Suite
**Objective**: Automated scripts that verify all expected telemetry data exists after a workload run.
**What to do**:
- Create `docker/telemetry/workload/validate_telemetry.py`:
**Span validation** (queries Jaeger/Tempo API):
- Assert all 16 span names appear in traces
- Assert each span has its required attributes (22 total attributes across spans)
- Assert parent-child relationships are correct (`rpc.request` `rpc.process` `rpc.command.*`)
- Assert span durations are reasonable (> 0, < 60s)
**Metric validation** (queries Prometheus API):
- Assert all SpanMetrics-derived metrics are non-zero: `traces_span_metrics_calls_total`, `traces_span_metrics_duration_milliseconds_bucket`
- Assert all StatsD metrics are non-zero: `rippled_LedgerMaster_Validated_Ledger_Age`, `rippled_Peer_Finder_Active_*`, etc.
- Assert all Phase 9 metrics are non-zero: `rippled_nodestore_*`, `rippled_cache_*`, `rippled_txq_*`, `rippled_rpc_method_*`, `rippled_object_count`, `rippled_load_factor*`
- Assert metric label cardinality is within bounds
**Log-trace correlation validation** (queries Loki API):
- Assert logs contain `trace_id=` and `span_id=` fields
- Pick a random trace_id from Jaeger query Loki for matching logs assert results exist
- Assert Grafana derived field links are functional
**Dashboard validation**:
- For each of the 10 Grafana dashboards, query the dashboard API and assert no panels show "No data"
- Output: JSON report with pass/fail per check, suitable for CI.
**Key files**:
- New: `docker/telemetry/workload/validate_telemetry.py`
- New: `docker/telemetry/workload/expected_spans.json` (span inventory for validation)
- New: `docker/telemetry/workload/expected_metrics.json` (metric inventory for validation)
---
## Task 10.5: Performance Benchmark Suite
**Objective**: Measure CPU/memory/latency overhead of the telemetry stack.
**What to do**:
- Create `docker/telemetry/workload/benchmark.sh`:
- **Baseline run**: Start cluster with `[telemetry] enabled=0`, run transaction workload for 5 minutes, record metrics
- **Telemetry run**: Start cluster with full telemetry enabled, run identical workload, record metrics
- **Comparison**: Calculate deltas for:
- CPU usage (per-node average)
- Memory RSS (per-node peak)
- RPC p99 latency
- Transaction throughput (TPS)
- Consensus round time p95
- Ledger close time p95
- Output: Markdown table comparing baseline vs. telemetry, with pass/fail against targets:
- CPU overhead < 3%
- Memory overhead < 5MB
- RPC latency impact < 2ms p99
- Throughput impact < 5%
- Consensus impact < 1%
- Store results in `docker/telemetry/workload/benchmark-results/` for historical tracking.
**Key files**:
- New: `docker/telemetry/workload/benchmark.sh`
- New: `docker/telemetry/workload/collect_system_metrics.sh`
---
## Task 10.6: CI Integration
**Objective**: Wire the validation suite into CI for regression detection.
**What to do**:
- Create a CI workflow (GitHub Actions or equivalent) that:
1. Builds rippled with `-DXRPL_ENABLE_TELEMETRY=ON`
2. Starts the multi-node workload harness
3. Runs the RPC load generator + transaction submitter for 2 minutes
4. Runs the validation suite
5. Runs the benchmark suite
6. Fails the build if any validation check fails or benchmark exceeds thresholds
7. Archives the validation report and benchmark results as artifacts
- This should be a separate workflow (not part of the main CI), triggered manually or on telemetry-related branch changes.
**Key files**:
- New: `.github/workflows/telemetry-validation.yml`
- New: `docker/telemetry/workload/run-full-validation.sh` (orchestrator script)
---
## Task 10.7: Documentation
**Objective**: Document the workload tools and validation process.
**What to do**:
- Create `docker/telemetry/workload/README.md`:
- Quick start guide for running workload harness
- Configuration options for load generator and tx submitter
- How to read validation reports
- How to run benchmarks and interpret results
- Update `docs/telemetry-runbook.md`:
- Add "Validating Telemetry Stack" section
- Add "Performance Benchmarking" section
- Update `OpenTelemetryPlan/09-data-collection-reference.md`:
- Add "Validation" section with expected metric/span counts
---
## Effort Summary
| Task | Description | Effort | Risk |
| ---- | --------------------------- | ------ | ------ |
| 10.1 | Multi-node test harness | 2d | Medium |
| 10.2 | RPC load generator | 1d | Low |
| 10.3 | Transaction submitter | 2d | Medium |
| 10.4 | Telemetry validation suite | 2d | Medium |
| 10.5 | Performance benchmark suite | 1.5d | Low |
| 10.6 | CI integration | 1d | Medium |
| 10.7 | Documentation | 0.5d | Low |
**Total Effort**: 10 days
## Exit Criteria
- [ ] 5-node validator cluster starts and reaches consensus in docker-compose
- [ ] RPC load generator fires all traced RPC commands at configurable rates
- [ ] Transaction submitter generates 6+ transaction types at configurable TPS
- [ ] Validation suite confirms all 16 spans, 22 attributes, 300+ metrics are present
- [ ] Log-trace correlation validated end-to-end (Loki Tempo)
- [ ] All 10 Grafana dashboards render data (no empty panels)
- [ ] Benchmark shows < 3% CPU overhead, < 5MB memory overhead
- [ ] CI workflow runs validation on telemetry branch changes
- [ ] Validation report output is CI-parseable (JSON with exit codes)

View File

@@ -0,0 +1,471 @@
# Phase 11: Third-Party Data Collection Pipelines — Task List
> **Status**: Future Enhancement
>
> **Goal**: Build a custom OTel Collector receiver that periodically polls rippled's admin RPCs and exports structured metrics for external consumers — making all XRPL health, validator, peer, fee, and DEX data available as Prometheus/OTLP metrics without rippled code changes.
>
> **Scope**: Go-based OTel Collector receiver plugin + Grafana dashboards + Prometheus alerting rules.
>
> **Branch**: `pratik/otel-phase11-third-party-collection` (from `pratik/otel-phase10-workload-validation`)
>
> **Depends on**: Phase 10 (validation harness for testing the new receiver)
### Related Plan Documents
| Document | Relevance |
| -------------------------------------------------------------------- | --------------------------------------------------------------- |
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 11 plan: motivation, architecture, exit criteria (§6.8.4) |
| [09-data-collection-reference.md](./09-data-collection-reference.md) | Defines full metric inventory including third-party metrics |
| [Phase10_taskList.md](./Phase10_taskList.md) | Prerequisite — validation harness for testing |
### Third-Party Consumer Gap Analysis
This phase addresses the cross-cutting gap identified during research: **rippled has no native Prometheus/OTLP metrics export for data accessible only via RPC**. Every consumer (exchanges, payment processors, analytics providers, validators, researchers, compliance firms, custodians) must build custom JSON-RPC polling and conversion. This receiver centralizes that work.
| Consumer Category | Data Unlocked by This Phase |
| -------------------------- | ------------------------------------------------------------------ |
| **Exchanges** | Real-time fee estimates, TxQ capacity, server health scores |
| **Payment Processors** | Settlement latency percentiles, corridor health, path availability |
| **Analytics Providers** | Validator metrics, network topology, amendment voting status |
| **DeFi / AMM** | AMM pool TVL, DEX order book depth, trade volumes |
| **Validators / Operators** | Per-peer latency, version distribution, UNL health, alerting |
| **Compliance** | Transaction volume trends, network growth metrics |
| **Academic Researchers** | Consensus performance time-series, decentralization metrics |
| **CBDC / Tokenization** | Token supply tracking, trust line adoption, freeze status |
| **Institutional Custody** | Multi-sig status, escrow tracking, reserve calculations |
| **Wallet Providers** | Server health for node selection, fee prediction data |
---
## Task 11.1: OTel Collector Receiver Scaffold
**Objective**: Create the Go project structure for a custom OTel Collector receiver that polls rippled JSON-RPC.
**What to do**:
- Create `docker/telemetry/otel-rippled-receiver/`:
- `receiver.go` — implements `receiver.Metrics` interface
- `config.go` — configuration struct (endpoint, poll interval, enabled RPCs)
- `factory.go` — receiver factory registration
- `go.mod` / `go.sum` — Go module with OTel Collector SDK dependency
- Configuration model:
```yaml
rippled_receiver:
endpoint: "http://localhost:5005" # rippled admin RPC
poll_interval: 30s # how often to poll
enabled_collectors:
- server_info
- get_counts
- fee
- peers
- validators
- feature
- server_state
amm_pools: [] # optional: AMM pool IDs to track
book_offers_pairs: [] # optional: currency pairs for DEX depth
```
- Build a custom OTel Collector binary that includes this receiver alongside the standard receivers.
**Key files**:
- New: `docker/telemetry/otel-rippled-receiver/receiver.go`
- New: `docker/telemetry/otel-rippled-receiver/config.go`
- New: `docker/telemetry/otel-rippled-receiver/factory.go`
- New: `docker/telemetry/otel-rippled-receiver/go.mod`
- New: `docker/telemetry/otel-rippled-receiver/Dockerfile`
---
## Task 11.2: server_info / server_state Collector
**Objective**: Poll `server_info` and `server_state` and export all fields as OTel metrics.
**What to do**:
- Implement `serverInfoCollector` that calls `server_info` (admin) and extracts:
**Node Health Gauges:**
- `xrpl_server_state` (enum → int: disconnected=0, connected=1, syncing=2, tracking=3, full=4, proposing=5)
- `xrpl_server_state_duration_seconds`
- `xrpl_uptime_seconds`
- `xrpl_io_latency_ms`
- `xrpl_amendment_blocked` (0 or 1)
- `xrpl_peers_count`
- `xrpl_peer_disconnects_total`
- `xrpl_peer_disconnects_resources_total`
- `xrpl_jq_trans_overflow_total`
**Consensus Gauges:**
- `xrpl_last_close_proposers`
- `xrpl_last_close_converge_time_seconds`
- `xrpl_validation_quorum`
**Ledger Gauges:**
- `xrpl_validated_ledger_seq`
- `xrpl_validated_ledger_age_seconds`
- `xrpl_validated_ledger_base_fee_drops`
- `xrpl_validated_ledger_reserve_base_drops`
- `xrpl_validated_ledger_reserve_inc_drops`
- `xrpl_close_time_offset_seconds` (0 when absent)
**Load Factor Gauges:**
- `xrpl_load_factor`
- `xrpl_load_factor_server`
- `xrpl_load_factor_fee_escalation`
- `xrpl_load_factor_fee_queue`
- `xrpl_load_factor_local`
- `xrpl_load_factor_net`
- `xrpl_load_factor_cluster`
**State Accounting Gauges** (per state: disconnected, connected, syncing, tracking, full):
- `xrpl_state_duration_seconds{state="<name>"}`
- `xrpl_state_transitions_total{state="<name>"}`
**Validator Info** (when node is a validator):
- `xrpl_validator_list_count`
- `xrpl_validator_list_expiration_seconds` (epoch)
- `xrpl_validator_list_active` (0 or 1)
**Key files**:
- New: `docker/telemetry/otel-rippled-receiver/collectors/server_info.go`
---
## Task 11.3: get_counts Collector
**Objective**: Poll `get_counts` and export internal object counts and NodeStore stats.
**What to do**:
- Implement `getCountsCollector`:
**Database Gauges:**
- `xrpl_db_size_kb{db="total"}`, `xrpl_db_size_kb{db="ledger"}`, `xrpl_db_size_kb{db="transaction"}`
**NodeStore Gauges:**
- `xrpl_nodestore_reads_total`, `xrpl_nodestore_reads_hit`, `xrpl_nodestore_writes_total`
- `xrpl_nodestore_read_bytes`, `xrpl_nodestore_written_bytes`
- `xrpl_nodestore_read_duration_us`, `xrpl_nodestore_write_load`
- `xrpl_nodestore_read_queue`, `xrpl_nodestore_read_threads_running`
**Cache Gauges:**
- `xrpl_cache_hit_rate{cache="SLE"}`, `xrpl_cache_hit_rate{cache="ledger"}`, `xrpl_cache_hit_rate{cache="accepted_ledger"}`
- `xrpl_cache_size{cache="treenode"}`, `xrpl_cache_size{cache="fullbelow"}`, `xrpl_cache_size{cache="accepted_ledger"}`
**Object Count Gauges:**
- `xrpl_object_count{type="<name>"}` for each counted object type (Transaction, Ledger, NodeObject, STTx, STLedgerEntry, InboundLedger, Pathfinder, etc.)
**Rates:**
- `xrpl_historical_fetch_per_minute`
- `xrpl_local_txs`
**Key files**:
- New: `docker/telemetry/otel-rippled-receiver/collectors/get_counts.go`
---
## Task 11.4: Peer Topology Collector
**Objective**: Poll `peers` and export per-peer and aggregate network metrics.
**What to do**:
- Implement `peersCollector`:
**Aggregate Gauges:**
- `xrpl_peers_inbound_count`
- `xrpl_peers_outbound_count`
- `xrpl_peers_cluster_count`
**Per-Peer Gauges** (with labels `peer_key` truncated to 8 chars for cardinality control):
- `xrpl_peer_latency_ms{peer="<key>", version="<ver>", inbound="<bool>"}`
- `xrpl_peer_uptime_seconds{peer="<key>"}`
- `xrpl_peer_load{peer="<key>"}`
**Distribution Gauges** (aggregated across all peers):
- `xrpl_peer_latency_p50_ms`, `xrpl_peer_latency_p95_ms`, `xrpl_peer_latency_p99_ms`
- `xrpl_peer_version_count{version="<semver>"}` — count of peers per software version
**Tracking Status:**
- `xrpl_peer_diverged_count` — peers with `track=diverged`
- `xrpl_peer_unknown_count` — peers with `track=unknown`
**Key files**:
- New: `docker/telemetry/otel-rippled-receiver/collectors/peers.go`
**Cardinality note**: Per-peer metrics use truncated keys. For large peer sets (50+), the aggregate distribution gauges are preferred over per-peer labels.
---
## Task 11.5: Validator & Amendment Collector
**Objective**: Poll `validators` and `feature` to export validator health and amendment voting status.
**What to do**:
- Implement `validatorCollector`:
**From `validators` RPC:**
- `xrpl_trusted_validators_count`
- `xrpl_validator_signing` (0 or 1 — whether local validator is signing)
**From `feature` RPC:**
- `xrpl_amendment_enabled_count` — total enabled amendments
- `xrpl_amendment_majority_count` — amendments with majority but not yet enabled
- `xrpl_amendment_vetoed_count` — locally vetoed amendments
- `xrpl_amendment_unsupported_majority` (0 or 1) — any unsupported amendment has majority (critical alert)
**Per-amendment with majority** (limited cardinality — only amendments with `majority` set):
- `xrpl_amendment_majority_time{name="<amendment>"}` — epoch time when majority was gained
- `xrpl_amendment_votes{name="<amendment>"}` — current vote count
- `xrpl_amendment_threshold{name="<amendment>"}` — votes needed
**Key files**:
- New: `docker/telemetry/otel-rippled-receiver/collectors/validators.go`
---
## Task 11.6: Fee & TxQ Collector
**Objective**: Poll `fee` RPC and export real-time fee market data.
**What to do**:
- Implement `feeCollector` that calls the public `fee` RPC:
**Fee Level Gauges:**
- `xrpl_fee_current_ledger_size` — transactions in current open ledger
- `xrpl_fee_expected_ledger_size` — expected transactions at close
- `xrpl_fee_max_queue_size` — maximum transaction queue size
- `xrpl_fee_open_ledger_fee_drops` — minimum fee for open ledger inclusion
- `xrpl_fee_median_fee_drops` — median fee level
- `xrpl_fee_minimum_fee_drops` — base reference fee
- `xrpl_fee_queue_size` — current queue depth
- This overlaps with Phase 9's internal TxQ metrics but provides an external-only collection path that doesn't require rippled code changes.
**Key files**:
- New: `docker/telemetry/otel-rippled-receiver/collectors/fee.go`
---
## Task 11.7: DEX & AMM Collector (Optional)
**Objective**: Periodically poll configured AMM pools and order book pairs for DeFi metrics.
**What to do**:
- Implement `dexCollector` (enabled only when `amm_pools` or `book_offers_pairs` are configured):
**AMM Pool Gauges** (per configured pool):
- `xrpl_amm_reserve{pool="<id>", asset="<currency>"}` — pool reserve amount
- `xrpl_amm_lp_token_supply{pool="<id>"}` — outstanding LP tokens
- `xrpl_amm_trading_fee{pool="<id>"}` — pool trading fee (basis points)
- `xrpl_amm_tvl_drops{pool="<id>"}` — total value locked (XRP-denominated)
**Order Book Gauges** (per configured pair):
- `xrpl_orderbook_bid_depth{pair="<base>/<quote>"}` — total bid volume
- `xrpl_orderbook_ask_depth{pair="<base>/<quote>"}` — total ask volume
- `xrpl_orderbook_spread{pair="<base>/<quote>"}` — best bid-ask spread
- `xrpl_orderbook_offer_count{pair="<base>/<quote>", side="bid|ask"}` — number of offers
**Key files**:
- New: `docker/telemetry/otel-rippled-receiver/collectors/dex.go`
**Note**: This is optional because it requires explicit configuration of which pools/pairs to track. Default configuration tracks no DEX data.
---
## Task 11.8: Prometheus Alerting Rules
**Objective**: Create production-ready alerting rules for the metrics exported by this receiver.
**What to do**:
- Create `docker/telemetry/prometheus/rippled-alerts.yml`:
**Tier 1 — Critical (page immediately):**
```yaml
- alert: XRPLServerNotFull
expr: xrpl_server_state < 4
for: 15m
- alert: XRPLAmendmentBlocked
expr: xrpl_amendment_blocked == 1
for: 1m
- alert: XRPLNoPeers
expr: xrpl_peers_count == 0
for: 5m
- alert: XRPLLedgerStale
expr: xrpl_validated_ledger_age_seconds > 120
for: 2m
- alert: XRPLHighIOLatency
expr: xrpl_io_latency_ms > 100
for: 5m
- alert: XRPLUnsupportedAmendmentMajority
expr: xrpl_amendment_unsupported_majority == 1
for: 1m
```
**Tier 2 — Warning (investigate within hours):**
```yaml
- alert: XRPLLowPeerCount
expr: xrpl_peers_count < 10
for: 15m
- alert: XRPLHighLoadFactor
expr: xrpl_load_factor > 10
for: 10m
- alert: XRPLSlowConsensus
expr: xrpl_last_close_converge_time_seconds > 6
for: 5m
- alert: XRPLValidatorListExpiring
expr: (xrpl_validator_list_expiration_seconds - time()) < 86400
for: 1h
- alert: XRPLClockDrift
expr: xrpl_close_time_offset_seconds > 0
for: 5m
- alert: XRPLStateFlapping
expr: rate(xrpl_state_transitions_total{state="full"}[1h]) > 2
for: 30m
```
**Key files**:
- New: `docker/telemetry/prometheus/rippled-alerts.yml`
- Update: `docker/telemetry/prometheus/prometheus.yml` (add rule_files reference)
---
## Task 11.9: New Grafana Dashboards
**Objective**: Create 4 new dashboards for the data exported by the receiver.
**What to do**:
- **Validator Health** (`rippled-validator-health`):
- Server state timeline, state duration breakdown
- Proposer count trend, converge time trend, validation quorum
- Validator list expiration countdown
- Amendment voting status (majority/enabled/vetoed)
- **Network Topology** (`rippled-network-topology`):
- Peer count (inbound/outbound/cluster), peer version distribution
- Peer latency distribution (p50/p95/p99), diverged peer count
- Geographic distribution (if enriched with GeoIP)
- Peer uptime distribution
- **Fee Market** (`rippled-fee-market-external`):
- Current fee levels (open ledger, median, minimum), fee escalation timeline
- Queue depth vs. capacity, transactions per ledger
- Load factor breakdown (server/network/cluster/escalation)
- **DEX & AMM Overview** (`rippled-dex-amm`) (only populated when DEX collectors are configured):
- AMM pool TVL, reserve ratios, LP token supply
- Order book depth per pair, spread trends
- Trading fee revenue estimates
**Key files**:
- New: `docker/telemetry/grafana/dashboards/rippled-validator-health.json`
- New: `docker/telemetry/grafana/dashboards/rippled-network-topology.json`
- New: `docker/telemetry/grafana/dashboards/rippled-fee-market-external.json`
- New: `docker/telemetry/grafana/dashboards/rippled-dex-amm.json`
---
## Task 11.10: Integration with Phase 10 Validation
**Objective**: Extend the Phase 10 validation suite to verify this receiver's metrics.
**What to do**:
- Update `docker/telemetry/workload/validate_telemetry.py`:
- Add assertions for all `xrpl_*` metrics produced by the receiver
- Verify metric labels have expected values
- Verify alerting rules fire correctly (inject a "bad" state and check alert)
- Update `docker/telemetry/docker-compose.workload.yaml`:
- Add the custom OTel Collector build with the rippled receiver
- Configure the receiver to poll one of the test nodes
**Key files**:
- Update: `docker/telemetry/workload/validate_telemetry.py`
- Update: `docker/telemetry/docker-compose.workload.yaml`
- Update: `docker/telemetry/workload/expected_metrics.json`
---
## Task 11.11: Documentation
**Objective**: Document the receiver, its metrics, deployment, and alerting.
**What to do**:
- Create `docker/telemetry/otel-rippled-receiver/README.md`:
- Architecture overview (how the receiver fits into the OTel Collector)
- Configuration reference (all config options with defaults)
- Metric reference table (all exported metrics with types and labels)
- Deployment guide (building custom collector binary, docker-compose integration)
- Update `OpenTelemetryPlan/09-data-collection-reference.md`:
- Add "Third-Party Metrics (OTel Collector Receiver)" section
- Add new Grafana dashboard reference (4 dashboards)
- Add alerting rules reference
- Update `docs/telemetry-runbook.md`:
- Add "Third-Party Metrics Receiver" troubleshooting section
- Add alerting playbook (what to do for each Tier 1/Tier 2 alert)
---
## Effort Summary
| Task | Description | Effort | Risk |
| ----- | ------------------------------------ | ------ | ------ |
| 11.1 | OTel Collector receiver scaffold | 1.5d | Medium |
| 11.2 | server_info / server_state collector | 2d | Low |
| 11.3 | get_counts collector | 1.5d | Low |
| 11.4 | Peer topology collector | 1.5d | Medium |
| 11.5 | Validator & amendment collector | 1d | Low |
| 11.6 | Fee & TxQ collector | 0.5d | Low |
| 11.7 | DEX & AMM collector (optional) | 1.5d | Medium |
| 11.8 | Prometheus alerting rules | 1d | Low |
| 11.9 | New Grafana dashboards (4) | 2d | Low |
| 11.10 | Integration with Phase 10 validation | 1d | Low |
| 11.11 | Documentation | 1d | Low |
**Total Effort**: 15 days
## Exit Criteria
- [ ] Custom OTel Collector receiver builds and starts without errors
- [ ] All `xrpl_*` metrics from server_info, get_counts, peers, validators, fee appear in Prometheus
- [ ] Metrics update at configured poll interval (default 30s)
- [ ] 4 new Grafana dashboards operational with data
- [ ] Prometheus alerting rules fire correctly for simulated failure conditions
- [ ] DEX/AMM collector works when configured (optional — not required for base exit criteria)
- [ ] Phase 10 validation suite passes with receiver metrics included
- [ ] Receiver handles rippled restart/unavailability gracefully (no crash, logs warning, retries)
- [ ] Documentation complete: receiver README, metric reference, alerting playbook
- [ ] Go receiver has unit tests with >80% coverage

View File

@@ -0,0 +1,329 @@
# Phase 9: Internal Metric Instrumentation Gap Fill — Task List
> **Status**: Future Enhancement
>
> **Goal**: Instrument rippled to emit ~50+ metrics that exist in `get_counts`/`server_info`/TxQ/PerfLog but currently lack time-series export via the OTel or beast::insight pipelines.
>
> **Scope**: Hybrid approach — extend `beast::insight` for metrics near existing registrations, use OTel Metrics SDK `ObservableGauge` callbacks for new categories (TxQ, PerfLog, CountedObjects).
>
> **Branch**: `pratik/otel-phase9-metric-gap-fill` (from `pratik/otel-phase8-log-correlation`)
>
> **Depends on**: Phase 7 (native OTel metrics pipeline) and Phase 8 (log-trace correlation)
### Related Plan Documents
| Document | Relevance |
| -------------------------------------------------------------------- | -------------------------------------------------------------- |
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 9 plan: motivation, architecture, exit criteria (§6.8.2) |
| [09-data-collection-reference.md](./09-data-collection-reference.md) | Current metric inventory + future metrics section |
| [Phase7_taskList.md](./Phase7_taskList.md) | Prerequisite — OTel Metrics SDK and `OTelCollector` class |
| [Phase8_taskList.md](./Phase8_taskList.md) | Prerequisite — log-trace correlation |
### Third-Party Consumer Context
These metrics serve multiple external consumer categories identified during research:
| Consumer Category | Key Metrics They Need |
| ------------------------- | --------------------------------------------------------------- |
| **Exchanges** | Fee escalation levels, TxQ depth, settlement latency |
| **Payment Processors** | Load factors, io_latency, transaction throughput |
| **Analytics Providers** | NodeStore I/O, cache hit rates, counted objects |
| **Validators/Operators** | Per-job execution times, PerfLog RPC counters, consensus timing |
| **Academic Researchers** | Consensus performance time-series, fee market dynamics |
| **Institutional Custody** | Server health scores, reserve calculations, node availability |
---
## Task 9.1: NodeStore I/O Metrics
**Objective**: Export node store read/write performance as time-series metrics.
**What to do**:
- In `src/libxrpl/nodestore/Database.cpp`, extend existing `beast::insight` registrations to add:
- Gauge: `node_reads_total` (cumulative read operations)
- Gauge: `node_reads_hit` (cache-served reads)
- Gauge: `node_writes` (cumulative write operations)
- Gauge: `node_written_bytes` (cumulative bytes written)
- Gauge: `node_read_bytes` (cumulative bytes read)
- Gauge: `node_reads_duration_us` (cumulative read time in microseconds)
- Gauge: `write_load` (current write load score)
- Gauge: `read_queue` (items in read queue)
- These values are already computed in `Database::getCountsJson()` (line ~236). Wire the same counters to `beast::insight` hooks.
**Key modified files**:
- `src/libxrpl/nodestore/Database.cpp`
- `src/libxrpl/nodestore/Database.h` (add insight members)
**Derived Prometheus metrics**: `rippled_nodestore_reads_total`, `rippled_nodestore_reads_hit`, `rippled_nodestore_write_load`, etc.
**Grafana dashboard**: Add "NodeStore I/O" panel group to _Node Health_ dashboard.
---
## Task 9.2: Cache Hit Rate Metrics
**Objective**: Export SHAMap and ledger cache performance as time-series gauges.
**What to do**:
- Register OTel `ObservableGauge` callbacks (via Phase 7's `OTelCollector`) for:
- `SLE_hit_rate` — SLE cache hit rate (0.01.0)
- `ledger_hit_rate` — Ledger object cache hit rate
- `AL_hit_rate` — AcceptedLedger cache hit rate
- `treenode_cache_size` — SHAMap TreeNode cache size (entries)
- `treenode_track_size` — Tracked tree nodes
- `fullbelow_size` — FullBelow cache size
- The callback should read from the same sources as `GetCounts.cpp` handler (line ~43).
- Create a centralized `MetricsRegistry` class that holds all OTel async gauge registrations, polled at 10-second intervals by the `PeriodicMetricReader`.
**Key modified files**:
- New: `src/xrpld/telemetry/MetricsRegistry.h` / `.cpp`
- `src/xrpld/rpc/handlers/GetCounts.cpp` (extract shared access methods)
- `src/xrpld/app/main/Application.cpp` (register MetricsRegistry at startup)
**Derived Prometheus metrics**: `rippled_cache_SLE_hit_rate`, `rippled_cache_ledger_hit_rate`, `rippled_cache_treenode_size`, etc.
---
## Task 9.3: Transaction Queue (TxQ) Metrics
**Objective**: Export TxQ depth, capacity, and fee escalation levels as time-series.
**What to do**:
- Register OTel `ObservableGauge` callbacks for TxQ state (from `TxQ.h` line ~143):
- `txq_count` — Current transactions in queue
- `txq_max_size` — Maximum queue capacity
- `txq_in_ledger` — Transactions in current open ledger
- `txq_per_ledger` — Expected transactions per ledger
- `txq_reference_fee_level` — Reference fee level
- `txq_min_processing_fee_level` — Minimum fee to get processed
- `txq_med_fee_level` — Median fee level in queue
- `txq_open_ledger_fee_level` — Open ledger fee escalation level
- Add to the `MetricsRegistry` (Task 9.2).
**Key modified files**:
- `src/xrpld/telemetry/MetricsRegistry.cpp` (add TxQ callbacks)
- `src/xrpld/app/tx/detail/TxQ.h` (expose metrics accessor if needed)
**Derived Prometheus metrics**: `rippled_txq_count`, `rippled_txq_max_size`, `rippled_txq_open_ledger_fee_level`, etc.
**Grafana dashboard**: New _Fee Market & TxQ_ dashboard (`rippled-fee-market`).
---
## Task 9.4: PerfLog Per-RPC Method Metrics
**Objective**: Export per-RPC-method call counts and latency as OTel metrics.
**What to do**:
- Register OTel instruments for PerfLog RPC counters (from `PerfLogImp.cpp` line ~63):
- Counter: `rpc_method_started_total{method="<name>"}` — calls started
- Counter: `rpc_method_finished_total{method="<name>"}` — calls completed
- Counter: `rpc_method_errored_total{method="<name>"}` — calls errored
- Histogram: `rpc_method_duration_us{method="<name>"}` — execution time distribution
- Use OTel `Counter<int64_t>` and `Histogram<double>` instruments with `method` attribute label.
- Hook into the existing PerfLog callback mechanism rather than adding new instrumentation points.
**Key modified files**:
- `src/xrpld/perflog/detail/PerfLogImp.cpp` (add OTel instrument updates alongside existing JSON counters)
- `src/xrpld/telemetry/MetricsRegistry.cpp` (register instruments)
**Derived Prometheus metrics**: `rippled_rpc_method_started_total{method="server_info"}`, `rippled_rpc_method_duration_us_bucket{method="ledger"}`, etc.
**Grafana dashboard**: Add "Per-Method RPC Breakdown" panel group to _RPC Performance_ dashboard.
---
## Task 9.5: PerfLog Per-Job-Type Metrics
**Objective**: Export per-job-type queue and execution metrics.
**What to do**:
- Register OTel instruments for PerfLog job counters:
- Counter: `job_queued_total{job_type="<name>"}` — jobs queued
- Counter: `job_started_total{job_type="<name>"}` — jobs started
- Counter: `job_finished_total{job_type="<name>"}` — jobs completed
- Histogram: `job_queued_duration_us{job_type="<name>"}` — time spent waiting in queue
- Histogram: `job_running_duration_us{job_type="<name>"}` — execution time distribution
- Hook into PerfLog's existing job tracking alongside Task 9.4.
**Key modified files**:
- `src/xrpld/perflog/detail/PerfLogImp.cpp`
- `src/xrpld/telemetry/MetricsRegistry.cpp`
**Derived Prometheus metrics**: `rippled_job_queued_total{job_type="ledgerData"}`, `rippled_job_running_duration_us_bucket{job_type="transaction"}`, etc.
**Grafana dashboard**: New _Job Queue Analysis_ dashboard (`rippled-job-queue`).
---
## Task 9.6: Counted Object Instance Metrics
**Objective**: Export live instance counts for key internal object types.
**What to do**:
- Register OTel `ObservableGauge` callbacks for `CountedObject<T>` instance counts:
- `object_count{type="Transaction"}` — live Transaction objects
- `object_count{type="Ledger"}` — live Ledger objects
- `object_count{type="NodeObject"}` — live NodeObject instances
- `object_count{type="STTx"}` — serialized transaction objects
- `object_count{type="STLedgerEntry"}` — serialized ledger entries
- `object_count{type="InboundLedger"}` — ledgers being fetched
- `object_count{type="Pathfinder"}` — active pathfinding computations
- `object_count{type="PathRequest"}` — active path requests
- `object_count{type="HashRouterEntry"}` — hash router entries
- The `CountedObject` template already tracks these via atomic counters. The callback just reads the current counts.
**Key modified files**:
- `src/xrpld/telemetry/MetricsRegistry.cpp` (add counted object callbacks)
- `include/xrpl/basics/CountedObject.h` (may need static accessor for iteration)
**Derived Prometheus metrics**: `rippled_object_count{type="Transaction"}`, `rippled_object_count{type="NodeObject"}`, etc.
**Grafana dashboard**: Add "Object Instance Counts" panel to _Node Health_ dashboard.
---
## Task 9.7: Fee Escalation & Load Factor Metrics
**Objective**: Export the full load factor breakdown as time-series.
**What to do**:
- Register OTel `ObservableGauge` callbacks for load factors (from `NetworkOPs.cpp` line ~2694):
- `load_factor` — combined transaction cost multiplier
- `load_factor_server` — server + cluster + network contribution
- `load_factor_local` — local server load only
- `load_factor_net` — network-wide load estimate
- `load_factor_cluster` — cluster peer load
- `load_factor_fee_escalation` — open ledger fee escalation
- `load_factor_fee_queue` — queue entry fee level
- These overlap with some existing StatsD metrics but provide finer granularity (individual factor breakdown vs. combined value).
**Key modified files**:
- `src/xrpld/telemetry/MetricsRegistry.cpp`
- `src/xrpld/app/misc/NetworkOPs.cpp` (expose load factor accessors if needed)
**Derived Prometheus metrics**: `rippled_load_factor`, `rippled_load_factor_fee_escalation`, etc.
**Grafana dashboard**: Add "Load Factor Breakdown" panel to _Fee Market & TxQ_ dashboard.
---
## Task 9.8: New Grafana Dashboards
**Objective**: Create Grafana dashboards for the new metric categories.
**What to do**:
- Create 2 new dashboards:
1. **Fee Market & TxQ** (`rippled-fee-market`) — TxQ depth/capacity, fee levels, load factor breakdown, fee escalation timeline
2. **Job Queue Analysis** (`rippled-job-queue`) — Per-job-type rates, queue wait times, execution times, job queue depth
- Update 2 existing dashboards:
1. **Node Health** (`rippled-statsd-node-health`) — Add NodeStore I/O panels, cache hit rate panels, object instance counts
2. **RPC Performance** (`rippled-rpc-perf`) — Add per-method RPC breakdown panels
**Key modified files**:
- New: `docker/telemetry/grafana/dashboards/rippled-fee-market.json`
- New: `docker/telemetry/grafana/dashboards/rippled-job-queue.json`
- `docker/telemetry/grafana/dashboards/rippled-statsd-node-health.json`
- `docker/telemetry/grafana/dashboards/rippled-rpc-perf.json`
---
## Task 9.9: Update Documentation
**Objective**: Update telemetry reference docs with all new metrics.
**What to do**:
- Update `OpenTelemetryPlan/09-data-collection-reference.md`:
- Add new section for OTel SDK-exported metrics (NodeStore, cache, TxQ, PerfLog, CountedObjects, load factors)
- Update Grafana dashboard reference table (add 2 new dashboards)
- Add Prometheus query examples for new metrics
- Update `docs/telemetry-runbook.md`:
- Add alerting rules for new metrics (NodeStore write_load, TxQ capacity, cache hit rate degradation)
- Add troubleshooting entries for new metric categories
**Key modified files**:
- `OpenTelemetryPlan/09-data-collection-reference.md`
- `docs/telemetry-runbook.md`
---
## Task 9.10: Integration Tests
**Objective**: Verify all new metrics appear in Prometheus after a test workload.
**What to do**:
- Extend the existing telemetry integration test:
- Start rippled with `[telemetry] enabled=1` and `[insight] server=otel`
- Submit a batch of RPC calls and transactions
- Query Prometheus for each new metric family
- Assert non-zero values for: NodeStore reads, cache hit rates, TxQ count, PerfLog RPC counters, object counts, load factors
- Add unit tests for the `MetricsRegistry` class:
- Verify callback registration and deregistration
- Verify metric values match `get_counts` JSON output
- Verify graceful behavior when telemetry is disabled
**Key modified files**:
- `src/test/telemetry/MetricsRegistry_test.cpp` (new)
- Existing integration test script (extend assertions)
---
## Effort Summary
| Task | Description | Effort | Risk |
| ---- | ---------------------------------------- | ------ | ------ |
| 9.1 | NodeStore I/O metrics | 1d | Low |
| 9.2 | Cache hit rate metrics + MetricsRegistry | 2d | Medium |
| 9.3 | TxQ metrics | 1d | Low |
| 9.4 | PerfLog per-RPC metrics | 1.5d | Medium |
| 9.5 | PerfLog per-job metrics | 1d | Low |
| 9.6 | Counted object instance metrics | 0.5d | Low |
| 9.7 | Fee escalation & load factor metrics | 0.5d | Low |
| 9.8 | New Grafana dashboards | 2d | Low |
| 9.9 | Update documentation | 1d | Low |
| 9.10 | Integration tests | 1.5d | Medium |
**Total Effort**: 12 days
## Exit Criteria
- [ ] All ~50 new metrics visible in Prometheus via OTLP pipeline
- [ ] `MetricsRegistry` class registers/deregisters cleanly with OTel SDK
- [ ] Async gauge callbacks execute at 10s intervals without performance impact
- [ ] 2 new Grafana dashboards operational (Fee Market, Job Queue)
- [ ] 2 existing dashboards updated with new panel groups
- [ ] Integration test validates all new metric families are non-zero
- [ ] No performance regression (< 0.5% CPU overhead from new callbacks)
- [ ] Documentation updated with full new metric inventory