diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index eadd18293f..643aa29392 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -288,7 +288,78 @@ See [Phase4_taskList.md § Phase 4b](./Phase4_taskList.md) for full design. --- -## 6.7 Risk Assessment +## 6.7 Phase 6: StatsD Metrics Integration (Week 10) + +**Objective**: Bridge rippled's existing `beast::insight` StatsD metrics into the OpenTelemetry collection pipeline, exposing 300+ pre-existing metrics alongside span-derived RED metrics in Prometheus/Grafana. + +### Background + +rippled has a mature metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These metrics cover node health, peer networking, RPC performance, job queue, and overlay traffic — data that **does not** overlap with the span-based instrumentation from Phases 1-5. By adding a StatsD receiver to the OTel Collector, both metric sources converge in Prometheus. + +### Metric Inventory + +| Category | Group | Type | Count | Key Metrics | +| --------------- | ------------------ | ------------- | ---------- | ------------------------------------------------------ | +| Node State | `State_Accounting` | Gauge | 10 | `*_duration`, `*_transitions` per operating mode | +| Ledger | `LedgerMaster` | Gauge | 2 | `Validated_Ledger_Age`, `Published_Ledger_Age` | +| Ledger Fetch | — | Counter | 1 | `ledger_fetches` | +| Ledger History | `ledger.history` | Counter | 1 | `mismatch` | +| RPC | `rpc` | Counter+Event | 3 | `requests`, `time` (histogram), `size` (histogram) | +| Job Queue | — | Gauge+Event | 1 + 2×N | `job_count`, per-job `{name}` and `{name}_q` | +| Peer Finder | `Peer_Finder` | Gauge | 2 | `Active_Inbound_Peers`, `Active_Outbound_Peers` | +| Overlay | `Overlay` | Gauge | 1 | `Peer_Disconnects` | +| Overlay Traffic | per-category | Gauge | 4×57 = 228 | `Bytes_In/Out`, `Messages_In/Out` per traffic category | +| Pathfinding | — | Event | 2 | `pathfind_fast`, `pathfind_full` (histograms) | +| I/O | — | Event | 1 | `ios_latency` (histogram) | +| Resource Mgr | — | Meter | 2 | `warn`, `drop` (rate counters) | +| Caches | per-cache | Gauge | 2×N | `{cache}.size`, `{cache}.hit_rate` | + +**Total**: ~255+ unique metrics (plus dynamic job-type and cache metrics) + +### Tasks + +| Task | Description | +| ---- | --------------------------------------------------------------------------------------------------------------- | +| 6.1 | **DEFERRED** Fix Meter wire format (`\|m` → `\|c`) in StatsDCollector.cpp — breaking change, tracked separately | +| 6.2 | Add `statsd` receiver to OTel Collector config | +| 6.3 | Expose UDP port 8125 in docker-compose.yml | +| 6.4 | Add `[insight]` config to integration test node configs | +| 6.5 | Create "Node Health" Grafana dashboard (8 panels) | +| 6.6 | Create "Network Traffic" Grafana dashboard (8 panels) | +| 6.7 | Create "RPC & Pathfinding (StatsD)" Grafana dashboard (8 panels) | +| 6.8 | Update integration test to verify StatsD metrics in Prometheus | +| 6.9 | Update TESTING.md and telemetry-runbook.md | + +### Wire Format Fix (Task 6.1) — DEFERRED + +The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffix, which is non-standard StatsD. The OTel StatsD receiver silently drops these. Fix: change `|m` to `|c` (counter), which is semantically correct since meters are increment-only counters. Only 2 metrics are affected (`warn`, `drop` in Resource Manager). + +**Status**: Deferred as a separate change — this is a breaking change for any StatsD backend that previously consumed the custom `|m` type. The Resource Warnings and Resource Drops dashboard panels will show no data until this fix is applied. + +### New Grafana Dashboards + +**Node Health** (`statsd-node-health.json`, uid: `rippled-statsd-node-health`): + +- Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches + +**Network Traffic** (`statsd-network-traffic.json`, uid: `rippled-statsd-network`): + +- Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories + +**RPC & Pathfinding (StatsD)** (`statsd-rpc-pathfinding.json`, uid: `rippled-statsd-rpc`): + +- RPC Request Rate, Response Time p95/p50, Response Size p95/p50, Pathfinding Fast/Full Duration, Resource Warnings/Drops, Response Time Heatmap + +### Exit Criteria + +- [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age`) +- [ ] All 3 new Grafana dashboards load without errors +- [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests) +- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ — DEFERRED (breaking change, tracked separately) + +--- + +## 6.9 Risk Assessment ```mermaid quadrantChart @@ -319,7 +390,7 @@ quadrantChart --- -## 6.8 Success Metrics +## 6.10 Success Metrics | Metric | Target | Measurement | | ------------------------ | -------------------------------------------------------------- | --------------------- | @@ -485,13 +556,15 @@ quadrantChart --- -## 6.10 Definition of Done + +## 6.13 Definition of Done > **TxQ** = Transaction Queue | **HA** = High Availability Clear, measurable criteria for each phase. -### 6.10.1 Phase 1: Core Infrastructure +### 6.13.1 Phase 1: Core Infrastructure + | Criterion | Measurement | Target | | --------------- | ---------------------------------------------------------- | ---------------------------- | @@ -503,7 +576,9 @@ Clear, measurable criteria for each phase. **Definition of Done**: All criteria met, PR merged, no regressions in CI. -### 6.10.2 Phase 2: RPC Tracing + +### 6.13.2 Phase 2: RPC Tracing + | Criterion | Measurement | Target | | ------------------ | ---------------------------------- | -------------------------- | @@ -515,7 +590,9 @@ Clear, measurable criteria for each phase. **Definition of Done**: RPC traces visible in Tempo for all commands, dashboard shows latency distribution. -### 6.10.3 Phase 3: Transaction Tracing + +### 6.13.3 Phase 3: Transaction Tracing + | Criterion | Measurement | Target | | ---------------- | ------------------------------- | ---------------------------------- | @@ -527,7 +604,9 @@ Clear, measurable criteria for each phase. **Definition of Done**: Transaction traces span 3+ nodes in test network, performance within bounds. -### 6.10.4 Phase 4: Consensus Tracing + +### 6.13.4 Phase 4: Consensus Tracing + | Criterion | Measurement | Target | | -------------------- | ----------------------------- | ------------------------- | @@ -539,7 +618,9 @@ Clear, measurable criteria for each phase. **Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing. -### 6.10.5 Phase 5: Production Deployment + +### 6.13.5 Phase 5: Production Deployment + | Criterion | Measurement | Target | | ------------ | ---------------------------- | -------------------------- | @@ -552,7 +633,9 @@ Clear, measurable criteria for each phase. **Definition of Done**: Telemetry running in production, operators trained, alerts active. -### 6.10.6 Success Metrics Summary + +### 6.13.6 Success Metrics Summary + | Phase | Primary Metric | Secondary Metric | Deadline | | ------- | ---------------------- | --------------------------- | ------------- | @@ -564,7 +647,7 @@ Clear, measurable criteria for each phase. --- -## 6.12 Recommended Implementation Order +## 6.14 Recommended Implementation Order Based on ROI analysis, implement in this exact order: diff --git a/OpenTelemetryPlan/08-appendix.md b/OpenTelemetryPlan/08-appendix.md index 742d8a9bf5..660c4f845d 100644 --- a/OpenTelemetryPlan/08-appendix.md +++ b/OpenTelemetryPlan/08-appendix.md @@ -170,19 +170,20 @@ flowchart TB ### Plan Documents -| Document | Description | -| ---------------------------------------------------------------- | -------------------------------------------- | -| [OpenTelemetryPlan.md](./OpenTelemetryPlan.md) | Master overview and executive summary | -| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) | Distributed tracing concepts and OTel primer | -| [01-architecture-analysis.md](./01-architecture-analysis.md) | rippled architecture and trace points | -| [02-design-decisions.md](./02-design-decisions.md) | SDK selection, exporters, span conventions | -| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure, performance analysis | -| [04-code-samples.md](./04-code-samples.md) | C++ code examples for all components | -| [05-configuration-reference.md](./05-configuration-reference.md) | rippled config, CMake, Collector configs | -| [06-implementation-phases.md](./06-implementation-phases.md) | Timeline, tasks, risks, success metrics | -| [07-observability-backends.md](./07-observability-backends.md) | Backend selection and architecture | -| [08-appendix.md](./08-appendix.md) | Glossary, references, version history | -| [presentation.md](./presentation.md) | Slide deck for OTel plan overview | +| Document | Description | +| -------------------------------------------------------------------- | -------------------------------------------- | +| [OpenTelemetryPlan.md](./OpenTelemetryPlan.md) | Master overview and executive summary | +| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) | Distributed tracing concepts and OTel primer | +| [01-architecture-analysis.md](./01-architecture-analysis.md) | rippled architecture and trace points | +| [02-design-decisions.md](./02-design-decisions.md) | SDK selection, exporters, span conventions | +| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure, performance analysis | +| [04-code-samples.md](./04-code-samples.md) | C++ code examples for all components | +| [05-configuration-reference.md](./05-configuration-reference.md) | rippled config, CMake, Collector configs | +| [06-implementation-phases.md](./06-implementation-phases.md) | Timeline, tasks, risks, success metrics | +| [07-observability-backends.md](./07-observability-backends.md) | Backend selection and architecture | +| [08-appendix.md](./08-appendix.md) | Glossary, references, version history | +| [09-data-collection-reference.md](./09-data-collection-reference.md) | Span/metric/dashboard inventory | +| [presentation.md](./presentation.md) | Slide deck for OTel plan overview | ### Task Lists diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md new file mode 100644 index 0000000000..2298c22d08 --- /dev/null +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -0,0 +1,553 @@ +# Observability Data Collection Reference + +> **Audience**: Developers and operators. This is the single source of truth for all telemetry data collected by rippled's observability stack. +> +> **Related docs**: [docs/telemetry-runbook.md](../docs/telemetry-runbook.md) (operator runbook with alerting and troubleshooting) | [03-implementation-strategy.md](./03-implementation-strategy.md) (code structure and performance optimization) | [04-code-samples.md](./04-code-samples.md) (C++ instrumentation examples) + +## Data Flow Overview + +```mermaid +graph LR + subgraph rippledNode["rippled Node"] + A["Trace Macros
XRPL_TRACE_SPAN
(OTLP/HTTP exporter)"] + B["beast::insight
StatsD metrics
(UDP sender)"] + end + + subgraph collector["OTel Collector :4317 / :4318 / :8125"] + direction TB + R1["OTLP Receiver
:4317 gRPC | :4318 HTTP"] + R2["StatsD Receiver
:8125 UDP"] + BP["Batch Processor
timeout 1s, batch 100"] + SM["SpanMetrics Connector
derives RED metrics
from trace spans"] + + R1 --> BP + BP --> SM + end + + subgraph backends["Trace Backends (choose one or both)"] + D["Jaeger :16686
Trace search &
visualization"] + T["Grafana Tempo
(preferred for production)
S3/GCS long-term storage"] + end + + subgraph metrics["Metrics Stack"] + E["Prometheus :9090
scrapes :8889
span-derived + StatsD metrics"] + end + + subgraph viz["Visualization"] + F["Grafana :3000
10 dashboards"] + end + + A -->|"OTLP/HTTP :4318
(traces + attributes)"| R1 + B -->|"UDP :8125
(gauges, counters, timers)"| R2 + + BP -->|"OTLP/gRPC :4317"| D + BP -->|"OTLP/gRPC"| T + + SM -->|"span_calls_total
span_duration_ms
(6 dimension labels)"| E + R2 -->|"rippled_* gauges
rippled_* counters
rippled_* summaries"| E + + E -->|"Prometheus
data source"| F + D -->|"Jaeger
data source"| F + T -->|"Tempo
data source"| F + + style A fill:#4a90d9,color:#fff,stroke:#2a6db5 + style B fill:#d9534f,color:#fff,stroke:#b52d2d + style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d + style R2 fill:#5cb85c,color:#fff,stroke:#3d8b3d + style BP fill:#449d44,color:#fff,stroke:#2d6e2d + style SM fill:#449d44,color:#fff,stroke:#2d6e2d + style D fill:#f0ad4e,color:#000,stroke:#c78c2e + style T fill:#e8953a,color:#000,stroke:#b5732a + style E fill:#f0ad4e,color:#000,stroke:#c78c2e + style F fill:#5bc0de,color:#000,stroke:#3aa8c1 + style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9 + style collector fill:#1a3320,color:#ccc,stroke:#5cb85c + style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e + style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e + style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de +``` + +There are two independent telemetry pipelines entering a single **OTel Collector**: + +1. **OpenTelemetry Traces** — Distributed spans with attributes, exported via OTLP/HTTP (:4318) to the collector's **OTLP Receiver**. The **Batch Processor** groups spans (1s timeout, batch size 100) before forwarding to trace backends. The **SpanMetrics Connector** derives RED metrics (rate, errors, duration) from every span and feeds them into the metrics pipeline. +2. **beast::insight StatsD** — System-level gauges, counters, and timers emitted as StatsD UDP packets to port :8125, ingested by the collector's **StatsD Receiver**, and exported alongside span-derived metrics to Prometheus. + +**Trace backends** — The collector exports traces via OTLP/gRPC to one or both: + +- **Jaeger** (development) — Provides trace search UI at `:16686`. Easy single-binary setup. +- **Grafana Tempo** (production) — Preferred for production. Supports S3/GCS object storage for cost-effective long-term trace retention and integrates natively with Grafana. + +> **Further reading**: [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) for core OpenTelemetry concepts (traces, spans, context propagation, sampling). [07-observability-backends.md](./07-observability-backends.md) for production backend selection, collector placement, and sampling strategies. + +--- + +## 1. OpenTelemetry Spans + +### 1.1 Complete Span Inventory (16 spans) + +> **See also**: [02-design-decisions.md §2.3](./02-design-decisions.md#23-span-naming-conventions) for naming conventions and the full span catalog with rationale. [04-code-samples.md §4.6](./04-code-samples.md#46-span-flow-visualization) for span flow diagrams. + +#### RPC Spans + +Controlled by `trace_rpc=1` in `[telemetry]` config. + +| Span Name | Parent | Source File | Description | +| -------------------- | ------------- | ----------------- | ------------------------------------------------------------------------ | +| `rpc.request` | — | ServerHandler.cpp | Top-level HTTP RPC request entry point | +| `rpc.process` | `rpc.request` | ServerHandler.cpp | RPC processing pipeline | +| `rpc.ws_message` | — | ServerHandler.cpp | WebSocket message handling | +| `rpc.command.` | `rpc.process` | RPCHandler.cpp | Per-command span (e.g., `rpc.command.server_info`, `rpc.command.ledger`) | + +**Where to find**: Jaeger → Service: `rippled` → Operation: `rpc.request` or `rpc.command.*` + +**Grafana dashboard**: _RPC Performance_ (`rippled-rpc-perf`) + +#### Transaction Spans + +Controlled by `trace_transactions=1` in `[telemetry]` config. + +| Span Name | Parent | Source File | Description | +| ------------ | -------------- | --------------- | ----------------------------------------------------------------- | +| `tx.process` | — | NetworkOPs.cpp | Transaction submission entry point (local or peer-relayed) | +| `tx.receive` | — | PeerImp.cpp | Raw transaction received from peer overlay (before deduplication) | +| `tx.apply` | `ledger.build` | BuildLedger.cpp | Transaction set applied to new ledger during consensus | + +**Where to find**: Jaeger → Operation: `tx.process` or `tx.receive` + +**Grafana dashboard**: _Transaction Overview_ (`rippled-transactions`) + +#### Consensus Spans + +Controlled by `trace_consensus=1` in `[telemetry]` config. + +| Span Name | Parent | Source File | Description | +| --------------------------- | ------ | ---------------- | --------------------------------------------- | +| `consensus.proposal.send` | — | RCLConsensus.cpp | Node broadcasts its transaction set proposal | +| `consensus.ledger_close` | — | RCLConsensus.cpp | Ledger close event triggered by consensus | +| `consensus.accept` | — | RCLConsensus.cpp | Consensus accepts a ledger (round complete) | +| `consensus.validation.send` | — | RCLConsensus.cpp | Validation message sent after ledger accepted | +| `consensus.accept.apply` | — | RCLConsensus.cpp | Ledger application with close time details | + +**Where to find**: Jaeger → Operation: `consensus.*` + +**Grafana dashboard**: _Consensus Health_ (`rippled-consensus`) + +#### Ledger Spans + +Controlled by `trace_ledger=1` in `[telemetry]` config. + +| Span Name | Parent | Source File | Description | +| ----------------- | ------ | ---------------- | ---------------------------------------------- | +| `ledger.build` | — | BuildLedger.cpp | Build new ledger from accepted transaction set | +| `ledger.validate` | — | LedgerMaster.cpp | Ledger promoted to validated status | +| `ledger.store` | — | LedgerMaster.cpp | Ledger stored to database/history | + +**Where to find**: Jaeger → Operation: `ledger.*` + +**Grafana dashboard**: _Ledger Operations_ (`rippled-ledger-ops`) + +#### Peer Spans + +Controlled by `trace_peer=1` in `[telemetry]` config. **Disabled by default** (high volume). + +| Span Name | Parent | Source File | Description | +| ------------------------- | ------ | ----------- | ------------------------------------- | +| `peer.proposal.receive` | — | PeerImp.cpp | Consensus proposal received from peer | +| `peer.validation.receive` | — | PeerImp.cpp | Validation message received from peer | + +**Where to find**: Jaeger → Operation: `peer.*` + +**Grafana dashboard**: _Peer Network_ (`rippled-peer-net`) + +--- + +### 1.2 Complete Attribute Inventory (22 attributes) + +> **See also**: [02-design-decisions.md §2.4.2](./02-design-decisions.md#242-span-attributes-by-category) for attribute design rationale and privacy considerations. + +Every span can carry key-value attributes that provide context for filtering and aggregation. + +#### RPC Attributes + +| Attribute | Type | Set On | Description | +| ------------------------ | ------ | --------------- | ------------------------------------------------ | +| `xrpl.rpc.command` | string | `rpc.command.*` | RPC command name (e.g., `server_info`, `ledger`) | +| `xrpl.rpc.version` | int64 | `rpc.command.*` | API version number | +| `xrpl.rpc.role` | string | `rpc.command.*` | Caller role: `"admin"` or `"user"` | +| `xrpl.rpc.status` | string | `rpc.command.*` | Result: `"success"` or `"error"` | +| `xrpl.rpc.duration_ms` | int64 | `rpc.command.*` | Command execution time in milliseconds | +| `xrpl.rpc.error_message` | string | `rpc.command.*` | Error details (only set on failure) | + +**Jaeger query**: Tag `xrpl.rpc.command=server_info` to find all `server_info` calls. + +**Prometheus label**: `xrpl_rpc_command` (dots converted to underscores by SpanMetrics). + +#### Transaction Attributes + +| Attribute | Type | Set On | Description | +| -------------------- | ------- | -------------------------- | ---------------------------------------------------- | +| `xrpl.tx.hash` | string | `tx.process`, `tx.receive` | Transaction hash (hex-encoded) | +| `xrpl.tx.local` | boolean | `tx.process` | `true` if locally submitted, `false` if peer-relayed | +| `xrpl.tx.path` | string | `tx.process` | Submission path: `"sync"` or `"async"` | +| `xrpl.tx.suppressed` | boolean | `tx.receive` | `true` if transaction was suppressed (duplicate) | +| `xrpl.tx.status` | string | `tx.receive` | Transaction status (e.g., `"known_bad"`) | + +**Jaeger query**: Tag `xrpl.tx.hash=` to trace a specific transaction across nodes. + +**Prometheus label**: `xrpl_tx_local` (used as SpanMetrics dimension). + +#### Consensus Attributes + +| Attribute | Type | Set On | Description | +| ------------------------------------ | ------- | --------------------------------------------------------------------------------------------------- | ------------------------------------------------------------- | +| `xrpl.consensus.round` | int64 | `consensus.proposal.send` | Consensus round number | +| `xrpl.consensus.mode` | string | `consensus.proposal.send`, `consensus.ledger_close` | Node mode: `"syncing"`, `"tracking"`, `"full"`, `"proposing"` | +| `xrpl.consensus.proposers` | int64 | `consensus.proposal.send`, `consensus.accept` | Number of proposers in the round | +| `xrpl.consensus.proposing` | boolean | `consensus.validation.send` | Whether this node was a proposer | +| `xrpl.consensus.ledger.seq` | int64 | `consensus.ledger_close`, `consensus.accept`, `consensus.validation.send`, `consensus.accept.apply` | Ledger sequence number | +| `xrpl.consensus.close_time` | int64 | `consensus.accept.apply` | Agreed-upon ledger close time (epoch seconds) | +| `xrpl.consensus.close_time_correct` | boolean | `consensus.accept.apply` | Whether validators reached agreement on close time | +| `xrpl.consensus.close_resolution_ms` | int64 | `consensus.accept.apply` | Close time rounding granularity in milliseconds | +| `xrpl.consensus.state` | string | `consensus.accept.apply` | Consensus outcome: `"finished"` or `"moved_on"` | +| `xrpl.consensus.round_time_ms` | int64 | `consensus.accept.apply` | Total consensus round duration in milliseconds | + +**Jaeger query**: Tag `xrpl.consensus.mode=proposing` to find rounds where node was proposing. + +**Prometheus label**: `xrpl_consensus_mode` (used as SpanMetrics dimension). + +#### Ledger Attributes + +| Attribute | Type | Set On | Description | +| ------------------------- | ----- | ------------------------------------------------------------- | ---------------------------------------------- | +| `xrpl.ledger.seq` | int64 | `ledger.build`, `ledger.validate`, `ledger.store`, `tx.apply` | Ledger sequence number | +| `xrpl.ledger.validations` | int64 | `ledger.validate` | Number of validations received for this ledger | +| `xrpl.ledger.tx_count` | int64 | `ledger.build`, `tx.apply` | Transactions in the ledger | +| `xrpl.ledger.tx_failed` | int64 | `ledger.build`, `tx.apply` | Failed transactions in the ledger | + +**Jaeger query**: Tag `xrpl.ledger.seq=12345` to find all spans for a specific ledger. + +#### Peer Attributes + +| Attribute | Type | Set On | Description | +| ------------------------------ | ------- | ---------------------------------------------------------------- | ---------------------------------------------------- | +| `xrpl.peer.id` | int64 | `tx.receive`, `peer.proposal.receive`, `peer.validation.receive` | Peer identifier | +| `xrpl.peer.proposal.trusted` | boolean | `peer.proposal.receive` | Whether the proposal came from a trusted validator | +| `xrpl.peer.validation.trusted` | boolean | `peer.validation.receive` | Whether the validation came from a trusted validator | + +**Prometheus labels**: `xrpl_peer_proposal_trusted`, `xrpl_peer_validation_trusted` (SpanMetrics dimensions). + +--- + +### 1.3 SpanMetrics — Derived Prometheus Metrics + +> **See also**: [01-architecture-analysis.md](./01-architecture-analysis.md) §1.8.2 for how span-derived metrics map to operational insights. + +The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Errors, Duration) metrics from every span. No custom metrics code in rippled is needed. + +| Prometheus Metric | Type | Description | +| -------------------------------------------------- | --------- | ------------------------------------------------------------------------------ | +| `traces_span_metrics_calls_total` | Counter | Total span invocations | +| `traces_span_metrics_duration_milliseconds_bucket` | Histogram | Latency distribution (buckets: 1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000 ms) | +| `traces_span_metrics_duration_milliseconds_count` | Histogram | Observation count | +| `traces_span_metrics_duration_milliseconds_sum` | Histogram | Cumulative latency | + +**Standard labels on every metric**: `span_name`, `status_code`, `service_name`, `span_kind` + +**Additional dimension labels** (configured in `otel-collector-config.yaml`): + +| Span Attribute | Prometheus Label | Applies To | +| ------------------------------ | ------------------------------ | ------------------------- | +| `xrpl.rpc.command` | `xrpl_rpc_command` | `rpc.command.*` | +| `xrpl.rpc.status` | `xrpl_rpc_status` | `rpc.command.*` | +| `xrpl.consensus.mode` | `xrpl_consensus_mode` | `consensus.ledger_close` | +| `xrpl.tx.local` | `xrpl_tx_local` | `tx.process` | +| `xrpl.peer.proposal.trusted` | `xrpl_peer_proposal_trusted` | `peer.proposal.receive` | +| `xrpl.peer.validation.trusted` | `xrpl_peer_validation_trusted` | `peer.validation.receive` | + +**Where to query**: Prometheus → `traces_span_metrics_calls_total{span_name="rpc.command.server_info"}` + +--- + +## 2. StatsD Metrics (beast::insight) + +> **See also**: [02-design-decisions.md](./02-design-decisions.md) for the beast::insight coexistence design. [06-implementation-phases.md](./06-implementation-phases.md) for the Phase 6 metric inventory. + +These are system-level metrics emitted by rippled's `beast::insight` framework via StatsD UDP. They cover operational data that doesn't map to individual trace spans. + +### Configuration + +```ini +[insight] +server=statsd +address=127.0.0.1:8125 +prefix=rippled +``` + +### 2.1 Gauges + +| Prometheus Metric | Source File | Description | Typical Range | +| --------------------------------------------------- | --------------------- | ---------------------------------------- | ------------------------------- | +| `rippled_LedgerMaster_Validated_Ledger_Age` | LedgerMaster.h | Seconds since last validated ledger | 0–10 (healthy), >30 (stale) | +| `rippled_LedgerMaster_Published_Ledger_Age` | LedgerMaster.h | Seconds since last published ledger | 0–10 (healthy) | +| `rippled_State_Accounting_Disconnected_duration` | NetworkOPs.cpp | Cumulative seconds in Disconnected state | Monotonic | +| `rippled_State_Accounting_Connected_duration` | NetworkOPs.cpp | Cumulative seconds in Connected state | Monotonic | +| `rippled_State_Accounting_Syncing_duration` | NetworkOPs.cpp | Cumulative seconds in Syncing state | Monotonic | +| `rippled_State_Accounting_Tracking_duration` | NetworkOPs.cpp | Cumulative seconds in Tracking state | Monotonic | +| `rippled_State_Accounting_Full_duration` | NetworkOPs.cpp | Cumulative seconds in Full state | Monotonic (should dominate) | +| `rippled_State_Accounting_Disconnected_transitions` | NetworkOPs.cpp | Count of transitions to Disconnected | Low | +| `rippled_State_Accounting_Connected_transitions` | NetworkOPs.cpp | Count of transitions to Connected | Low | +| `rippled_State_Accounting_Syncing_transitions` | NetworkOPs.cpp | Count of transitions to Syncing | Low | +| `rippled_State_Accounting_Tracking_transitions` | NetworkOPs.cpp | Count of transitions to Tracking | Low | +| `rippled_State_Accounting_Full_transitions` | NetworkOPs.cpp | Count of transitions to Full | Low (should be 1 after startup) | +| `rippled_Peer_Finder_Active_Inbound_Peers` | PeerfinderManager.cpp | Active inbound peer connections | 0–85 | +| `rippled_Peer_Finder_Active_Outbound_Peers` | PeerfinderManager.cpp | Active outbound peer connections | 10–21 | +| `rippled_Overlay_Peer_Disconnects` | OverlayImpl.cpp | Cumulative peer disconnection count | Low growth | +| `rippled_job_count` | JobQueue.cpp | Current job queue depth | 0–100 (healthy) | + +**Grafana dashboard**: _Node Health (StatsD)_ (`rippled-statsd-node-health`) + +### 2.2 Counters + +| Prometheus Metric | Source File | Description | +| --------------------------------- | ------------------ | --------------------------------------------- | +| `rippled_rpc_requests` | ServerHandler.cpp | Total RPC requests received | +| `rippled_ledger_fetches` | InboundLedgers.cpp | Inbound ledger fetch attempts | +| `rippled_ledger_history_mismatch` | LedgerHistory.cpp | Ledger hash mismatches detected | +| `rippled_warn` | Logic.h | Resource manager warnings issued | +| `rippled_drop` | Logic.h | Resource manager drops (connections rejected) | + +**Note**: `rippled_warn` and `rippled_drop` use non-standard StatsD meter type (`|m`). The OTel StatsD receiver only recognizes `|c`, `|g`, `|ms`, `|h`, `|s` — these metrics may be silently dropped. See Known Issues below. + +**Grafana dashboard**: _RPC & Pathfinding (StatsD)_ (`rippled-statsd-rpc`) + +### 2.3 Histograms (from StatsD timers) + +| Prometheus Metric | Source File | Unit | Description | +| ----------------------- | ----------------- | ----- | ------------------------------ | +| `rippled_rpc_time` | ServerHandler.cpp | ms | RPC response time distribution | +| `rippled_rpc_size` | ServerHandler.cpp | bytes | RPC response size distribution | +| `rippled_ios_latency` | Application.cpp | ms | I/O service loop latency | +| `rippled_pathfind_fast` | PathRequests.h | ms | Fast pathfinding duration | +| `rippled_pathfind_full` | PathRequests.h | ms | Full pathfinding duration | + +Quantiles collected: 0th, 50th, 90th, 95th, 99th, 100th percentile. + +**Grafana dashboards**: _Node Health_ (`ios_latency`), _RPC & Pathfinding_ (`rpc_time`, `rpc_size`, `pathfind_*`) + +### 2.4 Overlay Traffic Metrics + +For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), four gauges are emitted: + +- `rippled_{category}_Bytes_In` +- `rippled_{category}_Bytes_Out` +- `rippled_{category}_Messages_In` +- `rippled_{category}_Messages_Out` + +**Key categories**: + +| Category | Description | +| ----------------------------------------------------------------- | -------------------------- | +| `total` | All traffic aggregated | +| `overhead` / `overhead_overlay` | Protocol overhead | +| `transactions` / `transactions_duplicate` | Transaction relay | +| `proposals` / `proposals_untrusted` / `proposals_duplicate` | Consensus proposals | +| `validations` / `validations_untrusted` / `validations_duplicate` | Consensus validations | +| `ledger_data_get` / `ledger_data_share` | Ledger data exchange | +| `ledger_data_Transaction_Node_get/share` | Transaction node data | +| `ledger_data_Account_State_Node_get/share` | Account state node data | +| `ledger_data_Transaction_Set_candidate_get/share` | Transaction set candidates | +| `getObject` / `haveTxSet` / `ledgerData` | Object requests | +| `ping` / `status` | Keepalive and status | +| `set_get` | Set requests | + +**Grafana dashboards**: _Network Traffic_ (`rippled-statsd-network`), _Overlay Traffic Detail_ (`rippled-statsd-overlay-detail`), _Ledger Data & Sync_ (`rippled-statsd-ledger-sync`) + +--- + +## 3. Grafana Dashboard Reference + +> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8 for Grafana data source provisioning (Tempo, Jaeger, Prometheus) and TraceQL query examples. + +### 3.1 Span-Derived Dashboards (5) + +| Dashboard | UID | Data Source | Key Panels | +| -------------------- | ---------------------- | ------------------------ | ---------------------------------------------------------------------------------- | +| RPC Performance | `rippled-rpc-perf` | Prometheus (SpanMetrics) | Request rate by command, p95 latency by command, error rate, heatmap, top commands | +| Transaction Overview | `rippled-transactions` | Prometheus (SpanMetrics) | Processing rate, latency p95/p50, local vs relay split, apply duration, heatmap | +| Consensus Health | `rippled-consensus` | Prometheus (SpanMetrics) | Round duration p95/p50, proposals rate, close duration, mode timeline, heatmap | +| Ledger Operations | `rippled-ledger-ops` | Prometheus (SpanMetrics) | Build rate, build duration, validation rate, store rate, build vs close comparison | +| Peer Network | `rippled-peer-net` | Prometheus (SpanMetrics) | Proposal receive rate, validation receive rate, trusted vs untrusted breakdown | + +### 3.2 StatsD Dashboards (5) + +| Dashboard | UID | Data Source | Key Panels | +| ---------------------- | ------------------------------- | ------------------- | --------------------------------------------------------------------------------- | +| Node Health | `rippled-statsd-node-health` | Prometheus (StatsD) | Ledger age, operating mode, I/O latency, job queue, fetch rate | +| Network Traffic | `rippled-statsd-network` | Prometheus (StatsD) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category | +| RPC & Pathfinding | `rippled-statsd-rpc` | Prometheus (StatsD) | RPC rate, response time/size, pathfinding duration, resource warnings/drops | +| Overlay Traffic Detail | `rippled-statsd-overlay-detail` | Prometheus (StatsD) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths | +| Ledger Data & Sync | `rippled-statsd-ledger-sync` | Prometheus (StatsD) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap | + +### 3.3 Accessing the Dashboards + +1. Open Grafana at **http://localhost:3000** +2. Navigate to **Dashboards → rippled** folder +3. All 10 dashboards are auto-provisioned from `docker/telemetry/grafana/dashboards/` + +--- + +## 4. Jaeger Trace Search Guide + +> **See also**: [08-appendix.md](./08-appendix.md) §8.2 for span hierarchy visualizations. [05-configuration-reference.md](./05-configuration-reference.md) §5.8.5 for TraceQL examples when using Grafana Tempo instead of Jaeger. + +### Finding Traces by Type + +| What to Find | Jaeger Search Parameters | +| ------------------------ | ---------------------------------------------------------- | +| All RPC calls | Service: `rippled`, Operation: `rpc.request` | +| Specific RPC command | Operation: `rpc.command.server_info` (or any command name) | +| Slow RPC calls | Operation: `rpc.command.*`, Min Duration: `100ms` | +| Failed RPC calls | Tag: `xrpl.rpc.status=error` | +| Specific transaction | Tag: `xrpl.tx.hash=` | +| Local transactions only | Tag: `xrpl.tx.local=true` | +| Consensus rounds | Operation: `consensus.accept` | +| Rounds by mode | Tag: `xrpl.consensus.mode=proposing` | +| Specific ledger | Tag: `xrpl.ledger.seq=12345` | +| Peer proposals (trusted) | Tag: `xrpl.peer.proposal.trusted=true` | + +### Trace Structure + +A typical RPC trace shows the span hierarchy: + +``` +rpc.request (ServerHandler) + └── rpc.process (ServerHandler) + └── rpc.command.server_info (RPCHandler) +``` + +A consensus round produces independent spans (not parent-child): + +``` +consensus.ledger_close (close event) +consensus.proposal.send (broadcast proposal) +ledger.build (build new ledger) + └── tx.apply (apply transaction set) +consensus.accept (accept result) +consensus.validation.send (send validation) +ledger.validate (promote to validated) +ledger.store (persist to DB) +``` + +--- + +## 5. Prometheus Query Examples + +> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8.7 for correlating Prometheus StatsD metrics with trace-derived metrics. + +### Span-Derived Metrics + +```promql +# RPC request rate by command (last 5 minutes) +sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m])) + +# RPC p95 latency by command +histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m]))) + +# Consensus round duration p95 +histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name="consensus.accept"}[5m]))) + +# Transaction processing rate (local vs relay) +sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m])) + +# Trusted vs untrusted proposal rate +sum by (xrpl_peer_proposal_trusted) (rate(traces_span_metrics_calls_total{span_name="peer.proposal.receive"}[5m])) +``` + +### StatsD Metrics + +```promql +# Validated ledger age (should be < 10s) +rippled_LedgerMaster_Validated_Ledger_Age + +# Active peer count +rippled_Peer_Finder_Active_Inbound_Peers + rippled_Peer_Finder_Active_Outbound_Peers + +# RPC response time p95 +histogram_quantile(0.95, rippled_rpc_time_bucket) + +# Total network bytes in (rate) +rate(rippled_total_Bytes_In[5m]) + +# Operating mode (should be "Full" after startup) +rippled_State_Accounting_Full_duration +``` + +--- + +## 6. Known Issues + +| Issue | Impact | Status | +| ------------------------------------------------------------------ | ------------------------------------------------ | -------------------------------------------------------------------- | +| `warn` and `drop` metrics use non-standard StatsD `\|m` meter type | Metrics silently dropped by OTel StatsD receiver | Phase 6 Task 6.1 — needs `\|m` → `\|c` change in StatsDCollector.cpp | +| `rippled_job_count` may not emit in standalone mode | Missing from Prometheus in some test configs | Requires active job queue activity | +| `rippled_rpc_requests` depends on `[insight]` config | Zero series if StatsD not configured | Requires `[insight] server=statsd` in xrpld.cfg | +| Peer tracing disabled by default | No `peer.*` spans unless `trace_peer=1` | Intentional — high volume on mainnet | + +--- + +## 7. Privacy and Data Collection + +The telemetry system is designed with privacy in mind: + +- **No private keys** are ever included in spans or metrics +- **No account balances** or financial data is traced +- **Transaction hashes** are included (public on-ledger data) but not transaction contents +- **Peer IDs** are internal identifiers, not IP addresses +- **All telemetry is opt-in** — disabled by default at build time (`-Dtelemetry=OFF`) +- **Sampling** reduces data volume — `sampling_ratio=0.01` recommended for production +- **Data stays local** — the default stack sends data to `localhost` only + +--- + +## 8. Configuration Quick Reference + +> **Full reference**: [05-configuration-reference.md](./05-configuration-reference.md) §5.1 for all `[telemetry]` options with defaults, the config parser implementation, and collector YAML configurations (dev and production). + +### Minimal Setup (development) + +```ini +[telemetry] +enabled=1 + +[insight] +server=statsd +address=127.0.0.1:8125 +prefix=rippled +``` + +### Production Setup + +```ini +[telemetry] +enabled=1 +endpoint=http://otel-collector:4318/v1/traces +sampling_ratio=0.01 +trace_peer=0 +batch_size=1024 +max_queue_size=4096 + +[insight] +server=statsd +address=otel-collector:8125 +prefix=rippled +``` + +### Trace Category Toggle + +| Config Key | Default | Controls | +| -------------------- | ------- | ---------------------------- | +| `trace_rpc` | `1` | `rpc.*` spans | +| `trace_transactions` | `1` | `tx.*` spans | +| `trace_consensus` | `1` | `consensus.*` spans | +| `trace_ledger` | `1` | `ledger.*` spans | +| `trace_peer` | `0` | `peer.*` spans (high volume) | diff --git a/OpenTelemetryPlan/OpenTelemetryPlan.md b/OpenTelemetryPlan/OpenTelemetryPlan.md index fb9f037c00..bd79489b79 100644 --- a/OpenTelemetryPlan/OpenTelemetryPlan.md +++ b/OpenTelemetryPlan/OpenTelemetryPlan.md @@ -55,6 +55,7 @@ flowchart TB backends["07-observability-backends.md"] appendix["08-appendix.md"] poc["POC_taskList.md"] + dataref["09-data-collection-reference.md"] end overview --> fundamentals @@ -71,6 +72,7 @@ flowchart TB phases --> backends backends --> appendix phases --> poc + appendix --> dataref style overview fill:#1b5e20,stroke:#0d3d14,color:#fff,stroke-width:2px style fundamentals fill:#00695c,stroke:#004d40,color:#fff @@ -87,6 +89,7 @@ flowchart TB style backends fill:#4a148c,stroke:#2e0d57,color:#fff style appendix fill:#4a148c,stroke:#2e0d57,color:#fff style poc fill:#4a148c,stroke:#2e0d57,color:#fff + style dataref fill:#4a148c,stroke:#2e0d57,color:#fff ``` @@ -95,18 +98,19 @@ flowchart TB ## Table of Contents -| Section | Document | Description | -| ------- | ---------------------------------------------------------- | ---------------------------------------------------------------------- | -| **0** | [Tracing Fundamentals](./00-tracing-fundamentals.md) | Distributed tracing concepts, span relationships, context propagation | -| **1** | [Architecture Analysis](./01-architecture-analysis.md) | rippled component analysis, trace points, instrumentation priorities | -| **2** | [Design Decisions](./02-design-decisions.md) | SDK selection, exporters, span naming, attributes, context propagation | -| **3** | [Implementation Strategy](./03-implementation-strategy.md) | Directory structure, key principles, performance optimization | -| **4** | [Code Samples](./04-code-samples.md) | C++ implementation examples for core infrastructure and key modules | -| **5** | [Configuration Reference](./05-configuration-reference.md) | rippled config, CMake integration, Collector configurations | -| **6** | [Implementation Phases](./06-implementation-phases.md) | 5-phase timeline, tasks, risks, success metrics | -| **7** | [Observability Backends](./07-observability-backends.md) | Backend selection guide and production architecture | -| **8** | [Appendix](./08-appendix.md) | Glossary, references, version history | -| **POC** | [POC Task List](./POC_taskList.md) | Proof of concept tasks for RPC tracing end-to-end demo | +| Section | Document | Description | +| ------- | -------------------------------------------------------------- | ---------------------------------------------------------------------- | +| **0** | [Tracing Fundamentals](./00-tracing-fundamentals.md) | Distributed tracing concepts, span relationships, context propagation | +| **1** | [Architecture Analysis](./01-architecture-analysis.md) | rippled component analysis, trace points, instrumentation priorities | +| **2** | [Design Decisions](./02-design-decisions.md) | SDK selection, exporters, span naming, attributes, context propagation | +| **3** | [Implementation Strategy](./03-implementation-strategy.md) | Directory structure, key principles, performance optimization | +| **4** | [Code Samples](./04-code-samples.md) | C++ implementation examples for core infrastructure and key modules | +| **5** | [Configuration Reference](./05-configuration-reference.md) | rippled config, CMake integration, Collector configurations | +| **6** | [Implementation Phases](./06-implementation-phases.md) | 5-phase timeline, tasks, risks, success metrics | +| **7** | [Observability Backends](./07-observability-backends.md) | Backend selection guide and production architecture | +| **8** | [Appendix](./08-appendix.md) | Glossary, references, version history | +| **9** | [Data Collection Reference](./09-data-collection-reference.md) | Complete inventory of spans, attributes, metrics, and dashboards | +| **POC** | [POC Task List](./POC_taskList.md) | Proof of concept tasks for RPC tracing end-to-end demo | --- @@ -219,6 +223,14 @@ The appendix contains a glossary of OpenTelemetry and rippled-specific terms, re --- +## 9. Data Collection Reference + +A single-source-of-truth reference documenting every piece of telemetry data collected by rippled. Covers all 16 OpenTelemetry spans with their 22 attributes, all StatsD metrics (gauges, counters, histograms, overlay traffic), SpanMetrics-derived Prometheus metrics, and all 8 Grafana dashboards. Includes Jaeger search guides and Prometheus query examples. + +➡️ **[View Data Collection Reference](./09-data-collection-reference.md)** + +--- + ## POC Task List A step-by-step task list for building a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. The POC scope is limited to RPC tracing — showing request traces flowing from rippled through an OpenTelemetry Collector into Tempo, viewable in Grafana. diff --git a/docker/telemetry/TESTING.md b/docker/telemetry/TESTING.md index 24220c49af..9b88429f68 100644 --- a/docker/telemetry/TESTING.md +++ b/docker/telemetry/TESTING.md @@ -444,11 +444,69 @@ curl -s "$PROM/api/v1/query?query=traces_span_metrics_calls_total{span_name=~\"r | jq '.data.result[] | {command: .metric["xrpl.rpc.command"], count: .value[1]}' ``` +### StatsD Metrics (beast::insight) + +rippled's built-in `beast::insight` framework emits StatsD metrics over UDP to the OTel Collector +on port 8125. These appear in Prometheus alongside spanmetrics. + +Requires `[insight]` config in `xrpld.cfg`: + +```ini +[insight] +server=statsd +address=127.0.0.1:8125 +prefix=rippled +``` + +Verify StatsD metrics in Prometheus: + +```bash +# Ledger age gauge +curl -s "$PROM/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age" | jq '.data.result' + +# Peer counts +curl -s "$PROM/api/v1/query?query=rippled_Peer_Finder_Active_Inbound_Peers" | jq '.data.result' + +# RPC request counter +curl -s "$PROM/api/v1/query?query=rippled_rpc_requests" | jq '.data.result' + +# State accounting +curl -s "$PROM/api/v1/query?query=rippled_State_Accounting_Full_duration" | jq '.data.result' + +# Overlay traffic +curl -s "$PROM/api/v1/query?query=rippled_total_Bytes_In" | jq '.data.result' +``` + +Key StatsD metrics (prefix `rippled_`): + +| Metric | Type | Source | +| ------------------------------------- | --------- | ----------------------------------------- | +| `LedgerMaster_Validated_Ledger_Age` | gauge | LedgerMaster.h:373 | +| `LedgerMaster_Published_Ledger_Age` | gauge | LedgerMaster.h:374 | +| `State_Accounting_{Mode}_duration` | gauge | NetworkOPs.cpp:774 | +| `State_Accounting_{Mode}_transitions` | gauge | NetworkOPs.cpp:780 | +| `Peer_Finder_Active_Inbound_Peers` | gauge | PeerfinderManager.cpp:214 | +| `Peer_Finder_Active_Outbound_Peers` | gauge | PeerfinderManager.cpp:215 | +| `Overlay_Peer_Disconnects` | gauge | OverlayImpl.h:557 | +| `job_count` | gauge | JobQueue.cpp:26 | +| `rpc_requests` | counter | ServerHandler.cpp:108 | +| `rpc_time` | histogram | ServerHandler.cpp:110 | +| `rpc_size` | histogram | ServerHandler.cpp:109 | +| `ios_latency` | histogram | Application.cpp:438 | +| `pathfind_fast` | histogram | PathRequests.h:23 | +| `pathfind_full` | histogram | PathRequests.h:24 | +| `ledger_fetches` | counter | InboundLedgers.cpp:44 | +| `ledger_history_mismatch` | counter | LedgerHistory.cpp:16 | +| `warn` | counter | Logic.h:33 | +| `drop` | counter | Logic.h:34 | +| `{category}_Bytes_In/Out` | gauge | OverlayImpl.h:535 (57 traffic categories) | +| `{category}_Messages_In/Out` | gauge | OverlayImpl.h:535 (57 traffic categories) | + ### Grafana Open http://localhost:3000 (anonymous admin access enabled). -Pre-configured dashboards: +Pre-configured dashboards (span-derived): - **RPC Performance**: Request rates, latency percentiles by command, top commands, WebSocket rate - **Transaction Overview**: Transaction processing rates, apply duration, peer relay, failed tx rate @@ -456,9 +514,16 @@ Pre-configured dashboards: - **Ledger Operations**: Build/validate/store rates and durations, TX apply metrics - **Peer Network**: Proposal/validation receive rates, trusted vs untrusted breakdown (requires `trace_peer=1`) +Pre-configured dashboards (StatsD): + +- **Node Health (StatsD)**: Validated/published ledger age, operating mode, I/O latency, job queue +- **Network Traffic (StatsD)**: Peer counts, disconnects, overlay traffic by category +- **RPC & Pathfinding (StatsD)**: RPC request rate/time/size, pathfinding duration, resource warnings + Pre-configured datasources: - **Jaeger**: Trace data at `http://jaeger:16686` +- **Tempo**: Trace data at `http://tempo:3200` (via Grafana Explore) - **Prometheus**: Metrics at `http://prometheus:9090` --- diff --git a/docker/telemetry/docker-compose.yml b/docker/telemetry/docker-compose.yml index 1178274b83..2dc6b8f9a3 100644 --- a/docker/telemetry/docker-compose.yml +++ b/docker/telemetry/docker-compose.yml @@ -27,7 +27,8 @@ services: ports: - "4317:4317" # OTLP gRPC - "4318:4318" # OTLP HTTP - - "8889:8889" # Prometheus metrics (spanmetrics) + - "8125:8125/udp" # StatsD UDP (beast::insight metrics) + - "8889:8889" # Prometheus metrics (spanmetrics + statsd) - "13133:13133" # Health check volumes: - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro diff --git a/docker/telemetry/grafana/dashboards/consensus-health.json b/docker/telemetry/grafana/dashboards/consensus-health.json index 331c2ab042..8b3719dd34 100644 --- a/docker/telemetry/grafana/dashboards/consensus-health.json +++ b/docker/telemetry/grafana/dashboards/consensus-health.json @@ -449,6 +449,6 @@ "from": "now-1h", "to": "now" }, - "title": "rippled Consensus Health", + "title": "Consensus Health", "uid": "rippled-consensus" } diff --git a/docker/telemetry/grafana/dashboards/ledger-operations.json b/docker/telemetry/grafana/dashboards/ledger-operations.json index fc19cb6898..67711e4fa8 100644 --- a/docker/telemetry/grafana/dashboards/ledger-operations.json +++ b/docker/telemetry/grafana/dashboards/ledger-operations.json @@ -325,7 +325,7 @@ { "name": "node", "label": "Node", - "description": "Filter by rippled node (service.instance.id — e.g. Node-1)", + "description": "Filter by rippled node (service.instance.id \u2014 e.g. Node-1)", "type": "query", "query": "label_values(traces_span_metrics_calls_total, exported_instance)", "datasource": { @@ -348,6 +348,6 @@ "from": "now-1h", "to": "now" }, - "title": "rippled Ledger Operations", + "title": "Ledger Operations", "uid": "rippled-ledger-ops" } diff --git a/docker/telemetry/grafana/dashboards/peer-network.json b/docker/telemetry/grafana/dashboards/peer-network.json index 3e3e0d97a5..9740b04366 100644 --- a/docker/telemetry/grafana/dashboards/peer-network.json +++ b/docker/telemetry/grafana/dashboards/peer-network.json @@ -159,7 +159,7 @@ { "name": "node", "label": "Node", - "description": "Filter by rippled node (service.instance.id — e.g. Node-1)", + "description": "Filter by rippled node (service.instance.id \u2014 e.g. Node-1)", "type": "query", "query": "label_values(traces_span_metrics_calls_total, exported_instance)", "datasource": { @@ -222,6 +222,6 @@ "from": "now-1h", "to": "now" }, - "title": "rippled Peer Network", + "title": "Peer Network", "uid": "rippled-peer-net" } diff --git a/docker/telemetry/grafana/dashboards/rpc-performance.json b/docker/telemetry/grafana/dashboards/rpc-performance.json index 73d73d6ea6..dec11c506d 100644 --- a/docker/telemetry/grafana/dashboards/rpc-performance.json +++ b/docker/telemetry/grafana/dashboards/rpc-performance.json @@ -371,6 +371,6 @@ "from": "now-1h", "to": "now" }, - "title": "rippled RPC Performance", + "title": "RPC Performance", "uid": "rippled-rpc-perf" } diff --git a/docker/telemetry/grafana/dashboards/statsd-ledger-data-sync.json b/docker/telemetry/grafana/dashboards/statsd-ledger-data-sync.json new file mode 100644 index 0000000000..502d78e7aa --- /dev/null +++ b/docker/telemetry/grafana/dashboards/statsd-ledger-data-sync.json @@ -0,0 +1,506 @@ +{ + "annotations": { + "list": [] + }, + "description": "Ledger data exchange and object fetch traffic from beast::insight StatsD. Covers ledger sync, node data retrieval, and transaction set exchange. Requires [insight] server=statsd in rippled config.", + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 1, + "id": null, + "links": [], + "panels": [ + { + "title": "Ledger Data Exchange (Bytes In)", + "description": "Inbound bytes for ledger data sub-categories. 'ledger_data' = aggregated ledger data, sub-types include Transaction_Set_candidate (proposed tx sets), Transaction_Node (tx tree nodes), and Account_State_Node (state tree nodes). High Account_State_Node traffic indicates state sync; high Transaction_Set_candidate indicates consensus catch-up. Sourced from TrafficCount.h ledger_data_* categories.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 0 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_data_get_Bytes_In", + "legendFormat": "Ledger Data Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_data_share_Bytes_In", + "legendFormat": "Ledger Data Share" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_data_Transaction_Set_candidate_get_Bytes_In", + "legendFormat": "TX Set Candidate Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_data_Transaction_Set_candidate_share_Bytes_In", + "legendFormat": "TX Set Candidate Share" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_data_Transaction_Node_get_Bytes_In", + "legendFormat": "TX Node Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_data_Transaction_Node_share_Bytes_In", + "legendFormat": "TX Node Share" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_data_Account_State_Node_get_Bytes_In", + "legendFormat": "Account State Node Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_data_Account_State_Node_share_Bytes_In", + "legendFormat": "Account State Node Share" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes", + "custom": { + "axisLabel": "Bytes In", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Ledger Share/Get Traffic (Bytes)", + "description": "Legacy ledger share and get traffic by sub-type. These are the older ledger fetch protocol categories (as opposed to ledger_data_* which is the newer protocol). Sub-types: Transaction_Set_candidate, Transaction_node, Account_State_node, plus aggregate ledger_share and ledger_get. Sourced from TrafficCount.h ledger_* categories.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 0 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_share_Bytes_In", + "legendFormat": "Ledger Share In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_get_Bytes_In", + "legendFormat": "Ledger Get In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_Transaction_Set_candidate_share_Bytes_In", + "legendFormat": "TX Set Candidate Share" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_Transaction_Set_candidate_get_Bytes_In", + "legendFormat": "TX Set Candidate Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_Transaction_node_share_Bytes_In", + "legendFormat": "TX Node Share" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_Transaction_node_get_Bytes_In", + "legendFormat": "TX Node Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_Account_State_node_share_Bytes_In", + "legendFormat": "Account State Share" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ledger_Account_State_node_get_Bytes_In", + "legendFormat": "Account State Get" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes", + "custom": { + "axisLabel": "Bytes In", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "GetObject Traffic by Type (Bytes In)", + "description": "Object fetch traffic by object type. GetObject is the protocol for fetching specific SHAMap nodes. Types: Ledger (full ledger headers), Transaction (individual txs), Transaction_node (tx tree nodes), Account_State_node (state tree nodes), CAS (Content Addressable Storage objects), Fetch_Pack (batch fetch during catch-up), Transactions (bulk tx fetch). High Fetch_Pack traffic indicates a node is catching up. Sourced from TrafficCount.h getobject_* categories.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 8 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Ledger_get_Bytes_In", + "legendFormat": "Ledger Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Ledger_share_Bytes_In", + "legendFormat": "Ledger Share" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Transaction_get_Bytes_In", + "legendFormat": "Transaction Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Transaction_share_Bytes_In", + "legendFormat": "Transaction Share" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Transaction_node_get_Bytes_In", + "legendFormat": "TX Node Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Transaction_node_share_Bytes_In", + "legendFormat": "TX Node Share" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Account_State_node_get_Bytes_In", + "legendFormat": "Account State Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Account_State_node_share_Bytes_In", + "legendFormat": "Account State Share" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes", + "custom": { + "axisLabel": "Bytes In", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "GetObject Aggregate & Special Types (Bytes In)", + "description": "Aggregate getobject traffic plus special categories: CAS (Content Addressable Storage) for SHAMap node fetch, Fetch_Pack for bulk batch downloads during catch-up, Transactions for bulk tx fetch, and the aggregate getobject_get/getobject_share totals. Sourced from TrafficCount.h getobject_* categories.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 8 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_CAS_get_Bytes_In", + "legendFormat": "CAS Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_CAS_share_Bytes_In", + "legendFormat": "CAS Share" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Fetch_Pack_share_Bytes_In", + "legendFormat": "Fetch Pack Share" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Fetch_Pack_get_Bytes_In", + "legendFormat": "Fetch Pack Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Transactions_get_Bytes_In", + "legendFormat": "Transactions Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_get_Bytes_In", + "legendFormat": "Aggregate Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_share_Bytes_In", + "legendFormat": "Aggregate Share" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes", + "custom": { + "axisLabel": "Bytes In", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "GetObject Messages by Type", + "description": "Message counts for object fetch operations. Shows how many individual fetch requests and responses are exchanged per type. High message counts with low byte counts indicate small object fetches; the inverse indicates large batch transfers. Sourced from TrafficCount.h getobject_* categories.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 16 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Ledger_get_Messages_In", + "legendFormat": "Ledger Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Transaction_get_Messages_In", + "legendFormat": "Transaction Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Transaction_node_get_Messages_In", + "legendFormat": "TX Node Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Account_State_node_get_Messages_In", + "legendFormat": "Account State Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_CAS_get_Messages_In", + "legendFormat": "CAS Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Fetch_Pack_get_Messages_In", + "legendFormat": "Fetch Pack Get" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_getobject_Transactions_get_Messages_In", + "legendFormat": "Transactions Get" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Messages In", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Overlay Traffic Heatmap (All Categories, Bytes In)", + "description": "Bar gauge showing all overlay traffic categories ranked by inbound bytes. Provides a complete at-a-glance view of which protocol message types consume the most bandwidth across all 57+ traffic categories. Sourced from all TrafficCount.h categories via wildcard match.", + "type": "bargauge", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 16 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + }, + "displayMode": "gradient", + "orientation": "horizontal", + "reduceOptions": { + "calcs": ["lastNotNull"], + "fields": "", + "values": false + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "topk(20, {__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})", + "legendFormat": "{{__name__}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes", + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 1048576 + }, + { + "color": "red", + "value": 104857600 + } + ] + } + }, + "overrides": [] + } + } + ], + "schemaVersion": 39, + "tags": ["rippled", "statsd", "ledger", "sync", "telemetry"], + "templating": { + "list": [] + }, + "time": { + "from": "now-1h", + "to": "now" + }, + "title": "Ledger Data & Sync (StatsD)", + "uid": "rippled-statsd-ledger-sync" +} diff --git a/docker/telemetry/grafana/dashboards/statsd-network-traffic.json b/docker/telemetry/grafana/dashboards/statsd-network-traffic.json new file mode 100644 index 0000000000..8dc072ba23 --- /dev/null +++ b/docker/telemetry/grafana/dashboards/statsd-network-traffic.json @@ -0,0 +1,671 @@ +{ + "annotations": { + "list": [] + }, + "description": "Network traffic and peer metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.", + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 1, + "id": null, + "links": [], + "panels": [ + { + "title": "Active Peers", + "description": "Number of active inbound and outbound peer connections. Sourced from Peer_Finder.Active_Inbound_Peers and Peer_Finder.Active_Outbound_Peers gauges (PeerfinderManager.cpp:214-215). A healthy mainnet node typically has 10-21 outbound and 0-85 inbound peers depending on configuration.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 0 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_Peer_Finder_Active_Inbound_Peers", + "legendFormat": "Inbound Peers" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_Peer_Finder_Active_Outbound_Peers", + "legendFormat": "Outbound Peers" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Peers", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Peer Disconnects", + "description": "Cumulative count of peer disconnections. Sourced from the Overlay.Peer_Disconnects gauge (OverlayImpl.h:557). A rising trend indicates network instability, aggressive peer management, or resource exhaustion causing connection drops.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 0 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_Overlay_Peer_Disconnects", + "legendFormat": "Disconnects" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Disconnects", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Total Network Bytes", + "description": "Total bytes sent and received across all peer connections. Sourced from the total.Bytes_In and total.Bytes_Out traffic category gauges (OverlayImpl.h:535-548). Provides a high-level view of network bandwidth consumption.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 8 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_total_Bytes_In", + "legendFormat": "Bytes In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_total_Bytes_Out", + "legendFormat": "Bytes Out" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes", + "custom": { + "axisLabel": "Bytes", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Total Network Messages", + "description": "Total messages sent and received across all peer connections. Sourced from the total.Messages_In and total.Messages_Out traffic category gauges (OverlayImpl.h:535-548). Shows the overall message throughput of the overlay network.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 8 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_total_Messages_In", + "legendFormat": "Messages In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_total_Messages_Out", + "legendFormat": "Messages Out" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Messages", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Transaction Traffic", + "description": "Bytes and messages for transaction-related overlay traffic. Includes the transactions traffic category (OverlayImpl/TrafficCount.h). Spikes indicate high transaction volume on the network or transaction flooding.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 16 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_transactions_Messages_In", + "legendFormat": "TX Messages In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_transactions_Messages_Out", + "legendFormat": "TX Messages Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_transactions_duplicate_Messages_In", + "legendFormat": "TX Duplicate In" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Messages", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Proposal Traffic", + "description": "Messages for consensus proposal overlay traffic. Includes proposals, proposals_untrusted, and proposals_duplicate categories (TrafficCount.h). High untrusted or duplicate counts may indicate UNL misconfiguration or network spam.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 16 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_proposals_Messages_In", + "legendFormat": "Proposals In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_proposals_Messages_Out", + "legendFormat": "Proposals Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_proposals_untrusted_Messages_In", + "legendFormat": "Untrusted In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_proposals_duplicate_Messages_In", + "legendFormat": "Duplicate In" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Messages", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Validation Traffic", + "description": "Messages for validation overlay traffic. Includes validations, validations_untrusted, and validations_duplicate categories (TrafficCount.h). Monitoring trusted vs untrusted validation traffic helps detect UNL health issues.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 24 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_validations_Messages_In", + "legendFormat": "Validations In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_validations_Messages_Out", + "legendFormat": "Validations Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_validations_untrusted_Messages_In", + "legendFormat": "Untrusted In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_validations_duplicate_Messages_In", + "legendFormat": "Duplicate In" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Messages", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Overlay Traffic by Category (Bytes In)", + "description": "Top traffic categories by inbound bytes. Includes all 57 overlay traffic categories from TrafficCount.h. Shows which protocol message types consume the most bandwidth. Categories include transactions, proposals, validations, ledger data, getobject, and overlay overhead.", + "type": "bargauge", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 24 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "topk(10, {__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})", + "legendFormat": "{{__name__}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "rippled_transactions_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Transactions" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_proposals_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Proposals" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_validations_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Validations" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_overhead_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Overhead" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_overhead_overlay_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Overhead Overlay" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ping_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Ping" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_status_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Status" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_getObject_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Get Object" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_haveTxSet_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Have Tx Set" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledgerData_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Ledger Data" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledger_share_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Ledger Share" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledger_data_get_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Ledger Data Get" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledger_data_share_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Ledger Data Share" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledger_data_Account_State_Node_get_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Account State Node Get" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledger_data_Account_State_Node_share_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Account State Node Share" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledger_data_Transaction_Node_get_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Transaction Node Get" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledger_data_Transaction_Node_share_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Transaction Node Share" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledger_data_Transaction_Set_candidate_get_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Tx Set Candidate Get" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledger_Account_State_node_share_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Account State Node Share (Legacy)" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledger_Transaction_Set_candidate_share_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Tx Set Candidate Share" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_ledger_Transaction_node_share_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Transaction Node Share (Legacy)" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "rippled_set_get_Bytes_In" + }, + "properties": [ + { + "id": "displayName", + "value": "Set Get" + } + ] + } + ] + } + } + ], + "schemaVersion": 39, + "tags": ["rippled", "statsd", "network", "telemetry"], + "templating": { + "list": [] + }, + "time": { + "from": "now-1h", + "to": "now" + }, + "title": "Network Traffic (StatsD)", + "uid": "rippled-statsd-network" +} diff --git a/docker/telemetry/grafana/dashboards/statsd-node-health.json b/docker/telemetry/grafana/dashboards/statsd-node-health.json new file mode 100644 index 0000000000..215187f382 --- /dev/null +++ b/docker/telemetry/grafana/dashboards/statsd-node-health.json @@ -0,0 +1,415 @@ +{ + "annotations": { + "list": [] + }, + "description": "Node health metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.", + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 1, + "id": null, + "links": [], + "panels": [ + { + "title": "Validated Ledger Age", + "description": "Age of the most recently validated ledger in seconds. Sourced from the LedgerMaster.Validated_Ledger_Age gauge (LedgerMaster.h:373) which is updated every collection interval via the insight hook. Values above 20s indicate the node is falling behind the network.", + "type": "stat", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 0 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_LedgerMaster_Validated_Ledger_Age", + "legendFormat": "Validated Age" + } + ], + "fieldConfig": { + "defaults": { + "unit": "s", + "thresholds": { + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 10 + }, + { + "color": "red", + "value": 20 + } + ] + } + }, + "overrides": [] + } + }, + { + "title": "Published Ledger Age", + "description": "Age of the most recently published ledger in seconds. Sourced from the LedgerMaster.Published_Ledger_Age gauge (LedgerMaster.h:374). Published ledger age should track close to validated ledger age. A growing gap indicates publish pipeline backlog.", + "type": "stat", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 0 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_LedgerMaster_Published_Ledger_Age", + "legendFormat": "Published Age" + } + ], + "fieldConfig": { + "defaults": { + "unit": "s", + "thresholds": { + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 10 + }, + { + "color": "red", + "value": 20 + } + ] + } + }, + "overrides": [] + } + }, + { + "title": "Operating Mode Duration", + "description": "Cumulative time spent in each operating mode (Disconnected, Connected, Syncing, Tracking, Full). Sourced from State_Accounting.*_duration gauges (NetworkOPs.cpp:774-778). A healthy node should spend the vast majority of time in Full mode.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 8 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_State_Accounting_Full_duration", + "legendFormat": "Full" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_State_Accounting_Tracking_duration", + "legendFormat": "Tracking" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_State_Accounting_Syncing_duration", + "legendFormat": "Syncing" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_State_Accounting_Connected_duration", + "legendFormat": "Connected" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_State_Accounting_Disconnected_duration", + "legendFormat": "Disconnected" + } + ], + "fieldConfig": { + "defaults": { + "unit": "s", + "custom": { + "axisLabel": "Duration (Sec)", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Operating Mode Transitions", + "description": "Count of transitions into each operating mode. Sourced from State_Accounting.*_transitions gauges (NetworkOPs.cpp:780-786). Frequent transitions out of Full mode indicate instability. Transitions to Disconnected or Syncing warrant investigation.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 8 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_State_Accounting_Full_transitions", + "legendFormat": "Full" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_State_Accounting_Tracking_transitions", + "legendFormat": "Tracking" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_State_Accounting_Syncing_transitions", + "legendFormat": "Syncing" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_State_Accounting_Connected_transitions", + "legendFormat": "Connected" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_State_Accounting_Disconnected_transitions", + "legendFormat": "Disconnected" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Transitions", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "I/O Latency", + "description": "P95 and P50 of the I/O service loop latency in milliseconds. Sourced from the ios_latency event (Application.cpp:438) which measures how long it takes for the io_context to process a timer callback. Values above 10ms are logged; above 500ms trigger warnings. High values indicate thread pool saturation or blocking operations.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 16 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ios_latency{quantile=\"0.95\"}", + "legendFormat": "P95 I/O Latency" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_ios_latency{quantile=\"0.5\"}", + "legendFormat": "P50 I/O Latency" + } + ], + "fieldConfig": { + "defaults": { + "unit": "ms", + "custom": { + "axisLabel": "Latency (ms)", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Job Queue Depth", + "description": "Current number of jobs waiting in the job queue. Sourced from the job_count gauge (JobQueue.cpp:26). A sustained high value indicates the node cannot process work fast enough \u2014 common during ledger replay or heavy RPC load.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 16 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_job_count", + "legendFormat": "Job Queue Depth" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Jobs", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Ledger Fetch Rate", + "description": "Rate of ledger fetch requests initiated by the node. Sourced from the ledger_fetches counter (InboundLedgers.cpp:44) which increments each time the node requests a ledger from a peer. High rates indicate the node is catching up or missing ledgers.", + "type": "stat", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 24 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rate(rippled_ledger_fetches_total[5m])", + "legendFormat": "Fetches / Sec" + } + ], + "fieldConfig": { + "defaults": { + "unit": "ops" + }, + "overrides": [] + } + }, + { + "title": "Ledger History Mismatches", + "description": "Rate of ledger history hash mismatches. Sourced from the ledger.history.mismatch counter (LedgerHistory.cpp:16) which increments when a built ledger hash does not match the expected validated hash. Non-zero values indicate consensus divergence or database corruption.", + "type": "stat", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 24 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rate(rippled_ledger_history_mismatch_total[5m])", + "legendFormat": "Mismatches / Sec" + } + ], + "fieldConfig": { + "defaults": { + "unit": "ops", + "thresholds": { + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 0.01 + } + ] + } + }, + "overrides": [] + } + } + ], + "schemaVersion": 39, + "tags": ["rippled", "statsd", "node-health", "telemetry"], + "templating": { + "list": [] + }, + "time": { + "from": "now-1h", + "to": "now" + }, + "title": "Node Health (StatsD)", + "uid": "rippled-statsd-node-health" +} diff --git a/docker/telemetry/grafana/dashboards/statsd-overlay-traffic-detail.json b/docker/telemetry/grafana/dashboards/statsd-overlay-traffic-detail.json new file mode 100644 index 0000000000..a09a2b5d17 --- /dev/null +++ b/docker/telemetry/grafana/dashboards/statsd-overlay-traffic-detail.json @@ -0,0 +1,566 @@ +{ + "annotations": { + "list": [] + }, + "description": "Detailed overlay traffic breakdown for categories not covered by the main Network Traffic dashboard. Includes squelch, overhead, validator lists, object fetch, ledger sync, and protocol negotiation traffic. Requires [insight] server=statsd in rippled config.", + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 1, + "id": null, + "links": [], + "panels": [ + { + "title": "Squelch Traffic (Messages)", + "description": "Squelch-related overlay messages. Squelch is the peer traffic management protocol that suppresses redundant message forwarding. 'squelch' = squelch control messages, 'squelch_suppressed' = messages suppressed by squelch, 'squelch_ignored' = squelch directives that were ignored. High suppressed counts indicate effective bandwidth savings; high ignored counts may indicate misconfigured peers. Sourced from TrafficCount.h squelch categories.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 0 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_squelch_Messages_In", + "legendFormat": "Squelch In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_squelch_Messages_Out", + "legendFormat": "Squelch Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_squelch_suppressed_Messages_In", + "legendFormat": "Suppressed In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_squelch_suppressed_Messages_Out", + "legendFormat": "Suppressed Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_squelch_ignored_Messages_In", + "legendFormat": "Ignored In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_squelch_ignored_Messages_Out", + "legendFormat": "Ignored Out" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Messages", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Overhead Traffic Breakdown (Bytes)", + "description": "Overlay protocol overhead by sub-category. 'overhead' = base protocol overhead (ping, status, etc.), 'overhead_cluster' = intra-cluster communication overhead, 'overhead_manifest' = validator manifest distribution overhead. High cluster overhead may indicate frequent cluster state syncs; high manifest overhead occurs during UNL changes. Sourced from TrafficCount.h overhead categories.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 0 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_overhead_Bytes_In", + "legendFormat": "Base Overhead In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_overhead_Bytes_Out", + "legendFormat": "Base Overhead Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_overhead_cluster_Bytes_In", + "legendFormat": "Cluster In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_overhead_cluster_Bytes_Out", + "legendFormat": "Cluster Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_overhead_manifest_Bytes_In", + "legendFormat": "Manifest In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_overhead_manifest_Bytes_Out", + "legendFormat": "Manifest Out" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes", + "custom": { + "axisLabel": "Bytes", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Validator List Traffic", + "description": "Validator list (UNL) distribution traffic. Validator lists are exchanged when peers share their trusted validator configurations. Spikes occur during UNL updates or when new peers connect. Sourced from TrafficCount.h validator_lists category.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 8 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_validator_lists_Bytes_In", + "legendFormat": "Bytes In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_validator_lists_Bytes_Out", + "legendFormat": "Bytes Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_validator_lists_Messages_In", + "legendFormat": "Messages In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_validator_lists_Messages_Out", + "legendFormat": "Messages Out" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Count", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": "/Bytes/" + }, + "properties": [ + { + "id": "custom.axisPlacement", + "value": "right" + }, + { + "id": "unit", + "value": "decbytes" + } + ] + } + ] + } + }, + { + "title": "Set Get/Share Traffic (Bytes)", + "description": "Transaction set get and share traffic. 'set_get' = requests to fetch transaction sets (sent during ledger close), 'set_share' = responses sharing transaction sets. High set_get traffic indicates peers frequently requesting missing transaction sets, which may signal sync delays. Sourced from TrafficCount.h set_get/set_share categories.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 8 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_set_get_Bytes_In", + "legendFormat": "Set Get In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_set_get_Bytes_Out", + "legendFormat": "Set Get Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_set_share_Bytes_In", + "legendFormat": "Set Share In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_set_share_Bytes_Out", + "legendFormat": "Set Share Out" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes", + "custom": { + "axisLabel": "Bytes", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Have/Requested Transactions (Messages)", + "description": "Transaction availability protocol messages. 'have_transactions' = advertisements that a peer has specific transactions available, 'requested_transactions' = explicit requests for transaction data. A high ratio of requested to have may indicate peers are behind on transaction propagation. Sourced from TrafficCount.h have_transactions/requested_transactions categories.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 16 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_have_transactions_Messages_In", + "legendFormat": "Have TX In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_have_transactions_Messages_Out", + "legendFormat": "Have TX Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_requested_transactions_Messages_In", + "legendFormat": "Requested TX In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_requested_transactions_Messages_Out", + "legendFormat": "Requested TX Out" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Messages", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Unknown / Unclassified Traffic", + "description": "Traffic that does not match any known overlay message category. Non-zero values may indicate protocol version mismatches, corrupted messages, or new message types not yet classified. Sourced from TrafficCount.h unknown category.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 16 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_unknown_Bytes_In", + "legendFormat": "Unknown Bytes In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_unknown_Bytes_Out", + "legendFormat": "Unknown Bytes Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_unknown_Messages_In", + "legendFormat": "Unknown Messages In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_unknown_Messages_Out", + "legendFormat": "Unknown Messages Out" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "axisLabel": "Count", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": "/Bytes/" + }, + "properties": [ + { + "id": "custom.axisPlacement", + "value": "right" + }, + { + "id": "unit", + "value": "decbytes" + } + ] + } + ] + } + }, + { + "title": "Proof Path Traffic", + "description": "Proof path request/response traffic for ledger state proof exchange. Used by peers to verify specific ledger entries without downloading the full ledger. High request volume may indicate peers validating state during catch-up. Sourced from TrafficCount.h proof_path_request/proof_path_response categories.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 24 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_proof_path_request_Bytes_In", + "legendFormat": "Request Bytes In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_proof_path_request_Bytes_Out", + "legendFormat": "Request Bytes Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_proof_path_response_Bytes_In", + "legendFormat": "Response Bytes In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_proof_path_response_Bytes_Out", + "legendFormat": "Response Bytes Out" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes", + "custom": { + "axisLabel": "Bytes", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Replay Delta Traffic", + "description": "Replay delta request/response traffic for ledger replay protocol. Used during catch-up to efficiently replay ledger state changes. Sourced from TrafficCount.h replay_delta_request/replay_delta_response categories.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 24 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_replay_delta_request_Bytes_In", + "legendFormat": "Request Bytes In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_replay_delta_request_Bytes_Out", + "legendFormat": "Request Bytes Out" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_replay_delta_response_Bytes_In", + "legendFormat": "Response Bytes In" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_replay_delta_response_Bytes_Out", + "legendFormat": "Response Bytes Out" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes", + "custom": { + "axisLabel": "Bytes", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + } + ], + "schemaVersion": 39, + "tags": ["rippled", "statsd", "overlay", "network", "telemetry"], + "templating": { + "list": [] + }, + "time": { + "from": "now-1h", + "to": "now" + }, + "title": "Overlay Traffic Detail (StatsD)", + "uid": "rippled-statsd-overlay-detail" +} diff --git a/docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json b/docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json new file mode 100644 index 0000000000..10bf1575e3 --- /dev/null +++ b/docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json @@ -0,0 +1,396 @@ +{ + "annotations": { + "list": [] + }, + "description": "RPC and pathfinding metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.", + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 1, + "id": null, + "links": [], + "panels": [ + { + "title": "RPC Request Rate (StatsD)", + "description": "Rate of RPC requests as counted by the beast::insight counter. Sourced from rpc.requests (ServerHandler.cpp:108) which increments on every HTTP and WebSocket RPC request. Compare with the span-based rpc.request rate in the RPC Performance dashboard for cross-validation.", + "type": "stat", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 0 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rate(rippled_rpc_requests_total[5m])", + "legendFormat": "Requests / Sec" + } + ], + "fieldConfig": { + "defaults": { + "unit": "reqps" + }, + "overrides": [] + } + }, + { + "title": "RPC Response Time (StatsD)", + "description": "P95 and P50 of RPC response time from the beast::insight timer. Sourced from the rpc.time event (ServerHandler.cpp:110) which records elapsed milliseconds for each RPC response. This measures the full HTTP handler time, not just command execution. Compare with span-based rpc.request duration.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 0 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_rpc_time{quantile=\"0.95\"}", + "legendFormat": "P95 Response Time" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_rpc_time{quantile=\"0.5\"}", + "legendFormat": "P50 Response Time" + } + ], + "fieldConfig": { + "defaults": { + "unit": "ms", + "custom": { + "axisLabel": "Latency (ms)", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "RPC Response Size", + "description": "P95 and P50 of RPC response payload size in bytes. Sourced from the rpc.size event (ServerHandler.cpp:109) which records the byte length of each RPC JSON response. Large responses may indicate expensive queries (e.g. account_tx with many results) or API misuse.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 8 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_rpc_size{quantile=\"0.95\"}", + "legendFormat": "P95 Response Size" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_rpc_size{quantile=\"0.5\"}", + "legendFormat": "P50 Response Size" + } + ], + "fieldConfig": { + "defaults": { + "unit": "decbytes", + "custom": { + "axisLabel": "Size (Bytes)", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "RPC Response Time Distribution", + "description": "Distribution of RPC response times from the beast::insight timer showing P50, P90, P95, and P99 quantiles. Sourced from the rpc.time event (ServerHandler.cpp:110). Useful for detecting bimodal latency or long-tail requests.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 8 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_rpc_time{quantile=\"0.5\"}", + "legendFormat": "P50" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_rpc_time{quantile=\"0.9\"}", + "legendFormat": "P90" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_rpc_time{quantile=\"0.95\"}", + "legendFormat": "P95" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_rpc_time{quantile=\"0.99\"}", + "legendFormat": "P99" + } + ], + "fieldConfig": { + "defaults": { + "unit": "ms", + "custom": { + "axisLabel": "Latency (ms)", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Pathfinding Fast Duration", + "description": "P95 and P50 of fast pathfinding execution time. Sourced from the pathfind_fast event (PathRequests.h:23) which records the duration of the fast pathfinding algorithm. Fast pathfinding uses a simplified search that trades accuracy for speed.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 16 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_pathfind_fast{quantile=\"0.95\"}", + "legendFormat": "P95 Fast Pathfind" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_pathfind_fast{quantile=\"0.5\"}", + "legendFormat": "P50 Fast Pathfind" + } + ], + "fieldConfig": { + "defaults": { + "unit": "ms", + "custom": { + "axisLabel": "Duration (ms)", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Pathfinding Full Duration", + "description": "P95 and P50 of full pathfinding execution time. Sourced from the pathfind_full event (PathRequests.h:24) which records the duration of the exhaustive pathfinding search. Full pathfinding is more expensive and can take significantly longer than fast mode.", + "type": "timeseries", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 16 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_pathfind_full{quantile=\"0.95\"}", + "legendFormat": "P95 Full Pathfind" + }, + { + "datasource": { + "type": "prometheus" + }, + "expr": "rippled_pathfind_full{quantile=\"0.5\"}", + "legendFormat": "P50 Full Pathfind" + } + ], + "fieldConfig": { + "defaults": { + "unit": "ms", + "custom": { + "axisLabel": "Duration (ms)", + "spanNulls": true, + "insertNulls": false, + "showPoints": "auto", + "pointSize": 3 + } + }, + "overrides": [] + } + }, + { + "title": "Resource Warnings Rate", + "description": "Rate of resource warning events from the Resource Manager. Sourced from the warn meter (Logic.h:33) which increments when a consumer (peer or RPC client) exceeds the warning threshold for resource usage. A rising rate indicates aggressive clients that may need throttling. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).", + "type": "stat", + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 24 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rate(rippled_warn_total[5m])", + "legendFormat": "Warnings / Sec" + } + ], + "fieldConfig": { + "defaults": { + "unit": "ops", + "thresholds": { + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 0.1 + }, + { + "color": "red", + "value": 1 + } + ] + } + }, + "overrides": [] + } + }, + { + "title": "Resource Drops Rate", + "description": "Rate of resource drop events from the Resource Manager. Sourced from the drop meter (Logic.h:34) which increments when a consumer is disconnected or blocked due to excessive resource usage. Non-zero values mean the node is actively rejecting abusive connections. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).", + "type": "stat", + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 24 + }, + "options": { + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus" + }, + "expr": "rate(rippled_drop_total[5m])", + "legendFormat": "Drops / Sec" + } + ], + "fieldConfig": { + "defaults": { + "unit": "ops", + "thresholds": { + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 0.01 + }, + { + "color": "red", + "value": 0.1 + } + ] + } + }, + "overrides": [] + } + } + ], + "schemaVersion": 39, + "tags": ["rippled", "statsd", "rpc", "pathfinding", "telemetry"], + "templating": { + "list": [] + }, + "time": { + "from": "now-1h", + "to": "now" + }, + "title": "RPC & Pathfinding (StatsD)", + "uid": "rippled-statsd-rpc" +} diff --git a/docker/telemetry/grafana/dashboards/transaction-overview.json b/docker/telemetry/grafana/dashboards/transaction-overview.json index 10ff5f4d3c..d233110ce0 100644 --- a/docker/telemetry/grafana/dashboards/transaction-overview.json +++ b/docker/telemetry/grafana/dashboards/transaction-overview.json @@ -379,6 +379,6 @@ "from": "now-1h", "to": "now" }, - "title": "rippled Transaction Overview", + "title": "Transaction Overview", "uid": "rippled-transactions" } diff --git a/docker/telemetry/integration-test.sh b/docker/telemetry/integration-test.sh index dfed27e21e..52d7706e40 100755 --- a/docker/telemetry/integration-test.sh +++ b/docker/telemetry/integration-test.sh @@ -311,6 +311,11 @@ trace_consensus=1 trace_peer=1 trace_ledger=1 +[insight] +server=statsd +address=127.0.0.1:8125 +prefix=rippled + [rpc_startup] { "command": "log_level", "severity": "warning" } @@ -533,6 +538,44 @@ else fail "Grafana: not reachable at localhost:3000" fi +# --------------------------------------------------------------------------- +# Step 10b: Verify StatsD metrics in Prometheus +# --------------------------------------------------------------------------- +log "" +log "--- Phase 6: StatsD Metrics (beast::insight) ---" +log "Waiting 20s for StatsD aggregation + Prometheus scrape..." +sleep 20 + +check_statsd_metric() { + local metric_name="$1" + local result + result=$(curl -sf "$PROM/api/v1/query?query=$metric_name" \ + | jq '.data.result | length' 2>/dev/null || echo 0) + if [ "$result" -gt 0 ]; then + ok "StatsD: $metric_name ($result series)" + else + fail "StatsD: $metric_name (0 series)" + fi +} + +# Node health gauges +check_statsd_metric "rippled_LedgerMaster_Validated_Ledger_Age" +check_statsd_metric "rippled_LedgerMaster_Published_Ledger_Age" +check_statsd_metric "rippled_job_count" + +# State accounting +check_statsd_metric "rippled_State_Accounting_Full_duration" + +# Peer finder +check_statsd_metric "rippled_Peer_Finder_Active_Inbound_Peers" +check_statsd_metric "rippled_Peer_Finder_Active_Outbound_Peers" + +# RPC counters (only if RPC was exercised — should be true from Steps 5-8) +check_statsd_metric "rippled_rpc_requests" + +# Overlay traffic +check_statsd_metric "rippled_total_Bytes_In" + # --------------------------------------------------------------------------- # Step 11: Summary # --------------------------------------------------------------------------- diff --git a/docker/telemetry/otel-collector-config.yaml b/docker/telemetry/otel-collector-config.yaml index ff49546dea..ff7734a234 100644 --- a/docker/telemetry/otel-collector-config.yaml +++ b/docker/telemetry/otel-collector-config.yaml @@ -2,11 +2,22 @@ # # Pipelines: # traces: OTLP receiver -> batch processor -> debug + Jaeger + Tempo + spanmetrics -# metrics: spanmetrics connector -> Prometheus exporter +# metrics: spanmetrics connector + StatsD receiver -> Prometheus exporter # # rippled sends traces via OTLP/HTTP to port 4318. The collector batches # them, forwards to both Jaeger and Tempo, and derives RED metrics via the # spanmetrics connector, which Prometheus scrapes on port 8889. +# +# rippled also sends beast::insight metrics via StatsD/UDP to port 8125. +# These are ingested by the statsd receiver and merged into the same +# Prometheus endpoint alongside span-derived metrics. +# +# TODO: The Resource Manager's "warn" and "drop" metrics use the non-standard +# "|m" (meter) StatsD type in StatsDCollector.cpp:706. The OTel StatsD +# receiver silently drops "|m" metrics since it only recognizes standard +# types (|c, |g, |ms, |h, |s). To capture these two metrics, change "|m" +# to "|c" in StatsDCollector.cpp — this is a breaking change for any +# backend that relied on the custom "|m" type. Tracked as Phase 6 Task 6.1. receivers: otlp: @@ -15,6 +26,20 @@ receivers: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 + statsd: + endpoint: "0.0.0.0:8125" + aggregation_interval: 15s + enable_metric_type: true + is_monotonic_counter: true + timer_histogram_mapping: + - statsd_type: "timing" + observer_type: "summary" + summary: + percentiles: [0, 50, 90, 95, 99, 100] + - statsd_type: "histogram" + observer_type: "summary" + summary: + percentiles: [0, 50, 90, 95, 99, 100] processors: batch: @@ -59,5 +84,5 @@ service: processors: [batch] exporters: [debug, otlp/jaeger, otlp/tempo, spanmetrics] metrics: - receivers: [spanmetrics] + receivers: [spanmetrics, statsd] exporters: [prometheus] diff --git a/docs/telemetry-runbook.md b/docs/telemetry-runbook.md index 799f8216e5..d1f3b892e9 100644 --- a/docs/telemetry-runbook.md +++ b/docs/telemetry-runbook.md @@ -62,20 +62,20 @@ All spans instrumented in rippled, grouped by subsystem: ### RPC Spans (Phase 2) -| Span Name | Source File | Attributes | Description | -| -------------------- | --------------------- | ------------------------------------------------------- | -------------------------------------------------- | -| `rpc.request` | ServerHandler.cpp:271 | — | Top-level HTTP RPC request | -| `rpc.process` | ServerHandler.cpp:573 | — | RPC processing (child of rpc.request) | -| `rpc.ws_message` | ServerHandler.cpp:384 | — | WebSocket RPC message | -| `rpc.command.` | RPCHandler.cpp:161 | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role` | Per-command span (e.g., `rpc.command.server_info`) | +| Span Name | Source File | Attributes | Description | +| -------------------- | --------------------- | ---------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- | +| `rpc.request` | ServerHandler.cpp:271 | — | Top-level HTTP RPC request | +| `rpc.process` | ServerHandler.cpp:573 | — | RPC processing (child of rpc.request) | +| `rpc.ws_message` | ServerHandler.cpp:384 | — | WebSocket RPC message | +| `rpc.command.` | RPCHandler.cpp:161 | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.status`, `xrpl.rpc.duration_ms`, `xrpl.rpc.error_message` | Per-command span (e.g., `rpc.command.server_info`) | ### Transaction Spans (Phase 3) -| Span Name | Source File | Attributes | Description | -| ------------ | ------------------- | ----------------------------------------------- | ------------------------------------- | -| `tx.process` | NetworkOPs.cpp:1227 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Transaction submission and processing | -| `tx.receive` | PeerImp.cpp:1273 | `xrpl.peer.id` | Transaction received from peer relay | -| `tx.apply` | BuildLedger.cpp:88 | `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Transaction set applied per ledger | +| Span Name | Source File | Attributes | Description | +| ------------ | ------------------- | ---------------------------------------------------------------------- | ------------------------------------- | +| `tx.process` | NetworkOPs.cpp:1227 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Transaction submission and processing | +| `tx.receive` | PeerImp.cpp:1273 | `xrpl.peer.id`, `xrpl.tx.hash`, `xrpl.tx.suppressed`, `xrpl.tx.status` | Transaction received from peer relay | +| `tx.apply` | BuildLedger.cpp:88 | `xrpl.ledger.seq`, `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Transaction set applied per ledger | ### Consensus Spans (Phase 4) @@ -105,11 +105,11 @@ All spans instrumented in rippled, grouped by subsystem: ### Ledger Spans (Phase 5) -| Span Name | Source File | Attributes | Description | -| ----------------- | -------------------- | -------------------------------------------- | ----------------------------- | -| `ledger.build` | BuildLedger.cpp:31 | `xrpl.ledger.seq` | Ledger build during consensus | -| `ledger.validate` | LedgerMaster.cpp:915 | `xrpl.ledger.seq`, `xrpl.ledger.validations` | Ledger promoted to validated | -| `ledger.store` | LedgerMaster.cpp:409 | `xrpl.ledger.seq` | Ledger stored in history | +| Span Name | Source File | Attributes | Description | +| ----------------- | -------------------- | ------------------------------------------------------------------ | ----------------------------- | +| `ledger.build` | BuildLedger.cpp:31 | `xrpl.ledger.seq`, `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Ledger build during consensus | +| `ledger.validate` | LedgerMaster.cpp:915 | `xrpl.ledger.seq`, `xrpl.ledger.validations` | Ledger promoted to validated | +| `ledger.store` | LedgerMaster.cpp:409 | `xrpl.ledger.seq` | Ledger stored in history | ### Peer Spans (Phase 5) @@ -161,9 +161,63 @@ Configured in `otel-collector-config.yaml`: 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s ``` +## StatsD Metrics (beast::insight) + +rippled has a built-in metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These complement the span-derived RED metrics by providing system-level gauges, counters, and timers that don't map to individual trace spans. + +### Configuration + +Add to `xrpld.cfg`: + +```ini +[insight] +server=statsd +address=127.0.0.1:8125 +prefix=rippled +``` + +The OTel Collector receives these via a `statsd` receiver on UDP port 8125 and exports them to Prometheus alongside spanmetrics. + +### Metric Reference + +#### Gauges + +| Prometheus Metric | Source | Description | +| --------------------------------------------- | ------------------------- | -------------------------------------------------------------------------- | +| `rippled_LedgerMaster_Validated_Ledger_Age` | LedgerMaster.h:373 | Age of validated ledger (seconds) | +| `rippled_LedgerMaster_Published_Ledger_Age` | LedgerMaster.h:374 | Age of published ledger (seconds) | +| `rippled_State_Accounting_{Mode}_duration` | NetworkOPs.cpp:774 | Time in each operating mode (Disconnected/Connected/Syncing/Tracking/Full) | +| `rippled_State_Accounting_{Mode}_transitions` | NetworkOPs.cpp:780 | Transition count per mode | +| `rippled_Peer_Finder_Active_Inbound_Peers` | PeerfinderManager.cpp:214 | Active inbound peer connections | +| `rippled_Peer_Finder_Active_Outbound_Peers` | PeerfinderManager.cpp:215 | Active outbound peer connections | +| `rippled_Overlay_Peer_Disconnects` | OverlayImpl.h:557 | Peer disconnect count | +| `rippled_job_count` | JobQueue.cpp:26 | Current job queue depth | +| `rippled_{category}_Bytes_In/Out` | OverlayImpl.h:535 | Overlay traffic bytes per category (57 categories) | +| `rippled_{category}_Messages_In/Out` | OverlayImpl.h:535 | Overlay traffic messages per category | + +#### Counters + +| Prometheus Metric | Source | Description | +| --------------------------------- | --------------------- | ------------------------------ | +| `rippled_rpc_requests` | ServerHandler.cpp:108 | Total RPC request count | +| `rippled_ledger_fetches` | InboundLedgers.cpp:44 | Ledger fetch request count | +| `rippled_ledger_history_mismatch` | LedgerHistory.cpp:16 | Ledger hash mismatch count | +| `rippled_warn` | Logic.h:33 | Resource manager warning count | +| `rippled_drop` | Logic.h:34 | Resource manager drop count | + +#### Histograms (from StatsD timers) + +| Prometheus Metric | Source | Description | +| ----------------------- | --------------------- | ------------------------------ | +| `rippled_rpc_time` | ServerHandler.cpp:110 | RPC response time (ms) | +| `rippled_rpc_size` | ServerHandler.cpp:109 | RPC response size (bytes) | +| `rippled_ios_latency` | Application.cpp:438 | I/O service loop latency (ms) | +| `rippled_pathfind_fast` | PathRequests.h:23 | Fast pathfinding duration (ms) | +| `rippled_pathfind_full` | PathRequests.h:24 | Full pathfinding duration (ms) | + ## Grafana Dashboards -Five dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`: +Eight dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`: ### RPC Performance (`rippled-rpc-perf`) @@ -230,6 +284,45 @@ Requires `trace_peer=1` in the `[telemetry]` config section. | Proposals Trusted vs Untrusted | piechart | by `xrpl_peer_proposal_trusted` | `xrpl_peer_proposal_trusted` | | Validations Trusted vs Untrusted | piechart | by `xrpl_peer_validation_trusted` | `xrpl_peer_validation_trusted` | +### Node Health — StatsD (`rippled-statsd-node-health`) + +| Panel | Type | PromQL | Labels Used | +| -------------------------- | ---------- | ------------------------------------------------------ | ----------- | +| Validated Ledger Age | stat | `rippled_LedgerMaster_Validated_Ledger_Age` | — | +| Published Ledger Age | stat | `rippled_LedgerMaster_Published_Ledger_Age` | — | +| Operating Mode Duration | timeseries | `rippled_State_Accounting_*_duration` | — | +| Operating Mode Transitions | timeseries | `rippled_State_Accounting_*_transitions` | — | +| I/O Latency | timeseries | `histogram_quantile(0.95, rippled_ios_latency_bucket)` | — | +| Job Queue Depth | timeseries | `rippled_job_count` | — | +| Ledger Fetch Rate | stat | `rate(rippled_ledger_fetches[5m])` | — | +| Ledger History Mismatches | stat | `rate(rippled_ledger_history_mismatch[5m])` | — | + +### Network Traffic — StatsD (`rippled-statsd-network`) + +| Panel | Type | PromQL | Labels Used | +| ---------------------- | ---------- | -------------------------------------- | ----------- | +| Active Peers | timeseries | `rippled_Peer_Finder_Active_*_Peers` | — | +| Peer Disconnects | timeseries | `rippled_Overlay_Peer_Disconnects` | — | +| Total Network Bytes | timeseries | `rippled_total_Bytes_In/Out` | — | +| Total Network Messages | timeseries | `rippled_total_Messages_In/Out` | — | +| Transaction Traffic | timeseries | `rippled_transactions_Messages_In/Out` | — | +| Proposal Traffic | timeseries | `rippled_proposals_Messages_In/Out` | — | +| Validation Traffic | timeseries | `rippled_validations_Messages_In/Out` | — | +| Traffic by Category | bargauge | `topk(10, rippled_*_Bytes_In)` | — | + +### RPC & Pathfinding — StatsD (`rippled-statsd-rpc`) + +| Panel | Type | PromQL | Labels Used | +| ------------------------- | ---------- | -------------------------------------------------------- | ----------- | +| RPC Request Rate | stat | `rate(rippled_rpc_requests[5m])` | — | +| RPC Response Time | timeseries | `histogram_quantile(0.95, rippled_rpc_time_bucket)` | — | +| RPC Response Size | timeseries | `histogram_quantile(0.95, rippled_rpc_size_bucket)` | — | +| RPC Response Time Heatmap | heatmap | `rippled_rpc_time_bucket` | — | +| Pathfinding Fast Duration | timeseries | `histogram_quantile(0.95, rippled_pathfind_fast_bucket)` | — | +| Pathfinding Full Duration | timeseries | `histogram_quantile(0.95, rippled_pathfind_full_bucket)` | — | +| Resource Warnings Rate | stat | `rate(rippled_warn[5m])` | — | +| Resource Drops Rate | stat | `rate(rippled_drop[5m])` | — | + ### Span → Metric → Dashboard Summary | Span Name | Prometheus Metric Filter | Grafana Dashboard |