mirror of
https://github.com/XRPLF/rippled.git
synced 2026-03-02 10:42:33 +00:00
Phase 6: Integrate beast::insight StatsD metrics into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -182,7 +182,80 @@ gantt
|
||||
|
||||
---
|
||||
|
||||
## 6.7 Risk Assessment
|
||||
## 6.7 Phase 6: StatsD Metrics Integration (Week 10)
|
||||
|
||||
**Objective**: Bridge rippled's existing `beast::insight` StatsD metrics into the OpenTelemetry collection pipeline, exposing 300+ pre-existing metrics alongside span-derived RED metrics in Prometheus/Grafana.
|
||||
|
||||
### Background
|
||||
|
||||
rippled has a mature metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These metrics cover node health, peer networking, RPC performance, job queue, and overlay traffic — data that **does not** overlap with the span-based instrumentation from Phases 1-5. By adding a StatsD receiver to the OTel Collector, both metric sources converge in Prometheus.
|
||||
|
||||
### Metric Inventory
|
||||
|
||||
| Category | Group | Type | Count | Key Metrics |
|
||||
| --------------- | ------------------ | ------------- | ---------- | ------------------------------------------------------ |
|
||||
| Node State | `State_Accounting` | Gauge | 10 | `*_duration`, `*_transitions` per operating mode |
|
||||
| Ledger | `LedgerMaster` | Gauge | 2 | `Validated_Ledger_Age`, `Published_Ledger_Age` |
|
||||
| Ledger Fetch | — | Counter | 1 | `ledger_fetches` |
|
||||
| Ledger History | `ledger.history` | Counter | 1 | `mismatch` |
|
||||
| RPC | `rpc` | Counter+Event | 3 | `requests`, `time` (histogram), `size` (histogram) |
|
||||
| Job Queue | — | Gauge+Event | 1 + 2×N | `job_count`, per-job `{name}` and `{name}_q` |
|
||||
| Peer Finder | `Peer_Finder` | Gauge | 2 | `Active_Inbound_Peers`, `Active_Outbound_Peers` |
|
||||
| Overlay | `Overlay` | Gauge | 1 | `Peer_Disconnects` |
|
||||
| Overlay Traffic | per-category | Gauge | 4×57 = 228 | `Bytes_In/Out`, `Messages_In/Out` per traffic category |
|
||||
| Pathfinding | — | Event | 2 | `pathfind_fast`, `pathfind_full` (histograms) |
|
||||
| I/O | — | Event | 1 | `ios_latency` (histogram) |
|
||||
| Resource Mgr | — | Meter | 2 | `warn`, `drop` (rate counters) |
|
||||
| Caches | per-cache | Gauge | 2×N | `{cache}.size`, `{cache}.hit_rate` |
|
||||
|
||||
**Total**: ~255+ unique metrics (plus dynamic job-type and cache metrics)
|
||||
|
||||
### Tasks
|
||||
|
||||
| Task | Description | Effort | Risk |
|
||||
| ---- | --------------------------------------------------------------------------------------------------------------- | ------ | ---- |
|
||||
| 6.1 | **DEFERRED** Fix Meter wire format (`\|m` → `\|c`) in StatsDCollector.cpp — breaking change, tracked separately | 0.5d | Low |
|
||||
| 6.2 | Add `statsd` receiver to OTel Collector config | 0.5d | Low |
|
||||
| 6.3 | Expose UDP port 8125 in docker-compose.yml | 0.1d | Low |
|
||||
| 6.4 | Add `[insight]` config to integration test node configs | 0.5d | Low |
|
||||
| 6.5 | Create "Node Health" Grafana dashboard (8 panels) | 1d | Low |
|
||||
| 6.6 | Create "Network Traffic" Grafana dashboard (8 panels) | 1d | Low |
|
||||
| 6.7 | Create "RPC & Pathfinding (StatsD)" Grafana dashboard (8 panels) | 1d | Low |
|
||||
| 6.8 | Update integration test to verify StatsD metrics in Prometheus | 0.5d | Low |
|
||||
| 6.9 | Update TESTING.md and telemetry-runbook.md | 0.5d | Low |
|
||||
|
||||
**Total Effort**: 5.6 days
|
||||
|
||||
### Wire Format Fix (Task 6.1) — DEFERRED
|
||||
|
||||
The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffix, which is non-standard StatsD. The OTel StatsD receiver silently drops these. Fix: change `|m` to `|c` (counter), which is semantically correct since meters are increment-only counters. Only 2 metrics are affected (`warn`, `drop` in Resource Manager).
|
||||
|
||||
**Status**: Deferred as a separate change — this is a breaking change for any StatsD backend that previously consumed the custom `|m` type. The Resource Warnings and Resource Drops dashboard panels will show no data until this fix is applied.
|
||||
|
||||
### New Grafana Dashboards
|
||||
|
||||
**Node Health** (`statsd-node-health.json`, uid: `rippled-statsd-node-health`):
|
||||
|
||||
- Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches
|
||||
|
||||
**Network Traffic** (`statsd-network-traffic.json`, uid: `rippled-statsd-network`):
|
||||
|
||||
- Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories
|
||||
|
||||
**RPC & Pathfinding (StatsD)** (`statsd-rpc-pathfinding.json`, uid: `rippled-statsd-rpc`):
|
||||
|
||||
- RPC Request Rate, Response Time p95/p50, Response Size p95/p50, Pathfinding Fast/Full Duration, Resource Warnings/Drops, Response Time Heatmap
|
||||
|
||||
### Exit Criteria
|
||||
|
||||
- [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age`)
|
||||
- [ ] All 3 new Grafana dashboards load without errors
|
||||
- [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests)
|
||||
- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ — DEFERRED (breaking change, tracked separately)
|
||||
|
||||
---
|
||||
|
||||
## 6.9 Risk Assessment
|
||||
|
||||
```mermaid
|
||||
quadrantChart
|
||||
@@ -213,7 +286,7 @@ quadrantChart
|
||||
|
||||
---
|
||||
|
||||
## 6.8 Success Metrics
|
||||
## 6.10 Success Metrics
|
||||
|
||||
| Metric | Target | Measurement |
|
||||
| ------------------------ | ------------------------------ | --------------------- |
|
||||
@@ -226,7 +299,7 @@ quadrantChart
|
||||
|
||||
---
|
||||
|
||||
## 6.9 Effort Summary
|
||||
## 6.11 Effort Summary
|
||||
|
||||
<div align="center">
|
||||
|
||||
@@ -257,11 +330,11 @@ pie showData
|
||||
|
||||
---
|
||||
|
||||
## 6.10 Quick Wins and Crawl-Walk-Run Strategy
|
||||
## 6.12 Quick Wins and Crawl-Walk-Run Strategy
|
||||
|
||||
This section outlines a prioritized approach to maximize ROI with minimal initial investment.
|
||||
|
||||
### 6.10.1 Crawl-Walk-Run Overview
|
||||
### 6.12.1 Crawl-Walk-Run Overview
|
||||
|
||||
<div align="center">
|
||||
|
||||
@@ -300,7 +373,7 @@ flowchart TB
|
||||
|
||||
</div>
|
||||
|
||||
### 6.10.2 Quick Wins (Immediate Value)
|
||||
### 6.12.2 Quick Wins (Immediate Value)
|
||||
|
||||
| Quick Win | Effort | Value | When to Deploy |
|
||||
| ------------------------------ | -------- | ------ | -------------- |
|
||||
@@ -310,7 +383,7 @@ flowchart TB
|
||||
| **Transaction Submit Tracing** | 1 day | High | Week 3 |
|
||||
| **Consensus Round Duration** | 1 day | Medium | Week 6 |
|
||||
|
||||
### 6.10.3 CRAWL Phase (Weeks 1-2)
|
||||
### 6.12.3 CRAWL Phase (Weeks 1-2)
|
||||
|
||||
**Goal**: Get basic tracing working with minimal code changes.
|
||||
|
||||
@@ -330,7 +403,7 @@ flowchart TB
|
||||
- No cross-node complexity
|
||||
- Single file modification to existing code
|
||||
|
||||
### 6.10.4 WALK Phase (Weeks 3-5)
|
||||
### 6.12.4 WALK Phase (Weeks 3-5)
|
||||
|
||||
**Goal**: Add transaction lifecycle tracing across nodes.
|
||||
|
||||
@@ -349,7 +422,7 @@ flowchart TB
|
||||
- Moderate complexity (requires context propagation)
|
||||
- High value for debugging transaction issues
|
||||
|
||||
### 6.10.5 RUN Phase (Weeks 6-9)
|
||||
### 6.12.5 RUN Phase (Weeks 6-9)
|
||||
|
||||
**Goal**: Full observability including consensus.
|
||||
|
||||
@@ -368,7 +441,7 @@ flowchart TB
|
||||
- Requires thorough testing
|
||||
- Lower relative value (consensus issues are rarer)
|
||||
|
||||
### 6.10.6 ROI Prioritization Matrix
|
||||
### 6.12.6 ROI Prioritization Matrix
|
||||
|
||||
```mermaid
|
||||
quadrantChart
|
||||
@@ -390,11 +463,11 @@ quadrantChart
|
||||
|
||||
---
|
||||
|
||||
## 6.11 Definition of Done
|
||||
## 6.13 Definition of Done
|
||||
|
||||
Clear, measurable criteria for each phase.
|
||||
|
||||
### 6.11.1 Phase 1: Core Infrastructure
|
||||
### 6.13.1 Phase 1: Core Infrastructure
|
||||
|
||||
| Criterion | Measurement | Target |
|
||||
| --------------- | ---------------------------------------------------------- | ---------------------------- |
|
||||
@@ -406,7 +479,7 @@ Clear, measurable criteria for each phase.
|
||||
|
||||
**Definition of Done**: All criteria met, PR merged, no regressions in CI.
|
||||
|
||||
### 6.11.2 Phase 2: RPC Tracing
|
||||
### 6.13.2 Phase 2: RPC Tracing
|
||||
|
||||
| Criterion | Measurement | Target |
|
||||
| ------------------ | ---------------------------------- | -------------------------- |
|
||||
@@ -418,7 +491,7 @@ Clear, measurable criteria for each phase.
|
||||
|
||||
**Definition of Done**: RPC traces visible in Jaeger/Tempo for all commands, dashboard shows latency distribution.
|
||||
|
||||
### 6.11.3 Phase 3: Transaction Tracing
|
||||
### 6.13.3 Phase 3: Transaction Tracing
|
||||
|
||||
| Criterion | Measurement | Target |
|
||||
| ---------------- | ------------------------------- | ---------------------------------- |
|
||||
@@ -430,7 +503,7 @@ Clear, measurable criteria for each phase.
|
||||
|
||||
**Definition of Done**: Transaction traces span 3+ nodes in test network, performance within bounds.
|
||||
|
||||
### 6.11.4 Phase 4: Consensus Tracing
|
||||
### 6.13.4 Phase 4: Consensus Tracing
|
||||
|
||||
| Criterion | Measurement | Target |
|
||||
| -------------------- | ----------------------------- | ------------------------- |
|
||||
@@ -442,7 +515,7 @@ Clear, measurable criteria for each phase.
|
||||
|
||||
**Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing.
|
||||
|
||||
### 6.11.5 Phase 5: Production Deployment
|
||||
### 6.13.5 Phase 5: Production Deployment
|
||||
|
||||
| Criterion | Measurement | Target |
|
||||
| ------------ | ---------------------------- | -------------------------- |
|
||||
@@ -455,7 +528,7 @@ Clear, measurable criteria for each phase.
|
||||
|
||||
**Definition of Done**: Telemetry running in production, operators trained, alerts active.
|
||||
|
||||
### 6.11.6 Success Metrics Summary
|
||||
### 6.13.6 Success Metrics Summary
|
||||
|
||||
| Phase | Primary Metric | Secondary Metric | Deadline |
|
||||
| ------- | ---------------------- | --------------------------- | ------------- |
|
||||
@@ -467,7 +540,7 @@ Clear, measurable criteria for each phase.
|
||||
|
||||
---
|
||||
|
||||
## 6.12 Recommended Implementation Order
|
||||
## 6.14 Recommended Implementation Order
|
||||
|
||||
Based on ROI analysis, implement in this exact order:
|
||||
|
||||
|
||||
@@ -370,7 +370,7 @@ See the "Verification Queries" section below.
|
||||
|
||||
## Expected Span Catalog
|
||||
|
||||
All 12 production span names instrumented across Phases 2-4:
|
||||
All 16 production span names instrumented across Phases 2-5:
|
||||
|
||||
| Span Name | Source File | Phase | Key Attributes | How to Trigger |
|
||||
| --------------------------- | --------------------- | ----- | ---------------------------------------------------------- | ------------------------- |
|
||||
@@ -380,10 +380,16 @@ All 12 production span names instrumented across Phases 2-4:
|
||||
| `rpc.command.<name>` | RPCHandler.cpp:161 | 2 | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role` | Any RPC command |
|
||||
| `tx.process` | NetworkOPs.cpp:1227 | 3 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Submit transaction |
|
||||
| `tx.receive` | PeerImp.cpp:1273 | 3 | `xrpl.peer.id` | Peer relays transaction |
|
||||
| `tx.apply` | BuildLedger.cpp:88 | 5 | `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Ledger close (tx set) |
|
||||
| `consensus.proposal.send` | RCLConsensus.cpp:177 | 4 | `xrpl.consensus.round` | Consensus proposing phase |
|
||||
| `consensus.ledger_close` | RCLConsensus.cpp:282 | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` | Ledger close event |
|
||||
| `consensus.accept` | RCLConsensus.cpp:395 | 4 | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | Ledger accepted |
|
||||
| `consensus.validation.send` | RCLConsensus.cpp:753 | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` | Validation sent |
|
||||
| `ledger.build` | BuildLedger.cpp:31 | 5 | `xrpl.ledger.seq` | Ledger build |
|
||||
| `ledger.validate` | LedgerMaster.cpp:915 | 5 | `xrpl.ledger.seq`, `xrpl.ledger.validations` | Ledger validated |
|
||||
| `ledger.store` | LedgerMaster.cpp:409 | 5 | `xrpl.ledger.seq` | Ledger stored |
|
||||
| `peer.proposal.receive` | PeerImp.cpp:1667 | 5 | `xrpl.peer.id`, `xrpl.peer.proposal.trusted` | Peer sends proposal |
|
||||
| `peer.validation.receive` | PeerImp.cpp:2264 | 5 | `xrpl.peer.id`, `xrpl.peer.validation.trusted` | Peer sends validation |
|
||||
|
||||
---
|
||||
|
||||
@@ -405,9 +411,11 @@ curl -s "$JAEGER/api/services/rippled/operations" | jq '.data'
|
||||
# Query traces by operation
|
||||
for op in "rpc.request" "rpc.process" \
|
||||
"rpc.command.server_info" "rpc.command.server_state" "rpc.command.ledger" \
|
||||
"tx.process" "tx.receive" \
|
||||
"tx.process" "tx.receive" "tx.apply" \
|
||||
"consensus.proposal.send" "consensus.ledger_close" \
|
||||
"consensus.accept" "consensus.validation.send"; do
|
||||
"consensus.accept" "consensus.validation.send" \
|
||||
"ledger.build" "ledger.validate" "ledger.store" \
|
||||
"peer.proposal.receive" "peer.validation.receive"; do
|
||||
count=$(curl -s "$JAEGER/api/traces?service=rippled&operation=$op&limit=5&lookback=1h" \
|
||||
| jq '.data | length')
|
||||
printf "%-35s %s traces\n" "$op" "$count"
|
||||
@@ -434,15 +442,81 @@ curl -s "$PROM/api/v1/query?query=traces_span_metrics_calls_total{span_name=~\"r
|
||||
| jq '.data.result[] | {command: .metric["xrpl.rpc.command"], count: .value[1]}'
|
||||
```
|
||||
|
||||
### StatsD Metrics (beast::insight)
|
||||
|
||||
rippled's built-in `beast::insight` framework emits StatsD metrics over UDP to the OTel Collector
|
||||
on port 8125. These appear in Prometheus alongside spanmetrics.
|
||||
|
||||
Requires `[insight]` config in `xrpld.cfg`:
|
||||
|
||||
```ini
|
||||
[insight]
|
||||
server=statsd
|
||||
address=127.0.0.1:8125
|
||||
prefix=rippled
|
||||
```
|
||||
|
||||
Verify StatsD metrics in Prometheus:
|
||||
|
||||
```bash
|
||||
# Ledger age gauge
|
||||
curl -s "$PROM/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age" | jq '.data.result'
|
||||
|
||||
# Peer counts
|
||||
curl -s "$PROM/api/v1/query?query=rippled_Peer_Finder_Active_Inbound_Peers" | jq '.data.result'
|
||||
|
||||
# RPC request counter
|
||||
curl -s "$PROM/api/v1/query?query=rippled_rpc_requests" | jq '.data.result'
|
||||
|
||||
# State accounting
|
||||
curl -s "$PROM/api/v1/query?query=rippled_State_Accounting_Full_duration" | jq '.data.result'
|
||||
|
||||
# Overlay traffic
|
||||
curl -s "$PROM/api/v1/query?query=rippled_total_Bytes_In" | jq '.data.result'
|
||||
```
|
||||
|
||||
Key StatsD metrics (prefix `rippled_`):
|
||||
|
||||
| Metric | Type | Source |
|
||||
| ------------------------------------- | --------- | ----------------------------------------- |
|
||||
| `LedgerMaster_Validated_Ledger_Age` | gauge | LedgerMaster.h:373 |
|
||||
| `LedgerMaster_Published_Ledger_Age` | gauge | LedgerMaster.h:374 |
|
||||
| `State_Accounting_{Mode}_duration` | gauge | NetworkOPs.cpp:774 |
|
||||
| `State_Accounting_{Mode}_transitions` | gauge | NetworkOPs.cpp:780 |
|
||||
| `Peer_Finder_Active_Inbound_Peers` | gauge | PeerfinderManager.cpp:214 |
|
||||
| `Peer_Finder_Active_Outbound_Peers` | gauge | PeerfinderManager.cpp:215 |
|
||||
| `Overlay_Peer_Disconnects` | gauge | OverlayImpl.h:557 |
|
||||
| `job_count` | gauge | JobQueue.cpp:26 |
|
||||
| `rpc_requests` | counter | ServerHandler.cpp:108 |
|
||||
| `rpc_time` | histogram | ServerHandler.cpp:110 |
|
||||
| `rpc_size` | histogram | ServerHandler.cpp:109 |
|
||||
| `ios_latency` | histogram | Application.cpp:438 |
|
||||
| `pathfind_fast` | histogram | PathRequests.h:23 |
|
||||
| `pathfind_full` | histogram | PathRequests.h:24 |
|
||||
| `ledger_fetches` | counter | InboundLedgers.cpp:44 |
|
||||
| `ledger_history_mismatch` | counter | LedgerHistory.cpp:16 |
|
||||
| `warn` | counter | Logic.h:33 |
|
||||
| `drop` | counter | Logic.h:34 |
|
||||
| `{category}_Bytes_In/Out` | gauge | OverlayImpl.h:535 (57 traffic categories) |
|
||||
| `{category}_Messages_In/Out` | gauge | OverlayImpl.h:535 (57 traffic categories) |
|
||||
|
||||
### Grafana
|
||||
|
||||
Open http://localhost:3000 (anonymous admin access enabled).
|
||||
|
||||
Pre-configured dashboards:
|
||||
Pre-configured dashboards (span-derived):
|
||||
|
||||
- **RPC Performance**: Request rates, latency percentiles by command
|
||||
- **Transaction Overview**: Transaction processing rates and paths
|
||||
- **Consensus Health**: Consensus round duration and proposer counts
|
||||
- **RPC Performance**: Request rates, latency percentiles by command, top commands, WebSocket rate
|
||||
- **Transaction Overview**: Transaction processing rates, apply duration, peer relay, failed tx rate
|
||||
- **Consensus Health**: Consensus round duration, proposer counts, mode tracking, accept heatmap
|
||||
- **Ledger Operations**: Build/validate/store rates and durations, TX apply metrics
|
||||
- **Peer Network**: Proposal/validation receive rates, trusted vs untrusted breakdown (requires `trace_peer=1`)
|
||||
|
||||
Pre-configured dashboards (StatsD):
|
||||
|
||||
- **Node Health (StatsD)**: Validated/published ledger age, operating mode, I/O latency, job queue
|
||||
- **Network Traffic (StatsD)**: Peer counts, disconnects, overlay traffic by category
|
||||
- **RPC & Pathfinding (StatsD)**: RPC request rate/time/size, pathfinding duration, resource warnings
|
||||
|
||||
Pre-configured datasources:
|
||||
|
||||
|
||||
@@ -23,7 +23,8 @@ services:
|
||||
ports:
|
||||
- "4317:4317" # OTLP gRPC
|
||||
- "4318:4318" # OTLP HTTP
|
||||
- "8889:8889" # Prometheus metrics (spanmetrics)
|
||||
- "8125:8125/udp" # StatsD UDP (beast::insight metrics)
|
||||
- "8889:8889" # Prometheus metrics (spanmetrics + statsd)
|
||||
- "13133:13133" # Health check
|
||||
volumes:
|
||||
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
|
||||
|
||||
@@ -8,72 +8,109 @@
|
||||
"panels": [
|
||||
{
|
||||
"title": "Consensus Round Duration",
|
||||
"description": "p95 and p50 duration of consensus accept rounds. The consensus.accept span (RCLConsensus.cpp:395) measures the time to process an accepted ledger including transaction application and state finalization. The span carries xrpl.consensus.proposers and xrpl.consensus.round_time_ms attributes. Normal range is 3-6 seconds on mainnet.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.accept\"}[5m])))",
|
||||
"legendFormat": "p95 round duration"
|
||||
"legendFormat": "P95 Round Duration"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.50, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.accept\"}[5m])))",
|
||||
"legendFormat": "p50 round duration"
|
||||
"legendFormat": "P50 Round Duration"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms"
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Consensus Proposals Sent Rate",
|
||||
"description": "Rate at which this node sends consensus proposals to the network. Sourced from the consensus.proposal.send span (RCLConsensus.cpp:177) which fires each time the node proposes a transaction set. The span carries xrpl.consensus.round identifying the consensus round number. A healthy proposing node should show steady proposal output.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.proposal.send\"}[5m]))",
|
||||
"legendFormat": "proposals/sec"
|
||||
"legendFormat": "Proposals / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops"
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"axisLabel": "Proposals / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Ledger Close Duration",
|
||||
"description": "p95 duration of the ledger close event. The consensus.ledger_close span (RCLConsensus.cpp:282) measures the time from when consensus triggers a ledger close to completion. Carries xrpl.consensus.ledger.seq and xrpl.consensus.mode attributes. Compare with Consensus Round Duration to understand how close timing relates to overall round time.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.ledger_close\"}[5m])))",
|
||||
"legendFormat": "p95 close duration"
|
||||
"legendFormat": "P95 Close Duration"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms"
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Validation Send Rate",
|
||||
"description": "Rate at which this node sends ledger validations to the network. Sourced from the consensus.validation.send span (RCLConsensus.cpp:753). Each validation confirms the node has fully validated a ledger. The span carries xrpl.consensus.ledger.seq and xrpl.consensus.proposing. Should closely track the ledger close rate when the node is healthy.",
|
||||
"type": "stat",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.validation.send\"}[5m]))",
|
||||
"legendFormat": "validations/sec"
|
||||
"legendFormat": "Validations / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
@@ -82,6 +119,121 @@
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Consensus Mode Over Time",
|
||||
"description": "Breakdown of consensus ledger close events by the node's consensus mode (proposing, observing, wrongLedger, switchedLedger). Grouped by the xrpl.consensus.mode span attribute from consensus.ledger_close. A healthy validator should be predominantly in 'proposing' mode. Frequent 'wrongLedger' or 'switchedLedger' indicates sync issues.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum by (xrpl_consensus_mode) (rate(traces_span_metrics_calls_total{span_name=\"consensus.ledger_close\"}[5m]))",
|
||||
"legendFormat": "{{xrpl_consensus_mode}}"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"axisLabel": "Events / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Accept vs Close Rate",
|
||||
"description": "Compares the rate of consensus.accept (ledger accepted after consensus) vs consensus.ledger_close (ledger close initiated). These should track closely in a healthy network. A divergence means some close events are not completing the accept phase, potentially indicating consensus failures or timeouts.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.accept\"}[5m]))",
|
||||
"legendFormat": "Accepts / Sec"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.ledger_close\"}[5m]))",
|
||||
"legendFormat": "Closes / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"axisLabel": "Events / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Validation vs Close Rate",
|
||||
"description": "Compares the rate of consensus.validation.send vs consensus.ledger_close. Each validated ledger should produce one validation message. If validations lag behind closes, the node may be falling behind on validation or experiencing issues with the validation pipeline.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.validation.send\"}[5m]))",
|
||||
"legendFormat": "Validations / Sec"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.ledger_close\"}[5m]))",
|
||||
"legendFormat": "Closes / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"axisLabel": "Events / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Consensus Accept Duration Heatmap",
|
||||
"description": "Heatmap showing the distribution of consensus.accept span durations across histogram buckets over time. Each cell represents how many accept events fell into that duration bucket in a 5m window. Useful for detecting outlier consensus rounds that take abnormally long.",
|
||||
"type": "heatmap",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" },
|
||||
"yAxis": { "axisLabel": "Duration (ms)" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.accept\"}[5m])) by (le)",
|
||||
"legendFormat": "{{le}}",
|
||||
"format": "heatmap"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"schemaVersion": 39,
|
||||
|
||||
231
docker/telemetry/grafana/dashboards/ledger-operations.json
Normal file
231
docker/telemetry/grafana/dashboards/ledger-operations.json
Normal file
@@ -0,0 +1,231 @@
|
||||
{
|
||||
"annotations": { "list": [] },
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"title": "Ledger Build Rate",
|
||||
"description": "Rate at which new ledgers are being built. The ledger.build span (BuildLedger.cpp:31) wraps the entire buildLedgerImpl() function which creates a new ledger from a parent, applies transactions, flushes SHAMap nodes, and sets the accepted state. Should match the consensus close rate (~0.25/sec on mainnet with ~4s rounds).",
|
||||
"type": "stat",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"ledger.build\"}[5m]))",
|
||||
"legendFormat": "Builds / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops"
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Ledger Build Duration",
|
||||
"description": "p95 and p50 duration of ledger builds. Measures the full buildLedgerImpl() call including transaction application, SHAMap flushing, and ledger acceptance. The span records xrpl.ledger.seq as an attribute. Long build times indicate expensive transaction sets or I/O pressure from SHAMap flushes.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"ledger.build\"}[5m])))",
|
||||
"legendFormat": "P95 Build Duration"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.50, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"ledger.build\"}[5m])))",
|
||||
"legendFormat": "P50 Build Duration"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Ledger Validation Rate",
|
||||
"description": "Rate at which ledgers pass the validation threshold and are accepted as fully validated. The ledger.validate span (LedgerMaster.cpp:915) fires in checkAccept() only after the ledger receives sufficient trusted validations (>= quorum). Records xrpl.ledger.seq and xrpl.ledger.validations (the number of validations received).",
|
||||
"type": "stat",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"ledger.validate\"}[5m]))",
|
||||
"legendFormat": "Validations / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops"
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Ledger Build Duration Heatmap",
|
||||
"description": "Heatmap showing the distribution of ledger.build durations across histogram buckets over time. Each cell represents the count of ledger builds that fell into that duration bucket in a 5m window. Useful for spotting occasional slow ledger builds that may not appear in percentile charts.",
|
||||
"type": "heatmap",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" },
|
||||
"yAxis": { "axisLabel": "Duration (ms)" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=\"ledger.build\"}[5m])) by (le)",
|
||||
"legendFormat": "{{le}}",
|
||||
"format": "heatmap"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Transaction Apply Duration",
|
||||
"description": "p95 and p50 duration of applying the consensus transaction set during ledger building. The tx.apply span (BuildLedger.cpp:88) wraps applyTransactions() which iterates through the CanonicalTXSet with multiple retry passes. Records xrpl.ledger.tx_count (successful) and xrpl.ledger.tx_failed (failed) as attributes.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"tx.apply\"}[5m])))",
|
||||
"legendFormat": "P95 tx.apply"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.50, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"tx.apply\"}[5m])))",
|
||||
"legendFormat": "P50 tx.apply"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Transaction Apply Rate",
|
||||
"description": "Rate of tx.apply span invocations, reflecting how frequently the transaction application phase runs during ledger building. Each ledger build triggers one tx.apply call. Should closely match the ledger build rate.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"tx.apply\"}[5m]))",
|
||||
"legendFormat": "tx.apply / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"axisLabel": "Operations / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Ledger Store Rate",
|
||||
"description": "Rate at which ledgers are stored into the ledger history. The ledger.store span (LedgerMaster.cpp:409) wraps storeLedger() which inserts the ledger into the LedgerHistory cache. Records xrpl.ledger.seq. Should match the ledger build rate under normal operation.",
|
||||
"type": "stat",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"ledger.store\"}[5m]))",
|
||||
"legendFormat": "Stores / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops"
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Build vs Close Duration",
|
||||
"description": "Compares p95 durations of ledger.build (the actual ledger construction in BuildLedger.cpp) vs consensus.ledger_close (the consensus close event in RCLConsensus.cpp). Build time is a subset of close time. A large gap between them indicates overhead in the consensus pipeline outside of ledger construction itself.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"ledger.build\"}[5m])))",
|
||||
"legendFormat": "P95 ledger.build"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.ledger_close\"}[5m])))",
|
||||
"legendFormat": "P95 consensus.ledger_close"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
}
|
||||
],
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "ledger", "telemetry"],
|
||||
"templating": { "list": [] },
|
||||
"time": { "from": "now-1h", "to": "now" },
|
||||
"title": "rippled Ledger Operations",
|
||||
"uid": "rippled-ledger-ops"
|
||||
}
|
||||
107
docker/telemetry/grafana/dashboards/peer-network.json
Normal file
107
docker/telemetry/grafana/dashboards/peer-network.json
Normal file
@@ -0,0 +1,107 @@
|
||||
{
|
||||
"annotations": { "list": [] },
|
||||
"description": "Requires trace_peer=1 in the [telemetry] config section.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"title": "Peer Proposal Receive Rate",
|
||||
"description": "Rate of consensus proposals received from network peers. The peer.proposal.receive span (PeerImp.cpp:1667) fires in onMessage(TMProposeSet) for each incoming proposal. Records xrpl.peer.id (sending peer) and xrpl.peer.proposal.trusted (whether the proposer is in our UNL). Requires trace_peer=1 in the telemetry config.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"peer.proposal.receive\"}[5m]))",
|
||||
"legendFormat": "Proposals Received / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"axisLabel": "Proposals / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Peer Validation Receive Rate",
|
||||
"description": "Rate of ledger validations received from network peers. The peer.validation.receive span (PeerImp.cpp:2264) fires in onMessage(TMValidation) for each incoming validation message. Records xrpl.peer.id (sending peer) and xrpl.peer.validation.trusted (whether the validator is trusted). Requires trace_peer=1 in the telemetry config.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"peer.validation.receive\"}[5m]))",
|
||||
"legendFormat": "Validations Received / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"axisLabel": "Validations / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Proposals Trusted vs Untrusted",
|
||||
"description": "Pie chart showing the ratio of proposals received from trusted validators (in our UNL) vs untrusted validators. Grouped by the xrpl.peer.proposal.trusted span attribute (true/false). A healthy node connected to a well-configured UNL should see a significant portion of trusted proposals. Note: proposals that fail early validation may not have the trusted attribute set.",
|
||||
"type": "piechart",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum by (xrpl_peer_proposal_trusted) (rate(traces_span_metrics_calls_total{span_name=\"peer.proposal.receive\"}[5m]))",
|
||||
"legendFormat": "Trusted = {{xrpl_peer_proposal_trusted}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Validations Trusted vs Untrusted",
|
||||
"description": "Pie chart showing the ratio of validations received from trusted validators (in our UNL) vs untrusted validators. Grouped by the xrpl.peer.validation.trusted span attribute (true/false). Monitoring this helps detect if the node is receiving validations from the expected set of trusted validators. Note: validations that fail early checks may not have the trusted attribute set.",
|
||||
"type": "piechart",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum by (xrpl_peer_validation_trusted) (rate(traces_span_metrics_calls_total{span_name=\"peer.validation.receive\"}[5m]))",
|
||||
"legendFormat": "Trusted = {{xrpl_peer_validation_trusted}}"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "peer", "telemetry"],
|
||||
"templating": { "list": [] },
|
||||
"time": { "from": "now-1h", "to": "now" },
|
||||
"title": "rippled Peer Network",
|
||||
"uid": "rippled-peer-net"
|
||||
}
|
||||
@@ -8,8 +8,12 @@
|
||||
"panels": [
|
||||
{
|
||||
"title": "RPC Request Rate by Command",
|
||||
"description": "Per-second rate of RPC command executions, broken down by command name (e.g. server_info, submit). Calculated as rate(traces_span_metrics_calls_total{span_name=~\"rpc.command.*\"}) over a 5m window, grouped by the xrpl.rpc.command span attribute.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
@@ -19,33 +23,55 @@
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "reqps"
|
||||
"unit": "reqps",
|
||||
"custom": {
|
||||
"axisLabel": "Requests / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "RPC Latency p95 by Command",
|
||||
"description": "95th percentile response time for each RPC command. Computed from the spanmetrics duration histogram using histogram_quantile(0.95) over rpc.command.* spans, grouped by xrpl.rpc.command. High values indicate slow commands that may need optimization.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~\"rpc.command.*\"}[5m])))",
|
||||
"legendFormat": "p95 {{xrpl_rpc_command}}"
|
||||
"legendFormat": "P95 {{xrpl_rpc_command}}"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms"
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Latency (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "RPC Error Rate",
|
||||
"description": "Percentage of RPC commands that completed with an error status, per command. Calculated as (error calls / total calls) * 100, where errors have status_code=STATUS_CODE_ERROR. Thresholds: green < 1%, yellow 1-5%, red > 5%.",
|
||||
"type": "bargauge",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
@@ -69,8 +95,13 @@
|
||||
},
|
||||
{
|
||||
"title": "RPC Latency Heatmap",
|
||||
"description": "Distribution of RPC command response times across histogram buckets. Shows the density of requests at each latency level over time. Each cell represents the count of requests that fell into that duration bucket in a 5m window. Useful for spotting bimodal latency patterns.",
|
||||
"type": "heatmap",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" },
|
||||
"yAxis": { "axisLabel": "Duration (ms)" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
@@ -79,6 +110,118 @@
|
||||
"format": "heatmap"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Overall RPC Throughput",
|
||||
"description": "Aggregate RPC throughput showing two layers of the request pipeline. rpc.request is the outer HTTP handler (ServerHandler.cpp:271) that accepts incoming connections. rpc.process is the inner processing layer (ServerHandler.cpp:573) that parses and dispatches. A gap between the two indicates requests being queued or rejected before processing.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"rpc.request\"}[5m]))",
|
||||
"legendFormat": "rpc.request / Sec"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"rpc.process\"}[5m]))",
|
||||
"legendFormat": "rpc.process / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "reqps",
|
||||
"custom": {
|
||||
"axisLabel": "Requests / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "RPC Success vs Error",
|
||||
"description": "Aggregate rate of successful vs failed RPC commands across all command types. Success = status_code UNSET (OpenTelemetry default for OK spans). Error = status_code STATUS_CODE_ERROR. A sustained error rate warrants investigation via per-command breakdown above.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=~\"rpc.command.*\", status_code=\"STATUS_CODE_UNSET\"}[5m]))",
|
||||
"legendFormat": "Success"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=~\"rpc.command.*\", status_code=\"STATUS_CODE_ERROR\"}[5m]))",
|
||||
"legendFormat": "Error"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"axisLabel": "Commands / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Top Commands by Volume",
|
||||
"description": "Top 10 most frequently called RPC commands by total invocation count over the last 5 minutes. Uses topk(10, increase(calls_total)) to rank commands. Helps identify the hottest API endpoints driving load on the node.",
|
||||
"type": "bargauge",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "topk(10, sum by (xrpl_rpc_command) (increase(traces_span_metrics_calls_total{span_name=~\"rpc.command.*\"}[5m])))",
|
||||
"legendFormat": "{{xrpl_rpc_command}}"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "WebSocket Message Rate",
|
||||
"description": "Rate of incoming WebSocket RPC messages processed by the server. Sourced from the rpc.ws_message span (ServerHandler.cpp:384). Only active when clients connect via WebSocket instead of HTTP. Zero is normal if only HTTP RPC is in use.",
|
||||
"type": "stat",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"rpc.ws_message\"}[5m]))",
|
||||
"legendFormat": "WS Messages / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops"
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
}
|
||||
],
|
||||
"schemaVersion": 39,
|
||||
|
||||
470
docker/telemetry/grafana/dashboards/statsd-network-traffic.json
Normal file
470
docker/telemetry/grafana/dashboards/statsd-network-traffic.json
Normal file
@@ -0,0 +1,470 @@
|
||||
{
|
||||
"annotations": { "list": [] },
|
||||
"description": "Network traffic and peer metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"title": "Active Peers",
|
||||
"description": "Number of active inbound and outbound peer connections. Sourced from Peer_Finder.Active_Inbound_Peers and Peer_Finder.Active_Outbound_Peers gauges (PeerfinderManager.cpp:214-215). A healthy mainnet node typically has 10-21 outbound and 0-85 inbound peers depending on configuration.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_Peer_Finder_Active_Inbound_Peers",
|
||||
"legendFormat": "Inbound Peers"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_Peer_Finder_Active_Outbound_Peers",
|
||||
"legendFormat": "Outbound Peers"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Peers",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Peer Disconnects",
|
||||
"description": "Cumulative count of peer disconnections. Sourced from the Overlay.Peer_Disconnects gauge (OverlayImpl.h:557). A rising trend indicates network instability, aggressive peer management, or resource exhaustion causing connection drops.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_Overlay_Peer_Disconnects",
|
||||
"legendFormat": "Disconnects"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Disconnects",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Total Network Bytes",
|
||||
"description": "Total bytes sent and received across all peer connections. Sourced from the total.Bytes_In and total.Bytes_Out traffic category gauges (OverlayImpl.h:535-548). Provides a high-level view of network bandwidth consumption.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_total_Bytes_In",
|
||||
"legendFormat": "Bytes In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_total_Bytes_Out",
|
||||
"legendFormat": "Bytes Out"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "decbytes",
|
||||
"custom": {
|
||||
"axisLabel": "Bytes",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Total Network Messages",
|
||||
"description": "Total messages sent and received across all peer connections. Sourced from the total.Messages_In and total.Messages_Out traffic category gauges (OverlayImpl.h:535-548). Shows the overall message throughput of the overlay network.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_total_Messages_In",
|
||||
"legendFormat": "Messages In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_total_Messages_Out",
|
||||
"legendFormat": "Messages Out"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Transaction Traffic",
|
||||
"description": "Bytes and messages for transaction-related overlay traffic. Includes the transactions traffic category (OverlayImpl/TrafficCount.h). Spikes indicate high transaction volume on the network or transaction flooding.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_transactions_Messages_In",
|
||||
"legendFormat": "TX Messages In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_transactions_Messages_Out",
|
||||
"legendFormat": "TX Messages Out"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_transactions_duplicate_Messages_In",
|
||||
"legendFormat": "TX Duplicate In"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Proposal Traffic",
|
||||
"description": "Messages for consensus proposal overlay traffic. Includes proposals, proposals_untrusted, and proposals_duplicate categories (TrafficCount.h). High untrusted or duplicate counts may indicate UNL misconfiguration or network spam.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_proposals_Messages_In",
|
||||
"legendFormat": "Proposals In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_proposals_Messages_Out",
|
||||
"legendFormat": "Proposals Out"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_proposals_untrusted_Messages_In",
|
||||
"legendFormat": "Untrusted In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_proposals_duplicate_Messages_In",
|
||||
"legendFormat": "Duplicate In"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Validation Traffic",
|
||||
"description": "Messages for validation overlay traffic. Includes validations, validations_untrusted, and validations_duplicate categories (TrafficCount.h). Monitoring trusted vs untrusted validation traffic helps detect UNL health issues.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_validations_Messages_In",
|
||||
"legendFormat": "Validations In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_validations_Messages_Out",
|
||||
"legendFormat": "Validations Out"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_validations_untrusted_Messages_In",
|
||||
"legendFormat": "Untrusted In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_validations_duplicate_Messages_In",
|
||||
"legendFormat": "Duplicate In"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Overlay Traffic by Category (Bytes In)",
|
||||
"description": "Top traffic categories by inbound bytes. Includes all 57 overlay traffic categories from TrafficCount.h. Shows which protocol message types consume the most bandwidth. Categories include transactions, proposals, validations, ledger data, getobject, and overlay overhead.",
|
||||
"type": "bargauge",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "topk(10, {__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})",
|
||||
"legendFormat": "{{__name__}}"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "decbytes"
|
||||
},
|
||||
"overrides": [
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_transactions_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Transactions" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_proposals_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Proposals" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_validations_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Validations" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_overhead_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Overhead" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_overhead_overlay_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Overhead Overlay" }]
|
||||
},
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "rippled_ping_Bytes_In" },
|
||||
"properties": [{ "id": "displayName", "value": "Ping" }]
|
||||
},
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "rippled_status_Bytes_In" },
|
||||
"properties": [{ "id": "displayName", "value": "Status" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_getObject_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Get Object" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_haveTxSet_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Have Tx Set" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledgerData_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Ledger Data" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_share_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Ledger Share" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_get_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Ledger Data Get" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Ledger Data Share" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_Account_State_Node_get_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Account State Node Get" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_Account_State_Node_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Account State Node Share" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_Transaction_Node_get_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Transaction Node Get" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_Transaction_Node_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Transaction Node Share" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_Transaction_Set_candidate_get_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Tx Set Candidate Get" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_Account_State_node_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{
|
||||
"id": "displayName",
|
||||
"value": "Account State Node Share (Legacy)"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_Transaction_Set_candidate_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Tx Set Candidate Share" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_Transaction_node_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{
|
||||
"id": "displayName",
|
||||
"value": "Transaction Node Share (Legacy)"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_set_get_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Set Get" }]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
],
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "statsd", "network", "telemetry"],
|
||||
"templating": { "list": [] },
|
||||
"time": { "from": "now-1h", "to": "now" },
|
||||
"title": "rippled Network Traffic (StatsD)",
|
||||
"uid": "rippled-statsd-network"
|
||||
}
|
||||
415
docker/telemetry/grafana/dashboards/statsd-node-health.json
Normal file
415
docker/telemetry/grafana/dashboards/statsd-node-health.json
Normal file
@@ -0,0 +1,415 @@
|
||||
{
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "Node health metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"title": "Validated Ledger Age",
|
||||
"description": "Age of the most recently validated ledger in seconds. Sourced from the LedgerMaster.Validated_Ledger_Age gauge (LedgerMaster.h:373) which is updated every collection interval via the insight hook. Values above 20s indicate the node is falling behind the network.",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_LedgerMaster_Validated_Ledger_Age",
|
||||
"legendFormat": "Validated Age"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 10
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 20
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Published Ledger Age",
|
||||
"description": "Age of the most recently published ledger in seconds. Sourced from the LedgerMaster.Published_Ledger_Age gauge (LedgerMaster.h:374). Published ledger age should track close to validated ledger age. A growing gap indicates publish pipeline backlog.",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_LedgerMaster_Published_Ledger_Age",
|
||||
"legendFormat": "Published Age"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 10
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 20
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Operating Mode Duration",
|
||||
"description": "Cumulative time spent in each operating mode (Disconnected, Connected, Syncing, Tracking, Full). Sourced from State_Accounting.*_duration gauges (NetworkOPs.cpp:774-778). A healthy node should spend the vast majority of time in Full mode.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Full_duration",
|
||||
"legendFormat": "Full"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Tracking_duration",
|
||||
"legendFormat": "Tracking"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Syncing_duration",
|
||||
"legendFormat": "Syncing"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Connected_duration",
|
||||
"legendFormat": "Connected"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Disconnected_duration",
|
||||
"legendFormat": "Disconnected"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (Sec)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Operating Mode Transitions",
|
||||
"description": "Count of transitions into each operating mode. Sourced from State_Accounting.*_transitions gauges (NetworkOPs.cpp:780-786). Frequent transitions out of Full mode indicate instability. Transitions to Disconnected or Syncing warrant investigation.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Full_transitions",
|
||||
"legendFormat": "Full"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Tracking_transitions",
|
||||
"legendFormat": "Tracking"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Syncing_transitions",
|
||||
"legendFormat": "Syncing"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Connected_transitions",
|
||||
"legendFormat": "Connected"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Disconnected_transitions",
|
||||
"legendFormat": "Disconnected"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Transitions",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "I/O Latency",
|
||||
"description": "P95 and P50 of the I/O service loop latency in milliseconds. Sourced from the ios_latency event (Application.cpp:438) which measures how long it takes for the io_context to process a timer callback. Values above 10ms are logged; above 500ms trigger warnings. High values indicate thread pool saturation or blocking operations.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 16
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ios_latency{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 I/O Latency"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ios_latency{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 I/O Latency"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Latency (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Job Queue Depth",
|
||||
"description": "Current number of jobs waiting in the job queue. Sourced from the job_count gauge (JobQueue.cpp:26). A sustained high value indicates the node cannot process work fast enough \u2014 common during ledger replay or heavy RPC load.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 16
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_job_count",
|
||||
"legendFormat": "Job Queue Depth"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Jobs",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Ledger Fetch Rate",
|
||||
"description": "Rate of ledger fetch requests initiated by the node. Sourced from the ledger_fetches counter (InboundLedgers.cpp:44) which increments each time the node requests a ledger from a peer. High rates indicate the node is catching up or missing ledgers.",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 24
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_ledger_fetches_total[5m])",
|
||||
"legendFormat": "Fetches / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops"
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Ledger History Mismatches",
|
||||
"description": "Rate of ledger history hash mismatches. Sourced from the ledger.history.mismatch counter (LedgerHistory.cpp:16) which increments when a built ledger hash does not match the expected validated hash. Non-zero values indicate consensus divergence or database corruption.",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 24
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_ledger_history_mismatch_total[5m])",
|
||||
"legendFormat": "Mismatches / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 0.01
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
}
|
||||
],
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "statsd", "node-health", "telemetry"],
|
||||
"templating": {
|
||||
"list": []
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"title": "rippled Node Health (StatsD)",
|
||||
"uid": "rippled-statsd-node-health"
|
||||
}
|
||||
396
docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
Normal file
396
docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
Normal file
@@ -0,0 +1,396 @@
|
||||
{
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "RPC and pathfinding metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"title": "RPC Request Rate (StatsD)",
|
||||
"description": "Rate of RPC requests as counted by the beast::insight counter. Sourced from rpc.requests (ServerHandler.cpp:108) which increments on every HTTP and WebSocket RPC request. Compare with the span-based rpc.request rate in the RPC Performance dashboard for cross-validation.",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_rpc_requests_total[5m])",
|
||||
"legendFormat": "Requests / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "reqps"
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "RPC Response Time (StatsD)",
|
||||
"description": "P95 and P50 of RPC response time from the beast::insight timer. Sourced from the rpc.time event (ServerHandler.cpp:110) which records elapsed milliseconds for each RPC response. This measures the full HTTP handler time, not just command execution. Compare with span-based rpc.request duration.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Response Time"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Response Time"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Latency (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "RPC Response Size",
|
||||
"description": "P95 and P50 of RPC response payload size in bytes. Sourced from the rpc.size event (ServerHandler.cpp:109) which records the byte length of each RPC JSON response. Large responses may indicate expensive queries (e.g. account_tx with many results) or API misuse.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_size{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Response Size"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_size{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Response Size"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "decbytes",
|
||||
"custom": {
|
||||
"axisLabel": "Size (Bytes)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "RPC Response Time Distribution",
|
||||
"description": "Distribution of RPC response times from the beast::insight timer showing P50, P90, P95, and P99 quantiles. Sourced from the rpc.time event (ServerHandler.cpp:110). Useful for detecting bimodal latency or long-tail requests.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.9\"}",
|
||||
"legendFormat": "P90"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.99\"}",
|
||||
"legendFormat": "P99"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Latency (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Pathfinding Fast Duration",
|
||||
"description": "P95 and P50 of fast pathfinding execution time. Sourced from the pathfind_fast event (PathRequests.h:23) which records the duration of the fast pathfinding algorithm. Fast pathfinding uses a simplified search that trades accuracy for speed.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 16
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_fast{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Fast Pathfind"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_fast{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Fast Pathfind"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Pathfinding Full Duration",
|
||||
"description": "P95 and P50 of full pathfinding execution time. Sourced from the pathfind_full event (PathRequests.h:24) which records the duration of the exhaustive pathfinding search. Full pathfinding is more expensive and can take significantly longer than fast mode.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 16
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_full{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Full Pathfind"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_full{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Full Pathfind"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Resource Warnings Rate",
|
||||
"description": "Rate of resource warning events from the Resource Manager. Sourced from the warn meter (Logic.h:33) which increments when a consumer (peer or RPC client) exceeds the warning threshold for resource usage. A rising rate indicates aggressive clients that may need throttling. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 24
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_warn_total[5m])",
|
||||
"legendFormat": "Warnings / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 0.1
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Resource Drops Rate",
|
||||
"description": "Rate of resource drop events from the Resource Manager. Sourced from the drop meter (Logic.h:34) which increments when a consumer is disconnected or blocked due to excessive resource usage. Non-zero values mean the node is actively rejecting abusive connections. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 24
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_drop_total[5m])",
|
||||
"legendFormat": "Drops / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 0.01
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 0.1
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
}
|
||||
],
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "statsd", "rpc", "pathfinding", "telemetry"],
|
||||
"templating": {
|
||||
"list": []
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"title": "rippled RPC & Pathfinding (StatsD)",
|
||||
"uid": "rippled-statsd-rpc"
|
||||
}
|
||||
@@ -8,76 +8,223 @@
|
||||
"panels": [
|
||||
{
|
||||
"title": "Transaction Processing Rate",
|
||||
"description": "Rate of transactions entering the processing pipeline. tx.process (NetworkOPs.cpp:1227) fires when a transaction is submitted locally or received from a peer and enters processTransaction(). tx.receive (PeerImp.cpp:1273) fires when a raw transaction message arrives from a peer before deduplication.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"tx.process\"}[5m]))",
|
||||
"legendFormat": "tx.process/sec"
|
||||
"legendFormat": "tx.process / Sec"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"tx.receive\"}[5m]))",
|
||||
"legendFormat": "tx.receive/sec"
|
||||
"legendFormat": "tx.receive / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops"
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"axisLabel": "Transactions / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Transaction Processing Latency",
|
||||
"description": "p95 and p50 latency of transaction processing (tx.process span). Measures the time from when a transaction enters processTransaction() to completion. Computed via histogram_quantile() over the spanmetrics duration histogram with a 5m rate window.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"tx.process\"}[5m])))",
|
||||
"legendFormat": "p95"
|
||||
"legendFormat": "P95"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.50, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"tx.process\"}[5m])))",
|
||||
"legendFormat": "p50"
|
||||
"legendFormat": "P50"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms"
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Latency (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Transaction Path Distribution",
|
||||
"description": "Breakdown of transactions by origin path. The xrpl.tx.local attribute indicates whether the transaction was submitted locally (true) or received from a peer (false). Helps understand the ratio of locally-originated vs relayed transactions.",
|
||||
"type": "piechart",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name=\"tx.process\"}[5m]))",
|
||||
"legendFormat": "local={{xrpl_tx_local}}"
|
||||
"legendFormat": "Local = {{xrpl_tx_local}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Transaction Receive vs Suppressed",
|
||||
"description": "Total rate of raw transaction messages received from peers (tx.receive span from PeerImp.cpp:1273). This fires before deduplication via the HashRouter, so the difference between tx.receive and tx.process reflects suppressed duplicate transactions.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"tx.receive\"}[5m]))",
|
||||
"legendFormat": "total received"
|
||||
"legendFormat": "Total Received"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops"
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"axisLabel": "Transactions / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Transaction Processing Duration Heatmap",
|
||||
"description": "Heatmap showing the distribution of tx.process span durations across histogram buckets over time. Each cell represents the count of transactions that completed within that latency bucket in a 5m window. Reveals whether processing times are consistent or exhibit multi-modal patterns.",
|
||||
"type": "heatmap",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" },
|
||||
"yAxis": { "axisLabel": "Duration (ms)" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=\"tx.process\"}[5m])) by (le)",
|
||||
"legendFormat": "{{le}}",
|
||||
"format": "heatmap"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Transaction Apply Duration per Ledger",
|
||||
"description": "p95 and p50 latency of applying the consensus transaction set to a new ledger. The tx.apply span (BuildLedger.cpp:88) wraps the applyTransactions() function that iterates through the CanonicalTXSet and applies each transaction to the OpenView. Long durations indicate heavy transaction sets or expensive transaction processing.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"tx.apply\"}[5m])))",
|
||||
"legendFormat": "P95 tx.apply"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "histogram_quantile(0.50, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"tx.apply\"}[5m])))",
|
||||
"legendFormat": "P50 tx.apply"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Latency (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Peer Transaction Receive Rate",
|
||||
"description": "Rate of transaction messages received from network peers. Sourced from the tx.receive span (PeerImp.cpp:1273) which fires in the onMessage(TMTransaction) handler. High rates may indicate network-wide transaction volume spikes or peer flooding.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"tx.receive\"}[5m]))",
|
||||
"legendFormat": "tx.receive / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"custom": {
|
||||
"axisLabel": "Transactions / Sec",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Transaction Apply Failed Rate",
|
||||
"description": "Rate of tx.apply spans completing with error status, indicating transaction application failures during ledger building. The span records xrpl.ledger.tx_failed as an attribute. Thresholds: green < 0.1/sec, yellow 0.1-1/sec, red > 1/sec. Some failures are normal (e.g. conflicting offers) but sustained high rates may indicate issues.",
|
||||
"type": "stat",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"tx.apply\", status_code=\"STATUS_CODE_ERROR\"}[5m]))",
|
||||
"legendFormat": "Failed / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "yellow", "value": 0.1 },
|
||||
{ "color": "red", "value": 1 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
|
||||
@@ -307,9 +307,14 @@ max_queue_size=2048
|
||||
trace_rpc=1
|
||||
trace_transactions=1
|
||||
trace_consensus=1
|
||||
trace_peer=0
|
||||
trace_peer=1
|
||||
trace_ledger=1
|
||||
|
||||
[insight]
|
||||
server=statsd
|
||||
address=127.0.0.1:8125
|
||||
prefix=rippled
|
||||
|
||||
[rpc_startup]
|
||||
{ "command": "log_level", "severity": "warning" }
|
||||
|
||||
@@ -481,6 +486,7 @@ log ""
|
||||
log "--- Phase 3: Transaction Spans ---"
|
||||
check_span "tx.process"
|
||||
check_span "tx.receive"
|
||||
check_span "tx.apply"
|
||||
|
||||
log ""
|
||||
log "--- Phase 4: Consensus Spans ---"
|
||||
@@ -489,6 +495,17 @@ check_span "consensus.ledger_close"
|
||||
check_span "consensus.accept"
|
||||
check_span "consensus.validation.send"
|
||||
|
||||
log ""
|
||||
log "--- Phase 5: Ledger Spans ---"
|
||||
check_span "ledger.build"
|
||||
check_span "ledger.validate"
|
||||
check_span "ledger.store"
|
||||
|
||||
log ""
|
||||
log "--- Phase 5: Peer Spans (trace_peer=1) ---"
|
||||
check_span "peer.proposal.receive"
|
||||
check_span "peer.validation.receive"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 10: Verify Prometheus spanmetrics
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -520,6 +537,44 @@ else
|
||||
fail "Grafana: not reachable at localhost:3000"
|
||||
fi
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 10b: Verify StatsD metrics in Prometheus
|
||||
# ---------------------------------------------------------------------------
|
||||
log ""
|
||||
log "--- Phase 6: StatsD Metrics (beast::insight) ---"
|
||||
log "Waiting 20s for StatsD aggregation + Prometheus scrape..."
|
||||
sleep 20
|
||||
|
||||
check_statsd_metric() {
|
||||
local metric_name="$1"
|
||||
local result
|
||||
result=$(curl -sf "$PROM/api/v1/query?query=$metric_name" \
|
||||
| jq '.data.result | length' 2>/dev/null || echo 0)
|
||||
if [ "$result" -gt 0 ]; then
|
||||
ok "StatsD: $metric_name ($result series)"
|
||||
else
|
||||
fail "StatsD: $metric_name (0 series)"
|
||||
fi
|
||||
}
|
||||
|
||||
# Node health gauges
|
||||
check_statsd_metric "rippled_LedgerMaster_Validated_Ledger_Age"
|
||||
check_statsd_metric "rippled_LedgerMaster_Published_Ledger_Age"
|
||||
check_statsd_metric "rippled_job_count"
|
||||
|
||||
# State accounting
|
||||
check_statsd_metric "rippled_State_Accounting_Full_duration"
|
||||
|
||||
# Peer finder
|
||||
check_statsd_metric "rippled_Peer_Finder_Active_Inbound_Peers"
|
||||
check_statsd_metric "rippled_Peer_Finder_Active_Outbound_Peers"
|
||||
|
||||
# RPC counters (only if RPC was exercised — should be true from Steps 5-8)
|
||||
check_statsd_metric "rippled_rpc_requests"
|
||||
|
||||
# Overlay traffic
|
||||
check_statsd_metric "rippled_total_Bytes_In"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 11: Summary
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@@ -2,11 +2,22 @@
|
||||
#
|
||||
# Pipelines:
|
||||
# traces: OTLP receiver -> batch processor -> debug + Jaeger + spanmetrics
|
||||
# metrics: spanmetrics connector -> Prometheus exporter
|
||||
# metrics: spanmetrics connector + StatsD receiver -> Prometheus exporter
|
||||
#
|
||||
# rippled sends traces via OTLP/HTTP to port 4318. The collector batches
|
||||
# them, forwards to Jaeger, and derives RED metrics via the spanmetrics
|
||||
# connector, which Prometheus scrapes on port 8889.
|
||||
#
|
||||
# rippled also sends beast::insight metrics via StatsD/UDP to port 8125.
|
||||
# These are ingested by the statsd receiver and merged into the same
|
||||
# Prometheus endpoint alongside span-derived metrics.
|
||||
#
|
||||
# TODO: The Resource Manager's "warn" and "drop" metrics use the non-standard
|
||||
# "|m" (meter) StatsD type in StatsDCollector.cpp:706. The OTel StatsD
|
||||
# receiver silently drops "|m" metrics since it only recognizes standard
|
||||
# types (|c, |g, |ms, |h, |s). To capture these two metrics, change "|m"
|
||||
# to "|c" in StatsDCollector.cpp — this is a breaking change for any
|
||||
# backend that relied on the custom "|m" type. Tracked as Phase 6 Task 6.1.
|
||||
|
||||
receivers:
|
||||
otlp:
|
||||
@@ -15,6 +26,20 @@ receivers:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
statsd:
|
||||
endpoint: "0.0.0.0:8125"
|
||||
aggregation_interval: 15s
|
||||
enable_metric_type: true
|
||||
is_monotonic_counter: true
|
||||
timer_histogram_mapping:
|
||||
- statsd_type: "timing"
|
||||
observer_type: "summary"
|
||||
summary:
|
||||
percentiles: [0, 50, 90, 95, 99, 100]
|
||||
- statsd_type: "histogram"
|
||||
observer_type: "summary"
|
||||
summary:
|
||||
percentiles: [0, 50, 90, 95, 99, 100]
|
||||
|
||||
processors:
|
||||
batch:
|
||||
@@ -31,6 +56,8 @@ connectors:
|
||||
- name: xrpl.rpc.status
|
||||
- name: xrpl.consensus.mode
|
||||
- name: xrpl.tx.local
|
||||
- name: xrpl.peer.proposal.trusted
|
||||
- name: xrpl.peer.validation.trusted
|
||||
|
||||
exporters:
|
||||
debug:
|
||||
@@ -49,5 +76,5 @@ service:
|
||||
processors: [batch]
|
||||
exporters: [debug, otlp/jaeger, spanmetrics]
|
||||
metrics:
|
||||
receivers: [spanmetrics]
|
||||
receivers: [spanmetrics, statsd]
|
||||
exporters: [prometheus]
|
||||
|
||||
@@ -75,6 +75,7 @@ All spans instrumented in rippled, grouped by subsystem:
|
||||
| ------------ | ------------------- | ----------------------------------------------- | ------------------------------------- |
|
||||
| `tx.process` | NetworkOPs.cpp:1227 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Transaction submission and processing |
|
||||
| `tx.receive` | PeerImp.cpp:1273 | `xrpl.peer.id` | Transaction received from peer relay |
|
||||
| `tx.apply` | BuildLedger.cpp:88 | `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Transaction set applied per ledger |
|
||||
|
||||
### Consensus Spans (Phase 4)
|
||||
|
||||
@@ -85,6 +86,21 @@ All spans instrumented in rippled, grouped by subsystem:
|
||||
| `consensus.accept` | RCLConsensus.cpp:395 | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | Ledger accepted by consensus |
|
||||
| `consensus.validation.send` | RCLConsensus.cpp:753 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` | Validation sent after accept |
|
||||
|
||||
### Ledger Spans (Phase 5)
|
||||
|
||||
| Span Name | Source File | Attributes | Description |
|
||||
| ----------------- | -------------------- | -------------------------------------------- | ----------------------------- |
|
||||
| `ledger.build` | BuildLedger.cpp:31 | `xrpl.ledger.seq` | Ledger build during consensus |
|
||||
| `ledger.validate` | LedgerMaster.cpp:915 | `xrpl.ledger.seq`, `xrpl.ledger.validations` | Ledger promoted to validated |
|
||||
| `ledger.store` | LedgerMaster.cpp:409 | `xrpl.ledger.seq` | Ledger stored in history |
|
||||
|
||||
### Peer Spans (Phase 5)
|
||||
|
||||
| Span Name | Source File | Attributes | Description |
|
||||
| ------------------------- | ---------------- | ---------------------------------------------- | ----------------------------- |
|
||||
| `peer.proposal.receive` | PeerImp.cpp:1667 | `xrpl.peer.id`, `xrpl.peer.proposal.trusted` | Proposal received from peer |
|
||||
| `peer.validation.receive` | PeerImp.cpp:2264 | `xrpl.peer.id`, `xrpl.peer.validation.trusted` | Validation received from peer |
|
||||
|
||||
## Prometheus Metrics (Spanmetrics)
|
||||
|
||||
The OTel Collector's spanmetrics connector automatically derives RED (Rate, Errors, Duration) metrics from every span. No custom metrics code is needed in rippled.
|
||||
@@ -111,12 +127,14 @@ Every metric carries these standard labels:
|
||||
|
||||
Additionally, span attributes configured as dimensions in the collector become metric labels (dots → underscores):
|
||||
|
||||
| Span Attribute | Metric Label | Applies To |
|
||||
| --------------------- | --------------------- | ------------------------------ |
|
||||
| `xrpl.rpc.command` | `xrpl_rpc_command` | `rpc.command.*` spans |
|
||||
| `xrpl.rpc.status` | `xrpl_rpc_status` | `rpc.command.*` spans |
|
||||
| `xrpl.consensus.mode` | `xrpl_consensus_mode` | `consensus.ledger_close` spans |
|
||||
| `xrpl.tx.local` | `xrpl_tx_local` | `tx.process` spans |
|
||||
| Span Attribute | Metric Label | Applies To |
|
||||
| ------------------------------ | ------------------------------ | ------------------------------- |
|
||||
| `xrpl.rpc.command` | `xrpl_rpc_command` | `rpc.command.*` spans |
|
||||
| `xrpl.rpc.status` | `xrpl_rpc_status` | `rpc.command.*` spans |
|
||||
| `xrpl.consensus.mode` | `xrpl_consensus_mode` | `consensus.ledger_close` spans |
|
||||
| `xrpl.tx.local` | `xrpl_tx_local` | `tx.process` spans |
|
||||
| `xrpl.peer.proposal.trusted` | `xrpl_peer_proposal_trusted` | `peer.proposal.receive` spans |
|
||||
| `xrpl.peer.validation.trusted` | `xrpl_peer_validation_trusted` | `peer.validation.receive` spans |
|
||||
|
||||
### Histogram Buckets
|
||||
|
||||
@@ -126,9 +144,63 @@ Configured in `otel-collector-config.yaml`:
|
||||
1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s
|
||||
```
|
||||
|
||||
## StatsD Metrics (beast::insight)
|
||||
|
||||
rippled has a built-in metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These complement the span-derived RED metrics by providing system-level gauges, counters, and timers that don't map to individual trace spans.
|
||||
|
||||
### Configuration
|
||||
|
||||
Add to `xrpld.cfg`:
|
||||
|
||||
```ini
|
||||
[insight]
|
||||
server=statsd
|
||||
address=127.0.0.1:8125
|
||||
prefix=rippled
|
||||
```
|
||||
|
||||
The OTel Collector receives these via a `statsd` receiver on UDP port 8125 and exports them to Prometheus alongside spanmetrics.
|
||||
|
||||
### Metric Reference
|
||||
|
||||
#### Gauges
|
||||
|
||||
| Prometheus Metric | Source | Description |
|
||||
| --------------------------------------------- | ------------------------- | -------------------------------------------------------------------------- |
|
||||
| `rippled_LedgerMaster_Validated_Ledger_Age` | LedgerMaster.h:373 | Age of validated ledger (seconds) |
|
||||
| `rippled_LedgerMaster_Published_Ledger_Age` | LedgerMaster.h:374 | Age of published ledger (seconds) |
|
||||
| `rippled_State_Accounting_{Mode}_duration` | NetworkOPs.cpp:774 | Time in each operating mode (Disconnected/Connected/Syncing/Tracking/Full) |
|
||||
| `rippled_State_Accounting_{Mode}_transitions` | NetworkOPs.cpp:780 | Transition count per mode |
|
||||
| `rippled_Peer_Finder_Active_Inbound_Peers` | PeerfinderManager.cpp:214 | Active inbound peer connections |
|
||||
| `rippled_Peer_Finder_Active_Outbound_Peers` | PeerfinderManager.cpp:215 | Active outbound peer connections |
|
||||
| `rippled_Overlay_Peer_Disconnects` | OverlayImpl.h:557 | Peer disconnect count |
|
||||
| `rippled_job_count` | JobQueue.cpp:26 | Current job queue depth |
|
||||
| `rippled_{category}_Bytes_In/Out` | OverlayImpl.h:535 | Overlay traffic bytes per category (57 categories) |
|
||||
| `rippled_{category}_Messages_In/Out` | OverlayImpl.h:535 | Overlay traffic messages per category |
|
||||
|
||||
#### Counters
|
||||
|
||||
| Prometheus Metric | Source | Description |
|
||||
| --------------------------------- | --------------------- | ------------------------------ |
|
||||
| `rippled_rpc_requests` | ServerHandler.cpp:108 | Total RPC request count |
|
||||
| `rippled_ledger_fetches` | InboundLedgers.cpp:44 | Ledger fetch request count |
|
||||
| `rippled_ledger_history_mismatch` | LedgerHistory.cpp:16 | Ledger hash mismatch count |
|
||||
| `rippled_warn` | Logic.h:33 | Resource manager warning count |
|
||||
| `rippled_drop` | Logic.h:34 | Resource manager drop count |
|
||||
|
||||
#### Histograms (from StatsD timers)
|
||||
|
||||
| Prometheus Metric | Source | Description |
|
||||
| ----------------------- | --------------------- | ------------------------------ |
|
||||
| `rippled_rpc_time` | ServerHandler.cpp:110 | RPC response time (ms) |
|
||||
| `rippled_rpc_size` | ServerHandler.cpp:109 | RPC response size (bytes) |
|
||||
| `rippled_ios_latency` | Application.cpp:438 | I/O service loop latency (ms) |
|
||||
| `rippled_pathfind_fast` | PathRequests.h:23 | Fast pathfinding duration (ms) |
|
||||
| `rippled_pathfind_full` | PathRequests.h:24 | Full pathfinding duration (ms) |
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
Three dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
|
||||
Eight dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
|
||||
|
||||
### RPC Performance (`rippled-rpc-perf`)
|
||||
|
||||
@@ -138,6 +210,10 @@ Three dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
|
||||
| RPC Latency p95 by Command | timeseries | `histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))` | `xrpl_rpc_command` |
|
||||
| RPC Error Rate | bargauge | Error spans / total spans × 100, grouped by `xrpl_rpc_command` | `xrpl_rpc_command`, `status_code` |
|
||||
| RPC Latency Heatmap | heatmap | `sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])) by (le)` | `le` (bucket boundaries) |
|
||||
| Overall RPC Throughput | timeseries | `rpc.request` + `rpc.process` rate | — |
|
||||
| RPC Success vs Error | timeseries | by `status_code` (UNSET vs ERROR) | `status_code` |
|
||||
| Top Commands by Volume | bargauge | `topk(10, ...)` by `xrpl_rpc_command` | `xrpl_rpc_command` |
|
||||
| WebSocket Message Rate | stat | `rpc.ws_message` rate | — |
|
||||
|
||||
### Transaction Overview (`rippled-transactions`)
|
||||
|
||||
@@ -147,29 +223,107 @@ Three dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
|
||||
| Transaction Processing Latency | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="tx.process"})` | — |
|
||||
| Transaction Path Distribution | piechart | `sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))` | `xrpl_tx_local` |
|
||||
| Transaction Receive vs Suppressed | timeseries | `rate(traces_span_metrics_calls_total{span_name="tx.receive"}[5m])` | — |
|
||||
| TX Processing Duration Heatmap | heatmap | `tx.process` histogram buckets | `le` |
|
||||
| TX Apply Duration per Ledger | timeseries | p95/p50 of `tx.apply` | — |
|
||||
| Peer TX Receive Rate | timeseries | `tx.receive` rate | — |
|
||||
| TX Apply Failed Rate | stat | `tx.apply` with `STATUS_CODE_ERROR` | `status_code` |
|
||||
|
||||
### Consensus Health (`rippled-consensus`)
|
||||
|
||||
| Panel | Type | PromQL | Labels Used |
|
||||
| ----------------------------- | ---------- | ---------------------------------------------------------------------------------- | ----------- |
|
||||
| Consensus Round Duration | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept"})` | — |
|
||||
| Consensus Proposals Sent Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="consensus.proposal.send"}[5m])` | — |
|
||||
| Ledger Close Duration | timeseries | `histogram_quantile(0.95, ... {span_name="consensus.ledger_close"})` | — |
|
||||
| Validation Send Rate | stat | `rate(traces_span_metrics_calls_total{span_name="consensus.validation.send"}[5m])` | — |
|
||||
| Panel | Type | PromQL | Labels Used |
|
||||
| ----------------------------- | ---------- | ---------------------------------------------------------------------------------- | --------------------- |
|
||||
| Consensus Round Duration | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept"})` | — |
|
||||
| Consensus Proposals Sent Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="consensus.proposal.send"}[5m])` | — |
|
||||
| Ledger Close Duration | timeseries | `histogram_quantile(0.95, ... {span_name="consensus.ledger_close"})` | — |
|
||||
| Validation Send Rate | stat | `rate(traces_span_metrics_calls_total{span_name="consensus.validation.send"}[5m])` | — |
|
||||
| Consensus Mode Over Time | timeseries | `consensus.ledger_close` by `xrpl_consensus_mode` | `xrpl_consensus_mode` |
|
||||
| Accept vs Close Rate | timeseries | `consensus.accept` vs `consensus.ledger_close` rate | — |
|
||||
| Validation vs Close Rate | timeseries | `consensus.validation.send` vs `consensus.ledger_close` | — |
|
||||
| Accept Duration Heatmap | heatmap | `consensus.accept` histogram buckets | `le` |
|
||||
|
||||
### Ledger Operations (`rippled-ledger-ops`)
|
||||
|
||||
| Panel | Type | PromQL | Labels Used |
|
||||
| ----------------------- | ---------- | ---------------------------------------------- | ----------- |
|
||||
| Ledger Build Rate | stat | `ledger.build` call rate | — |
|
||||
| Ledger Build Duration | timeseries | p95/p50 of `ledger.build` | — |
|
||||
| Ledger Validation Rate | stat | `ledger.validate` call rate | — |
|
||||
| Build Duration Heatmap | heatmap | `ledger.build` histogram buckets | `le` |
|
||||
| TX Apply Duration | timeseries | p95/p50 of `tx.apply` | — |
|
||||
| TX Apply Rate | timeseries | `tx.apply` call rate | — |
|
||||
| Ledger Store Rate | stat | `ledger.store` call rate | — |
|
||||
| Build vs Close Duration | timeseries | p95 `ledger.build` vs `consensus.ledger_close` | — |
|
||||
|
||||
### Peer Network (`rippled-peer-net`)
|
||||
|
||||
Requires `trace_peer=1` in the `[telemetry]` config section.
|
||||
|
||||
| Panel | Type | PromQL | Labels Used |
|
||||
| -------------------------------- | ---------- | --------------------------------- | ------------------------------ |
|
||||
| Proposal Receive Rate | timeseries | `peer.proposal.receive` rate | — |
|
||||
| Validation Receive Rate | timeseries | `peer.validation.receive` rate | — |
|
||||
| Proposals Trusted vs Untrusted | piechart | by `xrpl_peer_proposal_trusted` | `xrpl_peer_proposal_trusted` |
|
||||
| Validations Trusted vs Untrusted | piechart | by `xrpl_peer_validation_trusted` | `xrpl_peer_validation_trusted` |
|
||||
|
||||
### Node Health — StatsD (`rippled-statsd-node-health`)
|
||||
|
||||
| Panel | Type | PromQL | Labels Used |
|
||||
| -------------------------- | ---------- | ------------------------------------------------------ | ----------- |
|
||||
| Validated Ledger Age | stat | `rippled_LedgerMaster_Validated_Ledger_Age` | — |
|
||||
| Published Ledger Age | stat | `rippled_LedgerMaster_Published_Ledger_Age` | — |
|
||||
| Operating Mode Duration | timeseries | `rippled_State_Accounting_*_duration` | — |
|
||||
| Operating Mode Transitions | timeseries | `rippled_State_Accounting_*_transitions` | — |
|
||||
| I/O Latency | timeseries | `histogram_quantile(0.95, rippled_ios_latency_bucket)` | — |
|
||||
| Job Queue Depth | timeseries | `rippled_job_count` | — |
|
||||
| Ledger Fetch Rate | stat | `rate(rippled_ledger_fetches[5m])` | — |
|
||||
| Ledger History Mismatches | stat | `rate(rippled_ledger_history_mismatch[5m])` | — |
|
||||
|
||||
### Network Traffic — StatsD (`rippled-statsd-network`)
|
||||
|
||||
| Panel | Type | PromQL | Labels Used |
|
||||
| ---------------------- | ---------- | -------------------------------------- | ----------- |
|
||||
| Active Peers | timeseries | `rippled_Peer_Finder_Active_*_Peers` | — |
|
||||
| Peer Disconnects | timeseries | `rippled_Overlay_Peer_Disconnects` | — |
|
||||
| Total Network Bytes | timeseries | `rippled_total_Bytes_In/Out` | — |
|
||||
| Total Network Messages | timeseries | `rippled_total_Messages_In/Out` | — |
|
||||
| Transaction Traffic | timeseries | `rippled_transactions_Messages_In/Out` | — |
|
||||
| Proposal Traffic | timeseries | `rippled_proposals_Messages_In/Out` | — |
|
||||
| Validation Traffic | timeseries | `rippled_validations_Messages_In/Out` | — |
|
||||
| Traffic by Category | bargauge | `topk(10, rippled_*_Bytes_In)` | — |
|
||||
|
||||
### RPC & Pathfinding — StatsD (`rippled-statsd-rpc`)
|
||||
|
||||
| Panel | Type | PromQL | Labels Used |
|
||||
| ------------------------- | ---------- | -------------------------------------------------------- | ----------- |
|
||||
| RPC Request Rate | stat | `rate(rippled_rpc_requests[5m])` | — |
|
||||
| RPC Response Time | timeseries | `histogram_quantile(0.95, rippled_rpc_time_bucket)` | — |
|
||||
| RPC Response Size | timeseries | `histogram_quantile(0.95, rippled_rpc_size_bucket)` | — |
|
||||
| RPC Response Time Heatmap | heatmap | `rippled_rpc_time_bucket` | — |
|
||||
| Pathfinding Fast Duration | timeseries | `histogram_quantile(0.95, rippled_pathfind_fast_bucket)` | — |
|
||||
| Pathfinding Full Duration | timeseries | `histogram_quantile(0.95, rippled_pathfind_full_bucket)` | — |
|
||||
| Resource Warnings Rate | stat | `rate(rippled_warn[5m])` | — |
|
||||
| Resource Drops Rate | stat | `rate(rippled_drop[5m])` | — |
|
||||
|
||||
### Span → Metric → Dashboard Summary
|
||||
|
||||
| Span Name | Prometheus Metric Filter | Grafana Dashboard |
|
||||
| --------------------------- | ----------------------------------------- | ---------------------------------- |
|
||||
| `rpc.request` | `{span_name="rpc.request"}` | — (available but not paneled) |
|
||||
| `rpc.process` | `{span_name="rpc.process"}` | — (available but not paneled) |
|
||||
| `rpc.command.*` | `{span_name=~"rpc.command.*"}` | RPC Performance (all 4 panels) |
|
||||
| `tx.process` | `{span_name="tx.process"}` | Transaction Overview (3 panels) |
|
||||
| `tx.receive` | `{span_name="tx.receive"}` | Transaction Overview (2 panels) |
|
||||
| `consensus.accept` | `{span_name="consensus.accept"}` | Consensus Health (Round Duration) |
|
||||
| `consensus.proposal.send` | `{span_name="consensus.proposal.send"}` | Consensus Health (Proposals Rate) |
|
||||
| `consensus.ledger_close` | `{span_name="consensus.ledger_close"}` | Consensus Health (Close Duration) |
|
||||
| `consensus.validation.send` | `{span_name="consensus.validation.send"}` | Consensus Health (Validation Rate) |
|
||||
| Span Name | Prometheus Metric Filter | Grafana Dashboard |
|
||||
| --------------------------- | ----------------------------------------- | --------------------------------------------- |
|
||||
| `rpc.request` | `{span_name="rpc.request"}` | RPC Performance (Overall Throughput) |
|
||||
| `rpc.process` | `{span_name="rpc.process"}` | RPC Performance (Overall Throughput) |
|
||||
| `rpc.ws_message` | `{span_name="rpc.ws_message"}` | RPC Performance (WebSocket Rate) |
|
||||
| `rpc.command.*` | `{span_name=~"rpc.command.*"}` | RPC Performance (Rate, Latency, Error, Top) |
|
||||
| `tx.process` | `{span_name="tx.process"}` | Transaction Overview (Rate, Latency, Heatmap) |
|
||||
| `tx.receive` | `{span_name="tx.receive"}` | Transaction Overview (Rate, Receive) |
|
||||
| `tx.apply` | `{span_name="tx.apply"}` | Transaction Overview + Ledger Ops (Apply) |
|
||||
| `consensus.accept` | `{span_name="consensus.accept"}` | Consensus Health (Duration, Rate, Heatmap) |
|
||||
| `consensus.proposal.send` | `{span_name="consensus.proposal.send"}` | Consensus Health (Proposals Rate) |
|
||||
| `consensus.ledger_close` | `{span_name="consensus.ledger_close"}` | Consensus Health (Close, Mode) |
|
||||
| `consensus.validation.send` | `{span_name="consensus.validation.send"}` | Consensus Health (Validation Rate) |
|
||||
| `ledger.build` | `{span_name="ledger.build"}` | Ledger Ops (Build Rate, Duration, Heatmap) |
|
||||
| `ledger.validate` | `{span_name="ledger.validate"}` | Ledger Ops (Validation Rate) |
|
||||
| `ledger.store` | `{span_name="ledger.store"}` | Ledger Ops (Store Rate) |
|
||||
| `peer.proposal.receive` | `{span_name="peer.proposal.receive"}` | Peer Network (Rate, Trusted/Untrusted) |
|
||||
| `peer.validation.receive` | `{span_name="peer.validation.receive"}` | Peer Network (Rate, Trusted/Untrusted) |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
|
||||
@@ -4,6 +4,7 @@
|
||||
#include <xrpld/app/ledger/OpenLedger.h>
|
||||
#include <xrpld/app/main/Application.h>
|
||||
#include <xrpld/app/misc/CanonicalTXSet.h>
|
||||
#include <xrpld/telemetry/TracingInstrumentation.h>
|
||||
|
||||
#include <xrpl/protocol/Feature.h>
|
||||
#include <xrpl/tx/apply.h>
|
||||
@@ -27,6 +28,8 @@ buildLedgerImpl(
|
||||
beast::Journal j,
|
||||
ApplyTxs&& applyTxs)
|
||||
{
|
||||
XRPL_TRACE_LEDGER(app.getTelemetry(), "ledger.build");
|
||||
|
||||
auto built = std::make_shared<Ledger>(*parent, closeTime);
|
||||
|
||||
if (built->isFlagLedger())
|
||||
@@ -60,6 +63,7 @@ buildLedgerImpl(
|
||||
built->header().seq < XRP_LEDGER_EARLIEST_FEES || built->read(keylet::fees()),
|
||||
"xrpl::buildLedgerImpl : valid ledger fees");
|
||||
built->setAccepted(closeTime, closeResolution, closeTimeCorrect);
|
||||
XRPL_TRACE_SET_ATTR("xrpl.ledger.seq", static_cast<int64_t>(built->header().seq));
|
||||
|
||||
return built;
|
||||
}
|
||||
@@ -83,6 +87,8 @@ applyTransactions(
|
||||
OpenView& view,
|
||||
beast::Journal j)
|
||||
{
|
||||
XRPL_TRACE_TX(app.getTelemetry(), "tx.apply");
|
||||
|
||||
bool certainRetry = true;
|
||||
std::size_t count = 0;
|
||||
|
||||
@@ -149,6 +155,8 @@ applyTransactions(
|
||||
// If there are any transactions left, we must have
|
||||
// tried them in at least one final pass
|
||||
XRPL_ASSERT(txns.empty() || !certainRetry, "xrpl::applyTransactions : retry transactions");
|
||||
XRPL_TRACE_SET_ATTR("xrpl.ledger.tx_count", static_cast<int64_t>(count));
|
||||
XRPL_TRACE_SET_ATTR("xrpl.ledger.tx_failed", static_cast<int64_t>(failed.size()));
|
||||
return count;
|
||||
}
|
||||
|
||||
|
||||
@@ -13,6 +13,7 @@
|
||||
#include <xrpld/core/TimeKeeper.h>
|
||||
#include <xrpld/overlay/Overlay.h>
|
||||
#include <xrpld/overlay/Peer.h>
|
||||
#include <xrpld/telemetry/TracingInstrumentation.h>
|
||||
|
||||
#include <xrpl/basics/Log.h>
|
||||
#include <xrpl/basics/MathUtilities.h>
|
||||
@@ -404,6 +405,9 @@ LedgerMaster::fixIndex(LedgerIndex ledgerIndex, LedgerHash const& ledgerHash)
|
||||
bool
|
||||
LedgerMaster::storeLedger(std::shared_ptr<Ledger const> ledger)
|
||||
{
|
||||
XRPL_TRACE_LEDGER(app_.getTelemetry(), "ledger.store");
|
||||
XRPL_TRACE_SET_ATTR("xrpl.ledger.seq", static_cast<int64_t>(ledger->header().seq));
|
||||
|
||||
bool validated = ledger->header().validated;
|
||||
// Returns true if we already had the ledger
|
||||
return mLedgerHistory.insert(std::move(ledger), validated);
|
||||
@@ -907,6 +911,10 @@ LedgerMaster::checkAccept(std::shared_ptr<Ledger const> const& ledger)
|
||||
return;
|
||||
}
|
||||
|
||||
XRPL_TRACE_LEDGER(app_.getTelemetry(), "ledger.validate");
|
||||
XRPL_TRACE_SET_ATTR("xrpl.ledger.seq", static_cast<int64_t>(ledger->header().seq));
|
||||
XRPL_TRACE_SET_ATTR("xrpl.ledger.validations", static_cast<int64_t>(tvc));
|
||||
|
||||
JLOG(m_journal.info()) << "Advancing accepted ledger to " << ledger->header().seq
|
||||
<< " with >= " << minVal << " validations";
|
||||
|
||||
|
||||
@@ -1664,6 +1664,9 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMLedgerData> const& m)
|
||||
void
|
||||
PeerImp::onMessage(std::shared_ptr<protocol::TMProposeSet> const& m)
|
||||
{
|
||||
XRPL_TRACE_PEER(app_.getTelemetry(), "peer.proposal.receive");
|
||||
XRPL_TRACE_SET_ATTR("xrpl.peer.id", static_cast<int64_t>(id_));
|
||||
|
||||
protocol::TMProposeSet& set = *m;
|
||||
|
||||
auto const sig = makeSlice(set.signature());
|
||||
@@ -1690,6 +1693,7 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMProposeSet> const& m)
|
||||
// every time a spam packet is received
|
||||
PublicKey const publicKey{makeSlice(set.nodepubkey())};
|
||||
auto const isTrusted = app_.validators().trusted(publicKey);
|
||||
XRPL_TRACE_SET_ATTR("xrpl.peer.proposal.trusted", isTrusted);
|
||||
|
||||
// If the operator has specified that untrusted proposals be dropped then
|
||||
// this happens here I.e. before further wasting CPU verifying the signature
|
||||
@@ -2257,6 +2261,9 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMValidatorListCollection> const& m
|
||||
void
|
||||
PeerImp::onMessage(std::shared_ptr<protocol::TMValidation> const& m)
|
||||
{
|
||||
XRPL_TRACE_PEER(app_.getTelemetry(), "peer.validation.receive");
|
||||
XRPL_TRACE_SET_ATTR("xrpl.peer.id", static_cast<int64_t>(id_));
|
||||
|
||||
if (m->validation().size() < 50)
|
||||
{
|
||||
JLOG(p_journal_.warn()) << "Validation: Too small";
|
||||
@@ -2295,6 +2302,7 @@ PeerImp::onMessage(std::shared_ptr<protocol::TMValidation> const& m)
|
||||
// suppression for 30 seconds to avoid doing a relatively expensive
|
||||
// lookup every time a spam packet is received
|
||||
auto const isTrusted = app_.validators().trusted(val->getSignerPublic());
|
||||
XRPL_TRACE_SET_ATTR("xrpl.peer.validation.trusted", isTrusted);
|
||||
|
||||
// If the operator has specified that untrusted validations be
|
||||
// dropped then this happens here I.e. before further wasting CPU
|
||||
|
||||
Reference in New Issue
Block a user