mirror of
https://github.com/XRPLF/rippled.git
synced 2026-04-29 15:37:57 +00:00
Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -525,6 +525,87 @@ Pre-configured datasources:
|
||||
- **Jaeger**: Trace data at `http://jaeger:16686`
|
||||
- **Tempo**: Trace data at `http://tempo:3200` (via Grafana Explore)
|
||||
- **Prometheus**: Metrics at `http://prometheus:9090`
|
||||
- **Loki**: Log data at `http://loki:3100` (via Grafana Explore)
|
||||
|
||||
---
|
||||
|
||||
## Test 3: Log-Trace Correlation (Phase 8)
|
||||
|
||||
Phase 8 injects `trace_id` and `span_id` into rippled's log output when
|
||||
a log line is emitted within an active OTel span. This test verifies the
|
||||
end-to-end log-trace correlation pipeline.
|
||||
|
||||
### Step 1: Verify trace_id in log output
|
||||
|
||||
After running Test 1 or Test 2 (which generate RPC spans), check the
|
||||
rippled debug.log for trace context:
|
||||
|
||||
```bash
|
||||
grep 'trace_id=[a-f0-9]\{32\} span_id=[a-f0-9]\{16\}' /path/to/debug.log
|
||||
```
|
||||
|
||||
Expected: log lines with `trace_id=<32hex> span_id=<16hex>` between the
|
||||
severity code and the message. Example:
|
||||
|
||||
```
|
||||
2024-01-15T10:30:45.123Z RPCHandler:NFO trace_id=abc123def456789012345678abcdef01 span_id=0123456789abcdef Calling server_info
|
||||
```
|
||||
|
||||
Lines emitted outside of an active span (background tasks, startup) will
|
||||
NOT have trace context — this is expected.
|
||||
|
||||
### Step 2: Cross-check trace_id in Jaeger
|
||||
|
||||
Extract a `trace_id` from the log and verify it exists in Jaeger:
|
||||
|
||||
```bash
|
||||
TRACE_ID=$(grep -o 'trace_id=[a-f0-9]\{32\}' /path/to/debug.log | head -1 | cut -d= -f2)
|
||||
echo "Checking trace: $TRACE_ID"
|
||||
curl -s "http://localhost:16686/api/traces/$TRACE_ID" | jq '.data | length'
|
||||
```
|
||||
|
||||
Expected result: `1` (the trace exists in Jaeger).
|
||||
|
||||
### Step 3: Verify Loki log ingestion
|
||||
|
||||
The OTel Collector's filelog receiver tails rippled's debug.log and
|
||||
exports parsed entries to Loki. Verify Loki has received entries:
|
||||
|
||||
```bash
|
||||
# Query Loki for any rippled logs
|
||||
curl -sG "http://localhost:3100/loki/api/v1/query" \
|
||||
--data-urlencode 'query={job="rippled"}' \
|
||||
--data-urlencode 'limit=5' | jq '.data.result | length'
|
||||
```
|
||||
|
||||
Expected: > 0 results.
|
||||
|
||||
### Step 4: Verify Grafana Tempo-to-Loki correlation
|
||||
|
||||
1. Open Grafana at http://localhost:3000
|
||||
2. Navigate to **Explore** -> select **Tempo** datasource
|
||||
3. Search for a trace (e.g., operation `rpc.command.server_info`)
|
||||
4. Click **"Logs for this trace"** in the trace detail view
|
||||
5. Verify that Loki log lines appear, filtered by the trace's `trace_id`
|
||||
|
||||
### Step 5: Verify Grafana Loki-to-Tempo correlation
|
||||
|
||||
1. In Grafana **Explore**, select **Loki** datasource
|
||||
2. Query: `{job="rippled"} |= "trace_id="`
|
||||
3. In the log results, click the **TraceID** derived field link
|
||||
4. Verify it navigates to the full trace in Tempo
|
||||
|
||||
### Expected results
|
||||
|
||||
| Check | Expected |
|
||||
| ------------------------------ | ---------------------------------------- |
|
||||
| `trace_id=` in debug.log | Present in log lines within active spans |
|
||||
| `span_id=` in debug.log | Present alongside trace_id |
|
||||
| Logs without active span | No trace_id/span_id fields |
|
||||
| trace_id in Jaeger | Matches a valid trace |
|
||||
| Loki log ingestion | Logs visible via LogQL |
|
||||
| Tempo -> Loki "Logs for trace" | Shows correlated log lines |
|
||||
| Loki -> Tempo TraceID link | Navigates to correct trace |
|
||||
|
||||
---
|
||||
|
||||
@@ -567,6 +648,44 @@ Pre-configured datasources:
|
||||
2. Check submit response for error codes
|
||||
3. In standalone mode, remember to call `ledger_accept` after submitting
|
||||
|
||||
### No trace_id in log output (Phase 8)
|
||||
|
||||
1. Verify rippled was built with `telemetry=ON` (`-Dtelemetry=ON` in CMake)
|
||||
2. Verify `enabled=1` in the `[telemetry]` config section
|
||||
3. Log lines only contain trace context when emitted inside an active span.
|
||||
Background logs (startup, periodic tasks outside spans) will not have
|
||||
`trace_id`/`span_id`.
|
||||
4. Ensure the trace category is enabled (e.g., `trace_rpc=1` for RPC logs)
|
||||
|
||||
### No logs in Loki (Phase 8)
|
||||
|
||||
1. Verify the log file mount in docker-compose.yml:
|
||||
```yaml
|
||||
volumes:
|
||||
- /tmp/xrpld-integration:/var/log/rippled:ro
|
||||
```
|
||||
2. Check OTel Collector logs for filelog receiver errors:
|
||||
```bash
|
||||
docker compose -f docker/telemetry/docker-compose.yml logs otel-collector | grep -i "filelog\|loki\|error"
|
||||
```
|
||||
3. Verify Loki is running:
|
||||
```bash
|
||||
curl -s http://localhost:3100/ready
|
||||
```
|
||||
4. Verify the filelog receiver glob pattern matches your log files:
|
||||
The default pattern is `/var/log/rippled/*/debug.log`
|
||||
|
||||
### Grafana trace-log links not working (Phase 8)
|
||||
|
||||
1. Verify `tracesToLogs` is configured in the Tempo datasource provisioning
|
||||
(`docker/telemetry/grafana/provisioning/datasources/tempo.yaml`)
|
||||
2. Verify `derivedFields` is configured in the Loki datasource provisioning
|
||||
(`docker/telemetry/grafana/provisioning/datasources/loki.yaml`)
|
||||
3. Restart Grafana after changing provisioning files:
|
||||
```bash
|
||||
docker compose -f docker/telemetry/docker-compose.yml restart grafana
|
||||
```
|
||||
|
||||
### Spanmetrics not appearing in Prometheus
|
||||
|
||||
1. Verify otel-collector config has `spanmetrics` connector
|
||||
|
||||
@@ -2,13 +2,16 @@
|
||||
#
|
||||
# Provides services for local development:
|
||||
# - otel-collector: receives OTLP traces from rippled, batches and
|
||||
# forwards them to Jaeger and Tempo. Listens on ports 4317 (gRPC)
|
||||
# and 4318 (HTTP).
|
||||
# forwards them to Jaeger and Tempo. Also tails rippled log files
|
||||
# via filelog receiver and exports to Loki. Listens on ports
|
||||
# 4317 (gRPC), 4318 (HTTP), and 8125 (StatsD UDP).
|
||||
# - jaeger: all-in-one tracing backend with UI on port 16686.
|
||||
# - tempo: Grafana Tempo tracing backend, queryable via Grafana Explore
|
||||
# on port 3000. Recommended for production (S3/GCS storage, TraceQL).
|
||||
# - loki: Grafana Loki log aggregation backend for centralized log
|
||||
# ingestion and log-trace correlation (Phase 8).
|
||||
# - grafana: dashboards on port 3000, pre-configured with Jaeger, Tempo,
|
||||
# and Prometheus datasources.
|
||||
# Prometheus, and Loki datasources.
|
||||
#
|
||||
# Usage:
|
||||
# docker compose -f docker/telemetry/docker-compose.yml up -d
|
||||
@@ -34,9 +37,14 @@ services:
|
||||
# - "8125:8125/udp"
|
||||
volumes:
|
||||
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
|
||||
# Phase 8: Mount rippled log directory for filelog receiver.
|
||||
# The integration test writes logs to /tmp/xrpld-integration/;
|
||||
# mount it read-only so the collector can tail debug.log files.
|
||||
- /tmp/xrpld-integration:/var/log/rippled:ro
|
||||
depends_on:
|
||||
- jaeger
|
||||
- tempo
|
||||
- loki
|
||||
networks:
|
||||
- rippled-telemetry
|
||||
|
||||
@@ -61,6 +69,18 @@ services:
|
||||
networks:
|
||||
- rippled-telemetry
|
||||
|
||||
# Phase 8: Grafana Loki for centralized log ingestion and log-trace
|
||||
# correlation. Loki 3.x supports native OTLP ingestion, so the OTel
|
||||
# Collector exports via otlphttp to Loki's /otlp endpoint.
|
||||
# Query logs via Grafana Explore -> Loki at http://localhost:3000.
|
||||
loki:
|
||||
image: grafana/loki:3.4.2
|
||||
ports:
|
||||
- "3100:3100"
|
||||
command: -config.file=/etc/loki/local-config.yaml
|
||||
networks:
|
||||
- rippled-telemetry
|
||||
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
ports:
|
||||
@@ -86,6 +106,7 @@ services:
|
||||
- jaeger
|
||||
- tempo
|
||||
- prometheus
|
||||
- loki
|
||||
networks:
|
||||
- rippled-telemetry
|
||||
|
||||
|
||||
470
docker/telemetry/grafana/dashboards/statsd-network-traffic.json
Normal file
470
docker/telemetry/grafana/dashboards/statsd-network-traffic.json
Normal file
@@ -0,0 +1,470 @@
|
||||
{
|
||||
"annotations": { "list": [] },
|
||||
"description": "Network traffic and peer metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"title": "Active Peers",
|
||||
"description": "Number of active inbound and outbound peer connections. Sourced from Peer_Finder.Active_Inbound_Peers and Peer_Finder.Active_Outbound_Peers gauges (PeerfinderManager.cpp:214-215). A healthy mainnet node typically has 10-21 outbound and 0-85 inbound peers depending on configuration.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_Peer_Finder_Active_Inbound_Peers",
|
||||
"legendFormat": "Inbound Peers"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_Peer_Finder_Active_Outbound_Peers",
|
||||
"legendFormat": "Outbound Peers"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Peers",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Peer Disconnects",
|
||||
"description": "Cumulative count of peer disconnections. Sourced from the Overlay.Peer_Disconnects gauge (OverlayImpl.h:557). A rising trend indicates network instability, aggressive peer management, or resource exhaustion causing connection drops.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_Overlay_Peer_Disconnects",
|
||||
"legendFormat": "Disconnects"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Disconnects",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Total Network Bytes",
|
||||
"description": "Total bytes sent and received across all peer connections. Sourced from the total.Bytes_In and total.Bytes_Out traffic category gauges (OverlayImpl.h:535-548). Provides a high-level view of network bandwidth consumption.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_total_Bytes_In",
|
||||
"legendFormat": "Bytes In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_total_Bytes_Out",
|
||||
"legendFormat": "Bytes Out"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "decbytes",
|
||||
"custom": {
|
||||
"axisLabel": "Bytes",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Total Network Messages",
|
||||
"description": "Total messages sent and received across all peer connections. Sourced from the total.Messages_In and total.Messages_Out traffic category gauges (OverlayImpl.h:535-548). Shows the overall message throughput of the overlay network.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_total_Messages_In",
|
||||
"legendFormat": "Messages In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_total_Messages_Out",
|
||||
"legendFormat": "Messages Out"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Transaction Traffic",
|
||||
"description": "Bytes and messages for transaction-related overlay traffic. Includes the transactions traffic category (OverlayImpl/TrafficCount.h). Spikes indicate high transaction volume on the network or transaction flooding.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_transactions_Messages_In",
|
||||
"legendFormat": "TX Messages In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_transactions_Messages_Out",
|
||||
"legendFormat": "TX Messages Out"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_transactions_duplicate_Messages_In",
|
||||
"legendFormat": "TX Duplicate In"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Proposal Traffic",
|
||||
"description": "Messages for consensus proposal overlay traffic. Includes proposals, proposals_untrusted, and proposals_duplicate categories (TrafficCount.h). High untrusted or duplicate counts may indicate UNL misconfiguration or network spam.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_proposals_Messages_In",
|
||||
"legendFormat": "Proposals In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_proposals_Messages_Out",
|
||||
"legendFormat": "Proposals Out"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_proposals_untrusted_Messages_In",
|
||||
"legendFormat": "Untrusted In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_proposals_duplicate_Messages_In",
|
||||
"legendFormat": "Duplicate In"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Validation Traffic",
|
||||
"description": "Messages for validation overlay traffic. Includes validations, validations_untrusted, and validations_duplicate categories (TrafficCount.h). Monitoring trusted vs untrusted validation traffic helps detect UNL health issues.",
|
||||
"type": "timeseries",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_validations_Messages_In",
|
||||
"legendFormat": "Validations In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_validations_Messages_Out",
|
||||
"legendFormat": "Validations Out"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_validations_untrusted_Messages_In",
|
||||
"legendFormat": "Untrusted In"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "rippled_validations_duplicate_Messages_In",
|
||||
"legendFormat": "Duplicate In"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Messages",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Overlay Traffic by Category (Bytes In)",
|
||||
"description": "Top traffic categories by inbound bytes. Includes all 57 overlay traffic categories from TrafficCount.h. Shows which protocol message types consume the most bandwidth. Categories include transactions, proposals, validations, ledger data, getobject, and overlay overhead.",
|
||||
"type": "bargauge",
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 },
|
||||
"options": {
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus" },
|
||||
"expr": "topk(10, {__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})",
|
||||
"legendFormat": "{{__name__}}"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "decbytes"
|
||||
},
|
||||
"overrides": [
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_transactions_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Transactions" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_proposals_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Proposals" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_validations_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Validations" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_overhead_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Overhead" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_overhead_overlay_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Overhead Overlay" }]
|
||||
},
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "rippled_ping_Bytes_In" },
|
||||
"properties": [{ "id": "displayName", "value": "Ping" }]
|
||||
},
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "rippled_status_Bytes_In" },
|
||||
"properties": [{ "id": "displayName", "value": "Status" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_getObject_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Get Object" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_haveTxSet_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Have Tx Set" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledgerData_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Ledger Data" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_share_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Ledger Share" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_get_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Ledger Data Get" }]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Ledger Data Share" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_Account_State_Node_get_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Account State Node Get" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_Account_State_Node_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Account State Node Share" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_Transaction_Node_get_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Transaction Node Get" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_Transaction_Node_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Transaction Node Share" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_data_Transaction_Set_candidate_get_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Tx Set Candidate Get" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_Account_State_node_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{
|
||||
"id": "displayName",
|
||||
"value": "Account State Node Share (Legacy)"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_Transaction_Set_candidate_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Tx Set Candidate Share" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_ledger_Transaction_node_share_Bytes_In"
|
||||
},
|
||||
"properties": [
|
||||
{
|
||||
"id": "displayName",
|
||||
"value": "Transaction Node Share (Legacy)"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byName",
|
||||
"options": "rippled_set_get_Bytes_In"
|
||||
},
|
||||
"properties": [{ "id": "displayName", "value": "Set Get" }]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
],
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "statsd", "network", "telemetry"],
|
||||
"templating": { "list": [] },
|
||||
"time": { "from": "now-1h", "to": "now" },
|
||||
"title": "rippled Network Traffic (StatsD)",
|
||||
"uid": "rippled-statsd-network"
|
||||
}
|
||||
415
docker/telemetry/grafana/dashboards/statsd-node-health.json
Normal file
415
docker/telemetry/grafana/dashboards/statsd-node-health.json
Normal file
@@ -0,0 +1,415 @@
|
||||
{
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "Node health metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"title": "Validated Ledger Age",
|
||||
"description": "Age of the most recently validated ledger in seconds. Sourced from the LedgerMaster.Validated_Ledger_Age gauge (LedgerMaster.h:373) which is updated every collection interval via the insight hook. Values above 20s indicate the node is falling behind the network.",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_LedgerMaster_Validated_Ledger_Age",
|
||||
"legendFormat": "Validated Age"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 10
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 20
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Published Ledger Age",
|
||||
"description": "Age of the most recently published ledger in seconds. Sourced from the LedgerMaster.Published_Ledger_Age gauge (LedgerMaster.h:374). Published ledger age should track close to validated ledger age. A growing gap indicates publish pipeline backlog.",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_LedgerMaster_Published_Ledger_Age",
|
||||
"legendFormat": "Published Age"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 10
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 20
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Operating Mode Duration",
|
||||
"description": "Cumulative time spent in each operating mode (Disconnected, Connected, Syncing, Tracking, Full). Sourced from State_Accounting.*_duration gauges (NetworkOPs.cpp:774-778). A healthy node should spend the vast majority of time in Full mode.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Full_duration",
|
||||
"legendFormat": "Full"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Tracking_duration",
|
||||
"legendFormat": "Tracking"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Syncing_duration",
|
||||
"legendFormat": "Syncing"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Connected_duration",
|
||||
"legendFormat": "Connected"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Disconnected_duration",
|
||||
"legendFormat": "Disconnected"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (Sec)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Operating Mode Transitions",
|
||||
"description": "Count of transitions into each operating mode. Sourced from State_Accounting.*_transitions gauges (NetworkOPs.cpp:780-786). Frequent transitions out of Full mode indicate instability. Transitions to Disconnected or Syncing warrant investigation.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Full_transitions",
|
||||
"legendFormat": "Full"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Tracking_transitions",
|
||||
"legendFormat": "Tracking"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Syncing_transitions",
|
||||
"legendFormat": "Syncing"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Connected_transitions",
|
||||
"legendFormat": "Connected"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_State_Accounting_Disconnected_transitions",
|
||||
"legendFormat": "Disconnected"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Transitions",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "I/O Latency",
|
||||
"description": "P95 and P50 of the I/O service loop latency in milliseconds. Sourced from the ios_latency event (Application.cpp:438) which measures how long it takes for the io_context to process a timer callback. Values above 10ms are logged; above 500ms trigger warnings. High values indicate thread pool saturation or blocking operations.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 16
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ios_latency{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 I/O Latency"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_ios_latency{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 I/O Latency"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Latency (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Job Queue Depth",
|
||||
"description": "Current number of jobs waiting in the job queue. Sourced from the job_count gauge (JobQueue.cpp:26). A sustained high value indicates the node cannot process work fast enough \u2014 common during ledger replay or heavy RPC load.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 16
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_job_count",
|
||||
"legendFormat": "Job Queue Depth"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"axisLabel": "Jobs",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Ledger Fetch Rate",
|
||||
"description": "Rate of ledger fetch requests initiated by the node. Sourced from the ledger_fetches counter (InboundLedgers.cpp:44) which increments each time the node requests a ledger from a peer. High rates indicate the node is catching up or missing ledgers.",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 24
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_ledger_fetches_total[5m])",
|
||||
"legendFormat": "Fetches / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops"
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Ledger History Mismatches",
|
||||
"description": "Rate of ledger history hash mismatches. Sourced from the ledger.history.mismatch counter (LedgerHistory.cpp:16) which increments when a built ledger hash does not match the expected validated hash. Non-zero values indicate consensus divergence or database corruption.",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 24
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_ledger_history_mismatch_total[5m])",
|
||||
"legendFormat": "Mismatches / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 0.01
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
}
|
||||
],
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "statsd", "node-health", "telemetry"],
|
||||
"templating": {
|
||||
"list": []
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"title": "rippled Node Health (StatsD)",
|
||||
"uid": "rippled-statsd-node-health"
|
||||
}
|
||||
396
docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
Normal file
396
docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
Normal file
@@ -0,0 +1,396 @@
|
||||
{
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "RPC and pathfinding metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"title": "RPC Request Rate (StatsD)",
|
||||
"description": "Rate of RPC requests as counted by the beast::insight counter. Sourced from rpc.requests (ServerHandler.cpp:108) which increments on every HTTP and WebSocket RPC request. Compare with the span-based rpc.request rate in the RPC Performance dashboard for cross-validation.",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_rpc_requests_total[5m])",
|
||||
"legendFormat": "Requests / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "reqps"
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "RPC Response Time (StatsD)",
|
||||
"description": "P95 and P50 of RPC response time from the beast::insight timer. Sourced from the rpc.time event (ServerHandler.cpp:110) which records elapsed milliseconds for each RPC response. This measures the full HTTP handler time, not just command execution. Compare with span-based rpc.request duration.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Response Time"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Response Time"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Latency (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "RPC Response Size",
|
||||
"description": "P95 and P50 of RPC response payload size in bytes. Sourced from the rpc.size event (ServerHandler.cpp:109) which records the byte length of each RPC JSON response. Large responses may indicate expensive queries (e.g. account_tx with many results) or API misuse.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_size{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Response Size"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_size{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Response Size"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "decbytes",
|
||||
"custom": {
|
||||
"axisLabel": "Size (Bytes)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "RPC Response Time Distribution",
|
||||
"description": "Distribution of RPC response times from the beast::insight timer showing P50, P90, P95, and P99 quantiles. Sourced from the rpc.time event (ServerHandler.cpp:110). Useful for detecting bimodal latency or long-tail requests.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.9\"}",
|
||||
"legendFormat": "P90"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_rpc_time{quantile=\"0.99\"}",
|
||||
"legendFormat": "P99"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Latency (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Pathfinding Fast Duration",
|
||||
"description": "P95 and P50 of fast pathfinding execution time. Sourced from the pathfind_fast event (PathRequests.h:23) which records the duration of the fast pathfinding algorithm. Fast pathfinding uses a simplified search that trades accuracy for speed.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 16
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_fast{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Fast Pathfind"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_fast{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Fast Pathfind"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Pathfinding Full Duration",
|
||||
"description": "P95 and P50 of full pathfinding execution time. Sourced from the pathfind_full event (PathRequests.h:24) which records the duration of the exhaustive pathfinding search. Full pathfinding is more expensive and can take significantly longer than fast mode.",
|
||||
"type": "timeseries",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 16
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_full{quantile=\"0.95\"}",
|
||||
"legendFormat": "P95 Full Pathfind"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rippled_pathfind_full{quantile=\"0.5\"}",
|
||||
"legendFormat": "P50 Full Pathfind"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"custom": {
|
||||
"axisLabel": "Duration (ms)",
|
||||
"spanNulls": true,
|
||||
"insertNulls": false,
|
||||
"showPoints": "auto",
|
||||
"pointSize": 3
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Resource Warnings Rate",
|
||||
"description": "Rate of resource warning events from the Resource Manager. Sourced from the warn meter (Logic.h:33) which increments when a consumer (peer or RPC client) exceeds the warning threshold for resource usage. A rising rate indicates aggressive clients that may need throttling. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 24
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_warn_total[5m])",
|
||||
"legendFormat": "Warnings / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 0.1
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "Resource Drops Rate",
|
||||
"description": "Rate of resource drop events from the Resource Manager. Sourced from the drop meter (Logic.h:34) which increments when a consumer is disconnected or blocked due to excessive resource usage. Non-zero values mean the node is actively rejecting abusive connections. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).",
|
||||
"type": "stat",
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 24
|
||||
},
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "multi",
|
||||
"sort": "desc"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus"
|
||||
},
|
||||
"expr": "rate(rippled_drop_total[5m])",
|
||||
"legendFormat": "Drops / Sec"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ops",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 0.01
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 0.1
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
}
|
||||
}
|
||||
],
|
||||
"schemaVersion": 39,
|
||||
"tags": ["rippled", "statsd", "rpc", "pathfinding", "telemetry"],
|
||||
"templating": {
|
||||
"list": []
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"title": "rippled RPC & Pathfinding (StatsD)",
|
||||
"uid": "rippled-statsd-rpc"
|
||||
}
|
||||
24
docker/telemetry/grafana/provisioning/datasources/loki.yaml
Normal file
24
docker/telemetry/grafana/provisioning/datasources/loki.yaml
Normal file
@@ -0,0 +1,24 @@
|
||||
# Grafana Loki data source provisioning for rippled log-trace correlation.
|
||||
#
|
||||
# Phase 8: Log-Trace Correlation and Centralized Log Ingestion
|
||||
#
|
||||
# Loki ingests rippled logs via OTel Collector's filelog receiver.
|
||||
# The derivedFields config links trace_id values in log lines back to
|
||||
# Tempo traces, enabling one-click log-to-trace navigation in Grafana.
|
||||
#
|
||||
# See: OpenTelemetryPlan/Phase8_taskList.md (Tasks 8.2, 8.4)
|
||||
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Loki
|
||||
type: loki
|
||||
access: proxy
|
||||
url: http://loki:3100
|
||||
uid: loki
|
||||
jsonData:
|
||||
derivedFields:
|
||||
- datasourceUid: tempo
|
||||
matcherRegex: "trace_id=(\\w+)"
|
||||
name: TraceID
|
||||
url: "$${__value.raw}"
|
||||
@@ -23,6 +23,14 @@ datasources:
|
||||
enabled: true
|
||||
serviceMap:
|
||||
datasourceUid: prometheus
|
||||
# Phase 8: Trace-to-log correlation — enables one-click navigation
|
||||
# from a Tempo trace to the corresponding Loki log lines. Filters
|
||||
# by trace_id so only logs from the same trace are shown.
|
||||
tracesToLogs:
|
||||
datasourceUid: loki
|
||||
filterByTraceID: true
|
||||
filterBySpanID: false
|
||||
tags: ["partition", "severity"]
|
||||
tracesToMetrics:
|
||||
datasourceUid: prometheus
|
||||
spanStartTimeShift: "-1h"
|
||||
|
||||
@@ -62,6 +62,49 @@ check_span() {
|
||||
fi
|
||||
}
|
||||
|
||||
# Phase 8: Verify trace_id injection in rippled log output.
|
||||
# Greps all node debug.log files for the "trace_id=<hex> span_id=<hex>"
|
||||
# pattern that Logs::format() injects when an active OTel span exists.
|
||||
# Also cross-checks that a trace_id found in logs matches a trace in Jaeger.
|
||||
check_log_correlation() {
|
||||
log "Checking log-trace correlation..."
|
||||
|
||||
local total_matches=0
|
||||
local sample_trace_id=""
|
||||
|
||||
for i in $(seq 1 "$NUM_NODES"); do
|
||||
local logfile="$WORKDIR/node$i/debug.log"
|
||||
if [ ! -f "$logfile" ]; then
|
||||
continue
|
||||
fi
|
||||
local matches
|
||||
matches=$(grep -c 'trace_id=[a-f0-9]\{32\} span_id=[a-f0-9]\{16\}' "$logfile" 2>/dev/null || echo 0)
|
||||
total_matches=$((total_matches + matches))
|
||||
# Capture the first trace_id we find for cross-referencing with Jaeger
|
||||
if [ -z "$sample_trace_id" ] && [ "$matches" -gt 0 ]; then
|
||||
sample_trace_id=$(grep -o 'trace_id=[a-f0-9]\{32\}' "$logfile" | head -1 | cut -d= -f2)
|
||||
fi
|
||||
done
|
||||
|
||||
if [ "$total_matches" -gt 0 ]; then
|
||||
ok "Log correlation: found $total_matches log lines with trace_id"
|
||||
else
|
||||
fail "Log correlation: no trace_id found in any node debug.log"
|
||||
fi
|
||||
|
||||
# Cross-check: verify the sample trace_id exists in Jaeger
|
||||
if [ -n "$sample_trace_id" ]; then
|
||||
local trace_found
|
||||
trace_found=$(curl -sf "$JAEGER/api/traces/$sample_trace_id" \
|
||||
| jq '.data | length' 2>/dev/null || echo 0)
|
||||
if [ "$trace_found" -gt 0 ]; then
|
||||
ok "Log-Jaeger cross-check: trace_id=$sample_trace_id found in Jaeger"
|
||||
else
|
||||
fail "Log-Jaeger cross-check: trace_id=$sample_trace_id NOT found in Jaeger"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
cleanup() {
|
||||
log "Cleaning up..."
|
||||
# Kill xrpld nodes
|
||||
@@ -507,6 +550,13 @@ log "--- Phase 5: Peer Spans (trace_peer=1) ---"
|
||||
check_span "peer.proposal.receive"
|
||||
check_span "peer.validation.receive"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 9b: Verify log-trace correlation (Phase 8)
|
||||
# ---------------------------------------------------------------------------
|
||||
log ""
|
||||
log "--- Phase 8: Log-Trace Correlation ---"
|
||||
check_log_correlation
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 10: Verify Prometheus spanmetrics
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -602,6 +652,7 @@ echo ""
|
||||
echo " Jaeger UI: http://localhost:16686"
|
||||
echo " Grafana: http://localhost:3000"
|
||||
echo " Prometheus: http://localhost:9090"
|
||||
echo " Loki: http://localhost:3100"
|
||||
echo ""
|
||||
echo " xrpld nodes (6) are running:"
|
||||
for i in $(seq 1 "$NUM_NODES"); do
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
# Pipelines:
|
||||
# traces: OTLP receiver -> batch processor -> debug + Jaeger + Tempo + spanmetrics
|
||||
# metrics: OTLP receiver + spanmetrics connector -> Prometheus exporter
|
||||
# logs: filelog receiver -> batch processor -> otlphttp/Loki (Phase 8)
|
||||
#
|
||||
# rippled sends traces via OTLP/HTTP to port 4318. The collector batches
|
||||
# them, forwards to both Jaeger and Tempo, and derives RED metrics via the
|
||||
@@ -17,6 +18,12 @@
|
||||
# but commented out. If you need StatsD fallback (server=statsd in
|
||||
# [insight]), uncomment the statsd receiver and add it to the metrics
|
||||
# pipeline receivers list.
|
||||
#
|
||||
# Phase 8: The filelog receiver tails rippled's debug.log files under
|
||||
# /var/log/rippled/ (mounted from the host). A regex_parser operator
|
||||
# extracts timestamp, partition, severity, and optional trace_id/span_id
|
||||
# fields injected by Logs::format(). Parsed logs are exported to Grafana
|
||||
# Loki for log-trace correlation.
|
||||
|
||||
receivers:
|
||||
otlp:
|
||||
@@ -41,6 +48,18 @@ receivers:
|
||||
# observer_type: "summary"
|
||||
# summary:
|
||||
# percentiles: [0, 50, 90, 95, 99, 100]
|
||||
# Phase 8: Filelog receiver tails rippled debug.log files for log-trace
|
||||
# correlation. Extracts structured fields (timestamp, partition, severity,
|
||||
# trace_id, span_id, message) via regex. The trace_id and span_id are
|
||||
# optional — only present when the log was emitted within an active span.
|
||||
filelog:
|
||||
include: [/var/log/rippled/*/debug.log]
|
||||
operators:
|
||||
- type: regex_parser
|
||||
regex: '^(?P<timestamp>\S+)\s+(?P<partition>\S+):(?P<severity>\S+)\s+(?:trace_id=(?P<trace_id>[a-f0-9]+)\s+span_id=(?P<span_id>[a-f0-9]+)\s+)?(?P<message>.*)$'
|
||||
timestamp:
|
||||
parse_from: attributes.timestamp
|
||||
layout: "%Y-%b-%d %H:%M:%S.%f"
|
||||
|
||||
processors:
|
||||
batch:
|
||||
@@ -75,6 +94,11 @@ exporters:
|
||||
endpoint: tempo:4317
|
||||
tls:
|
||||
insecure: true
|
||||
# Phase 8: Export logs to Grafana Loki via OTLP/HTTP. Loki 3.x supports
|
||||
# native OTLP ingestion on its /otlp endpoint, replacing the removed
|
||||
# loki exporter (dropped in otel-collector-contrib v0.147.0).
|
||||
otlphttp/loki:
|
||||
endpoint: http://loki:3100/otlp
|
||||
prometheus:
|
||||
endpoint: 0.0.0.0:8889
|
||||
|
||||
@@ -88,3 +112,9 @@ service:
|
||||
receivers: [otlp, spanmetrics]
|
||||
processors: [batch]
|
||||
exporters: [prometheus]
|
||||
# Phase 8: Log pipeline ingests rippled debug.log via filelog receiver,
|
||||
# batches entries, and exports to Loki for log-trace correlation.
|
||||
logs:
|
||||
receivers: [filelog]
|
||||
processors: [batch]
|
||||
exporters: [otlphttp/loki]
|
||||
|
||||
Reference in New Issue
Block a user