Phase 8: Log-trace correlation with Loki and filelog receiver

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-03-20 17:22:57 +00:00
parent 3c6d7f1c48
commit 767f423ec4
17 changed files with 2069 additions and 5 deletions

View File

@@ -525,6 +525,87 @@ Pre-configured datasources:
- **Jaeger**: Trace data at `http://jaeger:16686`
- **Tempo**: Trace data at `http://tempo:3200` (via Grafana Explore)
- **Prometheus**: Metrics at `http://prometheus:9090`
- **Loki**: Log data at `http://loki:3100` (via Grafana Explore)
---
## Test 3: Log-Trace Correlation (Phase 8)
Phase 8 injects `trace_id` and `span_id` into rippled's log output when
a log line is emitted within an active OTel span. This test verifies the
end-to-end log-trace correlation pipeline.
### Step 1: Verify trace_id in log output
After running Test 1 or Test 2 (which generate RPC spans), check the
rippled debug.log for trace context:
```bash
grep 'trace_id=[a-f0-9]\{32\} span_id=[a-f0-9]\{16\}' /path/to/debug.log
```
Expected: log lines with `trace_id=<32hex> span_id=<16hex>` between the
severity code and the message. Example:
```
2024-01-15T10:30:45.123Z RPCHandler:NFO trace_id=abc123def456789012345678abcdef01 span_id=0123456789abcdef Calling server_info
```
Lines emitted outside of an active span (background tasks, startup) will
NOT have trace context — this is expected.
### Step 2: Cross-check trace_id in Jaeger
Extract a `trace_id` from the log and verify it exists in Jaeger:
```bash
TRACE_ID=$(grep -o 'trace_id=[a-f0-9]\{32\}' /path/to/debug.log | head -1 | cut -d= -f2)
echo "Checking trace: $TRACE_ID"
curl -s "http://localhost:16686/api/traces/$TRACE_ID" | jq '.data | length'
```
Expected result: `1` (the trace exists in Jaeger).
### Step 3: Verify Loki log ingestion
The OTel Collector's filelog receiver tails rippled's debug.log and
exports parsed entries to Loki. Verify Loki has received entries:
```bash
# Query Loki for any rippled logs
curl -sG "http://localhost:3100/loki/api/v1/query" \
--data-urlencode 'query={job="rippled"}' \
--data-urlencode 'limit=5' | jq '.data.result | length'
```
Expected: > 0 results.
### Step 4: Verify Grafana Tempo-to-Loki correlation
1. Open Grafana at http://localhost:3000
2. Navigate to **Explore** -> select **Tempo** datasource
3. Search for a trace (e.g., operation `rpc.command.server_info`)
4. Click **"Logs for this trace"** in the trace detail view
5. Verify that Loki log lines appear, filtered by the trace's `trace_id`
### Step 5: Verify Grafana Loki-to-Tempo correlation
1. In Grafana **Explore**, select **Loki** datasource
2. Query: `{job="rippled"} |= "trace_id="`
3. In the log results, click the **TraceID** derived field link
4. Verify it navigates to the full trace in Tempo
### Expected results
| Check | Expected |
| ------------------------------ | ---------------------------------------- |
| `trace_id=` in debug.log | Present in log lines within active spans |
| `span_id=` in debug.log | Present alongside trace_id |
| Logs without active span | No trace_id/span_id fields |
| trace_id in Jaeger | Matches a valid trace |
| Loki log ingestion | Logs visible via LogQL |
| Tempo -> Loki "Logs for trace" | Shows correlated log lines |
| Loki -> Tempo TraceID link | Navigates to correct trace |
---
@@ -567,6 +648,44 @@ Pre-configured datasources:
2. Check submit response for error codes
3. In standalone mode, remember to call `ledger_accept` after submitting
### No trace_id in log output (Phase 8)
1. Verify rippled was built with `telemetry=ON` (`-Dtelemetry=ON` in CMake)
2. Verify `enabled=1` in the `[telemetry]` config section
3. Log lines only contain trace context when emitted inside an active span.
Background logs (startup, periodic tasks outside spans) will not have
`trace_id`/`span_id`.
4. Ensure the trace category is enabled (e.g., `trace_rpc=1` for RPC logs)
### No logs in Loki (Phase 8)
1. Verify the log file mount in docker-compose.yml:
```yaml
volumes:
- /tmp/xrpld-integration:/var/log/rippled:ro
```
2. Check OTel Collector logs for filelog receiver errors:
```bash
docker compose -f docker/telemetry/docker-compose.yml logs otel-collector | grep -i "filelog\|loki\|error"
```
3. Verify Loki is running:
```bash
curl -s http://localhost:3100/ready
```
4. Verify the filelog receiver glob pattern matches your log files:
The default pattern is `/var/log/rippled/*/debug.log`
### Grafana trace-log links not working (Phase 8)
1. Verify `tracesToLogs` is configured in the Tempo datasource provisioning
(`docker/telemetry/grafana/provisioning/datasources/tempo.yaml`)
2. Verify `derivedFields` is configured in the Loki datasource provisioning
(`docker/telemetry/grafana/provisioning/datasources/loki.yaml`)
3. Restart Grafana after changing provisioning files:
```bash
docker compose -f docker/telemetry/docker-compose.yml restart grafana
```
### Spanmetrics not appearing in Prometheus
1. Verify otel-collector config has `spanmetrics` connector

View File

@@ -2,13 +2,16 @@
#
# Provides services for local development:
# - otel-collector: receives OTLP traces from rippled, batches and
# forwards them to Jaeger and Tempo. Listens on ports 4317 (gRPC)
# and 4318 (HTTP).
# forwards them to Jaeger and Tempo. Also tails rippled log files
# via filelog receiver and exports to Loki. Listens on ports
# 4317 (gRPC), 4318 (HTTP), and 8125 (StatsD UDP).
# - jaeger: all-in-one tracing backend with UI on port 16686.
# - tempo: Grafana Tempo tracing backend, queryable via Grafana Explore
# on port 3000. Recommended for production (S3/GCS storage, TraceQL).
# - loki: Grafana Loki log aggregation backend for centralized log
# ingestion and log-trace correlation (Phase 8).
# - grafana: dashboards on port 3000, pre-configured with Jaeger, Tempo,
# and Prometheus datasources.
# Prometheus, and Loki datasources.
#
# Usage:
# docker compose -f docker/telemetry/docker-compose.yml up -d
@@ -34,9 +37,14 @@ services:
# - "8125:8125/udp"
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
# Phase 8: Mount rippled log directory for filelog receiver.
# The integration test writes logs to /tmp/xrpld-integration/;
# mount it read-only so the collector can tail debug.log files.
- /tmp/xrpld-integration:/var/log/rippled:ro
depends_on:
- jaeger
- tempo
- loki
networks:
- rippled-telemetry
@@ -61,6 +69,18 @@ services:
networks:
- rippled-telemetry
# Phase 8: Grafana Loki for centralized log ingestion and log-trace
# correlation. Loki 3.x supports native OTLP ingestion, so the OTel
# Collector exports via otlphttp to Loki's /otlp endpoint.
# Query logs via Grafana Explore -> Loki at http://localhost:3000.
loki:
image: grafana/loki:3.4.2
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
networks:
- rippled-telemetry
prometheus:
image: prom/prometheus:latest
ports:
@@ -86,6 +106,7 @@ services:
- jaeger
- tempo
- prometheus
- loki
networks:
- rippled-telemetry

View File

@@ -0,0 +1,470 @@
{
"annotations": { "list": [] },
"description": "Network traffic and peer metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "Active Peers",
"description": "Number of active inbound and outbound peer connections. Sourced from Peer_Finder.Active_Inbound_Peers and Peer_Finder.Active_Outbound_Peers gauges (PeerfinderManager.cpp:214-215). A healthy mainnet node typically has 10-21 outbound and 0-85 inbound peers depending on configuration.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "rippled_Peer_Finder_Active_Inbound_Peers",
"legendFormat": "Inbound Peers"
},
{
"datasource": { "type": "prometheus" },
"expr": "rippled_Peer_Finder_Active_Outbound_Peers",
"legendFormat": "Outbound Peers"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Peers",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Peer Disconnects",
"description": "Cumulative count of peer disconnections. Sourced from the Overlay.Peer_Disconnects gauge (OverlayImpl.h:557). A rising trend indicates network instability, aggressive peer management, or resource exhaustion causing connection drops.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "rippled_Overlay_Peer_Disconnects",
"legendFormat": "Disconnects"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Disconnects",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Total Network Bytes",
"description": "Total bytes sent and received across all peer connections. Sourced from the total.Bytes_In and total.Bytes_Out traffic category gauges (OverlayImpl.h:535-548). Provides a high-level view of network bandwidth consumption.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "rippled_total_Bytes_In",
"legendFormat": "Bytes In"
},
{
"datasource": { "type": "prometheus" },
"expr": "rippled_total_Bytes_Out",
"legendFormat": "Bytes Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"custom": {
"axisLabel": "Bytes",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Total Network Messages",
"description": "Total messages sent and received across all peer connections. Sourced from the total.Messages_In and total.Messages_Out traffic category gauges (OverlayImpl.h:535-548). Shows the overall message throughput of the overlay network.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "rippled_total_Messages_In",
"legendFormat": "Messages In"
},
{
"datasource": { "type": "prometheus" },
"expr": "rippled_total_Messages_Out",
"legendFormat": "Messages Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Messages",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Transaction Traffic",
"description": "Bytes and messages for transaction-related overlay traffic. Includes the transactions traffic category (OverlayImpl/TrafficCount.h). Spikes indicate high transaction volume on the network or transaction flooding.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "rippled_transactions_Messages_In",
"legendFormat": "TX Messages In"
},
{
"datasource": { "type": "prometheus" },
"expr": "rippled_transactions_Messages_Out",
"legendFormat": "TX Messages Out"
},
{
"datasource": { "type": "prometheus" },
"expr": "rippled_transactions_duplicate_Messages_In",
"legendFormat": "TX Duplicate In"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Messages",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Proposal Traffic",
"description": "Messages for consensus proposal overlay traffic. Includes proposals, proposals_untrusted, and proposals_duplicate categories (TrafficCount.h). High untrusted or duplicate counts may indicate UNL misconfiguration or network spam.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "rippled_proposals_Messages_In",
"legendFormat": "Proposals In"
},
{
"datasource": { "type": "prometheus" },
"expr": "rippled_proposals_Messages_Out",
"legendFormat": "Proposals Out"
},
{
"datasource": { "type": "prometheus" },
"expr": "rippled_proposals_untrusted_Messages_In",
"legendFormat": "Untrusted In"
},
{
"datasource": { "type": "prometheus" },
"expr": "rippled_proposals_duplicate_Messages_In",
"legendFormat": "Duplicate In"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Messages",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Validation Traffic",
"description": "Messages for validation overlay traffic. Includes validations, validations_untrusted, and validations_duplicate categories (TrafficCount.h). Monitoring trusted vs untrusted validation traffic helps detect UNL health issues.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "rippled_validations_Messages_In",
"legendFormat": "Validations In"
},
{
"datasource": { "type": "prometheus" },
"expr": "rippled_validations_Messages_Out",
"legendFormat": "Validations Out"
},
{
"datasource": { "type": "prometheus" },
"expr": "rippled_validations_untrusted_Messages_In",
"legendFormat": "Untrusted In"
},
{
"datasource": { "type": "prometheus" },
"expr": "rippled_validations_duplicate_Messages_In",
"legendFormat": "Duplicate In"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Messages",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Overlay Traffic by Category (Bytes In)",
"description": "Top traffic categories by inbound bytes. Includes all 57 overlay traffic categories from TrafficCount.h. Shows which protocol message types consume the most bandwidth. Categories include transactions, proposals, validations, ledger data, getobject, and overlay overhead.",
"type": "bargauge",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "topk(10, {__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"})",
"legendFormat": "{{__name__}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes"
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "rippled_transactions_Bytes_In"
},
"properties": [{ "id": "displayName", "value": "Transactions" }]
},
{
"matcher": {
"id": "byName",
"options": "rippled_proposals_Bytes_In"
},
"properties": [{ "id": "displayName", "value": "Proposals" }]
},
{
"matcher": {
"id": "byName",
"options": "rippled_validations_Bytes_In"
},
"properties": [{ "id": "displayName", "value": "Validations" }]
},
{
"matcher": {
"id": "byName",
"options": "rippled_overhead_Bytes_In"
},
"properties": [{ "id": "displayName", "value": "Overhead" }]
},
{
"matcher": {
"id": "byName",
"options": "rippled_overhead_overlay_Bytes_In"
},
"properties": [{ "id": "displayName", "value": "Overhead Overlay" }]
},
{
"matcher": { "id": "byName", "options": "rippled_ping_Bytes_In" },
"properties": [{ "id": "displayName", "value": "Ping" }]
},
{
"matcher": { "id": "byName", "options": "rippled_status_Bytes_In" },
"properties": [{ "id": "displayName", "value": "Status" }]
},
{
"matcher": {
"id": "byName",
"options": "rippled_getObject_Bytes_In"
},
"properties": [{ "id": "displayName", "value": "Get Object" }]
},
{
"matcher": {
"id": "byName",
"options": "rippled_haveTxSet_Bytes_In"
},
"properties": [{ "id": "displayName", "value": "Have Tx Set" }]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledgerData_Bytes_In"
},
"properties": [{ "id": "displayName", "value": "Ledger Data" }]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_share_Bytes_In"
},
"properties": [{ "id": "displayName", "value": "Ledger Share" }]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_get_Bytes_In"
},
"properties": [{ "id": "displayName", "value": "Ledger Data Get" }]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_share_Bytes_In"
},
"properties": [
{ "id": "displayName", "value": "Ledger Data Share" }
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_Account_State_Node_get_Bytes_In"
},
"properties": [
{ "id": "displayName", "value": "Account State Node Get" }
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_Account_State_Node_share_Bytes_In"
},
"properties": [
{ "id": "displayName", "value": "Account State Node Share" }
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_Transaction_Node_get_Bytes_In"
},
"properties": [
{ "id": "displayName", "value": "Transaction Node Get" }
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_Transaction_Node_share_Bytes_In"
},
"properties": [
{ "id": "displayName", "value": "Transaction Node Share" }
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_data_Transaction_Set_candidate_get_Bytes_In"
},
"properties": [
{ "id": "displayName", "value": "Tx Set Candidate Get" }
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_Account_State_node_share_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Account State Node Share (Legacy)"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_Transaction_Set_candidate_share_Bytes_In"
},
"properties": [
{ "id": "displayName", "value": "Tx Set Candidate Share" }
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_ledger_Transaction_node_share_Bytes_In"
},
"properties": [
{
"id": "displayName",
"value": "Transaction Node Share (Legacy)"
}
]
},
{
"matcher": {
"id": "byName",
"options": "rippled_set_get_Bytes_In"
},
"properties": [{ "id": "displayName", "value": "Set Get" }]
}
]
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "statsd", "network", "telemetry"],
"templating": { "list": [] },
"time": { "from": "now-1h", "to": "now" },
"title": "rippled Network Traffic (StatsD)",
"uid": "rippled-statsd-network"
}

View File

@@ -0,0 +1,415 @@
{
"annotations": {
"list": []
},
"description": "Node health metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "Validated Ledger Age",
"description": "Age of the most recently validated ledger in seconds. Sourced from the LedgerMaster.Validated_Ledger_Age gauge (LedgerMaster.h:373) which is updated every collection interval via the insight hook. Values above 20s indicate the node is falling behind the network.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_LedgerMaster_Validated_Ledger_Age",
"legendFormat": "Validated Age"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 10
},
{
"color": "red",
"value": 20
}
]
}
},
"overrides": []
}
},
{
"title": "Published Ledger Age",
"description": "Age of the most recently published ledger in seconds. Sourced from the LedgerMaster.Published_Ledger_Age gauge (LedgerMaster.h:374). Published ledger age should track close to validated ledger age. A growing gap indicates publish pipeline backlog.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_LedgerMaster_Published_Ledger_Age",
"legendFormat": "Published Age"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 10
},
{
"color": "red",
"value": 20
}
]
}
},
"overrides": []
}
},
{
"title": "Operating Mode Duration",
"description": "Cumulative time spent in each operating mode (Disconnected, Connected, Syncing, Tracking, Full). Sourced from State_Accounting.*_duration gauges (NetworkOPs.cpp:774-778). A healthy node should spend the vast majority of time in Full mode.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Full_duration",
"legendFormat": "Full"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Tracking_duration",
"legendFormat": "Tracking"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Syncing_duration",
"legendFormat": "Syncing"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Connected_duration",
"legendFormat": "Connected"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Disconnected_duration",
"legendFormat": "Disconnected"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"axisLabel": "Duration (Sec)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Operating Mode Transitions",
"description": "Count of transitions into each operating mode. Sourced from State_Accounting.*_transitions gauges (NetworkOPs.cpp:780-786). Frequent transitions out of Full mode indicate instability. Transitions to Disconnected or Syncing warrant investigation.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Full_transitions",
"legendFormat": "Full"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Tracking_transitions",
"legendFormat": "Tracking"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Syncing_transitions",
"legendFormat": "Syncing"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Connected_transitions",
"legendFormat": "Connected"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_State_Accounting_Disconnected_transitions",
"legendFormat": "Disconnected"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Transitions",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "I/O Latency",
"description": "P95 and P50 of the I/O service loop latency in milliseconds. Sourced from the ios_latency event (Application.cpp:438) which measures how long it takes for the io_context to process a timer callback. Values above 10ms are logged; above 500ms trigger warnings. High values indicate thread pool saturation or blocking operations.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ios_latency{quantile=\"0.95\"}",
"legendFormat": "P95 I/O Latency"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_ios_latency{quantile=\"0.5\"}",
"legendFormat": "P50 I/O Latency"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Latency (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Job Queue Depth",
"description": "Current number of jobs waiting in the job queue. Sourced from the job_count gauge (JobQueue.cpp:26). A sustained high value indicates the node cannot process work fast enough \u2014 common during ledger replay or heavy RPC load.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_job_count",
"legendFormat": "Job Queue Depth"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"axisLabel": "Jobs",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Ledger Fetch Rate",
"description": "Rate of ledger fetch requests initiated by the node. Sourced from the ledger_fetches counter (InboundLedgers.cpp:44) which increments each time the node requests a ledger from a peer. High rates indicate the node is catching up or missing ledgers.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate(rippled_ledger_fetches_total[5m])",
"legendFormat": "Fetches / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops"
},
"overrides": []
}
},
{
"title": "Ledger History Mismatches",
"description": "Rate of ledger history hash mismatches. Sourced from the ledger.history.mismatch counter (LedgerHistory.cpp:16) which increments when a built ledger hash does not match the expected validated hash. Non-zero values indicate consensus divergence or database corruption.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate(rippled_ledger_history_mismatch_total[5m])",
"legendFormat": "Mismatches / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 0.01
}
]
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "statsd", "node-health", "telemetry"],
"templating": {
"list": []
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "rippled Node Health (StatsD)",
"uid": "rippled-statsd-node-health"
}

View File

@@ -0,0 +1,396 @@
{
"annotations": {
"list": []
},
"description": "RPC and pathfinding metrics from beast::insight StatsD. Requires [insight] server=statsd in rippled config.",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"title": "RPC Request Rate (StatsD)",
"description": "Rate of RPC requests as counted by the beast::insight counter. Sourced from rpc.requests (ServerHandler.cpp:108) which increments on every HTTP and WebSocket RPC request. Compare with the span-based rpc.request rate in the RPC Performance dashboard for cross-validation.",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate(rippled_rpc_requests_total[5m])",
"legendFormat": "Requests / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps"
},
"overrides": []
}
},
{
"title": "RPC Response Time (StatsD)",
"description": "P95 and P50 of RPC response time from the beast::insight timer. Sourced from the rpc.time event (ServerHandler.cpp:110) which records elapsed milliseconds for each RPC response. This measures the full HTTP handler time, not just command execution. Compare with span-based rpc.request duration.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{quantile=\"0.95\"}",
"legendFormat": "P95 Response Time"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{quantile=\"0.5\"}",
"legendFormat": "P50 Response Time"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Latency (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "RPC Response Size",
"description": "P95 and P50 of RPC response payload size in bytes. Sourced from the rpc.size event (ServerHandler.cpp:109) which records the byte length of each RPC JSON response. Large responses may indicate expensive queries (e.g. account_tx with many results) or API misuse.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_size{quantile=\"0.95\"}",
"legendFormat": "P95 Response Size"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_size{quantile=\"0.5\"}",
"legendFormat": "P50 Response Size"
}
],
"fieldConfig": {
"defaults": {
"unit": "decbytes",
"custom": {
"axisLabel": "Size (Bytes)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "RPC Response Time Distribution",
"description": "Distribution of RPC response times from the beast::insight timer showing P50, P90, P95, and P99 quantiles. Sourced from the rpc.time event (ServerHandler.cpp:110). Useful for detecting bimodal latency or long-tail requests.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{quantile=\"0.5\"}",
"legendFormat": "P50"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{quantile=\"0.9\"}",
"legendFormat": "P90"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{quantile=\"0.95\"}",
"legendFormat": "P95"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_rpc_time{quantile=\"0.99\"}",
"legendFormat": "P99"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Latency (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Pathfinding Fast Duration",
"description": "P95 and P50 of fast pathfinding execution time. Sourced from the pathfind_fast event (PathRequests.h:23) which records the duration of the fast pathfinding algorithm. Fast pathfinding uses a simplified search that trades accuracy for speed.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_pathfind_fast{quantile=\"0.95\"}",
"legendFormat": "P95 Fast Pathfind"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_pathfind_fast{quantile=\"0.5\"}",
"legendFormat": "P50 Fast Pathfind"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Pathfinding Full Duration",
"description": "P95 and P50 of full pathfinding execution time. Sourced from the pathfind_full event (PathRequests.h:24) which records the duration of the exhaustive pathfinding search. Full pathfinding is more expensive and can take significantly longer than fast mode.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_pathfind_full{quantile=\"0.95\"}",
"legendFormat": "P95 Full Pathfind"
},
{
"datasource": {
"type": "prometheus"
},
"expr": "rippled_pathfind_full{quantile=\"0.5\"}",
"legendFormat": "P50 Full Pathfind"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Resource Warnings Rate",
"description": "Rate of resource warning events from the Resource Manager. Sourced from the warn meter (Logic.h:33) which increments when a consumer (peer or RPC client) exceeds the warning threshold for resource usage. A rising rate indicates aggressive clients that may need throttling. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate(rippled_warn_total[5m])",
"legendFormat": "Warnings / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 0.1
},
{
"color": "red",
"value": 1
}
]
}
},
"overrides": []
}
},
{
"title": "Resource Drops Rate",
"description": "Rate of resource drop events from the Resource Manager. Sourced from the drop meter (Logic.h:34) which increments when a consumer is disconnected or blocked due to excessive resource usage. Non-zero values mean the node is actively rejecting abusive connections. NOTE: This panel will show no data until the |m -> |c fix is applied in StatsDCollector.cpp:706 (Phase 6 Task 6.1).",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 24
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "rate(rippled_drop_total[5m])",
"legendFormat": "Drops / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 0.01
},
{
"color": "red",
"value": 0.1
}
]
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
"tags": ["rippled", "statsd", "rpc", "pathfinding", "telemetry"],
"templating": {
"list": []
},
"time": {
"from": "now-1h",
"to": "now"
},
"title": "rippled RPC & Pathfinding (StatsD)",
"uid": "rippled-statsd-rpc"
}

View File

@@ -0,0 +1,24 @@
# Grafana Loki data source provisioning for rippled log-trace correlation.
#
# Phase 8: Log-Trace Correlation and Centralized Log Ingestion
#
# Loki ingests rippled logs via OTel Collector's filelog receiver.
# The derivedFields config links trace_id values in log lines back to
# Tempo traces, enabling one-click log-to-trace navigation in Grafana.
#
# See: OpenTelemetryPlan/Phase8_taskList.md (Tasks 8.2, 8.4)
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
uid: loki
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: "trace_id=(\\w+)"
name: TraceID
url: "$${__value.raw}"

View File

@@ -23,6 +23,14 @@ datasources:
enabled: true
serviceMap:
datasourceUid: prometheus
# Phase 8: Trace-to-log correlation — enables one-click navigation
# from a Tempo trace to the corresponding Loki log lines. Filters
# by trace_id so only logs from the same trace are shown.
tracesToLogs:
datasourceUid: loki
filterByTraceID: true
filterBySpanID: false
tags: ["partition", "severity"]
tracesToMetrics:
datasourceUid: prometheus
spanStartTimeShift: "-1h"

View File

@@ -62,6 +62,49 @@ check_span() {
fi
}
# Phase 8: Verify trace_id injection in rippled log output.
# Greps all node debug.log files for the "trace_id=<hex> span_id=<hex>"
# pattern that Logs::format() injects when an active OTel span exists.
# Also cross-checks that a trace_id found in logs matches a trace in Jaeger.
check_log_correlation() {
log "Checking log-trace correlation..."
local total_matches=0
local sample_trace_id=""
for i in $(seq 1 "$NUM_NODES"); do
local logfile="$WORKDIR/node$i/debug.log"
if [ ! -f "$logfile" ]; then
continue
fi
local matches
matches=$(grep -c 'trace_id=[a-f0-9]\{32\} span_id=[a-f0-9]\{16\}' "$logfile" 2>/dev/null || echo 0)
total_matches=$((total_matches + matches))
# Capture the first trace_id we find for cross-referencing with Jaeger
if [ -z "$sample_trace_id" ] && [ "$matches" -gt 0 ]; then
sample_trace_id=$(grep -o 'trace_id=[a-f0-9]\{32\}' "$logfile" | head -1 | cut -d= -f2)
fi
done
if [ "$total_matches" -gt 0 ]; then
ok "Log correlation: found $total_matches log lines with trace_id"
else
fail "Log correlation: no trace_id found in any node debug.log"
fi
# Cross-check: verify the sample trace_id exists in Jaeger
if [ -n "$sample_trace_id" ]; then
local trace_found
trace_found=$(curl -sf "$JAEGER/api/traces/$sample_trace_id" \
| jq '.data | length' 2>/dev/null || echo 0)
if [ "$trace_found" -gt 0 ]; then
ok "Log-Jaeger cross-check: trace_id=$sample_trace_id found in Jaeger"
else
fail "Log-Jaeger cross-check: trace_id=$sample_trace_id NOT found in Jaeger"
fi
fi
}
cleanup() {
log "Cleaning up..."
# Kill xrpld nodes
@@ -507,6 +550,13 @@ log "--- Phase 5: Peer Spans (trace_peer=1) ---"
check_span "peer.proposal.receive"
check_span "peer.validation.receive"
# ---------------------------------------------------------------------------
# Step 9b: Verify log-trace correlation (Phase 8)
# ---------------------------------------------------------------------------
log ""
log "--- Phase 8: Log-Trace Correlation ---"
check_log_correlation
# ---------------------------------------------------------------------------
# Step 10: Verify Prometheus spanmetrics
# ---------------------------------------------------------------------------
@@ -602,6 +652,7 @@ echo ""
echo " Jaeger UI: http://localhost:16686"
echo " Grafana: http://localhost:3000"
echo " Prometheus: http://localhost:9090"
echo " Loki: http://localhost:3100"
echo ""
echo " xrpld nodes (6) are running:"
for i in $(seq 1 "$NUM_NODES"); do

View File

@@ -3,6 +3,7 @@
# Pipelines:
# traces: OTLP receiver -> batch processor -> debug + Jaeger + Tempo + spanmetrics
# metrics: OTLP receiver + spanmetrics connector -> Prometheus exporter
# logs: filelog receiver -> batch processor -> otlphttp/Loki (Phase 8)
#
# rippled sends traces via OTLP/HTTP to port 4318. The collector batches
# them, forwards to both Jaeger and Tempo, and derives RED metrics via the
@@ -17,6 +18,12 @@
# but commented out. If you need StatsD fallback (server=statsd in
# [insight]), uncomment the statsd receiver and add it to the metrics
# pipeline receivers list.
#
# Phase 8: The filelog receiver tails rippled's debug.log files under
# /var/log/rippled/ (mounted from the host). A regex_parser operator
# extracts timestamp, partition, severity, and optional trace_id/span_id
# fields injected by Logs::format(). Parsed logs are exported to Grafana
# Loki for log-trace correlation.
receivers:
otlp:
@@ -41,6 +48,18 @@ receivers:
# observer_type: "summary"
# summary:
# percentiles: [0, 50, 90, 95, 99, 100]
# Phase 8: Filelog receiver tails rippled debug.log files for log-trace
# correlation. Extracts structured fields (timestamp, partition, severity,
# trace_id, span_id, message) via regex. The trace_id and span_id are
# optional — only present when the log was emitted within an active span.
filelog:
include: [/var/log/rippled/*/debug.log]
operators:
- type: regex_parser
regex: '^(?P<timestamp>\S+)\s+(?P<partition>\S+):(?P<severity>\S+)\s+(?:trace_id=(?P<trace_id>[a-f0-9]+)\s+span_id=(?P<span_id>[a-f0-9]+)\s+)?(?P<message>.*)$'
timestamp:
parse_from: attributes.timestamp
layout: "%Y-%b-%d %H:%M:%S.%f"
processors:
batch:
@@ -75,6 +94,11 @@ exporters:
endpoint: tempo:4317
tls:
insecure: true
# Phase 8: Export logs to Grafana Loki via OTLP/HTTP. Loki 3.x supports
# native OTLP ingestion on its /otlp endpoint, replacing the removed
# loki exporter (dropped in otel-collector-contrib v0.147.0).
otlphttp/loki:
endpoint: http://loki:3100/otlp
prometheus:
endpoint: 0.0.0.0:8889
@@ -88,3 +112,9 @@ service:
receivers: [otlp, spanmetrics]
processors: [batch]
exporters: [prometheus]
# Phase 8: Log pipeline ingests rippled debug.log via filelog receiver,
# batches entries, and exports to Loki for log-trace correlation.
logs:
receivers: [filelog]
processors: [batch]
exporters: [otlphttp/loki]