Files
rippled/docs/telemetry-runbook.md
2026-03-31 16:39:39 +01:00

294 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# rippled Telemetry Operator Runbook
## Overview
rippled supports OpenTelemetry distributed tracing to provide visibility into RPC requests, transaction processing, and consensus rounds.
## Quick Start
### 1. Start the observability stack
```bash
docker compose -f docker/telemetry/docker-compose.yml up -d
```
This starts:
- **OTel Collector** on ports 4317 (gRPC) and 4318 (HTTP)
- **Jaeger** UI on http://localhost:16686
- **Prometheus** on http://localhost:9090
- **Grafana** on http://localhost:3000
### 2. Enable telemetry in rippled
Add to your `xrpld.cfg`:
```ini
[telemetry]
enabled=1
endpoint=http://localhost:4318/v1/traces
```
### 3. Build with telemetry support
```bash
conan install . --build=missing -o telemetry=True
cmake --preset default -Dtelemetry=ON
cmake --build --preset default
```
## Configuration Reference
| Option | Default | Description |
| -------------------- | --------------------------------- | ----------------------------------------- |
| `enabled` | `0` | Master switch for telemetry |
| `endpoint` | `http://localhost:4318/v1/traces` | OTLP/HTTP endpoint |
| `exporter` | `otlp_http` | Exporter type |
| `sampling_ratio` | `1.0` | Head-based sampling ratio (0.01.0) |
| `trace_rpc` | `1` | Enable RPC request tracing |
| `trace_transactions` | `1` | Enable transaction tracing |
| `trace_consensus` | `1` | Enable consensus tracing |
| `trace_peer` | `0` | Enable peer message tracing (high volume) |
| `trace_ledger` | `1` | Enable ledger tracing |
| `batch_size` | `512` | Max spans per batch export |
| `batch_delay_ms` | `5000` | Delay between batch exports |
| `max_queue_size` | `2048` | Max spans queued before dropping |
| `use_tls` | `0` | Use TLS for exporter connection |
| `tls_ca_cert` | (empty) | Path to CA certificate bundle |
## Span Reference
All spans instrumented in rippled, grouped by subsystem:
### RPC Spans (Phase 2)
| Span Name | Source File | Attributes | Description |
| -------------------- | --------------------- | ------------------------------------------------------- | -------------------------------------------------- |
| `rpc.request` | ServerHandler.cpp:271 | — | Top-level HTTP RPC request |
| `rpc.process` | ServerHandler.cpp:573 | — | RPC processing (child of rpc.request) |
| `rpc.ws_message` | ServerHandler.cpp:384 | — | WebSocket RPC message |
| `rpc.command.<name>` | RPCHandler.cpp:161 | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role` | Per-command span (e.g., `rpc.command.server_info`) |
### Transaction Spans (Phase 3)
| Span Name | Source File | Attributes | Description |
| ------------ | ------------------- | ----------------------------------------------- | ------------------------------------- |
| `tx.process` | NetworkOPs.cpp:1227 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Transaction submission and processing |
| `tx.receive` | PeerImp.cpp:1273 | `xrpl.peer.id` | Transaction received from peer relay |
| `tx.apply` | BuildLedger.cpp:88 | `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Transaction set applied per ledger |
### Consensus Spans (Phase 4)
| Span Name | Source File | Attributes | Description |
| --------------------------- | -------------------- | ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------ |
| `consensus.proposal.send` | RCLConsensus.cpp:177 | `xrpl.consensus.round` | Consensus proposal broadcast |
| `consensus.ledger_close` | RCLConsensus.cpp:282 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` | Ledger close event |
| `consensus.accept` | RCLConsensus.cpp:395 | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | Ledger accepted by consensus |
| `consensus.validation.send` | RCLConsensus.cpp:753 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` | Validation sent after accept |
| `consensus.accept.apply` | RCLConsensus.cpp:453 | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq` | Ledger application with close time details |
#### Close Time Queries (Tempo TraceQL)
```
# Find rounds where validators disagreed on close time
{name="consensus.accept.apply"} | xrpl.consensus.close_time_correct = false
# Find consensus failures (moved_on)
{name="consensus.accept.apply"} | xrpl.consensus.state = "moved_on"
# Find slow ledger applications (>5s)
{name="consensus.accept.apply"} | duration > 5s
# Find specific ledger's consensus details
{name="consensus.accept.apply"} | xrpl.consensus.ledger.seq = 92345678
```
### Ledger Spans (Phase 5)
| Span Name | Source File | Attributes | Description |
| ----------------- | -------------------- | -------------------------------------------- | ----------------------------- |
| `ledger.build` | BuildLedger.cpp:31 | `xrpl.ledger.seq` | Ledger build during consensus |
| `ledger.validate` | LedgerMaster.cpp:915 | `xrpl.ledger.seq`, `xrpl.ledger.validations` | Ledger promoted to validated |
| `ledger.store` | LedgerMaster.cpp:409 | `xrpl.ledger.seq` | Ledger stored in history |
### Peer Spans (Phase 5)
| Span Name | Source File | Attributes | Description |
| ------------------------- | ---------------- | ---------------------------------------------- | ----------------------------- |
| `peer.proposal.receive` | PeerImp.cpp:1667 | `xrpl.peer.id`, `xrpl.peer.proposal.trusted` | Proposal received from peer |
| `peer.validation.receive` | PeerImp.cpp:2264 | `xrpl.peer.id`, `xrpl.peer.validation.trusted` | Validation received from peer |
## Prometheus Metrics (Spanmetrics)
The OTel Collector's spanmetrics connector automatically derives RED (Rate, Errors, Duration) metrics from every span. No custom metrics code is needed in rippled.
### Generated Metric Names
| Prometheus Metric | Type | Description |
| -------------------------------------------------- | --------- | ---------------------------- |
| `traces_span_metrics_calls_total` | Counter | Total span invocations |
| `traces_span_metrics_duration_milliseconds_bucket` | Histogram | Latency distribution buckets |
| `traces_span_metrics_duration_milliseconds_count` | Histogram | Latency observation count |
| `traces_span_metrics_duration_milliseconds_sum` | Histogram | Cumulative latency |
### Metric Labels
Every metric carries these standard labels:
| Label | Source | Example |
| -------------- | ------------------ | ---------------------------------------- |
| `span_name` | Span name | `rpc.command.server_info` |
| `status_code` | Span status | `STATUS_CODE_UNSET`, `STATUS_CODE_ERROR` |
| `service_name` | Resource attribute | `rippled` |
| `span_kind` | Span kind | `SPAN_KIND_INTERNAL` |
Additionally, span attributes configured as dimensions in the collector become metric labels (dots → underscores):
| Span Attribute | Metric Label | Applies To |
| ------------------------------ | ------------------------------ | ------------------------------- |
| `xrpl.rpc.command` | `xrpl_rpc_command` | `rpc.command.*` spans |
| `xrpl.rpc.status` | `xrpl_rpc_status` | `rpc.command.*` spans |
| `xrpl.consensus.mode` | `xrpl_consensus_mode` | `consensus.ledger_close` spans |
| `xrpl.tx.local` | `xrpl_tx_local` | `tx.process` spans |
| `xrpl.peer.proposal.trusted` | `xrpl_peer_proposal_trusted` | `peer.proposal.receive` spans |
| `xrpl.peer.validation.trusted` | `xrpl_peer_validation_trusted` | `peer.validation.receive` spans |
### Histogram Buckets
Configured in `otel-collector-config.yaml`:
```
1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s
```
## Grafana Dashboards
Five dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
### RPC Performance (`rippled-rpc-perf`)
| Panel | Type | PromQL | Labels Used |
| --------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
| RPC Request Rate by Command | timeseries | `sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))` | `xrpl_rpc_command` |
| RPC Latency p95 by Command | timeseries | `histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))` | `xrpl_rpc_command` |
| RPC Error Rate | bargauge | Error spans / total spans × 100, grouped by `xrpl_rpc_command` | `xrpl_rpc_command`, `status_code` |
| RPC Latency Heatmap | heatmap | `sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])) by (le)` | `le` (bucket boundaries) |
| Overall RPC Throughput | timeseries | `rpc.request` + `rpc.process` rate | — |
| RPC Success vs Error | timeseries | by `status_code` (UNSET vs ERROR) | `status_code` |
| Top Commands by Volume | bargauge | `topk(10, ...)` by `xrpl_rpc_command` | `xrpl_rpc_command` |
| WebSocket Message Rate | stat | `rpc.ws_message` rate | — |
### Transaction Overview (`rippled-transactions`)
| Panel | Type | PromQL | Labels Used |
| --------------------------------- | ---------- | -------------------------------------------------------------------------------------------- | --------------- |
| Transaction Processing Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m])` and `tx.receive` | `span_name` |
| Transaction Processing Latency | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="tx.process"})` | — |
| Transaction Path Distribution | piechart | `sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))` | `xrpl_tx_local` |
| Transaction Receive vs Suppressed | timeseries | `rate(traces_span_metrics_calls_total{span_name="tx.receive"}[5m])` | — |
| TX Processing Duration Heatmap | heatmap | `tx.process` histogram buckets | `le` |
| TX Apply Duration per Ledger | timeseries | p95/p50 of `tx.apply` | — |
| Peer TX Receive Rate | timeseries | `tx.receive` rate | — |
| TX Apply Failed Rate | stat | `tx.apply` with `STATUS_CODE_ERROR` | `status_code` |
### Consensus Health (`rippled-consensus`)
| Panel | Type | PromQL | Labels Used |
| ----------------------------- | ---------- | ---------------------------------------------------------------------------------- | --------------------- |
| Consensus Round Duration | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept"})` | — |
| Consensus Proposals Sent Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="consensus.proposal.send"}[5m])` | — |
| Ledger Close Duration | timeseries | `histogram_quantile(0.95, ... {span_name="consensus.ledger_close"})` | — |
| Validation Send Rate | stat | `rate(traces_span_metrics_calls_total{span_name="consensus.validation.send"}[5m])` | — |
| Ledger Apply Duration | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept.apply"})` | — |
| Close Time Agreement | timeseries | `rate(traces_span_metrics_calls_total{span_name="consensus.accept.apply"}[5m])` | — |
| Consensus Mode Over Time | timeseries | `consensus.ledger_close` by `xrpl_consensus_mode` | `xrpl_consensus_mode` |
| Accept vs Close Rate | timeseries | `consensus.accept` vs `consensus.ledger_close` rate | — |
| Validation vs Close Rate | timeseries | `consensus.validation.send` vs `consensus.ledger_close` | — |
| Accept Duration Heatmap | heatmap | `consensus.accept` histogram buckets | `le` |
### Ledger Operations (`rippled-ledger-ops`)
| Panel | Type | PromQL | Labels Used |
| ----------------------- | ---------- | ---------------------------------------------- | ----------- |
| Ledger Build Rate | stat | `ledger.build` call rate | — |
| Ledger Build Duration | timeseries | p95/p50 of `ledger.build` | — |
| Ledger Validation Rate | stat | `ledger.validate` call rate | — |
| Build Duration Heatmap | heatmap | `ledger.build` histogram buckets | `le` |
| TX Apply Duration | timeseries | p95/p50 of `tx.apply` | — |
| TX Apply Rate | timeseries | `tx.apply` call rate | — |
| Ledger Store Rate | stat | `ledger.store` call rate | — |
| Build vs Close Duration | timeseries | p95 `ledger.build` vs `consensus.ledger_close` | — |
### Peer Network (`rippled-peer-net`)
Requires `trace_peer=1` in the `[telemetry]` config section.
| Panel | Type | PromQL | Labels Used |
| -------------------------------- | ---------- | --------------------------------- | ------------------------------ |
| Proposal Receive Rate | timeseries | `peer.proposal.receive` rate | — |
| Validation Receive Rate | timeseries | `peer.validation.receive` rate | — |
| Proposals Trusted vs Untrusted | piechart | by `xrpl_peer_proposal_trusted` | `xrpl_peer_proposal_trusted` |
| Validations Trusted vs Untrusted | piechart | by `xrpl_peer_validation_trusted` | `xrpl_peer_validation_trusted` |
### Span → Metric → Dashboard Summary
| Span Name | Prometheus Metric Filter | Grafana Dashboard |
| --------------------------- | ----------------------------------------- | --------------------------------------------- |
| `rpc.request` | `{span_name="rpc.request"}` | RPC Performance (Overall Throughput) |
| `rpc.process` | `{span_name="rpc.process"}` | RPC Performance (Overall Throughput) |
| `rpc.ws_message` | `{span_name="rpc.ws_message"}` | RPC Performance (WebSocket Rate) |
| `rpc.command.*` | `{span_name=~"rpc.command.*"}` | RPC Performance (Rate, Latency, Error, Top) |
| `tx.process` | `{span_name="tx.process"}` | Transaction Overview (Rate, Latency, Heatmap) |
| `tx.receive` | `{span_name="tx.receive"}` | Transaction Overview (Rate, Receive) |
| `tx.apply` | `{span_name="tx.apply"}` | Transaction Overview + Ledger Ops (Apply) |
| `consensus.accept` | `{span_name="consensus.accept"}` | Consensus Health (Duration, Rate, Heatmap) |
| `consensus.proposal.send` | `{span_name="consensus.proposal.send"}` | Consensus Health (Proposals Rate) |
| `consensus.ledger_close` | `{span_name="consensus.ledger_close"}` | Consensus Health (Close, Mode) |
| `consensus.validation.send` | `{span_name="consensus.validation.send"}` | Consensus Health (Validation Rate) |
| `consensus.accept.apply` | `{span_name="consensus.accept.apply"}` | Consensus Health (Apply Duration, Close Time) |
| `ledger.build` | `{span_name="ledger.build"}` | Ledger Ops (Build Rate, Duration, Heatmap) |
| `ledger.validate` | `{span_name="ledger.validate"}` | Ledger Ops (Validation Rate) |
| `ledger.store` | `{span_name="ledger.store"}` | Ledger Ops (Store Rate) |
| `peer.proposal.receive` | `{span_name="peer.proposal.receive"}` | Peer Network (Rate, Trusted/Untrusted) |
| `peer.validation.receive` | `{span_name="peer.validation.receive"}` | Peer Network (Rate, Trusted/Untrusted) |
## Troubleshooting
### No traces appearing in Jaeger
1. Check rippled logs for `Telemetry starting` message
2. Verify `enabled=1` in the `[telemetry]` config section
3. Test collector connectivity: `curl -v http://localhost:4318/v1/traces`
4. Check collector logs: `docker compose logs otel-collector`
### High memory usage
- Reduce `sampling_ratio` (e.g., `0.1` for 10% sampling)
- Reduce `max_queue_size` and `batch_size`
- Disable high-volume trace categories: `trace_peer=0`
### Collector connection failures
- Verify endpoint URL matches collector address
- Check firewall rules for ports 4317/4318
- If using TLS, verify certificate path with `tls_ca_cert`
## Performance Tuning
| Scenario | Recommendation |
| ------------------------ | ------------------------------------------------- |
| Production mainnet | `sampling_ratio=0.01`, `trace_peer=0` |
| Testnet/devnet | `sampling_ratio=1.0` (full tracing) |
| Debugging specific issue | `sampling_ratio=1.0` temporarily |
| High-throughput node | Increase `batch_size=1024`, `max_queue_size=4096` |
## Disabling Telemetry
Set `enabled=0` in config (runtime disable) or build without the flag:
```bash
cmake --preset default -Dtelemetry=OFF
```
When telemetry is compiled out, all trace macros expand to no-ops with zero overhead.