# OpenTelemetry Observability for xrpld
> Status: Phases 1-8 shipped. Traces, metrics, logs all live via OTel.
---
## Slide 1: Introduction
> **CNCF** = Cloud Native Computing Foundation | **OTel** = OpenTelemetry
### What is OpenTelemetry?
CNCF-backed, vendor-neutral framework for **traces, metrics, and logs** with a single SDK and wire protocol (OTLP).
### Why OTel for xrpld?
- **End-to-end TX visibility** — submission → consensus → ledger inclusion
- **Cross-node correlation** — shared `trace_id` stitches hops without a central coordinator
- **Consensus round analysis** — phase timing across validators
- **Incident debugging** — correlated traces, metrics, logs for one query
```mermaid
flowchart LR
A["Node A
tx.receive
trace_id: abc123"] --> B["Node B
tx.relay
trace_id: abc123"] --> C["Node C
tx.validate
trace_id: abc123"] --> D["Node D
ledger.apply
trace_id: abc123"]
style A fill:#1565c0,stroke:#0d47a1,color:#fff
style B fill:#2e7d32,stroke:#1b5e20,color:#fff
style C fill:#2e7d32,stroke:#1b5e20,color:#fff
style D fill:#e65100,stroke:#bf360c,color:#fff
```
> One trace, four nodes, full lifecycle.
---
## Slide 2: Old Stack vs New OTel Stack
### Side-by-Side
| Aspect | Before (StatsD + Debug Logs) | After (OTel: Traces + Metrics + Logs) |
| ------------------------- | --------------------------------------------------------------------------------- | ------------------------------------------------------------ |
| **Metrics** | Beast Insight → StatsD UDP → Graphite | `MetricsRegistry` → OTLP/HTTP → Prometheus |
| **Metric inventory** | **~250 metric series** at runtime (28 registrations × overlay traffic categories) | **23 native instruments** × dimensions + RED via spanmetrics |
| **Logs** | `beast::Journal` → `debug.log` (grep / tail) | Journal → filelog tail → Loki (structured, queryable) |
| **Traces** | None | Telemetry SDK → OTLP → Tempo (cross-node) |
| **Correlation** | Timestamp + grep across files | Shared `trace_id` across all 3 signals |
| **Format** | Counter/gauge names; free-form log lines | OTLP protobuf; structured records |
| **Backend choice** | Locked to StatsD daemon + log files | Vendor-neutral via Collector exporters |
| **Cross-node view** | ❌ Not possible | ✅ Native via trace context propagation |
| **Histogram p50/p95/p99** | ❌ Counters/gauges only | ✅ Native histograms + spanmetrics |
### Legacy StatsD Metric Series (~250 total)
| Category | Series | Notes |
| --------------------------- | -------- | ----------------------------------------------------------------------------------- |
| **Overlay traffic gauges** | ~224 | 56 `TrafficCount::category` enum × 4 gauges (`Bytes_{In,Out}`, `Messages_{In,Out}`) |
| **Peer Finder** | 2 | `Active_{In,Out}bound_Peers` |
| **State Accounting** | 10 | `{Disconnected,Connected,Syncing,Tracking,Full}_{duration,transitions}` |
| **Ledger** | 4 | `Validated/Published_Ledger_Age`, `mismatch`, `ledger_fetches` |
| **RPC / Pathfinding** | 5 | `requests`, `size`, `time`, `pathfind_{fast,full}` |
| **JobQueue / IO / Disconn** | 3 | `job_count`, `ios_latency`, `Peer_Disconnects` |
| **Total** | **~248** | 28 `make_*` call sites; series count balloons via overlay-category fan-out |
### Use Case Matrix
| Scenario | StatsD | Debug Logs | OTel Traces | OTel Metrics | OTel Logs |
| ---------------------------------- | ------ | ---------- | ----------- | ------------ | --------- |
| "TXs per second?" | ✅ | ❌ | ❌ | ✅ | ❌ |
| "Why was this specific TX slow?" | ❌ | ⚠️ | ✅ | ❌ | ⚠️ |
| "Which node delayed consensus?" | ❌ | ❌ | ✅ | ❌ | ❌ |
| "TX journey across 5 nodes" | ❌ | ❌ | ✅ | ❌ | ❌ |
| "Validator error at 14:02" | ❌ | ✅ | ⚠️ | ❌ | ✅ |
| "Reproduce rare assertion / crash" | ❌ | ✅ | ❌ | ❌ | ✅ |
| "p99 RPC latency by method" | ⚠️ | ❌ | ⚠️ | ✅ | ❌ |
> Old stack: 2 signals, no correlation, single node. New stack: 3 signals, `trace_id` everywhere, cross-node native.
---
## Slide 3: OTel vs Open-Source Alternatives
| Feature | OpenTelemetry | Jaeger | Zipkin | SkyWalking | Pinpoint | Prometheus |
| ------------------- | --------------- | ------------- | --------------- | ---------- | ---------- | ---------- |
| **Tracing** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| **Metrics** | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ |
| **Logs** | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
| **C++ SDK** | ✅ Official | ⚠️ Deprecated | ⚠️ Unmaintained | ❌ | ❌ | ✅ |
| **Vendor neutral** | ✅ Primary goal | ❌ | ❌ | ❌ | ❌ | ❌ |
| **Instrumentation** | Manual + Auto | Manual | Manual | Auto-first | Auto-first | Manual |
| **Backend** | Any (exporters) | Self | Self | Self | Self | Self |
| **CNCF Status** | Incubating | Graduated | — | Incubating | — | Graduated |
> Only actively maintained, full-signal C++ option. Backend-agnostic — Tempo/Prometheus/Loki/Elastic/commercial all work without code change.
---
## Slide 4: Architecture (Current)
> **OTLP** = OpenTelemetry Protocol over HTTP/gRPC
```mermaid
flowchart TB
subgraph xrpld["xrpld Node"]
direction TB
Surfaces["RPC · TX · Consensus · Peer · Ledger · Job"]
SDK["Telemetry SDK + MetricsRegistry"]
Journal["beast::Journal → debug.log
(trace_id/span_id injected)"]
Surfaces --> SDK
Surfaces --> Journal
end
SDK -->|"OTLP/HTTP :4318
traces + metrics"| Collector["OTel Collector"]
Journal -->|"filelog tail"| Collector
Collector --> Tempo["Tempo
(traces)"]
Collector --> Prom["Prometheus
(metrics)"]
Collector --> Loki["Loki
(logs)"]
Tempo --> Grafana["Grafana
(15 dashboards)"]
Prom --> Grafana
Loki --> Grafana
style xrpld fill:#424242,stroke:#212121,color:#fff
style SDK fill:#2e7d32,stroke:#1b5e20,color:#fff
style Journal fill:#1565c0,stroke:#0d47a1,color:#fff
style Collector fill:#e65100,stroke:#bf360c,color:#fff
style Grafana fill:#4a148c,stroke:#2e0d57,color:#fff
```
| Component | Role |
| ---------------------- | --------------------------------------------------- |
| Telemetry SDK | Span creation, trace context, OTLP traces export |
| MetricsRegistry | RPC/job/peer/consensus counters, gauges, histograms |
| beast::Journal filelog | `debug.log` tailed by Collector, parsed → Loki |
| OTel Collector | Receive OTLP + filelog; route to Tempo/Prom/Loki |
| Spanmetrics connector | Derives RED metrics from spans (Prometheus) |
---
## Slide 5: Signal Coverage
| Surface | Traces (Spans) | Metrics (OTLP) | Logs (Journal Partition) |
| ------------------ | --------------------------------------------------------------- | ---------------------------------------------- | ------------------------------ |
| **RPC** | `rpc.request` + handler spans | request count, latency p50/p95/p99, error rate | `RPC*` |
| **Transactions** | `tx.receive`, `tx.validate`, `tx.relay`, `tx.apply` | TX/sec by result, fee escalation gauges | `TxQ`, `LedgerMaster` |
| **Consensus** | `consensus.round`, `proposal.send/recv`, `validation.send/recv` | round duration, phase histograms, mode gauge | `Consensus`, `LedgerConsensus` |
| **Peer / Overlay** | `peer.send`, `peer.receive` per message type | peer count, bytes/sec by msg type, suppression | `Overlay`, `PeerImp` |
| **Ledger** | `ledger.close`, `ledger.apply` | close time, TX count, ledger index gauge | `LedgerMaster` |
| **Job Queue** | (sampled per type) | queue depth, queue/run duration histograms | `JobQueue` |
> ~30 distinct span kinds, ~80 metric series, structured logs from 50+ partitions.
---
## Slide 6: Context Propagation
```mermaid
sequenceDiagram
participant Client
participant NodeA as Node A
participant NodeB as Node B
Client->>NodeA: Submit TX (no context)
Note over NodeA: Create trace_id: abc123
span: tx.receive
NodeA->>NodeB: Relay TX (TraceContext field, ~29B)
Note over NodeB: Link trace_id: abc123
span: tx.relay (parent: A)
```
| Carrier | Mechanism |
| --------------------- | ------------------------------------------ |
| HTTP / WebSocket RPC | W3C `traceparent` header |
| P2P protobuf | `TraceContext` extension field per message |
| Internal job dispatch | Thread-local context + `SpanGuard` |
| Field | Size | Description |
| ------------- | --------- | ------------------------------------- |
| `trace_id` | 16 bytes | Trace correlation key |
| `span_id` | 8 bytes | Parent span on receiver |
| `trace_flags` | 1 byte | Sampling decision |
| `trace_state` | 0-4 bytes | Optional vendor data |
| **Total** | **~29 B** | Per traced P2P message (~1-6% of msg) |
---
## Slide 7: Performance Overhead
| Metric | Overhead | Driver |
| ----------------- | ---------- | --------------------------------------------------- |
| **CPU** | 1-3% | ~4 μs/TX span work (~2% at 25 TPS baseline) |
| **Memory** | ~10 MB | SDK statics + worker stack + 2048-span export queue |
| **Network** | 10-50 KB/s | OTLP export + 29 B P2P context per traced msg |
| **Latency (p99)** | <2% | TX path dominates; RPC and consensus negligible |
### Kill Switches
1. `enabled=0` in `xrpld.cfg` → instant disable, no restart
2. Build with `XRPL_ENABLE_TELEMETRY=OFF` → zero overhead (no-op stubs)
3. Reduce `sampling_ratio` → linear export reduction
> Derivations and per-component cost tables: see [03-implementation-strategy.md §3.5.4](./03-implementation-strategy.md#354-performance-data-sources).
---
## Slide 8: Sampling — Head vs Tail
| | Head Sampling | Tail Sampling |
| ------------------------ | --------------------------------- | -------------------------------------- |
| **Where** | Inside xrpld (SDK) | OTel Collector (external) |
| **Decision time** | Trace start (random coin flip) | Trace end (after all spans buffered) |
| **Knows trace content?** | No | Yes — error, latency, span kind |
| **xrpld overhead** | Lowest (drop = no-op) | Higher (export 100%) |
| **Captures all errors?** | No | **Yes** (status_code policy) |
| **Captures slow ops?** | No | **Yes** (latency policy) |
| **Config** | `xrpld.cfg`: `sampling_ratio=0.1` | `tail_sampling` processor in collector |
| **Best for** | Steady-state high volume | Anomaly + error retention |
### Recommended Layered Strategy
```mermaid
flowchart LR
xrpld["xrpld
sampling_ratio=1.0
(export all)"] -->|"100%"| col["Collector
tail_sampling:
errors + slow + 10% random"]
col -->|"~15-20% kept"| tempo["Tempo storage"]
style xrpld fill:#424242,stroke:#212121,color:#fff
style col fill:#1565c0,stroke:#0d47a1,color:#fff
style tempo fill:#2e7d32,stroke:#1b5e20,color:#fff
```
> If Collector resource pressure: drop `sampling_ratio` to 0.5 — still enough trace volume for tail decisions.
---
## Slide 9: Data Collection & Privacy
### Collected (operational metadata)
| Category | Attributes |
| ----------- | -------------------------------------------------------------------- |
| Transaction | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` |
| Consensus | `round`, `phase`, `mode`, `proposers`, `duration_ms` |
| RPC | `command`, `version`, `status`, `duration_ms` |
| Peer | `peer.id` (public key), `latency_ms`, `message.type`, `message.size` |
| Ledger | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` |
| Job | `job.type`, `queue_ms`, `worker` |
### NOT Collected (hard exclusions)
> ❌ Private keys · ❌ Account balances · ❌ Transaction amounts · ❌ Raw payloads · ❌ Personal data · ⚙️ IP addresses (configurable)
### Privacy Mechanisms
| Mechanism | Description |
| ---------------------- | --------------------------------------------------------- |
| Account hashing | `xrpl.tx.account` hashed at Collector before storage |
| Configurable redaction | Sensitive attributes excluded via Collector config |
| Sampling | 10% default reduces exposure |
| Local control | Operator owns Collector → backend pipeline |
| No raw payloads | Span attributes are metadata only, never message contents |
> Principle: telemetry records **operational metadata** — never financial or personal content.
---
## Slide 10: Implementation Timeline
```mermaid
gantt
title OpenTelemetry Rollout
dateFormat YYYY-MM-DD
axisFormat Week %W
section Done
Phase 1 Core Infra :done, p1, 2024-01-01, 2w
Phase 2 RPC Tracing :done, p2, after p1, 2w
Phase 3 TX Tracing :done, p3, after p2, 2w
Phase 4 Consensus :done, p4, after p3, 2w
Phase 5 Docs/Deploy :done, p5, after p4, 1w
Phase 6 StatsD Bridge :done, p6, after p5, 1w
Phase 7 Native OTel Metrics :done, p7, after p6, 2w
Phase 8 Log-Trace Correlation :done, p8, after p7, 1w
Phase 9 Metric Gap Fill :active, p9, after p8, 2w
section Future
Phase 10 Workload Validation :p10, after p9, 2w
Phase 11 3rd-Party Pipelines :p11, after p10, 3w
```
| Phase | Focus | Status |
| ----- | ------------------------------------------- | ------- |
| 1 | SDK integration, Telemetry, Config | ✅ Done |
| 2 | RPC handler spans, HTTP context | ✅ Done |
| 3 | TX spans, P2P protobuf context | ✅ Done |
| 4 | Consensus rounds, proposal/validation | ✅ Done |
| 5 | Runbook, dashboards, deployment | ✅ Done |
| 6 | StatsD bridge (interim) | ✅ Done |
| 7 | Native OTel metrics (replace Beast Insight) | ✅ Done |
| 8 | Log-trace correlation (Loki) | ✅ Done |
| 9 | Internal metric gap fill | ✅ Done |
---
## Slide 11: Current State — What Shipped
### By Signal
| Signal | Backend | Status | Notes |
| ----------- | ---------- | ------ | -------------------------------------------------------- |
| **Traces** | Tempo | ✅ | All 6 surfaces instrumented; cross-node propagation live |
| **Metrics** | Prometheus | ✅ | Native OTLP; Beast Insight retired |
| **Logs** | Loki | ✅ | filelog tailing `debug.log`; `trace_id` injected |
### By Surface
| Surface | Spans Live | Metrics Live | Notes |
| -------------- | ---------- | ------------ | --------------------------------------------------- |
| RPC | ✅ | ✅ | Handler + pathfinding + TxQ |
| Transactions | ✅ | ✅ | Receive, validate, relay, apply |
| Consensus | ✅ | ✅ | Round + proposal/validation send+receive (Phase 4a) |
| Peer / Overlay | ✅ | ✅ | Per-msg-type send/receive |
| Ledger | ✅ | ✅ | Close + apply |
| Job Queue | ✅ | ✅ | Queue depth + duration histograms |
### Stack Live
| Component | Version |
| -------------------------- | ------- |
| OTel Collector (contrib) | 0.121.0 |
| Grafana Tempo | 2.7.2 |
| Grafana Loki | 3.4.2 |
| Prometheus | latest |
| Grafana | 11.5.2 |
| **Dashboards provisioned** | **15** |
---
## Slide 12: Future Phases
### Phase 10 — Synthetic Workload Validation
| Aspect | Detail |
| ----------- | ------------------------------------------------------------------ |
| Goal | Drive instrumented surfaces under reproducible load |
| Why | Validate dashboards, catch regressions, measure overhead at scale |
| Deliverable | Workload generator + assertion suite (RPC/TX/peer churn scenarios) |
| Effort | ~2 weeks |
### Phase 11 — Admin-RPC Receiver (`xrpl_*` metrics)
| Aspect | Detail |
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| Goal | Custom Go OTel Collector receiver polls xrpld admin RPC, emits `xrpl_*` Prometheus metrics |
| Why | Admin-RPC-only data has no native export — every consumer reinvents JSON-RPC polling |
| Scope | `validators` (UNL, listed keys), `feature` (amendments), `peers` (per-peer detail), `amm_info`, `book_offers`, `fee` (detail tiers) |
| Excluded | `server_info` / `get_counts` basics — Phase 9 (#6513) already ships `xrpld_server_info` + 14 gauges/histograms natively from in-process state |
| Deliverable | Go receiver plugin + custom Collector binary + 4 Grafana dashboards (UNL, amendments, AMM, DEX) + Prometheus alerts |
| Effort | ~3 weeks |
```mermaid
flowchart LR
rpc["xrpld admin RPC
(validators, feature, peers,
amm_info, book_offers, fee)"] -->|JSON-RPC poll| recv["Custom Go receiver
(in Collector)"]
recv -->|xrpl_* metrics| prom["Prometheus"]
prom --> graf["Grafana dashboards"]
style rpc fill:#2e7d32,stroke:#1b5e20,color:#fff
style recv fill:#1565c0,stroke:#0d47a1,color:#fff
style prom fill:#e65100,stroke:#bf360c,color:#fff
style graf fill:#6a1b9a,stroke:#4a148c,color:#fff
```
> Phase 11 fills the gap above Phase 9 — data only reachable via admin RPC, not via in-process metric callbacks.
---
## Slide 11: External Dashboard Parity (Phase 7+)
### Bridging Community Monitoring into Native OTel
The community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard) provides 86 metrics for validator operators. We integrated the 29 missing metrics natively into the OTel pipeline.
### New Metric Categories
```mermaid
graph LR
subgraph "New Observable Gauges"
VH["Validator Health
amendment_blocked, UNL expiry,
quorum"]
PQ["Peer Quality
P90 latency, insane peers,
version awareness"]
LE["Ledger Economy
fees, reserves, tx rate,
ledger age"]
ST["State Tracking
state value 0-6,
time in state"]
VA["Validation Agreement
1h/24h agreement %,
agreements, misses"]
end
subgraph "Counters"
C1["ledgers_closed_total"]
C2["validations_sent_total"]
C3["state_changes_total"]
end
style VH fill:#1565c0,color:#fff
style PQ fill:#2e7d32,color:#fff
style LE fill:#e65100,color:#fff
style ST fill:#6a1b9a,color:#fff
style VA fill:#c62828,color:#fff
style C1 fill:#37474f,color:#fff
style C2 fill:#37474f,color:#fff
style C3 fill:#37474f,color:#fff
```
### ValidationTracker — Agreement Computation
```mermaid
sequenceDiagram
participant C as RCLConsensus
participant VT as ValidationTracker
participant MR as MetricsRegistry
participant P as Prometheus
C->>VT: recordOurValidation(hash, seq)
Note over VT: Stores pending event
C->>VT: recordNetworkValidation(hash, seq)
Note over VT: Marks network validated
MR->>VT: reconcile() [every 10s]
Note over VT: After 8s grace period:
both validated → agreed
only one → missed
5min late repair window
MR->>P: Export agreement_pct_1h/24h
```
### New Grafana Dashboards
| Dashboard | Key Panels |
| ---------------- | --------------------------------------------------- |
| Validator Health | Agreement %, amendment blocked, quorum, state value |
| Peer Quality | P90 latency, version awareness, upgrade recommended |
| Ledger Economy | Base fee, reserves, ledger age, transaction rate |
---
_End of Presentation_