Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-04-29 20:24:52 +01:00
1526 changed files with 75760 additions and 25422 deletions

View File

@@ -406,7 +406,7 @@ flowchart TB
## Distributed Traces Across Nodes
In distributed systems like rippled, traces span **multiple independent nodes**. The trace context must be propagated in network messages:
In distributed systems like xrpld, traces span **multiple independent nodes**. The trace context must be propagated in network messages:
```mermaid
sequenceDiagram
@@ -492,7 +492,7 @@ traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
└── Version
```
### Protocol Buffers (rippled P2P messages)
### Protocol Buffers (xrpld P2P messages)
```protobuf
message TMTransaction {
@@ -534,7 +534,7 @@ Trace completes → Collector evaluates:
---
## Key Benefits for rippled
## Key Benefits for xrpld
| Challenge | How Tracing Helps |
| ---------------------------------- | ---------------------------------------- |

View File

@@ -5,15 +5,15 @@
---
## 1.1 Current rippled Architecture Overview
## 1.1 Current xrpld Architecture Overview
> **WS** = WebSocket | **UNL** = Unique Node List | **TxQ** = Transaction Queue | **StatsD** = Statistics Daemon
The rippled node software consists of several interconnected components that need instrumentation for distributed tracing:
The xrpld node software consists of several interconnected components that need instrumentation for distributed tracing:
```mermaid
flowchart TB
subgraph rippled["rippled Node"]
subgraph xrpld["xrpld Node"]
subgraph services["Core Services"]
RPC["RPC Server<br/>(HTTP/WS/gRPC)"]
Overlay["Overlay<br/>(P2P Network)"]
@@ -47,7 +47,7 @@ flowchart TB
JobQueue --> appservices
end
style rippled fill:#424242,stroke:#212121,color:#ffffff
style xrpld fill:#424242,stroke:#212121,color:#ffffff
style services fill:#1565c0,stroke:#0d47a1,color:#ffffff
style processing fill:#2e7d32,stroke:#1b5e20,color:#ffffff
style appservices fill:#6a1b9a,stroke:#4a148c,color:#ffffff
@@ -56,7 +56,7 @@ flowchart TB
**Reading the diagram:**
- **Core Services (blue)**: The entry points into rippled -- RPC Server handles client requests, Overlay manages peer-to-peer networking, Consensus drives agreement, and ValidatorList manages trusted validators.
- **Core Services (blue)**: The entry points into xrpld -- RPC Server handles client requests, Overlay manages peer-to-peer networking, Consensus drives agreement, and ValidatorList manages trusted validators.
- **JobQueue (center)**: The asynchronous thread pool that decouples Core Services from the Processing and Application layers. All work flows through it.
- **Processing Layer (green)**: Core business logic -- NetworkOPs processes transactions, LedgerMaster manages ledger state, NodeStore handles persistence, and InboundLedgers synchronizes missing data.
- **Application Services (purple)**: Higher-level features -- PathFinding computes payment routes, TxQ manages fee-based queuing, and LoadManager tracks server load.
@@ -71,7 +71,7 @@ flowchart TB
| Who (Plain English) | Technical Term |
| ----------------------------------------- | -------------------------- |
| Network node running XRPL software | rippled node |
| Network node running XRPL software | xrpld node |
| External client submitting requests | RPC Client |
| Network neighbor sharing data | Peer (PeerImp) |
| Request handler for client queries | RPC Server (ServerHandler) |
@@ -354,17 +354,17 @@ After implementing OpenTelemetry, operators and developers will gain visibility
### 1.8.1 What You Will See: Traces
| Trace Type | Description | Example Query in Grafana/Tempo |
| -------------------------- | ------------------------------------------------------------------------------------------- | ------------------------------------------------------ |
| **Transaction Lifecycle** | Full journey from RPC submission through validation, relay, consensus, and ledger inclusion | `{service.name="rippled" && xrpl.tx.hash="ABC123..."}` |
| **Cross-Node Propagation** | Transaction path across multiple rippled nodes with timing | `{xrpl.tx.relay_count > 0}` |
| **Consensus Rounds** | Complete round with all phases (open, establish, accept) | `{span.name=~"consensus.round.*"}` |
| **RPC Request Processing** | Individual command execution with timing breakdown | `{xrpl.rpc.command="account_info"}` |
| **Ledger Acquisition** | Peer-to-peer ledger data requests and responses | `{span.name="ledger.acquire"}` |
| **PathFinding Latency** | Path computation time and cache effectiveness for payment RPCs | `{span.name="pathfind.compute"}` |
| **TxQ Behavior** | Queue depth, eviction patterns, fee escalation during congestion | `{span.name=~"txq.*"}` |
| **Ledger Sync** | Full acquisition timeline including delta and transaction fetches | `{span.name=~"ledger.acquire.*"}` |
| **Validator Health** | UNL fetch success, manifest updates, stale list detection | `{span.name=~"validator.*"}` |
| Trace Type | Description | Example Query in Grafana/Tempo |
| -------------------------- | ------------------------------------------------------------------------------------------- | ---------------------------------------------------- |
| **Transaction Lifecycle** | Full journey from RPC submission through validation, relay, consensus, and ledger inclusion | `{service.name="xrpld" && xrpl.tx.hash="ABC123..."}` |
| **Cross-Node Propagation** | Transaction path across multiple xrpld nodes with timing | `{xrpl.tx.relay_count > 0}` |
| **Consensus Rounds** | Complete round with all phases (open, establish, accept) | `{span.name=~"consensus.round.*"}` |
| **RPC Request Processing** | Individual command execution with timing breakdown | `{xrpl.rpc.command="account_info"}` |
| **Ledger Acquisition** | Peer-to-peer ledger data requests and responses | `{span.name="ledger.acquire"}` |
| **PathFinding Latency** | Path computation time and cache effectiveness for payment RPCs | `{span.name="pathfind.compute"}` |
| **TxQ Behavior** | Queue depth, eviction patterns, fee escalation during congestion | `{span.name=~"txq.*"}` |
| **Ledger Sync** | Full acquisition timeline including delta and transaction fetches | `{span.name=~"ledger.acquire.*"}` |
| **Validator Health** | UNL fetch success, manifest updates, stale list detection | `{span.name=~"validator.*"}` |
### 1.8.2 What You Will See: Metrics (Derived from Traces)

View File

@@ -25,10 +25,10 @@
**Manual Instrumentation** (recommended):
| Approach | Pros | Cons |
| ---------- | ----------------------------------------------------------------- | ------------------------------------------------------- |
| **Manual** | Precise control, optimized placement, rippled-specific attributes | More development effort |
| **Auto** | Less code, automatic coverage | Less control, potential overhead, limited customization |
| Approach | Pros | Cons |
| ---------- | --------------------------------------------------------------- | ------------------------------------------------------- |
| **Manual** | Precise control, optimized placement, xrpld-specific attributes | More development effort |
| **Auto** | Less code, automatic coverage | Less control, potential overhead, limited customization |
---
@@ -38,10 +38,10 @@
```mermaid
flowchart TB
subgraph nodes["rippled Nodes"]
node1["rippled<br/>Node 1"]
node2["rippled<br/>Node 2"]
node3["rippled<br/>Node 3"]
subgraph nodes["xrpld Nodes"]
node1["xrpld<br/>Node 1"]
node2["xrpld<br/>Node 2"]
node3["xrpld<br/>Node 3"]
end
collector["OpenTelemetry<br/>Collector<br/>(sidecar or standalone)"]
@@ -65,7 +65,7 @@ flowchart TB
**Reading the diagram:**
- **rippled Nodes (blue)**: The source of telemetry data. Each rippled node exports spans via OTLP/gRPC on port 4317.
- **xrpld Nodes (blue)**: The source of telemetry data. Each xrpld node exports spans via OTLP/gRPC on port 4317.
- **OpenTelemetry Collector (red)**: The central aggregation point that receives spans from all nodes. Can run as a sidecar (per-node) or standalone (shared). Handles batching, filtering, and routing.
- **Observability Backends (green)**: The storage and visualization destinations. Tempo is the recommended backend for both development and production, and Elastic APM is an alternative. The Collector routes to one or more backends.
- **Arrows (nodes to collector to backends)**: The data pipeline -- spans flow from nodes to the Collector over gRPC, then the Collector fans out to the configured backends.
@@ -203,11 +203,11 @@ job:
```cpp
// Standard OpenTelemetry semantic conventions
resource::SemanticConventions::SERVICE_NAME = "rippled"
resource::SemanticConventions::SERVICE_NAME = "xrpld"
resource::SemanticConventions::SERVICE_VERSION = BuildInfo::getVersionString()
resource::SemanticConventions::SERVICE_INSTANCE_ID = <node_public_key_base58>
// Custom rippled attributes
// Custom xrpld attributes
"xrpl.network.id" = <network_id> // e.g., 0 for mainnet
"xrpl.network.type" = "mainnet" | "testnet" | "devnet" | "standalone"
"xrpl.node.type" = "validator" | "stock" | "reporting"
@@ -251,8 +251,8 @@ resource::SemanticConventions::SERVICE_INSTANCE_ID = <node_public_key_base58>
"xrpl.consensus.proposers_total" = int64 // Total peer positions
"xrpl.consensus.agree_count" = int64 // Peers that agree (haveConsensus)
"xrpl.consensus.disagree_count" = int64 // Peers that disagree
"xrpl.consensus.threshold_percent" = int64 // Current threshold (50/65/70/95)
"xrpl.consensus.result" = string // "yes", "no", "moved_on"
"xrpl.consensus.threshold_percent" = int64 // Close-time consensus threshold (avCT_CONSENSUS_PCT = 75%)
"xrpl.consensus.result" = string // "yes", "no", "moved_on", "expired"
"xrpl.consensus.mode.old" = string // Previous consensus mode
"xrpl.consensus.mode.new" = string // New consensus mode
```
@@ -406,7 +406,7 @@ processors:
#### Configuration Options for Privacy
In `rippled.cfg`, operators can control data collection granularity:
In `xrpld.cfg`, operators can control data collection granularity:
```ini
[telemetry]
@@ -423,7 +423,7 @@ redact_account=1 # Hash account addresses before export
redact_peer_address=1 # Remove peer IP addresses
```
> **Note**: The `redact_account` configuration in `rippled.cfg` controls SDK-level redaction before export, while collector-level filtering (see [Collector-Level Data Protection](#collector-level-data-protection) above) provides an additional defense-in-depth layer. Both can operate independently.
> **Note**: The `redact_account` configuration in `xrpld.cfg` controls SDK-level redaction before export, while collector-level filtering (see [Collector-Level Data Protection](#collector-level-data-protection) above) provides an additional defense-in-depth layer. Both can operate independently.
> **Key Principle**: Telemetry collects **operational metadata** (timing, counts, hashes) — never **sensitive content** (keys, balances, amounts, raw payloads).
@@ -433,12 +433,91 @@ redact_peer_address=1 # Remove peer IP addresses
> **WS** = WebSocket
### 2.5.0 Deterministic Trace ID Strategy
Both transaction and consensus tracing use **deterministic trace IDs** derived from
a globally known hash, so all nodes handling the same workflow independently produce
spans under the same `trace_id`. This is combined with protobuf `span_id` propagation
for parent-child relay ordering when available.
#### Transactions — `trace_id = txHash[0:16]`
Every node that handles a transaction knows its `txID` (the `uint256` transaction
hash). The first 16 bytes of this hash are used as the OTel `trace_id`:
```
uint256 txHash: A1B2C3D4 E5F6A7B8 C9D0E1F2 A3B4C5D6 E7F8A9B0 C1D2E3F4 A5B6C7D8 E9F0A1B2
|---------- trace_id (16 bytes) ---------| (remaining 16 bytes unused)
```
Each node generates a **random 8-byte `span_id`** so its span is unique within the
shared trace. When protobuf `TraceContext` is present in the incoming `TMTransaction`,
the sender's `span_id` is extracted and used as the parent — preserving the relay
chain as a parent-child tree. When absent (older peers, first hop from client), the
span appears as a root in the same trace — correlation is preserved, only the tree
structure degrades.
```
Node A (submitter) Node B (relay) Node C (relay)
trace_id: A1B2... trace_id: A1B2... trace_id: A1B2...
span_id: 1234 (random) span_id: 5678 (random) span_id: 9ABC (random)
parent: (none) parent: 1234 (proto) parent: 5678 (proto)
↑ ↑
protobuf propagation protobuf propagation
```
If protobuf propagation fails at Node B (old peer):
```
Node A Node B (old peer) Node C
trace_id: A1B2... trace_id: A1B2... trace_id: A1B2...
span_id: 1234 span_id: 5678 span_id: 9ABC
parent: (none) parent: (none) parent: 5678 (proto)
↑ no parent, but same trace_id — still grouped
```
#### Consensus — `trace_id = prevLedgerHash[0:16]`
All validators in the same consensus round share the same `previousLedger.id()`.
The first 16 bytes are used as trace_id. See [Phase 4a implementation status](./06-implementation-phases.md)
and `createDeterministicContext()` in `RCLConsensus.cpp` for the implementation.
Switchable via `consensus_trace_strategy` config:
`"deterministic"` (default) or `"attribute"` (random trace_id, correlation via attribute queries).
#### Why Not Random IDs with Propagation Only?
Random trace IDs require **unbroken context propagation** across every hop. In a
mixed-version network (common during upgrades), older peers silently drop the
`trace_context` protobuf field. The trace splits and downstream spans become
impossible to find. Deterministic IDs make correlation **propagation-resilient** — the trace
backend groups all spans for the same transaction/round regardless of whether
propagation succeeded.
#### Why Keep Protobuf Propagation?
Deterministic trace IDs alone provide correlation (all spans grouped) but not
**causality** (which node relayed to which). Protobuf `span_id` propagation adds
parent-child ordering that shows the exact relay path. The two mechanisms complement
each other:
| Mechanism | Provides | Fails when |
| ---------------------------- | --------------------------- | -------------------------------------- |
| Deterministic trace_id | Cross-node correlation | Never (hash is always known) |
| Protobuf span_id propagation | Parent-child relay ordering | Older peer drops `trace_context` field |
#### Implementation Reference
The utility function `createDeterministicTxContext(uint256 const& txHash)` follows
the same pattern as `createDeterministicContext(uint256 const& ledgerId)` in
`RCLConsensus.cpp`. See [Phase 3 Task 3.9](./Phase3_taskList.md) for the full spec.
### 2.5.1 Propagation Boundaries
```mermaid
flowchart TB
subgraph http["HTTP/WebSocket (RPC)"]
w3c["W3C Trace Context Headers:<br/>traceparent:<br/>00-trace_id-span_id-flags<br/>tracestate: rippled=..."]
w3c["W3C Trace Context Headers:<br/>traceparent:<br/>00-trace_id-span_id-flags<br/>tracestate: xrpld=..."]
end
subgraph protobuf["Protocol Buffers (P2P)"]
@@ -457,7 +536,7 @@ flowchart TB
**Reading the diagram:**
- **HTTP/WebSocket - RPC (blue)**: For client-facing RPC requests, trace context is propagated using the W3C `traceparent` header. This is the standard approach and works with any OTel-compatible client.
- **Protocol Buffers - P2P (green)**: For peer-to-peer messages between rippled nodes, trace context is embedded as a protobuf `TraceContext` message carrying trace_id, span_id, flags, and optional trace_state.
- **Protocol Buffers - P2P (green)**: For peer-to-peer messages between xrpld nodes, trace context is embedded as a protobuf `TraceContext` message carrying trace_id, span_id, flags, and optional trace_state.
- **JobQueue - Internal Async (red)**: For asynchronous work within a single node, the OTel context is captured when a job is created and restored when the job executes on a worker thread. This bridges the async gap so spans remain linked.
---
@@ -468,7 +547,7 @@ flowchart TB
### 2.6.1 Existing Frameworks Comparison
rippled already has two observability mechanisms. OpenTelemetry complements (not replaces) them:
xrpld already has two observability mechanisms. OpenTelemetry complements (not replaces) them:
| Aspect | PerfLog | Beast Insight (StatsD) | OpenTelemetry |
| --------------------- | ----------------------------- | ---------------------------- | ------------------------- |
@@ -517,7 +596,7 @@ rippled already has two observability mechanisms. OpenTelemetry complements (not
- Single-node perspective
```cpp
// Example StatsD usage in rippled
// Example StatsD usage in xrpld
insight.increment("rpc.submit.count");
insight.gauge("ledger.age", age);
insight.timing("consensus.round", duration);
@@ -556,11 +635,9 @@ span->SetAttribute("peer.id", peerId);
### 2.6.4 Coexistence Strategy
> **Note**: Phase 7 replaces the StatsD bridge with native OTel Metrics SDK export. The diagram below shows the Phase 6 intermediate state. See [Phase7_taskList.md](./Phase7_taskList.md) for the migration design where Beast Insight emits via OTLP instead of StatsD.
```mermaid
flowchart TB
subgraph rippled["rippled Process"]
subgraph xrpld["xrpld Process"]
perflog["PerfLog<br/>(JSON to file)"]
insight["Beast Insight<br/>(StatsD)"]
otel["OpenTelemetry<br/>(Tracing)"]
@@ -574,20 +651,18 @@ flowchart TB
statsd --> grafana
collector --> grafana
style rippled fill:#212121,stroke:#0a0a0a,color:#ffffff
style xrpld fill:#212121,stroke:#0a0a0a,color:#ffffff
style grafana fill:#bf360c,stroke:#8c2809,color:#ffffff
```
**Reading the diagram:**
- **rippled Process (dark gray)**: The single rippled node running all three observability frameworks side by side. Each framework operates independently with no interference.
- **xrpld Process (dark gray)**: The single xrpld node running all three observability frameworks side by side. Each framework operates independently with no interference.
- **PerfLog to perf.log**: PerfLog writes JSON-formatted event logs to a local file. Grafana can ingest these via Loki or a file-based datasource.
- **Beast Insight to StatsD Server**: Insight sends aggregated metrics (counters, gauges) over UDP to a StatsD server. Grafana reads from StatsD-compatible backends like Graphite or Prometheus (via StatsD exporter).
- **OpenTelemetry to OTLP Collector**: OTel exports spans over OTLP/gRPC to a Collector, which then forwards to a trace backend (Tempo).
- **Grafana (red, unified UI)**: All three data streams converge in Grafana, enabling operators to correlate logs, metrics, and traces in a single dashboard.
**Phase 7 target state**: Beast Insight routes to `OTelCollector` (new `Collector` implementation) which exports via OTLP/HTTP to the same collector endpoint as traces. StatsD UDP path becomes a deprecated fallback (`[insight] server=statsd`). See [06-implementation-phases.md §6.8](./06-implementation-phases.md) and [Phase7_taskList.md](./Phase7_taskList.md) for details.
### 2.6.5 Correlation with PerfLog
Trace IDs can be correlated with existing PerfLog entries for comprehensive debugging:

View File

@@ -7,28 +7,24 @@
## 3.1 Directory Structure
The telemetry implementation follows rippled's existing code organization pattern:
The telemetry implementation follows xrpld's existing code organization pattern:
```
include/xrpl/
├── telemetry/
│ ├── Telemetry.h # Main telemetry interface
│ ├── Telemetry.h # Main telemetry interface (global singleton)
│ ├── TelemetryConfig.h # Configuration structures
│ ├── TraceContext.h # Context propagation utilities
│ ├── SpanGuard.h # RAII span management
│ ├── SpanGuard.h # RAII span management with factory methods + discard()
│ ├── DiscardFlag.h # Thread-local discard flag
│ └── SpanAttributes.h # Attribute helper functions
src/libxrpl/
├── telemetry/
│ ├── Telemetry.cpp # Implementation
│ ├── Telemetry.cpp # Implementation + FilteringSpanProcessor
│ ├── TelemetryConfig.cpp # Config parsing
│ ├── TraceContext.cpp # Context serialization
│ └── NullTelemetry.cpp # No-op implementation
src/xrpld/
├── telemetry/
│ ├── TracingInstrumentation.h # Instrumentation macros
│ └── TracingInstrumentation.cpp
```
---
@@ -314,20 +310,20 @@ flowchart TD
### 3.7.3 Conditional Instrumentation
```cpp
// Compile-time feature flag
#ifndef XRPL_ENABLE_TELEMETRY
// Zero-cost when disabled
#define XRPL_TRACE_SPAN(t, n) ((void)0)
#endif
SpanGuard's static factory methods handle both compile-time and runtime
checks internally. When `XRPL_ENABLE_TELEMETRY` is not defined, the
entire SpanGuard class compiles to a no-op stub with empty method bodies.
When it is defined, the factory methods check the global Telemetry
instance and the relevant component filter before creating a span:
// Runtime component filtering
if (telemetry.shouldTracePeer())
{
XRPL_TRACE_SPAN(telemetry, "peer.message.receive");
// ... instrumentation
}
// No overhead when component tracing disabled
```cpp
// SpanGuard factory methods handle all conditional logic internally.
// When XRPL_ENABLE_TELEMETRY is not defined, these are no-ops.
// When defined, they check Telemetry::getInstance() and the
// component filter (e.g. shouldTracePeer()) at runtime.
auto span = telemetry::SpanGuard::peerSpan("peer.message.receive");
span.setAttribute("xrpl.peer.id", peerId);
// No overhead when telemetry is disabled at compile time or runtime
```
---
@@ -344,13 +340,13 @@ if (telemetry.shouldTracePeer())
> **TxQ** = Transaction Queue
This section provides a detailed assessment of how intrusive the OpenTelemetry integration is to the existing rippled codebase.
This section provides a detailed assessment of how intrusive the OpenTelemetry integration is to the existing xrpld codebase.
### 3.9.1 Files Modified Summary
| Component | Files Modified | Lines Added | Lines Changed | Architectural Impact |
| --------------------- | -------------- | ----------- | ------------- | -------------------- |
| **Core Telemetry** | 5 new files | ~800 | 0 | None (new module) |
| **Core Telemetry** | 7 new files | ~800 | 0 | None (new module) |
| **Application Init** | 2 files | ~30 | ~5 | Minimal |
| **RPC Layer** | 3 files | ~80 | ~20 | Minimal |
| **Transaction Relay** | 4 files | ~120 | ~40 | Low |
@@ -360,7 +356,7 @@ This section provides a detailed assessment of how intrusive the OpenTelemetry i
| **PathFinding** | 2 | ~80 | ~5 | Minimal |
| **TxQ/Fee** | 2 | ~60 | ~5 | Minimal |
| **Validator/Amend** | 3 | ~40 | ~5 | Minimal |
| **Total** | **~28 files** | **~1,490** | **~120** | **Low** |
| **Total** | **~27 files** | **~1,490** | **~120** | **Low** |
### 3.9.2 Detailed File Impact
@@ -380,22 +376,22 @@ pie title Code Changes by Component
#### New Files (No Impact on Existing Code)
| File | Lines | Purpose |
| ---------------------------------------------- | ----- | -------------------- |
| `include/xrpl/telemetry/Telemetry.h` | ~160 | Main interface |
| `include/xrpl/telemetry/SpanGuard.h` | ~120 | RAII wrapper |
| `include/xrpl/telemetry/TraceContext.h` | ~80 | Context propagation |
| `src/xrpld/telemetry/TracingInstrumentation.h` | ~60 | Macros |
| `src/libxrpl/telemetry/Telemetry.cpp` | ~200 | Implementation |
| `src/libxrpl/telemetry/TelemetryConfig.cpp` | ~60 | Config parsing |
| `src/libxrpl/telemetry/NullTelemetry.cpp` | ~40 | No-op implementation |
| File | Lines | Purpose |
| ------------------------------------------- | ----- | ----------------------------------------------------- |
| `include/xrpl/telemetry/Telemetry.h` | ~160 | Main interface (global singleton) |
| `include/xrpl/telemetry/SpanGuard.h` | ~250 | RAII wrapper + factory methods + discard + no-op stub |
| `include/xrpl/telemetry/DiscardFlag.h` | ~28 | Thread-local discard flag |
| `include/xrpl/telemetry/TraceContext.h` | ~80 | Context propagation |
| `src/libxrpl/telemetry/Telemetry.cpp` | ~400 | Implementation + FilteringSpanProcessor |
| `src/libxrpl/telemetry/TelemetryConfig.cpp` | ~60 | Config parsing |
| `src/libxrpl/telemetry/NullTelemetry.cpp` | ~40 | No-op implementation |
#### Modified Files (Existing Rippled Code)
#### Modified Files (Existing Xrpld Code)
| File | Lines Added | Lines Changed | Risk Level |
| ------------------------------------------------- | ----------- | ------------- | ---------- |
| `src/xrpld/app/main/Application.cpp` | ~15 | ~3 | Low |
| `include/xrpl/app/main/Application.h` | ~5 | ~2 | Low |
| `include/xrpl/core/ServiceRegistry.h` | ~5 | ~2 | Low |
| `src/xrpld/rpc/detail/ServerHandler.cpp` | ~40 | ~10 | Low |
| `src/xrpld/rpc/handlers/*.cpp` | ~30 | ~8 | Low |
| `src/xrpld/overlay/detail/PeerImp.cpp` | ~60 | ~15 | Medium |
@@ -491,18 +487,24 @@ void ServerHandler::onRequest(...) {
send(result);
}
// After (only ~10 lines added)
// After (only ~4 lines added)
void ServerHandler::onRequest(...) {
XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.request"); // +1 line
XRPL_TRACE_SET_ATTR("xrpl.rpc.command", command); // +1 line
auto span = telemetry::SpanGuard::rpcSpan("rpc.request"); // +1 line
span.setAttribute("xrpl.rpc.command", command); // +1 line
auto result = processRequest(req);
XRPL_TRACE_SET_ATTR("xrpl.rpc.status", status); // +1 line
span.setAttribute("xrpl.rpc.status", status); // +1 line
send(result);
}
```
SpanGuard factory methods (`rpcSpan`, `txSpan`, `consensusSpan`, etc.)
access the global `Telemetry` instance internally and check the relevant
component filter (`shouldTraceRpc()`, etc.) before creating a span. The
public SpanGuard header has zero `opentelemetry/` includes -- all OTel
types are hidden behind the pimpl idiom.
**Consensus Instrumentation (Medium Intrusiveness):**
```cpp
@@ -513,11 +515,11 @@ void RCLConsensusAdaptor::startRound(...) {
// After (context storage required)
void RCLConsensusAdaptor::startRound(...) {
XRPL_TRACE_CONSENSUS(app_.getTelemetry(), "consensus.round");
XRPL_TRACE_SET_ATTR("xrpl.consensus.ledger.seq", seq);
auto span = telemetry::SpanGuard::consensusSpan("consensus.round");
span.setAttribute("xrpl.consensus.ledger.seq", seq);
// Store context for child spans in phase transitions
currentRoundContext_ = _xrpl_guard_->context(); // New member variable
currentRoundContext_ = span.context(); // New member variable
// ... existing logic unchanged
}

View File

@@ -30,7 +30,7 @@ namespace telemetry {
/**
* Main telemetry interface for OpenTelemetry integration.
*
* This class provides the primary API for distributed tracing in rippled.
* This class provides the primary API for distributed tracing in xrpld.
* It manages the OpenTelemetry SDK lifecycle and provides convenience
* methods for creating spans and propagating context.
*/
@@ -43,7 +43,7 @@ public:
struct Setup
{
bool enabled = false;
std::string serviceName = "rippled";
std::string serviceName = "xrpld";
std::string serviceVersion;
std::string serviceInstanceId; // Node public key
@@ -98,7 +98,7 @@ public:
/** Get the tracer for creating spans */
virtual opentelemetry::nostd::shared_ptr<opentelemetry::trace::Tracer>
getTracer(std::string_view name = "rippled") = 0;
getTracer(std::string_view name = "xrpld") = 0;
// ═══════════════════════════════════════════════════════════════════════
// SPAN CREATION (Convenience Methods)
@@ -181,271 +181,227 @@ setup_Telemetry(
---
## 4.2 RAII Span Guard
## 4.2 RAII Span Guard with Factory Methods
SpanGuard is a self-contained RAII wrapper that creates, activates, and
ends trace spans. It uses the pimpl idiom to hide all OpenTelemetry
types -- the public header has **zero `opentelemetry/` includes**.
Callers never interact with OTel SDK types directly.
SpanGuard provides static factory methods (`rpcSpan()`, `txSpan()`,
`consensusSpan()`, etc.) that access the global `Telemetry` singleton
internally. Each factory checks both the runtime enable flag and the
relevant component filter before creating a span.
When `XRPL_ENABLE_TELEMETRY` is **not** defined, the entire SpanGuard
class compiles to a no-op stub with empty inline method bodies, giving
zero compile-time and runtime cost.
```cpp
// include/xrpl/telemetry/SpanGuard.h
//
// Public API -- no opentelemetry/ includes.
// OTel types are hidden behind the pimpl (Impl struct, defined in the
// #ifdef XRPL_ENABLE_TELEMETRY section at the bottom of the header).
#pragma once
#include <opentelemetry/trace/span.h>
#include <opentelemetry/trace/scope.h>
#include <opentelemetry/trace/status.h>
#include <memory>
#include <string_view>
#include <exception>
namespace xrpl {
namespace telemetry {
/**
* RAII guard for OpenTelemetry spans.
*
* Automatically ends the span on destruction and makes it the current
* span in the thread-local context.
*/
#ifdef XRPL_ENABLE_TELEMETRY
class SpanGuard
{
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span> span_;
opentelemetry::trace::Scope scope_;
struct Impl; // pimpl -- defined in .cpp or
std::unique_ptr<Impl> impl_; // in the guarded section below
public:
/**
* Construct guard with span.
* The span becomes the current span in thread-local context.
*
* @note If span is nullptr (e.g., telemetry disabled), the guard
* becomes a no-op. All methods safely check for null before access.
*/
explicit SpanGuard(
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span> span)
: span_(span ? std::move(span) : nullptr)
, scope_(span_ ? opentelemetry::trace::Scope(span_)
: opentelemetry::trace::Scope(
opentelemetry::nostd::shared_ptr<
opentelemetry::trace::Span>(nullptr)))
{
}
// ═══════════════════════════════════════════════════════════════════
// FACTORY METHODS (access global Telemetry internally)
// ═══════════════════════════════════════════════════════════════════
// Non-copyable, non-movable
/** Create a span for RPC request handling.
* Returns a no-op guard if telemetry is disabled or
* shouldTraceRpc() is false.
*/
static SpanGuard rpcSpan(std::string_view name);
/** Create a span for transaction processing. */
static SpanGuard txSpan(std::string_view name);
/** Create a span for consensus rounds. */
static SpanGuard consensusSpan(std::string_view name);
/** Create a span for peer-to-peer messages. */
static SpanGuard peerSpan(std::string_view name);
/** Create a span for ledger operations. */
static SpanGuard ledgerSpan(std::string_view name);
/** Create an uncategorized span (always created when enabled). */
static SpanGuard span(std::string_view name);
// ═══════════════════════════════════════════════════════════════════
// INSTANCE METHODS
// ═══════════════════════════════════════════════════════════════════
SpanGuard(); // constructs a no-op guard
~SpanGuard();
SpanGuard(SpanGuard&& other) noexcept;
SpanGuard& operator=(SpanGuard&&) = delete;
SpanGuard(SpanGuard const&) = delete;
SpanGuard& operator=(SpanGuard const&) = delete;
SpanGuard(SpanGuard&&) = delete;
SpanGuard& operator=(SpanGuard&&) = delete;
~SpanGuard()
{
if (span_)
span_->End();
}
/** Mark the span status as OK. */
void setOk();
/** Access the underlying span */
opentelemetry::trace::Span& span() { return *span_; }
opentelemetry::trace::Span const& span() const { return *span_; }
/** Set an explicit status code. */
void setStatus(int code, std::string_view description = "");
/** Set span status to OK */
void setOk()
{
span_->SetStatus(opentelemetry::trace::StatusCode::kOk);
}
/** Set span status with code and description */
void setStatus(
opentelemetry::trace::StatusCode code,
std::string_view description = "")
{
span_->SetStatus(code, std::string(description));
}
/** Set an attribute on the span */
/** Set a key-value attribute on the span. */
template<typename T>
void setAttribute(std::string_view key, T&& value)
{
span_->SetAttribute(
opentelemetry::nostd::string_view(key.data(), key.size()),
std::forward<T>(value));
}
void setAttribute(std::string_view key, T&& value);
/** Add an event to the span */
void addEvent(std::string_view name)
{
span_->AddEvent(std::string(name));
}
/** Add an event to the span timeline. */
void addEvent(std::string_view name);
/** Record an exception on the span */
void recordException(std::exception const& e)
{
span_->RecordException(e);
span_->SetStatus(
opentelemetry::trace::StatusCode::kError,
e.what());
}
/** Record an exception and set error status. */
void recordException(std::exception const& e);
/** Get the current trace context */
opentelemetry::context::Context context() const
{
return opentelemetry::context::RuntimeContext::GetCurrent();
}
/** Get the current trace context (for cross-thread propagation). */
// Returns an opaque context handle.
auto context() const;
/** Discard this span -- dropped before export. */
void discard();
};
/**
* No-op span guard for when tracing is disabled.
* Provides the same interface but does nothing.
*/
class NullSpanGuard
#else // XRPL_ENABLE_TELEMETRY not defined -- zero-cost stub
class SpanGuard
{
public:
NullSpanGuard() = default;
// Factory methods -- all return no-op guards
static SpanGuard rpcSpan(std::string_view) { return {}; }
static SpanGuard txSpan(std::string_view) { return {}; }
static SpanGuard consensusSpan(std::string_view) { return {}; }
static SpanGuard peerSpan(std::string_view) { return {}; }
static SpanGuard ledgerSpan(std::string_view) { return {}; }
static SpanGuard span(std::string_view) { return {}; }
// Instance methods -- all no-ops
void setOk() {}
void setStatus(opentelemetry::trace::StatusCode, std::string_view = "") {}
void setStatus(int, std::string_view = "") {}
template<typename T>
void setAttribute(std::string_view, T&&) {}
void addEvent(std::string_view) {}
void recordException(std::exception const&) {}
/** Return a default empty context (matches SpanGuard interface) */
opentelemetry::context::Context context() const
{
return opentelemetry::context::Context{};
}
void discard() {}
};
#endif // XRPL_ENABLE_TELEMETRY
} // namespace telemetry
} // namespace xrpl
```
---
## 4.3 Instrumentation Macros
## 4.3 SpanGuard API Reference
The previous macro-based approach (`TracingInstrumentation.h` with
`XRPL_TRACE_*` macros) has been replaced by SpanGuard's static factory
methods. This eliminates preprocessor macros from instrumentation call
sites and provides a cleaner, type-safe API.
### 4.3.1 Factory Methods
Each factory method accesses the global `Telemetry::getInstance()`
singleton internally and checks the corresponding component filter.
If telemetry is disabled (compile-time or runtime) or the component
filter is off, the factory returns a no-op guard whose methods are
all empty inlines.
| Factory Method | Component Filter | Typical Span Names |
| -------------------------------- | --------------------------- | ------------------------------------ |
| `SpanGuard::rpcSpan(name)` | `shouldTraceRpc()` | `rpc.request`, `rpc.command.submit` |
| `SpanGuard::txSpan(name)` | `shouldTraceTransactions()` | `tx.receive`, `tx.validate` |
| `SpanGuard::consensusSpan(name)` | `shouldTraceConsensus()` | `consensus.round`, `consensus.phase` |
| `SpanGuard::peerSpan(name)` | `shouldTracePeer()` | `peer.message.receive` |
| `SpanGuard::ledgerSpan(name)` | `shouldTraceLedger()` | `ledger.close`, `ledger.accept` |
| `SpanGuard::span(name)` | (always, if enabled) | `job.execute`, custom spans |
### 4.3.2 Usage Pattern
```cpp
// src/xrpld/telemetry/TracingInstrumentation.h
#pragma once
#include <xrpl/telemetry/Telemetry.h>
#include <xrpl/telemetry/SpanGuard.h>
namespace xrpl {
namespace telemetry {
void ServerHandler::onRequest(...)
{
// Factory creates a span if RPC tracing is enabled, no-op otherwise.
// No Telemetry& reference needed -- accessed via global singleton.
auto span = telemetry::SpanGuard::rpcSpan("rpc.request");
span.setAttribute("xrpl.rpc.command", command);
// ═══════════════════════════════════════════════════════════════════════════
// INSTRUMENTATION MACROS
// ═══════════════════════════════════════════════════════════════════════════
auto result = processRequest(req);
#ifdef XRPL_ENABLE_TELEMETRY
span.setAttribute("xrpl.rpc.status", result.status());
span.setOk();
// span ended automatically when it goes out of scope
}
```
// Start a span that is automatically ended when guard goes out of scope
#define XRPL_TRACE_SPAN(telemetry, name) \
auto _xrpl_span_ = (telemetry).startSpan(name); \
::xrpl::telemetry::SpanGuard _xrpl_guard_(_xrpl_span_)
### 4.3.3 Compile-Time Disabled Behavior
// Start a span with specific kind
#define XRPL_TRACE_SPAN_KIND(telemetry, name, kind) \
auto _xrpl_span_ = (telemetry).startSpan(name, kind); \
::xrpl::telemetry::SpanGuard _xrpl_guard_(_xrpl_span_)
When `XRPL_ENABLE_TELEMETRY` is **not** defined, SpanGuard compiles to
a zero-cost no-op stub. All factory methods return a default-constructed
guard, and all instance methods have empty bodies:
// Conditional span based on component
#define XRPL_TRACE_TX(telemetry, name) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((telemetry).shouldTraceTransactions()) { \
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
}
```cpp
// When XRPL_ENABLE_TELEMETRY is not defined:
class SpanGuard
{
public:
static SpanGuard rpcSpan(std::string_view) { return {}; }
static SpanGuard txSpan(std::string_view) { return {}; }
static SpanGuard consensusSpan(std::string_view) { return {}; }
static SpanGuard peerSpan(std::string_view) { return {}; }
static SpanGuard ledgerSpan(std::string_view) { return {}; }
static SpanGuard span(std::string_view) { return {}; }
#define XRPL_TRACE_CONSENSUS(telemetry, name) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((telemetry).shouldTraceConsensus()) { \
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
}
void setOk() {}
void setStatus(int, std::string_view = "") {}
template<typename T>
void setAttribute(std::string_view, T&&) {}
void addEvent(std::string_view) {}
void recordException(std::exception const&) {}
void discard() {}
};
```
#define XRPL_TRACE_RPC(telemetry, name) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((telemetry).shouldTraceRpc()) { \
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
}
The compiler optimizes away all calls to these empty methods, producing
the same binary as if no instrumentation code were present.
#define XRPL_TRACE_PEER(telemetry, name) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((telemetry).shouldTracePeer()) { \
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
}
### 4.3.4 Discard Support
#define XRPL_TRACE_LEDGER(telemetry, name) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((telemetry).shouldTraceLedger()) { \
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
}
SpanGuard supports discarding a span before it is exported. This is
useful for filtering out uninteresting spans (e.g. successful
preflight checks) after the span has been started:
#define XRPL_TRACE_PATHFIND(telemetry, name) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((telemetry).shouldTracePathfind()) { \
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
}
#define XRPL_TRACE_TXQ(telemetry, name) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((telemetry).shouldTraceTxQ()) { \
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
}
#define XRPL_TRACE_VALIDATOR(telemetry, name) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((telemetry).shouldTraceValidator()) { \
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
}
#define XRPL_TRACE_AMENDMENT(telemetry, name) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((telemetry).shouldTraceAmendment()) { \
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
}
// Set attribute on current span (if exists).
// Works with both std::optional<SpanGuard> (from conditional macros)
// and bare SpanGuard (from XRPL_TRACE_SPAN). Uses 'if constexpr'-like
// dispatch via a helper that checks for .has_value().
#define XRPL_TRACE_SET_ATTR(key, value) \
do { \
if constexpr (requires { _xrpl_guard_.has_value(); }) { \
if (_xrpl_guard_.has_value()) \
_xrpl_guard_->setAttribute(key, value); \
} else { \
_xrpl_guard_.setAttribute(key, value); \
} \
} while(0)
// Record exception on current span
#define XRPL_TRACE_EXCEPTION(e) \
do { \
if constexpr (requires { _xrpl_guard_.has_value(); }) { \
if (_xrpl_guard_.has_value()) \
_xrpl_guard_->recordException(e); \
} else { \
_xrpl_guard_.recordException(e); \
} \
} while(0)
#else // XRPL_ENABLE_TELEMETRY not defined
#define XRPL_TRACE_SPAN(telemetry, name) ((void)0)
#define XRPL_TRACE_SPAN_KIND(telemetry, name, kind) ((void)0)
#define XRPL_TRACE_TX(telemetry, name) ((void)0)
#define XRPL_TRACE_CONSENSUS(telemetry, name) ((void)0)
#define XRPL_TRACE_RPC(telemetry, name) ((void)0)
#define XRPL_TRACE_PEER(telemetry, name) ((void)0)
#define XRPL_TRACE_LEDGER(telemetry, name) ((void)0)
#define XRPL_TRACE_PATHFIND(telemetry, name) ((void)0)
#define XRPL_TRACE_TXQ(telemetry, name) ((void)0)
#define XRPL_TRACE_VALIDATOR(telemetry, name) ((void)0)
#define XRPL_TRACE_AMENDMENT(telemetry, name) ((void)0)
#define XRPL_TRACE_SET_ATTR(key, value) ((void)0)
#define XRPL_TRACE_EXCEPTION(e) ((void)0)
#endif // XRPL_ENABLE_TELEMETRY
} // namespace telemetry
} // namespace xrpl
```cpp
auto span = telemetry::SpanGuard::txSpan("tx.process");
auto result = preflight(tx);
if (result != tesSUCCESS)
{
// Span is dropped before entering the batch export queue.
span.discard();
return result;
}
```
---
@@ -457,7 +413,7 @@ namespace telemetry {
Add to `src/xrpld/overlay/detail/ripple.proto`:
```protobuf
// Note: rippled uses proto2 syntax. The 'optional' keyword below is valid
// Note: xrpld uses proto2 syntax. The 'optional' keyword below is valid
// in proto2 (it is the default field rule) and is included for clarity.
// Trace context for distributed tracing across nodes
@@ -644,7 +600,7 @@ TraceContextPropagator::inject(
```cpp
// src/xrpld/overlay/detail/PeerImp.cpp (modified)
#include <xrpl/telemetry/TracingInstrumentation.h>
#include <xrpl/telemetry/SpanGuard.h>
void
PeerImp::handleTransaction(
@@ -724,7 +680,7 @@ PeerImp::handleTransaction(
relayGuard.context(), protoCtx);
// Relay to other peers
app_.overlay().relay(
app_.getOverlay().relay(
stx->getTransactionID(),
*m,
protoCtx, // Pass trace context
@@ -749,7 +705,7 @@ PeerImp::handleTransaction(
```cpp
// src/xrpld/app/consensus/RCLConsensus.cpp (modified)
#include <xrpl/telemetry/TracingInstrumentation.h>
#include <xrpl/telemetry/SpanGuard.h>
void
RCLConsensusAdaptor::startRound(
@@ -759,20 +715,18 @@ RCLConsensusAdaptor::startRound(
hash_set<NodeID> const& peers,
bool proposing)
{
XRPL_TRACE_CONSENSUS(app_.getTelemetry(), "consensus.round");
auto span = telemetry::SpanGuard::consensusSpan("consensus.round");
XRPL_TRACE_SET_ATTR("xrpl.consensus.ledger.prev", to_string(prevLedgerHash));
XRPL_TRACE_SET_ATTR("xrpl.consensus.ledger.seq",
span.setAttribute("xrpl.consensus.ledger.prev", to_string(prevLedgerHash));
span.setAttribute("xrpl.consensus.ledger.seq",
static_cast<int64_t>(prevLedger.seq() + 1));
XRPL_TRACE_SET_ATTR("xrpl.consensus.proposers",
span.setAttribute("xrpl.consensus.proposers",
static_cast<int64_t>(peers.size()));
XRPL_TRACE_SET_ATTR("xrpl.consensus.mode",
span.setAttribute("xrpl.consensus.mode",
proposing ? "proposing" : "observing");
// Store trace context for use in phase transitions
currentRoundContext_ = _xrpl_guard_.has_value()
? _xrpl_guard_->context()
: opentelemetry::context::Context{};
currentRoundContext_ = span.context();
// ... existing implementation ...
}
@@ -844,34 +798,22 @@ RCLConsensusAdaptor::peerProposal(
```cpp
// src/xrpld/rpc/detail/ServerHandler.cpp (modified)
#include <xrpl/telemetry/TracingInstrumentation.h>
#include <xrpl/telemetry/SpanGuard.h>
void
ServerHandler::onRequest(
http_request_type&& req,
std::function<void(http_response_type&&)>&& send)
{
// Extract trace context from HTTP headers (W3C Trace Context)
auto parentCtx = telemetry::TraceContextPropagator::extractFromHeaders(
[&req](std::string_view name) -> std::optional<std::string> {
// Beast's find() accepts a string_view for custom header lookup
auto it = req.find(name);
if (it != req.end())
return std::string(it->value());
return std::nullopt;
});
// Start request span
auto span = app_.getTelemetry().startSpan(
"rpc.request",
parentCtx,
opentelemetry::trace::SpanKind::kServer);
telemetry::SpanGuard guard(span);
// SpanGuard::rpcSpan() accesses the global Telemetry instance
// and checks shouldTraceRpc() internally. Returns a no-op guard
// if tracing is disabled.
auto span = telemetry::SpanGuard::rpcSpan("rpc.request");
// Add HTTP attributes
guard.setAttribute("http.method", std::string(req.method_string()));
guard.setAttribute("http.target", std::string(req.target()));
guard.setAttribute("http.user_agent",
span.setAttribute("http.method", std::string(req.method_string()));
span.setAttribute("http.target", std::string(req.target()));
span.setAttribute("http.user_agent",
std::string(req[boost::beast::http::field::user_agent]));
auto const startTime = std::chrono::steady_clock::now();
@@ -885,8 +827,8 @@ ServerHandler::onRequest(
if (!reader.parse(body, jv))
{
guard.setStatus(
opentelemetry::trace::StatusCode::kError,
span.setStatus(
/* kError */ 2,
"Invalid JSON");
sendError(send, "Invalid JSON");
return;
@@ -899,13 +841,12 @@ ServerHandler::onRequest(
? jv["method"].asString()
: "unknown";
guard.setAttribute("xrpl.rpc.command", command);
span.setAttribute("xrpl.rpc.command", command);
// Create child span for command execution
auto cmdSpan = app_.getTelemetry().startSpan(
"rpc.command." + command);
{
telemetry::SpanGuard cmdGuard(cmdSpan);
auto cmdSpan = telemetry::SpanGuard::rpcSpan(
"rpc.command." + command);
// Execute RPC command
auto result = processRequest(jv);
@@ -913,42 +854,42 @@ ServerHandler::onRequest(
// Record result attributes
if (result.isMember("status"))
{
cmdGuard.setAttribute("xrpl.rpc.status",
cmdSpan.setAttribute("xrpl.rpc.status",
result["status"].asString());
}
if (result["status"].asString() == "error")
{
cmdGuard.setStatus(
opentelemetry::trace::StatusCode::kError,
cmdSpan.setStatus(
/* kError */ 2,
result.isMember("error_message")
? result["error_message"].asString()
: "RPC error");
}
else
{
cmdGuard.setOk();
cmdSpan.setOk();
}
}
auto const duration = std::chrono::steady_clock::now() - startTime;
guard.setAttribute("http.duration_ms",
span.setAttribute("http.duration_ms",
std::chrono::duration<double, std::milli>(duration).count());
// Inject trace context into response headers
http_response_type resp;
telemetry::TraceContextPropagator::injectToHeaders(
guard.context(),
span.context(),
[&resp](std::string_view name, std::string_view value) {
resp.set(std::string(name), std::string(value));
});
guard.setOk();
span.setOk();
send(std::move(resp));
}
catch (std::exception const& e)
{
guard.recordException(e);
span.recordException(e);
JLOG(journal_.error()) << "RPC request failed: " << e.what();
sendError(send, e.what());
}
@@ -957,62 +898,43 @@ ServerHandler::onRequest(
### 4.5.4 JobQueue Context Propagation
> **Architecture note**: `JobQueue` and its inner `Workers` class do not
> hold an `Application&` or `ServiceRegistry&`. They receive a
> `perf::PerfLog*` at construction. Because SpanGuard's factory methods
> access the global `Telemetry` instance directly, no `Telemetry&`
> reference needs to be threaded into `JobQueue`.
>
> The approach below captures trace context at job-creation time and
> restores it when the job executes, so that any spans created _inside_
> the job body automatically become children of the original caller's
> trace.
```cpp
// src/xrpld/core/JobQueue.h (modified)
// src/libxrpl/core/detail/JobQueue.cpp (modified -- processTask)
#include <opentelemetry/context/context.h>
class Job
{
// ... existing members ...
// Captured trace context at job creation
opentelemetry::context::Context traceContext_;
public:
// Constructor captures current trace context
Job(JobType type, std::function<void()> func, ...)
: type_(type)
, func_(std::move(func))
, traceContext_(opentelemetry::context::RuntimeContext::GetCurrent())
// ... other initializations ...
{
}
// Get trace context for restoration during execution
opentelemetry::context::Context const&
traceContext() const { return traceContext_; }
};
// src/xrpld/core/JobQueue.cpp (modified)
#include <xrpl/telemetry/SpanGuard.h>
void
Worker::run()
JobQueue::processTask(int instance)
{
while (auto job = getJob())
// ... existing job dequeue logic ...
// SpanGuard::span() uses the global Telemetry instance --
// no Telemetry& member needed on JobQueue.
auto span = telemetry::SpanGuard::span("job.execute");
span.setAttribute("xrpl.job.type", to_string(job.type()));
span.setAttribute("xrpl.job.worker",
static_cast<int64_t>(instance));
try
{
// Restore trace context from job creation
auto token = opentelemetry::context::RuntimeContext::Attach(
job->traceContext());
// Start execution span
auto span = app_.getTelemetry().startSpan("job.execute");
telemetry::SpanGuard guard(span);
guard.setAttribute("xrpl.job.type", to_string(job->type()));
guard.setAttribute("xrpl.job.queue_ms", job->queueTimeMs());
guard.setAttribute("xrpl.job.worker", workerId_);
try
{
job->execute();
guard.setOk();
}
catch (std::exception const& e)
{
guard.recordException(e);
JLOG(journal_.error()) << "Job execution failed: " << e.what();
}
job.execute();
span.setOk();
}
catch (std::exception const& e)
{
span.recordException(e);
JLOG(journal_.error()) << "Job execution failed: " << e.what();
}
}
```
@@ -1029,7 +951,7 @@ flowchart TB
submit["Submit TX"]
end
subgraph NodeA["rippled Node A"]
subgraph NodeA["xrpld Node A"]
rpcA["rpc.request"]
cmdA["rpc.command.submit"]
txRecvA["tx.receive"]
@@ -1037,13 +959,13 @@ flowchart TB
txRelayA["tx.relay"]
end
subgraph NodeB["rippled Node B"]
subgraph NodeB["xrpld Node B"]
txRecvB["tx.receive"]
txValB["tx.validate"]
txRelayB["tx.relay"]
end
subgraph NodeC["rippled Node C"]
subgraph NodeC["xrpld Node C"]
txRecvC["tx.receive"]
consensusC["consensus.round"]
phaseC["consensus.phase.establish"]

View File

@@ -5,7 +5,7 @@
---
## 5.1 rippled Configuration
## 5.1 xrpld Configuration
> **OTLP** = OpenTelemetry Protocol | **TxQ** = Transaction Queue
@@ -61,8 +61,16 @@ Add to `cfg/xrpld-example.cfg`:
# trace_validator=0 # Validator list and manifest updates (low volume)
# trace_amendment=0 # Amendment voting (very low volume)
#
# # Trace ID strategies for cross-node correlation
# # "deterministic" (default) derives trace_id from a workflow hash
# # (txHash for transactions, prevLedgerHash for consensus) so all nodes
# # produce spans under the same trace_id for the same workflow.
# # "attribute" uses random trace_id; correlation via attribute queries.
# tx_trace_strategy=deterministic
# consensus_trace_strategy=deterministic
#
# # Service identification (automatically detected if not specified)
# # service_name=rippled
# # service_name=xrpld
# # service_instance_id=<node_public_key>
[telemetry]
@@ -71,28 +79,30 @@ enabled=0
### 5.1.2 Configuration Options Summary
| Option | Type | Default | Description |
| --------------------- | ------ | ---------------- | ----------------------------------------- |
| `enabled` | bool | `false` | Enable/disable telemetry |
| `exporter` | string | `"otlp_grpc"` | Exporter type: otlp_grpc, otlp_http, none |
| `endpoint` | string | `localhost:4317` | OTLP collector endpoint |
| `use_tls` | bool | `false` | Enable TLS for exporter connection |
| `tls_ca_cert` | string | `""` | Path to CA certificate file |
| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) |
| `batch_size` | uint | `512` | Spans per export batch |
| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) |
| `max_queue_size` | uint | `2048` | Maximum queued spans |
| `trace_transactions` | bool | `true` | Enable transaction tracing |
| `trace_consensus` | bool | `true` | Enable consensus tracing |
| `trace_rpc` | bool | `true` | Enable RPC tracing |
| `trace_peer` | bool | `false` | Enable peer message tracing (high volume) |
| `trace_ledger` | bool | `true` | Enable ledger tracing |
| `trace_pathfind` | bool | `true` | Enable path computation tracing |
| `trace_txq` | bool | `true` | Enable transaction queue tracing |
| `trace_validator` | bool | `false` | Enable validator list/manifest tracing |
| `trace_amendment` | bool | `false` | Enable amendment voting tracing |
| `service_name` | string | `"rippled"` | Service name for traces |
| `service_instance_id` | string | `<node_pubkey>` | Instance identifier |
| Option | Type | Default | Description |
| -------------------------- | ------ | ----------------- | ---------------------------------------------------------------------------------------------------------- |
| `enabled` | bool | `false` | Enable/disable telemetry |
| `exporter` | string | `"otlp_grpc"` | Exporter type: otlp_grpc, otlp_http, none |
| `endpoint` | string | `localhost:4317` | OTLP collector endpoint |
| `use_tls` | bool | `false` | Enable TLS for exporter connection |
| `tls_ca_cert` | string | `""` | Path to CA certificate file |
| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) |
| `batch_size` | uint | `512` | Spans per export batch |
| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) |
| `max_queue_size` | uint | `2048` | Maximum queued spans |
| `trace_transactions` | bool | `true` | Enable transaction tracing |
| `trace_consensus` | bool | `true` | Enable consensus tracing |
| `trace_rpc` | bool | `true` | Enable RPC tracing |
| `trace_peer` | bool | `false` | Enable peer message tracing (high volume) |
| `trace_ledger` | bool | `true` | Enable ledger tracing |
| `trace_pathfind` | bool | `true` | Enable path computation tracing |
| `trace_txq` | bool | `true` | Enable transaction queue tracing |
| `trace_validator` | bool | `false` | Enable validator list/manifest tracing |
| `trace_amendment` | bool | `false` | Enable amendment voting tracing |
| `tx_trace_strategy` | string | `"deterministic"` | TX trace ID strategy: `"deterministic"` (trace_id = txHash[0:16]) or `"attribute"` (random) |
| `consensus_trace_strategy` | string | `"deterministic"` | Consensus trace ID strategy: `"deterministic"` (trace_id = prevLedgerHash[0:16]) or `"attribute"` (random) |
| `service_name` | string | `"xrpld"` | Service name for traces |
| `service_instance_id` | string | `<node_pubkey>` | Instance identifier |
---
@@ -119,7 +129,7 @@ setup_Telemetry(
// Basic settings
setup.enabled = section.value_or("enabled", false);
setup.serviceName = section.value_or("service_name", "rippled");
setup.serviceName = section.value_or("service_name", "xrpld");
setup.serviceVersion = version;
setup.serviceInstanceId = section.value_or(
"service_instance_id", nodePublicKey);
@@ -173,87 +183,97 @@ setup_Telemetry(
### 5.3.1 ApplicationImp Changes
> **Deferred identity**: The node public key (`nodeIdentity_`) is not
> available during `ApplicationImp`'s member initializer list — it is
> resolved later in `setup()`. The `Telemetry` object is therefore
> constructed with an empty `serviceInstanceId` and patched via
> `setServiceInstanceId()` once `setup()` has called `getNodeIdentity()`.
```cpp
// src/xrpld/app/main/Application.cpp (modified)
#include <xrpl/telemetry/Telemetry.h>
class ApplicationImp : public Application
class ApplicationImp : public Application, public BasicApp
{
// ... existing members ...
// ... existing members (perfLog_, etc.) ...
// Telemetry (must be constructed early, destroyed late)
// Telemetry constructed in the member initializer list with
// an empty serviceInstanceId, patched in setup().
std::unique_ptr<telemetry::Telemetry> telemetry_;
public:
ApplicationImp(...)
// Member initializer list (excerpt):
// ...
// , telemetry_(
// telemetry::make_Telemetry(
// telemetry::setup_Telemetry(
// config_->section("telemetry"),
// "", // Updated later via setServiceInstanceId()
// BuildInfo::getVersionString()),
// logs_->journal("Telemetry")))
// ...
bool setup(...) override
{
// Initialize telemetry early (before other components)
auto telemetrySection = config_->section("telemetry");
auto telemetrySetup = telemetry::setup_Telemetry(
telemetrySection,
toBase58(TokenType::NodePublic, nodeIdentity_.publicKey()),
BuildInfo::getVersionString());
// ... existing setup code ...
// Set network attributes
telemetrySetup.networkId = config_->NETWORK_ID;
telemetrySetup.networkType = [&]() {
if (config_->NETWORK_ID == 0) return "mainnet";
if (config_->NETWORK_ID == 1) return "testnet";
if (config_->NETWORK_ID == 2) return "devnet";
return "custom";
}();
nodeIdentity_ = getNodeIdentity(*this, cmdline);
telemetry_ = telemetry::make_Telemetry(
telemetrySetup,
logs_->journal("Telemetry"));
// Inject node identity into telemetry resource attributes,
// unless the user already set a custom service_instance_id.
if (!config_->section("telemetry").exists("service_instance_id"))
telemetry_->setServiceInstanceId(
toBase58(TokenType::NodePublic, nodeIdentity_->first));
// ... rest of initialization ...
// ... rest of setup ...
}
void start() override
void start(bool withTimers) override
{
// Start telemetry first
if (telemetry_)
telemetry_->start();
// ... existing start code ...
telemetry_->start();
}
void stop() override
void run() override
{
// ... existing stop code ...
// Stop telemetry last (to capture shutdown spans)
if (telemetry_)
telemetry_->stop();
// ... existing run/shutdown code ...
telemetry_->stop();
}
telemetry::Telemetry& getTelemetry() override
telemetry::Telemetry&
getTelemetry() override
{
assert(telemetry_);
return *telemetry_;
}
};
```
### 5.3.2 Application Interface Addition
### 5.3.2 ServiceRegistry Interface Addition
```cpp
// include/xrpl/app/main/Application.h (modified)
// include/xrpl/core/ServiceRegistry.h (modified)
namespace telemetry { class Telemetry; }
namespace telemetry {
class Telemetry;
} // namespace telemetry
class Application
class ServiceRegistry
{
public:
// ... existing virtual methods ...
/** Get the telemetry system for distributed tracing */
virtual telemetry::Telemetry& getTelemetry() = 0;
/** Get the telemetry system for distributed tracing. */
virtual telemetry::Telemetry&
getTelemetry() = 0;
};
```
> **Note:** `Application` extends `ServiceRegistry`, so `getTelemetry()` is
> available on both. Components that hold a `ServiceRegistry&` (e.g.
> `NetworkOPsImp`) call `registry_.get().getTelemetry()`. Components that
> still hold an `Application&` (e.g. `ServerHandler`, `PeerImp`,
> `RCLConsensusAdaptor`) call `app_.getTelemetry()` directly.
---
## 5.4 CMake Integration
@@ -403,13 +423,7 @@ exporters:
sampling_initial: 5
sampling_thereafter: 200
# Tempo for trace visualization
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
# Grafana Tempo for trace storage
# Tempo for trace storage
otlp/tempo:
endpoint: tempo:4317
tls:
@@ -420,7 +434,7 @@ service:
traces:
receivers: [otlp]
processors: [batch]
exporters: [logging, jaeger, otlp/tempo]
exporters: [logging, otlp/tempo]
```
### 5.5.2 Production Configuration
@@ -554,7 +568,7 @@ services:
depends_on:
- tempo
# Tempo for trace visualization
# Tempo for trace storage
tempo:
image: grafana/tempo:2.6.1
container_name: tempo
@@ -562,17 +576,6 @@ services:
- "3200:3200" # Tempo HTTP API
- "4317" # OTLP gRPC (internal)
# Grafana Tempo for trace storage (recommended for production)
tempo:
image: grafana/tempo:2.7.2
container_name: tempo
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml:ro
- tempo-data:/var/tempo
ports:
- "3200:3200" # HTTP API
# Grafana for dashboards
grafana:
image: grafana/grafana:10.2.3
@@ -586,7 +589,6 @@ services:
ports:
- "3000:3000"
depends_on:
- jaeger
- tempo
# Prometheus for metrics (optional, for correlation)
@@ -600,7 +602,7 @@ services:
networks:
default:
name: rippled-telemetry
name: xrpld-telemetry
```
---
@@ -653,7 +655,7 @@ flowchart TB
- **Configuration Sources**: `xrpld.cfg` provides runtime settings (endpoint, sampling) while the CMake flag controls whether telemetry is compiled in at all.
- **Initialization**: `setup_Telemetry()` parses config values, then `make_Telemetry()` constructs the provider, processor, and exporter objects.
- **Runtime Components**: The `TracerProvider` creates spans, the `BatchProcessor` buffers them, and the `OTLP Exporter` serializes and sends them over the wire.
- **OTLP arrow to Collector**: Trace data leaves the rippled process via OTLP (gRPC or HTTP) and enters the external Collector pipeline.
- **OTLP arrow to Collector**: Trace data leaves the xrpld process via OTLP (gRPC or HTTP) and enters the external Collector pipeline.
- **Collector Pipeline**: `Receivers` ingest OTLP data, `Processors` apply sampling/filtering/enrichment, and `Exporters` forward traces to storage backends (Tempo, etc.).
---
@@ -662,7 +664,7 @@ flowchart TB
> **APM** = Application Performance Monitoring
Step-by-step instructions for integrating rippled traces with Grafana.
Step-by-step instructions for integrating xrpld traces with Grafana.
### 5.8.1 Data Source Configuration
@@ -721,10 +723,10 @@ datasources:
apiVersion: 1
providers:
- name: "rippled-dashboards"
- name: "xrpld-dashboards"
orgId: 1
folder: "rippled"
folderUid: "rippled"
folder: "xrpld"
folderUid: "xrpld"
type: file
disableDeletion: false
updateIntervalSeconds: 30
@@ -736,8 +738,8 @@ providers:
```json
{
"title": "rippled RPC Performance",
"uid": "rippled-rpc-performance",
"title": "xrpld RPC Performance",
"uid": "xrpld-rpc-performance",
"panels": [
{
"title": "RPC Latency by Command",
@@ -746,7 +748,7 @@ providers:
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && span.xrpl.rpc.command != \"\"} | histogram_over_time(duration) by (span.xrpl.rpc.command)"
"query": "{resource.service.name=\"xrpld\" && span.xrpl.rpc.command != \"\"} | histogram_over_time(duration) by (span.xrpl.rpc.command)"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
@@ -758,7 +760,7 @@ providers:
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && status.code=error} | rate() by (span.xrpl.rpc.command)"
"query": "{resource.service.name=\"xrpld\" && status.code=error} | rate() by (span.xrpl.rpc.command)"
}
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
@@ -770,7 +772,7 @@ providers:
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && span.xrpl.rpc.command != \"\"} | avg(duration) by (span.xrpl.rpc.command) | topk(10)"
"query": "{resource.service.name=\"xrpld\" && span.xrpl.rpc.command != \"\"} | avg(duration) by (span.xrpl.rpc.command) | topk(10)"
}
],
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 8 }
@@ -782,7 +784,7 @@ providers:
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\"}"
"query": "{resource.service.name=\"xrpld\"}"
}
],
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 16 }
@@ -795,8 +797,8 @@ providers:
```json
{
"title": "rippled Transaction Tracing",
"uid": "rippled-tx-tracing",
"title": "xrpld Transaction Tracing",
"uid": "xrpld-tx-tracing",
"panels": [
{
"title": "Transaction Throughput",
@@ -805,7 +807,7 @@ providers:
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && name=\"tx.receive\"} | rate()"
"query": "{resource.service.name=\"xrpld\" && name=\"tx.receive\"} | rate()"
}
],
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 }
@@ -817,7 +819,7 @@ providers:
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && name=\"tx.relay\"} | avg(span.xrpl.tx.relay_count)"
"query": "{resource.service.name=\"xrpld\" && name=\"tx.relay\"} | avg(span.xrpl.tx.relay_count)"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 }
@@ -829,7 +831,7 @@ providers:
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && name=\"tx.validate\" && status.code=error}"
"query": "{resource.service.name=\"xrpld\" && name=\"tx.validate\" && status.code=error}"
}
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 }
@@ -840,26 +842,26 @@ providers:
### 5.8.5 TraceQL Query Examples
Common queries for rippled traces:
Common queries for xrpld traces:
```
# Find all traces for a specific transaction hash
{resource.service.name="rippled" && span.xrpl.tx.hash="ABC123..."}
{resource.service.name="xrpld" && span.xrpl.tx.hash="ABC123..."}
# Find slow RPC commands (>100ms)
{resource.service.name="rippled" && name=~"rpc.command.*"} | duration > 100ms
{resource.service.name="xrpld" && name=~"rpc.command.*"} | duration > 100ms
# Find consensus rounds taking >5 seconds
{resource.service.name="rippled" && name="consensus.round"} | duration > 5s
{resource.service.name="xrpld" && name="consensus.round"} | duration > 5s
# Find failed transactions with error details
{resource.service.name="rippled" && name="tx.validate" && status.code=error}
{resource.service.name="xrpld" && name="tx.validate" && status.code=error}
# Find transactions relayed to many peers
{resource.service.name="rippled" && name="tx.relay"} | span.xrpl.tx.relay_count > 10
{resource.service.name="xrpld" && name="tx.relay"} | span.xrpl.tx.relay_count > 10
# Compare latency across nodes
{resource.service.name="rippled" && name="rpc.command.account_info"} | avg(duration) by (resource.service.instance.id)
{resource.service.name="xrpld" && name="rpc.command.account_info"} | avg(duration) by (resource.service.instance.id)
```
### 5.8.6 Correlation with PerfLog
@@ -871,12 +873,12 @@ To correlate OpenTelemetry traces with existing PerfLog data:
```yaml
# promtail-config.yaml
scrape_configs:
- job_name: rippled-perflog
- job_name: xrpld-perflog
static_configs:
- targets:
- localhost
labels:
job: rippled
job: xrpld
__path__: /var/log/rippled/perf*.log
pipeline_stages:
- json:
@@ -921,22 +923,18 @@ jsonData:
filterBySpanID: false
```
### 5.8.7 Correlation with Insight/OTel System Metrics
### 5.8.7 Correlation with Insight/StatsD Metrics
To correlate traces with Beast Insight system metrics:
To correlate traces with existing Beast Insight metrics:
**Step 1: Export Insight metrics to Prometheus**
Beast Insight metrics are exported natively via OTLP to the OTel Collector,
which exposes them on the Prometheus endpoint alongside spanmetrics. No
separate StatsD exporter is needed when using `server=otel`.
```ini
# xrpld.cfg — native OTel metrics (recommended)
[insight]
server=otel
endpoint=http://localhost:4318/v1/metrics
prefix=rippled
```yaml
# prometheus.yaml
scrape_configs:
- job_name: "xrpld-statsd"
static_configs:
- targets: ["statsd-exporter:9102"]
```
**Step 2: Add exemplars to metrics**
@@ -962,7 +960,7 @@ jsonData:
"datasource": "Prometheus",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(rippled_rpc_duration_seconds_bucket[5m]))",
"expr": "histogram_quantile(0.99, rate(xrpld_rpc_duration_seconds_bucket[5m]))",
"exemplar": true
}
]

View File

@@ -46,10 +46,8 @@ gantt
Consensus Tracing :p4, after p3, 2w
Consensus Round Spans :p4a, after p3, 3d
Proposal Handling :p4b, after p4a, 3d
Validator List & Manifest Tracing :p4f, after p4b, 2d
Amendment Voting Tracing :p4g, after p4f, 2d
SHAMap Sync Tracing :p4h, after p4g, 2d
Validation Tests :p4c, after p4h, 4d
Establish Phase (4a) :p4f, after p4b, 3d
Validation Tests :p4c, after p4f, 4d
Buffer & Review :p4e, after p4c, 4d
section Phase 5
@@ -136,21 +134,31 @@ gantt
## 6.4 Phase 3: Transaction Tracing (Weeks 5-6)
**Objective**: Trace transaction lifecycle across network
**Objective**: Trace transaction lifecycle across network with deterministic cross-node correlation
### Tasks
| Task | Description |
| ---- | ---------------------------------------------------- |
| 3.1 | Define `TraceContext` Protocol Buffer message |
| 3.2 | Implement protobuf context serialization |
| 3.3 | Instrument `PeerImp::handleTransaction()` |
| 3.4 | Instrument `NetworkOPs::submitTransaction()` |
| 3.5 | Instrument HashRouter integration |
| 3.6 | Fee escalation instrumentation (`fee.escalate` span) |
| 3.7 | Implement relay context propagation |
| 3.8 | Integration tests (multi-node) |
| 3.9 | Performance benchmarks |
| Task | Description |
| ---- | -------------------------------------------------------------- |
| 3.1 | Define `TraceContext` Protocol Buffer message |
| 3.2 | Implement protobuf context serialization |
| 3.3 | Instrument `PeerImp::handleTransaction()` |
| 3.4 | Instrument `NetworkOPs::submitTransaction()` |
| 3.5 | Instrument HashRouter integration |
| 3.6 | Fee escalation instrumentation (`fee.escalate` span) |
| 3.7 | Implement relay context propagation |
| 3.8 | Integration tests (multi-node) |
| 3.9 | Deterministic transaction trace ID (`trace_id = txHash[0:16]`) |
| 3.10 | Performance benchmarks |
### Deterministic Trace ID (Task 3.9)
Transaction spans use **deterministic trace IDs** derived from the transaction hash:
`trace_id = txHash[0:16]`. All nodes handling the same transaction independently
produce spans under the same trace_id. Protobuf `span_id` propagation (Task 3.7)
additionally provides parent-child relay ordering when available. See
[02-design-decisions.md §2.5.0](./02-design-decisions.md) for the design rationale
and [Phase3_taskList.md Task 3.9](./Phase3_taskList.md) for the full implementation spec.
### Exit Criteria
@@ -159,6 +167,8 @@ gantt
- [ ] HashRouter deduplication visible in traces
- [ ] Multi-node integration tests passing
- [ ] <5% overhead on transaction throughput
- [ ] Deterministic trace_id: all nodes produce same trace_id for same transaction
- [ ] Protobuf span_id propagation preserves parent-child ordering when available
---
@@ -168,38 +178,44 @@ gantt
### Tasks
| Task | Description |
| ---- | ---------------------------------------------- |
| 4.1 | Instrument `RCLConsensusAdaptor::startRound()` |
| 4.2 | Instrument phase transitions |
| 4.3 | Instrument proposal handling |
| 4.4 | Instrument validation handling |
| 4.5 | Add consensus-specific attributes |
| 4.6 | Correlate with transaction traces |
| 4.7 | Validator list and manifest tracing |
| 4.8 | Amendment voting tracing |
| 4.9 | SHAMap sync tracing |
| 4.10 | Multi-validator integration tests |
| 4.11 | Performance validation |
| Task | Description | Status |
| ---- | ---------------------------------------------- | ------------------ |
| 4.1 | Instrument `RCLConsensusAdaptor::startRound()` | Done (via 4a.2) |
| 4.2 | Instrument phase transitions | Done |
| 4.3 | Instrument proposal handling | Done |
| 4.4 | Instrument validation handling | Done |
| 4.5 | Add consensus-specific attributes | Done |
| 4.6 | Correlate with transaction traces | Done |
| 4.7 | Build verification and testing | Done |
| 4.8 | Validation span enrichment (ext. dashboard) | Not done |
**Note**: The original plan doc listed tasks 4.7-4.11 as "Validator list tracing",
"Amendment voting tracing", "SHAMap sync tracing", "Multi-validator integration tests",
and "Performance validation". These were descoped and replaced by the tasklist's 4.7
(build verification) and 4.8 (validation span enrichment). Validator, amendment, and
SHAMap tracing are not implemented.
### Spans Produced
| Span Name | Location | Attributes |
| --------------------------- | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `consensus.proposal.send` | `RCLConsensus.cpp:177` | `xrpl.consensus.round` |
| `consensus.ledger_close` | `RCLConsensus.cpp:282` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` |
| `consensus.accept` | `RCLConsensus.cpp:395` | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` |
| `consensus.accept.apply` | `RCLConsensus.cpp:521` | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq`, `parent_close_time`, `close_time_self`, `close_time_vote_bins`, `resolution_direction` |
| `consensus.validation.send` | `RCLConsensus.cpp:753` | `xrpl.consensus.proposing` |
| `consensus.phase.open` | `Consensus.h:707` | _(none)_ |
| `consensus.proposal.send` | `RCLConsensus.cpp:232` | `xrpl.consensus.round` |
| `consensus.ledger_close` | `RCLConsensus.cpp:341` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` |
| `consensus.accept` | `RCLConsensus.cpp:492` | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms`, `xrpl.consensus.quorum` |
| `consensus.accept.apply` | `RCLConsensus.cpp:541` | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq`, `parent_close_time`, `close_time_self`, `close_time_vote_bins`, `resolution_direction` |
| `consensus.validation.send` | `RCLConsensus.cpp:900` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` |
### Exit Criteria
- [x] Complete consensus round traces
- [x] Phase transitions visible
- [x] Proposals and validations traced
- [x] Phase transitions visible (open, establish, close, accept)
- [x] Proposals and validations traced send and receive; relay deferred to Phase 4b
- [x] Close time agreement tracked (per `avCT_CONSENSUS_PCT`)
- [x] No impact on consensus timing
- [ ] Multi-validator test network validated
- [x] Transaction-consensus correlation (Task 4.6) `tx.included` events in doAccept
- [ ] Validation span enrichment (Task 4.8) not implemented
### Implementation Status — Phase 4a Complete
@@ -230,44 +246,47 @@ See [Phase4_taskList.md](./Phase4_taskList.md) for the full spec and implementat
**Objective**: Fill tracing gaps in the establish phase and establish cross-node
correlation using deterministic trace IDs derived from `previousLedger.id()`.
**Approach**: Direct instrumentation in `Consensus.h`. Long-lived spans use
direct SpanGuard members; short-lived scoped spans use `XRPL_TRACE_*` macros.
**Approach**: Direct instrumentation in `Consensus.h` and `RCLConsensus.cpp`.
All spans use `SpanGuard` factory methods (`span()`, `hashSpan()`, `linkedSpan()`)
with `TraceCategory::Consensus` gating. No macros used all tracing via direct
`SpanGuard` API calls.
### Tasks
| Task | Description | Effort | Risk |
| ---- | ------------------------------------------------ | ------ | ------ |
| 4a.0 | Prerequisites: extend SpanGuard & Telemetry APIs | 1d | Medium |
| 4a.1 | Adaptor `getTelemetry()` method | 0.5d | Low |
| 4a.2 | Switchable round span with deterministic traceID | 2d | High |
| 4a.3 | Span members in `Consensus.h` | 0.5d | Medium |
| 4a.4 | Instrument `phaseEstablish()` | 1d | Medium |
| 4a.5 | Instrument `updateOurPositions()` | 1d | Medium |
| 4a.6 | Instrument `haveConsensus()` (thresholds) | 1d | Medium |
| 4a.7 | Instrument mode changes | 0.5d | Low |
| 4a.8 | Reparent existing spans under round | 0.5d | Low |
| 4a.9 | Build verification and testing | 1d | Low |
| Task | Description | Effort | Risk | Status |
| ---- | ------------------------------------------------ | ------ | ------ | ------------------------ |
| 4a.0 | Prerequisites: extend SpanGuard & Telemetry APIs | 1d | Medium | Done (no macros) |
| 4a.1 | Adaptor `getTelemetry()` method | 0.5d | Low | Skipped (not needed) |
| 4a.2 | Switchable round span with deterministic traceID | 2d | High | Done |
| 4a.3 | Span members in `Consensus.h` | 0.5d | Medium | Done (with deviation) |
| 4a.4 | Instrument `phaseEstablish()` | 1d | Medium | Done |
| 4a.5 | Instrument `updateOurPositions()` | 1d | Medium | Done |
| 4a.6 | Instrument `haveConsensus()` (thresholds) | 1d | Medium | Done |
| 4a.7 | Instrument mode changes | 0.5d | Low | Done |
| 4a.8 | Reparent existing spans under round | 0.5d | Low | Done |
| 4a.9 | Build verification and testing | 1d | Low | Done |
**Total Effort**: 9 days
### Spans Produced
| Span Name | Location | Key Attributes |
| ---------------------------- | ------------------ | ---------------------------------------------------------------- |
| `consensus.round` | `RCLConsensus.cpp` | `round_id`, `ledger_id`, `ledger.seq`, `mode`; link prev round |
| `consensus.establish` | `Consensus.h` | `converge_percent`, `establish_count`, `proposers` |
| `consensus.update_positions` | `Consensus.h` | `disputes_count`, `converge_percent`, `proposers_agreed/total` |
| `consensus.check` | `Consensus.h` | `agree/disagree_count`, `threshold_percent`, `result` |
| `consensus.mode_change` | `RCLConsensus.cpp` | `mode.old`, `mode.new` |
| Span Name | Location | Key Attributes (actually set) |
| ---------------------------- | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- |
| `consensus.round` | `RCLConsensus.cpp` | `round_id`, `ledger_id`, `ledger.seq`, `mode`, `trace_strategy` |
| `consensus.establish` | `Consensus.h` | `converge_percent`, `establish_count`, `proposers` |
| `consensus.update_positions` | `Consensus.h` | `converge_percent`, `proposers`, `have_close_time_consensus`, `close_time_threshold`, `disputes_count`, `avalanche_threshold` |
| `consensus.check` | `Consensus.h` | `agree/disagree_count`, `converge_percent`, `have_close_time_consensus`, `threshold_percent`, `result` |
| `consensus.mode_change` | `RCLConsensus.cpp` | `mode.old`, `mode.new` |
### Exit Criteria
- [ ] Establish phase internals fully traced (disputes, convergence, thresholds)
- [ ] Cross-node correlation works via deterministic trace_id
- [ ] Strategy switchable via config (`deterministic` / `attribute`)
- [ ] Consecutive rounds linked via follows-from spans
- [ ] Build passes with telemetry ON and OFF
- [ ] No impact on consensus timing
- [x] Establish phase internals traced (establish, update_positions, check spans)
- [x] Establish phase fully traced `disputes_count`, `avalanche_threshold`, dispute `yays`/`nays` all implemented
- [x] Cross-node correlation works via deterministic trace_id
- [x] Strategy switchable via config (`deterministic` / `attribute`)
- [x] Consecutive rounds linked via follows-from spans
- [x] Build passes with telemetry ON and OFF
- [x] No impact on consensus timing
See [Phase4_taskList.md](./Phase4_taskList.md) for full task details.
@@ -279,7 +298,7 @@ See [Phase4_taskList.md](./Phase4_taskList.md) for full task details.
validations) to enable true distributed tracing between nodes.
**Status**: Design documented, NOT implemented. Protobuf fields (field 1001)
and `TraceContextPropagator` class exist. Wiring deferred until Phase 4a is
and `TraceContextPropagator` free functions exist. Wiring deferred until Phase 4a is
validated in a multi-node environment.
**Prerequisites**: Phase 4a complete and validated.
@@ -294,25 +313,25 @@ See [Phase4_taskList.md § Phase 4b](./Phase4_taskList.md) for full design.
### Tasks
| Task | Description |
| ---- | ----------------------------- |
| 5.1 | Operator runbook |
| 5.2 | Grafana dashboards |
| 5.3 | Alert definitions |
| 5.4 | Collector deployment examples |
| 5.5 | Developer documentation |
| 5.6 | Training materials |
| 5.7 | Final integration testing |
| Task | Description | Status |
| ---- | ----------------------------- | ------------------- |
| 5.1 | Operator runbook | Complete |
| 5.2 | Grafana dashboards | Complete |
| 5.3 | Alert definitions | Deferred post-MVP |
| 5.4 | Collector deployment examples | Complete |
| 5.5 | Developer documentation | Complete |
| 5.6 | Training materials | Deferred post-MVP |
| 5.7 | Final integration testing | Complete |
---
## 6.7 Phase 6: StatsD Metrics Integration (Week 10)
**Objective**: Bridge rippled's existing `beast::insight` StatsD metrics into the OpenTelemetry collection pipeline, exposing 300+ pre-existing metrics alongside span-derived RED metrics in Prometheus/Grafana.
**Objective**: Bridge xrpld's existing `beast::insight` StatsD metrics into the OpenTelemetry collection pipeline, exposing 300+ pre-existing metrics alongside span-derived RED metrics in Prometheus/Grafana.
### Background
rippled has a mature metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These metrics cover node health, peer networking, RPC performance, job queue, and overlay traffic data that **does not** overlap with the span-based instrumentation from Phases 1-5. By adding a StatsD receiver to the OTel Collector, both metric sources converge in Prometheus.
xrpld has a mature metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These metrics cover node health, peer networking, RPC performance, job queue, and overlay traffic data that **does not** overlap with the span-based instrumentation from Phases 1-5. By adding a StatsD receiver to the OTel Collector, both metric sources converge in Prometheus.
### Metric Inventory
@@ -342,8 +361,8 @@ rippled has a mature metrics framework (`beast::insight`) that emits StatsD-form
| 6.2 | Add `statsd` receiver to OTel Collector config |
| 6.3 | Expose UDP port 8125 in docker-compose.yml |
| 6.4 | Add `[insight]` config to integration test node configs |
| 6.5 | Create "Node Health" Grafana dashboard (8 panels) |
| 6.6 | Create "Network Traffic" Grafana dashboard (8 panels) |
| 6.5 | Create "Node Health" Grafana dashboard (16 panels) |
| 6.6 | Create "Network Traffic" Grafana dashboard (10 panels) |
| 6.7 | Create "RPC & Pathfinding (StatsD)" Grafana dashboard (8 panels) |
| 6.8 | Update integration test to verify StatsD metrics in Prometheus |
| 6.9 | Update TESTING.md and telemetry-runbook.md |
@@ -356,204 +375,24 @@ The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffi
### New Grafana Dashboards
**Node Health** (`statsd-node-health.json`, uid: `rippled-statsd-node-health`):
**Node Health** (`statsd-node-health.json`, uid: `xrpld-statsd-node-health`):
- Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches
- Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches, Key Jobs Execution/Dequeue Time, FullBelowCache Size/Hit Rate, Ledger Publish Gap, State Duration Rate, All Jobs Detail
**Network Traffic** (`statsd-network-traffic.json`, uid: `rippled-statsd-network`):
**Network Traffic** (`statsd-network-traffic.json`, uid: `xrpld-statsd-network`):
- Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories
- Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories, Duplicate Traffic, All Traffic Categories Detail
**RPC & Pathfinding (StatsD)** (`statsd-rpc-pathfinding.json`, uid: `rippled-statsd-rpc`):
**RPC & Pathfinding (StatsD)** (`statsd-rpc-pathfinding.json`, uid: `xrpld-statsd-rpc`):
- RPC Request Rate, Response Time p95/p50, Response Size p95/p50, Pathfinding Fast/Full Duration, Resource Warnings/Drops, Response Time Heatmap
### Exit Criteria
- [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age`)
- [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=xrpld_LedgerMaster_Validated_Ledger_Age`)
- [ ] All 3 new Grafana dashboards load without errors
- [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests)
- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ DEFERRED (breaking change, tracked separately; resolved by Phase 7's OTel Counter mapping)
---
## 6.8 Phase 7: Native OTel Metrics Migration (Weeks 11-12)
**Objective**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency and unifying traces and metrics into a single OTLP pipeline.
### Motivation: Why Migrate from StatsD to Native OTel Metrics
The Phase 6 StatsD bridge was a pragmatic first step, but it retains inherent limitations that native OTel export resolves.
#### What We Gain
1. **Unified telemetry pipeline** Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics."
2. **Eliminates StatsD UDP limitations** StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control.
3. **Fixes the `|m` wire format issue** The `StatsDMeterImpl` uses non-standard `|m` StatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 DEFERRED becomes resolved).
4. **Richer metric semantics** OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these.
5. **Removes infrastructure dependency** No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML.
6. **Metric-to-trace correlation** OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it impossible with StatsD-sourced metrics.
7. **Production-grade export** OTel's `PeriodicMetricReader` provides configurable export intervals, batch sizes, timeout handling, and graceful shutdown all built into the SDK rather than hand-rolled in `StatsDCollectorImp`.
#### What We Lose
1. **StatsD ecosystem compatibility** Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep `server=statsd` as a fallback.
2. **Simplicity of UDP** StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts.
3. **Slightly higher memory** OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state.
4. **Dependency on OTel C++ Metrics SDK stability** The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem.
#### Decision
The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. `StatsDCollector` is retained as a fallback via `server=statsd` for operators who need StatsD ecosystem compatibility during the transition period.
### Architecture
#### Class Hierarchy (after Phase 7)
```
beast::insight::Collector (abstract interface — unchanged)
|
+-- StatsDCollector (existing — retained as fallback, deprecated)
| +-- StatsDCounterImpl -> StatsD |c over UDP
| +-- StatsDGaugeImpl -> StatsD |g over UDP
| +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard)
| +-- StatsDEventImpl -> StatsD |ms over UDP
| +-- StatsDHookImpl -> 1s periodic callback
|
+-- NullCollector (existing — unchanged, used when disabled)
| +-- NullCounterImpl -> no-op
| +-- NullGaugeImpl -> no-op
| +-- NullMeterImpl -> no-op
| +-- NullEventImpl -> no-op
| +-- NullHookImpl -> no-op
|
+-- OTelCollector (NEW — Phase 7)
+-- OTelCounterImpl -> otel::Counter<int64_t>
+-- OTelGaugeImpl -> otel::ObservableGauge<uint64_t>
+-- OTelMeterImpl -> otel::Counter<uint64_t>
+-- OTelEventImpl -> otel::Histogram<double>
+-- OTelHookImpl -> 1s periodic callback (same pattern)
```
#### Data Flow (after Phase 7)
```mermaid
graph LR
subgraph rippledNode["rippled Node"]
A["Trace Macros<br/>XRPL_TRACE_SPAN"]
B["beast::insight<br/>OTelCollector"]
end
subgraph collector["OTel Collector :4317 / :4318"]
direction TB
R1["OTLP Receiver<br/>:4317 gRPC | :4318 HTTP"]
BP["Batch Processor"]
SM["SpanMetrics Connector"]
R1 --> BP
BP --> SM
end
subgraph backends["Trace Backends"]
D["Jaeger / Tempo"]
end
subgraph metrics["Metrics Stack"]
E["Prometheus :9090<br/>scrapes :8889<br/>span-derived + native OTel metrics"]
end
subgraph viz["Visualization"]
F["Grafana :3000"]
end
A -->|"OTLP/HTTP :4318<br/>(traces)"| R1
B -->|"OTLP/HTTP :4318<br/>(metrics)"| R1
BP -->|"OTLP/gRPC"| D
SM -->|"RED metrics"| E
R1 -->|"rippled_* metrics<br/>(native OTLP)"| E
E --> F
D --> F
style A fill:#4a90d9,color:#fff,stroke:#2a6db5
style B fill:#d9534f,color:#fff,stroke:#b52d2d
style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
style BP fill:#449d44,color:#fff,stroke:#2d6e2d
style SM fill:#449d44,color:#fff,stroke:#2d6e2d
style D fill:#f0ad4e,color:#000,stroke:#c78c2e
style E fill:#f0ad4e,color:#000,stroke:#c78c2e
style F fill:#5bc0de,color:#000,stroke:#3aa8c1
style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9
style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de
```
**Key change**: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port.
#### Configuration
```ini
# [insight] section — new "otel" server option
[insight]
server=otel # NEW: uses OTel OTLP metrics exporter
prefix=rippled # metric name prefix (preserved)
# Endpoint and auth inherited from [telemetry] section:
[telemetry]
enabled=1
endpoint=http://localhost:4318/v1/traces
```
The `OTelCollector` reads the OTLP endpoint from `[telemetry]` config (replacing `/v1/traces` with `/v1/metrics` for the metrics exporter). No additional config keys needed.
**Backward compatibility**: `server=statsd` continues to work exactly as before.
See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown.
### Instrument Type Mapping
| beast::insight | OTel Metrics SDK | Rationale |
| ---------------------- | -------------------------------- | ---------------------------------------------------------------- |
| Counter (int64, `\|c`) | `Counter<int64_t>` | Direct 1:1 mapping |
| Gauge (uint64, `\|g`) | `ObservableGauge<uint64_t>` | Async callback matches existing Hook polling pattern |
| Meter (uint64, `\|m`) | `Counter<uint64_t>` | Fixes non-standard wire format; meters are semantically counters |
| Event (ms, `\|ms`) | `Histogram<double>` | Duration distributions with explicit bucket boundaries |
| Hook (1s callback) | `PeriodicMetricReader` alignment | Same 1s collection interval |
### Tasks
| Task | Description |
| ---- | ------------------------------------------------------------------------- |
| 7.1 | Add OTel Metrics SDK to build deps (conan/cmake) |
| 7.2 | Implement `OTelCollector` class (~400-500 lines) |
| 7.3 | Update `CollectorManager` add `server=otel` |
| 7.4 | Update OTel Collector YAML (add metrics pipeline, remove StatsD receiver) |
| 7.5 | Preserve metric names in Prometheus (naming strategy) |
| 7.6 | Update Grafana dashboards (if names change) |
| 7.7 | Update integration tests |
| 7.8 | Update documentation (runbook, reference docs) |
### Exit Criteria
- [ ] All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver)
- [ ] `server=otel` is the default in development docker-compose
- [ ] `server=statsd` still works as a fallback
- [ ] Existing Grafana dashboards display data correctly
- [ ] Integration test passes with OTLP-only metrics pipeline
- [ ] No performance regression vs StatsD baseline (< 1% CPU overhead)
- [ ] Deferred Task 6.1 (`|m` wire format) no longer relevant
- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ DEFERRED (breaking change, tracked separately)
---
@@ -561,14 +400,14 @@ See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown.
### Motivation
rippled's `beast::Journal` logs and OpenTelemetry traces are currently two disjoint observability signals. When investigating an issue, operators must manually correlate timestamps between log files and Jaeger/Tempo traces. Phase 8 bridges this gap by injecting trace context (`trace_id`, `span_id`) into every log line emitted within an active span, and ingesting those logs into Grafana Loki via the OTel Collector's filelog receiver.
xrpld's `beast::Journal` logs and OpenTelemetry traces are currently two disjoint observability signals. When investigating an issue, operators must manually correlate timestamps between log files and Jaeger/Tempo traces. Phase 8 bridges this gap by injecting trace context (`trace_id`, `span_id`) into every log line emitted within an active span, and ingesting those logs into Grafana Loki via the OTel Collector's filelog receiver.
#### Gains
1. **One-click trace-to-log navigation** Click a trace in Tempo/Jaeger and immediately see the corresponding log lines in Loki, filtered by `trace_id`.
2. **Reverse lookup (log-to-trace)** Loki derived fields make `trace_id` values clickable links back to Tempo.
3. **Unified observability** All three pillars (traces, metrics, logs) flow through the same OTel Collector pipeline and are visible in a single Grafana instance.
4. **Zero new dependencies in rippled** Uses existing OTel SDK headers (`GetSpan`, `GetContext`) already linked in Phase 1.
4. **Zero new dependencies in xrpld** Uses existing OTel SDK headers (`GetSpan`, `GetContext`) already linked in Phase 1.
5. **Negligible overhead** `GetSpan()` + `GetContext()` are thread-local reads (<10ns/call). At ~1000 JLOG calls/min, this adds <10us/min.
#### Losses / Risks
@@ -586,13 +425,13 @@ The correlation value far outweighs the risks. The log format change is backward
Phase 8 has two independent sub-phases that can be developed in parallel:
- **Phase 8a (code change)**: Modify `Logs::format()` in `src/libxrpl/basics/Log.cpp` to append `trace_id=<hex32> span_id=<hex16>` when the current thread has an active OTel span. Guarded by `#ifdef XRPL_ENABLE_TELEMETRY`.
- **Phase 8b (infra only)**: Add Loki to the Docker Compose stack, configure the OTel Collector's `filelog` receiver to tail rippled's log file, parse out structured fields (timestamp, partition, severity, trace_id, span_id, message), and export to Loki via OTLP. Configure Grafana TempoLoki bidirectional linking.
- **Phase 8b (infra only)**: Add Loki to the Docker Compose stack, configure the OTel Collector's `filelog` receiver to tail xrpld's log file, parse out structured fields (timestamp, partition, severity, trace_id, span_id, message), and export to Loki via OTLP. Configure Grafana TempoLoki bidirectional linking.
#### Trace ID Injection Flow
```mermaid
flowchart LR
subgraph rippled["rippled process"]
subgraph xrpld["xrpld process"]
JLOG["JLOG(j.info())"]
Format["Logs::format()"]
OTelCtx["OTel Context<br/>(thread-local)"]
@@ -606,7 +445,7 @@ flowchart LR
Format --> LogLine
style rippled fill:#1a237e,stroke:#0d1642,color:#fff
style xrpld fill:#1a237e,stroke:#0d1642,color:#fff
style output fill:#1b5e20,stroke:#0d3d14,color:#fff
style JLOG fill:#283593,stroke:#1a237e,color:#fff
style Format fill:#283593,stroke:#1a237e,color:#fff
@@ -626,7 +465,7 @@ flowchart LR
FR --> RP --> BP --> LE
end
LogFile["rippled<br/>debug.log"] --> FR
LogFile["xrpld<br/>debug.log"] --> FR
LE --> Loki["Grafana Loki<br/>:3100"]
Loki <-->|"derivedFields ↔<br/>tracesToLogs"| Tempo["Grafana Tempo"]
@@ -657,7 +496,7 @@ flowchart LR
- [ ] Log lines within active spans contain `trace_id=<hex> span_id=<hex>`
- [ ] Log lines outside spans have no trace context (no empty fields)
- [ ] Loki ingests rippled logs via OTel Collector filelog receiver
- [ ] Loki ingests xrpld logs via OTel Collector filelog receiver
- [ ] Grafana Tempo Loki one-click correlation works
- [ ] Grafana Loki Tempo reverse lookup works via derived field
- [ ] Integration test verifies trace_id presence in logs
@@ -671,7 +510,7 @@ flowchart LR
### Motivation
Phases 1-8 establish trace spans, StatsD metrics bridge, native OTel metrics, and log-trace correlation. However, ~68 metrics that exist inside rippled's `get_counts`, `server_info`, TxQ, PerfLog, and `CountedObject` systems have **no time-series export path**. These are the metrics that exchanges, payment processors, analytics providers, validators, and researchers need most NodeStore I/O performance, cache hit rates, per-RPC-method counters, transaction queue depth, fee escalation levels, and live object instance counts.
Phases 1-8 establish trace spans, StatsD metrics bridge, native OTel metrics, and log-trace correlation. However, ~68 metrics that exist inside xrpld's `get_counts`, `server_info`, TxQ, PerfLog, and `CountedObject` systems have **no time-series export path**. These are the metrics that exchanges, payment processors, analytics providers, validators, and researchers need most NodeStore I/O performance, cache hit rates, per-RPC-method counters, transaction queue depth, fee escalation levels, and live object instance counts.
### Architecture
@@ -679,7 +518,7 @@ Hybrid approach — two instrumentation strategies based on proximity to existin
```mermaid
flowchart TB
subgraph rippled["rippled process"]
subgraph xrpld["xrpld process"]
subgraph existing["Existing beast::insight registrations"]
NS["NodeStore I/O<br/>(Database.cpp)"]
end
@@ -707,7 +546,7 @@ flowchart TB
BI --> OTLP["OTLP/HTTP :4318<br/>/v1/metrics"]
OS --> OTLP
style rippled fill:#1a2633,color:#ccc,stroke:#4a90d9
style xrpld fill:#1a2633,color:#ccc,stroke:#4a90d9
style existing fill:#2a4a6b,color:#fff,stroke:#4a90d9
style newreg fill:#2a4a6b,color:#fff,stroke:#4a90d9
style export fill:#1a3320,color:#ccc,stroke:#5cb85c
@@ -845,7 +684,7 @@ See [Phase10_taskList.md](./Phase10_taskList.md) for detailed per-task breakdown
### Motivation
rippled has no native Prometheus/OTLP metrics export for data accessible only via JSON-RPC (`server_info`, `get_counts`, `fee`, `peers`, `validators`, `feature`). Every external consumer exchanges, payment processors, analytics providers, validators, compliance firms, DeFi protocols, researchers, custodians, and CBDC platforms must build custom JSON-RPC polling and conversion pipelines. This phase centralizes that work into a reusable custom OTel Collector receiver.
xrpld has no native Prometheus/OTLP metrics export for data accessible only via JSON-RPC (`server_info`, `get_counts`, `fee`, `peers`, `validators`, `feature`). Every external consumer exchanges, payment processors, analytics providers, validators, compliance firms, DeFi protocols, researchers, custodians, and CBDC platforms must build custom JSON-RPC polling and conversion pipelines. This phase centralizes that work into a reusable custom OTel Collector receiver.
### Architecture
@@ -861,7 +700,7 @@ flowchart LR
DX["DEX/AMM<br/>collector<br/>(optional)"]
end
rippled["rippled<br/>Admin RPC<br/>:5005"] -->|"JSON-RPC<br/>poll every 30s"| receiver
xrpld["xrpld<br/>Admin RPC<br/>:5005"] -->|"JSON-RPC<br/>poll every 30s"| receiver
receiver -->|"xrpl_* metrics"| PROM["Prometheus<br/>:9090"]
receiver -->|"OTLP export"| OTLP["Any OTLP-<br/>compatible<br/>backend"]
@@ -876,7 +715,7 @@ flowchart LR
style PE fill:#5cb85c,color:#fff,stroke:#3d8b3d
style VA fill:#5cb85c,color:#fff,stroke:#3d8b3d
style DX fill:#449d44,color:#fff,stroke:#2d6e2d
style rippled fill:#4a90d9,color:#fff,stroke:#2a6db5
style xrpld fill:#4a90d9,color:#fff,stroke:#2a6db5
style PROM fill:#f0ad4e,color:#000,stroke:#c78c2e
style OTLP fill:#f0ad4e,color:#000,stroke:#c78c2e
style GF fill:#5bc0de,color:#000,stroke:#3aa8c1
@@ -921,7 +760,7 @@ See [Phase11_taskList.md](./Phase11_taskList.md) for detailed per-task breakdown
- [ ] Custom OTel Collector receiver exports all `xrpl_*` metrics to Prometheus
- [ ] 4 new Grafana dashboards operational (Validator Health, Network Topology, Fee Market, DEX/AMM)
- [ ] Prometheus alerting rules fire correctly for simulated failures
- [ ] Receiver handles rippled restart/unavailability gracefully
- [ ] Receiver handles xrpld restart/unavailability gracefully
- [ ] Go receiver has unit tests with >80% coverage
---
@@ -994,7 +833,7 @@ flowchart TB
subgraph run["🏃 RUN (Week 6-9)"]
direction LR
r1[Consensus Tracing] ~~~ r2[Validator, Amendment,<br/>SHAMap Tracing] ~~~ r3[Full Correlation] ~~~ r4[Production Deploy]
r1[Consensus Tracing] ~~~ r2[Establish Phase<br/>& Cross-Node Correlation] ~~~ r3[StatsD Integration] ~~~ r4[Production Deploy]
end
crawl --> walk --> run
@@ -1022,7 +861,7 @@ flowchart TB
- **CRAWL (Weeks 1-2)**: Minimal investment -- set up the SDK, instrument RPC and PathFinding/TxQ handlers, and verify on a single node. Delivers immediate latency visibility.
- **WALK (Weeks 3-5)**: Expand to transaction lifecycle tracing, fee escalation, cross-node context propagation, and basic Grafana dashboards. This is where distributed tracing starts working.
- **RUN (Weeks 6-9)**: Full consensus instrumentation, validator/amendment/SHAMap tracing, end-to-end correlation, and production deployment with sampling and alerting.
- **RUN (Weeks 6-9)**: Full consensus instrumentation, establish-phase gap fill, cross-node correlation, StatsD integration, and production deployment with sampling and alerting.
- **Arrows (crawl → walk → run)**: Each phase builds on the prior one; you cannot skip ahead because later phases depend on infrastructure established earlier.
### 6.9.2 Quick Wins (Immediate Value)
@@ -1087,17 +926,17 @@ flowchart TB
- Complete consensus round visibility
- Phase transition timing
- Validator proposal tracking
- Validator list and manifest tracing
- Amendment voting tracing
- SHAMap sync tracing
- Full end-to-end traces (client → RPC → TX → consensus → ledger)
- ~~Validator list and manifest tracing~~ — descoped
- ~~Amendment voting tracing~~ — descoped
- ~~SHAMap sync tracing~~ — descoped
- Full end-to-end traces (client → RPC → TX → consensus → ledger) — partial (tx-consensus correlation not yet done)
**Code Changes**: ~100 lines across 3 consensus files, plus validator/amendment/SHAMap modules
**Code Changes**: ~100 lines across 3 consensus files
**Why Do This Last**:
- Highest complexity (consensus is critical path)
- Validator, amendment, and SHAMap components are lower priority
- Validator, amendment, and SHAMap components were descoped (lower priority)
- Requires thorough testing
- Lower relative value (consensus issues are rarer)
@@ -1123,13 +962,13 @@ quadrantChart
---
## 6.12 Definition of Done
## 6.13 Definition of Done
> **TxQ** = Transaction Queue | **HA** = High Availability
Clear, measurable criteria for each phase.
### 6.12.1 Phase 1: Core Infrastructure
### 6.13.1 Phase 1: Core Infrastructure
| Criterion | Measurement | Target |
| --------------- | ---------------------------------------------------------- | ---------------------------- |
@@ -1141,7 +980,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: All criteria met, PR merged, no regressions in CI.
### 6.12.2 Phase 2: RPC Tracing
### 6.13.2 Phase 2: RPC Tracing
| Criterion | Measurement | Target |
| ------------------ | ---------------------------------- | -------------------------- |
@@ -1153,19 +992,22 @@ Clear, measurable criteria for each phase.
**Definition of Done**: RPC traces visible in Tempo for all commands, dashboard shows latency distribution.
### 6.12.3 Phase 3: Transaction Tracing
### 6.13.3 Phase 3: Transaction Tracing
| Criterion | Measurement | Target |
| ---------------- | ------------------------------- | ---------------------------------- |
| Local Trace | Submit validate TxQ traced | Single-node test passes |
| Cross-Node | Context propagates via protobuf | Multi-node test passes |
| Relay Visibility | relay_count attribute correct | Spot check 100 txs |
| HashRouter | Deduplication visible in trace | Duplicate txs show suppressed=true |
| Performance | TX throughput overhead | <5% degradation |
| Criterion | Measurement | Target |
| --------------------- | ------------------------------------------------- | -------------------------------------------------------- |
| Local Trace | Submit validate TxQ traced | Single-node test passes |
| Cross-Node | Context propagates via protobuf | Multi-node test passes |
| Deterministic TraceID | Same trace_id on all nodes for same tx | Multi-node test: query by txHash[0:16] returns all spans |
| Relay Ordering | Protobuf span_id propagation creates parent-child | Tempo trace tree shows relay chain |
| Graceful Degradation | Old peer drops trace_context | Spans still grouped by deterministic trace_id |
| Relay Visibility | relay_count attribute correct | Spot check 100 txs |
| HashRouter | Deduplication visible in trace | Duplicate txs show suppressed=true |
| Performance | TX throughput overhead | <5% degradation |
**Definition of Done**: Transaction traces span 3+ nodes in test network, performance within bounds.
**Definition of Done**: Transaction traces span 3+ nodes in test network with deterministic trace_id correlation, parent-child ordering via protobuf propagation, and performance within bounds.
### 6.12.4 Phase 4: Consensus Tracing
### 6.13.4 Phase 4: Consensus Tracing
| Criterion | Measurement | Target |
| -------------------- | ----------------------------- | ------------------------- |
@@ -1177,7 +1019,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing.
### 6.12.5 Phase 5: Production Deployment
### 6.13.5 Phase 5: Production Deployment
| Criterion | Measurement | Target |
| ------------ | ---------------------------- | -------------------------- |
@@ -1190,7 +1032,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: Telemetry running in production, operators trained, alerts active.
### 6.12.6 Success Metrics Summary
### 6.13.6 Success Metrics Summary
| Phase | Primary Metric | Secondary Metric | Deadline | Status |
| -------- | -------------------------------- | --------------------------- | -------------- | ------------------ |

View File

@@ -92,13 +92,13 @@ flowchart TD
```mermaid
flowchart TB
subgraph validators["Validator Nodes"]
v1[rippled<br/>Validator 1]
v2[rippled<br/>Validator 2]
v1[xrpld<br/>Validator 1]
v2[xrpld<br/>Validator 2]
end
subgraph stock["Stock Nodes"]
s1[rippled<br/>Stock 1]
s2[rippled<br/>Stock 2]
s1[xrpld<br/>Stock 1]
s2[xrpld<br/>Stock 2]
end
subgraph collector["OTel Collector Cluster"]
@@ -140,7 +140,7 @@ flowchart TB
**Reading the diagram:**
- **Validator / Stock Nodes**: All rippled nodes emit trace data via OTLP. Validators and stock nodes are grouped separately because they may reside in different network zones.
- **Validator / Stock Nodes**: All xrpld nodes emit trace data via OTLP. Validators and stock nodes are grouped separately because they may reside in different network zones.
- **Collector Cluster (DC1, DC2)**: Regional collectors receive OTLP from nodes in their datacenter, apply processing (sampling, enrichment), and fan out to multiple backends.
- **Storage Backends**: Tempo and Elastic provide queryable trace storage; S3/GCS Archive provides long-term cold storage for compliance or post-incident analysis.
- **Grafana Dashboards**: The single visualization layer that queries both Tempo and Elastic, giving operators a unified view of all traces.
@@ -160,7 +160,7 @@ flowchart TB
| **DaemonSet** | Collector per host | Shared resources | Complexity |
| **Gateway** | Central collector(s) | Centralized processing | Single point of failure |
**Recommendation**: Use **Gateway** pattern with regional collectors for rippled networks:
**Recommendation**: Use **Gateway** pattern with regional collectors for xrpld networks:
- One collector cluster per datacenter/region
- Tail-based sampling at collector level
@@ -197,7 +197,7 @@ flowchart LR
**Reading the diagram:**
- **Head Sampling (Node)**: The first filter -- each rippled node decides whether to sample a trace at creation time (default 100%, recommended 10% in production). This controls the volume leaving the node.
- **Head Sampling (Node)**: The first filter -- each xrpld node decides whether to sample a trace at creation time (default 100%, recommended 10% in production). This controls the volume leaving the node.
- **Tail Sampling (Collector)**: The second filter -- the collector inspects completed traces and applies rules: keep all errors, keep anything slower than 5 seconds, and keep 10% of the remainder.
- **Arrow head → tail**: All head-sampled traces flow to the collector, where tail sampling further reduces volume while preserving the most valuable data.
- **Final Traces**: The output after both sampling stages; this is what gets stored and queried. The two-stage approach balances cost with debuggability.
@@ -226,15 +226,15 @@ flowchart LR
## 7.6 Grafana Dashboard Examples
Pre-built dashboards for rippled observability.
Pre-built dashboards for xrpld observability.
### 7.6.1 Consensus Health Dashboard
```json
{
"title": "rippled Consensus Health",
"uid": "rippled-consensus-health",
"tags": ["rippled", "consensus", "tracing"],
"title": "xrpld Consensus Health",
"uid": "xrpld-consensus-health",
"tags": ["xrpld", "consensus", "tracing"],
"panels": [
{
"title": "Consensus Round Duration",
@@ -243,7 +243,7 @@ Pre-built dashboards for rippled observability.
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | avg(duration) by (resource.service.instance.id)"
"query": "{resource.service.name=\"xrpld\" && name=\"consensus.round\"} | avg(duration) by (resource.service.instance.id)"
}
],
"fieldConfig": {
@@ -267,7 +267,7 @@ Pre-built dashboards for rippled observability.
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && name=~\"consensus.phase.*\"} | avg(duration) by (name)"
"query": "{resource.service.name=\"xrpld\" && name=~\"consensus.phase.*\"} | avg(duration) by (name)"
}
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
@@ -279,7 +279,7 @@ Pre-built dashboards for rippled observability.
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | avg(span.xrpl.consensus.proposers)"
"query": "{resource.service.name=\"xrpld\" && name=\"consensus.round\"} | avg(span.xrpl.consensus.proposers)"
}
],
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 8 }
@@ -291,7 +291,7 @@ Pre-built dashboards for rippled observability.
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | duration > 5s"
"query": "{resource.service.name=\"xrpld\" && name=\"consensus.round\"} | duration > 5s"
}
],
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 12 }
@@ -304,8 +304,8 @@ Pre-built dashboards for rippled observability.
```json
{
"title": "rippled Node Overview",
"uid": "rippled-node-overview",
"title": "xrpld Node Overview",
"uid": "xrpld-node-overview",
"panels": [
{
"title": "Active Nodes",
@@ -314,7 +314,7 @@ Pre-built dashboards for rippled observability.
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\"} | count_over_time() by (resource.service.instance.id) | count()"
"query": "{resource.service.name=\"xrpld\"} | count_over_time() by (resource.service.instance.id) | count()"
}
],
"gridPos": { "h": 4, "w": 4, "x": 0, "y": 0 }
@@ -326,7 +326,7 @@ Pre-built dashboards for rippled observability.
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && name=\"tx.receive\"} | count()"
"query": "{resource.service.name=\"xrpld\" && name=\"tx.receive\"} | count()"
}
],
"gridPos": { "h": 4, "w": 4, "x": 4, "y": 0 }
@@ -338,7 +338,7 @@ Pre-built dashboards for rippled observability.
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && status.code=error} | rate() / {resource.service.name=\"rippled\"} | rate() * 100"
"query": "{resource.service.name=\"xrpld\" && status.code=error} | rate() / {resource.service.name=\"xrpld\"} | rate() * 100"
}
],
"fieldConfig": {
@@ -373,8 +373,8 @@ Pre-built dashboards for rippled observability.
apiVersion: 1
groups:
- name: rippled-tracing-alerts
folder: rippled
- name: xrpld-tracing-alerts
folder: xrpld
interval: 1m
rules:
- uid: consensus-slow
@@ -385,7 +385,7 @@ groups:
datasourceUid: tempo
model:
queryType: traceql
query: '{resource.service.name="rippled" && name="consensus.round"} | avg(duration) > 5s'
query: '{resource.service.name="xrpld" && name="consensus.round"} | avg(duration) > 5s'
# Note: Verify TraceQL aggregate queries are supported by your
# Tempo version. Aggregate alerting (e.g., avg(duration)) requires
# Tempo 2.3+ with TraceQL metrics enabled.
@@ -404,7 +404,7 @@ groups:
datasourceUid: tempo
model:
queryType: traceql
query: '{resource.service.name="rippled" && name=~"rpc.command.*" && status.code=error} | rate() > 0.05'
query: '{resource.service.name="xrpld" && name=~"rpc.command.*" && status.code=error} | rate() > 0.05'
# Note: Verify TraceQL aggregate queries are supported by your
# Tempo version. Aggregate alerting (e.g., rate()) requires
# Tempo 2.3+ with TraceQL metrics enabled.
@@ -422,7 +422,7 @@ groups:
datasourceUid: tempo
model:
queryType: traceql
query: '{resource.service.name="rippled" && name="tx.receive"} | rate() < 10'
query: '{resource.service.name="xrpld" && name="tx.receive"} | rate() < 10'
for: 10m
annotations:
summary: Transaction throughput below threshold
@@ -436,13 +436,13 @@ groups:
> **OTLP** = OpenTelemetry Protocol
How to correlate OpenTelemetry traces with existing rippled observability.
How to correlate OpenTelemetry traces with existing xrpld observability.
### 7.7.1 Correlation Architecture
```mermaid
flowchart TB
subgraph rippled["rippled Node"]
subgraph xrpld["xrpld Node"]
otel[OpenTelemetry<br/>Spans]
perflog[PerfLog<br/>JSON Logs]
insight[Beast Insight<br/>StatsD Metrics]
@@ -479,7 +479,7 @@ flowchart TB
logs --> corr
metrics --> corr
style rippled fill:#0d47a1,stroke:#082f6a,color:#fff
style xrpld fill:#0d47a1,stroke:#082f6a,color:#fff
style collectors fill:#bf360c,stroke:#8c2809,color:#fff
style storage fill:#1b5e20,stroke:#0d3d14,color:#fff
style grafana fill:#4a148c,stroke:#2e0d57,color:#fff
@@ -500,7 +500,7 @@ flowchart TB
**Reading the diagram:**
- **rippled Node (three sources)**: A single node emits three independent data streams -- OpenTelemetry spans, PerfLog JSON logs, and Beast Insight StatsD metrics.
- **xrpld Node (three sources)**: A single node emits three independent data streams -- OpenTelemetry spans, PerfLog JSON logs, and Beast Insight StatsD metrics.
- **Data Collection layer**: Each stream has its own collector -- OTel Collector for spans, Promtail/Fluentd for logs, and a StatsD exporter for metrics. They operate independently.
- **Storage layer (Tempo, Loki, Prometheus)**: Each data type lands in a purpose-built store optimized for its query patterns (trace search, log grep, metric aggregation).
- **Grafana Correlation Panel**: The key integration point -- Grafana queries all three stores and links them via shared fields (`trace_id`, `xrpl.tx.hash`, `ledger_seq`), enabling a single-pane debugging experience.
@@ -522,7 +522,7 @@ flowchart TB
```
# In Grafana Explore with Tempo
{resource.service.name="rippled" && span.xrpl.tx.hash="ABC123..."}
{resource.service.name="xrpld" && span.xrpl.tx.hash="ABC123..."}
```
**Step 2: Get the trace_id from the trace view**
@@ -535,14 +535,14 @@ Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736
```
# In Grafana Explore with Loki
{job="rippled"} |= "4bf92f3577b34da6a3ce929d0e0e4736"
{job="xrpld"} |= "4bf92f3577b34da6a3ce929d0e0e4736"
```
**Step 4: Check Insight metrics for the time window**
```
# In Grafana with Prometheus
rate(rippled_tx_applied_total[1m])
rate(xrpld_tx_applied_total[1m])
@ timestamp_from_trace
```
@@ -550,8 +550,8 @@ rate(rippled_tx_applied_total[1m])
```json
{
"title": "rippled Unified Observability",
"uid": "rippled-unified",
"title": "xrpld Unified Observability",
"uid": "xrpld-unified",
"panels": [
{
"title": "Transaction Latency (Traces)",
@@ -560,7 +560,7 @@ rate(rippled_tx_applied_total[1m])
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\" && name=\"tx.receive\"} | histogram_over_time(duration)"
"query": "{resource.service.name=\"xrpld\" && name=\"tx.receive\"} | histogram_over_time(duration)"
}
],
"gridPos": { "h": 6, "w": 8, "x": 0, "y": 0 }
@@ -571,7 +571,7 @@ rate(rippled_tx_applied_total[1m])
"datasource": "Prometheus",
"targets": [
{
"expr": "rate(rippled_tx_received_total[5m])",
"expr": "rate(xrpld_tx_received_total[5m])",
"legendFormat": "{{ instance }}"
}
],
@@ -580,7 +580,7 @@ rate(rippled_tx_applied_total[1m])
"links": [
{
"title": "View traces",
"url": "/explore?left={\"datasource\":\"Tempo\",\"query\":\"{resource.service.name=\\\"rippled\\\" && name=\\\"tx.receive\\\"}\"}"
"url": "/explore?left={\"datasource\":\"Tempo\",\"query\":\"{resource.service.name=\\\"xrpld\\\" && name=\\\"tx.receive\\\"}\"}"
}
]
}
@@ -593,7 +593,7 @@ rate(rippled_tx_applied_total[1m])
"datasource": "Loki",
"targets": [
{
"expr": "{job=\"rippled\"} | json"
"expr": "{job=\"xrpld\"} | json"
}
],
"gridPos": { "h": 6, "w": 8, "x": 16, "y": 0 }
@@ -605,7 +605,7 @@ rate(rippled_tx_applied_total[1m])
"targets": [
{
"queryType": "traceql",
"query": "{resource.service.name=\"rippled\"}"
"query": "{resource.service.name=\"xrpld\"}"
}
],
"fieldConfig": {
@@ -622,7 +622,7 @@ rate(rippled_tx_applied_total[1m])
},
{
"title": "View logs",
"url": "/explore?left={\"datasource\":\"Loki\",\"query\":\"{job=\\\"rippled\\\"} |= \\\"${__value.raw}\\\"\"}"
"url": "/explore?left={\"datasource\":\"Loki\",\"query\":\"{job=\\\"xrpld\\\"} |= \\\"${__value.raw}\\\"\"}"
}
]
}

View File

@@ -26,7 +26,7 @@
| **Resource** | Entity producing telemetry (service, host, etc.) |
| **Instrumentation** | Code that creates telemetry data |
### rippled-Specific Terms
### xrpld-Specific Terms
| Term | Definition |
| ----------------- | ------------------------------------------------------------- |
@@ -36,8 +36,8 @@
| **Validation** | Validator's signature on a closed ledger |
| **HashRouter** | Component for transaction deduplication |
| **JobQueue** | Thread pool for asynchronous task execution |
| **PerfLog** | Existing performance logging system in rippled |
| **Beast Insight** | Existing metrics framework in rippled |
| **PerfLog** | Existing performance logging system in xrpld |
| **Beast Insight** | Existing metrics framework in xrpld |
| **PathFinding** | Payment path computation engine for cross-currency payments |
| **TxQ** | Transaction queue managing fee-based prioritization |
| **LoadManager** | Dynamic fee escalation based on network load |
@@ -50,10 +50,10 @@
| **MetricsRegistry** | Centralized class for OTel async gauge registrations (Phase 9) |
| **ObservableGauge** | OTel Metrics SDK async instrument polled via callback at fixed intervals |
| **PeriodicMetricReader** | OTel SDK component that invokes gauge callbacks at configurable intervals |
| **CountedObject** | rippled template that tracks live instance counts via atomic counters |
| **CountedObject** | xrpld template that tracks live instance counts via atomic counters |
| **TxQ** | Transaction queue managing fee escalation and ordering |
| **Load Factor** | Combined multiplier affecting transaction cost (local, cluster, network) |
| **OTel Collector Receiver** | Custom Go plugin that polls rippled RPC and emits OTel metrics (Phase 11) |
| **OTel Collector Receiver** | Custom Go plugin that polls xrpld RPC and emits OTel metrics (Phase 11) |
---
@@ -158,13 +158,13 @@ flowchart TB
6. [W3C Baggage](https://www.w3.org/TR/baggage/)
7. [Protocol Buffers](https://protobuf.dev/)
### rippled Resources
### xrpld Resources
8. [rippled Source Code](https://github.com/XRPLF/rippled)
8. [xrpld Source Code](https://github.com/XRPLF/rippled)
9. [XRP Ledger Documentation](https://xrpl.org/docs/)
10. [rippled Overlay README](https://github.com/XRPLF/rippled/blob/develop/src/xrpld/overlay/README.md)
11. [rippled RPC README](https://github.com/XRPLF/rippled/blob/develop/src/xrpld/rpc/README.md)
12. [rippled Consensus README](https://github.com/XRPLF/rippled/blob/develop/src/xrpld/app/consensus/README.md)
10. [xrpld Overlay README](https://github.com/XRPLF/rippled/blob/develop/src/xrpld/overlay/README.md)
11. [xrpld RPC README](https://github.com/XRPLF/rippled/blob/develop/src/xrpld/rpc/README.md)
12. [xrpld Consensus README](https://github.com/XRPLF/rippled/blob/develop/src/xrpld/app/consensus/README.md)
---
@@ -187,11 +187,11 @@ flowchart TB
| -------------------------------------------------------------------- | -------------------------------------------- |
| [OpenTelemetryPlan.md](./OpenTelemetryPlan.md) | Master overview and executive summary |
| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) | Distributed tracing concepts and OTel primer |
| [01-architecture-analysis.md](./01-architecture-analysis.md) | rippled architecture and trace points |
| [01-architecture-analysis.md](./01-architecture-analysis.md) | xrpld architecture and trace points |
| [02-design-decisions.md](./02-design-decisions.md) | SDK selection, exporters, span conventions |
| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure, performance analysis |
| [04-code-samples.md](./04-code-samples.md) | C++ code examples for all components |
| [05-configuration-reference.md](./05-configuration-reference.md) | rippled config, CMake, Collector configs |
| [05-configuration-reference.md](./05-configuration-reference.md) | xrpld config, CMake, Collector configs |
| [06-implementation-phases.md](./06-implementation-phases.md) | Timeline, tasks, risks, success metrics |
| [07-observability-backends.md](./07-observability-backends.md) | Backend selection and architecture |
| [08-appendix.md](./08-appendix.md) | Glossary, references, version history |
@@ -231,7 +231,7 @@ This guide maps Phase 911 content to its location across the documentation.
| Task list (10 tasks) | [Phase9_taskList.md](./Phase9_taskList.md) |
| Future metric definitions (~50) | [09-data-collection-reference.md §5b](./09-data-collection-reference.md) |
| New class: `MetricsRegistry` | `src/xrpld/telemetry/MetricsRegistry.h/.cpp` (planned) |
| New dashboards | `rippled-fee-market`, `rippled-job-queue` (planned) |
| New dashboards | `xrpld-fee-market`, `xrpld-job-queue` (planned) |
**Metric categories**: NodeStore I/O, Cache Hit Rates, TxQ, PerfLog Per-RPC, PerfLog Per-Job, Counted Objects, Fee Escalation & Load Factors.

View File

@@ -1,6 +1,6 @@
# Observability Data Collection Reference
> **Audience**: Developers and operators. This is the single source of truth for all telemetry data collected by rippled's observability stack.
> **Audience**: Developers and operators. This is the single source of truth for all telemetry data collected by xrpld's observability stack.
>
> **Related docs**: [docs/telemetry-runbook.md](../docs/telemetry-runbook.md) (operator runbook with alerting and troubleshooting) | [03-implementation-strategy.md](./03-implementation-strategy.md) (code structure and performance optimization) | [04-code-samples.md](./04-code-samples.md) (C++ instrumentation examples)
@@ -8,7 +8,7 @@
```mermaid
graph LR
subgraph rippledNode["rippled Node"]
subgraph xrpldNode["xrpld Node"]
A["Trace Macros<br/>XRPL_TRACE_SPAN<br/>(OTLP/HTTP exporter)"]
B["beast::insight<br/>OTel native metrics<br/>(OTLP/HTTP exporter)"]
C["MetricsRegistry<br/>OTel SDK metrics<br/>(OTLP/HTTP exporter)"]
@@ -43,7 +43,7 @@ graph LR
BP -->|"OTLP/gRPC :4317"| D
SM -->|"span_calls_total<br/>span_duration_ms<br/>(6 dimension labels)"| E
R1 -->|"rippled_* gauges<br/>rippled_* counters<br/>rippled_* histograms"| E
R1 -->|"xrpld_* gauges<br/>xrpld_* counters<br/>xrpld_* histograms"| E
E -->|"Prometheus<br/>data source"| F
D -->|"Tempo<br/>data source"| F
@@ -56,7 +56,7 @@ graph LR
style D fill:#f0ad4e,color:#000,stroke:#c78c2e
style E fill:#f0ad4e,color:#000,stroke:#c78c2e
style F fill:#5bc0de,color:#000,stroke:#3aa8c1
style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9
style xrpldNode fill:#1a2633,color:#ccc,stroke:#4a90d9
style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
@@ -93,9 +93,9 @@ Controlled by `trace_rpc=1` in `[telemetry]` config.
| `rpc.ws_message` | — | ServerHandler.cpp | WebSocket message handling |
| `rpc.command.<name>` | `rpc.process` | RPCHandler.cpp | Per-command span (e.g., `rpc.command.server_info`, `rpc.command.ledger`) |
**Where to find**: Tempo → TraceQL: `{resource.service.name="rippled" && name=~"rpc.request|rpc.command.*"}`
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"rpc.request|rpc.command.*"}`
**Grafana dashboard**: _RPC Performance_ (`rippled-rpc-perf`)
**Grafana dashboard**: _RPC Performance_ (`xrpld-rpc-perf`)
#### Transaction Spans
@@ -107,9 +107,9 @@ Controlled by `trace_transactions=1` in `[telemetry]` config.
| `tx.receive` | — | PeerImp.cpp | Raw transaction received from peer overlay (before deduplication) |
| `tx.apply` | `ledger.build` | BuildLedger.cpp | Transaction set applied to new ledger during consensus |
**Where to find**: Tempo → TraceQL: `{resource.service.name="rippled" && name=~"tx.process|tx.receive"}`
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"tx.process|tx.receive"}`
**Grafana dashboard**: _Transaction Overview_ (`rippled-transactions`)
**Grafana dashboard**: _Transaction Overview_ (`xrpld-transactions`)
#### Consensus Spans
@@ -123,9 +123,9 @@ Controlled by `trace_consensus=1` in `[telemetry]` config.
| `consensus.validation.send` | — | RCLConsensus.cpp | Validation message sent after ledger accepted |
| `consensus.accept.apply` | — | RCLConsensus.cpp | Ledger application with close time details |
**Where to find**: Tempo → TraceQL: `{resource.service.name="rippled" && name=~"consensus.*"}`
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"consensus.*"}`
**Grafana dashboard**: _Consensus Health_ (`rippled-consensus`)
**Grafana dashboard**: _Consensus Health_ (`xrpld-consensus`)
#### Ledger Spans
@@ -137,9 +137,9 @@ Controlled by `trace_ledger=1` in `[telemetry]` config.
| `ledger.validate` | — | LedgerMaster.cpp | Ledger promoted to validated status |
| `ledger.store` | — | LedgerMaster.cpp | Ledger stored to database/history |
**Where to find**: Tempo → TraceQL: `{resource.service.name="rippled" && name=~"ledger.*"}`
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"ledger.*"}`
**Grafana dashboard**: _Ledger Operations_ (`rippled-ledger-ops`)
**Grafana dashboard**: _Ledger Operations_ (`xrpld-ledger-ops`)
#### Peer Spans
@@ -150,9 +150,9 @@ Controlled by `trace_peer=1` in `[telemetry]` config. **Disabled by default** (h
| `peer.proposal.receive` | — | PeerImp.cpp | Consensus proposal received from peer |
| `peer.validation.receive` | — | PeerImp.cpp | Validation message received from peer |
**Where to find**: Tempo → TraceQL: `{resource.service.name="rippled" && name=~"peer.*"}`
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"peer.*"}`
**Grafana dashboard**: _Peer Network_ (`rippled-peer-net`)
**Grafana dashboard**: _Peer Network_ (`xrpld-peer-net`)
---
@@ -237,7 +237,7 @@ Every span can carry key-value attributes that provide context for filtering and
> **See also**: [01-architecture-analysis.md](./01-architecture-analysis.md) §1.8.2 for how span-derived metrics map to operational insights.
The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Errors, Duration) metrics from every span. No custom metrics code in rippled is needed.
The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Errors, Duration) metrics from every span. No custom metrics code in xrpld is needed.
| Prometheus Metric | Type | Description |
| -------------------------------------------------- | --------- | ------------------------------------------------------------------------------ |
@@ -269,7 +269,7 @@ The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Er
>
> **Migration complete**: Phase 7 replaced the StatsD UDP transport with native OTel Metrics SDK export via OTLP/HTTP. The `beast::insight::Collector` interface and all metric names are preserved — only the wire protocol changed. `[insight] server=statsd` remains as a fallback.
These are system-level metrics emitted by rippled's `beast::insight` framework via OTel OTLP/HTTP. They cover operational data that doesn't map to individual trace spans.
These are system-level metrics emitted by xrpld's `beast::insight` framework via OTel OTLP/HTTP. They cover operational data that doesn't map to individual trace spans.
### Configuration
@@ -278,7 +278,7 @@ These are system-level metrics emitted by rippled's `beast::insight` framework v
[insight]
server=otel
endpoint=http://localhost:4318/v1/metrics
prefix=rippled
prefix=xrpld
```
Fallback (StatsD):
@@ -287,56 +287,56 @@ Fallback (StatsD):
[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled
prefix=xrpld
```
### 2.1 Gauges
| Prometheus Metric | Source File | Description | Typical Range |
| --------------------------------------------------- | --------------------- | ----------------------------------------- | ------------------------------- |
| `rippled_LedgerMaster_Validated_Ledger_Age` | LedgerMaster.h | Seconds since last validated ledger | 010 (healthy), >30 (stale) |
| `rippled_LedgerMaster_Published_Ledger_Age` | LedgerMaster.h | Seconds since last published ledger | 010 (healthy) |
| `rippled_State_Accounting_Disconnected_duration` | NetworkOPs.cpp | Cumulative seconds in Disconnected state | Monotonic |
| `rippled_State_Accounting_Connected_duration` | NetworkOPs.cpp | Cumulative seconds in Connected state | Monotonic |
| `rippled_State_Accounting_Syncing_duration` | NetworkOPs.cpp | Cumulative seconds in Syncing state | Monotonic |
| `rippled_State_Accounting_Tracking_duration` | NetworkOPs.cpp | Cumulative seconds in Tracking state | Monotonic |
| `rippled_State_Accounting_Full_duration` | NetworkOPs.cpp | Cumulative seconds in Full state | Monotonic (should dominate) |
| `rippled_State_Accounting_Disconnected_transitions` | NetworkOPs.cpp | Count of transitions to Disconnected | Low |
| `rippled_State_Accounting_Connected_transitions` | NetworkOPs.cpp | Count of transitions to Connected | Low |
| `rippled_State_Accounting_Syncing_transitions` | NetworkOPs.cpp | Count of transitions to Syncing | Low |
| `rippled_State_Accounting_Tracking_transitions` | NetworkOPs.cpp | Count of transitions to Tracking | Low |
| `rippled_State_Accounting_Full_transitions` | NetworkOPs.cpp | Count of transitions to Full | Low (should be 1 after startup) |
| `rippled_Peer_Finder_Active_Inbound_Peers` | PeerfinderManager.cpp | Active inbound peer connections | 085 |
| `rippled_Peer_Finder_Active_Outbound_Peers` | PeerfinderManager.cpp | Active outbound peer connections | 1021 |
| `rippled_Overlay_Peer_Disconnects` | OverlayImpl.cpp | Cumulative peer disconnection count | Low growth |
| `rippled_Overlay_Peer_Disconnects_Charges` | OverlayImpl.cpp | Disconnects due to resource limit charges | Low growth (subset of above) |
| `rippled_job_count` | JobQueue.cpp | Current job queue depth | 0100 (healthy) |
| Prometheus Metric | Source File | Description | Typical Range |
| ------------------------------------------------- | --------------------- | ----------------------------------------- | ------------------------------- |
| `xrpld_LedgerMaster_Validated_Ledger_Age` | LedgerMaster.h | Seconds since last validated ledger | 010 (healthy), >30 (stale) |
| `xrpld_LedgerMaster_Published_Ledger_Age` | LedgerMaster.h | Seconds since last published ledger | 010 (healthy) |
| `xrpld_State_Accounting_Disconnected_duration` | NetworkOPs.cpp | Cumulative seconds in Disconnected state | Monotonic |
| `xrpld_State_Accounting_Connected_duration` | NetworkOPs.cpp | Cumulative seconds in Connected state | Monotonic |
| `xrpld_State_Accounting_Syncing_duration` | NetworkOPs.cpp | Cumulative seconds in Syncing state | Monotonic |
| `xrpld_State_Accounting_Tracking_duration` | NetworkOPs.cpp | Cumulative seconds in Tracking state | Monotonic |
| `xrpld_State_Accounting_Full_duration` | NetworkOPs.cpp | Cumulative seconds in Full state | Monotonic (should dominate) |
| `xrpld_State_Accounting_Disconnected_transitions` | NetworkOPs.cpp | Count of transitions to Disconnected | Low |
| `xrpld_State_Accounting_Connected_transitions` | NetworkOPs.cpp | Count of transitions to Connected | Low |
| `xrpld_State_Accounting_Syncing_transitions` | NetworkOPs.cpp | Count of transitions to Syncing | Low |
| `xrpld_State_Accounting_Tracking_transitions` | NetworkOPs.cpp | Count of transitions to Tracking | Low |
| `xrpld_State_Accounting_Full_transitions` | NetworkOPs.cpp | Count of transitions to Full | Low (should be 1 after startup) |
| `xrpld_Peer_Finder_Active_Inbound_Peers` | PeerfinderManager.cpp | Active inbound peer connections | 085 |
| `xrpld_Peer_Finder_Active_Outbound_Peers` | PeerfinderManager.cpp | Active outbound peer connections | 1021 |
| `xrpld_Overlay_Peer_Disconnects` | OverlayImpl.cpp | Cumulative peer disconnection count | Low growth |
| `xrpld_Overlay_Peer_Disconnects_Charges` | OverlayImpl.cpp | Disconnects due to resource limit charges | Low growth (subset of above) |
| `xrpld_job_count` | JobQueue.cpp | Current job queue depth | 0100 (healthy) |
**Grafana dashboard**: _Node Health (System Metrics)_ (`rippled-system-node-health`)
**Grafana dashboard**: _Node Health (System Metrics)_ (`xrpld-system-node-health`)
### 2.2 Counters
| Prometheus Metric | Source File | Description |
| --------------------------------- | ------------------ | --------------------------------------------- |
| `rippled_rpc_requests` | ServerHandler.cpp | Total RPC requests received |
| `rippled_ledger_fetches` | InboundLedgers.cpp | Inbound ledger fetch attempts |
| `rippled_ledger_history_mismatch` | LedgerHistory.cpp | Ledger hash mismatches detected |
| `rippled_warn` | Logic.h | Resource manager warnings issued |
| `rippled_drop` | Logic.h | Resource manager drops (connections rejected) |
| Prometheus Metric | Source File | Description |
| ------------------------------- | ------------------ | --------------------------------------------- |
| `xrpld_rpc_requests` | ServerHandler.cpp | Total RPC requests received |
| `xrpld_ledger_fetches` | InboundLedgers.cpp | Inbound ledger fetch attempts |
| `xrpld_ledger_history_mismatch` | LedgerHistory.cpp | Ledger hash mismatches detected |
| `xrpld_warn` | Logic.h | Resource manager warnings issued |
| `xrpld_drop` | Logic.h | Resource manager drops (connections rejected) |
**Note**: With `server=otel`, `rippled_warn` and `rippled_drop` are properly exported as OTel Counter instruments. The previous StatsD `|m` type limitation no longer applies.
**Note**: With `server=otel`, `xrpld_warn` and `xrpld_drop` are properly exported as OTel Counter instruments. The previous StatsD `|m` type limitation no longer applies.
**Grafana dashboard**: _RPC & Pathfinding (System Metrics)_ (`rippled-system-rpc`)
**Grafana dashboard**: _RPC & Pathfinding (System Metrics)_ (`xrpld-system-rpc`)
### 2.3 Histograms (Event timers)
| Prometheus Metric | Source File | Unit | Description |
| ----------------------- | ----------------- | ----- | ------------------------------ |
| `rippled_rpc_time` | ServerHandler.cpp | ms | RPC response time distribution |
| `rippled_rpc_size` | ServerHandler.cpp | bytes | RPC response size distribution |
| `rippled_ios_latency` | Application.cpp | ms | I/O service loop latency |
| `rippled_pathfind_fast` | PathRequests.h | ms | Fast pathfinding duration |
| `rippled_pathfind_full` | PathRequests.h | ms | Full pathfinding duration |
| Prometheus Metric | Source File | Unit | Description |
| --------------------- | ----------------- | ----- | ------------------------------ |
| `xrpld_rpc_time` | ServerHandler.cpp | ms | RPC response time distribution |
| `xrpld_rpc_size` | ServerHandler.cpp | bytes | RPC response size distribution |
| `xrpld_ios_latency` | Application.cpp | ms | I/O service loop latency |
| `xrpld_pathfind_fast` | PathRequests.h | ms | Fast pathfinding duration |
| `xrpld_pathfind_full` | PathRequests.h | ms | Full pathfinding duration |
Quantiles collected: 0th, 50th, 90th, 95th, 99th, 100th percentile.
@@ -346,10 +346,10 @@ Quantiles collected: 0th, 50th, 90th, 95th, 99th, 100th percentile.
For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), four gauges are emitted:
- `rippled_{category}_Bytes_In`
- `rippled_{category}_Bytes_Out`
- `rippled_{category}_Messages_In`
- `rippled_{category}_Messages_Out`
- `xrpld_{category}_Bytes_In`
- `xrpld_{category}_Bytes_Out`
- `xrpld_{category}_Messages_In`
- `xrpld_{category}_Messages_Out`
**Key categories**:
@@ -368,7 +368,7 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo
| `ping` / `status` | Keepalive and status |
| `set_get` | Set requests |
**Grafana dashboards**: _Network Traffic_ (`rippled-system-network`), _Overlay Traffic Detail_ (`rippled-system-overlay-detail`), _Ledger Data & Sync_ (`rippled-system-ledger-sync`)
**Grafana dashboards**: _Network Traffic_ (`xrpld-system-network`), _Overlay Traffic Detail_ (`xrpld-system-overlay-detail`), _Ledger Data & Sync_ (`xrpld-system-ledger-sync`)
---
@@ -378,28 +378,28 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo
### 3.1 Span-Derived Dashboards (5)
| Dashboard | UID | Data Source | Key Panels |
| -------------------- | ---------------------- | ------------------------ | ---------------------------------------------------------------------------------- |
| RPC Performance | `rippled-rpc-perf` | Prometheus (SpanMetrics) | Request rate by command, p95 latency by command, error rate, heatmap, top commands |
| Transaction Overview | `rippled-transactions` | Prometheus (SpanMetrics) | Processing rate, latency p95/p50, local vs relay split, apply duration, heatmap |
| Consensus Health | `rippled-consensus` | Prometheus (SpanMetrics) | Round duration p95/p50, proposals rate, close duration, mode timeline, heatmap |
| Ledger Operations | `rippled-ledger-ops` | Prometheus (SpanMetrics) | Build rate, build duration, validation rate, store rate, build vs close comparison |
| Peer Network | `rippled-peer-net` | Prometheus (SpanMetrics) | Proposal receive rate, validation receive rate, trusted vs untrusted breakdown |
| Dashboard | UID | Data Source | Key Panels |
| -------------------- | -------------------- | ------------------------ | ---------------------------------------------------------------------------------- |
| RPC Performance | `xrpld-rpc-perf` | Prometheus (SpanMetrics) | Request rate by command, p95 latency by command, error rate, heatmap, top commands |
| Transaction Overview | `xrpld-transactions` | Prometheus (SpanMetrics) | Processing rate, latency p95/p50, local vs relay split, apply duration, heatmap |
| Consensus Health | `xrpld-consensus` | Prometheus (SpanMetrics) | Round duration p95/p50, proposals rate, close duration, mode timeline, heatmap |
| Ledger Operations | `xrpld-ledger-ops` | Prometheus (SpanMetrics) | Build rate, build duration, validation rate, store rate, build vs close comparison |
| Peer Network | `xrpld-peer-net` | Prometheus (SpanMetrics) | Proposal receive rate, validation receive rate, trusted vs untrusted breakdown |
### 3.2 System Metrics Dashboards (5)
| Dashboard | UID | Data Source | Key Panels |
| ---------------------- | ------------------------------- | ----------------- | --------------------------------------------------------------------------------- |
| Node Health | `rippled-system-node-health` | Prometheus (OTLP) | Ledger age, operating mode, I/O latency, job queue, fetch rate |
| Network Traffic | `rippled-system-network` | Prometheus (OTLP) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category |
| RPC & Pathfinding | `rippled-system-rpc` | Prometheus (OTLP) | RPC rate, response time/size, pathfinding duration, resource warnings/drops |
| Overlay Traffic Detail | `rippled-system-overlay-detail` | Prometheus (OTLP) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths |
| Ledger Data & Sync | `rippled-system-ledger-sync` | Prometheus (OTLP) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap |
| Dashboard | UID | Data Source | Key Panels |
| ---------------------- | ----------------------------- | ----------------- | --------------------------------------------------------------------------------- |
| Node Health | `xrpld-system-node-health` | Prometheus (OTLP) | Ledger age, operating mode, I/O latency, job queue, fetch rate |
| Network Traffic | `xrpld-system-network` | Prometheus (OTLP) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category |
| RPC & Pathfinding | `xrpld-system-rpc` | Prometheus (OTLP) | RPC rate, response time/size, pathfinding duration, resource warnings/drops |
| Overlay Traffic Detail | `xrpld-system-overlay-detail` | Prometheus (OTLP) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths |
| Ledger Data & Sync | `xrpld-system-ledger-sync` | Prometheus (OTLP) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap |
### 3.3 Accessing the Dashboards
1. Open Grafana at **http://localhost:3000**
2. Navigate to **Dashboards → rippled** folder
2. Navigate to **Dashboards → xrpld** folder
3. All 10 dashboards are auto-provisioned from `docker/telemetry/grafana/dashboards/`
---
@@ -410,18 +410,18 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo
### Finding Traces by Type
| What to Find | Tempo TraceQL Query |
| ------------------------ | -------------------------------------------------------------------------------- |
| All RPC calls | `{resource.service.name="rippled" && name="rpc.request"}` |
| Specific RPC command | `{resource.service.name="rippled" && name="rpc.command.server_info"}` |
| Slow RPC calls | `{resource.service.name="rippled" && name=~"rpc.command.*"} \| duration > 100ms` |
| Failed RPC calls | `{span.xrpl.rpc.status="error"}` |
| Specific transaction | `{span.xrpl.tx.hash="<hex_hash>"}` |
| Local transactions only | `{span.xrpl.tx.local=true}` |
| Consensus rounds | `{resource.service.name="rippled" && name="consensus.accept"}` |
| Rounds by mode | `{span.xrpl.consensus.mode="proposing"}` |
| Specific ledger | `{span.xrpl.ledger.seq=12345}` |
| Peer proposals (trusted) | `{span.xrpl.peer.proposal.trusted=true}` |
| What to Find | Tempo TraceQL Query |
| ------------------------ | ------------------------------------------------------------------------------ |
| All RPC calls | `{resource.service.name="xrpld" && name="rpc.request"}` |
| Specific RPC command | `{resource.service.name="xrpld" && name="rpc.command.server_info"}` |
| Slow RPC calls | `{resource.service.name="xrpld" && name=~"rpc.command.*"} \| duration > 100ms` |
| Failed RPC calls | `{span.xrpl.rpc.status="error"}` |
| Specific transaction | `{span.xrpl.tx.hash="<hex_hash>"}` |
| Local transactions only | `{span.xrpl.tx.local=true}` |
| Consensus rounds | `{resource.service.name="xrpld" && name="consensus.accept"}` |
| Rounds by mode | `{span.xrpl.consensus.mode="proposing"}` |
| Specific ledger | `{span.xrpl.ledger.seq=12345}` |
| Peer proposals (trusted) | `{span.xrpl.peer.proposal.trusted=true}` |
### Trace Structure
@@ -475,19 +475,19 @@ sum by (xrpl_peer_proposal_trusted) (rate(traces_span_metrics_calls_total{span_n
```promql
# Validated ledger age (should be < 10s)
rippled_LedgerMaster_Validated_Ledger_Age
xrpld_LedgerMaster_Validated_Ledger_Age
# Active peer count
rippled_Peer_Finder_Active_Inbound_Peers + rippled_Peer_Finder_Active_Outbound_Peers
xrpld_Peer_Finder_Active_Inbound_Peers + xrpld_Peer_Finder_Active_Outbound_Peers
# RPC response time p95
histogram_quantile(0.95, rippled_rpc_time_bucket)
histogram_quantile(0.95, xrpld_rpc_time_bucket)
# Total network bytes in (rate)
rate(rippled_total_Bytes_In[5m])
rate(xrpld_total_Bytes_In[5m])
# Operating mode (should be "Full" after startup)
rippled_State_Accounting_Full_duration
xrpld_State_Accounting_Full_duration
```
---
@@ -497,7 +497,7 @@ rippled_State_Accounting_Full_duration
> **Plan details**: [06-implementation-phases.md §6.8.1](./06-implementation-phases.md) — motivation, architecture, Mermaid diagrams
> **Task breakdown**: [Phase8_taskList.md](./Phase8_taskList.md) — per-task implementation details
Phase 8 injects OTel trace context into rippled's `Logs::format()` output, enabling log-trace correlation. When a log line is emitted within an active OTel span, the trace and span identifiers are automatically appended after the severity field:
Phase 8 injects OTel trace context into xrpld's `Logs::format()` output, enabling log-trace correlation. When a log line is emitted within an active OTel span, the trace and span identifiers are automatically appended after the severity field:
### Log Format
@@ -522,7 +522,7 @@ The trace context injection is implemented in `Logs::format()` (`src/libxrpl/bas
### Log Ingestion Pipeline
```
rippled debug.log -> OTel Collector filelog receiver -> regex_parser -> Loki exporter -> Grafana Loki
xrpld debug.log -> OTel Collector filelog receiver -> regex_parser -> Loki exporter -> Grafana Loki
```
The OTel Collector's `filelog` receiver tails `debug.log` files and uses a `regex_parser` operator to extract structured fields:
@@ -551,16 +551,16 @@ Grafana Loki (v2.9.0) serves as the log storage backend. It receives log entries
```logql
# Find all logs for a specific trace
{job="rippled"} |= "trace_id=abc123def456789012345678abcdef01"
{job="xrpld"} |= "trace_id=abc123def456789012345678abcdef01"
# Error logs with trace context
{job="rippled"} |= "ERR" |= "trace_id="
{job="xrpld"} |= "ERR" |= "trace_id="
# Logs from a specific partition with trace context
{job="rippled"} |= "LedgerMaster" | regexp `trace_id=(?P<trace_id>[a-f0-9]+)` | trace_id != ""
{job="xrpld"} |= "LedgerMaster" | regexp `trace_id=(?P<trace_id>[a-f0-9]+)` | trace_id != ""
# Count traced log lines over time
count_over_time({job="rippled"} |= "trace_id=" [5m])
count_over_time({job="xrpld"} |= "trace_id=" [5m])
```
---
@@ -571,141 +571,141 @@ count_over_time({job="rippled"} |= "trace_id=" [5m])
> **Plan details**: [06-implementation-phases.md §6.8.2](./06-implementation-phases.md) — motivation, architecture, third-party context
> **Task breakdown**: [Phase9_taskList.md](./Phase9_taskList.md) — per-task implementation details
Phase 9 fills ~50+ metrics that exist inside rippled but currently lack time-series export. Uses a hybrid approach: `beast::insight` extensions for NodeStore I/O, OTel `ObservableGauge` async callbacks for new categories.
Phase 9 fills ~50+ metrics that exist inside xrpld but currently lack time-series export. Uses a hybrid approach: `beast::insight` extensions for NodeStore I/O, OTel `ObservableGauge` async callbacks for new categories.
### New Metric Categories
#### NodeStore I/O (via beast::insight)
| Prometheus Metric | Type | Description |
| ------------------------------------ | ----- | ----------------------------------- |
| `rippled_nodestore_reads_total` | Gauge | Cumulative read operations |
| `rippled_nodestore_reads_hit` | Gauge | Cache-served reads |
| `rippled_nodestore_writes` | Gauge | Cumulative write operations |
| `rippled_nodestore_written_bytes` | Gauge | Cumulative bytes written |
| `rippled_nodestore_read_bytes` | Gauge | Cumulative bytes read |
| `rippled_nodestore_read_duration_us` | Gauge | Cumulative read time (microseconds) |
| `rippled_nodestore_write_load` | Gauge | Current write load score |
| `rippled_nodestore_read_queue` | Gauge | Items in read queue |
| Prometheus Metric | Type | Description |
| ---------------------------------- | ----- | ----------------------------------- |
| `xrpld_nodestore_reads_total` | Gauge | Cumulative read operations |
| `xrpld_nodestore_reads_hit` | Gauge | Cache-served reads |
| `xrpld_nodestore_writes` | Gauge | Cumulative write operations |
| `xrpld_nodestore_written_bytes` | Gauge | Cumulative bytes written |
| `xrpld_nodestore_read_bytes` | Gauge | Cumulative bytes read |
| `xrpld_nodestore_read_duration_us` | Gauge | Cumulative read time (microseconds) |
| `xrpld_nodestore_write_load` | Gauge | Current write load score |
| `xrpld_nodestore_read_queue` | Gauge | Items in read queue |
#### Cache Hit Rates (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
| ------------------------------- | ----- | ------------------------------------ |
| `rippled_cache_SLE_hit_rate` | Gauge | SLE cache hit rate (0.0-1.0) |
| `rippled_cache_ledger_hit_rate` | Gauge | Ledger object cache hit rate |
| `rippled_cache_AL_hit_rate` | Gauge | AcceptedLedger cache hit rate |
| `rippled_cache_treenode_size` | Gauge | SHAMap TreeNode cache size (entries) |
| `rippled_cache_fullbelow_size` | Gauge | FullBelow cache size |
| Prometheus Metric | Type | Description |
| ----------------------------- | ----- | ------------------------------------ |
| `xrpld_cache_SLE_hit_rate` | Gauge | SLE cache hit rate (0.0-1.0) |
| `xrpld_cache_ledger_hit_rate` | Gauge | Ledger object cache hit rate |
| `xrpld_cache_AL_hit_rate` | Gauge | AcceptedLedger cache hit rate |
| `xrpld_cache_treenode_size` | Gauge | SHAMap TreeNode cache size (entries) |
| `xrpld_cache_fullbelow_size` | Gauge | FullBelow cache size |
#### Transaction Queue (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
| -------------------------------------- | ----- | -------------------------------- |
| `rippled_txq_count` | Gauge | Current transactions in queue |
| `rippled_txq_max_size` | Gauge | Maximum queue capacity |
| `rippled_txq_in_ledger` | Gauge | Transactions in open ledger |
| `rippled_txq_per_ledger` | Gauge | Expected transactions per ledger |
| `rippled_txq_open_ledger_fee_level` | Gauge | Open ledger fee escalation level |
| `rippled_txq_med_fee_level` | Gauge | Median fee level in queue |
| `rippled_txq_reference_fee_level` | Gauge | Reference fee level |
| `rippled_txq_min_processing_fee_level` | Gauge | Minimum fee to get processed |
| Prometheus Metric | Type | Description |
| ------------------------------------ | ----- | -------------------------------- |
| `xrpld_txq_count` | Gauge | Current transactions in queue |
| `xrpld_txq_max_size` | Gauge | Maximum queue capacity |
| `xrpld_txq_in_ledger` | Gauge | Transactions in open ledger |
| `xrpld_txq_per_ledger` | Gauge | Expected transactions per ledger |
| `xrpld_txq_open_ledger_fee_level` | Gauge | Open ledger fee escalation level |
| `xrpld_txq_med_fee_level` | Gauge | Median fee level in queue |
| `xrpld_txq_reference_fee_level` | Gauge | Reference fee level |
| `xrpld_txq_min_processing_fee_level` | Gauge | Minimum fee to get processed |
#### PerfLog Per-RPC Method (via OTel Metrics SDK)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------- | --------- | ----------------- | --------------------------- |
| `rippled_rpc_method_started_total` | Counter | `method="<name>"` | RPC calls started |
| `rippled_rpc_method_finished_total` | Counter | `method="<name>"` | RPC calls completed |
| `rippled_rpc_method_errored_total` | Counter | `method="<name>"` | RPC calls errored |
| `rippled_rpc_method_duration_us_bucket` | Histogram | `method="<name>"` | Execution time distribution |
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------- | --------- | ----------------- | --------------------------- |
| `xrpld_rpc_method_started_total` | Counter | `method="<name>"` | RPC calls started |
| `xrpld_rpc_method_finished_total` | Counter | `method="<name>"` | RPC calls completed |
| `xrpld_rpc_method_errored_total` | Counter | `method="<name>"` | RPC calls errored |
| `xrpld_rpc_method_duration_us_bucket` | Histogram | `method="<name>"` | Execution time distribution |
#### PerfLog Per-Job Type (via OTel Metrics SDK)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------- | --------- | ------------------- | --------------- |
| `rippled_job_queued_total` | Counter | `job_type="<name>"` | Jobs queued |
| `rippled_job_started_total` | Counter | `job_type="<name>"` | Jobs started |
| `rippled_job_finished_total` | Counter | `job_type="<name>"` | Jobs completed |
| `rippled_job_queued_duration_us_bucket` | Histogram | `job_type="<name>"` | Queue wait time |
| `rippled_job_running_duration_us_bucket` | Histogram | `job_type="<name>"` | Execution time |
| Prometheus Metric | Type | Labels | Description |
| -------------------------------------- | --------- | ------------------- | --------------- |
| `xrpld_job_queued_total` | Counter | `job_type="<name>"` | Jobs queued |
| `xrpld_job_started_total` | Counter | `job_type="<name>"` | Jobs started |
| `xrpld_job_finished_total` | Counter | `job_type="<name>"` | Jobs completed |
| `xrpld_job_queued_duration_us_bucket` | Histogram | `job_type="<name>"` | Queue wait time |
| `xrpld_job_running_duration_us_bucket` | Histogram | `job_type="<name>"` | Execution time |
#### Counted Object Instances (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
| ---------------------- | ----- | --------------- | ------------------------------- |
| `rippled_object_count` | Gauge | `type="<name>"` | Live instances of internal type |
| Prometheus Metric | Type | Labels | Description |
| -------------------- | ----- | --------------- | ------------------------------- |
| `xrpld_object_count` | Gauge | `type="<name>"` | Live instances of internal type |
Tracked types: `Transaction`, `Ledger`, `NodeObject`, `STTx`, `STLedgerEntry`, `InboundLedger`, `Pathfinder`, `PathRequest`, `HashRouterEntry`
#### Fee Escalation & Load Factors (via OTel MetricsRegistry)
| Prometheus Metric | Type | Description |
| ------------------------------------ | ----- | ------------------------------------ |
| `rippled_load_factor` | Gauge | Combined transaction cost multiplier |
| `rippled_load_factor_server` | Gauge | Server + cluster + network load |
| `rippled_load_factor_local` | Gauge | Local server load only |
| `rippled_load_factor_net` | Gauge | Network-wide load estimate |
| `rippled_load_factor_cluster` | Gauge | Cluster peer load |
| `rippled_load_factor_fee_escalation` | Gauge | Open ledger fee escalation |
| `rippled_load_factor_fee_queue` | Gauge | Queue entry fee level |
| Prometheus Metric | Type | Description |
| ---------------------------------- | ----- | ------------------------------------ |
| `xrpld_load_factor` | Gauge | Combined transaction cost multiplier |
| `xrpld_load_factor_server` | Gauge | Server + cluster + network load |
| `xrpld_load_factor_local` | Gauge | Local server load only |
| `xrpld_load_factor_net` | Gauge | Network-wide load estimate |
| `xrpld_load_factor_cluster` | Gauge | Cluster peer load |
| `xrpld_load_factor_fee_escalation` | Gauge | Open ledger fee escalation |
| `xrpld_load_factor_fee_queue` | Gauge | Queue entry fee level |
#### Server Info (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
| ----------------------------------------------------------- | ----- | -------- | -------------------------------------------- |
| `rippled_server_info{metric="server_state"}` | Gauge | `metric` | Operating mode (0=DISCONNECTED .. 4=FULL) |
| `rippled_server_info{metric="uptime"}` | Gauge | `metric` | Seconds since server start |
| `rippled_server_info{metric="peers"}` | Gauge | `metric` | Total connected peers |
| `rippled_server_info{metric="validated_ledger_seq"}` | Gauge | `metric` | Validated ledger sequence number |
| `rippled_server_info{metric="ledger_current_index"}` | Gauge | `metric` | Current open ledger sequence |
| `rippled_server_info{metric="peer_disconnects_resources"}` | Gauge | `metric` | Cumulative resource-related peer disconnects |
| `rippled_server_info{metric="last_close_proposers"}` | Gauge | `metric` | Proposers in last closed round |
| `rippled_server_info{metric="last_close_converge_time_ms"}` | Gauge | `metric` | Last close convergence time (milliseconds) |
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------------------- | ----- | -------- | -------------------------------------------- |
| `xrpld_server_info{metric="server_state"}` | Gauge | `metric` | Operating mode (0=DISCONNECTED .. 4=FULL) |
| `xrpld_server_info{metric="uptime"}` | Gauge | `metric` | Seconds since server start |
| `xrpld_server_info{metric="peers"}` | Gauge | `metric` | Total connected peers |
| `xrpld_server_info{metric="validated_ledger_seq"}` | Gauge | `metric` | Validated ledger sequence number |
| `xrpld_server_info{metric="ledger_current_index"}` | Gauge | `metric` | Current open ledger sequence |
| `xrpld_server_info{metric="peer_disconnects_resources"}` | Gauge | `metric` | Cumulative resource-related peer disconnects |
| `xrpld_server_info{metric="last_close_proposers"}` | Gauge | `metric` | Proposers in last closed round |
| `xrpld_server_info{metric="last_close_converge_time_ms"}` | Gauge | `metric` | Last close convergence time (milliseconds) |
#### Build Info (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------- | ----- | --------- | --------------------------------- |
| `rippled_build_info{version="<ver>"}` | Gauge | `version` | Info-style metric, always value 1 |
| Prometheus Metric | Type | Labels | Description |
| ----------------------------------- | ----- | --------- | --------------------------------- |
| `xrpld_build_info{version="<ver>"}` | Gauge | `version` | Info-style metric, always value 1 |
#### Complete Ledger Ranges (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
| ----------------------------------------------------- | ----- | --------------- | --------------------------- |
| `rippled_complete_ledgers{bound="start",index="<N>"}` | Gauge | `bound`,`index` | Start of contiguous range N |
| `rippled_complete_ledgers{bound="end",index="<N>"}` | Gauge | `bound`,`index` | End of contiguous range N |
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------------- | ----- | --------------- | --------------------------- |
| `xrpld_complete_ledgers{bound="start",index="<N>"}` | Gauge | `bound`,`index` | Start of contiguous range N |
| `xrpld_complete_ledgers{bound="end",index="<N>"}` | Gauge | `bound`,`index` | End of contiguous range N |
#### Database Metrics (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------------- | ----- | -------- | --------------------------------- |
| `rippled_db_metrics{metric="db_kb_total"}` | Gauge | `metric` | Total database size (KB) |
| `rippled_db_metrics{metric="db_kb_ledger"}` | Gauge | `metric` | Ledger database size (KB) |
| `rippled_db_metrics{metric="db_kb_transaction"}` | Gauge | `metric` | Transaction database size (KB) |
| `rippled_db_metrics{metric="historical_perminute"}` | Gauge | `metric` | Historical ledger fetches per min |
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------------------- | ----- | -------- | --------------------------------- |
| `xrpld_db_metrics{metric="db_kb_total"}` | Gauge | `metric` | Total database size (KB) |
| `xrpld_db_metrics{metric="db_kb_ledger"}` | Gauge | `metric` | Ledger database size (KB) |
| `xrpld_db_metrics{metric="db_kb_transaction"}` | Gauge | `metric` | Transaction database size (KB) |
| `xrpld_db_metrics{metric="historical_perminute"}` | Gauge | `metric` | Historical ledger fetches per min |
#### Extended Cache Metrics (additions to existing rippled_cache_metrics)
#### Extended Cache Metrics (additions to existing xrpld_cache_metrics)
| Prometheus Metric | Type | Labels | Description |
| ----------------------------------------- | ----- | -------- | ------------------------- |
| `rippled_cache_metrics{metric="AL_size"}` | Gauge | `metric` | AcceptedLedger cache size |
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------- | ----- | -------- | ------------------------- |
| `xrpld_cache_metrics{metric="AL_size"}` | Gauge | `metric` | AcceptedLedger cache size |
#### Extended NodeStore Metrics (additions to existing rippled_nodestore_state)
#### Extended NodeStore Metrics (additions to existing xrpld_nodestore_state)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------------- | ----- | -------- | ----------------------------------- |
| `rippled_nodestore_state{metric="node_reads_duration_us"}` | Gauge | `metric` | Cumulative read time (microseconds) |
| `rippled_nodestore_state{metric="read_request_bundle"}` | Gauge | `metric` | Read request bundle count |
| `rippled_nodestore_state{metric="read_threads_running"}` | Gauge | `metric` | Active read threads |
| `rippled_nodestore_state{metric="read_threads_total"}` | Gauge | `metric` | Total read threads configured |
| Prometheus Metric | Type | Labels | Description |
| -------------------------------------------------------- | ----- | -------- | ----------------------------------- |
| `xrpld_nodestore_state{metric="node_reads_duration_us"}` | Gauge | `metric` | Cumulative read time (microseconds) |
| `xrpld_nodestore_state{metric="read_request_bundle"}` | Gauge | `metric` | Read request bundle count |
| `xrpld_nodestore_state{metric="read_threads_running"}` | Gauge | `metric` | Active read threads |
| `xrpld_nodestore_state{metric="read_threads_total"}` | Gauge | `metric` | Total read threads configured |
### New Grafana Dashboards (Phase 9)
| Dashboard | UID | Data Source | Key Panels |
| ------------------ | -------------------- | ----------- | ----------------------------------------------------------------- |
| Fee Market & TxQ | `rippled-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown, escalation |
| Job Queue Analysis | `rippled-job-queue` | Prometheus | Per-job rates, queue wait times, execution times, queue depth |
| Dashboard | UID | Data Source | Key Panels |
| ------------------ | ------------------ | ----------- | ----------------------------------------------------------------- |
| Fee Market & TxQ | `xrpld-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown, escalation |
| Job Queue Analysis | `xrpld-job-queue` | Prometheus | Per-job rates, queue wait times, execution times, queue depth |
---
@@ -737,7 +737,7 @@ Phase 10 builds a 5-node validator docker-compose harness with RPC load generato
> **Plan details**: [06-implementation-phases.md §6.8.4](./06-implementation-phases.md) — motivation, architecture, consumer gap analysis
> **Task breakdown**: [Phase11_taskList.md](./Phase11_taskList.md) — per-task implementation details
Phase 11 builds a custom OTel Collector receiver (Go) that polls rippled's admin RPCs and exports `xrpl_*` metrics for external consumers. No rippled code changes.
Phase 11 builds a custom OTel Collector receiver (Go) that polls xrpld's admin RPCs and exports `xrpl_*` metrics for external consumers. No xrpld code changes.
### Exported Metrics (via Custom OTel Collector Receiver)
@@ -807,102 +807,102 @@ via OTLP/HTTP to the OTel Collector and scraped by Prometheus.
#### NodeStore I/O (Observable Gauge — `nodestore_state`)
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------------------------ | ----- | -------- | ------------------------------------ |
| `rippled_nodestore_state{metric="node_reads_total"}` | Gauge | `metric` | Cumulative NodeStore read operations |
| `rippled_nodestore_state{metric="node_reads_hit"}` | Gauge | `metric` | Reads served from cache |
| `rippled_nodestore_state{metric="node_writes"}` | Gauge | `metric` | Cumulative write operations |
| `rippled_nodestore_state{metric="node_written_bytes"}` | Gauge | `metric` | Cumulative bytes written |
| `rippled_nodestore_state{metric="node_read_bytes"}` | Gauge | `metric` | Cumulative bytes read |
| `rippled_nodestore_state{metric="write_load"}` | Gauge | `metric` | Current write load score |
| `rippled_nodestore_state{metric="read_queue"}` | Gauge | `metric` | Items in read prefetch queue |
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------- | ----- | -------- | ------------------------------------ |
| `xrpld_nodestore_state{metric="node_reads_total"}` | Gauge | `metric` | Cumulative NodeStore read operations |
| `xrpld_nodestore_state{metric="node_reads_hit"}` | Gauge | `metric` | Reads served from cache |
| `xrpld_nodestore_state{metric="node_writes"}` | Gauge | `metric` | Cumulative write operations |
| `xrpld_nodestore_state{metric="node_written_bytes"}` | Gauge | `metric` | Cumulative bytes written |
| `xrpld_nodestore_state{metric="node_read_bytes"}` | Gauge | `metric` | Cumulative bytes read |
| `xrpld_nodestore_state{metric="write_load"}` | Gauge | `metric` | Current write load score |
| `xrpld_nodestore_state{metric="read_queue"}` | Gauge | `metric` | Items in read prefetch queue |
#### Cache Hit Rates & Sizes (Observable Gauge — `cache_metrics`)
| Prometheus Metric | Type | Labels | Description |
| ----------------------------------------------------- | ----- | -------- | ----------------------------- |
| `rippled_cache_metrics{metric="SLE_hit_rate"}` | Gauge | `metric` | SLE cache hit rate (0.0-1.0) |
| `rippled_cache_metrics{metric="ledger_hit_rate"}` | Gauge | `metric` | Ledger cache hit rate |
| `rippled_cache_metrics{metric="AL_hit_rate"}` | Gauge | `metric` | AcceptedLedger cache hit rate |
| `rippled_cache_metrics{metric="treenode_cache_size"}` | Gauge | `metric` | SHAMap TreeNode cache entries |
| `rippled_cache_metrics{metric="treenode_track_size"}` | Gauge | `metric` | Tracked tree nodes |
| `rippled_cache_metrics{metric="fullbelow_size"}` | Gauge | `metric` | FullBelow cache entries |
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------------- | ----- | -------- | ----------------------------- |
| `xrpld_cache_metrics{metric="SLE_hit_rate"}` | Gauge | `metric` | SLE cache hit rate (0.0-1.0) |
| `xrpld_cache_metrics{metric="ledger_hit_rate"}` | Gauge | `metric` | Ledger cache hit rate |
| `xrpld_cache_metrics{metric="AL_hit_rate"}` | Gauge | `metric` | AcceptedLedger cache hit rate |
| `xrpld_cache_metrics{metric="treenode_cache_size"}` | Gauge | `metric` | SHAMap TreeNode cache entries |
| `xrpld_cache_metrics{metric="treenode_track_size"}` | Gauge | `metric` | Tracked tree nodes |
| `xrpld_cache_metrics{metric="fullbelow_size"}` | Gauge | `metric` | FullBelow cache entries |
#### Transaction Queue (Observable Gauge — `txq_metrics`)
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------------------------------ | ----- | -------- | -------------------------------- |
| `rippled_txq_metrics{metric="txq_count"}` | Gauge | `metric` | Transactions currently in queue |
| `rippled_txq_metrics{metric="txq_max_size"}` | Gauge | `metric` | Maximum queue capacity |
| `rippled_txq_metrics{metric="txq_in_ledger"}` | Gauge | `metric` | Transactions in open ledger |
| `rippled_txq_metrics{metric="txq_per_ledger"}` | Gauge | `metric` | Expected transactions per ledger |
| `rippled_txq_metrics{metric="txq_reference_fee_level"}` | Gauge | `metric` | Reference fee level |
| `rippled_txq_metrics{metric="txq_min_processing_fee_level"}` | Gauge | `metric` | Minimum fee to get processed |
| `rippled_txq_metrics{metric="txq_med_fee_level"}` | Gauge | `metric` | Median fee level in queue |
| `rippled_txq_metrics{metric="txq_open_ledger_fee_level"}` | Gauge | `metric` | Open ledger fee escalation level |
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------------- | ----- | -------- | -------------------------------- |
| `xrpld_txq_metrics{metric="txq_count"}` | Gauge | `metric` | Transactions currently in queue |
| `xrpld_txq_metrics{metric="txq_max_size"}` | Gauge | `metric` | Maximum queue capacity |
| `xrpld_txq_metrics{metric="txq_in_ledger"}` | Gauge | `metric` | Transactions in open ledger |
| `xrpld_txq_metrics{metric="txq_per_ledger"}` | Gauge | `metric` | Expected transactions per ledger |
| `xrpld_txq_metrics{metric="txq_reference_fee_level"}` | Gauge | `metric` | Reference fee level |
| `xrpld_txq_metrics{metric="txq_min_processing_fee_level"}` | Gauge | `metric` | Minimum fee to get processed |
| `xrpld_txq_metrics{metric="txq_med_fee_level"}` | Gauge | `metric` | Median fee level in queue |
| `xrpld_txq_metrics{metric="txq_open_ledger_fee_level"}` | Gauge | `metric` | Open ledger fee escalation level |
#### Per-RPC Method Metrics (Synchronous Counters/Histogram)
| Prometheus Metric | Type | Labels | Description |
| ----------------------------------- | --------- | ----------------- | -------------------------------- |
| `rippled_rpc_method_started_total` | Counter | `method="<name>"` | RPC calls started |
| `rippled_rpc_method_finished_total` | Counter | `method="<name>"` | RPC calls completed successfully |
| `rippled_rpc_method_errored_total` | Counter | `method="<name>"` | RPC calls that errored |
| `rippled_rpc_method_duration_us` | Histogram | `method="<name>"` | Execution time distribution (us) |
| Prometheus Metric | Type | Labels | Description |
| --------------------------------- | --------- | ----------------- | -------------------------------- |
| `xrpld_rpc_method_started_total` | Counter | `method="<name>"` | RPC calls started |
| `xrpld_rpc_method_finished_total` | Counter | `method="<name>"` | RPC calls completed successfully |
| `xrpld_rpc_method_errored_total` | Counter | `method="<name>"` | RPC calls that errored |
| `xrpld_rpc_method_duration_us` | Histogram | `method="<name>"` | Execution time distribution (us) |
#### Per-Job-Type Metrics (Synchronous Counters/Histogram)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------- | --------- | ------------------- | --------------------------------- |
| `rippled_job_queued_total` | Counter | `job_type="<name>"` | Jobs enqueued |
| `rippled_job_started_total` | Counter | `job_type="<name>"` | Jobs started |
| `rippled_job_finished_total` | Counter | `job_type="<name>"` | Jobs completed |
| `rippled_job_queued_duration_us` | Histogram | `job_type="<name>"` | Queue wait time distribution (us) |
| `rippled_job_running_duration_us` | Histogram | `job_type="<name>"` | Execution time distribution (us) |
| Prometheus Metric | Type | Labels | Description |
| ------------------------------- | --------- | ------------------- | --------------------------------- |
| `xrpld_job_queued_total` | Counter | `job_type="<name>"` | Jobs enqueued |
| `xrpld_job_started_total` | Counter | `job_type="<name>"` | Jobs started |
| `xrpld_job_finished_total` | Counter | `job_type="<name>"` | Jobs completed |
| `xrpld_job_queued_duration_us` | Histogram | `job_type="<name>"` | Queue wait time distribution (us) |
| `xrpld_job_running_duration_us` | Histogram | `job_type="<name>"` | Execution time distribution (us) |
#### Counted Object Instances (Observable Gauge — `object_count`)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------- | ----- | --------------- | ------------------------------ |
| `rippled_object_count{type="Transaction"}` | Gauge | `type="<name>"` | Live Transaction objects |
| `rippled_object_count{type="Ledger"}` | Gauge | `type="<name>"` | Live Ledger objects |
| `rippled_object_count{type="NodeObject"}` | Gauge | `type="<name>"` | Live NodeObject instances |
| `rippled_object_count{type="STTx"}` | Gauge | `type="<name>"` | Serialized transaction objects |
| `rippled_object_count{type="STLedgerEntry"}` | Gauge | `type="<name>"` | Serialized ledger entries |
| `rippled_object_count{type="InboundLedger"}` | Gauge | `type="<name>"` | Ledgers being fetched |
| `rippled_object_count{type="Pathfinder"}` | Gauge | `type="<name>"` | Active pathfinding operations |
| `rippled_object_count{type="PathRequest"}` | Gauge | `type="<name>"` | Active path requests |
| `rippled_object_count{type="HashRouterEntry"}` | Gauge | `type="<name>"` | Hash router entries |
| Prometheus Metric | Type | Labels | Description |
| -------------------------------------------- | ----- | --------------- | ------------------------------ |
| `xrpld_object_count{type="Transaction"}` | Gauge | `type="<name>"` | Live Transaction objects |
| `xrpld_object_count{type="Ledger"}` | Gauge | `type="<name>"` | Live Ledger objects |
| `xrpld_object_count{type="NodeObject"}` | Gauge | `type="<name>"` | Live NodeObject instances |
| `xrpld_object_count{type="STTx"}` | Gauge | `type="<name>"` | Serialized transaction objects |
| `xrpld_object_count{type="STLedgerEntry"}` | Gauge | `type="<name>"` | Serialized ledger entries |
| `xrpld_object_count{type="InboundLedger"}` | Gauge | `type="<name>"` | Ledgers being fetched |
| `xrpld_object_count{type="Pathfinder"}` | Gauge | `type="<name>"` | Active pathfinding operations |
| `xrpld_object_count{type="PathRequest"}` | Gauge | `type="<name>"` | Active path requests |
| `xrpld_object_count{type="HashRouterEntry"}` | Gauge | `type="<name>"` | Hash router entries |
#### Load Factor Breakdown (Observable Gauge — `load_factor_metrics`)
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------------------------------------ | ----- | -------- | --------------------------------------- |
| `rippled_load_factor_metrics{metric="load_factor"}` | Gauge | `metric` | Combined transaction cost multiplier |
| `rippled_load_factor_metrics{metric="load_factor_server"}` | Gauge | `metric` | Server + cluster + network contribution |
| `rippled_load_factor_metrics{metric="load_factor_local"}` | Gauge | `metric` | Local server load only |
| `rippled_load_factor_metrics{metric="load_factor_net"}` | Gauge | `metric` | Network-wide load estimate |
| `rippled_load_factor_metrics{metric="load_factor_cluster"}` | Gauge | `metric` | Cluster peer load |
| `rippled_load_factor_metrics{metric="load_factor_fee_escalation"}` | Gauge | `metric` | Open ledger fee escalation |
| `rippled_load_factor_metrics{metric="load_factor_fee_queue"}` | Gauge | `metric` | Queue entry fee level |
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------------------- | ----- | -------- | --------------------------------------- |
| `xrpld_load_factor_metrics{metric="load_factor"}` | Gauge | `metric` | Combined transaction cost multiplier |
| `xrpld_load_factor_metrics{metric="load_factor_server"}` | Gauge | `metric` | Server + cluster + network contribution |
| `xrpld_load_factor_metrics{metric="load_factor_local"}` | Gauge | `metric` | Local server load only |
| `xrpld_load_factor_metrics{metric="load_factor_net"}` | Gauge | `metric` | Network-wide load estimate |
| `xrpld_load_factor_metrics{metric="load_factor_cluster"}` | Gauge | `metric` | Cluster peer load |
| `xrpld_load_factor_metrics{metric="load_factor_fee_escalation"}` | Gauge | `metric` | Open ledger fee escalation |
| `xrpld_load_factor_metrics{metric="load_factor_fee_queue"}` | Gauge | `metric` | Queue entry fee level |
#### Prometheus Query Examples (Phase 9)
```promql
# NodeStore cache hit ratio
rippled_nodestore_state{metric="node_reads_hit"} / rippled_nodestore_state{metric="node_reads_total"}
xrpld_nodestore_state{metric="node_reads_hit"} / xrpld_nodestore_state{metric="node_reads_total"}
# RPC error rate for server_info
rate(rippled_rpc_method_errored_total{method="server_info"}[5m])
rate(xrpld_rpc_method_errored_total{method="server_info"}[5m])
# Job queue wait time p95
histogram_quantile(0.95, sum by (le) (rate(rippled_job_queued_duration_us_bucket[5m])))
histogram_quantile(0.95, sum by (le) (rate(xrpld_job_queued_duration_us_bucket[5m])))
# TxQ utilization percentage
rippled_txq_metrics{metric="txq_count"} / rippled_txq_metrics{metric="txq_max_size"}
xrpld_txq_metrics{metric="txq_count"} / xrpld_txq_metrics{metric="txq_max_size"}
# High load factor alert candidate
rippled_load_factor_metrics{metric="load_factor"} > 5
xrpld_load_factor_metrics{metric="load_factor"} > 5
```
### Phase 7+: External Dashboard Parity Metrics
@@ -911,75 +911,75 @@ rippled_load_factor_metrics{metric="load_factor"} > 5
>
> **Task breakdown**: Phase 7 Tasks 7.9-7.16 (implementation), Phase 9 Tasks 9.11-9.13 (dashboards)
These metrics fill gaps identified by comparing rippled's internal observability with the community external dashboard's 86-metric coverage. All are exported via the OTel Metrics SDK (same `PeriodicMetricReader` as Phase 9 metrics).
These metrics fill gaps identified by comparing xrpld's internal observability with the community external dashboard's 86-metric coverage. All are exported via the OTel Metrics SDK (same `PeriodicMetricReader` as Phase 9 metrics).
#### Validation Agreement (Observable Gauge — `validation_agreement`)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------------- | ------ | -------- | --------------------------------------- |
| `rippled_validation_agreement{metric="agreement_pct_1h"}` | Double | `metric` | Rolling 1h agreement percentage (0-100) |
| `rippled_validation_agreement{metric="agreement_pct_24h"}` | Double | `metric` | Rolling 24h agreement percentage |
| `rippled_validation_agreement{metric="agreements_1h"}` | Int64 | `metric` | Agreed validations in 1h window |
| `rippled_validation_agreement{metric="missed_1h"}` | Int64 | `metric` | Missed validations in 1h window |
| `rippled_validation_agreement{metric="agreements_24h"}` | Int64 | `metric` | Agreed validations in 24h window |
| `rippled_validation_agreement{metric="missed_24h"}` | Int64 | `metric` | Missed validations in 24h window |
| Prometheus Metric | Type | Labels | Description |
| -------------------------------------------------------- | ------ | -------- | --------------------------------------- |
| `xrpld_validation_agreement{metric="agreement_pct_1h"}` | Double | `metric` | Rolling 1h agreement percentage (0-100) |
| `xrpld_validation_agreement{metric="agreement_pct_24h"}` | Double | `metric` | Rolling 24h agreement percentage |
| `xrpld_validation_agreement{metric="agreements_1h"}` | Int64 | `metric` | Agreed validations in 1h window |
| `xrpld_validation_agreement{metric="missed_1h"}` | Int64 | `metric` | Missed validations in 1h window |
| `xrpld_validation_agreement{metric="agreements_24h"}` | Int64 | `metric` | Agreed validations in 24h window |
| `xrpld_validation_agreement{metric="missed_24h"}` | Int64 | `metric` | Missed validations in 24h window |
Data source: `ValidationTracker` class with 8s grace period and 5m late repair window.
#### Validator Health (Observable Gauge — `validator_health`)
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------------------------ | ------ | -------- | ------------------------------ |
| `rippled_validator_health{metric="amendment_blocked"}` | Int64 | `metric` | 1 if amendment-blocked, else 0 |
| `rippled_validator_health{metric="unl_blocked"}` | Int64 | `metric` | 1 if UNL-blocked, else 0 |
| `rippled_validator_health{metric="unl_expiry_days"}` | Double | `metric` | Days until UNL list expires |
| `rippled_validator_health{metric="validation_quorum"}` | Int64 | `metric` | Validation quorum threshold |
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------- | ------ | -------- | ------------------------------ |
| `xrpld_validator_health{metric="amendment_blocked"}` | Int64 | `metric` | 1 if amendment-blocked, else 0 |
| `xrpld_validator_health{metric="unl_blocked"}` | Int64 | `metric` | 1 if UNL-blocked, else 0 |
| `xrpld_validator_health{metric="unl_expiry_days"}` | Double | `metric` | Days until UNL list expires |
| `xrpld_validator_health{metric="validation_quorum"}` | Int64 | `metric` | Validation quorum threshold |
#### Peer Quality (Observable Gauge — `peer_quality`)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------------------- | ------ | -------- | ------------------------------------ |
| `rippled_peer_quality{metric="peer_latency_p90_ms"}` | Double | `metric` | P90 peer latency in milliseconds |
| `rippled_peer_quality{metric="peers_insane_count"}` | Int64 | `metric` | Peers with diverged tracking status |
| `rippled_peer_quality{metric="peers_higher_version_pct"}` | Double | `metric` | % of peers on newer rippled version |
| `rippled_peer_quality{metric="upgrade_recommended"}` | Int64 | `metric` | 1 if >60% of peers are newer version |
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------------------------- | ------ | -------- | ------------------------------------ |
| `xrpld_peer_quality{metric="peer_latency_p90_ms"}` | Double | `metric` | P90 peer latency in milliseconds |
| `xrpld_peer_quality{metric="peers_insane_count"}` | Int64 | `metric` | Peers with diverged tracking status |
| `xrpld_peer_quality{metric="peers_higher_version_pct"}` | Double | `metric` | % of peers on newer xrpld version |
| `xrpld_peer_quality{metric="upgrade_recommended"}` | Int64 | `metric` | 1 if >60% of peers are newer version |
#### Ledger Economy (Observable Gauge — `ledger_economy`)
| Prometheus Metric | Type | Labels | Description |
| ----------------------------------------------------- | ------ | -------- | ---------------------------------- |
| `rippled_ledger_economy{metric="base_fee_xrp"}` | Double | `metric` | Base transaction fee in drops |
| `rippled_ledger_economy{metric="reserve_base_xrp"}` | Double | `metric` | Account reserve in drops |
| `rippled_ledger_economy{metric="reserve_inc_xrp"}` | Double | `metric` | Owner reserve increment in drops |
| `rippled_ledger_economy{metric="ledger_age_seconds"}` | Double | `metric` | Seconds since last validated close |
| `rippled_ledger_economy{metric="transaction_rate"}` | Double | `metric` | Smoothed transaction rate (tx/s) |
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------------- | ------ | -------- | ---------------------------------- |
| `xrpld_ledger_economy{metric="base_fee_xrp"}` | Double | `metric` | Base transaction fee in drops |
| `xrpld_ledger_economy{metric="reserve_base_xrp"}` | Double | `metric` | Account reserve in drops |
| `xrpld_ledger_economy{metric="reserve_inc_xrp"}` | Double | `metric` | Owner reserve increment in drops |
| `xrpld_ledger_economy{metric="ledger_age_seconds"}` | Double | `metric` | Seconds since last validated close |
| `xrpld_ledger_economy{metric="transaction_rate"}` | Double | `metric` | Smoothed transaction rate (tx/s) |
#### State Tracking (Observable Gauge — `state_tracking`)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------------------- | ------ | -------- | -------------------------------------- |
| `rippled_state_tracking{metric="state_value"}` | Int64 | `metric` | Numeric state 0-6 (see encoding below) |
| `rippled_state_tracking{metric="time_in_current_state_seconds"}` | Double | `metric` | Duration in current state |
| Prometheus Metric | Type | Labels | Description |
| -------------------------------------------------------------- | ------ | -------- | -------------------------------------- |
| `xrpld_state_tracking{metric="state_value"}` | Int64 | `metric` | Numeric state 0-6 (see encoding below) |
| `xrpld_state_tracking{metric="time_in_current_state_seconds"}` | Double | `metric` | Duration in current state |
State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full, 5=validating (FULL + validating), 6=proposing (FULL + proposing).
#### Storage Detail (Observable Gauge — `storage_detail`)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------- | ----- | -------- | ---------------------- |
| `rippled_storage_detail{metric="nudb_bytes"}` | Int64 | `metric` | NuDB backend file size |
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------------- | ----- | -------- | ---------------------- |
| `xrpld_storage_detail{metric="nudb_bytes"}` | Int64 | `metric` | NuDB backend file size |
#### Synchronous Counters (Phase 7+)
| Prometheus Metric | Type | Description | Increment Site |
| ------------------------------------- | ------- | -------------------------------- | --------------------- |
| `rippled_ledgers_closed_total` | Counter | Ledgers closed by consensus | RCLConsensus.cpp |
| `rippled_validations_sent_total` | Counter | Validations sent | RCLConsensus.cpp |
| `rippled_validations_checked_total` | Counter | Network validations observed | LedgerMaster.cpp |
| `rippled_validation_agreements_total` | Counter | Cumulative validation agreements | ValidationTracker.cpp |
| `rippled_validation_missed_total` | Counter | Cumulative validation misses | ValidationTracker.cpp |
| `rippled_state_changes_total` | Counter | Operating mode transitions | NetworkOPs.cpp |
| `rippled_jq_trans_overflow_total` | Counter | Job queue transaction overflows | JobQueue.cpp |
| Prometheus Metric | Type | Description | Increment Site |
| ----------------------------------- | ------- | -------------------------------- | --------------------- |
| `xrpld_ledgers_closed_total` | Counter | Ledgers closed by consensus | RCLConsensus.cpp |
| `xrpld_validations_sent_total` | Counter | Validations sent | RCLConsensus.cpp |
| `xrpld_validations_checked_total` | Counter | Network validations observed | LedgerMaster.cpp |
| `xrpld_validation_agreements_total` | Counter | Cumulative validation agreements | ValidationTracker.cpp |
| `xrpld_validation_missed_total` | Counter | Cumulative validation misses | ValidationTracker.cpp |
| `xrpld_state_changes_total` | Counter | Operating mode transitions | NetworkOPs.cpp |
| `xrpld_jq_trans_overflow_total` | Counter | Job queue transaction overflows | JobQueue.cpp |
#### Span Attribute Enrichments (Phases 2-4)
@@ -997,29 +997,29 @@ State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full
### New Grafana Dashboards (Phase 9)
| Dashboard | UID | Data Source | Key Panels |
| ---------------------- | -------------------------- | ----------- | --------------------------------------------------------- |
| Fee Market & TxQ | `rippled-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown |
| Job Queue Analysis | `rippled-job-queue` | Prometheus | Per-job rates, queue wait times, execution times |
| RPC Performance (OTel) | `rippled-rpc-perf` | Prometheus | Per-method call rates, error rates, latency distributions |
| Validator Health | `rippled-validator-health` | Prometheus | Agreement %, validation rate, amendment/UNL, state |
| Peer Quality | `rippled-peer-quality` | Prometheus | P90 latency, insane peers, version awareness, disconnects |
| Dashboard | UID | Data Source | Key Panels |
| ---------------------- | ------------------------ | ----------- | --------------------------------------------------------- |
| Fee Market & TxQ | `xrpld-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown |
| Job Queue Analysis | `xrpld-job-queue` | Prometheus | Per-job rates, queue wait times, execution times |
| RPC Performance (OTel) | `xrpld-rpc-perf` | Prometheus | Per-method call rates, error rates, latency distributions |
| Validator Health | `xrpld-validator-health` | Prometheus | Agreement %, validation rate, amendment/UNL, state |
| Peer Quality | `xrpld-peer-quality` | Prometheus | P90 latency, insane peers, version awareness, disconnects |
### Updated Grafana Dashboards (Phase 9)
| Dashboard | UID | New Panels Added |
| -------------------- | ---------------------------- | -------------------------------------------------------------------- |
| Node Health (StatsD) | `rippled-statsd-node-health` | NodeStore I/O, cache hit rates, object instance counts |
| System Node Health | `rippled-system-node-health` | Ledger economy row: base fee, reserves, ledger age, transaction rate |
| Dashboard | UID | New Panels Added |
| -------------------- | -------------------------- | -------------------------------------------------------------------- |
| Node Health (StatsD) | `xrpld-statsd-node-health` | NodeStore I/O, cache hit rates, object instance counts |
| System Node Health | `xrpld-system-node-health` | Ledger economy row: base fee, reserves, ledger age, transaction rate |
### New Grafana Dashboards (Phase 11)
| Dashboard | UID | Data Source | Key Panels |
| ------------------ | ----------------------------- | ----------- | ---------------------------------------------------------------------- |
| Validator Health | `rippled-validator-health` | Prometheus | Server state timeline, proposer count, converge time, amendment voting |
| Network Topology | `rippled-network-topology` | Prometheus | Peer count, version distribution, latency distribution, diverged peers |
| Fee Market (Ext) | `rippled-fee-market-external` | Prometheus | Fee levels, queue depth, load factor breakdown, escalation timeline |
| DEX & AMM Overview | `rippled-dex-amm` | Prometheus | AMM TVL, order book depth, spread trends, trading fee revenue |
| Dashboard | UID | Data Source | Key Panels |
| ------------------ | --------------------------- | ----------- | ---------------------------------------------------------------------- |
| Validator Health | `xrpld-validator-health` | Prometheus | Server state timeline, proposer count, converge time, amendment voting |
| Network Topology | `xrpld-network-topology` | Prometheus | Peer count, version distribution, latency distribution, diverged peers |
| Fee Market (Ext) | `xrpld-fee-market-external` | Prometheus | Fee levels, queue depth, load factor breakdown, escalation timeline |
| DEX & AMM Overview | `xrpld-dex-amm` | Prometheus | AMM TVL, order book depth, spread trends, trading fee revenue |
### Prometheus Alerting Rules (Phase 11)
@@ -1044,8 +1044,8 @@ State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full
| Issue | Impact | Status |
| ------------------------------------------------------------------ | ------------------------------------------------ | -------------------------------------------------------------------- |
| `warn` and `drop` metrics use non-standard StatsD `\|m` meter type | Metrics silently dropped by OTel StatsD receiver | Phase 6 Task 6.1 — needs `\|m``\|c` change in StatsDCollector.cpp |
| `rippled_job_count` may not emit in standalone mode | Missing from Prometheus in some test configs | Requires active job queue activity |
| `rippled_rpc_requests` depends on `[insight]` config | Zero series if StatsD not configured | Requires `[insight] server=statsd` in xrpld.cfg |
| `xrpld_job_count` may not emit in standalone mode | Missing from Prometheus in some test configs | Requires active job queue activity |
| `xrpld_rpc_requests` depends on `[insight]` config | Zero series if StatsD not configured | Requires `[insight] server=statsd` in xrpld.cfg |
| Peer tracing disabled by default | No `peer.*` spans unless `trace_peer=1` | Intentional — high volume on mainnet |
---
@@ -1077,7 +1077,7 @@ enabled=1
[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled
prefix=xrpld
```
### Production Setup
@@ -1094,7 +1094,7 @@ max_queue_size=4096
[insight]
server=statsd
address=otel-collector:8125
prefix=rippled
prefix=xrpld
```
### Trace Category Toggle

View File

@@ -1,10 +1,10 @@
# [OpenTelemetry](00-tracing-fundamentals.md) Distributed Tracing Implementation Plan for rippled (xrpld)
# [OpenTelemetry](00-tracing-fundamentals.md) Distributed Tracing Implementation Plan for xrpld (xrpld)
## Executive Summary
> **OTLP** = OpenTelemetry Protocol
This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. The plan addresses the unique challenges of a decentralized peer-to-peer system where trace context must propagate across network boundaries between independent nodes.
This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the xrpld XRP Ledger node software. The plan addresses the unique challenges of a decentralized peer-to-peer system where trace context must propagate across network boundaries between independent nodes.
### Key Benefits
@@ -101,11 +101,11 @@ flowchart TB
| Section | Document | Description |
| ------- | -------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **0** | [Tracing Fundamentals](./00-tracing-fundamentals.md) | Distributed tracing concepts, span relationships, context propagation |
| **1** | [Architecture Analysis](./01-architecture-analysis.md) | rippled component analysis, trace points, instrumentation priorities |
| **1** | [Architecture Analysis](./01-architecture-analysis.md) | xrpld component analysis, trace points, instrumentation priorities |
| **2** | [Design Decisions](./02-design-decisions.md) | SDK selection, exporters, span naming, attributes, context propagation |
| **3** | [Implementation Strategy](./03-implementation-strategy.md) | Directory structure, key principles, performance optimization |
| **4** | [Code Samples](./04-code-samples.md) | C++ implementation examples for core infrastructure and key modules |
| **5** | [Configuration Reference](./05-configuration-reference.md) | rippled config, CMake integration, Collector configurations |
| **5** | [Configuration Reference](./05-configuration-reference.md) | xrpld config, CMake integration, Collector configurations |
| **6** | [Implementation Phases](./06-implementation-phases.md) | 5-phase timeline, tasks, risks, success metrics |
| **7** | [Observability Backends](./07-observability-backends.md) | Backend selection guide and production architecture |
| **8** | [Appendix](./08-appendix.md) | Glossary, references, version history |
@@ -116,7 +116,7 @@ flowchart TB
## 0. Tracing Fundamentals
This document introduces distributed tracing concepts for readers unfamiliar with the domain. It covers what traces and spans are, how parent-child and follows-from relationships model causality, how context propagates across service boundaries, and how sampling controls data volume. It also maps these concepts to rippled-specific scenarios like transaction relay and consensus.
This document introduces distributed tracing concepts for readers unfamiliar with the domain. It covers what traces and spans are, how parent-child and follows-from relationships model causality, how context propagates across service boundaries, and how sampling controls data volume. It also maps these concepts to xrpld-specific scenarios like transaction relay and consensus.
➡️ **[Read Tracing Fundamentals](./00-tracing-fundamentals.md)**
@@ -126,7 +126,7 @@ This document introduces distributed tracing concepts for readers unfamiliar wit
> **WS** = WebSocket | **TxQ** = Transaction Queue
The rippled node consists of several key components that require instrumentation for comprehensive distributed tracing. The main areas include the RPC server (HTTP/WebSocket), Overlay P2P network, Consensus mechanism (RCLConsensus), JobQueue for async task execution, PathFinding, Transaction Queue (TxQ), fee escalation (LoadManager), ledger acquisition, validator management, and existing observability infrastructure (PerfLog, Insight/StatsD, Journal logging).
The xrpld node consists of several key components that require instrumentation for comprehensive distributed tracing. The main areas include the RPC server (HTTP/WebSocket), Overlay P2P network, Consensus mechanism (RCLConsensus), JobQueue for async task execution, PathFinding, Transaction Queue (TxQ), fee escalation (LoadManager), ledger acquisition, validator management, and existing observability infrastructure (PerfLog, Insight/StatsD, Journal logging).
Key trace points span across transaction submission via RPC, peer-to-peer message propagation, consensus round execution, ledger building, path computation, transaction queue behavior, fee escalation, and validator health. The implementation prioritizes high-value, low-risk components first: RPC handlers provide immediate value with minimal risk, while consensus tracing requires careful implementation to avoid timing impacts.
@@ -150,7 +150,7 @@ Span naming follows a hierarchical `<component>.<operation>` convention (e.g., `
## 3. Implementation Strategy
The telemetry code is organized under `include/xrpl/telemetry/` for headers and `src/libxrpl/telemetry/` for implementation. Key principles include RAII-based span management via `SpanGuard`, conditional compilation with `XRPL_ENABLE_TELEMETRY`, and minimal runtime overhead through batch processing and efficient sampling.
The telemetry code is organized under `include/xrpl/telemetry/` for headers and `src/libxrpl/telemetry/` for implementation. Key principles include RAII-based span management via `SpanGuard` (with `discard()` for dropping unwanted spans), a `FilteringSpanProcessor` that intercepts `OnEnd()` to prevent discarded spans from entering the export pipeline, conditional compilation with `XRPL_ENABLE_TELEMETRY`, and minimal runtime overhead through batch processing and efficient sampling.
Performance optimization strategies include probabilistic head sampling (10% default), tail-based sampling at the collector for errors and slow traces, batch export to reduce network overhead, and conditional instrumentation that compiles to no-ops when disabled.
@@ -163,8 +163,9 @@ Performance optimization strategies include probabilistic head sampling (10% def
C++ implementation examples are provided for the core telemetry infrastructure and key modules:
- `Telemetry.h` - Core interface for tracer access and span creation
- `SpanGuard.h` - RAII wrapper for automatic span lifecycle management
- `TracingInstrumentation.h` - Macros for conditional instrumentation
- `SpanGuard.h` - RAII wrapper for automatic span lifecycle management with `discard()` support
- `DiscardFlag.h` - Thread-local flag for span discard signaling between SpanGuard and FilteringSpanProcessor
- `SpanGuard.cpp` - Pimpl implementation confining all OTel SDK types
- Protocol Buffer extensions for trace context propagation
- Module-specific instrumentation (RPC, Consensus, P2P, JobQueue)
- Remaining modules (PathFinding, TxQ, Validator, etc.) follow the same patterns
@@ -220,7 +221,7 @@ The recommended production architecture uses a gateway collector pattern with re
## 8. Appendix
The appendix contains a glossary of OpenTelemetry and rippled-specific terms, references to external documentation and specifications, version history for this implementation plan, and a complete document index.
The appendix contains a glossary of OpenTelemetry and xrpld-specific terms, references to external documentation and specifications, version history for this implementation plan, and a complete document index.
➡️ **[View Appendix](./08-appendix.md)**
@@ -228,7 +229,7 @@ The appendix contains a glossary of OpenTelemetry and rippled-specific terms, re
## 9. Data Collection Reference
A single-source-of-truth reference documenting every piece of telemetry data collected by rippled. Covers all 16 OpenTelemetry spans with their 22 attributes, all StatsD metrics (gauges, counters, histograms, overlay traffic), SpanMetrics-derived Prometheus metrics, and all 8 Grafana dashboards. Includes Jaeger search guides and Prometheus query examples.
A single-source-of-truth reference documenting every piece of telemetry data collected by xrpld. Covers all 16 OpenTelemetry spans with their 22 attributes, all StatsD metrics (gauges, counters, histograms, overlay traffic), SpanMetrics-derived Prometheus metrics, and all 10 Grafana dashboards. Includes Jaeger search guides and Prometheus query examples.
➡️ **[View Data Collection Reference](./09-data-collection-reference.md)**
@@ -236,10 +237,10 @@ A single-source-of-truth reference documenting every piece of telemetry data col
## POC Task List
A step-by-step task list for building a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. The POC scope is limited to RPC tracing — showing request traces flowing from rippled through an OpenTelemetry Collector into Tempo, viewable in Grafana.
A step-by-step task list for building a minimal end-to-end proof of concept that demonstrates distributed tracing in xrpld. The POC scope is limited to RPC tracing — showing request traces flowing from xrpld through an OpenTelemetry Collector into Tempo, viewable in Grafana.
➡️ **[View POC Task List](./POC_taskList.md)**
---
_This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. For detailed information on any section, follow the links to the corresponding sub-documents._
_This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the xrpld XRP Ledger node software. For detailed information on any section, follow the links to the corresponding sub-documents._

View File

@@ -1,21 +1,21 @@
# OpenTelemetry POC Task List
> **Goal**: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. A successful POC will show RPC request traces flowing from rippled through an OTel Collector into Tempo, viewable in Grafana.
> **Goal**: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in xrpld. A successful POC will show RPC request traces flowing from xrpld through an OTel Collector into Tempo, viewable in Grafana.
>
> **Scope**: RPC tracing only (highest value, lowest risk per the [CRAWL phase](./06-implementation-phases.md#6102-quick-wins-immediate-value) in the implementation phases). No cross-node P2P context propagation or consensus tracing in the POC.
### Related Plan Documents
| Document | Relevance to POC |
| ---------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) | Core concepts: traces, spans, context propagation, sampling |
| [01-architecture-analysis.md](./01-architecture-analysis.md) | RPC request flow (§1.5), key trace points (§1.6), instrumentation priority (§1.7) |
| [02-design-decisions.md](./02-design-decisions.md) | SDK selection (§2.1), exporter config (§2.2), span naming (§2.3), attribute schema (§2.4), coexistence with PerfLog/Insight (§2.6) |
| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure (§3.1), key principles (§3.2), performance overhead (§3.3-3.6), conditional compilation (§3.7.3), code intrusiveness (§3.9) |
| [04-code-samples.md](./04-code-samples.md) | Telemetry interface (§4.1), SpanGuard (§4.2), macros (§4.3), RPC instrumentation (§4.5.3) |
| [05-configuration-reference.md](./05-configuration-reference.md) | rippled config (§5.1), config parser (§5.2), Application integration (§5.3), CMake (§5.4), Collector config (§5.5), Docker Compose (§5.6), Grafana (§5.8) |
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 1 core tasks (§6.2), Phase 2 RPC tasks (§6.3), quick wins (§6.10), definition of done (§6.11) |
| [07-observability-backends.md](./07-observability-backends.md) | Tempo dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3) |
| Document | Relevance to POC |
| ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) | Core concepts: traces, spans, context propagation, sampling |
| [01-architecture-analysis.md](./01-architecture-analysis.md) | RPC request flow (§1.5), key trace points (§1.6), instrumentation priority (§1.7) |
| [02-design-decisions.md](./02-design-decisions.md) | SDK selection (§2.1), exporter config (§2.2), span naming (§2.3), attribute schema (§2.4), coexistence with PerfLog/Insight (§2.6) |
| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure (§3.1), key principles (§3.2), performance overhead (§3.3-3.6), conditional compilation (§3.7.3), code intrusiveness (§3.9) |
| [04-code-samples.md](./04-code-samples.md) | Telemetry interface (§4.1), SpanGuard factory methods (§4.2-4.3), RPC instrumentation (§4.5.3) |
| [05-configuration-reference.md](./05-configuration-reference.md) | xrpld config (§5.1), config parser (§5.2), Application integration (§5.3), CMake (§5.4), Collector config (§5.5), Docker Compose (§5.6), Grafana (§5.8) |
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 1 core tasks (§6.2), Phase 2 RPC tasks (§6.3), quick wins (§6.10), definition of done (§6.11) |
| [07-observability-backends.md](./07-observability-backends.md) | Tempo dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3) |
---
@@ -137,7 +137,7 @@
- `virtual void start() = 0;`
- `virtual void stop() = 0;`
- `virtual bool isEnabled() const = 0;`
- `virtual nostd::shared_ptr<Tracer> getTracer(string_view name = "rippled") = 0;`
- `virtual nostd::shared_ptr<Tracer> getTracer(string_view name = "xrpld") = 0;`
- `virtual nostd::shared_ptr<Span> startSpan(string_view name, SpanKind kind = kInternal) = 0;`
- `virtual nostd::shared_ptr<Span> startSpan(string_view name, Context const& parentContext, SpanKind kind = kInternal) = 0;`
- `virtual bool shouldTraceRpc() const = 0;`
@@ -147,9 +147,11 @@
- Config parser: `Telemetry::Setup setup_Telemetry(Section const&, std::string const& nodePublicKey, std::string const& version);`
- Create `include/xrpl/telemetry/SpanGuard.h`:
- RAII guard that takes an `nostd::shared_ptr<Span>`, creates a `Scope`, and calls `span->End()` in destructor.
- Convenience: `setAttribute()`, `setOk()`, `setStatus()`, `addEvent()`, `recordException()`, `context()`
- See [04-code-samples.md](./04-code-samples.md) §4.2 for the full implementation.
- RAII guard with static factory methods (`rpcSpan()`, `txSpan()`, `consensusSpan()`, etc.) that access the global `Telemetry::getInstance()` singleton internally.
- Uses pimpl idiom to hide all OTel types -- the public header has zero `opentelemetry/` includes.
- Convenience instance methods: `setAttribute()`, `setOk()`, `setStatus()`, `addEvent()`, `recordException()`, `context()`, `discard()`
- When `XRPL_ENABLE_TELEMETRY` is not defined, the entire class compiles to a no-op stub.
- See [04-code-samples.md](./04-code-samples.md) §4.2-4.3 for the full API reference.
- Create `src/libxrpl/telemetry/NullTelemetry.cpp`:
- Implements `Telemetry` with all no-ops.
@@ -167,7 +169,7 @@
**Reference**:
- [04-code-samples.md §4.1](./04-code-samples.md) — Full `Telemetry` interface with `Setup` struct, lifecycle, tracer access, span creation, and component filtering methods
- [04-code-samples.md §4.2](./04-code-samples.md) — Full `SpanGuard` RAII implementation and `NullSpanGuard` no-op class
- [04-code-samples.md §4.2-4.3](./04-code-samples.md) — SpanGuard with factory methods, pimpl design, no-op stub, and discard support
- [03-implementation-strategy.md §3.1](./03-implementation-strategy.md) — Directory structure: `include/xrpl/telemetry/` for headers, `src/libxrpl/telemetry/` for implementation
- [03-implementation-strategy.md §3.7.3](./03-implementation-strategy.md) — Conditional instrumentation and zero-cost compile-time disabled pattern
@@ -225,43 +227,59 @@
## Task 4: Integrate Telemetry into Application Lifecycle
**Objective**: Wire the `Telemetry` object into `Application` so all components can access it.
**Objective**: Wire the `Telemetry` object into the `ServiceRegistry` / `Application` so all components can access it.
**What to do**:
- Edit `src/xrpld/app/main/Application.h`:
- Forward-declare `namespace xrpl::telemetry { class Telemetry; }`
- Edit `include/xrpl/core/ServiceRegistry.h`:
- Forward-declare `namespace telemetry { class Telemetry; }` inside `namespace xrpl`
- Add pure virtual method: `virtual telemetry::Telemetry& getTelemetry() = 0;`
- (`Application` extends `ServiceRegistry`, so this is automatically available on `Application` too)
- Edit `src/xrpld/app/main/Application.cpp` (the `ApplicationImp` class):
- Add member: `std::unique_ptr<telemetry::Telemetry> telemetry_;`
- In the constructor, after config is loaded and node identity is known:
- In the member initializer list, construct telemetry with an empty
`serviceInstanceId` (node identity is not yet known):
```cpp
auto const telemetrySection = config_->section("telemetry");
auto telemetrySetup = telemetry::setup_Telemetry(
telemetrySection,
toBase58(TokenType::NodePublic, nodeIdentity_.publicKey()),
BuildInfo::getVersionString());
telemetry_ = telemetry::make_Telemetry(telemetrySetup, logs_->journal("Telemetry"));
, telemetry_(
telemetry::make_Telemetry(
telemetry::setup_Telemetry(
config_->section("telemetry"),
"", // Updated later via setServiceInstanceId()
BuildInfo::getVersionString()),
logs_->journal("Telemetry")))
```
- In `start()`: call `telemetry_->start()` early
- In `stop()` or destructor: call `telemetry_->stop()` late (to flush pending spans)
- In `setup()`, after `nodeIdentity_` is resolved, inject the node
public key as the service instance ID:
```cpp
if (!config_->section("telemetry").exists("service_instance_id"))
telemetry_->setServiceInstanceId(
toBase58(TokenType::NodePublic, nodeIdentity_->first));
```
- In `start()`: call `telemetry_->start()`
- In `run()` (shutdown path): call `telemetry_->stop()` (to flush pending spans)
- Implement `getTelemetry()` override: return `*telemetry_`
- Add `[telemetry]` section to the example config `cfg/rippled-example.cfg`:
- Add `[telemetry]` section to the example config `cfg/xrpld-example.cfg`:
```ini
# [telemetry]
# enabled=1
# endpoint=localhost:4317
# endpoint=http://localhost:4318/v1/traces
# sampling_ratio=1.0
# trace_rpc=1
```
> **Access patterns**: Components holding `ServiceRegistry&` (e.g.
> `NetworkOPsImp`) call `registry_.get().getTelemetry()`. Components
> holding `Application&` (e.g. `ServerHandler`, `PeerImp`,
> `RCLConsensusAdaptor`) call `app_.getTelemetry()` directly. Both
> resolve to the same `Telemetry` instance.
**Key modified files**:
- `src/xrpld/app/main/Application.h`
- `include/xrpl/core/ServiceRegistry.h`
- `src/xrpld/app/main/Application.cpp`
- `cfg/rippled-example.cfg` (or equivalent example config)
- `cfg/xrpld-example.cfg` (example config)
**Reference**:
@@ -271,47 +289,37 @@
---
## Task 5: Create Instrumentation Macros
## Task 5: Add SpanGuard Factory Methods
**Objective**: Define convenience macros that make instrumenting code one-liners, and that compile to zero-cost no-ops when telemetry is disabled.
**Objective**: Add static factory methods to SpanGuard that provide type-safe, one-liner instrumentation and compile to zero-cost no-ops when telemetry is disabled. This replaces the earlier macro-based approach (`TracingInstrumentation.h` has been removed).
**What to do**:
- Create `src/xrpld/telemetry/TracingInstrumentation.h`:
- When `XRPL_ENABLE_TELEMETRY` is defined:
- Update `include/xrpl/telemetry/SpanGuard.h`:
- Add static factory methods that access the global `Telemetry::getInstance()` singleton and check the relevant component filter before creating a span:
```cpp
#define XRPL_TRACE_SPAN(telemetry, name) \
auto _xrpl_span_ = (telemetry).startSpan(name); \
::xrpl::telemetry::SpanGuard _xrpl_guard_(_xrpl_span_)
#define XRPL_TRACE_RPC(telemetry, name) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((telemetry).shouldTraceRpc()) { \
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
}
#define XRPL_TRACE_SET_ATTR(key, value) \
if (_xrpl_guard_.has_value()) { \
_xrpl_guard_->setAttribute(key, value); \
}
#define XRPL_TRACE_EXCEPTION(e) \
if (_xrpl_guard_.has_value()) { \
_xrpl_guard_->recordException(e); \
}
// Each factory checks the global Telemetry instance internally.
// No Telemetry& reference needed at the call site.
auto span = telemetry::SpanGuard::rpcSpan("rpc.request");
span.setAttribute("xrpl.rpc.command", command);
span.setAttribute("xrpl.rpc.status", status);
```
- When `XRPL_ENABLE_TELEMETRY` is NOT defined, all macros expand to `((void)0)`
- Factory methods: `rpcSpan()`, `txSpan()`, `consensusSpan()`, `peerSpan()`, `ledgerSpan()`, `span()`
- Use the pimpl idiom to hide all OTel types from the public header (zero `opentelemetry/` includes)
- When `XRPL_ENABLE_TELEMETRY` is NOT defined, the entire class compiles to a no-op stub with empty inline method bodies
**Key new file**:
- No separate `TracingInstrumentation.h` file is needed. All instrumentation call sites use `#include <xrpl/telemetry/SpanGuard.h>` directly.
- `src/xrpld/telemetry/TracingInstrumentation.h`
**Key modified file**:
- `include/xrpl/telemetry/SpanGuard.h`
**Reference**:
- [04-code-samples.md §4.3](./04-code-samples.md) — Full macro definitions for `XRPL_TRACE_SPAN`, `XRPL_TRACE_RPC`, `XRPL_TRACE_CONSENSUS`, `XRPL_TRACE_SET_ATTR`, `XRPL_TRACE_EXCEPTION` with both enabled and disabled branches
- [03-implementation-strategy.md §3.7.3](./03-implementation-strategy.md) — Conditional instrumentation pattern: compile-time `#ifndef` and runtime `shouldTrace*()` checks
- [04-code-samples.md §4.3](./04-code-samples.md) — SpanGuard API reference: factory methods, usage patterns, compile-time disabled behavior, and discard support
- [03-implementation-strategy.md §3.7.3](./03-implementation-strategy.md) — Conditional instrumentation pattern: factory methods handle compile-time and runtime checks internally
- [03-implementation-strategy.md §3.9.7](./03-implementation-strategy.md) — Before/after code examples showing minimal intrusiveness (~1-3 lines per instrumentation point)
---
@@ -325,17 +333,17 @@
**What to do**:
- Edit `src/xrpld/rpc/detail/ServerHandler.cpp`:
- `#include` the `TracingInstrumentation.h` header
- `#include <xrpl/telemetry/SpanGuard.h>`
- In `ServerHandler::onRequest(Session& session)`:
- At the top of the method, add: `XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.request");`
- After the RPC command name is extracted, set attribute: `XRPL_TRACE_SET_ATTR("xrpl.rpc.command", command);`
- After the response status is known, set: `XRPL_TRACE_SET_ATTR("http.status_code", static_cast<int64_t>(statusCode));`
- Wrap error paths with: `XRPL_TRACE_EXCEPTION(e);`
- At the top of the method, add: `auto span = telemetry::SpanGuard::rpcSpan("rpc.request");`
- After the RPC command name is extracted, set attribute: `span.setAttribute("xrpl.rpc.command", command);`
- After the response status is known, set: `span.setAttribute("http.status_code", static_cast<int64_t>(statusCode));`
- Wrap error paths with: `span.recordException(e);`
- In `ServerHandler::processRequest(...)`:
- Add a child span: `XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.process");`
- Set method attribute: `XRPL_TRACE_SET_ATTR("xrpl.rpc.method", request_method);`
- Add a child span: `auto span = telemetry::SpanGuard::rpcSpan("rpc.process");`
- Set method attribute: `span.setAttribute("xrpl.rpc.method", request_method);`
- In `ServerHandler::onWSMessage(...)` (WebSocket path):
- Add: `XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.ws.message");`
- Add: `auto span = telemetry::SpanGuard::rpcSpan("rpc.ws.message");`
- The goal is to see spans like:
```
@@ -350,7 +358,7 @@
**Reference**:
- [04-code-samples.md §4.5.3](./04-code-samples.md) — Complete `ServerHandler::onRequest()` instrumented code sample with W3C header extraction, span creation, attribute setting, and error handling
- [04-code-samples.md §4.5.3](./04-code-samples.md) — Complete `ServerHandler::onRequest()` instrumented code sample using SpanGuard factory methods
- [01-architecture-analysis.md §1.5](./01-architecture-analysis.md) — RPC request flow diagram: HTTP request -> attributes -> jobqueue.enqueue -> rpc.command -> response
- [01-architecture-analysis.md §1.6](./01-architecture-analysis.md) — Key trace points table: `rpc.request` in `ServerHandler.cpp::onRequest()` (Priority: High)
- [02-design-decisions.md §2.3](./02-design-decisions.md) — Span naming convention: `rpc.request`, `rpc.command.*`
@@ -366,15 +374,15 @@
**What to do**:
- Edit `src/xrpld/rpc/detail/RPCHandler.cpp`:
- `#include` the `TracingInstrumentation.h` header
- `#include <xrpl/telemetry/SpanGuard.h>`
- In `doCommand(RPC::JsonContext& context, Json::Value& result)`:
- At the top: `XRPL_TRACE_RPC(context.app.getTelemetry(), "rpc.command." + context.method);`
- At the top: `auto span = telemetry::SpanGuard::rpcSpan("rpc.command." + context.method);`
- Set attributes:
- `XRPL_TRACE_SET_ATTR("xrpl.rpc.command", context.method);`
- `XRPL_TRACE_SET_ATTR("xrpl.rpc.version", static_cast<int64_t>(context.apiVersion));`
- `XRPL_TRACE_SET_ATTR("xrpl.rpc.role", (context.role == Role::ADMIN) ? "admin" : "user");`
- On success: `XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "success");`
- On error: `XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "error");` and set the error message
- `span.setAttribute("xrpl.rpc.command", context.method);`
- `span.setAttribute("xrpl.rpc.version", static_cast<int64_t>(context.apiVersion));`
- `span.setAttribute("xrpl.rpc.role", (context.role == Role::ADMIN) ? "admin" : "user");`
- On success: `span.setAttribute("xrpl.rpc.status", "success");`
- On error: `span.setAttribute("xrpl.rpc.status", "error");` and set the error message
- After this, traces in Tempo/Grafana should look like:
```
@@ -402,7 +410,7 @@
> **OTLP** = OpenTelemetry Protocol
**Objective**: Prove the full pipeline works: rippled emits traces -> OTel Collector receives them -> Tempo stores them for Grafana visualization.
**Objective**: Prove the full pipeline works: xrpld emits traces -> OTel Collector receives them -> Tempo stores them for Grafana visualization.
**What to do**:
@@ -414,7 +422,7 @@
Verify Collector health: `curl http://localhost:13133`
2. **Build rippled with telemetry**:
2. **Build xrpld with telemetry**:
```bash
# Adjust for your actual build workflow
@@ -423,8 +431,8 @@
cmake --build --preset default
```
3. **Configure rippled**:
Add to `rippled.cfg` (or your local test config):
3. **Configure xrpld**:
Add to `xrpld.cfg` (or your local test config):
```ini
[telemetry]
@@ -434,10 +442,10 @@
trace_rpc=1
```
4. **Start rippled** in standalone mode:
4. **Start xrpld** in standalone mode:
```bash
./rippled --conf rippled.cfg -a --start
./rippled --conf xrpld.cfg -a --start
```
5. **Generate RPC traffic**:
@@ -462,21 +470,21 @@
6. **Verify in Grafana (Tempo)**:
- Open `http://localhost:3000`
- Navigate to Explore → select Tempo datasource
- Search for service `rippled`
- Search for service `xrpld`
- Confirm you see traces with spans: `rpc.request` -> `rpc.process` -> `rpc.command.server_info`
- Click into a trace and verify attributes: `xrpl.rpc.command`, `xrpl.rpc.status`, `xrpl.rpc.version`
7. **Verify zero-overhead when disabled**:
- Rebuild with `XRPL_ENABLE_TELEMETRY=OFF`, or set `enabled=0` in config
- Run the same RPC calls
- Confirm no new traces appear and no errors in rippled logs
- Confirm no new traces appear and no errors in xrpld logs
**Verification Checklist**:
- [ ] Docker stack starts without errors
- [ ] rippled builds with `-DXRPL_ENABLE_TELEMETRY=ON`
- [ ] rippled starts and connects to OTel Collector (check rippled logs for telemetry messages)
- [ ] Traces appear in Grafana/Tempo under service "rippled"
- [ ] xrpld builds with `-DXRPL_ENABLE_TELEMETRY=ON`
- [ ] xrpld starts and connects to OTel Collector (check xrpld logs for telemetry messages)
- [ ] Traces appear in Grafana/Tempo under service "xrpld"
- [ ] Span hierarchy is correct (parent-child relationships)
- [ ] Span attributes are populated (`xrpl.rpc.command`, `xrpl.rpc.status`, etc.)
- [ ] Error spans show error status and message
@@ -502,13 +510,13 @@
**What to do**:
- Take screenshots of Grafana/Tempo showing:
- The service list with "rippled"
- The service list with "xrpld"
- A trace with the full span tree
- Span detail view showing attributes
- Document any issues encountered (build issues, SDK quirks, missing attributes)
- Note performance observations (build time impact, any noticeable runtime overhead)
- Write a short summary of what the POC proves and what it doesn't cover yet:
- **Proves**: OTel SDK integrates with rippled, OTLP export works, RPC traces visible
- **Proves**: OTel SDK integrates with xrpld, OTLP export works, RPC traces visible
- **Doesn't cover**: Cross-node P2P context propagation, consensus tracing, protobuf trace context, W3C traceparent header extraction, tail-based sampling, production deployment
- Outline next steps (mapping to the full plan phases):
- [Phase 2](./06-implementation-phases.md) completion: [W3C header extraction](./02-design-decisions.md) (§2.5), WebSocket tracing, all [RPC handlers](./01-architecture-analysis.md) (§1.6)
@@ -537,7 +545,7 @@
| 2 | Core Telemetry interface + NullImpl | 3 | 0 | 1 |
| 3 | OTel-backed Telemetry implementation | 2 | 1 | 1, 2 |
| 4 | Application lifecycle integration | 0 | 3 | 2, 3 |
| 5 | Instrumentation macros | 1 | 0 | 2 |
| 5 | SpanGuard factory methods | 0 | 1 | 2 |
| 6 | Instrument RPC ServerHandler | 0 | 1 | 4, 5 |
| 7 | Instrument RPC command execution | 0 | 1 | 4, 5 |
| 8 | End-to-end verification | 0 | 0 | 0-7 |
@@ -615,6 +623,6 @@ Issues encountered during POC implementation that inform future work:
| Conan package only builds OTLP HTTP exporter, not gRPC | Switched from gRPC to HTTP exporter (`localhost:4318/v1/traces`) | HTTP exporter is the default; gRPC requires custom Conan profile |
| CMake target `opentelemetry-cpp::api` etc. don't exist in Conan package | Use umbrella target `opentelemetry-cpp::opentelemetry-cpp` | Conan targets differ from upstream CMake targets |
| OTel Collector `logging` exporter deprecated | Renamed to `debug` exporter | Use `debug` in all collector configs going forward |
| Macro parameter `telemetry` collided with `::xrpl::telemetry::` namespace | Renamed macro params to `_tel_obj_`, `_span_name_` | Avoid common words as macro parameter names |
| Macro parameter `telemetry` collided with `::xrpl::telemetry::` namespace | Replaced macros with SpanGuard factory methods (no macros needed) | Factory methods avoid macro hygiene issues entirely |
| `opentelemetry::trace::Scope` creates new context on move | Store scope as member, create once in constructor | SpanGuard move semantics need care with Scope lifecycle |
| `TracerProviderFactory::Create` returns `unique_ptr<sdk::TracerProvider>`, not `nostd::shared_ptr` | Use `std::shared_ptr` member, wrap in `nostd::shared_ptr` for global provider | OTel SDK factory return types don't match API provider types |

View File

@@ -38,7 +38,7 @@ Before Phases 1-9 can be considered production-ready, we need proof that:
**What to do**:
- Create `docker/telemetry/docker-compose.workload.yaml`:
- 5 rippled validator nodes with UNL configured for each other
- 5 xrpld validator nodes with UNL configured for each other
- All telemetry enabled: `[telemetry] enabled=1`, `[insight] server=otel`
- Full OTel stack: Collector, Tempo, Prometheus, Loki, Grafana
- Shared network with service discovery
@@ -66,7 +66,7 @@ Before Phases 1-9 can be considered production-ready, we need proof that:
**What to do**:
- Create `docker/telemetry/workload/rpc_load_generator.py`:
- Connects to one or more rippled WebSocket endpoints
- Connects to one or more xrpld WebSocket endpoints
- Fires all RPC commands that have trace spans: `server_info`, `ledger`, `tx`, `account_info`, `account_lines`, `fee`, `submit`, etc.
- Configurable parameters: rate (RPS), duration, command distribution weights
- Injects `traceparent` HTTP headers to test W3C context propagation
@@ -129,8 +129,8 @@ Before Phases 1-9 can be considered production-ready, we need proof that:
**Metric validation** (queries Prometheus API):
- Assert all SpanMetrics-derived metrics are non-zero: `traces_span_metrics_calls_total`, `traces_span_metrics_duration_milliseconds_bucket`
- Assert all StatsD metrics are non-zero: `rippled_LedgerMaster_Validated_Ledger_Age`, `rippled_Peer_Finder_Active_*`, etc.
- Assert all Phase 9 metrics are non-zero: `rippled_nodestore_*`, `rippled_cache_*`, `rippled_txq_*`, `rippled_rpc_method_*`, `rippled_object_count`, `rippled_load_factor*`
- Assert all StatsD metrics are non-zero: `xrpld_LedgerMaster_Validated_Ledger_Age`, `xrpld_Peer_Finder_Active_*`, etc.
- Assert all Phase 9 metrics are non-zero: `xrpld_nodestore_*`, `xrpld_cache_*`, `xrpld_txq_*`, `xrpld_rpc_method_*`, `xrpld_object_count`, `xrpld_load_factor*`
- Assert metric label cardinality is within bounds
**Log-trace correlation validation** (queries Loki API):
@@ -191,7 +191,7 @@ Before Phases 1-9 can be considered production-ready, we need proof that:
**What to do**:
- Create a CI workflow (GitHub Actions or equivalent) that:
1. Builds rippled with `-DXRPL_ENABLE_TELEMETRY=ON`
1. Builds xrpld with `-DXRPL_ENABLE_TELEMETRY=ON`
2. Starts the multi-node workload harness
3. Runs the RPC load generator + transaction submitter for 2 minutes
4. Runs the validation suite

View File

@@ -2,7 +2,7 @@
> **Status**: Future Enhancement
>
> **Goal**: Build a custom OTel Collector receiver that periodically polls rippled's admin RPCs and exports structured metrics for external consumers — making all XRPL health, validator, peer, fee, and DEX data available as Prometheus/OTLP metrics without rippled code changes.
> **Goal**: Build a custom OTel Collector receiver that periodically polls xrpld's admin RPCs and exports structured metrics for external consumers — making all XRPL health, validator, peer, fee, and DEX data available as Prometheus/OTLP metrics without xrpld code changes.
>
> **Scope**: Go-based OTel Collector receiver plugin + Grafana dashboards + Prometheus alerting rules.
>
@@ -20,7 +20,7 @@
### Third-Party Consumer Gap Analysis
This phase addresses the cross-cutting gap identified during research: **rippled has no native Prometheus/OTLP metrics export for data accessible only via RPC**. Every consumer (exchanges, payment processors, analytics providers, validators, researchers, compliance firms, custodians) must build custom JSON-RPC polling and conversion. This receiver centralizes that work.
This phase addresses the cross-cutting gap identified during research: **xrpld has no native Prometheus/OTLP metrics export for data accessible only via RPC**. Every consumer (exchanges, payment processors, analytics providers, validators, researchers, compliance firms, custodians) must build custom JSON-RPC polling and conversion. This receiver centralizes that work.
| Consumer Category | Data Unlocked by This Phase |
| -------------------------- | ------------------------------------------------------------------ |
@@ -39,7 +39,7 @@ This phase addresses the cross-cutting gap identified during research: **rippled
## Task 11.1: OTel Collector Receiver Scaffold
**Objective**: Create the Go project structure for a custom OTel Collector receiver that polls rippled JSON-RPC.
**Objective**: Create the Go project structure for a custom OTel Collector receiver that polls xrpld JSON-RPC.
**What to do**:
@@ -52,8 +52,8 @@ This phase addresses the cross-cutting gap identified during research: **rippled
- Configuration model:
```yaml
rippled_receiver:
endpoint: "http://localhost:5005" # rippled admin RPC
xrpld_receiver:
endpoint: "http://localhost:5005" # xrpld admin RPC
poll_interval: 30s # how often to poll
enabled_collectors:
- server_info
@@ -249,7 +249,7 @@ This phase addresses the cross-cutting gap identified during research: **rippled
- `xrpl_fee_minimum_fee_drops` — base reference fee
- `xrpl_fee_queue_size` — current queue depth
- This overlaps with Phase 9's internal TxQ metrics but provides an external-only collection path that doesn't require rippled code changes.
- This overlaps with Phase 9's internal TxQ metrics but provides an external-only collection path that doesn't require xrpld code changes.
**Key files**:
@@ -362,24 +362,24 @@ This phase addresses the cross-cutting gap identified during research: **rippled
**What to do**:
- **Validator Health** (`rippled-validator-health`):
- **Validator Health** (`xrpld-validator-health`):
- Server state timeline, state duration breakdown
- Proposer count trend, converge time trend, validation quorum
- Validator list expiration countdown
- Amendment voting status (majority/enabled/vetoed)
- **Network Topology** (`rippled-network-topology`):
- **Network Topology** (`xrpld-network-topology`):
- Peer count (inbound/outbound/cluster), peer version distribution
- Peer latency distribution (p50/p95/p99), diverged peer count
- Geographic distribution (if enriched with GeoIP)
- Peer uptime distribution
- **Fee Market** (`rippled-fee-market-external`):
- **Fee Market** (`xrpld-fee-market-external`):
- Current fee levels (open ledger, median, minimum), fee escalation timeline
- Queue depth vs. capacity, transactions per ledger
- Load factor breakdown (server/network/cluster/escalation)
- **DEX & AMM Overview** (`rippled-dex-amm`) (only populated when DEX collectors are configured):
- **DEX & AMM Overview** (`xrpld-dex-amm`) (only populated when DEX collectors are configured):
- AMM pool TVL, reserve ratios, LP token supply
- Order book depth per pair, spread trends
- Trading fee revenue estimates
@@ -405,7 +405,7 @@ This phase addresses the cross-cutting gap identified during research: **rippled
- Verify alerting rules fire correctly (inject a "bad" state and check alert)
- Update `docker/telemetry/docker-compose.workload.yaml`:
- Add the custom OTel Collector build with the rippled receiver
- Add the custom OTel Collector build with the xrpld receiver
- Configure the receiver to poll one of the test nodes
**Key files**:
@@ -448,6 +448,6 @@ This phase addresses the cross-cutting gap identified during research: **rippled
- [ ] Prometheus alerting rules fire correctly for simulated failure conditions
- [ ] DEX/AMM collector works when configured (optional — not required for base exit criteria)
- [ ] Phase 10 validation suite passes with receiver metrics included
- [ ] Receiver handles rippled restart/unavailability gracefully (no crash, logs warning, retries)
- [ ] Receiver handles xrpld restart/unavailability gracefully (no crash, logs warning, retries)
- [ ] Documentation complete: receiver README, metric reference, alerting playbook
- [ ] Go receiver has unit tests with >80% coverage

View File

@@ -1,10 +1,10 @@
# Phase 2: RPC Tracing Completion Task List
> **Goal**: Complete full RPC tracing coverage with W3C Trace Context propagation, unit tests, and performance validation. Build on the POC foundation to achieve production-quality RPC observability.
> **Goal**: Complete RPC tracing coverage with unit tests, Grafana search filters, node health attributes, and config hardening. Build on the Phase 1c SpanGuard factory foundation to achieve production-quality RPC observability.
>
> **Scope**: W3C header extraction, TraceContext propagation utilities, unit tests for core telemetry, integration tests for RPC tracing, and performance benchmarks.
> **Scope**: Unit tests for core telemetry, Grafana Tempo search filters, node health span attributes, config validation (`std::clamp`).
>
> **Branch**: `pratik/otel-phase2-rpc-tracing` (from `pratik/OpenTelemetry_and_DistributedTracing_planning`)
> **Branch**: `pratik/otel-phase2-rpc-tracing` (from `pratik/otel-phase1c-rpc-integration`)
### Related Plan Documents
@@ -16,59 +16,23 @@
---
## Task 2.1: Implement W3C Trace Context HTTP Header Extraction
## Task 2.1: W3C Trace Context HTTP Header Extraction
**Objective**: Extract `traceparent` and `tracestate` headers from incoming HTTP RPC requests so external callers can propagate their trace context into rippled.
**Status**: DEFERRED → Phase 3
**What to do**:
**Reason**: W3C context propagation (`traceparent`/`tracestate` headers) requires a consumer — in Phase 2, RPC spans are entirely local to the node. Phase 3 introduces cross-node transaction tracing via protobuf context propagation, which is the first use case for extracted trace context. Implementing it here without a consumer would be dead code.
- Create `include/xrpl/telemetry/TraceContextPropagator.h`:
- `extractFromHeaders(headerGetter)` - extract W3C traceparent/tracestate from HTTP headers
- `injectToHeaders(ctx, headerSetter)` - inject trace context into response headers
- Use OTel's `TextMapPropagator` with `W3CTraceContextPropagator` for standards compliance
- Only compiled when `XRPL_ENABLE_TELEMETRY` is defined
- Create `src/libxrpl/telemetry/TraceContextPropagator.cpp`:
- Implement a simple `TextMapCarrier` adapter for HTTP headers
- Use `opentelemetry::context::propagation::GlobalTextMapPropagator` for extraction/injection
- Register the W3C propagator in `TelemetryImpl::start()`
- Modify `src/xrpld/rpc/detail/ServerHandler.cpp`:
- In the HTTP request handler, extract parent context from headers before creating span
- Pass extracted context to `startSpan()` as parent
- Inject trace context into response headers
**Key new files**:
- `include/xrpl/telemetry/TraceContextPropagator.h`
- `src/libxrpl/telemetry/TraceContextPropagator.cpp`
**Key modified files**:
- `src/xrpld/rpc/detail/ServerHandler.cpp`
- `src/libxrpl/telemetry/Telemetry.cpp` (register W3C propagator)
**Reference**:
- [04-code-samples.md §4.4.2](./04-code-samples.md) — TraceContextPropagator with extractFromHeaders/injectToHeaders
- [02-design-decisions.md §2.5](./02-design-decisions.md) — W3C Trace Context propagation design
**Implemented in**: `pratik/otel-phase3-tx-tracing``TraceContextPropagator.h/.cpp`
---
## Task 2.2: Add XRPL_TRACE_PEER Macro
## Task 2.2: Per-Category Span Creation
**Objective**: Add the missing peer-tracing macro for future Phase 3 use and ensure macro completeness.
**Status**: COMPLETE (superseded by Phase 1c design)
**What to do**:
**Original plan**: Add `XRPL_TRACE_PEER` and `XRPL_TRACE_LEDGER` macros.
- Edit `src/xrpld/telemetry/TracingInstrumentation.h`:
- Add `XRPL_TRACE_PEER(_tel_obj_, _span_name_)` macro that checks `shouldTracePeer()`
- Add `XRPL_TRACE_LEDGER(_tel_obj_, _span_name_)` macro (for future ledger tracing)
- Ensure disabled variants expand to `((void)0)`
**Key modified file**:
- `src/xrpld/telemetry/TracingInstrumentation.h`
**Actual implementation**: Phase 1c replaced all tracing macros with the `SpanGuard::span(TraceCategory, prefix, name)` factory pattern. The `TraceCategory` enum (`Rpc`, `Transactions`, `Consensus`, `Peer`, `Ledger`) serves the same conditional-creation purpose without macros. No separate task needed — the factory already supports all categories.
---
@@ -95,59 +59,41 @@
## Task 2.4: Unit Tests for Core Telemetry Infrastructure
**Status**: COMPLETE
**Objective**: Add unit tests for the core telemetry abstractions to validate correctness and catch regressions.
**What to do**:
**Implemented**:
- Create `src/test/telemetry/Telemetry_test.cpp`:
- Test NullTelemetry: verify all methods return expected no-op values
- Test Setup defaults: verify all Setup fields have correct defaults
- Test setup_Telemetry config parser: verify parsing of [telemetry] section
- Test enabled/disabled factory paths
- Test shouldTrace\* methods respect config flags
- `src/tests/libxrpl/telemetry/TelemetryConfig.cpp`:
- Test Setup defaults (all fields have correct initial values)
- Test `setup_Telemetry` config parser (empty section, full section, edge cases)
- Test `samplingRatio` clamping (values outside 0.0-1.0)
- Create `src/test/telemetry/SpanGuard_test.cpp`:
- Test SpanGuard RAII lifecycle (span ends on destruction)
- Test move constructor works correctly
- Test setAttribute, setOk, setStatus, addEvent, recordException
- Test context() returns valid context
- `src/tests/libxrpl/telemetry/SpanGuardFactory.cpp`:
- Test null guard methods are safe (setAttribute, setOk, setError, addEvent on null)
- Test category span returns null when telemetry disabled
- Test child/linked span null when no parent context
- Test move construction transfers ownership
- Test recordException safe on null guard
- Test discard() safe on null guard
- Add test files to CMake build
**Key new files**:
- `src/test/telemetry/Telemetry_test.cpp`
- `src/test/telemetry/SpanGuard_test.cpp`
**Reference**:
- [06-implementation-phases.md §6.11.1](./06-implementation-phases.md) — Phase 1 exit criteria (unit tests passing)
- `src/tests/libxrpl/telemetry/main.cpp` — GTest runner
- `src/tests/libxrpl/CMakeLists.txt` — test target with optional OTel linking
---
## Task 2.5: Enhance RPC Span Attributes
**Objective**: Add additional attributes to RPC spans per the semantic conventions defined in the plan.
**Status**: DEFERRED (low priority)
**What to do**:
**Reason**: The high-value attributes (`command`, `version`, `role`, `status`) are already set by Phase 1c. The remaining HTTP transport-level attributes (`http.method`, `net.peer.ip`, `http.status_code`) provide limited additional insight since:
- Edit `src/xrpld/rpc/detail/ServerHandler.cpp`:
- Add `http.method` attribute for HTTP requests
- Add `http.status_code` attribute for responses
- Add `net.peer.ip` attribute for client IP (if available)
- `http.method` is always POST for JSON-RPC
- `net.peer.ip` is debug-level info available in logs
- `xrpl.rpc.duration_ms` is redundant with span duration (OTel captures start/end time natively)
- Edit `src/xrpld/rpc/detail/RPCHandler.cpp`:
- Add `xrpl.rpc.duration_ms` attribute on completion
- Add error message attribute on failure: `xrpl.rpc.error_message`
**Key modified files**:
- `src/xrpld/rpc/detail/ServerHandler.cpp`
- `src/xrpld/rpc/detail/RPCHandler.cpp`
**Reference**:
- [02-design-decisions.md §2.4.2](./02-design-decisions.md) — RPC attribute schema
These can be added later if dashboard queries specifically need them. The node health attributes (Task 2.8) provide far more operational value and were prioritized instead.
---
@@ -215,19 +161,49 @@ This surfaces all RPCs served during a blocked period — critical for post-inci
---
## Task 2.9: PathFind RPC Instrumentation
**Status**: COMPLETE
**Objective**: Trace the path_find and ripple_path_find RPC handlers to capture request latency and computation cost.
**Spans added**:
- `pathfind.request` — wraps `doPathFind()` and `doRipplePathFind()` RPC handlers
- `pathfind.compute` — wraps `PathRequest::doUpdate()` (fast/normal attr)
- `pathfind.update_all` — wraps `PathRequestManager::updateAll()` on ledger close (ledger_index attr)
- `pathfind.discover` — wraps `Pathfinder::findPaths()` graph exploration (search_level attr)
- `pathfind.rank` — wraps `Pathfinder::computePathRanks()` liquidity validation (num_paths attr)
**New file**: `src/xrpld/rpc/detail/PathFindSpanNames.h`
**Modified files**:
- `src/xrpld/rpc/handlers/orderbook/PathFind.cpp`
- `src/xrpld/rpc/handlers/orderbook/RipplePathFind.cpp`
- `src/xrpld/rpc/detail/PathRequest.cpp`
- `src/xrpld/rpc/detail/PathRequestManager.cpp`
- `src/xrpld/rpc/detail/Pathfinder.cpp`
---
## Summary
| Task | Description | New Files | Modified Files | Depends On |
| ---- | ------------------------------------------- | --------- | -------------- | ---------- |
| 2.1 | W3C Trace Context header extraction | 2 | 2 | POC |
| 2.2 | Add XRPL_TRACE_PEER/LEDGER macros | 0 | 1 | POC |
| 2.3 | Add shouldTraceLedger() interface method | 0 | 3 | POC |
| 2.4 | Unit tests for core telemetry | 2 | 1 | POC |
| 2.5 | Enhanced RPC span attributes | 0 | 2 | POC |
| 2.6 | Build verification and performance baseline | 0 | 0 | 2.1-2.5 |
| 2.8 | RPC span attribute enrichment (node health) | 0 | 1 | 2.5 |
| Task | Description | Status | Notes |
| ---- | ------------------------------------------- | ------------------- | ------------------------------------------------ |
| 2.1 | W3C Trace Context header extraction | Deferred → Phase 3 | No consumer in Phase 2; needs cross-node tracing |
| 2.2 | Per-category span creation | Complete (Phase 1c) | Superseded by TraceCategory enum + SpanGuard |
| 2.3 | Add shouldTraceLedger() interface method | Complete (Phase 1c) | Delivered in Phase 1c base branch |
| 2.4 | Unit tests for core telemetry | Complete | TelemetryConfig + SpanGuardFactory tests |
| 2.5 | Enhanced RPC span attributes (HTTP-level) | Deferred | Low value; span duration covers timing natively |
| 2.6 | Build verification and performance baseline | Complete | Verified in CI on Phase 1c |
| 2.7 | Grafana Tempo search filters | Complete | rpc-command, rpc-status, rpc-role filters |
| 2.8 | RPC span attribute enrichment (node health) | Complete | amendment_blocked + server_state |
| 2.9 | PathFind RPC instrumentation (5 spans) | Complete | request, compute, update_all, discover, rank |
**Parallel work**: Tasks 2.1, 2.2, 2.3 can run in parallel. Task 2.4 depends on 2.3. Task 2.5 can run in parallel with 2.4. Task 2.6 depends on all others. Task 2.8 depends on 2.5 (existing span creation must be in place).
**Delivered in this branch**: Tasks 2.4, 2.7, 2.8, 2.9.
**Deferred with rationale**: Tasks 2.1 (→Phase 3), 2.5 (low priority).
**Superseded**: Task 2.2 (Phase 1c SpanGuard factory covers this).
---

View File

@@ -23,7 +23,7 @@
**What to do**:
- Edit `include/xrpl/proto/xrpl.proto` (or `src/ripple/proto/ripple.proto`, wherever the proto is):
- Edit `include/xrpl/proto/xrpl.proto` (or `src/xrpld/proto/ripple.proto`, wherever the proto is):
- Add `TraceContext` message definition:
```protobuf
message TraceContext {
@@ -97,7 +97,8 @@
- Inject current trace context into outgoing `TMTransaction::trace_context`
- Set `xrpl.tx.relay_count` attribute
- Include `TracingInstrumentation.h` and use `XRPL_TRACE_TX` macro
- Use `SpanGuard::span(TraceCategory::Transactions, "tx", "receive")` factory
(Phase 1c replaced macros with the SpanGuard factory pattern)
**Key modified files**:
@@ -165,27 +166,54 @@
## Task 3.6: Context Propagation in Transaction Relay
**Status**: COMPLETE
**Objective**: Ensure trace context flows correctly when transactions are relayed between peers, creating linked spans across nodes.
**What to do**:
**What was done**:
- Verify the relay path injects trace context:
- When `PeerImp` relays a transaction, the `TMTransaction` message should carry `trace_context`
- When a remote peer receives it, the context is extracted and used as parent
- **TX send side**: `NetworkOPs::apply()` now injects the tx.process span's trace
context into the outgoing `TMTransaction` protobuf before relay, using
`telemetry::injectSpanContext()`. The receiving node's `txReceiveSpan()` (already
wired in PeerImp) extracts the parent span_id and creates the tx.receive span
as a child of the sender's tx.process span.
- Test context propagation:
- Manually verify with 2+ node setup that trace IDs match across nodes
- Confirm parent-child span relationships are correct in Tempo
- **Proposal send/receive**: `RCLConsensus::Adaptor::propose()` injects the
current thread's active span context into the `TMProposeSet` protobuf via
`telemetry::injectToProtobuf()`. PeerImp creates a
`consensus.proposal.receive` span that extracts the sender's trace context
as parent (via `ConsensusReceiveTracing.h`).
- Handle edge cases:
- Missing trace context (older peers): create new root span
- Corrupted trace context: log warning, create new root span
- Sampled-out traces: respect trace flags
- **Validation send/receive**: `RCLConsensus::Adaptor::validate()` injects
the current thread's active span context into the `TMValidation` protobuf.
PeerImp creates a `consensus.validation.receive` span that extracts the
sender's trace context as parent.
- **Edge cases**: Missing trace context (older peers) degrades gracefully to
standalone spans. Invalid/corrupted context is treated as absent. Trace
flags are propagated and respected.
**New infrastructure**:
- `SpanGuard::getTraceBytes()` — extracts raw trace_id/span_id/trace_flags
from a span without exposing OTel types. Safe to call from any thread.
- `PropagationHelpers.h` — `injectSpanContext(SpanGuard&, proto)` bridge
between SpanGuard and protobuf TraceContext.
- `TraceContextPropagator.h` — `injectToProtobuf(ctx, proto)` for
same-thread injection via OTel RuntimeContext (used in propose/validate).
- `ConsensusReceiveTracing.h` — `proposalReceiveSpan()` and
`validationReceiveSpan()` helper functions that create receive spans with
optional parent context extraction from incoming protobuf messages.
**Key modified files**:
- `src/xrpld/overlay/detail/PeerImp.cpp`
- `src/xrpld/overlay/detail/OverlayImpl.cpp` (if relay method needs context param)
- `src/xrpld/app/misc/NetworkOPs.cpp` — tx relay injection
- `src/xrpld/app/consensus/RCLConsensus.cpp` — proposal/validation send injection
- `src/xrpld/overlay/detail/PeerImp.cpp` — proposal/validation receive spans
- `include/xrpl/telemetry/SpanGuard.h` — `TraceBytes` struct, `getTraceBytes()`
- `src/libxrpl/telemetry/SpanGuard.cpp` — `getTraceBytes()` implementation
- `src/xrpld/telemetry/PropagationHelpers.h` — inject helpers (new file)
- `src/xrpld/telemetry/ConsensusReceiveTracing.h` — receive span helpers (new file)
**Reference**:
@@ -223,7 +251,7 @@
> **Upstream**: Phase 2 (RPC span infrastructure must exist).
> **Downstream**: Phase 10 (validation checks for this attribute).
**Objective**: Add the relaying peer's rippled version to `tx.receive` spans so operators can correlate transaction issues with peer version mismatches during network upgrades.
**Objective**: Add the relaying peer's xrpld version to `tx.receive` spans so operators can correlate transaction issues with peer version mismatches during network upgrades.
**What to do**:
@@ -234,9 +262,9 @@
**New span attribute**:
| Attribute | Type | Source | Example |
| ------------------- | ------ | -------------------- | ----------------- |
| `xrpl.peer.version` | string | `peer->getVersion()` | `"rippled-2.4.0"` |
| Attribute | Type | Source | Example |
| ------------------- | ------ | -------------------- | --------------- |
| `xrpl.peer.version` | string | `peer->getVersion()` | `"xrpld-2.4.0"` |
**Rationale**: Transaction relay is where version mismatches cause subtle serialization or validation bugs. Tracing "this tx came from a v2.3.0 peer" helps diagnose compatibility issues. The community dashboard tracks peer versions externally; this brings version awareness into the trace itself.
@@ -252,6 +280,192 @@
---
## Task 3.9: Deterministic Transaction Trace ID
> **Upstream**: Task 3.2 (protobuf serialization), Task 3.3 (PeerImp span exists).
> **Downstream**: Phase 10 (workload validation can query by tx hash directly).
> **Pattern**: Mirrors the consensus deterministic trace ID in Phase 4a
> (`createDeterministicContext` in `RCLConsensus.cpp`), adapted for transactions.
**Objective**: Derive the trace_id for transaction spans deterministically from the
transaction hash so that all nodes handling the same transaction independently produce
spans under the same trace_id — regardless of whether protobuf context propagation
succeeds.
**Why**: The current approach creates spans with random trace_ids and relies entirely
on protobuf `TraceContext` propagation to link them. If any hop in the relay chain
drops the context (older peers, message corruption, mixed-version networks), the trace
splits and downstream spans become impossible to find. With deterministic trace_ids,
correlation is guaranteed because every node derives the same trace_id from the same
`txID`.
**Approach — deterministic trace_id + protobuf span_id propagation**:
1. Derive `trace_id = txHash[0:16]` (first 16 bytes of the 32-byte transaction hash).
2. Generate a random 8-byte `span_id` per node (each node's span is unique within
the shared trace).
3. Create the span under this deterministic context as parent.
4. **Additionally**, if protobuf `TraceContext` is present in the incoming
`TMTransaction` message, extract the sender's `span_id` and use it as the span's
parent — this preserves parent-child ordering in the trace tree.
5. If protobuf context is absent (older peer, first hop), the span still has the
correct deterministic `trace_id` — it appears as a sibling root in the same trace
rather than being lost.
This gives the best of both worlds: guaranteed cross-node correlation via deterministic
`trace_id`, plus parent-child relay ordering via protobuf `span_id` when available.
**What to do**:
- Create `createDeterministicTxContext(uint256 const& txHash)` utility function:
- Location: shared header or file-local in `PeerImp.cpp` and `NetworkOPs.cpp`
(or a shared telemetry utility if both need it).
- Pattern: identical to `createDeterministicContext(uint256 const& ledgerId)` in
`RCLConsensus.cpp` — take `txHash[0:16]` as trace_id, random span_id via
`default_prng()`, sampled flag set, `remote=false`.
- Guard behind `#ifdef XRPL_ENABLE_TELEMETRY`.
```cpp
opentelemetry::context::Context
createDeterministicTxContext(uint256 const& txHash)
{
namespace trace = opentelemetry::trace;
// First 16 bytes of the 32-byte tx hash as trace ID.
trace::TraceId traceId(
opentelemetry::nostd::span<uint8_t const, 16>(txHash.data(), 16));
// Random span_id so each node's span is unique within the trace.
uint8_t spanIdBytes[8];
auto const rval = default_prng()();
std::memcpy(spanIdBytes, &rval, sizeof(spanIdBytes));
trace::SpanId spanId(
opentelemetry::nostd::span<uint8_t const, 8>(spanIdBytes, 8));
trace::SpanContext syntheticCtx(
traceId, spanId, trace::TraceFlags(1), /* remote = */ false);
return opentelemetry::context::Context{}.SetValue(
trace::kSpanKey,
opentelemetry::nostd::shared_ptr<trace::Span>(
new trace::DefaultSpan(syntheticCtx)));
}
```
- Edit `src/xrpld/overlay/detail/PeerImp.cpp` — restructure `handleTransaction()`:
- **Move span creation after deserialization** (txID must be known first):
1. Deserialize `STTx` and get `txID` (existing code at line ~1382).
2. Create deterministic parent context: `auto detCtx = createDeterministicTxContext(txID)`.
3. If `m->has_trace_context()`: extract protobuf context via `extractFromProtobuf()`,
**combine** with deterministic trace_id — use the protobuf span_id as parent
to preserve relay ordering, but override trace_id with the deterministic one.
4. If no protobuf context: create span under `detCtx` directly.
5. Set all existing attributes (`hash`, `peerId`, `peerVersion`, `suppressed`, etc.).
- **Combining deterministic trace_id with protobuf parent span_id**:
When both are available, construct a synthetic `SpanContext` with:
- `trace_id` = `txHash[0:16]` (deterministic)
- `span_id` = extracted from protobuf (sender's span_id → becomes parent)
- `trace_flags` = from protobuf
- `remote` = true (came from another node)
```cpp
// Pseudo-code for the combined context:
auto detTraceId = trace::TraceId(txHash.data(), 16);
auto remoteSpanId = /* from extractFromProtobuf */;
auto remoteFlags = /* from extractFromProtobuf */;
trace::SpanContext combinedCtx(
detTraceId, remoteSpanId, remoteFlags, /* remote = */ true);
// Use as parent context for the new span.
```
- Edit `src/xrpld/app/misc/NetworkOPs.cpp` — update `processTransaction()`:
- `transaction->getID()` is already available at the top of the function.
- Create deterministic parent context from `txID`.
- Create `tx.process` span under this context.
- No protobuf context to extract here (NetworkOPs is intra-node), so
deterministic context alone is sufficient.
- Add `tx_trace_strategy` attribute to spans:
- Add `inline constexpr auto traceStrategy = join(xrplTx, makeStr("trace_strategy"));`
to `TxSpanNames.h`.
- Set on each tx span: `span.setAttribute(tx_span::attr::traceStrategy, "deterministic")`.
**Key new/modified files**:
- `src/xrpld/overlay/detail/PeerImp.cpp` — restructured span creation
- `src/xrpld/app/misc/NetworkOPs.cpp` — deterministic context for tx.process
- `src/xrpld/app/misc/TxSpanNames.h` — new `traceStrategy` attribute constant
- New or shared utility for `createDeterministicTxContext()` (location TBD: could be
a shared header like `include/xrpl/telemetry/DeterministicContext.h`, or file-local
if only used in two places)
**Interaction with existing tasks**:
- **Task 3.3 (PeerImp instrumentation)**: The span creation in `handleTransaction()`
must be restructured — the span currently starts before `txID` is known. This task
moves it after deserialization.
- **Task 3.6 (Relay context propagation)**: Protobuf injection at the relay site
remains the same — `injectToProtobuf()` serializes the current span's `span_id`.
The receiver extracts it and combines with the deterministic `trace_id`.
- **Phase 4a (Consensus deterministic trace ID)**: This task follows the same pattern.
Consider extracting a shared utility (e.g., `createDeterministicContext(uint256)`)
that both consensus and transaction tracing use.
**Exit Criteria**:
- [ ] `tx.receive` and `tx.process` spans have deterministic trace_id = `txHash[0:16]`
- [ ] All nodes handling the same transaction produce spans under the same trace_id
- [x] Protobuf `span_id` propagation still works when available (parent-child ordering)
- [ ] Missing protobuf context (old peer) degrades gracefully to sibling spans, not lost traces
- [ ] `xrpl.tx.trace_strategy` attribute set to `"deterministic"` on all tx spans
- [ ] Trace queryable by tx hash (truncate hash → trace_id → direct lookup in Tempo)
**Deliverables implemented (not in original plan)**:
- **`SpanGuard::txSpan()` factory method** (`include/xrpl/telemetry/SpanGuard.h`):
Two overloads for creating transaction spans with deterministic trace IDs:
- `txSpan(category, group, name, txHash)` — standalone span (deterministic
trace_id from `txHash[0:16]`, no parent span_id).
- `txSpan(category, group, name, txHash, parentCtx)` — child span (deterministic
trace_id combined with protobuf-extracted parent span_id for relay ordering).
- **`TxTracing.h` helper functions** (`src/xrpld/overlay/detail/TxTracing.h`):
File-local helpers that wrap `SpanGuard::txSpan()` for the two main PeerImp call
sites:
- `txReceiveSpan(txHash, parentCtx)` — creates `tx.receive` span with
deterministic trace_id and optional protobuf parent context.
- `txProcessSpan(txHash)` — creates `tx.process` span with deterministic
trace_id only (no protobuf parent, used intra-node).
- **Note**: `TxTracing.h` includes `xrpl.pb.h` unconditionally (outside
`#ifdef XRPL_ENABLE_TELEMETRY`) because `protocol::TMTransaction` appears in
the function signatures regardless of telemetry build mode.
---
## Task 3.10: TxQ Instrumentation
**Status**: COMPLETE
**Objective**: Trace the transaction queue lifecycle — enqueue decisions, direct apply, batch clear, ledger-close accept loop, per-tx apply, and cleanup.
**Spans added**:
- `txq.enqueue` — wraps `TxQ::apply()` with tx_hash attribute
- `txq.apply_direct` — wraps `TxQ::tryDirectApply()` fast-path
- `txq.batch_clear` — wraps `TxQ::tryClearAccountQueueUpThruTx()`
- `txq.accept` — wraps `TxQ::accept()` ledger-close dequeue with queue_size attr
- `txq.accept_tx` — per-tx span inside accept loop with tx_hash, ter_code,
retries_remaining attributes
- `txq.cleanup` — wraps `TxQ::processClosedLedger()` with ledger_seq attribute
**New file**: `src/xrpld/app/misc/detail/TxQSpanNames.h`
**Modified file**: `src/xrpld/app/misc/detail/TxQ.cpp`
---
## Summary
| Task | Description | New Files | Modified Files | Depends On |
@@ -264,34 +478,24 @@
| 3.6 | Relay context propagation | 0 | 1-2 | 3.3, 3.5 |
| 3.7 | Build verification and testing | 0 | 0 | 3.1-3.6 |
| 3.8 | TX span peer version attribute | 0 | 1 | 3.3 |
| 3.9 | Deterministic transaction trace ID | 0-1 | 3 | 3.2, 3.3 |
| 3.10 | TxQ instrumentation (6 spans) | 1 | 1 | 3.4 |
**Parallel work**: Tasks 3.1 and 3.4 can start in parallel. Task 3.2 depends on 3.1. Tasks 3.3 and 3.5 depend on 3.2. Task 3.6 depends on 3.3 and 3.5. Task 3.8 depends on 3.3 (span must exist).
**Parallel work**: Tasks 3.1 and 3.4 can start in parallel. Task 3.2 depends on 3.1. Tasks 3.3 and 3.5 depend on 3.2. Task 3.6 depends on 3.3 and 3.5. Task 3.8 depends on 3.3 (span must exist). Task 3.9 depends on 3.2 and 3.3. Task 3.10 depends on 3.4 (tx.process span must exist).
**Exit Criteria** (from [06-implementation-phases.md §6.11.3](./06-implementation-phases.md)):
- [ ] Transaction traces span across nodes
- [ ] Trace context in Protocol Buffer messages
- [x] Transaction traces span across nodes
- [x] Trace context in Protocol Buffer messages
- [ ] HashRouter deduplication visible in traces
- [ ] <5% overhead on transaction throughput
- [x] Deterministic trace_id: same trace_id for same tx across all nodes
- [x] Protobuf span_id propagation preserves parent-child ordering when available
---
## Known Issues / Future Work
### Propagation utilities not yet wired into P2P flow
`extractFromProtobuf()` and `injectToProtobuf()` in `TraceContextPropagator.h`
are implemented and tested but not called from production code. To enable
cross-node distributed traces:
- Call `injectToProtobuf()` in `PeerImp` when sending `TMTransaction` /
`TMProposeSet` messages
- Call `extractFromProtobuf()` in the corresponding message handlers to
reconstruct the parent span context, then pass it to `startSpan()` as the
parent
This was deferred to validate single-node tracing performance first.
### Unused trace_state proto field
The `TraceContext.trace_state` field (field 4) in `xrpl.proto` is reserved for

View File

@@ -17,30 +17,25 @@
---
## Task 4.1: Instrument Consensus Round Start
## Task 4.1: Instrument Consensus Round Start
**Objective**: Create a root span for each consensus round that captures the round's key parameters.
**What to do**:
**Status**: DONE (implemented via Task 4a.2 `startRoundTracing()` helper).
- Edit `src/xrpld/app/consensus/RCLConsensus.cpp`:
- In `RCLConsensus::startRound()` (or the Adaptor's startRound):
- Create `consensus.round` span using `XRPL_TRACE_CONSENSUS` macro
- Set attributes:
- `xrpl.consensus.ledger.prev` — previous ledger hash
- `xrpl.consensus.ledger.seq` — target ledger sequence
- `xrpl.consensus.proposers` — number of trusted proposers
- `xrpl.consensus.mode` — "proposing" or "observing"
- Store the span context for use by child spans in phase transitions
**What was done**:
- Add a member to hold current round trace context:
- `opentelemetry::context::Context currentRoundContext_` (guarded by `#ifdef`)
- Updated at round start, used by phase transition spans
- `RCLConsensus::Adaptor::startRoundTracing()` creates `consensus.round` span
via `SpanGuard::hashSpan()` (deterministic) or `SpanGuard::span()` (attribute strategy)
- Attributes set: `xrpl.consensus.ledger_id`, `xrpl.consensus.ledger.seq`,
`xrpl.consensus.mode`, `xrpl.consensus.trace_strategy`, `xrpl.consensus.round_id`
- Round span stored as `roundSpan_` member in `RCLConsensus::Adaptor`
- `roundSpanContext_` snapshot captured for cross-thread span linking
**Key modified files**:
- `src/xrpld/app/consensus/RCLConsensus.cpp`
- `src/xrpld/app/consensus/RCLConsensus.h` (add context member)
- `src/xrpld/app/consensus/RCLConsensus.h` (span and context members)
**Reference**:
@@ -49,30 +44,27 @@
---
## Task 4.2: Instrument Phase Transitions
## Task 4.2: Instrument Phase Transitions
**Objective**: Create child spans for each consensus phase (open, establish, accept) to show timing breakdown.
**What to do**:
**Status**: DONE. All consensus phases are now instrumented:
- Edit `src/xrpld/app/consensus/RCLConsensus.cpp`:
- Identify where phase transitions occur (the `Consensus<Adaptor>` template drives this)
- For each phase entry:
- Create span as child of `currentRoundContext_`: `consensus.phase.open`, `consensus.phase.establish`, `consensus.phase.accept`
- Set `xrpl.consensus.phase` attribute
- Add `phase.enter` event at start, `phase.exit` event at end
- Record phase duration in milliseconds
- `consensus.establish` — created in `Consensus.h::startEstablishTracing()`
- `consensus.ledger_close` — created in `RCLConsensus.cpp::onClose()`
- `consensus.accept` / `consensus.accept.apply` — created in `onAccept()` / `doAccept()`
- `consensus.phase.open` `openSpan_` member in `Consensus.h`, created in `startRoundInternal()`, ended in `closeLedger()`
- In the `onClose` adaptor method:
- Create `consensus.ledger_close` span
- Set attributes: close_time, mode, transaction count in initial position
**Design notes**:
- Note: The Consensus template class in `src/xrpld/consensus/Consensus.h` drives phase transitions — Phase 4a instruments directly in the template
- `xrpl.consensus.phase` attribute — phases are distinguished by span names instead
- `phase.enter` / `phase.exit` events — not added (span start/end serves this purpose)
- `xrpl.consensus.phase_duration_ms` attribute — not set (span duration captures this)
**Key modified files**:
- `src/xrpld/app/consensus/RCLConsensus.cpp`
- Possibly `include/xrpl/consensus/Consensus.h` (for template-level phase tracking)
- `src/xrpld/consensus/Consensus.h` (template-level establish phase tracking)
**Reference**:
@@ -80,25 +72,26 @@
---
## Task 4.3: Instrument Proposal Handling
## Task 4.3: Instrument Proposal Handling
**Objective**: Trace proposal send and receive to show validator coordination.
**What to do**:
**Status**: DONE. Both send and receive paths are instrumented.
- Edit `src/xrpld/app/consensus/RCLConsensus.cpp`:
- In `Adaptor::propose()`:
- Create `consensus.proposal.send` span
- Set attributes: `xrpl.consensus.round` (proposal sequence), proposal hash
- Inject trace context into outgoing `TMProposeSet::trace_context` (from Phase 3 protobuf)
**What was done**:
- In `Adaptor::peerProposal()` (or wherever peer proposals are received):
- Extract trace context from incoming `TMProposeSet::trace_context`
- Create `consensus.proposal.receive` span as child of extracted context
- Set attributes: `xrpl.consensus.proposer` (node ID), `xrpl.consensus.round`
- In `Adaptor::propose()`:
- Creates `consensus.proposal.send` span via `SpanGuard::span()`
- Sets `xrpl.consensus.round` attribute
- In `Adaptor::share(RCLCxPeerPos)`:
- Create `consensus.proposal.relay` span for relaying peer proposals
- In `PeerImp::onMessage(TMProposeSet)`:
- Creates `consensus.proposal.receive` span
- Sets `xrpl.consensus.proposal.trusted` attribute (bool)
**Not implemented** (deferred to Phase 4b — cross-node propagation):
- `consensus.proposal.relay` span in `share(RCLCxPeerPos)` — requires trace context injection
- Trace context injection/extraction for `TMProposeSet::trace_context`
**Key modified files**:
@@ -111,73 +104,84 @@
---
## Task 4.4: Instrument Validation Handling
## Task 4.4: Instrument Validation Handling
**Objective**: Trace validation send and receive to show ledger validation flow.
**What to do**:
**Status**: DONE. Both send and receive paths are instrumented.
- Edit `src/xrpld/app/consensus/RCLConsensus.cpp` (or the validation handler):
- When sending our validation:
- Create `consensus.validation.send` span
- Set attributes: validated ledger hash, sequence, signing time
**What was done**:
- When receiving a peer validation:
- Extract trace context from `TMValidation::trace_context` (if present)
- Create `consensus.validation.receive` span
- Set attributes: `xrpl.consensus.validator` (node ID), ledger hash
- In `Adaptor::validate()` (called from `doAccept()`):
- Creates `consensus.validation.send` span via `Adaptor::createValidationSpan()`
- Uses `SpanGuard::linkedSpan()` to create a follows-from link to the round span
- Thread-safe: uses `roundSpanContext_` snapshot (captured on consensus thread,
read on jtACCEPT thread)
- Sets `xrpl.consensus.ledger.seq` and `xrpl.consensus.proposing` attributes
- In `PeerImp::onMessage(TMValidation)`:
- Creates `consensus.validation.receive` span
- Sets `xrpl.consensus.validation.trusted` attribute (bool)
- Sets `xrpl.consensus.validation.ledger_seq` attribute
**Not implemented** (deferred to Phase 4b — cross-node propagation):
- Validated ledger hash, signing time attributes on send span (see Task 4.8)
**Key modified files**:
- `src/xrpld/app/consensus/RCLConsensus.cpp`
- `src/xrpld/app/misc/NetworkOPs.cpp` (if validation handling is here)
---
## Task 4.5: Add Consensus-Specific Attributes
## Task 4.5: Add Consensus-Specific Attributes
**Objective**: Enrich consensus spans with detailed attributes for debugging and analysis.
**What to do**:
**Status**: DONE. All core attributes are set across various spans, including the previously missing `tx_count` and `disputes_count`.
- Review all consensus spans and ensure they include:
- `xrpl.consensus.ledger.seq` — target ledger sequence number
- `xrpl.consensus.round`consensus round number
- `xrpl.consensus.mode` — proposing/observing/wrongLedger
- `xrpl.consensus.phase`current phase name
- `xrpl.consensus.phase_duration_ms` — time spent in phase
- `xrpl.consensus.proposers` — number of trusted proposers
- `xrpl.consensus.tx_count`transactions in proposed set
- `xrpl.consensus.disputes` — number of disputed transactions
- `xrpl.consensus.converge_percent` — convergence percentage
**Implemented attributes** (across various spans):
- `xrpl.consensus.ledger.seq` — on `consensus.round`, `consensus.accept.apply`
- `xrpl.consensus.round` — on `consensus.proposal.send`
- `xrpl.consensus.mode`on `consensus.round`, `consensus.ledger_close`
- `xrpl.consensus.proposers` — on `consensus.accept`, `consensus.establish`, `consensus.update_positions`
- `xrpl.consensus.converge_percent` — on `consensus.establish`, `consensus.update_positions`, `consensus.check`
- `xrpl.consensus.tx_count`on `consensus.accept.apply` span (in `doAccept()`)
- `xrpl.consensus.disputes_count` — on `consensus.update_positions` span (in `updateOurPositions()`)
**Design notes**:
- `xrpl.consensus.phase` — phases distinguished by span names instead
- `xrpl.consensus.phase_duration_ms` — span duration captures this
**Key modified files**:
- `src/xrpld/app/consensus/RCLConsensus.cpp`
- `src/xrpld/consensus/Consensus.h`
---
## Task 4.6: Correlate Transaction and Consensus Traces
## Task 4.6: Correlate Transaction and Consensus Traces
**Objective**: Link transaction traces from Phase 3 with consensus traces so you can follow a transaction from submission through consensus into the ledger.
**What to do**:
**Status**: DONE. Transaction-consensus correlation implemented via `tx.included` events in `doAccept()`.
- In `onClose()` or `onAccept()`:
- When building the consensus position, link the round span to individual transaction spans using span links (if OTel SDK supports it) or events
- At minimum, record the transaction hashes included in the consensus set as span events: `tx.included` with `xrpl.tx.hash` attribute
**What was done**:
- In `processTransactionSet()` (NetworkOPs):
- If the consensus round span context is available, create child spans for each transaction applied to the ledger
- In `doAccept()` (RCLConsensus.cpp):
- Records `tx.included` events on the `consensus.accept.apply` span for each transaction in the accepted set
- Each event includes `xrpl.tx.id` attribute with the transaction hash
- This links consensus traces to individual transactions
**Key modified files**:
- `src/xrpld/app/consensus/RCLConsensus.cpp`
- `src/xrpld/app/misc/NetworkOPs.cpp`
---
## Task 4.7: Build Verification and Testing
## Task 4.7: Build Verification and Testing
**Objective**: Verify all Phase 4 changes compile and don't affect consensus timing.
@@ -186,20 +190,20 @@
1. Build with `telemetry=ON` — verify no compilation errors
2. Build with `telemetry=OFF` — verify no regressions (critical for consensus code)
3. Run existing consensus-related unit tests
4. Verify that all macros expand to no-ops when disabled
4. Verify that `SpanGuard` factory methods compile to no-ops when disabled
5. Check that no consensus-critical code paths are affected by instrumentation overhead
**Verification Checklist**:
- [ ] Build succeeds with telemetry ON
- [ ] Build succeeds with telemetry OFF
- [ ] Existing consensus tests pass
- [ ] No new includes in consensus headers when telemetry is OFF
- [ ] Phase timing instrumentation doesn't use blocking operations
- [x] Build succeeds with telemetry ON
- [x] Build succeeds with telemetry OFF
- [x] Existing consensus tests pass
- [x] `SpanGuard` no-op implementation prevents overhead when telemetry is OFF
- [x] Phase timing instrumentation doesn't use blocking operations
---
## Task 4.8: Consensus Validation Span Enrichment — External Dashboard Parity
## Task 4.8: Consensus Validation Span Enrichment — NOT DONE
> **Source**: [External Dashboard Parity](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md) — adds validation agreement context inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard).
>
@@ -208,6 +212,8 @@
**Objective**: Add ledger hash, validation type, and quorum data to consensus validation spans on both send and receive paths. This enables trace-level validation agreement analysis — filter by ledger hash to see which validators agreed for a given ledger.
**Status**: Not implemented. None of the enrichment attributes are set. The `consensus.validation.send` span only has `ledger.seq` and `proposing`. The `consensus.accept` span has `quorum` set to `result.proposers` (not the actual validator quorum from `app_.validators().quorum()`). No `PeerImp.cpp` changes were made.
**What to do**:
- Edit `src/xrpld/app/consensus/RCLConsensus.cpp`:
@@ -242,7 +248,7 @@
Phase 7's `ValidationTracker` builds metric-level aggregation (1h/24h agreement %) on top of this data.
**Key modified files**:
**Key modified files (not yet modified)**:
- `src/xrpld/app/consensus/RCLConsensus.cpp`
- `src/xrpld/overlay/detail/PeerImp.cpp`
@@ -259,16 +265,16 @@ Phase 7's `ValidationTracker` builds metric-level aggregation (1h/24h agreement
## Summary
| Task | Description | New Files | Modified Files | Depends On |
| ---- | ------------------------------------------- | --------- | -------------- | ------------- |
| 4.1 | Consensus round start instrumentation | 0 | 2 | Phase 3 |
| 4.2 | Phase transition instrumentation | 0 | 1-2 | 4.1 |
| 4.3 | Proposal handling instrumentation | 0 | 1 | 4.1 |
| 4.4 | Validation handling instrumentation | 0 | 1-2 | 4.1 |
| 4.5 | Consensus-specific attributes | 0 | 1 | 4.2, 4.3, 4.4 |
| 4.6 | Transaction-consensus correlation | 0 | 2 | 4.2, Phase 3 |
| 4.7 | Build verification and testing | 0 | 0 | 4.1-4.6 |
| 4.8 | Validation span enrichment (ext. dashboard) | 0 | 2 | 4.4 |
| Task | Description | Status | New Files | Modified Files | Depends On |
| ---- | ------------------------------------------- | ----------- | --------- | -------------- | ------------- |
| 4.1 | Consensus round start instrumentation | ✅ Done | 0 | 2 | Phase 3 |
| 4.2 | Phase transition instrumentation | ✅ Done | 0 | 1-2 | 4.1 |
| 4.3 | Proposal handling instrumentation | ✅ Done | 0 | 2 | 4.1 |
| 4.4 | Validation handling instrumentation | ✅ Done | 0 | 2 | 4.1 |
| 4.5 | Consensus-specific attributes | ✅ Done | 0 | 2 | 4.2, 4.3, 4.4 |
| 4.6 | Transaction-consensus correlation | ✅ Done | 0 | 1 | 4.2, Phase 3 |
| 4.7 | Build verification and testing | ✅ Done | 0 | 0 | 4.1-4.6 |
| 4.8 | Validation span enrichment (ext. dashboard) | ❌ Not done | 0 | 2 | 4.4 |
**Parallel work**: Tasks 4.2, 4.3, and 4.4 can run in parallel after 4.1 is complete. Task 4.5 depends on all three. Task 4.6 depends on 4.2 and Phase 3. Task 4.8 depends on 4.4 (validation spans must exist).
@@ -301,10 +307,12 @@ driven by `avCT_CONSENSUS_PCT` (75% validator agreement threshold):
**Exit Criteria** (from [06-implementation-phases.md §6.11.4](./06-implementation-phases.md)):
- [x] Complete consensus round traces
- [x] Phase transitions visible
- [x] Proposals and validations traced
- [x] Phase transitions visible (open, establish, close, accept)
- [x] Proposals and validations traced — send and receive; relay deferred to Phase 4b
- [x] Close time agreement tracked (per `avCT_CONSENSUS_PCT`)
- [x] No impact on consensus timing
- [x] Transaction-consensus correlation (Task 4.6) — `tx.included` events in doAccept
- [ ] Validation span enrichment (Task 4.8) — not implemented
---
@@ -314,14 +322,13 @@ driven by `avCT_CONSENSUS_PCT` (75% validator agreement threshold):
> threshold escalation, mode changes) and establish cross-node correlation using a
> deterministic shared trace ID derived from `previousLedger.id()`.
>
> **Approach**: Direct instrumentation in `Consensus.h` — the generic consensus
> template has full access to internal state (`convergePercent_`, `result_->disputes`,
> `mode_`, threshold logic). Telemetry access comes via a single new adaptor
> method `getTelemetry()`. Long-lived spans (round, establish) are stored as
> class members using `SpanGuard` directly — NOT the `XRPL_TRACE_*` convenience
> macros (which create local variables named `_xrpl_guard_`). Short-lived
> scoped spans (update_positions, check) can use the macros. All code compiles
> to no-ops when `XRPL_ENABLE_TELEMETRY` is not defined.
> **Approach**: Direct instrumentation in `Consensus.h` and `RCLConsensus.cpp`.
> All spans use `SpanGuard` factory methods (`span()`, `hashSpan()`, `linkedSpan()`)
> with `TraceCategory::Consensus` gating. Long-lived spans (round, establish) are
> stored as `std::optional<SpanGuard>` class members. Short-lived scoped spans
> (update_positions, check) are local variables. No macros are used — all tracing
> is via direct `SpanGuard` API calls. `SpanGuard` compiles to no-ops when
> telemetry is disabled.
>
> **Branch**: `pratik/otel-phase4-consensus-tracing`
@@ -356,6 +363,9 @@ on every consensus span. Correlation happens at query time via Tempo/Grafana
consensus_trace_strategy=deterministic
```
The C++ API to query this at runtime is `Telemetry::getConsensusTraceStrategy()`,
which returns a `std::string const&` (`"deterministic"` or `"attribute"`).
### Implementation
In `RCLConsensus::Adaptor::startRound()`:
@@ -409,267 +419,252 @@ consensus.round (root — created in RCLConsensus::startRound, closed at accept
---
## Task 4a.0: Prerequisites — Extend SpanGuard and Telemetry APIs
## Task 4a.0: Prerequisites — Extend SpanGuard and Telemetry APIs
**Objective**: Add missing API surface needed by later tasks.
**What to do**:
**Status**: Done, but implemented differently than originally planned. The macro-based
approach (`XRPL_TRACE_CONSENSUS`, `XRPL_TRACE_ADD_EVENT`, `XRPL_TRACE_SET_ATTR`) was
**not used**. Instead, all consensus tracing uses `SpanGuard` factory methods and
direct method calls, which is cleaner and avoids macro control-flow issues.
1. **Add `SpanGuard::addEvent()` with attributes** (needed by Task 4a.5):
The current `addEvent(string_view name)` only accepts a name. Add an
overload that accepts key-value attributes:
**What was done**:
1. **`SpanGuard::addEvent()` with attributes** — implemented as planned:
```cpp
using EventAttribute = std::pair<std::string_view, std::string_view>;
void addEvent(std::string_view name,
std::initializer_list<
std::pair<opentelemetry::nostd::string_view,
opentelemetry::common::AttributeValue>> attributes)
{
span_->AddEvent(std::string(name), attributes);
}
std::initializer_list<EventAttribute> attrs);
```
2. **Add a `Telemetry::startSpan()` overload that accepts span links** (needed by Tasks 4a.2, 4a.8):
The current `startSpan()` has no span link support. Add an overload that
accepts a vector of `SpanContext` links for follows-from relationships:
Callers pass plain `string_view` pairs; the implementation converts internally.
```cpp
virtual opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>
startSpan(
std::string_view name,
opentelemetry::context::Context const& parentContext,
std::vector<opentelemetry::trace::SpanContext> const& links,
opentelemetry::trace::SpanKind kind = opentelemetry::trace::SpanKind::kInternal) = 0;
// Actual usage in Consensus.h::updateOurPositions():
span.addEvent(
"dispute.resolve",
{{cons_span::attr::txId, to_string(txId)},
{cons_span::attr::disputeOurVote, dispute.getOurVote() ? "yes" : "no"}});
```
3. **Add `XRPL_TRACE_ADD_EVENT` macro** (needed by Task 4a.5):
Add to `TracingInstrumentation.h` to expose `addEvent(name, attrs)` through
the macro interface (consistent with `XRPL_TRACE_SET_ATTR` pattern):
2. **Span link support** — implemented via `SpanGuard::linkedSpan()` static factory
instead of a `Telemetry::startSpan()` overload:
```cpp
#ifdef XRPL_ENABLE_TELEMETRY
#define XRPL_TRACE_ADD_EVENT(name, ...) \
if (_xrpl_guard_.has_value()) \
{ \
_xrpl_guard_->addEvent(name, __VA_ARGS__); \
}
#else
#define XRPL_TRACE_ADD_EVENT(name, ...) ((void)0)
#endif
static SpanGuard linkedSpan(
std::string_view name, SpanContext const& linkTarget);
```
3. **No macros added** — `TracingInstrumentation.h` was not created. The `XRPL_TRACE_CONSENSUS`,
`XRPL_TRACE_ADD_EVENT`, and `XRPL_TRACE_SET_ATTR` macros from the original plan were
not implemented. All consensus tracing uses direct `SpanGuard` API:
- `SpanGuard::span()` — create scoped spans
- `SpanGuard::hashSpan()` — create spans with deterministic trace IDs
- `SpanGuard::linkedSpan()` — create spans with follows-from links
- `span.setAttribute()` — set attributes directly
- `span.addEvent()` — add events directly
**Key modified files**:
- `include/xrpl/telemetry/SpanGuard.h` — add `addEvent()` overload
- `include/xrpl/telemetry/Telemetry.h` — add `startSpan()` with links
- `src/xrpld/telemetry/Telemetry.cpp` — implement new overload
- `src/xrpld/telemetry/NullTelemetry.cpp` — no-op implementation
- `src/xrpld/telemetry/TracingInstrumentation.h` — add `XRPL_TRACE_ADD_EVENT` macro
- `include/xrpl/telemetry/SpanGuard.h` — `addEvent()` overload, `EventAttribute` type alias
- `src/libxrpl/telemetry/SpanGuard.cpp` — `addEvent()` implementation
---
## Task 4a.1: Adaptor `getTelemetry()` Method
## Task 4a.1: Adaptor `getTelemetry()` Method — NOT DONE (Not Needed)
**Objective**: Give `Consensus.h` access to the telemetry subsystem without
coupling the generic template to OTel headers.
**What to do**:
**Status**: Not implemented as specified. The `getTelemetry()` adaptor method was
not needed because `SpanGuard::span()` is a static factory method that internally
checks telemetry state via the global `Telemetry` singleton. `Consensus.h` creates
spans by calling `SpanGuard::span(TraceCategory::Consensus, ...)` directly, without
needing adaptor access. Only `RCLConsensus::Adaptor` uses `app_.getTelemetry()`
directly (for `getConsensusTraceStrategy()` in `startRoundTracing()`).
- Add `getTelemetry()` method to the Adaptor concept (returns
`xrpl::telemetry::Telemetry&`). The return type is already forward-declared
behind `#ifdef XRPL_ENABLE_TELEMETRY`.
- Implement in `RCLConsensus::Adaptor` — delegates to `app_.getTelemetry()`.
- In `Consensus.h`, the `XRPL_TRACE_*` macros call
`adaptor_.getTelemetry()` — when telemetry is disabled, the macros expand to
`((void)0)` and the method is never called.
**Key modified files**:
- `src/xrpld/app/consensus/RCLConsensus.h` — declare `getTelemetry()`
- `src/xrpld/app/consensus/RCLConsensus.cpp` — implement `getTelemetry()`
**Key insight**: The `XRPL_TRACE_*` macro approach would have required
`adaptor_.getTelemetry()`. Since macros were not used, this task became unnecessary.
---
## Task 4a.2: Switchable Round Span with Deterministic Trace ID
## Task 4a.2: Switchable Round Span with Deterministic Trace ID
**Objective**: Create a `consensus.round` root span in `startRound()` that uses
the switchable correlation strategy. Store span context as a member for child
spans in `Consensus.h`.
**What to do**:
**Status**: Done. Implemented in `Adaptor::startRoundTracing()`.
- In `RCLConsensus::Adaptor::startRound()` (or a new helper):
- Read `consensus_trace_strategy` from config.
- **Deterministic**: compute `trace_id = SHA256(prevLedgerID)[0:16]`.
Construct a `SpanContext` with this trace_id, then start
`consensus.round` span as child of that context.
- **Attribute**: start normal `consensus.round` span.
- Set attributes on both: `xrpl.consensus.round_id`,
`xrpl.consensus.ledger_id`, `xrpl.consensus.ledger.seq`,
`xrpl.consensus.mode`.
- Store the round span in `Consensus` as a member (see Task 4a.3).
- If a previous round's span context is available, add a **span link**
(follows-from) to establish the round chain.
**What was done**:
- Add `createDeterministicTraceId(hash)` utility to
`include/xrpl/telemetry/Telemetry.h` (returns 16-byte trace ID from a
256-bit hash by truncation).
- `RCLConsensus::Adaptor::startRoundTracing()` helper:
- Reads `consensus_trace_strategy` via `app_.getTelemetry().getConsensusTraceStrategy()`
- **Deterministic**: uses `SpanGuard::hashSpan()` with `prevLgr.id()` data
- **Attribute**: uses `SpanGuard::span(TraceCategory::Consensus, seg::consensus, "round")`
- Sets attributes: `ledger_id`, `ledger.seq`, `mode`, `trace_strategy`, `round_id`
- Captures `roundSpanContext_` snapshot for cross-thread span linking
- Saves `prevRoundContext_` from previous round for follows-from links
- **`SpanGuard::hashSpan()` factory**: encapsulates deterministic trace ID logic:
- Add `consensus_trace_strategy` to `Telemetry::Setup` and
`TelemetryConfig.cpp` parser:
```cpp
/** Cross-node correlation strategy: "deterministic" or "attribute". */
std::string consensusTraceStrategy = "deterministic";
static SpanGuard hashSpan(
TraceCategory cat, std::string_view name,
std::uint8_t const* hashData, std::size_t hashSize);
```
Derives `trace_id = hashData[0:16]` so all nodes in the same round share
the same trace_id. Compiles to no-op when telemetry is disabled.
- `consensus_trace_strategy` config parsed in `TelemetryConfig.cpp`,
stored in `Telemetry::Setup`, accessible via `Telemetry::getConsensusTraceStrategy()`
**Key modified files**:
- `src/xrpld/app/consensus/RCLConsensus.cpp`
- `include/xrpl/telemetry/Telemetry.h` — `createDeterministicTraceId()`
- `src/xrpld/telemetry/TelemetryConfig.cpp` — parse new config option
- `src/xrpld/app/consensus/RCLConsensus.cpp` — `startRoundTracing()` implementation
- `src/xrpld/app/consensus/ConsensusSpanNames.h` — **(new)** compile-time span name and attribute key constants
- `include/xrpl/telemetry/Telemetry.h` — `consensusTraceStrategy` in Setup, `getConsensusTraceStrategy()`
- `src/libxrpl/telemetry/TelemetryConfig.cpp` — parse new config option
---
## Task 4a.3: Span Members in `Consensus.h`
## Task 4a.3: Span Members in `Consensus.h`
**Objective**: Add span storage to the `Consensus` class so that spans created
in `startRound()` (adaptor) are accessible from `phaseEstablish()`,
`updateOurPositions()`, and `haveConsensus()` (template methods).
**What to do**:
**Status**: Done with documented plan deviation.
**What was done**:
- `establishSpan_` added to `Consensus` private members (as planned):
- Add to `Consensus` private members (guarded by `#ifdef XRPL_ENABLE_TELEMETRY`):
```cpp
#ifdef XRPL_ENABLE_TELEMETRY
std::optional<xrpl::telemetry::SpanGuard> roundSpan_;
std::optional<xrpl::telemetry::SpanGuard> establishSpan_;
opentelemetry::context::Context prevRoundContext_;
#endif
```
- `roundSpan_` is created in `startRound()` via the adaptor and stored.
Its `SpanGuard::Scope` member keeps the span active on the thread context
for the entire round lifetime.
- `establishSpan_` is created when entering phaseEstablish and cleared on accept.
It becomes a child of `roundSpan_` via OTel's thread-local context propagation.
- `prevRoundContext_` stores the previous round's context for follows-from links.
**Threading assumption**: `startRound()`, `phaseEstablish()`, `updateOurPositions()`,
and `haveConsensus()` all run on the same thread (the consensus job queue thread).
This is required for the `SpanGuard::Scope`-based parent-child hierarchy to work.
The `Consensus` class documentation confirms it is NOT thread-safe and calls are
serialized by the application.
- **Plan deviation**: `roundSpan_`, `prevRoundContext_`, and `roundSpanContext_`
are stored in `RCLConsensus::Adaptor` (not `Consensus.h`) because the adaptor
has access to telemetry config for the deterministic trace ID strategy.
- Add conditional include at top of `Consensus.h`:
- **No `#ifdef XRPL_ENABLE_TELEMETRY` guards**: Members use `std::optional<SpanGuard>`
and `SpanContext` which have no-op implementations when telemetry is disabled,
so `#ifdef` guards are unnecessary. The members are always present in the class
layout but incur negligible overhead.
- Includes added unconditionally to `Consensus.h`:
```cpp
#ifdef XRPL_ENABLE_TELEMETRY
#include <xrpl/telemetry/SpanGuard.h>
#include <xrpld/telemetry/TracingInstrumentation.h>
#endif
#include <xrpld/app/consensus/ConsensusSpanNames.h>
```
No `TracingInstrumentation.h` include (file doesn't exist; macros not used).
**Key modified files**:
- `src/xrpld/consensus/Consensus.h`
- `src/xrpld/app/consensus/RCLConsensus.h` (round span and context members)
---
## Task 4a.4: Instrument `phaseEstablish()`
## Task 4a.4: Instrument `phaseEstablish()`
**Objective**: Create `consensus.establish` span wrapping the establish phase,
with attributes for convergence progress.
**What to do**:
**Status**: Done. Implemented via three private helpers in `Consensus.h`.
- At the start of `phaseEstablish()` (line 1298), if `establishSpan_` is not
yet created, create it as child of `roundSpan_` using the **direct API**
(NOT the `XRPL_TRACE_CONSENSUS` macro, which creates a local variable):
**What was done**:
```cpp
#ifdef XRPL_ENABLE_TELEMETRY
if (!establishSpan_ && adaptor_.getTelemetry().shouldTraceConsensus())
{
establishSpan_.emplace(
adaptor_.getTelemetry().startSpan("consensus.establish"));
}
#endif
```
- `startEstablishTracing()` — creates `consensus.establish` span via
`SpanGuard::span(TraceCategory::Consensus, seg::consensus, "establish")`.
Called once at start of establish phase. No `#ifdef` guards needed —
`SpanGuard::span()` returns a no-op guard when telemetry is disabled.
- Set attributes on each call:
- `updateEstablishTracing()` — sets attributes on each `phaseEstablish()` call:
- `xrpl.consensus.converge_percent` — `convergePercent_`
- `xrpl.consensus.establish_count` — `establishCounter_`
- `xrpl.consensus.proposers` — `currPeerPositions_.size()`
- On phase exit (transition to accept), close the establish span and record
final duration.
- `endEstablishTracing()` — calls `establishSpan_.reset()` on phase exit.
**Key modified files**:
- `src/xrpld/consensus/Consensus.h` — `phaseEstablish()` method
- `src/xrpld/consensus/Consensus.h` — `phaseEstablish()` method + 3 helper methods
---
## Task 4a.5: Instrument `updateOurPositions()`
## Task 4a.5: Instrument `updateOurPositions()`
**Objective**: Trace each position update cycle including dispute resolution
details.
**What to do**:
**Status**: DONE. Span, dispute events with yays/nays, and disputes_count attribute are all implemented.
- At the start of `updateOurPositions()` (line 1418), create a scoped child
span. This method is called and returns within a single `phaseEstablish()`
call, so the `XRPL_TRACE_CONSENSUS` macro works here (scoped local):
**What was done**:
- Creates `consensus.update_positions` scoped span via
`SpanGuard::span(TraceCategory::Consensus, seg::consensus, "update_positions")`:
```cpp
XRPL_TRACE_CONSENSUS(adaptor_.getTelemetry(), "consensus.update_positions");
auto span = SpanGuard::span(TraceCategory::Consensus, seg::consensus, "update_positions");
```
- Set attributes:
- `xrpl.consensus.disputes_count` — `result_->disputes.size()`
- Attributes set:
- `xrpl.consensus.converge_percent` — current convergence
- `xrpl.consensus.proposers_agreed` — count of peers with same position
- `xrpl.consensus.proposers_total` — total peer positions
- `xrpl.consensus.proposers` — `currPeerPositions_.size()`
- `xrpl.consensus.have_close_time_consensus` — close time consensus state
- `xrpl.consensus.close_time_threshold` — `avCT_CONSENSUS_PCT`
- `xrpl.consensus.disputes_count` — number of active disputes
- Inside the dispute resolution loop, for each dispute that changes our vote,
add an **event** with attributes using `XRPL_TRACE_ADD_EVENT` (from Task 4a.0):
- Dispute events recorded via direct `span.addEvent()` call with yays/nays:
```cpp
XRPL_TRACE_ADD_EVENT("dispute.resolve", {
{"xrpl.tx.id", std::string(tx_id)},
{"xrpl.dispute.our_vote", our_vote},
{"xrpl.dispute.yays", static_cast<int64_t>(yays)},
{"xrpl.dispute.nays", static_cast<int64_t>(nays)}
});
span.addEvent(
"dispute.resolve",
{{cons_span::attr::txId, to_string(txId)},
{cons_span::attr::disputeOurVote, dispute.getOurVote() ? "yes" : "no"},
{cons_span::attr::disputeYays, std::to_string(dispute.getYays())},
{cons_span::attr::disputeNays, std::to_string(dispute.getNays())}});
```
**Not implemented**:
- `xrpl.consensus.proposers_agreed` / `xrpl.consensus.proposers_total` attributes — not set
**Key modified files**:
- `src/xrpld/consensus/Consensus.h` — `updateOurPositions()` method
- `src/xrpld/consensus/DisputedTx.h` — added `getYays()` / `getNays()` (currently unused)
---
## Task 4a.6: Instrument `haveConsensus()` (Threshold & Convergence)
## Task 4a.6: Instrument `haveConsensus()` (Threshold & Convergence)
**Objective**: Trace consensus checking including threshold escalation
(`ConsensusParms::AvalancheState::{init, mid, late, stuck}`).
**Objective**: Trace consensus checking including threshold escalation.
**What to do**:
**Status**: DONE. The `consensus.check` span is created with all planned attributes
including the avalanche threshold.
- At the start of `haveConsensus()` (line 1598), create a scoped child span:
**What was done**:
- Creates `consensus.check` scoped span via
`SpanGuard::span(TraceCategory::Consensus, seg::consensus, "check")`:
```cpp
XRPL_TRACE_CONSENSUS(adaptor_.getTelemetry(), "consensus.check");
auto span = SpanGuard::span(TraceCategory::Consensus, seg::consensus, "check");
```
- Set attributes:
- Attributes set:
- `xrpl.consensus.agree_count` — peers that agree with our position
- `xrpl.consensus.disagree_count` — peers that disagree
- `xrpl.consensus.converge_percent` — convergence percentage
- `xrpl.consensus.result` — ConsensusState result (Yes/No/MovedOn)
- The free function `checkConsensus()` in `Consensus.cpp` (line 151) determines
thresholds based on `currentAgreeTime`. Threshold values come from
`ConsensusParms::avalancheCutoffs` (defined in `ConsensusParms.h`).
The escalation states are `ConsensusParms::AvalancheState::{init, mid, late, stuck}`.
Record the effective threshold as an attribute on the span:
- `xrpl.consensus.threshold_percent` — current threshold from `avalancheCutoffs`
- `xrpl.consensus.have_close_time_consensus` — close time consensus state
- `xrpl.consensus.threshold_percent` — set to `avCT_CONSENSUS_PCT` (75%)
- `xrpl.consensus.result` — "yes", "no", or "moved_on"
- `xrpl.consensus.avalanche_threshold` — the escalated weight from `getNeededWeight()` on the `consensus.update_positions` span
**Key modified files**:
@@ -677,29 +672,26 @@ details.
---
## Task 4a.7: Instrument Mode Changes
## Task 4a.7: Instrument Mode Changes
**Objective**: Trace consensus mode transitions (proposing ↔ observing,
wrongLedger, switchedLedger).
**What to do**:
**Status**: Done.
Mode changes are rare (typically 0-1 per round), so a **standalone short-lived
span** is appropriate (not an event). This captures timing of the mode change
itself.
**What was done**:
- In `RCLConsensus::Adaptor::onModeChange()`, create a scoped span:
- In `RCLConsensus::Adaptor::onModeChange()`, creates a scoped span via direct
`SpanGuard::span()` call:
```cpp
XRPL_TRACE_CONSENSUS(app_.getTelemetry(), "consensus.mode_change");
XRPL_TRACE_SET_ATTR("xrpl.consensus.mode.old", to_string(before).c_str());
XRPL_TRACE_SET_ATTR("xrpl.consensus.mode.new", to_string(after).c_str());
auto span = telemetry::SpanGuard::span(
telemetry::TraceCategory::Consensus, telemetry::seg::consensus, "mode_change");
span.setAttribute(cons_span::attr::modeOld, to_string(before).c_str());
span.setAttribute(cons_span::attr::modeNew, to_string(after).c_str());
```
- Note: `MonitoredMode::set()` (line 304 in `Consensus.h`) calls
`adaptor_.onModeChange(before, after)` — so the span is created in the
adaptor, which already has telemetry access. No instrumentation needed
in `Consensus.h` for this task.
- `MonitoredMode::set()` in `Consensus.h` calls `adaptor_.onModeChange(before, after)`.
**Key modified files**:
@@ -707,31 +699,36 @@ itself.
---
## Task 4a.8: Reparent Existing Spans Under Round
## Task 4a.8: Reparent Existing Spans Under Round
**Objective**: Make existing consensus spans (`consensus.accept`,
`consensus.accept.apply`, `consensus.validation.send`) children of the
`consensus.round` root span instead of being standalone.
**What to do**:
**Status**: DONE. All three spans are now parented under the round span.
- The existing spans in `onAccept()`, `doAccept()`, and `validate()` use
`XRPL_TRACE_CONSENSUS(app_.getTelemetry(), ...)` which creates standalone
spans on the current thread's context.
- After Task 4a.2 creates the round span and stores it, these methods run on
the same thread within the round span's scope, so they automatically become
children. Verify this works correctly.
- For `consensus.validation.send`: add a **span link** (follows-from) to the
round span context, since the validation may be processed after the round
completes.
**What was done**:
- `consensus.validation.send` uses `SpanGuard::linkedSpan()` to create a
follows-from link to `roundSpanContext_`. This is thread-safe because
`roundSpanContext_` is a lightweight `SpanContext` snapshot captured on the
consensus thread and read on the jtACCEPT worker thread.
- `consensus.accept` and `consensus.accept.apply` now use
`SpanGuard::childSpan(name, roundSpanContext_)` instead of `SpanGuard::span()`
to explicitly parent under the round span context. This solves the cross-thread
parenting problem:
- `doAccept()` runs on the jtACCEPT worker thread (not the consensus thread)
- `childSpan()` explicitly passes the parent context, bypassing OTel's
thread-local context propagation
**Key modified files**:
- `src/xrpld/app/consensus/RCLConsensus.cpp` — verify parent-child hierarchy
- `src/xrpld/app/consensus/RCLConsensus.cpp`
---
## Task 4a.9: Build Verification and Testing
## Task 4a.9: Build Verification and Testing
**Objective**: Verify all Phase 4a changes compile cleanly with telemetry ON
and OFF, and don't affect consensus timing.
@@ -739,11 +736,9 @@ and OFF, and don't affect consensus timing.
**What to do**:
1. Build with `telemetry=ON` — verify no compilation errors
2. Build with `telemetry=OFF` — verify macros expand to no-ops, no new includes
leak into `Consensus.h` when disabled
2. Build with `telemetry=OFF` — verify `SpanGuard` compiles to no-ops
3. Run existing consensus unit tests
4. Verify `#ifdef XRPL_ENABLE_TELEMETRY` guards on all new members in
`Consensus.h`
4. Verify `SpanGuard` / `SpanContext` members have negligible overhead when disabled
5. Run `pccl` pre-commit checks
**Verification Checklist**:
@@ -751,7 +746,7 @@ and OFF, and don't affect consensus timing.
- [x] Build succeeds with telemetry ON
- [x] Build succeeds with telemetry OFF
- [x] Existing consensus tests pass
- [x] `Consensus.h` has zero OTel includes when telemetry is OFF
- [x] `SpanGuard` no-op path verified (no `#ifdef` needed — disabled at runtime)
- [x] No new virtual calls in hot consensus paths
- [x] `pccl` passes
@@ -759,74 +754,91 @@ and OFF, and don't affect consensus timing.
## Phase 4a Summary
| Task | Description | New Files | Modified Files | Depends On |
| ---- | ------------------------------------------------ | --------- | -------------- | ---------- |
| 4a.0 | Prerequisites: extend SpanGuard & Telemetry APIs | 0 | 4 | Phase 4 |
| 4a.1 | Adaptor `getTelemetry()` method | 0 | 2 | Phase 4 |
| 4a.2 | Switchable round span with deterministic traceID | 0 | 3 | 4a.0, 4a.1 |
| 4a.3 | Span members in `Consensus.h` | 0 | 1 | 4a.1 |
| 4a.4 | Instrument `phaseEstablish()` | 0 | 1 | 4a.3 |
| 4a.5 | Instrument `updateOurPositions()` | 0 | 1 | 4a.0, 4a.3 |
| 4a.6 | Instrument `haveConsensus()` (thresholds) | 0 | 1 | 4a.3 |
| 4a.7 | Instrument mode changes | 0 | 1 | 4a.1 |
| 4a.8 | Reparent existing spans under round | 0 | 1 | 4a.0, 4a.2 |
| 4a.9 | Build verification and testing | 0 | 0 | 4a.0-4a.8 |
| Task | Description | Status | New Files | Modified Files | Depends On |
| ---- | ------------------------------------------------ | ------------------------ | --------- | -------------- | ---------- |
| 4a.0 | Prerequisites: extend SpanGuard & Telemetry APIs | ✅ Done (no macros) | 0 | 2 | Phase 4 |
| 4a.1 | Adaptor `getTelemetry()` method | ⏭️ Skipped (not needed) | 0 | 0 | Phase 4 |
| 4a.2 | Switchable round span with deterministic traceID | ✅ Done | 1 | 3 | 4a.0 |
| 4a.3 | Span members in `Consensus.h` | ✅ Done (with deviation) | 0 | 2 | |
| 4a.4 | Instrument `phaseEstablish()` | ✅ Done | 0 | 1 | 4a.3 |
| 4a.5 | Instrument `updateOurPositions()` | ✅ Done | 0 | 2 | 4a.0, 4a.3 |
| 4a.6 | Instrument `haveConsensus()` (thresholds) | ✅ Done | 0 | 1 | 4a.3 |
| 4a.7 | Instrument mode changes | ✅ Done | 0 | 1 | |
| 4a.8 | Reparent existing spans under round | ✅ Done | 0 | 1 | 4a.0, 4a.2 |
| 4a.9 | Build verification and testing | ✅ Done | 0 | 0 | 4a.0-4a.8 |
**Parallel work**: Tasks 4a.0 and 4a.1 can run in parallel. Tasks 4a.4, 4a.5, 4a.6, and 4a.7 can run in parallel after 4a.3 (and 4a.0 for 4a.5).
### New Spans (Phase 4a)
| Span Name | Location | Key Attributes |
| ---------------------------- | ------------------ | ---------------------------------------------------------------------------------- |
| `consensus.round` | `RCLConsensus.cpp` | `round_id`, `ledger_id`, `ledger.seq`, `mode`; link → prev round |
| `consensus.establish` | `Consensus.h` | `converge_percent`, `establish_count`, `proposers` |
| `consensus.update_positions` | `Consensus.h` | `disputes_count`, `converge_percent`, `proposers_agreed`, `proposers_total` |
| `consensus.check` | `Consensus.h` | `agree_count`, `disagree_count`, `converge_percent`, `result`, `threshold_percent` |
| `consensus.mode_change` | `RCLConsensus.cpp` | `mode.old`, `mode.new` |
| Span Name | Location | Key Attributes (actually set) |
| ---------------------------- | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- |
| `consensus.round` | `RCLConsensus.cpp` | `round_id`, `ledger_id`, `ledger.seq`, `mode`, `trace_strategy` |
| `consensus.establish` | `Consensus.h` | `converge_percent`, `establish_count`, `proposers` |
| `consensus.update_positions` | `Consensus.h` | `converge_percent`, `proposers`, `have_close_time_consensus`, `close_time_threshold`, `disputes_count`, `avalanche_threshold` |
| `consensus.check` | `Consensus.h` | `agree_count`, `disagree_count`, `converge_percent`, `have_close_time_consensus`, `threshold_percent`, `result` |
| `consensus.mode_change` | `RCLConsensus.cpp` | `mode.old`, `mode.new` |
### New Events (Phase 4a)
| Event Name | Parent Span | Attributes |
| Event Name | Parent Span | Attributes (actually set) |
| ----------------- | ---------------------------- | ----------------------------------- |
| `dispute.resolve` | `consensus.update_positions` | `tx_id`, `our_vote`, `yays`, `nays` |
| `tx.included` | `consensus.accept.apply` | `tx_id` |
### New Attributes (Phase 4a)
```cpp
// Round-level (on consensus.round)
// Round-level (on consensus.round) — ALL IMPLEMENTED
"xrpl.consensus.round_id" = int64 // Consensus round number
"xrpl.consensus.ledger_id" = string // previousLedger.id() hash
"xrpl.consensus.trace_strategy" = string // "deterministic" or "attribute"
// Establish-level
// Establish-level — IMPLEMENTED
"xrpl.consensus.converge_percent" = int64 // Convergence % (0-100+)
"xrpl.consensus.establish_count" = int64 // Number of establish iterations
"xrpl.consensus.disputes_count" = int64 // Active disputes
"xrpl.consensus.proposers_agreed" = int64 // Peers agreeing with us
"xrpl.consensus.proposers_total" = int64 // Total peer positions
"xrpl.consensus.agree_count" = int64 // Peers that agree (haveConsensus)
"xrpl.consensus.disagree_count" = int64 // Peers that disagree
"xrpl.consensus.threshold_percent" = int64 // Current threshold (50/65/70/95)
"xrpl.consensus.threshold_percent" = int64 // Current threshold (avCT_CONSENSUS_PCT = 75%)
"xrpl.consensus.result" = string // "yes", "no", "moved_on"
"xrpl.consensus.have_close_time_consensus" = bool // Close time consensus reached
"xrpl.consensus.close_time_threshold" = int64 // Close time voting threshold
// Mode change
// Establish-level — IMPLEMENTED
"xrpl.consensus.disputes_count" = int64 // Active disputes (on update_positions)
"xrpl.consensus.avalanche_threshold" = int64 // Escalated weight (on update_positions)
// Establish-level — NOT IMPLEMENTED
// "xrpl.consensus.proposers_agreed" = int64 // Peers agreeing with us — not set
// "xrpl.consensus.proposers_total" = int64 // Total peer positions — not set (not defined)
// Mode change — ALL IMPLEMENTED
"xrpl.consensus.mode.old" = string // Previous mode
"xrpl.consensus.mode.new" = string // New mode
```
### Implementation Notes
- **No macros**: The planned `XRPL_TRACE_CONSENSUS`, `XRPL_TRACE_ADD_EVENT`, and
`XRPL_TRACE_SET_ATTR` macros were not implemented. All consensus tracing uses
`SpanGuard` factory methods (`span()`, `hashSpan()`, `linkedSpan()`) and direct
method calls (`setAttribute()`, `addEvent()`). This avoids macro control-flow
issues and is cleaner than the planned approach.
- **Separation of concerns**: All non-trivial telemetry code extracted to private
helpers (`startRoundTracing`, `createValidationSpan`, `startEstablishTracing`,
`updateEstablishTracing`, `endEstablishTracing`). Business logic methods contain
only single-line `#ifdef` blocks calling these helpers.
single-line calls to these helpers.
- **Thread safety**: `createValidationSpan()` runs on the jtACCEPT worker thread.
Instead of accessing `roundSpan_` across threads, a `roundSpanContext_` snapshot
(lightweight `SpanContext` value type) is captured on the consensus thread in
`startRoundTracing()` and read by `createValidationSpan()`. The job queue
provides the happens-before guarantee.
- **Macro safety**: `XRPL_TRACE_ADD_EVENT` uses `do { } while (0)` to prevent
dangling-else issues.
- **No `#ifdef` guards**: Span members use `std::optional<SpanGuard>` and `SpanContext`
which have no-op implementations when telemetry is disabled. No `#ifdef XRPL_ENABLE_TELEMETRY`
guards needed around members or includes.
- **No `getTelemetry()` adaptor method**: `SpanGuard::span()` is a static factory that
internally checks telemetry state, so `Consensus.h` doesn't need adaptor access
for span creation. Only `RCLConsensus::Adaptor` accesses `app_.getTelemetry()` directly.
- **Config validation**: `consensus_trace_strategy` is validated to be either
`"deterministic"` or `"attribute"`, falling back to `"deterministic"` for
unrecognised values.
@@ -891,6 +903,6 @@ share the same trace_id. P2P propagation adds **span-level** linking:
## Prerequisites
- Phase 4a (this task list) — establish phase tracing must be in place
- `TraceContextPropagator` class (already exists in
- `TraceContextPropagator` free functions (already exist in
`include/xrpl/telemetry/TraceContextPropagator.h`)
- Protobuf `TraceContext` message (already exists, field 1001)

View File

@@ -178,7 +178,7 @@ Tempo/Prometheus.
**Verification**:
- [ ] Script completes with all checks passing
- [ ] Tempo UI shows rippled service with all expected span names
- [ ] Tempo UI shows xrpld service with all expected span names
- [ ] Grafana dashboards load and show data
---

View File

@@ -175,7 +175,7 @@
**What to do**:
- Create `docs/telemetry-runbook.md`:
- **Setup**: How to enable telemetry in rippled
- **Setup**: How to enable telemetry in xrpld
- **Configuration**: All config options with descriptions
- **Collector Deployment**: Docker Compose vs. Kubernetes vs. bare metal
- **Troubleshooting**: Common issues and resolutions
@@ -199,7 +199,7 @@
**What to do**:
1. Start full Docker stack (Collector, Tempo, Grafana, Prometheus)
2. Build rippled with `telemetry=ON`
2. Build xrpld with `telemetry=ON`
3. Run in standalone mode with telemetry enabled
4. Generate RPC traffic and verify traces in Tempo
5. Verify dashboards populate in Grafana

View File

@@ -130,7 +130,7 @@
- Edit `docker/telemetry/docker-compose.yml`:
- Remove UDP :8125 port mapping from otel-collector service
- Update rippled service config: change `[insight] server=statsd` to `server=otel`
- Update xrpld service config: change `[insight] server=statsd` to `server=otel`
**Key modified files**:
@@ -148,14 +148,14 @@
**What to do**:
- In `OTelCollector.cpp`, construct OTel instrument names to match existing Prometheus metric names:
- beast::insight `make_gauge("LedgerMaster", "Validated_Ledger_Age")` → OTel instrument name: `rippled_LedgerMaster_Validated_Ledger_Age`
- beast::insight `make_gauge("LedgerMaster", "Validated_Ledger_Age")` → OTel instrument name: `xrpld_LedgerMaster_Validated_Ledger_Age`
- The prefix + group + name concatenation must produce the same string as `StatsDCollector`'s format
- Use underscores as separators (matching StatsD convention)
- Verify in integration test that key Prometheus queries still return data:
- `rippled_LedgerMaster_Validated_Ledger_Age`
- `rippled_Peer_Finder_Active_Inbound_Peers`
- `rippled_rpc_requests`
- `xrpld_LedgerMaster_Validated_Ledger_Age`
- `xrpld_Peer_Finder_Active_Inbound_Peers`
- `xrpld_rpc_requests`
**Key consideration**: OTel Prometheus exporter may normalize metric names differently than StatsD receiver. Test this early (Task 7.2) and adjust naming strategy if needed. The OTel SDK's Prometheus exporter adds `_total` suffix to counters and converts dots to underscores — match existing conventions.
@@ -321,7 +321,7 @@ struct WindowEvent {
```cpp
validatorHealthGauge_ = meter_->CreateDoubleObservableGauge(
"rippled_validator_health", "Validator health indicators");
"xrpld_validator_health", "Validator health indicators");
```
**Gauge label values**:
@@ -333,12 +333,63 @@ validatorHealthGauge_ = meter_->CreateDoubleObservableGauge(
| `unl_expiry_days` | double | `app_.validators().expires()` → days until expiry |
| `validation_quorum` | int64 | `app_.validators().quorum()` |
### Sub-task 7.10a: Per-Validator Validation Count (Flag Ledger Window)
**Objective**: Track how many ledgers each UNL validator has validated over
the last 256 consecutive ledgers (one flag ledger window). This is the key
UNL participation metric — validators consistently below threshold may be
candidates for removal from the UNL.
**What to do**:
- Add a new observable gauge:
```cpp
validatorParticipationGauge_ = meter_->CreateInt64ObservableGauge(
"xrpld_validator_participation",
"Per-validator validation count over the last 256 ledgers");
```
- The callback queries `app_.getValidations()` to get the trusted
validation set for each of the last 256 ledger hashes (from
`LedgerMaster::getValidatedLedger()` walking backwards). For each
validator public key in the UNL, count how many of those 256 ledgers
have a matching validation.
- **Label dimensions**:
- `validator` — base58-encoded validator master public key
- `exported_instance` — this node's identity (standard)
- **Emission**: every flag ledger (256 ledgers, ~15 minutes) or on a
10-second async gauge callback with cached results (recompute only
at flag ledger boundaries).
- **Data source**: `RCLValidations::getTrustedForLedger(hash, seq)` returns
`std::vector<std::shared_ptr<STValidation>>` with `getSignerPublic()`
for each. The UNL list is from `app_.getValidators().getTrustedMasterKeys()`.
- **Dashboard panel**: Add a table panel to the Validator Health dashboard
showing `xrpld_validator_participation` grouped by `validator` label,
with a threshold color (green >= 240, yellow >= 200, red < 200).
**Key modified files**: `src/xrpld/telemetry/MetricsRegistry.h/.cpp`
**Exit Criteria**:
- [ ] All 4 label values emitted every 10s
- [ ] Gauge emits one time series per UNL validator
- [ ] Values range 0-256 and update at flag ledger boundaries
- [ ] Grafana table panel shows per-validator participation
- [ ] Validators below 75% participation are highlighted in red
---
**Key modified files**: `src/xrpld/telemetry/MetricsRegistry.h/.cpp`
**Exit Criteria**:
- [ ] All 4 base label values emitted every 10s
- [ ] `unl_expiry_days` is negative when expired, positive when active
- [ ] Per-validator participation gauge emits at flag ledger boundaries
- [ ] Values visible in Prometheus
---
@@ -362,7 +413,7 @@ validatorHealthGauge_ = meter_->CreateDoubleObservableGauge(
| -------------------------- | ------ | ------------------------------------- |
| `peer_latency_p90_ms` | double | P90 from sorted peer latencies |
| `peers_insane_count` | int64 | Peers with diverged tracking status |
| `peers_higher_version_pct` | double | % of peers on newer rippled version |
| `peers_higher_version_pct` | double | % of peers on newer xrpld version |
| `upgrade_recommended` | int64 | 1 if `peers_higher_version_pct > 60%` |
**Implementation note**: The callback runs every 10s on the metrics reader thread. Iterating ~50-200 peers is acceptable overhead.
@@ -373,7 +424,7 @@ validatorHealthGauge_ = meter_->CreateDoubleObservableGauge(
- [ ] P90 latency computed correctly
- [ ] Insane count matches `peers` RPC output
- [ ] Version comparison handles format variations (e.g., "rippled-2.4.0-rc1")
- [ ] Version comparison handles format variations (e.g., "xrpld-2.4.0-rc1")
---
@@ -435,10 +486,10 @@ validatorHealthGauge_ = meter_->CreateDoubleObservableGauge(
**Gauge label values**:
| Gauge Name | Label `metric=` | Type | Source |
| ------------------------ | ------------------------------- | ------ | ----------------------------- |
| `rippled_storage_detail` | `nudb_bytes` | int64 | NuDB backend file size |
| `rippled_sync_info` | `initial_sync_duration_seconds` | double | Time from start to first FULL |
| Gauge Name | Label `metric=` | Type | Source |
| ---------------------- | ------------------------------- | ------ | ----------------------------- |
| `xrpld_storage_detail` | `nudb_bytes` | int64 | NuDB backend file size |
| `xrpld_sync_info` | `initial_sync_duration_seconds` | double | Time from start to first FULL |
**Key modified files**: `src/xrpld/telemetry/MetricsRegistry.h/.cpp`
@@ -455,15 +506,15 @@ validatorHealthGauge_ = meter_->CreateDoubleObservableGauge(
**Objective**: Add 7 new event counters incremented at their respective instrumentation sites.
| Counter Name | Increment Site | Source File |
| ------------------------------------- | -------------------------------- | --------------------- |
| `rippled_ledgers_closed_total` | `onAccept()` in consensus | RCLConsensus.cpp |
| `rippled_validations_sent_total` | `validate()` in consensus | RCLConsensus.cpp |
| `rippled_validations_checked_total` | Network validation received | LedgerMaster.cpp |
| `rippled_validation_agreements_total` | ValidationTracker reconciliation | ValidationTracker.cpp |
| `rippled_validation_missed_total` | ValidationTracker reconciliation | ValidationTracker.cpp |
| `rippled_state_changes_total` | `setMode()` in NetworkOPs | NetworkOPs.cpp |
| `rippled_jq_trans_overflow_total` | Job queue overflow path | JobQueue.cpp |
| Counter Name | Increment Site | Source File |
| ----------------------------------- | -------------------------------- | --------------------- |
| `xrpld_ledgers_closed_total` | `onAccept()` in consensus | RCLConsensus.cpp |
| `xrpld_validations_sent_total` | `validate()` in consensus | RCLConsensus.cpp |
| `xrpld_validations_checked_total` | Network validation received | LedgerMaster.cpp |
| `xrpld_validation_agreements_total` | ValidationTracker reconciliation | ValidationTracker.cpp |
| `xrpld_validation_missed_total` | ValidationTracker reconciliation | ValidationTracker.cpp |
| `xrpld_state_changes_total` | `setMode()` in NetworkOPs | NetworkOPs.cpp |
| `xrpld_jq_trans_overflow_total` | Job queue overflow path | JobQueue.cpp |
**Key modified files**: `src/xrpld/telemetry/MetricsRegistry.h/.cpp` (declarations), plus recording sites in RCLConsensus.cpp, LedgerMaster.cpp, NetworkOPs.cpp, JobQueue.cpp
@@ -482,14 +533,14 @@ validatorHealthGauge_ = meter_->CreateDoubleObservableGauge(
**Gauge label values**:
| Gauge Name | Label `metric=` | Type | Source |
| ------------------------------ | ------------------- | ------ | --------------------------- |
| `rippled_validation_agreement` | `agreement_pct_1h` | double | `tracker.agreementPct1h()` |
| | `agreements_1h` | int64 | `tracker.agreements1h()` |
| | `missed_1h` | int64 | `tracker.missed1h()` |
| | `agreement_pct_24h` | double | `tracker.agreementPct24h()` |
| | `agreements_24h` | int64 | `tracker.agreements24h()` |
| | `missed_24h` | int64 | `tracker.missed24h()` |
| Gauge Name | Label `metric=` | Type | Source |
| ---------------------------- | ------------------- | ------ | --------------------------- |
| `xrpld_validation_agreement` | `agreement_pct_1h` | double | `tracker.agreementPct1h()` |
| | `agreements_1h` | int64 | `tracker.agreements1h()` |
| | `missed_1h` | int64 | `tracker.missed1h()` |
| | `agreement_pct_24h` | double | `tracker.agreementPct24h()` |
| | `agreements_24h` | int64 | `tracker.agreements24h()` |
| | `missed_24h` | int64 | `tracker.missed24h()` |
**Key modified files**: `src/xrpld/telemetry/MetricsRegistry.cpp`

View File

@@ -1,6 +1,6 @@
# Phase 8: Log-Trace Correlation and Centralized Log Ingestion — Task List
> **Goal**: Inject trace context (trace_id, span_id) into rippled's Journal log output for log-trace correlation, and add OTel Collector filelog receiver to ingest logs into Grafana Loki for unified observability.
> **Goal**: Inject trace context (trace_id, span_id) into xrpld's Journal log output for log-trace correlation, and add OTel Collector filelog receiver to ingest logs into Grafana Loki for unified observability.
>
> **Scope**: Two independent sub-phases — 8a (code change: trace_id in logs) and 8b (infra only: filelog receiver to Loki). No changes to the `beast::Journal` public API.
>
@@ -89,7 +89,7 @@
## Task 8.3: Add Filelog Receiver to OTel Collector
**Objective**: Configure the OTel Collector to tail rippled's log file and export to Loki.
**Objective**: Configure the OTel Collector to tail xrpld's log file and export to Loki.
**What to do**:
@@ -124,7 +124,7 @@
insecure: true
```
- Mount rippled's log directory into the collector container via docker-compose volume
- Mount xrpld's log directory into the collector container via docker-compose volume
**Key modified files**:
@@ -172,7 +172,7 @@
**What to do**:
- Edit `docker/telemetry/integration-test.sh`:
- After sending RPC requests (which create spans), grep rippled's log output for `trace_id=`
- After sending RPC requests (which create spans), grep xrpld's log output for `trace_id=`
- Verify trace_id matches a trace visible in Tempo
- Optionally: query Loki via API to confirm log ingestion
@@ -225,7 +225,7 @@
- [ ] Log lines within active spans contain `trace_id=<hex> span_id=<hex>`
- [ ] Log lines outside spans have no trace context (no empty fields)
- [ ] Loki ingests rippled logs via OTel Collector filelog receiver
- [ ] Loki ingests xrpld logs via OTel Collector filelog receiver
- [ ] Grafana Tempo -> Loki one-click correlation works
- [ ] Grafana Loki -> Tempo reverse lookup works via derived field
- [ ] Integration test verifies trace_id presence in logs

View File

@@ -2,7 +2,7 @@
> **Status**: Future Enhancement
>
> **Goal**: Instrument rippled to emit ~50+ metrics that exist in `get_counts`/`server_info`/TxQ/PerfLog but currently lack time-series export via the OTel or beast::insight pipelines.
> **Goal**: Instrument xrpld to emit ~50+ metrics that exist in `get_counts`/`server_info`/TxQ/PerfLog but currently lack time-series export via the OTel or beast::insight pipelines.
>
> **Scope**: Hybrid approach — extend `beast::insight` for metrics near existing registrations, use OTel Metrics SDK `ObservableGauge` callbacks for new categories (TxQ, PerfLog, CountedObjects).
>
@@ -57,7 +57,7 @@ These metrics serve multiple external consumer categories identified during rese
- `src/libxrpl/nodestore/Database.cpp`
- `src/libxrpl/nodestore/Database.h` (add insight members)
**Derived Prometheus metrics**: `rippled_nodestore_reads_total`, `rippled_nodestore_reads_hit`, `rippled_nodestore_write_load`, etc.
**Derived Prometheus metrics**: `xrpld_nodestore_reads_total`, `xrpld_nodestore_reads_hit`, `xrpld_nodestore_write_load`, etc.
**Grafana dashboard**: Add "NodeStore I/O" panel group to _Node Health_ dashboard.
@@ -87,7 +87,7 @@ These metrics serve multiple external consumer categories identified during rese
- `src/xrpld/rpc/handlers/GetCounts.cpp` (extract shared access methods)
- `src/xrpld/app/main/Application.cpp` (register MetricsRegistry at startup)
**Derived Prometheus metrics**: `rippled_cache_SLE_hit_rate`, `rippled_cache_ledger_hit_rate`, `rippled_cache_treenode_size`, etc.
**Derived Prometheus metrics**: `xrpld_cache_SLE_hit_rate`, `xrpld_cache_ledger_hit_rate`, `xrpld_cache_treenode_size`, etc.
---
@@ -114,9 +114,9 @@ These metrics serve multiple external consumer categories identified during rese
- `src/xrpld/telemetry/MetricsRegistry.cpp` (add TxQ callbacks)
- `src/xrpld/app/tx/detail/TxQ.h` (expose metrics accessor if needed)
**Derived Prometheus metrics**: `rippled_txq_count`, `rippled_txq_max_size`, `rippled_txq_open_ledger_fee_level`, etc.
**Derived Prometheus metrics**: `xrpld_txq_count`, `xrpld_txq_max_size`, `xrpld_txq_open_ledger_fee_level`, etc.
**Grafana dashboard**: New _Fee Market & TxQ_ dashboard (`rippled-fee-market`).
**Grafana dashboard**: New _Fee Market & TxQ_ dashboard (`xrpld-fee-market`).
---
@@ -141,7 +141,7 @@ These metrics serve multiple external consumer categories identified during rese
- `src/xrpld/perflog/detail/PerfLogImp.cpp` (add OTel instrument updates alongside existing JSON counters)
- `src/xrpld/telemetry/MetricsRegistry.cpp` (register instruments)
**Derived Prometheus metrics**: `rippled_rpc_method_started_total{method="server_info"}`, `rippled_rpc_method_duration_us_bucket{method="ledger"}`, etc.
**Derived Prometheus metrics**: `xrpld_rpc_method_started_total{method="server_info"}`, `xrpld_rpc_method_duration_us_bucket{method="ledger"}`, etc.
**Grafana dashboard**: Add "Per-Method RPC Breakdown" panel group to _RPC Performance_ dashboard.
@@ -167,9 +167,9 @@ These metrics serve multiple external consumer categories identified during rese
- `src/xrpld/perflog/detail/PerfLogImp.cpp`
- `src/xrpld/telemetry/MetricsRegistry.cpp`
**Derived Prometheus metrics**: `rippled_job_queued_total{job_type="ledgerData"}`, `rippled_job_running_duration_us_bucket{job_type="transaction"}`, etc.
**Derived Prometheus metrics**: `xrpld_job_queued_total{job_type="ledgerData"}`, `xrpld_job_running_duration_us_bucket{job_type="transaction"}`, etc.
**Grafana dashboard**: New _Job Queue Analysis_ dashboard (`rippled-job-queue`).
**Grafana dashboard**: New _Job Queue Analysis_ dashboard (`xrpld-job-queue`).
---
@@ -197,7 +197,7 @@ These metrics serve multiple external consumer categories identified during rese
- `src/xrpld/telemetry/MetricsRegistry.cpp` (add counted object callbacks)
- `include/xrpl/basics/CountedObject.h` (may need static accessor for iteration)
**Derived Prometheus metrics**: `rippled_object_count{type="Transaction"}`, `rippled_object_count{type="NodeObject"}`, etc.
**Derived Prometheus metrics**: `xrpld_object_count{type="Transaction"}`, `xrpld_object_count{type="NodeObject"}`, etc.
**Grafana dashboard**: Add "Object Instance Counts" panel to _Node Health_ dashboard.
@@ -225,7 +225,7 @@ These metrics serve multiple external consumer categories identified during rese
- `src/xrpld/telemetry/MetricsRegistry.cpp`
- `src/xrpld/app/misc/NetworkOPs.cpp` (expose load factor accessors if needed)
**Derived Prometheus metrics**: `rippled_load_factor`, `rippled_load_factor_fee_escalation`, etc.
**Derived Prometheus metrics**: `xrpld_load_factor`, `xrpld_load_factor_fee_escalation`, etc.
**Grafana dashboard**: Add "Load Factor Breakdown" panel to _Fee Market & TxQ_ dashboard.
@@ -243,7 +243,7 @@ These metrics serve multiple external consumer categories identified during rese
- `read_request_bundle` (native JSON int)
- `read_threads_running` (native JSON int)
- `read_threads_total` (native JSON int)
- Added new `rippled_server_info` Int64ObservableGauge with 8 metrics:
- Added new `xrpld_server_info` Int64ObservableGauge with 8 metrics:
- `server_state` — operating mode as int (0=DISCONNECTED .. 4=FULL)
- `uptime` — seconds since server start
- `peers` — total peer count
@@ -252,9 +252,9 @@ These metrics serve multiple external consumer categories identified during rese
- `peer_disconnects_resources` — cumulative resource-related disconnects
- `last_close_proposers` — from `getConsensusInfo()["previous_proposers"]`
- `last_close_converge_time_ms` — from `getConsensusInfo()["previous_mseconds"]`
- Added new `rippled_build_info` Int64ObservableGauge (info-style, value=1 with `version` label)
- Added new `rippled_complete_ledgers` Int64ObservableGauge parsing comma-separated ranges into `{bound, index}` pairs
- Added new `rippled_db_metrics` Int64ObservableGauge with 4 metrics:
- Added new `xrpld_build_info` Int64ObservableGauge (info-style, value=1 with `version` label)
- Added new `xrpld_complete_ledgers` Int64ObservableGauge parsing comma-separated ranges into `{bound, index}` pairs
- Added new `xrpld_db_metrics` Int64ObservableGauge with 4 metrics:
- `db_kb_total`, `db_kb_ledger`, `db_kb_transaction` (SQLite stat queries)
- `historical_perminute` (historical ledger fetch rate)
@@ -263,11 +263,11 @@ These metrics serve multiple external consumer categories identified during rese
- `src/xrpld/telemetry/MetricsRegistry.h` (4 new gauge members, updated ASCII diagram)
- `src/xrpld/telemetry/MetricsRegistry.cpp` (4 new callback registrations, 2 callback extensions)
**Not implementable inside rippled**:
**Not implementable inside xrpld**:
- `connection_count_51233/51234` — OS-level port connection counts from external shell script (`get_connection.sh`)
**Derived Prometheus metrics**: `rippled_server_info{metric="server_state"}`, `rippled_build_info{version="2.4.0"}`, `rippled_complete_ledgers{bound="start",index="0"}`, `rippled_db_metrics{metric="db_kb_total"}`, etc.
**Derived Prometheus metrics**: `xrpld_server_info{metric="server_state"}`, `xrpld_build_info{version="2.4.0"}`, `xrpld_complete_ledgers{bound="start",index="0"}`, `xrpld_db_metrics{metric="db_kb_total"}`, etc.
**Grafana dashboard**: New panels added to _Node Health_ dashboard (`system-node-health.json`).
@@ -280,12 +280,12 @@ These metrics serve multiple external consumer categories identified during rese
**What to do**:
- Create 2 new dashboards:
1. **Fee Market & TxQ** (`rippled-fee-market`) — TxQ depth/capacity, fee levels, load factor breakdown, fee escalation timeline
2. **Job Queue Analysis** (`rippled-job-queue`) — Per-job-type rates, queue wait times, execution times, job queue depth
1. **Fee Market & TxQ** (`xrpld-fee-market`) — TxQ depth/capacity, fee levels, load factor breakdown, fee escalation timeline
2. **Job Queue Analysis** (`xrpld-job-queue`) — Per-job-type rates, queue wait times, execution times, job queue depth
- Update 2 existing dashboards:
1. **Node Health** (`rippled-statsd-node-health`) — Add NodeStore I/O panels, cache hit rate panels, object instance counts
2. **RPC Performance** (`rippled-rpc-perf`) — Add per-method RPC breakdown panels
1. **Node Health** (`xrpld-statsd-node-health`) — Add NodeStore I/O panels, cache hit rate panels, object instance counts
2. **RPC Performance** (`xrpld-rpc-perf`) — Add per-method RPC breakdown panels
**Key modified files**:
@@ -325,7 +325,7 @@ These metrics serve multiple external consumer categories identified during rese
**What to do**:
- Extend the existing telemetry integration test:
- Start rippled with `[telemetry] enabled=1` and `[insight] server=otel`
- Start xrpld with `[telemetry] enabled=1` and `[insight] server=otel`
- Submit a batch of RPC calls and transactions
- Query Prometheus for each new metric family
- Assert non-zero values for: NodeStore reads, cache hit rates, TxQ count, PerfLog RPC counters, object counts, load factors
@@ -351,23 +351,23 @@ These metrics serve multiple external consumer categories identified during rese
**Objective**: Create a Grafana dashboard for validation agreement, amendment/UNL health, and state tracking.
**Dashboard**: `rippled-validator-health.json`
**Dashboard**: `xrpld-validator-health.json`
| Panel | Type | PromQL |
| -------------------------- | ---------- | ---------------------------------------------------------------- |
| Agreement % (1h) | stat | `rippled_validation_agreement{metric="agreement_pct_1h"}` |
| Agreement % (24h) | stat | `rippled_validation_agreement{metric="agreement_pct_24h"}` |
| Agreements vs Missed (1h) | bargauge | `agreements_1h` and `missed_1h` side by side |
| Agreements vs Missed (24h) | bargauge | `agreements_24h` and `missed_24h` side by side |
| Validation Rate | stat | `rate(rippled_validations_sent_total[5m]) * 60` |
| Validations Checked Rate | stat | `rate(rippled_validations_checked_total[5m]) * 60` |
| Amendment Blocked | stat | `rippled_validator_health{metric="amendment_blocked"}` |
| UNL Expiry (days) | stat | `rippled_validator_health{metric="unl_expiry_days"}` |
| Validation Quorum | stat | `rippled_validator_health{metric="validation_quorum"}` |
| State Value Timeline | timeseries | `rippled_state_tracking{metric="state_value"}` |
| Time in Current State | stat | `rippled_state_tracking{metric="time_in_current_state_seconds"}` |
| State Changes Rate | stat | `rate(rippled_state_changes_total[1h])` |
| Ledgers Closed Rate | stat | `rate(rippled_ledgers_closed_total[5m]) * 60` |
| Panel | Type | PromQL |
| -------------------------- | ---------- | -------------------------------------------------------------- |
| Agreement % (1h) | stat | `xrpld_validation_agreement{metric="agreement_pct_1h"}` |
| Agreement % (24h) | stat | `xrpld_validation_agreement{metric="agreement_pct_24h"}` |
| Agreements vs Missed (1h) | bargauge | `agreements_1h` and `missed_1h` side by side |
| Agreements vs Missed (24h) | bargauge | `agreements_24h` and `missed_24h` side by side |
| Validation Rate | stat | `rate(xrpld_validations_sent_total[5m]) * 60` |
| Validations Checked Rate | stat | `rate(xrpld_validations_checked_total[5m]) * 60` |
| Amendment Blocked | stat | `xrpld_validator_health{metric="amendment_blocked"}` |
| UNL Expiry (days) | stat | `xrpld_validator_health{metric="unl_expiry_days"}` |
| Validation Quorum | stat | `xrpld_validator_health{metric="validation_quorum"}` |
| State Value Timeline | timeseries | `xrpld_state_tracking{metric="state_value"}` |
| Time in Current State | stat | `xrpld_state_tracking{metric="time_in_current_state_seconds"}` |
| State Changes Rate | stat | `rate(xrpld_state_changes_total[1h])` |
| Ledgers Closed Rate | stat | `rate(xrpld_ledgers_closed_total[5m]) * 60` |
**Dashboard conventions**: `$node` template variable for `exported_instance` filtering, dark theme, matching existing panel sizes and color schemes.
@@ -387,16 +387,16 @@ These metrics serve multiple external consumer categories identified during rese
**Objective**: Create a Grafana dashboard for peer health aggregates.
**Dashboard**: `rippled-peer-quality.json`
**Dashboard**: `xrpld-peer-quality.json`
| Panel | Type | PromQL |
| ---------------------- | ---------- | ---------------------------------------------------------------- |
| P90 Peer Latency | timeseries | `rippled_peer_quality{metric="peer_latency_p90_ms"}` |
| Insane/Diverged Peers | stat | `rippled_peer_quality{metric="peers_insane_count"}` |
| Higher Version Peers % | stat | `rippled_peer_quality{metric="peers_higher_version_pct"}` |
| Upgrade Recommended | stat | `rippled_peer_quality{metric="upgrade_recommended"}` |
| Resource Disconnects | timeseries | `rippled_Overlay_Peer_Disconnects_Charges` |
| Inbound vs Outbound | bargauge | `rippled_Peer_Finder_Active_Inbound_Peers`, `..._Outbound_Peers` |
| Panel | Type | PromQL |
| ---------------------- | ---------- | -------------------------------------------------------------- |
| P90 Peer Latency | timeseries | `xrpld_peer_quality{metric="peer_latency_p90_ms"}` |
| Insane/Diverged Peers | stat | `xrpld_peer_quality{metric="peers_insane_count"}` |
| Higher Version Peers % | stat | `xrpld_peer_quality{metric="peers_higher_version_pct"}` |
| Upgrade Recommended | stat | `xrpld_peer_quality{metric="upgrade_recommended"}` |
| Resource Disconnects | timeseries | `xrpld_Overlay_Peer_Disconnects_Charges` |
| Inbound vs Outbound | bargauge | `xrpld_Peer_Finder_Active_Inbound_Peers`, `..._Outbound_Peers` |
**Key new files**: `docker/telemetry/grafana/dashboards/rippled-peer-quality.json`
@@ -414,13 +414,13 @@ These metrics serve multiple external consumer categories identified during rese
**Objective**: Add "Ledger Economy" row to the existing `system-node-health.json` dashboard.
| Panel | Type | PromQL |
| -------------------- | ---------- | ----------------------------------------------------- |
| Base Fee (drops) | stat | `rippled_ledger_economy{metric="base_fee_xrp"}` |
| Reserve Base (drops) | stat | `rippled_ledger_economy{metric="reserve_base_xrp"}` |
| Reserve Inc (drops) | stat | `rippled_ledger_economy{metric="reserve_inc_xrp"}` |
| Ledger Age | stat | `rippled_ledger_economy{metric="ledger_age_seconds"}` |
| Transaction Rate | timeseries | `rippled_ledger_economy{metric="transaction_rate"}` |
| Panel | Type | PromQL |
| -------------------- | ---------- | --------------------------------------------------- |
| Base Fee (drops) | stat | `xrpld_ledger_economy{metric="base_fee_xrp"}` |
| Reserve Base (drops) | stat | `xrpld_ledger_economy{metric="reserve_base_xrp"}` |
| Reserve Inc (drops) | stat | `xrpld_ledger_economy{metric="reserve_inc_xrp"}` |
| Ledger Age | stat | `xrpld_ledger_economy{metric="ledger_age_seconds"}` |
| Transaction Rate | timeseries | `xrpld_ledger_economy{metric="transaction_rate"}` |
**Key modified files**: `docker/telemetry/grafana/dashboards/system-node-health.json`

View File

@@ -1,4 +1,4 @@
# OpenTelemetry Distributed Tracing for rippled
# OpenTelemetry Distributed Tracing for xrpld
---
@@ -10,7 +10,7 @@
OpenTelemetry is an open-source, CNCF-backed observability framework for distributed tracing, metrics, and logs.
### Why OpenTelemetry for rippled?
### Why OpenTelemetry for xrpld?
- **End-to-End Transaction Visibility**: Track transactions from submission → consensus → ledger inclusion
- **Cross-Node Correlation**: Follow requests across multiple independent nodes using a unique `trace_id`
@@ -59,13 +59,13 @@ flowchart LR
## Slide 3: Adoption Scope — Traces Only (Current Plan)
OpenTelemetry supports three signal types: **Traces**, **Metrics**, and **Logs**. rippled already captures metrics (StatsD via Beast Insight) and logs (Journal/PerfLog). The question is: how much of OTel do we adopt?
OpenTelemetry supports three signal types: **Traces**, **Metrics**, and **Logs**. xrpld already captures metrics (StatsD via Beast Insight) and logs (Journal/PerfLog). The question is: how much of OTel do we adopt?
> **Scenario A**: Add distributed tracing. Keep StatsD for metrics and Journal for logs.
```mermaid
flowchart LR
subgraph rippled["rippled Process"]
subgraph xrpld["xrpld Process"]
direction TB
OTel["OTel SDK<br/>(Traces)"]
Insight["Beast Insight<br/>(StatsD Metrics)"]
@@ -80,7 +80,7 @@ flowchart LR
StatsD --> Graphite["Graphite / Grafana"]
LogFile --> Loki["Loki (optional)"]
style rippled fill:#424242,stroke:#212121,color:#fff
style xrpld fill:#424242,stroke:#212121,color:#fff
style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff
style Insight fill:#1565c0,stroke:#0d47a1,color:#fff
style Journal fill:#e65100,stroke:#bf360c,color:#fff
@@ -106,7 +106,7 @@ flowchart LR
```mermaid
flowchart LR
subgraph rippled["rippled Process"]
subgraph xrpld["xrpld Process"]
direction TB
OTel["OTel SDK<br/>(Traces + Metrics)"]
Journal["Journal + PerfLog<br/>(Logging)"]
@@ -119,7 +119,7 @@ flowchart LR
Collector --> Prom["Prometheus<br/>(Metrics)"]
LogFile --> Loki["Loki (optional)"]
style rippled fill:#424242,stroke:#212121,color:#fff
style xrpld fill:#424242,stroke:#212121,color:#fff
style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff
style Journal fill:#e65100,stroke:#bf360c,color:#fff
style Collector fill:#2e7d32,stroke:#1b5e20,color:#fff
@@ -136,7 +136,7 @@ flowchart LR
```mermaid
flowchart LR
subgraph rippled["rippled Process"]
subgraph xrpld["xrpld Process"]
OTel["OTel SDK<br/>(Traces + Metrics + Logs)"]
end
@@ -146,7 +146,7 @@ flowchart LR
Collector --> Prom["Prometheus<br/>(Metrics)"]
Collector --> Loki["Loki / Elastic<br/>(Logs)"]
style rippled fill:#424242,stroke:#212121,color:#fff
style xrpld fill:#424242,stroke:#212121,color:#fff
style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff
style Collector fill:#2e7d32,stroke:#1b5e20,color:#fff
```
@@ -177,7 +177,7 @@ flowchart LR
---
## Slide 5: Comparison with rippled's Existing Solutions
## Slide 5: Comparison with xrpld's Existing Solutions
### Current Observability Stack
@@ -211,7 +211,7 @@ flowchart LR
```mermaid
flowchart TB
subgraph rippled["rippled Node"]
subgraph xrpld["xrpld Node"]
subgraph services["Core Services"]
direction LR
RPC["RPC Server<br/>(HTTP/WS)"] ~~~ Overlay["Overlay<br/>(P2P Network)"] ~~~ Consensus["Consensus<br/>(RCLConsensus)"]
@@ -227,7 +227,7 @@ flowchart TB
Collector --> Tempo["Grafana Tempo"]
Collector --> Elastic["Elastic APM"]
style rippled fill:#424242,stroke:#212121,color:#fff
style xrpld fill:#424242,stroke:#212121,color:#fff
style services fill:#1565c0,stroke:#0d47a1,color:#fff
style Telemetry fill:#2e7d32,stroke:#1b5e20,color:#fff
style Collector fill:#e65100,stroke:#bf360c,color:#fff
@@ -236,9 +236,9 @@ flowchart TB
**Reading the diagram:**
- **Core Services (blue, top)**: RPC Server, Overlay, and Consensus are the three primary components that generate trace data — they represent the entry points for client requests, peer messages, and consensus rounds respectively.
- **Telemetry Module (green, middle)**: The OpenTelemetry SDK sits below the core services and receives span data from all three; it acts as a single collection point within the rippled process.
- **OTel Collector (orange, center)**: An external process that receives spans over OTLP/gRPC from the Telemetry Module; it decouples rippled from backend choices and handles batching, sampling, and routing.
- **Backends (bottom row)**: Tempo and Elastic APM are interchangeable — the Collector fans out to any combination, so operators can switch backends without modifying rippled code.
- **Telemetry Module (green, middle)**: The OpenTelemetry SDK sits below the core services and receives span data from all three; it acts as a single collection point within the xrpld process.
- **OTel Collector (orange, center)**: An external process that receives spans over OTLP/gRPC from the Telemetry Module; it decouples xrpld from backend choices and handles batching, sampling, and routing.
- **Backends (bottom row)**: Tempo and Elastic APM are interchangeable — the Collector fans out to any combination, so operators can switch backends without modifying xrpld code.
- **Top-to-bottom flow**: Data flows from instrumented code down through the SDK, out over the network to the Collector, and finally into storage/visualization backends.
### Context Propagation
@@ -496,7 +496,7 @@ flowchart LR
| Aspect | Details |
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Where it runs** | Inside rippled (SDK-level). Configured via `sampling_ratio` in `rippled.cfg`. |
| **Where it runs** | Inside xrpld (SDK-level). Configured via `sampling_ratio` in `xrpld.cfg`. |
| **When the decision happens** | At trace creation time before the first span is even populated. |
| **How it works** | `sampling_ratio=0.1` means each trace has a 10% probability of being recorded. Dropped traces incur near-zero overhead (no spans created, no attributes set, no export). |
| **Propagation** | Once a trace is sampled, the `trace_flags` field (1 byte in the context header) tells downstream nodes to also sample it. Unsampled traces propagate `trace_flags=0`, so downstream nodes skip them too. |
@@ -504,7 +504,7 @@ flowchart LR
| **Cons** | **Blind** it doesn't know if the trace will be interesting. A rare error or slow consensus round has only a 10% chance of being captured. |
| **Best for** | High-volume, steady-state traffic where most traces look similar (e.g., routine RPC requests). |
**rippled configuration**:
**xrpld configuration**:
```ini
[telemetry]
@@ -538,16 +538,16 @@ flowchart TB
style E fill:#4a148c,stroke:#2e0d57,color:#fff
```
| Aspect | Details |
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Where it runs** | In the **OTel Collector** (external process), not inside rippled. rippled exports 100% of traces; the Collector decides what to keep. |
| **When the decision happens** | After the Collector has received all spans for a trace (waits `decision_wait=10s` for stragglers). |
| **How it works** | Policy rules evaluate the completed trace: keep all errors, keep slow operations above a threshold, keep all consensus rounds, then probabilistically sample the rest at 10%. |
| **Pros** | **Never misses important traces**. Errors, slow requests, and consensus anomalies are always captured regardless of probability. |
| **Cons** | Higher resource usage rippled must export 100% of spans to the Collector, which buffers them in memory before deciding. The Collector needs more RAM (configured via `num_traces` and `decision_wait`). |
| **Best for** | Production troubleshooting where you can't afford to miss errors or anomalies. |
| Aspect | Details |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Where it runs** | In the **OTel Collector** (external process), not inside xrpld. xrpld exports 100% of traces; the Collector decides what to keep. |
| **When the decision happens** | After the Collector has received all spans for a trace (waits `decision_wait=10s` for stragglers). |
| **How it works** | Policy rules evaluate the completed trace: keep all errors, keep slow operations above a threshold, keep all consensus rounds, then probabilistically sample the rest at 10%. |
| **Pros** | **Never misses important traces**. Errors, slow requests, and consensus anomalies are always captured regardless of probability. |
| **Cons** | Higher resource usage xrpld must export 100% of spans to the Collector, which buffers them in memory before deciding. The Collector needs more RAM (configured via `num_traces` and `decision_wait`). |
| **Best for** | Production troubleshooting where you can't afford to miss errors or anomalies. |
**Collector configuration** (tail sampling rules for rippled):
**Collector configuration** (tail sampling rules for xrpld):
```yaml
processors:
@@ -576,22 +576,22 @@ processors:
| | Head Sampling | Tail Sampling |
| ----------------------------- | ---------------------------------------- | ------------------------------------------------ |
| **Decision point** | Trace start (inside rippled) | Trace end (in OTel Collector) |
| **Decision point** | Trace start (inside xrpld) | Trace end (in OTel Collector) |
| **Knows trace content?** | No (random coin flip) | Yes (evaluates completed trace) |
| **Overhead on rippled** | Lowest (dropped traces = no-op) | Higher (must export 100% to Collector) |
| **Overhead on xrpld** | Lowest (dropped traces = no-op) | Higher (must export 100% to Collector) |
| **Collector resource usage** | Low (receives only sampled traces) | Higher (buffers all traces before deciding) |
| **Captures all errors?** | No (only if trace was randomly selected) | **Yes** (error policy catches them) |
| **Captures slow operations?** | No (random) | **Yes** (latency policy catches them) |
| **Configuration** | `rippled.cfg`: `sampling_ratio=0.1` | `otel-collector.yaml`: `tail_sampling` processor |
| **Configuration** | `xrpld.cfg`: `sampling_ratio=0.1` | `otel-collector.yaml`: `tail_sampling` processor |
| **Best for** | High-throughput steady-state | Troubleshooting & anomaly detection |
### Recommended Strategy for rippled
### Recommended Strategy for xrpld
Use **both** in a layered approach:
```mermaid
flowchart LR
subgraph rippled["rippled (Head Sampling)"]
subgraph xrpld["xrpld (Head Sampling)"]
HS["sampling_ratio=1.0<br/>(export everything)"]
end
@@ -603,14 +603,14 @@ flowchart LR
ST["Only interesting traces<br/>stored long-term"]
end
rippled -->|"100% of spans"| collector -->|"~15-20% kept"| storage
xrpld -->|"100% of spans"| collector -->|"~15-20% kept"| storage
style rippled fill:#424242,stroke:#212121,color:#fff
style xrpld fill:#424242,stroke:#212121,color:#fff
style collector fill:#1565c0,stroke:#0d47a1,color:#fff
style storage fill:#2e7d32,stroke:#1b5e20,color:#fff
```
> **Why this works**: rippled exports everything (no blind drops), the Collector applies intelligent filtering (keep errors/slow/anomalies, sample the rest), and only ~15-20% of traces reach storage. If Collector resource usage becomes a concern, add head sampling at `sampling_ratio=0.5` to halve the export volume while still giving the Collector enough data for good tail-sampling decisions.
> **Why this works**: xrpld exports everything (no blind drops), the Collector applies intelligent filtering (keep errors/slow/anomalies, sample the rest), and only ~15-20% of traces reach storage. If Collector resource usage becomes a concern, add head sampling at `sampling_ratio=0.5` to halve the export volume while still giving the Collector enough data for good tail-sampling decisions.
---