diff --git a/OpenTelemetryPlan/00-tracing-fundamentals.md b/OpenTelemetryPlan/00-tracing-fundamentals.md index 1e61ed9584..0dfac46e72 100644 --- a/OpenTelemetryPlan/00-tracing-fundamentals.md +++ b/OpenTelemetryPlan/00-tracing-fundamentals.md @@ -15,6 +15,33 @@ Distributed tracing is a method for tracking data objects as they flow through d --- +## Actors and Actions at a Glance + +### Actors + +| Who (Plain English) | Technical Term | +| ---------------------------------------------- | --------------- | +| A single unit of work being tracked | Span | +| The complete journey of a request | Trace | +| Data that links spans across services | Trace Context | +| Code that creates spans and propagates context | Instrumentation | +| Service that receives and processes traces | Collector | +| Storage and visualization system | Backend (Tempo) | +| Decision logic for which traces to keep | Sampler | + +### Actions + +| What Happens (Plain English) | Technical Term | +| --------------------------------------- | ----------------------- | +| Start tracking a new operation | Create a Span | +| Connect a child operation to its parent | Set `parent_span_id` | +| Group all related operations together | Share a `trace_id` | +| Pass tracking data between services | Context Propagation | +| Decide whether to record a trace | Sampling (Head or Tail) | +| Send completed traces to storage | Export (OTLP) | + +--- + ## Core Concepts ### 1. Trace @@ -33,16 +60,16 @@ Trace ID: abc123 A **span** represents a single unit of work within a trace. Each span has: -| Attribute | Description | Example | -| ---------------- | --------------------- | -------------------------- | -| `trace_id` | Links to parent trace | `abc123` | -| `span_id` | Unique identifier | `span456` | -| `parent_span_id` | Parent span (if any) | `p_span123` | -| `name` | Operation name | `rpc.submit` | -| `start_time` | When work began | `2024-01-15T10:30:00Z` | -| `end_time` | When work completed | `2024-01-15T10:30:00.050Z` | -| `attributes` | Key-value metadata | `tx.hash=ABC...` | -| `status` | OK, ERROR MSG | `OK` | +| Attribute | Description | Example | +| ---------------- | -------------------------------- | -------------------------- | +| `trace_id` | Identifies the trace | `event123` | +| `span_id` | Unique identifier | `span456` | +| `parent_span_id` | Parent span (if any) | `p_span123` | +| `name` | Operation name | `rpc.submit` | +| `start_time` | When work began (local time) | `2024-01-15T10:30:00Z` | +| `end_time` | When work completed (local time) | `2024-01-15T10:30:00.050Z` | +| `attributes` | Key-value metadata | `tx.hash=ABC...` | +| `status` | OK, ERROR MSG | `OK` | ### 3. Trace Context @@ -74,6 +101,13 @@ flowchart TB style E fill:#bf360c,stroke:#8c2809,color:#ffffff ``` +**Reading the diagram:** + +- **tx.submit (blue, root)**: The top-level span representing the entire transaction submission; all other spans are its descendants. +- **tx.validate, tx.relay, tx.apply (green)**: Direct children of tx.submit, representing the three main stages -- validation, relay to peers, and application to the ledger. +- **ledger.update (red)**: A grandchild span nested under tx.apply, representing the actual ledger state mutation triggered by applying the transaction. +- **Arrows (parent to child)**: Each arrow indicates a parent-child span relationship where the parent's completion depends on the child finishing. + The same trace visualized as a **timeline (Gantt chart)**: ``` @@ -92,6 +126,284 @@ ledger │ │▓▓▓▓▓▓▓▓▓▓▓▓▓ --- +## Span Relationships + +Spans don't always form simple parent-child trees. Distributed tracing defines several relationship types to capture different causal patterns: + +### 1. Parent-Child (ChildOf) + +The default relationship. The parent span **depends on** or **contains** the child span. The child runs within the scope of the parent. + +``` +tx.submit (parent) +├── tx.validate (child) ← parent waits for this +├── tx.relay (child) ← parent waits for this +└── tx.apply (child) ← parent waits for this +``` + +**When to use:** Synchronous calls, nested operations, any case where the parent's completion depends on the child. + +### 2. Follows-From + +A causal relationship where the first span **triggers** the second, but does **not wait** for it. The originator fires and moves on. + +``` +Time → + +tx.receive [=======] + ↓ triggers (follows-from) + tx.relay [===========] ← runs independently +``` + +**When to use:** Asynchronous jobs, queued work, fire-and-forget patterns. For example, a node receives a transaction and queues it for relay — the relay span _follows from_ the receive span but the receiver doesn't wait for relaying to complete. + +> **OpenTracing** defined `FollowsFrom` as a first-class reference type alongside `ChildOf`. +> **OpenTelemetry** represents this using **Span Links** with descriptive attributes instead (see below). + +### 3. Span Links (Cross-Trace and Non-Hierarchical) + +Links connect spans that are **causally related but not in a parent-child hierarchy**. Unlike parent-child, links can cross trace boundaries. + +``` +Trace A Trace B +────── ────── +batch.schedule batch.execute +├─ item.enqueue (span X) ┌──► process.item +├─ item.enqueue (span Y) ───┤ (links to X, Y, Z) +├─ item.enqueue (span Z) └──► +``` + +**Use cases:** + +| Pattern | Description | +| -------------------- | --------------------------------------------------------------------------- | +| **Batch processing** | A batch span links back to all individual spans that contributed to it | +| **Fan-in** | An aggregation span links to the multiple producer spans it merges | +| **Fan-out** | Multiple downstream spans link back to the single span that triggered them | +| **Async handoff** | A deferred job links back to the request that queued it (follows-from) | +| **Cross-trace** | Correlating spans across independent traces (e.g., retries, related events) | + +**Link structure:** Each link carries the target span's context plus optional attributes: + +``` +Link { + trace_id: + span_id: + attributes: { "link.description": "triggered by batch scheduler" } +} +``` + +### Relationship Summary + +```mermaid +flowchart LR + subgraph parent_child["Parent-Child"] + direction TB + P["Parent"] --> C["Child"] + end + + subgraph follows_from["Follows-From"] + direction TB + A["Span A"] -.->|triggers| B["Span B"] + end + + subgraph links["Span Links"] + direction TB + X["Span X\n(Trace 1)"] -.-|link| Y["Span Y\n(Trace 2)"] + end + + parent_child ~~~ follows_from ~~~ links + + style P fill:#0d47a1,stroke:#082f6a,color:#ffffff + style C fill:#1b5e20,stroke:#0d3d14,color:#ffffff + style A fill:#0d47a1,stroke:#082f6a,color:#ffffff + style B fill:#bf360c,stroke:#8c2809,color:#ffffff + style X fill:#4a148c,stroke:#38006b,color:#ffffff + style Y fill:#4a148c,stroke:#38006b,color:#ffffff +``` + +| Relationship | Same Trace? | Dependency? | OTel Mechanism | +| ---------------- | ----------- | -------------------------- | ----------------- | +| **Parent-Child** | Yes | Parent depends on child | `parent_span_id` | +| **Follows-From** | Usually | Causal but no dependency | Link + attributes | +| **Span Link** | Either | Correlation, no dependency | Link + attributes | + +--- + +## Trace ID Generation + +A `trace_id` is a 128-bit (16-byte) identifier that groups all spans belonging to one logical operation. How it's generated determines how easily you can find and correlate traces later. + +### General Approaches + +#### 1. Random (W3C Default) + +Generate a random 128-bit ID when a trace starts. Standard approach for most services. + +``` +trace_id = random_128_bits() +``` + +| Pros | Cons | +| --------------------------- | --------------------------------------------- | +| Simple, standard | No natural correlation to domain events | +| Guaranteed unique per trace | If propagation is lost, trace is broken | +| Works with all OTel tooling | "Find trace for TX abc" requires index lookup | + +#### 2. Deterministic (Derived from Domain Data) + +Compute the trace_id from a hash of a natural identifier. Every node independently derives the **same** trace_id for the same event. + +``` +trace_id = SHA-256(domain_identifier)[0:16] // truncate to 128 bits +``` + +| Pros | Cons | +| --------------------------------------------------- | ---------------------------------------------------------- | +| Propagation-resilient — same ID computed everywhere | Same event processed twice (retry) shares trace_id | +| Natural search — domain ID maps directly to trace | Non-standard (tooling assumes random) | +| No coordination needed between nodes | 256→128 bit truncation (collision risk negligible at ~2⁶⁴) | + +#### 3. Hybrid (Deterministic Prefix + Random Suffix) + +First 8 bytes derived from domain data, last 8 bytes random. + +``` +trace_id = SHA-256(domain_identifier)[0:8] || random_64_bits() +``` + +| Pros | Cons | +| ------------------------------------------- | ---------------------------------------- | +| Prefix search: "find all traces for TX abc" | Must propagate to maintain full trace_id | +| Unique per processing instance | More complex generation logic | +| Retries get distinct trace_ids | Partial correlation only (prefix match) | + +### XRPL Workflow Analysis + +XRPL has a unique advantage: its core workflows produce **globally unique 256-bit hashes** that are known on every node. This makes deterministic trace_id generation practical in ways most systems can't achieve. + +#### Natural Identifiers by Workflow + +| Workflow | Natural Identifier | Size | Known at Start? | Same on All Nodes? | +| ------------------- | --------------------------------- | ---------- | ----------------------------- | -------------------------------- | +| **Transaction** | Transaction hash (`tid_`) | 256-bit | Yes — computed before signing | Yes — hash of canonical tx data | +| **Consensus round** | Previous ledger hash + ledger seq | 256+32 bit | Yes — known when round opens | Yes — all validators agree | +| **Validation** | Ledger hash being validated | 256-bit | Yes — from consensus result | Yes — same closed ledger | +| **Ledger catch-up** | Target ledger hash | 256-bit | Yes — we know what to fetch | Yes — identifies ledger globally | + +#### Where These Identifiers Live in Code + +``` +Transaction: STTx::getTransactionID() → uint256 tid_ + TMTransaction::rawTransaction → recompute hash from bytes + +Consensus: ConsensusProposal::prevLedger_ → uint256 (previous ledger hash) + ConsensusProposal::position_ → uint256 (TxSet hash) + LedgerHeader::seq → uint32_t (ledger sequence) + +Validation: STValidation::getLedgerHash() → uint256 + STValidation::getNodeID() → NodeID (160-bit) + +Ledger fetch: InboundLedger constructor → uint256 hash, uint32_t seq + TMGetLedger::ledgerHash → bytes (uint256) +``` + +### Recommended Strategy: Workflow-Scoped Deterministic + +Each workflow type derives its trace_id from its natural domain identifier: + +``` +Transaction trace: trace_id = SHA-256("tx" || tx_hash)[0:16] +Consensus trace: trace_id = SHA-256("cons" || prev_ledger_hash || ledger_seq)[0:16] +Ledger catch-up: trace_id = SHA-256("fetch" || target_ledger_hash)[0:16] +``` + +The string prefix (`"tx"`, `"cons"`, `"fetch"`) prevents collisions between workflows that might share underlying hashes. + +**Why this works for XRPL:** + +1. **Propagation-resilient** — Even if a P2P message drops trace context, every node independently computes the same trace_id from the same tx_hash or ledger_hash. Spans still correlate. + +2. **Zero-cost search** — "Show me the trace for transaction ABC" becomes a direct lookup: compute `SHA-256("tx" || ABC)[0:16]` and query. No secondary index needed. + +3. **Cross-workflow linking via Span Links** — A consensus trace links to individual transaction traces. A validation span links to the consensus trace. This connects the full picture without forcing everything into one giant trace. + +### Cross-Workflow Correlation + +Each workflow gets its own trace. Span Links tie them together: + +```mermaid +flowchart TB + subgraph tx_trace["Transaction Trace"] + direction LR + Tn["trace_id = f(tx_hash)"]:::note --> T1["tx.receive"] --> T2["tx.validate"] --> T3["tx.relay"] + end + + subgraph cons_trace["Consensus Trace"] + direction LR + Cn["trace_id = f(prev_ledger, seq)"]:::note --> C1["cons.open"] --> C2["cons.propose"] --> C3["cons.accept"] + end + + subgraph val_trace["Validation"] + direction LR + Vn["spans within consensus trace"]:::note --> V1["val.create"] --> V2["val.broadcast"] + end + + subgraph fetch_trace["Catch-Up Trace"] + direction LR + Fn["trace_id = f(ledger_hash)"]:::note --> F1["fetch.request"] --> F2["fetch.receive"] --> F3["fetch.apply"] + end + + C1 -.-|"span link\n(tx traces)"| T3 + C3 --> V1 + F1 -.-|"span link\n(target ledger)"| C3 + + classDef note fill:none,stroke:#888,stroke-dasharray:5 5,color:#333,font-style:italic + style T1 fill:#0d47a1,stroke:#082f6a,color:#ffffff + style T2 fill:#0d47a1,stroke:#082f6a,color:#ffffff + style T3 fill:#0d47a1,stroke:#082f6a,color:#ffffff + style C1 fill:#1b5e20,stroke:#0d3d14,color:#ffffff + style C2 fill:#1b5e20,stroke:#0d3d14,color:#ffffff + style C3 fill:#1b5e20,stroke:#0d3d14,color:#ffffff + style V1 fill:#bf360c,stroke:#8c2809,color:#ffffff + style V2 fill:#bf360c,stroke:#8c2809,color:#ffffff + style F1 fill:#4a148c,stroke:#38006b,color:#ffffff + style F2 fill:#4a148c,stroke:#38006b,color:#ffffff + style F3 fill:#4a148c,stroke:#38006b,color:#ffffff +``` + +**Reading the diagram:** + +- **Transaction Trace (blue)**: An independent trace whose `trace_id` is deterministically derived from the transaction hash. Contains receive, validate, and relay spans. +- **Consensus Trace (green)**: An independent trace whose `trace_id` is derived from the previous ledger hash and sequence number. Covers the open, propose, and accept phases. +- **Validation (red)**: Validation spans live within the consensus trace (not a separate trace). They are created after the accept phase completes. +- **Catch-Up Trace (purple)**: An independent trace for ledger acquisition, derived from the target ledger hash. Used when a node is behind and fetching missing ledgers. +- **Dotted arrows (span links)**: Cross-trace correlations. Consensus links to transaction traces it included; catch-up links to the consensus trace that produced the target ledger. +- **Solid arrow (C3 to V1)**: A parent-child relationship -- validation spans are direct children of the consensus accept span within the same trace. + +**How a query flows:** + +``` +"Why was TX abc slow?" + 1. Compute trace_id = SHA-256("tx" || abc)[0:16] + 2. Find transaction trace → see it was included in consensus round N + 3. Follow span link → consensus trace for round N + 4. See which phase was slow (propose? accept?) + 5. If a node was catching up, follow link → catch-up trace +``` + +### Trade-offs to Consider + +| Concern | Mitigation | +| ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | +| **Retries get same trace_id** | Add `attempt` attribute to root span; spans have unique span_ids and timestamps | +| **256→128 bit truncation** | Birthday-bound collision at ~2⁶⁴ operations — negligible for XRPL's throughput | +| **Non-standard generation** | OTel spec allows any 16-byte non-zero value; tooling works on the hex string | +| **Hash computation cost** | SHA-256 is ~0.3μs per call; XRPL already computes these hashes for other purposes | +| **Late-binding identifiers** | Ledger hash isn't known until after consensus — validation spans use ledger_seq as fallback, then link to the consensus trace | + +--- + ## Distributed Traces Across Nodes In distributed systems like rippled, traces span **multiple independent nodes**. The trace context must be propagated in network messages: @@ -118,20 +430,27 @@ sequenceDiagram Note over NodeA,NodeC: All spans share trace_id: abc123
enabling correlation across nodes ``` +**Reading the diagram:** + +- **Client**: The external entity that submits a transaction. It does not carry trace context -- the trace originates at the first node. +- **Node A**: The entry point that creates a new trace (trace_id: abc123) and the root span `tx.receive`. It relays the transaction to peers with trace context attached. +- **Node B and Node C**: Peer nodes that receive the relayed transaction along with the propagated trace context. Each creates a child span under Node A's span, preserving the same `trace_id`. +- **Arrows with trace context**: The relay messages carry `trace_id` and `parent_span_id`, allowing each downstream node to link its spans back to the originating span on Node A. + --- ## Context Propagation For traces to work across nodes, **trace context must be propagated** in messages. -### What's in the Context (32 bytes) +### What's in the Context (~26 bytes) -| Field | Size | Description | -| ------------- | ---------- | ------------------------------------------------------- | -| `trace_id` | 16 bytes | Identifies the entire trace (constant across all nodes) | -| `span_id` | 8 bytes | The sender's current span (becomes parent on receiver) | -| `trace_flags` | 4 bytes | Sampling decision flags | -| `trace_state` | ~0-4 bytes | Optional vendor-specific data | +| Field | Size | Description | +| ------------- | -------- | ------------------------------------------------------- | +| `trace_id` | 16 bytes | Identifies the entire trace (constant across all nodes) | +| `span_id` | 8 bytes | The sender's current span (becomes parent on receiver) | +| `trace_flags` | 1 byte | Sampling decision (bit 0 = sampled; bits 1-7 reserved) | +| `trace_state` | variable | Optional vendor-specific data (typically omitted) | ### How span_id Changes at Each Hop @@ -165,11 +484,11 @@ There are two patterns: ### HTTP/RPC Headers (W3C Trace Context) ``` -traceparent: 00-abc123def456-span789-01 - │ │ │ │ - │ │ │ └── Flags (sampled) - │ │ └── Parent span ID - │ └── Trace ID +traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 + │ │ │ │ + │ │ │ └── Flags (sampled) + │ │ └── Parent span ID (16 hex) + │ └── Trace ID (32 hex) └── Version ``` @@ -228,16 +547,20 @@ Trace completes → Collector evaluates: ## Glossary -| Term | Definition | -| ------------------- | --------------------------------------------------------------- | -| **Trace** | Complete journey of a request, identified by `trace_id` | -| **Span** | Single operation within a trace | -| **Context** | Data propagated between services (`trace_id`, `span_id`, flags) | -| **Instrumentation** | Code that creates spans and propagates context | -| **Collector** | Service that receives, processes, and exports traces | -| **Backend** | Storage/visualization system (Jaeger, Tempo, etc.) | -| **Head Sampling** | Sampling decision at trace start | -| **Tail Sampling** | Sampling decision after trace completes | +| Term | Definition | +| -------------------- | ------------------------------------------------------------------- | +| **Trace** | Complete journey of a request, identified by `trace_id` | +| **Span** | Single operation within a trace | +| **Parent-Child** | Span relationship where the parent depends on the child | +| **Follows-From** | Causal relationship where originator doesn't wait for the result | +| **Span Link** | Non-hierarchical connection between spans, possibly across traces | +| **Deterministic ID** | Trace ID derived from domain data (e.g., tx_hash) instead of random | +| **Context** | Data propagated between services (`trace_id`, `span_id`, flags) | +| **Instrumentation** | Code that creates spans and propagates context | +| **Collector** | Service that receives, processes, and exports traces | +| **Backend** | Storage/visualization system (Tempo) | +| **Head Sampling** | Sampling decision at trace start | +| **Tail Sampling** | Sampling decision after trace completes | --- diff --git a/OpenTelemetryPlan/01-architecture-analysis.md b/OpenTelemetryPlan/01-architecture-analysis.md index 9eb448d78c..4424744e09 100644 --- a/OpenTelemetryPlan/01-architecture-analysis.md +++ b/OpenTelemetryPlan/01-architecture-analysis.md @@ -7,6 +7,8 @@ ## 1.1 Current rippled Architecture Overview +> **WS** = WebSocket | **UNL** = Unique Node List | **TxQ** = Transaction Queue | **StatsD** = Statistics Daemon + The rippled node software consists of several interconnected components that need instrumentation for distributed tracing: ```mermaid @@ -16,6 +18,7 @@ flowchart TB RPC["RPC Server
(HTTP/WS/gRPC)"] Overlay["Overlay
(P2P Network)"] Consensus["Consensus
(RCLConsensus)"] + ValidatorList["ValidatorList
(UNL Mgmt)"] end JobQueue["JobQueue
(Thread Pool)"] @@ -24,6 +27,13 @@ flowchart TB NetworkOPs["NetworkOPs
(Tx Processing)"] LedgerMaster["LedgerMaster
(Ledger Mgmt)"] NodeStore["NodeStore
(Database)"] + InboundLedgers["InboundLedgers
(Ledger Sync)"] + end + + subgraph appservices["Application Services"] + PathFind["PathFinding
(Payment Paths)"] + TxQ["TxQ
(Fee Escalation)"] + LoadMgr["LoadManager
(Fee/Load)"] end subgraph observability["Existing Observability"] @@ -34,27 +44,92 @@ flowchart TB services --> JobQueue JobQueue --> processing + JobQueue --> appservices end style rippled fill:#424242,stroke:#212121,color:#ffffff style services fill:#1565c0,stroke:#0d47a1,color:#ffffff style processing fill:#2e7d32,stroke:#1b5e20,color:#ffffff + style appservices fill:#6a1b9a,stroke:#4a148c,color:#ffffff style observability fill:#e65100,stroke:#bf360c,color:#ffffff ``` +**Reading the diagram:** + +- **Core Services (blue)**: The entry points into rippled -- RPC Server handles client requests, Overlay manages peer-to-peer networking, Consensus drives agreement, and ValidatorList manages trusted validators. +- **JobQueue (center)**: The asynchronous thread pool that decouples Core Services from the Processing and Application layers. All work flows through it. +- **Processing Layer (green)**: Core business logic -- NetworkOPs processes transactions, LedgerMaster manages ledger state, NodeStore handles persistence, and InboundLedgers synchronizes missing data. +- **Application Services (purple)**: Higher-level features -- PathFinding computes payment routes, TxQ manages fee-based queuing, and LoadManager tracks server load. +- **Existing Observability (orange)**: The current monitoring stack (PerfLog, Insight, Journal logging) that OpenTelemetry will complement, not replace. +- **Arrows (Services to JobQueue to layers)**: Work originates at Core Services, is enqueued onto the JobQueue, and dispatched to Processing or Application layers for execution. + +--- + +## 1.1.1 Actors and Actions + +### Actors + +| Who (Plain English) | Technical Term | +| ----------------------------------------- | -------------------------- | +| Network node running XRPL software | rippled node | +| External client submitting requests | RPC Client | +| Network neighbor sharing data | Peer (PeerImp) | +| Request handler for client queries | RPC Server (ServerHandler) | +| Command executor for specific RPC methods | RPCHandler | +| Agreement process between nodes | Consensus (RCLConsensus) | +| Transaction processing coordinator | NetworkOPs | +| Background task scheduler | JobQueue | +| Ledger state manager | LedgerMaster | +| Payment route calculator | PathFinding (Pathfinder) | +| Transaction waiting room | TxQ (Transaction Queue) | +| Fee adjustment system | LoadManager | +| Trusted validator list manager | ValidatorList | +| Protocol upgrade tracker | AmendmentTable | +| Ledger state hash tree | SHAMap | +| Persistent key-value storage | NodeStore | + +### Actions + +| What Happens (Plain English) | Technical Term | +| ---------------------------------------------- | ---------------------- | +| Client sends a request to a node | `rpc.request` | +| Node executes a specific RPC command | `rpc.command.*` | +| Node receives a transaction from a peer | `tx.receive` | +| Node checks if a transaction is valid | `tx.validate` | +| Node forwards a transaction to neighbors | `tx.relay` | +| Nodes agree on which transactions to include | `consensus.round` | +| Consensus progresses through phases | `consensus.phase.*` | +| Node builds a new confirmed ledger | `ledger.build` | +| Node fetches missing ledger data from peers | `ledger.acquire` | +| Node computes payment routes | `pathfind.compute` | +| Node queues a transaction for later processing | `txq.enqueue` | +| Node increases fees due to high load | `fee.escalate` | +| Node fetches the latest trusted validator list | `validator.list.fetch` | +| Node votes on a protocol amendment | `amendment.vote` | +| Node synchronizes state tree data | `shamap.sync` | + --- ## 1.2 Key Components for Instrumentation -| Component | Location | Purpose | Trace Value | -| ----------------- | ------------------------------------------ | ------------------------ | ---------------------------- | -| **Overlay** | `src/xrpld/overlay/` | P2P communication | Message propagation timing | -| **PeerImp** | `src/xrpld/overlay/detail/PeerImp.cpp` | Individual peer handling | Per-peer latency | -| **RCLConsensus** | `src/xrpld/app/consensus/RCLConsensus.cpp` | Consensus algorithm | Round timing, phase analysis | -| **NetworkOPs** | `src/xrpld/app/misc/NetworkOPs.cpp` | Transaction processing | Tx lifecycle tracking | -| **ServerHandler** | `src/xrpld/rpc/detail/ServerHandler.cpp` | RPC entry point | Request latency | -| **RPCHandler** | `src/xrpld/rpc/detail/RPCHandler.cpp` | Command execution | Per-command timing | -| **JobQueue** | `src/xrpl/core/JobQueue.h` | Async task execution | Queue wait times | +> **TxQ** = Transaction Queue | **UNL** = Unique Node List + +| Component | Location | Purpose | Trace Value | +| ------------------ | ------------------------------------------ | ------------------------ | -------------------------------- | +| **Overlay** | `src/xrpld/overlay/` | P2P communication | Message propagation timing | +| **PeerImp** | `src/xrpld/overlay/detail/PeerImp.cpp` | Individual peer handling | Per-peer latency | +| **RCLConsensus** | `src/xrpld/app/consensus/RCLConsensus.cpp` | Consensus algorithm | Round timing, phase analysis | +| **NetworkOPs** | `src/xrpld/app/misc/NetworkOPs.cpp` | Transaction processing | Tx lifecycle tracking | +| **ServerHandler** | `src/xrpld/rpc/detail/ServerHandler.cpp` | RPC entry point | Request latency | +| **RPCHandler** | `src/xrpld/rpc/detail/RPCHandler.cpp` | Command execution | Per-command timing | +| **JobQueue** | `src/xrpl/core/JobQueue.h` | Async task execution | Queue wait times | +| **PathFinding** | `src/xrpld/app/paths/` | Payment path computation | Path latency, cache hits | +| **TxQ** | `src/xrpld/app/misc/TxQ.cpp` | Transaction queue/fees | Queue depth, eviction rates | +| **LoadManager** | `src/xrpld/app/main/LoadManager.cpp` | Fee escalation/load | Fee levels, load factors | +| **InboundLedgers** | `src/xrpld/app/ledger/InboundLedgers.cpp` | Ledger acquisition | Sync time, peer reliability | +| **ValidatorList** | `src/xrpld/app/misc/ValidatorList.cpp` | UNL management | List freshness, fetch failures | +| **AmendmentTable** | `src/xrpld/app/misc/AmendmentTable.cpp` | Protocol amendments | Voting status, activation events | +| **SHAMap** | `src/xrpld/shamap/` | State hash tree | Sync speed, missing nodes | --- @@ -93,6 +168,15 @@ sequenceDiagram Note over Client,PeerC: DISTRIBUTED TRACE (same trace_id: abc123) ``` +**Reading the diagram:** + +- **Client**: The external entity that submits a transaction to Peer A. It has no trace context -- the trace starts at the first node. +- **Peer A (Receive)**: The entry node that creates the root span `tx.receive`, runs HashRouter deduplication to avoid processing duplicates, and creates a child `tx.validate` span. +- **Peer A to Peer B arrow**: The relay message carries trace context (trace_id + parent span_id), enabling Peer B to create a linked span under the same trace. +- **Peer B (Relay)**: Receives the transaction and trace context, creates a `tx.receive` span linked to Peer A's trace, then relays onward. +- **Peer C (Validate)**: Final hop in this example. Creates a linked `tx.receive` span and runs `tx.process` to fully process the transaction. +- **Blue rectangles**: Highlight the span boundaries on each node, showing where instrumentation creates and closes spans. + ### Trace Structure ``` @@ -142,16 +226,26 @@ flowchart TB style accept fill:#c2185b,stroke:#880e4f,color:#ffffff ``` +**Reading the diagram:** + +- **consensus.round (orange, root span)**: The top-level span encompassing the entire consensus round, with attributes like ledger sequence, mode, and proposer count. +- **consensus.phase.open (blue)**: The first phase where the node waits (~3s) to collect incoming transactions before proposing. +- **consensus.phase.establish (green)**: The negotiation phase where validators exchange proposals, resolve disputes, and converge on a transaction set. Child spans track each proposal received/sent and each dispute resolved. +- **consensus.phase.accept (pink)**: The final phase where the agreed transaction set is applied, a new ledger is built, and the ledger is validated. Child spans cover `ledger.build` and `ledger.validate`. +- **Arrows (open to establish to accept)**: The sequential flow through the three consensus phases. Each phase must complete before the next begins. + --- ## 1.5 RPC Request Flow +> **WS** = WebSocket + RPC requests support W3C Trace Context headers for distributed tracing across services: ```mermaid flowchart TB subgraph request["rpc.request (root span)"] - http["HTTP Request
POST /
traceparent: 00-abc123...-def456...-01"] + http["HTTP Request — POST /
traceparent:
00-abc123...-def456...-01"] attrs["Attributes:
http.method = POST
net.peer.ip = 192.168.1.100
xrpl.rpc.command = submit"] @@ -177,32 +271,56 @@ flowchart TB style command fill:#e65100,stroke:#bf360c,color:#ffffff ``` +**Reading the diagram:** + +- **rpc.request (green, root span)**: The outermost span representing the full RPC request lifecycle, from HTTP receipt to response. Carries the W3C `traceparent` header for distributed tracing. +- **HTTP Request node**: Shows the incoming POST request with its `traceparent` header and extracted attributes (method, peer IP, command name). +- **jobqueue.enqueue (blue)**: The span covering the asynchronous handoff from the RPC thread to the JobQueue worker thread. The trace context is preserved across this async boundary. +- **rpc.command.submit (orange)**: The span for the actual command execution, with child spans for deserialization, local validation, and network submission. +- **Response node**: The final output with HTTP status and total duration, marking the end of the root span. +- **Arrows (top to bottom)**: The sequential processing pipeline -- receive request, extract attributes, enqueue job, execute command, return response. + --- ## 1.6 Key Trace Points +> **TxQ** = Transaction Queue + The following table identifies priority instrumentation points across the codebase: -| Category | Span Name | File | Method | Priority | -| --------------- | ---------------------- | -------------------- | ---------------------- | -------- | -| **Transaction** | `tx.receive` | `PeerImp.cpp` | `handleTransaction()` | High | -| **Transaction** | `tx.validate` | `NetworkOPs.cpp` | `processTransaction()` | High | -| **Transaction** | `tx.process` | `NetworkOPs.cpp` | `doTransactionSync()` | High | -| **Transaction** | `tx.relay` | `OverlayImpl.cpp` | `relay()` | Medium | -| **Consensus** | `consensus.round` | `RCLConsensus.cpp` | `startRound()` | High | -| **Consensus** | `consensus.phase.*` | `Consensus.h` | `timerEntry()` | High | -| **Consensus** | `consensus.proposal.*` | `RCLConsensus.cpp` | `peerProposal()` | Medium | -| **RPC** | `rpc.request` | `ServerHandler.cpp` | `onRequest()` | High | -| **RPC** | `rpc.command.*` | `RPCHandler.cpp` | `doCommand()` | High | -| **Peer** | `peer.connect` | `OverlayImpl.cpp` | `onHandoff()` | Low | -| **Peer** | `peer.message.*` | `PeerImp.cpp` | `onMessage()` | Low | -| **Ledger** | `ledger.acquire` | `InboundLedgers.cpp` | `acquire()` | Medium | -| **Ledger** | `ledger.build` | `RCLConsensus.cpp` | `buildLCL()` | High | +| Category | Span Name | File | Method | Priority | +| --------------- | ---------------------- | ---------------------- | ----------------------- | -------- | +| **Transaction** | `tx.receive` | `PeerImp.cpp` | `handleTransaction()` | High | +| **Transaction** | `tx.validate` | `NetworkOPs.cpp` | `processTransaction()` | High | +| **Transaction** | `tx.process` | `NetworkOPs.cpp` | `doTransactionSync()` | High | +| **Transaction** | `tx.relay` | `OverlayImpl.cpp` | `relay()` | Medium | +| **Consensus** | `consensus.round` | `RCLConsensus.cpp` | `startRound()` | High | +| **Consensus** | `consensus.phase.*` | `Consensus.h` | `timerEntry()` | High | +| **Consensus** | `consensus.proposal.*` | `RCLConsensus.cpp` | `peerProposal()` | Medium | +| **RPC** | `rpc.request` | `ServerHandler.cpp` | `onRequest()` | High | +| **RPC** | `rpc.command.*` | `RPCHandler.cpp` | `doCommand()` | High | +| **Peer** | `peer.connect` | `OverlayImpl.cpp` | `onHandoff()` | Low | +| **Peer** | `peer.message.*` | `PeerImp.cpp` | `onMessage()` | Low | +| **Ledger** | `ledger.acquire` | `InboundLedgers.cpp` | `acquire()` | Medium | +| **Ledger** | `ledger.build` | `RCLConsensus.cpp` | `buildLCL()` | High | +| **PathFinding** | `pathfind.request` | `PathRequest.cpp` | `doUpdate()` | High | +| **PathFinding** | `pathfind.compute` | `Pathfinder.cpp` | `findPaths()` | High | +| **TxQ** | `txq.enqueue` | `TxQ.cpp` | `apply()` | High | +| **TxQ** | `txq.apply` | `TxQ.cpp` | `processClosedLedger()` | High | +| **Fee** | `fee.escalate` | `LoadManager.cpp` | `raiseLocalFee()` | Medium | +| **Ledger** | `ledger.replay` | `LedgerReplayer.h` | `replay()` | Medium | +| **Ledger** | `ledger.delta` | `LedgerDeltaAcquire.h` | `processData()` | Medium | +| **Validator** | `validator.list.fetch` | `ValidatorList.cpp` | `verify()` | Medium | +| **Validator** | `validator.manifest` | `Manifest.cpp` | `applyManifest()` | Low | +| **Amendment** | `amendment.vote` | `AmendmentTable.cpp` | `doVoting()` | Low | +| **SHAMap** | `shamap.sync` | `SHAMap.cpp` | `fetchRoot()` | Medium | --- ## 1.7 Instrumentation Priority +> **TxQ** = Transaction Queue + ```mermaid quadrantChart title Instrumentation Priority Matrix @@ -213,18 +331,25 @@ quadrantChart quadrant-3 Quick Wins quadrant-4 Consider Later - RPC Tracing: [0.3, 0.85] - Transaction Tracing: [0.65, 0.92] - Consensus Tracing: [0.75, 0.87] - Peer Message Tracing: [0.4, 0.3] - Ledger Acquisition: [0.5, 0.6] - JobQueue Tracing: [0.35, 0.5] + RPC Tracing: [0.2, 0.92] + Transaction Tracing: [0.55, 0.88] + Consensus Tracing: [0.78, 0.82] + PathFinding: [0.38, 0.75] + TxQ and Fees: [0.25, 0.65] + Ledger Sync: [0.62, 0.58] + Peer Message Tracing: [0.35, 0.25] + JobQueue Tracing: [0.2, 0.48] + Validator Mgmt: [0.48, 0.42] + Amendment Tracking: [0.15, 0.32] + SHAMap Operations: [0.72, 0.45] ``` --- ## 1.8 Observable Outcomes +> **TxQ** = Transaction Queue | **UNL** = Unique Node List + After implementing OpenTelemetry, operators and developers will gain visibility into the following: ### 1.8.1 What You Will See: Traces @@ -236,20 +361,28 @@ After implementing OpenTelemetry, operators and developers will gain visibility | **Consensus Rounds** | Complete round with all phases (open, establish, accept) | `{span.name=~"consensus.round.*"}` | | **RPC Request Processing** | Individual command execution with timing breakdown | `{xrpl.rpc.command="account_info"}` | | **Ledger Acquisition** | Peer-to-peer ledger data requests and responses | `{span.name="ledger.acquire"}` | +| **PathFinding Latency** | Path computation time and cache effectiveness for payment RPCs | `{span.name="pathfind.compute"}` | +| **TxQ Behavior** | Queue depth, eviction patterns, fee escalation during congestion | `{span.name=~"txq.*"}` | +| **Ledger Sync** | Full acquisition timeline including delta and transaction fetches | `{span.name=~"ledger.acquire.*"}` | +| **Validator Health** | UNL fetch success, manifest updates, stale list detection | `{span.name=~"validator.*"}` | ### 1.8.2 What You Will See: Metrics (Derived from Traces) -| Metric | Description | Dashboard Panel | -| ----------------------------- | -------------------------------------- | --------------------------- | -| **RPC Latency (p50/p95/p99)** | Response time distribution per command | Heatmap by command | -| **Transaction Throughput** | Transactions processed per second | Time series graph | -| **Consensus Round Duration** | Time to complete consensus phases | Histogram | -| **Cross-Node Latency** | Time for transaction to reach N nodes | Line chart with percentiles | -| **Error Rate** | Failed transactions/RPC calls by type | Stacked bar chart | +| Metric | Description | Dashboard Panel | +| ----------------------------- | --------------------------------------- | --------------------------- | +| **RPC Latency (p50/p95/p99)** | Response time distribution per command | Heatmap by command | +| **Transaction Throughput** | Transactions processed per second | Time series graph | +| **Consensus Round Duration** | Time to complete consensus phases | Histogram | +| **Cross-Node Latency** | Time for transaction to reach N nodes | Line chart with percentiles | +| **Error Rate** | Failed transactions/RPC calls by type | Stacked bar chart | +| **PathFinding Latency** | Path computation time per currency pair | Heatmap by currency | +| **TxQ Depth** | Queued transactions over time | Time series with thresholds | +| **Fee Escalation Level** | Current fee multiplier | Gauge with alert thresholds | +| **Ledger Sync Duration** | Time to acquire missing ledgers | Histogram | ### 1.8.3 Concrete Dashboard Examples -**Transaction Trace View (Jaeger/Tempo):** +**Transaction Trace View (Tempo):** ``` ┌────────────────────────────────────────────────────────────────────────────────┐ @@ -304,18 +437,22 @@ xychart-beta title "Consensus Round Duration (Last 24 Hours)" x-axis "Time of Day (Hours)" [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24] y-axis "Duration (seconds)" 1 --> 5 - line [2.1, 2.3, 2.5, 2.4, 2.8, 1.6, 3.2, 3.0, 3.5, 1.3, 3.8, 3.6, 4.0, 3.2, 4.3, 4.1, 4.5, 4.3, 4.2, 2.4, 4.8, 4.6, 4.9, 4.7, 5.0, 4.9, 4.8, 2.6, 4.7, 4.5, 4.2, 4.0, 2.5, 3.7, 3.2, 3.4, 2.9, 3.1, 2.6, 2.8, 2.3, 1.5, 2.7, 2.4, 2.5, 2.3, 2.2, 2.1, 2.0] + line [2.1, 2.4, 2.8, 3.2, 3.8, 4.3, 4.5, 5.0, 4.7, 4.0, 3.2, 2.6, 2.0] ``` ### 1.8.4 Operator Actionable Insights -| Scenario | What You'll See | Action | -| --------------------- | ---------------------------------------------------------------------------- | -------------------------------- | -| **Slow RPC** | Span showing which phase is slow (parsing, execution, serialization) | Optimize specific code path | -| **Transaction Stuck** | Trace stops at validation; error attribute shows reason | Fix transaction parameters | -| **Consensus Delay** | Phase.establish taking too long; proposer attribute shows missing validators | Investigate network connectivity | -| **Memory Spike** | Large batch of spans correlating with memory increase | Tune batch_size or sampling | -| **Network Partition** | Traces missing cross-node links for specific peer | Check peer connectivity | +| Scenario | What You'll See | Action | +| ------------------------- | ---------------------------------------------------------------------------- | ------------------------------------------------ | +| **Slow RPC** | Span showing which phase is slow (parsing, execution, serialization) | Optimize specific code path | +| **Transaction Stuck** | Trace stops at validation; error attribute shows reason | Fix transaction parameters | +| **Consensus Delay** | Phase.establish taking too long; proposer attribute shows missing validators | Investigate network connectivity | +| **Memory Spike** | Large batch of spans correlating with memory increase | Tune batch_size or sampling | +| **Network Partition** | Traces missing cross-node links for specific peer | Check peer connectivity | +| **Path Computation Slow** | pathfind.compute span shows high latency; cache miss rate in attributes | Warm the RippleLineCache, check order book depth | +| **TxQ Full** | txq.enqueue spans show evictions; fee.escalate spans increasing | Monitor fee levels, alert operators | +| **Ledger Sync Stalled** | ledger.acquire spans timing out; peer reliability attributes show issues | Check peer connectivity, add trusted peers | +| **UNL Stale** | validator.list.fetch spans failing; last_update attribute aging | Verify validator site URLs, check DNS | ### 1.8.5 Developer Debugging Workflow diff --git a/OpenTelemetryPlan/02-design-decisions.md b/OpenTelemetryPlan/02-design-decisions.md index 793dd6b5ac..8ff6eaa983 100644 --- a/OpenTelemetryPlan/02-design-decisions.md +++ b/OpenTelemetryPlan/02-design-decisions.md @@ -7,6 +7,8 @@ ## 2.1 OpenTelemetry Components +> **OTLP** = OpenTelemetry Protocol + ### 2.1.1 SDK Selection **Primary Choice**: OpenTelemetry C++ SDK (`opentelemetry-cpp`) @@ -32,6 +34,8 @@ ## 2.2 Exporter Configuration +> **OTLP** = OpenTelemetry Protocol + ```mermaid flowchart TB subgraph nodes["rippled Nodes"] @@ -43,8 +47,7 @@ flowchart TB collector["OpenTelemetry
Collector
(sidecar or standalone)"] subgraph backends["Observability Backends"] - jaeger["Jaeger
(Dev)"] - tempo["Tempo
(Prod)"] + tempo["Tempo"] elastic["Elastic
APM"] end @@ -52,7 +55,6 @@ flowchart TB node2 -->|"OTLP/gRPC
:4317"| collector node3 -->|"OTLP/gRPC
:4317"| collector - collector --> jaeger collector --> tempo collector --> elastic @@ -61,6 +63,13 @@ flowchart TB style collector fill:#bf360c,stroke:#8c2809,color:#ffffff ``` +**Reading the diagram:** + +- **rippled Nodes (blue)**: The source of telemetry data. Each rippled node exports spans via OTLP/gRPC on port 4317. +- **OpenTelemetry Collector (red)**: The central aggregation point that receives spans from all nodes. Can run as a sidecar (per-node) or standalone (shared). Handles batching, filtering, and routing. +- **Observability Backends (green)**: The storage and visualization destinations. Tempo is the recommended backend for both development and production, and Elastic APM is an alternative. The Collector routes to one or more backends. +- **Arrows (nodes to collector to backends)**: The data pipeline -- spans flow from nodes to the Collector over gRPC, then the Collector fans out to the configured backends. + ### 2.2.1 OTLP/gRPC (Recommended) ```cpp @@ -69,8 +78,8 @@ namespace otlp = opentelemetry::exporter::otlp; otlp::OtlpGrpcExporterOptions opts; opts.endpoint = "localhost:4317"; -opts.use_ssl_credentials = true; -opts.ssl_credentials_cacert_path = "/path/to/ca.crt"; +opts.useTls = true; +opts.sslCaCertPath = "/path/to/ca.crt"; ``` ### 2.2.2 OTLP/HTTP (Alternative) @@ -88,6 +97,8 @@ opts.content_type = otlp::HttpRequestContentType::kJson; // or kBinary ## 2.3 Span Naming Conventions +> **TxQ** = Transaction Queue | **UNL** = Unique Node List | **WS** = WebSocket + ### 2.3.1 Naming Schema ``` @@ -145,6 +156,36 @@ ledger: build: "Build new ledger" validate: "Ledger validation" close: "Close ledger" + replay: "Ledger replay executed" + delta: "Delta-based ledger acquired" + +# PathFinding Spans +pathfind: + request: "Path request initiated" + compute: "Path computation executed" + +# TxQ Spans +txq: + enqueue: "Transaction queued" + apply: "Queued transaction applied" + +# Fee/Load Spans +fee: + escalate: "Fee escalation triggered" + +# Validator Spans +validator: + list: + fetch: "UNL list fetched" + manifest: "Manifest update processed" + +# Amendment Spans +amendment: + vote: "Amendment voting executed" + +# SHAMap Spans +shamap: + sync: "State tree synchronization" # Job Spans job: @@ -156,6 +197,8 @@ job: ## 2.4 Attribute Schema +> **TxQ** = Transaction Queue | **UNL** = Unique Node List | **OTLP** = OpenTelemetry Protocol + ### 2.4.1 Resource Attributes (Set Once at Startup) ```cpp @@ -231,21 +274,75 @@ resource::SemanticConventions::SERVICE_INSTANCE_ID = "xrpl.job.worker" = int64 // Worker thread ID ``` +#### PathFinding Attributes + +```cpp +"xrpl.pathfind.source_currency" = string // Source currency code +"xrpl.pathfind.dest_currency" = string // Destination currency code +"xrpl.pathfind.path_count" = int64 // Number of paths found +"xrpl.pathfind.cache_hit" = bool // RippleLineCache hit +``` + +#### TxQ Attributes + +```cpp +"xrpl.txq.queue_depth" = int64 // Current queue depth +"xrpl.txq.fee_level" = int64 // Fee level of transaction +"xrpl.txq.eviction_reason" = string // Why transaction was evicted +``` + +#### Fee Attributes + +```cpp +"xrpl.fee.load_factor" = int64 // Current load factor +"xrpl.fee.escalation_level" = int64 // Fee escalation multiplier +``` + +#### Validator Attributes + +```cpp +"xrpl.validator.list_size" = int64 // UNL size +"xrpl.validator.list_age_sec" = int64 // Seconds since last update +``` + +#### Amendment Attributes + +```cpp +"xrpl.amendment.name" = string // Amendment name +"xrpl.amendment.status" = string // "enabled", "vetoed", "supported" +``` + +#### SHAMap Attributes + +```cpp +"xrpl.shamap.type" = string // "transaction", "state", "account_state" +"xrpl.shamap.missing_nodes" = int64 // Number of missing nodes during sync +"xrpl.shamap.duration_ms" = float64 // Sync duration +``` + ### 2.4.3 Data Collection Summary The following table summarizes what data is collected by category: -| Category | Attributes Collected | Purpose | -| --------------- | -------------------------------------------------------------------- | --------------------------- | -| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` | Trace transaction lifecycle | -| **Consensus** | `round`, `phase`, `mode`, `proposers` (public keys), `duration_ms` | Analyze consensus timing | -| **RPC** | `command`, `version`, `status`, `duration_ms` | Monitor RPC performance | -| **Peer** | `peer.id` (public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis | -| **Ledger** | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` | Ledger progression tracking | -| **Job** | `job.type`, `queue_ms`, `worker` | JobQueue performance | +| Category | Attributes Collected | Purpose | +| --------------- | ---------------------------------------------------------------------- | ---------------------------- | +| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` | Trace transaction lifecycle | +| **Consensus** | `round`, `phase`, `mode`, `proposers` (public keys), `duration_ms` | Analyze consensus timing | +| **RPC** | `command`, `version`, `status`, `duration_ms` | Monitor RPC performance | +| **Peer** | `peer.id` (public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis | +| **Ledger** | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` | Ledger progression tracking | +| **Job** | `job.type`, `queue_ms`, `worker` | JobQueue performance | +| **PathFinding** | `pathfind.source_currency`, `dest_currency`, `path_count`, `cache_hit` | Payment path analysis | +| **TxQ** | `txq.queue_depth`, `fee_level`, `eviction_reason` | Queue depth and fee tracking | +| **Fee** | `fee.load_factor`, `escalation_level` | Fee escalation monitoring | +| **Validator** | `validator.list_size`, `list_age_sec` | UNL health monitoring | +| **Amendment** | `amendment.name`, `status` | Protocol upgrade tracking | +| **SHAMap** | `shamap.type`, `missing_nodes`, `duration_ms` | State tree sync performance | ### 2.4.4 Privacy & Sensitive Data Policy +> **PII** = Personally Identifiable Information + OpenTelemetry instrumentation is designed to collect **operational metadata only**, never sensitive content. #### Data NOT Collected @@ -310,18 +407,22 @@ redact_account=1 # Hash account addresses before export redact_peer_address=1 # Remove peer IP addresses ``` +> **Note**: The `redact_account` configuration in `rippled.cfg` controls SDK-level redaction before export, while collector-level filtering (see [Collector-Level Data Protection](#collector-level-data-protection) above) provides an additional defense-in-depth layer. Both can operate independently. + > **Key Principle**: Telemetry collects **operational metadata** (timing, counts, hashes) — never **sensitive content** (keys, balances, amounts, raw payloads). --- ## 2.5 Context Propagation Design +> **WS** = WebSocket + ### 2.5.1 Propagation Boundaries ```mermaid flowchart TB subgraph http["HTTP/WebSocket (RPC)"] - w3c["W3C Trace Context Headers:
traceparent: 00-{trace_id}-{span_id}-{flags}
tracestate: rippled="] + w3c["W3C Trace Context Headers:
traceparent:
00-trace_id-span_id-flags
tracestate: rippled=..."] end subgraph protobuf["Protocol Buffers (P2P)"] @@ -329,7 +430,7 @@ flowchart TB end subgraph jobqueue["JobQueue (Internal Async)"] - job["Context captured at job creation,
restored at execution

class Job {
opentelemetry::context::Context traceContext_;
};"] + job["Context captured at job creation,
restored at execution

class Job {
otel::context::Context
traceContext_;
};"] end style http fill:#0d47a1,stroke:#082f6a,color:#ffffff @@ -337,10 +438,18 @@ flowchart TB style jobqueue fill:#bf360c,stroke:#8c2809,color:#ffffff ``` +**Reading the diagram:** + +- **HTTP/WebSocket - RPC (blue)**: For client-facing RPC requests, trace context is propagated using the W3C `traceparent` header. This is the standard approach and works with any OTel-compatible client. +- **Protocol Buffers - P2P (green)**: For peer-to-peer messages between rippled nodes, trace context is embedded as a protobuf `TraceContext` message carrying trace_id, span_id, flags, and optional trace_state. +- **JobQueue - Internal Async (red)**: For asynchronous work within a single node, the OTel context is captured when a job is created and restored when the job executes on a worker thread. This bridges the async gap so spans remain linked. + --- ## 2.6 Integration with Existing Observability +> **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket + ### 2.6.1 Existing Frameworks Comparison rippled already has two observability mechanisms. OpenTelemetry complements (not replaces) them: @@ -422,7 +531,7 @@ span->SetAttribute("peer.id", peerId); | Scenario | PerfLog | StatsD | OpenTelemetry | | --------------------------------------- | ---------- | ------ | ------------- | -| "How many TXs per second?" | ❌ | ✅ | ❌ | +| "How many TXs per second?" | ❌ | ✅ | ✅ | | "What's the p99 RPC latency?" | ❌ | ✅ | ✅ | | "Why was this specific TX slow?" | ⚠️ partial | ❌ | ✅ | | "Which node delayed consensus?" | ❌ | ❌ | ✅ | @@ -451,6 +560,14 @@ flowchart TB style grafana fill:#bf360c,stroke:#8c2809,color:#ffffff ``` +**Reading the diagram:** + +- **rippled Process (dark gray)**: The single rippled node running all three observability frameworks side by side. Each framework operates independently with no interference. +- **PerfLog to perf.log**: PerfLog writes JSON-formatted event logs to a local file. Grafana can ingest these via Loki or a file-based datasource. +- **Beast Insight to StatsD Server**: Insight sends aggregated metrics (counters, gauges) over UDP to a StatsD server. Grafana reads from StatsD-compatible backends like Graphite or Prometheus (via StatsD exporter). +- **OpenTelemetry to OTLP Collector**: OTel exports spans over OTLP/gRPC to a Collector, which then forwards to a trace backend (Tempo). +- **Grafana (red, unified UI)**: All three data streams converge in Grafana, enabling operators to correlate logs, metrics, and traces in a single dashboard. + ### 2.6.5 Correlation with PerfLog Trace IDs can be correlated with existing PerfLog entries for comprehensive debugging: diff --git a/OpenTelemetryPlan/03-implementation-strategy.md b/OpenTelemetryPlan/03-implementation-strategy.md index 723fe4978a..a20e329bcf 100644 --- a/OpenTelemetryPlan/03-implementation-strategy.md +++ b/OpenTelemetryPlan/03-implementation-strategy.md @@ -81,12 +81,14 @@ flowchart TB ## 3.3 Performance Overhead Summary -| Metric | Overhead | Notes | -| ------------- | ---------- | ----------------------------------- | -| CPU | 1-3% | Span creation and attribute setting | -| Memory | 2-5 MB | Batch buffer for pending spans | -| Network | 10-50 KB/s | Compressed OTLP export to collector | -| Latency (p99) | <2% | With proper sampling configuration | +> **OTLP** = OpenTelemetry Protocol + +| Metric | Overhead | Notes | +| ------------- | ---------- | ------------------------------------------------ | +| CPU | 1-3% | Of per-transaction CPU cost (~200μs baseline) | +| Memory | ~10 MB | SDK statics + batch buffer + worker thread stack | +| Network | 10-50 KB/s | Compressed OTLP export to collector | +| Latency (p99) | <2% | With proper sampling configuration | --- @@ -94,17 +96,26 @@ flowchart TB ### 3.4.1 Per-Operation Costs +> **Note on hardware assumptions**: The costs below are based on the official OTel C++ SDK CI benchmarks +> (969 runs on GitHub Actions 2-core shared runners). On production server hardware (3+ GHz Xeon), +> expect costs at the **lower end** of each range (~30-50% improvement over CI hardware). + | Operation | Time (ns) | Frequency | Impact | | --------------------- | --------- | ---------------------- | ---------- | -| Span creation | 200-500 | Every traced operation | Low | +| Span creation | 500-1000 | Every traced operation | Low | | Span end | 100-200 | Every traced operation | Low | | SetAttribute (string) | 80-120 | 3-5 per span | Low | | SetAttribute (int) | 40-60 | 2-3 per span | Negligible | -| AddEvent | 50-80 | 0-2 per span | Negligible | +| AddEvent | 100-200 | 0-2 per span | Low | | Context injection | 150-250 | Per outgoing message | Low | | Context extraction | 100-180 | Per incoming message | Low | | GetCurrent context | 10-20 | Thread-local access | Negligible | +**Source**: Span creation based on OTel C++ SDK `BM_SpanCreation` benchmark (AlwaysOnSampler + +SimpleSpanProcessor + InMemoryExporter), median ~1,000 ns on CI hardware. AddEvent includes +timestamp read + string copy + vector push + mutex acquisition. Context injection/extraction +confirmed by `BM_SpanCreationWithScope` benchmark delta (~160 ns). + ### 3.4.2 Transaction Processing Overhead
@@ -112,67 +123,91 @@ flowchart TB ```mermaid %%{init: {'pie': {'textPosition': 0.75}}}%% pie showData - "tx.receive (800ns)" : 800 - "tx.validate (500ns)" : 500 - "tx.relay (500ns)" : 500 - "Context inject (600ns)" : 600 + "tx.receive (1400ns)" : 1400 + "tx.validate (1200ns)" : 1200 + "tx.relay (1200ns)" : 1200 + "Context inject (200ns)" : 200 ``` -**Transaction Tracing Overhead (~2.4μs total)** +**Transaction Tracing Overhead (~4.0μs total)**
-**Overhead percentage**: 2.4 μs / 200 μs (avg tx processing) = **~1.2%** +**Overhead percentage**: 4.0 μs / 200 μs (avg tx processing) = **~2.0%** + +> **Breakdown**: Each span (tx.receive, tx.validate, tx.relay) costs ~1,000 ns for creation plus +> ~200-400 ns for 3-5 attribute sets. Context injection is ~200 ns (confirmed by benchmarks). +> On production hardware, expect ~2.6 μs total (~1.3% overhead) due to faster span creation (~500-600 ns). ### 3.4.3 Consensus Round Overhead | Operation | Count | Cost (ns) | Total | | ---------------------- | ----- | --------- | ---------- | -| consensus.round span | 1 | ~1000 | ~1 μs | -| consensus.phase spans | 3 | ~700 | ~2.1 μs | -| proposal.receive spans | ~20 | ~600 | ~12 μs | -| proposal.send spans | ~3 | ~600 | ~1.8 μs | +| consensus.round span | 1 | ~1200 | ~1.2 μs | +| consensus.phase spans | 3 | ~1100 | ~3.3 μs | +| proposal.receive spans | ~20 | ~1100 | ~22 μs | +| proposal.send spans | ~3 | ~1100 | ~3.3 μs | | Context operations | ~30 | ~200 | ~6 μs | -| **TOTAL** | | | **~23 μs** | +| **TOTAL** | | | **~36 μs** | -**Overhead percentage**: 23 μs / 3s (typical round) = **~0.0008%** (negligible) +> **Why higher**: Each span costs ~1,000 ns creation + ~100-200 ns for 1-2 attributes, totaling ~1,100-1,200 ns. +> Context operations remain ~200 ns (confirmed by benchmarks). On production hardware, expect ~24 μs total. + +**Overhead percentage**: 36 μs / 3s (typical round) = **~0.001%** (negligible) ### 3.4.4 RPC Request Overhead | Operation | Cost (ns) | | ---------------- | ------------ | -| rpc.request span | ~700 | -| rpc.command span | ~600 | +| rpc.request span | ~1200 | +| rpc.command span | ~1100 | | Context extract | ~250 | | Context inject | ~200 | -| **TOTAL** | **~1.75 μs** | +| **TOTAL** | **~2.75 μs** | -- Fast RPC (1ms): 1.75 μs / 1ms = **~0.175%** -- Slow RPC (100ms): 1.75 μs / 100ms = **~0.002%** +> **Why higher**: Each span costs ~1,000 ns creation + ~100-200 ns for attributes (command name, +> version, role). Context extract/inject costs are confirmed by OTel C++ benchmarks. + +- Fast RPC (1ms): 2.75 μs / 1ms = **~0.275%** +- Slow RPC (100ms): 2.75 μs / 100ms = **~0.003%** --- ## 3.5 Memory Overhead Analysis +> **OTLP** = OpenTelemetry Protocol + ### 3.5.1 Static Memory -| Component | Size | Allocated | -| ------------------------ | ----------- | ---------- | -| TracerProvider singleton | ~64 KB | At startup | -| BatchSpanProcessor | ~128 KB | At startup | -| OTLP exporter | ~256 KB | At startup | -| Propagator registry | ~8 KB | At startup | -| **Total static** | **~456 KB** | | +| Component | Size | Allocated | +| ------------------------------------ | ----------- | ---------- | +| TracerProvider singleton | ~64 KB | At startup | +| BatchSpanProcessor (circular buffer) | ~16 KB | At startup | +| BatchSpanProcessor (worker thread) | ~8 MB | At startup | +| OTLP exporter (gRPC channel init) | ~256 KB | At startup | +| Propagator registry | ~8 KB | At startup | +| **Total static** | **~8.3 MB** | | + +> **Why higher than earlier estimate**: The BatchSpanProcessor's circular buffer itself is only ~16 KB +> (2049 x 8-byte `AtomicUniquePtr` entries), but it spawns a dedicated worker thread whose default +> stack size on Linux is ~8 MB. The OTLP gRPC exporter allocates memory for channel stubs and TLS +> initialization. The worker thread stack dominates the static footprint. ### 3.5.2 Dynamic Memory -| Component | Size per unit | Max units | Peak | -| -------------------- | ------------- | ---------- | ----------- | -| Active span | ~200 bytes | 1000 | ~200 KB | -| Queued span (export) | ~500 bytes | 2048 | ~1 MB | -| Attribute storage | ~50 bytes | 5 per span | Included | -| Context storage | ~64 bytes | Per thread | ~6.4 KB | -| **Total dynamic** | | | **~1.2 MB** | +| Component | Size per unit | Max units | Peak | +| -------------------- | -------------- | ---------- | --------------- | +| Active span | ~500-800 bytes | 1000 | ~500-800 KB | +| Queued span (export) | ~500 bytes | 2048 | ~1 MB | +| Attribute storage | ~80 bytes | 5 per span | Included | +| Context storage | ~64 bytes | Per thread | ~6.4 KB | +| **Total dynamic** | | | **~1.5-1.8 MB** | + +> **Why active spans are larger**: An active `Span` object includes the wrapper (~88 bytes: shared_ptr, +> mutex, unique_ptr to Recordable) plus `SpanData` (~250 bytes: SpanContext, timestamps, name, status, +> empty containers) plus attribute storage (~200-500 bytes for 3-5 string attributes in a `std::map`). +> Source: `sdk/src/trace/span.h` and `sdk/include/opentelemetry/sdk/trace/span_data.h`. +> Queued spans release the wrapper, keeping only `SpanData` + attributes (~500 bytes). ### 3.5.3 Memory Growth Characteristics @@ -184,18 +219,34 @@ config: height: 400 --- xychart-beta - title "Memory Usage vs Span Rate" + title "Memory Usage vs Span Rate (bounded by queue limit)" x-axis "Spans/second" [0, 200, 400, 600, 800, 1000] - y-axis "Memory (MB)" 0 --> 6 - line [1, 1.8, 2.6, 3.4, 4.2, 5] + y-axis "Memory (MB)" 0 --> 12 + line [8.5, 9.2, 9.6, 9.9, 10.0, 10.0] ``` **Notes**: -- Memory increases linearly with span rate +- Memory increases with span rate but **plateaus at queue capacity** (default 2048 spans) - Batch export prevents unbounded growth -- Queue size is configurable (default 2048 spans) - At queue limit, oldest spans are dropped (not blocked) +- Maximum memory is bounded: ~8.3 MB static (dominated by worker thread stack) + 2048 queued spans x ~500 bytes (~1 MB) + active spans (~0.8 MB) ≈ **~10 MB ceiling** +- The worker thread stack (~8 MB) is virtual memory; actual RSS depends on stack usage (typically much less) + +### 3.5.4 Performance Data Sources + +The overhead estimates in Sections 3.3-3.5 are derived from the following sources: + +| Source | What it covers | URL | +| ------------------------------------------------ | ----------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| OTel C++ SDK CI benchmarks (969 runs) | Span creation, context activation, sampler overhead | [Benchmark Dashboard](https://open-telemetry.github.io/opentelemetry-cpp/benchmarks/) | +| `api/test/trace/span_benchmark.cc` | API-level span creation (~22 ns no-op) | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/api/test/trace/span_benchmark.cc) | +| `sdk/test/trace/sampler_benchmark.cc` | SDK span creation with samplers (~1,000 ns AlwaysOn) | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/test/trace/sampler_benchmark.cc) | +| `sdk/include/.../span_data.h` | SpanData memory layout (~250 bytes base) | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/include/opentelemetry/sdk/trace/span_data.h) | +| `sdk/src/trace/span.h` | Span wrapper memory layout (~88 bytes) | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/src/trace/span.h) | +| `sdk/include/.../batch_span_processor_options.h` | Default queue size (2048), batch size (512) | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/include/opentelemetry/sdk/trace/batch_span_processor_options.h) | +| `sdk/include/.../circular_buffer.h` | CircularBuffer implementation (AtomicUniquePtr array) | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/include/opentelemetry/sdk/common/circular_buffer.h) | +| OTLP proto definition | Serialized span size estimation | [Proto](https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto) | --- @@ -203,6 +254,11 @@ xychart-beta ### 3.6.1 Export Bandwidth +> **Bytes per span**: Estimates use ~500 bytes/span (conservative upper bound). OTLP protobuf analysis +> shows a typical span with 3-5 string attributes serializes to ~200-300 bytes raw; with gzip +> compression (~60-70% of raw) and batching (amortized headers), ~350 bytes/span is more realistic. +> The table uses the conservative estimate for capacity planning. + | Sampling Rate | Spans/sec | Bandwidth | Notes | | ------------- | --------- | --------- | ---------------- | | 100% | ~500 | ~250 KB/s | Development only | @@ -214,10 +270,10 @@ xychart-beta | Message Type | Context Size | Messages/sec | Overhead | | ---------------------- | ------------ | ------------ | ----------- | -| TMTransaction | 32 bytes | ~100 | ~3.2 KB/s | -| TMProposeSet | 32 bytes | ~10 | ~320 B/s | -| TMValidation | 32 bytes | ~50 | ~1.6 KB/s | -| **Total P2P overhead** | | | **~5 KB/s** | +| TMTransaction | 25 bytes | ~100 | ~2.5 KB/s | +| TMProposeSet | 25 bytes | ~10 | ~250 B/s | +| TMValidation | 25 bytes | ~50 | ~1.25 KB/s | +| **Total P2P overhead** | | | **~4 KB/s** | --- @@ -225,6 +281,8 @@ xychart-beta ### 3.7.1 Sampling Strategies +#### Tail Sampling + ```mermaid flowchart TD trace["New Trace"] @@ -284,6 +342,8 @@ if (telemetry.shouldTracePeer()) ## 3.9 Code Intrusiveness Assessment +> **TxQ** = Transaction Queue + This section provides a detailed assessment of how intrusive the OpenTelemetry integration is to the existing rippled codebase. ### 3.9.1 Files Modified Summary @@ -297,7 +357,10 @@ This section provides a detailed assessment of how intrusive the OpenTelemetry i | **Consensus** | 3 files | ~100 | ~30 | Low-Medium | | **Protocol Buffers** | 1 file | ~25 | 0 | Low | | **CMake/Build** | 3 files | ~50 | ~10 | Minimal | -| **Total** | **~21 files** | **~1,205** | **~105** | **Low** | +| **PathFinding** | 2 | ~80 | ~5 | Minimal | +| **TxQ/Fee** | 2 | ~60 | ~5 | Minimal | +| **Validator/Amend** | 3 | ~40 | ~5 | Minimal | +| **Total** | **~28 files** | **~1,490** | **~120** | **Low** | ### 3.9.2 Detailed File Impact @@ -307,6 +370,9 @@ pie title Code Changes by Component "Transaction Relay" : 160 "Consensus" : 130 "RPC Layer" : 100 + "PathFinding" : 80 + "TxQ/Fee" : 60 + "Validator/Amendment" : 40 "Application Init" : 35 "Protocol Buffers" : 25 "Build System" : 60 @@ -337,6 +403,14 @@ pie title Code Changes by Component | `src/xrpld/app/consensus/RCLConsensus.cpp` | ~50 | ~15 | Medium | | `src/xrpld/app/consensus/RCLConsensusAdaptor.cpp` | ~40 | ~12 | Medium | | `src/xrpld/core/JobQueue.cpp` | ~20 | ~5 | Low | +| `src/xrpld/app/paths/PathRequest.cpp` | ~40 | ~3 | Low | +| `src/xrpld/app/paths/Pathfinder.cpp` | ~40 | ~2 | Low | +| `src/xrpld/app/misc/TxQ.cpp` | ~40 | ~3 | Low | +| `src/xrpld/app/main/LoadManager.cpp` | ~20 | ~2 | Low | +| `src/xrpld/app/misc/ValidatorList.cpp` | ~20 | ~2 | Low | +| `src/xrpld/app/misc/AmendmentTable.cpp` | ~10 | ~2 | Low | +| `src/xrpld/app/misc/Manifest.cpp` | ~10 | ~1 | Low | +| `src/xrpld/shamap/SHAMap.cpp` | ~20 | ~3 | Low | | `src/xrpld/overlay/detail/ripple.proto` | ~25 | 0 | Low | | `CMakeLists.txt` | ~40 | ~8 | Low | | `cmake/FindOpenTelemetry.cmake` | ~50 | 0 | None (new) | @@ -353,12 +427,15 @@ quadrantChart x-axis Low Risk --> High Risk y-axis Low Value --> High Value - RPC Tracing: [0.2, 0.8] - Transaction Relay: [0.5, 0.9] - Consensus Tracing: [0.7, 0.95] - Peer Message Tracing: [0.8, 0.4] - JobQueue Context: [0.4, 0.5] - Ledger Acquisition: [0.5, 0.6] + RPC Tracing: [0.2, 0.55] + Transaction Relay: [0.55, 0.85] + Consensus Tracing: [0.75, 0.92] + Peer Message Tracing: [0.85, 0.35] + JobQueue Context: [0.3, 0.42] + Ledger Acquisition: [0.48, 0.65] + PathFinding: [0.38, 0.72] + TxQ and Fees: [0.25, 0.62] + Validator Mgmt: [0.15, 0.35] ``` **Optional** ↙ ↘ **Avoid** @@ -375,15 +452,15 @@ quadrantChart ### 3.9.4 Architectural Impact Assessment -| Aspect | Impact | Justification | -| -------------------- | ------- | --------------------------------------------------------------------- | -| **Data Flow** | None | Tracing is purely observational; no business logic changes | -| **Threading Model** | Minimal | Context propagation uses thread-local storage (standard OTel pattern) | -| **Memory Model** | Low | Bounded queues prevent unbounded growth; RAII ensures cleanup | -| **Network Protocol** | Low | Optional fields in protobuf (high field numbers); backward compatible | -| **Configuration** | None | New config section; existing configs unaffected | -| **Build System** | Low | Optional CMake flag; builds work without OpenTelemetry | -| **Dependencies** | Low | OpenTelemetry SDK is optional; null implementation when disabled | +| Aspect | Impact | Justification | +| -------------------- | ------- | -------------------------------------------------------------------------------- | +| **Data Flow** | Minimal | Read-only instrumentation; no modification to consensus or transaction data flow | +| **Threading Model** | Minimal | Context propagation uses thread-local storage (standard OTel pattern) | +| **Memory Model** | Low | Bounded queues prevent unbounded growth; RAII ensures cleanup | +| **Network Protocol** | Low | Optional fields in protobuf (high field numbers); backward compatible | +| **Configuration** | None | New config section; existing configs unaffected | +| **Build System** | Low | Optional CMake flag; builds work without OpenTelemetry | +| **Dependencies** | Low | OpenTelemetry SDK is optional; null implementation when disabled | ### 3.9.5 Backward Compatibility diff --git a/OpenTelemetryPlan/04-code-samples.md b/OpenTelemetryPlan/04-code-samples.md index 3daf6adfbf..bf54e6d913 100644 --- a/OpenTelemetryPlan/04-code-samples.md +++ b/OpenTelemetryPlan/04-code-samples.md @@ -7,6 +7,8 @@ ## 4.1 Core Interfaces +> **OTLP** = OpenTelemetry Protocol + ### 4.1.1 Main Telemetry Interface ```cpp @@ -69,6 +71,10 @@ public: bool traceRpc = true; bool tracePeer = false; // High volume, disabled by default bool traceLedger = true; + bool tracePathfind = true; + bool traceTxQ = true; + bool traceValidator = false; // Low volume, disabled by default + bool traceAmendment = false; // Very low volume, disabled by default }; virtual ~Telemetry() = default; @@ -140,6 +146,21 @@ public: /** Check if peer message tracing is enabled */ virtual bool shouldTracePeer() const = 0; + + /** Check if ledger tracing is enabled */ + virtual bool shouldTraceLedger() const = 0; + + /** Check if path finding tracing is enabled */ + virtual bool shouldTracePathfind() const = 0; + + /** Check if transaction queue tracing is enabled */ + virtual bool shouldTraceTxQ() const = 0; + + /** Check if validator list/manifest tracing is enabled */ + virtual bool shouldTraceValidator() const = 0; + + /** Check if amendment voting tracing is enabled */ + virtual bool shouldTraceAmendment() const = 0; }; // Factory functions @@ -191,11 +212,17 @@ public: /** * Construct guard with span. * The span becomes the current span in thread-local context. + * + * @note If span is nullptr (e.g., telemetry disabled), the guard + * becomes a no-op. All methods safely check for null before access. */ explicit SpanGuard( opentelemetry::nostd::shared_ptr span) - : span_(std::move(span)) - , scope_(span_) + : span_(span ? std::move(span) : nullptr) + , scope_(span_ ? opentelemetry::trace::Scope(span_) + : opentelemetry::trace::Scope( + opentelemetry::nostd::shared_ptr< + opentelemetry::trace::Span>(nullptr))) { } @@ -277,6 +304,12 @@ public: void addEvent(std::string_view) {} void recordException(std::exception const&) {} + + /** Return a default empty context (matches SpanGuard interface) */ + opentelemetry::context::Context context() const + { + return opentelemetry::context::Context{}; + } }; } // namespace telemetry @@ -332,17 +365,66 @@ namespace telemetry { _xrpl_guard_.emplace((telemetry).startSpan(name)); \ } -// Set attribute on current span (if exists) -#define XRPL_TRACE_SET_ATTR(key, value) \ - if (_xrpl_guard_.has_value()) { \ - _xrpl_guard_->setAttribute(key, value); \ +#define XRPL_TRACE_PEER(telemetry, name) \ + std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \ + if ((telemetry).shouldTracePeer()) { \ + _xrpl_guard_.emplace((telemetry).startSpan(name)); \ } +#define XRPL_TRACE_LEDGER(telemetry, name) \ + std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \ + if ((telemetry).shouldTraceLedger()) { \ + _xrpl_guard_.emplace((telemetry).startSpan(name)); \ + } + +#define XRPL_TRACE_PATHFIND(telemetry, name) \ + std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \ + if ((telemetry).shouldTracePathfind()) { \ + _xrpl_guard_.emplace((telemetry).startSpan(name)); \ + } + +#define XRPL_TRACE_TXQ(telemetry, name) \ + std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \ + if ((telemetry).shouldTraceTxQ()) { \ + _xrpl_guard_.emplace((telemetry).startSpan(name)); \ + } + +#define XRPL_TRACE_VALIDATOR(telemetry, name) \ + std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \ + if ((telemetry).shouldTraceValidator()) { \ + _xrpl_guard_.emplace((telemetry).startSpan(name)); \ + } + +#define XRPL_TRACE_AMENDMENT(telemetry, name) \ + std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \ + if ((telemetry).shouldTraceAmendment()) { \ + _xrpl_guard_.emplace((telemetry).startSpan(name)); \ + } + +// Set attribute on current span (if exists). +// Works with both std::optional (from conditional macros) +// and bare SpanGuard (from XRPL_TRACE_SPAN). Uses 'if constexpr'-like +// dispatch via a helper that checks for .has_value(). +#define XRPL_TRACE_SET_ATTR(key, value) \ + do { \ + if constexpr (requires { _xrpl_guard_.has_value(); }) { \ + if (_xrpl_guard_.has_value()) \ + _xrpl_guard_->setAttribute(key, value); \ + } else { \ + _xrpl_guard_.setAttribute(key, value); \ + } \ + } while(0) + // Record exception on current span #define XRPL_TRACE_EXCEPTION(e) \ - if (_xrpl_guard_.has_value()) { \ - _xrpl_guard_->recordException(e); \ - } + do { \ + if constexpr (requires { _xrpl_guard_.has_value(); }) { \ + if (_xrpl_guard_.has_value()) \ + _xrpl_guard_->recordException(e); \ + } else { \ + _xrpl_guard_.recordException(e); \ + } \ + } while(0) #else // XRPL_ENABLE_TELEMETRY not defined @@ -351,6 +433,12 @@ namespace telemetry { #define XRPL_TRACE_TX(telemetry, name) ((void)0) #define XRPL_TRACE_CONSENSUS(telemetry, name) ((void)0) #define XRPL_TRACE_RPC(telemetry, name) ((void)0) +#define XRPL_TRACE_PEER(telemetry, name) ((void)0) +#define XRPL_TRACE_LEDGER(telemetry, name) ((void)0) +#define XRPL_TRACE_PATHFIND(telemetry, name) ((void)0) +#define XRPL_TRACE_TXQ(telemetry, name) ((void)0) +#define XRPL_TRACE_VALIDATOR(telemetry, name) ((void)0) +#define XRPL_TRACE_AMENDMENT(telemetry, name) ((void)0) #define XRPL_TRACE_SET_ATTR(key, value) ((void)0) #define XRPL_TRACE_EXCEPTION(e) ((void)0) @@ -369,6 +457,9 @@ namespace telemetry { Add to `src/xrpld/overlay/detail/ripple.proto`: ```protobuf +// Note: rippled uses proto2 syntax. The 'optional' keyword below is valid +// in proto2 (it is the default field rule) and is included for clarity. + // Trace context for distributed tracing across nodes // Uses W3C Trace Context format internally message TraceContext { @@ -423,6 +514,8 @@ message TMLedgerData { #pragma once #include +#include +#include #include #include // Generated protobuf @@ -480,7 +573,14 @@ TraceContextPropagator::extract(protocol::TraceContext const& proto) using namespace opentelemetry::trace; if (proto.trace_id().size() != 16 || proto.span_id().size() != 8) - return opentelemetry::context::Context{}; // Invalid, return empty + { + // Log malformed trace context for debugging. Silent failures in + // context extraction make distributed tracing issues hard to diagnose. + JLOG(j_.warn()) << "Malformed trace context: trace_id size=" + << proto.trace_id().size() + << " span_id size=" << proto.span_id().size(); + return opentelemetry::context::Context{}; + } // Construct TraceId and SpanId from bytes TraceId traceId(reinterpret_cast(proto.trace_id().data())); @@ -490,11 +590,15 @@ TraceContextPropagator::extract(protocol::TraceContext const& proto) // Create SpanContext from extracted data SpanContext spanContext(traceId, spanId, flags, /* remote = */ true); - // Create context with extracted span as parent - return opentelemetry::context::Context{}.SetValue( - opentelemetry::trace::kSpanKey, + // DefaultSpan wraps SpanContext for use as a non-recording parent. + // This is the standard OTel C++ pattern for remote context propagation. + // DefaultSpan carries the remote SpanContext without recording any data. + auto parentCtx = opentelemetry::trace::SetSpan( + opentelemetry::context::Context{}, opentelemetry::nostd::shared_ptr( new DefaultSpan(spanContext))); + + return parentCtx; } inline void @@ -750,8 +854,8 @@ ServerHandler::onRequest( // Extract trace context from HTTP headers (W3C Trace Context) auto parentCtx = telemetry::TraceContextPropagator::extractFromHeaders( [&req](std::string_view name) -> std::optional { - auto it = req.find(boost::beast::http::field{ - std::string(name)}); + // Beast's find() accepts a string_view for custom header lookup + auto it = req.find(name); if (it != req.end()) return std::string(it->value()); return std::nullopt; @@ -977,6 +1081,14 @@ flowchart TB +**Reading the diagram:** + +- **Client / Submit TX**: An external client submits a transaction, creating the root span that initiates the trace. +- **Node A (RPC layer)**: The receiving node processes the submission through `rpc.request` and `rpc.command.submit`, then hands off to the transaction pipeline (`tx.receive` → `tx.validate` → `tx.relay`). +- **Dashed arrows (TraceContext)**: Cross-node boundaries where trace context is propagated via the protobuf protocol extension, linking spans across independent processes. +- **Node B (relay hop)**: A peer node that receives, validates, and relays the transaction further, demonstrating multi-hop propagation. +- **Node C (consensus)**: The final node where the transaction enters consensus (`consensus.round` → `consensus.phase.establish`), showing how a single client action produces an end-to-end distributed trace. + --- _Previous: [Implementation Strategy](./03-implementation-strategy.md)_ | _Next: [Configuration Reference](./05-configuration-reference.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_ diff --git a/OpenTelemetryPlan/05-configuration-reference.md b/OpenTelemetryPlan/05-configuration-reference.md index b13cc839ab..11aceb7883 100644 --- a/OpenTelemetryPlan/05-configuration-reference.md +++ b/OpenTelemetryPlan/05-configuration-reference.md @@ -7,6 +7,8 @@ ## 5.1 rippled Configuration +> **OTLP** = OpenTelemetry Protocol | **TxQ** = Transaction Queue + ### 5.1.1 Configuration File Section Add to `cfg/xrpld-example.cfg`: @@ -38,6 +40,9 @@ Add to `cfg/xrpld-example.cfg`: # # # Sampling ratio: 0.0-1.0 (default: 1.0 = 100% sampling) # # Use lower values in production to reduce overhead +# # Default: 1.0 (all traces). For production deployments with high +# # throughput, 0.1 (10%) is recommended to reduce overhead. +# # See Section 7.4.2 for sampling strategy details. # sampling_ratio=0.1 # # # Batch processor settings @@ -51,6 +56,10 @@ Add to `cfg/xrpld-example.cfg`: # trace_rpc=1 # RPC request handling # trace_peer=0 # Peer messages (high volume, disabled by default) # trace_ledger=1 # Ledger acquisition and building +# trace_pathfind=1 # Path computation (can be expensive) +# trace_txq=1 # Transaction queue and fee escalation +# trace_validator=0 # Validator list and manifest updates (low volume) +# trace_amendment=0 # Amendment voting (very low volume) # # # Service identification (automatically detected if not specified) # # service_name=rippled @@ -78,6 +87,10 @@ enabled=0 | `trace_rpc` | bool | `true` | Enable RPC tracing | | `trace_peer` | bool | `false` | Enable peer message tracing (high volume) | | `trace_ledger` | bool | `true` | Enable ledger tracing | +| `trace_pathfind` | bool | `true` | Enable path computation tracing | +| `trace_txq` | bool | `true` | Enable transaction queue tracing | +| `trace_validator` | bool | `false` | Enable validator list/manifest tracing | +| `trace_amendment` | bool | `false` | Enable amendment voting tracing | | `service_name` | string | `"rippled"` | Service name for traces | | `service_instance_id` | string | `` | Instance identifier | @@ -85,6 +98,8 @@ enabled=0 ## 5.2 Configuration Parser +> **TxQ** = Transaction Queue + ```cpp // src/libxrpl/telemetry/TelemetryConfig.cpp @@ -140,6 +155,10 @@ setup_Telemetry( setup.traceRpc = section.value_or("trace_rpc", true); setup.tracePeer = section.value_or("trace_peer", false); setup.traceLedger = section.value_or("trace_ledger", true); + setup.tracePathfind = section.value_or("trace_pathfind", true); + setup.traceTxQ = section.value_or("trace_txq", true); + setup.traceValidator = section.value_or("trace_validator", false); + setup.traceAmendment = section.value_or("trace_amendment", false); return setup; } @@ -239,6 +258,8 @@ public: ## 5.4 CMake Integration +> **OTLP** = OpenTelemetry Protocol + ### 5.4.1 Find OpenTelemetry Module ```cmake @@ -354,6 +375,8 @@ endif() ## 5.5 OpenTelemetry Collector Configuration +> **OTLP** = OpenTelemetry Protocol | **APM** = Application Performance Monitoring + ### 5.5.1 Development Configuration ```yaml @@ -380,9 +403,9 @@ exporters: sampling_initial: 5 sampling_thereafter: 200 - # Jaeger for trace visualization - jaeger: - endpoint: jaeger:14250 + # Tempo for trace visualization + otlp/tempo: + endpoint: tempo:4317 tls: insecure: true @@ -391,7 +414,7 @@ service: traces: receivers: [otlp] processors: [batch] - exporters: [logging, jaeger] + exporters: [logging, otlp/tempo] ``` ### 5.5.2 Production Configuration @@ -504,6 +527,8 @@ service: ## 5.6 Docker Compose Development Environment +> **OTLP** = OpenTelemetry Protocol + ```yaml # docker-compose-telemetry.yaml version: "3.8" @@ -521,17 +546,15 @@ services: - "4318:4318" # OTLP HTTP - "13133:13133" # Health check depends_on: - - jaeger + - tempo - # Jaeger for trace visualization - jaeger: - image: jaegertracing/all-in-one:1.53 - container_name: jaeger - environment: - - COLLECTOR_OTLP_ENABLED=true + # Tempo for trace visualization + tempo: + image: grafana/tempo:2.6.1 + container_name: tempo ports: - - "16686:16686" # UI - - "14250:14250" # gRPC + - "3200:3200" # Tempo HTTP API + - "4317" # OTLP gRPC (internal) # Grafana for dashboards grafana: @@ -546,7 +569,7 @@ services: ports: - "3000:3000" depends_on: - - jaeger + - tempo # Prometheus for metrics (optional, for correlation) prometheus: @@ -566,6 +589,8 @@ networks: ## 5.7 Configuration Architecture +> **OTLP** = OpenTelemetry Protocol + ```mermaid flowchart TB subgraph config["Configuration Sources"] @@ -605,10 +630,20 @@ flowchart TB style collector fill:#fff3e0,stroke:#ff9800 ``` +**Reading the diagram:** + +- **Configuration Sources**: `xrpld.cfg` provides runtime settings (endpoint, sampling) while the CMake flag controls whether telemetry is compiled in at all. +- **Initialization**: `setup_Telemetry()` parses config values, then `make_Telemetry()` constructs the provider, processor, and exporter objects. +- **Runtime Components**: The `TracerProvider` creates spans, the `BatchProcessor` buffers them, and the `OTLP Exporter` serializes and sends them over the wire. +- **OTLP arrow to Collector**: Trace data leaves the rippled process via OTLP (gRPC or HTTP) and enters the external Collector pipeline. +- **Collector Pipeline**: `Receivers` ingest OTLP data, `Processors` apply sampling/filtering/enrichment, and `Exporters` forward traces to storage backends (Tempo, etc.). + --- ## 5.8 Grafana Integration +> **APM** = Application Performance Monitoring + Step-by-step instructions for integrating rippled traces with Grafana. ### 5.8.1 Data Source Configuration @@ -642,23 +677,6 @@ datasources: datasourceUid: loki ``` -#### Jaeger - -```yaml -# grafana/provisioning/datasources/jaeger.yaml -apiVersion: 1 - -datasources: - - name: Jaeger - type: jaeger - access: proxy - url: http://jaeger:16686 - jsonData: - tracesToLogs: - datasourceUid: loki - tags: ["service.name"] -``` - #### Elastic APM ```yaml diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index 5fb9978f32..ccf1fd54d4 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -7,6 +7,8 @@ ## 6.1 Phase Overview +> **TxQ** = Transaction Queue + ```mermaid gantt title OpenTelemetry Implementation Timeline @@ -19,26 +21,36 @@ gantt Telemetry Interface :p1b, after p1a, 3d Configuration & CMake :p1c, after p1b, 3d Unit Tests :p1d, after p1c, 2d + Buffer & Integration :p1e, after p1d, 2d section Phase 2 RPC Tracing :p2, after p1, 2w HTTP Context Extraction :p2a, after p1, 2d RPC Handler Instrumentation :p2b, after p2a, 4d - WebSocket Support :p2c, after p2b, 2d + PathFinding Instrumentation :p2f, after p2b, 2d + TxQ Instrumentation :p2g, after p2f, 2d + WebSocket Support :p2c, after p2g, 2d Integration Tests :p2d, after p2c, 2d + Buffer & Review :p2e, after p2d, 4d section Phase 3 Transaction Tracing :p3, after p2, 2w Protocol Buffer Extension :p3a, after p2, 2d PeerImp Instrumentation :p3b, after p3a, 3d - Relay Context Propagation :p3c, after p3b, 3d + Fee Escalation Instrumentation :p3f, after p3b, 2d + Relay Context Propagation :p3c, after p3f, 3d Multi-node Tests :p3d, after p3c, 2d + Buffer & Review :p3e, after p3d, 4d section Phase 4 Consensus Tracing :p4, after p3, 2w Consensus Round Spans :p4a, after p3, 3d Proposal Handling :p4b, after p4a, 3d - Validation Tests :p4c, after p4b, 4d + Validator List & Manifest Tracing :p4f, after p4b, 2d + Amendment Voting Tracing :p4g, after p4f, 2d + SHAMap Sync Tracing :p4h, after p4g, 2d + Validation Tests :p4c, after p4h, 4d + Buffer & Review :p4e, after p4c, 4d section Phase 5 Documentation & Deploy :p5, after p4, 1w @@ -75,20 +87,24 @@ gantt ## 6.3 Phase 2: RPC Tracing (Weeks 3-4) +> **TxQ** = Transaction Queue + **Objective**: Complete tracing for all RPC operations ### Tasks -| Task | Description | -| ---- | -------------------------------------------------- | -| 2.1 | Implement W3C Trace Context HTTP header extraction | -| 2.2 | Instrument `ServerHandler::onRequest()` | -| 2.3 | Instrument `RPCHandler::doCommand()` | -| 2.4 | Add RPC-specific attributes | -| 2.5 | Instrument WebSocket handler | -| 2.6 | Integration tests for RPC tracing | -| 2.7 | Performance benchmarks | -| 2.8 | Documentation | +| Task | Description | +| ---- | -------------------------------------------------------------------------- | +| 2.1 | Implement W3C Trace Context HTTP header extraction | +| 2.2 | Instrument `ServerHandler::onRequest()` | +| 2.3 | Instrument `RPCHandler::doCommand()` | +| 2.4 | Add RPC-specific attributes | +| 2.5 | Instrument WebSocket handler | +| 2.6 | PathFinding instrumentation (`pathfind.request`, `pathfind.compute` spans) | +| 2.7 | TxQ instrumentation (`txq.enqueue`, `txq.apply` spans) | +| 2.8 | Integration tests for RPC tracing | +| 2.9 | Performance benchmarks | +| 2.10 | Documentation | ### Exit Criteria @@ -106,16 +122,17 @@ gantt ### Tasks -| Task | Description | -| ---- | --------------------------------------------- | -| 3.1 | Define `TraceContext` Protocol Buffer message | -| 3.2 | Implement protobuf context serialization | -| 3.3 | Instrument `PeerImp::handleTransaction()` | -| 3.4 | Instrument `NetworkOPs::submitTransaction()` | -| 3.5 | Instrument HashRouter integration | -| 3.6 | Implement relay context propagation | -| 3.7 | Integration tests (multi-node) | -| 3.8 | Performance benchmarks | +| Task | Description | +| ---- | ---------------------------------------------------- | +| 3.1 | Define `TraceContext` Protocol Buffer message | +| 3.2 | Implement protobuf context serialization | +| 3.3 | Instrument `PeerImp::handleTransaction()` | +| 3.4 | Instrument `NetworkOPs::submitTransaction()` | +| 3.5 | Instrument HashRouter integration | +| 3.6 | Fee escalation instrumentation (`fee.escalate` span) | +| 3.7 | Implement relay context propagation | +| 3.8 | Integration tests (multi-node) | +| 3.9 | Performance benchmarks | ### Exit Criteria @@ -141,8 +158,11 @@ gantt | 4.4 | Instrument validation handling | | 4.5 | Add consensus-specific attributes | | 4.6 | Correlate with transaction traces | -| 4.7 | Multi-validator integration tests | -| 4.8 | Performance validation | +| 4.7 | Validator list and manifest tracing | +| 4.8 | Amendment voting tracing | +| 4.9 | SHAMap sync tracing | +| 4.10 | Multi-validator integration tests | +| 4.11 | Performance validation | ### Exit Criteria @@ -159,6 +179,9 @@ Phase 4a (establish-phase gap fill & cross-node correlation) adds: - **Deterministic trace ID** derived from `previousLedger.id()` so all validators in the same round share the same `trace_id` (switchable via `consensus_trace_strategy` config: `"deterministic"` or `"attribute"`). + See [Configuration Reference](./05-configuration-reference.md) for full + configuration options. The `consensus_trace_strategy` option will be + documented in the configuration reference as part of Phase 4a implementation. - **Round lifecycle spans**: `consensus.round` with round-to-round span links. - **Establish phase**: `consensus.establish`, `consensus.update_positions` (with `dispute.resolve` events), `consensus.check` (with threshold tracking). @@ -198,16 +221,16 @@ quadrantChart title Risk Assessment Matrix x-axis Low Impact --> High Impact y-axis Low Likelihood --> High Likelihood - quadrant-1 Monitor Closely - quadrant-2 Mitigate Immediately + quadrant-1 Mitigate Immediately + quadrant-2 Plan Mitigation quadrant-3 Accept Risk - quadrant-4 Plan Mitigation + quadrant-4 Monitor Closely - SDK Compatibility: [0.25, 0.2] - Protocol Changes: [0.75, 0.65] - Performance Overhead: [0.65, 0.45] - Context Propagation: [0.5, 0.5] - Memory Leaks: [0.8, 0.2] + SDK Compat: [0.2, 0.18] + Protocol Chg: [0.75, 0.72] + Perf Overhead: [0.58, 0.42] + Context Prop: [0.4, 0.55] + Memory Leaks: [0.85, 0.25] ``` ### Risk Details @@ -224,19 +247,21 @@ quadrantChart ## 6.8 Success Metrics -| Metric | Target | Measurement | -| ------------------------ | ------------------------------ | --------------------- | -| Trace coverage | >95% of transactions | Sampling verification | -| CPU overhead | <3% | Benchmark tests | -| Memory overhead | <5 MB | Memory profiling | -| Latency impact (p99) | <2% | Performance tests | -| Trace completeness | >99% spans with required attrs | Validation script | -| Cross-node trace linkage | >90% of multi-hop transactions | Integration tests | +| Metric | Target | Measurement | +| ------------------------ | -------------------------------------------------------------- | --------------------- | +| Trace coverage | >95% of transaction code paths (independent of sampling ratio) | Sampling verification | +| CPU overhead | <3% | Benchmark tests | +| Memory overhead | <10 MB | Memory profiling | +| Latency impact (p99) | <2% | Performance tests | +| Trace completeness | >99% spans with required attrs | Validation script | +| Cross-node trace linkage | >90% of multi-hop transactions | Integration tests | --- ## 6.9 Quick Wins and Crawl-Walk-Run Strategy +> **TxQ** = Transaction Queue + This section outlines a prioritized approach to maximize ROI with minimal initial investment. ### 6.9.1 Crawl-Walk-Run Overview @@ -247,17 +272,17 @@ This section outlines a prioritized approach to maximize ROI with minimal initia flowchart TB subgraph crawl["🐢 CRAWL (Week 1-2)"] direction LR - c1[Core SDK Setup] ~~~ c2[RPC Tracing Only] ~~~ c3[Single Node] + c1[Core SDK Setup] ~~~ c2[RPC Tracing Only] ~~~ c3[PathFinding + TxQ Tracing] ~~~ c4[Single Node] end subgraph walk["🚶 WALK (Week 3-5)"] direction LR - w1[Transaction Tracing] ~~~ w2[Cross-Node Context] ~~~ w3[Basic Dashboards] + w1[Transaction Tracing] ~~~ w2[Fee Escalation Tracing] ~~~ w3[Cross-Node Context] ~~~ w4[Basic Dashboards] end subgraph run["🏃 RUN (Week 6-9)"] direction LR - r1[Consensus Tracing] ~~~ r2[Full Correlation] ~~~ r3[Production Deploy] + r1[Consensus Tracing] ~~~ r2[Validator, Amendment,
SHAMap Tracing] ~~~ r3[Full Correlation] ~~~ r4[Production Deploy] end crawl --> walk --> run @@ -268,16 +293,26 @@ flowchart TB style c1 fill:#1b5e20,stroke:#0d3d14,color:#fff style c2 fill:#1b5e20,stroke:#0d3d14,color:#fff style c3 fill:#1b5e20,stroke:#0d3d14,color:#fff + style c4 fill:#1b5e20,stroke:#0d3d14,color:#fff style w1 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style w2 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style w3 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b + style w4 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style r1 fill:#0d47a1,stroke:#082f6a,color:#fff style r2 fill:#0d47a1,stroke:#082f6a,color:#fff style r3 fill:#0d47a1,stroke:#082f6a,color:#fff + style r4 fill:#0d47a1,stroke:#082f6a,color:#fff ``` +**Reading the diagram:** + +- **CRAWL (Weeks 1-2)**: Minimal investment -- set up the SDK, instrument RPC and PathFinding/TxQ handlers, and verify on a single node. Delivers immediate latency visibility. +- **WALK (Weeks 3-5)**: Expand to transaction lifecycle tracing, fee escalation, cross-node context propagation, and basic Grafana dashboards. This is where distributed tracing starts working. +- **RUN (Weeks 6-9)**: Full consensus instrumentation, validator/amendment/SHAMap tracing, end-to-end correlation, and production deployment with sampling and alerting. +- **Arrows (crawl → walk → run)**: Each phase builds on the prior one; you cannot skip ahead because later phases depend on infrastructure established earlier. + ### 6.9.2 Quick Wins (Immediate Value) | Quick Win | Value | When to Deploy | @@ -296,6 +331,7 @@ flowchart TB - RPC request/response traces for all commands - Latency breakdown per RPC command +- PathFinding and TxQ tracing (directly impacts RPC latency) - Error visibility with stack traces - Basic Grafana dashboard @@ -304,6 +340,7 @@ flowchart TB **Why Start Here**: - RPC is the lowest-risk, highest-visibility component +- PathFinding and TxQ are RPC-adjacent and directly affect latency - Immediate value for debugging client issues - No cross-node complexity - Single file modification to existing code @@ -315,6 +352,7 @@ flowchart TB **What You Get**: - End-to-end transaction traces from submit to relay +- Fee escalation tracing within the transaction pipeline - Cross-node correlation (see transaction path) - HashRouter deduplication visibility - Relay latency metrics @@ -324,6 +362,7 @@ flowchart TB **Why Do This Second**: - Builds on RPC tracing (transactions submitted via RPC) +- Fee escalation is integral to the transaction processing pipeline - Moderate complexity (requires context propagation) - High value for debugging transaction issues @@ -336,13 +375,17 @@ flowchart TB - Complete consensus round visibility - Phase transition timing - Validator proposal tracking +- Validator list and manifest tracing +- Amendment voting tracing +- SHAMap sync tracing - Full end-to-end traces (client → RPC → TX → consensus → ledger) -**Code Changes**: ~100 lines across 3 consensus files +**Code Changes**: ~100 lines across 3 consensus files, plus validator/amendment/SHAMap modules **Why Do This Last**: - Highest complexity (consensus is critical path) +- Validator, amendment, and SHAMap components are lower priority - Requires thorough testing - Lower relative value (consensus issues are rarer) @@ -358,33 +401,35 @@ quadrantChart quadrant-3 Nice to Have - Optional quadrant-4 Time Sinks - Avoid - RPC Tracing: [0.15, 0.9] - TX Submit Trace: [0.25, 0.85] - TX Relay Trace: [0.5, 0.8] - Consensus Trace: [0.7, 0.75] - Peer Message Trace: [0.85, 0.3] - Ledger Acquire: [0.55, 0.5] + RPC Tracing: [0.15, 0.92] + TX Submit Trace: [0.3, 0.78] + TX Relay Trace: [0.5, 0.88] + Consensus Trace: [0.72, 0.72] + Peer Msg Trace: [0.85, 0.3] + Ledger Acquire: [0.55, 0.52] ``` --- -## 6.11 Definition of Done +## 6.10 Definition of Done + +> **TxQ** = Transaction Queue | **HA** = High Availability Clear, measurable criteria for each phase. -### 6.11.1 Phase 1: Core Infrastructure +### 6.10.1 Phase 1: Core Infrastructure | Criterion | Measurement | Target | | --------------- | ---------------------------------------------------------- | ---------------------------- | | SDK Integration | `cmake --build` succeeds with `-DXRPL_ENABLE_TELEMETRY=ON` | ✅ Compiles | | Runtime Toggle | `enabled=0` produces zero overhead | <0.1% CPU difference | -| Span Creation | Unit test creates and exports span | Span appears in Jaeger | +| Span Creation | Unit test creates and exports span | Span appears in Tempo | | Configuration | All config options parsed correctly | Config validation tests pass | | Documentation | Developer guide exists | PR approved | **Definition of Done**: All criteria met, PR merged, no regressions in CI. -### 6.11.2 Phase 2: RPC Tracing +### 6.10.2 Phase 2: RPC Tracing | Criterion | Measurement | Target | | ------------------ | ---------------------------------- | -------------------------- | @@ -394,9 +439,9 @@ Clear, measurable criteria for each phase. | Performance | RPC latency overhead | <1ms p99 | | Dashboard | Grafana dashboard deployed | Screenshot in docs | -**Definition of Done**: RPC traces visible in Jaeger/Tempo for all commands, dashboard shows latency distribution. +**Definition of Done**: RPC traces visible in Tempo for all commands, dashboard shows latency distribution. -### 6.11.3 Phase 3: Transaction Tracing +### 6.10.3 Phase 3: Transaction Tracing | Criterion | Measurement | Target | | ---------------- | ------------------------------- | ---------------------------------- | @@ -408,7 +453,7 @@ Clear, measurable criteria for each phase. **Definition of Done**: Transaction traces span 3+ nodes in test network, performance within bounds. -### 6.11.4 Phase 4: Consensus Tracing +### 6.10.4 Phase 4: Consensus Tracing | Criterion | Measurement | Target | | -------------------- | ----------------------------- | ------------------------- | @@ -420,7 +465,7 @@ Clear, measurable criteria for each phase. **Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing. -### 6.11.5 Phase 5: Production Deployment +### 6.10.5 Phase 5: Production Deployment | Criterion | Measurement | Target | | ------------ | ---------------------------- | -------------------------- | @@ -433,7 +478,7 @@ Clear, measurable criteria for each phase. **Definition of Done**: Telemetry running in production, operators trained, alerts active. -### 6.11.6 Success Metrics Summary +### 6.10.6 Success Metrics Summary | Phase | Primary Metric | Secondary Metric | Deadline | | ------- | ---------------------- | --------------------------- | ------------- | @@ -458,7 +503,7 @@ flowchart TB subgraph week2["Week 2"] t3[3. RPC ServerHandler
instrumentation] - t4[4. Basic Jaeger setup
for testing] + t4[4. Basic Tempo setup
for testing] end subgraph week3["Week 3"] @@ -516,6 +561,15 @@ flowchart TB style t14 fill:#4a148c,stroke:#2e0d57,color:#fff ``` +**Reading the diagram:** + +- **Week 1 (tasks 1-2)**: Foundation work -- integrate the OpenTelemetry SDK via Conan/CMake and build the `Telemetry` interface with `SpanGuard` and config parsing. +- **Week 2 (tasks 3-4)**: First observable output -- instrument `ServerHandler` for RPC tracing and stand up Tempo so developers can see traces immediately. +- **Weeks 3-5 (tasks 5-10)**: Transaction lifecycle -- add submit tracing, build the first Grafana dashboard, extend protobuf for cross-node context, instrument `PeerImp` relay, then validate with multi-node integration tests and performance benchmarks. +- **Weeks 6-8 (tasks 11-12)**: Consensus deep-dive -- instrument consensus rounds and phases, then run full integration testing across all instrumented paths. +- **Week 9 (tasks 13-14)**: Go-live -- deploy to production with sampling/alerting configured, and deliver documentation and operator training. +- **Arrow chain (t1 → ... → t14)**: Strict sequential dependency; each task's output is a prerequisite for the next. + --- _Previous: [Configuration Reference](./05-configuration-reference.md)_ | _Next: [Observability Backends](./07-observability-backends.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_ diff --git a/OpenTelemetryPlan/07-observability-backends.md b/OpenTelemetryPlan/07-observability-backends.md index a90f41ae43..2877333a41 100644 --- a/OpenTelemetryPlan/07-observability-backends.md +++ b/OpenTelemetryPlan/07-observability-backends.md @@ -7,33 +7,36 @@ ## 7.1 Development/Testing Backends -| Backend | Pros | Cons | Use Case | -| ---------- | ------------------- | ----------------- | ----------------- | -| **Jaeger** | Easy setup, good UI | Limited retention | Local dev, CI | -| **Zipkin** | Simple, lightweight | Basic features | Quick prototyping | +> **OTLP** = OpenTelemetry Protocol -### Quick Start with Jaeger +| Backend | Pros | Cons | Use Case | +| ---------- | ----------------------------------- | ---------------------- | ------------------- | +| **Tempo** | Cost-effective, Grafana integration | Requires Grafana stack | Local dev, CI, Prod | +| **Zipkin** | Simple, lightweight | Basic features | Quick prototyping | + +### Quick Start with Tempo ```bash -# Start Jaeger with OTLP support -docker run -d --name jaeger \ - -e COLLECTOR_OTLP_ENABLED=true \ - -p 16686:16686 \ +# Start Tempo with OTLP support +docker run -d --name tempo \ + -p 3200:3200 \ -p 4317:4317 \ -p 4318:4318 \ - jaegertracing/all-in-one:latest + grafana/tempo:2.6.1 ``` --- ## 7.2 Production Backends -| Backend | Pros | Cons | Use Case | -| ----------------- | ----------------------------------------- | ------------------ | --------------------------- | -| **Grafana Tempo** | Cost-effective, Grafana integration | Newer project | Most production deployments | -| **Elastic APM** | Full observability stack, log correlation | Resource intensive | Existing Elastic users | -| **Honeycomb** | Excellent query, high cardinality | SaaS cost | Deep debugging needs | -| **Datadog APM** | Full platform, easy setup | SaaS cost | Enterprise with budget | +> **APM** = Application Performance Monitoring + +| Backend | Pros | Cons | Use Case | +| ----------------- | ----------------------------------------- | ---------------------- | --------------------------- | +| **Grafana Tempo** | Cost-effective, Grafana integration | Requires Grafana stack | Most production deployments | +| **Elastic APM** | Full observability stack, log correlation | Resource intensive | Existing Elastic users | +| **Honeycomb** | Excellent query, high cardinality | SaaS cost | Deep debugging needs | +| **Datadog APM** | Full platform, easy setup | SaaS cost | Enterprise with budget | ### Backend Selection Flowchart @@ -73,10 +76,19 @@ flowchart TD style datadog fill:#4a148c,stroke:#2e0d57,color:#fff ``` +**Reading the diagram:** + +- **Budget Constraints? (Yes)**: Leads to open-source options. If you already run Grafana or Elastic, pick the matching backend; otherwise default to Grafana Tempo. +- **Budget Constraints? (No) → Prefer SaaS?**: If you want a managed service, choose between Datadog (enterprise support) and Honeycomb (developer-focused). If not, fall back to open-source. +- **Terminal nodes (Tempo / Elastic / Honeycomb / Datadog)**: Each represents a concrete backend choice, all of which feed into the same final step. +- **Configure Collector**: Regardless of backend, you always finish by configuring the OTel Collector to export to your chosen destination. + --- ## 7.3 Recommended Production Architecture +> **OTLP** = OpenTelemetry Protocol | **APM** = Application Performance Monitoring | **HA** = High Availability + ```mermaid flowchart TB subgraph validators["Validator Nodes"] @@ -117,6 +129,8 @@ flowchart TB tempo --> grafana elastic --> grafana + %% Note: simplified single-collector-per-DC topology shown for clarity + style validators fill:#b71c1c,stroke:#7f1d1d,color:#ffffff style stock fill:#0d47a1,stroke:#082f6a,color:#ffffff style collector fill:#bf360c,stroke:#8c2809,color:#ffffff @@ -124,6 +138,16 @@ flowchart TB style ui fill:#4a148c,stroke:#2e0d57,color:#ffffff ``` +**Reading the diagram:** + +- **Validator / Stock Nodes**: All rippled nodes emit trace data via OTLP. Validators and stock nodes are grouped separately because they may reside in different network zones. +- **Collector Cluster (DC1, DC2)**: Regional collectors receive OTLP from nodes in their datacenter, apply processing (sampling, enrichment), and fan out to multiple backends. +- **Storage Backends**: Tempo and Elastic provide queryable trace storage; S3/GCS Archive provides long-term cold storage for compliance or post-incident analysis. +- **Grafana Dashboards**: The single visualization layer that queries both Tempo and Elastic, giving operators a unified view of all traces. +- **Data flow direction**: Nodes → Collectors → Storage → Grafana. Each arrow represents a network hop; minimizing collector-to-backend hops reduces latency. + +> **Note**: Production deployments should use multiple collector instances behind a load balancer for high availability. The diagram shows a simplified single-collector topology for clarity. + --- ## 7.4 Architecture Considerations @@ -147,7 +171,7 @@ flowchart TB ```mermaid flowchart LR subgraph head["Head Sampling (Node)"] - hs[10% probabilistic] + hs[Node-level head sampling
configurable, default: 100%
recommended production: 10%] end subgraph tail["Tail Sampling (Collector)"] @@ -171,6 +195,13 @@ flowchart LR style final fill:#bf360c,stroke:#8c2809,color:#fff ``` +**Reading the diagram:** + +- **Head Sampling (Node)**: The first filter -- each rippled node decides whether to sample a trace at creation time (default 100%, recommended 10% in production). This controls the volume leaving the node. +- **Tail Sampling (Collector)**: The second filter -- the collector inspects completed traces and applies rules: keep all errors, keep anything slower than 5 seconds, and keep 10% of the remainder. +- **Arrow head → tail**: All head-sampled traces flow to the collector, where tail sampling further reduces volume while preserving the most valuable data. +- **Final Traces**: The output after both sampling stages; this is what gets stored and queried. The two-stage approach balances cost with debuggability. + ### 7.4.3 Data Retention | Environment | Hot Storage | Warm Storage | Cold Archive | @@ -355,6 +386,9 @@ groups: model: queryType: traceql query: '{resource.service.name="rippled" && name="consensus.round"} | avg(duration) > 5s' + # Note: Verify TraceQL aggregate queries are supported by your + # Tempo version. Aggregate alerting (e.g., avg(duration)) requires + # Tempo 2.3+ with TraceQL metrics enabled. for: 5m annotations: summary: Consensus rounds taking >5 seconds @@ -371,6 +405,9 @@ groups: model: queryType: traceql query: '{resource.service.name="rippled" && name=~"rpc.command.*" && status.code=error} | rate() > 0.05' + # Note: Verify TraceQL aggregate queries are supported by your + # Tempo version. Aggregate alerting (e.g., rate()) requires + # Tempo 2.3+ with TraceQL metrics enabled. for: 2m annotations: summary: RPC error rate >5% @@ -397,6 +434,8 @@ groups: ## 7.7 PerfLog and Insight Correlation +> **OTLP** = OpenTelemetry Protocol + How to correlate OpenTelemetry traces with existing rippled observability. ### 7.7.1 Correlation Architecture @@ -459,6 +498,13 @@ flowchart TB style corr fill:#4a148c,stroke:#2e0d57,color:#fff ``` +**Reading the diagram:** + +- **rippled Node (three sources)**: A single node emits three independent data streams -- OpenTelemetry spans, PerfLog JSON logs, and Beast Insight StatsD metrics. +- **Data Collection layer**: Each stream has its own collector -- OTel Collector for spans, Promtail/Fluentd for logs, and a StatsD exporter for metrics. They operate independently. +- **Storage layer (Tempo, Loki, Prometheus)**: Each data type lands in a purpose-built store optimized for its query patterns (trace search, log grep, metric aggregation). +- **Grafana Correlation Panel**: The key integration point -- Grafana queries all three stores and links them via shared fields (`trace_id`, `xrpl.tx.hash`, `ledger_seq`), enabling a single-pane debugging experience. + ### 7.7.2 Correlation Fields | Source | Field | Link To | Purpose | diff --git a/OpenTelemetryPlan/08-appendix.md b/OpenTelemetryPlan/08-appendix.md index 6e0001d2b4..2e3d2f5d72 100644 --- a/OpenTelemetryPlan/08-appendix.md +++ b/OpenTelemetryPlan/08-appendix.md @@ -7,6 +7,8 @@ ## 8.1 Glossary +> **OTLP** = OpenTelemetry Protocol | **TxQ** = Transaction Queue + | Term | Definition | | --------------------- | ---------------------------------------------------------- | | **Span** | A unit of work with start/end time, name, and attributes | @@ -26,25 +28,31 @@ ### rippled-Specific Terms -| Term | Definition | -| ----------------- | -------------------------------------------------- | -| **Overlay** | P2P network layer managing peer connections | -| **Consensus** | XRP Ledger consensus algorithm (RCL) | -| **Proposal** | Validator's suggested transaction set for a ledger | -| **Validation** | Validator's signature on a closed ledger | -| **HashRouter** | Component for transaction deduplication | -| **JobQueue** | Thread pool for asynchronous task execution | -| **PerfLog** | Existing performance logging system in rippled | -| **Beast Insight** | Existing metrics framework in rippled | +| Term | Definition | +| ----------------- | ------------------------------------------------------------- | +| **Overlay** | P2P network layer managing peer connections | +| **Consensus** | XRP Ledger consensus algorithm (RCL) | +| **Proposal** | Validator's suggested transaction set for a ledger | +| **Validation** | Validator's signature on a closed ledger | +| **HashRouter** | Component for transaction deduplication | +| **JobQueue** | Thread pool for asynchronous task execution | +| **PerfLog** | Existing performance logging system in rippled | +| **Beast Insight** | Existing metrics framework in rippled | +| **PathFinding** | Payment path computation engine for cross-currency payments | +| **TxQ** | Transaction queue managing fee-based prioritization | +| **LoadManager** | Dynamic fee escalation based on network load | +| **SHAMap** | SHA-256 hash-based map (Merkle trie variant) for ledger state | --- ## 8.2 Span Hierarchy Visualization +> **TxQ** = Transaction Queue + ```mermaid flowchart TB subgraph trace["Trace: Transaction Lifecycle"] - rpc["rpc.submit
(entry point)"] + rpc["rpc.request
(entry point)"] validate["tx.validate"] relay["tx.relay
(parent span)"] @@ -54,20 +62,45 @@ flowchart TB p3["peer.send
Peer C"] end + subgraph pathfinding["PathFinding Spans"] + pathfind["pathfind.request"] + pathcomp["pathfind.compute"] + end + consensus["consensus.round"] apply["tx.apply"] + + subgraph txqueue["TxQ Spans"] + txq["txq.enqueue"] + txqApply["txq.apply"] + end + + feeCalc["fee.escalate"] + end + + subgraph validators["Validator Spans"] + valFetch["validator.list.fetch"] + valManifest["validator.manifest"] end rpc --> validate + rpc --> pathfind + pathfind --> pathcomp validate --> relay relay --> p1 relay --> p2 relay --> p3 p1 -.->|"context propagation"| consensus consensus --> apply + apply --> txq + txq --> txqApply + txq --> feeCalc style trace fill:#0f172a,stroke:#020617,color:#fff style peers fill:#1e3a8a,stroke:#172554,color:#fff + style pathfinding fill:#134e4a,stroke:#0f766e,color:#fff + style txqueue fill:#064e3b,stroke:#047857,color:#fff + style validators fill:#4c1d95,stroke:#6d28d9,color:#fff style rpc fill:#1d4ed8,stroke:#1e40af,color:#fff style validate fill:#047857,stroke:#064e3b,color:#fff style relay fill:#047857,stroke:#064e3b,color:#fff @@ -76,12 +109,30 @@ flowchart TB style p3 fill:#0e7490,stroke:#155e75,color:#fff style consensus fill:#fef3c7,stroke:#fde68a,color:#1e293b style apply fill:#047857,stroke:#064e3b,color:#fff + style pathfind fill:#0e7490,stroke:#155e75,color:#fff + style pathcomp fill:#0e7490,stroke:#155e75,color:#fff + style txq fill:#047857,stroke:#064e3b,color:#fff + style txqApply fill:#047857,stroke:#064e3b,color:#fff + style feeCalc fill:#047857,stroke:#064e3b,color:#fff + style valFetch fill:#6d28d9,stroke:#4c1d95,color:#fff + style valManifest fill:#6d28d9,stroke:#4c1d95,color:#fff ``` +**Reading the diagram:** + +- **rpc.request (blue, top)**: The entry point — every traced transaction starts as an RPC call; this root span is the parent of all downstream work. +- **tx.validate and pathfind.request (green/teal, first fork)**: The RPC request fans out into transaction validation and, for cross-currency payments, a PathFinding branch (`pathfind.request` -> `pathfind.compute`). +- **tx.relay -> Peer Spans (teal, middle)**: After validation, the transaction is relayed to peers A, B, and C in parallel; each `peer.send` is a sibling child span showing fan-out across the network. +- **context propagation (dashed arrow)**: The dotted line from `peer.send Peer A` to `consensus.round` represents the trace context crossing a node boundary — the receiving validator picks up the same `trace_id` and continues the trace. +- **consensus.round -> tx.apply -> TxQ Spans (green, lower)**: Once consensus accepts the transaction, it is applied to the ledger; the TxQ spans (`txq.enqueue`, `txq.apply`, `fee.escalate`) capture queue depth and fee escalation behavior. +- **Validator Spans (purple, detached)**: `validator.list.fetch` and `validator.manifest` are independent workflows for UNL management — they run on their own traces and are linked to consensus via Span Links, not parent-child relationships. + --- ## 8.3 References +> **OTLP** = OpenTelemetry Protocol + ### OpenTelemetry Resources 1. [OpenTelemetry C++ SDK](https://github.com/open-telemetry/opentelemetry-cpp) @@ -107,10 +158,11 @@ flowchart TB ## 8.4 Version History -| Version | Date | Author | Changes | -| ------- | ---------- | ------ | --------------------------------- | -| 1.0 | 2026-02-12 | - | Initial implementation plan | -| 1.1 | 2026-02-13 | - | Refactored into modular documents | +| Version | Date | Author | Changes | +| ------- | ---------- | ------ | -------------------------------------------------------------- | +| 1.0 | 2026-02-12 | - | Initial implementation plan | +| 1.1 | 2026-02-13 | - | Refactored into modular documents | +| 1.2 | 2026-03-24 | - | Review fixes: accuracy corrections, cross-document consistency | --- @@ -133,9 +185,10 @@ flowchart TB ### Task Lists -| Document | Description | -| ------------------------------------ | -------------------------------------- | -| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration | +| Document | Description | +| ------------------------------------ | --------------------------------------------------- | +| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration | +| [presentation.md](./presentation.md) | Presentation slides for OpenTelemetry plan overview | --- diff --git a/OpenTelemetryPlan/OpenTelemetryPlan.md b/OpenTelemetryPlan/OpenTelemetryPlan.md index 96a1b697de..fb9f037c00 100644 --- a/OpenTelemetryPlan/OpenTelemetryPlan.md +++ b/OpenTelemetryPlan/OpenTelemetryPlan.md @@ -2,6 +2,8 @@ ## Executive Summary +> **OTLP** = OpenTelemetry Protocol + This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. The plan addresses the unique challenges of a decentralized peer-to-peer system where trace context must propagate across network boundaries between independent nodes. ### Key Benefits @@ -33,6 +35,10 @@ This implementation plan is organized into modular documents for easier navigati flowchart TB overview["📋 OpenTelemetryPlan.md
(This Document)"] + subgraph fundamentals["Fundamentals"] + fund["00-tracing-fundamentals.md"] + end + subgraph analysis["Analysis & Design"] arch["01-architecture-analysis.md"] design["02-design-decisions.md"] @@ -48,12 +54,15 @@ flowchart TB phases["06-implementation-phases.md"] backends["07-observability-backends.md"] appendix["08-appendix.md"] + poc["POC_taskList.md"] end + overview --> fundamentals overview --> analysis overview --> impl overview --> deploy + fund --> arch arch --> design design --> strategy strategy --> code @@ -61,8 +70,11 @@ flowchart TB config --> phases phases --> backends backends --> appendix + phases --> poc style overview fill:#1b5e20,stroke:#0d3d14,color:#fff,stroke-width:2px + style fundamentals fill:#00695c,stroke:#004d40,color:#fff + style fund fill:#00695c,stroke:#004d40,color:#fff style analysis fill:#0d47a1,stroke:#082f6a,color:#fff style impl fill:#bf360c,stroke:#8c2809,color:#fff style deploy fill:#4a148c,stroke:#2e0d57,color:#fff @@ -74,6 +86,7 @@ flowchart TB style phases fill:#4a148c,stroke:#2e0d57,color:#fff style backends fill:#4a148c,stroke:#2e0d57,color:#fff style appendix fill:#4a148c,stroke:#2e0d57,color:#fff + style poc fill:#4a148c,stroke:#2e0d57,color:#fff ``` @@ -84,22 +97,34 @@ flowchart TB | Section | Document | Description | | ------- | ---------------------------------------------------------- | ---------------------------------------------------------------------- | +| **0** | [Tracing Fundamentals](./00-tracing-fundamentals.md) | Distributed tracing concepts, span relationships, context propagation | | **1** | [Architecture Analysis](./01-architecture-analysis.md) | rippled component analysis, trace points, instrumentation priorities | | **2** | [Design Decisions](./02-design-decisions.md) | SDK selection, exporters, span naming, attributes, context propagation | | **3** | [Implementation Strategy](./03-implementation-strategy.md) | Directory structure, key principles, performance optimization | -| **4** | [Code Samples](./04-code-samples.md) | Complete C++ implementation examples for all components | +| **4** | [Code Samples](./04-code-samples.md) | C++ implementation examples for core infrastructure and key modules | | **5** | [Configuration Reference](./05-configuration-reference.md) | rippled config, CMake integration, Collector configurations | | **6** | [Implementation Phases](./06-implementation-phases.md) | 5-phase timeline, tasks, risks, success metrics | | **7** | [Observability Backends](./07-observability-backends.md) | Backend selection guide and production architecture | | **8** | [Appendix](./08-appendix.md) | Glossary, references, version history | +| **POC** | [POC Task List](./POC_taskList.md) | Proof of concept tasks for RPC tracing end-to-end demo | + +--- + +## 0. Tracing Fundamentals + +This document introduces distributed tracing concepts for readers unfamiliar with the domain. It covers what traces and spans are, how parent-child and follows-from relationships model causality, how context propagates across service boundaries, and how sampling controls data volume. It also maps these concepts to rippled-specific scenarios like transaction relay and consensus. + +➡️ **[Read Tracing Fundamentals](./00-tracing-fundamentals.md)** --- ## 1. Architecture Analysis -The rippled node consists of several key components that require instrumentation for comprehensive distributed tracing. The main areas include the RPC server (HTTP/WebSocket), Overlay P2P network, Consensus mechanism (RCLConsensus), JobQueue for async task execution, and existing observability infrastructure (PerfLog, Insight/StatsD, Journal logging). +> **WS** = WebSocket | **TxQ** = Transaction Queue -Key trace points span across transaction submission via RPC, peer-to-peer message propagation, consensus round execution, and ledger building. The implementation prioritizes high-value, low-risk components first: RPC handlers provide immediate value with minimal risk, while consensus tracing requires careful implementation to avoid timing impacts. +The rippled node consists of several key components that require instrumentation for comprehensive distributed tracing. The main areas include the RPC server (HTTP/WebSocket), Overlay P2P network, Consensus mechanism (RCLConsensus), JobQueue for async task execution, PathFinding, Transaction Queue (TxQ), fee escalation (LoadManager), ledger acquisition, validator management, and existing observability infrastructure (PerfLog, Insight/StatsD, Journal logging). + +Key trace points span across transaction submission via RPC, peer-to-peer message propagation, consensus round execution, ledger building, path computation, transaction queue behavior, fee escalation, and validator health. The implementation prioritizes high-value, low-risk components first: RPC handlers provide immediate value with minimal risk, while consensus tracing requires careful implementation to avoid timing impacts. ➡️ **[Read full Architecture Analysis](./01-architecture-analysis.md)** @@ -107,11 +132,13 @@ Key trace points span across transaction submission via RPC, peer-to-peer messag ## 2. Design Decisions +> **OTLP** = OpenTelemetry Protocol | **CNCF** = Cloud Native Computing Foundation + The OpenTelemetry C++ SDK is selected for its CNCF backing, active development, and native performance characteristics. Traces are exported via OTLP/gRPC (primary) or OTLP/HTTP (fallback) to an OpenTelemetry Collector, which provides flexible routing and sampling. Span naming follows a hierarchical `.` convention (e.g., `rpc.submit`, `tx.relay`, `consensus.round`). Context propagation uses W3C Trace Context headers for HTTP and embedded Protocol Buffer fields for P2P messages. The implementation coexists with existing PerfLog and Insight observability systems through correlation IDs. -**Data Collection & Privacy**: Telemetry collects only operational metadata (timing, counts, hashes) — never sensitive content (private keys, balances, amounts, raw payloads). Privacy protection includes account hashing, configurable redaction, sampling, and collector-level filtering. Node operators retain full control(not penned down in this document yet) over what data is exported. +**Data Collection & Privacy**: Telemetry collects only operational metadata (timing, counts, hashes) — never sensitive content (private keys, balances, amounts, raw payloads). Privacy protection includes account hashing, configurable redaction, sampling, and collector-level filtering. Node operators retain full control over telemetry configuration. ➡️ **[Read full Design Decisions](./02-design-decisions.md)** @@ -129,13 +156,14 @@ Performance optimization strategies include probabilistic head sampling (10% def ## 4. Code Samples -Complete C++ implementation examples are provided for all telemetry components: +C++ implementation examples are provided for the core telemetry infrastructure and key modules: - `Telemetry.h` - Core interface for tracer access and span creation - `SpanGuard.h` - RAII wrapper for automatic span lifecycle management - `TracingInstrumentation.h` - Macros for conditional instrumentation - Protocol Buffer extensions for trace context propagation - Module-specific instrumentation (RPC, Consensus, P2P, JobQueue) +- Remaining modules (PathFinding, TxQ, Validator, etc.) follow the same patterns ➡️ **[View all Code Samples](./04-code-samples.md)** @@ -143,9 +171,11 @@ Complete C++ implementation examples are provided for all telemetry components: ## 5. Configuration Reference +> **OTLP** = OpenTelemetry Protocol | **APM** = Application Performance Monitoring + Configuration is handled through the `[telemetry]` section in `xrpld.cfg` with options for enabling/disabling, exporter selection, endpoint configuration, sampling ratios, and component-level filtering. CMake integration includes a `XRPL_ENABLE_TELEMETRY` option for compile-time control. -OpenTelemetry Collector configurations are provided for development (with Jaeger) and production (with tail-based sampling, Tempo, and Elastic APM). Docker Compose examples enable quick local development environment setup. +OpenTelemetry Collector configurations are provided for development and production (with tail-based sampling, Tempo, and Elastic APM). Docker Compose examples enable quick local development environment setup. ➡️ **[View full Configuration Reference](./05-configuration-reference.md)** @@ -163,7 +193,7 @@ The implementation spans 9 weeks across 5 phases: | 4 | Weeks 7-8 | Consensus Tracing | Round spans, Proposal/validation tracing | | 5 | Week 9 | Documentation | Runbook, Dashboards, Training | -**Total Effort**: 47 developer-days with 2 developers +**Total Effort**: 47 person-days (2 developers working in parallel) ➡️ **[View full Implementation Phases](./06-implementation-phases.md)** @@ -171,7 +201,9 @@ The implementation spans 9 weeks across 5 phases: ## 7. Observability Backends -For development and testing, Jaeger provides easy setup with a good UI. For production deployments, Grafana Tempo is recommended for its cost-effectiveness and Grafana integration, while Elastic APM is ideal for organizations with existing Elastic infrastructure. +> **APM** = Application Performance Monitoring | **GCS** = Google Cloud Storage + +Grafana Tempo is recommended for all environments due to its cost-effectiveness and Grafana integration, while Elastic APM is ideal for organizations with existing Elastic infrastructure. The recommended production architecture uses a gateway collector pattern with regional collectors performing tail-based sampling, routing traces to multiple backends (Tempo for primary storage, Elastic for log correlation, S3/GCS for long-term archive). @@ -187,4 +219,12 @@ The appendix contains a glossary of OpenTelemetry and rippled-specific terms, re --- +## POC Task List + +A step-by-step task list for building a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. The POC scope is limited to RPC tracing — showing request traces flowing from rippled through an OpenTelemetry Collector into Tempo, viewable in Grafana. + +➡️ **[View POC Task List](./POC_taskList.md)** + +--- + _This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. For detailed information on any section, follow the links to the corresponding sub-documents._ diff --git a/OpenTelemetryPlan/POC_taskList.md b/OpenTelemetryPlan/POC_taskList.md index 8d3a24279e..e2a7958094 100644 --- a/OpenTelemetryPlan/POC_taskList.md +++ b/OpenTelemetryPlan/POC_taskList.md @@ -1,6 +1,6 @@ # OpenTelemetry POC Task List -> **Goal**: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. A successful POC will show RPC request traces flowing from rippled through an OTel Collector into Jaeger, viewable in a browser UI. +> **Goal**: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. A successful POC will show RPC request traces flowing from rippled through an OTel Collector into Tempo, viewable in Grafana. > > **Scope**: RPC tracing only (highest value, lowest risk per the [CRAWL phase](./06-implementation-phases.md#6102-quick-wins-immediate-value) in the implementation phases). No cross-node P2P context propagation or consensus tracing in the POC. @@ -15,28 +15,29 @@ | [04-code-samples.md](./04-code-samples.md) | Telemetry interface (§4.1), SpanGuard (§4.2), macros (§4.3), RPC instrumentation (§4.5.3) | | [05-configuration-reference.md](./05-configuration-reference.md) | rippled config (§5.1), config parser (§5.2), Application integration (§5.3), CMake (§5.4), Collector config (§5.5), Docker Compose (§5.6), Grafana (§5.8) | | [06-implementation-phases.md](./06-implementation-phases.md) | Phase 1 core tasks (§6.2), Phase 2 RPC tasks (§6.3), quick wins (§6.10), definition of done (§6.11) | -| [07-observability-backends.md](./07-observability-backends.md) | Jaeger dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3) | +| [07-observability-backends.md](./07-observability-backends.md) | Tempo dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3) | --- ## Task 0: Docker Observability Stack Setup +> **OTLP** = OpenTelemetry Protocol + **Objective**: Stand up the backend infrastructure to receive, store, and display traces. **What to do**: - Create `docker/telemetry/docker-compose.yml` in the repo with three services: - 1. **OpenTelemetry Collector** (`otel/opentelemetry-collector-contrib:latest`) + 1. **OpenTelemetry Collector** (`otel/opentelemetry-collector-contrib:0.92.0`) - Expose ports `4317` (OTLP gRPC) and `4318` (OTLP HTTP) - Expose port `13133` (health check) - Mount a config file `docker/telemetry/otel-collector-config.yaml` - 2. **Jaeger** (`jaegertracing/all-in-one:latest`) - - Expose port `16686` (UI) and `14250` (gRPC collector) - - Set env `COLLECTOR_OTLP_ENABLED=true` + 2. **Tempo** (`grafana/tempo:2.6.1`) + - Expose port `3200` (HTTP API) and `4317` (OTLP gRPC, internal) 3. **Grafana** (`grafana/grafana:latest`) — optional but useful - Expose port `3000` - Enable anonymous admin access for local dev (`GF_AUTH_ANONYMOUS_ENABLED=true`, `GF_AUTH_ANONYMOUS_ORG_ROLE=Admin`) - - Provision Jaeger as a data source via `docker/telemetry/grafana/provisioning/datasources/jaeger.yaml` + - Provision Tempo as a data source via `docker/telemetry/grafana/provisioning/datasources/tempo.yaml` - Create `docker/telemetry/otel-collector-config.yaml`: @@ -57,8 +58,8 @@ exporters: logging: verbosity: detailed - otlp/jaeger: - endpoint: jaeger:4317 + otlp/tempo: + endpoint: tempo:4317 tls: insecure: true @@ -67,30 +68,29 @@ traces: receivers: [otlp] processors: [batch] - exporters: [logging, otlp/jaeger] + exporters: [logging, otlp/tempo] ``` -- Create Grafana Jaeger datasource provisioning file at `docker/telemetry/grafana/provisioning/datasources/jaeger.yaml`: +- Create Grafana Tempo datasource provisioning file at `docker/telemetry/grafana/provisioning/datasources/tempo.yaml`: ```yaml apiVersion: 1 datasources: - - name: Jaeger - type: jaeger + - name: Tempo + type: tempo access: proxy - url: http://jaeger:16686 + url: http://tempo:3200 ``` **Verification**: Run `docker compose -f docker/telemetry/docker-compose.yml up -d`, then: - `curl http://localhost:13133` returns healthy (Collector) -- `http://localhost:16686` opens Jaeger UI (no traces yet) -- `http://localhost:3000` opens Grafana (optional) +- `http://localhost:3000` opens Grafana (Tempo datasource available, no traces yet) **Reference**: -- [05-configuration-reference.md §5.5](./05-configuration-reference.md) — Collector config (dev YAML with Jaeger exporter) +- [05-configuration-reference.md §5.5](./05-configuration-reference.md) — Collector config (dev YAML with Tempo exporter) - [05-configuration-reference.md §5.6](./05-configuration-reference.md) — Docker Compose development environment -- [07-observability-backends.md §7.1](./07-observability-backends.md) — Jaeger quick start and backend selection +- [07-observability-backends.md §7.1](./07-observability-backends.md) — Tempo quick start and backend selection - [05-configuration-reference.md §5.8](./05-configuration-reference.md) — Grafana datasource provisioning and dashboards --- @@ -175,6 +175,8 @@ ## Task 3: Implement OTel-Backed Telemetry +> **OTLP** = OpenTelemetry Protocol + **Objective**: Implement the real `Telemetry` class that initializes the OTel SDK, configures the OTLP exporter and batch processor, and creates tracers/spans. **What to do**: @@ -183,7 +185,7 @@ - `class TelemetryImpl : public Telemetry` that: - In `start()`: creates a `TracerProvider` with: - Resource attributes: `service.name`, `service.version`, `service.instance.id` - - An `OtlpGrpcExporter` pointed at `setup.exporterEndpoint` (default `localhost:4317`) + - An `OtlpHttpExporter` pointed at `setup.exporterEndpoint` (default `localhost:4318`) - A `BatchSpanProcessor` with configurable batch size and delay - A `TraceIdRatioBasedSampler` using `setup.samplingRatio` - Sets the global `TracerProvider` @@ -316,6 +318,8 @@ ## Task 6: Instrument RPC ServerHandler +> **WS** = WebSocket + **Objective**: Add tracing to the HTTP RPC entry point so every incoming RPC request creates a span. **What to do**: @@ -338,7 +342,7 @@ rpc.request └── rpc.process ``` - in Jaeger for every HTTP RPC call. + in Tempo/Grafana for every HTTP RPC call. **Key modified file**: @@ -372,7 +376,7 @@ - On success: `XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "success");` - On error: `XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "error");` and set the error message -- After this, traces in Jaeger should look like: +- After this, traces in Tempo/Grafana should look like: ``` rpc.request (xrpl.rpc.command=account_info) └── rpc.process @@ -396,7 +400,9 @@ ## Task 8: Build, Run, and Verify End-to-End -**Objective**: Prove the full pipeline works: rippled emits traces -> OTel Collector receives them -> Jaeger displays them. +> **OTLP** = OpenTelemetry Protocol + +**Objective**: Prove the full pipeline works: rippled emits traces -> OTel Collector receives them -> Tempo stores them for Grafana visualization. **What to do**: @@ -453,10 +459,10 @@ -d '{"method":"account_info","params":[{"account":"rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh"}]}' ``` -6. **Verify in Jaeger**: - - Open `http://localhost:16686` - - Select service `rippled` from the dropdown - - Click "Find Traces" +6. **Verify in Grafana (Tempo)**: + - Open `http://localhost:3000` + - Navigate to Explore → select Tempo datasource + - Search for service `rippled` - Confirm you see traces with spans: `rpc.request` -> `rpc.process` -> `rpc.command.server_info` - Click into a trace and verify attributes: `xrpl.rpc.command`, `xrpl.rpc.status`, `xrpl.rpc.version` @@ -470,7 +476,7 @@ - [ ] Docker stack starts without errors - [ ] rippled builds with `-DXRPL_ENABLE_TELEMETRY=ON` - [ ] rippled starts and connects to OTel Collector (check rippled logs for telemetry messages) -- [ ] Traces appear in Jaeger UI under service "rippled" +- [ ] Traces appear in Grafana/Tempo under service "rippled" - [ ] Span hierarchy is correct (parent-child relationships) - [ ] Span attributes are populated (`xrpl.rpc.command`, `xrpl.rpc.status`, etc.) - [ ] Error spans show error status and message @@ -479,8 +485,8 @@ **Reference**: -- [06-implementation-phases.md §6.11.1](./06-implementation-phases.md) — Phase 1 definition of done: SDK compiles, runtime toggle works, span creation verified in Jaeger, config validation passes -- [06-implementation-phases.md §6.11.2](./06-implementation-phases.md) — Phase 2 definition of done: 100% RPC coverage, traceparent propagation, <1ms p99 overhead, dashboard deployed +- [06-implementation-phases.md §6.11.1](./06-implementation-phases.md) — Phase 1 definition of done: SDK compiles, runtime toggle works, span creation verified in Tempo, config validation passes +- [06-implementation-phases.md §6.11.2](./06-implementation-phases.md#6112-phase-2-rpc-tracing) — Phase 2 definition of done: 100% RPC coverage, traceparent propagation, <1ms p99 overhead, dashboard deployed - [06-implementation-phases.md §6.8](./06-implementation-phases.md) — Success metrics: trace coverage >95%, CPU overhead <3%, memory <5 MB, latency impact <2% - [03-implementation-strategy.md §3.9.5](./03-implementation-strategy.md) — Backward compatibility: config optional, protocol unchanged, `XRPL_ENABLE_TELEMETRY=OFF` produces identical binary - [01-architecture-analysis.md §1.8](./01-architecture-analysis.md) — Observable outcomes: what traces, metrics, and dashboards to expect @@ -489,11 +495,13 @@ ## Task 9: Document POC Results and Next Steps +> **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket + **Objective**: Capture findings, screenshots, and remaining work for the team. **What to do**: -- Take screenshots of Jaeger showing: +- Take screenshots of Grafana/Tempo showing: - The service list with "rippled" - A trace with the full span tree - Span detail view showing attributes @@ -541,9 +549,11 @@ ## Next Steps (Post-POC) +> **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket + ### Metrics Pipeline for Grafana Dashboards -The current POC exports **traces only**. Grafana's Explore view can query Jaeger for individual traces, but time-series charts (latency histograms, request throughput, error rates) require a **metrics pipeline**. To enable this: +The current POC exports **traces only**. Grafana's Explore view can query Tempo for individual traces, but time-series charts (latency histograms, request throughput, error rates) require a **metrics pipeline**. To enable this: 1. **Add a `spanmetrics` connector** to the OTel Collector config that derives RED metrics (Rate, Errors, Duration) from trace spans automatically: @@ -566,7 +576,7 @@ The current POC exports **traces only**. Grafana's Explore view can query Jaeger traces: receivers: [otlp] processors: [batch] - exporters: [debug, otlp/jaeger, spanmetrics] + exporters: [debug, otlp/tempo, spanmetrics] metrics: receivers: [spanmetrics] exporters: [prometheus] diff --git a/OpenTelemetryPlan/presentation.md b/OpenTelemetryPlan/presentation.md index 7a443a635c..7d8a3fa40a 100644 --- a/OpenTelemetryPlan/presentation.md +++ b/OpenTelemetryPlan/presentation.md @@ -4,6 +4,8 @@ ## Slide 1: Introduction +> **CNCF** = Cloud Native Computing Foundation + ### What is OpenTelemetry? OpenTelemetry is an open-source, CNCF-backed observability framework for distributed tracing, metrics, and logs. @@ -25,12 +27,21 @@ flowchart LR style D fill:#e65100,stroke:#bf360c,color:#fff ``` +**Reading the diagram:** + +- **Node A (blue, leftmost)**: The originating node that first receives the transaction and assigns a new `trace_id: abc123`; this ID becomes the correlation key for the entire distributed trace. +- **Node B and Node C (green, middle)**: Relay and validation nodes — each creates its own span but carries the same `trace_id`, so their work is linked to the original submission without any central coordinator. +- **Node D (orange, rightmost)**: The final node that applies the transaction to the ledger; the trace now spans the full lifecycle from submission to ledger inclusion. +- **Left-to-right flow**: The horizontal progression shows the real-world message path — a transaction hops from node to node, and the shared `trace_id` stitches all hops into a single queryable trace. + > **Trace ID: abc123** — All nodes share the same trace, enabling cross-node correlation. --- ## Slide 2: OpenTelemetry vs Open Source Alternatives +> **CNCF** = Cloud Native Computing Foundation + | Feature | OpenTelemetry | Jaeger | Zipkin | SkyWalking | Pinpoint | Prometheus | | ------------------- | ---------------- | ---------------- | ------------------ | ---------- | ---------- | ---------- | | **Tracing** | YES | YES | YES | YES | YES | NO | @@ -42,11 +53,131 @@ flowchart LR | **Backend** | Any (exporters) | Self | Self | Self | Self | Self | | **CNCF Status** | Incubating | Graduated | NO | Incubating | NO | Graduated | -> **Why OpenTelemetry?** It's the only actively maintained, full-featured C++ option with vendor neutrality — allowing export to Jaeger, Prometheus, Grafana, or any commercial backend without changing instrumentation. +> **Why OpenTelemetry?** It's the only actively maintained, full-featured C++ option with vendor neutrality — allowing export to Tempo, Prometheus, Grafana, or any commercial backend without changing instrumentation. --- -## Slide 3: Comparison with rippled's Existing Solutions +## Slide 3: Adoption Scope — Traces Only (Current Plan) + +OpenTelemetry supports three signal types: **Traces**, **Metrics**, and **Logs**. rippled already captures metrics (StatsD via Beast Insight) and logs (Journal/PerfLog). The question is: how much of OTel do we adopt? + +> **Scenario A**: Add distributed tracing. Keep StatsD for metrics and Journal for logs. + +```mermaid +flowchart LR + subgraph rippled["rippled Process"] + direction TB + OTel["OTel SDK
(Traces)"] + Insight["Beast Insight
(StatsD Metrics)"] + Journal["Journal + PerfLog
(Logging)"] + end + + OTel -->|"OTLP"| Collector["OTel Collector"] + Insight -->|"UDP"| StatsD["StatsD Server"] + Journal -->|"File I/O"| LogFile["perf.log / debug.log"] + + Collector --> Tempo["Tempo / Jaeger"] + StatsD --> Graphite["Graphite / Grafana"] + LogFile --> Loki["Loki (optional)"] + + style rippled fill:#424242,stroke:#212121,color:#fff + style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff + style Insight fill:#1565c0,stroke:#0d47a1,color:#fff + style Journal fill:#e65100,stroke:#bf360c,color:#fff + style Collector fill:#2e7d32,stroke:#1b5e20,color:#fff +``` + +| Aspect | Details | +| ------------------------------ | --------------------------------------------------------------------------------------------------------------- | +| **What changes for operators** | Deploy OTel Collector + trace backend. Existing StatsD and log pipelines stay as-is. | +| **Codebase impact** | New `Telemetry` module (~1500 LOC). Beast Insight and Journal untouched. | +| **New capabilities** | Cross-node trace correlation, span-based debugging, request lifecycle visibility. | +| **What we still can't do** | Correlate metrics with specific traces natively. StatsD metrics remain fire-and-forget with no trace exemplars. | +| **Maintenance burden** | Three separate observability systems to maintain (OTel + StatsD + Journal). | +| **Risk** | Lowest — additive change, no existing systems disturbed. | + +--- + +## Slide 4: Future Adoption — Metrics & Logs via OTel + +### Scenario B: + OTel Metrics (Replace StatsD) + +> Migrate StatsD to OTel Metrics API, exposing Prometheus-compatible metrics. Remove Beast Insight. + +```mermaid +flowchart LR + subgraph rippled["rippled Process"] + direction TB + OTel["OTel SDK
(Traces + Metrics)"] + Journal["Journal + PerfLog
(Logging)"] + end + + OTel -->|"OTLP"| Collector["OTel Collector"] + Journal -->|"File I/O"| LogFile["perf.log / debug.log"] + + Collector --> Tempo["Tempo
(Traces)"] + Collector --> Prom["Prometheus
(Metrics)"] + LogFile --> Loki["Loki (optional)"] + + style rippled fill:#424242,stroke:#212121,color:#fff + style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff + style Journal fill:#e65100,stroke:#bf360c,color:#fff + style Collector fill:#2e7d32,stroke:#1b5e20,color:#fff +``` + +- **Better metrics?** Yes — Prometheus gives native histograms (p50/p95/p99), multi-dimensional labels, and exemplars linking metric spikes to traces. +- **Codebase**: Remove `Beast::Insight` + `StatsDCollector` (~2000 LOC). Single SDK for traces and metrics. +- **Operator effort**: Rewrite dashboards from StatsD/Graphite queries to PromQL. Run both in parallel during transition. +- **Risk**: Medium — operators must migrate monitoring infrastructure. + +### Scenario C: + OTel Logs (Full Stack) + +> Also replace Journal logging with OTel Logs API. Single SDK for everything. + +```mermaid +flowchart LR + subgraph rippled["rippled Process"] + OTel["OTel SDK
(Traces + Metrics + Logs)"] + end + + OTel -->|"OTLP"| Collector["OTel Collector"] + + Collector --> Tempo["Tempo
(Traces)"] + Collector --> Prom["Prometheus
(Metrics)"] + Collector --> Loki["Loki / Elastic
(Logs)"] + + style rippled fill:#424242,stroke:#212121,color:#fff + style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff + style Collector fill:#2e7d32,stroke:#1b5e20,color:#fff +``` + +- **Structured logging**: OTel Logs API outputs structured records with `trace_id`, `span_id`, severity, and attributes by design. +- **Full correlation**: Every log line carries `trace_id`. Click trace → see logs. Click metric spike → see trace → see logs. +- **Codebase**: Remove Beast Insight (~2000 LOC) + simplify Journal/PerfLog (~3000 LOC). One dependency instead of three. +- **Risk**: Highest — `beast::Journal` is deeply embedded in every component. Large refactor. OTel C++ Logs API is newer (stable since v1.11, less battle-tested). + +### Recommendation + +```mermaid +flowchart LR + A["Phase 1
Traces Only
(Current Plan)"] --> B["Phase 2
+ Metrics
(Replace StatsD)"] --> C["Phase 3
+ Logs
(Full OTel)"] + + style A fill:#2e7d32,stroke:#1b5e20,color:#fff + style B fill:#1565c0,stroke:#0d47a1,color:#fff + style C fill:#e65100,stroke:#bf360c,color:#fff +``` + +| Phase | Signal | Strategy | Risk | +| -------------------- | --------- | -------------------------------------------------------------- | ------ | +| **Phase 1** (now) | Traces | Add OTel traces. Keep StatsD and Journal. Prove value. | Low | +| **Phase 2** (future) | + Metrics | Migrate StatsD → Prometheus via OTel. Remove Beast Insight. | Medium | +| **Phase 3** (future) | + Logs | Adopt OTel Logs API. Align with structured logging initiative. | High | + +> **Key Takeaway**: Start with traces (unique value, lowest risk), then incrementally adopt metrics and logs as the OTel infrastructure proves itself. + +--- + +## Slide 5: Comparison with rippled's Existing Solutions ### Current Observability Stack @@ -68,11 +199,13 @@ flowchart LR | "Which node delayed consensus?" | ❌ | ❌ | ✅ | | "Show TX journey across 5 nodes" | ❌ | ❌ | ✅ | -> **Key Insight**: OpenTelemetry **complements** (not replaces) existing systems. +> **Key Insight**: In the **traces-only** approach (Phase 1), OpenTelemetry **complements** existing systems. In future phases, OTel metrics and logs could **replace** StatsD and Journal respectively — see Slides 3-4 for the full adoption roadmap. --- -## Slide 4: Architecture +## Slide 6: Architecture + +> **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket ### High-Level Integration Architecture @@ -92,7 +225,6 @@ flowchart TB Telemetry -->|OTLP/gRPC| Collector["OTel Collector"] Collector --> Tempo["Grafana Tempo"] - Collector --> Jaeger["Jaeger"] Collector --> Elastic["Elastic APM"] style rippled fill:#424242,stroke:#212121,color:#fff @@ -101,6 +233,14 @@ flowchart TB style Collector fill:#e65100,stroke:#bf360c,color:#fff ``` +**Reading the diagram:** + +- **Core Services (blue, top)**: RPC Server, Overlay, and Consensus are the three primary components that generate trace data — they represent the entry points for client requests, peer messages, and consensus rounds respectively. +- **Telemetry Module (green, middle)**: The OpenTelemetry SDK sits below the core services and receives span data from all three; it acts as a single collection point within the rippled process. +- **OTel Collector (orange, center)**: An external process that receives spans over OTLP/gRPC from the Telemetry Module; it decouples rippled from backend choices and handles batching, sampling, and routing. +- **Backends (bottom row)**: Tempo and Elastic APM are interchangeable — the Collector fans out to any combination, so operators can switch backends without modifying rippled code. +- **Top-to-bottom flow**: Data flows from instrumented code down through the SDK, out over the network to the Collector, and finally into storage/visualization backends. + ### Context Propagation ```mermaid @@ -120,10 +260,12 @@ sequenceDiagram --- -## Slide 5: Implementation Plan +## Slide 7: Implementation Plan ### 5-Phase Rollout (9 Weeks) +> **Note**: Dates shown are relative to project start, not calendar dates. + ```mermaid gantt title Implementation Timeline @@ -158,18 +300,114 @@ gantt **Total Effort**: ~47 developer-days (2 developers) +> **Future Phases** (not in current scope): After traces are stable, OTel metrics can replace StatsD (~3 weeks), and OTel logs can replace Journal (~4 weeks, aligned with structured logging initiative). See Slides 3-4 for the full adoption roadmap. + --- -## Slide 6: Performance Overhead +## Slide 8: Performance Overhead + +> **OTLP** = OpenTelemetry Protocol ### Estimated System Impact -| Metric | Overhead | Notes | -| ----------------- | ---------- | ----------------------------------- | -| **CPU** | 1-3% | Span creation and attribute setting | -| **Memory** | 2-5 MB | Batch buffer for pending spans | -| **Network** | 10-50 KB/s | Compressed OTLP export to collector | -| **Latency (p99)** | <2% | With proper sampling configuration | +| Metric | Overhead | Notes | +| ----------------- | ---------- | ------------------------------------------------ | +| **CPU** | 1-3% | Span creation and attribute setting | +| **Memory** | ~10 MB | SDK statics + batch buffer + worker thread stack | +| **Network** | 10-50 KB/s | Compressed OTLP export to collector | +| **Latency (p99)** | <2% | With proper sampling configuration | + +#### How We Arrived at These Numbers + +**Assumptions (XRPL mainnet baseline)**: + +| Parameter | Value | Source | +| ------------------------- | ---------------------- | --------------------------------------------------------------------------------------------------- | +| Transaction throughput | ~25 TPS (peaks to ~50) | Mainnet average | +| Default peers per node | 21 | `peerfinder/detail/Tuning.h` (`defaultMaxPeers`) | +| Consensus round frequency | ~1 round / 3-4 seconds | `ConsensusParms.h` (`ledgerMIN_CONSENSUS=1950ms`) | +| Proposers per round | ~20-35 | Mainnet UNL size | +| P2P message rate | ~160 msgs/sec | See message breakdown below | +| Avg TX processing time | ~200 μs | Profiled baseline | +| Single span creation cost | 500-1000 ns | OTel C++ SDK benchmarks (see [3.5.4](./03-implementation-strategy.md#354-performance-data-sources)) | + +**P2P message breakdown** (per node, mainnet): + +| Message Type | Rate | Derivation | +| ------------- | ------------ | --------------------------------------------------------------------- | +| TMTransaction | ~100/sec | ~25 TPS × ~4 relay hops per TX, deduplicated by HashRouter | +| TMValidation | ~50/sec | ~35 validators × ~1 validation/3s round ≈ ~12/sec, plus relay fan-out | +| TMProposeSet | ~10/sec | ~35 proposers / 3s round ≈ ~12/round, clustered in establish phase | +| **Total** | **~160/sec** | **Only traced message types counted** | + +**CPU (1-3%) — Calculation**: + +Per-transaction tracing cost breakdown: + +| Operation | Cost | Notes | +| ----------------------------------------------- | ----------- | ------------------------------------------ | +| `tx.receive` span (create + end + 4 attributes) | ~1400 ns | ~1000ns create + ~200ns end + 4×50ns attrs | +| `tx.validate` span | ~1200 ns | ~1000ns create + ~200ns for 2 attributes | +| `tx.relay` span | ~1200 ns | ~1000ns create + ~200ns for 2 attributes | +| Context injection into P2P message | ~200 ns | Serialize trace_id + span_id into protobuf | +| **Total per TX** | **~4.0 μs** | | + +> **CPU overhead**: 4.0 μs / 200 μs baseline = **~2.0% per transaction**. Under high load with consensus + RPC spans overlapping, reaches ~3%. Consensus itself adds only ~36 μs per 3-second round (~0.001%), so the TX path dominates. On production server hardware (3+ GHz Xeon), span creation drops to ~500-600 ns, bringing per-TX cost to ~2.6 μs (~1.3%). See [Section 3.5.4](./03-implementation-strategy.md#354-performance-data-sources) for benchmark sources. + +**Memory (~10 MB) — Calculation**: + +| Component | Size | Notes | +| --------------------------------------------- | ------------------ | ------------------------------------- | +| TracerProvider + Exporter (gRPC channel init) | ~320 KB | Allocated once at startup | +| BatchSpanProcessor (circular buffer) | ~16 KB | 2049 × 8-byte AtomicUniquePtr entries | +| BatchSpanProcessor (worker thread stack) | ~8 MB | Default Linux thread stack size | +| Active spans (in-flight, max ~1000) | ~500-800 KB | ~500-800 bytes/span × 1000 concurrent | +| Export queue (batch buffer, max 2048 spans) | ~1 MB | ~500 bytes/span × 2048 queue depth | +| Thread-local context storage (~100 threads) | ~6.4 KB | ~64 bytes/thread | +| **Total** | **~10 MB ceiling** | | + +> Memory plateaus once the export queue fills — the `max_queue_size=2048` config bounds growth. +> The worker thread stack (~8 MB) dominates the static footprint but is virtual memory; actual RSS +> depends on stack usage (typically much less). Active spans are larger than originally estimated +> (~500-800 bytes) because the OTel SDK `Span` object includes a mutex (~40 bytes), `SpanData` +> recordable (~250 bytes base), and `std::map`-based attribute storage (~200-500 bytes for 3-5 +> string attributes). See [Section 3.5.4](./03-implementation-strategy.md#354-performance-data-sources) for source references. + +**Network (10-50 KB/s) — Calculation**: + +Two sources of network overhead: + +**(A) OTLP span export to Collector:** + +| Sampling Rate | Effective Spans/sec | Avg Span Size (compressed) | Bandwidth | +| -------------------------- | ------------------- | -------------------------- | ------------ | +| 100% (dev only) | ~500 | ~500 bytes | ~250 KB/s | +| **10% (recommended prod)** | **~50** | **~500 bytes** | **~25 KB/s** | +| 1% (minimal) | ~5 | ~500 bytes | ~2.5 KB/s | + +> The ~500 spans/sec at 100% comes from: ~100 TX spans + ~160 P2P context spans + ~23 consensus spans/round + ~50 RPC spans = ~500/sec. OTLP protobuf with gzip compression yields ~500 bytes/span average. + +**(B) P2P trace context overhead** (added to existing messages, always-on regardless of sampling): + +| Message Type | Rate | Context Size | Bandwidth | +| ------------- | -------- | ------------ | ------------- | +| TMTransaction | ~100/sec | 29 bytes | ~2.9 KB/s | +| TMValidation | ~50/sec | 29 bytes | ~1.5 KB/s | +| TMProposeSet | ~10/sec | 29 bytes | ~0.3 KB/s | +| **Total P2P** | | | **~4.7 KB/s** | + +> **Combined**: 25 KB/s (OTLP export at 10%) + 5 KB/s (P2P context) ≈ **~30 KB/s typical**. The 10-50 KB/s range covers 10-20% sampling under normal to peak mainnet load. + +**Latency (<2%) — Calculation**: + +| Path | Tracing Cost | Baseline | Overhead | +| ------------------------------ | ------------ | -------- | -------- | +| Fast RPC (e.g., `server_info`) | 2.75 μs | ~1 ms | 0.275% | +| Slow RPC (e.g., `path_find`) | 2.75 μs | ~100 ms | 0.003% | +| Transaction processing | 4.0 μs | ~200 μs | 2.0% | +| Consensus round | 36 μs | ~3 sec | 0.001% | + +> At p99, even the worst case (TX processing at 2.0%) is within the 1-3% range. RPC and consensus overhead are negligible. On production hardware, TX overhead drops to ~1.3%. ### Per-Message Overhead (Context Propagation) @@ -179,20 +417,20 @@ Each P2P message carries trace context with the following overhead: | ------------- | ------------- | ----------------------------------------- | | `trace_id` | 16 bytes | Unique identifier for the entire trace | | `span_id` | 8 bytes | Current span (becomes parent on receiver) | -| `trace_flags` | 4 bytes | Sampling decision flags | +| `trace_flags` | 1 byte | Sampling decision flags | | `trace_state` | 0-4 bytes | Optional vendor-specific data | -| **Total** | **~32 bytes** | **Added per traced P2P message** | +| **Total** | **~29 bytes** | **Added per traced P2P message** | ```mermaid flowchart LR subgraph msg["P2P Message with Trace Context"] - A["Original Message
(variable size)"] --> B["+ TraceContext
(~32 bytes)"] + A["Original Message
(variable size)"] --> B["+ TraceContext
(~29 bytes)"] end subgraph breakdown["Context Breakdown"] C["trace_id
16 bytes"] D["span_id
8 bytes"] - E["flags
4 bytes"] + E["flags
1 byte"] F["state
0-4 bytes"] end @@ -206,7 +444,14 @@ flowchart LR style F fill:#4a148c,stroke:#2e0d57,color:#fff ``` -> **Note**: 32 bytes is negligible compared to typical transaction messages (hundreds to thousands of bytes) +**Reading the diagram:** + +- **Original Message (gray, left)**: The existing P2P message payload of variable size — this is unchanged; trace context is appended, never modifying the original data. +- **+ TraceContext (green, right of message)**: The additional 29-byte context block attached to each traced message; the arrow from the original message shows it is a pure addition. +- **Context Breakdown (right subgraph)**: The four fields — `trace_id` (16 bytes), `span_id` (8 bytes), `flags` (1 byte), and `state` (0-4 bytes) — show exactly what is added and their individual sizes. +- **Color coding**: Blue fields (`trace_id`, `span_id`) are the core identifiers required for trace correlation; orange (`flags`) controls sampling decisions; purple (`state`) is optional vendor data typically omitted. + +> **Note**: 29 bytes represents ~1-6% overhead depending on message size (500B simple TX to 5KB proposal), which is acceptable for the observability benefits provided. ### Mitigation Strategies @@ -220,6 +465,8 @@ flowchart LR style D fill:#4a148c,stroke:#2e0d57,color:#fff ``` +> For a detailed explanation of head vs. tail sampling, see Slide 9. + ### Kill Switches (Rollback Options) 1. **Config Disable**: Set `enabled=0` in config → instant disable, no restart needed for sampling @@ -228,18 +475,157 @@ flowchart LR --- -## Slide 7: Data Collection & Privacy +## Slide 9: Sampling Strategies — Head vs. Tail + +> Sampling controls **which traces are recorded and exported**. Without sampling, every operation generates a trace — at 500+ spans/sec, this overwhelms storage and network. Sampling lets you keep the signal, discard the noise. + +### Head Sampling (Decision at Start) + +The sampling decision is made **when a trace begins**, before any work is done. A random number is generated; if it falls within the configured ratio, the entire trace is recorded. Otherwise, the trace is silently dropped. + +```mermaid +flowchart LR + A["New Request
Arrives"] --> B{"Random < 10%?"} + B -->|"Yes (1 in 10)"| C["Record Entire Trace
(all spans)"] + B -->|"No (9 in 10)"| D["Drop Entire Trace
(zero overhead)"] + + style C fill:#2e7d32,stroke:#1b5e20,color:#fff + style D fill:#c62828,stroke:#8c2809,color:#fff + style B fill:#1565c0,stroke:#0d47a1,color:#fff +``` + +| Aspect | Details | +| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Where it runs** | Inside rippled (SDK-level). Configured via `sampling_ratio` in `rippled.cfg`. | +| **When the decision happens** | At trace creation time — before the first span is even populated. | +| **How it works** | `sampling_ratio=0.1` means each trace has a 10% probability of being recorded. Dropped traces incur near-zero overhead (no spans created, no attributes set, no export). | +| **Propagation** | Once a trace is sampled, the `trace_flags` field (1 byte in the context header) tells downstream nodes to also sample it. Unsampled traces propagate `trace_flags=0`, so downstream nodes skip them too. | +| **Pros** | Lowest overhead. Simple to configure. Predictable resource usage. | +| **Cons** | **Blind** — it doesn't know if the trace will be interesting. A rare error or slow consensus round has only a 10% chance of being captured. | +| **Best for** | High-volume, steady-state traffic where most traces look similar (e.g., routine RPC requests). | + +**rippled configuration**: + +```ini +[telemetry] +# Record 10% of traces (recommended for production) +sampling_ratio=0.1 +``` + +### Tail Sampling (Decision at End) + +The sampling decision is made **after the trace completes**, based on its actual content — was it slow? Did it error? Was it a consensus round? This requires buffering complete traces before deciding. + +```mermaid +flowchart TB + A["All Traces
Buffered (100%)"] --> B["OTel Collector
Evaluates Rules"] + + B --> C{"Error?"} + C -->|Yes| K["KEEP"] + + C -->|No| D{"Slow?
(>5s consensus,
>1s RPC)"} + D -->|Yes| K + + D -->|No| E{"Random < 10%?"} + E -->|Yes| K + E -->|No| F["DROP"] + + style K fill:#2e7d32,stroke:#1b5e20,color:#fff + style F fill:#c62828,stroke:#8c2809,color:#fff + style B fill:#1565c0,stroke:#0d47a1,color:#fff + style C fill:#e65100,stroke:#bf360c,color:#fff + style D fill:#e65100,stroke:#bf360c,color:#fff + style E fill:#4a148c,stroke:#2e0d57,color:#fff +``` + +| Aspect | Details | +| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Where it runs** | In the **OTel Collector** (external process), not inside rippled. rippled exports 100% of traces; the Collector decides what to keep. | +| **When the decision happens** | After the Collector has received all spans for a trace (waits `decision_wait=10s` for stragglers). | +| **How it works** | Policy rules evaluate the completed trace: keep all errors, keep slow operations above a threshold, keep all consensus rounds, then probabilistically sample the rest at 10%. | +| **Pros** | **Never misses important traces**. Errors, slow requests, and consensus anomalies are always captured regardless of probability. | +| **Cons** | Higher resource usage — rippled must export 100% of spans to the Collector, which buffers them in memory before deciding. The Collector needs more RAM (configured via `num_traces` and `decision_wait`). | +| **Best for** | Production troubleshooting where you can't afford to miss errors or anomalies. | + +**Collector configuration** (tail sampling rules for rippled): + +```yaml +processors: + tail_sampling: + decision_wait: 10s # Wait for all spans in a trace + num_traces: 100000 # Buffer up to 100K concurrent traces + policies: + - name: errors # Always keep error traces + type: status_code + status_code: { status_codes: [ERROR] } + + - name: slow-consensus # Keep consensus rounds >5s + type: latency + latency: { threshold_ms: 5000 } + + - name: slow-rpc # Keep slow RPC requests >1s + type: latency + latency: { threshold_ms: 1000 } + + - name: probabilistic # Sample 10% of everything else + type: probabilistic + probabilistic: { sampling_percentage: 10 } +``` + +### Head vs. Tail — Side-by-Side + +| | Head Sampling | Tail Sampling | +| ----------------------------- | ---------------------------------------- | ------------------------------------------------ | +| **Decision point** | Trace start (inside rippled) | Trace end (in OTel Collector) | +| **Knows trace content?** | No (random coin flip) | Yes (evaluates completed trace) | +| **Overhead on rippled** | Lowest (dropped traces = no-op) | Higher (must export 100% to Collector) | +| **Collector resource usage** | Low (receives only sampled traces) | Higher (buffers all traces before deciding) | +| **Captures all errors?** | No (only if trace was randomly selected) | **Yes** (error policy catches them) | +| **Captures slow operations?** | No (random) | **Yes** (latency policy catches them) | +| **Configuration** | `rippled.cfg`: `sampling_ratio=0.1` | `otel-collector.yaml`: `tail_sampling` processor | +| **Best for** | High-throughput steady-state | Troubleshooting & anomaly detection | + +### Recommended Strategy for rippled + +Use **both** in a layered approach: + +```mermaid +flowchart LR + subgraph rippled["rippled (Head Sampling)"] + HS["sampling_ratio=1.0
(export everything)"] + end + + subgraph collector["OTel Collector (Tail Sampling)"] + TS["Keep: errors + slow + 10% random
Drop: routine traces"] + end + + subgraph storage["Backend Storage"] + ST["Only interesting traces
stored long-term"] + end + + rippled -->|"100% of spans"| collector -->|"~15-20% kept"| storage + + style rippled fill:#424242,stroke:#212121,color:#fff + style collector fill:#1565c0,stroke:#0d47a1,color:#fff + style storage fill:#2e7d32,stroke:#1b5e20,color:#fff +``` + +> **Why this works**: rippled exports everything (no blind drops), the Collector applies intelligent filtering (keep errors/slow/anomalies, sample the rest), and only ~15-20% of traces reach storage. If Collector resource usage becomes a concern, add head sampling at `sampling_ratio=0.5` to halve the export volume while still giving the Collector enough data for good tail-sampling decisions. + +--- + +## Slide 10: Data Collection & Privacy ### What Data is Collected -| Category | Attributes Collected | Purpose | -| --------------- | ---------------------------------------------------------------------------------- | --------------------------- | -| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` | Trace transaction lifecycle | -| **Consensus** | `round`, `phase`, `mode`, `proposers`(public key or public node id), `duration_ms` | Analyze consensus timing | -| **RPC** | `command`, `version`, `status`, `duration_ms` | Monitor RPC performance | -| **Peer** | `peer.id`(public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis | -| **Ledger** | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` | Ledger progression tracking | -| **Job** | `job.type`, `queue_ms`, `worker` | JobQueue performance | +| Category | Attributes Collected | Purpose | +| --------------- | ------------------------------------------------------------------------------------ | --------------------------- | +| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` | Trace transaction lifecycle | +| **Consensus** | `round`, `phase`, `mode`, `proposers` (count of proposing validators), `duration_ms` | Analyze consensus timing | +| **RPC** | `command`, `version`, `status`, `duration_ms` | Monitor RPC performance | +| **Peer** | `peer.id`(public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis | +| **Ledger** | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` | Ledger progression tracking | +| **Job** | `job.type`, `queue_ms`, `worker` | JobQueue performance | ### What is NOT Collected (Privacy Guarantees) @@ -263,6 +649,13 @@ flowchart LR style F fill:#c62828,stroke:#8c2809,color:#fff ``` +**Reading the diagram:** + +- **NOT Collected (top row, red)**: Private Keys, Account Balances, and Transaction Amounts are explicitly excluded — these are financial/security-sensitive fields that telemetry never touches. +- **Also Excluded (bottom row, red)**: IP Addresses (configurable per deployment), Personal Data, and Raw TX Payloads are also excluded — these protect operator and user privacy. +- **All-red styling**: Every box is styled in red to visually reinforce that these are hard exclusions, not optional — the telemetry system has no code path to collect any of these fields. +- **Two-row layout**: The split between "NOT Collected" and "Also Excluded" distinguishes between financial data (top) and operational/personal data (bottom), making the privacy boundaries clear to auditors. + ### Privacy Protection Mechanisms | Mechanism | Description | diff --git a/cspell.config.yaml b/cspell.config.yaml index 5d510798b0..f43d6a634e 100644 --- a/cspell.config.yaml +++ b/cspell.config.yaml @@ -276,6 +276,7 @@ words: - txjson - txn - txns + - txqueue - txs - UBSAN - ubsan