docs: correct OTel overhead estimates against SDK benchmarks

Verified CPU, memory, and network overhead calculations against official OTel C++ SDK benchmarks (969 CI runs) and source code analysis. Key corrections: - Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median ~1000ns; original estimate matched API no-op, not SDK path) - Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%) - Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper + SpanData + std::map attribute storage) - Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread stack ~8MB was omitted) - Total memory ceiling: ~2.3MB → ~10MB - Memory success metric target: <5MB → <10MB - AddEvent: 50-80ns → 100-200ns Added Section 3.5.4 with links to all benchmark sources. Updated presentation.md with matching corrections. High-level conclusions unchanged (1-3% CPU, negligible consensus). Also includes: review fixes, cross-document consistency improvements, additional component tracing docs (PathFinding, TxQ, Validator, etc.), context size corrections (32 → 25 bytes). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-07-30 10:30:22 +00:00 · 2026-03-24 19:11:12 +00:00
parent accea17e9d
commit 913a4b794c
13 changed files with 1749 additions and 368 deletions
--- a/OpenTelemetryPlan/00-tracing-fundamentals.md
+++ b/OpenTelemetryPlan/00-tracing-fundamentals.md
@@ -15,6 +15,33 @@ Distributed tracing is a method for tracking data objects as they flow through d

 ---

+## Actors and Actions at a Glance
+
+### Actors
+
+| Who (Plain English)                            | Technical Term  |
+| ---------------------------------------------- | --------------- |
+| A single unit of work being tracked            | Span            |
+| The complete journey of a request              | Trace           |
+| Data that links spans across services          | Trace Context   |
+| Code that creates spans and propagates context | Instrumentation |
+| Service that receives and processes traces     | Collector       |
+| Storage and visualization system               | Backend (Tempo) |
+| Decision logic for which traces to keep        | Sampler         |
+
+### Actions
+
+| What Happens (Plain English)            | Technical Term          |
+| --------------------------------------- | ----------------------- |
+| Start tracking a new operation          | Create a Span           |
+| Connect a child operation to its parent | Set `parent_span_id`    |
+| Group all related operations together   | Share a `trace_id`      |
+| Pass tracking data between services     | Context Propagation     |
+| Decide whether to record a trace        | Sampling (Head or Tail) |
+| Send completed traces to storage        | Export (OTLP)           |
+
+---
+
 ## Core Concepts

 ### 1. Trace
@@ -33,16 +60,16 @@ Trace ID: abc123

 A **span** represents a single unit of work within a trace. Each span has:

-| Attribute        | Description           | Example                    |
-| ---------------- | --------------------- | -------------------------- |
-| `trace_id`       | Links to parent trace | `abc123`                   |
-| `span_id`        | Unique identifier     | `span456`                  |
-| `parent_span_id` | Parent span (if any)  | `p_span123`                |
-| `name`           | Operation name        | `rpc.submit`               |
-| `start_time`     | When work began       | `2024-01-15T10:30:00Z`     |
-| `end_time`       | When work completed   | `2024-01-15T10:30:00.050Z` |
-| `attributes`     | Key-value metadata    | `tx.hash=ABC...`           |
-| `status`         | OK, ERROR MSG         | `OK`                       |
+| Attribute        | Description                      | Example                    |
+| ---------------- | -------------------------------- | -------------------------- |
+| `trace_id`       | Identifies the trace             | `event123`                 |
+| `span_id`        | Unique identifier                | `span456`                  |
+| `parent_span_id` | Parent span (if any)             | `p_span123`                |
+| `name`           | Operation name                   | `rpc.submit`               |
+| `start_time`     | When work began (local time)     | `2024-01-15T10:30:00Z`     |
+| `end_time`       | When work completed (local time) | `2024-01-15T10:30:00.050Z` |
+| `attributes`     | Key-value metadata               | `tx.hash=ABC...`           |
+| `status`         | OK, ERROR MSG                    | `OK`                       |

 ### 3. Trace Context

@@ -74,6 +101,13 @@ flowchart TB
    style E fill:#bf360c,stroke:#8c2809,color:#ffffff
 ```

+**Reading the diagram:**
+
+- **tx.submit (blue, root)**: The top-level span representing the entire transaction submission; all other spans are its descendants.
+- **tx.validate, tx.relay, tx.apply (green)**: Direct children of tx.submit, representing the three main stages -- validation, relay to peers, and application to the ledger.
+- **ledger.update (red)**: A grandchild span nested under tx.apply, representing the actual ledger state mutation triggered by applying the transaction.
+- **Arrows (parent to child)**: Each arrow indicates a parent-child span relationship where the parent's completion depends on the child finishing.
+
 The same trace visualized as a **timeline (Gantt chart)**:

 ```
@@ -92,6 +126,284 @@ ledger   │                         │▓▓▓▓▓▓▓▓▓▓▓▓▓

 ---

+## Span Relationships
+
+Spans don't always form simple parent-child trees. Distributed tracing defines several relationship types to capture different causal patterns:
+
+### 1. Parent-Child (ChildOf)
+
+The default relationship. The parent span **depends on** or **contains** the child span. The child runs within the scope of the parent.
+
+```
+tx.submit (parent)
+├── tx.validate (child)     ← parent waits for this
+├── tx.relay (child)        ← parent waits for this
+└── tx.apply (child)        ← parent waits for this
+```
+
+**When to use:** Synchronous calls, nested operations, any case where the parent's completion depends on the child.
+
+### 2. Follows-From
+
+A causal relationship where the first span **triggers** the second, but does **not wait** for it. The originator fires and moves on.
+
+```
+Time →
+
+tx.receive [=======]
+                     ↓ triggers (follows-from)
+              tx.relay   [===========]   ← runs independently
+```
+
+**When to use:** Asynchronous jobs, queued work, fire-and-forget patterns. For example, a node receives a transaction and queues it for relay — the relay span _follows from_ the receive span but the receiver doesn't wait for relaying to complete.
+
+> **OpenTracing** defined `FollowsFrom` as a first-class reference type alongside `ChildOf`.
+> **OpenTelemetry** represents this using **Span Links** with descriptive attributes instead (see below).
+
+### 3. Span Links (Cross-Trace and Non-Hierarchical)
+
+Links connect spans that are **causally related but not in a parent-child hierarchy**. Unlike parent-child, links can cross trace boundaries.
+
+```
+Trace A                          Trace B
+──────                           ──────
+batch.schedule                   batch.execute
+├─ item.enqueue (span X)    ┌──► process.item
+├─ item.enqueue (span Y) ───┤    (links to X, Y, Z)
+├─ item.enqueue (span Z)    └──►
+```
+
+**Use cases:**
+
+| Pattern              | Description                                                                 |
+| -------------------- | --------------------------------------------------------------------------- |
+| **Batch processing** | A batch span links back to all individual spans that contributed to it      |
+| **Fan-in**           | An aggregation span links to the multiple producer spans it merges          |
+| **Fan-out**          | Multiple downstream spans link back to the single span that triggered them  |
+| **Async handoff**    | A deferred job links back to the request that queued it (follows-from)      |
+| **Cross-trace**      | Correlating spans across independent traces (e.g., retries, related events) |
+
+**Link structure:** Each link carries the target span's context plus optional attributes:
+
+```
+Link {
+    trace_id:   <target trace>
+    span_id:    <target span>
+    attributes: { "link.description": "triggered by batch scheduler" }
+}
+```
+
+### Relationship Summary
+
+```mermaid
+flowchart LR
+    subgraph parent_child["Parent-Child"]
+        direction TB
+        P["Parent"] --> C["Child"]
+    end
+
+    subgraph follows_from["Follows-From"]
+        direction TB
+        A["Span A"] -.->|triggers| B["Span B"]
+    end
+
+    subgraph links["Span Links"]
+        direction TB
+        X["Span X\n(Trace 1)"] -.-|link| Y["Span Y\n(Trace 2)"]
+    end
+
+    parent_child ~~~ follows_from ~~~ links
+
+    style P fill:#0d47a1,stroke:#082f6a,color:#ffffff
+    style C fill:#1b5e20,stroke:#0d3d14,color:#ffffff
+    style A fill:#0d47a1,stroke:#082f6a,color:#ffffff
+    style B fill:#bf360c,stroke:#8c2809,color:#ffffff
+    style X fill:#4a148c,stroke:#38006b,color:#ffffff
+    style Y fill:#4a148c,stroke:#38006b,color:#ffffff
+```
+
+| Relationship     | Same Trace? | Dependency?                | OTel Mechanism    |
+| ---------------- | ----------- | -------------------------- | ----------------- |
+| **Parent-Child** | Yes         | Parent depends on child    | `parent_span_id`  |
+| **Follows-From** | Usually     | Causal but no dependency   | Link + attributes |
+| **Span Link**    | Either      | Correlation, no dependency | Link + attributes |
+
+---
+
+## Trace ID Generation
+
+A `trace_id` is a 128-bit (16-byte) identifier that groups all spans belonging to one logical operation. How it's generated determines how easily you can find and correlate traces later.
+
+### General Approaches
+
+#### 1. Random (W3C Default)
+
+Generate a random 128-bit ID when a trace starts. Standard approach for most services.
+
+```
+trace_id = random_128_bits()
+```
+
+| Pros                        | Cons                                          |
+| --------------------------- | --------------------------------------------- |
+| Simple, standard            | No natural correlation to domain events       |
+| Guaranteed unique per trace | If propagation is lost, trace is broken       |
+| Works with all OTel tooling | "Find trace for TX abc" requires index lookup |
+
+#### 2. Deterministic (Derived from Domain Data)
+
+Compute the trace_id from a hash of a natural identifier. Every node independently derives the **same** trace_id for the same event.
+
+```
+trace_id = SHA-256(domain_identifier)[0:16]   // truncate to 128 bits
+```
+
+| Pros                                                | Cons                                                       |
+| --------------------------------------------------- | ---------------------------------------------------------- |
+| Propagation-resilient — same ID computed everywhere | Same event processed twice (retry) shares trace_id         |
+| Natural search — domain ID maps directly to trace   | Non-standard (tooling assumes random)                      |
+| No coordination needed between nodes                | 256→128 bit truncation (collision risk negligible at ~2⁶⁴) |
+
+#### 3. Hybrid (Deterministic Prefix + Random Suffix)
+
+First 8 bytes derived from domain data, last 8 bytes random.
+
+```
+trace_id = SHA-256(domain_identifier)[0:8] || random_64_bits()
+```
+
+| Pros                                        | Cons                                     |
+| ------------------------------------------- | ---------------------------------------- |
+| Prefix search: "find all traces for TX abc" | Must propagate to maintain full trace_id |
+| Unique per processing instance              | More complex generation logic            |
+| Retries get distinct trace_ids              | Partial correlation only (prefix match)  |
+
+### XRPL Workflow Analysis
+
+XRPL has a unique advantage: its core workflows produce **globally unique 256-bit hashes** that are known on every node. This makes deterministic trace_id generation practical in ways most systems can't achieve.
+
+#### Natural Identifiers by Workflow
+
+| Workflow            | Natural Identifier                | Size       | Known at Start?               | Same on All Nodes?               |
+| ------------------- | --------------------------------- | ---------- | ----------------------------- | -------------------------------- |
+| **Transaction**     | Transaction hash (`tid_`)         | 256-bit    | Yes — computed before signing | Yes — hash of canonical tx data  |
+| **Consensus round** | Previous ledger hash + ledger seq | 256+32 bit | Yes — known when round opens  | Yes — all validators agree       |
+| **Validation**      | Ledger hash being validated       | 256-bit    | Yes — from consensus result   | Yes — same closed ledger         |
+| **Ledger catch-up** | Target ledger hash                | 256-bit    | Yes — we know what to fetch   | Yes — identifies ledger globally |
+
+#### Where These Identifiers Live in Code
+
+```
+Transaction:     STTx::getTransactionID()     → uint256 tid_
+                 TMTransaction::rawTransaction → recompute hash from bytes
+
+Consensus:       ConsensusProposal::prevLedger_ → uint256 (previous ledger hash)
+                 ConsensusProposal::position_   → uint256 (TxSet hash)
+                 LedgerHeader::seq              → uint32_t (ledger sequence)
+
+Validation:      STValidation::getLedgerHash()  → uint256
+                 STValidation::getNodeID()      → NodeID (160-bit)
+
+Ledger fetch:    InboundLedger constructor      → uint256 hash, uint32_t seq
+                 TMGetLedger::ledgerHash        → bytes (uint256)
+```
+
+### Recommended Strategy: Workflow-Scoped Deterministic
+
+Each workflow type derives its trace_id from its natural domain identifier:
+
+```
+Transaction trace:   trace_id = SHA-256("tx"    || tx_hash)[0:16]
+Consensus trace:     trace_id = SHA-256("cons"  || prev_ledger_hash || ledger_seq)[0:16]
+Ledger catch-up:     trace_id = SHA-256("fetch" || target_ledger_hash)[0:16]
+```
+
+The string prefix (`"tx"`, `"cons"`, `"fetch"`) prevents collisions between workflows that might share underlying hashes.
+
+**Why this works for XRPL:**
+
+1. **Propagation-resilient** — Even if a P2P message drops trace context, every node independently computes the same trace_id from the same tx_hash or ledger_hash. Spans still correlate.
+
+2. **Zero-cost search** — "Show me the trace for transaction ABC" becomes a direct lookup: compute `SHA-256("tx" || ABC)[0:16]` and query. No secondary index needed.
+
+3. **Cross-workflow linking via Span Links** — A consensus trace links to individual transaction traces. A validation span links to the consensus trace. This connects the full picture without forcing everything into one giant trace.
+
+### Cross-Workflow Correlation
+
+Each workflow gets its own trace. Span Links tie them together:
+
+```mermaid
+flowchart TB
+    subgraph tx_trace["Transaction Trace"]
+        direction LR
+        Tn["trace_id = f(tx_hash)"]:::note --> T1["tx.receive"] --> T2["tx.validate"] --> T3["tx.relay"]
+    end
+
+    subgraph cons_trace["Consensus Trace"]
+        direction LR
+        Cn["trace_id = f(prev_ledger, seq)"]:::note --> C1["cons.open"] --> C2["cons.propose"] --> C3["cons.accept"]
+    end
+
+    subgraph val_trace["Validation"]
+        direction LR
+        Vn["spans within consensus trace"]:::note --> V1["val.create"] --> V2["val.broadcast"]
+    end
+
+    subgraph fetch_trace["Catch-Up Trace"]
+        direction LR
+        Fn["trace_id = f(ledger_hash)"]:::note --> F1["fetch.request"] --> F2["fetch.receive"] --> F3["fetch.apply"]
+    end
+
+    C1 -.-|"span link\n(tx traces)"| T3
+    C3 --> V1
+    F1 -.-|"span link\n(target ledger)"| C3
+
+    classDef note fill:none,stroke:#888,stroke-dasharray:5 5,color:#333,font-style:italic
+    style T1 fill:#0d47a1,stroke:#082f6a,color:#ffffff
+    style T2 fill:#0d47a1,stroke:#082f6a,color:#ffffff
+    style T3 fill:#0d47a1,stroke:#082f6a,color:#ffffff
+    style C1 fill:#1b5e20,stroke:#0d3d14,color:#ffffff
+    style C2 fill:#1b5e20,stroke:#0d3d14,color:#ffffff
+    style C3 fill:#1b5e20,stroke:#0d3d14,color:#ffffff
+    style V1 fill:#bf360c,stroke:#8c2809,color:#ffffff
+    style V2 fill:#bf360c,stroke:#8c2809,color:#ffffff
+    style F1 fill:#4a148c,stroke:#38006b,color:#ffffff
+    style F2 fill:#4a148c,stroke:#38006b,color:#ffffff
+    style F3 fill:#4a148c,stroke:#38006b,color:#ffffff
+```
+
+**Reading the diagram:**
+
+- **Transaction Trace (blue)**: An independent trace whose `trace_id` is deterministically derived from the transaction hash. Contains receive, validate, and relay spans.
+- **Consensus Trace (green)**: An independent trace whose `trace_id` is derived from the previous ledger hash and sequence number. Covers the open, propose, and accept phases.
+- **Validation (red)**: Validation spans live within the consensus trace (not a separate trace). They are created after the accept phase completes.
+- **Catch-Up Trace (purple)**: An independent trace for ledger acquisition, derived from the target ledger hash. Used when a node is behind and fetching missing ledgers.
+- **Dotted arrows (span links)**: Cross-trace correlations. Consensus links to transaction traces it included; catch-up links to the consensus trace that produced the target ledger.
+- **Solid arrow (C3 to V1)**: A parent-child relationship -- validation spans are direct children of the consensus accept span within the same trace.
+
+**How a query flows:**
+
+```
+"Why was TX abc slow?"
+  1. Compute trace_id = SHA-256("tx" || abc)[0:16]
+  2. Find transaction trace → see it was included in consensus round N
+  3. Follow span link → consensus trace for round N
+  4. See which phase was slow (propose? accept?)
+  5. If a node was catching up, follow link → catch-up trace
+```
+
+### Trade-offs to Consider
+
+| Concern                       | Mitigation                                                                                                                    |
+| ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
+| **Retries get same trace_id** | Add `attempt` attribute to root span; spans have unique span_ids and timestamps                                               |
+| **256→128 bit truncation**    | Birthday-bound collision at ~2⁶⁴ operations — negligible for XRPL's throughput                                                |
+| **Non-standard generation**   | OTel spec allows any 16-byte non-zero value; tooling works on the hex string                                                  |
+| **Hash computation cost**     | SHA-256 is ~0.3μs per call; XRPL already computes these hashes for other purposes                                             |
+| **Late-binding identifiers**  | Ledger hash isn't known until after consensus — validation spans use ledger_seq as fallback, then link to the consensus trace |
+
+---
+
 ## Distributed Traces Across Nodes

 In distributed systems like rippled, traces span **multiple independent nodes**. The trace context must be propagated in network messages:
@@ -118,20 +430,27 @@ sequenceDiagram
    Note over NodeA,NodeC: All spans share trace_id: abc123<br/>enabling correlation across nodes
 ```

+**Reading the diagram:**
+
+- **Client**: The external entity that submits a transaction. It does not carry trace context -- the trace originates at the first node.
+- **Node A**: The entry point that creates a new trace (trace_id: abc123) and the root span `tx.receive`. It relays the transaction to peers with trace context attached.
+- **Node B and Node C**: Peer nodes that receive the relayed transaction along with the propagated trace context. Each creates a child span under Node A's span, preserving the same `trace_id`.
+- **Arrows with trace context**: The relay messages carry `trace_id` and `parent_span_id`, allowing each downstream node to link its spans back to the originating span on Node A.
+
 ---

 ## Context Propagation

 For traces to work across nodes, **trace context must be propagated** in messages.

-### What's in the Context (32 bytes)
+### What's in the Context (~26 bytes)

-| Field         | Size       | Description                                             |
-| ------------- | ---------- | ------------------------------------------------------- |
-| `trace_id`    | 16 bytes   | Identifies the entire trace (constant across all nodes) |
-| `span_id`     | 8 bytes    | The sender's current span (becomes parent on receiver)  |
-| `trace_flags` | 4 bytes    | Sampling decision flags                                 |
-| `trace_state` | ~0-4 bytes | Optional vendor-specific data                           |
+| Field         | Size     | Description                                             |
+| ------------- | -------- | ------------------------------------------------------- |
+| `trace_id`    | 16 bytes | Identifies the entire trace (constant across all nodes) |
+| `span_id`     | 8 bytes  | The sender's current span (becomes parent on receiver)  |
+| `trace_flags` | 1 byte   | Sampling decision (bit 0 = sampled; bits 1-7 reserved)  |
+| `trace_state` | variable | Optional vendor-specific data (typically omitted)       |

 ### How span_id Changes at Each Hop

@@ -165,11 +484,11 @@ There are two patterns:
 ### HTTP/RPC Headers (W3C Trace Context)

 ```
-traceparent: 00-abc123def456-span789-01
-             │  │             │      │
-             │  │             │      └── Flags (sampled)
-             │  │             └── Parent span ID
-             │  └── Trace ID
+traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
+             │  │                                │                │
+             │  │                                │                └── Flags (sampled)
+             │  │                                └── Parent span ID (16 hex)
+             │  └── Trace ID (32 hex)
             └── Version
 ```

@@ -228,16 +547,20 @@ Trace completes → Collector evaluates:

 ## Glossary

-| Term                | Definition                                                      |
-| ------------------- | --------------------------------------------------------------- |
-| **Trace**           | Complete journey of a request, identified by `trace_id`         |
-| **Span**            | Single operation within a trace                                 |
-| **Context**         | Data propagated between services (`trace_id`, `span_id`, flags) |
-| **Instrumentation** | Code that creates spans and propagates context                  |
-| **Collector**       | Service that receives, processes, and exports traces            |
-| **Backend**         | Storage/visualization system (Jaeger, Tempo, etc.)              |
-| **Head Sampling**   | Sampling decision at trace start                                |
-| **Tail Sampling**   | Sampling decision after trace completes                         |
+| Term                 | Definition                                                          |
+| -------------------- | ------------------------------------------------------------------- |
+| **Trace**            | Complete journey of a request, identified by `trace_id`             |
+| **Span**             | Single operation within a trace                                     |
+| **Parent-Child**     | Span relationship where the parent depends on the child             |
+| **Follows-From**     | Causal relationship where originator doesn't wait for the result    |
+| **Span Link**        | Non-hierarchical connection between spans, possibly across traces   |
+| **Deterministic ID** | Trace ID derived from domain data (e.g., tx_hash) instead of random |
+| **Context**          | Data propagated between services (`trace_id`, `span_id`, flags)     |
+| **Instrumentation**  | Code that creates spans and propagates context                      |
+| **Collector**        | Service that receives, processes, and exports traces                |
+| **Backend**          | Storage/visualization system (Tempo)                                |
+| **Head Sampling**    | Sampling decision at trace start                                    |
+| **Tail Sampling**    | Sampling decision after trace completes                             |

 ---

--- a/OpenTelemetryPlan/01-architecture-analysis.md
+++ b/OpenTelemetryPlan/01-architecture-analysis.md
@@ -7,6 +7,8 @@

 ## 1.1 Current rippled Architecture Overview

+> **WS** = WebSocket | **UNL** = Unique Node List | **TxQ** = Transaction Queue | **StatsD** = Statistics Daemon
+
 The rippled node software consists of several interconnected components that need instrumentation for distributed tracing:

 ```mermaid
@@ -16,6 +18,7 @@ flowchart TB
            RPC["RPC Server<br/>(HTTP/WS/gRPC)"]
            Overlay["Overlay<br/>(P2P Network)"]
            Consensus["Consensus<br/>(RCLConsensus)"]
+            ValidatorList["ValidatorList<br/>(UNL Mgmt)"]
        end

        JobQueue["JobQueue<br/>(Thread Pool)"]
@@ -24,6 +27,13 @@ flowchart TB
            NetworkOPs["NetworkOPs<br/>(Tx Processing)"]
            LedgerMaster["LedgerMaster<br/>(Ledger Mgmt)"]
            NodeStore["NodeStore<br/>(Database)"]
+            InboundLedgers["InboundLedgers<br/>(Ledger Sync)"]
+        end
+
+        subgraph appservices["Application Services"]
+            PathFind["PathFinding<br/>(Payment Paths)"]
+            TxQ["TxQ<br/>(Fee Escalation)"]
+            LoadMgr["LoadManager<br/>(Fee/Load)"]
        end

        subgraph observability["Existing Observability"]
@@ -34,27 +44,92 @@ flowchart TB

        services --> JobQueue
        JobQueue --> processing
+        JobQueue --> appservices
    end

    style rippled fill:#424242,stroke:#212121,color:#ffffff
    style services fill:#1565c0,stroke:#0d47a1,color:#ffffff
    style processing fill:#2e7d32,stroke:#1b5e20,color:#ffffff
+    style appservices fill:#6a1b9a,stroke:#4a148c,color:#ffffff
    style observability fill:#e65100,stroke:#bf360c,color:#ffffff
 ```

+**Reading the diagram:**
+
+- **Core Services (blue)**: The entry points into rippled -- RPC Server handles client requests, Overlay manages peer-to-peer networking, Consensus drives agreement, and ValidatorList manages trusted validators.
+- **JobQueue (center)**: The asynchronous thread pool that decouples Core Services from the Processing and Application layers. All work flows through it.
+- **Processing Layer (green)**: Core business logic -- NetworkOPs processes transactions, LedgerMaster manages ledger state, NodeStore handles persistence, and InboundLedgers synchronizes missing data.
+- **Application Services (purple)**: Higher-level features -- PathFinding computes payment routes, TxQ manages fee-based queuing, and LoadManager tracks server load.
+- **Existing Observability (orange)**: The current monitoring stack (PerfLog, Insight, Journal logging) that OpenTelemetry will complement, not replace.
+- **Arrows (Services to JobQueue to layers)**: Work originates at Core Services, is enqueued onto the JobQueue, and dispatched to Processing or Application layers for execution.
+
+---
+
+## 1.1.1 Actors and Actions
+
+### Actors
+
+| Who (Plain English)                       | Technical Term             |
+| ----------------------------------------- | -------------------------- |
+| Network node running XRPL software        | rippled node               |
+| External client submitting requests       | RPC Client                 |
+| Network neighbor sharing data             | Peer (PeerImp)             |
+| Request handler for client queries        | RPC Server (ServerHandler) |
+| Command executor for specific RPC methods | RPCHandler                 |
+| Agreement process between nodes           | Consensus (RCLConsensus)   |
+| Transaction processing coordinator        | NetworkOPs                 |
+| Background task scheduler                 | JobQueue                   |
+| Ledger state manager                      | LedgerMaster               |
+| Payment route calculator                  | PathFinding (Pathfinder)   |
+| Transaction waiting room                  | TxQ (Transaction Queue)    |
+| Fee adjustment system                     | LoadManager                |
+| Trusted validator list manager            | ValidatorList              |
+| Protocol upgrade tracker                  | AmendmentTable             |
+| Ledger state hash tree                    | SHAMap                     |
+| Persistent key-value storage              | NodeStore                  |
+
+### Actions
+
+| What Happens (Plain English)                   | Technical Term         |
+| ---------------------------------------------- | ---------------------- |
+| Client sends a request to a node               | `rpc.request`          |
+| Node executes a specific RPC command           | `rpc.command.*`        |
+| Node receives a transaction from a peer        | `tx.receive`           |
+| Node checks if a transaction is valid          | `tx.validate`          |
+| Node forwards a transaction to neighbors       | `tx.relay`             |
+| Nodes agree on which transactions to include   | `consensus.round`      |
+| Consensus progresses through phases            | `consensus.phase.*`    |
+| Node builds a new confirmed ledger             | `ledger.build`         |
+| Node fetches missing ledger data from peers    | `ledger.acquire`       |
+| Node computes payment routes                   | `pathfind.compute`     |
+| Node queues a transaction for later processing | `txq.enqueue`          |
+| Node increases fees due to high load           | `fee.escalate`         |
+| Node fetches the latest trusted validator list | `validator.list.fetch` |
+| Node votes on a protocol amendment             | `amendment.vote`       |
+| Node synchronizes state tree data              | `shamap.sync`          |
+
 ---

 ## 1.2 Key Components for Instrumentation

-| Component         | Location                                   | Purpose                  | Trace Value                  |
-| ----------------- | ------------------------------------------ | ------------------------ | ---------------------------- |
-| **Overlay**       | `src/xrpld/overlay/`                       | P2P communication        | Message propagation timing   |
-| **PeerImp**       | `src/xrpld/overlay/detail/PeerImp.cpp`     | Individual peer handling | Per-peer latency             |
-| **RCLConsensus**  | `src/xrpld/app/consensus/RCLConsensus.cpp` | Consensus algorithm      | Round timing, phase analysis |
-| **NetworkOPs**    | `src/xrpld/app/misc/NetworkOPs.cpp`        | Transaction processing   | Tx lifecycle tracking        |
-| **ServerHandler** | `src/xrpld/rpc/detail/ServerHandler.cpp`   | RPC entry point          | Request latency              |
-| **RPCHandler**    | `src/xrpld/rpc/detail/RPCHandler.cpp`      | Command execution        | Per-command timing           |
-| **JobQueue**      | `src/xrpl/core/JobQueue.h`                 | Async task execution     | Queue wait times             |
+> **TxQ** = Transaction Queue | **UNL** = Unique Node List
+
+| Component          | Location                                   | Purpose                  | Trace Value                      |
+| ------------------ | ------------------------------------------ | ------------------------ | -------------------------------- |
+| **Overlay**        | `src/xrpld/overlay/`                       | P2P communication        | Message propagation timing       |
+| **PeerImp**        | `src/xrpld/overlay/detail/PeerImp.cpp`     | Individual peer handling | Per-peer latency                 |
+| **RCLConsensus**   | `src/xrpld/app/consensus/RCLConsensus.cpp` | Consensus algorithm      | Round timing, phase analysis     |
+| **NetworkOPs**     | `src/xrpld/app/misc/NetworkOPs.cpp`        | Transaction processing   | Tx lifecycle tracking            |
+| **ServerHandler**  | `src/xrpld/rpc/detail/ServerHandler.cpp`   | RPC entry point          | Request latency                  |
+| **RPCHandler**     | `src/xrpld/rpc/detail/RPCHandler.cpp`      | Command execution        | Per-command timing               |
+| **JobQueue**       | `src/xrpl/core/JobQueue.h`                 | Async task execution     | Queue wait times                 |
+| **PathFinding**    | `src/xrpld/app/paths/`                     | Payment path computation | Path latency, cache hits         |
+| **TxQ**            | `src/xrpld/app/misc/TxQ.cpp`               | Transaction queue/fees   | Queue depth, eviction rates      |
+| **LoadManager**    | `src/xrpld/app/main/LoadManager.cpp`       | Fee escalation/load      | Fee levels, load factors         |
+| **InboundLedgers** | `src/xrpld/app/ledger/InboundLedgers.cpp`  | Ledger acquisition       | Sync time, peer reliability      |
+| **ValidatorList**  | `src/xrpld/app/misc/ValidatorList.cpp`     | UNL management           | List freshness, fetch failures   |
+| **AmendmentTable** | `src/xrpld/app/misc/AmendmentTable.cpp`    | Protocol amendments      | Voting status, activation events |
+| **SHAMap**         | `src/xrpld/shamap/`                        | State hash tree          | Sync speed, missing nodes        |

 ---

@@ -93,6 +168,15 @@ sequenceDiagram
    Note over Client,PeerC: DISTRIBUTED TRACE (same trace_id: abc123)
 ```

+**Reading the diagram:**
+
+- **Client**: The external entity that submits a transaction to Peer A. It has no trace context -- the trace starts at the first node.
+- **Peer A (Receive)**: The entry node that creates the root span `tx.receive`, runs HashRouter deduplication to avoid processing duplicates, and creates a child `tx.validate` span.
+- **Peer A to Peer B arrow**: The relay message carries trace context (trace_id + parent span_id), enabling Peer B to create a linked span under the same trace.
+- **Peer B (Relay)**: Receives the transaction and trace context, creates a `tx.receive` span linked to Peer A's trace, then relays onward.
+- **Peer C (Validate)**: Final hop in this example. Creates a linked `tx.receive` span and runs `tx.process` to fully process the transaction.
+- **Blue rectangles**: Highlight the span boundaries on each node, showing where instrumentation creates and closes spans.
+
 ### Trace Structure

 ```
@@ -142,16 +226,26 @@ flowchart TB
    style accept fill:#c2185b,stroke:#880e4f,color:#ffffff
 ```

+**Reading the diagram:**
+
+- **consensus.round (orange, root span)**: The top-level span encompassing the entire consensus round, with attributes like ledger sequence, mode, and proposer count.
+- **consensus.phase.open (blue)**: The first phase where the node waits (~3s) to collect incoming transactions before proposing.
+- **consensus.phase.establish (green)**: The negotiation phase where validators exchange proposals, resolve disputes, and converge on a transaction set. Child spans track each proposal received/sent and each dispute resolved.
+- **consensus.phase.accept (pink)**: The final phase where the agreed transaction set is applied, a new ledger is built, and the ledger is validated. Child spans cover `ledger.build` and `ledger.validate`.
+- **Arrows (open to establish to accept)**: The sequential flow through the three consensus phases. Each phase must complete before the next begins.
+
 ---

 ## 1.5 RPC Request Flow

+> **WS** = WebSocket
+
 RPC requests support W3C Trace Context headers for distributed tracing across services:

 ```mermaid
 flowchart TB
    subgraph request["rpc.request (root span)"]
-        http["HTTP Request<br/>POST /<br/>traceparent: 00-abc123...-def456...-01"]
+        http["HTTP Request — POST /<br/>traceparent:<br/>00-abc123...-def456...-01"]

        attrs["Attributes:<br/>http.method = POST<br/>net.peer.ip = 192.168.1.100<br/>xrpl.rpc.command = submit"]

@@ -177,32 +271,56 @@ flowchart TB
    style command fill:#e65100,stroke:#bf360c,color:#ffffff
 ```

+**Reading the diagram:**
+
+- **rpc.request (green, root span)**: The outermost span representing the full RPC request lifecycle, from HTTP receipt to response. Carries the W3C `traceparent` header for distributed tracing.
+- **HTTP Request node**: Shows the incoming POST request with its `traceparent` header and extracted attributes (method, peer IP, command name).
+- **jobqueue.enqueue (blue)**: The span covering the asynchronous handoff from the RPC thread to the JobQueue worker thread. The trace context is preserved across this async boundary.
+- **rpc.command.submit (orange)**: The span for the actual command execution, with child spans for deserialization, local validation, and network submission.
+- **Response node**: The final output with HTTP status and total duration, marking the end of the root span.
+- **Arrows (top to bottom)**: The sequential processing pipeline -- receive request, extract attributes, enqueue job, execute command, return response.
+
 ---

 ## 1.6 Key Trace Points

+> **TxQ** = Transaction Queue
+
 The following table identifies priority instrumentation points across the codebase:

-| Category        | Span Name              | File                 | Method                 | Priority |
-| --------------- | ---------------------- | -------------------- | ---------------------- | -------- |
-| **Transaction** | `tx.receive`           | `PeerImp.cpp`        | `handleTransaction()`  | High     |
-| **Transaction** | `tx.validate`          | `NetworkOPs.cpp`     | `processTransaction()` | High     |
-| **Transaction** | `tx.process`           | `NetworkOPs.cpp`     | `doTransactionSync()`  | High     |
-| **Transaction** | `tx.relay`             | `OverlayImpl.cpp`    | `relay()`              | Medium   |
-| **Consensus**   | `consensus.round`      | `RCLConsensus.cpp`   | `startRound()`         | High     |
-| **Consensus**   | `consensus.phase.*`    | `Consensus.h`        | `timerEntry()`         | High     |
-| **Consensus**   | `consensus.proposal.*` | `RCLConsensus.cpp`   | `peerProposal()`       | Medium   |
-| **RPC**         | `rpc.request`          | `ServerHandler.cpp`  | `onRequest()`          | High     |
-| **RPC**         | `rpc.command.*`        | `RPCHandler.cpp`     | `doCommand()`          | High     |
-| **Peer**        | `peer.connect`         | `OverlayImpl.cpp`    | `onHandoff()`          | Low      |
-| **Peer**        | `peer.message.*`       | `PeerImp.cpp`        | `onMessage()`          | Low      |
-| **Ledger**      | `ledger.acquire`       | `InboundLedgers.cpp` | `acquire()`            | Medium   |
-| **Ledger**      | `ledger.build`         | `RCLConsensus.cpp`   | `buildLCL()`           | High     |
+| Category        | Span Name              | File                   | Method                  | Priority |
+| --------------- | ---------------------- | ---------------------- | ----------------------- | -------- |
+| **Transaction** | `tx.receive`           | `PeerImp.cpp`          | `handleTransaction()`   | High     |
+| **Transaction** | `tx.validate`          | `NetworkOPs.cpp`       | `processTransaction()`  | High     |
+| **Transaction** | `tx.process`           | `NetworkOPs.cpp`       | `doTransactionSync()`   | High     |
+| **Transaction** | `tx.relay`             | `OverlayImpl.cpp`      | `relay()`               | Medium   |
+| **Consensus**   | `consensus.round`      | `RCLConsensus.cpp`     | `startRound()`          | High     |
+| **Consensus**   | `consensus.phase.*`    | `Consensus.h`          | `timerEntry()`          | High     |
+| **Consensus**   | `consensus.proposal.*` | `RCLConsensus.cpp`     | `peerProposal()`        | Medium   |
+| **RPC**         | `rpc.request`          | `ServerHandler.cpp`    | `onRequest()`           | High     |
+| **RPC**         | `rpc.command.*`        | `RPCHandler.cpp`       | `doCommand()`           | High     |
+| **Peer**        | `peer.connect`         | `OverlayImpl.cpp`      | `onHandoff()`           | Low      |
+| **Peer**        | `peer.message.*`       | `PeerImp.cpp`          | `onMessage()`           | Low      |
+| **Ledger**      | `ledger.acquire`       | `InboundLedgers.cpp`   | `acquire()`             | Medium   |
+| **Ledger**      | `ledger.build`         | `RCLConsensus.cpp`     | `buildLCL()`            | High     |
+| **PathFinding** | `pathfind.request`     | `PathRequest.cpp`      | `doUpdate()`            | High     |
+| **PathFinding** | `pathfind.compute`     | `Pathfinder.cpp`       | `findPaths()`           | High     |
+| **TxQ**         | `txq.enqueue`          | `TxQ.cpp`              | `apply()`               | High     |
+| **TxQ**         | `txq.apply`            | `TxQ.cpp`              | `processClosedLedger()` | High     |
+| **Fee**         | `fee.escalate`         | `LoadManager.cpp`      | `raiseLocalFee()`       | Medium   |
+| **Ledger**      | `ledger.replay`        | `LedgerReplayer.h`     | `replay()`              | Medium   |
+| **Ledger**      | `ledger.delta`         | `LedgerDeltaAcquire.h` | `processData()`         | Medium   |
+| **Validator**   | `validator.list.fetch` | `ValidatorList.cpp`    | `verify()`              | Medium   |
+| **Validator**   | `validator.manifest`   | `Manifest.cpp`         | `applyManifest()`       | Low      |
+| **Amendment**   | `amendment.vote`       | `AmendmentTable.cpp`   | `doVoting()`            | Low      |
+| **SHAMap**      | `shamap.sync`          | `SHAMap.cpp`           | `fetchRoot()`           | Medium   |

 ---

 ## 1.7 Instrumentation Priority

+> **TxQ** = Transaction Queue
+
 ```mermaid
 quadrantChart
    title Instrumentation Priority Matrix
@@ -213,18 +331,25 @@ quadrantChart
    quadrant-3 Quick Wins
    quadrant-4 Consider Later

-    RPC Tracing: [0.3, 0.85]
-    Transaction Tracing: [0.65, 0.92]
-    Consensus Tracing: [0.75, 0.87]
-    Peer Message Tracing: [0.4, 0.3]
-    Ledger Acquisition: [0.5, 0.6]
-    JobQueue Tracing: [0.35, 0.5]
+    RPC Tracing: [0.2, 0.92]
+    Transaction Tracing: [0.55, 0.88]
+    Consensus Tracing: [0.78, 0.82]
+    PathFinding: [0.38, 0.75]
+    TxQ and Fees: [0.25, 0.65]
+    Ledger Sync: [0.62, 0.58]
+    Peer Message Tracing: [0.35, 0.25]
+    JobQueue Tracing: [0.2, 0.48]
+    Validator Mgmt: [0.48, 0.42]
+    Amendment Tracking: [0.15, 0.32]
+    SHAMap Operations: [0.72, 0.45]
 ```

 ---

 ## 1.8 Observable Outcomes

+> **TxQ** = Transaction Queue | **UNL** = Unique Node List
+
 After implementing OpenTelemetry, operators and developers will gain visibility into the following:

 ### 1.8.1 What You Will See: Traces
@@ -236,20 +361,28 @@ After implementing OpenTelemetry, operators and developers will gain visibility
 | **Consensus Rounds**       | Complete round with all phases (open, establish, accept)                                    | `{span.name=~"consensus.round.*"}`                     |
 | **RPC Request Processing** | Individual command execution with timing breakdown                                          | `{xrpl.rpc.command="account_info"}`                    |
 | **Ledger Acquisition**     | Peer-to-peer ledger data requests and responses                                             | `{span.name="ledger.acquire"}`                         |
+| **PathFinding Latency**    | Path computation time and cache effectiveness for payment RPCs                              | `{span.name="pathfind.compute"}`                       |
+| **TxQ Behavior**           | Queue depth, eviction patterns, fee escalation during congestion                            | `{span.name=~"txq.*"}`                                 |
+| **Ledger Sync**            | Full acquisition timeline including delta and transaction fetches                           | `{span.name=~"ledger.acquire.*"}`                      |
+| **Validator Health**       | UNL fetch success, manifest updates, stale list detection                                   | `{span.name=~"validator.*"}`                           |

 ### 1.8.2 What You Will See: Metrics (Derived from Traces)

-| Metric                        | Description                            | Dashboard Panel             |
-| ----------------------------- | -------------------------------------- | --------------------------- |
-| **RPC Latency (p50/p95/p99)** | Response time distribution per command | Heatmap by command          |
-| **Transaction Throughput**    | Transactions processed per second      | Time series graph           |
-| **Consensus Round Duration**  | Time to complete consensus phases      | Histogram                   |
-| **Cross-Node Latency**        | Time for transaction to reach N nodes  | Line chart with percentiles |
-| **Error Rate**                | Failed transactions/RPC calls by type  | Stacked bar chart           |
+| Metric                        | Description                             | Dashboard Panel             |
+| ----------------------------- | --------------------------------------- | --------------------------- |
+| **RPC Latency (p50/p95/p99)** | Response time distribution per command  | Heatmap by command          |
+| **Transaction Throughput**    | Transactions processed per second       | Time series graph           |
+| **Consensus Round Duration**  | Time to complete consensus phases       | Histogram                   |
+| **Cross-Node Latency**        | Time for transaction to reach N nodes   | Line chart with percentiles |
+| **Error Rate**                | Failed transactions/RPC calls by type   | Stacked bar chart           |
+| **PathFinding Latency**       | Path computation time per currency pair | Heatmap by currency         |
+| **TxQ Depth**                 | Queued transactions over time           | Time series with thresholds |
+| **Fee Escalation Level**      | Current fee multiplier                  | Gauge with alert thresholds |
+| **Ledger Sync Duration**      | Time to acquire missing ledgers         | Histogram                   |

 ### 1.8.3 Concrete Dashboard Examples

-**Transaction Trace View (Jaeger/Tempo):**
+**Transaction Trace View (Tempo):**

 ```
 ┌────────────────────────────────────────────────────────────────────────────────┐
@@ -304,18 +437,22 @@ xychart-beta
    title "Consensus Round Duration (Last 24 Hours)"
    x-axis "Time of Day (Hours)" [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24]
    y-axis "Duration (seconds)" 1 --> 5
-    line [2.1, 2.3, 2.5, 2.4, 2.8, 1.6, 3.2, 3.0, 3.5, 1.3, 3.8, 3.6, 4.0, 3.2, 4.3, 4.1, 4.5, 4.3, 4.2, 2.4, 4.8, 4.6, 4.9, 4.7, 5.0, 4.9, 4.8, 2.6, 4.7, 4.5, 4.2, 4.0, 2.5, 3.7, 3.2, 3.4, 2.9, 3.1, 2.6, 2.8, 2.3, 1.5, 2.7, 2.4, 2.5, 2.3, 2.2, 2.1, 2.0]
+    line [2.1, 2.4, 2.8, 3.2, 3.8, 4.3, 4.5, 5.0, 4.7, 4.0, 3.2, 2.6, 2.0]
 ```

 ### 1.8.4 Operator Actionable Insights

-| Scenario              | What You'll See                                                              | Action                           |
-| --------------------- | ---------------------------------------------------------------------------- | -------------------------------- |
-| **Slow RPC**          | Span showing which phase is slow (parsing, execution, serialization)         | Optimize specific code path      |
-| **Transaction Stuck** | Trace stops at validation; error attribute shows reason                      | Fix transaction parameters       |
-| **Consensus Delay**   | Phase.establish taking too long; proposer attribute shows missing validators | Investigate network connectivity |
-| **Memory Spike**      | Large batch of spans correlating with memory increase                        | Tune batch_size or sampling      |
-| **Network Partition** | Traces missing cross-node links for specific peer                            | Check peer connectivity          |
+| Scenario                  | What You'll See                                                              | Action                                           |
+| ------------------------- | ---------------------------------------------------------------------------- | ------------------------------------------------ |
+| **Slow RPC**              | Span showing which phase is slow (parsing, execution, serialization)         | Optimize specific code path                      |
+| **Transaction Stuck**     | Trace stops at validation; error attribute shows reason                      | Fix transaction parameters                       |
+| **Consensus Delay**       | Phase.establish taking too long; proposer attribute shows missing validators | Investigate network connectivity                 |
+| **Memory Spike**          | Large batch of spans correlating with memory increase                        | Tune batch_size or sampling                      |
+| **Network Partition**     | Traces missing cross-node links for specific peer                            | Check peer connectivity                          |
+| **Path Computation Slow** | pathfind.compute span shows high latency; cache miss rate in attributes      | Warm the RippleLineCache, check order book depth |
+| **TxQ Full**              | txq.enqueue spans show evictions; fee.escalate spans increasing              | Monitor fee levels, alert operators              |
+| **Ledger Sync Stalled**   | ledger.acquire spans timing out; peer reliability attributes show issues     | Check peer connectivity, add trusted peers       |
+| **UNL Stale**             | validator.list.fetch spans failing; last_update attribute aging              | Verify validator site URLs, check DNS            |

 ### 1.8.5 Developer Debugging Workflow

--- a/OpenTelemetryPlan/02-design-decisions.md
+++ b/OpenTelemetryPlan/02-design-decisions.md
@@ -7,6 +7,8 @@

 ## 2.1 OpenTelemetry Components

+> **OTLP** = OpenTelemetry Protocol
+
 ### 2.1.1 SDK Selection

 **Primary Choice**: OpenTelemetry C++ SDK (`opentelemetry-cpp`)
@@ -32,6 +34,8 @@

 ## 2.2 Exporter Configuration

+> **OTLP** = OpenTelemetry Protocol
+
 ```mermaid
 flowchart TB
    subgraph nodes["rippled Nodes"]
@@ -43,8 +47,7 @@ flowchart TB
    collector["OpenTelemetry<br/>Collector<br/>(sidecar or standalone)"]

    subgraph backends["Observability Backends"]
-        jaeger["Jaeger<br/>(Dev)"]
-        tempo["Tempo<br/>(Prod)"]
+        tempo["Tempo"]
        elastic["Elastic<br/>APM"]
    end

@@ -52,7 +55,6 @@ flowchart TB
    node2 -->|"OTLP/gRPC<br/>:4317"| collector
    node3 -->|"OTLP/gRPC<br/>:4317"| collector

-    collector --> jaeger
    collector --> tempo
    collector --> elastic

@@ -61,6 +63,13 @@ flowchart TB
    style collector fill:#bf360c,stroke:#8c2809,color:#ffffff
 ```

+**Reading the diagram:**
+
+- **rippled Nodes (blue)**: The source of telemetry data. Each rippled node exports spans via OTLP/gRPC on port 4317.
+- **OpenTelemetry Collector (red)**: The central aggregation point that receives spans from all nodes. Can run as a sidecar (per-node) or standalone (shared). Handles batching, filtering, and routing.
+- **Observability Backends (green)**: The storage and visualization destinations. Tempo is the recommended backend for both development and production, and Elastic APM is an alternative. The Collector routes to one or more backends.
+- **Arrows (nodes to collector to backends)**: The data pipeline -- spans flow from nodes to the Collector over gRPC, then the Collector fans out to the configured backends.
+
 ### 2.2.1 OTLP/gRPC (Recommended)

 ```cpp
@@ -69,8 +78,8 @@ namespace otlp = opentelemetry::exporter::otlp;

 otlp::OtlpGrpcExporterOptions opts;
 opts.endpoint = "localhost:4317";
-opts.use_ssl_credentials = true;
-opts.ssl_credentials_cacert_path = "/path/to/ca.crt";
+opts.useTls = true;
+opts.sslCaCertPath = "/path/to/ca.crt";
 ```

 ### 2.2.2 OTLP/HTTP (Alternative)
@@ -88,6 +97,8 @@ opts.content_type = otlp::HttpRequestContentType::kJson;  // or kBinary

 ## 2.3 Span Naming Conventions

+> **TxQ** = Transaction Queue | **UNL** = Unique Node List | **WS** = WebSocket
+
 ### 2.3.1 Naming Schema

 ```
@@ -145,6 +156,36 @@ ledger:
  build: "Build new ledger"
  validate: "Ledger validation"
  close: "Close ledger"
+  replay: "Ledger replay executed"
+  delta: "Delta-based ledger acquired"
+
+# PathFinding Spans
+pathfind:
+  request: "Path request initiated"
+  compute: "Path computation executed"
+
+# TxQ Spans
+txq:
+  enqueue: "Transaction queued"
+  apply: "Queued transaction applied"
+
+# Fee/Load Spans
+fee:
+  escalate: "Fee escalation triggered"
+
+# Validator Spans
+validator:
+  list:
+    fetch: "UNL list fetched"
+  manifest: "Manifest update processed"
+
+# Amendment Spans
+amendment:
+  vote: "Amendment voting executed"
+
+# SHAMap Spans
+shamap:
+  sync: "State tree synchronization"

 # Job Spans
 job:
@@ -156,6 +197,8 @@ job:

 ## 2.4 Attribute Schema

+> **TxQ** = Transaction Queue | **UNL** = Unique Node List | **OTLP** = OpenTelemetry Protocol
+
 ### 2.4.1 Resource Attributes (Set Once at Startup)

 ```cpp
@@ -231,21 +274,75 @@ resource::SemanticConventions::SERVICE_INSTANCE_ID = <node_public_key_base58>
 "xrpl.job.worker"        = int64    // Worker thread ID
 ```

+#### PathFinding Attributes
+
+```cpp
+"xrpl.pathfind.source_currency"  = string   // Source currency code
+"xrpl.pathfind.dest_currency"    = string   // Destination currency code
+"xrpl.pathfind.path_count"       = int64    // Number of paths found
+"xrpl.pathfind.cache_hit"        = bool     // RippleLineCache hit
+```
+
+#### TxQ Attributes
+
+```cpp
+"xrpl.txq.queue_depth"      = int64    // Current queue depth
+"xrpl.txq.fee_level"        = int64    // Fee level of transaction
+"xrpl.txq.eviction_reason"  = string   // Why transaction was evicted
+```
+
+#### Fee Attributes
+
+```cpp
+"xrpl.fee.load_factor"      = int64    // Current load factor
+"xrpl.fee.escalation_level" = int64    // Fee escalation multiplier
+```
+
+#### Validator Attributes
+
+```cpp
+"xrpl.validator.list_size"    = int64    // UNL size
+"xrpl.validator.list_age_sec" = int64    // Seconds since last update
+```
+
+#### Amendment Attributes
+
+```cpp
+"xrpl.amendment.name"         = string   // Amendment name
+"xrpl.amendment.status"       = string   // "enabled", "vetoed", "supported"
+```
+
+#### SHAMap Attributes
+
+```cpp
+"xrpl.shamap.type"            = string   // "transaction", "state", "account_state"
+"xrpl.shamap.missing_nodes"   = int64    // Number of missing nodes during sync
+"xrpl.shamap.duration_ms"     = float64  // Sync duration
+```
+
 ### 2.4.3 Data Collection Summary

 The following table summarizes what data is collected by category:

-| Category        | Attributes Collected                                                 | Purpose                     |
-| --------------- | -------------------------------------------------------------------- | --------------------------- |
-| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index`          | Trace transaction lifecycle |
-| **Consensus**   | `round`, `phase`, `mode`, `proposers` (public keys), `duration_ms`   | Analyze consensus timing    |
-| **RPC**         | `command`, `version`, `status`, `duration_ms`                        | Monitor RPC performance     |
-| **Peer**        | `peer.id` (public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis   |
-| **Ledger**      | `ledger.hash`, `ledger.index`, `close_time`, `tx_count`              | Ledger progression tracking |
-| **Job**         | `job.type`, `queue_ms`, `worker`                                     | JobQueue performance        |
+| Category        | Attributes Collected                                                   | Purpose                      |
+| --------------- | ---------------------------------------------------------------------- | ---------------------------- |
+| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index`            | Trace transaction lifecycle  |
+| **Consensus**   | `round`, `phase`, `mode`, `proposers` (public keys), `duration_ms`     | Analyze consensus timing     |
+| **RPC**         | `command`, `version`, `status`, `duration_ms`                          | Monitor RPC performance      |
+| **Peer**        | `peer.id` (public key), `latency_ms`, `message.type`, `message.size`   | Network topology analysis    |
+| **Ledger**      | `ledger.hash`, `ledger.index`, `close_time`, `tx_count`                | Ledger progression tracking  |
+| **Job**         | `job.type`, `queue_ms`, `worker`                                       | JobQueue performance         |
+| **PathFinding** | `pathfind.source_currency`, `dest_currency`, `path_count`, `cache_hit` | Payment path analysis        |
+| **TxQ**         | `txq.queue_depth`, `fee_level`, `eviction_reason`                      | Queue depth and fee tracking |
+| **Fee**         | `fee.load_factor`, `escalation_level`                                  | Fee escalation monitoring    |
+| **Validator**   | `validator.list_size`, `list_age_sec`                                  | UNL health monitoring        |
+| **Amendment**   | `amendment.name`, `status`                                             | Protocol upgrade tracking    |
+| **SHAMap**      | `shamap.type`, `missing_nodes`, `duration_ms`                          | State tree sync performance  |

 ### 2.4.4 Privacy & Sensitive Data Policy

+> **PII** = Personally Identifiable Information
+
 OpenTelemetry instrumentation is designed to collect **operational metadata only**, never sensitive content.

 #### Data NOT Collected
@@ -310,18 +407,22 @@ redact_account=1      # Hash account addresses before export
 redact_peer_address=1 # Remove peer IP addresses
 ```

+> **Note**: The `redact_account` configuration in `rippled.cfg` controls SDK-level redaction before export, while collector-level filtering (see [Collector-Level Data Protection](#collector-level-data-protection) above) provides an additional defense-in-depth layer. Both can operate independently.
+
 > **Key Principle**: Telemetry collects **operational metadata** (timing, counts, hashes) — never **sensitive content** (keys, balances, amounts, raw payloads).

 ---

 ## 2.5 Context Propagation Design

+> **WS** = WebSocket
+
 ### 2.5.1 Propagation Boundaries

 ```mermaid
 flowchart TB
    subgraph http["HTTP/WebSocket (RPC)"]
-        w3c["W3C Trace Context Headers:<br/>traceparent: 00-{trace_id}-{span_id}-{flags}<br/>tracestate: rippled=<state>"]
+        w3c["W3C Trace Context Headers:<br/>traceparent:<br/>00-trace_id-span_id-flags<br/>tracestate: rippled=..."]
    end

    subgraph protobuf["Protocol Buffers (P2P)"]
@@ -329,7 +430,7 @@ flowchart TB
    end

    subgraph jobqueue["JobQueue (Internal Async)"]
-        job["Context captured at job creation,<br/>restored at execution<br/><br/>class Job {<br/>  opentelemetry::context::Context traceContext_;<br/>};"]
+        job["Context captured at job creation,<br/>restored at execution<br/><br/>class Job {<br/>  otel::context::Context<br/>    traceContext_;<br/>};"]
    end

    style http fill:#0d47a1,stroke:#082f6a,color:#ffffff
@@ -337,10 +438,18 @@ flowchart TB
    style jobqueue fill:#bf360c,stroke:#8c2809,color:#ffffff
 ```

+**Reading the diagram:**
+
+- **HTTP/WebSocket - RPC (blue)**: For client-facing RPC requests, trace context is propagated using the W3C `traceparent` header. This is the standard approach and works with any OTel-compatible client.
+- **Protocol Buffers - P2P (green)**: For peer-to-peer messages between rippled nodes, trace context is embedded as a protobuf `TraceContext` message carrying trace_id, span_id, flags, and optional trace_state.
+- **JobQueue - Internal Async (red)**: For asynchronous work within a single node, the OTel context is captured when a job is created and restored when the job executes on a worker thread. This bridges the async gap so spans remain linked.
+
 ---

 ## 2.6 Integration with Existing Observability

+> **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket
+
 ### 2.6.1 Existing Frameworks Comparison

 rippled already has two observability mechanisms. OpenTelemetry complements (not replaces) them:
@@ -422,7 +531,7 @@ span->SetAttribute("peer.id", peerId);

 | Scenario                                | PerfLog    | StatsD | OpenTelemetry |
 | --------------------------------------- | ---------- | ------ | ------------- |
-| "How many TXs per second?"              | ❌         | ✅     | ❌            |
+| "How many TXs per second?"              | ❌         | ✅     | ✅            |
 | "What's the p99 RPC latency?"           | ❌         | ✅     | ✅            |
 | "Why was this specific TX slow?"        | ⚠️ partial | ❌     | ✅            |
 | "Which node delayed consensus?"         | ❌         | ❌     | ✅            |
@@ -451,6 +560,14 @@ flowchart TB
    style grafana fill:#bf360c,stroke:#8c2809,color:#ffffff
 ```

+**Reading the diagram:**
+
+- **rippled Process (dark gray)**: The single rippled node running all three observability frameworks side by side. Each framework operates independently with no interference.
+- **PerfLog to perf.log**: PerfLog writes JSON-formatted event logs to a local file. Grafana can ingest these via Loki or a file-based datasource.
+- **Beast Insight to StatsD Server**: Insight sends aggregated metrics (counters, gauges) over UDP to a StatsD server. Grafana reads from StatsD-compatible backends like Graphite or Prometheus (via StatsD exporter).
+- **OpenTelemetry to OTLP Collector**: OTel exports spans over OTLP/gRPC to a Collector, which then forwards to a trace backend (Tempo).
+- **Grafana (red, unified UI)**: All three data streams converge in Grafana, enabling operators to correlate logs, metrics, and traces in a single dashboard.
+
 ### 2.6.5 Correlation with PerfLog

 Trace IDs can be correlated with existing PerfLog entries for comprehensive debugging:
--- a/OpenTelemetryPlan/03-implementation-strategy.md
+++ b/OpenTelemetryPlan/03-implementation-strategy.md
@@ -81,12 +81,14 @@ flowchart TB

 ## 3.3 Performance Overhead Summary

-| Metric        | Overhead   | Notes                               |
-| ------------- | ---------- | ----------------------------------- |
-| CPU           | 1-3%       | Span creation and attribute setting |
-| Memory        | 2-5 MB     | Batch buffer for pending spans      |
-| Network       | 10-50 KB/s | Compressed OTLP export to collector |
-| Latency (p99) | <2%        | With proper sampling configuration  |
+> **OTLP** = OpenTelemetry Protocol
+
+| Metric        | Overhead   | Notes                                            |
+| ------------- | ---------- | ------------------------------------------------ |
+| CPU           | 1-3%       | Of per-transaction CPU cost (~200μs baseline)    |
+| Memory        | ~10 MB     | SDK statics + batch buffer + worker thread stack |
+| Network       | 10-50 KB/s | Compressed OTLP export to collector              |
+| Latency (p99) | <2%        | With proper sampling configuration               |

 ---

@@ -94,17 +96,26 @@ flowchart TB

 ### 3.4.1 Per-Operation Costs

+> **Note on hardware assumptions**: The costs below are based on the official OTel C++ SDK CI benchmarks
+> (969 runs on GitHub Actions 2-core shared runners). On production server hardware (3+ GHz Xeon),
+> expect costs at the **lower end** of each range (~30-50% improvement over CI hardware).
+
 | Operation             | Time (ns) | Frequency              | Impact     |
 | --------------------- | --------- | ---------------------- | ---------- |
-| Span creation         | 200-500   | Every traced operation | Low        |
+| Span creation         | 500-1000  | Every traced operation | Low        |
 | Span end              | 100-200   | Every traced operation | Low        |
 | SetAttribute (string) | 80-120    | 3-5 per span           | Low        |
 | SetAttribute (int)    | 40-60     | 2-3 per span           | Negligible |
-| AddEvent              | 50-80     | 0-2 per span           | Negligible |
+| AddEvent              | 100-200   | 0-2 per span           | Low        |
 | Context injection     | 150-250   | Per outgoing message   | Low        |
 | Context extraction    | 100-180   | Per incoming message   | Low        |
 | GetCurrent context    | 10-20     | Thread-local access    | Negligible |

+**Source**: Span creation based on OTel C++ SDK `BM_SpanCreation` benchmark (AlwaysOnSampler +
+SimpleSpanProcessor + InMemoryExporter), median ~1,000 ns on CI hardware. AddEvent includes
+timestamp read + string copy + vector push + mutex acquisition. Context injection/extraction
+confirmed by `BM_SpanCreationWithScope` benchmark delta (~160 ns).
+
 ### 3.4.2 Transaction Processing Overhead

 <div align="center">
@@ -112,67 +123,91 @@ flowchart TB
 ```mermaid
 %%{init: {'pie': {'textPosition': 0.75}}}%%
 pie showData
-    "tx.receive (800ns)" : 800
-    "tx.validate (500ns)" : 500
-    "tx.relay (500ns)" : 500
-    "Context inject (600ns)" : 600
+    "tx.receive (1400ns)" : 1400
+    "tx.validate (1200ns)" : 1200
+    "tx.relay (1200ns)" : 1200
+    "Context inject (200ns)" : 200
 ```

-**Transaction Tracing Overhead (~2.4μs total)**
+**Transaction Tracing Overhead (~4.0μs total)**

 </div>

-**Overhead percentage**: 2.4 μs / 200 μs (avg tx processing) = **~1.2%**
+**Overhead percentage**: 4.0 μs / 200 μs (avg tx processing) = **~2.0%**
+
+> **Breakdown**: Each span (tx.receive, tx.validate, tx.relay) costs ~1,000 ns for creation plus
+> ~200-400 ns for 3-5 attribute sets. Context injection is ~200 ns (confirmed by benchmarks).
+> On production hardware, expect ~2.6 μs total (~1.3% overhead) due to faster span creation (~500-600 ns).

 ### 3.4.3 Consensus Round Overhead

 | Operation              | Count | Cost (ns) | Total      |
 | ---------------------- | ----- | --------- | ---------- |
-| consensus.round span   | 1     | ~1000     | ~1 μs      |
-| consensus.phase spans  | 3     | ~700      | ~2.1 μs    |
-| proposal.receive spans | ~20   | ~600      | ~12 μs     |
-| proposal.send spans    | ~3    | ~600      | ~1.8 μs    |
+| consensus.round span   | 1     | ~1200     | ~1.2 μs    |
+| consensus.phase spans  | 3     | ~1100     | ~3.3 μs    |
+| proposal.receive spans | ~20   | ~1100     | ~22 μs     |
+| proposal.send spans    | ~3    | ~1100     | ~3.3 μs    |
 | Context operations     | ~30   | ~200      | ~6 μs      |
-| **TOTAL**              |       |           | **~23 μs** |
+| **TOTAL**              |       |           | **~36 μs** |

-**Overhead percentage**: 23 μs / 3s (typical round) = **~0.0008%** (negligible)
+> **Why higher**: Each span costs ~1,000 ns creation + ~100-200 ns for 1-2 attributes, totaling ~1,100-1,200 ns.
+> Context operations remain ~200 ns (confirmed by benchmarks). On production hardware, expect ~24 μs total.
+
+**Overhead percentage**: 36 μs / 3s (typical round) = **~0.001%** (negligible)

 ### 3.4.4 RPC Request Overhead

 | Operation        | Cost (ns)    |
 | ---------------- | ------------ |
-| rpc.request span | ~700         |
-| rpc.command span | ~600         |
+| rpc.request span | ~1200        |
+| rpc.command span | ~1100        |
 | Context extract  | ~250         |
 | Context inject   | ~200         |
-| **TOTAL**        | **~1.75 μs** |
+| **TOTAL**        | **~2.75 μs** |

- Fast RPC (1ms): 1.75 μs / 1ms = **~0.175%**
- Slow RPC (100ms): 1.75 μs / 100ms = **~0.002%**
+> **Why higher**: Each span costs ~1,000 ns creation + ~100-200 ns for attributes (command name,
+> version, role). Context extract/inject costs are confirmed by OTel C++ benchmarks.
+
+- Fast RPC (1ms): 2.75 μs / 1ms = **~0.275%**
+- Slow RPC (100ms): 2.75 μs / 100ms = **~0.003%**

 ---

 ## 3.5 Memory Overhead Analysis

+> **OTLP** = OpenTelemetry Protocol
+
 ### 3.5.1 Static Memory

-| Component                | Size        | Allocated  |
-| ------------------------ | ----------- | ---------- |
-| TracerProvider singleton | ~64 KB      | At startup |
-| BatchSpanProcessor       | ~128 KB     | At startup |
-| OTLP exporter            | ~256 KB     | At startup |
-| Propagator registry      | ~8 KB       | At startup |
-| **Total static**         | **~456 KB** |            |
+| Component                            | Size        | Allocated  |
+| ------------------------------------ | ----------- | ---------- |
+| TracerProvider singleton             | ~64 KB      | At startup |
+| BatchSpanProcessor (circular buffer) | ~16 KB      | At startup |
+| BatchSpanProcessor (worker thread)   | ~8 MB       | At startup |
+| OTLP exporter (gRPC channel init)    | ~256 KB     | At startup |
+| Propagator registry                  | ~8 KB       | At startup |
+| **Total static**                     | **~8.3 MB** |            |
+
+> **Why higher than earlier estimate**: The BatchSpanProcessor's circular buffer itself is only ~16 KB
+> (2049 x 8-byte `AtomicUniquePtr` entries), but it spawns a dedicated worker thread whose default
+> stack size on Linux is ~8 MB. The OTLP gRPC exporter allocates memory for channel stubs and TLS
+> initialization. The worker thread stack dominates the static footprint.

 ### 3.5.2 Dynamic Memory

-| Component            | Size per unit | Max units  | Peak        |
-| -------------------- | ------------- | ---------- | ----------- |
-| Active span          | ~200 bytes    | 1000       | ~200 KB     |
-| Queued span (export) | ~500 bytes    | 2048       | ~1 MB       |
-| Attribute storage    | ~50 bytes     | 5 per span | Included    |
-| Context storage      | ~64 bytes     | Per thread | ~6.4 KB     |
-| **Total dynamic**    |               |            | **~1.2 MB** |
+| Component            | Size per unit  | Max units  | Peak            |
+| -------------------- | -------------- | ---------- | --------------- |
+| Active span          | ~500-800 bytes | 1000       | ~500-800 KB     |
+| Queued span (export) | ~500 bytes     | 2048       | ~1 MB           |
+| Attribute storage    | ~80 bytes      | 5 per span | Included        |
+| Context storage      | ~64 bytes      | Per thread | ~6.4 KB         |
+| **Total dynamic**    |                |            | **~1.5-1.8 MB** |
+
+> **Why active spans are larger**: An active `Span` object includes the wrapper (~88 bytes: shared_ptr,
+> mutex, unique_ptr to Recordable) plus `SpanData` (~250 bytes: SpanContext, timestamps, name, status,
+> empty containers) plus attribute storage (~200-500 bytes for 3-5 string attributes in a `std::map`).
+> Source: `sdk/src/trace/span.h` and `sdk/include/opentelemetry/sdk/trace/span_data.h`.
+> Queued spans release the wrapper, keeping only `SpanData` + attributes (~500 bytes).

 ### 3.5.3 Memory Growth Characteristics

@@ -184,18 +219,34 @@ config:
        height: 400
 ---
 xychart-beta
-    title "Memory Usage vs Span Rate"
+    title "Memory Usage vs Span Rate (bounded by queue limit)"
    x-axis "Spans/second" [0, 200, 400, 600, 800, 1000]
-    y-axis "Memory (MB)" 0 --> 6
-    line [1, 1.8, 2.6, 3.4, 4.2, 5]
+    y-axis "Memory (MB)" 0 --> 12
+    line [8.5, 9.2, 9.6, 9.9, 10.0, 10.0]
 ```

 **Notes**:

- Memory increases linearly with span rate
+- Memory increases with span rate but **plateaus at queue capacity** (default 2048 spans)
 - Batch export prevents unbounded growth
- Queue size is configurable (default 2048 spans)
 - At queue limit, oldest spans are dropped (not blocked)
+- Maximum memory is bounded: ~8.3 MB static (dominated by worker thread stack) + 2048 queued spans x ~500 bytes (~1 MB) + active spans (~0.8 MB) ≈ **~10 MB ceiling**
+- The worker thread stack (~8 MB) is virtual memory; actual RSS depends on stack usage (typically much less)
+
+### 3.5.4 Performance Data Sources
+
+The overhead estimates in Sections 3.3-3.5 are derived from the following sources:
+
+| Source                                           | What it covers                                        | URL                                                                                                                                        |
+| ------------------------------------------------ | ----------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
+| OTel C++ SDK CI benchmarks (969 runs)            | Span creation, context activation, sampler overhead   | [Benchmark Dashboard](https://open-telemetry.github.io/opentelemetry-cpp/benchmarks/)                                                      |
+| `api/test/trace/span_benchmark.cc`               | API-level span creation (~22 ns no-op)                | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/api/test/trace/span_benchmark.cc)                                   |
+| `sdk/test/trace/sampler_benchmark.cc`            | SDK span creation with samplers (~1,000 ns AlwaysOn)  | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/test/trace/sampler_benchmark.cc)                                |
+| `sdk/include/.../span_data.h`                    | SpanData memory layout (~250 bytes base)              | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/include/opentelemetry/sdk/trace/span_data.h)                    |
+| `sdk/src/trace/span.h`                           | Span wrapper memory layout (~88 bytes)                | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/src/trace/span.h)                                               |
+| `sdk/include/.../batch_span_processor_options.h` | Default queue size (2048), batch size (512)           | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/include/opentelemetry/sdk/trace/batch_span_processor_options.h) |
+| `sdk/include/.../circular_buffer.h`              | CircularBuffer implementation (AtomicUniquePtr array) | [Source](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/include/opentelemetry/sdk/common/circular_buffer.h)             |
+| OTLP proto definition                            | Serialized span size estimation                       | [Proto](https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto)                          |

 ---

@@ -203,6 +254,11 @@ xychart-beta

 ### 3.6.1 Export Bandwidth

+> **Bytes per span**: Estimates use ~500 bytes/span (conservative upper bound). OTLP protobuf analysis
+> shows a typical span with 3-5 string attributes serializes to ~200-300 bytes raw; with gzip
+> compression (~60-70% of raw) and batching (amortized headers), ~350 bytes/span is more realistic.
+> The table uses the conservative estimate for capacity planning.
+
 | Sampling Rate | Spans/sec | Bandwidth | Notes            |
 | ------------- | --------- | --------- | ---------------- |
 | 100%          | ~500      | ~250 KB/s | Development only |
@@ -214,10 +270,10 @@ xychart-beta

 | Message Type           | Context Size | Messages/sec | Overhead    |
 | ---------------------- | ------------ | ------------ | ----------- |
-| TMTransaction          | 32 bytes     | ~100         | ~3.2 KB/s   |
-| TMProposeSet           | 32 bytes     | ~10          | ~320 B/s    |
-| TMValidation           | 32 bytes     | ~50          | ~1.6 KB/s   |
-| **Total P2P overhead** |              |              | **~5 KB/s** |
+| TMTransaction          | 25 bytes     | ~100         | ~2.5 KB/s   |
+| TMProposeSet           | 25 bytes     | ~10          | ~250 B/s    |
+| TMValidation           | 25 bytes     | ~50          | ~1.25 KB/s  |
+| **Total P2P overhead** |              |              | **~4 KB/s** |

 ---

@@ -225,6 +281,8 @@ xychart-beta

 ### 3.7.1 Sampling Strategies

+#### Tail Sampling
+
 ```mermaid
 flowchart TD
    trace["New Trace"]
@@ -284,6 +342,8 @@ if (telemetry.shouldTracePeer())

 ## 3.9 Code Intrusiveness Assessment

+> **TxQ** = Transaction Queue
+
 This section provides a detailed assessment of how intrusive the OpenTelemetry integration is to the existing rippled codebase.

 ### 3.9.1 Files Modified Summary
@@ -297,7 +357,10 @@ This section provides a detailed assessment of how intrusive the OpenTelemetry i
 | **Consensus**         | 3 files        | ~100        | ~30           | Low-Medium           |
 | **Protocol Buffers**  | 1 file         | ~25         | 0             | Low                  |
 | **CMake/Build**       | 3 files        | ~50         | ~10           | Minimal              |
-| **Total**             | **~21 files**  | **~1,205**  | **~105**      | **Low**              |
+| **PathFinding**       | 2              | ~80         | ~5            | Minimal              |
+| **TxQ/Fee**           | 2              | ~60         | ~5            | Minimal              |
+| **Validator/Amend**   | 3              | ~40         | ~5            | Minimal              |
+| **Total**             | **~28 files**  | **~1,490**  | **~120**      | **Low**              |

 ### 3.9.2 Detailed File Impact

@@ -307,6 +370,9 @@ pie title Code Changes by Component
    "Transaction Relay" : 160
    "Consensus" : 130
    "RPC Layer" : 100
+    "PathFinding" : 80
+    "TxQ/Fee" : 60
+    "Validator/Amendment" : 40
    "Application Init" : 35
    "Protocol Buffers" : 25
    "Build System" : 60
@@ -337,6 +403,14 @@ pie title Code Changes by Component
 | `src/xrpld/app/consensus/RCLConsensus.cpp`        | ~50         | ~15           | Medium     |
 | `src/xrpld/app/consensus/RCLConsensusAdaptor.cpp` | ~40         | ~12           | Medium     |
 | `src/xrpld/core/JobQueue.cpp`                     | ~20         | ~5            | Low        |
+| `src/xrpld/app/paths/PathRequest.cpp`             | ~40         | ~3            | Low        |
+| `src/xrpld/app/paths/Pathfinder.cpp`              | ~40         | ~2            | Low        |
+| `src/xrpld/app/misc/TxQ.cpp`                      | ~40         | ~3            | Low        |
+| `src/xrpld/app/main/LoadManager.cpp`              | ~20         | ~2            | Low        |
+| `src/xrpld/app/misc/ValidatorList.cpp`            | ~20         | ~2            | Low        |
+| `src/xrpld/app/misc/AmendmentTable.cpp`           | ~10         | ~2            | Low        |
+| `src/xrpld/app/misc/Manifest.cpp`                 | ~10         | ~1            | Low        |
+| `src/xrpld/shamap/SHAMap.cpp`                     | ~20         | ~3            | Low        |
 | `src/xrpld/overlay/detail/ripple.proto`           | ~25         | 0             | Low        |
 | `CMakeLists.txt`                                  | ~40         | ~8            | Low        |
 | `cmake/FindOpenTelemetry.cmake`                   | ~50         | 0             | None (new) |
@@ -353,12 +427,15 @@ quadrantChart
    x-axis Low Risk --> High Risk
    y-axis Low Value --> High Value

-    RPC Tracing: [0.2, 0.8]
-    Transaction Relay: [0.5, 0.9]
-    Consensus Tracing: [0.7, 0.95]
-    Peer Message Tracing: [0.8, 0.4]
-    JobQueue Context: [0.4, 0.5]
-    Ledger Acquisition: [0.5, 0.6]
+    RPC Tracing: [0.2, 0.55]
+    Transaction Relay: [0.55, 0.85]
+    Consensus Tracing: [0.75, 0.92]
+    Peer Message Tracing: [0.85, 0.35]
+    JobQueue Context: [0.3, 0.42]
+    Ledger Acquisition: [0.48, 0.65]
+    PathFinding: [0.38, 0.72]
+    TxQ and Fees: [0.25, 0.62]
+    Validator Mgmt: [0.15, 0.35]
 ```

 **Optional** ↙ ↘ **Avoid**
@@ -375,15 +452,15 @@ quadrantChart

 ### 3.9.4 Architectural Impact Assessment

-| Aspect               | Impact  | Justification                                                         |
-| -------------------- | ------- | --------------------------------------------------------------------- |
-| **Data Flow**        | None    | Tracing is purely observational; no business logic changes            |
-| **Threading Model**  | Minimal | Context propagation uses thread-local storage (standard OTel pattern) |
-| **Memory Model**     | Low     | Bounded queues prevent unbounded growth; RAII ensures cleanup         |
-| **Network Protocol** | Low     | Optional fields in protobuf (high field numbers); backward compatible |
-| **Configuration**    | None    | New config section; existing configs unaffected                       |
-| **Build System**     | Low     | Optional CMake flag; builds work without OpenTelemetry                |
-| **Dependencies**     | Low     | OpenTelemetry SDK is optional; null implementation when disabled      |
+| Aspect               | Impact  | Justification                                                                    |
+| -------------------- | ------- | -------------------------------------------------------------------------------- |
+| **Data Flow**        | Minimal | Read-only instrumentation; no modification to consensus or transaction data flow |
+| **Threading Model**  | Minimal | Context propagation uses thread-local storage (standard OTel pattern)            |
+| **Memory Model**     | Low     | Bounded queues prevent unbounded growth; RAII ensures cleanup                    |
+| **Network Protocol** | Low     | Optional fields in protobuf (high field numbers); backward compatible            |
+| **Configuration**    | None    | New config section; existing configs unaffected                                  |
+| **Build System**     | Low     | Optional CMake flag; builds work without OpenTelemetry                           |
+| **Dependencies**     | Low     | OpenTelemetry SDK is optional; null implementation when disabled                 |

 ### 3.9.5 Backward Compatibility

--- a/OpenTelemetryPlan/04-code-samples.md
+++ b/OpenTelemetryPlan/04-code-samples.md
@@ -7,6 +7,8 @@

 ## 4.1 Core Interfaces

+> **OTLP** = OpenTelemetry Protocol
+
 ### 4.1.1 Main Telemetry Interface

 ```cpp
@@ -69,6 +71,10 @@ public:
        bool traceRpc = true;
        bool tracePeer = false;  // High volume, disabled by default
        bool traceLedger = true;
+        bool tracePathfind = true;
+        bool traceTxQ = true;
+        bool traceValidator = false;  // Low volume, disabled by default
+        bool traceAmendment = false;  // Very low volume, disabled by default
    };

    virtual ~Telemetry() = default;
@@ -140,6 +146,21 @@ public:

    /** Check if peer message tracing is enabled */
    virtual bool shouldTracePeer() const = 0;
+
+    /** Check if ledger tracing is enabled */
+    virtual bool shouldTraceLedger() const = 0;
+
+    /** Check if path finding tracing is enabled */
+    virtual bool shouldTracePathfind() const = 0;
+
+    /** Check if transaction queue tracing is enabled */
+    virtual bool shouldTraceTxQ() const = 0;
+
+    /** Check if validator list/manifest tracing is enabled */
+    virtual bool shouldTraceValidator() const = 0;
+
+    /** Check if amendment voting tracing is enabled */
+    virtual bool shouldTraceAmendment() const = 0;
 };

 // Factory functions
@@ -191,11 +212,17 @@ public:
    /**
     * Construct guard with span.
     * The span becomes the current span in thread-local context.
+     *
+     * @note If span is nullptr (e.g., telemetry disabled), the guard
+     * becomes a no-op. All methods safely check for null before access.
     */
    explicit SpanGuard(
        opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span> span)
-        : span_(std::move(span))
-        , scope_(span_)
+        : span_(span ? std::move(span) : nullptr)
+        , scope_(span_ ? opentelemetry::trace::Scope(span_)
+                       : opentelemetry::trace::Scope(
+                           opentelemetry::nostd::shared_ptr<
+                               opentelemetry::trace::Span>(nullptr)))
    {
    }

@@ -277,6 +304,12 @@ public:

    void addEvent(std::string_view) {}
    void recordException(std::exception const&) {}
+
+    /** Return a default empty context (matches SpanGuard interface) */
+    opentelemetry::context::Context context() const
+    {
+        return opentelemetry::context::Context{};
+    }
 };

 } // namespace telemetry
@@ -332,17 +365,66 @@ namespace telemetry {
        _xrpl_guard_.emplace((telemetry).startSpan(name)); \
    }

-// Set attribute on current span (if exists)
-#define XRPL_TRACE_SET_ATTR(key, value) \
-    if (_xrpl_guard_.has_value()) { \
-        _xrpl_guard_->setAttribute(key, value); \
+#define XRPL_TRACE_PEER(telemetry, name) \
+    std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
+    if ((telemetry).shouldTracePeer()) { \
+        _xrpl_guard_.emplace((telemetry).startSpan(name)); \
    }

+#define XRPL_TRACE_LEDGER(telemetry, name) \
+    std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
+    if ((telemetry).shouldTraceLedger()) { \
+        _xrpl_guard_.emplace((telemetry).startSpan(name)); \
+    }
+
+#define XRPL_TRACE_PATHFIND(telemetry, name) \
+    std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
+    if ((telemetry).shouldTracePathfind()) { \
+        _xrpl_guard_.emplace((telemetry).startSpan(name)); \
+    }
+
+#define XRPL_TRACE_TXQ(telemetry, name) \
+    std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
+    if ((telemetry).shouldTraceTxQ()) { \
+        _xrpl_guard_.emplace((telemetry).startSpan(name)); \
+    }
+
+#define XRPL_TRACE_VALIDATOR(telemetry, name) \
+    std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
+    if ((telemetry).shouldTraceValidator()) { \
+        _xrpl_guard_.emplace((telemetry).startSpan(name)); \
+    }
+
+#define XRPL_TRACE_AMENDMENT(telemetry, name) \
+    std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
+    if ((telemetry).shouldTraceAmendment()) { \
+        _xrpl_guard_.emplace((telemetry).startSpan(name)); \
+    }
+
+// Set attribute on current span (if exists).
+// Works with both std::optional<SpanGuard> (from conditional macros)
+// and bare SpanGuard (from XRPL_TRACE_SPAN). Uses 'if constexpr'-like
+// dispatch via a helper that checks for .has_value().
+#define XRPL_TRACE_SET_ATTR(key, value) \
+    do { \
+        if constexpr (requires { _xrpl_guard_.has_value(); }) { \
+            if (_xrpl_guard_.has_value()) \
+                _xrpl_guard_->setAttribute(key, value); \
+        } else { \
+            _xrpl_guard_.setAttribute(key, value); \
+        } \
+    } while(0)
+
 // Record exception on current span
 #define XRPL_TRACE_EXCEPTION(e) \
-    if (_xrpl_guard_.has_value()) { \
-        _xrpl_guard_->recordException(e); \
-    }
+    do { \
+        if constexpr (requires { _xrpl_guard_.has_value(); }) { \
+            if (_xrpl_guard_.has_value()) \
+                _xrpl_guard_->recordException(e); \
+        } else { \
+            _xrpl_guard_.recordException(e); \
+        } \
+    } while(0)

 #else  // XRPL_ENABLE_TELEMETRY not defined

@@ -351,6 +433,12 @@ namespace telemetry {
 #define XRPL_TRACE_TX(telemetry, name) ((void)0)
 #define XRPL_TRACE_CONSENSUS(telemetry, name) ((void)0)
 #define XRPL_TRACE_RPC(telemetry, name) ((void)0)
+#define XRPL_TRACE_PEER(telemetry, name) ((void)0)
+#define XRPL_TRACE_LEDGER(telemetry, name) ((void)0)
+#define XRPL_TRACE_PATHFIND(telemetry, name) ((void)0)
+#define XRPL_TRACE_TXQ(telemetry, name) ((void)0)
+#define XRPL_TRACE_VALIDATOR(telemetry, name) ((void)0)
+#define XRPL_TRACE_AMENDMENT(telemetry, name) ((void)0)
 #define XRPL_TRACE_SET_ATTR(key, value) ((void)0)
 #define XRPL_TRACE_EXCEPTION(e) ((void)0)

@@ -369,6 +457,9 @@ namespace telemetry {
 Add to `src/xrpld/overlay/detail/ripple.proto`:

 ```protobuf
+// Note: rippled uses proto2 syntax. The 'optional' keyword below is valid
+// in proto2 (it is the default field rule) and is included for clarity.
+
 // Trace context for distributed tracing across nodes
 // Uses W3C Trace Context format internally
 message TraceContext {
@@ -423,6 +514,8 @@ message TMLedgerData {
 #pragma once

 #include <opentelemetry/context/context.h>
+#include <opentelemetry/trace/context.h>
+#include <opentelemetry/trace/default_span.h>
 #include <opentelemetry/trace/span_context.h>
 #include <protocol/messages.h>  // Generated protobuf

@@ -480,7 +573,14 @@ TraceContextPropagator::extract(protocol::TraceContext const& proto)
    using namespace opentelemetry::trace;

    if (proto.trace_id().size() != 16 || proto.span_id().size() != 8)
-        return opentelemetry::context::Context{};  // Invalid, return empty
+    {
+        // Log malformed trace context for debugging. Silent failures in
+        // context extraction make distributed tracing issues hard to diagnose.
+        JLOG(j_.warn()) << "Malformed trace context: trace_id size="
+                        << proto.trace_id().size()
+                        << " span_id size=" << proto.span_id().size();
+        return opentelemetry::context::Context{};
+    }

    // Construct TraceId and SpanId from bytes
    TraceId traceId(reinterpret_cast<uint8_t const*>(proto.trace_id().data()));
@@ -490,11 +590,15 @@ TraceContextPropagator::extract(protocol::TraceContext const& proto)
    // Create SpanContext from extracted data
    SpanContext spanContext(traceId, spanId, flags, /* remote = */ true);

-    // Create context with extracted span as parent
-    return opentelemetry::context::Context{}.SetValue(
-        opentelemetry::trace::kSpanKey,
+    // DefaultSpan wraps SpanContext for use as a non-recording parent.
+    // This is the standard OTel C++ pattern for remote context propagation.
+    // DefaultSpan carries the remote SpanContext without recording any data.
+    auto parentCtx = opentelemetry::trace::SetSpan(
+        opentelemetry::context::Context{},
        opentelemetry::nostd::shared_ptr<Span>(
            new DefaultSpan(spanContext)));
+
+    return parentCtx;
 }

 inline void
@@ -750,8 +854,8 @@ ServerHandler::onRequest(
    // Extract trace context from HTTP headers (W3C Trace Context)
    auto parentCtx = telemetry::TraceContextPropagator::extractFromHeaders(
        [&req](std::string_view name) -> std::optional<std::string> {
-            auto it = req.find(boost::beast::http::field{
-                std::string(name)});
+            // Beast's find() accepts a string_view for custom header lookup
+            auto it = req.find(name);
            if (it != req.end())
                return std::string(it->value());
            return std::nullopt;
@@ -977,6 +1081,14 @@ flowchart TB

 </div>

+**Reading the diagram:**
+
+- **Client / Submit TX**: An external client submits a transaction, creating the root span that initiates the trace.
+- **Node A (RPC layer)**: The receiving node processes the submission through `rpc.request` and `rpc.command.submit`, then hands off to the transaction pipeline (`tx.receive` → `tx.validate` → `tx.relay`).
+- **Dashed arrows (TraceContext)**: Cross-node boundaries where trace context is propagated via the protobuf protocol extension, linking spans across independent processes.
+- **Node B (relay hop)**: A peer node that receives, validates, and relays the transaction further, demonstrating multi-hop propagation.
+- **Node C (consensus)**: The final node where the transaction enters consensus (`consensus.round` → `consensus.phase.establish`), showing how a single client action produces an end-to-end distributed trace.
+
 ---

 _Previous: [Implementation Strategy](./03-implementation-strategy.md)_ | _Next: [Configuration Reference](./05-configuration-reference.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_
--- a/OpenTelemetryPlan/05-configuration-reference.md
+++ b/OpenTelemetryPlan/05-configuration-reference.md
@@ -7,6 +7,8 @@

 ## 5.1 rippled Configuration

+> **OTLP** = OpenTelemetry Protocol | **TxQ** = Transaction Queue
+
 ### 5.1.1 Configuration File Section

 Add to `cfg/xrpld-example.cfg`:
@@ -38,6 +40,9 @@ Add to `cfg/xrpld-example.cfg`:
 #
 # # Sampling ratio: 0.0-1.0 (default: 1.0 = 100% sampling)
 # # Use lower values in production to reduce overhead
+# # Default: 1.0 (all traces). For production deployments with high
+# # throughput, 0.1 (10%) is recommended to reduce overhead.
+# # See Section 7.4.2 for sampling strategy details.
 # sampling_ratio=0.1
 #
 # # Batch processor settings
@@ -51,6 +56,10 @@ Add to `cfg/xrpld-example.cfg`:
 # trace_rpc=1              # RPC request handling
 # trace_peer=0             # Peer messages (high volume, disabled by default)
 # trace_ledger=1           # Ledger acquisition and building
+# trace_pathfind=1         # Path computation (can be expensive)
+# trace_txq=1              # Transaction queue and fee escalation
+# trace_validator=0        # Validator list and manifest updates (low volume)
+# trace_amendment=0        # Amendment voting (very low volume)
 #
 # # Service identification (automatically detected if not specified)
 # # service_name=rippled
@@ -78,6 +87,10 @@ enabled=0
 | `trace_rpc`           | bool   | `true`           | Enable RPC tracing                        |
 | `trace_peer`          | bool   | `false`          | Enable peer message tracing (high volume) |
 | `trace_ledger`        | bool   | `true`           | Enable ledger tracing                     |
+| `trace_pathfind`      | bool   | `true`           | Enable path computation tracing           |
+| `trace_txq`           | bool   | `true`           | Enable transaction queue tracing          |
+| `trace_validator`     | bool   | `false`          | Enable validator list/manifest tracing    |
+| `trace_amendment`     | bool   | `false`          | Enable amendment voting tracing           |
 | `service_name`        | string | `"rippled"`      | Service name for traces                   |
 | `service_instance_id` | string | `<node_pubkey>`  | Instance identifier                       |

@@ -85,6 +98,8 @@ enabled=0

 ## 5.2 Configuration Parser

+> **TxQ** = Transaction Queue
+
 ```cpp
 // src/libxrpl/telemetry/TelemetryConfig.cpp

@@ -140,6 +155,10 @@ setup_Telemetry(
    setup.traceRpc = section.value_or("trace_rpc", true);
    setup.tracePeer = section.value_or("trace_peer", false);
    setup.traceLedger = section.value_or("trace_ledger", true);
+    setup.tracePathfind = section.value_or("trace_pathfind", true);
+    setup.traceTxQ = section.value_or("trace_txq", true);
+    setup.traceValidator = section.value_or("trace_validator", false);
+    setup.traceAmendment = section.value_or("trace_amendment", false);

    return setup;
 }
@@ -239,6 +258,8 @@ public:

 ## 5.4 CMake Integration

+> **OTLP** = OpenTelemetry Protocol
+
 ### 5.4.1 Find OpenTelemetry Module

 ```cmake
@@ -354,6 +375,8 @@ endif()

 ## 5.5 OpenTelemetry Collector Configuration

+> **OTLP** = OpenTelemetry Protocol | **APM** = Application Performance Monitoring
+
 ### 5.5.1 Development Configuration

 ```yaml
@@ -380,9 +403,9 @@ exporters:
    sampling_initial: 5
    sampling_thereafter: 200

-  # Jaeger for trace visualization
-  jaeger:
-    endpoint: jaeger:14250
+  # Tempo for trace visualization
+  otlp/tempo:
+    endpoint: tempo:4317
    tls:
      insecure: true

@@ -391,7 +414,7 @@ service:
    traces:
      receivers: [otlp]
      processors: [batch]
-      exporters: [logging, jaeger]
+      exporters: [logging, otlp/tempo]
 ```

 ### 5.5.2 Production Configuration
@@ -504,6 +527,8 @@ service:

 ## 5.6 Docker Compose Development Environment

+> **OTLP** = OpenTelemetry Protocol
+
 ```yaml
 # docker-compose-telemetry.yaml
 version: "3.8"
@@ -521,17 +546,15 @@ services:
      - "4318:4318" # OTLP HTTP
      - "13133:13133" # Health check
    depends_on:
-      - jaeger
+      - tempo

-  # Jaeger for trace visualization
-  jaeger:
-    image: jaegertracing/all-in-one:1.53
-    container_name: jaeger
-    environment:
-      - COLLECTOR_OTLP_ENABLED=true
+  # Tempo for trace visualization
+  tempo:
+    image: grafana/tempo:2.6.1
+    container_name: tempo
    ports:
-      - "16686:16686" # UI
-      - "14250:14250" # gRPC
+      - "3200:3200" # Tempo HTTP API
+      - "4317" # OTLP gRPC (internal)

  # Grafana for dashboards
  grafana:
@@ -546,7 +569,7 @@ services:
    ports:
      - "3000:3000"
    depends_on:
-      - jaeger
+      - tempo

  # Prometheus for metrics (optional, for correlation)
  prometheus:
@@ -566,6 +589,8 @@ networks:

 ## 5.7 Configuration Architecture

+> **OTLP** = OpenTelemetry Protocol
+
 ```mermaid
 flowchart TB
    subgraph config["Configuration Sources"]
@@ -605,10 +630,20 @@ flowchart TB
    style collector fill:#fff3e0,stroke:#ff9800
 ```

+**Reading the diagram:**
+
+- **Configuration Sources**: `xrpld.cfg` provides runtime settings (endpoint, sampling) while the CMake flag controls whether telemetry is compiled in at all.
+- **Initialization**: `setup_Telemetry()` parses config values, then `make_Telemetry()` constructs the provider, processor, and exporter objects.
+- **Runtime Components**: The `TracerProvider` creates spans, the `BatchProcessor` buffers them, and the `OTLP Exporter` serializes and sends them over the wire.
+- **OTLP arrow to Collector**: Trace data leaves the rippled process via OTLP (gRPC or HTTP) and enters the external Collector pipeline.
+- **Collector Pipeline**: `Receivers` ingest OTLP data, `Processors` apply sampling/filtering/enrichment, and `Exporters` forward traces to storage backends (Tempo, etc.).
+
 ---

 ## 5.8 Grafana Integration

+> **APM** = Application Performance Monitoring
+
 Step-by-step instructions for integrating rippled traces with Grafana.

 ### 5.8.1 Data Source Configuration
@@ -642,23 +677,6 @@ datasources:
        datasourceUid: loki
 ```

-#### Jaeger
-
-```yaml
-# grafana/provisioning/datasources/jaeger.yaml
-apiVersion: 1
-
-datasources:
-  - name: Jaeger
-    type: jaeger
-    access: proxy
-    url: http://jaeger:16686
-    jsonData:
-      tracesToLogs:
-        datasourceUid: loki
-        tags: ["service.name"]
-```
-
 #### Elastic APM

 ```yaml
--- a/OpenTelemetryPlan/06-implementation-phases.md
+++ b/OpenTelemetryPlan/06-implementation-phases.md
@@ -7,6 +7,8 @@

 ## 6.1 Phase Overview

+> **TxQ** = Transaction Queue
+
 ```mermaid
 gantt
    title OpenTelemetry Implementation Timeline
@@ -19,26 +21,36 @@ gantt
    Telemetry Interface       :p1b, after p1a, 3d
    Configuration & CMake     :p1c, after p1b, 3d
    Unit Tests                :p1d, after p1c, 2d
+    Buffer & Integration      :p1e, after p1d, 2d

    section Phase 2
    RPC Tracing               :p2, after p1, 2w
    HTTP Context Extraction   :p2a, after p1, 2d
    RPC Handler Instrumentation :p2b, after p2a, 4d
-    WebSocket Support         :p2c, after p2b, 2d
+    PathFinding Instrumentation :p2f, after p2b, 2d
+    TxQ Instrumentation       :p2g, after p2f, 2d
+    WebSocket Support         :p2c, after p2g, 2d
    Integration Tests         :p2d, after p2c, 2d
+    Buffer & Review           :p2e, after p2d, 4d

    section Phase 3
    Transaction Tracing       :p3, after p2, 2w
    Protocol Buffer Extension :p3a, after p2, 2d
    PeerImp Instrumentation   :p3b, after p3a, 3d
-    Relay Context Propagation :p3c, after p3b, 3d
+    Fee Escalation Instrumentation :p3f, after p3b, 2d
+    Relay Context Propagation :p3c, after p3f, 3d
    Multi-node Tests          :p3d, after p3c, 2d
+    Buffer & Review           :p3e, after p3d, 4d

    section Phase 4
    Consensus Tracing         :p4, after p3, 2w
    Consensus Round Spans     :p4a, after p3, 3d
    Proposal Handling         :p4b, after p4a, 3d
-    Validation Tests          :p4c, after p4b, 4d
+    Validator List & Manifest Tracing :p4f, after p4b, 2d
+    Amendment Voting Tracing  :p4g, after p4f, 2d
+    SHAMap Sync Tracing       :p4h, after p4g, 2d
+    Validation Tests          :p4c, after p4h, 4d
+    Buffer & Review           :p4e, after p4c, 4d

    section Phase 5
    Documentation & Deploy    :p5, after p4, 1w
@@ -75,20 +87,24 @@ gantt

 ## 6.3 Phase 2: RPC Tracing (Weeks 3-4)

+> **TxQ** = Transaction Queue
+
 **Objective**: Complete tracing for all RPC operations

 ### Tasks

-| Task | Description                                        |
-| ---- | -------------------------------------------------- |
-| 2.1  | Implement W3C Trace Context HTTP header extraction |
-| 2.2  | Instrument `ServerHandler::onRequest()`            |
-| 2.3  | Instrument `RPCHandler::doCommand()`               |
-| 2.4  | Add RPC-specific attributes                        |
-| 2.5  | Instrument WebSocket handler                       |
-| 2.6  | Integration tests for RPC tracing                  |
-| 2.7  | Performance benchmarks                             |
-| 2.8  | Documentation                                      |
+| Task | Description                                                                |
+| ---- | -------------------------------------------------------------------------- |
+| 2.1  | Implement W3C Trace Context HTTP header extraction                         |
+| 2.2  | Instrument `ServerHandler::onRequest()`                                    |
+| 2.3  | Instrument `RPCHandler::doCommand()`                                       |
+| 2.4  | Add RPC-specific attributes                                                |
+| 2.5  | Instrument WebSocket handler                                               |
+| 2.6  | PathFinding instrumentation (`pathfind.request`, `pathfind.compute` spans) |
+| 2.7  | TxQ instrumentation (`txq.enqueue`, `txq.apply` spans)                     |
+| 2.8  | Integration tests for RPC tracing                                          |
+| 2.9  | Performance benchmarks                                                     |
+| 2.10 | Documentation                                                              |

 ### Exit Criteria

@@ -106,16 +122,17 @@ gantt

 ### Tasks

-| Task | Description                                   |
-| ---- | --------------------------------------------- |
-| 3.1  | Define `TraceContext` Protocol Buffer message |
-| 3.2  | Implement protobuf context serialization      |
-| 3.3  | Instrument `PeerImp::handleTransaction()`     |
-| 3.4  | Instrument `NetworkOPs::submitTransaction()`  |
-| 3.5  | Instrument HashRouter integration             |
-| 3.6  | Implement relay context propagation           |
-| 3.7  | Integration tests (multi-node)                |
-| 3.8  | Performance benchmarks                        |
+| Task | Description                                          |
+| ---- | ---------------------------------------------------- |
+| 3.1  | Define `TraceContext` Protocol Buffer message        |
+| 3.2  | Implement protobuf context serialization             |
+| 3.3  | Instrument `PeerImp::handleTransaction()`            |
+| 3.4  | Instrument `NetworkOPs::submitTransaction()`         |
+| 3.5  | Instrument HashRouter integration                    |
+| 3.6  | Fee escalation instrumentation (`fee.escalate` span) |
+| 3.7  | Implement relay context propagation                  |
+| 3.8  | Integration tests (multi-node)                       |
+| 3.9  | Performance benchmarks                               |

 ### Exit Criteria

@@ -141,8 +158,11 @@ gantt
 | 4.4  | Instrument validation handling                 |
 | 4.5  | Add consensus-specific attributes              |
 | 4.6  | Correlate with transaction traces              |
-| 4.7  | Multi-validator integration tests              |
-| 4.8  | Performance validation                         |
+| 4.7  | Validator list and manifest tracing            |
+| 4.8  | Amendment voting tracing                       |
+| 4.9  | SHAMap sync tracing                            |
+| 4.10 | Multi-validator integration tests              |
+| 4.11 | Performance validation                         |

 ### Exit Criteria

@@ -159,6 +179,9 @@ Phase 4a (establish-phase gap fill & cross-node correlation) adds:
 - **Deterministic trace ID** derived from `previousLedger.id()` so all validators
  in the same round share the same `trace_id` (switchable via
  `consensus_trace_strategy` config: `"deterministic"` or `"attribute"`).
+  See [Configuration Reference](./05-configuration-reference.md) for full
+  configuration options. The `consensus_trace_strategy` option will be
+  documented in the configuration reference as part of Phase 4a implementation.
 - **Round lifecycle spans**: `consensus.round` with round-to-round span links.
 - **Establish phase**: `consensus.establish`, `consensus.update_positions` (with
  `dispute.resolve` events), `consensus.check` (with threshold tracking).
@@ -198,16 +221,16 @@ quadrantChart
    title Risk Assessment Matrix
    x-axis Low Impact --> High Impact
    y-axis Low Likelihood --> High Likelihood
-    quadrant-1 Monitor Closely
-    quadrant-2 Mitigate Immediately
+    quadrant-1 Mitigate Immediately
+    quadrant-2 Plan Mitigation
    quadrant-3 Accept Risk
-    quadrant-4 Plan Mitigation
+    quadrant-4 Monitor Closely

-    SDK Compatibility: [0.25, 0.2]
-    Protocol Changes: [0.75, 0.65]
-    Performance Overhead: [0.65, 0.45]
-    Context Propagation: [0.5, 0.5]
-    Memory Leaks: [0.8, 0.2]
+    SDK Compat: [0.2, 0.18]
+    Protocol Chg: [0.75, 0.72]
+    Perf Overhead: [0.58, 0.42]
+    Context Prop: [0.4, 0.55]
+    Memory Leaks: [0.85, 0.25]
 ```

 ### Risk Details
@@ -224,19 +247,21 @@ quadrantChart

 ## 6.8 Success Metrics

-| Metric                   | Target                         | Measurement           |
-| ------------------------ | ------------------------------ | --------------------- |
-| Trace coverage           | >95% of transactions           | Sampling verification |
-| CPU overhead             | <3%                            | Benchmark tests       |
-| Memory overhead          | <5 MB                          | Memory profiling      |
-| Latency impact (p99)     | <2%                            | Performance tests     |
-| Trace completeness       | >99% spans with required attrs | Validation script     |
-| Cross-node trace linkage | >90% of multi-hop transactions | Integration tests     |
+| Metric                   | Target                                                         | Measurement           |
+| ------------------------ | -------------------------------------------------------------- | --------------------- |
+| Trace coverage           | >95% of transaction code paths (independent of sampling ratio) | Sampling verification |
+| CPU overhead             | <3%                                                            | Benchmark tests       |
+| Memory overhead          | <10 MB                                                         | Memory profiling      |
+| Latency impact (p99)     | <2%                                                            | Performance tests     |
+| Trace completeness       | >99% spans with required attrs                                 | Validation script     |
+| Cross-node trace linkage | >90% of multi-hop transactions                                 | Integration tests     |

 ---

 ## 6.9 Quick Wins and Crawl-Walk-Run Strategy

+> **TxQ** = Transaction Queue
+
 This section outlines a prioritized approach to maximize ROI with minimal initial investment.

 ### 6.9.1 Crawl-Walk-Run Overview
@@ -247,17 +272,17 @@ This section outlines a prioritized approach to maximize ROI with minimal initia
 flowchart TB
    subgraph crawl["🐢 CRAWL (Week 1-2)"]
        direction LR
-        c1[Core SDK Setup] ~~~ c2[RPC Tracing Only] ~~~ c3[Single Node]
+        c1[Core SDK Setup] ~~~ c2[RPC Tracing Only] ~~~ c3[PathFinding + TxQ Tracing] ~~~ c4[Single Node]
    end

    subgraph walk["🚶 WALK (Week 3-5)"]
        direction LR
-        w1[Transaction Tracing] ~~~ w2[Cross-Node Context] ~~~ w3[Basic Dashboards]
+        w1[Transaction Tracing] ~~~ w2[Fee Escalation Tracing] ~~~ w3[Cross-Node Context] ~~~ w4[Basic Dashboards]
    end

    subgraph run["🏃 RUN (Week 6-9)"]
        direction LR
-        r1[Consensus Tracing] ~~~ r2[Full Correlation] ~~~ r3[Production Deploy]
+        r1[Consensus Tracing] ~~~ r2[Validator, Amendment,<br/>SHAMap Tracing] ~~~ r3[Full Correlation] ~~~ r4[Production Deploy]
    end

    crawl --> walk --> run
@@ -268,16 +293,26 @@ flowchart TB
    style c1 fill:#1b5e20,stroke:#0d3d14,color:#fff
    style c2 fill:#1b5e20,stroke:#0d3d14,color:#fff
    style c3 fill:#1b5e20,stroke:#0d3d14,color:#fff
+    style c4 fill:#1b5e20,stroke:#0d3d14,color:#fff
    style w1 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
    style w2 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
    style w3 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
+    style w4 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
    style r1 fill:#0d47a1,stroke:#082f6a,color:#fff
    style r2 fill:#0d47a1,stroke:#082f6a,color:#fff
    style r3 fill:#0d47a1,stroke:#082f6a,color:#fff
+    style r4 fill:#0d47a1,stroke:#082f6a,color:#fff
 ```

 </div>

+**Reading the diagram:**
+
+- **CRAWL (Weeks 1-2)**: Minimal investment -- set up the SDK, instrument RPC and PathFinding/TxQ handlers, and verify on a single node. Delivers immediate latency visibility.
+- **WALK (Weeks 3-5)**: Expand to transaction lifecycle tracing, fee escalation, cross-node context propagation, and basic Grafana dashboards. This is where distributed tracing starts working.
+- **RUN (Weeks 6-9)**: Full consensus instrumentation, validator/amendment/SHAMap tracing, end-to-end correlation, and production deployment with sampling and alerting.
+- **Arrows (crawl → walk → run)**: Each phase builds on the prior one; you cannot skip ahead because later phases depend on infrastructure established earlier.
+
 ### 6.9.2 Quick Wins (Immediate Value)

 | Quick Win                      | Value  | When to Deploy |
@@ -296,6 +331,7 @@ flowchart TB

 - RPC request/response traces for all commands
 - Latency breakdown per RPC command
+- PathFinding and TxQ tracing (directly impacts RPC latency)
 - Error visibility with stack traces
 - Basic Grafana dashboard

@@ -304,6 +340,7 @@ flowchart TB
 **Why Start Here**:

 - RPC is the lowest-risk, highest-visibility component
+- PathFinding and TxQ are RPC-adjacent and directly affect latency
 - Immediate value for debugging client issues
 - No cross-node complexity
 - Single file modification to existing code
@@ -315,6 +352,7 @@ flowchart TB
 **What You Get**:

 - End-to-end transaction traces from submit to relay
+- Fee escalation tracing within the transaction pipeline
 - Cross-node correlation (see transaction path)
 - HashRouter deduplication visibility
 - Relay latency metrics
@@ -324,6 +362,7 @@ flowchart TB
 **Why Do This Second**:

 - Builds on RPC tracing (transactions submitted via RPC)
+- Fee escalation is integral to the transaction processing pipeline
 - Moderate complexity (requires context propagation)
 - High value for debugging transaction issues

@@ -336,13 +375,17 @@ flowchart TB
 - Complete consensus round visibility
 - Phase transition timing
 - Validator proposal tracking
+- Validator list and manifest tracing
+- Amendment voting tracing
+- SHAMap sync tracing
 - Full end-to-end traces (client → RPC → TX → consensus → ledger)

-**Code Changes**: ~100 lines across 3 consensus files
+**Code Changes**: ~100 lines across 3 consensus files, plus validator/amendment/SHAMap modules

 **Why Do This Last**:

 - Highest complexity (consensus is critical path)
+- Validator, amendment, and SHAMap components are lower priority
 - Requires thorough testing
 - Lower relative value (consensus issues are rarer)

@@ -358,33 +401,35 @@ quadrantChart
    quadrant-3 Nice to Have - Optional
    quadrant-4 Time Sinks - Avoid

-    RPC Tracing: [0.15, 0.9]
-    TX Submit Trace: [0.25, 0.85]
-    TX Relay Trace: [0.5, 0.8]
-    Consensus Trace: [0.7, 0.75]
-    Peer Message Trace: [0.85, 0.3]
-    Ledger Acquire: [0.55, 0.5]
+    RPC Tracing: [0.15, 0.92]
+    TX Submit Trace: [0.3, 0.78]
+    TX Relay Trace: [0.5, 0.88]
+    Consensus Trace: [0.72, 0.72]
+    Peer Msg Trace: [0.85, 0.3]
+    Ledger Acquire: [0.55, 0.52]
 ```

 ---

-## 6.11 Definition of Done
+## 6.10 Definition of Done
+
+> **TxQ** = Transaction Queue | **HA** = High Availability

 Clear, measurable criteria for each phase.

-### 6.11.1 Phase 1: Core Infrastructure
+### 6.10.1 Phase 1: Core Infrastructure

 | Criterion       | Measurement                                                | Target                       |
 | --------------- | ---------------------------------------------------------- | ---------------------------- |
 | SDK Integration | `cmake --build` succeeds with `-DXRPL_ENABLE_TELEMETRY=ON` | ✅ Compiles                  |
 | Runtime Toggle  | `enabled=0` produces zero overhead                         | <0.1% CPU difference         |
-| Span Creation   | Unit test creates and exports span                         | Span appears in Jaeger       |
+| Span Creation   | Unit test creates and exports span                         | Span appears in Tempo        |
 | Configuration   | All config options parsed correctly                        | Config validation tests pass |
 | Documentation   | Developer guide exists                                     | PR approved                  |

 **Definition of Done**: All criteria met, PR merged, no regressions in CI.

-### 6.11.2 Phase 2: RPC Tracing
+### 6.10.2 Phase 2: RPC Tracing

 | Criterion          | Measurement                        | Target                     |
 | ------------------ | ---------------------------------- | -------------------------- |
@@ -394,9 +439,9 @@ Clear, measurable criteria for each phase.
 | Performance        | RPC latency overhead               | <1ms p99                   |
 | Dashboard          | Grafana dashboard deployed         | Screenshot in docs         |

-**Definition of Done**: RPC traces visible in Jaeger/Tempo for all commands, dashboard shows latency distribution.
+**Definition of Done**: RPC traces visible in Tempo for all commands, dashboard shows latency distribution.

-### 6.11.3 Phase 3: Transaction Tracing
+### 6.10.3 Phase 3: Transaction Tracing

 | Criterion        | Measurement                     | Target                             |
 | ---------------- | ------------------------------- | ---------------------------------- |
@@ -408,7 +453,7 @@ Clear, measurable criteria for each phase.

 **Definition of Done**: Transaction traces span 3+ nodes in test network, performance within bounds.

-### 6.11.4 Phase 4: Consensus Tracing
+### 6.10.4 Phase 4: Consensus Tracing

 | Criterion            | Measurement                   | Target                    |
 | -------------------- | ----------------------------- | ------------------------- |
@@ -420,7 +465,7 @@ Clear, measurable criteria for each phase.

 **Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing.

-### 6.11.5 Phase 5: Production Deployment
+### 6.10.5 Phase 5: Production Deployment

 | Criterion    | Measurement                  | Target                     |
 | ------------ | ---------------------------- | -------------------------- |
@@ -433,7 +478,7 @@ Clear, measurable criteria for each phase.

 **Definition of Done**: Telemetry running in production, operators trained, alerts active.

-### 6.11.6 Success Metrics Summary
+### 6.10.6 Success Metrics Summary

 | Phase   | Primary Metric         | Secondary Metric            | Deadline      |
 | ------- | ---------------------- | --------------------------- | ------------- |
@@ -458,7 +503,7 @@ flowchart TB

    subgraph week2["Week 2"]
        t3[3. RPC ServerHandler<br/>instrumentation]
-        t4[4. Basic Jaeger setup<br/>for testing]
+        t4[4. Basic Tempo setup<br/>for testing]
    end

    subgraph week3["Week 3"]
@@ -516,6 +561,15 @@ flowchart TB
    style t14 fill:#4a148c,stroke:#2e0d57,color:#fff
 ```

+**Reading the diagram:**
+
+- **Week 1 (tasks 1-2)**: Foundation work -- integrate the OpenTelemetry SDK via Conan/CMake and build the `Telemetry` interface with `SpanGuard` and config parsing.
+- **Week 2 (tasks 3-4)**: First observable output -- instrument `ServerHandler` for RPC tracing and stand up Tempo so developers can see traces immediately.
+- **Weeks 3-5 (tasks 5-10)**: Transaction lifecycle -- add submit tracing, build the first Grafana dashboard, extend protobuf for cross-node context, instrument `PeerImp` relay, then validate with multi-node integration tests and performance benchmarks.
+- **Weeks 6-8 (tasks 11-12)**: Consensus deep-dive -- instrument consensus rounds and phases, then run full integration testing across all instrumented paths.
+- **Week 9 (tasks 13-14)**: Go-live -- deploy to production with sampling/alerting configured, and deliver documentation and operator training.
+- **Arrow chain (t1 → ... → t14)**: Strict sequential dependency; each task's output is a prerequisite for the next.
+
 ---

 _Previous: [Configuration Reference](./05-configuration-reference.md)_ | _Next: [Observability Backends](./07-observability-backends.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_
--- a/OpenTelemetryPlan/07-observability-backends.md
+++ b/OpenTelemetryPlan/07-observability-backends.md
@@ -7,33 +7,36 @@

 ## 7.1 Development/Testing Backends

-| Backend    | Pros                | Cons              | Use Case          |
-| ---------- | ------------------- | ----------------- | ----------------- |
-| **Jaeger** | Easy setup, good UI | Limited retention | Local dev, CI     |
-| **Zipkin** | Simple, lightweight | Basic features    | Quick prototyping |
+> **OTLP** = OpenTelemetry Protocol

-### Quick Start with Jaeger
+| Backend    | Pros                                | Cons                   | Use Case            |
+| ---------- | ----------------------------------- | ---------------------- | ------------------- |
+| **Tempo**  | Cost-effective, Grafana integration | Requires Grafana stack | Local dev, CI, Prod |
+| **Zipkin** | Simple, lightweight                 | Basic features         | Quick prototyping   |
+
+### Quick Start with Tempo

 ```bash
-# Start Jaeger with OTLP support
-docker run -d --name jaeger \
-  -e COLLECTOR_OTLP_ENABLED=true \
-  -p 16686:16686 \
+# Start Tempo with OTLP support
+docker run -d --name tempo \
+  -p 3200:3200 \
  -p 4317:4317 \
  -p 4318:4318 \
-  jaegertracing/all-in-one:latest
+  grafana/tempo:2.6.1
 ```

 ---

 ## 7.2 Production Backends

-| Backend           | Pros                                      | Cons               | Use Case                    |
-| ----------------- | ----------------------------------------- | ------------------ | --------------------------- |
-| **Grafana Tempo** | Cost-effective, Grafana integration       | Newer project      | Most production deployments |
-| **Elastic APM**   | Full observability stack, log correlation | Resource intensive | Existing Elastic users      |
-| **Honeycomb**     | Excellent query, high cardinality         | SaaS cost          | Deep debugging needs        |
-| **Datadog APM**   | Full platform, easy setup                 | SaaS cost          | Enterprise with budget      |
+> **APM** = Application Performance Monitoring
+
+| Backend           | Pros                                      | Cons                   | Use Case                    |
+| ----------------- | ----------------------------------------- | ---------------------- | --------------------------- |
+| **Grafana Tempo** | Cost-effective, Grafana integration       | Requires Grafana stack | Most production deployments |
+| **Elastic APM**   | Full observability stack, log correlation | Resource intensive     | Existing Elastic users      |
+| **Honeycomb**     | Excellent query, high cardinality         | SaaS cost              | Deep debugging needs        |
+| **Datadog APM**   | Full platform, easy setup                 | SaaS cost              | Enterprise with budget      |

 ### Backend Selection Flowchart

@@ -73,10 +76,19 @@ flowchart TD
    style datadog fill:#4a148c,stroke:#2e0d57,color:#fff
 ```

+**Reading the diagram:**
+
+- **Budget Constraints? (Yes)**: Leads to open-source options. If you already run Grafana or Elastic, pick the matching backend; otherwise default to Grafana Tempo.
+- **Budget Constraints? (No) → Prefer SaaS?**: If you want a managed service, choose between Datadog (enterprise support) and Honeycomb (developer-focused). If not, fall back to open-source.
+- **Terminal nodes (Tempo / Elastic / Honeycomb / Datadog)**: Each represents a concrete backend choice, all of which feed into the same final step.
+- **Configure Collector**: Regardless of backend, you always finish by configuring the OTel Collector to export to your chosen destination.
+
 ---

 ## 7.3 Recommended Production Architecture

+> **OTLP** = OpenTelemetry Protocol | **APM** = Application Performance Monitoring | **HA** = High Availability
+
 ```mermaid
 flowchart TB
    subgraph validators["Validator Nodes"]
@@ -117,6 +129,8 @@ flowchart TB
    tempo --> grafana
    elastic --> grafana

+    %% Note: simplified single-collector-per-DC topology shown for clarity
+
    style validators fill:#b71c1c,stroke:#7f1d1d,color:#ffffff
    style stock fill:#0d47a1,stroke:#082f6a,color:#ffffff
    style collector fill:#bf360c,stroke:#8c2809,color:#ffffff
@@ -124,6 +138,16 @@ flowchart TB
    style ui fill:#4a148c,stroke:#2e0d57,color:#ffffff
 ```

+**Reading the diagram:**
+
+- **Validator / Stock Nodes**: All rippled nodes emit trace data via OTLP. Validators and stock nodes are grouped separately because they may reside in different network zones.
+- **Collector Cluster (DC1, DC2)**: Regional collectors receive OTLP from nodes in their datacenter, apply processing (sampling, enrichment), and fan out to multiple backends.
+- **Storage Backends**: Tempo and Elastic provide queryable trace storage; S3/GCS Archive provides long-term cold storage for compliance or post-incident analysis.
+- **Grafana Dashboards**: The single visualization layer that queries both Tempo and Elastic, giving operators a unified view of all traces.
+- **Data flow direction**: Nodes → Collectors → Storage → Grafana. Each arrow represents a network hop; minimizing collector-to-backend hops reduces latency.
+
+> **Note**: Production deployments should use multiple collector instances behind a load balancer for high availability. The diagram shows a simplified single-collector topology for clarity.
+
 ---

 ## 7.4 Architecture Considerations
@@ -147,7 +171,7 @@ flowchart TB
 ```mermaid
 flowchart LR
    subgraph head["Head Sampling (Node)"]
-        hs[10% probabilistic]
+        hs[Node-level head sampling<br/>configurable, default: 100%<br/>recommended production: 10%]
    end

    subgraph tail["Tail Sampling (Collector)"]
@@ -171,6 +195,13 @@ flowchart LR
    style final fill:#bf360c,stroke:#8c2809,color:#fff
 ```

+**Reading the diagram:**
+
+- **Head Sampling (Node)**: The first filter -- each rippled node decides whether to sample a trace at creation time (default 100%, recommended 10% in production). This controls the volume leaving the node.
+- **Tail Sampling (Collector)**: The second filter -- the collector inspects completed traces and applies rules: keep all errors, keep anything slower than 5 seconds, and keep 10% of the remainder.
+- **Arrow head → tail**: All head-sampled traces flow to the collector, where tail sampling further reduces volume while preserving the most valuable data.
+- **Final Traces**: The output after both sampling stages; this is what gets stored and queried. The two-stage approach balances cost with debuggability.
+
 ### 7.4.3 Data Retention

 | Environment | Hot Storage | Warm Storage | Cold Archive |
@@ -355,6 +386,9 @@ groups:
            model:
              queryType: traceql
              query: '{resource.service.name="rippled" && name="consensus.round"} | avg(duration) > 5s'
+              # Note: Verify TraceQL aggregate queries are supported by your
+              # Tempo version. Aggregate alerting (e.g., avg(duration)) requires
+              # Tempo 2.3+ with TraceQL metrics enabled.
        for: 5m
        annotations:
          summary: Consensus rounds taking >5 seconds
@@ -371,6 +405,9 @@ groups:
            model:
              queryType: traceql
              query: '{resource.service.name="rippled" && name=~"rpc.command.*" && status.code=error} | rate() > 0.05'
+              # Note: Verify TraceQL aggregate queries are supported by your
+              # Tempo version. Aggregate alerting (e.g., rate()) requires
+              # Tempo 2.3+ with TraceQL metrics enabled.
        for: 2m
        annotations:
          summary: RPC error rate >5%
@@ -397,6 +434,8 @@ groups:

 ## 7.7 PerfLog and Insight Correlation

+> **OTLP** = OpenTelemetry Protocol
+
 How to correlate OpenTelemetry traces with existing rippled observability.

 ### 7.7.1 Correlation Architecture
@@ -459,6 +498,13 @@ flowchart TB
    style corr fill:#4a148c,stroke:#2e0d57,color:#fff
 ```

+**Reading the diagram:**
+
+- **rippled Node (three sources)**: A single node emits three independent data streams -- OpenTelemetry spans, PerfLog JSON logs, and Beast Insight StatsD metrics.
+- **Data Collection layer**: Each stream has its own collector -- OTel Collector for spans, Promtail/Fluentd for logs, and a StatsD exporter for metrics. They operate independently.
+- **Storage layer (Tempo, Loki, Prometheus)**: Each data type lands in a purpose-built store optimized for its query patterns (trace search, log grep, metric aggregation).
+- **Grafana Correlation Panel**: The key integration point -- Grafana queries all three stores and links them via shared fields (`trace_id`, `xrpl.tx.hash`, `ledger_seq`), enabling a single-pane debugging experience.
+
 ### 7.7.2 Correlation Fields

 | Source      | Field                       | Link To       | Purpose                    |
--- a/OpenTelemetryPlan/08-appendix.md
+++ b/OpenTelemetryPlan/08-appendix.md
@@ -7,6 +7,8 @@

 ## 8.1 Glossary

+> **OTLP** = OpenTelemetry Protocol | **TxQ** = Transaction Queue
+
 | Term                  | Definition                                                 |
 | --------------------- | ---------------------------------------------------------- |
 | **Span**              | A unit of work with start/end time, name, and attributes   |
@@ -26,25 +28,31 @@

 ### rippled-Specific Terms

-| Term              | Definition                                         |
-| ----------------- | -------------------------------------------------- |
-| **Overlay**       | P2P network layer managing peer connections        |
-| **Consensus**     | XRP Ledger consensus algorithm (RCL)               |
-| **Proposal**      | Validator's suggested transaction set for a ledger |
-| **Validation**    | Validator's signature on a closed ledger           |
-| **HashRouter**    | Component for transaction deduplication            |
-| **JobQueue**      | Thread pool for asynchronous task execution        |
-| **PerfLog**       | Existing performance logging system in rippled     |
-| **Beast Insight** | Existing metrics framework in rippled              |
+| Term              | Definition                                                    |
+| ----------------- | ------------------------------------------------------------- |
+| **Overlay**       | P2P network layer managing peer connections                   |
+| **Consensus**     | XRP Ledger consensus algorithm (RCL)                          |
+| **Proposal**      | Validator's suggested transaction set for a ledger            |
+| **Validation**    | Validator's signature on a closed ledger                      |
+| **HashRouter**    | Component for transaction deduplication                       |
+| **JobQueue**      | Thread pool for asynchronous task execution                   |
+| **PerfLog**       | Existing performance logging system in rippled                |
+| **Beast Insight** | Existing metrics framework in rippled                         |
+| **PathFinding**   | Payment path computation engine for cross-currency payments   |
+| **TxQ**           | Transaction queue managing fee-based prioritization           |
+| **LoadManager**   | Dynamic fee escalation based on network load                  |
+| **SHAMap**        | SHA-256 hash-based map (Merkle trie variant) for ledger state |

 ---

 ## 8.2 Span Hierarchy Visualization

+> **TxQ** = Transaction Queue
+
 ```mermaid
 flowchart TB
    subgraph trace["Trace: Transaction Lifecycle"]
-        rpc["rpc.submit<br/>(entry point)"]
+        rpc["rpc.request<br/>(entry point)"]
        validate["tx.validate"]
        relay["tx.relay<br/>(parent span)"]

@@ -54,20 +62,45 @@ flowchart TB
            p3["peer.send<br/>Peer C"]
        end

+        subgraph pathfinding["PathFinding Spans"]
+            pathfind["pathfind.request"]
+            pathcomp["pathfind.compute"]
+        end
+
        consensus["consensus.round"]
        apply["tx.apply"]
+
+        subgraph txqueue["TxQ Spans"]
+            txq["txq.enqueue"]
+            txqApply["txq.apply"]
+        end
+
+        feeCalc["fee.escalate"]
+    end
+
+    subgraph validators["Validator Spans"]
+        valFetch["validator.list.fetch"]
+        valManifest["validator.manifest"]
    end

    rpc --> validate
+    rpc --> pathfind
+    pathfind --> pathcomp
    validate --> relay
    relay --> p1
    relay --> p2
    relay --> p3
    p1 -.->|"context propagation"| consensus
    consensus --> apply
+    apply --> txq
+    txq --> txqApply
+    txq --> feeCalc

    style trace fill:#0f172a,stroke:#020617,color:#fff
    style peers fill:#1e3a8a,stroke:#172554,color:#fff
+    style pathfinding fill:#134e4a,stroke:#0f766e,color:#fff
+    style txqueue fill:#064e3b,stroke:#047857,color:#fff
+    style validators fill:#4c1d95,stroke:#6d28d9,color:#fff
    style rpc fill:#1d4ed8,stroke:#1e40af,color:#fff
    style validate fill:#047857,stroke:#064e3b,color:#fff
    style relay fill:#047857,stroke:#064e3b,color:#fff
@@ -76,12 +109,30 @@ flowchart TB
    style p3 fill:#0e7490,stroke:#155e75,color:#fff
    style consensus fill:#fef3c7,stroke:#fde68a,color:#1e293b
    style apply fill:#047857,stroke:#064e3b,color:#fff
+    style pathfind fill:#0e7490,stroke:#155e75,color:#fff
+    style pathcomp fill:#0e7490,stroke:#155e75,color:#fff
+    style txq fill:#047857,stroke:#064e3b,color:#fff
+    style txqApply fill:#047857,stroke:#064e3b,color:#fff
+    style feeCalc fill:#047857,stroke:#064e3b,color:#fff
+    style valFetch fill:#6d28d9,stroke:#4c1d95,color:#fff
+    style valManifest fill:#6d28d9,stroke:#4c1d95,color:#fff
 ```

+**Reading the diagram:**
+
+- **rpc.request (blue, top)**: The entry point — every traced transaction starts as an RPC call; this root span is the parent of all downstream work.
+- **tx.validate and pathfind.request (green/teal, first fork)**: The RPC request fans out into transaction validation and, for cross-currency payments, a PathFinding branch (`pathfind.request` -> `pathfind.compute`).
+- **tx.relay -> Peer Spans (teal, middle)**: After validation, the transaction is relayed to peers A, B, and C in parallel; each `peer.send` is a sibling child span showing fan-out across the network.
+- **context propagation (dashed arrow)**: The dotted line from `peer.send Peer A` to `consensus.round` represents the trace context crossing a node boundary — the receiving validator picks up the same `trace_id` and continues the trace.
+- **consensus.round -> tx.apply -> TxQ Spans (green, lower)**: Once consensus accepts the transaction, it is applied to the ledger; the TxQ spans (`txq.enqueue`, `txq.apply`, `fee.escalate`) capture queue depth and fee escalation behavior.
+- **Validator Spans (purple, detached)**: `validator.list.fetch` and `validator.manifest` are independent workflows for UNL management — they run on their own traces and are linked to consensus via Span Links, not parent-child relationships.
+
 ---

 ## 8.3 References

+> **OTLP** = OpenTelemetry Protocol
+
 ### OpenTelemetry Resources

 1. [OpenTelemetry C++ SDK](https://github.com/open-telemetry/opentelemetry-cpp)
@@ -107,10 +158,11 @@ flowchart TB

 ## 8.4 Version History

-| Version | Date       | Author | Changes                           |
-| ------- | ---------- | ------ | --------------------------------- |
-| 1.0     | 2026-02-12 | -      | Initial implementation plan       |
-| 1.1     | 2026-02-13 | -      | Refactored into modular documents |
+| Version | Date       | Author | Changes                                                        |
+| ------- | ---------- | ------ | -------------------------------------------------------------- |
+| 1.0     | 2026-02-12 | -      | Initial implementation plan                                    |
+| 1.1     | 2026-02-13 | -      | Refactored into modular documents                              |
+| 1.2     | 2026-03-24 | -      | Review fixes: accuracy corrections, cross-document consistency |

 ---

@@ -133,9 +185,10 @@ flowchart TB

 ### Task Lists

-| Document                             | Description                            |
-| ------------------------------------ | -------------------------------------- |
-| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration |
+| Document                             | Description                                         |
+| ------------------------------------ | --------------------------------------------------- |
+| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration              |
+| [presentation.md](./presentation.md) | Presentation slides for OpenTelemetry plan overview |

 ---

--- a/OpenTelemetryPlan/OpenTelemetryPlan.md
+++ b/OpenTelemetryPlan/OpenTelemetryPlan.md
@@ -2,6 +2,8 @@

 ## Executive Summary

+> **OTLP** = OpenTelemetry Protocol
+
 This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. The plan addresses the unique challenges of a decentralized peer-to-peer system where trace context must propagate across network boundaries between independent nodes.

 ### Key Benefits
@@ -33,6 +35,10 @@ This implementation plan is organized into modular documents for easier navigati
 flowchart TB
    overview["📋 OpenTelemetryPlan.md<br/>(This Document)"]

+    subgraph fundamentals["Fundamentals"]
+        fund["00-tracing-fundamentals.md"]
+    end
+
    subgraph analysis["Analysis & Design"]
        arch["01-architecture-analysis.md"]
        design["02-design-decisions.md"]
@@ -48,12 +54,15 @@ flowchart TB
        phases["06-implementation-phases.md"]
        backends["07-observability-backends.md"]
        appendix["08-appendix.md"]
+        poc["POC_taskList.md"]
    end

+    overview --> fundamentals
    overview --> analysis
    overview --> impl
    overview --> deploy

+    fund --> arch
    arch --> design
    design --> strategy
    strategy --> code
@@ -61,8 +70,11 @@ flowchart TB
    config --> phases
    phases --> backends
    backends --> appendix
+    phases --> poc

    style overview fill:#1b5e20,stroke:#0d3d14,color:#fff,stroke-width:2px
+    style fundamentals fill:#00695c,stroke:#004d40,color:#fff
+    style fund fill:#00695c,stroke:#004d40,color:#fff
    style analysis fill:#0d47a1,stroke:#082f6a,color:#fff
    style impl fill:#bf360c,stroke:#8c2809,color:#fff
    style deploy fill:#4a148c,stroke:#2e0d57,color:#fff
@@ -74,6 +86,7 @@ flowchart TB
    style phases fill:#4a148c,stroke:#2e0d57,color:#fff
    style backends fill:#4a148c,stroke:#2e0d57,color:#fff
    style appendix fill:#4a148c,stroke:#2e0d57,color:#fff
+    style poc fill:#4a148c,stroke:#2e0d57,color:#fff
 ```

 </div>
@@ -84,22 +97,34 @@ flowchart TB

 | Section | Document                                                   | Description                                                            |
 | ------- | ---------------------------------------------------------- | ---------------------------------------------------------------------- |
+| **0**   | [Tracing Fundamentals](./00-tracing-fundamentals.md)       | Distributed tracing concepts, span relationships, context propagation  |
 | **1**   | [Architecture Analysis](./01-architecture-analysis.md)     | rippled component analysis, trace points, instrumentation priorities   |
 | **2**   | [Design Decisions](./02-design-decisions.md)               | SDK selection, exporters, span naming, attributes, context propagation |
 | **3**   | [Implementation Strategy](./03-implementation-strategy.md) | Directory structure, key principles, performance optimization          |
-| **4**   | [Code Samples](./04-code-samples.md)                       | Complete C++ implementation examples for all components                |
+| **4**   | [Code Samples](./04-code-samples.md)                       | C++ implementation examples for core infrastructure and key modules    |
 | **5**   | [Configuration Reference](./05-configuration-reference.md) | rippled config, CMake integration, Collector configurations            |
 | **6**   | [Implementation Phases](./06-implementation-phases.md)     | 5-phase timeline, tasks, risks, success metrics                        |
 | **7**   | [Observability Backends](./07-observability-backends.md)   | Backend selection guide and production architecture                    |
 | **8**   | [Appendix](./08-appendix.md)                               | Glossary, references, version history                                  |
+| **POC** | [POC Task List](./POC_taskList.md)                         | Proof of concept tasks for RPC tracing end-to-end demo                 |
+
+---
+
+## 0. Tracing Fundamentals
+
+This document introduces distributed tracing concepts for readers unfamiliar with the domain. It covers what traces and spans are, how parent-child and follows-from relationships model causality, how context propagates across service boundaries, and how sampling controls data volume. It also maps these concepts to rippled-specific scenarios like transaction relay and consensus.
+
+➡️ **[Read Tracing Fundamentals](./00-tracing-fundamentals.md)**

 ---

 ## 1. Architecture Analysis

-The rippled node consists of several key components that require instrumentation for comprehensive distributed tracing. The main areas include the RPC server (HTTP/WebSocket), Overlay P2P network, Consensus mechanism (RCLConsensus), JobQueue for async task execution, and existing observability infrastructure (PerfLog, Insight/StatsD, Journal logging).
+> **WS** = WebSocket | **TxQ** = Transaction Queue

-Key trace points span across transaction submission via RPC, peer-to-peer message propagation, consensus round execution, and ledger building. The implementation prioritizes high-value, low-risk components first: RPC handlers provide immediate value with minimal risk, while consensus tracing requires careful implementation to avoid timing impacts.
+The rippled node consists of several key components that require instrumentation for comprehensive distributed tracing. The main areas include the RPC server (HTTP/WebSocket), Overlay P2P network, Consensus mechanism (RCLConsensus), JobQueue for async task execution, PathFinding, Transaction Queue (TxQ), fee escalation (LoadManager), ledger acquisition, validator management, and existing observability infrastructure (PerfLog, Insight/StatsD, Journal logging).
+
+Key trace points span across transaction submission via RPC, peer-to-peer message propagation, consensus round execution, ledger building, path computation, transaction queue behavior, fee escalation, and validator health. The implementation prioritizes high-value, low-risk components first: RPC handlers provide immediate value with minimal risk, while consensus tracing requires careful implementation to avoid timing impacts.

 ➡️ **[Read full Architecture Analysis](./01-architecture-analysis.md)**

@@ -107,11 +132,13 @@ Key trace points span across transaction submission via RPC, peer-to-peer messag

 ## 2. Design Decisions

+> **OTLP** = OpenTelemetry Protocol | **CNCF** = Cloud Native Computing Foundation
+
 The OpenTelemetry C++ SDK is selected for its CNCF backing, active development, and native performance characteristics. Traces are exported via OTLP/gRPC (primary) or OTLP/HTTP (fallback) to an OpenTelemetry Collector, which provides flexible routing and sampling.

 Span naming follows a hierarchical `<component>.<operation>` convention (e.g., `rpc.submit`, `tx.relay`, `consensus.round`). Context propagation uses W3C Trace Context headers for HTTP and embedded Protocol Buffer fields for P2P messages. The implementation coexists with existing PerfLog and Insight observability systems through correlation IDs.

-**Data Collection & Privacy**: Telemetry collects only operational metadata (timing, counts, hashes) — never sensitive content (private keys, balances, amounts, raw payloads). Privacy protection includes account hashing, configurable redaction, sampling, and collector-level filtering. Node operators retain full control(not penned down in this document yet) over what data is exported.
+**Data Collection & Privacy**: Telemetry collects only operational metadata (timing, counts, hashes) — never sensitive content (private keys, balances, amounts, raw payloads). Privacy protection includes account hashing, configurable redaction, sampling, and collector-level filtering. Node operators retain full control over telemetry configuration.

 ➡️ **[Read full Design Decisions](./02-design-decisions.md)**

@@ -129,13 +156,14 @@ Performance optimization strategies include probabilistic head sampling (10% def

 ## 4. Code Samples

-Complete C++ implementation examples are provided for all telemetry components:
+C++ implementation examples are provided for the core telemetry infrastructure and key modules:

 - `Telemetry.h` - Core interface for tracer access and span creation
 - `SpanGuard.h` - RAII wrapper for automatic span lifecycle management
 - `TracingInstrumentation.h` - Macros for conditional instrumentation
 - Protocol Buffer extensions for trace context propagation
 - Module-specific instrumentation (RPC, Consensus, P2P, JobQueue)
+- Remaining modules (PathFinding, TxQ, Validator, etc.) follow the same patterns

 ➡️ **[View all Code Samples](./04-code-samples.md)**

@@ -143,9 +171,11 @@ Complete C++ implementation examples are provided for all telemetry components:

 ## 5. Configuration Reference

+> **OTLP** = OpenTelemetry Protocol | **APM** = Application Performance Monitoring
+
 Configuration is handled through the `[telemetry]` section in `xrpld.cfg` with options for enabling/disabling, exporter selection, endpoint configuration, sampling ratios, and component-level filtering. CMake integration includes a `XRPL_ENABLE_TELEMETRY` option for compile-time control.

-OpenTelemetry Collector configurations are provided for development (with Jaeger) and production (with tail-based sampling, Tempo, and Elastic APM). Docker Compose examples enable quick local development environment setup.
+OpenTelemetry Collector configurations are provided for development and production (with tail-based sampling, Tempo, and Elastic APM). Docker Compose examples enable quick local development environment setup.

 ➡️ **[View full Configuration Reference](./05-configuration-reference.md)**

@@ -163,7 +193,7 @@ The implementation spans 9 weeks across 5 phases:
 | 4     | Weeks 7-8 | Consensus Tracing   | Round spans, Proposal/validation tracing            |
 | 5     | Week 9    | Documentation       | Runbook, Dashboards, Training                       |

-**Total Effort**: 47 developer-days with 2 developers
+**Total Effort**: 47 person-days (2 developers working in parallel)

 ➡️ **[View full Implementation Phases](./06-implementation-phases.md)**

@@ -171,7 +201,9 @@ The implementation spans 9 weeks across 5 phases:

 ## 7. Observability Backends

-For development and testing, Jaeger provides easy setup with a good UI. For production deployments, Grafana Tempo is recommended for its cost-effectiveness and Grafana integration, while Elastic APM is ideal for organizations with existing Elastic infrastructure.
+> **APM** = Application Performance Monitoring | **GCS** = Google Cloud Storage
+
+Grafana Tempo is recommended for all environments due to its cost-effectiveness and Grafana integration, while Elastic APM is ideal for organizations with existing Elastic infrastructure.

 The recommended production architecture uses a gateway collector pattern with regional collectors performing tail-based sampling, routing traces to multiple backends (Tempo for primary storage, Elastic for log correlation, S3/GCS for long-term archive).

@@ -187,4 +219,12 @@ The appendix contains a glossary of OpenTelemetry and rippled-specific terms, re

 ---

+## POC Task List
+
+A step-by-step task list for building a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. The POC scope is limited to RPC tracing — showing request traces flowing from rippled through an OpenTelemetry Collector into Tempo, viewable in Grafana.
+
+➡️ **[View POC Task List](./POC_taskList.md)**
+
+---
+
 _This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. For detailed information on any section, follow the links to the corresponding sub-documents._
--- a/OpenTelemetryPlan/POC_taskList.md
+++ b/OpenTelemetryPlan/POC_taskList.md
@@ -1,6 +1,6 @@
 # OpenTelemetry POC Task List

-> **Goal**: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. A successful POC will show RPC request traces flowing from rippled through an OTel Collector into Jaeger, viewable in a browser UI.
+> **Goal**: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. A successful POC will show RPC request traces flowing from rippled through an OTel Collector into Tempo, viewable in Grafana.
 >
 > **Scope**: RPC tracing only (highest value, lowest risk per the [CRAWL phase](./06-implementation-phases.md#6102-quick-wins-immediate-value) in the implementation phases). No cross-node P2P context propagation or consensus tracing in the POC.

@@ -15,28 +15,29 @@
 | [04-code-samples.md](./04-code-samples.md)                       | Telemetry interface (§4.1), SpanGuard (§4.2), macros (§4.3), RPC instrumentation (§4.5.3)                                                                 |
 | [05-configuration-reference.md](./05-configuration-reference.md) | rippled config (§5.1), config parser (§5.2), Application integration (§5.3), CMake (§5.4), Collector config (§5.5), Docker Compose (§5.6), Grafana (§5.8) |
 | [06-implementation-phases.md](./06-implementation-phases.md)     | Phase 1 core tasks (§6.2), Phase 2 RPC tasks (§6.3), quick wins (§6.10), definition of done (§6.11)                                                       |
-| [07-observability-backends.md](./07-observability-backends.md)   | Jaeger dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3)                                                                                  |
+| [07-observability-backends.md](./07-observability-backends.md)   | Tempo dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3)                                                                                   |

 ---

 ## Task 0: Docker Observability Stack Setup

+> **OTLP** = OpenTelemetry Protocol
+
 **Objective**: Stand up the backend infrastructure to receive, store, and display traces.

 **What to do**:

 - Create `docker/telemetry/docker-compose.yml` in the repo with three services:
-  1. **OpenTelemetry Collector** (`otel/opentelemetry-collector-contrib:latest`)
+  1. **OpenTelemetry Collector** (`otel/opentelemetry-collector-contrib:0.92.0`)
     - Expose ports `4317` (OTLP gRPC) and `4318` (OTLP HTTP)
     - Expose port `13133` (health check)
     - Mount a config file `docker/telemetry/otel-collector-config.yaml`
-  2. **Jaeger** (`jaegertracing/all-in-one:latest`)
-     - Expose port `16686` (UI) and `14250` (gRPC collector)
-     - Set env `COLLECTOR_OTLP_ENABLED=true`
+  2. **Tempo** (`grafana/tempo:2.6.1`)
+     - Expose port `3200` (HTTP API) and `4317` (OTLP gRPC, internal)
  3. **Grafana** (`grafana/grafana:latest`) — optional but useful
     - Expose port `3000`
     - Enable anonymous admin access for local dev (`GF_AUTH_ANONYMOUS_ENABLED=true`, `GF_AUTH_ANONYMOUS_ORG_ROLE=Admin`)
-     - Provision Jaeger as a data source via `docker/telemetry/grafana/provisioning/datasources/jaeger.yaml`
+     - Provision Tempo as a data source via `docker/telemetry/grafana/provisioning/datasources/tempo.yaml`

 - Create `docker/telemetry/otel-collector-config.yaml`:

@@ -57,8 +58,8 @@
  exporters:
    logging:
      verbosity: detailed
-    otlp/jaeger:
-      endpoint: jaeger:4317
+    otlp/tempo:
+      endpoint: tempo:4317
      tls:
        insecure: true

@@ -67,30 +68,29 @@
      traces:
        receivers: [otlp]
        processors: [batch]
-        exporters: [logging, otlp/jaeger]
+        exporters: [logging, otlp/tempo]
  ```

- Create Grafana Jaeger datasource provisioning file at `docker/telemetry/grafana/provisioning/datasources/jaeger.yaml`:
+- Create Grafana Tempo datasource provisioning file at `docker/telemetry/grafana/provisioning/datasources/tempo.yaml`:
  ```yaml
  apiVersion: 1
  datasources:
-    - name: Jaeger
-      type: jaeger
+    - name: Tempo
+      type: tempo
      access: proxy
-      url: http://jaeger:16686
+      url: http://tempo:3200
  ```

 **Verification**: Run `docker compose -f docker/telemetry/docker-compose.yml up -d`, then:

 - `curl http://localhost:13133` returns healthy (Collector)
- `http://localhost:16686` opens Jaeger UI (no traces yet)
- `http://localhost:3000` opens Grafana (optional)
+- `http://localhost:3000` opens Grafana (Tempo datasource available, no traces yet)

 **Reference**:

- [05-configuration-reference.md §5.5](./05-configuration-reference.md) — Collector config (dev YAML with Jaeger exporter)
+- [05-configuration-reference.md §5.5](./05-configuration-reference.md) — Collector config (dev YAML with Tempo exporter)
 - [05-configuration-reference.md §5.6](./05-configuration-reference.md) — Docker Compose development environment
- [07-observability-backends.md §7.1](./07-observability-backends.md) — Jaeger quick start and backend selection
+- [07-observability-backends.md §7.1](./07-observability-backends.md) — Tempo quick start and backend selection
 - [05-configuration-reference.md §5.8](./05-configuration-reference.md) — Grafana datasource provisioning and dashboards

 ---
@@ -175,6 +175,8 @@

 ## Task 3: Implement OTel-Backed Telemetry

+> **OTLP** = OpenTelemetry Protocol
+
 **Objective**: Implement the real `Telemetry` class that initializes the OTel SDK, configures the OTLP exporter and batch processor, and creates tracers/spans.

 **What to do**:
@@ -183,7 +185,7 @@
  - `class TelemetryImpl : public Telemetry` that:
    - In `start()`: creates a `TracerProvider` with:
      - Resource attributes: `service.name`, `service.version`, `service.instance.id`
-      - An `OtlpGrpcExporter` pointed at `setup.exporterEndpoint` (default `localhost:4317`)
+      - An `OtlpHttpExporter` pointed at `setup.exporterEndpoint` (default `localhost:4318`)
      - A `BatchSpanProcessor` with configurable batch size and delay
      - A `TraceIdRatioBasedSampler` using `setup.samplingRatio`
    - Sets the global `TracerProvider`
@@ -316,6 +318,8 @@

 ## Task 6: Instrument RPC ServerHandler

+> **WS** = WebSocket
+
 **Objective**: Add tracing to the HTTP RPC entry point so every incoming RPC request creates a span.

 **What to do**:
@@ -338,7 +342,7 @@
  rpc.request
    └── rpc.process
  ```
-  in Jaeger for every HTTP RPC call.
+  in Tempo/Grafana for every HTTP RPC call.

 **Key modified file**:

@@ -372,7 +376,7 @@
    - On success: `XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "success");`
    - On error: `XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "error");` and set the error message

- After this, traces in Jaeger should look like:
+- After this, traces in Tempo/Grafana should look like:
  ```
  rpc.request  (xrpl.rpc.command=account_info)
    └── rpc.process
@@ -396,7 +400,9 @@

 ## Task 8: Build, Run, and Verify End-to-End

-**Objective**: Prove the full pipeline works: rippled emits traces -> OTel Collector receives them -> Jaeger displays them.
+> **OTLP** = OpenTelemetry Protocol
+
+**Objective**: Prove the full pipeline works: rippled emits traces -> OTel Collector receives them -> Tempo stores them for Grafana visualization.

 **What to do**:

@@ -453,10 +459,10 @@
     -d '{"method":"account_info","params":[{"account":"rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh"}]}'
   ```

-6. **Verify in Jaeger**:
-   - Open `http://localhost:16686`
-   - Select service `rippled` from the dropdown
-   - Click "Find Traces"
+6. **Verify in Grafana (Tempo)**:
+   - Open `http://localhost:3000`
+   - Navigate to Explore → select Tempo datasource
+   - Search for service `rippled`
   - Confirm you see traces with spans: `rpc.request` -> `rpc.process` -> `rpc.command.server_info`
   - Click into a trace and verify attributes: `xrpl.rpc.command`, `xrpl.rpc.status`, `xrpl.rpc.version`

@@ -470,7 +476,7 @@
 - [ ] Docker stack starts without errors
 - [ ] rippled builds with `-DXRPL_ENABLE_TELEMETRY=ON`
 - [ ] rippled starts and connects to OTel Collector (check rippled logs for telemetry messages)
- [ ] Traces appear in Jaeger UI under service "rippled"
+- [ ] Traces appear in Grafana/Tempo under service "rippled"
 - [ ] Span hierarchy is correct (parent-child relationships)
 - [ ] Span attributes are populated (`xrpl.rpc.command`, `xrpl.rpc.status`, etc.)
 - [ ] Error spans show error status and message
@@ -479,8 +485,8 @@

 **Reference**:

- [06-implementation-phases.md §6.11.1](./06-implementation-phases.md) — Phase 1 definition of done: SDK compiles, runtime toggle works, span creation verified in Jaeger, config validation passes
- [06-implementation-phases.md §6.11.2](./06-implementation-phases.md) — Phase 2 definition of done: 100% RPC coverage, traceparent propagation, <1ms p99 overhead, dashboard deployed
+- [06-implementation-phases.md §6.11.1](./06-implementation-phases.md) — Phase 1 definition of done: SDK compiles, runtime toggle works, span creation verified in Tempo, config validation passes
+- [06-implementation-phases.md §6.11.2](./06-implementation-phases.md#6112-phase-2-rpc-tracing) — Phase 2 definition of done: 100% RPC coverage, traceparent propagation, <1ms p99 overhead, dashboard deployed
 - [06-implementation-phases.md §6.8](./06-implementation-phases.md) — Success metrics: trace coverage >95%, CPU overhead <3%, memory <5 MB, latency impact <2%
 - [03-implementation-strategy.md §3.9.5](./03-implementation-strategy.md) — Backward compatibility: config optional, protocol unchanged, `XRPL_ENABLE_TELEMETRY=OFF` produces identical binary
 - [01-architecture-analysis.md §1.8](./01-architecture-analysis.md) — Observable outcomes: what traces, metrics, and dashboards to expect
@@ -489,11 +495,13 @@

 ## Task 9: Document POC Results and Next Steps

+> **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket
+
 **Objective**: Capture findings, screenshots, and remaining work for the team.

 **What to do**:

- Take screenshots of Jaeger showing:
+- Take screenshots of Grafana/Tempo showing:
  - The service list with "rippled"
  - A trace with the full span tree
  - Span detail view showing attributes
@@ -541,9 +549,11 @@

 ## Next Steps (Post-POC)

+> **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket
+
 ### Metrics Pipeline for Grafana Dashboards

-The current POC exports **traces only**. Grafana's Explore view can query Jaeger for individual traces, but time-series charts (latency histograms, request throughput, error rates) require a **metrics pipeline**. To enable this:
+The current POC exports **traces only**. Grafana's Explore view can query Tempo for individual traces, but time-series charts (latency histograms, request throughput, error rates) require a **metrics pipeline**. To enable this:

 1. **Add a `spanmetrics` connector** to the OTel Collector config that derives RED metrics (Rate, Errors, Duration) from trace spans automatically:

@@ -566,7 +576,7 @@ The current POC exports **traces only**. Grafana's Explore view can query Jaeger
       traces:
         receivers: [otlp]
         processors: [batch]
-         exporters: [debug, otlp/jaeger, spanmetrics]
+         exporters: [debug, otlp/tempo, spanmetrics]
       metrics:
         receivers: [spanmetrics]
         exporters: [prometheus]
--- a/OpenTelemetryPlan/presentation.md
+++ b/OpenTelemetryPlan/presentation.md
@@ -4,6 +4,8 @@

 ## Slide 1: Introduction

+> **CNCF** = Cloud Native Computing Foundation
+
 ### What is OpenTelemetry?

 OpenTelemetry is an open-source, CNCF-backed observability framework for distributed tracing, metrics, and logs.
@@ -25,12 +27,21 @@ flowchart LR
    style D fill:#e65100,stroke:#bf360c,color:#fff
 ```

+**Reading the diagram:**
+
+- **Node A (blue, leftmost)**: The originating node that first receives the transaction and assigns a new `trace_id: abc123`; this ID becomes the correlation key for the entire distributed trace.
+- **Node B and Node C (green, middle)**: Relay and validation nodes — each creates its own span but carries the same `trace_id`, so their work is linked to the original submission without any central coordinator.
+- **Node D (orange, rightmost)**: The final node that applies the transaction to the ledger; the trace now spans the full lifecycle from submission to ledger inclusion.
+- **Left-to-right flow**: The horizontal progression shows the real-world message path — a transaction hops from node to node, and the shared `trace_id` stitches all hops into a single queryable trace.
+
 > **Trace ID: abc123** — All nodes share the same trace, enabling cross-node correlation.

 ---

 ## Slide 2: OpenTelemetry vs Open Source Alternatives

+> **CNCF** = Cloud Native Computing Foundation
+
 | Feature             | OpenTelemetry    | Jaeger           | Zipkin             | SkyWalking | Pinpoint   | Prometheus |
 | ------------------- | ---------------- | ---------------- | ------------------ | ---------- | ---------- | ---------- |
 | **Tracing**         | YES              | YES              | YES                | YES        | YES        | NO         |
@@ -42,11 +53,131 @@ flowchart LR
 | **Backend**         | Any (exporters)  | Self             | Self               | Self       | Self       | Self       |
 | **CNCF Status**     | Incubating       | Graduated        | NO                 | Incubating | NO         | Graduated  |

-> **Why OpenTelemetry?** It's the only actively maintained, full-featured C++ option with vendor neutrality — allowing export to Jaeger, Prometheus, Grafana, or any commercial backend without changing instrumentation.
+> **Why OpenTelemetry?** It's the only actively maintained, full-featured C++ option with vendor neutrality — allowing export to Tempo, Prometheus, Grafana, or any commercial backend without changing instrumentation.

 ---

-## Slide 3: Comparison with rippled's Existing Solutions
+## Slide 3: Adoption Scope — Traces Only (Current Plan)
+
+OpenTelemetry supports three signal types: **Traces**, **Metrics**, and **Logs**. rippled already captures metrics (StatsD via Beast Insight) and logs (Journal/PerfLog). The question is: how much of OTel do we adopt?
+
+> **Scenario A**: Add distributed tracing. Keep StatsD for metrics and Journal for logs.
+
+```mermaid
+flowchart LR
+    subgraph rippled["rippled Process"]
+        direction TB
+        OTel["OTel SDK<br/>(Traces)"]
+        Insight["Beast Insight<br/>(StatsD Metrics)"]
+        Journal["Journal + PerfLog<br/>(Logging)"]
+    end
+
+    OTel -->|"OTLP"| Collector["OTel Collector"]
+    Insight -->|"UDP"| StatsD["StatsD Server"]
+    Journal -->|"File I/O"| LogFile["perf.log / debug.log"]
+
+    Collector --> Tempo["Tempo / Jaeger"]
+    StatsD --> Graphite["Graphite / Grafana"]
+    LogFile --> Loki["Loki (optional)"]
+
+    style rippled fill:#424242,stroke:#212121,color:#fff
+    style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff
+    style Insight fill:#1565c0,stroke:#0d47a1,color:#fff
+    style Journal fill:#e65100,stroke:#bf360c,color:#fff
+    style Collector fill:#2e7d32,stroke:#1b5e20,color:#fff
+```
+
+| Aspect                         | Details                                                                                                         |
+| ------------------------------ | --------------------------------------------------------------------------------------------------------------- |
+| **What changes for operators** | Deploy OTel Collector + trace backend. Existing StatsD and log pipelines stay as-is.                            |
+| **Codebase impact**            | New `Telemetry` module (~1500 LOC). Beast Insight and Journal untouched.                                        |
+| **New capabilities**           | Cross-node trace correlation, span-based debugging, request lifecycle visibility.                               |
+| **What we still can't do**     | Correlate metrics with specific traces natively. StatsD metrics remain fire-and-forget with no trace exemplars. |
+| **Maintenance burden**         | Three separate observability systems to maintain (OTel + StatsD + Journal).                                     |
+| **Risk**                       | Lowest — additive change, no existing systems disturbed.                                                        |
+
+---
+
+## Slide 4: Future Adoption — Metrics & Logs via OTel
+
+### Scenario B: + OTel Metrics (Replace StatsD)
+
+> Migrate StatsD to OTel Metrics API, exposing Prometheus-compatible metrics. Remove Beast Insight.
+
+```mermaid
+flowchart LR
+    subgraph rippled["rippled Process"]
+        direction TB
+        OTel["OTel SDK<br/>(Traces + Metrics)"]
+        Journal["Journal + PerfLog<br/>(Logging)"]
+    end
+
+    OTel -->|"OTLP"| Collector["OTel Collector"]
+    Journal -->|"File I/O"| LogFile["perf.log / debug.log"]
+
+    Collector --> Tempo["Tempo<br/>(Traces)"]
+    Collector --> Prom["Prometheus<br/>(Metrics)"]
+    LogFile --> Loki["Loki (optional)"]
+
+    style rippled fill:#424242,stroke:#212121,color:#fff
+    style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff
+    style Journal fill:#e65100,stroke:#bf360c,color:#fff
+    style Collector fill:#2e7d32,stroke:#1b5e20,color:#fff
+```
+
+- **Better metrics?** Yes — Prometheus gives native histograms (p50/p95/p99), multi-dimensional labels, and exemplars linking metric spikes to traces.
+- **Codebase**: Remove `Beast::Insight` + `StatsDCollector` (~2000 LOC). Single SDK for traces and metrics.
+- **Operator effort**: Rewrite dashboards from StatsD/Graphite queries to PromQL. Run both in parallel during transition.
+- **Risk**: Medium — operators must migrate monitoring infrastructure.
+
+### Scenario C: + OTel Logs (Full Stack)
+
+> Also replace Journal logging with OTel Logs API. Single SDK for everything.
+
+```mermaid
+flowchart LR
+    subgraph rippled["rippled Process"]
+        OTel["OTel SDK<br/>(Traces + Metrics + Logs)"]
+    end
+
+    OTel -->|"OTLP"| Collector["OTel Collector"]
+
+    Collector --> Tempo["Tempo<br/>(Traces)"]
+    Collector --> Prom["Prometheus<br/>(Metrics)"]
+    Collector --> Loki["Loki / Elastic<br/>(Logs)"]
+
+    style rippled fill:#424242,stroke:#212121,color:#fff
+    style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff
+    style Collector fill:#2e7d32,stroke:#1b5e20,color:#fff
+```
+
+- **Structured logging**: OTel Logs API outputs structured records with `trace_id`, `span_id`, severity, and attributes by design.
+- **Full correlation**: Every log line carries `trace_id`. Click trace → see logs. Click metric spike → see trace → see logs.
+- **Codebase**: Remove Beast Insight (~2000 LOC) + simplify Journal/PerfLog (~3000 LOC). One dependency instead of three.
+- **Risk**: Highest — `beast::Journal` is deeply embedded in every component. Large refactor. OTel C++ Logs API is newer (stable since v1.11, less battle-tested).
+
+### Recommendation
+
+```mermaid
+flowchart LR
+    A["Phase 1<br/><b>Traces Only</b><br/>(Current Plan)"] --> B["Phase 2<br/><b>+ Metrics</b><br/>(Replace StatsD)"] --> C["Phase 3<br/><b>+ Logs</b><br/>(Full OTel)"]
+
+    style A fill:#2e7d32,stroke:#1b5e20,color:#fff
+    style B fill:#1565c0,stroke:#0d47a1,color:#fff
+    style C fill:#e65100,stroke:#bf360c,color:#fff
+```
+
+| Phase                | Signal    | Strategy                                                       | Risk   |
+| -------------------- | --------- | -------------------------------------------------------------- | ------ |
+| **Phase 1** (now)    | Traces    | Add OTel traces. Keep StatsD and Journal. Prove value.         | Low    |
+| **Phase 2** (future) | + Metrics | Migrate StatsD → Prometheus via OTel. Remove Beast Insight.    | Medium |
+| **Phase 3** (future) | + Logs    | Adopt OTel Logs API. Align with structured logging initiative. | High   |
+
+> **Key Takeaway**: Start with traces (unique value, lowest risk), then incrementally adopt metrics and logs as the OTel infrastructure proves itself.
+
+---
+
+## Slide 5: Comparison with rippled's Existing Solutions

 ### Current Observability Stack

@@ -68,11 +199,13 @@ flowchart LR
 | "Which node delayed consensus?"  | ❌      | ❌     | ✅            |
 | "Show TX journey across 5 nodes" | ❌      | ❌     | ✅            |

-> **Key Insight**: OpenTelemetry **complements** (not replaces) existing systems.
+> **Key Insight**: In the **traces-only** approach (Phase 1), OpenTelemetry **complements** existing systems. In future phases, OTel metrics and logs could **replace** StatsD and Journal respectively — see Slides 3-4 for the full adoption roadmap.

 ---

-## Slide 4: Architecture
+## Slide 6: Architecture
+
+> **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket

 ### High-Level Integration Architecture

@@ -92,7 +225,6 @@ flowchart TB
    Telemetry -->|OTLP/gRPC| Collector["OTel Collector"]

    Collector --> Tempo["Grafana Tempo"]
-    Collector --> Jaeger["Jaeger"]
    Collector --> Elastic["Elastic APM"]

    style rippled fill:#424242,stroke:#212121,color:#fff
@@ -101,6 +233,14 @@ flowchart TB
    style Collector fill:#e65100,stroke:#bf360c,color:#fff
 ```

+**Reading the diagram:**
+
+- **Core Services (blue, top)**: RPC Server, Overlay, and Consensus are the three primary components that generate trace data — they represent the entry points for client requests, peer messages, and consensus rounds respectively.
+- **Telemetry Module (green, middle)**: The OpenTelemetry SDK sits below the core services and receives span data from all three; it acts as a single collection point within the rippled process.
+- **OTel Collector (orange, center)**: An external process that receives spans over OTLP/gRPC from the Telemetry Module; it decouples rippled from backend choices and handles batching, sampling, and routing.
+- **Backends (bottom row)**: Tempo and Elastic APM are interchangeable — the Collector fans out to any combination, so operators can switch backends without modifying rippled code.
+- **Top-to-bottom flow**: Data flows from instrumented code down through the SDK, out over the network to the Collector, and finally into storage/visualization backends.
+
 ### Context Propagation

 ```mermaid
@@ -120,10 +260,12 @@ sequenceDiagram

 ---

-## Slide 5: Implementation Plan
+## Slide 7: Implementation Plan

 ### 5-Phase Rollout (9 Weeks)

+> **Note**: Dates shown are relative to project start, not calendar dates.
+
 ```mermaid
 gantt
    title Implementation Timeline
@@ -158,18 +300,114 @@ gantt

 **Total Effort**: ~47 developer-days (2 developers)

+> **Future Phases** (not in current scope): After traces are stable, OTel metrics can replace StatsD (~3 weeks), and OTel logs can replace Journal (~4 weeks, aligned with structured logging initiative). See Slides 3-4 for the full adoption roadmap.
+
 ---

-## Slide 6: Performance Overhead
+## Slide 8: Performance Overhead
+
+> **OTLP** = OpenTelemetry Protocol

 ### Estimated System Impact

-| Metric            | Overhead   | Notes                               |
-| ----------------- | ---------- | ----------------------------------- |
-| **CPU**           | 1-3%       | Span creation and attribute setting |
-| **Memory**        | 2-5 MB     | Batch buffer for pending spans      |
-| **Network**       | 10-50 KB/s | Compressed OTLP export to collector |
-| **Latency (p99)** | <2%        | With proper sampling configuration  |
+| Metric            | Overhead   | Notes                                            |
+| ----------------- | ---------- | ------------------------------------------------ |
+| **CPU**           | 1-3%       | Span creation and attribute setting              |
+| **Memory**        | ~10 MB     | SDK statics + batch buffer + worker thread stack |
+| **Network**       | 10-50 KB/s | Compressed OTLP export to collector              |
+| **Latency (p99)** | <2%        | With proper sampling configuration               |
+
+#### How We Arrived at These Numbers
+
+**Assumptions (XRPL mainnet baseline)**:
+
+| Parameter                 | Value                  | Source                                                                                              |
+| ------------------------- | ---------------------- | --------------------------------------------------------------------------------------------------- |
+| Transaction throughput    | ~25 TPS (peaks to ~50) | Mainnet average                                                                                     |
+| Default peers per node    | 21                     | `peerfinder/detail/Tuning.h` (`defaultMaxPeers`)                                                    |
+| Consensus round frequency | ~1 round / 3-4 seconds | `ConsensusParms.h` (`ledgerMIN_CONSENSUS=1950ms`)                                                   |
+| Proposers per round       | ~20-35                 | Mainnet UNL size                                                                                    |
+| P2P message rate          | ~160 msgs/sec          | See message breakdown below                                                                         |
+| Avg TX processing time    | ~200 μs                | Profiled baseline                                                                                   |
+| Single span creation cost | 500-1000 ns            | OTel C++ SDK benchmarks (see [3.5.4](./03-implementation-strategy.md#354-performance-data-sources)) |
+
+**P2P message breakdown** (per node, mainnet):
+
+| Message Type  | Rate         | Derivation                                                            |
+| ------------- | ------------ | --------------------------------------------------------------------- |
+| TMTransaction | ~100/sec     | ~25 TPS × ~4 relay hops per TX, deduplicated by HashRouter            |
+| TMValidation  | ~50/sec      | ~35 validators × ~1 validation/3s round ≈ ~12/sec, plus relay fan-out |
+| TMProposeSet  | ~10/sec      | ~35 proposers / 3s round ≈ ~12/round, clustered in establish phase    |
+| **Total**     | **~160/sec** | **Only traced message types counted**                                 |
+
+**CPU (1-3%) — Calculation**:
+
+Per-transaction tracing cost breakdown:
+
+| Operation                                       | Cost        | Notes                                      |
+| ----------------------------------------------- | ----------- | ------------------------------------------ |
+| `tx.receive` span (create + end + 4 attributes) | ~1400 ns    | ~1000ns create + ~200ns end + 4×50ns attrs |
+| `tx.validate` span                              | ~1200 ns    | ~1000ns create + ~200ns for 2 attributes   |
+| `tx.relay` span                                 | ~1200 ns    | ~1000ns create + ~200ns for 2 attributes   |
+| Context injection into P2P message              | ~200 ns     | Serialize trace_id + span_id into protobuf |
+| **Total per TX**                                | **~4.0 μs** |                                            |
+
+> **CPU overhead**: 4.0 μs / 200 μs baseline = **~2.0% per transaction**. Under high load with consensus + RPC spans overlapping, reaches ~3%. Consensus itself adds only ~36 μs per 3-second round (~0.001%), so the TX path dominates. On production server hardware (3+ GHz Xeon), span creation drops to ~500-600 ns, bringing per-TX cost to ~2.6 μs (~1.3%). See [Section 3.5.4](./03-implementation-strategy.md#354-performance-data-sources) for benchmark sources.
+
+**Memory (~10 MB) — Calculation**:
+
+| Component                                     | Size               | Notes                                 |
+| --------------------------------------------- | ------------------ | ------------------------------------- |
+| TracerProvider + Exporter (gRPC channel init) | ~320 KB            | Allocated once at startup             |
+| BatchSpanProcessor (circular buffer)          | ~16 KB             | 2049 × 8-byte AtomicUniquePtr entries |
+| BatchSpanProcessor (worker thread stack)      | ~8 MB              | Default Linux thread stack size       |
+| Active spans (in-flight, max ~1000)           | ~500-800 KB        | ~500-800 bytes/span × 1000 concurrent |
+| Export queue (batch buffer, max 2048 spans)   | ~1 MB              | ~500 bytes/span × 2048 queue depth    |
+| Thread-local context storage (~100 threads)   | ~6.4 KB            | ~64 bytes/thread                      |
+| **Total**                                     | **~10 MB ceiling** |                                       |
+
+> Memory plateaus once the export queue fills — the `max_queue_size=2048` config bounds growth.
+> The worker thread stack (~8 MB) dominates the static footprint but is virtual memory; actual RSS
+> depends on stack usage (typically much less). Active spans are larger than originally estimated
+> (~500-800 bytes) because the OTel SDK `Span` object includes a mutex (~40 bytes), `SpanData`
+> recordable (~250 bytes base), and `std::map`-based attribute storage (~200-500 bytes for 3-5
+> string attributes). See [Section 3.5.4](./03-implementation-strategy.md#354-performance-data-sources) for source references.
+
+**Network (10-50 KB/s) — Calculation**:
+
+Two sources of network overhead:
+
+**(A) OTLP span export to Collector:**
+
+| Sampling Rate              | Effective Spans/sec | Avg Span Size (compressed) | Bandwidth    |
+| -------------------------- | ------------------- | -------------------------- | ------------ |
+| 100% (dev only)            | ~500                | ~500 bytes                 | ~250 KB/s    |
+| **10% (recommended prod)** | **~50**             | **~500 bytes**             | **~25 KB/s** |
+| 1% (minimal)               | ~5                  | ~500 bytes                 | ~2.5 KB/s    |
+
+> The ~500 spans/sec at 100% comes from: ~100 TX spans + ~160 P2P context spans + ~23 consensus spans/round + ~50 RPC spans = ~500/sec. OTLP protobuf with gzip compression yields ~500 bytes/span average.
+
+**(B) P2P trace context overhead** (added to existing messages, always-on regardless of sampling):
+
+| Message Type  | Rate     | Context Size | Bandwidth     |
+| ------------- | -------- | ------------ | ------------- |
+| TMTransaction | ~100/sec | 29 bytes     | ~2.9 KB/s     |
+| TMValidation  | ~50/sec  | 29 bytes     | ~1.5 KB/s     |
+| TMProposeSet  | ~10/sec  | 29 bytes     | ~0.3 KB/s     |
+| **Total P2P** |          |              | **~4.7 KB/s** |
+
+> **Combined**: 25 KB/s (OTLP export at 10%) + 5 KB/s (P2P context) ≈ **~30 KB/s typical**. The 10-50 KB/s range covers 10-20% sampling under normal to peak mainnet load.
+
+**Latency (<2%) — Calculation**:
+
+| Path                           | Tracing Cost | Baseline | Overhead |
+| ------------------------------ | ------------ | -------- | -------- |
+| Fast RPC (e.g., `server_info`) | 2.75 μs      | ~1 ms    | 0.275%   |
+| Slow RPC (e.g., `path_find`)   | 2.75 μs      | ~100 ms  | 0.003%   |
+| Transaction processing         | 4.0 μs       | ~200 μs  | 2.0%     |
+| Consensus round                | 36 μs        | ~3 sec   | 0.001%   |
+
+> At p99, even the worst case (TX processing at 2.0%) is within the 1-3% range. RPC and consensus overhead are negligible. On production hardware, TX overhead drops to ~1.3%.

 ### Per-Message Overhead (Context Propagation)

@@ -179,20 +417,20 @@ Each P2P message carries trace context with the following overhead:
 | ------------- | ------------- | ----------------------------------------- |
 | `trace_id`    | 16 bytes      | Unique identifier for the entire trace    |
 | `span_id`     | 8 bytes       | Current span (becomes parent on receiver) |
-| `trace_flags` | 4 bytes       | Sampling decision flags                   |
+| `trace_flags` | 1 byte        | Sampling decision flags                   |
 | `trace_state` | 0-4 bytes     | Optional vendor-specific data             |
-| **Total**     | **~32 bytes** | **Added per traced P2P message**          |
+| **Total**     | **~29 bytes** | **Added per traced P2P message**          |

 ```mermaid
 flowchart LR
    subgraph msg["P2P Message with Trace Context"]
-        A["Original Message<br/>(variable size)"] --> B["+ TraceContext<br/>(~32 bytes)"]
+        A["Original Message<br/>(variable size)"] --> B["+ TraceContext<br/>(~29 bytes)"]
    end

    subgraph breakdown["Context Breakdown"]
        C["trace_id<br/>16 bytes"]
        D["span_id<br/>8 bytes"]
-        E["flags<br/>4 bytes"]
+        E["flags<br/>1 byte"]
        F["state<br/>0-4 bytes"]
    end

@@ -206,7 +444,14 @@ flowchart LR
    style F fill:#4a148c,stroke:#2e0d57,color:#fff
 ```

-> **Note**: 32 bytes is negligible compared to typical transaction messages (hundreds to thousands of bytes)
+**Reading the diagram:**
+
+- **Original Message (gray, left)**: The existing P2P message payload of variable size — this is unchanged; trace context is appended, never modifying the original data.
+- **+ TraceContext (green, right of message)**: The additional 29-byte context block attached to each traced message; the arrow from the original message shows it is a pure addition.
+- **Context Breakdown (right subgraph)**: The four fields — `trace_id` (16 bytes), `span_id` (8 bytes), `flags` (1 byte), and `state` (0-4 bytes) — show exactly what is added and their individual sizes.
+- **Color coding**: Blue fields (`trace_id`, `span_id`) are the core identifiers required for trace correlation; orange (`flags`) controls sampling decisions; purple (`state`) is optional vendor data typically omitted.
+
+> **Note**: 29 bytes represents ~1-6% overhead depending on message size (500B simple TX to 5KB proposal), which is acceptable for the observability benefits provided.

 ### Mitigation Strategies

@@ -220,6 +465,8 @@ flowchart LR
    style D fill:#4a148c,stroke:#2e0d57,color:#fff
 ```

+> For a detailed explanation of head vs. tail sampling, see Slide 9.
+
 ### Kill Switches (Rollback Options)

 1. **Config Disable**: Set `enabled=0` in config → instant disable, no restart needed for sampling
@@ -228,18 +475,157 @@ flowchart LR

 ---

-## Slide 7: Data Collection & Privacy
+## Slide 9: Sampling Strategies — Head vs. Tail
+
+> Sampling controls **which traces are recorded and exported**. Without sampling, every operation generates a trace — at 500+ spans/sec, this overwhelms storage and network. Sampling lets you keep the signal, discard the noise.
+
+### Head Sampling (Decision at Start)
+
+The sampling decision is made **when a trace begins**, before any work is done. A random number is generated; if it falls within the configured ratio, the entire trace is recorded. Otherwise, the trace is silently dropped.
+
+```mermaid
+flowchart LR
+    A["New Request<br/>Arrives"] --> B{"Random < 10%?"}
+    B -->|"Yes (1 in 10)"| C["Record Entire Trace<br/>(all spans)"]
+    B -->|"No (9 in 10)"| D["Drop Entire Trace<br/>(zero overhead)"]
+
+    style C fill:#2e7d32,stroke:#1b5e20,color:#fff
+    style D fill:#c62828,stroke:#8c2809,color:#fff
+    style B fill:#1565c0,stroke:#0d47a1,color:#fff
+```
+
+| Aspect                        | Details                                                                                                                                                                                                  |
+| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Where it runs**             | Inside rippled (SDK-level). Configured via `sampling_ratio` in `rippled.cfg`.                                                                                                                            |
+| **When the decision happens** | At trace creation time — before the first span is even populated.                                                                                                                                        |
+| **How it works**              | `sampling_ratio=0.1` means each trace has a 10% probability of being recorded. Dropped traces incur near-zero overhead (no spans created, no attributes set, no export).                                 |
+| **Propagation**               | Once a trace is sampled, the `trace_flags` field (1 byte in the context header) tells downstream nodes to also sample it. Unsampled traces propagate `trace_flags=0`, so downstream nodes skip them too. |
+| **Pros**                      | Lowest overhead. Simple to configure. Predictable resource usage.                                                                                                                                        |
+| **Cons**                      | **Blind** — it doesn't know if the trace will be interesting. A rare error or slow consensus round has only a 10% chance of being captured.                                                              |
+| **Best for**                  | High-volume, steady-state traffic where most traces look similar (e.g., routine RPC requests).                                                                                                           |
+
+**rippled configuration**:
+
+```ini
+[telemetry]
+# Record 10% of traces (recommended for production)
+sampling_ratio=0.1
+```
+
+### Tail Sampling (Decision at End)
+
+The sampling decision is made **after the trace completes**, based on its actual content — was it slow? Did it error? Was it a consensus round? This requires buffering complete traces before deciding.
+
+```mermaid
+flowchart TB
+    A["All Traces<br/>Buffered (100%)"] --> B["OTel Collector<br/>Evaluates Rules"]
+
+    B --> C{"Error?"}
+    C -->|Yes| K["KEEP"]
+
+    C -->|No| D{"Slow?<br/>(>5s consensus,<br/>>1s RPC)"}
+    D -->|Yes| K
+
+    D -->|No| E{"Random < 10%?"}
+    E -->|Yes| K
+    E -->|No| F["DROP"]
+
+    style K fill:#2e7d32,stroke:#1b5e20,color:#fff
+    style F fill:#c62828,stroke:#8c2809,color:#fff
+    style B fill:#1565c0,stroke:#0d47a1,color:#fff
+    style C fill:#e65100,stroke:#bf360c,color:#fff
+    style D fill:#e65100,stroke:#bf360c,color:#fff
+    style E fill:#4a148c,stroke:#2e0d57,color:#fff
+```
+
+| Aspect                        | Details                                                                                                                                                                                                   |
+| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Where it runs**             | In the **OTel Collector** (external process), not inside rippled. rippled exports 100% of traces; the Collector decides what to keep.                                                                     |
+| **When the decision happens** | After the Collector has received all spans for a trace (waits `decision_wait=10s` for stragglers).                                                                                                        |
+| **How it works**              | Policy rules evaluate the completed trace: keep all errors, keep slow operations above a threshold, keep all consensus rounds, then probabilistically sample the rest at 10%.                             |
+| **Pros**                      | **Never misses important traces**. Errors, slow requests, and consensus anomalies are always captured regardless of probability.                                                                          |
+| **Cons**                      | Higher resource usage — rippled must export 100% of spans to the Collector, which buffers them in memory before deciding. The Collector needs more RAM (configured via `num_traces` and `decision_wait`). |
+| **Best for**                  | Production troubleshooting where you can't afford to miss errors or anomalies.                                                                                                                            |
+
+**Collector configuration** (tail sampling rules for rippled):
+
+```yaml
+processors:
+  tail_sampling:
+    decision_wait: 10s # Wait for all spans in a trace
+    num_traces: 100000 # Buffer up to 100K concurrent traces
+    policies:
+      - name: errors # Always keep error traces
+        type: status_code
+        status_code: { status_codes: [ERROR] }
+
+      - name: slow-consensus # Keep consensus rounds >5s
+        type: latency
+        latency: { threshold_ms: 5000 }
+
+      - name: slow-rpc # Keep slow RPC requests >1s
+        type: latency
+        latency: { threshold_ms: 1000 }
+
+      - name: probabilistic # Sample 10% of everything else
+        type: probabilistic
+        probabilistic: { sampling_percentage: 10 }
+```
+
+### Head vs. Tail — Side-by-Side
+
+|                               | Head Sampling                            | Tail Sampling                                    |
+| ----------------------------- | ---------------------------------------- | ------------------------------------------------ |
+| **Decision point**            | Trace start (inside rippled)             | Trace end (in OTel Collector)                    |
+| **Knows trace content?**      | No (random coin flip)                    | Yes (evaluates completed trace)                  |
+| **Overhead on rippled**       | Lowest (dropped traces = no-op)          | Higher (must export 100% to Collector)           |
+| **Collector resource usage**  | Low (receives only sampled traces)       | Higher (buffers all traces before deciding)      |
+| **Captures all errors?**      | No (only if trace was randomly selected) | **Yes** (error policy catches them)              |
+| **Captures slow operations?** | No (random)                              | **Yes** (latency policy catches them)            |
+| **Configuration**             | `rippled.cfg`: `sampling_ratio=0.1`      | `otel-collector.yaml`: `tail_sampling` processor |
+| **Best for**                  | High-throughput steady-state             | Troubleshooting & anomaly detection              |
+
+### Recommended Strategy for rippled
+
+Use **both** in a layered approach:
+
+```mermaid
+flowchart LR
+    subgraph rippled["rippled (Head Sampling)"]
+        HS["sampling_ratio=1.0<br/>(export everything)"]
+    end
+
+    subgraph collector["OTel Collector (Tail Sampling)"]
+        TS["Keep: errors + slow + 10% random<br/>Drop: routine traces"]
+    end
+
+    subgraph storage["Backend Storage"]
+        ST["Only interesting traces<br/>stored long-term"]
+    end
+
+    rippled -->|"100% of spans"| collector -->|"~15-20% kept"| storage
+
+    style rippled fill:#424242,stroke:#212121,color:#fff
+    style collector fill:#1565c0,stroke:#0d47a1,color:#fff
+    style storage fill:#2e7d32,stroke:#1b5e20,color:#fff
+```
+
+> **Why this works**: rippled exports everything (no blind drops), the Collector applies intelligent filtering (keep errors/slow/anomalies, sample the rest), and only ~15-20% of traces reach storage. If Collector resource usage becomes a concern, add head sampling at `sampling_ratio=0.5` to halve the export volume while still giving the Collector enough data for good tail-sampling decisions.
+
+---
+
+## Slide 10: Data Collection & Privacy

 ### What Data is Collected

-| Category        | Attributes Collected                                                               | Purpose                     |
-| --------------- | ---------------------------------------------------------------------------------- | --------------------------- |
-| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index`                        | Trace transaction lifecycle |
-| **Consensus**   | `round`, `phase`, `mode`, `proposers`(public key or public node id), `duration_ms` | Analyze consensus timing    |
-| **RPC**         | `command`, `version`, `status`, `duration_ms`                                      | Monitor RPC performance     |
-| **Peer**        | `peer.id`(public key), `latency_ms`, `message.type`, `message.size`                | Network topology analysis   |
-| **Ledger**      | `ledger.hash`, `ledger.index`, `close_time`, `tx_count`                            | Ledger progression tracking |
-| **Job**         | `job.type`, `queue_ms`, `worker`                                                   | JobQueue performance        |
+| Category        | Attributes Collected                                                                 | Purpose                     |
+| --------------- | ------------------------------------------------------------------------------------ | --------------------------- |
+| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index`                          | Trace transaction lifecycle |
+| **Consensus**   | `round`, `phase`, `mode`, `proposers` (count of proposing validators), `duration_ms` | Analyze consensus timing    |
+| **RPC**         | `command`, `version`, `status`, `duration_ms`                                        | Monitor RPC performance     |
+| **Peer**        | `peer.id`(public key), `latency_ms`, `message.type`, `message.size`                  | Network topology analysis   |
+| **Ledger**      | `ledger.hash`, `ledger.index`, `close_time`, `tx_count`                              | Ledger progression tracking |
+| **Job**         | `job.type`, `queue_ms`, `worker`                                                     | JobQueue performance        |

 ### What is NOT Collected (Privacy Guarantees)

@@ -263,6 +649,13 @@ flowchart LR
    style F fill:#c62828,stroke:#8c2809,color:#fff
 ```

+**Reading the diagram:**
+
+- **NOT Collected (top row, red)**: Private Keys, Account Balances, and Transaction Amounts are explicitly excluded — these are financial/security-sensitive fields that telemetry never touches.
+- **Also Excluded (bottom row, red)**: IP Addresses (configurable per deployment), Personal Data, and Raw TX Payloads are also excluded — these protect operator and user privacy.
+- **All-red styling**: Every box is styled in red to visually reinforce that these are hard exclusions, not optional — the telemetry system has no code path to collect any of these fields.
+- **Two-row layout**: The split between "NOT Collected" and "Also Excluded" distinguishes between financial data (top) and operational/personal data (bottom), making the privacy boundaries clear to auditors.
+
 ### Privacy Protection Mechanisms

 | Mechanism                  | Description                                                   |
--- a/cspell.config.yaml
+++ b/cspell.config.yaml
@@ -276,6 +276,7 @@ words:
  - txjson
  - txn
  - txns
+  - txqueue
  - txs
  - UBSAN
  - ubsan