diff --git a/OpenTelemetryPlan/00-tracing-fundamentals.md b/OpenTelemetryPlan/00-tracing-fundamentals.md new file mode 100644 index 0000000000..e623ea351c --- /dev/null +++ b/OpenTelemetryPlan/00-tracing-fundamentals.md @@ -0,0 +1,239 @@ +# Distributed Tracing Fundamentals + +> **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md) +> **Next**: [Architecture Analysis](./01-architecture-analysis.md) + +--- + +## What is Distributed Tracing? + +Distributed tracing is a method for tracking data objects as they flow through distributed systems. In a network like XRP Ledger, a single transaction touches multiple independent nodes—each with no shared memory or logging. Distributed tracing connects these dots. + +**Without tracing:** You see isolated logs on each node with no way to correlate them. + +**With tracing:** You see the complete journey of a transaction or an event across all nodes it touched. + +--- + +## Core Concepts + +### 1. Trace + +A **trace** represents the entire journey of a request through the system. It has a unique `trace_id` that stays constant across all nodes. + +``` +Trace ID: abc123 +├── Node A: received transaction +├── Node B: relayed transaction +├── Node C: included in consensus +└── Node D: applied to ledger +``` + +### 2. Span + +A **span** represents a single unit of work within a trace. Each span has: + +| Attribute | Description | Example | +| ---------------- | --------------------- | -------------------------- | +| `trace_id` | Links to parent trace | `abc123` | +| `span_id` | Unique identifier | `span456` | +| `parent_span_id` | Parent span (if any) | `p_span123` | +| `name` | Operation name | `rpc.submit` | +| `start_time` | When work began | `2024-01-15T10:30:00Z` | +| `end_time` | When work completed | `2024-01-15T10:30:00.050Z` | +| `attributes` | Key-value metadata | `tx.hash=ABC...` | +| `status` | OK, ERROR MSG | `OK` | + +### 3. Trace Context + +**Trace context** is the data that propagates between services to link spans together. It contains: + +- `trace_id` - The trace this span belongs to +- `span_id` - The current span (becomes parent for child spans) +- `trace_flags` - Sampling decisions + +--- + +## How Spans Form a Trace + +Spans have parent-child relationships forming a tree structure: + +```mermaid +flowchart TB + subgraph trace["Trace: abc123"] + A["tx.submit
span_id: 001
50ms"] --> B["tx.validate
span_id: 002
5ms"] + A --> C["tx.relay
span_id: 003
10ms"] + A --> D["tx.apply
span_id: 004
30ms"] + D --> E["ledger.update
span_id: 005
20ms"] + end + + style A fill:#0d47a1,stroke:#082f6a,color:#ffffff + style B fill:#1b5e20,stroke:#0d3d14,color:#ffffff + style C fill:#1b5e20,stroke:#0d3d14,color:#ffffff + style D fill:#1b5e20,stroke:#0d3d14,color:#ffffff + style E fill:#bf360c,stroke:#8c2809,color:#ffffff +``` + +The same trace visualized as a **timeline (Gantt chart)**: + +``` +Time → 0ms 10ms 20ms 30ms 40ms 50ms + ├───────────────────────────────────────────┤ +tx.submit│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│ + ├─────┤ +tx.valid │▓▓▓▓▓│ + │ ├──────────┤ +tx.relay │ │▓▓▓▓▓▓▓▓▓▓│ + │ ├────────────────────────────┤ +tx.apply │ │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│ + │ ├──────────────────┤ +ledger │ │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│ +``` + +--- + +## Distributed Traces Across Nodes + +In distributed systems like rippled, traces span **multiple independent nodes**. The trace context must be propagated in network messages: + +```mermaid +sequenceDiagram + participant Client + participant NodeA as Node A + participant NodeB as Node B + participant NodeC as Node C + + Client->>NodeA: Submit TX
(no trace context) + + Note over NodeA: Creates new trace
trace_id: abc123
span: tx.receive + + NodeA->>NodeB: Relay TX
(trace_id: abc123, parent: 001) + + Note over NodeB: Creates child span
span: tx.relay
parent_span_id: 001 + + NodeA->>NodeC: Relay TX
(trace_id: abc123, parent: 001) + + Note over NodeC: Creates child span
span: tx.relay
parent_span_id: 001 + + Note over NodeA,NodeC: All spans share trace_id: abc123
enabling correlation across nodes +``` + +--- + +## Context Propagation + +For traces to work across nodes, **trace context must be propagated** in messages. + +### What's in the Context (32 bytes) + +| Field | Size | Description | +| ------------- | ---------- | ------------------------------------------------------- | +| `trace_id` | 16 bytes | Identifies the entire trace (constant across all nodes) | +| `span_id` | 8 bytes | The sender's current span (becomes parent on receiver) | +| `trace_flags` | 4 bytes | Sampling decision flags | +| `trace_state` | ~0-4 bytes | Optional vendor-specific data | + +### How span_id Changes at Each Hop + +Only **one** `span_id` travels in the context - the sender's current span. Each node: +1. Extracts the received `span_id` and uses it as the `parent_span_id` +2. Creates a **new** `span_id` for its own span +3. Sends its own `span_id` as the parent when forwarding + +``` +Node A Node B Node C +────── ────── ────── + +Span AAA Span BBB Span CCC + │ │ │ + ▼ ▼ ▼ +Context out: Context out: Context out: +├─ trace_id: abc123 ├─ trace_id: abc123 ├─ trace_id: abc123 +├─ span_id: AAA ──────────► ├─ span_id: BBB ──────────► ├─ span_id: CCC ──────► +└─ flags: 01 └─ flags: 01 └─ flags: 01 + │ │ + parent = AAA parent = BBB +``` + +The `trace_id` stays constant, but `span_id` **changes at every hop** to maintain the parent-child chain. + +### Propagation Formats + +There are two patterns: + +### HTTP/RPC Headers (W3C Trace Context) + +``` +traceparent: 00-abc123def456-span789-01 + │ │ │ │ + │ │ │ └── Flags (sampled) + │ │ └── Parent span ID + │ └── Trace ID + └── Version +``` + +### Protocol Buffers (rippled P2P messages) + +```protobuf +message TMTransaction { + bytes rawTransaction = 1; + // ... existing fields ... + + // Trace context extension + bytes trace_parent = 100; // W3C traceparent + bytes trace_state = 101; // W3C tracestate +} +``` + +--- + +## Sampling + +Not every trace needs to be recorded. **Sampling** reduces overhead: + +### Head Sampling (at trace start) +``` +Request arrives → Random 10% chance → Record or skip entire trace +``` +- ✅ Low overhead +- ❌ May miss interesting traces + +### Tail Sampling (after trace completes) +``` +Trace completes → Collector evaluates: + - Error? → KEEP + - Slow? → KEEP + - Normal? → Sample 10% +``` +- ✅ Never loses important traces +- ❌ Higher memory usage at collector + +--- + +## Key Benefits for rippled + +| Challenge | How Tracing Helps | +| ---------------------------------- | ---------------------------------------- | +| "Where is my transaction?" | Follow trace across all nodes it touched | +| "Why was consensus slow?" | See timing breakdown of each phase | +| "Which node is the bottleneck?" | Compare span durations across nodes | +| "What happened during the outage?" | Correlate errors across the network | + +--- + +## Glossary + +| Term | Definition | +| ------------------- | --------------------------------------------------------------- | +| **Trace** | Complete journey of a request, identified by `trace_id` | +| **Span** | Single operation within a trace | +| **Context** | Data propagated between services (`trace_id`, `span_id`, flags) | +| **Instrumentation** | Code that creates spans and propagates context | +| **Collector** | Service that receives, processes, and exports traces | +| **Backend** | Storage/visualization system (Jaeger, Tempo, etc.) | +| **Head Sampling** | Sampling decision at trace start | +| **Tail Sampling** | Sampling decision after trace completes | + +--- + +*Next: [Architecture Analysis](./01-architecture-analysis.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)* diff --git a/OpenTelemetryPlan/01-architecture-analysis.md b/OpenTelemetryPlan/01-architecture-analysis.md index 3d910331b8..d29ebf21b3 100644 --- a/OpenTelemetryPlan/01-architecture-analysis.md +++ b/OpenTelemetryPlan/01-architecture-analysis.md @@ -36,10 +36,10 @@ flowchart TB JobQueue --> processing end - style rippled fill:#f5f5f5,stroke:#333 - style services fill:#e3f2fd,stroke:#1976d2 - style processing fill:#e8f5e9,stroke:#388e3c - style observability fill:#fff3e0,stroke:#f57c00 + style rippled fill:#424242,stroke:#212121,color:#ffffff + style services fill:#1565c0,stroke:#0d47a1,color:#ffffff + style processing fill:#2e7d32,stroke:#1b5e20,color:#ffffff + style observability fill:#e65100,stroke:#bf360c,color:#ffffff ``` --- @@ -136,10 +136,10 @@ flowchart TB establish --> accept end - style round fill:#fff8e1,stroke:#ffc107 - style open fill:#e3f2fd,stroke:#1976d2 - style establish fill:#e8f5e9,stroke:#388e3c - style accept fill:#fce4ec,stroke:#e91e63 + style round fill:#f57f17,stroke:#e65100,color:#ffffff + style open fill:#1565c0,stroke:#0d47a1,color:#ffffff + style establish fill:#2e7d32,stroke:#1b5e20,color:#ffffff + style accept fill:#c2185b,stroke:#880e4f,color:#ffffff ``` --- @@ -172,9 +172,9 @@ flowchart TB command --> response end - style request fill:#e8f5e9,stroke:#388e3c - style enqueue fill:#e3f2fd,stroke:#1976d2 - style command fill:#fff3e0,stroke:#ff9800 + style request fill:#2e7d32,stroke:#1b5e20,color:#ffffff + style enqueue fill:#1565c0,stroke:#0d47a1,color:#ffffff + style command fill:#e65100,stroke:#bf360c,color:#ffffff ``` --- @@ -214,8 +214,8 @@ quadrantChart quadrant-4 Consider Later RPC Tracing: [0.3, 0.85] - Transaction Tracing: [0.6, 0.95] - Consensus Tracing: [0.75, 0.9] + Transaction Tracing: [0.65, 0.92] + Consensus Tracing: [0.75, 0.87] Peer Message Tracing: [0.4, 0.3] Ledger Acquisition: [0.5, 0.6] JobQueue Tracing: [0.35, 0.5] @@ -251,9 +251,9 @@ After implementing OpenTelemetry, operators and developers will gain visibility **Transaction Trace View (Jaeger/Tempo):** ``` -┌─────────────────────────────────────────────────────────────────────────────────┐ +┌────────────────────────────────────────────────────────────────────────────────┐ │ Trace: abc123... (Transaction Submission) Duration: 847ms │ -├─────────────────────────────────────────────────────────────────────────────────┤ +├────────────────────────────────────────────────────────────────────────────────┤ │ ├── rpc.request [ServerHandler] ████░░░░░░ 45ms │ │ │ └── rpc.command.submit [RPCHandler] ████░░░░░░ 42ms │ │ │ └── tx.receive [NetworkOPs] ███░░░░░░░ 35ms │ @@ -266,7 +266,7 @@ After implementing OpenTelemetry, operators and developers will gain visibility │ ├── consensus.phase.open ██░░░░░░░░ 180ms │ │ ├── consensus.phase.establish █████░░░░░ 480ms │ │ └── consensus.phase.accept █░░░░░░░░░ 60ms │ -└─────────────────────────────────────────────────────────────────────────────────┘ +└────────────────────────────────────────────────────────────────────────────────┘ ``` **RPC Performance Dashboard Panel:** @@ -285,23 +285,24 @@ After implementing OpenTelemetry, operators and developers will gain visibility ``` **Consensus Health Dashboard Panel:** -``` -┌─────────────────────────────────────────────────────────────┐ -│ Consensus Round Duration (Last 24 Hours) │ -├─────────────────────────────────────────────────────────────┤ -│ │ -│ 5s ┤ * │ -│ │ * * * │ -│ 4s ┤ * ** * * │ -│ │ * * * * ** * │ -│ 3s ┤ * * * * * * * * │ -│ │ * * * * * * * * * │ -│ 2s ┤* ** * * ** * * * * │ -│ │ ** ** ** │ -│ 1s ┤────────────────────────────────────────────────── │ -│ └──────────────────────────────────────────────────── │ -│ 00:00 04:00 08:00 12:00 16:00 20:00 24:00 │ -└─────────────────────────────────────────────────────────────┘ + +```mermaid +--- +config: + xyChart: + width: 1200 + height: 400 + plotReservedSpacePercent: 50 + chartOrientation: vertical + themeVariables: + xyChart: + plotColorPalette: "#3498db" +--- +xychart-beta + title "Consensus Round Duration (Last 24 Hours)" + x-axis "Time of Day (Hours)" [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24] + y-axis "Duration (seconds)" 1 --> 5 + line [2.1, 2.3, 2.5, 2.4, 2.8, 1.6, 3.2, 3.0, 3.5, 1.3, 3.8, 3.6, 4.0, 3.2, 4.3, 4.1, 4.5, 4.3, 4.2, 2.4, 4.8, 4.6, 4.9, 4.7, 5.0, 4.9, 4.8, 2.6, 4.7, 4.5, 4.2, 4.0, 2.5, 3.7, 3.2, 3.4, 2.9, 3.1, 2.6, 2.8, 2.3, 1.5, 2.7, 2.4, 2.5, 2.3, 2.2, 2.1, 2.0] ``` ### 1.8.4 Operator Actionable Insights diff --git a/OpenTelemetryPlan/02-design-decisions.md b/OpenTelemetryPlan/02-design-decisions.md index d337442add..a6fa6346eb 100644 --- a/OpenTelemetryPlan/02-design-decisions.md +++ b/OpenTelemetryPlan/02-design-decisions.md @@ -56,9 +56,9 @@ flowchart TB collector --> tempo collector --> elastic - style nodes fill:#e3f2fd,stroke:#1976d2 - style backends fill:#e8f5e9,stroke:#388e3c - style collector fill:#fff3e0,stroke:#ff9800 + style nodes fill:#0d47a1,stroke:#082f6a,color:#ffffff + style backends fill:#1b5e20,stroke:#0d3d14,color:#ffffff + style collector fill:#bf360c,stroke:#8c2809,color:#ffffff ``` ### 2.2.1 OTLP/gRPC (Recommended) @@ -245,16 +245,101 @@ flowchart TB job["Context captured at job creation,
restored at execution

class Job {
opentelemetry::context::Context traceContext_;
};"] end - style http fill:#e3f2fd,stroke:#1976d2 - style protobuf fill:#e8f5e9,stroke:#388e3c - style jobqueue fill:#fff3e0,stroke:#ff9800 + style http fill:#0d47a1,stroke:#082f6a,color:#ffffff + style protobuf fill:#1b5e20,stroke:#0d3d14,color:#ffffff + style jobqueue fill:#bf360c,stroke:#8c2809,color:#ffffff ``` --- ## 2.6 Integration with Existing Observability -### 2.6.1 Coexistence Strategy +### 2.6.1 Existing Frameworks Comparison + +rippled already has two observability mechanisms. OpenTelemetry complements (not replaces) them: + +| Aspect | PerfLog | Beast Insight (StatsD) | OpenTelemetry | +| --------------------- | ----------------------------- | ---------------------------- | ------------------------- | +| **Type** | Logging | Metrics | Distributed Tracing | +| **Data** | JSON log entries | Counters, gauges, histograms | Spans with context | +| **Scope** | Single node | Single node | **Cross-node** | +| **Output** | `perf.log` file | StatsD server | OTLP Collector | +| **Question answered** | "What happened on this node?" | "How many? How fast?" | "What was the journey?" | +| **Correlation** | By timestamp | By metric name | By `trace_id` | +| **Overhead** | Low (file I/O) | Low (UDP packets) | Low-Medium (configurable) | + +### 2.6.2 What Each Framework Does Best + +#### PerfLog +- **Purpose**: Detailed local event logging for RPC and job execution +- **Strengths**: + - Rich JSON output with timing data + - Already integrated in RPC handlers + - File-based, no external dependencies +- **Limitations**: + - Single-node only (no cross-node correlation) + - No parent-child relationships between events + - Manual log parsing required + +```json +// Example PerfLog entry +{ + "time": "2024-01-15T10:30:00.123Z", + "method": "submit", + "duration_us": 1523, + "result": "tesSUCCESS" +} +``` + +#### Beast Insight (StatsD) +- **Purpose**: Real-time metrics for monitoring dashboards +- **Strengths**: + - Aggregated metrics (counters, gauges, histograms) + - Low overhead (UDP, fire-and-forget) + - Good for alerting thresholds +- **Limitations**: + - No request-level detail + - No causal relationships + - Single-node perspective + +```cpp +// Example StatsD usage in rippled +insight.increment("rpc.submit.count"); +insight.gauge("ledger.age", age); +insight.timing("consensus.round", duration); +``` + +#### OpenTelemetry (NEW) +- **Purpose**: Distributed request tracing across nodes +- **Strengths**: + - **Cross-node correlation** via `trace_id` + - Parent-child span relationships + - Rich attributes per span + - Industry standard (CNCF) +- **Limitations**: + - Requires collector infrastructure + - Higher complexity than logging + +```cpp +// Example OpenTelemetry span +auto span = telemetry.startSpan("tx.relay"); +span->SetAttribute("tx.hash", hash); +span->SetAttribute("peer.id", peerId); +// Span automatically linked to parent via context +``` + +### 2.6.3 When to Use Each + +| Scenario | PerfLog | StatsD | OpenTelemetry | +| --------------------------------------- | --------- | ------ | ------------- | +| "How many TXs per second?" | ❌ | ✅ | ❌ | +| "What's the p99 RPC latency?" | ❌ | ✅ | ✅ | +| "Why was this specific TX slow?" | ⚠️ partial | ❌ | ✅ | +| "Which node delayed consensus?" | ❌ | ❌ | ✅ | +| "What happened on node X at time T?" | ✅ | ❌ | ✅ | +| "Show me the TX journey across 5 nodes" | ❌ | ❌ | ✅ | + +### 2.6.4 Coexistence Strategy ```mermaid flowchart TB @@ -272,11 +357,11 @@ flowchart TB statsd --> grafana collector --> grafana - style rippled fill:#f5f5f5,stroke:#333 - style grafana fill:#ff9800,stroke:#e65100 + style rippled fill:#212121,stroke:#0a0a0a,color:#ffffff + style grafana fill:#bf360c,stroke:#8c2809,color:#ffffff ``` -### 2.6.2 Correlation with PerfLog +### 2.6.5 Correlation with PerfLog Trace IDs can be correlated with existing PerfLog entries for comprehensive debugging: diff --git a/OpenTelemetryPlan/03-implementation-strategy.md b/OpenTelemetryPlan/03-implementation-strategy.md index 41438e87db..05be0fce32 100644 --- a/OpenTelemetryPlan/03-implementation-strategy.md +++ b/OpenTelemetryPlan/03-implementation-strategy.md @@ -35,37 +35,41 @@ src/xrpld/ ## 3.2 Implementation Approach +
+ ```mermaid -flowchart LR +%%{init: {'flowchart': {'nodeSpacing': 20, 'rankSpacing': 30}}}%% +flowchart TB subgraph phase1["Phase 1: Core"] - sdk["SDK Integration"] - interface["Telemetry Interface"] - config["Configuration"] + direction LR + sdk["SDK Integration"] ~~~ interface["Telemetry Interface"] ~~~ config["Configuration"] end subgraph phase2["Phase 2: RPC"] - http["HTTP Context"] - rpc["RPC Handlers"] + direction LR + http["HTTP Context"] ~~~ rpc["RPC Handlers"] end subgraph phase3["Phase 3: P2P"] - proto["Protobuf Context"] - tx["Transaction Relay"] + direction LR + proto["Protobuf Context"] ~~~ tx["Transaction Relay"] end subgraph phase4["Phase 4: Consensus"] - consensus["Consensus Rounds"] - proposals["Proposals"] + direction LR + consensus["Consensus Rounds"] ~~~ proposals["Proposals"] end phase1 --> phase2 --> phase3 --> phase4 - style phase1 fill:#e3f2fd,stroke:#1976d2 - style phase2 fill:#e8f5e9,stroke:#388e3c - style phase3 fill:#fff3e0,stroke:#ff9800 - style phase4 fill:#fce4ec,stroke:#e91e63 + style phase1 fill:#1565c0,stroke:#0d47a1,color:#ffffff + style phase2 fill:#2e7d32,stroke:#1b5e20,color:#ffffff + style phase3 fill:#e65100,stroke:#bf360c,color:#ffffff + style phase4 fill:#c2185b,stroke:#880e4f,color:#ffffff ``` +
+ ### Key Principles 1. **Minimal Intrusion**: Instrumentation should not alter existing control flow @@ -103,14 +107,21 @@ flowchart LR ### 3.4.2 Transaction Processing Overhead +
+ ```mermaid -pie title Transaction Tracing Overhead (~2.4μs total) - "tx.receive span" : 800 - "tx.validate span" : 500 - "tx.relay span" : 500 - "Context injection (×3)" : 600 +%%{init: {'pie': {'textPosition': 0.75}}}%% +pie showData + "tx.receive (800ns)" : 800 + "tx.validate (500ns)" : 500 + "tx.relay (500ns)" : 500 + "Context inject (600ns)" : 600 ``` +**Transaction Tracing Overhead (~2.4μs total)** + +
+ **Overhead percentage**: 2.4 μs / 200 μs (avg tx processing) = **~1.2%** ### 3.4.3 Consensus Round Overhead @@ -166,6 +177,12 @@ pie title Transaction Tracing Overhead (~2.4μs total) ### 3.5.3 Memory Growth Characteristics ```mermaid +--- +config: + xyChart: + width: 700 + height: 400 +--- xychart-beta title "Memory Usage vs Span Rate" x-axis "Spans/second" [0, 200, 400, 600, 800, 1000] @@ -325,15 +342,15 @@ pie title Code Changes by Component ### 3.9.3 Risk Assessment by Component +
+ +**Do First** ↖ ↗ **Plan Carefully** + ```mermaid quadrantChart title Code Intrusiveness Risk Matrix x-axis Low Risk --> High Risk y-axis Low Value --> High Value - quadrant-1 High Value, Low Risk - Do First - quadrant-2 High Value, High Risk - Plan Carefully - quadrant-3 Low Value, Low Risk - Optional - quadrant-4 Low Value, High Risk - Avoid RPC Tracing: [0.2, 0.8] Transaction Relay: [0.5, 0.9] @@ -343,6 +360,10 @@ quadrantChart Ledger Acquisition: [0.5, 0.6] ``` +**Optional** ↙ ↘ **Avoid** + +
+ #### Risk Level Definitions | Risk Level | Definition | Mitigation | diff --git a/OpenTelemetryPlan/04-code-samples.md b/OpenTelemetryPlan/04-code-samples.md index c0beb0ebc3..df192a33ac 100644 --- a/OpenTelemetryPlan/04-code-samples.md +++ b/OpenTelemetryPlan/04-code-samples.md @@ -917,6 +917,8 @@ Worker::run() ## 4.6 Span Flow Visualization +
+ ```mermaid flowchart TB subgraph Client["External Client"] @@ -955,13 +957,26 @@ flowchart TB txRecvC --> consensusC consensusC --> phaseC - style rpcA fill:#e3f2fd,stroke:#1976d2 - style txRecvA fill:#e8f5e9,stroke:#388e3c - style txRecvB fill:#e8f5e9,stroke:#388e3c - style txRecvC fill:#e8f5e9,stroke:#388e3c - style consensusC fill:#fff3e0,stroke:#ff9800 + style Client fill:#334155,stroke:#1e293b,color:#fff + style NodeA fill:#1e3a8a,stroke:#172554,color:#fff + style NodeB fill:#064e3b,stroke:#022c22,color:#fff + style NodeC fill:#78350f,stroke:#451a03,color:#fff + style submit fill:#e2e8f0,stroke:#cbd5e1,color:#1e293b + style rpcA fill:#1d4ed8,stroke:#1e40af,color:#fff + style cmdA fill:#1d4ed8,stroke:#1e40af,color:#fff + style txRecvA fill:#047857,stroke:#064e3b,color:#fff + style txValA fill:#047857,stroke:#064e3b,color:#fff + style txRelayA fill:#047857,stroke:#064e3b,color:#fff + style txRecvB fill:#047857,stroke:#064e3b,color:#fff + style txValB fill:#047857,stroke:#064e3b,color:#fff + style txRelayB fill:#047857,stroke:#064e3b,color:#fff + style txRecvC fill:#047857,stroke:#064e3b,color:#fff + style consensusC fill:#fef3c7,stroke:#fde68a,color:#1e293b + style phaseC fill:#fef3c7,stroke:#fde68a,color:#1e293b ``` +
+ --- *Previous: [Implementation Strategy](./03-implementation-strategy.md)* | *Next: [Configuration Reference](./05-configuration-reference.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)* diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index e8364517d1..71a73aabbd 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -228,8 +228,11 @@ quadrantChart ## 6.9 Effort Summary +
+ ```mermaid -pie title Total Effort Distribution (47 developer-days) +%%{init: {'pie': {'textPosition': 0.75}}}%% +pie showData "Phase 1: Core Infrastructure" : 10 "Phase 2: RPC Tracing" : 10 "Phase 3: Transaction Tracing" : 11 @@ -237,6 +240,10 @@ pie title Total Effort Distribution (47 developer-days) "Phase 5: Documentation" : 5 ``` +**Total Effort Distribution (47 developer-days)** + +
+ ### Resource Requirements | Phase | Developers | Duration | Total Effort | @@ -256,33 +263,43 @@ This section outlines a prioritized approach to maximize ROI with minimal initia ### 6.10.1 Crawl-Walk-Run Overview +
+ ```mermaid -flowchart LR +flowchart TB subgraph crawl["🐢 CRAWL (Week 1-2)"] - c1[Core SDK Setup] - c2[RPC Tracing Only] - c3[Single Node] + direction LR + c1[Core SDK Setup] ~~~ c2[RPC Tracing Only] ~~~ c3[Single Node] end subgraph walk["🚶 WALK (Week 3-5)"] - w1[Transaction Tracing] - w2[Cross-Node Context] - w3[Basic Dashboards] + direction LR + w1[Transaction Tracing] ~~~ w2[Cross-Node Context] ~~~ w3[Basic Dashboards] end subgraph run["🏃 RUN (Week 6-9)"] - r1[Consensus Tracing] - r2[Full Correlation] - r3[Production Deploy] + direction LR + r1[Consensus Tracing] ~~~ r2[Full Correlation] ~~~ r3[Production Deploy] end crawl --> walk --> run - style crawl fill:#e8f5e9,stroke:#388e3c - style walk fill:#fff3e0,stroke:#ff9800 - style run fill:#e3f2fd,stroke:#1976d2 + style crawl fill:#1b5e20,stroke:#0d3d14,color:#fff + style walk fill:#bf360c,stroke:#8c2809,color:#fff + style run fill:#0d47a1,stroke:#082f6a,color:#fff + style c1 fill:#1b5e20,stroke:#0d3d14,color:#fff + style c2 fill:#1b5e20,stroke:#0d3d14,color:#fff + style c3 fill:#1b5e20,stroke:#0d3d14,color:#fff + style w1 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b + style w2 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b + style w3 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b + style r1 fill:#0d47a1,stroke:#082f6a,color:#fff + style r2 fill:#0d47a1,stroke:#082f6a,color:#fff + style r3 fill:#0d47a1,stroke:#082f6a,color:#fff ``` +
+ ### 6.10.2 Quick Wins (Immediate Value) | Quick Win | Effort | Value | When to Deploy | @@ -492,13 +509,27 @@ flowchart TB t10 --> t11 --> t12 t12 --> t13 --> t14 - style week1 fill:#e8f5e9,stroke:#388e3c - style week2 fill:#e8f5e9,stroke:#388e3c - style week3 fill:#fff3e0,stroke:#ff9800 - style week4 fill:#fff3e0,stroke:#ff9800 - style week5 fill:#fff3e0,stroke:#ff9800 - style week6_8 fill:#e3f2fd,stroke:#1976d2 - style week9 fill:#f3e5f5,stroke:#7b1fa2 + style week1 fill:#1b5e20,stroke:#0d3d14,color:#fff + style week2 fill:#1b5e20,stroke:#0d3d14,color:#fff + style week3 fill:#bf360c,stroke:#8c2809,color:#fff + style week4 fill:#bf360c,stroke:#8c2809,color:#fff + style week5 fill:#bf360c,stroke:#8c2809,color:#fff + style week6_8 fill:#0d47a1,stroke:#082f6a,color:#fff + style week9 fill:#4a148c,stroke:#2e0d57,color:#fff + style t1 fill:#1b5e20,stroke:#0d3d14,color:#fff + style t2 fill:#1b5e20,stroke:#0d3d14,color:#fff + style t3 fill:#1b5e20,stroke:#0d3d14,color:#fff + style t4 fill:#1b5e20,stroke:#0d3d14,color:#fff + style t5 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b + style t6 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b + style t7 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b + style t8 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b + style t9 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b + style t10 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b + style t11 fill:#0d47a1,stroke:#082f6a,color:#fff + style t12 fill:#0d47a1,stroke:#082f6a,color:#fff + style t13 fill:#4a148c,stroke:#2e0d57,color:#fff + style t14 fill:#4a148c,stroke:#2e0d57,color:#fff ``` --- diff --git a/OpenTelemetryPlan/07-observability-backends.md b/OpenTelemetryPlan/07-observability-backends.md index cc6b09ff77..73ca5dafd7 100644 --- a/OpenTelemetryPlan/07-observability-backends.md +++ b/OpenTelemetryPlan/07-observability-backends.md @@ -60,10 +60,17 @@ flowchart TD honeycomb --> final datadog --> final - style tempo fill:#e8f5e9,stroke:#388e3c - style elastic fill:#fff3e0,stroke:#ff9800 - style honeycomb fill:#e3f2fd,stroke:#1976d2 - style datadog fill:#f3e5f5,stroke:#7b1fa2 + style start fill:#0f172a,stroke:#020617,color:#fff + style budget fill:#334155,stroke:#1e293b,color:#fff + style oss fill:#1e293b,stroke:#0f172a,color:#fff + style existing fill:#334155,stroke:#1e293b,color:#fff + style saas fill:#334155,stroke:#1e293b,color:#fff + style enterprise fill:#334155,stroke:#1e293b,color:#fff + style final fill:#0f172a,stroke:#020617,color:#fff + style tempo fill:#1b5e20,stroke:#0d3d14,color:#fff + style elastic fill:#bf360c,stroke:#8c2809,color:#fff + style honeycomb fill:#0d47a1,stroke:#082f6a,color:#fff + style datadog fill:#4a148c,stroke:#2e0d57,color:#fff ``` --- @@ -110,11 +117,11 @@ flowchart TB tempo --> grafana elastic --> grafana - style validators fill:#ffebee,stroke:#c62828 - style stock fill:#e3f2fd,stroke:#1976d2 - style collector fill:#fff3e0,stroke:#ff9800 - style backends fill:#e8f5e9,stroke:#388e3c - style ui fill:#f3e5f5,stroke:#7b1fa2 + style validators fill:#b71c1c,stroke:#7f1d1d,color:#ffffff + style stock fill:#0d47a1,stroke:#082f6a,color:#ffffff + style collector fill:#bf360c,stroke:#8c2809,color:#ffffff + style backends fill:#1b5e20,stroke:#0d3d14,color:#ffffff + style ui fill:#4a148c,stroke:#2e0d57,color:#ffffff ``` --- @@ -153,6 +160,14 @@ flowchart LR ts1 --> final[Final Traces] ts2 --> final ts3 --> final + + style head fill:#0d47a1,stroke:#082f6a,color:#fff + style tail fill:#1b5e20,stroke:#0d3d14,color:#fff + style hs fill:#0d47a1,stroke:#082f6a,color:#fff + style ts1 fill:#1b5e20,stroke:#0d3d14,color:#fff + style ts2 fill:#1b5e20,stroke:#0d3d14,color:#fff + style ts3 fill:#1b5e20,stroke:#0d3d14,color:#fff + style final fill:#bf360c,stroke:#8c2809,color:#fff ``` ### 7.4.3 Data Retention @@ -161,7 +176,7 @@ flowchart LR | ----------- | ----------- | ------------ | ------------ | | Development | 24 hours | N/A | N/A | | Staging | 7 days | N/A | N/A | -| Production | 7 days | 30 days | 1 year | +| Production | 7 days | 30 days | many years | --- @@ -424,7 +439,23 @@ flowchart TB logs --> corr metrics --> corr - style corr fill:#f3e5f5,stroke:#7b1fa2 + style rippled fill:#0d47a1,stroke:#082f6a,color:#fff + style collectors fill:#bf360c,stroke:#8c2809,color:#fff + style storage fill:#1b5e20,stroke:#0d3d14,color:#fff + style grafana fill:#4a148c,stroke:#2e0d57,color:#fff + style otel fill:#0d47a1,stroke:#082f6a,color:#fff + style perflog fill:#0d47a1,stroke:#082f6a,color:#fff + style insight fill:#0d47a1,stroke:#082f6a,color:#fff + style otelc fill:#bf360c,stroke:#8c2809,color:#fff + style promtail fill:#bf360c,stroke:#8c2809,color:#fff + style statsd fill:#bf360c,stroke:#8c2809,color:#fff + style tempo fill:#1b5e20,stroke:#0d3d14,color:#fff + style loki fill:#1b5e20,stroke:#0d3d14,color:#fff + style prom fill:#1b5e20,stroke:#0d3d14,color:#fff + style traces fill:#4a148c,stroke:#2e0d57,color:#fff + style logs fill:#4a148c,stroke:#2e0d57,color:#fff + style metrics fill:#4a148c,stroke:#2e0d57,color:#fff + style corr fill:#4a148c,stroke:#2e0d57,color:#fff ``` ### 7.7.2 Correlation Fields diff --git a/OpenTelemetryPlan/08-appendix.md b/OpenTelemetryPlan/08-appendix.md index 9681863d05..30b2b68cb9 100644 --- a/OpenTelemetryPlan/08-appendix.md +++ b/OpenTelemetryPlan/08-appendix.md @@ -7,35 +7,35 @@ ## 8.1 Glossary -| Term | Definition | -| -------------- | ---------------------------------------------------------- | -| **Span** | A unit of work with start/end time, name, and attributes | -| **Trace** | A collection of spans representing a complete request flow | -| **Trace ID** | 128-bit unique identifier for a trace | -| **Span ID** | 64-bit unique identifier for a span within a trace | -| **Context** | Carrier for trace/span IDs across boundaries | -| **Propagator** | Component that injects/extracts context | -| **Sampler** | Decides which traces to record | -| **Exporter** | Sends spans to backend | -| **Collector** | Receives, processes, and forwards telemetry | -| **OTLP** | OpenTelemetry Protocol (wire format) | -| **W3C Trace Context** | Standard HTTP headers for trace propagation | -| **Baggage** | Key-value pairs propagated across service boundaries | -| **Resource** | Entity producing telemetry (service, host, etc.) | -| **Instrumentation** | Code that creates telemetry data | +| Term | Definition | +| --------------------- | ---------------------------------------------------------- | +| **Span** | A unit of work with start/end time, name, and attributes | +| **Trace** | A collection of spans representing a complete request flow | +| **Trace ID** | 128-bit unique identifier for a trace | +| **Span ID** | 64-bit unique identifier for a span within a trace | +| **Context** | Carrier for trace/span IDs across boundaries | +| **Propagator** | Component that injects/extracts context | +| **Sampler** | Decides which traces to record | +| **Exporter** | Sends spans to backend | +| **Collector** | Receives, processes, and forwards telemetry | +| **OTLP** | OpenTelemetry Protocol (wire format) | +| **W3C Trace Context** | Standard HTTP headers for trace propagation | +| **Baggage** | Key-value pairs propagated across service boundaries | +| **Resource** | Entity producing telemetry (service, host, etc.) | +| **Instrumentation** | Code that creates telemetry data | ### rippled-Specific Terms -| Term | Definition | -| -------------- | ---------------------------------------------------------- | -| **Overlay** | P2P network layer managing peer connections | -| **Consensus** | XRP Ledger consensus algorithm (RCL) | -| **Proposal** | Validator's suggested transaction set for a ledger | -| **Validation** | Validator's signature on a closed ledger | -| **HashRouter** | Component for transaction deduplication | -| **JobQueue** | Thread pool for asynchronous task execution | -| **PerfLog** | Existing performance logging system in rippled | -| **Beast Insight** | Existing metrics framework in rippled | +| Term | Definition | +| ----------------- | -------------------------------------------------- | +| **Overlay** | P2P network layer managing peer connections | +| **Consensus** | XRP Ledger consensus algorithm (RCL) | +| **Proposal** | Validator's suggested transaction set for a ledger | +| **Validation** | Validator's signature on a closed ledger | +| **HashRouter** | Component for transaction deduplication | +| **JobQueue** | Thread pool for asynchronous task execution | +| **PerfLog** | Existing performance logging system in rippled | +| **Beast Insight** | Existing metrics framework in rippled | --- @@ -47,17 +47,17 @@ flowchart TB rpc["rpc.submit
(entry point)"] validate["tx.validate"] relay["tx.relay
(parent span)"] - + subgraph peers["Peer Spans"] p1["peer.send
Peer A"] p2["peer.send
Peer B"] p3["peer.send
Peer C"] end - + consensus["consensus.round"] apply["tx.apply"] end - + rpc --> validate validate --> relay relay --> p1 @@ -65,9 +65,17 @@ flowchart TB relay --> p3 p1 -.->|"context propagation"| consensus consensus --> apply - - style trace fill:#f5f5f5,stroke:#333 - style peers fill:#e3f2fd,stroke:#1976d2 + + style trace fill:#0f172a,stroke:#020617,color:#fff + style peers fill:#1e3a8a,stroke:#172554,color:#fff + style rpc fill:#1d4ed8,stroke:#1e40af,color:#fff + style validate fill:#047857,stroke:#064e3b,color:#fff + style relay fill:#047857,stroke:#064e3b,color:#fff + style p1 fill:#0e7490,stroke:#155e75,color:#fff + style p2 fill:#0e7490,stroke:#155e75,color:#fff + style p3 fill:#0e7490,stroke:#155e75,color:#fff + style consensus fill:#fef3c7,stroke:#fde68a,color:#1e293b + style apply fill:#047857,stroke:#064e3b,color:#fff ``` --- @@ -99,28 +107,27 @@ flowchart TB ## 8.4 Version History -| Version | Date | Author | Changes | -| ------- | ---------- | ------ | --------------------------- | -| 1.0 | 2026-02-12 | - | Initial implementation plan | +| Version | Date | Author | Changes | +| ------- | ---------- | ------ | --------------------------------- | +| 1.0 | 2026-02-12 | - | Initial implementation plan | | 1.1 | 2026-02-13 | - | Refactored into modular documents | --- ## 8.5 Document Index -| Document | Description | -| -------- | ----------- | -| [OpenTelemetryPlan.md](./OpenTelemetryPlan.md) | Master overview and executive summary | -| [01-architecture-analysis.md](./01-architecture-analysis.md) | rippled architecture and trace points | -| [02-design-decisions.md](./02-design-decisions.md) | SDK selection, exporters, span conventions | -| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure, performance analysis | -| [04-code-samples.md](./04-code-samples.md) | C++ code examples for all components | -| [05-configuration-reference.md](./05-configuration-reference.md) | rippled config, CMake, Collector configs | -| [06-implementation-phases.md](./06-implementation-phases.md) | Timeline, tasks, risks, success metrics | -| [07-observability-backends.md](./07-observability-backends.md) | Backend selection and architecture | -| [08-appendix.md](./08-appendix.md) | Glossary, references, version history | +| Document | Description | +| ---------------------------------------------------------------- | ------------------------------------------ | +| [OpenTelemetryPlan.md](./OpenTelemetryPlan.md) | Master overview and executive summary | +| [01-architecture-analysis.md](./01-architecture-analysis.md) | rippled architecture and trace points | +| [02-design-decisions.md](./02-design-decisions.md) | SDK selection, exporters, span conventions | +| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure, performance analysis | +| [04-code-samples.md](./04-code-samples.md) | C++ code examples for all components | +| [05-configuration-reference.md](./05-configuration-reference.md) | rippled config, CMake, Collector configs | +| [06-implementation-phases.md](./06-implementation-phases.md) | Timeline, tasks, risks, success metrics | +| [07-observability-backends.md](./07-observability-backends.md) | Backend selection and architecture | +| [08-appendix.md](./08-appendix.md) | Glossary, references, version history | --- *Previous: [Observability Backends](./07-observability-backends.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)* - diff --git a/OpenTelemetryPlan/OpenTelemetryPlan.md b/OpenTelemetryPlan/OpenTelemetryPlan.md index 89f9c79c43..afb06417f4 100644 --- a/OpenTelemetryPlan/OpenTelemetryPlan.md +++ b/OpenTelemetryPlan/OpenTelemetryPlan.md @@ -1,4 +1,4 @@ -# OpenTelemetry Distributed Tracing Implementation Plan for rippled (xrpld) +# [OpenTelemetry](00-tracing-fundamentals.md) Distributed Tracing Implementation Plan for rippled (xrpld) ## Executive Summary @@ -27,31 +27,33 @@ This document provides a comprehensive implementation plan for integrating OpenT This implementation plan is organized into modular documents for easier navigation: +
+ ```mermaid flowchart TB overview["📋 OpenTelemetryPlan.md
(This Document)"] - + subgraph analysis["Analysis & Design"] arch["01-architecture-analysis.md"] design["02-design-decisions.md"] end - + subgraph impl["Implementation"] strategy["03-implementation-strategy.md"] code["04-code-samples.md"] config["05-configuration-reference.md"] end - + subgraph deploy["Deployment & Planning"] phases["06-implementation-phases.md"] backends["07-observability-backends.md"] appendix["08-appendix.md"] end - + overview --> analysis overview --> impl overview --> deploy - + arch --> design design --> strategy strategy --> code @@ -59,27 +61,37 @@ flowchart TB config --> phases phases --> backends backends --> appendix - - style overview fill:#e8f5e9,stroke:#388e3c,stroke-width:2px - style analysis fill:#e3f2fd,stroke:#1976d2 - style impl fill:#fff3e0,stroke:#ff9800 - style deploy fill:#f3e5f5,stroke:#7b1fa2 + + style overview fill:#1b5e20,stroke:#0d3d14,color:#fff,stroke-width:2px + style analysis fill:#0d47a1,stroke:#082f6a,color:#fff + style impl fill:#bf360c,stroke:#8c2809,color:#fff + style deploy fill:#4a148c,stroke:#2e0d57,color:#fff + style arch fill:#0d47a1,stroke:#082f6a,color:#fff + style design fill:#0d47a1,stroke:#082f6a,color:#fff + style strategy fill:#bf360c,stroke:#8c2809,color:#fff + style code fill:#bf360c,stroke:#8c2809,color:#fff + style config fill:#bf360c,stroke:#8c2809,color:#fff + style phases fill:#4a148c,stroke:#2e0d57,color:#fff + style backends fill:#4a148c,stroke:#2e0d57,color:#fff + style appendix fill:#4a148c,stroke:#2e0d57,color:#fff ``` +
+ --- ## Table of Contents -| Section | Document | Description | -| ------- | -------- | ----------- | -| **1** | [Architecture Analysis](./01-architecture-analysis.md) | rippled component analysis, trace points, instrumentation priorities | -| **2** | [Design Decisions](./02-design-decisions.md) | SDK selection, exporters, span naming, attributes, context propagation | -| **3** | [Implementation Strategy](./03-implementation-strategy.md) | Directory structure, key principles, performance optimization | -| **4** | [Code Samples](./04-code-samples.md) | Complete C++ implementation examples for all components | -| **5** | [Configuration Reference](./05-configuration-reference.md) | rippled config, CMake integration, Collector configurations | -| **6** | [Implementation Phases](./06-implementation-phases.md) | 5-phase timeline, tasks, risks, success metrics | -| **7** | [Observability Backends](./07-observability-backends.md) | Backend selection guide and production architecture | -| **8** | [Appendix](./08-appendix.md) | Glossary, references, version history | +| Section | Document | Description | +| ------- | ---------------------------------------------------------- | ---------------------------------------------------------------------- | +| **1** | [Architecture Analysis](./01-architecture-analysis.md) | rippled component analysis, trace points, instrumentation priorities | +| **2** | [Design Decisions](./02-design-decisions.md) | SDK selection, exporters, span naming, attributes, context propagation | +| **3** | [Implementation Strategy](./03-implementation-strategy.md) | Directory structure, key principles, performance optimization | +| **4** | [Code Samples](./04-code-samples.md) | Complete C++ implementation examples for all components | +| **5** | [Configuration Reference](./05-configuration-reference.md) | rippled config, CMake integration, Collector configurations | +| **6** | [Implementation Phases](./06-implementation-phases.md) | 5-phase timeline, tasks, risks, success metrics | +| **7** | [Observability Backends](./07-observability-backends.md) | Backend selection guide and production architecture | +| **8** | [Appendix](./08-appendix.md) | Glossary, references, version history | --- @@ -140,13 +152,13 @@ OpenTelemetry Collector configurations are provided for development (with Jaeger The implementation spans 9 weeks across 5 phases: -| Phase | Duration | Focus | Key Deliverables | -| ----- | -------- | ----- | ---------------- | -| 1 | Weeks 1-2 | Core Infrastructure | SDK integration, Telemetry interface, Configuration | -| 2 | Weeks 3-4 | RPC Tracing | HTTP context extraction, Handler instrumentation | -| 3 | Weeks 5-6 | Transaction Tracing | Protocol Buffer context, Relay propagation | -| 4 | Weeks 7-8 | Consensus Tracing | Round spans, Proposal/validation tracing | -| 5 | Week 9 | Documentation | Runbook, Dashboards, Training | +| Phase | Duration | Focus | Key Deliverables | +| ----- | --------- | ------------------- | --------------------------------------------------- | +| 1 | Weeks 1-2 | Core Infrastructure | SDK integration, Telemetry interface, Configuration | +| 2 | Weeks 3-4 | RPC Tracing | HTTP context extraction, Handler instrumentation | +| 3 | Weeks 5-6 | Transaction Tracing | Protocol Buffer context, Relay propagation | +| 4 | Weeks 7-8 | Consensus Tracing | Round spans, Proposal/validation tracing | +| 5 | Week 9 | Documentation | Runbook, Dashboards, Training | **Total Effort**: 47 developer-days with 2 developers @@ -173,4 +185,3 @@ The appendix contains a glossary of OpenTelemetry and rippled-specific terms, re --- *This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. For detailed information on any section, follow the links to the corresponding sub-documents.* -