mirror of
https://github.com/XRPLF/rippled.git
synced 2026-03-03 03:02:45 +00:00
Compare commits
27 Commits
pratik/ote
...
pratik/std
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
ddaa264dcb | ||
|
|
674d10a82e | ||
|
|
773a69f4d5 | ||
|
|
0e0813c766 | ||
|
|
679bf3ea83 | ||
|
|
666cafca63 | ||
|
|
e3a4e2489e | ||
|
|
c38bdc3d9a | ||
|
|
d4ec539563 | ||
|
|
fd00d5279b | ||
|
|
363870eb34 | ||
|
|
926d8128af | ||
|
|
a0a72a6934 | ||
|
|
3ed8a51d48 | ||
|
|
81eb10885b | ||
|
|
8f517e41f6 | ||
|
|
4613377a41 | ||
|
|
51ae47cdc6 | ||
|
|
185921ea94 | ||
|
|
403abe6408 | ||
|
|
eb83e111af | ||
|
|
464c09efc7 | ||
|
|
897c75bc6b | ||
|
|
ca15c0efd7 | ||
|
|
bb4bc1d167 | ||
|
|
b9d14fb9e1 | ||
|
|
af30b71043 |
1724
BoostToStdCoroutineSwitchPlan.md
Normal file
1724
BoostToStdCoroutineSwitchPlan.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -1,244 +0,0 @@
|
||||
# Distributed Tracing Fundamentals
|
||||
|
||||
> **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)
|
||||
> **Next**: [Architecture Analysis](./01-architecture-analysis.md)
|
||||
|
||||
---
|
||||
|
||||
## What is Distributed Tracing?
|
||||
|
||||
Distributed tracing is a method for tracking data objects as they flow through distributed systems. In a network like XRP Ledger, a single transaction touches multiple independent nodes—each with no shared memory or logging. Distributed tracing connects these dots.
|
||||
|
||||
**Without tracing:** You see isolated logs on each node with no way to correlate them.
|
||||
|
||||
**With tracing:** You see the complete journey of a transaction or an event across all nodes it touched.
|
||||
|
||||
---
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Trace
|
||||
|
||||
A **trace** represents the entire journey of a request through the system. It has a unique `trace_id` that stays constant across all nodes.
|
||||
|
||||
```
|
||||
Trace ID: abc123
|
||||
├── Node A: received transaction
|
||||
├── Node B: relayed transaction
|
||||
├── Node C: included in consensus
|
||||
└── Node D: applied to ledger
|
||||
```
|
||||
|
||||
### 2. Span
|
||||
|
||||
A **span** represents a single unit of work within a trace. Each span has:
|
||||
|
||||
| Attribute | Description | Example |
|
||||
| ---------------- | --------------------- | -------------------------- |
|
||||
| `trace_id` | Links to parent trace | `abc123` |
|
||||
| `span_id` | Unique identifier | `span456` |
|
||||
| `parent_span_id` | Parent span (if any) | `p_span123` |
|
||||
| `name` | Operation name | `rpc.submit` |
|
||||
| `start_time` | When work began | `2024-01-15T10:30:00Z` |
|
||||
| `end_time` | When work completed | `2024-01-15T10:30:00.050Z` |
|
||||
| `attributes` | Key-value metadata | `tx.hash=ABC...` |
|
||||
| `status` | OK, ERROR MSG | `OK` |
|
||||
|
||||
### 3. Trace Context
|
||||
|
||||
**Trace context** is the data that propagates between services to link spans together. It contains:
|
||||
|
||||
- `trace_id` - The trace this span belongs to
|
||||
- `span_id` - The current span (becomes parent for child spans)
|
||||
- `trace_flags` - Sampling decisions
|
||||
|
||||
---
|
||||
|
||||
## How Spans Form a Trace
|
||||
|
||||
Spans have parent-child relationships forming a tree structure:
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph trace["Trace: abc123"]
|
||||
A["tx.submit<br/>span_id: 001<br/>50ms"] --> B["tx.validate<br/>span_id: 002<br/>5ms"]
|
||||
A --> C["tx.relay<br/>span_id: 003<br/>10ms"]
|
||||
A --> D["tx.apply<br/>span_id: 004<br/>30ms"]
|
||||
D --> E["ledger.update<br/>span_id: 005<br/>20ms"]
|
||||
end
|
||||
|
||||
style A fill:#0d47a1,stroke:#082f6a,color:#ffffff
|
||||
style B fill:#1b5e20,stroke:#0d3d14,color:#ffffff
|
||||
style C fill:#1b5e20,stroke:#0d3d14,color:#ffffff
|
||||
style D fill:#1b5e20,stroke:#0d3d14,color:#ffffff
|
||||
style E fill:#bf360c,stroke:#8c2809,color:#ffffff
|
||||
```
|
||||
|
||||
The same trace visualized as a **timeline (Gantt chart)**:
|
||||
|
||||
```
|
||||
Time → 0ms 10ms 20ms 30ms 40ms 50ms
|
||||
├───────────────────────────────────────────┤
|
||||
tx.submit│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
|
||||
├─────┤
|
||||
tx.valid │▓▓▓▓▓│
|
||||
│ ├──────────┤
|
||||
tx.relay │ │▓▓▓▓▓▓▓▓▓▓│
|
||||
│ ├────────────────────────────┤
|
||||
tx.apply │ │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
|
||||
│ ├──────────────────┤
|
||||
ledger │ │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Distributed Traces Across Nodes
|
||||
|
||||
In distributed systems like rippled, traces span **multiple independent nodes**. The trace context must be propagated in network messages:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant NodeA as Node A
|
||||
participant NodeB as Node B
|
||||
participant NodeC as Node C
|
||||
|
||||
Client->>NodeA: Submit TX<br/>(no trace context)
|
||||
|
||||
Note over NodeA: Creates new trace<br/>trace_id: abc123<br/>span: tx.receive
|
||||
|
||||
NodeA->>NodeB: Relay TX<br/>(trace_id: abc123, parent: 001)
|
||||
|
||||
Note over NodeB: Creates child span<br/>span: tx.relay<br/>parent_span_id: 001
|
||||
|
||||
NodeA->>NodeC: Relay TX<br/>(trace_id: abc123, parent: 001)
|
||||
|
||||
Note over NodeC: Creates child span<br/>span: tx.relay<br/>parent_span_id: 001
|
||||
|
||||
Note over NodeA,NodeC: All spans share trace_id: abc123<br/>enabling correlation across nodes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Context Propagation
|
||||
|
||||
For traces to work across nodes, **trace context must be propagated** in messages.
|
||||
|
||||
### What's in the Context (32 bytes)
|
||||
|
||||
| Field | Size | Description |
|
||||
| ------------- | ---------- | ------------------------------------------------------- |
|
||||
| `trace_id` | 16 bytes | Identifies the entire trace (constant across all nodes) |
|
||||
| `span_id` | 8 bytes | The sender's current span (becomes parent on receiver) |
|
||||
| `trace_flags` | 4 bytes | Sampling decision flags |
|
||||
| `trace_state` | ~0-4 bytes | Optional vendor-specific data |
|
||||
|
||||
### How span_id Changes at Each Hop
|
||||
|
||||
Only **one** `span_id` travels in the context - the sender's current span. Each node:
|
||||
|
||||
1. Extracts the received `span_id` and uses it as the `parent_span_id`
|
||||
2. Creates a **new** `span_id` for its own span
|
||||
3. Sends its own `span_id` as the parent when forwarding
|
||||
|
||||
```
|
||||
Node A Node B Node C
|
||||
────── ────── ──────
|
||||
|
||||
Span AAA Span BBB Span CCC
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
Context out: Context out: Context out:
|
||||
├─ trace_id: abc123 ├─ trace_id: abc123 ├─ trace_id: abc123
|
||||
├─ span_id: AAA ──────────► ├─ span_id: BBB ──────────► ├─ span_id: CCC ──────►
|
||||
└─ flags: 01 └─ flags: 01 └─ flags: 01
|
||||
│ │
|
||||
parent = AAA parent = BBB
|
||||
```
|
||||
|
||||
The `trace_id` stays constant, but `span_id` **changes at every hop** to maintain the parent-child chain.
|
||||
|
||||
### Propagation Formats
|
||||
|
||||
There are two patterns:
|
||||
|
||||
### HTTP/RPC Headers (W3C Trace Context)
|
||||
|
||||
```
|
||||
traceparent: 00-abc123def456-span789-01
|
||||
│ │ │ │
|
||||
│ │ │ └── Flags (sampled)
|
||||
│ │ └── Parent span ID
|
||||
│ └── Trace ID
|
||||
└── Version
|
||||
```
|
||||
|
||||
### Protocol Buffers (rippled P2P messages)
|
||||
|
||||
```protobuf
|
||||
message TMTransaction {
|
||||
bytes rawTransaction = 1;
|
||||
// ... existing fields ...
|
||||
|
||||
// Trace context extension
|
||||
bytes trace_parent = 100; // W3C traceparent
|
||||
bytes trace_state = 101; // W3C tracestate
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sampling
|
||||
|
||||
Not every trace needs to be recorded. **Sampling** reduces overhead:
|
||||
|
||||
### Head Sampling (at trace start)
|
||||
|
||||
```
|
||||
Request arrives → Random 10% chance → Record or skip entire trace
|
||||
```
|
||||
|
||||
- ✅ Low overhead
|
||||
- ❌ May miss interesting traces
|
||||
|
||||
### Tail Sampling (after trace completes)
|
||||
|
||||
```
|
||||
Trace completes → Collector evaluates:
|
||||
- Error? → KEEP
|
||||
- Slow? → KEEP
|
||||
- Normal? → Sample 10%
|
||||
```
|
||||
|
||||
- ✅ Never loses important traces
|
||||
- ❌ Higher memory usage at collector
|
||||
|
||||
---
|
||||
|
||||
## Key Benefits for rippled
|
||||
|
||||
| Challenge | How Tracing Helps |
|
||||
| ---------------------------------- | ---------------------------------------- |
|
||||
| "Where is my transaction?" | Follow trace across all nodes it touched |
|
||||
| "Why was consensus slow?" | See timing breakdown of each phase |
|
||||
| "Which node is the bottleneck?" | Compare span durations across nodes |
|
||||
| "What happened during the outage?" | Correlate errors across the network |
|
||||
|
||||
---
|
||||
|
||||
## Glossary
|
||||
|
||||
| Term | Definition |
|
||||
| ------------------- | --------------------------------------------------------------- |
|
||||
| **Trace** | Complete journey of a request, identified by `trace_id` |
|
||||
| **Span** | Single operation within a trace |
|
||||
| **Context** | Data propagated between services (`trace_id`, `span_id`, flags) |
|
||||
| **Instrumentation** | Code that creates spans and propagates context |
|
||||
| **Collector** | Service that receives, processes, and exports traces |
|
||||
| **Backend** | Storage/visualization system (Jaeger, Tempo, etc.) |
|
||||
| **Head Sampling** | Sampling decision at trace start |
|
||||
| **Tail Sampling** | Sampling decision after trace completes |
|
||||
|
||||
---
|
||||
|
||||
_Next: [Architecture Analysis](./01-architecture-analysis.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_
|
||||
@@ -1,330 +0,0 @@
|
||||
# Architecture Analysis
|
||||
|
||||
> **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)
|
||||
> **Related**: [Design Decisions](./02-design-decisions.md) | [Implementation Strategy](./03-implementation-strategy.md)
|
||||
|
||||
---
|
||||
|
||||
## 1.1 Current rippled Architecture Overview
|
||||
|
||||
The rippled node software consists of several interconnected components that need instrumentation for distributed tracing:
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph rippled["rippled Node"]
|
||||
subgraph services["Core Services"]
|
||||
RPC["RPC Server<br/>(HTTP/WS/gRPC)"]
|
||||
Overlay["Overlay<br/>(P2P Network)"]
|
||||
Consensus["Consensus<br/>(RCLConsensus)"]
|
||||
end
|
||||
|
||||
JobQueue["JobQueue<br/>(Thread Pool)"]
|
||||
|
||||
subgraph processing["Processing Layer"]
|
||||
NetworkOPs["NetworkOPs<br/>(Tx Processing)"]
|
||||
LedgerMaster["LedgerMaster<br/>(Ledger Mgmt)"]
|
||||
NodeStore["NodeStore<br/>(Database)"]
|
||||
end
|
||||
|
||||
subgraph observability["Existing Observability"]
|
||||
PerfLog["PerfLog<br/>(JSON)"]
|
||||
Insight["Insight<br/>(StatsD)"]
|
||||
Logging["Logging<br/>(Journal)"]
|
||||
end
|
||||
|
||||
services --> JobQueue
|
||||
JobQueue --> processing
|
||||
end
|
||||
|
||||
style rippled fill:#424242,stroke:#212121,color:#ffffff
|
||||
style services fill:#1565c0,stroke:#0d47a1,color:#ffffff
|
||||
style processing fill:#2e7d32,stroke:#1b5e20,color:#ffffff
|
||||
style observability fill:#e65100,stroke:#bf360c,color:#ffffff
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1.2 Key Components for Instrumentation
|
||||
|
||||
| Component | Location | Purpose | Trace Value |
|
||||
| ----------------- | ------------------------------------------ | ------------------------ | ---------------------------- |
|
||||
| **Overlay** | `src/xrpld/overlay/` | P2P communication | Message propagation timing |
|
||||
| **PeerImp** | `src/xrpld/overlay/detail/PeerImp.cpp` | Individual peer handling | Per-peer latency |
|
||||
| **RCLConsensus** | `src/xrpld/app/consensus/RCLConsensus.cpp` | Consensus algorithm | Round timing, phase analysis |
|
||||
| **NetworkOPs** | `src/xrpld/app/misc/NetworkOPs.cpp` | Transaction processing | Tx lifecycle tracking |
|
||||
| **ServerHandler** | `src/xrpld/rpc/detail/ServerHandler.cpp` | RPC entry point | Request latency |
|
||||
| **RPCHandler** | `src/xrpld/rpc/detail/RPCHandler.cpp` | Command execution | Per-command timing |
|
||||
| **JobQueue** | `src/xrpl/core/JobQueue.h` | Async task execution | Queue wait times |
|
||||
|
||||
---
|
||||
|
||||
## 1.3 Transaction Flow Diagram
|
||||
|
||||
Transaction flow spans multiple nodes in the network. Each node creates linked spans to form a distributed trace:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant PeerA as Peer A (Receive)
|
||||
participant PeerB as Peer B (Relay)
|
||||
participant PeerC as Peer C (Validate)
|
||||
|
||||
Client->>PeerA: 1. Submit TX
|
||||
|
||||
rect rgb(230, 245, 255)
|
||||
Note over PeerA: tx.receive SPAN START
|
||||
PeerA->>PeerA: HashRouter Deduplication
|
||||
PeerA->>PeerA: tx.validate (child span)
|
||||
end
|
||||
|
||||
PeerA->>PeerB: 2. Relay TX (with trace ctx)
|
||||
|
||||
rect rgb(230, 245, 255)
|
||||
Note over PeerB: tx.receive (linked span)
|
||||
end
|
||||
|
||||
PeerB->>PeerC: 3. Relay TX
|
||||
|
||||
rect rgb(230, 245, 255)
|
||||
Note over PeerC: tx.receive (linked span)
|
||||
PeerC->>PeerC: tx.process
|
||||
end
|
||||
|
||||
Note over Client,PeerC: DISTRIBUTED TRACE (same trace_id: abc123)
|
||||
```
|
||||
|
||||
### Trace Structure
|
||||
|
||||
```
|
||||
trace_id: abc123
|
||||
├── span: tx.receive (Peer A)
|
||||
│ ├── span: tx.validate
|
||||
│ └── span: tx.relay
|
||||
├── span: tx.receive (Peer B) [parent: Peer A]
|
||||
│ └── span: tx.relay
|
||||
└── span: tx.receive (Peer C) [parent: Peer B]
|
||||
└── span: tx.process
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1.4 Consensus Round Flow
|
||||
|
||||
Consensus rounds are multi-phase operations that benefit significantly from tracing:
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph round["consensus.round (root span)"]
|
||||
attrs["Attributes:<br/>xrpl.consensus.ledger.seq = 12345678<br/>xrpl.consensus.mode = proposing<br/>xrpl.consensus.proposers = 35"]
|
||||
|
||||
subgraph open["consensus.phase.open"]
|
||||
open_desc["Duration: ~3s<br/>Waiting for transactions"]
|
||||
end
|
||||
|
||||
subgraph establish["consensus.phase.establish"]
|
||||
est_attrs["proposals_received = 28<br/>disputes_resolved = 3"]
|
||||
est_children["├── consensus.proposal.receive (×28)<br/>├── consensus.proposal.send (×1)<br/>└── consensus.dispute.resolve (×3)"]
|
||||
end
|
||||
|
||||
subgraph accept["consensus.phase.accept"]
|
||||
acc_attrs["transactions_applied = 150<br/>ledger.hash = DEF456..."]
|
||||
acc_children["├── ledger.build<br/>└── ledger.validate"]
|
||||
end
|
||||
|
||||
attrs --> open
|
||||
open --> establish
|
||||
establish --> accept
|
||||
end
|
||||
|
||||
style round fill:#f57f17,stroke:#e65100,color:#ffffff
|
||||
style open fill:#1565c0,stroke:#0d47a1,color:#ffffff
|
||||
style establish fill:#2e7d32,stroke:#1b5e20,color:#ffffff
|
||||
style accept fill:#c2185b,stroke:#880e4f,color:#ffffff
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1.5 RPC Request Flow
|
||||
|
||||
RPC requests support W3C Trace Context headers for distributed tracing across services:
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph request["rpc.request (root span)"]
|
||||
http["HTTP Request<br/>POST /<br/>traceparent: 00-abc123...-def456...-01"]
|
||||
|
||||
attrs["Attributes:<br/>http.method = POST<br/>net.peer.ip = 192.168.1.100<br/>xrpl.rpc.command = submit"]
|
||||
|
||||
subgraph enqueue["jobqueue.enqueue"]
|
||||
job_attr["xrpl.job.type = jtCLIENT_RPC"]
|
||||
end
|
||||
|
||||
subgraph command["rpc.command.submit"]
|
||||
cmd_attrs["xrpl.rpc.version = 2<br/>xrpl.rpc.role = user"]
|
||||
cmd_children["├── tx.deserialize<br/>├── tx.validate_local<br/>└── tx.submit_to_network"]
|
||||
end
|
||||
|
||||
response["Response: 200 OK<br/>Duration: 45ms"]
|
||||
|
||||
http --> attrs
|
||||
attrs --> enqueue
|
||||
enqueue --> command
|
||||
command --> response
|
||||
end
|
||||
|
||||
style request fill:#2e7d32,stroke:#1b5e20,color:#ffffff
|
||||
style enqueue fill:#1565c0,stroke:#0d47a1,color:#ffffff
|
||||
style command fill:#e65100,stroke:#bf360c,color:#ffffff
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1.6 Key Trace Points
|
||||
|
||||
The following table identifies priority instrumentation points across the codebase:
|
||||
|
||||
| Category | Span Name | File | Method | Priority |
|
||||
| --------------- | ---------------------- | -------------------- | ---------------------- | -------- |
|
||||
| **Transaction** | `tx.receive` | `PeerImp.cpp` | `handleTransaction()` | High |
|
||||
| **Transaction** | `tx.validate` | `NetworkOPs.cpp` | `processTransaction()` | High |
|
||||
| **Transaction** | `tx.process` | `NetworkOPs.cpp` | `doTransactionSync()` | High |
|
||||
| **Transaction** | `tx.relay` | `OverlayImpl.cpp` | `relay()` | Medium |
|
||||
| **Consensus** | `consensus.round` | `RCLConsensus.cpp` | `startRound()` | High |
|
||||
| **Consensus** | `consensus.phase.*` | `Consensus.h` | `timerEntry()` | High |
|
||||
| **Consensus** | `consensus.proposal.*` | `RCLConsensus.cpp` | `peerProposal()` | Medium |
|
||||
| **RPC** | `rpc.request` | `ServerHandler.cpp` | `onRequest()` | High |
|
||||
| **RPC** | `rpc.command.*` | `RPCHandler.cpp` | `doCommand()` | High |
|
||||
| **Peer** | `peer.connect` | `OverlayImpl.cpp` | `onHandoff()` | Low |
|
||||
| **Peer** | `peer.message.*` | `PeerImp.cpp` | `onMessage()` | Low |
|
||||
| **Ledger** | `ledger.acquire` | `InboundLedgers.cpp` | `acquire()` | Medium |
|
||||
| **Ledger** | `ledger.build` | `RCLConsensus.cpp` | `buildLCL()` | High |
|
||||
|
||||
---
|
||||
|
||||
## 1.7 Instrumentation Priority
|
||||
|
||||
```mermaid
|
||||
quadrantChart
|
||||
title Instrumentation Priority Matrix
|
||||
x-axis Low Complexity --> High Complexity
|
||||
y-axis Low Value --> High Value
|
||||
quadrant-1 Implement First
|
||||
quadrant-2 Plan Carefully
|
||||
quadrant-3 Quick Wins
|
||||
quadrant-4 Consider Later
|
||||
|
||||
RPC Tracing: [0.3, 0.85]
|
||||
Transaction Tracing: [0.65, 0.92]
|
||||
Consensus Tracing: [0.75, 0.87]
|
||||
Peer Message Tracing: [0.4, 0.3]
|
||||
Ledger Acquisition: [0.5, 0.6]
|
||||
JobQueue Tracing: [0.35, 0.5]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1.8 Observable Outcomes
|
||||
|
||||
After implementing OpenTelemetry, operators and developers will gain visibility into the following:
|
||||
|
||||
### 1.8.1 What You Will See: Traces
|
||||
|
||||
| Trace Type | Description | Example Query in Grafana/Tempo |
|
||||
| -------------------------- | ------------------------------------------------------------------------------------------- | ------------------------------------------------------ |
|
||||
| **Transaction Lifecycle** | Full journey from RPC submission through validation, relay, consensus, and ledger inclusion | `{service.name="rippled" && xrpl.tx.hash="ABC123..."}` |
|
||||
| **Cross-Node Propagation** | Transaction path across multiple rippled nodes with timing | `{xrpl.tx.relay_count > 0}` |
|
||||
| **Consensus Rounds** | Complete round with all phases (open, establish, accept) | `{span.name=~"consensus.round.*"}` |
|
||||
| **RPC Request Processing** | Individual command execution with timing breakdown | `{xrpl.rpc.command="account_info"}` |
|
||||
| **Ledger Acquisition** | Peer-to-peer ledger data requests and responses | `{span.name="ledger.acquire"}` |
|
||||
|
||||
### 1.8.2 What You Will See: Metrics (Derived from Traces)
|
||||
|
||||
| Metric | Description | Dashboard Panel |
|
||||
| ----------------------------- | -------------------------------------- | --------------------------- |
|
||||
| **RPC Latency (p50/p95/p99)** | Response time distribution per command | Heatmap by command |
|
||||
| **Transaction Throughput** | Transactions processed per second | Time series graph |
|
||||
| **Consensus Round Duration** | Time to complete consensus phases | Histogram |
|
||||
| **Cross-Node Latency** | Time for transaction to reach N nodes | Line chart with percentiles |
|
||||
| **Error Rate** | Failed transactions/RPC calls by type | Stacked bar chart |
|
||||
|
||||
### 1.8.3 Concrete Dashboard Examples
|
||||
|
||||
**Transaction Trace View (Jaeger/Tempo):**
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Trace: abc123... (Transaction Submission) Duration: 847ms │
|
||||
├────────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ├── rpc.request [ServerHandler] ████░░░░░░ 45ms │
|
||||
│ │ └── rpc.command.submit [RPCHandler] ████░░░░░░ 42ms │
|
||||
│ │ └── tx.receive [NetworkOPs] ███░░░░░░░ 35ms │
|
||||
│ │ ├── tx.validate [TxQ] █░░░░░░░░░ 8ms │
|
||||
│ │ └── tx.relay [Overlay] ██░░░░░░░░ 15ms │
|
||||
│ │ ├── tx.receive [Node-B] █████░░░░░ 52ms │
|
||||
│ │ │ └── tx.relay [Node-B] ██░░░░░░░░ 18ms │
|
||||
│ │ └── tx.receive [Node-C] ██████░░░░ 65ms │
|
||||
│ └── consensus.round [RCLConsensus] ████████░░ 720ms │
|
||||
│ ├── consensus.phase.open ██░░░░░░░░ 180ms │
|
||||
│ ├── consensus.phase.establish █████░░░░░ 480ms │
|
||||
│ └── consensus.phase.accept █░░░░░░░░░ 60ms │
|
||||
└────────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**RPC Performance Dashboard Panel:**
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ RPC Command Latency (Last 1 Hour) │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ Command │ p50 │ p95 │ p99 │ Errors │ Rate │
|
||||
│──────────────────┼────────┼────────┼────────┼────────┼──────│
|
||||
│ account_info │ 12ms │ 45ms │ 89ms │ 0.1% │ 150/s│
|
||||
│ submit │ 35ms │ 120ms │ 250ms │ 2.3% │ 45/s│
|
||||
│ ledger │ 8ms │ 25ms │ 55ms │ 0.0% │ 80/s│
|
||||
│ tx │ 15ms │ 50ms │ 100ms │ 0.5% │ 60/s│
|
||||
│ server_info │ 5ms │ 12ms │ 20ms │ 0.0% │ 200/s│
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Consensus Health Dashboard Panel:**
|
||||
|
||||
```mermaid
|
||||
---
|
||||
config:
|
||||
xyChart:
|
||||
width: 1200
|
||||
height: 400
|
||||
plotReservedSpacePercent: 50
|
||||
chartOrientation: vertical
|
||||
themeVariables:
|
||||
xyChart:
|
||||
plotColorPalette: "#3498db"
|
||||
---
|
||||
xychart-beta
|
||||
title "Consensus Round Duration (Last 24 Hours)"
|
||||
x-axis "Time of Day (Hours)" [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24]
|
||||
y-axis "Duration (seconds)" 1 --> 5
|
||||
line [2.1, 2.3, 2.5, 2.4, 2.8, 1.6, 3.2, 3.0, 3.5, 1.3, 3.8, 3.6, 4.0, 3.2, 4.3, 4.1, 4.5, 4.3, 4.2, 2.4, 4.8, 4.6, 4.9, 4.7, 5.0, 4.9, 4.8, 2.6, 4.7, 4.5, 4.2, 4.0, 2.5, 3.7, 3.2, 3.4, 2.9, 3.1, 2.6, 2.8, 2.3, 1.5, 2.7, 2.4, 2.5, 2.3, 2.2, 2.1, 2.0]
|
||||
```
|
||||
|
||||
### 1.8.4 Operator Actionable Insights
|
||||
|
||||
| Scenario | What You'll See | Action |
|
||||
| --------------------- | ---------------------------------------------------------------------------- | -------------------------------- |
|
||||
| **Slow RPC** | Span showing which phase is slow (parsing, execution, serialization) | Optimize specific code path |
|
||||
| **Transaction Stuck** | Trace stops at validation; error attribute shows reason | Fix transaction parameters |
|
||||
| **Consensus Delay** | Phase.establish taking too long; proposer attribute shows missing validators | Investigate network connectivity |
|
||||
| **Memory Spike** | Large batch of spans correlating with memory increase | Tune batch_size or sampling |
|
||||
| **Network Partition** | Traces missing cross-node links for specific peer | Check peer connectivity |
|
||||
|
||||
### 1.8.5 Developer Debugging Workflow
|
||||
|
||||
1. **Find Transaction**: Query by `xrpl.tx.hash` to get full trace
|
||||
2. **Identify Bottleneck**: Look at span durations to find slowest component
|
||||
3. **Check Attributes**: Review `xrpl.tx.validity`, `xrpl.rpc.status` for errors
|
||||
4. **Correlate Logs**: Use `trace_id` to find related PerfLog entries
|
||||
5. **Compare Nodes**: Filter by `service.instance.id` to compare behavior across nodes
|
||||
|
||||
---
|
||||
|
||||
_Next: [Design Decisions](./02-design-decisions.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_
|
||||
@@ -1,494 +0,0 @@
|
||||
# Design Decisions
|
||||
|
||||
> **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)
|
||||
> **Related**: [Architecture Analysis](./01-architecture-analysis.md) | [Code Samples](./04-code-samples.md)
|
||||
|
||||
---
|
||||
|
||||
## 2.1 OpenTelemetry Components
|
||||
|
||||
### 2.1.1 SDK Selection
|
||||
|
||||
**Primary Choice**: OpenTelemetry C++ SDK (`opentelemetry-cpp`)
|
||||
|
||||
| Component | Purpose | Required |
|
||||
| --------------------------------------- | ---------------------- | ----------- |
|
||||
| `opentelemetry-cpp::api` | Tracing API headers | Yes |
|
||||
| `opentelemetry-cpp::sdk` | SDK implementation | Yes |
|
||||
| `opentelemetry-cpp::ext` | Extensions (exporters) | Yes |
|
||||
| `opentelemetry-cpp::otlp_grpc_exporter` | OTLP/gRPC export | Recommended |
|
||||
| `opentelemetry-cpp::otlp_http_exporter` | OTLP/HTTP export | Alternative |
|
||||
|
||||
### 2.1.2 Instrumentation Strategy
|
||||
|
||||
**Manual Instrumentation** (recommended):
|
||||
|
||||
| Approach | Pros | Cons |
|
||||
| ---------- | ----------------------------------------------------------------- | ------------------------------------------------------- |
|
||||
| **Manual** | Precise control, optimized placement, rippled-specific attributes | More development effort |
|
||||
| **Auto** | Less code, automatic coverage | Less control, potential overhead, limited customization |
|
||||
|
||||
---
|
||||
|
||||
## 2.2 Exporter Configuration
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph nodes["rippled Nodes"]
|
||||
node1["rippled<br/>Node 1"]
|
||||
node2["rippled<br/>Node 2"]
|
||||
node3["rippled<br/>Node 3"]
|
||||
end
|
||||
|
||||
collector["OpenTelemetry<br/>Collector<br/>(sidecar or standalone)"]
|
||||
|
||||
subgraph backends["Observability Backends"]
|
||||
jaeger["Jaeger<br/>(Dev)"]
|
||||
tempo["Tempo<br/>(Prod)"]
|
||||
elastic["Elastic<br/>APM"]
|
||||
end
|
||||
|
||||
node1 -->|"OTLP/gRPC<br/>:4317"| collector
|
||||
node2 -->|"OTLP/gRPC<br/>:4317"| collector
|
||||
node3 -->|"OTLP/gRPC<br/>:4317"| collector
|
||||
|
||||
collector --> jaeger
|
||||
collector --> tempo
|
||||
collector --> elastic
|
||||
|
||||
style nodes fill:#0d47a1,stroke:#082f6a,color:#ffffff
|
||||
style backends fill:#1b5e20,stroke:#0d3d14,color:#ffffff
|
||||
style collector fill:#bf360c,stroke:#8c2809,color:#ffffff
|
||||
```
|
||||
|
||||
### 2.2.1 OTLP/gRPC (Recommended)
|
||||
|
||||
```cpp
|
||||
// Configuration for OTLP over gRPC
|
||||
namespace otlp = opentelemetry::exporter::otlp;
|
||||
|
||||
otlp::OtlpGrpcExporterOptions opts;
|
||||
opts.endpoint = "localhost:4317";
|
||||
opts.use_ssl_credentials = true;
|
||||
opts.ssl_credentials_cacert_path = "/path/to/ca.crt";
|
||||
```
|
||||
|
||||
### 2.2.2 OTLP/HTTP (Alternative)
|
||||
|
||||
```cpp
|
||||
// Configuration for OTLP over HTTP
|
||||
namespace otlp = opentelemetry::exporter::otlp;
|
||||
|
||||
otlp::OtlpHttpExporterOptions opts;
|
||||
opts.url = "http://localhost:4318/v1/traces";
|
||||
opts.content_type = otlp::HttpRequestContentType::kJson; // or kBinary
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2.3 Span Naming Conventions
|
||||
|
||||
### 2.3.1 Naming Schema
|
||||
|
||||
```
|
||||
<component>.<operation>[.<sub-operation>]
|
||||
```
|
||||
|
||||
**Examples**:
|
||||
|
||||
- `tx.receive` - Transaction received from peer
|
||||
- `consensus.phase.establish` - Consensus establish phase
|
||||
- `rpc.command.server_info` - server_info RPC command
|
||||
|
||||
### 2.3.2 Complete Span Catalog
|
||||
|
||||
```yaml
|
||||
# Transaction Spans
|
||||
tx:
|
||||
receive: "Transaction received from network"
|
||||
validate: "Transaction signature/format validation"
|
||||
process: "Full transaction processing"
|
||||
relay: "Transaction relay to peers"
|
||||
apply: "Apply transaction to ledger"
|
||||
|
||||
# Consensus Spans
|
||||
consensus:
|
||||
round: "Complete consensus round"
|
||||
phase:
|
||||
open: "Open phase - collecting transactions"
|
||||
establish: "Establish phase - reaching agreement"
|
||||
accept: "Accept phase - applying consensus"
|
||||
proposal:
|
||||
receive: "Receive peer proposal"
|
||||
send: "Send our proposal"
|
||||
validation:
|
||||
receive: "Receive peer validation"
|
||||
send: "Send our validation"
|
||||
|
||||
# RPC Spans
|
||||
rpc:
|
||||
request: "HTTP/WebSocket request handling"
|
||||
command:
|
||||
"*": "Specific RPC command (dynamic)"
|
||||
|
||||
# Peer Spans
|
||||
peer:
|
||||
connect: "Peer connection establishment"
|
||||
disconnect: "Peer disconnection"
|
||||
message:
|
||||
send: "Send protocol message"
|
||||
receive: "Receive protocol message"
|
||||
|
||||
# Ledger Spans
|
||||
ledger:
|
||||
acquire: "Ledger acquisition from network"
|
||||
build: "Build new ledger"
|
||||
validate: "Ledger validation"
|
||||
close: "Close ledger"
|
||||
|
||||
# Job Spans
|
||||
job:
|
||||
enqueue: "Job added to queue"
|
||||
execute: "Job execution"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2.4 Attribute Schema
|
||||
|
||||
### 2.4.1 Resource Attributes (Set Once at Startup)
|
||||
|
||||
```cpp
|
||||
// Standard OpenTelemetry semantic conventions
|
||||
resource::SemanticConventions::SERVICE_NAME = "rippled"
|
||||
resource::SemanticConventions::SERVICE_VERSION = BuildInfo::getVersionString()
|
||||
resource::SemanticConventions::SERVICE_INSTANCE_ID = <node_public_key_base58>
|
||||
|
||||
// Custom rippled attributes
|
||||
"xrpl.network.id" = <network_id> // e.g., 0 for mainnet
|
||||
"xrpl.network.type" = "mainnet" | "testnet" | "devnet" | "standalone"
|
||||
"xrpl.node.type" = "validator" | "stock" | "reporting"
|
||||
"xrpl.node.cluster" = <cluster_name> // If clustered
|
||||
```
|
||||
|
||||
### 2.4.2 Span Attributes by Category
|
||||
|
||||
#### Transaction Attributes
|
||||
|
||||
```cpp
|
||||
"xrpl.tx.hash" = string // Transaction hash (hex)
|
||||
"xrpl.tx.type" = string // "Payment", "OfferCreate", etc.
|
||||
"xrpl.tx.account" = string // Source account (redacted in prod)
|
||||
"xrpl.tx.sequence" = int64 // Account sequence number
|
||||
"xrpl.tx.fee" = int64 // Fee in drops
|
||||
"xrpl.tx.result" = string // "tesSUCCESS", "tecPATH_DRY", etc.
|
||||
"xrpl.tx.ledger_index" = int64 // Ledger containing transaction
|
||||
```
|
||||
|
||||
#### Consensus Attributes
|
||||
|
||||
```cpp
|
||||
"xrpl.consensus.round" = int64 // Round number
|
||||
"xrpl.consensus.phase" = string // "open", "establish", "accept"
|
||||
"xrpl.consensus.mode" = string // "proposing", "observing", etc.
|
||||
"xrpl.consensus.proposers" = int64 // Number of proposers
|
||||
"xrpl.consensus.ledger.prev" = string // Previous ledger hash
|
||||
"xrpl.consensus.ledger.seq" = int64 // Ledger sequence
|
||||
"xrpl.consensus.tx_count" = int64 // Transactions in consensus set
|
||||
"xrpl.consensus.duration_ms" = float64 // Round duration
|
||||
```
|
||||
|
||||
#### RPC Attributes
|
||||
|
||||
```cpp
|
||||
"xrpl.rpc.command" = string // Command name
|
||||
"xrpl.rpc.version" = int64 // API version
|
||||
"xrpl.rpc.role" = string // "admin" or "user"
|
||||
"xrpl.rpc.params" = string // Sanitized parameters (optional)
|
||||
```
|
||||
|
||||
#### Peer & Message Attributes
|
||||
|
||||
```cpp
|
||||
"xrpl.peer.id" = string // Peer public key (base58)
|
||||
"xrpl.peer.address" = string // IP:port
|
||||
"xrpl.peer.latency_ms" = float64 // Measured latency
|
||||
"xrpl.peer.cluster" = string // Cluster name if clustered
|
||||
"xrpl.message.type" = string // Protocol message type name
|
||||
"xrpl.message.size_bytes" = int64 // Message size
|
||||
"xrpl.message.compressed" = bool // Whether compressed
|
||||
```
|
||||
|
||||
#### Ledger & Job Attributes
|
||||
|
||||
```cpp
|
||||
"xrpl.ledger.hash" = string // Ledger hash
|
||||
"xrpl.ledger.index" = int64 // Ledger sequence/index
|
||||
"xrpl.ledger.close_time" = int64 // Close time (epoch)
|
||||
"xrpl.ledger.tx_count" = int64 // Transaction count
|
||||
"xrpl.job.type" = string // Job type name
|
||||
"xrpl.job.queue_ms" = float64 // Time spent in queue
|
||||
"xrpl.job.worker" = int64 // Worker thread ID
|
||||
```
|
||||
|
||||
### 2.4.3 Data Collection Summary
|
||||
|
||||
The following table summarizes what data is collected by category:
|
||||
|
||||
| Category | Attributes Collected | Purpose |
|
||||
| --------------- | -------------------------------------------------------------------- | --------------------------- |
|
||||
| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` | Trace transaction lifecycle |
|
||||
| **Consensus** | `round`, `phase`, `mode`, `proposers` (public keys), `duration_ms` | Analyze consensus timing |
|
||||
| **RPC** | `command`, `version`, `status`, `duration_ms` | Monitor RPC performance |
|
||||
| **Peer** | `peer.id` (public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis |
|
||||
| **Ledger** | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` | Ledger progression tracking |
|
||||
| **Job** | `job.type`, `queue_ms`, `worker` | JobQueue performance |
|
||||
|
||||
### 2.4.4 Privacy & Sensitive Data Policy
|
||||
|
||||
OpenTelemetry instrumentation is designed to collect **operational metadata only**, never sensitive content.
|
||||
|
||||
#### Data NOT Collected
|
||||
|
||||
The following data is explicitly **excluded** from telemetry collection:
|
||||
|
||||
| Excluded Data | Reason |
|
||||
| ----------------------- | ----------------------------------------- |
|
||||
| **Private Keys** | Never exposed; not relevant to tracing |
|
||||
| **Account Balances** | Financial data; privacy sensitive |
|
||||
| **Transaction Amounts** | Financial data; privacy sensitive |
|
||||
| **Raw TX Payloads** | May contain sensitive memo/data fields |
|
||||
| **Personal Data** | No PII collected |
|
||||
| **IP Addresses** | Configurable; excluded by default in prod |
|
||||
|
||||
#### Privacy Protection Mechanisms
|
||||
|
||||
| Mechanism | Description |
|
||||
| ----------------------------- | ------------------------------------------------------------------------- |
|
||||
| **Account Hashing** | `xrpl.tx.account` is hashed at collector level before storage |
|
||||
| **Configurable Redaction** | Sensitive fields can be excluded via `[telemetry]` config section |
|
||||
| **Sampling** | Only 10% of traces recorded by default, reducing data exposure |
|
||||
| **Local Control** | Node operators have full control over what gets exported |
|
||||
| **No Raw Payloads** | Transaction content is never recorded, only metadata (hash, type, result) |
|
||||
| **Collector-Level Filtering** | Additional redaction/hashing can be configured at OTel Collector |
|
||||
|
||||
#### Collector-Level Data Protection
|
||||
|
||||
The OpenTelemetry Collector can be configured to hash or redact sensitive attributes before export:
|
||||
|
||||
```yaml
|
||||
processors:
|
||||
attributes:
|
||||
actions:
|
||||
# Hash account addresses before storage
|
||||
- key: xrpl.tx.account
|
||||
action: hash
|
||||
# Remove IP addresses entirely
|
||||
- key: xrpl.peer.address
|
||||
action: delete
|
||||
# Redact specific fields
|
||||
- key: xrpl.rpc.params
|
||||
action: delete
|
||||
```
|
||||
|
||||
#### Configuration Options for Privacy
|
||||
|
||||
In `rippled.cfg`, operators can control data collection granularity:
|
||||
|
||||
```ini
|
||||
[telemetry]
|
||||
enabled=1
|
||||
|
||||
# Disable collection of specific components
|
||||
trace_transactions=1
|
||||
trace_consensus=1
|
||||
trace_rpc=1
|
||||
trace_peer=0 # Disable peer tracing (high volume, includes addresses)
|
||||
|
||||
# Redact specific attributes
|
||||
redact_account=1 # Hash account addresses before export
|
||||
redact_peer_address=1 # Remove peer IP addresses
|
||||
```
|
||||
|
||||
> **Key Principle**: Telemetry collects **operational metadata** (timing, counts, hashes) — never **sensitive content** (keys, balances, amounts, raw payloads).
|
||||
|
||||
---
|
||||
|
||||
## 2.5 Context Propagation Design
|
||||
|
||||
### 2.5.1 Propagation Boundaries
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph http["HTTP/WebSocket (RPC)"]
|
||||
w3c["W3C Trace Context Headers:<br/>traceparent: 00-{trace_id}-{span_id}-{flags}<br/>tracestate: rippled=<state>"]
|
||||
end
|
||||
|
||||
subgraph protobuf["Protocol Buffers (P2P)"]
|
||||
proto["message TraceContext {<br/> bytes trace_id = 1; // 16 bytes<br/> bytes span_id = 2; // 8 bytes<br/> uint32 trace_flags = 3;<br/> string trace_state = 4;<br/>}"]
|
||||
end
|
||||
|
||||
subgraph jobqueue["JobQueue (Internal Async)"]
|
||||
job["Context captured at job creation,<br/>restored at execution<br/><br/>class Job {<br/> opentelemetry::context::Context traceContext_;<br/>};"]
|
||||
end
|
||||
|
||||
style http fill:#0d47a1,stroke:#082f6a,color:#ffffff
|
||||
style protobuf fill:#1b5e20,stroke:#0d3d14,color:#ffffff
|
||||
style jobqueue fill:#bf360c,stroke:#8c2809,color:#ffffff
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2.6 Integration with Existing Observability
|
||||
|
||||
### 2.6.1 Existing Frameworks Comparison
|
||||
|
||||
rippled already has two observability mechanisms. OpenTelemetry complements (not replaces) them:
|
||||
|
||||
| Aspect | PerfLog | Beast Insight (StatsD) | OpenTelemetry |
|
||||
| --------------------- | ----------------------------- | ---------------------------- | ------------------------- |
|
||||
| **Type** | Logging | Metrics | Distributed Tracing |
|
||||
| **Data** | JSON log entries | Counters, gauges, histograms | Spans with context |
|
||||
| **Scope** | Single node | Single node | **Cross-node** |
|
||||
| **Output** | `perf.log` file | StatsD server | OTLP Collector |
|
||||
| **Question answered** | "What happened on this node?" | "How many? How fast?" | "What was the journey?" |
|
||||
| **Correlation** | By timestamp | By metric name | By `trace_id` |
|
||||
| **Overhead** | Low (file I/O) | Low (UDP packets) | Low-Medium (configurable) |
|
||||
|
||||
### 2.6.2 What Each Framework Does Best
|
||||
|
||||
#### PerfLog
|
||||
|
||||
- **Purpose**: Detailed local event logging for RPC and job execution
|
||||
- **Strengths**:
|
||||
- Rich JSON output with timing data
|
||||
- Already integrated in RPC handlers
|
||||
- File-based, no external dependencies
|
||||
- **Limitations**:
|
||||
- Single-node only (no cross-node correlation)
|
||||
- No parent-child relationships between events
|
||||
- Manual log parsing required
|
||||
|
||||
```json
|
||||
// Example PerfLog entry
|
||||
{
|
||||
"time": "2024-01-15T10:30:00.123Z",
|
||||
"method": "submit",
|
||||
"duration_us": 1523,
|
||||
"result": "tesSUCCESS"
|
||||
}
|
||||
```
|
||||
|
||||
#### Beast Insight (StatsD)
|
||||
|
||||
- **Purpose**: Real-time metrics for monitoring dashboards
|
||||
- **Strengths**:
|
||||
- Aggregated metrics (counters, gauges, histograms)
|
||||
- Low overhead (UDP, fire-and-forget)
|
||||
- Good for alerting thresholds
|
||||
- **Limitations**:
|
||||
- No request-level detail
|
||||
- No causal relationships
|
||||
- Single-node perspective
|
||||
|
||||
```cpp
|
||||
// Example StatsD usage in rippled
|
||||
insight.increment("rpc.submit.count");
|
||||
insight.gauge("ledger.age", age);
|
||||
insight.timing("consensus.round", duration);
|
||||
```
|
||||
|
||||
#### OpenTelemetry (NEW)
|
||||
|
||||
- **Purpose**: Distributed request tracing across nodes
|
||||
- **Strengths**:
|
||||
- **Cross-node correlation** via `trace_id`
|
||||
- Parent-child span relationships
|
||||
- Rich attributes per span
|
||||
- Industry standard (CNCF)
|
||||
- **Limitations**:
|
||||
- Requires collector infrastructure
|
||||
- Higher complexity than logging
|
||||
|
||||
```cpp
|
||||
// Example OpenTelemetry span
|
||||
auto span = telemetry.startSpan("tx.relay");
|
||||
span->SetAttribute("tx.hash", hash);
|
||||
span->SetAttribute("peer.id", peerId);
|
||||
// Span automatically linked to parent via context
|
||||
```
|
||||
|
||||
### 2.6.3 When to Use Each
|
||||
|
||||
| Scenario | PerfLog | StatsD | OpenTelemetry |
|
||||
| --------------------------------------- | ---------- | ------ | ------------- |
|
||||
| "How many TXs per second?" | ❌ | ✅ | ❌ |
|
||||
| "What's the p99 RPC latency?" | ❌ | ✅ | ✅ |
|
||||
| "Why was this specific TX slow?" | ⚠️ partial | ❌ | ✅ |
|
||||
| "Which node delayed consensus?" | ❌ | ❌ | ✅ |
|
||||
| "What happened on node X at time T?" | ✅ | ❌ | ✅ |
|
||||
| "Show me the TX journey across 5 nodes" | ❌ | ❌ | ✅ |
|
||||
|
||||
### 2.6.4 Coexistence Strategy
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph rippled["rippled Process"]
|
||||
perflog["PerfLog<br/>(JSON to file)"]
|
||||
insight["Beast Insight<br/>(StatsD)"]
|
||||
otel["OpenTelemetry<br/>(Tracing)"]
|
||||
end
|
||||
|
||||
perflog --> perffile["perf.log"]
|
||||
insight --> statsd["StatsD Server"]
|
||||
otel --> collector["OTLP Collector"]
|
||||
|
||||
perffile --> grafana["Grafana<br/>(Unified UI)"]
|
||||
statsd --> grafana
|
||||
collector --> grafana
|
||||
|
||||
style rippled fill:#212121,stroke:#0a0a0a,color:#ffffff
|
||||
style grafana fill:#bf360c,stroke:#8c2809,color:#ffffff
|
||||
```
|
||||
|
||||
### 2.6.5 Correlation with PerfLog
|
||||
|
||||
Trace IDs can be correlated with existing PerfLog entries for comprehensive debugging:
|
||||
|
||||
```cpp
|
||||
// In RPCHandler.cpp - correlate trace with PerfLog
|
||||
Status doCommand(RPC::JsonContext& context, Json::Value& result)
|
||||
{
|
||||
// Start OpenTelemetry span
|
||||
auto span = context.app.getTelemetry().startSpan(
|
||||
"rpc.command." + context.method);
|
||||
|
||||
// Get trace ID for correlation
|
||||
auto traceId = span->GetContext().trace_id().IsValid()
|
||||
? toHex(span->GetContext().trace_id())
|
||||
: "";
|
||||
|
||||
// Use existing PerfLog with trace correlation
|
||||
auto const curId = context.app.getPerfLog().currentId();
|
||||
context.app.getPerfLog().rpcStart(context.method, curId);
|
||||
|
||||
// Future: Add trace ID to PerfLog entry
|
||||
// context.app.getPerfLog().setTraceId(curId, traceId);
|
||||
|
||||
try {
|
||||
auto ret = handler(context, result);
|
||||
context.app.getPerfLog().rpcFinish(context.method, curId);
|
||||
span->SetStatus(opentelemetry::trace::StatusCode::kOk);
|
||||
return ret;
|
||||
} catch (std::exception const& e) {
|
||||
context.app.getPerfLog().rpcError(context.method, curId);
|
||||
span->RecordException(e);
|
||||
span->SetStatus(opentelemetry::trace::StatusCode::kError, e.what());
|
||||
throw;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
_Previous: [Architecture Analysis](./01-architecture-analysis.md)_ | _Next: [Implementation Strategy](./03-implementation-strategy.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_
|
||||
@@ -1,451 +0,0 @@
|
||||
# Implementation Strategy
|
||||
|
||||
> **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)
|
||||
> **Related**: [Code Samples](./04-code-samples.md) | [Configuration Reference](./05-configuration-reference.md)
|
||||
|
||||
---
|
||||
|
||||
## 3.1 Directory Structure
|
||||
|
||||
The telemetry implementation follows rippled's existing code organization pattern:
|
||||
|
||||
```
|
||||
include/xrpl/
|
||||
├── telemetry/
|
||||
│ ├── Telemetry.h # Main telemetry interface
|
||||
│ ├── TelemetryConfig.h # Configuration structures
|
||||
│ ├── TraceContext.h # Context propagation utilities
|
||||
│ ├── SpanGuard.h # RAII span management
|
||||
│ └── SpanAttributes.h # Attribute helper functions
|
||||
|
||||
src/libxrpl/
|
||||
├── telemetry/
|
||||
│ ├── Telemetry.cpp # Implementation
|
||||
│ ├── TelemetryConfig.cpp # Config parsing
|
||||
│ ├── TraceContext.cpp # Context serialization
|
||||
│ └── NullTelemetry.cpp # No-op implementation
|
||||
|
||||
src/xrpld/
|
||||
├── telemetry/
|
||||
│ ├── TracingInstrumentation.h # Instrumentation macros
|
||||
│ └── TracingInstrumentation.cpp
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3.2 Implementation Approach
|
||||
|
||||
<div align="center">
|
||||
|
||||
```mermaid
|
||||
%%{init: {'flowchart': {'nodeSpacing': 20, 'rankSpacing': 30}}}%%
|
||||
flowchart TB
|
||||
subgraph phase1["Phase 1: Core"]
|
||||
direction LR
|
||||
sdk["SDK Integration"] ~~~ interface["Telemetry Interface"] ~~~ config["Configuration"]
|
||||
end
|
||||
|
||||
subgraph phase2["Phase 2: RPC"]
|
||||
direction LR
|
||||
http["HTTP Context"] ~~~ rpc["RPC Handlers"]
|
||||
end
|
||||
|
||||
subgraph phase3["Phase 3: P2P"]
|
||||
direction LR
|
||||
proto["Protobuf Context"] ~~~ tx["Transaction Relay"]
|
||||
end
|
||||
|
||||
subgraph phase4["Phase 4: Consensus"]
|
||||
direction LR
|
||||
consensus["Consensus Rounds"] ~~~ proposals["Proposals"]
|
||||
end
|
||||
|
||||
phase1 --> phase2 --> phase3 --> phase4
|
||||
|
||||
style phase1 fill:#1565c0,stroke:#0d47a1,color:#ffffff
|
||||
style phase2 fill:#2e7d32,stroke:#1b5e20,color:#ffffff
|
||||
style phase3 fill:#e65100,stroke:#bf360c,color:#ffffff
|
||||
style phase4 fill:#c2185b,stroke:#880e4f,color:#ffffff
|
||||
```
|
||||
|
||||
</div>
|
||||
|
||||
### Key Principles
|
||||
|
||||
1. **Minimal Intrusion**: Instrumentation should not alter existing control flow
|
||||
2. **Zero-Cost When Disabled**: Use compile-time flags and no-op implementations
|
||||
3. **Backward Compatibility**: Protocol Buffer extensions use high field numbers
|
||||
4. **Graceful Degradation**: Tracing failures must not affect node operation
|
||||
|
||||
---
|
||||
|
||||
## 3.3 Performance Overhead Summary
|
||||
|
||||
| Metric | Overhead | Notes |
|
||||
| ------------- | ---------- | ----------------------------------- |
|
||||
| CPU | 1-3% | Span creation and attribute setting |
|
||||
| Memory | 2-5 MB | Batch buffer for pending spans |
|
||||
| Network | 10-50 KB/s | Compressed OTLP export to collector |
|
||||
| Latency (p99) | <2% | With proper sampling configuration |
|
||||
|
||||
---
|
||||
|
||||
## 3.4 Detailed CPU Overhead Analysis
|
||||
|
||||
### 3.4.1 Per-Operation Costs
|
||||
|
||||
| Operation | Time (ns) | Frequency | Impact |
|
||||
| --------------------- | --------- | ---------------------- | ---------- |
|
||||
| Span creation | 200-500 | Every traced operation | Low |
|
||||
| Span end | 100-200 | Every traced operation | Low |
|
||||
| SetAttribute (string) | 80-120 | 3-5 per span | Low |
|
||||
| SetAttribute (int) | 40-60 | 2-3 per span | Negligible |
|
||||
| AddEvent | 50-80 | 0-2 per span | Negligible |
|
||||
| Context injection | 150-250 | Per outgoing message | Low |
|
||||
| Context extraction | 100-180 | Per incoming message | Low |
|
||||
| GetCurrent context | 10-20 | Thread-local access | Negligible |
|
||||
|
||||
### 3.4.2 Transaction Processing Overhead
|
||||
|
||||
<div align="center">
|
||||
|
||||
```mermaid
|
||||
%%{init: {'pie': {'textPosition': 0.75}}}%%
|
||||
pie showData
|
||||
"tx.receive (800ns)" : 800
|
||||
"tx.validate (500ns)" : 500
|
||||
"tx.relay (500ns)" : 500
|
||||
"Context inject (600ns)" : 600
|
||||
```
|
||||
|
||||
**Transaction Tracing Overhead (~2.4μs total)**
|
||||
|
||||
</div>
|
||||
|
||||
**Overhead percentage**: 2.4 μs / 200 μs (avg tx processing) = **~1.2%**
|
||||
|
||||
### 3.4.3 Consensus Round Overhead
|
||||
|
||||
| Operation | Count | Cost (ns) | Total |
|
||||
| ---------------------- | ----- | --------- | ---------- |
|
||||
| consensus.round span | 1 | ~1000 | ~1 μs |
|
||||
| consensus.phase spans | 3 | ~700 | ~2.1 μs |
|
||||
| proposal.receive spans | ~20 | ~600 | ~12 μs |
|
||||
| proposal.send spans | ~3 | ~600 | ~1.8 μs |
|
||||
| Context operations | ~30 | ~200 | ~6 μs |
|
||||
| **TOTAL** | | | **~23 μs** |
|
||||
|
||||
**Overhead percentage**: 23 μs / 3s (typical round) = **~0.0008%** (negligible)
|
||||
|
||||
### 3.4.4 RPC Request Overhead
|
||||
|
||||
| Operation | Cost (ns) |
|
||||
| ---------------- | ------------ |
|
||||
| rpc.request span | ~700 |
|
||||
| rpc.command span | ~600 |
|
||||
| Context extract | ~250 |
|
||||
| Context inject | ~200 |
|
||||
| **TOTAL** | **~1.75 μs** |
|
||||
|
||||
- Fast RPC (1ms): 1.75 μs / 1ms = **~0.175%**
|
||||
- Slow RPC (100ms): 1.75 μs / 100ms = **~0.002%**
|
||||
|
||||
---
|
||||
|
||||
## 3.5 Memory Overhead Analysis
|
||||
|
||||
### 3.5.1 Static Memory
|
||||
|
||||
| Component | Size | Allocated |
|
||||
| ------------------------ | ----------- | ---------- |
|
||||
| TracerProvider singleton | ~64 KB | At startup |
|
||||
| BatchSpanProcessor | ~128 KB | At startup |
|
||||
| OTLP exporter | ~256 KB | At startup |
|
||||
| Propagator registry | ~8 KB | At startup |
|
||||
| **Total static** | **~456 KB** | |
|
||||
|
||||
### 3.5.2 Dynamic Memory
|
||||
|
||||
| Component | Size per unit | Max units | Peak |
|
||||
| -------------------- | ------------- | ---------- | ----------- |
|
||||
| Active span | ~200 bytes | 1000 | ~200 KB |
|
||||
| Queued span (export) | ~500 bytes | 2048 | ~1 MB |
|
||||
| Attribute storage | ~50 bytes | 5 per span | Included |
|
||||
| Context storage | ~64 bytes | Per thread | ~6.4 KB |
|
||||
| **Total dynamic** | | | **~1.2 MB** |
|
||||
|
||||
### 3.5.3 Memory Growth Characteristics
|
||||
|
||||
```mermaid
|
||||
---
|
||||
config:
|
||||
xyChart:
|
||||
width: 700
|
||||
height: 400
|
||||
---
|
||||
xychart-beta
|
||||
title "Memory Usage vs Span Rate"
|
||||
x-axis "Spans/second" [0, 200, 400, 600, 800, 1000]
|
||||
y-axis "Memory (MB)" 0 --> 6
|
||||
line [1, 1.8, 2.6, 3.4, 4.2, 5]
|
||||
```
|
||||
|
||||
**Notes**:
|
||||
|
||||
- Memory increases linearly with span rate
|
||||
- Batch export prevents unbounded growth
|
||||
- Queue size is configurable (default 2048 spans)
|
||||
- At queue limit, oldest spans are dropped (not blocked)
|
||||
|
||||
---
|
||||
|
||||
## 3.6 Network Overhead Analysis
|
||||
|
||||
### 3.6.1 Export Bandwidth
|
||||
|
||||
| Sampling Rate | Spans/sec | Bandwidth | Notes |
|
||||
| ------------- | --------- | --------- | ---------------- |
|
||||
| 100% | ~500 | ~250 KB/s | Development only |
|
||||
| 10% | ~50 | ~25 KB/s | Staging |
|
||||
| 1% | ~5 | ~2.5 KB/s | Production |
|
||||
| Error-only | ~1 | ~0.5 KB/s | Minimal overhead |
|
||||
|
||||
### 3.6.2 Trace Context Propagation
|
||||
|
||||
| Message Type | Context Size | Messages/sec | Overhead |
|
||||
| ---------------------- | ------------ | ------------ | ----------- |
|
||||
| TMTransaction | 32 bytes | ~100 | ~3.2 KB/s |
|
||||
| TMProposeSet | 32 bytes | ~10 | ~320 B/s |
|
||||
| TMValidation | 32 bytes | ~50 | ~1.6 KB/s |
|
||||
| **Total P2P overhead** | | | **~5 KB/s** |
|
||||
|
||||
---
|
||||
|
||||
## 3.7 Optimization Strategies
|
||||
|
||||
### 3.7.1 Sampling Strategies
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
trace["New Trace"]
|
||||
|
||||
trace --> errors{"Is Error?"}
|
||||
errors -->|Yes| sample["SAMPLE"]
|
||||
errors -->|No| consensus{"Is Consensus?"}
|
||||
|
||||
consensus -->|Yes| sample
|
||||
consensus -->|No| slow{"Is Slow?"}
|
||||
|
||||
slow -->|Yes| sample
|
||||
slow -->|No| prob{"Random < 10%?"}
|
||||
|
||||
prob -->|Yes| sample
|
||||
prob -->|No| drop["DROP"]
|
||||
|
||||
style sample fill:#4caf50,stroke:#388e3c,color:#fff
|
||||
style drop fill:#f44336,stroke:#c62828,color:#fff
|
||||
```
|
||||
|
||||
### 3.7.2 Batch Tuning Recommendations
|
||||
|
||||
| Environment | Batch Size | Batch Delay | Max Queue |
|
||||
| ------------------ | ---------- | ----------- | --------- |
|
||||
| Low-latency | 128 | 1000ms | 512 |
|
||||
| High-throughput | 1024 | 10000ms | 8192 |
|
||||
| Memory-constrained | 256 | 2000ms | 512 |
|
||||
|
||||
### 3.7.3 Conditional Instrumentation
|
||||
|
||||
```cpp
|
||||
// Compile-time feature flag
|
||||
#ifndef XRPL_ENABLE_TELEMETRY
|
||||
// Zero-cost when disabled
|
||||
#define XRPL_TRACE_SPAN(t, n) ((void)0)
|
||||
#endif
|
||||
|
||||
// Runtime component filtering
|
||||
if (telemetry.shouldTracePeer())
|
||||
{
|
||||
XRPL_TRACE_SPAN(telemetry, "peer.message.receive");
|
||||
// ... instrumentation
|
||||
}
|
||||
// No overhead when component tracing disabled
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3.8 Links to Detailed Documentation
|
||||
|
||||
- **[Code Samples](./04-code-samples.md)**: Complete implementation code for all components
|
||||
- **[Configuration Reference](./05-configuration-reference.md)**: Configuration options and collector setup
|
||||
- **[Implementation Phases](./06-implementation-phases.md)**: Detailed timeline and milestones
|
||||
|
||||
---
|
||||
|
||||
## 3.9 Code Intrusiveness Assessment
|
||||
|
||||
This section provides a detailed assessment of how intrusive the OpenTelemetry integration is to the existing rippled codebase.
|
||||
|
||||
### 3.9.1 Files Modified Summary
|
||||
|
||||
| Component | Files Modified | Lines Added | Lines Changed | Architectural Impact |
|
||||
| --------------------- | -------------- | ----------- | ------------- | -------------------- |
|
||||
| **Core Telemetry** | 5 new files | ~800 | 0 | None (new module) |
|
||||
| **Application Init** | 2 files | ~30 | ~5 | Minimal |
|
||||
| **RPC Layer** | 3 files | ~80 | ~20 | Minimal |
|
||||
| **Transaction Relay** | 4 files | ~120 | ~40 | Low |
|
||||
| **Consensus** | 3 files | ~100 | ~30 | Low-Medium |
|
||||
| **Protocol Buffers** | 1 file | ~25 | 0 | Low |
|
||||
| **CMake/Build** | 3 files | ~50 | ~10 | Minimal |
|
||||
| **Total** | **~21 files** | **~1,205** | **~105** | **Low** |
|
||||
|
||||
### 3.9.2 Detailed File Impact
|
||||
|
||||
```mermaid
|
||||
pie title Code Changes by Component
|
||||
"New Telemetry Module" : 800
|
||||
"Transaction Relay" : 160
|
||||
"Consensus" : 130
|
||||
"RPC Layer" : 100
|
||||
"Application Init" : 35
|
||||
"Protocol Buffers" : 25
|
||||
"Build System" : 60
|
||||
```
|
||||
|
||||
#### New Files (No Impact on Existing Code)
|
||||
|
||||
| File | Lines | Purpose |
|
||||
| ---------------------------------------------- | ----- | -------------------- |
|
||||
| `include/xrpl/telemetry/Telemetry.h` | ~160 | Main interface |
|
||||
| `include/xrpl/telemetry/SpanGuard.h` | ~120 | RAII wrapper |
|
||||
| `include/xrpl/telemetry/TraceContext.h` | ~80 | Context propagation |
|
||||
| `src/xrpld/telemetry/TracingInstrumentation.h` | ~60 | Macros |
|
||||
| `src/libxrpl/telemetry/Telemetry.cpp` | ~200 | Implementation |
|
||||
| `src/libxrpl/telemetry/TelemetryConfig.cpp` | ~60 | Config parsing |
|
||||
| `src/libxrpl/telemetry/NullTelemetry.cpp` | ~40 | No-op implementation |
|
||||
|
||||
#### Modified Files (Existing Rippled Code)
|
||||
|
||||
| File | Lines Added | Lines Changed | Risk Level |
|
||||
| ------------------------------------------------- | ----------- | ------------- | ---------- |
|
||||
| `src/xrpld/app/main/Application.cpp` | ~15 | ~3 | Low |
|
||||
| `include/xrpl/app/main/Application.h` | ~5 | ~2 | Low |
|
||||
| `src/xrpld/rpc/detail/ServerHandler.cpp` | ~40 | ~10 | Low |
|
||||
| `src/xrpld/rpc/handlers/*.cpp` | ~30 | ~8 | Low |
|
||||
| `src/xrpld/overlay/detail/PeerImp.cpp` | ~60 | ~15 | Medium |
|
||||
| `src/xrpld/overlay/detail/OverlayImpl.cpp` | ~30 | ~10 | Medium |
|
||||
| `src/xrpld/app/consensus/RCLConsensus.cpp` | ~50 | ~15 | Medium |
|
||||
| `src/xrpld/app/consensus/RCLConsensusAdaptor.cpp` | ~40 | ~12 | Medium |
|
||||
| `src/xrpld/core/JobQueue.cpp` | ~20 | ~5 | Low |
|
||||
| `src/xrpld/overlay/detail/ripple.proto` | ~25 | 0 | Low |
|
||||
| `CMakeLists.txt` | ~40 | ~8 | Low |
|
||||
| `cmake/FindOpenTelemetry.cmake` | ~50 | 0 | None (new) |
|
||||
|
||||
### 3.9.3 Risk Assessment by Component
|
||||
|
||||
<div align="center">
|
||||
|
||||
**Do First** ↖ ↗ **Plan Carefully**
|
||||
|
||||
```mermaid
|
||||
quadrantChart
|
||||
title Code Intrusiveness Risk Matrix
|
||||
x-axis Low Risk --> High Risk
|
||||
y-axis Low Value --> High Value
|
||||
|
||||
RPC Tracing: [0.2, 0.8]
|
||||
Transaction Relay: [0.5, 0.9]
|
||||
Consensus Tracing: [0.7, 0.95]
|
||||
Peer Message Tracing: [0.8, 0.4]
|
||||
JobQueue Context: [0.4, 0.5]
|
||||
Ledger Acquisition: [0.5, 0.6]
|
||||
```
|
||||
|
||||
**Optional** ↙ ↘ **Avoid**
|
||||
|
||||
</div>
|
||||
|
||||
#### Risk Level Definitions
|
||||
|
||||
| Risk Level | Definition | Mitigation |
|
||||
| ---------- | ---------------------------------------------------------------- | ---------------------------------- |
|
||||
| **Low** | Additive changes only; no modification to existing logic | Standard code review |
|
||||
| **Medium** | Minor modifications to existing functions; clear boundaries | Comprehensive unit tests |
|
||||
| **High** | Changes to core logic or data structures; potential side effects | Integration tests + staged rollout |
|
||||
|
||||
### 3.9.4 Architectural Impact Assessment
|
||||
|
||||
| Aspect | Impact | Justification |
|
||||
| -------------------- | ------- | --------------------------------------------------------------------- |
|
||||
| **Data Flow** | None | Tracing is purely observational; no business logic changes |
|
||||
| **Threading Model** | Minimal | Context propagation uses thread-local storage (standard OTel pattern) |
|
||||
| **Memory Model** | Low | Bounded queues prevent unbounded growth; RAII ensures cleanup |
|
||||
| **Network Protocol** | Low | Optional fields in protobuf (high field numbers); backward compatible |
|
||||
| **Configuration** | None | New config section; existing configs unaffected |
|
||||
| **Build System** | Low | Optional CMake flag; builds work without OpenTelemetry |
|
||||
| **Dependencies** | Low | OpenTelemetry SDK is optional; null implementation when disabled |
|
||||
|
||||
### 3.9.5 Backward Compatibility
|
||||
|
||||
| Compatibility | Status | Notes |
|
||||
| --------------- | ------- | ----------------------------------------------------- |
|
||||
| **Config File** | ✅ Full | New `[telemetry]` section is optional |
|
||||
| **Protocol** | ✅ Full | Optional protobuf fields with high field numbers |
|
||||
| **Build** | ✅ Full | `XRPL_ENABLE_TELEMETRY=OFF` produces identical binary |
|
||||
| **Runtime** | ✅ Full | `enabled=0` produces zero overhead |
|
||||
| **API** | ✅ Full | No changes to public RPC or P2P APIs |
|
||||
|
||||
### 3.9.6 Rollback Strategy
|
||||
|
||||
If issues are discovered after deployment:
|
||||
|
||||
1. **Immediate**: Set `enabled=0` in config and restart (zero code change)
|
||||
2. **Quick**: Rebuild with `XRPL_ENABLE_TELEMETRY=OFF`
|
||||
3. **Complete**: Revert telemetry commits (clean separation makes this easy)
|
||||
|
||||
### 3.9.7 Code Change Examples
|
||||
|
||||
**Minimal RPC Instrumentation (Low Intrusiveness):**
|
||||
|
||||
```cpp
|
||||
// Before
|
||||
void ServerHandler::onRequest(...) {
|
||||
auto result = processRequest(req);
|
||||
send(result);
|
||||
}
|
||||
|
||||
// After (only ~10 lines added)
|
||||
void ServerHandler::onRequest(...) {
|
||||
XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.request"); // +1 line
|
||||
XRPL_TRACE_SET_ATTR("xrpl.rpc.command", command); // +1 line
|
||||
|
||||
auto result = processRequest(req);
|
||||
|
||||
XRPL_TRACE_SET_ATTR("xrpl.rpc.status", status); // +1 line
|
||||
send(result);
|
||||
}
|
||||
```
|
||||
|
||||
**Consensus Instrumentation (Medium Intrusiveness):**
|
||||
|
||||
```cpp
|
||||
// Before
|
||||
void RCLConsensusAdaptor::startRound(...) {
|
||||
// ... existing logic
|
||||
}
|
||||
|
||||
// After (context storage required)
|
||||
void RCLConsensusAdaptor::startRound(...) {
|
||||
XRPL_TRACE_CONSENSUS(app_.getTelemetry(), "consensus.round");
|
||||
XRPL_TRACE_SET_ATTR("xrpl.consensus.ledger.seq", seq);
|
||||
|
||||
// Store context for child spans in phase transitions
|
||||
currentRoundContext_ = _xrpl_guard_->context(); // New member variable
|
||||
|
||||
// ... existing logic unchanged
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
_Previous: [Design Decisions](./02-design-decisions.md)_ | _Next: [Code Samples](./04-code-samples.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_
|
||||
@@ -1,982 +0,0 @@
|
||||
# Code Samples
|
||||
|
||||
> **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)
|
||||
> **Related**: [Implementation Strategy](./03-implementation-strategy.md) | [Configuration Reference](./05-configuration-reference.md)
|
||||
|
||||
---
|
||||
|
||||
## 4.1 Core Interfaces
|
||||
|
||||
### 4.1.1 Main Telemetry Interface
|
||||
|
||||
```cpp
|
||||
// include/xrpl/telemetry/Telemetry.h
|
||||
#pragma once
|
||||
|
||||
#include <xrpl/telemetry/TelemetryConfig.h>
|
||||
#include <opentelemetry/trace/tracer.h>
|
||||
#include <opentelemetry/trace/span.h>
|
||||
#include <opentelemetry/context/context.h>
|
||||
|
||||
#include <memory>
|
||||
#include <string>
|
||||
#include <string_view>
|
||||
|
||||
namespace xrpl {
|
||||
namespace telemetry {
|
||||
|
||||
/**
|
||||
* Main telemetry interface for OpenTelemetry integration.
|
||||
*
|
||||
* This class provides the primary API for distributed tracing in rippled.
|
||||
* It manages the OpenTelemetry SDK lifecycle and provides convenience
|
||||
* methods for creating spans and propagating context.
|
||||
*/
|
||||
class Telemetry
|
||||
{
|
||||
public:
|
||||
/**
|
||||
* Configuration for the telemetry system.
|
||||
*/
|
||||
struct Setup
|
||||
{
|
||||
bool enabled = false;
|
||||
std::string serviceName = "rippled";
|
||||
std::string serviceVersion;
|
||||
std::string serviceInstanceId; // Node public key
|
||||
|
||||
// Exporter configuration
|
||||
std::string exporterType = "otlp_grpc"; // "otlp_grpc", "otlp_http", "none"
|
||||
std::string exporterEndpoint = "localhost:4317";
|
||||
bool useTls = false;
|
||||
std::string tlsCertPath;
|
||||
|
||||
// Sampling configuration
|
||||
double samplingRatio = 1.0; // 1.0 = 100% sampling
|
||||
|
||||
// Batch processor settings
|
||||
std::uint32_t batchSize = 512;
|
||||
std::chrono::milliseconds batchDelay{5000};
|
||||
std::uint32_t maxQueueSize = 2048;
|
||||
|
||||
// Network attributes
|
||||
std::uint32_t networkId = 0;
|
||||
std::string networkType = "mainnet";
|
||||
|
||||
// Component filtering
|
||||
bool traceTransactions = true;
|
||||
bool traceConsensus = true;
|
||||
bool traceRpc = true;
|
||||
bool tracePeer = false; // High volume, disabled by default
|
||||
bool traceLedger = true;
|
||||
};
|
||||
|
||||
virtual ~Telemetry() = default;
|
||||
|
||||
// ═══════════════════════════════════════════════════════════════════════
|
||||
// LIFECYCLE
|
||||
// ═══════════════════════════════════════════════════════════════════════
|
||||
|
||||
/** Start the telemetry system (call after configuration) */
|
||||
virtual void start() = 0;
|
||||
|
||||
/** Stop the telemetry system (flushes pending spans) */
|
||||
virtual void stop() = 0;
|
||||
|
||||
/** Check if telemetry is enabled */
|
||||
virtual bool isEnabled() const = 0;
|
||||
|
||||
// ═══════════════════════════════════════════════════════════════════════
|
||||
// TRACER ACCESS
|
||||
// ═══════════════════════════════════════════════════════════════════════
|
||||
|
||||
/** Get the tracer for creating spans */
|
||||
virtual opentelemetry::nostd::shared_ptr<opentelemetry::trace::Tracer>
|
||||
getTracer(std::string_view name = "rippled") = 0;
|
||||
|
||||
// ═══════════════════════════════════════════════════════════════════════
|
||||
// SPAN CREATION (Convenience Methods)
|
||||
// ═══════════════════════════════════════════════════════════════════════
|
||||
|
||||
/** Start a new span with default options */
|
||||
virtual opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>
|
||||
startSpan(
|
||||
std::string_view name,
|
||||
opentelemetry::trace::SpanKind kind =
|
||||
opentelemetry::trace::SpanKind::kInternal) = 0;
|
||||
|
||||
/** Start a span as child of given context */
|
||||
virtual opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>
|
||||
startSpan(
|
||||
std::string_view name,
|
||||
opentelemetry::context::Context const& parentContext,
|
||||
opentelemetry::trace::SpanKind kind =
|
||||
opentelemetry::trace::SpanKind::kInternal) = 0;
|
||||
|
||||
// ═══════════════════════════════════════════════════════════════════════
|
||||
// CONTEXT PROPAGATION
|
||||
// ═══════════════════════════════════════════════════════════════════════
|
||||
|
||||
/** Serialize context for network transmission */
|
||||
virtual std::string serializeContext(
|
||||
opentelemetry::context::Context const& ctx) = 0;
|
||||
|
||||
/** Deserialize context from network data */
|
||||
virtual opentelemetry::context::Context deserializeContext(
|
||||
std::string const& serialized) = 0;
|
||||
|
||||
// ═══════════════════════════════════════════════════════════════════════
|
||||
// COMPONENT FILTERING
|
||||
// ═══════════════════════════════════════════════════════════════════════
|
||||
|
||||
/** Check if transaction tracing is enabled */
|
||||
virtual bool shouldTraceTransactions() const = 0;
|
||||
|
||||
/** Check if consensus tracing is enabled */
|
||||
virtual bool shouldTraceConsensus() const = 0;
|
||||
|
||||
/** Check if RPC tracing is enabled */
|
||||
virtual bool shouldTraceRpc() const = 0;
|
||||
|
||||
/** Check if peer message tracing is enabled */
|
||||
virtual bool shouldTracePeer() const = 0;
|
||||
};
|
||||
|
||||
// Factory functions
|
||||
std::unique_ptr<Telemetry>
|
||||
make_Telemetry(
|
||||
Telemetry::Setup const& setup,
|
||||
beast::Journal journal);
|
||||
|
||||
Telemetry::Setup
|
||||
setup_Telemetry(
|
||||
Section const& section,
|
||||
std::string const& nodePublicKey,
|
||||
std::string const& version);
|
||||
|
||||
} // namespace telemetry
|
||||
} // namespace xrpl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4.2 RAII Span Guard
|
||||
|
||||
```cpp
|
||||
// include/xrpl/telemetry/SpanGuard.h
|
||||
#pragma once
|
||||
|
||||
#include <opentelemetry/trace/span.h>
|
||||
#include <opentelemetry/trace/scope.h>
|
||||
#include <opentelemetry/trace/status.h>
|
||||
|
||||
#include <string_view>
|
||||
#include <exception>
|
||||
|
||||
namespace xrpl {
|
||||
namespace telemetry {
|
||||
|
||||
/**
|
||||
* RAII guard for OpenTelemetry spans.
|
||||
*
|
||||
* Automatically ends the span on destruction and makes it the current
|
||||
* span in the thread-local context.
|
||||
*/
|
||||
class SpanGuard
|
||||
{
|
||||
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span> span_;
|
||||
opentelemetry::trace::Scope scope_;
|
||||
|
||||
public:
|
||||
/**
|
||||
* Construct guard with span.
|
||||
* The span becomes the current span in thread-local context.
|
||||
*/
|
||||
explicit SpanGuard(
|
||||
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span> span)
|
||||
: span_(std::move(span))
|
||||
, scope_(span_)
|
||||
{
|
||||
}
|
||||
|
||||
// Non-copyable, non-movable
|
||||
SpanGuard(SpanGuard const&) = delete;
|
||||
SpanGuard& operator=(SpanGuard const&) = delete;
|
||||
SpanGuard(SpanGuard&&) = delete;
|
||||
SpanGuard& operator=(SpanGuard&&) = delete;
|
||||
|
||||
~SpanGuard()
|
||||
{
|
||||
if (span_)
|
||||
span_->End();
|
||||
}
|
||||
|
||||
/** Access the underlying span */
|
||||
opentelemetry::trace::Span& span() { return *span_; }
|
||||
opentelemetry::trace::Span const& span() const { return *span_; }
|
||||
|
||||
/** Set span status to OK */
|
||||
void setOk()
|
||||
{
|
||||
span_->SetStatus(opentelemetry::trace::StatusCode::kOk);
|
||||
}
|
||||
|
||||
/** Set span status with code and description */
|
||||
void setStatus(
|
||||
opentelemetry::trace::StatusCode code,
|
||||
std::string_view description = "")
|
||||
{
|
||||
span_->SetStatus(code, std::string(description));
|
||||
}
|
||||
|
||||
/** Set an attribute on the span */
|
||||
template<typename T>
|
||||
void setAttribute(std::string_view key, T&& value)
|
||||
{
|
||||
span_->SetAttribute(
|
||||
opentelemetry::nostd::string_view(key.data(), key.size()),
|
||||
std::forward<T>(value));
|
||||
}
|
||||
|
||||
/** Add an event to the span */
|
||||
void addEvent(std::string_view name)
|
||||
{
|
||||
span_->AddEvent(std::string(name));
|
||||
}
|
||||
|
||||
/** Record an exception on the span */
|
||||
void recordException(std::exception const& e)
|
||||
{
|
||||
span_->RecordException(e);
|
||||
span_->SetStatus(
|
||||
opentelemetry::trace::StatusCode::kError,
|
||||
e.what());
|
||||
}
|
||||
|
||||
/** Get the current trace context */
|
||||
opentelemetry::context::Context context() const
|
||||
{
|
||||
return opentelemetry::context::RuntimeContext::GetCurrent();
|
||||
}
|
||||
};
|
||||
|
||||
/**
|
||||
* No-op span guard for when tracing is disabled.
|
||||
* Provides the same interface but does nothing.
|
||||
*/
|
||||
class NullSpanGuard
|
||||
{
|
||||
public:
|
||||
NullSpanGuard() = default;
|
||||
|
||||
void setOk() {}
|
||||
void setStatus(opentelemetry::trace::StatusCode, std::string_view = "") {}
|
||||
|
||||
template<typename T>
|
||||
void setAttribute(std::string_view, T&&) {}
|
||||
|
||||
void addEvent(std::string_view) {}
|
||||
void recordException(std::exception const&) {}
|
||||
};
|
||||
|
||||
} // namespace telemetry
|
||||
} // namespace xrpl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4.3 Instrumentation Macros
|
||||
|
||||
```cpp
|
||||
// src/xrpld/telemetry/TracingInstrumentation.h
|
||||
#pragma once
|
||||
|
||||
#include <xrpl/telemetry/Telemetry.h>
|
||||
#include <xrpl/telemetry/SpanGuard.h>
|
||||
|
||||
namespace xrpl {
|
||||
namespace telemetry {
|
||||
|
||||
// ═══════════════════════════════════════════════════════════════════════════
|
||||
// INSTRUMENTATION MACROS
|
||||
// ═══════════════════════════════════════════════════════════════════════════
|
||||
|
||||
#ifdef XRPL_ENABLE_TELEMETRY
|
||||
|
||||
// Start a span that is automatically ended when guard goes out of scope
|
||||
#define XRPL_TRACE_SPAN(telemetry, name) \
|
||||
auto _xrpl_span_ = (telemetry).startSpan(name); \
|
||||
::xrpl::telemetry::SpanGuard _xrpl_guard_(_xrpl_span_)
|
||||
|
||||
// Start a span with specific kind
|
||||
#define XRPL_TRACE_SPAN_KIND(telemetry, name, kind) \
|
||||
auto _xrpl_span_ = (telemetry).startSpan(name, kind); \
|
||||
::xrpl::telemetry::SpanGuard _xrpl_guard_(_xrpl_span_)
|
||||
|
||||
// Conditional span based on component
|
||||
#define XRPL_TRACE_TX(telemetry, name) \
|
||||
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
|
||||
if ((telemetry).shouldTraceTransactions()) { \
|
||||
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
|
||||
}
|
||||
|
||||
#define XRPL_TRACE_CONSENSUS(telemetry, name) \
|
||||
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
|
||||
if ((telemetry).shouldTraceConsensus()) { \
|
||||
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
|
||||
}
|
||||
|
||||
#define XRPL_TRACE_RPC(telemetry, name) \
|
||||
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
|
||||
if ((telemetry).shouldTraceRpc()) { \
|
||||
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
|
||||
}
|
||||
|
||||
// Set attribute on current span (if exists)
|
||||
#define XRPL_TRACE_SET_ATTR(key, value) \
|
||||
if (_xrpl_guard_.has_value()) { \
|
||||
_xrpl_guard_->setAttribute(key, value); \
|
||||
}
|
||||
|
||||
// Record exception on current span
|
||||
#define XRPL_TRACE_EXCEPTION(e) \
|
||||
if (_xrpl_guard_.has_value()) { \
|
||||
_xrpl_guard_->recordException(e); \
|
||||
}
|
||||
|
||||
#else // XRPL_ENABLE_TELEMETRY not defined
|
||||
|
||||
#define XRPL_TRACE_SPAN(telemetry, name) ((void)0)
|
||||
#define XRPL_TRACE_SPAN_KIND(telemetry, name, kind) ((void)0)
|
||||
#define XRPL_TRACE_TX(telemetry, name) ((void)0)
|
||||
#define XRPL_TRACE_CONSENSUS(telemetry, name) ((void)0)
|
||||
#define XRPL_TRACE_RPC(telemetry, name) ((void)0)
|
||||
#define XRPL_TRACE_SET_ATTR(key, value) ((void)0)
|
||||
#define XRPL_TRACE_EXCEPTION(e) ((void)0)
|
||||
|
||||
#endif // XRPL_ENABLE_TELEMETRY
|
||||
|
||||
} // namespace telemetry
|
||||
} // namespace xrpl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4.4 Protocol Buffer Extensions
|
||||
|
||||
### 4.4.1 TraceContext Message Definition
|
||||
|
||||
Add to `src/xrpld/overlay/detail/ripple.proto`:
|
||||
|
||||
```protobuf
|
||||
// Trace context for distributed tracing across nodes
|
||||
// Uses W3C Trace Context format internally
|
||||
message TraceContext {
|
||||
// 16-byte trace identifier (required for valid context)
|
||||
bytes trace_id = 1;
|
||||
|
||||
// 8-byte span identifier of parent span
|
||||
bytes span_id = 2;
|
||||
|
||||
// Trace flags (bit 0 = sampled)
|
||||
uint32 trace_flags = 3;
|
||||
|
||||
// W3C tracestate header value for vendor-specific data
|
||||
string trace_state = 4;
|
||||
}
|
||||
|
||||
// Extend existing messages with optional trace context
|
||||
// High field numbers (1000+) to avoid conflicts
|
||||
|
||||
message TMTransaction {
|
||||
// ... existing fields ...
|
||||
|
||||
// Optional trace context for distributed tracing
|
||||
optional TraceContext trace_context = 1001;
|
||||
}
|
||||
|
||||
message TMProposeSet {
|
||||
// ... existing fields ...
|
||||
optional TraceContext trace_context = 1001;
|
||||
}
|
||||
|
||||
message TMValidation {
|
||||
// ... existing fields ...
|
||||
optional TraceContext trace_context = 1001;
|
||||
}
|
||||
|
||||
message TMGetLedger {
|
||||
// ... existing fields ...
|
||||
optional TraceContext trace_context = 1001;
|
||||
}
|
||||
|
||||
message TMLedgerData {
|
||||
// ... existing fields ...
|
||||
optional TraceContext trace_context = 1001;
|
||||
}
|
||||
```
|
||||
|
||||
### 4.4.2 Context Serialization/Deserialization
|
||||
|
||||
```cpp
|
||||
// include/xrpl/telemetry/TraceContext.h
|
||||
#pragma once
|
||||
|
||||
#include <opentelemetry/context/context.h>
|
||||
#include <opentelemetry/trace/span_context.h>
|
||||
#include <protocol/messages.h> // Generated protobuf
|
||||
|
||||
#include <optional>
|
||||
#include <string>
|
||||
|
||||
namespace xrpl {
|
||||
namespace telemetry {
|
||||
|
||||
/**
|
||||
* Utilities for trace context serialization and propagation.
|
||||
*/
|
||||
class TraceContextPropagator
|
||||
{
|
||||
public:
|
||||
/**
|
||||
* Extract trace context from Protocol Buffer message.
|
||||
* Returns empty context if no trace info present.
|
||||
*/
|
||||
static opentelemetry::context::Context
|
||||
extract(protocol::TraceContext const& proto);
|
||||
|
||||
/**
|
||||
* Inject current trace context into Protocol Buffer message.
|
||||
*/
|
||||
static void
|
||||
inject(
|
||||
opentelemetry::context::Context const& ctx,
|
||||
protocol::TraceContext& proto);
|
||||
|
||||
/**
|
||||
* Extract trace context from HTTP headers (for RPC).
|
||||
* Supports W3C Trace Context (traceparent, tracestate).
|
||||
*/
|
||||
static opentelemetry::context::Context
|
||||
extractFromHeaders(
|
||||
std::function<std::optional<std::string>(std::string_view)> headerGetter);
|
||||
|
||||
/**
|
||||
* Inject trace context into HTTP headers (for RPC responses).
|
||||
*/
|
||||
static void
|
||||
injectToHeaders(
|
||||
opentelemetry::context::Context const& ctx,
|
||||
std::function<void(std::string_view, std::string_view)> headerSetter);
|
||||
};
|
||||
|
||||
// ═══════════════════════════════════════════════════════════════════════════
|
||||
// IMPLEMENTATION
|
||||
// ═══════════════════════════════════════════════════════════════════════════
|
||||
|
||||
inline opentelemetry::context::Context
|
||||
TraceContextPropagator::extract(protocol::TraceContext const& proto)
|
||||
{
|
||||
using namespace opentelemetry::trace;
|
||||
|
||||
if (proto.trace_id().size() != 16 || proto.span_id().size() != 8)
|
||||
return opentelemetry::context::Context{}; // Invalid, return empty
|
||||
|
||||
// Construct TraceId and SpanId from bytes
|
||||
TraceId traceId(reinterpret_cast<uint8_t const*>(proto.trace_id().data()));
|
||||
SpanId spanId(reinterpret_cast<uint8_t const*>(proto.span_id().data()));
|
||||
TraceFlags flags(static_cast<uint8_t>(proto.trace_flags()));
|
||||
|
||||
// Create SpanContext from extracted data
|
||||
SpanContext spanContext(traceId, spanId, flags, /* remote = */ true);
|
||||
|
||||
// Create context with extracted span as parent
|
||||
return opentelemetry::context::Context{}.SetValue(
|
||||
opentelemetry::trace::kSpanKey,
|
||||
opentelemetry::nostd::shared_ptr<Span>(
|
||||
new DefaultSpan(spanContext)));
|
||||
}
|
||||
|
||||
inline void
|
||||
TraceContextPropagator::inject(
|
||||
opentelemetry::context::Context const& ctx,
|
||||
protocol::TraceContext& proto)
|
||||
{
|
||||
using namespace opentelemetry::trace;
|
||||
|
||||
// Get current span from context
|
||||
auto span = GetSpan(ctx);
|
||||
if (!span)
|
||||
return;
|
||||
|
||||
auto const& spanCtx = span->GetContext();
|
||||
if (!spanCtx.IsValid())
|
||||
return;
|
||||
|
||||
// Serialize trace_id (16 bytes)
|
||||
auto const& traceId = spanCtx.trace_id();
|
||||
proto.set_trace_id(traceId.Id().data(), TraceId::kSize);
|
||||
|
||||
// Serialize span_id (8 bytes)
|
||||
auto const& spanId = spanCtx.span_id();
|
||||
proto.set_span_id(spanId.Id().data(), SpanId::kSize);
|
||||
|
||||
// Serialize flags
|
||||
proto.set_trace_flags(spanCtx.trace_flags().flags());
|
||||
|
||||
// Note: tracestate not implemented yet
|
||||
}
|
||||
|
||||
} // namespace telemetry
|
||||
} // namespace xrpl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4.5 Module-Specific Instrumentation
|
||||
|
||||
### 4.5.1 Transaction Relay Instrumentation
|
||||
|
||||
```cpp
|
||||
// src/xrpld/overlay/detail/PeerImp.cpp (modified)
|
||||
|
||||
#include <xrpl/telemetry/TracingInstrumentation.h>
|
||||
|
||||
void
|
||||
PeerImp::handleTransaction(
|
||||
std::shared_ptr<protocol::TMTransaction> const& m)
|
||||
{
|
||||
// Extract trace context from incoming message
|
||||
opentelemetry::context::Context parentCtx;
|
||||
if (m->has_trace_context())
|
||||
{
|
||||
parentCtx = telemetry::TraceContextPropagator::extract(
|
||||
m->trace_context());
|
||||
}
|
||||
|
||||
// Start span as child of remote span (cross-node link)
|
||||
auto span = app_.getTelemetry().startSpan(
|
||||
"tx.receive",
|
||||
parentCtx,
|
||||
opentelemetry::trace::SpanKind::kServer);
|
||||
telemetry::SpanGuard guard(span);
|
||||
|
||||
try
|
||||
{
|
||||
// Parse and validate transaction
|
||||
SerialIter sit(makeSlice(m->rawtransaction()));
|
||||
auto stx = std::make_shared<STTx const>(sit);
|
||||
|
||||
// Add transaction attributes
|
||||
guard.setAttribute("xrpl.tx.hash", to_string(stx->getTransactionID()));
|
||||
guard.setAttribute("xrpl.tx.type", stx->getTxnType());
|
||||
guard.setAttribute("xrpl.peer.id", remote_address_.to_string());
|
||||
|
||||
// Check if we've seen this transaction (HashRouter)
|
||||
auto const [flags, suppressed] =
|
||||
app_.getHashRouter().addSuppressionPeer(
|
||||
stx->getTransactionID(),
|
||||
id_);
|
||||
|
||||
if (suppressed)
|
||||
{
|
||||
guard.setAttribute("xrpl.tx.suppressed", true);
|
||||
guard.addEvent("tx.duplicate");
|
||||
return; // Already processing this transaction
|
||||
}
|
||||
|
||||
// Create child span for validation
|
||||
{
|
||||
auto validateSpan = app_.getTelemetry().startSpan("tx.validate");
|
||||
telemetry::SpanGuard validateGuard(validateSpan);
|
||||
|
||||
auto [validity, reason] = checkTransaction(stx);
|
||||
validateGuard.setAttribute("xrpl.tx.validity",
|
||||
validity == Validity::Valid ? "valid" : "invalid");
|
||||
|
||||
if (validity != Validity::Valid)
|
||||
{
|
||||
validateGuard.setStatus(
|
||||
opentelemetry::trace::StatusCode::kError,
|
||||
reason);
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
// Relay to other peers (capture context for propagation)
|
||||
auto ctx = guard.context();
|
||||
|
||||
// Create child span for relay
|
||||
auto relaySpan = app_.getTelemetry().startSpan(
|
||||
"tx.relay",
|
||||
ctx,
|
||||
opentelemetry::trace::SpanKind::kClient);
|
||||
{
|
||||
telemetry::SpanGuard relayGuard(relaySpan);
|
||||
|
||||
// Inject context into outgoing message
|
||||
protocol::TraceContext protoCtx;
|
||||
telemetry::TraceContextPropagator::inject(
|
||||
relayGuard.context(), protoCtx);
|
||||
|
||||
// Relay to other peers
|
||||
app_.overlay().relay(
|
||||
stx->getTransactionID(),
|
||||
*m,
|
||||
protoCtx, // Pass trace context
|
||||
exclusions);
|
||||
|
||||
relayGuard.setAttribute("xrpl.tx.relay_count",
|
||||
static_cast<int64_t>(relayCount));
|
||||
}
|
||||
|
||||
guard.setOk();
|
||||
}
|
||||
catch (std::exception const& e)
|
||||
{
|
||||
guard.recordException(e);
|
||||
JLOG(journal_.warn()) << "Transaction handling failed: " << e.what();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4.5.2 Consensus Instrumentation
|
||||
|
||||
```cpp
|
||||
// src/xrpld/app/consensus/RCLConsensus.cpp (modified)
|
||||
|
||||
#include <xrpl/telemetry/TracingInstrumentation.h>
|
||||
|
||||
void
|
||||
RCLConsensusAdaptor::startRound(
|
||||
NetClock::time_point const& now,
|
||||
RCLCxLedger::ID const& prevLedgerHash,
|
||||
RCLCxLedger const& prevLedger,
|
||||
hash_set<NodeID> const& peers,
|
||||
bool proposing)
|
||||
{
|
||||
XRPL_TRACE_CONSENSUS(app_.getTelemetry(), "consensus.round");
|
||||
|
||||
XRPL_TRACE_SET_ATTR("xrpl.consensus.ledger.prev", to_string(prevLedgerHash));
|
||||
XRPL_TRACE_SET_ATTR("xrpl.consensus.ledger.seq",
|
||||
static_cast<int64_t>(prevLedger.seq() + 1));
|
||||
XRPL_TRACE_SET_ATTR("xrpl.consensus.proposers",
|
||||
static_cast<int64_t>(peers.size()));
|
||||
XRPL_TRACE_SET_ATTR("xrpl.consensus.mode",
|
||||
proposing ? "proposing" : "observing");
|
||||
|
||||
// Store trace context for use in phase transitions
|
||||
currentRoundContext_ = _xrpl_guard_.has_value()
|
||||
? _xrpl_guard_->context()
|
||||
: opentelemetry::context::Context{};
|
||||
|
||||
// ... existing implementation ...
|
||||
}
|
||||
|
||||
ConsensusPhase
|
||||
RCLConsensusAdaptor::phaseTransition(ConsensusPhase newPhase)
|
||||
{
|
||||
// Create span for phase transition
|
||||
auto span = app_.getTelemetry().startSpan(
|
||||
"consensus.phase." + to_string(newPhase),
|
||||
currentRoundContext_);
|
||||
telemetry::SpanGuard guard(span);
|
||||
|
||||
guard.setAttribute("xrpl.consensus.phase", to_string(newPhase));
|
||||
guard.addEvent("phase.enter");
|
||||
|
||||
auto const startTime = std::chrono::steady_clock::now();
|
||||
|
||||
try
|
||||
{
|
||||
auto result = doPhaseTransition(newPhase);
|
||||
|
||||
auto const duration = std::chrono::steady_clock::now() - startTime;
|
||||
guard.setAttribute("xrpl.consensus.phase_duration_ms",
|
||||
std::chrono::duration<double, std::milli>(duration).count());
|
||||
|
||||
guard.setOk();
|
||||
return result;
|
||||
}
|
||||
catch (std::exception const& e)
|
||||
{
|
||||
guard.recordException(e);
|
||||
throw;
|
||||
}
|
||||
}
|
||||
|
||||
void
|
||||
RCLConsensusAdaptor::peerProposal(
|
||||
NetClock::time_point const& now,
|
||||
RCLCxPeerPos const& proposal)
|
||||
{
|
||||
// Extract trace context from proposal message
|
||||
opentelemetry::context::Context parentCtx;
|
||||
if (proposal.hasTraceContext())
|
||||
{
|
||||
parentCtx = telemetry::TraceContextPropagator::extract(
|
||||
proposal.traceContext());
|
||||
}
|
||||
|
||||
auto span = app_.getTelemetry().startSpan(
|
||||
"consensus.proposal.receive",
|
||||
parentCtx,
|
||||
opentelemetry::trace::SpanKind::kServer);
|
||||
telemetry::SpanGuard guard(span);
|
||||
|
||||
guard.setAttribute("xrpl.consensus.proposer",
|
||||
toBase58(TokenType::NodePublic, proposal.nodeId()));
|
||||
guard.setAttribute("xrpl.consensus.round",
|
||||
static_cast<int64_t>(proposal.proposal().proposeSeq()));
|
||||
|
||||
// ... existing implementation ...
|
||||
|
||||
guard.setOk();
|
||||
}
|
||||
```
|
||||
|
||||
### 4.5.3 RPC Handler Instrumentation
|
||||
|
||||
```cpp
|
||||
// src/xrpld/rpc/detail/ServerHandler.cpp (modified)
|
||||
|
||||
#include <xrpl/telemetry/TracingInstrumentation.h>
|
||||
|
||||
void
|
||||
ServerHandler::onRequest(
|
||||
http_request_type&& req,
|
||||
std::function<void(http_response_type&&)>&& send)
|
||||
{
|
||||
// Extract trace context from HTTP headers (W3C Trace Context)
|
||||
auto parentCtx = telemetry::TraceContextPropagator::extractFromHeaders(
|
||||
[&req](std::string_view name) -> std::optional<std::string> {
|
||||
auto it = req.find(boost::beast::http::field{
|
||||
std::string(name)});
|
||||
if (it != req.end())
|
||||
return std::string(it->value());
|
||||
return std::nullopt;
|
||||
});
|
||||
|
||||
// Start request span
|
||||
auto span = app_.getTelemetry().startSpan(
|
||||
"rpc.request",
|
||||
parentCtx,
|
||||
opentelemetry::trace::SpanKind::kServer);
|
||||
telemetry::SpanGuard guard(span);
|
||||
|
||||
// Add HTTP attributes
|
||||
guard.setAttribute("http.method", std::string(req.method_string()));
|
||||
guard.setAttribute("http.target", std::string(req.target()));
|
||||
guard.setAttribute("http.user_agent",
|
||||
std::string(req[boost::beast::http::field::user_agent]));
|
||||
|
||||
auto const startTime = std::chrono::steady_clock::now();
|
||||
|
||||
try
|
||||
{
|
||||
// Parse and process request
|
||||
auto const& body = req.body();
|
||||
Json::Value jv;
|
||||
Json::Reader reader;
|
||||
|
||||
if (!reader.parse(body, jv))
|
||||
{
|
||||
guard.setStatus(
|
||||
opentelemetry::trace::StatusCode::kError,
|
||||
"Invalid JSON");
|
||||
sendError(send, "Invalid JSON");
|
||||
return;
|
||||
}
|
||||
|
||||
// Extract command name
|
||||
std::string command = jv.isMember("command")
|
||||
? jv["command"].asString()
|
||||
: jv.isMember("method")
|
||||
? jv["method"].asString()
|
||||
: "unknown";
|
||||
|
||||
guard.setAttribute("xrpl.rpc.command", command);
|
||||
|
||||
// Create child span for command execution
|
||||
auto cmdSpan = app_.getTelemetry().startSpan(
|
||||
"rpc.command." + command);
|
||||
{
|
||||
telemetry::SpanGuard cmdGuard(cmdSpan);
|
||||
|
||||
// Execute RPC command
|
||||
auto result = processRequest(jv);
|
||||
|
||||
// Record result attributes
|
||||
if (result.isMember("status"))
|
||||
{
|
||||
cmdGuard.setAttribute("xrpl.rpc.status",
|
||||
result["status"].asString());
|
||||
}
|
||||
|
||||
if (result["status"].asString() == "error")
|
||||
{
|
||||
cmdGuard.setStatus(
|
||||
opentelemetry::trace::StatusCode::kError,
|
||||
result.isMember("error_message")
|
||||
? result["error_message"].asString()
|
||||
: "RPC error");
|
||||
}
|
||||
else
|
||||
{
|
||||
cmdGuard.setOk();
|
||||
}
|
||||
}
|
||||
|
||||
auto const duration = std::chrono::steady_clock::now() - startTime;
|
||||
guard.setAttribute("http.duration_ms",
|
||||
std::chrono::duration<double, std::milli>(duration).count());
|
||||
|
||||
// Inject trace context into response headers
|
||||
http_response_type resp;
|
||||
telemetry::TraceContextPropagator::injectToHeaders(
|
||||
guard.context(),
|
||||
[&resp](std::string_view name, std::string_view value) {
|
||||
resp.set(std::string(name), std::string(value));
|
||||
});
|
||||
|
||||
guard.setOk();
|
||||
send(std::move(resp));
|
||||
}
|
||||
catch (std::exception const& e)
|
||||
{
|
||||
guard.recordException(e);
|
||||
JLOG(journal_.error()) << "RPC request failed: " << e.what();
|
||||
sendError(send, e.what());
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4.5.4 JobQueue Context Propagation
|
||||
|
||||
```cpp
|
||||
// src/xrpld/core/JobQueue.h (modified)
|
||||
|
||||
#include <opentelemetry/context/context.h>
|
||||
|
||||
class Job
|
||||
{
|
||||
// ... existing members ...
|
||||
|
||||
// Captured trace context at job creation
|
||||
opentelemetry::context::Context traceContext_;
|
||||
|
||||
public:
|
||||
// Constructor captures current trace context
|
||||
Job(JobType type, std::function<void()> func, ...)
|
||||
: type_(type)
|
||||
, func_(std::move(func))
|
||||
, traceContext_(opentelemetry::context::RuntimeContext::GetCurrent())
|
||||
// ... other initializations ...
|
||||
{
|
||||
}
|
||||
|
||||
// Get trace context for restoration during execution
|
||||
opentelemetry::context::Context const&
|
||||
traceContext() const { return traceContext_; }
|
||||
};
|
||||
|
||||
// src/xrpld/core/JobQueue.cpp (modified)
|
||||
|
||||
void
|
||||
Worker::run()
|
||||
{
|
||||
while (auto job = getJob())
|
||||
{
|
||||
// Restore trace context from job creation
|
||||
auto token = opentelemetry::context::RuntimeContext::Attach(
|
||||
job->traceContext());
|
||||
|
||||
// Start execution span
|
||||
auto span = app_.getTelemetry().startSpan("job.execute");
|
||||
telemetry::SpanGuard guard(span);
|
||||
|
||||
guard.setAttribute("xrpl.job.type", to_string(job->type()));
|
||||
guard.setAttribute("xrpl.job.queue_ms", job->queueTimeMs());
|
||||
guard.setAttribute("xrpl.job.worker", workerId_);
|
||||
|
||||
try
|
||||
{
|
||||
job->execute();
|
||||
guard.setOk();
|
||||
}
|
||||
catch (std::exception const& e)
|
||||
{
|
||||
guard.recordException(e);
|
||||
JLOG(journal_.error()) << "Job execution failed: " << e.what();
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4.6 Span Flow Visualization
|
||||
|
||||
<div align="center">
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph Client["External Client"]
|
||||
submit["Submit TX"]
|
||||
end
|
||||
|
||||
subgraph NodeA["rippled Node A"]
|
||||
rpcA["rpc.request"]
|
||||
cmdA["rpc.command.submit"]
|
||||
txRecvA["tx.receive"]
|
||||
txValA["tx.validate"]
|
||||
txRelayA["tx.relay"]
|
||||
end
|
||||
|
||||
subgraph NodeB["rippled Node B"]
|
||||
txRecvB["tx.receive"]
|
||||
txValB["tx.validate"]
|
||||
txRelayB["tx.relay"]
|
||||
end
|
||||
|
||||
subgraph NodeC["rippled Node C"]
|
||||
txRecvC["tx.receive"]
|
||||
consensusC["consensus.round"]
|
||||
phaseC["consensus.phase.establish"]
|
||||
end
|
||||
|
||||
submit --> rpcA
|
||||
rpcA --> cmdA
|
||||
cmdA --> txRecvA
|
||||
txRecvA --> txValA
|
||||
txValA --> txRelayA
|
||||
txRelayA -.->|"TraceContext"| txRecvB
|
||||
txRecvB --> txValB
|
||||
txValB --> txRelayB
|
||||
txRelayB -.->|"TraceContext"| txRecvC
|
||||
txRecvC --> consensusC
|
||||
consensusC --> phaseC
|
||||
|
||||
style Client fill:#334155,stroke:#1e293b,color:#fff
|
||||
style NodeA fill:#1e3a8a,stroke:#172554,color:#fff
|
||||
style NodeB fill:#064e3b,stroke:#022c22,color:#fff
|
||||
style NodeC fill:#78350f,stroke:#451a03,color:#fff
|
||||
style submit fill:#e2e8f0,stroke:#cbd5e1,color:#1e293b
|
||||
style rpcA fill:#1d4ed8,stroke:#1e40af,color:#fff
|
||||
style cmdA fill:#1d4ed8,stroke:#1e40af,color:#fff
|
||||
style txRecvA fill:#047857,stroke:#064e3b,color:#fff
|
||||
style txValA fill:#047857,stroke:#064e3b,color:#fff
|
||||
style txRelayA fill:#047857,stroke:#064e3b,color:#fff
|
||||
style txRecvB fill:#047857,stroke:#064e3b,color:#fff
|
||||
style txValB fill:#047857,stroke:#064e3b,color:#fff
|
||||
style txRelayB fill:#047857,stroke:#064e3b,color:#fff
|
||||
style txRecvC fill:#047857,stroke:#064e3b,color:#fff
|
||||
style consensusC fill:#fef3c7,stroke:#fde68a,color:#1e293b
|
||||
style phaseC fill:#fef3c7,stroke:#fde68a,color:#1e293b
|
||||
```
|
||||
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
_Previous: [Implementation Strategy](./03-implementation-strategy.md)_ | _Next: [Configuration Reference](./05-configuration-reference.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_
|
||||
@@ -1,936 +0,0 @@
|
||||
# Configuration Reference
|
||||
|
||||
> **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)
|
||||
> **Related**: [Code Samples](./04-code-samples.md) | [Implementation Phases](./06-implementation-phases.md)
|
||||
|
||||
---
|
||||
|
||||
## 5.1 rippled Configuration
|
||||
|
||||
### 5.1.1 Configuration File Section
|
||||
|
||||
Add to `cfg/xrpld-example.cfg`:
|
||||
|
||||
```ini
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# TELEMETRY (OpenTelemetry Distributed Tracing)
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
#
|
||||
# Enables distributed tracing for transaction flow, consensus, and RPC calls.
|
||||
# Traces are exported to an OpenTelemetry Collector using OTLP protocol.
|
||||
#
|
||||
# [telemetry]
|
||||
#
|
||||
# # Enable/disable telemetry (default: 0 = disabled)
|
||||
# enabled=1
|
||||
#
|
||||
# # Exporter type: "otlp_grpc" (default), "otlp_http", or "none"
|
||||
# exporter=otlp_grpc
|
||||
#
|
||||
# # OTLP endpoint (default: localhost:4317 for gRPC, localhost:4318 for HTTP)
|
||||
# endpoint=localhost:4317
|
||||
#
|
||||
# # Use TLS for exporter connection (default: 0)
|
||||
# use_tls=0
|
||||
#
|
||||
# # Path to CA certificate for TLS (optional)
|
||||
# # tls_ca_cert=/path/to/ca.crt
|
||||
#
|
||||
# # Sampling ratio: 0.0-1.0 (default: 1.0 = 100% sampling)
|
||||
# # Use lower values in production to reduce overhead
|
||||
# sampling_ratio=0.1
|
||||
#
|
||||
# # Batch processor settings
|
||||
# batch_size=512 # Spans per batch (default: 512)
|
||||
# batch_delay_ms=5000 # Max delay before sending batch (default: 5000)
|
||||
# max_queue_size=2048 # Max queued spans (default: 2048)
|
||||
#
|
||||
# # Component-specific tracing (default: all enabled except peer)
|
||||
# trace_transactions=1 # Transaction relay and processing
|
||||
# trace_consensus=1 # Consensus rounds and proposals
|
||||
# trace_rpc=1 # RPC request handling
|
||||
# trace_peer=0 # Peer messages (high volume, disabled by default)
|
||||
# trace_ledger=1 # Ledger acquisition and building
|
||||
#
|
||||
# # Service identification (automatically detected if not specified)
|
||||
# # service_name=rippled
|
||||
# # service_instance_id=<node_public_key>
|
||||
|
||||
[telemetry]
|
||||
enabled=0
|
||||
```
|
||||
|
||||
### 5.1.2 Configuration Options Summary
|
||||
|
||||
| Option | Type | Default | Description |
|
||||
| --------------------- | ------ | ---------------- | ----------------------------------------- |
|
||||
| `enabled` | bool | `false` | Enable/disable telemetry |
|
||||
| `exporter` | string | `"otlp_grpc"` | Exporter type: otlp_grpc, otlp_http, none |
|
||||
| `endpoint` | string | `localhost:4317` | OTLP collector endpoint |
|
||||
| `use_tls` | bool | `false` | Enable TLS for exporter connection |
|
||||
| `tls_ca_cert` | string | `""` | Path to CA certificate file |
|
||||
| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) |
|
||||
| `batch_size` | uint | `512` | Spans per export batch |
|
||||
| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) |
|
||||
| `max_queue_size` | uint | `2048` | Maximum queued spans |
|
||||
| `trace_transactions` | bool | `true` | Enable transaction tracing |
|
||||
| `trace_consensus` | bool | `true` | Enable consensus tracing |
|
||||
| `trace_rpc` | bool | `true` | Enable RPC tracing |
|
||||
| `trace_peer` | bool | `false` | Enable peer message tracing (high volume) |
|
||||
| `trace_ledger` | bool | `true` | Enable ledger tracing |
|
||||
| `service_name` | string | `"rippled"` | Service name for traces |
|
||||
| `service_instance_id` | string | `<node_pubkey>` | Instance identifier |
|
||||
|
||||
---
|
||||
|
||||
## 5.2 Configuration Parser
|
||||
|
||||
```cpp
|
||||
// src/libxrpl/telemetry/TelemetryConfig.cpp
|
||||
|
||||
#include <xrpl/telemetry/Telemetry.h>
|
||||
#include <xrpl/basics/Log.h>
|
||||
|
||||
namespace xrpl {
|
||||
namespace telemetry {
|
||||
|
||||
Telemetry::Setup
|
||||
setup_Telemetry(
|
||||
Section const& section,
|
||||
std::string const& nodePublicKey,
|
||||
std::string const& version)
|
||||
{
|
||||
Telemetry::Setup setup;
|
||||
|
||||
// Basic settings
|
||||
setup.enabled = section.value_or("enabled", false);
|
||||
setup.serviceName = section.value_or("service_name", "rippled");
|
||||
setup.serviceVersion = version;
|
||||
setup.serviceInstanceId = section.value_or(
|
||||
"service_instance_id", nodePublicKey);
|
||||
|
||||
// Exporter settings
|
||||
setup.exporterType = section.value_or("exporter", "otlp_grpc");
|
||||
|
||||
if (setup.exporterType == "otlp_grpc")
|
||||
setup.exporterEndpoint = section.value_or("endpoint", "localhost:4317");
|
||||
else if (setup.exporterType == "otlp_http")
|
||||
setup.exporterEndpoint = section.value_or("endpoint", "localhost:4318");
|
||||
|
||||
setup.useTls = section.value_or("use_tls", false);
|
||||
setup.tlsCertPath = section.value_or("tls_ca_cert", "");
|
||||
|
||||
// Sampling
|
||||
setup.samplingRatio = section.value_or("sampling_ratio", 1.0);
|
||||
if (setup.samplingRatio < 0.0 || setup.samplingRatio > 1.0)
|
||||
{
|
||||
Throw<std::runtime_error>(
|
||||
"telemetry.sampling_ratio must be between 0.0 and 1.0");
|
||||
}
|
||||
|
||||
// Batch processor
|
||||
setup.batchSize = section.value_or("batch_size", 512u);
|
||||
setup.batchDelay = std::chrono::milliseconds{
|
||||
section.value_or("batch_delay_ms", 5000u)};
|
||||
setup.maxQueueSize = section.value_or("max_queue_size", 2048u);
|
||||
|
||||
// Component filtering
|
||||
setup.traceTransactions = section.value_or("trace_transactions", true);
|
||||
setup.traceConsensus = section.value_or("trace_consensus", true);
|
||||
setup.traceRpc = section.value_or("trace_rpc", true);
|
||||
setup.tracePeer = section.value_or("trace_peer", false);
|
||||
setup.traceLedger = section.value_or("trace_ledger", true);
|
||||
|
||||
return setup;
|
||||
}
|
||||
|
||||
} // namespace telemetry
|
||||
} // namespace xrpl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5.3 Application Integration
|
||||
|
||||
### 5.3.1 ApplicationImp Changes
|
||||
|
||||
```cpp
|
||||
// src/xrpld/app/main/Application.cpp (modified)
|
||||
|
||||
#include <xrpl/telemetry/Telemetry.h>
|
||||
|
||||
class ApplicationImp : public Application
|
||||
{
|
||||
// ... existing members ...
|
||||
|
||||
// Telemetry (must be constructed early, destroyed late)
|
||||
std::unique_ptr<telemetry::Telemetry> telemetry_;
|
||||
|
||||
public:
|
||||
ApplicationImp(...)
|
||||
{
|
||||
// Initialize telemetry early (before other components)
|
||||
auto telemetrySection = config_->section("telemetry");
|
||||
auto telemetrySetup = telemetry::setup_Telemetry(
|
||||
telemetrySection,
|
||||
toBase58(TokenType::NodePublic, nodeIdentity_.publicKey()),
|
||||
BuildInfo::getVersionString());
|
||||
|
||||
// Set network attributes
|
||||
telemetrySetup.networkId = config_->NETWORK_ID;
|
||||
telemetrySetup.networkType = [&]() {
|
||||
if (config_->NETWORK_ID == 0) return "mainnet";
|
||||
if (config_->NETWORK_ID == 1) return "testnet";
|
||||
if (config_->NETWORK_ID == 2) return "devnet";
|
||||
return "custom";
|
||||
}();
|
||||
|
||||
telemetry_ = telemetry::make_Telemetry(
|
||||
telemetrySetup,
|
||||
logs_->journal("Telemetry"));
|
||||
|
||||
// ... rest of initialization ...
|
||||
}
|
||||
|
||||
void start() override
|
||||
{
|
||||
// Start telemetry first
|
||||
if (telemetry_)
|
||||
telemetry_->start();
|
||||
|
||||
// ... existing start code ...
|
||||
}
|
||||
|
||||
void stop() override
|
||||
{
|
||||
// ... existing stop code ...
|
||||
|
||||
// Stop telemetry last (to capture shutdown spans)
|
||||
if (telemetry_)
|
||||
telemetry_->stop();
|
||||
}
|
||||
|
||||
telemetry::Telemetry& getTelemetry() override
|
||||
{
|
||||
assert(telemetry_);
|
||||
return *telemetry_;
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
### 5.3.2 Application Interface Addition
|
||||
|
||||
```cpp
|
||||
// include/xrpl/app/main/Application.h (modified)
|
||||
|
||||
namespace telemetry { class Telemetry; }
|
||||
|
||||
class Application
|
||||
{
|
||||
public:
|
||||
// ... existing virtual methods ...
|
||||
|
||||
/** Get the telemetry system for distributed tracing */
|
||||
virtual telemetry::Telemetry& getTelemetry() = 0;
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5.4 CMake Integration
|
||||
|
||||
### 5.4.1 Find OpenTelemetry Module
|
||||
|
||||
```cmake
|
||||
# cmake/FindOpenTelemetry.cmake
|
||||
|
||||
# Find OpenTelemetry C++ SDK
|
||||
#
|
||||
# This module defines:
|
||||
# OpenTelemetry_FOUND - System has OpenTelemetry
|
||||
# OpenTelemetry::api - API library target
|
||||
# OpenTelemetry::sdk - SDK library target
|
||||
# OpenTelemetry::otlp_grpc_exporter - OTLP gRPC exporter target
|
||||
# OpenTelemetry::otlp_http_exporter - OTLP HTTP exporter target
|
||||
|
||||
find_package(opentelemetry-cpp CONFIG QUIET)
|
||||
|
||||
if(opentelemetry-cpp_FOUND)
|
||||
set(OpenTelemetry_FOUND TRUE)
|
||||
|
||||
# Create imported targets if not already created by config
|
||||
if(NOT TARGET OpenTelemetry::api)
|
||||
add_library(OpenTelemetry::api ALIAS opentelemetry-cpp::api)
|
||||
endif()
|
||||
if(NOT TARGET OpenTelemetry::sdk)
|
||||
add_library(OpenTelemetry::sdk ALIAS opentelemetry-cpp::sdk)
|
||||
endif()
|
||||
if(NOT TARGET OpenTelemetry::otlp_grpc_exporter)
|
||||
add_library(OpenTelemetry::otlp_grpc_exporter ALIAS
|
||||
opentelemetry-cpp::otlp_grpc_exporter)
|
||||
endif()
|
||||
else()
|
||||
# Try pkg-config fallback
|
||||
find_package(PkgConfig QUIET)
|
||||
if(PKG_CONFIG_FOUND)
|
||||
pkg_check_modules(OTEL opentelemetry-cpp QUIET)
|
||||
if(OTEL_FOUND)
|
||||
set(OpenTelemetry_FOUND TRUE)
|
||||
# Create imported targets from pkg-config
|
||||
add_library(OpenTelemetry::api INTERFACE IMPORTED)
|
||||
target_include_directories(OpenTelemetry::api INTERFACE
|
||||
${OTEL_INCLUDE_DIRS})
|
||||
endif()
|
||||
endif()
|
||||
endif()
|
||||
|
||||
include(FindPackageHandleStandardArgs)
|
||||
find_package_handle_standard_args(OpenTelemetry
|
||||
REQUIRED_VARS OpenTelemetry_FOUND)
|
||||
```
|
||||
|
||||
### 5.4.2 CMakeLists.txt Changes
|
||||
|
||||
```cmake
|
||||
# CMakeLists.txt (additions)
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# TELEMETRY OPTIONS
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
|
||||
option(XRPL_ENABLE_TELEMETRY
|
||||
"Enable OpenTelemetry distributed tracing support" OFF)
|
||||
|
||||
if(XRPL_ENABLE_TELEMETRY)
|
||||
find_package(OpenTelemetry REQUIRED)
|
||||
|
||||
# Define compile-time flag
|
||||
add_compile_definitions(XRPL_ENABLE_TELEMETRY)
|
||||
|
||||
message(STATUS "OpenTelemetry tracing: ENABLED")
|
||||
else()
|
||||
message(STATUS "OpenTelemetry tracing: DISABLED")
|
||||
endif()
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# TELEMETRY LIBRARY
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
|
||||
if(XRPL_ENABLE_TELEMETRY)
|
||||
add_library(xrpl_telemetry
|
||||
src/libxrpl/telemetry/Telemetry.cpp
|
||||
src/libxrpl/telemetry/TelemetryConfig.cpp
|
||||
src/libxrpl/telemetry/TraceContext.cpp
|
||||
)
|
||||
|
||||
target_include_directories(xrpl_telemetry
|
||||
PUBLIC
|
||||
${CMAKE_CURRENT_SOURCE_DIR}/include
|
||||
)
|
||||
|
||||
target_link_libraries(xrpl_telemetry
|
||||
PUBLIC
|
||||
OpenTelemetry::api
|
||||
OpenTelemetry::sdk
|
||||
OpenTelemetry::otlp_grpc_exporter
|
||||
PRIVATE
|
||||
xrpl_basics
|
||||
)
|
||||
|
||||
# Add to main library dependencies
|
||||
target_link_libraries(xrpld PRIVATE xrpl_telemetry)
|
||||
else()
|
||||
# Create null implementation library
|
||||
add_library(xrpl_telemetry
|
||||
src/libxrpl/telemetry/NullTelemetry.cpp
|
||||
)
|
||||
target_include_directories(xrpl_telemetry
|
||||
PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include
|
||||
)
|
||||
endif()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5.5 OpenTelemetry Collector Configuration
|
||||
|
||||
### 5.5.1 Development Configuration
|
||||
|
||||
```yaml
|
||||
# otel-collector-dev.yaml
|
||||
# Minimal configuration for local development
|
||||
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
|
||||
processors:
|
||||
batch:
|
||||
timeout: 1s
|
||||
send_batch_size: 100
|
||||
|
||||
exporters:
|
||||
# Console output for debugging
|
||||
logging:
|
||||
verbosity: detailed
|
||||
sampling_initial: 5
|
||||
sampling_thereafter: 200
|
||||
|
||||
# Jaeger for trace visualization
|
||||
jaeger:
|
||||
endpoint: jaeger:14250
|
||||
tls:
|
||||
insecure: true
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [logging, jaeger]
|
||||
```
|
||||
|
||||
### 5.5.2 Production Configuration
|
||||
|
||||
```yaml
|
||||
# otel-collector-prod.yaml
|
||||
# Production configuration with filtering, sampling, and multiple backends
|
||||
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
tls:
|
||||
cert_file: /etc/otel/server.crt
|
||||
key_file: /etc/otel/server.key
|
||||
ca_file: /etc/otel/ca.crt
|
||||
|
||||
processors:
|
||||
# Memory limiter to prevent OOM
|
||||
memory_limiter:
|
||||
check_interval: 1s
|
||||
limit_mib: 1000
|
||||
spike_limit_mib: 200
|
||||
|
||||
# Batch processing for efficiency
|
||||
batch:
|
||||
timeout: 5s
|
||||
send_batch_size: 512
|
||||
send_batch_max_size: 1024
|
||||
|
||||
# Tail-based sampling (keep errors and slow traces)
|
||||
tail_sampling:
|
||||
decision_wait: 10s
|
||||
num_traces: 100000
|
||||
expected_new_traces_per_sec: 1000
|
||||
policies:
|
||||
# Always keep error traces
|
||||
- name: errors
|
||||
type: status_code
|
||||
status_code:
|
||||
status_codes: [ERROR]
|
||||
# Keep slow consensus rounds (>5s)
|
||||
- name: slow-consensus
|
||||
type: latency
|
||||
latency:
|
||||
threshold_ms: 5000
|
||||
# Keep slow RPC requests (>1s)
|
||||
- name: slow-rpc
|
||||
type: and
|
||||
and:
|
||||
and_sub_policy:
|
||||
- name: rpc-spans
|
||||
type: string_attribute
|
||||
string_attribute:
|
||||
key: xrpl.rpc.command
|
||||
values: [".*"]
|
||||
enabled_regex_matching: true
|
||||
- name: latency
|
||||
type: latency
|
||||
latency:
|
||||
threshold_ms: 1000
|
||||
# Probabilistic sampling for the rest
|
||||
- name: probabilistic
|
||||
type: probabilistic
|
||||
probabilistic:
|
||||
sampling_percentage: 10
|
||||
|
||||
# Attribute processing
|
||||
attributes:
|
||||
actions:
|
||||
# Hash sensitive data
|
||||
- key: xrpl.tx.account
|
||||
action: hash
|
||||
# Add deployment info
|
||||
- key: deployment.environment
|
||||
value: production
|
||||
action: upsert
|
||||
|
||||
exporters:
|
||||
# Grafana Tempo for long-term storage
|
||||
otlp/tempo:
|
||||
endpoint: tempo.monitoring:4317
|
||||
tls:
|
||||
insecure: false
|
||||
ca_file: /etc/otel/tempo-ca.crt
|
||||
|
||||
# Elastic APM for correlation with logs
|
||||
otlp/elastic:
|
||||
endpoint: apm.elastic:8200
|
||||
headers:
|
||||
Authorization: "Bearer ${ELASTIC_APM_TOKEN}"
|
||||
|
||||
extensions:
|
||||
health_check:
|
||||
endpoint: 0.0.0.0:13133
|
||||
zpages:
|
||||
endpoint: 0.0.0.0:55679
|
||||
|
||||
service:
|
||||
extensions: [health_check, zpages]
|
||||
pipelines:
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
processors: [memory_limiter, tail_sampling, attributes, batch]
|
||||
exporters: [otlp/tempo, otlp/elastic]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5.6 Docker Compose Development Environment
|
||||
|
||||
```yaml
|
||||
# docker-compose-telemetry.yaml
|
||||
version: "3.8"
|
||||
|
||||
services:
|
||||
# OpenTelemetry Collector
|
||||
otel-collector:
|
||||
image: otel/opentelemetry-collector-contrib:0.92.0
|
||||
container_name: otel-collector
|
||||
command: ["--config=/etc/otel-collector-config.yaml"]
|
||||
volumes:
|
||||
- ./otel-collector-dev.yaml:/etc/otel-collector-config.yaml:ro
|
||||
ports:
|
||||
- "4317:4317" # OTLP gRPC
|
||||
- "4318:4318" # OTLP HTTP
|
||||
- "13133:13133" # Health check
|
||||
depends_on:
|
||||
- jaeger
|
||||
|
||||
# Jaeger for trace visualization
|
||||
jaeger:
|
||||
image: jaegertracing/all-in-one:1.53
|
||||
container_name: jaeger
|
||||
environment:
|
||||
- COLLECTOR_OTLP_ENABLED=true
|
||||
ports:
|
||||
- "16686:16686" # UI
|
||||
- "14250:14250" # gRPC
|
||||
|
||||
# Grafana for dashboards
|
||||
grafana:
|
||||
image: grafana/grafana:10.2.3
|
||||
container_name: grafana
|
||||
environment:
|
||||
- GF_AUTH_ANONYMOUS_ENABLED=true
|
||||
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
|
||||
volumes:
|
||||
- ./grafana/provisioning:/etc/grafana/provisioning:ro
|
||||
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
|
||||
ports:
|
||||
- "3000:3000"
|
||||
depends_on:
|
||||
- jaeger
|
||||
|
||||
# Prometheus for metrics (optional, for correlation)
|
||||
prometheus:
|
||||
image: prom/prometheus:v2.48.1
|
||||
container_name: prometheus
|
||||
volumes:
|
||||
- ./prometheus.yaml:/etc/prometheus/prometheus.yml:ro
|
||||
ports:
|
||||
- "9090:9090"
|
||||
|
||||
networks:
|
||||
default:
|
||||
name: rippled-telemetry
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5.7 Configuration Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph config["Configuration Sources"]
|
||||
cfgFile["xrpld.cfg<br/>[telemetry] section"]
|
||||
cmake["CMake<br/>XRPL_ENABLE_TELEMETRY"]
|
||||
end
|
||||
|
||||
subgraph init["Initialization"]
|
||||
parse["setup_Telemetry()"]
|
||||
factory["make_Telemetry()"]
|
||||
end
|
||||
|
||||
subgraph runtime["Runtime Components"]
|
||||
tracer["TracerProvider"]
|
||||
exporter["OTLP Exporter"]
|
||||
processor["BatchProcessor"]
|
||||
end
|
||||
|
||||
subgraph collector["Collector Pipeline"]
|
||||
recv["Receivers"]
|
||||
proc["Processors"]
|
||||
exp["Exporters"]
|
||||
end
|
||||
|
||||
cfgFile --> parse
|
||||
cmake -->|"compile flag"| parse
|
||||
parse --> factory
|
||||
factory --> tracer
|
||||
tracer --> processor
|
||||
processor --> exporter
|
||||
exporter -->|"OTLP"| recv
|
||||
recv --> proc
|
||||
proc --> exp
|
||||
|
||||
style config fill:#e3f2fd,stroke:#1976d2
|
||||
style runtime fill:#e8f5e9,stroke:#388e3c
|
||||
style collector fill:#fff3e0,stroke:#ff9800
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5.8 Grafana Integration
|
||||
|
||||
Step-by-step instructions for integrating rippled traces with Grafana.
|
||||
|
||||
### 5.8.1 Data Source Configuration
|
||||
|
||||
#### Tempo (Recommended)
|
||||
|
||||
```yaml
|
||||
# grafana/provisioning/datasources/tempo.yaml
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Tempo
|
||||
type: tempo
|
||||
access: proxy
|
||||
url: http://tempo:3200
|
||||
jsonData:
|
||||
httpMethod: GET
|
||||
tracesToLogs:
|
||||
datasourceUid: loki
|
||||
tags: ["service.name", "xrpl.tx.hash"]
|
||||
mappedTags: [{ key: "trace_id", value: "traceID" }]
|
||||
mapTagNamesEnabled: true
|
||||
filterByTraceID: true
|
||||
serviceMap:
|
||||
datasourceUid: prometheus
|
||||
nodeGraph:
|
||||
enabled: true
|
||||
search:
|
||||
hide: false
|
||||
lokiSearch:
|
||||
datasourceUid: loki
|
||||
```
|
||||
|
||||
#### Jaeger
|
||||
|
||||
```yaml
|
||||
# grafana/provisioning/datasources/jaeger.yaml
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Jaeger
|
||||
type: jaeger
|
||||
access: proxy
|
||||
url: http://jaeger:16686
|
||||
jsonData:
|
||||
tracesToLogs:
|
||||
datasourceUid: loki
|
||||
tags: ["service.name"]
|
||||
```
|
||||
|
||||
#### Elastic APM
|
||||
|
||||
```yaml
|
||||
# grafana/provisioning/datasources/elastic-apm.yaml
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Elasticsearch-APM
|
||||
type: elasticsearch
|
||||
access: proxy
|
||||
url: http://elasticsearch:9200
|
||||
database: "apm-*"
|
||||
jsonData:
|
||||
esVersion: "8.0.0"
|
||||
timeField: "@timestamp"
|
||||
logMessageField: message
|
||||
logLevelField: log.level
|
||||
```
|
||||
|
||||
### 5.8.2 Dashboard Provisioning
|
||||
|
||||
```yaml
|
||||
# grafana/provisioning/dashboards/dashboards.yaml
|
||||
apiVersion: 1
|
||||
|
||||
providers:
|
||||
- name: "rippled-dashboards"
|
||||
orgId: 1
|
||||
folder: "rippled"
|
||||
folderUid: "rippled"
|
||||
type: file
|
||||
disableDeletion: false
|
||||
updateIntervalSeconds: 30
|
||||
options:
|
||||
path: /var/lib/grafana/dashboards/rippled
|
||||
```
|
||||
|
||||
### 5.8.3 Example Dashboard: RPC Performance
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "rippled RPC Performance",
|
||||
"uid": "rippled-rpc-performance",
|
||||
"panels": [
|
||||
{
|
||||
"title": "RPC Latency by Command",
|
||||
"type": "heatmap",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && span.xrpl.rpc.command != \"\"} | histogram_over_time(duration) by (span.xrpl.rpc.command)"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
|
||||
},
|
||||
{
|
||||
"title": "RPC Error Rate",
|
||||
"type": "timeseries",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && status.code=error} | rate() by (span.xrpl.rpc.command)"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
|
||||
},
|
||||
{
|
||||
"title": "Top 10 Slowest RPC Commands",
|
||||
"type": "table",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && span.xrpl.rpc.command != \"\"} | avg(duration) by (span.xrpl.rpc.command) | topk(10)"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 8 }
|
||||
},
|
||||
{
|
||||
"title": "Recent Traces",
|
||||
"type": "table",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\"}"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 16 }
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 5.8.4 Example Dashboard: Transaction Tracing
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "rippled Transaction Tracing",
|
||||
"uid": "rippled-tx-tracing",
|
||||
"panels": [
|
||||
{
|
||||
"title": "Transaction Throughput",
|
||||
"type": "stat",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && name=\"tx.receive\"} | rate()"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 }
|
||||
},
|
||||
{
|
||||
"title": "Cross-Node Relay Count",
|
||||
"type": "timeseries",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && name=\"tx.relay\"} | avg(span.xrpl.tx.relay_count)"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 }
|
||||
},
|
||||
{
|
||||
"title": "Transaction Validation Errors",
|
||||
"type": "table",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && name=\"tx.validate\" && status.code=error}"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 }
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 5.8.5 TraceQL Query Examples
|
||||
|
||||
Common queries for rippled traces:
|
||||
|
||||
```
|
||||
# Find all traces for a specific transaction hash
|
||||
{resource.service.name="rippled" && span.xrpl.tx.hash="ABC123..."}
|
||||
|
||||
# Find slow RPC commands (>100ms)
|
||||
{resource.service.name="rippled" && name=~"rpc.command.*"} | duration > 100ms
|
||||
|
||||
# Find consensus rounds taking >5 seconds
|
||||
{resource.service.name="rippled" && name="consensus.round"} | duration > 5s
|
||||
|
||||
# Find failed transactions with error details
|
||||
{resource.service.name="rippled" && name="tx.validate" && status.code=error}
|
||||
|
||||
# Find transactions relayed to many peers
|
||||
{resource.service.name="rippled" && name="tx.relay"} | span.xrpl.tx.relay_count > 10
|
||||
|
||||
# Compare latency across nodes
|
||||
{resource.service.name="rippled" && name="rpc.command.account_info"} | avg(duration) by (resource.service.instance.id)
|
||||
```
|
||||
|
||||
### 5.8.6 Correlation with PerfLog
|
||||
|
||||
To correlate OpenTelemetry traces with existing PerfLog data:
|
||||
|
||||
**Step 1: Configure Loki to ingest PerfLog**
|
||||
|
||||
```yaml
|
||||
# promtail-config.yaml
|
||||
scrape_configs:
|
||||
- job_name: rippled-perflog
|
||||
static_configs:
|
||||
- targets:
|
||||
- localhost
|
||||
labels:
|
||||
job: rippled
|
||||
__path__: /var/log/rippled/perf*.log
|
||||
pipeline_stages:
|
||||
- json:
|
||||
expressions:
|
||||
trace_id: trace_id
|
||||
ledger_seq: ledger_seq
|
||||
tx_hash: tx_hash
|
||||
- labels:
|
||||
trace_id:
|
||||
ledger_seq:
|
||||
tx_hash:
|
||||
```
|
||||
|
||||
**Step 2: Add trace_id to PerfLog entries**
|
||||
|
||||
Modify PerfLog to include trace_id when available:
|
||||
|
||||
```cpp
|
||||
// In PerfLog output, add trace_id from current span context
|
||||
void logPerf(Json::Value& entry) {
|
||||
auto span = opentelemetry::trace::GetSpan(
|
||||
opentelemetry::context::RuntimeContext::GetCurrent());
|
||||
if (span && span->GetContext().IsValid()) {
|
||||
char traceIdHex[33];
|
||||
span->GetContext().trace_id().ToLowerBase16(traceIdHex);
|
||||
entry["trace_id"] = std::string(traceIdHex, 32);
|
||||
}
|
||||
// ... existing logging
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Configure Grafana trace-to-logs link**
|
||||
|
||||
In Tempo data source configuration, set up the derived field:
|
||||
|
||||
```yaml
|
||||
jsonData:
|
||||
tracesToLogs:
|
||||
datasourceUid: loki
|
||||
tags: ["trace_id", "xrpl.tx.hash"]
|
||||
filterByTraceID: true
|
||||
filterBySpanID: false
|
||||
```
|
||||
|
||||
### 5.8.7 Correlation with Insight/StatsD Metrics
|
||||
|
||||
To correlate traces with existing Beast Insight metrics:
|
||||
|
||||
**Step 1: Export Insight metrics to Prometheus**
|
||||
|
||||
```yaml
|
||||
# prometheus.yaml
|
||||
scrape_configs:
|
||||
- job_name: "rippled-statsd"
|
||||
static_configs:
|
||||
- targets: ["statsd-exporter:9102"]
|
||||
```
|
||||
|
||||
**Step 2: Add exemplars to metrics**
|
||||
|
||||
OpenTelemetry SDK automatically adds exemplars (trace IDs) to metrics when using the Prometheus exporter. This links metrics spikes to specific traces.
|
||||
|
||||
**Step 3: Configure Grafana metric-to-trace link**
|
||||
|
||||
```yaml
|
||||
# In Prometheus data source
|
||||
jsonData:
|
||||
exemplarTraceIdDestinations:
|
||||
- name: trace_id
|
||||
datasourceUid: tempo
|
||||
```
|
||||
|
||||
**Step 4: Dashboard panel with exemplars**
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "RPC Latency with Trace Links",
|
||||
"type": "timeseries",
|
||||
"datasource": "Prometheus",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.99, rate(rippled_rpc_duration_seconds_bucket[5m]))",
|
||||
"exemplar": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This allows clicking on metric data points to jump directly to the related trace.
|
||||
|
||||
---
|
||||
|
||||
_Previous: [Code Samples](./04-code-samples.md)_ | _Next: [Implementation Phases](./06-implementation-phases.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_
|
||||
@@ -1,543 +0,0 @@
|
||||
# Implementation Phases
|
||||
|
||||
> **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)
|
||||
> **Related**: [Configuration Reference](./05-configuration-reference.md) | [Observability Backends](./07-observability-backends.md)
|
||||
|
||||
---
|
||||
|
||||
## 6.1 Phase Overview
|
||||
|
||||
```mermaid
|
||||
gantt
|
||||
title OpenTelemetry Implementation Timeline
|
||||
dateFormat YYYY-MM-DD
|
||||
axisFormat Week %W
|
||||
|
||||
section Phase 1
|
||||
Core Infrastructure :p1, 2024-01-01, 2w
|
||||
SDK Integration :p1a, 2024-01-01, 4d
|
||||
Telemetry Interface :p1b, after p1a, 3d
|
||||
Configuration & CMake :p1c, after p1b, 3d
|
||||
Unit Tests :p1d, after p1c, 2d
|
||||
|
||||
section Phase 2
|
||||
RPC Tracing :p2, after p1, 2w
|
||||
HTTP Context Extraction :p2a, after p1, 2d
|
||||
RPC Handler Instrumentation :p2b, after p2a, 4d
|
||||
WebSocket Support :p2c, after p2b, 2d
|
||||
Integration Tests :p2d, after p2c, 2d
|
||||
|
||||
section Phase 3
|
||||
Transaction Tracing :p3, after p2, 2w
|
||||
Protocol Buffer Extension :p3a, after p2, 2d
|
||||
PeerImp Instrumentation :p3b, after p3a, 3d
|
||||
Relay Context Propagation :p3c, after p3b, 3d
|
||||
Multi-node Tests :p3d, after p3c, 2d
|
||||
|
||||
section Phase 4
|
||||
Consensus Tracing :p4, after p3, 2w
|
||||
Consensus Round Spans :p4a, after p3, 3d
|
||||
Proposal Handling :p4b, after p4a, 3d
|
||||
Validation Tests :p4c, after p4b, 4d
|
||||
|
||||
section Phase 5
|
||||
Documentation & Deploy :p5, after p4, 1w
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6.2 Phase 1: Core Infrastructure (Weeks 1-2)
|
||||
|
||||
**Objective**: Establish foundational telemetry infrastructure
|
||||
|
||||
### Tasks
|
||||
|
||||
| Task | Description | Effort | Risk |
|
||||
| ---- | ----------------------------------------------------- | ------ | ------ |
|
||||
| 1.1 | Add OpenTelemetry C++ SDK to Conan/CMake | 2d | Low |
|
||||
| 1.2 | Implement `Telemetry` interface and factory | 2d | Low |
|
||||
| 1.3 | Implement `SpanGuard` RAII wrapper | 1d | Low |
|
||||
| 1.4 | Implement configuration parser | 1d | Low |
|
||||
| 1.5 | Integrate into `ApplicationImp` | 1d | Medium |
|
||||
| 1.6 | Add conditional compilation (`XRPL_ENABLE_TELEMETRY`) | 1d | Low |
|
||||
| 1.7 | Create `NullTelemetry` no-op implementation | 0.5d | Low |
|
||||
| 1.8 | Unit tests for core infrastructure | 1.5d | Low |
|
||||
|
||||
**Total Effort**: 10 days (2 developers)
|
||||
|
||||
### Exit Criteria
|
||||
|
||||
- [ ] OpenTelemetry SDK compiles and links
|
||||
- [ ] Telemetry can be enabled/disabled via config
|
||||
- [ ] Basic span creation works
|
||||
- [ ] No performance regression when disabled
|
||||
- [ ] Unit tests passing
|
||||
|
||||
---
|
||||
|
||||
## 6.3 Phase 2: RPC Tracing (Weeks 3-4)
|
||||
|
||||
**Objective**: Complete tracing for all RPC operations
|
||||
|
||||
### Tasks
|
||||
|
||||
| Task | Description | Effort | Risk |
|
||||
| ---- | -------------------------------------------------- | ------ | ------ |
|
||||
| 2.1 | Implement W3C Trace Context HTTP header extraction | 1d | Low |
|
||||
| 2.2 | Instrument `ServerHandler::onRequest()` | 1d | Low |
|
||||
| 2.3 | Instrument `RPCHandler::doCommand()` | 2d | Medium |
|
||||
| 2.4 | Add RPC-specific attributes | 1d | Low |
|
||||
| 2.5 | Instrument WebSocket handler | 1d | Medium |
|
||||
| 2.6 | Integration tests for RPC tracing | 2d | Low |
|
||||
| 2.7 | Performance benchmarks | 1d | Low |
|
||||
| 2.8 | Documentation | 1d | Low |
|
||||
|
||||
**Total Effort**: 10 days
|
||||
|
||||
### Exit Criteria
|
||||
|
||||
- [ ] All RPC commands traced
|
||||
- [ ] Trace context propagates from HTTP headers
|
||||
- [ ] WebSocket and HTTP both instrumented
|
||||
- [ ] <1ms overhead per RPC call
|
||||
- [ ] Integration tests passing
|
||||
|
||||
---
|
||||
|
||||
## 6.4 Phase 3: Transaction Tracing (Weeks 5-6)
|
||||
|
||||
**Objective**: Trace transaction lifecycle across network
|
||||
|
||||
### Tasks
|
||||
|
||||
| Task | Description | Effort | Risk |
|
||||
| ---- | --------------------------------------------- | ------ | ------ |
|
||||
| 3.1 | Define `TraceContext` Protocol Buffer message | 1d | Low |
|
||||
| 3.2 | Implement protobuf context serialization | 1d | Low |
|
||||
| 3.3 | Instrument `PeerImp::handleTransaction()` | 2d | Medium |
|
||||
| 3.4 | Instrument `NetworkOPs::submitTransaction()` | 1d | Medium |
|
||||
| 3.5 | Instrument HashRouter integration | 1d | Medium |
|
||||
| 3.6 | Implement relay context propagation | 2d | High |
|
||||
| 3.7 | Integration tests (multi-node) | 2d | Medium |
|
||||
| 3.8 | Performance benchmarks | 1d | Low |
|
||||
|
||||
**Total Effort**: 11 days
|
||||
|
||||
### Exit Criteria
|
||||
|
||||
- [ ] Transaction traces span across nodes
|
||||
- [ ] Trace context in Protocol Buffer messages
|
||||
- [ ] HashRouter deduplication visible in traces
|
||||
- [ ] Multi-node integration tests passing
|
||||
- [ ] <5% overhead on transaction throughput
|
||||
|
||||
---
|
||||
|
||||
## 6.5 Phase 4: Consensus Tracing (Weeks 7-8)
|
||||
|
||||
**Objective**: Full observability into consensus rounds
|
||||
|
||||
### Tasks
|
||||
|
||||
| Task | Description | Effort | Risk |
|
||||
| ---- | ---------------------------------------------- | ------ | ------ |
|
||||
| 4.1 | Instrument `RCLConsensusAdaptor::startRound()` | 1d | Medium |
|
||||
| 4.2 | Instrument phase transitions | 2d | Medium |
|
||||
| 4.3 | Instrument proposal handling | 2d | High |
|
||||
| 4.4 | Instrument validation handling | 1d | Medium |
|
||||
| 4.5 | Add consensus-specific attributes | 1d | Low |
|
||||
| 4.6 | Correlate with transaction traces | 1d | Medium |
|
||||
| 4.7 | Multi-validator integration tests | 2d | High |
|
||||
| 4.8 | Performance validation | 1d | Medium |
|
||||
|
||||
**Total Effort**: 11 days
|
||||
|
||||
### Exit Criteria
|
||||
|
||||
- [ ] Complete consensus round traces
|
||||
- [ ] Phase transitions visible
|
||||
- [ ] Proposals and validations traced
|
||||
- [ ] No impact on consensus timing
|
||||
- [ ] Multi-validator test network validated
|
||||
|
||||
---
|
||||
|
||||
## 6.6 Phase 5: Documentation & Deployment (Week 9)
|
||||
|
||||
**Objective**: Production readiness
|
||||
|
||||
### Tasks
|
||||
|
||||
| Task | Description | Effort | Risk |
|
||||
| ---- | ----------------------------- | ------ | ---- |
|
||||
| 5.1 | Operator runbook | 1d | Low |
|
||||
| 5.2 | Grafana dashboards | 1d | Low |
|
||||
| 5.3 | Alert definitions | 0.5d | Low |
|
||||
| 5.4 | Collector deployment examples | 0.5d | Low |
|
||||
| 5.5 | Developer documentation | 1d | Low |
|
||||
| 5.6 | Training materials | 0.5d | Low |
|
||||
| 5.7 | Final integration testing | 0.5d | Low |
|
||||
|
||||
**Total Effort**: 5 days
|
||||
|
||||
---
|
||||
|
||||
## 6.7 Risk Assessment
|
||||
|
||||
```mermaid
|
||||
quadrantChart
|
||||
title Risk Assessment Matrix
|
||||
x-axis Low Impact --> High Impact
|
||||
y-axis Low Likelihood --> High Likelihood
|
||||
quadrant-1 Monitor Closely
|
||||
quadrant-2 Mitigate Immediately
|
||||
quadrant-3 Accept Risk
|
||||
quadrant-4 Plan Mitigation
|
||||
|
||||
SDK Compatibility: [0.25, 0.2]
|
||||
Protocol Changes: [0.75, 0.65]
|
||||
Performance Overhead: [0.65, 0.45]
|
||||
Context Propagation: [0.5, 0.5]
|
||||
Memory Leaks: [0.8, 0.2]
|
||||
```
|
||||
|
||||
### Risk Details
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
| ------------------------------------ | ---------- | ------ | --------------------------------------- |
|
||||
| Protocol changes break compatibility | Medium | High | Use high field numbers, optional fields |
|
||||
| Performance overhead unacceptable | Medium | Medium | Sampling, conditional compilation |
|
||||
| Context propagation complexity | Medium | Medium | Phased rollout, extensive testing |
|
||||
| SDK compatibility issues | Low | Medium | Pin SDK version, fallback to no-op |
|
||||
| Memory leaks in long-running nodes | Low | High | Memory profiling, bounded queues |
|
||||
|
||||
---
|
||||
|
||||
## 6.8 Success Metrics
|
||||
|
||||
| Metric | Target | Measurement |
|
||||
| ------------------------ | ------------------------------ | --------------------- |
|
||||
| Trace coverage | >95% of transactions | Sampling verification |
|
||||
| CPU overhead | <3% | Benchmark tests |
|
||||
| Memory overhead | <5 MB | Memory profiling |
|
||||
| Latency impact (p99) | <2% | Performance tests |
|
||||
| Trace completeness | >99% spans with required attrs | Validation script |
|
||||
| Cross-node trace linkage | >90% of multi-hop transactions | Integration tests |
|
||||
|
||||
---
|
||||
|
||||
## 6.9 Effort Summary
|
||||
|
||||
<div align="center">
|
||||
|
||||
```mermaid
|
||||
%%{init: {'pie': {'textPosition': 0.75}}}%%
|
||||
pie showData
|
||||
"Phase 1: Core Infrastructure" : 10
|
||||
"Phase 2: RPC Tracing" : 10
|
||||
"Phase 3: Transaction Tracing" : 11
|
||||
"Phase 4: Consensus Tracing" : 11
|
||||
"Phase 5: Documentation" : 5
|
||||
```
|
||||
|
||||
**Total Effort Distribution (47 developer-days)**
|
||||
|
||||
</div>
|
||||
|
||||
### Resource Requirements
|
||||
|
||||
| Phase | Developers | Duration | Total Effort |
|
||||
| --------- | ---------- | ----------- | ------------ |
|
||||
| 1 | 2 | 2 weeks | 10 days |
|
||||
| 2 | 1-2 | 2 weeks | 10 days |
|
||||
| 3 | 2 | 2 weeks | 11 days |
|
||||
| 4 | 2 | 2 weeks | 11 days |
|
||||
| 5 | 1 | 1 week | 5 days |
|
||||
| **Total** | **2** | **9 weeks** | **47 days** |
|
||||
|
||||
---
|
||||
|
||||
## 6.10 Quick Wins and Crawl-Walk-Run Strategy
|
||||
|
||||
This section outlines a prioritized approach to maximize ROI with minimal initial investment.
|
||||
|
||||
### 6.10.1 Crawl-Walk-Run Overview
|
||||
|
||||
<div align="center">
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph crawl["🐢 CRAWL (Week 1-2)"]
|
||||
direction LR
|
||||
c1[Core SDK Setup] ~~~ c2[RPC Tracing Only] ~~~ c3[Single Node]
|
||||
end
|
||||
|
||||
subgraph walk["🚶 WALK (Week 3-5)"]
|
||||
direction LR
|
||||
w1[Transaction Tracing] ~~~ w2[Cross-Node Context] ~~~ w3[Basic Dashboards]
|
||||
end
|
||||
|
||||
subgraph run["🏃 RUN (Week 6-9)"]
|
||||
direction LR
|
||||
r1[Consensus Tracing] ~~~ r2[Full Correlation] ~~~ r3[Production Deploy]
|
||||
end
|
||||
|
||||
crawl --> walk --> run
|
||||
|
||||
style crawl fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style walk fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style run fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style c1 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style c2 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style c3 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style w1 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
|
||||
style w2 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
|
||||
style w3 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
|
||||
style r1 fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style r2 fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style r3 fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
```
|
||||
|
||||
</div>
|
||||
|
||||
### 6.10.2 Quick Wins (Immediate Value)
|
||||
|
||||
| Quick Win | Effort | Value | When to Deploy |
|
||||
| ------------------------------ | -------- | ------ | -------------- |
|
||||
| **RPC Command Tracing** | 2 days | High | Week 2 |
|
||||
| **RPC Latency Histograms** | 0.5 days | High | Week 2 |
|
||||
| **Error Rate Dashboard** | 0.5 days | Medium | Week 2 |
|
||||
| **Transaction Submit Tracing** | 1 day | High | Week 3 |
|
||||
| **Consensus Round Duration** | 1 day | Medium | Week 6 |
|
||||
|
||||
### 6.10.3 CRAWL Phase (Weeks 1-2)
|
||||
|
||||
**Goal**: Get basic tracing working with minimal code changes.
|
||||
|
||||
**What You Get**:
|
||||
|
||||
- RPC request/response traces for all commands
|
||||
- Latency breakdown per RPC command
|
||||
- Error visibility with stack traces
|
||||
- Basic Grafana dashboard
|
||||
|
||||
**Code Changes**: ~15 lines in `ServerHandler.cpp`, ~40 lines in new telemetry module
|
||||
|
||||
**Why Start Here**:
|
||||
|
||||
- RPC is the lowest-risk, highest-visibility component
|
||||
- Immediate value for debugging client issues
|
||||
- No cross-node complexity
|
||||
- Single file modification to existing code
|
||||
|
||||
### 6.10.4 WALK Phase (Weeks 3-5)
|
||||
|
||||
**Goal**: Add transaction lifecycle tracing across nodes.
|
||||
|
||||
**What You Get**:
|
||||
|
||||
- End-to-end transaction traces from submit to relay
|
||||
- Cross-node correlation (see transaction path)
|
||||
- HashRouter deduplication visibility
|
||||
- Relay latency metrics
|
||||
|
||||
**Code Changes**: ~120 lines across 4 files, plus protobuf extension
|
||||
|
||||
**Why Do This Second**:
|
||||
|
||||
- Builds on RPC tracing (transactions submitted via RPC)
|
||||
- Moderate complexity (requires context propagation)
|
||||
- High value for debugging transaction issues
|
||||
|
||||
### 6.10.5 RUN Phase (Weeks 6-9)
|
||||
|
||||
**Goal**: Full observability including consensus.
|
||||
|
||||
**What You Get**:
|
||||
|
||||
- Complete consensus round visibility
|
||||
- Phase transition timing
|
||||
- Validator proposal tracking
|
||||
- Full end-to-end traces (client → RPC → TX → consensus → ledger)
|
||||
|
||||
**Code Changes**: ~100 lines across 3 consensus files
|
||||
|
||||
**Why Do This Last**:
|
||||
|
||||
- Highest complexity (consensus is critical path)
|
||||
- Requires thorough testing
|
||||
- Lower relative value (consensus issues are rarer)
|
||||
|
||||
### 6.10.6 ROI Prioritization Matrix
|
||||
|
||||
```mermaid
|
||||
quadrantChart
|
||||
title Implementation ROI Matrix
|
||||
x-axis Low Effort --> High Effort
|
||||
y-axis Low Value --> High Value
|
||||
quadrant-1 Quick Wins - Do First
|
||||
quadrant-2 Major Projects - Plan Carefully
|
||||
quadrant-3 Nice to Have - Optional
|
||||
quadrant-4 Time Sinks - Avoid
|
||||
|
||||
RPC Tracing: [0.15, 0.9]
|
||||
TX Submit Trace: [0.25, 0.85]
|
||||
TX Relay Trace: [0.5, 0.8]
|
||||
Consensus Trace: [0.7, 0.75]
|
||||
Peer Message Trace: [0.85, 0.3]
|
||||
Ledger Acquire: [0.55, 0.5]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6.11 Definition of Done
|
||||
|
||||
Clear, measurable criteria for each phase.
|
||||
|
||||
### 6.11.1 Phase 1: Core Infrastructure
|
||||
|
||||
| Criterion | Measurement | Target |
|
||||
| --------------- | ---------------------------------------------------------- | ---------------------------- |
|
||||
| SDK Integration | `cmake --build` succeeds with `-DXRPL_ENABLE_TELEMETRY=ON` | ✅ Compiles |
|
||||
| Runtime Toggle | `enabled=0` produces zero overhead | <0.1% CPU difference |
|
||||
| Span Creation | Unit test creates and exports span | Span appears in Jaeger |
|
||||
| Configuration | All config options parsed correctly | Config validation tests pass |
|
||||
| Documentation | Developer guide exists | PR approved |
|
||||
|
||||
**Definition of Done**: All criteria met, PR merged, no regressions in CI.
|
||||
|
||||
### 6.11.2 Phase 2: RPC Tracing
|
||||
|
||||
| Criterion | Measurement | Target |
|
||||
| ------------------ | ---------------------------------- | -------------------------- |
|
||||
| Coverage | All RPC commands instrumented | 100% of commands |
|
||||
| Context Extraction | traceparent header propagates | Integration test passes |
|
||||
| Attributes | Command, status, duration recorded | Validation script confirms |
|
||||
| Performance | RPC latency overhead | <1ms p99 |
|
||||
| Dashboard | Grafana dashboard deployed | Screenshot in docs |
|
||||
|
||||
**Definition of Done**: RPC traces visible in Jaeger/Tempo for all commands, dashboard shows latency distribution.
|
||||
|
||||
### 6.11.3 Phase 3: Transaction Tracing
|
||||
|
||||
| Criterion | Measurement | Target |
|
||||
| ---------------- | ------------------------------- | ---------------------------------- |
|
||||
| Local Trace | Submit → validate → TxQ traced | Single-node test passes |
|
||||
| Cross-Node | Context propagates via protobuf | Multi-node test passes |
|
||||
| Relay Visibility | relay_count attribute correct | Spot check 100 txs |
|
||||
| HashRouter | Deduplication visible in trace | Duplicate txs show suppressed=true |
|
||||
| Performance | TX throughput overhead | <5% degradation |
|
||||
|
||||
**Definition of Done**: Transaction traces span 3+ nodes in test network, performance within bounds.
|
||||
|
||||
### 6.11.4 Phase 4: Consensus Tracing
|
||||
|
||||
| Criterion | Measurement | Target |
|
||||
| -------------------- | ----------------------------- | ------------------------- |
|
||||
| Round Tracing | startRound creates root span | Unit test passes |
|
||||
| Phase Visibility | All phases have child spans | Integration test confirms |
|
||||
| Proposer Attribution | Proposer ID in attributes | Spot check 50 rounds |
|
||||
| Timing Accuracy | Phase durations match PerfLog | <5% variance |
|
||||
| No Consensus Impact | Round timing unchanged | Performance test passes |
|
||||
|
||||
**Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing.
|
||||
|
||||
### 6.11.5 Phase 5: Production Deployment
|
||||
|
||||
| Criterion | Measurement | Target |
|
||||
| ------------ | ---------------------------- | -------------------------- |
|
||||
| Collector HA | Multiple collectors deployed | No single point of failure |
|
||||
| Sampling | Tail sampling configured | 10% base + errors + slow |
|
||||
| Retention | Data retained per policy | 7 days hot, 30 days warm |
|
||||
| Alerting | Alerts configured | Error spike, high latency |
|
||||
| Runbook | Operator documentation | Approved by ops team |
|
||||
| Training | Team trained | Session completed |
|
||||
|
||||
**Definition of Done**: Telemetry running in production, operators trained, alerts active.
|
||||
|
||||
### 6.11.6 Success Metrics Summary
|
||||
|
||||
| Phase | Primary Metric | Secondary Metric | Deadline |
|
||||
| ------- | ---------------------- | --------------------------- | ------------- |
|
||||
| Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 |
|
||||
| Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 |
|
||||
| Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 |
|
||||
| Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 |
|
||||
| Phase 5 | Production deployment | Operators trained | End of Week 9 |
|
||||
|
||||
---
|
||||
|
||||
## 6.12 Recommended Implementation Order
|
||||
|
||||
Based on ROI analysis, implement in this exact order:
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph week1["Week 1"]
|
||||
t1[1. OpenTelemetry SDK<br/>Conan/CMake integration]
|
||||
t2[2. Telemetry interface<br/>SpanGuard, config]
|
||||
end
|
||||
|
||||
subgraph week2["Week 2"]
|
||||
t3[3. RPC ServerHandler<br/>instrumentation]
|
||||
t4[4. Basic Jaeger setup<br/>for testing]
|
||||
end
|
||||
|
||||
subgraph week3["Week 3"]
|
||||
t5[5. Transaction submit<br/>tracing]
|
||||
t6[6. Grafana dashboard<br/>v1]
|
||||
end
|
||||
|
||||
subgraph week4["Week 4"]
|
||||
t7[7. Protobuf context<br/>extension]
|
||||
t8[8. PeerImp tx.relay<br/>instrumentation]
|
||||
end
|
||||
|
||||
subgraph week5["Week 5"]
|
||||
t9[9. Multi-node<br/>integration tests]
|
||||
t10[10. Performance<br/>benchmarks]
|
||||
end
|
||||
|
||||
subgraph week6_8["Weeks 6-8"]
|
||||
t11[11. Consensus<br/>instrumentation]
|
||||
t12[12. Full integration<br/>testing]
|
||||
end
|
||||
|
||||
subgraph week9["Week 9"]
|
||||
t13[13. Production<br/>deployment]
|
||||
t14[14. Documentation<br/>& training]
|
||||
end
|
||||
|
||||
t1 --> t2 --> t3 --> t4
|
||||
t4 --> t5 --> t6
|
||||
t6 --> t7 --> t8
|
||||
t8 --> t9 --> t10
|
||||
t10 --> t11 --> t12
|
||||
t12 --> t13 --> t14
|
||||
|
||||
style week1 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style week2 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style week3 fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style week4 fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style week5 fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style week6_8 fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style week9 fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
style t1 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style t2 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style t3 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style t4 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style t5 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
|
||||
style t6 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
|
||||
style t7 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
|
||||
style t8 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
|
||||
style t9 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
|
||||
style t10 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b
|
||||
style t11 fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style t12 fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style t13 fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
style t14 fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
_Previous: [Configuration Reference](./05-configuration-reference.md)_ | _Next: [Observability Backends](./07-observability-backends.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_
|
||||
@@ -1,595 +0,0 @@
|
||||
# Observability Backend Recommendations
|
||||
|
||||
> **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)
|
||||
> **Related**: [Implementation Phases](./06-implementation-phases.md) | [Appendix](./08-appendix.md)
|
||||
|
||||
---
|
||||
|
||||
## 7.1 Development/Testing Backends
|
||||
|
||||
| Backend | Pros | Cons | Use Case |
|
||||
| ---------- | ------------------- | ----------------- | ----------------- |
|
||||
| **Jaeger** | Easy setup, good UI | Limited retention | Local dev, CI |
|
||||
| **Zipkin** | Simple, lightweight | Basic features | Quick prototyping |
|
||||
|
||||
### Quick Start with Jaeger
|
||||
|
||||
```bash
|
||||
# Start Jaeger with OTLP support
|
||||
docker run -d --name jaeger \
|
||||
-e COLLECTOR_OTLP_ENABLED=true \
|
||||
-p 16686:16686 \
|
||||
-p 4317:4317 \
|
||||
-p 4318:4318 \
|
||||
jaegertracing/all-in-one:latest
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7.2 Production Backends
|
||||
|
||||
| Backend | Pros | Cons | Use Case |
|
||||
| ----------------- | ----------------------------------------- | ------------------ | --------------------------- |
|
||||
| **Grafana Tempo** | Cost-effective, Grafana integration | Newer project | Most production deployments |
|
||||
| **Elastic APM** | Full observability stack, log correlation | Resource intensive | Existing Elastic users |
|
||||
| **Honeycomb** | Excellent query, high cardinality | SaaS cost | Deep debugging needs |
|
||||
| **Datadog APM** | Full platform, easy setup | SaaS cost | Enterprise with budget |
|
||||
|
||||
### Backend Selection Flowchart
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
start[Select Backend] --> budget{Budget<br/>Constraints?}
|
||||
|
||||
budget -->|Yes| oss[Open Source]
|
||||
budget -->|No| saas{Prefer<br/>SaaS?}
|
||||
|
||||
oss --> existing{Existing<br/>Stack?}
|
||||
existing -->|Grafana| tempo[Grafana Tempo]
|
||||
existing -->|Elastic| elastic[Elastic APM]
|
||||
existing -->|None| tempo
|
||||
|
||||
saas -->|Yes| enterprise{Enterprise<br/>Support?}
|
||||
saas -->|No| oss
|
||||
|
||||
enterprise -->|Yes| datadog[Datadog APM]
|
||||
enterprise -->|No| honeycomb[Honeycomb]
|
||||
|
||||
tempo --> final[Configure Collector]
|
||||
elastic --> final
|
||||
honeycomb --> final
|
||||
datadog --> final
|
||||
|
||||
style start fill:#0f172a,stroke:#020617,color:#fff
|
||||
style budget fill:#334155,stroke:#1e293b,color:#fff
|
||||
style oss fill:#1e293b,stroke:#0f172a,color:#fff
|
||||
style existing fill:#334155,stroke:#1e293b,color:#fff
|
||||
style saas fill:#334155,stroke:#1e293b,color:#fff
|
||||
style enterprise fill:#334155,stroke:#1e293b,color:#fff
|
||||
style final fill:#0f172a,stroke:#020617,color:#fff
|
||||
style tempo fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style elastic fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style honeycomb fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style datadog fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7.3 Recommended Production Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph validators["Validator Nodes"]
|
||||
v1[rippled<br/>Validator 1]
|
||||
v2[rippled<br/>Validator 2]
|
||||
end
|
||||
|
||||
subgraph stock["Stock Nodes"]
|
||||
s1[rippled<br/>Stock 1]
|
||||
s2[rippled<br/>Stock 2]
|
||||
end
|
||||
|
||||
subgraph collector["OTel Collector Cluster"]
|
||||
c1[Collector<br/>DC1]
|
||||
c2[Collector<br/>DC2]
|
||||
end
|
||||
|
||||
subgraph backends["Storage Backends"]
|
||||
tempo[(Grafana<br/>Tempo)]
|
||||
elastic[(Elastic<br/>APM)]
|
||||
archive[(S3/GCS<br/>Archive)]
|
||||
end
|
||||
|
||||
subgraph ui["Visualization"]
|
||||
grafana[Grafana<br/>Dashboards]
|
||||
end
|
||||
|
||||
v1 -->|OTLP| c1
|
||||
v2 -->|OTLP| c1
|
||||
s1 -->|OTLP| c2
|
||||
s2 -->|OTLP| c2
|
||||
|
||||
c1 --> tempo
|
||||
c1 --> elastic
|
||||
c2 --> tempo
|
||||
c2 --> archive
|
||||
|
||||
tempo --> grafana
|
||||
elastic --> grafana
|
||||
|
||||
style validators fill:#b71c1c,stroke:#7f1d1d,color:#ffffff
|
||||
style stock fill:#0d47a1,stroke:#082f6a,color:#ffffff
|
||||
style collector fill:#bf360c,stroke:#8c2809,color:#ffffff
|
||||
style backends fill:#1b5e20,stroke:#0d3d14,color:#ffffff
|
||||
style ui fill:#4a148c,stroke:#2e0d57,color:#ffffff
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7.4 Architecture Considerations
|
||||
|
||||
### 7.4.1 Collector Placement
|
||||
|
||||
| Strategy | Description | Pros | Cons |
|
||||
| ------------- | -------------------- | ------------------------ | ----------------------- |
|
||||
| **Sidecar** | Collector per node | Isolation, simple config | Resource overhead |
|
||||
| **DaemonSet** | Collector per host | Shared resources | Complexity |
|
||||
| **Gateway** | Central collector(s) | Centralized processing | Single point of failure |
|
||||
|
||||
**Recommendation**: Use **Gateway** pattern with regional collectors for rippled networks:
|
||||
|
||||
- One collector cluster per datacenter/region
|
||||
- Tail-based sampling at collector level
|
||||
- Multiple export destinations for redundancy
|
||||
|
||||
### 7.4.2 Sampling Strategy
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph head["Head Sampling (Node)"]
|
||||
hs[10% probabilistic]
|
||||
end
|
||||
|
||||
subgraph tail["Tail Sampling (Collector)"]
|
||||
ts1[Keep all errors]
|
||||
ts2[Keep slow >5s]
|
||||
ts3[Keep 10% rest]
|
||||
end
|
||||
|
||||
head --> tail
|
||||
|
||||
ts1 --> final[Final Traces]
|
||||
ts2 --> final
|
||||
ts3 --> final
|
||||
|
||||
style head fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style tail fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style hs fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style ts1 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style ts2 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style ts3 fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style final fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
```
|
||||
|
||||
### 7.4.3 Data Retention
|
||||
|
||||
| Environment | Hot Storage | Warm Storage | Cold Archive |
|
||||
| ----------- | ----------- | ------------ | ------------ |
|
||||
| Development | 24 hours | N/A | N/A |
|
||||
| Staging | 7 days | N/A | N/A |
|
||||
| Production | 7 days | 30 days | many years |
|
||||
|
||||
---
|
||||
|
||||
## 7.5 Integration Checklist
|
||||
|
||||
- [ ] Choose primary backend (Tempo recommended for cost/features)
|
||||
- [ ] Deploy collector cluster with high availability
|
||||
- [ ] Configure tail-based sampling for error/latency traces
|
||||
- [ ] Set up Grafana dashboards for trace visualization
|
||||
- [ ] Configure alerts for trace anomalies
|
||||
- [ ] Establish data retention policies
|
||||
- [ ] Test trace correlation with logs and metrics
|
||||
|
||||
---
|
||||
|
||||
## 7.6 Grafana Dashboard Examples
|
||||
|
||||
Pre-built dashboards for rippled observability.
|
||||
|
||||
### 7.6.1 Consensus Health Dashboard
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "rippled Consensus Health",
|
||||
"uid": "rippled-consensus-health",
|
||||
"tags": ["rippled", "consensus", "tracing"],
|
||||
"panels": [
|
||||
{
|
||||
"title": "Consensus Round Duration",
|
||||
"type": "timeseries",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | avg(duration) by (resource.service.instance.id)"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "yellow", "value": 4000 },
|
||||
{ "color": "red", "value": 5000 }
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
|
||||
},
|
||||
{
|
||||
"title": "Phase Duration Breakdown",
|
||||
"type": "barchart",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && name=~\"consensus.phase.*\"} | avg(duration) by (name)"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
|
||||
},
|
||||
{
|
||||
"title": "Proposers per Round",
|
||||
"type": "stat",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | avg(span.xrpl.consensus.proposers)"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 8 }
|
||||
},
|
||||
{
|
||||
"title": "Recent Slow Rounds (>5s)",
|
||||
"type": "table",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | duration > 5s"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 12 }
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 7.6.2 Node Overview Dashboard
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "rippled Node Overview",
|
||||
"uid": "rippled-node-overview",
|
||||
"panels": [
|
||||
{
|
||||
"title": "Active Nodes",
|
||||
"type": "stat",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\"} | count_over_time() by (resource.service.instance.id) | count()"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 4, "w": 4, "x": 0, "y": 0 }
|
||||
},
|
||||
{
|
||||
"title": "Total Transactions (1h)",
|
||||
"type": "stat",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && name=\"tx.receive\"} | count()"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 4, "w": 4, "x": 4, "y": 0 }
|
||||
},
|
||||
{
|
||||
"title": "Error Rate",
|
||||
"type": "gauge",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && status.code=error} | rate() / {resource.service.name=\"rippled\"} | rate() * 100"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "percent",
|
||||
"max": 10,
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "yellow", "value": 1 },
|
||||
{ "color": "red", "value": 5 }
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"gridPos": { "h": 4, "w": 4, "x": 8, "y": 0 }
|
||||
},
|
||||
{
|
||||
"title": "Service Map",
|
||||
"type": "nodeGraph",
|
||||
"datasource": "Tempo",
|
||||
"gridPos": { "h": 12, "w": 12, "x": 12, "y": 0 }
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 7.6.3 Alert Rules
|
||||
|
||||
```yaml
|
||||
# grafana/provisioning/alerting/rippled-alerts.yaml
|
||||
apiVersion: 1
|
||||
|
||||
groups:
|
||||
- name: rippled-tracing-alerts
|
||||
folder: rippled
|
||||
interval: 1m
|
||||
rules:
|
||||
- uid: consensus-slow
|
||||
title: Consensus Round Slow
|
||||
condition: A
|
||||
data:
|
||||
- refId: A
|
||||
datasourceUid: tempo
|
||||
model:
|
||||
queryType: traceql
|
||||
query: '{resource.service.name="rippled" && name="consensus.round"} | avg(duration) > 5s'
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: Consensus rounds taking >5 seconds
|
||||
description: "Consensus duration: {{ $value }}ms"
|
||||
labels:
|
||||
severity: warning
|
||||
|
||||
- uid: rpc-error-spike
|
||||
title: RPC Error Rate Spike
|
||||
condition: B
|
||||
data:
|
||||
- refId: B
|
||||
datasourceUid: tempo
|
||||
model:
|
||||
queryType: traceql
|
||||
query: '{resource.service.name="rippled" && name=~"rpc.command.*" && status.code=error} | rate() > 0.05'
|
||||
for: 2m
|
||||
annotations:
|
||||
summary: RPC error rate >5%
|
||||
labels:
|
||||
severity: critical
|
||||
|
||||
- uid: tx-throughput-drop
|
||||
title: Transaction Throughput Drop
|
||||
condition: C
|
||||
data:
|
||||
- refId: C
|
||||
datasourceUid: tempo
|
||||
model:
|
||||
queryType: traceql
|
||||
query: '{resource.service.name="rippled" && name="tx.receive"} | rate() < 10'
|
||||
for: 10m
|
||||
annotations:
|
||||
summary: Transaction throughput below threshold
|
||||
labels:
|
||||
severity: warning
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7.7 PerfLog and Insight Correlation
|
||||
|
||||
How to correlate OpenTelemetry traces with existing rippled observability.
|
||||
|
||||
### 7.7.1 Correlation Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph rippled["rippled Node"]
|
||||
otel[OpenTelemetry<br/>Spans]
|
||||
perflog[PerfLog<br/>JSON Logs]
|
||||
insight[Beast Insight<br/>StatsD Metrics]
|
||||
end
|
||||
|
||||
subgraph collectors["Data Collection"]
|
||||
otelc[OTel Collector]
|
||||
promtail[Promtail/Fluentd]
|
||||
statsd[StatsD Exporter]
|
||||
end
|
||||
|
||||
subgraph storage["Storage"]
|
||||
tempo[(Tempo)]
|
||||
loki[(Loki)]
|
||||
prom[(Prometheus)]
|
||||
end
|
||||
|
||||
subgraph grafana["Grafana"]
|
||||
traces[Trace View]
|
||||
logs[Log View]
|
||||
metrics[Metrics View]
|
||||
corr[Correlation<br/>Panel]
|
||||
end
|
||||
|
||||
otel -->|OTLP| otelc --> tempo
|
||||
perflog -->|JSON| promtail --> loki
|
||||
insight -->|StatsD| statsd --> prom
|
||||
|
||||
tempo --> traces
|
||||
loki --> logs
|
||||
prom --> metrics
|
||||
|
||||
traces --> corr
|
||||
logs --> corr
|
||||
metrics --> corr
|
||||
|
||||
style rippled fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style collectors fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style storage fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style grafana fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
style otel fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style perflog fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style insight fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style otelc fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style promtail fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style statsd fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style tempo fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style loki fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style prom fill:#1b5e20,stroke:#0d3d14,color:#fff
|
||||
style traces fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
style logs fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
style metrics fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
style corr fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
```
|
||||
|
||||
### 7.7.2 Correlation Fields
|
||||
|
||||
| Source | Field | Link To | Purpose |
|
||||
| ----------- | --------------------------- | ------------- | -------------------------- |
|
||||
| **Trace** | `trace_id` | Logs | Find log entries for trace |
|
||||
| **Trace** | `xrpl.tx.hash` | Logs, Metrics | Find TX-related data |
|
||||
| **Trace** | `xrpl.consensus.ledger.seq` | Logs | Find ledger-related logs |
|
||||
| **PerfLog** | `trace_id` (new) | Traces | Jump to trace from log |
|
||||
| **PerfLog** | `ledger_seq` | Traces | Find consensus trace |
|
||||
| **Insight** | `exemplar.trace_id` | Traces | Jump from metric spike |
|
||||
|
||||
### 7.7.3 Example: Debugging a Slow Transaction
|
||||
|
||||
**Step 1: Find the trace**
|
||||
|
||||
```
|
||||
# In Grafana Explore with Tempo
|
||||
{resource.service.name="rippled" && span.xrpl.tx.hash="ABC123..."}
|
||||
```
|
||||
|
||||
**Step 2: Get the trace_id from the trace view**
|
||||
|
||||
```
|
||||
Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736
|
||||
```
|
||||
|
||||
**Step 3: Find related PerfLog entries**
|
||||
|
||||
```
|
||||
# In Grafana Explore with Loki
|
||||
{job="rippled"} |= "4bf92f3577b34da6a3ce929d0e0e4736"
|
||||
```
|
||||
|
||||
**Step 4: Check Insight metrics for the time window**
|
||||
|
||||
```
|
||||
# In Grafana with Prometheus
|
||||
rate(rippled_tx_applied_total[1m])
|
||||
@ timestamp_from_trace
|
||||
```
|
||||
|
||||
### 7.7.4 Unified Dashboard Example
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "rippled Unified Observability",
|
||||
"uid": "rippled-unified",
|
||||
"panels": [
|
||||
{
|
||||
"title": "Transaction Latency (Traces)",
|
||||
"type": "timeseries",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\" && name=\"tx.receive\"} | histogram_over_time(duration)"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 6, "w": 8, "x": 0, "y": 0 }
|
||||
},
|
||||
{
|
||||
"title": "Transaction Rate (Metrics)",
|
||||
"type": "timeseries",
|
||||
"datasource": "Prometheus",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "rate(rippled_tx_received_total[5m])",
|
||||
"legendFormat": "{{ instance }}"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"links": [
|
||||
{
|
||||
"title": "View traces",
|
||||
"url": "/explore?left={\"datasource\":\"Tempo\",\"query\":\"{resource.service.name=\\\"rippled\\\" && name=\\\"tx.receive\\\"}\"}"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"gridPos": { "h": 6, "w": 8, "x": 8, "y": 0 }
|
||||
},
|
||||
{
|
||||
"title": "Recent Logs",
|
||||
"type": "logs",
|
||||
"datasource": "Loki",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "{job=\"rippled\"} | json"
|
||||
}
|
||||
],
|
||||
"gridPos": { "h": 6, "w": 8, "x": 16, "y": 0 }
|
||||
},
|
||||
{
|
||||
"title": "Trace Search",
|
||||
"type": "table",
|
||||
"datasource": "Tempo",
|
||||
"targets": [
|
||||
{
|
||||
"queryType": "traceql",
|
||||
"query": "{resource.service.name=\"rippled\"}"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"overrides": [
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "traceID" },
|
||||
"properties": [
|
||||
{
|
||||
"id": "links",
|
||||
"value": [
|
||||
{
|
||||
"title": "View trace",
|
||||
"url": "/explore?left={\"datasource\":\"Tempo\",\"query\":\"${__value.raw}\"}"
|
||||
},
|
||||
{
|
||||
"title": "View logs",
|
||||
"url": "/explore?left={\"datasource\":\"Loki\",\"query\":\"{job=\\\"rippled\\\"} |= \\\"${__value.raw}\\\"\"}"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
},
|
||||
"gridPos": { "h": 12, "w": 24, "x": 0, "y": 6 }
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
_Previous: [Implementation Phases](./06-implementation-phases.md)_ | _Next: [Appendix](./08-appendix.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_
|
||||
@@ -1,133 +0,0 @@
|
||||
# Appendix
|
||||
|
||||
> **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)
|
||||
> **Related**: [Observability Backends](./07-observability-backends.md)
|
||||
|
||||
---
|
||||
|
||||
## 8.1 Glossary
|
||||
|
||||
| Term | Definition |
|
||||
| --------------------- | ---------------------------------------------------------- |
|
||||
| **Span** | A unit of work with start/end time, name, and attributes |
|
||||
| **Trace** | A collection of spans representing a complete request flow |
|
||||
| **Trace ID** | 128-bit unique identifier for a trace |
|
||||
| **Span ID** | 64-bit unique identifier for a span within a trace |
|
||||
| **Context** | Carrier for trace/span IDs across boundaries |
|
||||
| **Propagator** | Component that injects/extracts context |
|
||||
| **Sampler** | Decides which traces to record |
|
||||
| **Exporter** | Sends spans to backend |
|
||||
| **Collector** | Receives, processes, and forwards telemetry |
|
||||
| **OTLP** | OpenTelemetry Protocol (wire format) |
|
||||
| **W3C Trace Context** | Standard HTTP headers for trace propagation |
|
||||
| **Baggage** | Key-value pairs propagated across service boundaries |
|
||||
| **Resource** | Entity producing telemetry (service, host, etc.) |
|
||||
| **Instrumentation** | Code that creates telemetry data |
|
||||
|
||||
### rippled-Specific Terms
|
||||
|
||||
| Term | Definition |
|
||||
| ----------------- | -------------------------------------------------- |
|
||||
| **Overlay** | P2P network layer managing peer connections |
|
||||
| **Consensus** | XRP Ledger consensus algorithm (RCL) |
|
||||
| **Proposal** | Validator's suggested transaction set for a ledger |
|
||||
| **Validation** | Validator's signature on a closed ledger |
|
||||
| **HashRouter** | Component for transaction deduplication |
|
||||
| **JobQueue** | Thread pool for asynchronous task execution |
|
||||
| **PerfLog** | Existing performance logging system in rippled |
|
||||
| **Beast Insight** | Existing metrics framework in rippled |
|
||||
|
||||
---
|
||||
|
||||
## 8.2 Span Hierarchy Visualization
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph trace["Trace: Transaction Lifecycle"]
|
||||
rpc["rpc.submit<br/>(entry point)"]
|
||||
validate["tx.validate"]
|
||||
relay["tx.relay<br/>(parent span)"]
|
||||
|
||||
subgraph peers["Peer Spans"]
|
||||
p1["peer.send<br/>Peer A"]
|
||||
p2["peer.send<br/>Peer B"]
|
||||
p3["peer.send<br/>Peer C"]
|
||||
end
|
||||
|
||||
consensus["consensus.round"]
|
||||
apply["tx.apply"]
|
||||
end
|
||||
|
||||
rpc --> validate
|
||||
validate --> relay
|
||||
relay --> p1
|
||||
relay --> p2
|
||||
relay --> p3
|
||||
p1 -.->|"context propagation"| consensus
|
||||
consensus --> apply
|
||||
|
||||
style trace fill:#0f172a,stroke:#020617,color:#fff
|
||||
style peers fill:#1e3a8a,stroke:#172554,color:#fff
|
||||
style rpc fill:#1d4ed8,stroke:#1e40af,color:#fff
|
||||
style validate fill:#047857,stroke:#064e3b,color:#fff
|
||||
style relay fill:#047857,stroke:#064e3b,color:#fff
|
||||
style p1 fill:#0e7490,stroke:#155e75,color:#fff
|
||||
style p2 fill:#0e7490,stroke:#155e75,color:#fff
|
||||
style p3 fill:#0e7490,stroke:#155e75,color:#fff
|
||||
style consensus fill:#fef3c7,stroke:#fde68a,color:#1e293b
|
||||
style apply fill:#047857,stroke:#064e3b,color:#fff
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8.3 References
|
||||
|
||||
### OpenTelemetry Resources
|
||||
|
||||
1. [OpenTelemetry C++ SDK](https://github.com/open-telemetry/opentelemetry-cpp)
|
||||
2. [OpenTelemetry Specification](https://opentelemetry.io/docs/specs/otel/)
|
||||
3. [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/)
|
||||
4. [OTLP Protocol Specification](https://opentelemetry.io/docs/specs/otlp/)
|
||||
|
||||
### Standards
|
||||
|
||||
5. [W3C Trace Context](https://www.w3.org/TR/trace-context/)
|
||||
6. [W3C Baggage](https://www.w3.org/TR/baggage/)
|
||||
7. [Protocol Buffers](https://protobuf.dev/)
|
||||
|
||||
### rippled Resources
|
||||
|
||||
8. [rippled Source Code](https://github.com/XRPLF/rippled)
|
||||
9. [XRP Ledger Documentation](https://xrpl.org/docs/)
|
||||
10. [rippled Overlay README](https://github.com/XRPLF/rippled/blob/develop/src/xrpld/overlay/README.md)
|
||||
11. [rippled RPC README](https://github.com/XRPLF/rippled/blob/develop/src/xrpld/rpc/README.md)
|
||||
12. [rippled Consensus README](https://github.com/XRPLF/rippled/blob/develop/src/xrpld/app/consensus/README.md)
|
||||
|
||||
---
|
||||
|
||||
## 8.4 Version History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
| ------- | ---------- | ------ | --------------------------------- |
|
||||
| 1.0 | 2026-02-12 | - | Initial implementation plan |
|
||||
| 1.1 | 2026-02-13 | - | Refactored into modular documents |
|
||||
|
||||
---
|
||||
|
||||
## 8.5 Document Index
|
||||
|
||||
| Document | Description |
|
||||
| ---------------------------------------------------------------- | ------------------------------------------ |
|
||||
| [OpenTelemetryPlan.md](./OpenTelemetryPlan.md) | Master overview and executive summary |
|
||||
| [01-architecture-analysis.md](./01-architecture-analysis.md) | rippled architecture and trace points |
|
||||
| [02-design-decisions.md](./02-design-decisions.md) | SDK selection, exporters, span conventions |
|
||||
| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure, performance analysis |
|
||||
| [04-code-samples.md](./04-code-samples.md) | C++ code examples for all components |
|
||||
| [05-configuration-reference.md](./05-configuration-reference.md) | rippled config, CMake, Collector configs |
|
||||
| [06-implementation-phases.md](./06-implementation-phases.md) | Timeline, tasks, risks, success metrics |
|
||||
| [07-observability-backends.md](./07-observability-backends.md) | Backend selection and architecture |
|
||||
| [08-appendix.md](./08-appendix.md) | Glossary, references, version history |
|
||||
|
||||
---
|
||||
|
||||
_Previous: [Observability Backends](./07-observability-backends.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_
|
||||
@@ -1,190 +0,0 @@
|
||||
# [OpenTelemetry](00-tracing-fundamentals.md) Distributed Tracing Implementation Plan for rippled (xrpld)
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. The plan addresses the unique challenges of a decentralized peer-to-peer system where trace context must propagate across network boundaries between independent nodes.
|
||||
|
||||
### Key Benefits
|
||||
|
||||
- **End-to-end transaction visibility**: Track transactions from submission through consensus to ledger inclusion
|
||||
- **Consensus round analysis**: Understand timing and behavior of consensus phases across validators
|
||||
- **RPC performance insights**: Identify slow handlers and optimize response times
|
||||
- **Network topology understanding**: Visualize message propagation patterns between peers
|
||||
- **Incident debugging**: Correlate events across distributed nodes during issues
|
||||
|
||||
### Estimated Performance Overhead
|
||||
|
||||
| Metric | Overhead | Notes |
|
||||
| ------------- | ---------- | ----------------------------------- |
|
||||
| CPU | 1-3% | Span creation and attribute setting |
|
||||
| Memory | 2-5 MB | Batch buffer for pending spans |
|
||||
| Network | 10-50 KB/s | Compressed OTLP export to collector |
|
||||
| Latency (p99) | <2% | With proper sampling configuration |
|
||||
|
||||
---
|
||||
|
||||
## Document Structure
|
||||
|
||||
This implementation plan is organized into modular documents for easier navigation:
|
||||
|
||||
<div align="center">
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
overview["📋 OpenTelemetryPlan.md<br/>(This Document)"]
|
||||
|
||||
subgraph analysis["Analysis & Design"]
|
||||
arch["01-architecture-analysis.md"]
|
||||
design["02-design-decisions.md"]
|
||||
end
|
||||
|
||||
subgraph impl["Implementation"]
|
||||
strategy["03-implementation-strategy.md"]
|
||||
code["04-code-samples.md"]
|
||||
config["05-configuration-reference.md"]
|
||||
end
|
||||
|
||||
subgraph deploy["Deployment & Planning"]
|
||||
phases["06-implementation-phases.md"]
|
||||
backends["07-observability-backends.md"]
|
||||
appendix["08-appendix.md"]
|
||||
end
|
||||
|
||||
overview --> analysis
|
||||
overview --> impl
|
||||
overview --> deploy
|
||||
|
||||
arch --> design
|
||||
design --> strategy
|
||||
strategy --> code
|
||||
code --> config
|
||||
config --> phases
|
||||
phases --> backends
|
||||
backends --> appendix
|
||||
|
||||
style overview fill:#1b5e20,stroke:#0d3d14,color:#fff,stroke-width:2px
|
||||
style analysis fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style impl fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style deploy fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
style arch fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style design fill:#0d47a1,stroke:#082f6a,color:#fff
|
||||
style strategy fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style code fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style config fill:#bf360c,stroke:#8c2809,color:#fff
|
||||
style phases fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
style backends fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
style appendix fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
```
|
||||
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
| Section | Document | Description |
|
||||
| ------- | ---------------------------------------------------------- | ---------------------------------------------------------------------- |
|
||||
| **1** | [Architecture Analysis](./01-architecture-analysis.md) | rippled component analysis, trace points, instrumentation priorities |
|
||||
| **2** | [Design Decisions](./02-design-decisions.md) | SDK selection, exporters, span naming, attributes, context propagation |
|
||||
| **3** | [Implementation Strategy](./03-implementation-strategy.md) | Directory structure, key principles, performance optimization |
|
||||
| **4** | [Code Samples](./04-code-samples.md) | Complete C++ implementation examples for all components |
|
||||
| **5** | [Configuration Reference](./05-configuration-reference.md) | rippled config, CMake integration, Collector configurations |
|
||||
| **6** | [Implementation Phases](./06-implementation-phases.md) | 5-phase timeline, tasks, risks, success metrics |
|
||||
| **7** | [Observability Backends](./07-observability-backends.md) | Backend selection guide and production architecture |
|
||||
| **8** | [Appendix](./08-appendix.md) | Glossary, references, version history |
|
||||
|
||||
---
|
||||
|
||||
## 1. Architecture Analysis
|
||||
|
||||
The rippled node consists of several key components that require instrumentation for comprehensive distributed tracing. The main areas include the RPC server (HTTP/WebSocket), Overlay P2P network, Consensus mechanism (RCLConsensus), JobQueue for async task execution, and existing observability infrastructure (PerfLog, Insight/StatsD, Journal logging).
|
||||
|
||||
Key trace points span across transaction submission via RPC, peer-to-peer message propagation, consensus round execution, and ledger building. The implementation prioritizes high-value, low-risk components first: RPC handlers provide immediate value with minimal risk, while consensus tracing requires careful implementation to avoid timing impacts.
|
||||
|
||||
➡️ **[Read full Architecture Analysis](./01-architecture-analysis.md)**
|
||||
|
||||
---
|
||||
|
||||
## 2. Design Decisions
|
||||
|
||||
The OpenTelemetry C++ SDK is selected for its CNCF backing, active development, and native performance characteristics. Traces are exported via OTLP/gRPC (primary) or OTLP/HTTP (fallback) to an OpenTelemetry Collector, which provides flexible routing and sampling.
|
||||
|
||||
Span naming follows a hierarchical `<component>.<operation>` convention (e.g., `rpc.submit`, `tx.relay`, `consensus.round`). Context propagation uses W3C Trace Context headers for HTTP and embedded Protocol Buffer fields for P2P messages. The implementation coexists with existing PerfLog and Insight observability systems through correlation IDs.
|
||||
|
||||
**Data Collection & Privacy**: Telemetry collects only operational metadata (timing, counts, hashes) — never sensitive content (private keys, balances, amounts, raw payloads). Privacy protection includes account hashing, configurable redaction, sampling, and collector-level filtering. Node operators retain full control(not penned down in this document yet) over what data is exported.
|
||||
|
||||
➡️ **[Read full Design Decisions](./02-design-decisions.md)**
|
||||
|
||||
---
|
||||
|
||||
## 3. Implementation Strategy
|
||||
|
||||
The telemetry code is organized under `include/xrpl/telemetry/` for headers and `src/libxrpl/telemetry/` for implementation. Key principles include RAII-based span management via `SpanGuard`, conditional compilation with `XRPL_ENABLE_TELEMETRY`, and minimal runtime overhead through batch processing and efficient sampling.
|
||||
|
||||
Performance optimization strategies include probabilistic head sampling (10% default), tail-based sampling at the collector for errors and slow traces, batch export to reduce network overhead, and conditional instrumentation that compiles to no-ops when disabled.
|
||||
|
||||
➡️ **[Read full Implementation Strategy](./03-implementation-strategy.md)**
|
||||
|
||||
---
|
||||
|
||||
## 4. Code Samples
|
||||
|
||||
Complete C++ implementation examples are provided for all telemetry components:
|
||||
|
||||
- `Telemetry.h` - Core interface for tracer access and span creation
|
||||
- `SpanGuard.h` - RAII wrapper for automatic span lifecycle management
|
||||
- `TracingInstrumentation.h` - Macros for conditional instrumentation
|
||||
- Protocol Buffer extensions for trace context propagation
|
||||
- Module-specific instrumentation (RPC, Consensus, P2P, JobQueue)
|
||||
|
||||
➡️ **[View all Code Samples](./04-code-samples.md)**
|
||||
|
||||
---
|
||||
|
||||
## 5. Configuration Reference
|
||||
|
||||
Configuration is handled through the `[telemetry]` section in `xrpld.cfg` with options for enabling/disabling, exporter selection, endpoint configuration, sampling ratios, and component-level filtering. CMake integration includes a `XRPL_ENABLE_TELEMETRY` option for compile-time control.
|
||||
|
||||
OpenTelemetry Collector configurations are provided for development (with Jaeger) and production (with tail-based sampling, Tempo, and Elastic APM). Docker Compose examples enable quick local development environment setup.
|
||||
|
||||
➡️ **[View full Configuration Reference](./05-configuration-reference.md)**
|
||||
|
||||
---
|
||||
|
||||
## 6. Implementation Phases
|
||||
|
||||
The implementation spans 9 weeks across 5 phases:
|
||||
|
||||
| Phase | Duration | Focus | Key Deliverables |
|
||||
| ----- | --------- | ------------------- | --------------------------------------------------- |
|
||||
| 1 | Weeks 1-2 | Core Infrastructure | SDK integration, Telemetry interface, Configuration |
|
||||
| 2 | Weeks 3-4 | RPC Tracing | HTTP context extraction, Handler instrumentation |
|
||||
| 3 | Weeks 5-6 | Transaction Tracing | Protocol Buffer context, Relay propagation |
|
||||
| 4 | Weeks 7-8 | Consensus Tracing | Round spans, Proposal/validation tracing |
|
||||
| 5 | Week 9 | Documentation | Runbook, Dashboards, Training |
|
||||
|
||||
**Total Effort**: 47 developer-days with 2 developers
|
||||
|
||||
➡️ **[View full Implementation Phases](./06-implementation-phases.md)**
|
||||
|
||||
---
|
||||
|
||||
## 7. Observability Backends
|
||||
|
||||
For development and testing, Jaeger provides easy setup with a good UI. For production deployments, Grafana Tempo is recommended for its cost-effectiveness and Grafana integration, while Elastic APM is ideal for organizations with existing Elastic infrastructure.
|
||||
|
||||
The recommended production architecture uses a gateway collector pattern with regional collectors performing tail-based sampling, routing traces to multiple backends (Tempo for primary storage, Elastic for log correlation, S3/GCS for long-term archive).
|
||||
|
||||
➡️ **[View Observability Backend Recommendations](./07-observability-backends.md)**
|
||||
|
||||
---
|
||||
|
||||
## 8. Appendix
|
||||
|
||||
The appendix contains a glossary of OpenTelemetry and rippled-specific terms, references to external documentation and specifications, version history for this implementation plan, and a complete document index.
|
||||
|
||||
➡️ **[View Appendix](./08-appendix.md)**
|
||||
|
||||
---
|
||||
|
||||
_This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. For detailed information on any section, follow the links to the corresponding sub-documents._
|
||||
@@ -1,610 +0,0 @@
|
||||
# OpenTelemetry POC Task List
|
||||
|
||||
> **Goal**: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. A successful POC will show RPC request traces flowing from rippled through an OTel Collector into Jaeger, viewable in a browser UI.
|
||||
>
|
||||
> **Scope**: RPC tracing only (highest value, lowest risk per the [CRAWL phase](./06-implementation-phases.md#6102-quick-wins-immediate-value) in the implementation phases). No cross-node P2P context propagation or consensus tracing in the POC.
|
||||
|
||||
### Related Plan Documents
|
||||
|
||||
| Document | Relevance to POC |
|
||||
| ---------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) | Core concepts: traces, spans, context propagation, sampling |
|
||||
| [01-architecture-analysis.md](./01-architecture-analysis.md) | RPC request flow (§1.5), key trace points (§1.6), instrumentation priority (§1.7) |
|
||||
| [02-design-decisions.md](./02-design-decisions.md) | SDK selection (§2.1), exporter config (§2.2), span naming (§2.3), attribute schema (§2.4), coexistence with PerfLog/Insight (§2.6) |
|
||||
| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure (§3.1), key principles (§3.2), performance overhead (§3.3-3.6), conditional compilation (§3.7.3), code intrusiveness (§3.9) |
|
||||
| [04-code-samples.md](./04-code-samples.md) | Telemetry interface (§4.1), SpanGuard (§4.2), macros (§4.3), RPC instrumentation (§4.5.3) |
|
||||
| [05-configuration-reference.md](./05-configuration-reference.md) | rippled config (§5.1), config parser (§5.2), Application integration (§5.3), CMake (§5.4), Collector config (§5.5), Docker Compose (§5.6), Grafana (§5.8) |
|
||||
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 1 core tasks (§6.2), Phase 2 RPC tasks (§6.3), quick wins (§6.10), definition of done (§6.11) |
|
||||
| [07-observability-backends.md](./07-observability-backends.md) | Jaeger dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3) |
|
||||
|
||||
---
|
||||
|
||||
## Task 0: Docker Observability Stack Setup
|
||||
|
||||
**Objective**: Stand up the backend infrastructure to receive, store, and display traces.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Create `docker/telemetry/docker-compose.yml` in the repo with three services:
|
||||
1. **OpenTelemetry Collector** (`otel/opentelemetry-collector-contrib:latest`)
|
||||
- Expose ports `4317` (OTLP gRPC) and `4318` (OTLP HTTP)
|
||||
- Expose port `13133` (health check)
|
||||
- Mount a config file `docker/telemetry/otel-collector-config.yaml`
|
||||
2. **Jaeger** (`jaegertracing/all-in-one:latest`)
|
||||
- Expose port `16686` (UI) and `14250` (gRPC collector)
|
||||
- Set env `COLLECTOR_OTLP_ENABLED=true`
|
||||
3. **Grafana** (`grafana/grafana:latest`) — optional but useful
|
||||
- Expose port `3000`
|
||||
- Enable anonymous admin access for local dev (`GF_AUTH_ANONYMOUS_ENABLED=true`, `GF_AUTH_ANONYMOUS_ORG_ROLE=Admin`)
|
||||
- Provision Jaeger as a data source via `docker/telemetry/grafana/provisioning/datasources/jaeger.yaml`
|
||||
|
||||
- Create `docker/telemetry/otel-collector-config.yaml`:
|
||||
|
||||
```yaml
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
|
||||
processors:
|
||||
batch:
|
||||
timeout: 1s
|
||||
send_batch_size: 100
|
||||
|
||||
exporters:
|
||||
logging:
|
||||
verbosity: detailed
|
||||
otlp/jaeger:
|
||||
endpoint: jaeger:4317
|
||||
tls:
|
||||
insecure: true
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [logging, otlp/jaeger]
|
||||
```
|
||||
|
||||
- Create Grafana Jaeger datasource provisioning file at `docker/telemetry/grafana/provisioning/datasources/jaeger.yaml`:
|
||||
```yaml
|
||||
apiVersion: 1
|
||||
datasources:
|
||||
- name: Jaeger
|
||||
type: jaeger
|
||||
access: proxy
|
||||
url: http://jaeger:16686
|
||||
```
|
||||
|
||||
**Verification**: Run `docker compose -f docker/telemetry/docker-compose.yml up -d`, then:
|
||||
|
||||
- `curl http://localhost:13133` returns healthy (Collector)
|
||||
- `http://localhost:16686` opens Jaeger UI (no traces yet)
|
||||
- `http://localhost:3000` opens Grafana (optional)
|
||||
|
||||
**Reference**:
|
||||
|
||||
- [05-configuration-reference.md §5.5](./05-configuration-reference.md) — Collector config (dev YAML with Jaeger exporter)
|
||||
- [05-configuration-reference.md §5.6](./05-configuration-reference.md) — Docker Compose development environment
|
||||
- [07-observability-backends.md §7.1](./07-observability-backends.md) — Jaeger quick start and backend selection
|
||||
- [05-configuration-reference.md §5.8](./05-configuration-reference.md) — Grafana datasource provisioning and dashboards
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Add OpenTelemetry C++ SDK Dependency
|
||||
|
||||
**Objective**: Make `opentelemetry-cpp` available to the build system.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `conanfile.py` to add `opentelemetry-cpp` as an **optional** dependency. The gRPC otel plugin flag (`"grpc/*:otel_plugin": False`) in the existing conanfile may need to remain false — we pull the OTel SDK separately.
|
||||
- Add a Conan option: `with_telemetry = [True, False]` defaulting to `False`
|
||||
- When `with_telemetry` is `True`, add `opentelemetry-cpp` to `self.requires()`
|
||||
- Required OTel Conan components: `opentelemetry-cpp` (which bundles api, sdk, and exporters). If the package isn't in Conan Center, consider using `FetchContent` in CMake or building from source as a fallback.
|
||||
- Edit `CMakeLists.txt`:
|
||||
- Add option: `option(XRPL_ENABLE_TELEMETRY "Enable OpenTelemetry tracing" OFF)`
|
||||
- When ON, `find_package(opentelemetry-cpp CONFIG REQUIRED)` and add compile definition `XRPL_ENABLE_TELEMETRY`
|
||||
- When OFF, do nothing (zero build impact)
|
||||
- Verify the build succeeds with `-DXRPL_ENABLE_TELEMETRY=OFF` (no regressions) and with `-DXRPL_ENABLE_TELEMETRY=ON` (SDK links successfully).
|
||||
|
||||
**Key files**:
|
||||
|
||||
- `conanfile.py`
|
||||
- `CMakeLists.txt`
|
||||
|
||||
**Reference**:
|
||||
|
||||
- [05-configuration-reference.md §5.4](./05-configuration-reference.md) — CMake integration, `FindOpenTelemetry.cmake`, `XRPL_ENABLE_TELEMETRY` option
|
||||
- [03-implementation-strategy.md §3.2](./03-implementation-strategy.md) — Key principle: zero-cost when disabled via compile-time flags
|
||||
- [02-design-decisions.md §2.1](./02-design-decisions.md) — SDK selection rationale and required OTel components
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Create Core Telemetry Interface and NullTelemetry
|
||||
|
||||
**Objective**: Define the `Telemetry` abstract interface and a no-op implementation so the rest of the codebase can reference telemetry without hard-depending on the OTel SDK.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Create `include/xrpl/telemetry/Telemetry.h`:
|
||||
- Define `namespace xrpl::telemetry`
|
||||
- Define `struct Telemetry::Setup` holding: `enabled`, `exporterEndpoint`, `samplingRatio`, `serviceName`, `serviceVersion`, `serviceInstanceId`, `traceRpc`, `traceTransactions`, `traceConsensus`, `tracePeer`
|
||||
- Define abstract `class Telemetry` with:
|
||||
- `virtual void start() = 0;`
|
||||
- `virtual void stop() = 0;`
|
||||
- `virtual bool isEnabled() const = 0;`
|
||||
- `virtual nostd::shared_ptr<Tracer> getTracer(string_view name = "rippled") = 0;`
|
||||
- `virtual nostd::shared_ptr<Span> startSpan(string_view name, SpanKind kind = kInternal) = 0;`
|
||||
- `virtual nostd::shared_ptr<Span> startSpan(string_view name, Context const& parentContext, SpanKind kind = kInternal) = 0;`
|
||||
- `virtual bool shouldTraceRpc() const = 0;`
|
||||
- `virtual bool shouldTraceTransactions() const = 0;`
|
||||
- `virtual bool shouldTraceConsensus() const = 0;`
|
||||
- Factory: `std::unique_ptr<Telemetry> make_Telemetry(Setup const&, beast::Journal);`
|
||||
- Config parser: `Telemetry::Setup setup_Telemetry(Section const&, std::string const& nodePublicKey, std::string const& version);`
|
||||
|
||||
- Create `include/xrpl/telemetry/SpanGuard.h`:
|
||||
- RAII guard that takes an `nostd::shared_ptr<Span>`, creates a `Scope`, and calls `span->End()` in destructor.
|
||||
- Convenience: `setAttribute()`, `setOk()`, `setStatus()`, `addEvent()`, `recordException()`, `context()`
|
||||
- See [04-code-samples.md](./04-code-samples.md) §4.2 for the full implementation.
|
||||
|
||||
- Create `src/libxrpl/telemetry/NullTelemetry.cpp`:
|
||||
- Implements `Telemetry` with all no-ops.
|
||||
- `isEnabled()` returns `false`, `startSpan()` returns a noop span.
|
||||
- This is used when `XRPL_ENABLE_TELEMETRY` is OFF or `enabled=0` in config.
|
||||
|
||||
- Guard all OTel SDK headers behind `#ifdef XRPL_ENABLE_TELEMETRY`. The `NullTelemetry` implementation should compile without the OTel SDK present.
|
||||
|
||||
**Key new files**:
|
||||
|
||||
- `include/xrpl/telemetry/Telemetry.h`
|
||||
- `include/xrpl/telemetry/SpanGuard.h`
|
||||
- `src/libxrpl/telemetry/NullTelemetry.cpp`
|
||||
|
||||
**Reference**:
|
||||
|
||||
- [04-code-samples.md §4.1](./04-code-samples.md) — Full `Telemetry` interface with `Setup` struct, lifecycle, tracer access, span creation, and component filtering methods
|
||||
- [04-code-samples.md §4.2](./04-code-samples.md) — Full `SpanGuard` RAII implementation and `NullSpanGuard` no-op class
|
||||
- [03-implementation-strategy.md §3.1](./03-implementation-strategy.md) — Directory structure: `include/xrpl/telemetry/` for headers, `src/libxrpl/telemetry/` for implementation
|
||||
- [03-implementation-strategy.md §3.7.3](./03-implementation-strategy.md) — Conditional instrumentation and zero-cost compile-time disabled pattern
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Implement OTel-Backed Telemetry
|
||||
|
||||
**Objective**: Implement the real `Telemetry` class that initializes the OTel SDK, configures the OTLP exporter and batch processor, and creates tracers/spans.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Create `src/libxrpl/telemetry/Telemetry.cpp` (compiled only when `XRPL_ENABLE_TELEMETRY=ON`):
|
||||
- `class TelemetryImpl : public Telemetry` that:
|
||||
- In `start()`: creates a `TracerProvider` with:
|
||||
- Resource attributes: `service.name`, `service.version`, `service.instance.id`
|
||||
- An `OtlpGrpcExporter` pointed at `setup.exporterEndpoint` (default `localhost:4317`)
|
||||
- A `BatchSpanProcessor` with configurable batch size and delay
|
||||
- A `TraceIdRatioBasedSampler` using `setup.samplingRatio`
|
||||
- Sets the global `TracerProvider`
|
||||
- In `stop()`: calls `ForceFlush()` then shuts down the provider
|
||||
- In `startSpan()`: delegates to `getTracer()->StartSpan(name, ...)`
|
||||
- `shouldTraceRpc()` etc. read from `Setup` fields
|
||||
|
||||
- Create `src/libxrpl/telemetry/TelemetryConfig.cpp`:
|
||||
- `setup_Telemetry()` parses the `[telemetry]` config section from `xrpld.cfg`
|
||||
- Maps config keys: `enabled`, `exporter`, `endpoint`, `sampling_ratio`, `trace_rpc`, `trace_transactions`, `trace_consensus`, `trace_peer`
|
||||
|
||||
- Wire `make_Telemetry()` factory:
|
||||
- If `setup.enabled` is true AND `XRPL_ENABLE_TELEMETRY` is defined: return `TelemetryImpl`
|
||||
- Otherwise: return `NullTelemetry`
|
||||
|
||||
- Add telemetry source files to CMake. When `XRPL_ENABLE_TELEMETRY=ON`, compile `Telemetry.cpp` and `TelemetryConfig.cpp` and link against `opentelemetry-cpp::api`, `opentelemetry-cpp::sdk`, `opentelemetry-cpp::otlp_grpc_exporter`. When OFF, compile only `NullTelemetry.cpp`.
|
||||
|
||||
**Key new files**:
|
||||
|
||||
- `src/libxrpl/telemetry/Telemetry.cpp`
|
||||
- `src/libxrpl/telemetry/TelemetryConfig.cpp`
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `CMakeLists.txt` (add telemetry library target)
|
||||
|
||||
**Reference**:
|
||||
|
||||
- [04-code-samples.md §4.1](./04-code-samples.md) — `Telemetry` interface that `TelemetryImpl` must implement
|
||||
- [05-configuration-reference.md §5.2](./05-configuration-reference.md) — `setup_Telemetry()` config parser implementation
|
||||
- [02-design-decisions.md §2.2](./02-design-decisions.md) — OTLP/gRPC exporter config (endpoint, TLS options)
|
||||
- [02-design-decisions.md §2.4.1](./02-design-decisions.md) — Resource attributes: `service.name`, `service.version`, `service.instance.id`, `xrpl.network.id`
|
||||
- [03-implementation-strategy.md §3.4](./03-implementation-strategy.md) — Per-operation CPU costs and overhead budget for span creation
|
||||
- [03-implementation-strategy.md §3.5](./03-implementation-strategy.md) — Memory overhead: static (~456 KB) and dynamic (~1.2 MB) budgets
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Integrate Telemetry into Application Lifecycle
|
||||
|
||||
**Objective**: Wire the `Telemetry` object into `Application` so all components can access it.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `src/xrpld/app/main/Application.h`:
|
||||
- Forward-declare `namespace xrpl::telemetry { class Telemetry; }`
|
||||
- Add pure virtual method: `virtual telemetry::Telemetry& getTelemetry() = 0;`
|
||||
|
||||
- Edit `src/xrpld/app/main/Application.cpp` (the `ApplicationImp` class):
|
||||
- Add member: `std::unique_ptr<telemetry::Telemetry> telemetry_;`
|
||||
- In the constructor, after config is loaded and node identity is known:
|
||||
```cpp
|
||||
auto const telemetrySection = config_->section("telemetry");
|
||||
auto telemetrySetup = telemetry::setup_Telemetry(
|
||||
telemetrySection,
|
||||
toBase58(TokenType::NodePublic, nodeIdentity_.publicKey()),
|
||||
BuildInfo::getVersionString());
|
||||
telemetry_ = telemetry::make_Telemetry(telemetrySetup, logs_->journal("Telemetry"));
|
||||
```
|
||||
- In `start()`: call `telemetry_->start()` early
|
||||
- In `stop()` or destructor: call `telemetry_->stop()` late (to flush pending spans)
|
||||
- Implement `getTelemetry()` override: return `*telemetry_`
|
||||
|
||||
- Add `[telemetry]` section to the example config `cfg/rippled-example.cfg`:
|
||||
```ini
|
||||
# [telemetry]
|
||||
# enabled=1
|
||||
# endpoint=localhost:4317
|
||||
# sampling_ratio=1.0
|
||||
# trace_rpc=1
|
||||
```
|
||||
|
||||
**Key modified files**:
|
||||
|
||||
- `src/xrpld/app/main/Application.h`
|
||||
- `src/xrpld/app/main/Application.cpp`
|
||||
- `cfg/rippled-example.cfg` (or equivalent example config)
|
||||
|
||||
**Reference**:
|
||||
|
||||
- [05-configuration-reference.md §5.3](./05-configuration-reference.md) — `ApplicationImp` changes: member declaration, constructor init, `start()`/`stop()` wiring, `getTelemetry()` override
|
||||
- [05-configuration-reference.md §5.1](./05-configuration-reference.md) — `[telemetry]` config section format and all option defaults
|
||||
- [03-implementation-strategy.md §3.9.2](./03-implementation-strategy.md) — File impact assessment: `Application.cpp` ~15 lines added, ~3 changed (Low risk)
|
||||
|
||||
---
|
||||
|
||||
## Task 5: Create Instrumentation Macros
|
||||
|
||||
**Objective**: Define convenience macros that make instrumenting code one-liners, and that compile to zero-cost no-ops when telemetry is disabled.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Create `src/xrpld/telemetry/TracingInstrumentation.h`:
|
||||
- When `XRPL_ENABLE_TELEMETRY` is defined:
|
||||
|
||||
```cpp
|
||||
#define XRPL_TRACE_SPAN(telemetry, name) \
|
||||
auto _xrpl_span_ = (telemetry).startSpan(name); \
|
||||
::xrpl::telemetry::SpanGuard _xrpl_guard_(_xrpl_span_)
|
||||
|
||||
#define XRPL_TRACE_RPC(telemetry, name) \
|
||||
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
|
||||
if ((telemetry).shouldTraceRpc()) { \
|
||||
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
|
||||
}
|
||||
|
||||
#define XRPL_TRACE_SET_ATTR(key, value) \
|
||||
if (_xrpl_guard_.has_value()) { \
|
||||
_xrpl_guard_->setAttribute(key, value); \
|
||||
}
|
||||
|
||||
#define XRPL_TRACE_EXCEPTION(e) \
|
||||
if (_xrpl_guard_.has_value()) { \
|
||||
_xrpl_guard_->recordException(e); \
|
||||
}
|
||||
```
|
||||
|
||||
- When `XRPL_ENABLE_TELEMETRY` is NOT defined, all macros expand to `((void)0)`
|
||||
|
||||
**Key new file**:
|
||||
|
||||
- `src/xrpld/telemetry/TracingInstrumentation.h`
|
||||
|
||||
**Reference**:
|
||||
|
||||
- [04-code-samples.md §4.3](./04-code-samples.md) — Full macro definitions for `XRPL_TRACE_SPAN`, `XRPL_TRACE_RPC`, `XRPL_TRACE_CONSENSUS`, `XRPL_TRACE_SET_ATTR`, `XRPL_TRACE_EXCEPTION` with both enabled and disabled branches
|
||||
- [03-implementation-strategy.md §3.7.3](./03-implementation-strategy.md) — Conditional instrumentation pattern: compile-time `#ifndef` and runtime `shouldTrace*()` checks
|
||||
- [03-implementation-strategy.md §3.9.7](./03-implementation-strategy.md) — Before/after code examples showing minimal intrusiveness (~1-3 lines per instrumentation point)
|
||||
|
||||
---
|
||||
|
||||
## Task 6: Instrument RPC ServerHandler
|
||||
|
||||
**Objective**: Add tracing to the HTTP RPC entry point so every incoming RPC request creates a span.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `src/xrpld/rpc/detail/ServerHandler.cpp`:
|
||||
- `#include` the `TracingInstrumentation.h` header
|
||||
- In `ServerHandler::onRequest(Session& session)`:
|
||||
- At the top of the method, add: `XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.request");`
|
||||
- After the RPC command name is extracted, set attribute: `XRPL_TRACE_SET_ATTR("xrpl.rpc.command", command);`
|
||||
- After the response status is known, set: `XRPL_TRACE_SET_ATTR("http.status_code", static_cast<int64_t>(statusCode));`
|
||||
- Wrap error paths with: `XRPL_TRACE_EXCEPTION(e);`
|
||||
- In `ServerHandler::processRequest(...)`:
|
||||
- Add a child span: `XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.process");`
|
||||
- Set method attribute: `XRPL_TRACE_SET_ATTR("xrpl.rpc.method", request_method);`
|
||||
- In `ServerHandler::onWSMessage(...)` (WebSocket path):
|
||||
- Add: `XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.ws.message");`
|
||||
|
||||
- The goal is to see spans like:
|
||||
```
|
||||
rpc.request
|
||||
└── rpc.process
|
||||
```
|
||||
in Jaeger for every HTTP RPC call.
|
||||
|
||||
**Key modified file**:
|
||||
|
||||
- `src/xrpld/rpc/detail/ServerHandler.cpp` (~15-25 lines added)
|
||||
|
||||
**Reference**:
|
||||
|
||||
- [04-code-samples.md §4.5.3](./04-code-samples.md) — Complete `ServerHandler::onRequest()` instrumented code sample with W3C header extraction, span creation, attribute setting, and error handling
|
||||
- [01-architecture-analysis.md §1.5](./01-architecture-analysis.md) — RPC request flow diagram: HTTP request -> attributes -> jobqueue.enqueue -> rpc.command -> response
|
||||
- [01-architecture-analysis.md §1.6](./01-architecture-analysis.md) — Key trace points table: `rpc.request` in `ServerHandler.cpp::onRequest()` (Priority: High)
|
||||
- [02-design-decisions.md §2.3](./02-design-decisions.md) — Span naming convention: `rpc.request`, `rpc.command.*`
|
||||
- [02-design-decisions.md §2.4.2](./02-design-decisions.md) — RPC span attributes: `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.params`
|
||||
- [03-implementation-strategy.md §3.9.2](./03-implementation-strategy.md) — File impact: `ServerHandler.cpp` ~40 lines added, ~10 changed (Low risk)
|
||||
|
||||
---
|
||||
|
||||
## Task 7: Instrument RPC Command Execution
|
||||
|
||||
**Objective**: Add per-command tracing inside the RPC handler so each command (e.g., `submit`, `account_info`, `server_info`) gets its own child span.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Edit `src/xrpld/rpc/detail/RPCHandler.cpp`:
|
||||
- `#include` the `TracingInstrumentation.h` header
|
||||
- In `doCommand(RPC::JsonContext& context, Json::Value& result)`:
|
||||
- At the top: `XRPL_TRACE_RPC(context.app.getTelemetry(), "rpc.command." + context.method);`
|
||||
- Set attributes:
|
||||
- `XRPL_TRACE_SET_ATTR("xrpl.rpc.command", context.method);`
|
||||
- `XRPL_TRACE_SET_ATTR("xrpl.rpc.version", static_cast<int64_t>(context.apiVersion));`
|
||||
- `XRPL_TRACE_SET_ATTR("xrpl.rpc.role", (context.role == Role::ADMIN) ? "admin" : "user");`
|
||||
- On success: `XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "success");`
|
||||
- On error: `XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "error");` and set the error message
|
||||
|
||||
- After this, traces in Jaeger should look like:
|
||||
```
|
||||
rpc.request (xrpl.rpc.command=account_info)
|
||||
└── rpc.process
|
||||
└── rpc.command.account_info (xrpl.rpc.version=2, xrpl.rpc.role=user, xrpl.rpc.status=success)
|
||||
```
|
||||
|
||||
**Key modified file**:
|
||||
|
||||
- `src/xrpld/rpc/detail/RPCHandler.cpp` (~15-20 lines added)
|
||||
|
||||
**Reference**:
|
||||
|
||||
- [04-code-samples.md §4.5.3](./04-code-samples.md) — `ServerHandler::onRequest()` code sample (includes child span pattern for `rpc.command.*`)
|
||||
- [02-design-decisions.md §2.3](./02-design-decisions.md) — Span naming: `rpc.command.*` pattern with dynamic command name (e.g., `rpc.command.server_info`)
|
||||
- [02-design-decisions.md §2.4.2](./02-design-decisions.md) — RPC attribute schema: `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.status`
|
||||
- [01-architecture-analysis.md §1.6](./01-architecture-analysis.md) — Key trace points table: `rpc.command.*` in `RPCHandler.cpp::doCommand()` (Priority: High)
|
||||
- [02-design-decisions.md §2.6.5](./02-design-decisions.md) — Correlation with PerfLog: how `doCommand()` can link trace_id with existing PerfLog entries
|
||||
- [03-implementation-strategy.md §3.4.4](./03-implementation-strategy.md) — RPC request overhead budget: ~1.75 μs total per request
|
||||
|
||||
---
|
||||
|
||||
## Task 8: Build, Run, and Verify End-to-End
|
||||
|
||||
**Objective**: Prove the full pipeline works: rippled emits traces -> OTel Collector receives them -> Jaeger displays them.
|
||||
|
||||
**What to do**:
|
||||
|
||||
1. **Start the Docker stack**:
|
||||
|
||||
```bash
|
||||
docker compose -f docker/telemetry/docker-compose.yml up -d
|
||||
```
|
||||
|
||||
Verify Collector health: `curl http://localhost:13133`
|
||||
|
||||
2. **Build rippled with telemetry**:
|
||||
|
||||
```bash
|
||||
# Adjust for your actual build workflow
|
||||
conan install . --build=missing -o with_telemetry=True
|
||||
cmake --preset default -DXRPL_ENABLE_TELEMETRY=ON
|
||||
cmake --build --preset default
|
||||
```
|
||||
|
||||
3. **Configure rippled**:
|
||||
Add to `rippled.cfg` (or your local test config):
|
||||
|
||||
```ini
|
||||
[telemetry]
|
||||
enabled=1
|
||||
endpoint=localhost:4317
|
||||
sampling_ratio=1.0
|
||||
trace_rpc=1
|
||||
```
|
||||
|
||||
4. **Start rippled** in standalone mode:
|
||||
|
||||
```bash
|
||||
./rippled --conf rippled.cfg -a --start
|
||||
```
|
||||
|
||||
5. **Generate RPC traffic**:
|
||||
|
||||
```bash
|
||||
# server_info
|
||||
curl -s -X POST http://localhost:5005 \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"method":"server_info","params":[{}]}'
|
||||
|
||||
# ledger
|
||||
curl -s -X POST http://localhost:5005 \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"method":"ledger","params":[{"ledger_index":"current"}]}'
|
||||
|
||||
# account_info (will error in standalone, that's fine — we trace errors too)
|
||||
curl -s -X POST http://localhost:5005 \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"method":"account_info","params":[{"account":"rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh"}]}'
|
||||
```
|
||||
|
||||
6. **Verify in Jaeger**:
|
||||
- Open `http://localhost:16686`
|
||||
- Select service `rippled` from the dropdown
|
||||
- Click "Find Traces"
|
||||
- Confirm you see traces with spans: `rpc.request` -> `rpc.process` -> `rpc.command.server_info`
|
||||
- Click into a trace and verify attributes: `xrpl.rpc.command`, `xrpl.rpc.status`, `xrpl.rpc.version`
|
||||
|
||||
7. **Verify zero-overhead when disabled**:
|
||||
- Rebuild with `XRPL_ENABLE_TELEMETRY=OFF`, or set `enabled=0` in config
|
||||
- Run the same RPC calls
|
||||
- Confirm no new traces appear and no errors in rippled logs
|
||||
|
||||
**Verification Checklist**:
|
||||
|
||||
- [ ] Docker stack starts without errors
|
||||
- [ ] rippled builds with `-DXRPL_ENABLE_TELEMETRY=ON`
|
||||
- [ ] rippled starts and connects to OTel Collector (check rippled logs for telemetry messages)
|
||||
- [ ] Traces appear in Jaeger UI under service "rippled"
|
||||
- [ ] Span hierarchy is correct (parent-child relationships)
|
||||
- [ ] Span attributes are populated (`xrpl.rpc.command`, `xrpl.rpc.status`, etc.)
|
||||
- [ ] Error spans show error status and message
|
||||
- [ ] Building with `XRPL_ENABLE_TELEMETRY=OFF` produces no regressions
|
||||
- [ ] Setting `enabled=0` at runtime produces no traces and no errors
|
||||
|
||||
**Reference**:
|
||||
|
||||
- [06-implementation-phases.md §6.11.1](./06-implementation-phases.md) — Phase 1 definition of done: SDK compiles, runtime toggle works, span creation verified in Jaeger, config validation passes
|
||||
- [06-implementation-phases.md §6.11.2](./06-implementation-phases.md) — Phase 2 definition of done: 100% RPC coverage, traceparent propagation, <1ms p99 overhead, dashboard deployed
|
||||
- [06-implementation-phases.md §6.8](./06-implementation-phases.md) — Success metrics: trace coverage >95%, CPU overhead <3%, memory <5 MB, latency impact <2%
|
||||
- [03-implementation-strategy.md §3.9.5](./03-implementation-strategy.md) — Backward compatibility: config optional, protocol unchanged, `XRPL_ENABLE_TELEMETRY=OFF` produces identical binary
|
||||
- [01-architecture-analysis.md §1.8](./01-architecture-analysis.md) — Observable outcomes: what traces, metrics, and dashboards to expect
|
||||
|
||||
---
|
||||
|
||||
## Task 9: Document POC Results and Next Steps
|
||||
|
||||
**Objective**: Capture findings, screenshots, and remaining work for the team.
|
||||
|
||||
**What to do**:
|
||||
|
||||
- Take screenshots of Jaeger showing:
|
||||
- The service list with "rippled"
|
||||
- A trace with the full span tree
|
||||
- Span detail view showing attributes
|
||||
- Document any issues encountered (build issues, SDK quirks, missing attributes)
|
||||
- Note performance observations (build time impact, any noticeable runtime overhead)
|
||||
- Write a short summary of what the POC proves and what it doesn't cover yet:
|
||||
- **Proves**: OTel SDK integrates with rippled, OTLP export works, RPC traces visible
|
||||
- **Doesn't cover**: Cross-node P2P context propagation, consensus tracing, protobuf trace context, W3C traceparent header extraction, tail-based sampling, production deployment
|
||||
- Outline next steps (mapping to the full plan phases):
|
||||
- [Phase 2](./06-implementation-phases.md) completion: [W3C header extraction](./02-design-decisions.md) (§2.5), WebSocket tracing, all [RPC handlers](./01-architecture-analysis.md) (§1.6)
|
||||
- [Phase 3](./06-implementation-phases.md): [Protobuf `TraceContext` message](./04-code-samples.md) (§4.4), [transaction relay tracing](./04-code-samples.md) (§4.5.1) across nodes
|
||||
- [Phase 4](./06-implementation-phases.md): [Consensus round and phase tracing](./04-code-samples.md) (§4.5.2)
|
||||
- [Phase 5](./06-implementation-phases.md): [Production collector config](./05-configuration-reference.md) (§5.5.2), [Grafana dashboards](./07-observability-backends.md) (§7.6), [alerting](./07-observability-backends.md) (§7.6.3)
|
||||
|
||||
**Reference**:
|
||||
|
||||
- [06-implementation-phases.md §6.1](./06-implementation-phases.md) — Full 5-phase timeline overview and Gantt chart
|
||||
- [06-implementation-phases.md §6.10](./06-implementation-phases.md) — Crawl-Walk-Run strategy: POC is the CRAWL phase, next steps are WALK and RUN
|
||||
- [06-implementation-phases.md §6.12](./06-implementation-phases.md) — Recommended implementation order (14 steps across 9 weeks)
|
||||
- [03-implementation-strategy.md §3.9](./03-implementation-strategy.md) — Code intrusiveness assessment and risk matrix for each remaining component
|
||||
- [07-observability-backends.md §7.2](./07-observability-backends.md) — Production backend selection (Tempo, Elastic APM, Honeycomb, Datadog)
|
||||
- [02-design-decisions.md §2.5](./02-design-decisions.md) — Context propagation design: W3C HTTP headers, protobuf P2P, JobQueue internal
|
||||
- [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) — Reference for team onboarding on distributed tracing concepts
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Task | Description | New Files | Modified Files | Depends On |
|
||||
| ---- | ------------------------------------ | --------- | -------------- | ---------- |
|
||||
| 0 | Docker observability stack | 4 | 0 | — |
|
||||
| 1 | OTel C++ SDK dependency | 0 | 2 | — |
|
||||
| 2 | Core Telemetry interface + NullImpl | 3 | 0 | 1 |
|
||||
| 3 | OTel-backed Telemetry implementation | 2 | 1 | 1, 2 |
|
||||
| 4 | Application lifecycle integration | 0 | 3 | 2, 3 |
|
||||
| 5 | Instrumentation macros | 1 | 0 | 2 |
|
||||
| 6 | Instrument RPC ServerHandler | 0 | 1 | 4, 5 |
|
||||
| 7 | Instrument RPC command execution | 0 | 1 | 4, 5 |
|
||||
| 8 | End-to-end verification | 0 | 0 | 0-7 |
|
||||
| 9 | Document results and next steps | 1 | 0 | 8 |
|
||||
|
||||
**Parallel work**: Tasks 0 and 1 can run in parallel. Tasks 2 and 5 have no dependency on each other. Tasks 6 and 7 can be done in parallel once Tasks 4 and 5 are complete.
|
||||
|
||||
---
|
||||
|
||||
## Next Steps (Post-POC)
|
||||
|
||||
### Metrics Pipeline for Grafana Dashboards
|
||||
|
||||
The current POC exports **traces only**. Grafana's Explore view can query Jaeger for individual traces, but time-series charts (latency histograms, request throughput, error rates) require a **metrics pipeline**. To enable this:
|
||||
|
||||
1. **Add a `spanmetrics` connector** to the OTel Collector config that derives RED metrics (Rate, Errors, Duration) from trace spans automatically:
|
||||
|
||||
```yaml
|
||||
connectors:
|
||||
spanmetrics:
|
||||
histogram:
|
||||
explicit:
|
||||
buckets: [1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s]
|
||||
dimensions:
|
||||
- name: xrpl.rpc.command
|
||||
- name: xrpl.rpc.status
|
||||
|
||||
exporters:
|
||||
prometheus:
|
||||
endpoint: 0.0.0.0:8889
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [debug, otlp/jaeger, spanmetrics]
|
||||
metrics:
|
||||
receivers: [spanmetrics]
|
||||
exporters: [prometheus]
|
||||
```
|
||||
|
||||
2. **Add Prometheus** to the Docker Compose stack to scrape the collector's metrics endpoint.
|
||||
|
||||
3. **Add Prometheus as a Grafana datasource** and build dashboards for:
|
||||
- RPC request latency (p50/p95/p99) by command
|
||||
- RPC throughput (requests/sec) by command
|
||||
- Error rate by command
|
||||
- Span duration distribution
|
||||
|
||||
### Additional Instrumentation
|
||||
|
||||
- **W3C `traceparent` header extraction** in `ServerHandler` to support cross-service context propagation from external callers
|
||||
- **WebSocket RPC tracing** in `ServerHandler::onWSMessage()`
|
||||
- **Transaction relay tracing** across nodes using protobuf `TraceContext` messages
|
||||
- **Consensus round and phase tracing** for validator coordination visibility
|
||||
- **Ledger close tracing** to measure close-to-validated latency
|
||||
|
||||
### Production Hardening
|
||||
|
||||
- **Tail-based sampling** in the OTel Collector to reduce volume while retaining error/slow traces
|
||||
- **TLS configuration** for the OTLP exporter in production deployments
|
||||
- **Resource limits** on the batch processor queue to prevent unbounded memory growth
|
||||
- **Health monitoring** for the telemetry pipeline itself (collector lag, export failures)
|
||||
|
||||
### POC Lessons Learned
|
||||
|
||||
Issues encountered during POC implementation that inform future work:
|
||||
|
||||
| Issue | Resolution | Impact on Future Work |
|
||||
| -------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- | ---------------------------------------------------------------- |
|
||||
| Conan lockfile rejected `opentelemetry-cpp/1.18.0` | Used `--lockfile=""` to bypass | Lockfile must be regenerated when adding new dependencies |
|
||||
| Conan package only builds OTLP HTTP exporter, not gRPC | Switched from gRPC to HTTP exporter (`localhost:4318/v1/traces`) | HTTP exporter is the default; gRPC requires custom Conan profile |
|
||||
| CMake target `opentelemetry-cpp::api` etc. don't exist in Conan package | Use umbrella target `opentelemetry-cpp::opentelemetry-cpp` | Conan targets differ from upstream CMake targets |
|
||||
| OTel Collector `logging` exporter deprecated | Renamed to `debug` exporter | Use `debug` in all collector configs going forward |
|
||||
| Macro parameter `telemetry` collided with `::xrpl::telemetry::` namespace | Renamed macro params to `_tel_obj_`, `_span_name_` | Avoid common words as macro parameter names |
|
||||
| `opentelemetry::trace::Scope` creates new context on move | Store scope as member, create once in constructor | SpanGuard move semantics need care with Scope lifecycle |
|
||||
| `TracerProviderFactory::Create` returns `unique_ptr<sdk::TracerProvider>`, not `nostd::shared_ptr` | Use `std::shared_ptr` member, wrap in `nostd::shared_ptr` for global provider | OTel SDK factory return types don't match API provider types |
|
||||
@@ -71,12 +71,14 @@ words:
|
||||
- coldwallet
|
||||
- compr
|
||||
- conanfile
|
||||
- cppcoro
|
||||
- conanrun
|
||||
- confs
|
||||
- connectability
|
||||
- coro
|
||||
- coros
|
||||
- cowid
|
||||
- cppcoro
|
||||
- cryptocondition
|
||||
- cryptoconditional
|
||||
- cryptoconditions
|
||||
@@ -99,11 +101,14 @@ words:
|
||||
- endmacro
|
||||
- exceptioned
|
||||
- Falco
|
||||
- fcontext
|
||||
- finalizers
|
||||
- firewalled
|
||||
- fcontext
|
||||
- fmtdur
|
||||
- fsanitize
|
||||
- funclets
|
||||
- gantt
|
||||
- gcov
|
||||
- gcovr
|
||||
- ghead
|
||||
@@ -178,7 +183,6 @@ words:
|
||||
- nixpkgs
|
||||
- nonxrp
|
||||
- noripple
|
||||
- nostd
|
||||
- nudb
|
||||
- nullptr
|
||||
- nunl
|
||||
@@ -186,6 +190,7 @@ words:
|
||||
- ostr
|
||||
- pargs
|
||||
- partitioner
|
||||
- pratik
|
||||
- paychan
|
||||
- paychans
|
||||
- permdex
|
||||
@@ -193,6 +198,7 @@ words:
|
||||
- permissioned
|
||||
- pointee
|
||||
- populator
|
||||
- pratik
|
||||
- preauth
|
||||
- preauthorization
|
||||
- preauthorize
|
||||
@@ -207,6 +213,7 @@ words:
|
||||
- queuable
|
||||
- Raphson
|
||||
- replayer
|
||||
- repost
|
||||
- rerere
|
||||
- retriable
|
||||
- RIPD
|
||||
@@ -237,6 +244,7 @@ words:
|
||||
- soci
|
||||
- socidb
|
||||
- sslws
|
||||
- stackful
|
||||
- statsd
|
||||
- STATSDCOLLECTOR
|
||||
- stissue
|
||||
@@ -308,9 +316,3 @@ words:
|
||||
- xrplf
|
||||
- xxhash
|
||||
- xxhasher
|
||||
- xychart
|
||||
- otelc
|
||||
- zpages
|
||||
- traceql
|
||||
- Gantt
|
||||
- gantt
|
||||
|
||||
687
include/xrpl/core/CoroTask.h
Normal file
687
include/xrpl/core/CoroTask.h
Normal file
@@ -0,0 +1,687 @@
|
||||
#pragma once
|
||||
|
||||
#include <coroutine>
|
||||
#include <exception>
|
||||
#include <utility>
|
||||
#include <variant>
|
||||
|
||||
namespace xrpl {
|
||||
|
||||
template <typename T = void>
|
||||
class CoroTask;
|
||||
|
||||
/**
|
||||
* CoroTask<void> -- coroutine return type for void-returning coroutines.
|
||||
*
|
||||
* Class / Dependency Diagram
|
||||
* ==========================
|
||||
*
|
||||
* CoroTask<void>
|
||||
* +-----------------------------------------------+
|
||||
* | - handle_ : Handle (coroutine_handle<promise>) |
|
||||
* +-----------------------------------------------+
|
||||
* | + handle(), done() |
|
||||
* | + await_ready/suspend/resume (Awaiter iface) |
|
||||
* +-----------------------------------------------+
|
||||
* | owns
|
||||
* v
|
||||
* promise_type
|
||||
* +-----------------------------------------------+
|
||||
* | - exception_ : std::exception_ptr |
|
||||
* | - continuation_ : std::coroutine_handle<> |
|
||||
* +-----------------------------------------------+
|
||||
* | + get_return_object() -> CoroTask |
|
||||
* | + initial_suspend() -> suspend_always (lazy) |
|
||||
* | + final_suspend() -> FinalAwaiter |
|
||||
* | + return_void() |
|
||||
* | + unhandled_exception() |
|
||||
* +-----------------------------------------------+
|
||||
* | returns at final_suspend
|
||||
* v
|
||||
* FinalAwaiter
|
||||
* +-----------------------------------------------+
|
||||
* | await_suspend(h): |
|
||||
* | if continuation_ set -> symmetric transfer |
|
||||
* | else -> noop_coroutine |
|
||||
* +-----------------------------------------------+
|
||||
*
|
||||
* Design Notes
|
||||
* ------------
|
||||
* - Lazy start: initial_suspend returns suspend_always, so the coroutine
|
||||
* body does not execute until the handle is explicitly resumed.
|
||||
* - Symmetric transfer: await_suspend returns a coroutine_handle instead
|
||||
* of void/bool, allowing the scheduler to jump directly to the next
|
||||
* coroutine without growing the call stack.
|
||||
* - Continuation chaining: when one CoroTask is co_await-ed inside
|
||||
* another, the caller's handle is stored as continuation_ so
|
||||
* FinalAwaiter can resume it when this task finishes.
|
||||
* - Move-only: the handle is exclusively owned; copy is deleted.
|
||||
*
|
||||
* Usage Examples
|
||||
* ==============
|
||||
*
|
||||
* 1. Basic void coroutine (the most common case in rippled):
|
||||
*
|
||||
* CoroTask<void> doWork(std::shared_ptr<CoroTaskRunner> runner) {
|
||||
* // do something
|
||||
* co_await runner->suspend(); // yield control
|
||||
* // resumed later via runner->post() or runner->resume()
|
||||
* co_return;
|
||||
* }
|
||||
*
|
||||
* 2. co_await-ing one CoroTask<void> from another (chaining):
|
||||
*
|
||||
* CoroTask<void> inner() {
|
||||
* // ...
|
||||
* co_return;
|
||||
* }
|
||||
* CoroTask<void> outer() {
|
||||
* co_await inner(); // continuation_ links outer -> inner
|
||||
* co_return; // FinalAwaiter resumes outer
|
||||
* }
|
||||
*
|
||||
* 3. Exceptions propagate through co_await:
|
||||
*
|
||||
* CoroTask<void> failing() {
|
||||
* throw std::runtime_error("oops");
|
||||
* co_return;
|
||||
* }
|
||||
* CoroTask<void> caller() {
|
||||
* try { co_await failing(); }
|
||||
* catch (std::runtime_error const&) { // caught here }
|
||||
* }
|
||||
*
|
||||
* Caveats / Pitfalls
|
||||
* ==================
|
||||
*
|
||||
* BUG-RISK: Dangling references in coroutine parameters.
|
||||
* Coroutine parameters are copied into the frame, but references
|
||||
* are NOT -- they are stored as-is. If the referent goes out of scope
|
||||
* before the coroutine finishes, you get use-after-free.
|
||||
*
|
||||
* // BROKEN -- local dies before coroutine runs:
|
||||
* CoroTask<void> bad(int& ref) { co_return; }
|
||||
* void launch() {
|
||||
* int local = 42;
|
||||
* auto task = bad(local); // frame stores &local
|
||||
* } // local destroyed; frame holds dangling ref
|
||||
*
|
||||
* // FIX -- pass by value, or ensure lifetime via shared_ptr.
|
||||
*
|
||||
* BUG-RISK: GCC 14 corrupts reference captures in coroutine lambdas.
|
||||
* When a lambda that returns CoroTask captures by reference ([&]),
|
||||
* GCC 14 may generate a corrupted coroutine frame. Always capture
|
||||
* by explicit pointer-to-value instead:
|
||||
*
|
||||
* // BROKEN on GCC 14:
|
||||
* jq.postCoroTask(t, n, [&](auto) -> CoroTask<void> { ... });
|
||||
*
|
||||
* // FIX -- capture pointers explicitly:
|
||||
* jq.postCoroTask(t, n, [ptr = &val](auto) -> CoroTask<void> { ... });
|
||||
*
|
||||
* BUG-RISK: Resuming a destroyed or completed CoroTask.
|
||||
* Calling handle().resume() after the coroutine has already run to
|
||||
* completion (done() == true) is undefined behavior. The CoroTaskRunner
|
||||
* guards against this with an XRPL_ASSERT, but standalone usage of
|
||||
* CoroTask must check done() before resuming.
|
||||
*
|
||||
* BUG-RISK: Moving a CoroTask that is being awaited.
|
||||
* If task A is co_await-ed by task B (so A.continuation_ == B), moving
|
||||
* or destroying A will invalidate the continuation link. Never move
|
||||
* or reassign a CoroTask while it is mid-execution or being awaited.
|
||||
*
|
||||
* LIMITATION: CoroTask is fire-and-forget for the top-level owner.
|
||||
* There is no built-in notification when the coroutine finishes.
|
||||
* The caller must use external synchronization (e.g. CoroTaskRunner::join
|
||||
* or a gate/condition_variable) to know when it is done.
|
||||
*
|
||||
* LIMITATION: No cancellation token.
|
||||
* There is no way to cancel a suspended CoroTask from outside. The
|
||||
* coroutine body must cooperatively check a flag (e.g. jq_.isStopping())
|
||||
* after each co_await and co_return early if needed.
|
||||
*
|
||||
* LIMITATION: Stackless -- cannot suspend from nested non-coroutine calls.
|
||||
* If a coroutine calls a regular function that wants to "yield", it
|
||||
* cannot. Only the immediate coroutine body can use co_await.
|
||||
* This is acceptable for rippled because all yield() sites are shallow.
|
||||
*/
|
||||
template <>
|
||||
class CoroTask<void>
|
||||
{
|
||||
public:
|
||||
struct promise_type;
|
||||
using Handle = std::coroutine_handle<promise_type>;
|
||||
|
||||
/**
|
||||
* Coroutine promise. Compiler uses this to manage coroutine state.
|
||||
* Stores the exception (if any) and the continuation handle for
|
||||
* symmetric transfer back to the awaiting coroutine.
|
||||
*/
|
||||
struct promise_type
|
||||
{
|
||||
// Captured exception from the coroutine body, rethrown in
|
||||
// await_resume() when this task is co_await-ed by a caller.
|
||||
std::exception_ptr exception_;
|
||||
|
||||
// Handle to the coroutine that is co_await-ing this task.
|
||||
// Set by await_suspend(). FinalAwaiter uses it for symmetric
|
||||
// transfer back to the caller. Null if this is a top-level task.
|
||||
std::coroutine_handle<> continuation_;
|
||||
|
||||
/**
|
||||
* Create the CoroTask return object.
|
||||
* Called by the compiler at coroutine creation.
|
||||
*/
|
||||
CoroTask
|
||||
get_return_object()
|
||||
{
|
||||
return CoroTask{Handle::from_promise(*this)};
|
||||
}
|
||||
|
||||
/**
|
||||
* Lazy start. The coroutine body does not execute until the
|
||||
* handle is explicitly resumed (e.g. by CoroTaskRunner::resume).
|
||||
*/
|
||||
std::suspend_always
|
||||
initial_suspend() noexcept
|
||||
{
|
||||
return {};
|
||||
}
|
||||
|
||||
/**
|
||||
* Awaiter returned by final_suspend(). Uses symmetric transfer:
|
||||
* if a continuation exists, transfers control directly to it
|
||||
* (tail-call, no stack growth). Otherwise returns noop_coroutine
|
||||
* so the coroutine frame stays alive for the owner to destroy.
|
||||
*/
|
||||
struct FinalAwaiter
|
||||
{
|
||||
/**
|
||||
* Always false. We need await_suspend to run for
|
||||
* symmetric transfer.
|
||||
*/
|
||||
bool
|
||||
await_ready() noexcept
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Symmetric transfer: returns the continuation handle so
|
||||
* the compiler emits a tail-call instead of a nested resume.
|
||||
* If no continuation is set, returns noop_coroutine to
|
||||
* suspend at final_suspend without destroying the frame.
|
||||
*
|
||||
* @param h Handle to this completing coroutine
|
||||
*
|
||||
* @return Continuation handle, or noop_coroutine
|
||||
*/
|
||||
std::coroutine_handle<>
|
||||
await_suspend(Handle h) noexcept
|
||||
{
|
||||
if (auto cont = h.promise().continuation_)
|
||||
return cont;
|
||||
return std::noop_coroutine();
|
||||
}
|
||||
|
||||
void
|
||||
await_resume() noexcept
|
||||
{
|
||||
}
|
||||
};
|
||||
|
||||
/**
|
||||
* Returns FinalAwaiter for symmetric transfer at coroutine end.
|
||||
*/
|
||||
FinalAwaiter
|
||||
final_suspend() noexcept
|
||||
{
|
||||
return {};
|
||||
}
|
||||
|
||||
/**
|
||||
* Called by the compiler for `co_return;` (void coroutine).
|
||||
*/
|
||||
void
|
||||
return_void()
|
||||
{
|
||||
}
|
||||
|
||||
/**
|
||||
* Called by the compiler when an exception escapes the coroutine
|
||||
* body. Captures it for later rethrowing in await_resume().
|
||||
*/
|
||||
void
|
||||
unhandled_exception()
|
||||
{
|
||||
exception_ = std::current_exception();
|
||||
}
|
||||
};
|
||||
|
||||
/**
|
||||
* Default constructor. Creates an empty (null handle) task.
|
||||
*/
|
||||
CoroTask() = default;
|
||||
|
||||
/**
|
||||
* Takes ownership of a compiler-generated coroutine handle.
|
||||
*
|
||||
* @param h Coroutine handle to own
|
||||
*/
|
||||
explicit CoroTask(Handle h) : handle_(h)
|
||||
{
|
||||
}
|
||||
|
||||
/**
|
||||
* Destroys the coroutine frame if this task owns one.
|
||||
*/
|
||||
~CoroTask()
|
||||
{
|
||||
if (handle_)
|
||||
handle_.destroy();
|
||||
}
|
||||
|
||||
/**
|
||||
* Move constructor. Transfers handle ownership, leaves other empty.
|
||||
*/
|
||||
CoroTask(CoroTask&& other) noexcept : handle_(std::exchange(other.handle_, {}))
|
||||
{
|
||||
}
|
||||
|
||||
/**
|
||||
* Move assignment. Destroys current frame (if any), takes other's.
|
||||
*/
|
||||
CoroTask&
|
||||
operator=(CoroTask&& other) noexcept
|
||||
{
|
||||
if (this != &other)
|
||||
{
|
||||
if (handle_)
|
||||
handle_.destroy();
|
||||
handle_ = std::exchange(other.handle_, {});
|
||||
}
|
||||
return *this;
|
||||
}
|
||||
|
||||
CoroTask(CoroTask const&) = delete;
|
||||
CoroTask&
|
||||
operator=(CoroTask const&) = delete;
|
||||
|
||||
/**
|
||||
* @return The underlying coroutine_handle
|
||||
*/
|
||||
Handle
|
||||
handle() const
|
||||
{
|
||||
return handle_;
|
||||
}
|
||||
|
||||
/**
|
||||
* @return true if the coroutine has run to completion (or thrown)
|
||||
*/
|
||||
bool
|
||||
done() const
|
||||
{
|
||||
return handle_ && handle_.done();
|
||||
}
|
||||
|
||||
// -- Awaiter interface: allows `co_await someCoroTask;` --
|
||||
|
||||
/**
|
||||
* Always false. This task is lazy, so co_await always suspends
|
||||
* the caller to set up the continuation link.
|
||||
*/
|
||||
bool
|
||||
await_ready() const noexcept
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Stores the caller's handle as our continuation, then returns
|
||||
* our handle for symmetric transfer (caller suspends, we resume).
|
||||
*
|
||||
* @param caller Handle of the coroutine doing co_await on us
|
||||
*
|
||||
* @return Our handle for symmetric transfer
|
||||
*/
|
||||
std::coroutine_handle<>
|
||||
await_suspend(std::coroutine_handle<> caller) noexcept
|
||||
{
|
||||
handle_.promise().continuation_ = caller;
|
||||
return handle_; // Symmetric transfer
|
||||
}
|
||||
|
||||
/**
|
||||
* Called when the caller resumes after co_await. Rethrows any
|
||||
* exception captured by unhandled_exception().
|
||||
*/
|
||||
void
|
||||
await_resume()
|
||||
{
|
||||
if (auto& ep = handle_.promise().exception_)
|
||||
std::rethrow_exception(ep);
|
||||
}
|
||||
|
||||
private:
|
||||
// Exclusively-owned coroutine handle. Null after move or default
|
||||
// construction. Destroyed in the destructor.
|
||||
Handle handle_;
|
||||
};
|
||||
|
||||
/**
|
||||
* CoroTask<T> -- coroutine return type for value-returning coroutines.
|
||||
*
|
||||
* Class / Dependency Diagram
|
||||
* ==========================
|
||||
*
|
||||
* CoroTask<T>
|
||||
* +-----------------------------------------------+
|
||||
* | - handle_ : Handle (coroutine_handle<promise>) |
|
||||
* +-----------------------------------------------+
|
||||
* | + handle(), done() |
|
||||
* | + await_ready/suspend/resume (Awaiter iface) |
|
||||
* +-----------------------------------------------+
|
||||
* | owns
|
||||
* v
|
||||
* promise_type
|
||||
* +-----------------------------------------------+
|
||||
* | - result_ : variant<monostate, T, |
|
||||
* | exception_ptr> |
|
||||
* | - continuation_ : std::coroutine_handle<> |
|
||||
* +-----------------------------------------------+
|
||||
* | + get_return_object() -> CoroTask |
|
||||
* | + initial_suspend() -> suspend_always (lazy) |
|
||||
* | + final_suspend() -> FinalAwaiter |
|
||||
* | + return_value(T) -> stores in result_[1] |
|
||||
* | + unhandled_exception -> stores in result_[2] |
|
||||
* +-----------------------------------------------+
|
||||
* | returns at final_suspend
|
||||
* v
|
||||
* FinalAwaiter (same symmetric-transfer pattern as CoroTask<void>)
|
||||
*
|
||||
* Value Extraction
|
||||
* ----------------
|
||||
* await_resume() inspects the variant:
|
||||
* - index 2 (exception_ptr) -> rethrow
|
||||
* - index 1 (T) -> return value via move
|
||||
*
|
||||
* Usage Examples
|
||||
* ==============
|
||||
*
|
||||
* 1. Simple value return:
|
||||
*
|
||||
* CoroTask<int> computeAnswer() { co_return 42; }
|
||||
*
|
||||
* CoroTask<void> caller() {
|
||||
* int v = co_await computeAnswer(); // v == 42
|
||||
* }
|
||||
*
|
||||
* 2. Chaining value-returning coroutines:
|
||||
*
|
||||
* CoroTask<int> add(int a, int b) { co_return a + b; }
|
||||
* CoroTask<int> doubleSum(int a, int b) {
|
||||
* int s = co_await add(a, b);
|
||||
* co_return s * 2;
|
||||
* }
|
||||
*
|
||||
* 3. Exception propagation from inner to outer:
|
||||
*
|
||||
* CoroTask<int> failing() {
|
||||
* throw std::runtime_error("bad");
|
||||
* co_return 0; // never reached
|
||||
* }
|
||||
* CoroTask<void> caller() {
|
||||
* try {
|
||||
* int v = co_await failing(); // throws here
|
||||
* } catch (std::runtime_error const& e) {
|
||||
* // e.what() == "bad"
|
||||
* }
|
||||
* }
|
||||
*
|
||||
* Caveats / Pitfalls (in addition to CoroTask<void> caveats above)
|
||||
* ================================================================
|
||||
*
|
||||
* BUG-RISK: await_resume() moves the value out of the variant.
|
||||
* Calling co_await on the same CoroTask<T> instance twice is undefined
|
||||
* behavior -- the second call will see a moved-from T. CoroTask is
|
||||
* single-shot: one co_return, one co_await.
|
||||
*
|
||||
* BUG-RISK: T must be move-constructible.
|
||||
* return_value(T) takes by value and moves into the variant.
|
||||
* Types that are not movable cannot be used as T.
|
||||
*
|
||||
* LIMITATION: No co_yield support.
|
||||
* CoroTask<T> only supports a single co_return. It does not implement
|
||||
* yield_value(), so using co_yield inside a CoroTask<T> coroutine is a
|
||||
* compile error. For streaming values, a different return type
|
||||
* (e.g. Generator<T>) would be needed.
|
||||
*
|
||||
* LIMITATION: Result is only accessible via co_await.
|
||||
* There is no .get() or .result() method. The value can only be
|
||||
* extracted by co_await-ing the CoroTask<T> from inside another
|
||||
* coroutine. For extracting results in non-coroutine code, pass a
|
||||
* pointer to the caller and write through it (as the tests do).
|
||||
*/
|
||||
template <typename T>
|
||||
class CoroTask
|
||||
{
|
||||
public:
|
||||
struct promise_type;
|
||||
using Handle = std::coroutine_handle<promise_type>;
|
||||
|
||||
/**
|
||||
* Coroutine promise for value-returning coroutines.
|
||||
* Stores the result as a variant: monostate (not yet set),
|
||||
* T (co_return value), or exception_ptr (unhandled exception).
|
||||
*/
|
||||
struct promise_type
|
||||
{
|
||||
// Tri-state result:
|
||||
// index 0 (monostate) -- coroutine has not yet completed
|
||||
// index 1 (T) -- co_return value stored here
|
||||
// index 2 (exception) -- unhandled exception captured here
|
||||
std::variant<std::monostate, T, std::exception_ptr> result_;
|
||||
|
||||
// Handle to the coroutine co_await-ing this task. Used by
|
||||
// FinalAwaiter for symmetric transfer. Null for top-level tasks.
|
||||
std::coroutine_handle<> continuation_;
|
||||
|
||||
/**
|
||||
* Create the CoroTask return object.
|
||||
* Called by the compiler at coroutine creation.
|
||||
*/
|
||||
CoroTask
|
||||
get_return_object()
|
||||
{
|
||||
return CoroTask{Handle::from_promise(*this)};
|
||||
}
|
||||
|
||||
/**
|
||||
* Lazy start. Coroutine body does not run until explicitly resumed.
|
||||
*/
|
||||
std::suspend_always
|
||||
initial_suspend() noexcept
|
||||
{
|
||||
return {};
|
||||
}
|
||||
|
||||
/**
|
||||
* Symmetric-transfer awaiter at coroutine completion.
|
||||
* Same pattern as CoroTask<void>::FinalAwaiter.
|
||||
*/
|
||||
struct FinalAwaiter
|
||||
{
|
||||
bool
|
||||
await_ready() noexcept
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns continuation for symmetric transfer, or
|
||||
* noop_coroutine if this is a top-level task.
|
||||
*
|
||||
* @param h Handle to this completing coroutine
|
||||
*
|
||||
* @return Continuation handle, or noop_coroutine
|
||||
*/
|
||||
std::coroutine_handle<>
|
||||
await_suspend(Handle h) noexcept
|
||||
{
|
||||
if (auto cont = h.promise().continuation_)
|
||||
return cont;
|
||||
return std::noop_coroutine();
|
||||
}
|
||||
|
||||
void
|
||||
await_resume() noexcept
|
||||
{
|
||||
}
|
||||
};
|
||||
|
||||
FinalAwaiter
|
||||
final_suspend() noexcept
|
||||
{
|
||||
return {};
|
||||
}
|
||||
|
||||
/**
|
||||
* Called by the compiler for `co_return value;`.
|
||||
* Moves the value into result_ at index 1.
|
||||
*
|
||||
* @param value The value to store
|
||||
*/
|
||||
void
|
||||
return_value(T value)
|
||||
{
|
||||
result_.template emplace<1>(std::move(value));
|
||||
}
|
||||
|
||||
/**
|
||||
* Captures unhandled exceptions at index 2 of result_.
|
||||
* Rethrown later in await_resume().
|
||||
*/
|
||||
void
|
||||
unhandled_exception()
|
||||
{
|
||||
result_.template emplace<2>(std::current_exception());
|
||||
}
|
||||
};
|
||||
|
||||
/**
|
||||
* Default constructor. Creates an empty (null handle) task.
|
||||
*/
|
||||
CoroTask() = default;
|
||||
|
||||
/**
|
||||
* Takes ownership of a compiler-generated coroutine handle.
|
||||
*
|
||||
* @param h Coroutine handle to own
|
||||
*/
|
||||
explicit CoroTask(Handle h) : handle_(h)
|
||||
{
|
||||
}
|
||||
|
||||
/**
|
||||
* Destroys the coroutine frame if this task owns one.
|
||||
*/
|
||||
~CoroTask()
|
||||
{
|
||||
if (handle_)
|
||||
handle_.destroy();
|
||||
}
|
||||
|
||||
/**
|
||||
* Move constructor. Transfers handle ownership, leaves other empty.
|
||||
*/
|
||||
CoroTask(CoroTask&& other) noexcept : handle_(std::exchange(other.handle_, {}))
|
||||
{
|
||||
}
|
||||
|
||||
/**
|
||||
* Move assignment. Destroys current frame (if any), takes other's.
|
||||
*/
|
||||
CoroTask&
|
||||
operator=(CoroTask&& other) noexcept
|
||||
{
|
||||
if (this != &other)
|
||||
{
|
||||
if (handle_)
|
||||
handle_.destroy();
|
||||
handle_ = std::exchange(other.handle_, {});
|
||||
}
|
||||
return *this;
|
||||
}
|
||||
|
||||
CoroTask(CoroTask const&) = delete;
|
||||
CoroTask&
|
||||
operator=(CoroTask const&) = delete;
|
||||
|
||||
/**
|
||||
* @return The underlying coroutine_handle
|
||||
*/
|
||||
Handle
|
||||
handle() const
|
||||
{
|
||||
return handle_;
|
||||
}
|
||||
|
||||
/**
|
||||
* @return true if the coroutine has run to completion (or thrown)
|
||||
*/
|
||||
bool
|
||||
done() const
|
||||
{
|
||||
return handle_ && handle_.done();
|
||||
}
|
||||
|
||||
// -- Awaiter interface: allows `T val = co_await someCoroTask;` --
|
||||
|
||||
/**
|
||||
* Always false. co_await always suspends to set up continuation.
|
||||
*/
|
||||
bool
|
||||
await_ready() const noexcept
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Stores caller as continuation, returns our handle for
|
||||
* symmetric transfer.
|
||||
*
|
||||
* @param caller Handle of the coroutine doing co_await on us
|
||||
*
|
||||
* @return Our handle for symmetric transfer
|
||||
*/
|
||||
std::coroutine_handle<>
|
||||
await_suspend(std::coroutine_handle<> caller) noexcept
|
||||
{
|
||||
handle_.promise().continuation_ = caller;
|
||||
return handle_;
|
||||
}
|
||||
|
||||
/**
|
||||
* Extracts the result: rethrows if exception, otherwise moves
|
||||
* the T value out of the variant. Single-shot: calling twice
|
||||
* on the same task is undefined (moved-from T).
|
||||
*
|
||||
* @return The co_return-ed value
|
||||
*/
|
||||
T
|
||||
await_resume()
|
||||
{
|
||||
auto& result = handle_.promise().result_;
|
||||
if (auto* ep = std::get_if<2>(&result))
|
||||
std::rethrow_exception(*ep);
|
||||
return std::get<1>(std::move(result));
|
||||
}
|
||||
|
||||
private:
|
||||
// Exclusively-owned coroutine handle. Null after move or default
|
||||
// construction. Destroyed in the destructor.
|
||||
Handle handle_;
|
||||
};
|
||||
|
||||
} // namespace xrpl
|
||||
329
include/xrpl/core/CoroTaskRunner.ipp
Normal file
329
include/xrpl/core/CoroTaskRunner.ipp
Normal file
@@ -0,0 +1,329 @@
|
||||
#pragma once
|
||||
|
||||
/**
|
||||
* @file CoroTaskRunner.ipp
|
||||
*
|
||||
* CoroTaskRunner inline implementation.
|
||||
*
|
||||
* This file contains the business logic for managing C++20 coroutines
|
||||
* on the JobQueue. It is included at the bottom of JobQueue.h.
|
||||
*
|
||||
* Data Flow: suspend / post / resume cycle
|
||||
* =========================================
|
||||
*
|
||||
* coroutine body CoroTaskRunner JobQueue
|
||||
* -------------- -------------- --------
|
||||
* |
|
||||
* co_await runner->suspend()
|
||||
* |
|
||||
* +--- await_suspend ------> onSuspend()
|
||||
* | ++nSuspend_ ------------> nSuspend_
|
||||
* | [coroutine is now suspended]
|
||||
* |
|
||||
* . (externally or by JobQueueAwaiter)
|
||||
* .
|
||||
* +--- (caller calls) -----> post()
|
||||
* | ++runCount_
|
||||
* | addJob(resume) ----------> job enqueued
|
||||
* | |
|
||||
* | [worker picks up]
|
||||
* | |
|
||||
* +--- <----- resume() <-----------------------------------+
|
||||
* | --nSuspend_ ------> nSuspend_
|
||||
* | swap in LocalValues (lvs_)
|
||||
* | task_.handle().resume()
|
||||
* | |
|
||||
* | [coroutine body continues here]
|
||||
* | |
|
||||
* | swap out LocalValues
|
||||
* | --runCount_
|
||||
* | cv_.notify_all()
|
||||
* v
|
||||
*
|
||||
* Thread Safety
|
||||
* =============
|
||||
* - mutex_ : guards task_.handle().resume() so that post()-before-suspend
|
||||
* races cannot resume the coroutine while it is still running.
|
||||
* (See the race condition discussion in JobQueue.h)
|
||||
* - mutex_run_ : guards runCount_ counter; used by join() to wait until
|
||||
* all in-flight resume operations complete.
|
||||
* - jq_.m_mutex: guards nSuspend_ increments/decrements.
|
||||
*
|
||||
* Common Mistakes When Modifying This File
|
||||
* =========================================
|
||||
*
|
||||
* 1. Changing lock ordering.
|
||||
* resume() acquires locks in this order: jq_.m_mutex -> mutex_ -> mutex_run_.
|
||||
* post() acquires only mutex_run_. Any new code path that touches these
|
||||
* mutexes must follow the same order to avoid deadlocks.
|
||||
*
|
||||
* 2. Removing the shared_from_this() capture in post().
|
||||
* The lambda passed to addJob captures [this, sp = shared_from_this()].
|
||||
* If you remove sp, 'this' can be destroyed before the job runs,
|
||||
* causing use-after-free. The sp capture is load-bearing.
|
||||
*
|
||||
* 3. Forgetting to decrement nSuspend_ on a new code path.
|
||||
* Every ++nSuspend_ must have a matching --nSuspend_. If you add a new
|
||||
* suspension path (e.g. a new awaiter) and forget to decrement on resume
|
||||
* or on failure, JobQueue::stop() will hang.
|
||||
*
|
||||
* 4. Calling task_.handle().resume() without holding mutex_.
|
||||
* This allows a race where the coroutine runs on two threads
|
||||
* simultaneously. Always hold mutex_ around resume().
|
||||
*
|
||||
* 5. Swapping LocalValues outside of the mutex_ critical section.
|
||||
* The swap-in and swap-out of LocalValues must bracket the resume()
|
||||
* call. If you move the swap-out before the lock_guard(mutex_) is
|
||||
* released, you break LocalValue isolation for any code that runs
|
||||
* after the coroutine suspends but before the lock is dropped.
|
||||
*/
|
||||
//
|
||||
|
||||
namespace xrpl {
|
||||
|
||||
/**
|
||||
* Construct a CoroTaskRunner. Sets runCount_ to 0; does not
|
||||
* create the coroutine. Call init() afterwards.
|
||||
*
|
||||
* @param jq The JobQueue this coroutine will run on
|
||||
* @param type Job type for scheduling priority
|
||||
* @param name Human-readable name for logging
|
||||
*/
|
||||
inline JobQueue::CoroTaskRunner::CoroTaskRunner(
|
||||
create_t,
|
||||
JobQueue& jq,
|
||||
JobType type,
|
||||
std::string const& name)
|
||||
: jq_(jq), type_(type), name_(name), runCount_(0)
|
||||
{
|
||||
}
|
||||
|
||||
/**
|
||||
* Initialize with a coroutine-returning callable.
|
||||
* Stores the callable on the heap (FuncStore) so it outlives the
|
||||
* coroutine frame. Coroutine frames store a reference to the
|
||||
* callable's implicit object parameter (the lambda). If the callable
|
||||
* is a temporary, that reference dangles after the caller returns.
|
||||
* Keeping the callable alive here ensures the coroutine's captures
|
||||
* remain valid.
|
||||
*
|
||||
* @param f Callable: CoroTask<void>(shared_ptr<CoroTaskRunner>)
|
||||
*/
|
||||
template <class F>
|
||||
void
|
||||
JobQueue::CoroTaskRunner::init(F&& f)
|
||||
{
|
||||
using Fn = std::decay_t<F>;
|
||||
auto store = std::make_unique<FuncStore<Fn>>(std::forward<F>(f));
|
||||
task_ = store->func(shared_from_this());
|
||||
storedFunc_ = std::move(store);
|
||||
}
|
||||
|
||||
/**
|
||||
* Destructor. Waits for any in-flight resume() to complete, then
|
||||
* asserts (debug) that the coroutine has finished or
|
||||
* expectEarlyExit() was called.
|
||||
*
|
||||
* The join() call is necessary because with async dispatch the
|
||||
* coroutine runs on a worker thread. The gate signal (which wakes
|
||||
* the test thread) can arrive before resume() has set finished_.
|
||||
* join() synchronizes via mutex_run_, establishing a happens-before
|
||||
* edge: finished_ = true → unlock(mutex_run_) in resume() →
|
||||
* lock(mutex_run_) in join() → read finished_.
|
||||
*/
|
||||
inline JobQueue::CoroTaskRunner::~CoroTaskRunner()
|
||||
{
|
||||
#ifndef NDEBUG
|
||||
join();
|
||||
XRPL_ASSERT(finished_, "xrpl::JobQueue::CoroTaskRunner::~CoroTaskRunner : is finished");
|
||||
#endif
|
||||
}
|
||||
|
||||
/**
|
||||
* Increment the JobQueue's suspended-coroutine count (nSuspend_).
|
||||
*/
|
||||
inline void
|
||||
JobQueue::CoroTaskRunner::onSuspend()
|
||||
{
|
||||
std::lock_guard lock(jq_.m_mutex);
|
||||
++jq_.nSuspend_;
|
||||
}
|
||||
|
||||
/**
|
||||
* Decrement nSuspend_ without resuming.
|
||||
*/
|
||||
inline void
|
||||
JobQueue::CoroTaskRunner::onUndoSuspend()
|
||||
{
|
||||
std::lock_guard lock(jq_.m_mutex);
|
||||
--jq_.nSuspend_;
|
||||
}
|
||||
|
||||
/**
|
||||
* Return a SuspendAwaiter whose await_suspend() increments nSuspend_
|
||||
* before the coroutine actually suspends. The caller must later call
|
||||
* post() or resume() to continue execution.
|
||||
*
|
||||
* @return Awaiter for use with `co_await runner->suspend()`
|
||||
*/
|
||||
inline auto
|
||||
JobQueue::CoroTaskRunner::suspend()
|
||||
{
|
||||
/**
|
||||
* Custom awaiter for suspend(). Always suspends (await_ready
|
||||
* returns false) and increments nSuspend_ in await_suspend().
|
||||
*/
|
||||
struct SuspendAwaiter
|
||||
{
|
||||
CoroTaskRunner& runner_; // The runner that owns this coroutine.
|
||||
|
||||
/**
|
||||
* Always returns false so the coroutine suspends.
|
||||
*/
|
||||
bool
|
||||
await_ready() const noexcept
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Called when the coroutine suspends. Increments nSuspend_
|
||||
* so the JobQueue knows a coroutine is waiting.
|
||||
*/
|
||||
void
|
||||
await_suspend(std::coroutine_handle<>) const
|
||||
{
|
||||
runner_.onSuspend();
|
||||
}
|
||||
|
||||
void
|
||||
await_resume() const noexcept
|
||||
{
|
||||
}
|
||||
};
|
||||
return SuspendAwaiter{*this};
|
||||
}
|
||||
|
||||
/**
|
||||
* Schedule coroutine resumption as a job on the JobQueue.
|
||||
* A shared_ptr capture (sp) prevents this CoroTaskRunner from being
|
||||
* destroyed while the job is queued but not yet executed.
|
||||
*
|
||||
* @return false if the JobQueue rejected the job (shutting down)
|
||||
*/
|
||||
inline bool
|
||||
JobQueue::CoroTaskRunner::post()
|
||||
{
|
||||
{
|
||||
std::lock_guard lk(mutex_run_);
|
||||
++runCount_;
|
||||
}
|
||||
|
||||
// sp prevents 'this' from being destroyed while the job is pending
|
||||
if (jq_.addJob(type_, name_, [this, sp = shared_from_this()]() { resume(); }))
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// The coroutine will not run. Undo the runCount_ increment.
|
||||
std::lock_guard lk(mutex_run_);
|
||||
--runCount_;
|
||||
cv_.notify_all();
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Resume the coroutine on the current thread.
|
||||
*
|
||||
* Steps:
|
||||
* 1. Decrement nSuspend_ (under jq_.m_mutex)
|
||||
* 2. Swap in this coroutine's LocalValues for thread-local isolation
|
||||
* 3. Resume the coroutine handle (under mutex_)
|
||||
* 4. Swap out LocalValues, restoring the thread's previous state
|
||||
* 5. Decrement runCount_ and notify join() waiters
|
||||
*
|
||||
* Note: runCount_ is NOT incremented here — post() already did that.
|
||||
* This ensures join() stays blocked for the entire post→resume lifetime.
|
||||
*/
|
||||
inline void
|
||||
JobQueue::CoroTaskRunner::resume()
|
||||
{
|
||||
{
|
||||
std::lock_guard lock(jq_.m_mutex);
|
||||
--jq_.nSuspend_;
|
||||
}
|
||||
auto saved = detail::getLocalValues().release();
|
||||
detail::getLocalValues().reset(&lvs_);
|
||||
std::lock_guard lock(mutex_);
|
||||
XRPL_ASSERT(!task_.done(), "xrpl::JobQueue::CoroTaskRunner::resume : task is not done");
|
||||
task_.handle().resume();
|
||||
detail::getLocalValues().release();
|
||||
detail::getLocalValues().reset(saved);
|
||||
if (task_.done())
|
||||
{
|
||||
#ifndef NDEBUG
|
||||
finished_ = true;
|
||||
#endif
|
||||
// Break the shared_ptr cycle: frame -> shared_ptr<runner> -> this.
|
||||
// Use std::move (not task_ = {}) so task_.handle_ is null BEFORE the
|
||||
// frame is destroyed. operator= would destroy the frame while handle_
|
||||
// still holds the old value -- a re-entrancy hazard on GCC-12 if
|
||||
// frame destruction triggers runner cleanup.
|
||||
[[maybe_unused]] auto completed = std::move(task_);
|
||||
}
|
||||
std::lock_guard lk(mutex_run_);
|
||||
--runCount_;
|
||||
cv_.notify_all();
|
||||
}
|
||||
|
||||
/**
|
||||
* @return true if the coroutine has not yet run to completion
|
||||
*/
|
||||
inline bool
|
||||
JobQueue::CoroTaskRunner::runnable() const
|
||||
{
|
||||
// After normal completion, task_ is reset to break the shared_ptr cycle
|
||||
// (handle_ becomes null). A null handle means the coroutine is done.
|
||||
return task_.handle() && !task_.done();
|
||||
}
|
||||
|
||||
/**
|
||||
* Handle early termination when the coroutine never ran (e.g. JobQueue
|
||||
* is stopping). Decrements nSuspend_ and destroys the coroutine frame
|
||||
* to break the shared_ptr cycle: frame -> lambda -> runner -> frame.
|
||||
*/
|
||||
inline void
|
||||
JobQueue::CoroTaskRunner::expectEarlyExit()
|
||||
{
|
||||
#ifndef NDEBUG
|
||||
if (!finished_)
|
||||
#endif
|
||||
{
|
||||
std::lock_guard lock(jq_.m_mutex);
|
||||
--jq_.nSuspend_;
|
||||
#ifndef NDEBUG
|
||||
finished_ = true;
|
||||
#endif
|
||||
}
|
||||
// Break the shared_ptr cycle: frame -> shared_ptr<runner> -> this.
|
||||
// The coroutine is at initial_suspend and never ran user code, so
|
||||
// destroying it is safe. Use std::move (not task_ = {}) so
|
||||
// task_.handle_ is null before the frame is destroyed.
|
||||
{
|
||||
[[maybe_unused]] auto completed = std::move(task_);
|
||||
}
|
||||
storedFunc_.reset();
|
||||
}
|
||||
|
||||
/**
|
||||
* Block until all pending/active resume operations complete.
|
||||
* Uses cv_ + mutex_run_ to wait until runCount_ reaches 0.
|
||||
*/
|
||||
inline void
|
||||
JobQueue::CoroTaskRunner::join()
|
||||
{
|
||||
std::unique_lock<std::mutex> lk(mutex_run_);
|
||||
cv_.wait(lk, [this]() { return runCount_ == 0; });
|
||||
}
|
||||
|
||||
} // namespace xrpl
|
||||
@@ -2,6 +2,7 @@
|
||||
|
||||
#include <xrpl/basics/LocalValue.h>
|
||||
#include <xrpl/core/ClosureCounter.h>
|
||||
#include <xrpl/core/CoroTask.h>
|
||||
#include <xrpl/core/JobTypeData.h>
|
||||
#include <xrpl/core/JobTypes.h>
|
||||
#include <xrpl/core/detail/Workers.h>
|
||||
@@ -9,6 +10,7 @@
|
||||
|
||||
#include <boost/coroutine/all.hpp>
|
||||
|
||||
#include <coroutine>
|
||||
#include <set>
|
||||
|
||||
namespace xrpl {
|
||||
@@ -119,6 +121,395 @@ public:
|
||||
join();
|
||||
};
|
||||
|
||||
/** C++20 coroutine lifecycle manager. Replaces Coro for new code.
|
||||
*
|
||||
* Class / Inheritance / Dependency Diagram
|
||||
* =========================================
|
||||
*
|
||||
* std::enable_shared_from_this<CoroTaskRunner>
|
||||
* ^
|
||||
* | (public inheritance)
|
||||
* |
|
||||
* CoroTaskRunner
|
||||
* +---------------------------------------------------+
|
||||
* | - lvs_ : detail::LocalValues |
|
||||
* | - jq_ : JobQueue& |
|
||||
* | - type_ : JobType |
|
||||
* | - name_ : std::string |
|
||||
* | - runCount_ : int (in-flight resumes) |
|
||||
* | - mutex_ : std::mutex (coroutine guard) |
|
||||
* | - mutex_run_ : std::mutex (join guard) |
|
||||
* | - cv_ : condition_variable |
|
||||
* | - task_ : CoroTask<void> |
|
||||
* | - storedFunc_ : unique_ptr<FuncBase> (type-erased)|
|
||||
* +---------------------------------------------------+
|
||||
* | + init(F&&) : set up coroutine callable |
|
||||
* | + onSuspend() : ++jq_.nSuspend_ |
|
||||
* | + onUndoSuspend() : --jq_.nSuspend_ |
|
||||
* | + suspend() : returns SuspendAwaiter |
|
||||
* | + post() : schedule resume on JobQueue |
|
||||
* | + resume() : resume coroutine on caller |
|
||||
* | + runnable() : !task_.done() |
|
||||
* | + expectEarlyExit() : teardown for failed post |
|
||||
* | + join() : block until not running |
|
||||
* +---------------------------------------------------+
|
||||
* | |
|
||||
* | owns | references
|
||||
* v v
|
||||
* CoroTask<void> JobQueue
|
||||
* (coroutine frame) (thread pool + nSuspend_)
|
||||
*
|
||||
* FuncBase / FuncStore<F> (type-erased heap storage
|
||||
* for the coroutine lambda)
|
||||
*
|
||||
* Coroutine Lifecycle (Control Flow)
|
||||
* ===================================
|
||||
*
|
||||
* Caller thread JobQueue worker thread
|
||||
* ------------- ----------------------
|
||||
* postCoroTask(f)
|
||||
* |
|
||||
* +-- check stopping_ (reject if JQ shutting down)
|
||||
* +-- ++nSuspend_ (lazy start counts as suspended)
|
||||
* +-- make_shared<CoroTaskRunner>
|
||||
* +-- init(f)
|
||||
* | +-- store lambda on heap (FuncStore)
|
||||
* | +-- task_ = f(shared_from_this())
|
||||
* | [coroutine created, suspended at initial_suspend]
|
||||
* +-- post()
|
||||
* | +-- ++runCount_
|
||||
* | +-- addJob(type_, [resume]{})
|
||||
* | resume()
|
||||
* | |
|
||||
* | +-- --nSuspend_
|
||||
* | +-- swap in LocalValues
|
||||
* | +-- task_.handle().resume()
|
||||
* | | [coroutine body runs]
|
||||
* | | ...
|
||||
* | | co_await suspend()
|
||||
* | | +-- ++nSuspend_
|
||||
* | | [coroutine suspends]
|
||||
* | +-- swap out LocalValues
|
||||
* | +-- --runCount_
|
||||
* | +-- cv_.notify_all()
|
||||
* |
|
||||
* post() <-- called externally or by JobQueueAwaiter
|
||||
* +-- ++runCount_
|
||||
* +-- addJob(type_, [resume]{})
|
||||
* resume()
|
||||
* |
|
||||
* +-- [coroutine body continues]
|
||||
* +-- co_return
|
||||
* +-- --runCount_
|
||||
* +-- cv_.notify_all()
|
||||
* join()
|
||||
* +-- cv_.wait([]{runCount_ == 0})
|
||||
* +-- [done]
|
||||
*
|
||||
* Usage Examples
|
||||
* ==============
|
||||
*
|
||||
* 1. Fire-and-forget coroutine (most common pattern):
|
||||
*
|
||||
* jq.postCoroTask(jtCLIENT, "MyWork",
|
||||
* [](auto runner) -> CoroTask<void> {
|
||||
* doSomeWork();
|
||||
* co_await runner->suspend(); // yield to other jobs
|
||||
* doMoreWork();
|
||||
* co_return;
|
||||
* });
|
||||
*
|
||||
* 2. Manually controlling suspend / resume (external trigger):
|
||||
*
|
||||
* auto runner = jq.postCoroTask(jtCLIENT, "ExtTrigger",
|
||||
* [&result](auto runner) -> CoroTask<void> {
|
||||
* startAsyncOperation(callback);
|
||||
* co_await runner->suspend();
|
||||
* // callback called runner->post() to get here
|
||||
* result = collectResult();
|
||||
* co_return;
|
||||
* });
|
||||
* // ... later, from the callback:
|
||||
* runner->post(); // reschedule the coroutine on the JobQueue
|
||||
*
|
||||
* 3. Using JobQueueAwaiter for automatic suspend + repost:
|
||||
*
|
||||
* jq.postCoroTask(jtCLIENT, "AutoRepost",
|
||||
* [](auto runner) -> CoroTask<void> {
|
||||
* step1();
|
||||
* co_await JobQueueAwaiter{runner}; // yield + auto-repost
|
||||
* step2();
|
||||
* co_await JobQueueAwaiter{runner};
|
||||
* step3();
|
||||
* co_return;
|
||||
* });
|
||||
*
|
||||
* 4. Checking shutdown after co_await (cooperative cancellation):
|
||||
*
|
||||
* jq.postCoroTask(jtCLIENT, "Cancellable",
|
||||
* [&jq](auto runner) -> CoroTask<void> {
|
||||
* while (moreWork()) {
|
||||
* co_await JobQueueAwaiter{runner};
|
||||
* if (jq.isStopping())
|
||||
* co_return; // bail out cleanly
|
||||
* processNextItem();
|
||||
* }
|
||||
* co_return;
|
||||
* });
|
||||
*
|
||||
* Caveats / Pitfalls
|
||||
* ==================
|
||||
*
|
||||
* BUG-RISK: Calling suspend() without a matching post()/resume().
|
||||
* After co_await runner->suspend(), the coroutine is parked and
|
||||
* nSuspend_ is incremented. If nothing ever calls post() or
|
||||
* resume(), the coroutine is leaked and JobQueue::stop() will
|
||||
* hang forever waiting for nSuspend_ to reach zero.
|
||||
*
|
||||
* BUG-RISK: Calling post() on an already-running coroutine.
|
||||
* post() schedules a resume() job. If the coroutine has not
|
||||
* actually suspended yet (no co_await executed), the resume job
|
||||
* will try to call handle().resume() while the coroutine is still
|
||||
* running on another thread. This is UB. The mutex_ prevents
|
||||
* data corruption but the logic is wrong — always co_await
|
||||
* suspend() before calling post(). (The test testIncorrectOrder
|
||||
* shows this works only because mutex_ serializes the calls.)
|
||||
*
|
||||
* BUG-RISK: Dropping the shared_ptr<CoroTaskRunner> before join().
|
||||
* The CoroTaskRunner destructor asserts (!finished_ is false).
|
||||
* If you let the last shared_ptr die while the coroutine is still
|
||||
* running or suspended, you get an assertion failure in debug and
|
||||
* UB in release. Always call join() or expectEarlyExit() first.
|
||||
*
|
||||
* BUG-RISK: Lambda captures outliving the coroutine frame.
|
||||
* The lambda passed to postCoroTask is heap-allocated (FuncStore)
|
||||
* to prevent dangling. But objects captured by pointer still need
|
||||
* their own lifetime management. If you capture a raw pointer to
|
||||
* a stack variable, and the stack frame exits before the coroutine
|
||||
* finishes, the pointer dangles. Use shared_ptr or ensure the
|
||||
* pointed-to object outlives the coroutine.
|
||||
*
|
||||
* BUG-RISK: Forgetting co_return in a void coroutine.
|
||||
* If the coroutine body falls off the end without co_return,
|
||||
* the compiler may silently treat it as co_return (per standard),
|
||||
* but some compilers warn. Always write explicit co_return.
|
||||
*
|
||||
* LIMITATION: CoroTaskRunner only supports CoroTask<void>.
|
||||
* The task_ member is CoroTask<void>. To return values from
|
||||
* the top-level coroutine, write through a captured pointer
|
||||
* (as the tests demonstrate), or co_await inner CoroTask<T>
|
||||
* coroutines that return values.
|
||||
*
|
||||
* LIMITATION: One coroutine per CoroTaskRunner.
|
||||
* init() must be called exactly once. You cannot reuse a
|
||||
* CoroTaskRunner to run a second coroutine. Create a new one
|
||||
* via postCoroTask() instead.
|
||||
*
|
||||
* LIMITATION: No timeout on join().
|
||||
* join() blocks indefinitely. If the coroutine is suspended
|
||||
* and never posted, join() will deadlock. Use timed waits
|
||||
* on the gate pattern (condition_variable + wait_for) in tests.
|
||||
*/
|
||||
class CoroTaskRunner : public std::enable_shared_from_this<CoroTaskRunner>
|
||||
{
|
||||
private:
|
||||
// Per-coroutine thread-local storage. Swapped in before resume()
|
||||
// and swapped out after, so each coroutine sees its own LocalValue
|
||||
// state regardless of which worker thread executes it.
|
||||
detail::LocalValues lvs_;
|
||||
|
||||
// Back-reference to the owning JobQueue. Used to post jobs,
|
||||
// increment/decrement nSuspend_, and acquire jq_.m_mutex.
|
||||
JobQueue& jq_;
|
||||
|
||||
// Job type passed to addJob() when posting this coroutine.
|
||||
JobType type_;
|
||||
|
||||
// Human-readable name for this coroutine job (for logging).
|
||||
std::string name_;
|
||||
|
||||
// Number of in-flight resume operations (pending + active).
|
||||
// Incremented by post(), decremented when resume() finishes.
|
||||
// Guarded by mutex_run_. join() blocks until this reaches 0.
|
||||
//
|
||||
// A counter (not a bool) is needed because post() can be called
|
||||
// from within the coroutine body (e.g. via JobQueueAwaiter),
|
||||
// enqueuing a second resume while the first is still running.
|
||||
// A bool would be clobbered: R2.post() sets true, then R1's
|
||||
// cleanup sets false — losing the fact that R2 is still pending.
|
||||
int runCount_;
|
||||
|
||||
// Guards task_.handle().resume() to prevent the coroutine from
|
||||
// running on two threads simultaneously. Handles the race where
|
||||
// post() enqueues a resume before the coroutine has actually
|
||||
// suspended (post-before-suspend pattern).
|
||||
std::mutex mutex_;
|
||||
|
||||
// Guards runCount_. Used with cv_ for join() to wait
|
||||
// until all pending/active resume operations complete.
|
||||
std::mutex mutex_run_;
|
||||
|
||||
// Notified when runCount_ reaches zero, allowing
|
||||
// join() waiters to wake up.
|
||||
std::condition_variable cv_;
|
||||
|
||||
// The coroutine handle wrapper. Owns the coroutine frame.
|
||||
// Set by init(), reset to empty by expectEarlyExit() on
|
||||
// early termination.
|
||||
CoroTask<void> task_;
|
||||
|
||||
/**
|
||||
* Type-erased base for heap-stored callables.
|
||||
* Prevents the coroutine lambda from being destroyed before
|
||||
* the coroutine frame is done with it.
|
||||
*
|
||||
* @see FuncStore
|
||||
*/
|
||||
struct FuncBase
|
||||
{
|
||||
virtual ~FuncBase() = default;
|
||||
};
|
||||
|
||||
/**
|
||||
* Concrete type-erased storage for a callable of type F.
|
||||
* The coroutine frame stores a reference to the lambda's implicit
|
||||
* object parameter. If the lambda is a temporary, that reference
|
||||
* dangles after the call returns. FuncStore keeps it alive on
|
||||
* the heap for the lifetime of the CoroTaskRunner.
|
||||
*/
|
||||
template <class F>
|
||||
struct FuncStore : FuncBase
|
||||
{
|
||||
F func; // The stored callable (coroutine lambda).
|
||||
explicit FuncStore(F&& f) : func(std::move(f))
|
||||
{
|
||||
}
|
||||
};
|
||||
|
||||
// Heap-allocated callable storage. Set by init(), ensures the
|
||||
// lambda outlives the coroutine frame that references it.
|
||||
std::unique_ptr<FuncBase> storedFunc_;
|
||||
|
||||
#ifndef NDEBUG
|
||||
// Debug-only flag. True once the coroutine has completed or
|
||||
// expectEarlyExit() was called. Asserted in the destructor
|
||||
// to catch leaked runners.
|
||||
bool finished_ = false;
|
||||
#endif
|
||||
|
||||
public:
|
||||
/**
|
||||
* Tag type for private construction. Prevents external code
|
||||
* from constructing CoroTaskRunner directly. Use postCoroTask().
|
||||
*/
|
||||
struct create_t
|
||||
{
|
||||
explicit create_t() = default;
|
||||
};
|
||||
|
||||
/**
|
||||
* Construct a CoroTaskRunner. Private by convention (create_t tag).
|
||||
*
|
||||
* @param jq The JobQueue this coroutine will run on
|
||||
* @param type Job type for scheduling priority
|
||||
* @param name Human-readable name for logging
|
||||
*/
|
||||
CoroTaskRunner(create_t, JobQueue&, JobType, std::string const&);
|
||||
|
||||
CoroTaskRunner(CoroTaskRunner const&) = delete;
|
||||
CoroTaskRunner&
|
||||
operator=(CoroTaskRunner const&) = delete;
|
||||
|
||||
/**
|
||||
* Destructor. Asserts (debug) that the coroutine has finished
|
||||
* or expectEarlyExit() was called.
|
||||
*/
|
||||
~CoroTaskRunner();
|
||||
|
||||
/**
|
||||
* Initialize with a coroutine-returning callable.
|
||||
* Must be called exactly once, after the object is managed by
|
||||
* shared_ptr (because init uses shared_from_this internally).
|
||||
* This is handled automatically by postCoroTask().
|
||||
*
|
||||
* @param f Callable: CoroTask<void>(shared_ptr<CoroTaskRunner>)
|
||||
*/
|
||||
template <class F>
|
||||
void
|
||||
init(F&& f);
|
||||
|
||||
/**
|
||||
* Increment the JobQueue's suspended-coroutine count (nSuspend_).
|
||||
* Called when the coroutine is about to suspend. Every call
|
||||
* must be balanced by a corresponding decrement (via resume()
|
||||
* or onUndoSuspend()), or JobQueue::stop() will hang.
|
||||
*/
|
||||
void
|
||||
onSuspend();
|
||||
|
||||
/**
|
||||
* Decrement nSuspend_ without resuming.
|
||||
* Used to undo onSuspend() when a scheduled post() fails
|
||||
* (e.g. JobQueue is stopping).
|
||||
*/
|
||||
void
|
||||
onUndoSuspend();
|
||||
|
||||
/**
|
||||
* Suspend the coroutine.
|
||||
* The awaiter's await_suspend() increments nSuspend_ before the
|
||||
* coroutine actually suspends. The caller must later call post()
|
||||
* or resume() to continue execution.
|
||||
*
|
||||
* @return An awaiter for use with `co_await runner->suspend()`
|
||||
*/
|
||||
auto
|
||||
suspend();
|
||||
|
||||
/**
|
||||
* Schedule coroutine resumption as a job on the JobQueue.
|
||||
* Captures shared_from_this() to prevent this runner from being
|
||||
* destroyed while the job is queued.
|
||||
*
|
||||
* @return true if the job was accepted; false if the JobQueue
|
||||
* is stopping (caller must handle cleanup)
|
||||
*/
|
||||
bool
|
||||
post();
|
||||
|
||||
/**
|
||||
* Resume the coroutine on the current thread.
|
||||
* Decrements nSuspend_, swaps in LocalValues, resumes the
|
||||
* coroutine handle, swaps out LocalValues, and notifies join()
|
||||
* waiters. Lock ordering: mutex_run_ -> jq_.m_mutex -> mutex_.
|
||||
*/
|
||||
void
|
||||
resume();
|
||||
|
||||
/**
|
||||
* @return true if the coroutine has not yet run to completion
|
||||
*/
|
||||
bool
|
||||
runnable() const;
|
||||
|
||||
/**
|
||||
* Handle early termination when the coroutine never ran.
|
||||
* Decrements nSuspend_ and destroys the coroutine frame to
|
||||
* break the shared_ptr cycle (frame -> lambda -> runner -> frame).
|
||||
* Called by postCoroTask() when post() fails.
|
||||
*/
|
||||
void
|
||||
expectEarlyExit();
|
||||
|
||||
/**
|
||||
* Block until all pending/active resume operations complete.
|
||||
* Uses cv_ + mutex_run_ to wait until runCount_ reaches 0.
|
||||
* Warning: deadlocks if the coroutine is suspended and never posted.
|
||||
*/
|
||||
void
|
||||
join();
|
||||
};
|
||||
|
||||
using JobFunction = std::function<void()>;
|
||||
|
||||
JobQueue(
|
||||
@@ -165,6 +556,19 @@ public:
|
||||
std::shared_ptr<Coro>
|
||||
postCoro(JobType t, std::string const& name, F&& f);
|
||||
|
||||
/** Creates a C++20 coroutine and adds a job to the queue to run it.
|
||||
|
||||
@param t The type of job.
|
||||
@param name Name of the job.
|
||||
@param f Callable with signature
|
||||
CoroTask<void>(std::shared_ptr<CoroTaskRunner>).
|
||||
|
||||
@return shared_ptr to posted CoroTaskRunner. nullptr if not successful.
|
||||
*/
|
||||
template <class F>
|
||||
std::shared_ptr<CoroTaskRunner>
|
||||
postCoroTask(JobType t, std::string const& name, F&& f);
|
||||
|
||||
/** Jobs waiting at this priority.
|
||||
*/
|
||||
int
|
||||
@@ -379,6 +783,7 @@ private:
|
||||
} // namespace xrpl
|
||||
|
||||
#include <xrpl/core/Coro.ipp>
|
||||
#include <xrpl/core/CoroTaskRunner.ipp>
|
||||
|
||||
namespace xrpl {
|
||||
|
||||
@@ -401,4 +806,69 @@ JobQueue::postCoro(JobType t, std::string const& name, F&& f)
|
||||
return coro;
|
||||
}
|
||||
|
||||
// postCoroTask — entry point for launching a C++20 coroutine on the JobQueue.
|
||||
//
|
||||
// Control Flow
|
||||
// ============
|
||||
//
|
||||
// postCoroTask(t, name, f)
|
||||
// |
|
||||
// +-- 1. Check stopping_ — reject if JQ shutting down
|
||||
// |
|
||||
// +-- 2. ++nSuspend_ (mirrors Boost Coro ctor's implicit yield)
|
||||
// | The coroutine is "suspended" from the JobQueue's perspective
|
||||
// | even though it hasn't run yet — this keeps the JQ shutdown
|
||||
// | logic correct (it waits for nSuspend_ to reach 0).
|
||||
// |
|
||||
// +-- 3. Create CoroTaskRunner (shared_ptr, ref-counted)
|
||||
// |
|
||||
// +-- 4. runner->init(f)
|
||||
// | +-- Heap-allocate the lambda (FuncStore) to prevent
|
||||
// | | dangling captures in the coroutine frame
|
||||
// | +-- task_ = f(shared_from_this())
|
||||
// | [coroutine created but NOT started — lazy initial_suspend]
|
||||
// |
|
||||
// +-- 5. runner->post()
|
||||
// | +-- addJob(type_, [resume]{}) → resume on worker thread
|
||||
// | +-- failure (JQ stopping):
|
||||
// | +-- runner->expectEarlyExit()
|
||||
// | | --nSuspend_, destroy coroutine frame
|
||||
// | +-- return nullptr
|
||||
//
|
||||
// Why async post() instead of synchronous resume()?
|
||||
// ==================================================
|
||||
// The initial dispatch MUST use async post() so the coroutine body runs on
|
||||
// a JobQueue worker thread, not the caller's thread. resume() swaps the
|
||||
// caller's thread-local LocalValues with the coroutine's private copy.
|
||||
// If the coroutine mutates LocalValues (e.g. thread_specific_storage test),
|
||||
// those mutations bleed back into the caller's thread-local state after the
|
||||
// swap-out, corrupting subsequent tests that share the same thread pool.
|
||||
// Async post() avoids this by running the coroutine on a worker thread whose
|
||||
// LocalValues are managed by the thread pool, not by the caller.
|
||||
//
|
||||
template <class F>
|
||||
std::shared_ptr<JobQueue::CoroTaskRunner>
|
||||
JobQueue::postCoroTask(JobType t, std::string const& name, F&& f)
|
||||
{
|
||||
// Reject if the JQ is shutting down — matches addJob()'s stopping_ check.
|
||||
// Must check before incrementing nSuspend_ to avoid leaving an orphan
|
||||
// count that would cause stop() to hang.
|
||||
if (stopping_)
|
||||
return nullptr;
|
||||
|
||||
{
|
||||
std::lock_guard lock(m_mutex);
|
||||
++nSuspend_;
|
||||
}
|
||||
|
||||
auto runner = std::make_shared<CoroTaskRunner>(CoroTaskRunner::create_t{}, *this, t, name);
|
||||
runner->init(std::forward<F>(f));
|
||||
if (!runner->post())
|
||||
{
|
||||
runner->expectEarlyExit();
|
||||
runner.reset();
|
||||
}
|
||||
return runner;
|
||||
}
|
||||
|
||||
} // namespace xrpl
|
||||
|
||||
174
include/xrpl/core/JobQueueAwaiter.h
Normal file
174
include/xrpl/core/JobQueueAwaiter.h
Normal file
@@ -0,0 +1,174 @@
|
||||
#pragma once
|
||||
|
||||
#include <xrpl/core/JobQueue.h>
|
||||
|
||||
#include <coroutine>
|
||||
#include <memory>
|
||||
|
||||
namespace xrpl {
|
||||
|
||||
/**
|
||||
* Awaiter that suspends and immediately reschedules on the JobQueue.
|
||||
* Equivalent to calling yield() followed by post() in the old Coro API.
|
||||
*
|
||||
* Usage:
|
||||
* co_await JobQueueAwaiter{runner};
|
||||
*
|
||||
* What it waits for: The coroutine is re-queued as a job and resumes
|
||||
* when a worker thread picks it up.
|
||||
*
|
||||
* Which thread resumes: A JobQueue worker thread.
|
||||
*
|
||||
* What await_resume() returns: void.
|
||||
*
|
||||
* Dependency Diagram
|
||||
* ==================
|
||||
*
|
||||
* JobQueueAwaiter
|
||||
* +----------------------------------------------+
|
||||
* | + runner : shared_ptr<CoroTaskRunner> |
|
||||
* +----------------------------------------------+
|
||||
* | + await_ready() -> false (always suspend) |
|
||||
* | + await_suspend() -> bool (suspend or cancel) |
|
||||
* | + await_resume() -> void |
|
||||
* +----------------------------------------------+
|
||||
* | |
|
||||
* | uses | uses
|
||||
* v v
|
||||
* CoroTaskRunner JobQueue
|
||||
* .onSuspend() (via runner->post() -> addJob)
|
||||
* .onUndoSuspend()
|
||||
* .post()
|
||||
*
|
||||
* Control Flow (await_suspend)
|
||||
* ============================
|
||||
*
|
||||
* co_await JobQueueAwaiter{runner}
|
||||
* |
|
||||
* +-- await_ready() -> false
|
||||
* +-- await_suspend(handle)
|
||||
* |
|
||||
* +-- runner->onSuspend() // ++nSuspend_
|
||||
* +-- runner->post() // addJob to JobQueue
|
||||
* | |
|
||||
* | +-- success? return true // coroutine stays suspended
|
||||
* | | // worker thread will call resume()
|
||||
* | +-- failure? (JQ stopping)
|
||||
* | +-- runner->onUndoSuspend() // --nSuspend_
|
||||
* | +-- return false // coroutine continues immediately
|
||||
* | // so it can clean up and co_return
|
||||
*
|
||||
* Usage Examples
|
||||
* ==============
|
||||
*
|
||||
* 1. Yield and auto-repost (most common -- replaces yield() + post()):
|
||||
*
|
||||
* CoroTask<void> handler(auto runner) {
|
||||
* doPartA();
|
||||
* co_await JobQueueAwaiter{runner}; // yield + repost
|
||||
* doPartB(); // runs on a worker thread
|
||||
* co_return;
|
||||
* }
|
||||
*
|
||||
* 2. Multiple yield points in a loop:
|
||||
*
|
||||
* CoroTask<void> batchProcessor(auto runner) {
|
||||
* for (auto& item : items) {
|
||||
* process(item);
|
||||
* co_await JobQueueAwaiter{runner}; // let other jobs run
|
||||
* }
|
||||
* co_return;
|
||||
* }
|
||||
*
|
||||
* 3. Graceful shutdown -- checking after resume:
|
||||
*
|
||||
* CoroTask<void> longTask(auto runner, JobQueue& jq) {
|
||||
* while (hasWork()) {
|
||||
* co_await JobQueueAwaiter{runner};
|
||||
* // If JQ is stopping, await_suspend returns false and
|
||||
* // the coroutine continues immediately without re-queuing.
|
||||
* // Always check isStopping() to decide whether to proceed:
|
||||
* if (jq.isStopping())
|
||||
* co_return;
|
||||
* doNextChunk();
|
||||
* }
|
||||
* co_return;
|
||||
* }
|
||||
*
|
||||
* Caveats / Pitfalls
|
||||
* ==================
|
||||
*
|
||||
* BUG-RISK: Using a stale or null runner.
|
||||
* The runner shared_ptr must be valid and point to the CoroTaskRunner
|
||||
* that owns the coroutine currently executing. Passing a runner from
|
||||
* a different coroutine, or a default-constructed shared_ptr, is UB.
|
||||
*
|
||||
* BUG-RISK: Assuming resume happens on the same thread.
|
||||
* After co_await JobQueueAwaiter, the coroutine resumes on whatever
|
||||
* worker thread picks up the job. Do not rely on thread-local state
|
||||
* unless it is managed through LocalValue (which CoroTaskRunner
|
||||
* automatically swaps in/out).
|
||||
*
|
||||
* BUG-RISK: Ignoring the shutdown path.
|
||||
* When the JobQueue is stopping, post() fails and await_suspend()
|
||||
* returns false (coroutine does NOT actually suspend). The coroutine
|
||||
* body continues immediately on the same thread. If your code after
|
||||
* co_await assumes it was re-queued and is running on a worker thread,
|
||||
* that assumption breaks during shutdown. Always handle the "JQ is
|
||||
* stopping" case, either by checking jq.isStopping() or by letting
|
||||
* the coroutine fall through to co_return naturally.
|
||||
*
|
||||
* DIFFERENCE from runner->suspend() + runner->post():
|
||||
* JobQueueAwaiter combines both in one atomic operation. With the
|
||||
* manual suspend()/post() pattern, there is a window between the
|
||||
* two calls where an external event could race. JobQueueAwaiter
|
||||
* removes that window -- onSuspend() and post() happen within the
|
||||
* same await_suspend() call while the coroutine is guaranteed to
|
||||
* be suspended. Prefer JobQueueAwaiter unless you need an external
|
||||
* party to decide *when* to call post().
|
||||
*/
|
||||
struct JobQueueAwaiter
|
||||
{
|
||||
// The CoroTaskRunner that owns the currently executing coroutine.
|
||||
std::shared_ptr<JobQueue::CoroTaskRunner> runner;
|
||||
|
||||
/**
|
||||
* Always returns false so the coroutine suspends.
|
||||
*/
|
||||
bool
|
||||
await_ready() const noexcept
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Increment nSuspend (equivalent to yield()) and schedule resume
|
||||
* on the JobQueue (equivalent to post()). If the JobQueue is
|
||||
* stopping, undoes the suspend count and returns false so the
|
||||
* coroutine continues immediately and can clean up.
|
||||
*
|
||||
* @return true if coroutine should stay suspended (job posted);
|
||||
* false if coroutine should continue (JQ stopping)
|
||||
*/
|
||||
bool
|
||||
await_suspend(std::coroutine_handle<>)
|
||||
{
|
||||
runner->onSuspend();
|
||||
if (!runner->post())
|
||||
{
|
||||
// JobQueue is stopping. Undo the suspend count and
|
||||
// don't actually suspend — the coroutine continues
|
||||
// immediately so it can clean up and co_return.
|
||||
runner->onUndoSuspend();
|
||||
return false;
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
void
|
||||
await_resume() const noexcept
|
||||
{
|
||||
}
|
||||
};
|
||||
|
||||
} // namespace xrpl
|
||||
280
presentation.md
280
presentation.md
@@ -1,280 +0,0 @@
|
||||
# OpenTelemetry Distributed Tracing for rippled
|
||||
|
||||
---
|
||||
|
||||
## Slide 1: Introduction
|
||||
|
||||
### What is OpenTelemetry?
|
||||
|
||||
OpenTelemetry is an open-source, CNCF-backed observability framework for distributed tracing, metrics, and logs.
|
||||
|
||||
### Why OpenTelemetry for rippled?
|
||||
|
||||
- **End-to-End Transaction Visibility**: Track transactions from submission → consensus → ledger inclusion
|
||||
- **Cross-Node Correlation**: Follow requests across multiple independent nodes using a unique `trace_id`
|
||||
- **Consensus Round Analysis**: Understand timing and behavior across validators
|
||||
- **Incident Debugging**: Correlate events across distributed nodes during issues
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A["Node A<br/>tx.receive<br/>trace_id: abc123"] --> B["Node B<br/>tx.relay<br/>trace_id: abc123"] --> C["Node C<br/>tx.validate<br/>trace_id: abc123"] --> D["Node D<br/>ledger.apply<br/>trace_id: abc123"]
|
||||
|
||||
style A fill:#1565c0,stroke:#0d47a1,color:#fff
|
||||
style B fill:#2e7d32,stroke:#1b5e20,color:#fff
|
||||
style C fill:#2e7d32,stroke:#1b5e20,color:#fff
|
||||
style D fill:#e65100,stroke:#bf360c,color:#fff
|
||||
```
|
||||
|
||||
> **Trace ID: abc123** — All nodes share the same trace, enabling cross-node correlation.
|
||||
|
||||
---
|
||||
|
||||
## Slide 2: OpenTelemetry vs Open Source Alternatives
|
||||
|
||||
| Feature | OpenTelemetry | Jaeger | Zipkin | SkyWalking | Pinpoint | Prometheus |
|
||||
| ------------------- | ---------------- | ---------------- | ------------------ | ---------- | ---------- | ---------- |
|
||||
| **Tracing** | YES | YES | YES | YES | YES | NO |
|
||||
| **Metrics** | YES | NO | NO | YES | YES | YES |
|
||||
| **Logs** | YES | NO | NO | YES | NO | NO |
|
||||
| **C++ SDK** | YES Official | YES (Deprecated) | YES (Unmaintained) | NO | NO | YES |
|
||||
| **Vendor Neutral** | YES Primary goal | NO | NO | NO | NO | NO |
|
||||
| **Instrumentation** | Manual + Auto | Manual | Manual | Auto-first | Auto-first | Manual |
|
||||
| **Backend** | Any (exporters) | Self | Self | Self | Self | Self |
|
||||
| **CNCF Status** | Incubating | Graduated | NO | Incubating | NO | Graduated |
|
||||
|
||||
> **Why OpenTelemetry?** It's the only actively maintained, full-featured C++ option with vendor neutrality — allowing export to Jaeger, Prometheus, Grafana, or any commercial backend without changing instrumentation.
|
||||
|
||||
---
|
||||
|
||||
## Slide 3: Comparison with rippled's Existing Solutions
|
||||
|
||||
### Current Observability Stack
|
||||
|
||||
| Aspect | PerfLog (JSON) | StatsD (Metrics) | OpenTelemetry (NEW) |
|
||||
| --------------------- | --------------------- | --------------------- | --------------------------- |
|
||||
| **Type** | Logging | Metrics | Distributed Tracing |
|
||||
| **Scope** | Single node | Single node | **Cross-node** |
|
||||
| **Data** | JSON log entries | Counters, gauges | Spans with context |
|
||||
| **Correlation** | By timestamp | By metric name | By `trace_id` |
|
||||
| **Overhead** | Low (file I/O) | Low (UDP) | Low-Medium (configurable) |
|
||||
| **Question Answered** | "What happened here?" | "How many? How fast?" | **"What was the journey?"** |
|
||||
|
||||
### Use Case Matrix
|
||||
|
||||
| Scenario | PerfLog | StatsD | OpenTelemetry |
|
||||
| -------------------------------- | ------- | ------ | ------------- |
|
||||
| "How many TXs per second?" | ❌ | ✅ | ❌ |
|
||||
| "Why was this specific TX slow?" | ⚠️ | ❌ | ✅ |
|
||||
| "Which node delayed consensus?" | ❌ | ❌ | ✅ |
|
||||
| "Show TX journey across 5 nodes" | ❌ | ❌ | ✅ |
|
||||
|
||||
> **Key Insight**: OpenTelemetry **complements** (not replaces) existing systems.
|
||||
|
||||
---
|
||||
|
||||
## Slide 4: Architecture
|
||||
|
||||
### High-Level Integration Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph rippled["rippled Node"]
|
||||
subgraph services["Core Services"]
|
||||
direction LR
|
||||
RPC["RPC Server<br/>(HTTP/WS)"] ~~~ Overlay["Overlay<br/>(P2P Network)"] ~~~ Consensus["Consensus<br/>(RCLConsensus)"]
|
||||
end
|
||||
|
||||
Telemetry["Telemetry Module<br/>(OpenTelemetry SDK)"]
|
||||
|
||||
services --> Telemetry
|
||||
end
|
||||
|
||||
Telemetry -->|OTLP/gRPC| Collector["OTel Collector"]
|
||||
|
||||
Collector --> Tempo["Grafana Tempo"]
|
||||
Collector --> Jaeger["Jaeger"]
|
||||
Collector --> Elastic["Elastic APM"]
|
||||
|
||||
style rippled fill:#424242,stroke:#212121,color:#fff
|
||||
style services fill:#1565c0,stroke:#0d47a1,color:#fff
|
||||
style Telemetry fill:#2e7d32,stroke:#1b5e20,color:#fff
|
||||
style Collector fill:#e65100,stroke:#bf360c,color:#fff
|
||||
```
|
||||
|
||||
### Context Propagation
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant NodeA as Node A
|
||||
participant NodeB as Node B
|
||||
|
||||
Client->>NodeA: Submit TX (no context)
|
||||
Note over NodeA: Creates trace_id: abc123<br/>span: tx.receive
|
||||
NodeA->>NodeB: Relay TX<br/>(traceparent: abc123)
|
||||
Note over NodeB: Links to trace_id: abc123<br/>span: tx.relay
|
||||
```
|
||||
|
||||
- **HTTP/RPC**: W3C Trace Context headers (`traceparent`)
|
||||
- **P2P Messages**: Protocol Buffer extension fields
|
||||
|
||||
---
|
||||
|
||||
## Slide 5: Implementation Plan
|
||||
|
||||
### 5-Phase Rollout (9 Weeks)
|
||||
|
||||
```mermaid
|
||||
gantt
|
||||
title Implementation Timeline
|
||||
dateFormat YYYY-MM-DD
|
||||
axisFormat Week %W
|
||||
|
||||
section Phase 1
|
||||
Core Infrastructure :p1, 2024-01-01, 2w
|
||||
|
||||
section Phase 2
|
||||
RPC Tracing :p2, after p1, 2w
|
||||
|
||||
section Phase 3
|
||||
Transaction Tracing :p3, after p2, 2w
|
||||
|
||||
section Phase 4
|
||||
Consensus Tracing :p4, after p3, 2w
|
||||
|
||||
section Phase 5
|
||||
Documentation :p5, after p4, 1w
|
||||
```
|
||||
|
||||
### Phase Details
|
||||
|
||||
| Phase | Focus | Key Deliverables | Effort |
|
||||
| ----- | ------------------- | -------------------------------------------- | ------- |
|
||||
| 1 | Core Infrastructure | SDK integration, Telemetry interface, Config | 10 days |
|
||||
| 2 | RPC Tracing | HTTP context extraction, Handler spans | 10 days |
|
||||
| 3 | Transaction Tracing | Protobuf context, P2P relay propagation | 10 days |
|
||||
| 4 | Consensus Tracing | Round spans, Proposal/validation tracing | 10 days |
|
||||
| 5 | Documentation | Runbook, Dashboards, Training | 7 days |
|
||||
|
||||
**Total Effort**: ~47 developer-days (2 developers)
|
||||
|
||||
---
|
||||
|
||||
## Slide 6: Performance Overhead
|
||||
|
||||
### Estimated System Impact
|
||||
|
||||
| Metric | Overhead | Notes |
|
||||
| ----------------- | ---------- | ----------------------------------- |
|
||||
| **CPU** | 1-3% | Span creation and attribute setting |
|
||||
| **Memory** | 2-5 MB | Batch buffer for pending spans |
|
||||
| **Network** | 10-50 KB/s | Compressed OTLP export to collector |
|
||||
| **Latency (p99)** | <2% | With proper sampling configuration |
|
||||
|
||||
### Per-Message Overhead (Context Propagation)
|
||||
|
||||
Each P2P message carries trace context with the following overhead:
|
||||
|
||||
| Field | Size | Description |
|
||||
| ------------- | ------------- | ----------------------------------------- |
|
||||
| `trace_id` | 16 bytes | Unique identifier for the entire trace |
|
||||
| `span_id` | 8 bytes | Current span (becomes parent on receiver) |
|
||||
| `trace_flags` | 4 bytes | Sampling decision flags |
|
||||
| `trace_state` | 0-4 bytes | Optional vendor-specific data |
|
||||
| **Total** | **~32 bytes** | **Added per traced P2P message** |
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph msg["P2P Message with Trace Context"]
|
||||
A["Original Message<br/>(variable size)"] --> B["+ TraceContext<br/>(~32 bytes)"]
|
||||
end
|
||||
|
||||
subgraph breakdown["Context Breakdown"]
|
||||
C["trace_id<br/>16 bytes"]
|
||||
D["span_id<br/>8 bytes"]
|
||||
E["flags<br/>4 bytes"]
|
||||
F["state<br/>0-4 bytes"]
|
||||
end
|
||||
|
||||
B --> breakdown
|
||||
|
||||
style A fill:#424242,stroke:#212121,color:#fff
|
||||
style B fill:#2e7d32,stroke:#1b5e20,color:#fff
|
||||
style C fill:#1565c0,stroke:#0d47a1,color:#fff
|
||||
style D fill:#1565c0,stroke:#0d47a1,color:#fff
|
||||
style E fill:#e65100,stroke:#bf360c,color:#fff
|
||||
style F fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
```
|
||||
|
||||
> **Note**: 32 bytes is negligible compared to typical transaction messages (hundreds to thousands of bytes)
|
||||
|
||||
### Mitigation Strategies
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A["Head Sampling<br/>10% default"] --> B["Tail Sampling<br/>Keep errors/slow"] --> C["Batch Export<br/>Reduce I/O"] --> D["Conditional Compile<br/>XRPL_ENABLE_TELEMETRY"]
|
||||
|
||||
style A fill:#1565c0,stroke:#0d47a1,color:#fff
|
||||
style B fill:#2e7d32,stroke:#1b5e20,color:#fff
|
||||
style C fill:#e65100,stroke:#bf360c,color:#fff
|
||||
style D fill:#4a148c,stroke:#2e0d57,color:#fff
|
||||
```
|
||||
|
||||
### Kill Switches (Rollback Options)
|
||||
|
||||
1. **Config Disable**: Set `enabled=0` in config → instant disable, no restart needed for sampling
|
||||
2. **Rebuild**: Compile with `XRPL_ENABLE_TELEMETRY=OFF` → zero overhead (no-op)
|
||||
3. **Full Revert**: Clean separation allows easy commit reversion
|
||||
|
||||
---
|
||||
|
||||
## Slide 7: Data Collection & Privacy
|
||||
|
||||
### What Data is Collected
|
||||
|
||||
| Category | Attributes Collected | Purpose |
|
||||
| --------------- | ---------------------------------------------------------------------------------- | --------------------------- |
|
||||
| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` | Trace transaction lifecycle |
|
||||
| **Consensus** | `round`, `phase`, `mode`, `proposers`(public key or public node id), `duration_ms` | Analyze consensus timing |
|
||||
| **RPC** | `command`, `version`, `status`, `duration_ms` | Monitor RPC performance |
|
||||
| **Peer** | `peer.id`(public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis |
|
||||
| **Ledger** | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` | Ledger progression tracking |
|
||||
| **Job** | `job.type`, `queue_ms`, `worker` | JobQueue performance |
|
||||
|
||||
### What is NOT Collected (Privacy Guarantees)
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph notCollected["❌ NOT Collected"]
|
||||
direction LR
|
||||
A["Private Keys"] ~~~ B["Account Balances"] ~~~ C["Transaction Amounts"]
|
||||
end
|
||||
|
||||
subgraph alsoNot["❌ Also Excluded"]
|
||||
direction LR
|
||||
D["IP Addresses<br/>(configurable)"] ~~~ E["Personal Data"] ~~~ F["Raw TX Payloads"]
|
||||
end
|
||||
|
||||
style A fill:#c62828,stroke:#8c2809,color:#fff
|
||||
style B fill:#c62828,stroke:#8c2809,color:#fff
|
||||
style C fill:#c62828,stroke:#8c2809,color:#fff
|
||||
style D fill:#c62828,stroke:#8c2809,color:#fff
|
||||
style E fill:#c62828,stroke:#8c2809,color:#fff
|
||||
style F fill:#c62828,stroke:#8c2809,color:#fff
|
||||
```
|
||||
|
||||
### Privacy Protection Mechanisms
|
||||
|
||||
| Mechanism | Description |
|
||||
| -------------------------- | ------------------------------------------------------------- |
|
||||
| **Account Hashing** | `xrpl.tx.account` is hashed at collector level before storage |
|
||||
| **Configurable Redaction** | Sensitive fields can be excluded via config |
|
||||
| **Sampling** | Only 10% of traces recorded by default (reduces exposure) |
|
||||
| **Local Control** | Node operators control what gets exported |
|
||||
| **No Raw Payloads** | Transaction content is never recorded, only metadata |
|
||||
|
||||
> **Key Principle**: Telemetry collects **operational metadata** (timing, counts, hashes) — never **sensitive content** (keys, balances, amounts).
|
||||
|
||||
---
|
||||
|
||||
_End of Presentation_
|
||||
@@ -8,6 +8,7 @@
|
||||
#include <xrpld/rpc/detail/Tuning.h>
|
||||
|
||||
#include <xrpl/beast/unit_test.h>
|
||||
#include <xrpl/core/CoroTask.h>
|
||||
#include <xrpl/core/JobQueue.h>
|
||||
#include <xrpl/json/json_reader.h>
|
||||
#include <xrpl/protocol/ApiVersion.h>
|
||||
@@ -131,7 +132,6 @@ public:
|
||||
c,
|
||||
Role::USER,
|
||||
{},
|
||||
{},
|
||||
RPC::apiVersionIfUnspecified},
|
||||
{},
|
||||
{}};
|
||||
@@ -155,11 +155,11 @@ public:
|
||||
|
||||
Json::Value result;
|
||||
gate g;
|
||||
app.getJobQueue().postCoro(jtCLIENT, "RPC-Client", [&](auto const& coro) {
|
||||
app.getJobQueue().postCoroTask(jtCLIENT, "RPC-Client", [&](auto) -> CoroTask<void> {
|
||||
context.params = std::move(params);
|
||||
context.coro = coro;
|
||||
RPC::doCommand(context, result);
|
||||
g.signal();
|
||||
co_return;
|
||||
});
|
||||
|
||||
using namespace std::chrono_literals;
|
||||
@@ -240,28 +240,27 @@ public:
|
||||
c,
|
||||
Role::USER,
|
||||
{},
|
||||
{},
|
||||
RPC::apiVersionIfUnspecified},
|
||||
{},
|
||||
{}};
|
||||
Json::Value result;
|
||||
gate g;
|
||||
// Test RPC::Tuning::max_src_cur source currencies.
|
||||
app.getJobQueue().postCoro(jtCLIENT, "RPC-Client", [&](auto const& coro) {
|
||||
app.getJobQueue().postCoroTask(jtCLIENT, "RPC-Client", [&](auto) -> CoroTask<void> {
|
||||
context.params = rpf(Account("alice"), Account("bob"), RPC::Tuning::max_src_cur);
|
||||
context.coro = coro;
|
||||
RPC::doCommand(context, result);
|
||||
g.signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
BEAST_EXPECT(!result.isMember(jss::error));
|
||||
|
||||
// Test more than RPC::Tuning::max_src_cur source currencies.
|
||||
app.getJobQueue().postCoro(jtCLIENT, "RPC-Client", [&](auto const& coro) {
|
||||
app.getJobQueue().postCoroTask(jtCLIENT, "RPC-Client", [&](auto) -> CoroTask<void> {
|
||||
context.params = rpf(Account("alice"), Account("bob"), RPC::Tuning::max_src_cur + 1);
|
||||
context.coro = coro;
|
||||
RPC::doCommand(context, result);
|
||||
g.signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
BEAST_EXPECT(result.isMember(jss::error));
|
||||
@@ -269,22 +268,22 @@ public:
|
||||
// Test RPC::Tuning::max_auto_src_cur source currencies.
|
||||
for (auto i = 0; i < (RPC::Tuning::max_auto_src_cur - 1); ++i)
|
||||
env.trust(Account("alice")[std::to_string(i + 100)](100), "bob");
|
||||
app.getJobQueue().postCoro(jtCLIENT, "RPC-Client", [&](auto const& coro) {
|
||||
app.getJobQueue().postCoroTask(jtCLIENT, "RPC-Client", [&](auto) -> CoroTask<void> {
|
||||
context.params = rpf(Account("alice"), Account("bob"), 0);
|
||||
context.coro = coro;
|
||||
RPC::doCommand(context, result);
|
||||
g.signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
BEAST_EXPECT(!result.isMember(jss::error));
|
||||
|
||||
// Test more than RPC::Tuning::max_auto_src_cur source currencies.
|
||||
env.trust(Account("alice")["AUD"](100), "bob");
|
||||
app.getJobQueue().postCoro(jtCLIENT, "RPC-Client", [&](auto const& coro) {
|
||||
app.getJobQueue().postCoroTask(jtCLIENT, "RPC-Client", [&](auto) -> CoroTask<void> {
|
||||
context.params = rpf(Account("alice"), Account("bob"), 0);
|
||||
context.coro = coro;
|
||||
RPC::doCommand(context, result);
|
||||
g.signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
BEAST_EXPECT(result.isMember(jss::error));
|
||||
|
||||
537
src/test/core/CoroTask_test.cpp
Normal file
537
src/test/core/CoroTask_test.cpp
Normal file
@@ -0,0 +1,537 @@
|
||||
#include <test/jtx.h>
|
||||
|
||||
#include <xrpl/core/JobQueue.h>
|
||||
#include <xrpl/core/JobQueueAwaiter.h>
|
||||
|
||||
#include <chrono>
|
||||
#include <mutex>
|
||||
|
||||
namespace xrpl {
|
||||
namespace test {
|
||||
|
||||
/**
|
||||
* Test suite for the C++20 coroutine primitives: CoroTask, CoroTaskRunner,
|
||||
* and JobQueueAwaiter.
|
||||
*
|
||||
* Dependency Diagram
|
||||
* ==================
|
||||
*
|
||||
* CoroTask_test
|
||||
* +-------------------------------------------------+
|
||||
* | + gate (inner class) : condition_variable helper |
|
||||
* +-------------------------------------------------+
|
||||
* | uses
|
||||
* v
|
||||
* jtx::Env --> JobQueue::postCoroTask()
|
||||
* |
|
||||
* +-- CoroTaskRunner (suspend / post / resume)
|
||||
* +-- CoroTask<void> / CoroTask<T>
|
||||
* +-- JobQueueAwaiter
|
||||
*
|
||||
* Test Coverage Matrix
|
||||
* ====================
|
||||
*
|
||||
* Test | Primitives exercised
|
||||
* --------------------------+----------------------------------------------
|
||||
* testVoidCompletion | CoroTask<void> basic lifecycle
|
||||
* testCorrectOrder | suspend() -> join() -> post() -> complete
|
||||
* testIncorrectOrder | post() before suspend() (race-safe path)
|
||||
* testJobQueueAwaiter | JobQueueAwaiter suspend + auto-repost
|
||||
* testThreadSpecificStorage | LocalValue isolation across coroutines
|
||||
* testExceptionPropagation | unhandled_exception() in promise_type
|
||||
* testMultipleYields | N sequential suspend/resume cycles
|
||||
* testValueReturn | CoroTask<T> co_return value
|
||||
* testValueException | CoroTask<T> exception via co_await
|
||||
* testValueChaining | nested CoroTask<T> -> CoroTask<T>
|
||||
* testShutdownRejection | postCoroTask returns nullptr when stopping
|
||||
*/
|
||||
class CoroTask_test : public beast::unit_test::suite
|
||||
{
|
||||
public:
|
||||
/**
|
||||
* Simple one-shot gate for synchronizing between test thread
|
||||
* and coroutine worker threads. signal() sets the flag;
|
||||
* wait_for() blocks until signaled or timeout.
|
||||
*/
|
||||
class gate
|
||||
{
|
||||
private:
|
||||
std::condition_variable cv_;
|
||||
std::mutex mutex_;
|
||||
bool signaled_ = false;
|
||||
|
||||
public:
|
||||
/**
|
||||
* Block until signaled or timeout expires.
|
||||
*
|
||||
* @param rel_time Maximum duration to wait
|
||||
*
|
||||
* @return true if signaled before timeout
|
||||
*/
|
||||
template <class Rep, class Period>
|
||||
bool
|
||||
wait_for(std::chrono::duration<Rep, Period> const& rel_time)
|
||||
{
|
||||
std::unique_lock<std::mutex> lk(mutex_);
|
||||
auto b = cv_.wait_for(lk, rel_time, [this] { return signaled_; });
|
||||
signaled_ = false;
|
||||
return b;
|
||||
}
|
||||
|
||||
/**
|
||||
* Signal the gate, waking any waiting thread.
|
||||
*/
|
||||
void
|
||||
signal()
|
||||
{
|
||||
std::lock_guard lk(mutex_);
|
||||
signaled_ = true;
|
||||
cv_.notify_all();
|
||||
}
|
||||
};
|
||||
|
||||
// NOTE: All coroutine lambdas passed to postCoroTask use explicit
|
||||
// pointer-by-value captures instead of [&] to work around a GCC 14
|
||||
// bug where reference captures in coroutine lambdas are corrupted
|
||||
// in the coroutine frame.
|
||||
|
||||
/**
|
||||
* CoroTask<void> runs to completion and runner becomes non-runnable.
|
||||
*/
|
||||
void
|
||||
testVoidCompletion()
|
||||
{
|
||||
using namespace std::chrono_literals;
|
||||
using namespace jtx;
|
||||
|
||||
testcase("void completion");
|
||||
|
||||
Env env(*this, envconfig([](std::unique_ptr<Config> cfg) {
|
||||
cfg->FORCE_MULTI_THREAD = true;
|
||||
return cfg;
|
||||
}));
|
||||
|
||||
gate g;
|
||||
auto runner = env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT, "CoroTaskTest", [gp = &g](auto) -> CoroTask<void> {
|
||||
gp->signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(runner);
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
runner->join();
|
||||
BEAST_EXPECT(!runner->runnable());
|
||||
}
|
||||
|
||||
/**
|
||||
* Correct order: suspend, join, post, complete.
|
||||
* Mirrors existing Coroutine_test::correct_order.
|
||||
*/
|
||||
void
|
||||
testCorrectOrder()
|
||||
{
|
||||
using namespace std::chrono_literals;
|
||||
using namespace jtx;
|
||||
|
||||
testcase("correct order");
|
||||
|
||||
Env env(*this, envconfig([](std::unique_ptr<Config> cfg) {
|
||||
cfg->FORCE_MULTI_THREAD = true;
|
||||
return cfg;
|
||||
}));
|
||||
|
||||
gate g1, g2;
|
||||
std::shared_ptr<JobQueue::CoroTaskRunner> r;
|
||||
auto runner = env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT,
|
||||
"CoroTaskTest",
|
||||
[rp = &r, g1p = &g1, g2p = &g2](auto runner) -> CoroTask<void> {
|
||||
*rp = runner;
|
||||
g1p->signal();
|
||||
co_await runner->suspend();
|
||||
g2p->signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(runner);
|
||||
BEAST_EXPECT(g1.wait_for(5s));
|
||||
runner->join();
|
||||
runner->post();
|
||||
BEAST_EXPECT(g2.wait_for(5s));
|
||||
runner->join();
|
||||
}
|
||||
|
||||
/**
|
||||
* Incorrect order: post() before suspend(). Verifies the
|
||||
* race-safe path. Mirrors Coroutine_test::incorrect_order.
|
||||
*/
|
||||
void
|
||||
testIncorrectOrder()
|
||||
{
|
||||
using namespace std::chrono_literals;
|
||||
using namespace jtx;
|
||||
|
||||
testcase("incorrect order");
|
||||
|
||||
Env env(*this, envconfig([](std::unique_ptr<Config> cfg) {
|
||||
cfg->FORCE_MULTI_THREAD = true;
|
||||
return cfg;
|
||||
}));
|
||||
|
||||
gate g;
|
||||
env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT, "CoroTaskTest", [gp = &g](auto runner) -> CoroTask<void> {
|
||||
runner->post();
|
||||
co_await runner->suspend();
|
||||
gp->signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
}
|
||||
|
||||
/**
|
||||
* JobQueueAwaiter suspend + auto-repost across multiple yield points.
|
||||
*/
|
||||
void
|
||||
testJobQueueAwaiter()
|
||||
{
|
||||
using namespace std::chrono_literals;
|
||||
using namespace jtx;
|
||||
|
||||
testcase("JobQueueAwaiter");
|
||||
|
||||
Env env(*this, envconfig([](std::unique_ptr<Config> cfg) {
|
||||
cfg->FORCE_MULTI_THREAD = true;
|
||||
return cfg;
|
||||
}));
|
||||
|
||||
gate g;
|
||||
int step = 0;
|
||||
env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT, "CoroTaskTest", [sp = &step, gp = &g](auto runner) -> CoroTask<void> {
|
||||
*sp = 1;
|
||||
co_await JobQueueAwaiter{runner};
|
||||
*sp = 2;
|
||||
co_await JobQueueAwaiter{runner};
|
||||
*sp = 3;
|
||||
gp->signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
BEAST_EXPECT(step == 3);
|
||||
}
|
||||
|
||||
/**
|
||||
* Per-coroutine LocalValue isolation. Each coroutine sees its own
|
||||
* copy of thread-local state. Mirrors Coroutine_test::thread_specific_storage.
|
||||
*/
|
||||
void
|
||||
testThreadSpecificStorage()
|
||||
{
|
||||
using namespace std::chrono_literals;
|
||||
using namespace jtx;
|
||||
|
||||
testcase("thread specific storage");
|
||||
Env env(*this);
|
||||
|
||||
auto& jq = env.app().getJobQueue();
|
||||
|
||||
static int const N = 4;
|
||||
std::array<std::shared_ptr<JobQueue::CoroTaskRunner>, N> a;
|
||||
|
||||
LocalValue<int> lv(-1);
|
||||
BEAST_EXPECT(*lv == -1);
|
||||
|
||||
gate g;
|
||||
jq.addJob(jtCLIENT, "LocalValTest", [&]() {
|
||||
this->BEAST_EXPECT(*lv == -1);
|
||||
*lv = -2;
|
||||
this->BEAST_EXPECT(*lv == -2);
|
||||
g.signal();
|
||||
});
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
BEAST_EXPECT(*lv == -1);
|
||||
|
||||
for (int i = 0; i < N; ++i)
|
||||
{
|
||||
jq.postCoroTask(
|
||||
jtCLIENT,
|
||||
"CoroTaskTest",
|
||||
[this, ap = &a, gp = &g, lvp = &lv, id = i](auto runner) -> CoroTask<void> {
|
||||
(*ap)[id] = runner;
|
||||
gp->signal();
|
||||
co_await runner->suspend();
|
||||
|
||||
this->BEAST_EXPECT(**lvp == -1);
|
||||
**lvp = id;
|
||||
this->BEAST_EXPECT(**lvp == id);
|
||||
gp->signal();
|
||||
co_await runner->suspend();
|
||||
|
||||
this->BEAST_EXPECT(**lvp == id);
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
a[i]->join();
|
||||
}
|
||||
for (auto const& r : a)
|
||||
{
|
||||
r->post();
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
r->join();
|
||||
}
|
||||
for (auto const& r : a)
|
||||
{
|
||||
r->post();
|
||||
r->join();
|
||||
}
|
||||
|
||||
jq.addJob(jtCLIENT, "LocalValTest", [&]() {
|
||||
this->BEAST_EXPECT(*lv == -2);
|
||||
g.signal();
|
||||
});
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
BEAST_EXPECT(*lv == -1);
|
||||
}
|
||||
|
||||
/**
|
||||
* Exception thrown in coroutine body is caught by
|
||||
* promise_type::unhandled_exception(). Coroutine completes.
|
||||
*/
|
||||
void
|
||||
testExceptionPropagation()
|
||||
{
|
||||
using namespace std::chrono_literals;
|
||||
using namespace jtx;
|
||||
|
||||
testcase("exception propagation");
|
||||
|
||||
Env env(*this, envconfig([](std::unique_ptr<Config> cfg) {
|
||||
cfg->FORCE_MULTI_THREAD = true;
|
||||
return cfg;
|
||||
}));
|
||||
|
||||
gate g;
|
||||
auto runner = env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT, "CoroTaskTest", [gp = &g](auto) -> CoroTask<void> {
|
||||
gp->signal();
|
||||
throw std::runtime_error("test exception");
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(runner);
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
runner->join();
|
||||
// The exception is caught by promise_type::unhandled_exception()
|
||||
// and the coroutine is considered done
|
||||
BEAST_EXPECT(!runner->runnable());
|
||||
}
|
||||
|
||||
/**
|
||||
* Multiple sequential suspend/resume cycles via co_await.
|
||||
*/
|
||||
void
|
||||
testMultipleYields()
|
||||
{
|
||||
using namespace std::chrono_literals;
|
||||
using namespace jtx;
|
||||
|
||||
testcase("multiple yields");
|
||||
|
||||
Env env(*this, envconfig([](std::unique_ptr<Config> cfg) {
|
||||
cfg->FORCE_MULTI_THREAD = true;
|
||||
return cfg;
|
||||
}));
|
||||
|
||||
gate g;
|
||||
int counter = 0;
|
||||
std::shared_ptr<JobQueue::CoroTaskRunner> r;
|
||||
auto runner = env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT,
|
||||
"CoroTaskTest",
|
||||
[rp = &r, cp = &counter, gp = &g](auto runner) -> CoroTask<void> {
|
||||
*rp = runner;
|
||||
++(*cp);
|
||||
gp->signal();
|
||||
co_await runner->suspend();
|
||||
++(*cp);
|
||||
gp->signal();
|
||||
co_await runner->suspend();
|
||||
++(*cp);
|
||||
gp->signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(runner);
|
||||
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
BEAST_EXPECT(counter == 1);
|
||||
runner->join();
|
||||
|
||||
runner->post();
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
BEAST_EXPECT(counter == 2);
|
||||
runner->join();
|
||||
|
||||
runner->post();
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
BEAST_EXPECT(counter == 3);
|
||||
runner->join();
|
||||
BEAST_EXPECT(!runner->runnable());
|
||||
}
|
||||
|
||||
/**
|
||||
* CoroTask<T> returns a value via co_return. Outer coroutine
|
||||
* extracts it with co_await.
|
||||
*/
|
||||
void
|
||||
testValueReturn()
|
||||
{
|
||||
using namespace std::chrono_literals;
|
||||
using namespace jtx;
|
||||
|
||||
testcase("value return");
|
||||
|
||||
Env env(*this, envconfig([](std::unique_ptr<Config> cfg) {
|
||||
cfg->FORCE_MULTI_THREAD = true;
|
||||
return cfg;
|
||||
}));
|
||||
|
||||
gate g;
|
||||
int result = 0;
|
||||
auto runner = env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT, "CoroTaskTest", [rp = &result, gp = &g](auto) -> CoroTask<void> {
|
||||
auto inner = []() -> CoroTask<int> { co_return 42; };
|
||||
*rp = co_await inner();
|
||||
gp->signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(runner);
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
runner->join();
|
||||
BEAST_EXPECT(result == 42);
|
||||
BEAST_EXPECT(!runner->runnable());
|
||||
}
|
||||
|
||||
/**
|
||||
* CoroTask<T> propagates exceptions from inner coroutines.
|
||||
* Outer coroutine catches via try/catch around co_await.
|
||||
*/
|
||||
void
|
||||
testValueException()
|
||||
{
|
||||
using namespace std::chrono_literals;
|
||||
using namespace jtx;
|
||||
|
||||
testcase("value exception");
|
||||
|
||||
Env env(*this, envconfig([](std::unique_ptr<Config> cfg) {
|
||||
cfg->FORCE_MULTI_THREAD = true;
|
||||
return cfg;
|
||||
}));
|
||||
|
||||
gate g;
|
||||
bool caught = false;
|
||||
auto runner = env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT, "CoroTaskTest", [cp = &caught, gp = &g](auto) -> CoroTask<void> {
|
||||
auto inner = []() -> CoroTask<int> {
|
||||
throw std::runtime_error("inner error");
|
||||
co_return 0;
|
||||
};
|
||||
try
|
||||
{
|
||||
co_await inner();
|
||||
}
|
||||
catch (std::runtime_error const& e)
|
||||
{
|
||||
*cp = true;
|
||||
}
|
||||
gp->signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(runner);
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
runner->join();
|
||||
BEAST_EXPECT(caught);
|
||||
BEAST_EXPECT(!runner->runnable());
|
||||
}
|
||||
|
||||
/**
|
||||
* CoroTask<T> chaining. Nested value-returning coroutines
|
||||
* compose via co_await.
|
||||
*/
|
||||
void
|
||||
testValueChaining()
|
||||
{
|
||||
using namespace std::chrono_literals;
|
||||
using namespace jtx;
|
||||
|
||||
testcase("value chaining");
|
||||
|
||||
Env env(*this, envconfig([](std::unique_ptr<Config> cfg) {
|
||||
cfg->FORCE_MULTI_THREAD = true;
|
||||
return cfg;
|
||||
}));
|
||||
|
||||
gate g;
|
||||
int result = 0;
|
||||
auto runner = env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT, "CoroTaskTest", [rp = &result, gp = &g](auto) -> CoroTask<void> {
|
||||
auto add = [](int a, int b) -> CoroTask<int> { co_return a + b; };
|
||||
auto mul = [add](int a, int b) -> CoroTask<int> {
|
||||
int sum = co_await add(a, b);
|
||||
co_return sum * 2;
|
||||
};
|
||||
*rp = co_await mul(3, 4);
|
||||
gp->signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(runner);
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
runner->join();
|
||||
BEAST_EXPECT(result == 14); // (3 + 4) * 2
|
||||
BEAST_EXPECT(!runner->runnable());
|
||||
}
|
||||
|
||||
/**
|
||||
* postCoroTask returns nullptr when JobQueue is stopping.
|
||||
*/
|
||||
void
|
||||
testShutdownRejection()
|
||||
{
|
||||
using namespace std::chrono_literals;
|
||||
using namespace jtx;
|
||||
|
||||
testcase("shutdown rejection");
|
||||
|
||||
Env env(*this, envconfig([](std::unique_ptr<Config> cfg) {
|
||||
cfg->FORCE_MULTI_THREAD = true;
|
||||
return cfg;
|
||||
}));
|
||||
|
||||
// Stop the JobQueue
|
||||
env.app().getJobQueue().stop();
|
||||
|
||||
auto runner = env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT, "CoroTaskTest", [](auto) -> CoroTask<void> { co_return; });
|
||||
BEAST_EXPECT(!runner);
|
||||
}
|
||||
|
||||
void
|
||||
run() override
|
||||
{
|
||||
testVoidCompletion();
|
||||
testCorrectOrder();
|
||||
testIncorrectOrder();
|
||||
testJobQueueAwaiter();
|
||||
testThreadSpecificStorage();
|
||||
testExceptionPropagation();
|
||||
testMultipleYields();
|
||||
testValueReturn();
|
||||
testValueException();
|
||||
testValueChaining();
|
||||
testShutdownRejection();
|
||||
}
|
||||
};
|
||||
|
||||
BEAST_DEFINE_TESTSUITE(CoroTask, core, xrpl);
|
||||
|
||||
} // namespace test
|
||||
} // namespace xrpl
|
||||
@@ -40,6 +40,11 @@ public:
|
||||
}
|
||||
};
|
||||
|
||||
// NOTE: All coroutine lambdas passed to postCoroTask use explicit
|
||||
// pointer-by-value captures instead of [&] to work around a GCC 14
|
||||
// bug where reference captures in coroutine lambdas are corrupted
|
||||
// in the coroutine frame.
|
||||
|
||||
void
|
||||
correct_order()
|
||||
{
|
||||
@@ -54,13 +59,15 @@ public:
|
||||
}));
|
||||
|
||||
gate g1, g2;
|
||||
std::shared_ptr<JobQueue::Coro> c;
|
||||
env.app().getJobQueue().postCoro(jtCLIENT, "CoroTest", [&](auto const& cr) {
|
||||
c = cr;
|
||||
g1.signal();
|
||||
c->yield();
|
||||
g2.signal();
|
||||
});
|
||||
std::shared_ptr<JobQueue::CoroTaskRunner> c;
|
||||
env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT, "CoroTest", [cp = &c, g1p = &g1, g2p = &g2](auto runner) -> CoroTask<void> {
|
||||
*cp = runner;
|
||||
g1p->signal();
|
||||
co_await runner->suspend();
|
||||
g2p->signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(g1.wait_for(5s));
|
||||
c->join();
|
||||
c->post();
|
||||
@@ -81,11 +88,17 @@ public:
|
||||
}));
|
||||
|
||||
gate g;
|
||||
env.app().getJobQueue().postCoro(jtCLIENT, "CoroTest", [&](auto const& c) {
|
||||
c->post();
|
||||
c->yield();
|
||||
g.signal();
|
||||
});
|
||||
env.app().getJobQueue().postCoroTask(
|
||||
jtCLIENT, "CoroTest", [gp = &g](auto runner) -> CoroTask<void> {
|
||||
// Schedule a resume before suspending. The posted job
|
||||
// cannot actually call resume() until the current resume()
|
||||
// releases CoroTaskRunner::mutex_, which only happens after
|
||||
// the coroutine suspends at co_await.
|
||||
runner->post();
|
||||
co_await runner->suspend();
|
||||
gp->signal();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
}
|
||||
|
||||
@@ -101,7 +114,7 @@ public:
|
||||
auto& jq = env.app().getJobQueue();
|
||||
|
||||
static int const N = 4;
|
||||
std::array<std::shared_ptr<JobQueue::Coro>, N> a;
|
||||
std::array<std::shared_ptr<JobQueue::CoroTaskRunner>, N> a;
|
||||
|
||||
LocalValue<int> lv(-1);
|
||||
BEAST_EXPECT(*lv == -1);
|
||||
@@ -118,19 +131,23 @@ public:
|
||||
|
||||
for (int i = 0; i < N; ++i)
|
||||
{
|
||||
jq.postCoro(jtCLIENT, "CoroTest", [&, id = i](auto const& c) {
|
||||
a[id] = c;
|
||||
g.signal();
|
||||
c->yield();
|
||||
jq.postCoroTask(
|
||||
jtCLIENT,
|
||||
"CoroTest",
|
||||
[this, ap = &a, gp = &g, lvp = &lv, id = i](auto runner) -> CoroTask<void> {
|
||||
(*ap)[id] = runner;
|
||||
gp->signal();
|
||||
co_await runner->suspend();
|
||||
|
||||
this->BEAST_EXPECT(*lv == -1);
|
||||
*lv = id;
|
||||
this->BEAST_EXPECT(*lv == id);
|
||||
g.signal();
|
||||
c->yield();
|
||||
this->BEAST_EXPECT(**lvp == -1);
|
||||
**lvp = id;
|
||||
this->BEAST_EXPECT(**lvp == id);
|
||||
gp->signal();
|
||||
co_await runner->suspend();
|
||||
|
||||
this->BEAST_EXPECT(*lv == id);
|
||||
});
|
||||
this->BEAST_EXPECT(**lvp == id);
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(g.wait_for(5s));
|
||||
a[i]->join();
|
||||
}
|
||||
|
||||
@@ -43,87 +43,91 @@ class JobQueue_test : public beast::unit_test::suite
|
||||
}
|
||||
}
|
||||
|
||||
// NOTE: All coroutine lambdas passed to postCoroTask use explicit
|
||||
// pointer-by-value captures instead of [&] to work around a GCC 14
|
||||
// bug where reference captures in coroutine lambdas are corrupted
|
||||
// in the coroutine frame.
|
||||
|
||||
void
|
||||
testPostCoro()
|
||||
testPostCoroTask()
|
||||
{
|
||||
jtx::Env env{*this};
|
||||
|
||||
JobQueue& jQueue = env.app().getJobQueue();
|
||||
{
|
||||
// Test repeated post()s until the Coro completes.
|
||||
// Test repeated post()s until the coroutine completes.
|
||||
std::atomic<int> yieldCount{0};
|
||||
auto const coro = jQueue.postCoro(
|
||||
jtCLIENT,
|
||||
"PostCoroTest1",
|
||||
[&yieldCount](std::shared_ptr<JobQueue::Coro> const& coroCopy) {
|
||||
while (++yieldCount < 4)
|
||||
coroCopy->yield();
|
||||
auto const runner = jQueue.postCoroTask(
|
||||
jtCLIENT, "PostCoroTest1", [ycp = &yieldCount](auto runner) -> CoroTask<void> {
|
||||
while (++(*ycp) < 4)
|
||||
co_await runner->suspend();
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(coro != nullptr);
|
||||
BEAST_EXPECT(runner != nullptr);
|
||||
|
||||
// Wait for the Job to run and yield.
|
||||
while (yieldCount == 0)
|
||||
;
|
||||
|
||||
// Now re-post until the Coro says it is done.
|
||||
// Now re-post until the CoroTaskRunner says it is done.
|
||||
int old = yieldCount;
|
||||
while (coro->runnable())
|
||||
while (runner->runnable())
|
||||
{
|
||||
BEAST_EXPECT(coro->post());
|
||||
BEAST_EXPECT(runner->post());
|
||||
while (old == yieldCount)
|
||||
{
|
||||
}
|
||||
coro->join();
|
||||
runner->join();
|
||||
BEAST_EXPECT(++old == yieldCount);
|
||||
}
|
||||
BEAST_EXPECT(yieldCount == 4);
|
||||
}
|
||||
{
|
||||
// Test repeated resume()s until the Coro completes.
|
||||
// Test repeated resume()s until the coroutine completes.
|
||||
int yieldCount{0};
|
||||
auto const coro = jQueue.postCoro(
|
||||
jtCLIENT,
|
||||
"PostCoroTest2",
|
||||
[&yieldCount](std::shared_ptr<JobQueue::Coro> const& coroCopy) {
|
||||
while (++yieldCount < 4)
|
||||
coroCopy->yield();
|
||||
auto const runner = jQueue.postCoroTask(
|
||||
jtCLIENT, "PostCoroTest2", [ycp = &yieldCount](auto runner) -> CoroTask<void> {
|
||||
while (++(*ycp) < 4)
|
||||
co_await runner->suspend();
|
||||
co_return;
|
||||
});
|
||||
if (!coro)
|
||||
if (!runner)
|
||||
{
|
||||
// There's no good reason we should not get a Coro, but we
|
||||
// There's no good reason we should not get a runner, but we
|
||||
// can't continue without one.
|
||||
BEAST_EXPECT(false);
|
||||
return;
|
||||
}
|
||||
|
||||
// Wait for the Job to run and yield.
|
||||
coro->join();
|
||||
runner->join();
|
||||
|
||||
// Now resume until the Coro says it is done.
|
||||
// Now resume until the CoroTaskRunner says it is done.
|
||||
int old = yieldCount;
|
||||
while (coro->runnable())
|
||||
while (runner->runnable())
|
||||
{
|
||||
coro->resume(); // Resume runs synchronously on this thread.
|
||||
runner->resume(); // Resume runs synchronously on this thread.
|
||||
BEAST_EXPECT(++old == yieldCount);
|
||||
}
|
||||
BEAST_EXPECT(yieldCount == 4);
|
||||
}
|
||||
{
|
||||
// If the JobQueue is stopped, we should no
|
||||
// longer be able to add a Coro (and calling postCoro() should
|
||||
// return false).
|
||||
// longer be able to post a coroutine (and calling postCoroTask()
|
||||
// should return nullptr).
|
||||
using namespace std::chrono_literals;
|
||||
jQueue.stop();
|
||||
|
||||
// The Coro should never run, so having the Coro access this
|
||||
// The coroutine should never run, so having it access this
|
||||
// unprotected variable on the stack should be completely safe.
|
||||
// Not recommended for the faint of heart...
|
||||
bool unprotected;
|
||||
auto const coro = jQueue.postCoro(
|
||||
jtCLIENT, "PostCoroTest3", [&unprotected](std::shared_ptr<JobQueue::Coro> const&) {
|
||||
unprotected = false;
|
||||
auto const runner = jQueue.postCoroTask(
|
||||
jtCLIENT, "PostCoroTest3", [up = &unprotected](auto) -> CoroTask<void> {
|
||||
*up = false;
|
||||
co_return;
|
||||
});
|
||||
BEAST_EXPECT(coro == nullptr);
|
||||
BEAST_EXPECT(runner == nullptr);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -132,7 +136,7 @@ public:
|
||||
run() override
|
||||
{
|
||||
testAddJob();
|
||||
testPostCoro();
|
||||
testPostCoroTask();
|
||||
}
|
||||
};
|
||||
|
||||
|
||||
@@ -6,6 +6,7 @@
|
||||
|
||||
#include <xrpld/rpc/RPCHandler.h>
|
||||
|
||||
#include <xrpl/core/CoroTask.h>
|
||||
#include <xrpl/protocol/ApiVersion.h>
|
||||
#include <xrpl/protocol/STParsedJSON.h>
|
||||
#include <xrpl/resource/Fees.h>
|
||||
@@ -193,7 +194,6 @@ AMMTest::find_paths_request(
|
||||
c,
|
||||
Role::USER,
|
||||
{},
|
||||
{},
|
||||
RPC::apiVersionIfUnspecified},
|
||||
{},
|
||||
{}};
|
||||
@@ -215,11 +215,11 @@ AMMTest::find_paths_request(
|
||||
|
||||
Json::Value result;
|
||||
gate g;
|
||||
app.getJobQueue().postCoro(jtCLIENT, "RPC-Client", [&](auto const& coro) {
|
||||
app.getJobQueue().postCoroTask(jtCLIENT, "RPC-Client", [&](auto) -> CoroTask<void> {
|
||||
context.params = std::move(params);
|
||||
context.coro = coro;
|
||||
RPC::doCommand(context, result);
|
||||
g.signal();
|
||||
co_return;
|
||||
});
|
||||
|
||||
using namespace std::chrono_literals;
|
||||
|
||||
@@ -1425,7 +1425,6 @@ ApplicationImp::setup(boost::program_options::variables_map const& cmdline)
|
||||
c,
|
||||
Role::ADMIN,
|
||||
{},
|
||||
{},
|
||||
RPC::apiMaximumSupportedVersion},
|
||||
jvCommand};
|
||||
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
|
||||
#include <xrpl/beast/core/CurrentThreadName.h>
|
||||
#include <xrpl/beast/net/IPAddressConversion.h>
|
||||
#include <xrpl/core/CoroTask.h>
|
||||
#include <xrpl/resource/Fees.h>
|
||||
|
||||
namespace xrpl {
|
||||
@@ -99,13 +100,14 @@ GRPCServerImpl::CallData<Request, Response>::process()
|
||||
// ensures that finished is always true when this CallData object
|
||||
// is returned as a tag in handleRpcs(), after sending the response
|
||||
finished_ = true;
|
||||
auto coro = app_.getJobQueue().postCoro(
|
||||
JobType::jtRPC, "gRPC-Client", [thisShared](std::shared_ptr<JobQueue::Coro> coro) {
|
||||
thisShared->process(coro);
|
||||
auto runner = app_.getJobQueue().postCoroTask(
|
||||
JobType::jtRPC, "gRPC-Client", [thisShared](auto) -> CoroTask<void> {
|
||||
thisShared->processRequest();
|
||||
co_return;
|
||||
});
|
||||
|
||||
// If coro is null, then the JobQueue has already been shutdown
|
||||
if (!coro)
|
||||
// If runner is null, then the JobQueue has already been shutdown
|
||||
if (!runner)
|
||||
{
|
||||
grpc::Status status{grpc::StatusCode::INTERNAL, "Job Queue is already stopped"};
|
||||
responder_.FinishWithError(status, this);
|
||||
@@ -114,7 +116,7 @@ GRPCServerImpl::CallData<Request, Response>::process()
|
||||
|
||||
template <class Request, class Response>
|
||||
void
|
||||
GRPCServerImpl::CallData<Request, Response>::process(std::shared_ptr<JobQueue::Coro> coro)
|
||||
GRPCServerImpl::CallData<Request, Response>::processRequest()
|
||||
{
|
||||
try
|
||||
{
|
||||
@@ -156,7 +158,6 @@ GRPCServerImpl::CallData<Request, Response>::process(std::shared_ptr<JobQueue::C
|
||||
app_.getLedgerMaster(),
|
||||
usage,
|
||||
role,
|
||||
coro,
|
||||
InfoSub::pointer(),
|
||||
apiVersion},
|
||||
request_};
|
||||
|
||||
@@ -206,9 +206,12 @@ private:
|
||||
clone() override;
|
||||
|
||||
private:
|
||||
// process the request. Called inside the coroutine passed to JobQueue
|
||||
/**
|
||||
* Process the gRPC request. Called inside the CoroTask lambda
|
||||
* posted to the JobQueue by process().
|
||||
*/
|
||||
void
|
||||
process(std::shared_ptr<JobQueue::Coro> coro);
|
||||
processRequest();
|
||||
|
||||
// return load type of this RPC
|
||||
Resource::Charge
|
||||
|
||||
@@ -3,7 +3,6 @@
|
||||
#include <xrpld/rpc/Role.h>
|
||||
|
||||
#include <xrpl/beast/utility/Journal.h>
|
||||
#include <xrpl/core/JobQueue.h>
|
||||
#include <xrpl/server/InfoSub.h>
|
||||
|
||||
namespace xrpl {
|
||||
@@ -24,7 +23,6 @@ struct Context
|
||||
LedgerMaster& ledgerMaster;
|
||||
Resource::Consumer& consumer;
|
||||
Role role;
|
||||
std::shared_ptr<JobQueue::Coro> coro{};
|
||||
InfoSub::pointer infoSub{};
|
||||
unsigned int apiVersion;
|
||||
};
|
||||
|
||||
@@ -169,13 +169,10 @@ public:
|
||||
|
||||
private:
|
||||
Json::Value
|
||||
processSession(
|
||||
std::shared_ptr<WSSession> const& session,
|
||||
std::shared_ptr<JobQueue::Coro> const& coro,
|
||||
Json::Value const& jv);
|
||||
processSession(std::shared_ptr<WSSession> const& session, Json::Value const& jv);
|
||||
|
||||
void
|
||||
processSession(std::shared_ptr<Session> const&, std::shared_ptr<JobQueue::Coro> coro);
|
||||
processSession(std::shared_ptr<Session> const&);
|
||||
|
||||
void
|
||||
processRequest(
|
||||
@@ -183,7 +180,6 @@ private:
|
||||
std::string const& request,
|
||||
beast::IP::Endpoint const& remoteIPAddress,
|
||||
Output&&,
|
||||
std::shared_ptr<JobQueue::Coro> coro,
|
||||
std::string_view forwardedFor,
|
||||
std::string_view user);
|
||||
|
||||
|
||||
@@ -14,6 +14,7 @@
|
||||
#include <xrpl/basics/make_SSLContext.h>
|
||||
#include <xrpl/beast/net/IPAddressConversion.h>
|
||||
#include <xrpl/beast/rfc2616.h>
|
||||
#include <xrpl/core/CoroTask.h>
|
||||
#include <xrpl/core/JobQueue.h>
|
||||
#include <xrpl/json/json_reader.h>
|
||||
#include <xrpl/json/to_string.h>
|
||||
@@ -284,9 +285,10 @@ ServerHandler::onRequest(Session& session)
|
||||
}
|
||||
|
||||
std::shared_ptr<Session> detachedSession = session.detach();
|
||||
auto const postResult = m_jobQueue.postCoro(
|
||||
jtCLIENT_RPC, "RPC-Client", [this, detachedSession](std::shared_ptr<JobQueue::Coro> coro) {
|
||||
processSession(detachedSession, coro);
|
||||
auto const postResult = m_jobQueue.postCoroTask(
|
||||
jtCLIENT_RPC, "RPC-Client", [this, detachedSession](auto) -> CoroTask<void> {
|
||||
processSession(detachedSession);
|
||||
co_return;
|
||||
});
|
||||
if (postResult == nullptr)
|
||||
{
|
||||
@@ -322,17 +324,18 @@ ServerHandler::onWSMessage(
|
||||
|
||||
JLOG(m_journal.trace()) << "Websocket received '" << jv << "'";
|
||||
|
||||
auto const postResult = m_jobQueue.postCoro(
|
||||
auto const postResult = m_jobQueue.postCoroTask(
|
||||
jtCLIENT_WEBSOCKET,
|
||||
"WS-Client",
|
||||
[this, session, jv = std::move(jv)](std::shared_ptr<JobQueue::Coro> const& coro) {
|
||||
auto const jr = this->processSession(session, coro, jv);
|
||||
[this, session, jv = std::move(jv)](auto) -> CoroTask<void> {
|
||||
auto const jr = this->processSession(session, jv);
|
||||
auto const s = to_string(jr);
|
||||
auto const n = s.length();
|
||||
boost::beast::multi_buffer sb(n);
|
||||
sb.commit(boost::asio::buffer_copy(sb.prepare(n), boost::asio::buffer(s.c_str(), n)));
|
||||
session->send(std::make_shared<StreambufWSMsg<decltype(sb)>>(std::move(sb)));
|
||||
session->complete();
|
||||
co_return;
|
||||
});
|
||||
if (postResult == nullptr)
|
||||
{
|
||||
@@ -373,10 +376,7 @@ logDuration(Json::Value const& request, T const& duration, beast::Journal& journ
|
||||
}
|
||||
|
||||
Json::Value
|
||||
ServerHandler::processSession(
|
||||
std::shared_ptr<WSSession> const& session,
|
||||
std::shared_ptr<JobQueue::Coro> const& coro,
|
||||
Json::Value const& jv)
|
||||
ServerHandler::processSession(std::shared_ptr<WSSession> const& session, Json::Value const& jv)
|
||||
{
|
||||
auto is = std::static_pointer_cast<WSInfoSub>(session->appDefined);
|
||||
if (is->getConsumer().disconnect(m_journal))
|
||||
@@ -443,7 +443,6 @@ ServerHandler::processSession(
|
||||
app_.getLedgerMaster(),
|
||||
is->getConsumer(),
|
||||
role,
|
||||
coro,
|
||||
is,
|
||||
apiVersion},
|
||||
jv,
|
||||
@@ -514,18 +513,14 @@ ServerHandler::processSession(
|
||||
return jr;
|
||||
}
|
||||
|
||||
// Run as a coroutine.
|
||||
void
|
||||
ServerHandler::processSession(
|
||||
std::shared_ptr<Session> const& session,
|
||||
std::shared_ptr<JobQueue::Coro> coro)
|
||||
ServerHandler::processSession(std::shared_ptr<Session> const& session)
|
||||
{
|
||||
processRequest(
|
||||
session->port(),
|
||||
buffers_to_string(session->request().body().data()),
|
||||
session->remoteAddress().at_port(0),
|
||||
makeOutput(*session),
|
||||
coro,
|
||||
forwardedFor(session->request()),
|
||||
[&] {
|
||||
auto const iter = session->request().find("X-User");
|
||||
@@ -562,7 +557,6 @@ ServerHandler::processRequest(
|
||||
std::string const& request,
|
||||
beast::IP::Endpoint const& remoteIPAddress,
|
||||
Output&& output,
|
||||
std::shared_ptr<JobQueue::Coro> coro,
|
||||
std::string_view forwardedFor,
|
||||
std::string_view user)
|
||||
{
|
||||
@@ -819,7 +813,6 @@ ServerHandler::processRequest(
|
||||
app_.getLedgerMaster(),
|
||||
usage,
|
||||
role,
|
||||
coro,
|
||||
InfoSub::pointer(),
|
||||
apiVersion},
|
||||
params,
|
||||
|
||||
@@ -7,6 +7,9 @@
|
||||
#include <xrpl/protocol/RPCErr.h>
|
||||
#include <xrpl/resource/Fees.h>
|
||||
|
||||
#include <condition_variable>
|
||||
#include <mutex>
|
||||
|
||||
namespace xrpl {
|
||||
|
||||
// This interface is deprecated.
|
||||
@@ -37,98 +40,31 @@ doRipplePathFind(RPC::JsonContext& context)
|
||||
PathRequest::pointer request;
|
||||
lpLedger = context.ledgerMaster.getClosedLedger();
|
||||
|
||||
// It doesn't look like there's much odd happening here, but you should
|
||||
// be aware this code runs in a JobQueue::Coro, which is a coroutine.
|
||||
// And we may be flipping around between threads. Here's an overview:
|
||||
//
|
||||
// 1. We're running doRipplePathFind() due to a call to
|
||||
// ripple_path_find. doRipplePathFind() is currently running
|
||||
// inside of a JobQueue::Coro using a JobQueue thread.
|
||||
//
|
||||
// 2. doRipplePathFind's call to makeLegacyPathRequest() enqueues the
|
||||
// path-finding request. That request will (probably) run at some
|
||||
// indeterminate future time on a (probably different) JobQueue
|
||||
// thread.
|
||||
//
|
||||
// 3. As a continuation from that path-finding JobQueue thread, the
|
||||
// coroutine we're currently running in (!) is posted to the
|
||||
// JobQueue. Because it is a continuation, that post won't
|
||||
// happen until the path-finding request completes.
|
||||
//
|
||||
// 4. Once the continuation is enqueued, and we have reason to think
|
||||
// the path-finding job is likely to run, then the coroutine we're
|
||||
// running in yield()s. That means it surrenders its thread in
|
||||
// the JobQueue. The coroutine is suspended, but ready to run,
|
||||
// because it is kept resident by a shared_ptr in the
|
||||
// path-finding continuation.
|
||||
//
|
||||
// 5. If all goes well then path-finding runs on a JobQueue thread
|
||||
// and executes its continuation. The continuation posts this
|
||||
// same coroutine (!) to the JobQueue.
|
||||
//
|
||||
// 6. When the JobQueue calls this coroutine, this coroutine resumes
|
||||
// from the line below the coro->yield() and returns the
|
||||
// path-finding result.
|
||||
//
|
||||
// With so many moving parts, what could go wrong?
|
||||
//
|
||||
// Just in terms of the JobQueue refusing to add jobs at shutdown
|
||||
// there are two specific things that can go wrong.
|
||||
//
|
||||
// 1. The path-finding Job queued by makeLegacyPathRequest() might be
|
||||
// rejected (because we're shutting down).
|
||||
//
|
||||
// Fortunately this problem can be addressed by looking at the
|
||||
// return value of makeLegacyPathRequest(). If
|
||||
// makeLegacyPathRequest() cannot get a thread to run the path-find
|
||||
// on, then it returns an empty request.
|
||||
//
|
||||
// 2. The path-finding job might run, but the Coro::post() might be
|
||||
// rejected by the JobQueue (because we're shutting down).
|
||||
//
|
||||
// We handle this case by resuming (not posting) the Coro.
|
||||
// By resuming the Coro, we allow the Coro to run to completion
|
||||
// on the current thread instead of requiring that it run on a
|
||||
// new thread from the JobQueue.
|
||||
//
|
||||
// Both of these failure modes are hard to recreate in a unit test
|
||||
// because they are so dependent on inter-thread timing. However
|
||||
// the failure modes can be observed by synchronously (inside the
|
||||
// rippled source code) shutting down the application. The code to
|
||||
// do so looks like this:
|
||||
//
|
||||
// context.app.signalStop();
|
||||
// while (! context.app.getJobQueue().jobCounter().joined()) { }
|
||||
//
|
||||
// The first line starts the process of shutting down the app.
|
||||
// The second line waits until no more jobs can be added to the
|
||||
// JobQueue before letting the thread continue.
|
||||
//
|
||||
// May 2017
|
||||
// makeLegacyPathRequest enqueues a path-finding job that runs
|
||||
// asynchronously. We block this thread with a condition_variable
|
||||
// until the path-finding continuation signals completion.
|
||||
// If makeLegacyPathRequest cannot schedule the job (e.g. during
|
||||
// shutdown), it returns an empty request and we skip the wait.
|
||||
std::mutex mtx;
|
||||
std::condition_variable cv;
|
||||
bool pathDone = false;
|
||||
|
||||
jvResult = context.app.getPathRequests().makeLegacyPathRequest(
|
||||
request,
|
||||
[&context]() {
|
||||
// Copying the shared_ptr keeps the coroutine alive up
|
||||
// through the return. Otherwise the storage under the
|
||||
// captured reference could evaporate when we return from
|
||||
// coroCopy->resume(). This is not strictly necessary, but
|
||||
// will make maintenance easier.
|
||||
std::shared_ptr<JobQueue::Coro> coroCopy{context.coro};
|
||||
if (!coroCopy->post())
|
||||
[&]() {
|
||||
{
|
||||
// The post() failed, so we won't get a thread to let
|
||||
// the Coro finish. We'll call Coro::resume() so the
|
||||
// Coro can finish on our thread. Otherwise the
|
||||
// application will hang on shutdown.
|
||||
coroCopy->resume();
|
||||
std::lock_guard lk(mtx);
|
||||
pathDone = true;
|
||||
}
|
||||
cv.notify_one();
|
||||
},
|
||||
context.consumer,
|
||||
lpLedger,
|
||||
context.params);
|
||||
if (request)
|
||||
{
|
||||
context.coro->yield();
|
||||
std::unique_lock lk(mtx);
|
||||
cv.wait(lk, [&] { return pathDone; });
|
||||
jvResult = request->doStatus(context.params);
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user