Files
rippled/OpenTelemetryPlan/03-implementation-strategy.md
Pratik Mankawde 34243e0cc2 Phase 1a: OpenTelemetry plan documentation
Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:

- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 19:03:09 +00:00

452 lines
17 KiB
Markdown

# Implementation Strategy
> **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)
> **Related**: [Code Samples](./04-code-samples.md) | [Configuration Reference](./05-configuration-reference.md)
---
## 3.1 Directory Structure
The telemetry implementation follows rippled's existing code organization pattern:
```
include/xrpl/
├── telemetry/
│ ├── Telemetry.h # Main telemetry interface
│ ├── TelemetryConfig.h # Configuration structures
│ ├── TraceContext.h # Context propagation utilities
│ ├── SpanGuard.h # RAII span management
│ └── SpanAttributes.h # Attribute helper functions
src/libxrpl/
├── telemetry/
│ ├── Telemetry.cpp # Implementation
│ ├── TelemetryConfig.cpp # Config parsing
│ ├── TraceContext.cpp # Context serialization
│ └── NullTelemetry.cpp # No-op implementation
src/xrpld/
├── telemetry/
│ ├── TracingInstrumentation.h # Instrumentation macros
│ └── TracingInstrumentation.cpp
```
---
## 3.2 Implementation Approach
<div align="center">
```mermaid
%%{init: {'flowchart': {'nodeSpacing': 20, 'rankSpacing': 30}}}%%
flowchart TB
subgraph phase1["Phase 1: Core"]
direction LR
sdk["SDK Integration"] ~~~ interface["Telemetry Interface"] ~~~ config["Configuration"]
end
subgraph phase2["Phase 2: RPC"]
direction LR
http["HTTP Context"] ~~~ rpc["RPC Handlers"]
end
subgraph phase3["Phase 3: P2P"]
direction LR
proto["Protobuf Context"] ~~~ tx["Transaction Relay"]
end
subgraph phase4["Phase 4: Consensus"]
direction LR
consensus["Consensus Rounds"] ~~~ proposals["Proposals"]
end
phase1 --> phase2 --> phase3 --> phase4
style phase1 fill:#1565c0,stroke:#0d47a1,color:#ffffff
style phase2 fill:#2e7d32,stroke:#1b5e20,color:#ffffff
style phase3 fill:#e65100,stroke:#bf360c,color:#ffffff
style phase4 fill:#c2185b,stroke:#880e4f,color:#ffffff
```
</div>
### Key Principles
1. **Minimal Intrusion**: Instrumentation should not alter existing control flow
2. **Zero-Cost When Disabled**: Use compile-time flags and no-op implementations
3. **Backward Compatibility**: Protocol Buffer extensions use high field numbers
4. **Graceful Degradation**: Tracing failures must not affect node operation
---
## 3.3 Performance Overhead Summary
| Metric | Overhead | Notes |
| ------------- | ---------- | ----------------------------------- |
| CPU | 1-3% | Span creation and attribute setting |
| Memory | 2-5 MB | Batch buffer for pending spans |
| Network | 10-50 KB/s | Compressed OTLP export to collector |
| Latency (p99) | <2% | With proper sampling configuration |
---
## 3.4 Detailed CPU Overhead Analysis
### 3.4.1 Per-Operation Costs
| Operation | Time (ns) | Frequency | Impact |
| --------------------- | --------- | ---------------------- | ---------- |
| Span creation | 200-500 | Every traced operation | Low |
| Span end | 100-200 | Every traced operation | Low |
| SetAttribute (string) | 80-120 | 3-5 per span | Low |
| SetAttribute (int) | 40-60 | 2-3 per span | Negligible |
| AddEvent | 50-80 | 0-2 per span | Negligible |
| Context injection | 150-250 | Per outgoing message | Low |
| Context extraction | 100-180 | Per incoming message | Low |
| GetCurrent context | 10-20 | Thread-local access | Negligible |
### 3.4.2 Transaction Processing Overhead
<div align="center">
```mermaid
%%{init: {'pie': {'textPosition': 0.75}}}%%
pie showData
"tx.receive (800ns)" : 800
"tx.validate (500ns)" : 500
"tx.relay (500ns)" : 500
"Context inject (600ns)" : 600
```
**Transaction Tracing Overhead (~2.4μs total)**
</div>
**Overhead percentage**: 2.4 μs / 200 μs (avg tx processing) = **~1.2%**
### 3.4.3 Consensus Round Overhead
| Operation | Count | Cost (ns) | Total |
| ---------------------- | ----- | --------- | ---------- |
| consensus.round span | 1 | ~1000 | ~1 μs |
| consensus.phase spans | 3 | ~700 | ~2.1 μs |
| proposal.receive spans | ~20 | ~600 | ~12 μs |
| proposal.send spans | ~3 | ~600 | ~1.8 μs |
| Context operations | ~30 | ~200 | ~6 μs |
| **TOTAL** | | | **~23 μs** |
**Overhead percentage**: 23 μs / 3s (typical round) = **~0.0008%** (negligible)
### 3.4.4 RPC Request Overhead
| Operation | Cost (ns) |
| ---------------- | ------------ |
| rpc.request span | ~700 |
| rpc.command span | ~600 |
| Context extract | ~250 |
| Context inject | ~200 |
| **TOTAL** | **~1.75 μs** |
- Fast RPC (1ms): 1.75 μs / 1ms = **~0.175%**
- Slow RPC (100ms): 1.75 μs / 100ms = **~0.002%**
---
## 3.5 Memory Overhead Analysis
### 3.5.1 Static Memory
| Component | Size | Allocated |
| ------------------------ | ----------- | ---------- |
| TracerProvider singleton | ~64 KB | At startup |
| BatchSpanProcessor | ~128 KB | At startup |
| OTLP exporter | ~256 KB | At startup |
| Propagator registry | ~8 KB | At startup |
| **Total static** | **~456 KB** | |
### 3.5.2 Dynamic Memory
| Component | Size per unit | Max units | Peak |
| -------------------- | ------------- | ---------- | ----------- |
| Active span | ~200 bytes | 1000 | ~200 KB |
| Queued span (export) | ~500 bytes | 2048 | ~1 MB |
| Attribute storage | ~50 bytes | 5 per span | Included |
| Context storage | ~64 bytes | Per thread | ~6.4 KB |
| **Total dynamic** | | | **~1.2 MB** |
### 3.5.3 Memory Growth Characteristics
```mermaid
---
config:
xyChart:
width: 700
height: 400
---
xychart-beta
title "Memory Usage vs Span Rate"
x-axis "Spans/second" [0, 200, 400, 600, 800, 1000]
y-axis "Memory (MB)" 0 --> 6
line [1, 1.8, 2.6, 3.4, 4.2, 5]
```
**Notes**:
- Memory increases linearly with span rate
- Batch export prevents unbounded growth
- Queue size is configurable (default 2048 spans)
- At queue limit, oldest spans are dropped (not blocked)
---
## 3.6 Network Overhead Analysis
### 3.6.1 Export Bandwidth
| Sampling Rate | Spans/sec | Bandwidth | Notes |
| ------------- | --------- | --------- | ---------------- |
| 100% | ~500 | ~250 KB/s | Development only |
| 10% | ~50 | ~25 KB/s | Staging |
| 1% | ~5 | ~2.5 KB/s | Production |
| Error-only | ~1 | ~0.5 KB/s | Minimal overhead |
### 3.6.2 Trace Context Propagation
| Message Type | Context Size | Messages/sec | Overhead |
| ---------------------- | ------------ | ------------ | ----------- |
| TMTransaction | 32 bytes | ~100 | ~3.2 KB/s |
| TMProposeSet | 32 bytes | ~10 | ~320 B/s |
| TMValidation | 32 bytes | ~50 | ~1.6 KB/s |
| **Total P2P overhead** | | | **~5 KB/s** |
---
## 3.7 Optimization Strategies
### 3.7.1 Sampling Strategies
```mermaid
flowchart TD
trace["New Trace"]
trace --> errors{"Is Error?"}
errors -->|Yes| sample["SAMPLE"]
errors -->|No| consensus{"Is Consensus?"}
consensus -->|Yes| sample
consensus -->|No| slow{"Is Slow?"}
slow -->|Yes| sample
slow -->|No| prob{"Random < 10%?"}
prob -->|Yes| sample
prob -->|No| drop["DROP"]
style sample fill:#4caf50,stroke:#388e3c,color:#fff
style drop fill:#f44336,stroke:#c62828,color:#fff
```
### 3.7.2 Batch Tuning Recommendations
| Environment | Batch Size | Batch Delay | Max Queue |
| ------------------ | ---------- | ----------- | --------- |
| Low-latency | 128 | 1000ms | 512 |
| High-throughput | 1024 | 10000ms | 8192 |
| Memory-constrained | 256 | 2000ms | 512 |
### 3.7.3 Conditional Instrumentation
```cpp
// Compile-time feature flag
#ifndef XRPL_ENABLE_TELEMETRY
// Zero-cost when disabled
#define XRPL_TRACE_SPAN(t, n) ((void)0)
#endif
// Runtime component filtering
if (telemetry.shouldTracePeer())
{
XRPL_TRACE_SPAN(telemetry, "peer.message.receive");
// ... instrumentation
}
// No overhead when component tracing disabled
```
---
## 3.8 Links to Detailed Documentation
- **[Code Samples](./04-code-samples.md)**: Complete implementation code for all components
- **[Configuration Reference](./05-configuration-reference.md)**: Configuration options and collector setup
- **[Implementation Phases](./06-implementation-phases.md)**: Detailed timeline and milestones
---
## 3.9 Code Intrusiveness Assessment
This section provides a detailed assessment of how intrusive the OpenTelemetry integration is to the existing rippled codebase.
### 3.9.1 Files Modified Summary
| Component | Files Modified | Lines Added | Lines Changed | Architectural Impact |
| --------------------- | -------------- | ----------- | ------------- | -------------------- |
| **Core Telemetry** | 5 new files | ~800 | 0 | None (new module) |
| **Application Init** | 2 files | ~30 | ~5 | Minimal |
| **RPC Layer** | 3 files | ~80 | ~20 | Minimal |
| **Transaction Relay** | 4 files | ~120 | ~40 | Low |
| **Consensus** | 3 files | ~100 | ~30 | Low-Medium |
| **Protocol Buffers** | 1 file | ~25 | 0 | Low |
| **CMake/Build** | 3 files | ~50 | ~10 | Minimal |
| **Total** | **~21 files** | **~1,205** | **~105** | **Low** |
### 3.9.2 Detailed File Impact
```mermaid
pie title Code Changes by Component
"New Telemetry Module" : 800
"Transaction Relay" : 160
"Consensus" : 130
"RPC Layer" : 100
"Application Init" : 35
"Protocol Buffers" : 25
"Build System" : 60
```
#### New Files (No Impact on Existing Code)
| File | Lines | Purpose |
| ---------------------------------------------- | ----- | -------------------- |
| `include/xrpl/telemetry/Telemetry.h` | ~160 | Main interface |
| `include/xrpl/telemetry/SpanGuard.h` | ~120 | RAII wrapper |
| `include/xrpl/telemetry/TraceContext.h` | ~80 | Context propagation |
| `src/xrpld/telemetry/TracingInstrumentation.h` | ~60 | Macros |
| `src/libxrpl/telemetry/Telemetry.cpp` | ~200 | Implementation |
| `src/libxrpl/telemetry/TelemetryConfig.cpp` | ~60 | Config parsing |
| `src/libxrpl/telemetry/NullTelemetry.cpp` | ~40 | No-op implementation |
#### Modified Files (Existing Rippled Code)
| File | Lines Added | Lines Changed | Risk Level |
| ------------------------------------------------- | ----------- | ------------- | ---------- |
| `src/xrpld/app/main/Application.cpp` | ~15 | ~3 | Low |
| `include/xrpl/app/main/Application.h` | ~5 | ~2 | Low |
| `src/xrpld/rpc/detail/ServerHandler.cpp` | ~40 | ~10 | Low |
| `src/xrpld/rpc/handlers/*.cpp` | ~30 | ~8 | Low |
| `src/xrpld/overlay/detail/PeerImp.cpp` | ~60 | ~15 | Medium |
| `src/xrpld/overlay/detail/OverlayImpl.cpp` | ~30 | ~10 | Medium |
| `src/xrpld/app/consensus/RCLConsensus.cpp` | ~50 | ~15 | Medium |
| `src/xrpld/app/consensus/RCLConsensusAdaptor.cpp` | ~40 | ~12 | Medium |
| `src/xrpld/core/JobQueue.cpp` | ~20 | ~5 | Low |
| `src/xrpld/overlay/detail/ripple.proto` | ~25 | 0 | Low |
| `CMakeLists.txt` | ~40 | ~8 | Low |
| `cmake/FindOpenTelemetry.cmake` | ~50 | 0 | None (new) |
### 3.9.3 Risk Assessment by Component
<div align="center">
**Do First** ↖ ↗ **Plan Carefully**
```mermaid
quadrantChart
title Code Intrusiveness Risk Matrix
x-axis Low Risk --> High Risk
y-axis Low Value --> High Value
RPC Tracing: [0.2, 0.8]
Transaction Relay: [0.5, 0.9]
Consensus Tracing: [0.7, 0.95]
Peer Message Tracing: [0.8, 0.4]
JobQueue Context: [0.4, 0.5]
Ledger Acquisition: [0.5, 0.6]
```
**Optional** ↙ ↘ **Avoid**
</div>
#### Risk Level Definitions
| Risk Level | Definition | Mitigation |
| ---------- | ---------------------------------------------------------------- | ---------------------------------- |
| **Low** | Additive changes only; no modification to existing logic | Standard code review |
| **Medium** | Minor modifications to existing functions; clear boundaries | Comprehensive unit tests |
| **High** | Changes to core logic or data structures; potential side effects | Integration tests + staged rollout |
### 3.9.4 Architectural Impact Assessment
| Aspect | Impact | Justification |
| -------------------- | ------- | --------------------------------------------------------------------- |
| **Data Flow** | None | Tracing is purely observational; no business logic changes |
| **Threading Model** | Minimal | Context propagation uses thread-local storage (standard OTel pattern) |
| **Memory Model** | Low | Bounded queues prevent unbounded growth; RAII ensures cleanup |
| **Network Protocol** | Low | Optional fields in protobuf (high field numbers); backward compatible |
| **Configuration** | None | New config section; existing configs unaffected |
| **Build System** | Low | Optional CMake flag; builds work without OpenTelemetry |
| **Dependencies** | Low | OpenTelemetry SDK is optional; null implementation when disabled |
### 3.9.5 Backward Compatibility
| Compatibility | Status | Notes |
| --------------- | ------- | ----------------------------------------------------- |
| **Config File** | ✅ Full | New `[telemetry]` section is optional |
| **Protocol** | ✅ Full | Optional protobuf fields with high field numbers |
| **Build** | ✅ Full | `XRPL_ENABLE_TELEMETRY=OFF` produces identical binary |
| **Runtime** | ✅ Full | `enabled=0` produces zero overhead |
| **API** | ✅ Full | No changes to public RPC or P2P APIs |
### 3.9.6 Rollback Strategy
If issues are discovered after deployment:
1. **Immediate**: Set `enabled=0` in config and restart (zero code change)
2. **Quick**: Rebuild with `XRPL_ENABLE_TELEMETRY=OFF`
3. **Complete**: Revert telemetry commits (clean separation makes this easy)
### 3.9.7 Code Change Examples
**Minimal RPC Instrumentation (Low Intrusiveness):**
```cpp
// Before
void ServerHandler::onRequest(...) {
auto result = processRequest(req);
send(result);
}
// After (only ~10 lines added)
void ServerHandler::onRequest(...) {
XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.request"); // +1 line
XRPL_TRACE_SET_ATTR("xrpl.rpc.command", command); // +1 line
auto result = processRequest(req);
XRPL_TRACE_SET_ATTR("xrpl.rpc.status", status); // +1 line
send(result);
}
```
**Consensus Instrumentation (Medium Intrusiveness):**
```cpp
// Before
void RCLConsensusAdaptor::startRound(...) {
// ... existing logic
}
// After (context storage required)
void RCLConsensusAdaptor::startRound(...) {
XRPL_TRACE_CONSENSUS(app_.getTelemetry(), "consensus.round");
XRPL_TRACE_SET_ATTR("xrpl.consensus.ledger.seq", seq);
// Store context for child spans in phase transitions
currentRoundContext_ = _xrpl_guard_->context(); // New member variable
// ... existing logic unchanged
}
```
---
_Previous: [Design Decisions](./02-design-decisions.md)_ | _Next: [Code Samples](./04-code-samples.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_