Verified CPU, memory, and network overhead calculations against official OTel C++ SDK benchmarks (969 CI runs) and source code analysis. Key corrections: - Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median ~1000ns; original estimate matched API no-op, not SDK path) - Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%) - Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper + SpanData + std::map attribute storage) - Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread stack ~8MB was omitted) - Total memory ceiling: ~2.3MB → ~10MB - Memory success metric target: <5MB → <10MB - AddEvent: 50-80ns → 100-200ns Added Section 3.5.4 with links to all benchmark sources. Updated presentation.md with matching corrections. High-level conclusions unchanged (1-3% CPU, negligible consensus). Also includes: review fixes, cross-document consistency improvements, additional component tracing docs (PathFinding, TxQ, Validator, etc.), context size corrections (32 → 25 bytes). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 KiB
OpenTelemetry Distributed Tracing Implementation Plan for rippled (xrpld)
Executive Summary
OTLP = OpenTelemetry Protocol
This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. The plan addresses the unique challenges of a decentralized peer-to-peer system where trace context must propagate across network boundaries between independent nodes.
Key Benefits
- End-to-end transaction visibility: Track transactions from submission through consensus to ledger inclusion
- Consensus round analysis: Understand timing and behavior of consensus phases across validators
- RPC performance insights: Identify slow handlers and optimize response times
- Network topology understanding: Visualize message propagation patterns between peers
- Incident debugging: Correlate events across distributed nodes during issues
Estimated Performance Overhead
| Metric | Overhead | Notes |
|---|---|---|
| CPU | 1-3% | Span creation and attribute setting |
| Memory | 2-5 MB | Batch buffer for pending spans |
| Network | 10-50 KB/s | Compressed OTLP export to collector |
| Latency (p99) | <2% | With proper sampling configuration |
Document Structure
This implementation plan is organized into modular documents for easier navigation:
flowchart TB
overview["📋 OpenTelemetryPlan.md<br/>(This Document)"]
subgraph fundamentals["Fundamentals"]
fund["00-tracing-fundamentals.md"]
end
subgraph analysis["Analysis & Design"]
arch["01-architecture-analysis.md"]
design["02-design-decisions.md"]
end
subgraph impl["Implementation"]
strategy["03-implementation-strategy.md"]
code["04-code-samples.md"]
config["05-configuration-reference.md"]
end
subgraph deploy["Deployment & Planning"]
phases["06-implementation-phases.md"]
backends["07-observability-backends.md"]
appendix["08-appendix.md"]
poc["POC_taskList.md"]
end
overview --> fundamentals
overview --> analysis
overview --> impl
overview --> deploy
fund --> arch
arch --> design
design --> strategy
strategy --> code
code --> config
config --> phases
phases --> backends
backends --> appendix
phases --> poc
style overview fill:#1b5e20,stroke:#0d3d14,color:#fff,stroke-width:2px
style fundamentals fill:#00695c,stroke:#004d40,color:#fff
style fund fill:#00695c,stroke:#004d40,color:#fff
style analysis fill:#0d47a1,stroke:#082f6a,color:#fff
style impl fill:#bf360c,stroke:#8c2809,color:#fff
style deploy fill:#4a148c,stroke:#2e0d57,color:#fff
style arch fill:#0d47a1,stroke:#082f6a,color:#fff
style design fill:#0d47a1,stroke:#082f6a,color:#fff
style strategy fill:#bf360c,stroke:#8c2809,color:#fff
style code fill:#bf360c,stroke:#8c2809,color:#fff
style config fill:#bf360c,stroke:#8c2809,color:#fff
style phases fill:#4a148c,stroke:#2e0d57,color:#fff
style backends fill:#4a148c,stroke:#2e0d57,color:#fff
style appendix fill:#4a148c,stroke:#2e0d57,color:#fff
style poc fill:#4a148c,stroke:#2e0d57,color:#fff
Table of Contents
| Section | Document | Description |
|---|---|---|
| 0 | Tracing Fundamentals | Distributed tracing concepts, span relationships, context propagation |
| 1 | Architecture Analysis | rippled component analysis, trace points, instrumentation priorities |
| 2 | Design Decisions | SDK selection, exporters, span naming, attributes, context propagation |
| 3 | Implementation Strategy | Directory structure, key principles, performance optimization |
| 4 | Code Samples | C++ implementation examples for core infrastructure and key modules |
| 5 | Configuration Reference | rippled config, CMake integration, Collector configurations |
| 6 | Implementation Phases | 5-phase timeline, tasks, risks, success metrics |
| 7 | Observability Backends | Backend selection guide and production architecture |
| 8 | Appendix | Glossary, references, version history |
| POC | POC Task List | Proof of concept tasks for RPC tracing end-to-end demo |
0. Tracing Fundamentals
This document introduces distributed tracing concepts for readers unfamiliar with the domain. It covers what traces and spans are, how parent-child and follows-from relationships model causality, how context propagates across service boundaries, and how sampling controls data volume. It also maps these concepts to rippled-specific scenarios like transaction relay and consensus.
1. Architecture Analysis
WS = WebSocket | TxQ = Transaction Queue
The rippled node consists of several key components that require instrumentation for comprehensive distributed tracing. The main areas include the RPC server (HTTP/WebSocket), Overlay P2P network, Consensus mechanism (RCLConsensus), JobQueue for async task execution, PathFinding, Transaction Queue (TxQ), fee escalation (LoadManager), ledger acquisition, validator management, and existing observability infrastructure (PerfLog, Insight/StatsD, Journal logging).
Key trace points span across transaction submission via RPC, peer-to-peer message propagation, consensus round execution, ledger building, path computation, transaction queue behavior, fee escalation, and validator health. The implementation prioritizes high-value, low-risk components first: RPC handlers provide immediate value with minimal risk, while consensus tracing requires careful implementation to avoid timing impacts.
➡️ Read full Architecture Analysis
2. Design Decisions
OTLP = OpenTelemetry Protocol | CNCF = Cloud Native Computing Foundation
The OpenTelemetry C++ SDK is selected for its CNCF backing, active development, and native performance characteristics. Traces are exported via OTLP/gRPC (primary) or OTLP/HTTP (fallback) to an OpenTelemetry Collector, which provides flexible routing and sampling.
Span naming follows a hierarchical <component>.<operation> convention (e.g., rpc.submit, tx.relay, consensus.round). Context propagation uses W3C Trace Context headers for HTTP and embedded Protocol Buffer fields for P2P messages. The implementation coexists with existing PerfLog and Insight observability systems through correlation IDs.
Data Collection & Privacy: Telemetry collects only operational metadata (timing, counts, hashes) — never sensitive content (private keys, balances, amounts, raw payloads). Privacy protection includes account hashing, configurable redaction, sampling, and collector-level filtering. Node operators retain full control over telemetry configuration.
3. Implementation Strategy
The telemetry code is organized under include/xrpl/telemetry/ for headers and src/libxrpl/telemetry/ for implementation. Key principles include RAII-based span management via SpanGuard, conditional compilation with XRPL_ENABLE_TELEMETRY, and minimal runtime overhead through batch processing and efficient sampling.
Performance optimization strategies include probabilistic head sampling (10% default), tail-based sampling at the collector for errors and slow traces, batch export to reduce network overhead, and conditional instrumentation that compiles to no-ops when disabled.
➡️ Read full Implementation Strategy
4. Code Samples
C++ implementation examples are provided for the core telemetry infrastructure and key modules:
Telemetry.h- Core interface for tracer access and span creationSpanGuard.h- RAII wrapper for automatic span lifecycle managementTracingInstrumentation.h- Macros for conditional instrumentation- Protocol Buffer extensions for trace context propagation
- Module-specific instrumentation (RPC, Consensus, P2P, JobQueue)
- Remaining modules (PathFinding, TxQ, Validator, etc.) follow the same patterns
5. Configuration Reference
OTLP = OpenTelemetry Protocol | APM = Application Performance Monitoring
Configuration is handled through the [telemetry] section in xrpld.cfg with options for enabling/disabling, exporter selection, endpoint configuration, sampling ratios, and component-level filtering. CMake integration includes a XRPL_ENABLE_TELEMETRY option for compile-time control.
OpenTelemetry Collector configurations are provided for development and production (with tail-based sampling, Tempo, and Elastic APM). Docker Compose examples enable quick local development environment setup.
➡️ View full Configuration Reference
6. Implementation Phases
The implementation spans 9 weeks across 5 phases:
| Phase | Duration | Focus | Key Deliverables |
|---|---|---|---|
| 1 | Weeks 1-2 | Core Infrastructure | SDK integration, Telemetry interface, Configuration |
| 2 | Weeks 3-4 | RPC Tracing | HTTP context extraction, Handler instrumentation |
| 3 | Weeks 5-6 | Transaction Tracing | Protocol Buffer context, Relay propagation |
| 4 | Weeks 7-8 | Consensus Tracing | Round spans, Proposal/validation tracing |
| 5 | Week 9 | Documentation | Runbook, Dashboards, Training |
Total Effort: 47 person-days (2 developers working in parallel)
➡️ View full Implementation Phases
7. Observability Backends
APM = Application Performance Monitoring | GCS = Google Cloud Storage
Grafana Tempo is recommended for all environments due to its cost-effectiveness and Grafana integration, while Elastic APM is ideal for organizations with existing Elastic infrastructure.
The recommended production architecture uses a gateway collector pattern with regional collectors performing tail-based sampling, routing traces to multiple backends (Tempo for primary storage, Elastic for log correlation, S3/GCS for long-term archive).
➡️ View Observability Backend Recommendations
8. Appendix
The appendix contains a glossary of OpenTelemetry and rippled-specific terms, references to external documentation and specifications, version history for this implementation plan, and a complete document index.
POC Task List
A step-by-step task list for building a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. The POC scope is limited to RPC tracing — showing request traces flowing from rippled through an OpenTelemetry Collector into Tempo, viewable in Grafana.
This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. For detailed information on any section, follow the links to the corresponding sub-documents.