Files
rippled/OpenTelemetryPlan/OpenTelemetryPlan.md
Pratik Mankawde f135842071 docs: correct OTel overhead estimates against SDK benchmarks
Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:

- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
  ~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
  SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
  stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns

Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).

Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 15:55:26 +01:00

12 KiB

OpenTelemetry Distributed Tracing Implementation Plan for rippled (xrpld)

Executive Summary

OTLP = OpenTelemetry Protocol

This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. The plan addresses the unique challenges of a decentralized peer-to-peer system where trace context must propagate across network boundaries between independent nodes.

Key Benefits

  • End-to-end transaction visibility: Track transactions from submission through consensus to ledger inclusion
  • Consensus round analysis: Understand timing and behavior of consensus phases across validators
  • RPC performance insights: Identify slow handlers and optimize response times
  • Network topology understanding: Visualize message propagation patterns between peers
  • Incident debugging: Correlate events across distributed nodes during issues

Estimated Performance Overhead

Metric Overhead Notes
CPU 1-3% Span creation and attribute setting
Memory 2-5 MB Batch buffer for pending spans
Network 10-50 KB/s Compressed OTLP export to collector
Latency (p99) <2% With proper sampling configuration

Document Structure

This implementation plan is organized into modular documents for easier navigation:

flowchart TB
    overview["📋 OpenTelemetryPlan.md<br/>(This Document)"]

    subgraph fundamentals["Fundamentals"]
        fund["00-tracing-fundamentals.md"]
    end

    subgraph analysis["Analysis & Design"]
        arch["01-architecture-analysis.md"]
        design["02-design-decisions.md"]
    end

    subgraph impl["Implementation"]
        strategy["03-implementation-strategy.md"]
        code["04-code-samples.md"]
        config["05-configuration-reference.md"]
    end

    subgraph deploy["Deployment & Planning"]
        phases["06-implementation-phases.md"]
        backends["07-observability-backends.md"]
        appendix["08-appendix.md"]
        poc["POC_taskList.md"]
    end

    overview --> fundamentals
    overview --> analysis
    overview --> impl
    overview --> deploy

    fund --> arch
    arch --> design
    design --> strategy
    strategy --> code
    code --> config
    config --> phases
    phases --> backends
    backends --> appendix
    phases --> poc

    style overview fill:#1b5e20,stroke:#0d3d14,color:#fff,stroke-width:2px
    style fundamentals fill:#00695c,stroke:#004d40,color:#fff
    style fund fill:#00695c,stroke:#004d40,color:#fff
    style analysis fill:#0d47a1,stroke:#082f6a,color:#fff
    style impl fill:#bf360c,stroke:#8c2809,color:#fff
    style deploy fill:#4a148c,stroke:#2e0d57,color:#fff
    style arch fill:#0d47a1,stroke:#082f6a,color:#fff
    style design fill:#0d47a1,stroke:#082f6a,color:#fff
    style strategy fill:#bf360c,stroke:#8c2809,color:#fff
    style code fill:#bf360c,stroke:#8c2809,color:#fff
    style config fill:#bf360c,stroke:#8c2809,color:#fff
    style phases fill:#4a148c,stroke:#2e0d57,color:#fff
    style backends fill:#4a148c,stroke:#2e0d57,color:#fff
    style appendix fill:#4a148c,stroke:#2e0d57,color:#fff
    style poc fill:#4a148c,stroke:#2e0d57,color:#fff

Table of Contents

Section Document Description
0 Tracing Fundamentals Distributed tracing concepts, span relationships, context propagation
1 Architecture Analysis rippled component analysis, trace points, instrumentation priorities
2 Design Decisions SDK selection, exporters, span naming, attributes, context propagation
3 Implementation Strategy Directory structure, key principles, performance optimization
4 Code Samples C++ implementation examples for core infrastructure and key modules
5 Configuration Reference rippled config, CMake integration, Collector configurations
6 Implementation Phases 5-phase timeline, tasks, risks, success metrics
7 Observability Backends Backend selection guide and production architecture
8 Appendix Glossary, references, version history
POC POC Task List Proof of concept tasks for RPC tracing end-to-end demo

0. Tracing Fundamentals

This document introduces distributed tracing concepts for readers unfamiliar with the domain. It covers what traces and spans are, how parent-child and follows-from relationships model causality, how context propagates across service boundaries, and how sampling controls data volume. It also maps these concepts to rippled-specific scenarios like transaction relay and consensus.

➡️ Read Tracing Fundamentals


1. Architecture Analysis

WS = WebSocket | TxQ = Transaction Queue

The rippled node consists of several key components that require instrumentation for comprehensive distributed tracing. The main areas include the RPC server (HTTP/WebSocket), Overlay P2P network, Consensus mechanism (RCLConsensus), JobQueue for async task execution, PathFinding, Transaction Queue (TxQ), fee escalation (LoadManager), ledger acquisition, validator management, and existing observability infrastructure (PerfLog, Insight/StatsD, Journal logging).

Key trace points span across transaction submission via RPC, peer-to-peer message propagation, consensus round execution, ledger building, path computation, transaction queue behavior, fee escalation, and validator health. The implementation prioritizes high-value, low-risk components first: RPC handlers provide immediate value with minimal risk, while consensus tracing requires careful implementation to avoid timing impacts.

➡️ Read full Architecture Analysis


2. Design Decisions

OTLP = OpenTelemetry Protocol | CNCF = Cloud Native Computing Foundation

The OpenTelemetry C++ SDK is selected for its CNCF backing, active development, and native performance characteristics. Traces are exported via OTLP/gRPC (primary) or OTLP/HTTP (fallback) to an OpenTelemetry Collector, which provides flexible routing and sampling.

Span naming follows a hierarchical <component>.<operation> convention (e.g., rpc.submit, tx.relay, consensus.round). Context propagation uses W3C Trace Context headers for HTTP and embedded Protocol Buffer fields for P2P messages. The implementation coexists with existing PerfLog and Insight observability systems through correlation IDs.

Data Collection & Privacy: Telemetry collects only operational metadata (timing, counts, hashes) — never sensitive content (private keys, balances, amounts, raw payloads). Privacy protection includes account hashing, configurable redaction, sampling, and collector-level filtering. Node operators retain full control over telemetry configuration.

➡️ Read full Design Decisions


3. Implementation Strategy

The telemetry code is organized under include/xrpl/telemetry/ for headers and src/libxrpl/telemetry/ for implementation. Key principles include RAII-based span management via SpanGuard, conditional compilation with XRPL_ENABLE_TELEMETRY, and minimal runtime overhead through batch processing and efficient sampling.

Performance optimization strategies include probabilistic head sampling (10% default), tail-based sampling at the collector for errors and slow traces, batch export to reduce network overhead, and conditional instrumentation that compiles to no-ops when disabled.

➡️ Read full Implementation Strategy


4. Code Samples

C++ implementation examples are provided for the core telemetry infrastructure and key modules:

  • Telemetry.h - Core interface for tracer access and span creation
  • SpanGuard.h - RAII wrapper for automatic span lifecycle management
  • TracingInstrumentation.h - Macros for conditional instrumentation
  • Protocol Buffer extensions for trace context propagation
  • Module-specific instrumentation (RPC, Consensus, P2P, JobQueue)
  • Remaining modules (PathFinding, TxQ, Validator, etc.) follow the same patterns

➡️ View all Code Samples


5. Configuration Reference

OTLP = OpenTelemetry Protocol | APM = Application Performance Monitoring

Configuration is handled through the [telemetry] section in xrpld.cfg with options for enabling/disabling, exporter selection, endpoint configuration, sampling ratios, and component-level filtering. CMake integration includes a XRPL_ENABLE_TELEMETRY option for compile-time control.

OpenTelemetry Collector configurations are provided for development and production (with tail-based sampling, Tempo, and Elastic APM). Docker Compose examples enable quick local development environment setup.

➡️ View full Configuration Reference


6. Implementation Phases

The implementation spans 9 weeks across 5 phases:

Phase Duration Focus Key Deliverables
1 Weeks 1-2 Core Infrastructure SDK integration, Telemetry interface, Configuration
2 Weeks 3-4 RPC Tracing HTTP context extraction, Handler instrumentation
3 Weeks 5-6 Transaction Tracing Protocol Buffer context, Relay propagation
4 Weeks 7-8 Consensus Tracing Round spans, Proposal/validation tracing
5 Week 9 Documentation Runbook, Dashboards, Training

Total Effort: 47 person-days (2 developers working in parallel)

➡️ View full Implementation Phases


7. Observability Backends

APM = Application Performance Monitoring | GCS = Google Cloud Storage

Grafana Tempo is recommended for all environments due to its cost-effectiveness and Grafana integration, while Elastic APM is ideal for organizations with existing Elastic infrastructure.

The recommended production architecture uses a gateway collector pattern with regional collectors performing tail-based sampling, routing traces to multiple backends (Tempo for primary storage, Elastic for log correlation, S3/GCS for long-term archive).

➡️ View Observability Backend Recommendations


8. Appendix

The appendix contains a glossary of OpenTelemetry and rippled-specific terms, references to external documentation and specifications, version history for this implementation plan, and a complete document index.

➡️ View Appendix


POC Task List

A step-by-step task list for building a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. The POC scope is limited to RPC tracing — showing request traces flowing from rippled through an OpenTelemetry Collector into Tempo, viewable in Grafana.

➡️ View POC Task List


This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. For detailed information on any section, follow the links to the corresponding sub-documents.