mirror of https://github.com/XRPLF/rippled.git synced 2026-03-15 09:12:25 +00:00

Files

Pratik Mankawde 34243e0cc2 Phase 1a: OpenTelemetry plan documentation

Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:

- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-09 19:03:09 +00:00

17 KiB

Raw Blame History

Design Decisions

Parent Document: OpenTelemetryPlan.md Related: Architecture Analysis | Code Samples

2.1 OpenTelemetry Components

2.1.1 SDK Selection

Primary Choice: OpenTelemetry C++ SDK (opentelemetry-cpp)

Component	Purpose	Required
`opentelemetry-cpp::api`	Tracing API headers	Yes
`opentelemetry-cpp::sdk`	SDK implementation	Yes
`opentelemetry-cpp::ext`	Extensions (exporters)	Yes
`opentelemetry-cpp::otlp_grpc_exporter`	OTLP/gRPC export	Recommended
`opentelemetry-cpp::otlp_http_exporter`	OTLP/HTTP export	Alternative

2.1.2 Instrumentation Strategy

Manual Instrumentation (recommended):

Approach	Pros	Cons
Manual	Precise control, optimized placement, rippled-specific attributes	More development effort
Auto	Less code, automatic coverage	Less control, potential overhead, limited customization

2.2 Exporter Configuration

flowchart TB
    subgraph nodes["rippled Nodes"]
        node1["rippled<br/>Node 1"]
        node2["rippled<br/>Node 2"]
        node3["rippled<br/>Node 3"]
    end

    collector["OpenTelemetry<br/>Collector<br/>(sidecar or standalone)"]

    subgraph backends["Observability Backends"]
        jaeger["Jaeger<br/>(Dev)"]
        tempo["Tempo<br/>(Prod)"]
        elastic["Elastic<br/>APM"]
    end

    node1 -->|"OTLP/gRPC<br/>:4317"| collector
    node2 -->|"OTLP/gRPC<br/>:4317"| collector
    node3 -->|"OTLP/gRPC<br/>:4317"| collector

    collector --> jaeger
    collector --> tempo
    collector --> elastic

    style nodes fill:#0d47a1,stroke:#082f6a,color:#ffffff
    style backends fill:#1b5e20,stroke:#0d3d14,color:#ffffff
    style collector fill:#bf360c,stroke:#8c2809,color:#ffffff

2.2.1 OTLP/gRPC (Recommended)

// Configuration for OTLP over gRPC
namespace otlp = opentelemetry::exporter::otlp;

otlp::OtlpGrpcExporterOptions opts;
opts.endpoint = "localhost:4317";
opts.use_ssl_credentials = true;
opts.ssl_credentials_cacert_path = "/path/to/ca.crt";

2.2.2 OTLP/HTTP (Alternative)

// Configuration for OTLP over HTTP
namespace otlp = opentelemetry::exporter::otlp;

otlp::OtlpHttpExporterOptions opts;
opts.url = "http://localhost:4318/v1/traces";
opts.content_type = otlp::HttpRequestContentType::kJson;  // or kBinary

2.3 Span Naming Conventions

2.3.1 Naming Schema

<component>.<operation>[.<sub-operation>]

Examples:

tx.receive - Transaction received from peer
consensus.phase.establish - Consensus establish phase
rpc.command.server_info - server_info RPC command

2.3.2 Complete Span Catalog

# Transaction Spans
tx:
  receive: "Transaction received from network"
  validate: "Transaction signature/format validation"
  process: "Full transaction processing"
  relay: "Transaction relay to peers"
  apply: "Apply transaction to ledger"

# Consensus Spans
consensus:
  round: "Complete consensus round"
  phase:
    open: "Open phase - collecting transactions"
    establish: "Establish phase - reaching agreement"
    accept: "Accept phase - applying consensus"
  proposal:
    receive: "Receive peer proposal"
    send: "Send our proposal"
  validation:
    receive: "Receive peer validation"
    send: "Send our validation"

# RPC Spans
rpc:
  request: "HTTP/WebSocket request handling"
  command:
    "*": "Specific RPC command (dynamic)"

# Peer Spans
peer:
  connect: "Peer connection establishment"
  disconnect: "Peer disconnection"
  message:
    send: "Send protocol message"
    receive: "Receive protocol message"

# Ledger Spans
ledger:
  acquire: "Ledger acquisition from network"
  build: "Build new ledger"
  validate: "Ledger validation"
  close: "Close ledger"

# Job Spans
job:
  enqueue: "Job added to queue"
  execute: "Job execution"

2.4 Attribute Schema

2.4.1 Resource Attributes (Set Once at Startup)

// Standard OpenTelemetry semantic conventions
resource::SemanticConventions::SERVICE_NAME        = "rippled"
resource::SemanticConventions::SERVICE_VERSION     = BuildInfo::getVersionString()
resource::SemanticConventions::SERVICE_INSTANCE_ID = <node_public_key_base58>

// Custom rippled attributes
"xrpl.network.id"      = <network_id>           // e.g., 0 for mainnet
"xrpl.network.type"    = "mainnet" | "testnet" | "devnet" | "standalone"
"xrpl.node.type"       = "validator" | "stock" | "reporting"
"xrpl.node.cluster"    = <cluster_name>         // If clustered

2.4.2 Span Attributes by Category

Transaction Attributes

"xrpl.tx.hash"         = string   // Transaction hash (hex)
"xrpl.tx.type"         = string   // "Payment", "OfferCreate", etc.
"xrpl.tx.account"      = string   // Source account (redacted in prod)
"xrpl.tx.sequence"     = int64    // Account sequence number
"xrpl.tx.fee"          = int64    // Fee in drops
"xrpl.tx.result"       = string   // "tesSUCCESS", "tecPATH_DRY", etc.
"xrpl.tx.ledger_index" = int64    // Ledger containing transaction

Consensus Attributes

"xrpl.consensus.round"          = int64    // Round number
"xrpl.consensus.phase"          = string   // "open", "establish", "accept"
"xrpl.consensus.mode"           = string   // "proposing", "observing", etc.
"xrpl.consensus.proposers"      = int64    // Number of proposers
"xrpl.consensus.ledger.prev"    = string   // Previous ledger hash
"xrpl.consensus.ledger.seq"     = int64    // Ledger sequence
"xrpl.consensus.tx_count"       = int64    // Transactions in consensus set
"xrpl.consensus.duration_ms"    = float64  // Round duration

RPC Attributes

"xrpl.rpc.command"     = string   // Command name
"xrpl.rpc.version"     = int64    // API version
"xrpl.rpc.role"        = string   // "admin" or "user"
"xrpl.rpc.params"      = string   // Sanitized parameters (optional)

Peer & Message Attributes

"xrpl.peer.id"            = string   // Peer public key (base58)
"xrpl.peer.address"       = string   // IP:port
"xrpl.peer.latency_ms"    = float64  // Measured latency
"xrpl.peer.cluster"       = string   // Cluster name if clustered
"xrpl.message.type"       = string   // Protocol message type name
"xrpl.message.size_bytes" = int64    // Message size
"xrpl.message.compressed" = bool     // Whether compressed

Ledger & Job Attributes

"xrpl.ledger.hash"       = string   // Ledger hash
"xrpl.ledger.index"      = int64    // Ledger sequence/index
"xrpl.ledger.close_time" = int64    // Close time (epoch)
"xrpl.ledger.tx_count"   = int64    // Transaction count
"xrpl.job.type"          = string   // Job type name
"xrpl.job.queue_ms"      = float64  // Time spent in queue
"xrpl.job.worker"        = int64    // Worker thread ID

2.4.3 Data Collection Summary

The following table summarizes what data is collected by category:

Category	Attributes Collected	Purpose
Transaction	`tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index`	Trace transaction lifecycle
Consensus	`round`, `phase`, `mode`, `proposers` (public keys), `duration_ms`	Analyze consensus timing
RPC	`command`, `version`, `status`, `duration_ms`	Monitor RPC performance
Peer	`peer.id` (public key), `latency_ms`, `message.type`, `message.size`	Network topology analysis
Ledger	`ledger.hash`, `ledger.index`, `close_time`, `tx_count`	Ledger progression tracking
Job	`job.type`, `queue_ms`, `worker`	JobQueue performance

2.4.4 Privacy & Sensitive Data Policy

OpenTelemetry instrumentation is designed to collect operational metadata only, never sensitive content.

Data NOT Collected

The following data is explicitly excluded from telemetry collection:

Excluded Data	Reason
Private Keys	Never exposed; not relevant to tracing
Account Balances	Financial data; privacy sensitive
Transaction Amounts	Financial data; privacy sensitive
Raw TX Payloads	May contain sensitive memo/data fields
Personal Data	No PII collected
IP Addresses	Configurable; excluded by default in prod

Privacy Protection Mechanisms

Mechanism	Description
Account Hashing	`xrpl.tx.account` is hashed at collector level before storage
Configurable Redaction	Sensitive fields can be excluded via `[telemetry]` config section
Sampling	Only 10% of traces recorded by default, reducing data exposure
Local Control	Node operators have full control over what gets exported
No Raw Payloads	Transaction content is never recorded, only metadata (hash, type, result)
Collector-Level Filtering	Additional redaction/hashing can be configured at OTel Collector

Collector-Level Data Protection

The OpenTelemetry Collector can be configured to hash or redact sensitive attributes before export:

processors:
  attributes:
    actions:
      # Hash account addresses before storage
      - key: xrpl.tx.account
        action: hash
      # Remove IP addresses entirely
      - key: xrpl.peer.address
        action: delete
      # Redact specific fields
      - key: xrpl.rpc.params
        action: delete

Configuration Options for Privacy

In rippled.cfg, operators can control data collection granularity:

[telemetry]
enabled=1

# Disable collection of specific components
trace_transactions=1
trace_consensus=1
trace_rpc=1
trace_peer=0          # Disable peer tracing (high volume, includes addresses)

# Redact specific attributes
redact_account=1      # Hash account addresses before export
redact_peer_address=1 # Remove peer IP addresses

Key Principle: Telemetry collects operational metadata (timing, counts, hashes) — never sensitive content (keys, balances, amounts, raw payloads).

2.5 Context Propagation Design

2.5.1 Propagation Boundaries

flowchart TB
    subgraph http["HTTP/WebSocket (RPC)"]
        w3c["W3C Trace Context Headers:<br/>traceparent: 00-{trace_id}-{span_id}-{flags}<br/>tracestate: rippled=<state>"]
    end

    subgraph protobuf["Protocol Buffers (P2P)"]
        proto["message TraceContext {<br/>  bytes trace_id = 1;  // 16 bytes<br/>  bytes span_id = 2;   // 8 bytes<br/>  uint32 trace_flags = 3;<br/>  string trace_state = 4;<br/>}"]
    end

    subgraph jobqueue["JobQueue (Internal Async)"]
        job["Context captured at job creation,<br/>restored at execution<br/><br/>class Job {<br/>  opentelemetry::context::Context traceContext_;<br/>};"]
    end

    style http fill:#0d47a1,stroke:#082f6a,color:#ffffff
    style protobuf fill:#1b5e20,stroke:#0d3d14,color:#ffffff
    style jobqueue fill:#bf360c,stroke:#8c2809,color:#ffffff

2.6 Integration with Existing Observability

2.6.1 Existing Frameworks Comparison

rippled already has two observability mechanisms. OpenTelemetry complements (not replaces) them:

Aspect	PerfLog	Beast Insight (StatsD)	OpenTelemetry
Type	Logging	Metrics	Distributed Tracing
Data	JSON log entries	Counters, gauges, histograms	Spans with context
Scope	Single node	Single node	Cross-node
Output	`perf.log` file	StatsD server	OTLP Collector
Question answered	"What happened on this node?"	"How many? How fast?"	"What was the journey?"
Correlation	By timestamp	By metric name	By `trace_id`
Overhead	Low (file I/O)	Low (UDP packets)	Low-Medium (configurable)

2.6.2 What Each Framework Does Best

PerfLog

Purpose: Detailed local event logging for RPC and job execution
Strengths:
- Rich JSON output with timing data
- Already integrated in RPC handlers
- File-based, no external dependencies
Limitations:
- Single-node only (no cross-node correlation)
- No parent-child relationships between events
- Manual log parsing required

// Example PerfLog entry
{
  "time": "2024-01-15T10:30:00.123Z",
  "method": "submit",
  "duration_us": 1523,
  "result": "tesSUCCESS"
}

Beast Insight (StatsD)

Purpose: Real-time metrics for monitoring dashboards
Strengths:
- Aggregated metrics (counters, gauges, histograms)
- Low overhead (UDP, fire-and-forget)
- Good for alerting thresholds
Limitations:
- No request-level detail
- No causal relationships
- Single-node perspective

// Example StatsD usage in rippled
insight.increment("rpc.submit.count");
insight.gauge("ledger.age", age);
insight.timing("consensus.round", duration);

OpenTelemetry (NEW)

Purpose: Distributed request tracing across nodes
Strengths:
- Cross-node correlation via trace_id
- Parent-child span relationships
- Rich attributes per span
- Industry standard (CNCF)
Limitations:
- Requires collector infrastructure
- Higher complexity than logging

// Example OpenTelemetry span
auto span = telemetry.startSpan("tx.relay");
span->SetAttribute("tx.hash", hash);
span->SetAttribute("peer.id", peerId);
// Span automatically linked to parent via context

2.6.3 When to Use Each

Scenario	PerfLog	StatsD	OpenTelemetry
"How many TXs per second?"	❌	✅	❌
"What's the p99 RPC latency?"	❌	✅	✅
"Why was this specific TX slow?"	⚠️ partial	❌	✅
"Which node delayed consensus?"	❌	❌	✅
"What happened on node X at time T?"	✅	❌	✅
"Show me the TX journey across 5 nodes"	❌	❌	✅

2.6.4 Coexistence Strategy

flowchart TB
    subgraph rippled["rippled Process"]
        perflog["PerfLog<br/>(JSON to file)"]
        insight["Beast Insight<br/>(StatsD)"]
        otel["OpenTelemetry<br/>(Tracing)"]
    end

    perflog --> perffile["perf.log"]
    insight --> statsd["StatsD Server"]
    otel --> collector["OTLP Collector"]

    perffile --> grafana["Grafana<br/>(Unified UI)"]
    statsd --> grafana
    collector --> grafana

    style rippled fill:#212121,stroke:#0a0a0a,color:#ffffff
    style grafana fill:#bf360c,stroke:#8c2809,color:#ffffff

2.6.5 Correlation with PerfLog

Trace IDs can be correlated with existing PerfLog entries for comprehensive debugging:

// In RPCHandler.cpp - correlate trace with PerfLog
Status doCommand(RPC::JsonContext& context, Json::Value& result)
{
    // Start OpenTelemetry span
    auto span = context.app.getTelemetry().startSpan(
        "rpc.command." + context.method);

    // Get trace ID for correlation
    auto traceId = span->GetContext().trace_id().IsValid()
        ? toHex(span->GetContext().trace_id())
        : "";

    // Use existing PerfLog with trace correlation
    auto const curId = context.app.getPerfLog().currentId();
    context.app.getPerfLog().rpcStart(context.method, curId);

    // Future: Add trace ID to PerfLog entry
    // context.app.getPerfLog().setTraceId(curId, traceId);

    try {
        auto ret = handler(context, result);
        context.app.getPerfLog().rpcFinish(context.method, curId);
        span->SetStatus(opentelemetry::trace::StatusCode::kOk);
        return ret;
    } catch (std::exception const& e) {
        context.app.getPerfLog().rpcError(context.method, curId);
        span->RecordException(e);
        span->SetStatus(opentelemetry::trace::StatusCode::kError, e.what());
        throw;
    }
}

Previous: Architecture Analysis | Next: Implementation Strategy | Back to: Overview

17 KiB Raw Blame History