Files
rippled/OpenTelemetryPlan/POC_taskList.md
Pratik Mankawde e9c5c3520e fix(telemetry): address Phase 1b code review findings
Redesign SpanGuard with pimpl idiom to hide all OpenTelemetry types
from public headers. Add global Telemetry accessor so SpanGuard factory
methods work without explicit Telemetry references. Add child/linked
span creation and cross-thread context propagation. Update plan docs
to reflect macro removal in favor of SpanGuard factory pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 14:26:05 +01:00

33 KiB

OpenTelemetry POC Task List

Goal: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in xrpld. A successful POC will show RPC request traces flowing from xrpld through an OTel Collector into Tempo, viewable in Grafana.

Scope: RPC tracing only (highest value, lowest risk per the CRAWL phase in the implementation phases). No cross-node P2P context propagation or consensus tracing in the POC.

Document Relevance to POC
00-tracing-fundamentals.md Core concepts: traces, spans, context propagation, sampling
01-architecture-analysis.md RPC request flow (§1.5), key trace points (§1.6), instrumentation priority (§1.7)
02-design-decisions.md SDK selection (§2.1), exporter config (§2.2), span naming (§2.3), attribute schema (§2.4), coexistence with PerfLog/Insight (§2.6)
03-implementation-strategy.md Directory structure (§3.1), key principles (§3.2), performance overhead (§3.3-3.6), conditional compilation (§3.7.3), code intrusiveness (§3.9)
04-code-samples.md Telemetry interface (§4.1), SpanGuard factory methods (§4.2-4.3), RPC instrumentation (§4.5.3)
05-configuration-reference.md xrpld config (§5.1), config parser (§5.2), Application integration (§5.3), CMake (§5.4), Collector config (§5.5), Docker Compose (§5.6), Grafana (§5.8)
06-implementation-phases.md Phase 1 core tasks (§6.2), Phase 2 RPC tasks (§6.3), quick wins (§6.10), definition of done (§6.11)
07-observability-backends.md Tempo dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3)

Task 0: Docker Observability Stack Setup

OTLP = OpenTelemetry Protocol

Objective: Stand up the backend infrastructure to receive, store, and display traces.

What to do:

  • Create docker/telemetry/docker-compose.yml in the repo with three services:

    1. OpenTelemetry Collector (otel/opentelemetry-collector-contrib:0.92.0)
      • Expose ports 4317 (OTLP gRPC) and 4318 (OTLP HTTP)
      • Expose port 13133 (health check)
      • Mount a config file docker/telemetry/otel-collector-config.yaml
    2. Tempo (grafana/tempo:2.6.1)
      • Expose port 3200 (HTTP API) and 4317 (OTLP gRPC, internal)
    3. Grafana (grafana/grafana:latest) — optional but useful
      • Expose port 3000
      • Enable anonymous admin access for local dev (GF_AUTH_ANONYMOUS_ENABLED=true, GF_AUTH_ANONYMOUS_ORG_ROLE=Admin)
      • Provision Tempo as a data source via docker/telemetry/grafana/provisioning/datasources/tempo.yaml
  • Create docker/telemetry/otel-collector-config.yaml:

    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    
    processors:
      batch:
        timeout: 1s
        send_batch_size: 100
    
    exporters:
      logging:
        verbosity: detailed
      otlp/tempo:
        endpoint: tempo:4317
        tls:
          insecure: true
    
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [logging, otlp/tempo]
    
  • Create Grafana Tempo datasource provisioning file at docker/telemetry/grafana/provisioning/datasources/tempo.yaml:

    apiVersion: 1
    datasources:
      - name: Tempo
        type: tempo
        access: proxy
        url: http://tempo:3200
    

Verification: Run docker compose -f docker/telemetry/docker-compose.yml up -d, then:

  • curl http://localhost:13133 returns healthy (Collector)
  • http://localhost:3000 opens Grafana (Tempo datasource available, no traces yet)

Reference:


Task 1: Add OpenTelemetry C++ SDK Dependency

Objective: Make opentelemetry-cpp available to the build system.

What to do:

  • Edit conanfile.py to add opentelemetry-cpp as an optional dependency. The gRPC otel plugin flag ("grpc/*:otel_plugin": False) in the existing conanfile may need to remain false — we pull the OTel SDK separately.
    • Add a Conan option: with_telemetry = [True, False] defaulting to False
    • When with_telemetry is True, add opentelemetry-cpp to self.requires()
    • Required OTel Conan components: opentelemetry-cpp (which bundles api, sdk, and exporters). If the package isn't in Conan Center, consider using FetchContent in CMake or building from source as a fallback.
  • Edit CMakeLists.txt:
    • Add option: option(XRPL_ENABLE_TELEMETRY "Enable OpenTelemetry tracing" OFF)
    • When ON, find_package(opentelemetry-cpp CONFIG REQUIRED) and add compile definition XRPL_ENABLE_TELEMETRY
    • When OFF, do nothing (zero build impact)
  • Verify the build succeeds with -DXRPL_ENABLE_TELEMETRY=OFF (no regressions) and with -DXRPL_ENABLE_TELEMETRY=ON (SDK links successfully).

Key files:

  • conanfile.py
  • CMakeLists.txt

Reference:


Task 2: Create Core Telemetry Interface and NullTelemetry

Objective: Define the Telemetry abstract interface and a no-op implementation so the rest of the codebase can reference telemetry without hard-depending on the OTel SDK.

What to do:

  • Create include/xrpl/telemetry/Telemetry.h:

    • Define namespace xrpl::telemetry
    • Define struct Telemetry::Setup holding: enabled, exporterEndpoint, samplingRatio, serviceName, serviceVersion, serviceInstanceId, traceRpc, traceTransactions, traceConsensus, tracePeer
    • Define abstract class Telemetry with:
      • virtual void start() = 0;
      • virtual void stop() = 0;
      • virtual bool isEnabled() const = 0;
      • virtual nostd::shared_ptr<Tracer> getTracer(string_view name = "xrpld") = 0;
      • virtual nostd::shared_ptr<Span> startSpan(string_view name, SpanKind kind = kInternal) = 0;
      • virtual nostd::shared_ptr<Span> startSpan(string_view name, Context const& parentContext, SpanKind kind = kInternal) = 0;
      • virtual bool shouldTraceRpc() const = 0;
      • virtual bool shouldTraceTransactions() const = 0;
      • virtual bool shouldTraceConsensus() const = 0;
    • Factory: std::unique_ptr<Telemetry> make_Telemetry(Setup const&, beast::Journal);
    • Config parser: Telemetry::Setup setup_Telemetry(Section const&, std::string const& nodePublicKey, std::string const& version);
  • Create include/xrpl/telemetry/SpanGuard.h:

    • RAII guard with static factory methods (rpcSpan(), txSpan(), consensusSpan(), etc.) that access the global Telemetry::getInstance() singleton internally.
    • Uses pimpl idiom to hide all OTel types -- the public header has zero opentelemetry/ includes.
    • Convenience instance methods: setAttribute(), setOk(), setStatus(), addEvent(), recordException(), context(), discard()
    • When XRPL_ENABLE_TELEMETRY is not defined, the entire class compiles to a no-op stub.
    • See 04-code-samples.md §4.2-4.3 for the full API reference.
  • Create src/libxrpl/telemetry/NullTelemetry.cpp:

    • Implements Telemetry with all no-ops.
    • isEnabled() returns false, startSpan() returns a noop span.
    • This is used when XRPL_ENABLE_TELEMETRY is OFF or enabled=0 in config.
  • Guard all OTel SDK headers behind #ifdef XRPL_ENABLE_TELEMETRY. The NullTelemetry implementation should compile without the OTel SDK present.

Key new files:

  • include/xrpl/telemetry/Telemetry.h
  • include/xrpl/telemetry/SpanGuard.h
  • src/libxrpl/telemetry/NullTelemetry.cpp

Reference:


Task 3: Implement OTel-Backed Telemetry

OTLP = OpenTelemetry Protocol

Objective: Implement the real Telemetry class that initializes the OTel SDK, configures the OTLP exporter and batch processor, and creates tracers/spans.

What to do:

  • Create src/libxrpl/telemetry/Telemetry.cpp (compiled only when XRPL_ENABLE_TELEMETRY=ON):

    • class TelemetryImpl : public Telemetry that:
      • In start(): creates a TracerProvider with:
        • Resource attributes: service.name, service.version, service.instance.id
        • An OtlpHttpExporter pointed at setup.exporterEndpoint (default localhost:4318)
        • A BatchSpanProcessor with configurable batch size and delay
        • A TraceIdRatioBasedSampler using setup.samplingRatio
      • Sets the global TracerProvider
      • In stop(): calls ForceFlush() then shuts down the provider
      • In startSpan(): delegates to getTracer()->StartSpan(name, ...)
      • shouldTraceRpc() etc. read from Setup fields
  • Create src/libxrpl/telemetry/TelemetryConfig.cpp:

    • setup_Telemetry() parses the [telemetry] config section from xrpld.cfg
    • Maps config keys: enabled, exporter, endpoint, sampling_ratio, trace_rpc, trace_transactions, trace_consensus, trace_peer
  • Wire make_Telemetry() factory:

    • If setup.enabled is true AND XRPL_ENABLE_TELEMETRY is defined: return TelemetryImpl
    • Otherwise: return NullTelemetry
  • Add telemetry source files to CMake. When XRPL_ENABLE_TELEMETRY=ON, compile Telemetry.cpp and TelemetryConfig.cpp and link against opentelemetry-cpp::api, opentelemetry-cpp::sdk, opentelemetry-cpp::otlp_grpc_exporter. When OFF, compile only NullTelemetry.cpp.

Key new files:

  • src/libxrpl/telemetry/Telemetry.cpp
  • src/libxrpl/telemetry/TelemetryConfig.cpp

Key modified files:

  • CMakeLists.txt (add telemetry library target)

Reference:


Task 4: Integrate Telemetry into Application Lifecycle

Objective: Wire the Telemetry object into the ServiceRegistry / Application so all components can access it.

What to do:

  • Edit include/xrpl/core/ServiceRegistry.h:

    • Forward-declare namespace telemetry { class Telemetry; } inside namespace xrpl
    • Add pure virtual method: virtual telemetry::Telemetry& getTelemetry() = 0;
    • (Application extends ServiceRegistry, so this is automatically available on Application too)
  • Edit src/xrpld/app/main/Application.cpp (the ApplicationImp class):

    • Add member: std::unique_ptr<telemetry::Telemetry> telemetry_;
    • In the member initializer list, construct telemetry with an empty serviceInstanceId (node identity is not yet known):
      , telemetry_(
            telemetry::make_Telemetry(
                telemetry::setup_Telemetry(
                    config_->section("telemetry"),
                    "",  // Updated later via setServiceInstanceId()
                    BuildInfo::getVersionString()),
                logs_->journal("Telemetry")))
      
    • In setup(), after nodeIdentity_ is resolved, inject the node public key as the service instance ID:
      if (!config_->section("telemetry").exists("service_instance_id"))
          telemetry_->setServiceInstanceId(
              toBase58(TokenType::NodePublic, nodeIdentity_->first));
      
    • In start(): call telemetry_->start()
    • In run() (shutdown path): call telemetry_->stop() (to flush pending spans)
    • Implement getTelemetry() override: return *telemetry_
  • Add [telemetry] section to the example config cfg/xrpld-example.cfg:

    # [telemetry]
    # enabled=1
    # endpoint=http://localhost:4318/v1/traces
    # sampling_ratio=1.0
    # trace_rpc=1
    

Access patterns: Components holding ServiceRegistry& (e.g. NetworkOPsImp) call registry_.get().getTelemetry(). Components holding Application& (e.g. ServerHandler, PeerImp, RCLConsensusAdaptor) call app_.getTelemetry() directly. Both resolve to the same Telemetry instance.

Key modified files:

  • include/xrpl/core/ServiceRegistry.h
  • src/xrpld/app/main/Application.cpp
  • cfg/xrpld-example.cfg (example config)

Reference:


Task 5: Add SpanGuard Factory Methods

Objective: Add static factory methods to SpanGuard that provide type-safe, one-liner instrumentation and compile to zero-cost no-ops when telemetry is disabled. This replaces the earlier macro-based approach (TracingInstrumentation.h has been removed).

What to do:

  • Update include/xrpl/telemetry/SpanGuard.h:

    • Add static factory methods that access the global Telemetry::getInstance() singleton and check the relevant component filter before creating a span:

      // Each factory checks the global Telemetry instance internally.
      // No Telemetry& reference needed at the call site.
      auto span = telemetry::SpanGuard::rpcSpan("rpc.request");
      span.setAttribute("xrpl.rpc.command", command);
      span.setAttribute("xrpl.rpc.status", status);
      
    • Factory methods: rpcSpan(), txSpan(), consensusSpan(), peerSpan(), ledgerSpan(), span()

    • Use the pimpl idiom to hide all OTel types from the public header (zero opentelemetry/ includes)

    • When XRPL_ENABLE_TELEMETRY is NOT defined, the entire class compiles to a no-op stub with empty inline method bodies

  • No separate TracingInstrumentation.h file is needed. All instrumentation call sites use #include <xrpl/telemetry/SpanGuard.h> directly.

Key modified file:

  • include/xrpl/telemetry/SpanGuard.h

Reference:


Task 6: Instrument RPC ServerHandler

WS = WebSocket

Objective: Add tracing to the HTTP RPC entry point so every incoming RPC request creates a span.

What to do:

  • Edit src/xrpld/rpc/detail/ServerHandler.cpp:

    • #include <xrpl/telemetry/SpanGuard.h>
    • In ServerHandler::onRequest(Session& session):
      • At the top of the method, add: auto span = telemetry::SpanGuard::rpcSpan("rpc.request");
      • After the RPC command name is extracted, set attribute: span.setAttribute("xrpl.rpc.command", command);
      • After the response status is known, set: span.setAttribute("http.status_code", static_cast<int64_t>(statusCode));
      • Wrap error paths with: span.recordException(e);
    • In ServerHandler::processRequest(...):
      • Add a child span: auto span = telemetry::SpanGuard::rpcSpan("rpc.process");
      • Set method attribute: span.setAttribute("xrpl.rpc.method", request_method);
    • In ServerHandler::onWSMessage(...) (WebSocket path):
      • Add: auto span = telemetry::SpanGuard::rpcSpan("rpc.ws.message");
  • The goal is to see spans like:

    rpc.request
      └── rpc.process
    

    in Tempo/Grafana for every HTTP RPC call.

Key modified file:

  • src/xrpld/rpc/detail/ServerHandler.cpp (~15-25 lines added)

Reference:


Task 7: Instrument RPC Command Execution

Objective: Add per-command tracing inside the RPC handler so each command (e.g., submit, account_info, server_info) gets its own child span.

What to do:

  • Edit src/xrpld/rpc/detail/RPCHandler.cpp:

    • #include <xrpl/telemetry/SpanGuard.h>
    • In doCommand(RPC::JsonContext& context, Json::Value& result):
      • At the top: auto span = telemetry::SpanGuard::rpcSpan("rpc.command." + context.method);
      • Set attributes:
        • span.setAttribute("xrpl.rpc.command", context.method);
        • span.setAttribute("xrpl.rpc.version", static_cast<int64_t>(context.apiVersion));
        • span.setAttribute("xrpl.rpc.role", (context.role == Role::ADMIN) ? "admin" : "user");
      • On success: span.setAttribute("xrpl.rpc.status", "success");
      • On error: span.setAttribute("xrpl.rpc.status", "error"); and set the error message
  • After this, traces in Tempo/Grafana should look like:

    rpc.request  (xrpl.rpc.command=account_info)
      └── rpc.process
            └── rpc.command.account_info  (xrpl.rpc.version=2, xrpl.rpc.role=user, xrpl.rpc.status=success)
    

Key modified file:

  • src/xrpld/rpc/detail/RPCHandler.cpp (~15-20 lines added)

Reference:


Task 8: Build, Run, and Verify End-to-End

OTLP = OpenTelemetry Protocol

Objective: Prove the full pipeline works: xrpld emits traces -> OTel Collector receives them -> Tempo stores them for Grafana visualization.

What to do:

  1. Start the Docker stack:

    docker compose -f docker/telemetry/docker-compose.yml up -d
    

    Verify Collector health: curl http://localhost:13133

  2. Build xrpld with telemetry:

    # Adjust for your actual build workflow
    conan install . --build=missing -o with_telemetry=True
    cmake --preset default -DXRPL_ENABLE_TELEMETRY=ON
    cmake --build --preset default
    
  3. Configure xrpld: Add to xrpld.cfg (or your local test config):

    [telemetry]
    enabled=1
    endpoint=localhost:4317
    sampling_ratio=1.0
    trace_rpc=1
    
  4. Start xrpld in standalone mode:

    ./rippled --conf xrpld.cfg -a --start
    
  5. Generate RPC traffic:

    # server_info
    curl -s -X POST http://localhost:5005 \
      -H "Content-Type: application/json" \
      -d '{"method":"server_info","params":[{}]}'
    
    # ledger
    curl -s -X POST http://localhost:5005 \
      -H "Content-Type: application/json" \
      -d '{"method":"ledger","params":[{"ledger_index":"current"}]}'
    
    # account_info (will error in standalone, that's fine — we trace errors too)
    curl -s -X POST http://localhost:5005 \
      -H "Content-Type: application/json" \
      -d '{"method":"account_info","params":[{"account":"rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh"}]}'
    
  6. Verify in Grafana (Tempo):

    • Open http://localhost:3000
    • Navigate to Explore → select Tempo datasource
    • Search for service xrpld
    • Confirm you see traces with spans: rpc.request -> rpc.process -> rpc.command.server_info
    • Click into a trace and verify attributes: xrpl.rpc.command, xrpl.rpc.status, xrpl.rpc.version
  7. Verify zero-overhead when disabled:

    • Rebuild with XRPL_ENABLE_TELEMETRY=OFF, or set enabled=0 in config
    • Run the same RPC calls
    • Confirm no new traces appear and no errors in xrpld logs

Verification Checklist:

  • Docker stack starts without errors
  • xrpld builds with -DXRPL_ENABLE_TELEMETRY=ON
  • xrpld starts and connects to OTel Collector (check xrpld logs for telemetry messages)
  • Traces appear in Grafana/Tempo under service "xrpld"
  • Span hierarchy is correct (parent-child relationships)
  • Span attributes are populated (xrpl.rpc.command, xrpl.rpc.status, etc.)
  • Error spans show error status and message
  • Building with XRPL_ENABLE_TELEMETRY=OFF produces no regressions
  • Setting enabled=0 at runtime produces no traces and no errors

Reference:


Task 9: Document POC Results and Next Steps

OTLP = OpenTelemetry Protocol | WS = WebSocket

Objective: Capture findings, screenshots, and remaining work for the team.

What to do:

  • Take screenshots of Grafana/Tempo showing:
    • The service list with "xrpld"
    • A trace with the full span tree
    • Span detail view showing attributes
  • Document any issues encountered (build issues, SDK quirks, missing attributes)
  • Note performance observations (build time impact, any noticeable runtime overhead)
  • Write a short summary of what the POC proves and what it doesn't cover yet:
    • Proves: OTel SDK integrates with xrpld, OTLP export works, RPC traces visible
    • Doesn't cover: Cross-node P2P context propagation, consensus tracing, protobuf trace context, W3C traceparent header extraction, tail-based sampling, production deployment
  • Outline next steps (mapping to the full plan phases):

Reference:


Summary

Task Description New Files Modified Files Depends On
0 Docker observability stack 4 0
1 OTel C++ SDK dependency 0 2
2 Core Telemetry interface + NullImpl 3 0 1
3 OTel-backed Telemetry implementation 2 1 1, 2
4 Application lifecycle integration 0 3 2, 3
5 SpanGuard factory methods 0 1 2
6 Instrument RPC ServerHandler 0 1 4, 5
7 Instrument RPC command execution 0 1 4, 5
8 End-to-end verification 0 0 0-7
9 Document results and next steps 1 0 8

Parallel work: Tasks 0 and 1 can run in parallel. Tasks 2 and 5 have no dependency on each other. Tasks 6 and 7 can be done in parallel once Tasks 4 and 5 are complete.


Next Steps (Post-POC)

OTLP = OpenTelemetry Protocol | WS = WebSocket

Metrics Pipeline for Grafana Dashboards

The current POC exports traces only. Grafana's Explore view can query Tempo for individual traces, but time-series charts (latency histograms, request throughput, error rates) require a metrics pipeline. To enable this:

  1. Add a spanmetrics connector to the OTel Collector config that derives RED metrics (Rate, Errors, Duration) from trace spans automatically:

    connectors:
      spanmetrics:
        histogram:
          explicit:
            buckets: [1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s]
        dimensions:
          - name: xrpl.rpc.command
          - name: xrpl.rpc.status
    
    exporters:
      prometheus:
        endpoint: 0.0.0.0:8889
    
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [debug, otlp/tempo, spanmetrics]
        metrics:
          receivers: [spanmetrics]
          exporters: [prometheus]
    
  2. Add Prometheus to the Docker Compose stack to scrape the collector's metrics endpoint.

  3. Add Prometheus as a Grafana datasource and build dashboards for:

    • RPC request latency (p50/p95/p99) by command
    • RPC throughput (requests/sec) by command
    • Error rate by command
    • Span duration distribution

Additional Instrumentation

  • W3C traceparent header extraction in ServerHandler to support cross-service context propagation from external callers
  • WebSocket RPC tracing in ServerHandler::onWSMessage()
  • Transaction relay tracing across nodes using protobuf TraceContext messages
  • Consensus round and phase tracing for validator coordination visibility
  • Ledger close tracing to measure close-to-validated latency

Production Hardening

  • Tail-based sampling in the OTel Collector to reduce volume while retaining error/slow traces
  • TLS configuration for the OTLP exporter in production deployments
  • Resource limits on the batch processor queue to prevent unbounded memory growth
  • Health monitoring for the telemetry pipeline itself (collector lag, export failures)

POC Lessons Learned

Issues encountered during POC implementation that inform future work:

Issue Resolution Impact on Future Work
Conan lockfile rejected opentelemetry-cpp/1.18.0 Used --lockfile="" to bypass Lockfile must be regenerated when adding new dependencies
Conan package only builds OTLP HTTP exporter, not gRPC Switched from gRPC to HTTP exporter (localhost:4318/v1/traces) HTTP exporter is the default; gRPC requires custom Conan profile
CMake target opentelemetry-cpp::api etc. don't exist in Conan package Use umbrella target opentelemetry-cpp::opentelemetry-cpp Conan targets differ from upstream CMake targets
OTel Collector logging exporter deprecated Renamed to debug exporter Use debug in all collector configs going forward
Macro parameter telemetry collided with ::xrpl::telemetry:: namespace Replaced macros with SpanGuard factory methods (no macros needed) Factory methods avoid macro hygiene issues entirely
opentelemetry::trace::Scope creates new context on move Store scope as member, create once in constructor SpanGuard move semantics need care with Scope lifecycle
TracerProviderFactory::Create returns unique_ptr<sdk::TracerProvider>, not nostd::shared_ptr Use std::shared_ptr member, wrap in nostd::shared_ptr for global provider OTel SDK factory return types don't match API provider types