mirror of https://github.com/XRPLF/rippled.git synced 2026-04-29 15:37:57 +00:00

Files

Pratik Mankawde e9c5c3520e fix(telemetry): address Phase 1b code review findings

Redesign SpanGuard with pimpl idiom to hide all OpenTelemetry types
from public headers. Add global Telemetry accessor so SpanGuard factory
methods work without explicit Telemetry references. Add child/linked
span creation and cross-thread context propagation. Update plan docs
to reflect macro removal in favor of SpanGuard factory pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-28 14:26:05 +01:00

33 KiB

Raw Blame History

OpenTelemetry POC Task List

Goal: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in xrpld. A successful POC will show RPC request traces flowing from xrpld through an OTel Collector into Tempo, viewable in Grafana.

Scope: RPC tracing only (highest value, lowest risk per the CRAWL phase in the implementation phases). No cross-node P2P context propagation or consensus tracing in the POC.

Document	Relevance to POC
00-tracing-fundamentals.md	Core concepts: traces, spans, context propagation, sampling
01-architecture-analysis.md	RPC request flow (§1.5), key trace points (§1.6), instrumentation priority (§1.7)
02-design-decisions.md	SDK selection (§2.1), exporter config (§2.2), span naming (§2.3), attribute schema (§2.4), coexistence with PerfLog/Insight (§2.6)
03-implementation-strategy.md	Directory structure (§3.1), key principles (§3.2), performance overhead (§3.3-3.6), conditional compilation (§3.7.3), code intrusiveness (§3.9)
04-code-samples.md	Telemetry interface (§4.1), SpanGuard factory methods (§4.2-4.3), RPC instrumentation (§4.5.3)
05-configuration-reference.md	xrpld config (§5.1), config parser (§5.2), Application integration (§5.3), CMake (§5.4), Collector config (§5.5), Docker Compose (§5.6), Grafana (§5.8)
06-implementation-phases.md	Phase 1 core tasks (§6.2), Phase 2 RPC tasks (§6.3), quick wins (§6.10), definition of done (§6.11)
07-observability-backends.md	Tempo dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3)

Task 0: Docker Observability Stack Setup

OTLP = OpenTelemetry Protocol

Objective: Stand up the backend infrastructure to receive, store, and display traces.

What to do:

Create docker/telemetry/docker-compose.yml in the repo with three services:
1. OpenTelemetry Collector (otel/opentelemetry-collector-contrib:0.92.0)
  - Expose ports 4317 (OTLP gRPC) and 4318 (OTLP HTTP)
  - Expose port 13133 (health check)
  - Mount a config file docker/telemetry/otel-collector-config.yaml
2. Tempo (grafana/tempo:2.6.1)
  - Expose port 3200 (HTTP API) and 4317 (OTLP gRPC, internal)
3. Grafana (grafana/grafana:latest) — optional but useful
  - Expose port 3000
  - Enable anonymous admin access for local dev (GF_AUTH_ANONYMOUS_ENABLED=true, GF_AUTH_ANONYMOUS_ORG_ROLE=Admin)
  - Provision Tempo as a data source via docker/telemetry/grafana/provisioning/datasources/tempo.yaml

Create docker/telemetry/otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 100

exporters:
  logging:
    verbosity: detailed
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, otlp/tempo]

Create Grafana Tempo datasource provisioning file at docker/telemetry/grafana/provisioning/datasources/tempo.yaml:

apiVersion: 1
datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200

Verification: Run docker compose -f docker/telemetry/docker-compose.yml up -d, then:

curl http://localhost:13133 returns healthy (Collector)
http://localhost:3000 opens Grafana (Tempo datasource available, no traces yet)

Reference:

05-configuration-reference.md §5.5 — Collector config (dev YAML with Tempo exporter)
05-configuration-reference.md §5.6 — Docker Compose development environment
07-observability-backends.md §7.1 — Tempo quick start and backend selection
05-configuration-reference.md §5.8 — Grafana datasource provisioning and dashboards

Task 1: Add OpenTelemetry C++ SDK Dependency

Objective: Make opentelemetry-cpp available to the build system.

What to do:

Edit conanfile.py to add opentelemetry-cpp as an optional dependency. The gRPC otel plugin flag ("grpc/*:otel_plugin": False) in the existing conanfile may need to remain false — we pull the OTel SDK separately.
- Add a Conan option: with_telemetry = [True, False] defaulting to False
- When with_telemetry is True, add opentelemetry-cpp to self.requires()
- Required OTel Conan components: opentelemetry-cpp (which bundles api, sdk, and exporters). If the package isn't in Conan Center, consider using FetchContent in CMake or building from source as a fallback.
Edit CMakeLists.txt:
- Add option: option(XRPL_ENABLE_TELEMETRY "Enable OpenTelemetry tracing" OFF)
- When ON, find_package(opentelemetry-cpp CONFIG REQUIRED) and add compile definition XRPL_ENABLE_TELEMETRY
- When OFF, do nothing (zero build impact)
Verify the build succeeds with -DXRPL_ENABLE_TELEMETRY=OFF (no regressions) and with -DXRPL_ENABLE_TELEMETRY=ON (SDK links successfully).

Key files:

conanfile.py
CMakeLists.txt

Reference:

05-configuration-reference.md §5.4 — CMake integration, FindOpenTelemetry.cmake, XRPL_ENABLE_TELEMETRY option
03-implementation-strategy.md §3.2 — Key principle: zero-cost when disabled via compile-time flags
02-design-decisions.md §2.1 — SDK selection rationale and required OTel components

Task 2: Create Core Telemetry Interface and NullTelemetry

Objective: Define the Telemetry abstract interface and a no-op implementation so the rest of the codebase can reference telemetry without hard-depending on the OTel SDK.

What to do:

Create include/xrpl/telemetry/Telemetry.h:
- Define namespace xrpl::telemetry
- Define struct Telemetry::Setup holding: enabled, exporterEndpoint, samplingRatio, serviceName, serviceVersion, serviceInstanceId, traceRpc, traceTransactions, traceConsensus, tracePeer
- Define abstract class Telemetry with:
  - virtual void start() = 0;
  - virtual void stop() = 0;
  - virtual bool isEnabled() const = 0;
  - virtual nostd::shared_ptr<Tracer> getTracer(string_view name = "xrpld") = 0;
  - virtual nostd::shared_ptr<Span> startSpan(string_view name, SpanKind kind = kInternal) = 0;
  - virtual nostd::shared_ptr<Span> startSpan(string_view name, Context const& parentContext, SpanKind kind = kInternal) = 0;
  - virtual bool shouldTraceRpc() const = 0;
  - virtual bool shouldTraceTransactions() const = 0;
  - virtual bool shouldTraceConsensus() const = 0;
- Factory: std::unique_ptr<Telemetry> make_Telemetry(Setup const&, beast::Journal);
- Config parser: Telemetry::Setup setup_Telemetry(Section const&, std::string const& nodePublicKey, std::string const& version);
Create include/xrpl/telemetry/SpanGuard.h:
- RAII guard with static factory methods (rpcSpan(), txSpan(), consensusSpan(), etc.) that access the global Telemetry::getInstance() singleton internally.
- Uses pimpl idiom to hide all OTel types -- the public header has zero opentelemetry/ includes.
- Convenience instance methods: setAttribute(), setOk(), setStatus(), addEvent(), recordException(), context(), discard()
- When XRPL_ENABLE_TELEMETRY is not defined, the entire class compiles to a no-op stub.
- See 04-code-samples.md §4.2-4.3 for the full API reference.
Create src/libxrpl/telemetry/NullTelemetry.cpp:
- Implements Telemetry with all no-ops.
- isEnabled() returns false, startSpan() returns a noop span.
- This is used when XRPL_ENABLE_TELEMETRY is OFF or enabled=0 in config.
Guard all OTel SDK headers behind #ifdef XRPL_ENABLE_TELEMETRY. The NullTelemetry implementation should compile without the OTel SDK present.

Key new files:

include/xrpl/telemetry/Telemetry.h
include/xrpl/telemetry/SpanGuard.h
src/libxrpl/telemetry/NullTelemetry.cpp

Reference:

04-code-samples.md §4.1 — Full Telemetry interface with Setup struct, lifecycle, tracer access, span creation, and component filtering methods
04-code-samples.md §4.2-4.3 — SpanGuard with factory methods, pimpl design, no-op stub, and discard support
03-implementation-strategy.md §3.1 — Directory structure: include/xrpl/telemetry/ for headers, src/libxrpl/telemetry/ for implementation
03-implementation-strategy.md §3.7.3 — Conditional instrumentation and zero-cost compile-time disabled pattern

Task 3: Implement OTel-Backed Telemetry

OTLP = OpenTelemetry Protocol

Objective: Implement the real Telemetry class that initializes the OTel SDK, configures the OTLP exporter and batch processor, and creates tracers/spans.

What to do:

Create src/libxrpl/telemetry/Telemetry.cpp (compiled only when XRPL_ENABLE_TELEMETRY=ON):
- class TelemetryImpl : public Telemetry that:
  - In start(): creates a TracerProvider with:
    - Resource attributes: service.name, service.version, service.instance.id
    - An OtlpHttpExporter pointed at setup.exporterEndpoint (default localhost:4318)
    - A BatchSpanProcessor with configurable batch size and delay
    - A TraceIdRatioBasedSampler using setup.samplingRatio
  - Sets the global TracerProvider
  - In stop(): calls ForceFlush() then shuts down the provider
  - In startSpan(): delegates to getTracer()->StartSpan(name, ...)
  - shouldTraceRpc() etc. read from Setup fields
Create src/libxrpl/telemetry/TelemetryConfig.cpp:
- setup_Telemetry() parses the [telemetry] config section from xrpld.cfg
- Maps config keys: enabled, exporter, endpoint, sampling_ratio, trace_rpc, trace_transactions, trace_consensus, trace_peer
Wire make_Telemetry() factory:
- If setup.enabled is true AND XRPL_ENABLE_TELEMETRY is defined: return TelemetryImpl
- Otherwise: return NullTelemetry
Add telemetry source files to CMake. When XRPL_ENABLE_TELEMETRY=ON, compile Telemetry.cpp and TelemetryConfig.cpp and link against opentelemetry-cpp::api, opentelemetry-cpp::sdk, opentelemetry-cpp::otlp_grpc_exporter. When OFF, compile only NullTelemetry.cpp.

Key new files:

src/libxrpl/telemetry/Telemetry.cpp
src/libxrpl/telemetry/TelemetryConfig.cpp

Key modified files:

CMakeLists.txt (add telemetry library target)

Reference:

04-code-samples.md §4.1 — Telemetry interface that TelemetryImpl must implement
05-configuration-reference.md §5.2 — setup_Telemetry() config parser implementation
02-design-decisions.md §2.2 — OTLP/gRPC exporter config (endpoint, TLS options)
02-design-decisions.md §2.4.1 — Resource attributes: service.name, service.version, service.instance.id, xrpl.network.id
03-implementation-strategy.md §3.4 — Per-operation CPU costs and overhead budget for span creation
03-implementation-strategy.md §3.5 — Memory overhead: static (~456 KB) and dynamic (~1.2 MB) budgets

Task 4: Integrate Telemetry into Application Lifecycle

Objective: Wire the Telemetry object into the ServiceRegistry / Application so all components can access it.

What to do:

Edit include/xrpl/core/ServiceRegistry.h:
- Forward-declare namespace telemetry { class Telemetry; } inside namespace xrpl
- Add pure virtual method: virtual telemetry::Telemetry& getTelemetry() = 0;
- (Application extends ServiceRegistry, so this is automatically available on Application too)

Edit src/xrpld/app/main/Application.cpp (the ApplicationImp class):

Add member: std::unique_ptr<telemetry::Telemetry> telemetry_;

In the member initializer list, construct telemetry with an empty serviceInstanceId (node identity is not yet known):

, telemetry_(
      telemetry::make_Telemetry(
          telemetry::setup_Telemetry(
              config_->section("telemetry"),
              "",  // Updated later via setServiceInstanceId()
              BuildInfo::getVersionString()),
          logs_->journal("Telemetry")))

In setup(), after nodeIdentity_ is resolved, inject the node public key as the service instance ID:

if (!config_->section("telemetry").exists("service_instance_id"))
    telemetry_->setServiceInstanceId(
        toBase58(TokenType::NodePublic, nodeIdentity_->first));

In start(): call telemetry_->start()
In run() (shutdown path): call telemetry_->stop() (to flush pending spans)
Implement getTelemetry() override: return *telemetry_

Add [telemetry] section to the example config cfg/xrpld-example.cfg:

# [telemetry]
# enabled=1
# endpoint=http://localhost:4318/v1/traces
# sampling_ratio=1.0
# trace_rpc=1

Access patterns: Components holding ServiceRegistry& (e.g. NetworkOPsImp) call registry_.get().getTelemetry(). Components holding Application& (e.g. ServerHandler, PeerImp, RCLConsensusAdaptor) call app_.getTelemetry() directly. Both resolve to the same Telemetry instance.

Key modified files:

include/xrpl/core/ServiceRegistry.h
src/xrpld/app/main/Application.cpp
cfg/xrpld-example.cfg (example config)

Reference:

05-configuration-reference.md §5.3 — ApplicationImp changes: member declaration, constructor init, start()/stop() wiring, getTelemetry() override
05-configuration-reference.md §5.1 — [telemetry] config section format and all option defaults
03-implementation-strategy.md §3.9.2 — File impact assessment: Application.cpp ~15 lines added, ~3 changed (Low risk)

Task 5: Add SpanGuard Factory Methods

Objective: Add static factory methods to SpanGuard that provide type-safe, one-liner instrumentation and compile to zero-cost no-ops when telemetry is disabled. This replaces the earlier macro-based approach (TracingInstrumentation.h has been removed).

What to do:

Update include/xrpl/telemetry/SpanGuard.h:
- Add static factory methods that access the global Telemetry::getInstance() singleton and check the relevant component filter before creating a span:
```
// Each factory checks the global Telemetry instance internally.
// No Telemetry& reference needed at the call site.
auto span = telemetry::SpanGuard::rpcSpan("rpc.request");
span.setAttribute("xrpl.rpc.command", command);
span.setAttribute("xrpl.rpc.status", status);
```
- Factory methods: rpcSpan(), txSpan(), consensusSpan(), peerSpan(), ledgerSpan(), span()
- Use the pimpl idiom to hide all OTel types from the public header (zero opentelemetry/ includes)
- When XRPL_ENABLE_TELEMETRY is NOT defined, the entire class compiles to a no-op stub with empty inline method bodies
No separate TracingInstrumentation.h file is needed. All instrumentation call sites use #include <xrpl/telemetry/SpanGuard.h> directly.

Key modified file:

include/xrpl/telemetry/SpanGuard.h

Reference:

04-code-samples.md §4.3 — SpanGuard API reference: factory methods, usage patterns, compile-time disabled behavior, and discard support
03-implementation-strategy.md §3.7.3 — Conditional instrumentation pattern: factory methods handle compile-time and runtime checks internally
03-implementation-strategy.md §3.9.7 — Before/after code examples showing minimal intrusiveness (~1-3 lines per instrumentation point)

Task 6: Instrument RPC ServerHandler

WS = WebSocket

Objective: Add tracing to the HTTP RPC entry point so every incoming RPC request creates a span.

What to do:

Edit src/xrpld/rpc/detail/ServerHandler.cpp:
- #include <xrpl/telemetry/SpanGuard.h>
- In ServerHandler::onRequest(Session& session):
  - At the top of the method, add: auto span = telemetry::SpanGuard::rpcSpan("rpc.request");
  - After the RPC command name is extracted, set attribute: span.setAttribute("xrpl.rpc.command", command);
  - After the response status is known, set: span.setAttribute("http.status_code", static_cast<int64_t>(statusCode));
  - Wrap error paths with: span.recordException(e);
- In ServerHandler::processRequest(...):
  - Add a child span: auto span = telemetry::SpanGuard::rpcSpan("rpc.process");
  - Set method attribute: span.setAttribute("xrpl.rpc.method", request_method);
- In ServerHandler::onWSMessage(...) (WebSocket path):
  - Add: auto span = telemetry::SpanGuard::rpcSpan("rpc.ws.message");
The goal is to see spans like:
```
rpc.request
  └── rpc.process
```
in Tempo/Grafana for every HTTP RPC call.

Key modified file:

src/xrpld/rpc/detail/ServerHandler.cpp (~15-25 lines added)

Reference:

04-code-samples.md §4.5.3 — Complete ServerHandler::onRequest() instrumented code sample using SpanGuard factory methods
01-architecture-analysis.md §1.5 — RPC request flow diagram: HTTP request -> attributes -> jobqueue.enqueue -> rpc.command -> response
01-architecture-analysis.md §1.6 — Key trace points table: rpc.request in ServerHandler.cpp::onRequest() (Priority: High)
02-design-decisions.md §2.3 — Span naming convention: rpc.request, rpc.command.*
02-design-decisions.md §2.4.2 — RPC span attributes: xrpl.rpc.command, xrpl.rpc.version, xrpl.rpc.role, xrpl.rpc.params
03-implementation-strategy.md §3.9.2 — File impact: ServerHandler.cpp ~40 lines added, ~10 changed (Low risk)

Task 7: Instrument RPC Command Execution

Objective: Add per-command tracing inside the RPC handler so each command (e.g., submit, account_info, server_info) gets its own child span.

What to do:

Edit src/xrpld/rpc/detail/RPCHandler.cpp:
- #include <xrpl/telemetry/SpanGuard.h>
- In doCommand(RPC::JsonContext& context, Json::Value& result):
  - At the top: auto span = telemetry::SpanGuard::rpcSpan("rpc.command." + context.method);
  - Set attributes:
    - span.setAttribute("xrpl.rpc.command", context.method);
    - span.setAttribute("xrpl.rpc.version", static_cast<int64_t>(context.apiVersion));
    - span.setAttribute("xrpl.rpc.role", (context.role == Role::ADMIN) ? "admin" : "user");
  - On success: span.setAttribute("xrpl.rpc.status", "success");
  - On error: span.setAttribute("xrpl.rpc.status", "error"); and set the error message

After this, traces in Tempo/Grafana should look like:

rpc.request  (xrpl.rpc.command=account_info)
  └── rpc.process
        └── rpc.command.account_info  (xrpl.rpc.version=2, xrpl.rpc.role=user, xrpl.rpc.status=success)

Key modified file:

src/xrpld/rpc/detail/RPCHandler.cpp (~15-20 lines added)

Reference:

04-code-samples.md §4.5.3 — ServerHandler::onRequest() code sample (includes child span pattern for rpc.command.*)
02-design-decisions.md §2.3 — Span naming: rpc.command.* pattern with dynamic command name (e.g., rpc.command.server_info)
02-design-decisions.md §2.4.2 — RPC attribute schema: xrpl.rpc.command, xrpl.rpc.version, xrpl.rpc.role, xrpl.rpc.status
01-architecture-analysis.md §1.6 — Key trace points table: rpc.command.* in RPCHandler.cpp::doCommand() (Priority: High)
02-design-decisions.md §2.6.5 — Correlation with PerfLog: how doCommand() can link trace_id with existing PerfLog entries
03-implementation-strategy.md §3.4.4 — RPC request overhead budget: ~1.75 μs total per request

Task 8: Build, Run, and Verify End-to-End

OTLP = OpenTelemetry Protocol

Objective: Prove the full pipeline works: xrpld emits traces -> OTel Collector receives them -> Tempo stores them for Grafana visualization.

What to do:

Start the Docker stack:
```
docker compose -f docker/telemetry/docker-compose.yml up -d
```
Verify Collector health: curl http://localhost:13133

Build xrpld with telemetry:

# Adjust for your actual build workflow
conan install . --build=missing -o with_telemetry=True
cmake --preset default -DXRPL_ENABLE_TELEMETRY=ON
cmake --build --preset default

Configure xrpld: Add to xrpld.cfg (or your local test config):

[telemetry]
enabled=1
endpoint=localhost:4317
sampling_ratio=1.0
trace_rpc=1

Start xrpld in standalone mode:
```
./rippled --conf xrpld.cfg -a --start
```

Generate RPC traffic:

# server_info
curl -s -X POST http://localhost:5005 \
  -H "Content-Type: application/json" \
  -d '{"method":"server_info","params":[{}]}'

# ledger
curl -s -X POST http://localhost:5005 \
  -H "Content-Type: application/json" \
  -d '{"method":"ledger","params":[{"ledger_index":"current"}]}'

# account_info (will error in standalone, that's fine — we trace errors too)
curl -s -X POST http://localhost:5005 \
  -H "Content-Type: application/json" \
  -d '{"method":"account_info","params":[{"account":"rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh"}]}'

Verify in Grafana (Tempo):
- Open http://localhost:3000
- Navigate to Explore → select Tempo datasource
- Search for service xrpld
- Confirm you see traces with spans: rpc.request -> rpc.process -> rpc.command.server_info
- Click into a trace and verify attributes: xrpl.rpc.command, xrpl.rpc.status, xrpl.rpc.version
Verify zero-overhead when disabled:
- Rebuild with XRPL_ENABLE_TELEMETRY=OFF, or set enabled=0 in config
- Run the same RPC calls
- Confirm no new traces appear and no errors in xrpld logs

Verification Checklist:

Docker stack starts without errors
xrpld builds with -DXRPL_ENABLE_TELEMETRY=ON
xrpld starts and connects to OTel Collector (check xrpld logs for telemetry messages)
Traces appear in Grafana/Tempo under service "xrpld"
Span hierarchy is correct (parent-child relationships)
Span attributes are populated (xrpl.rpc.command, xrpl.rpc.status, etc.)
Error spans show error status and message
Building with XRPL_ENABLE_TELEMETRY=OFF produces no regressions
Setting enabled=0 at runtime produces no traces and no errors

Reference:

06-implementation-phases.md §6.11.1 — Phase 1 definition of done: SDK compiles, runtime toggle works, span creation verified in Tempo, config validation passes
06-implementation-phases.md §6.11.2 — Phase 2 definition of done: 100% RPC coverage, traceparent propagation, <1ms p99 overhead, dashboard deployed
06-implementation-phases.md §6.8 — Success metrics: trace coverage >95%, CPU overhead <3%, memory <5 MB, latency impact <2%
03-implementation-strategy.md §3.9.5 — Backward compatibility: config optional, protocol unchanged, XRPL_ENABLE_TELEMETRY=OFF produces identical binary
01-architecture-analysis.md §1.8 — Observable outcomes: what traces, metrics, and dashboards to expect

Task 9: Document POC Results and Next Steps

OTLP = OpenTelemetry Protocol | WS = WebSocket

Objective: Capture findings, screenshots, and remaining work for the team.

What to do:

Take screenshots of Grafana/Tempo showing:
- The service list with "xrpld"
- A trace with the full span tree
- Span detail view showing attributes
Document any issues encountered (build issues, SDK quirks, missing attributes)
Note performance observations (build time impact, any noticeable runtime overhead)
Write a short summary of what the POC proves and what it doesn't cover yet:
- Proves: OTel SDK integrates with xrpld, OTLP export works, RPC traces visible
- Doesn't cover: Cross-node P2P context propagation, consensus tracing, protobuf trace context, W3C traceparent header extraction, tail-based sampling, production deployment
Outline next steps (mapping to the full plan phases):
- Phase 2 completion: W3C header extraction (§2.5), WebSocket tracing, all RPC handlers (§1.6)
- Phase 3: Protobuf TraceContext message (§4.4), transaction relay tracing (§4.5.1) across nodes
- Phase 4: Consensus round and phase tracing (§4.5.2)
- Phase 5: Production collector config (§5.5.2), Grafana dashboards (§7.6), alerting (§7.6.3)

Reference:

06-implementation-phases.md §6.1 — Full 5-phase timeline overview and Gantt chart
06-implementation-phases.md §6.10 — Crawl-Walk-Run strategy: POC is the CRAWL phase, next steps are WALK and RUN
06-implementation-phases.md §6.12 — Recommended implementation order (14 steps across 9 weeks)
03-implementation-strategy.md §3.9 — Code intrusiveness assessment and risk matrix for each remaining component
07-observability-backends.md §7.2 — Production backend selection (Tempo, Elastic APM, Honeycomb, Datadog)
02-design-decisions.md §2.5 — Context propagation design: W3C HTTP headers, protobuf P2P, JobQueue internal
00-tracing-fundamentals.md — Reference for team onboarding on distributed tracing concepts

Summary

Task	Description	New Files	Modified Files	Depends On
0	Docker observability stack	4	0	—
1	OTel C++ SDK dependency	0	2	—
2	Core Telemetry interface + NullImpl	3	0	1
3	OTel-backed Telemetry implementation	2	1	1, 2
4	Application lifecycle integration	0	3	2, 3
5	SpanGuard factory methods	0	1	2
6	Instrument RPC ServerHandler	0	1	4, 5
7	Instrument RPC command execution	0	1	4, 5
8	End-to-end verification	0	0	0-7
9	Document results and next steps	1	0	8

Parallel work: Tasks 0 and 1 can run in parallel. Tasks 2 and 5 have no dependency on each other. Tasks 6 and 7 can be done in parallel once Tasks 4 and 5 are complete.

Next Steps (Post-POC)

OTLP = OpenTelemetry Protocol | WS = WebSocket

Metrics Pipeline for Grafana Dashboards

The current POC exports traces only. Grafana's Explore view can query Tempo for individual traces, but time-series charts (latency histograms, request throughput, error rates) require a metrics pipeline. To enable this:

Add a spanmetrics connector to the OTel Collector config that derives RED metrics (Rate, Errors, Duration) from trace spans automatically:

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s]
    dimensions:
      - name: xrpl.rpc.command
      - name: xrpl.rpc.status

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, otlp/tempo, spanmetrics]
    metrics:
      receivers: [spanmetrics]
      exporters: [prometheus]

Add Prometheus to the Docker Compose stack to scrape the collector's metrics endpoint.
Add Prometheus as a Grafana datasource and build dashboards for:
- RPC request latency (p50/p95/p99) by command
- RPC throughput (requests/sec) by command
- Error rate by command
- Span duration distribution

Additional Instrumentation

W3C traceparent header extraction in ServerHandler to support cross-service context propagation from external callers
WebSocket RPC tracing in ServerHandler::onWSMessage()
Transaction relay tracing across nodes using protobuf TraceContext messages
Consensus round and phase tracing for validator coordination visibility
Ledger close tracing to measure close-to-validated latency

Production Hardening

Tail-based sampling in the OTel Collector to reduce volume while retaining error/slow traces
TLS configuration for the OTLP exporter in production deployments
Resource limits on the batch processor queue to prevent unbounded memory growth
Health monitoring for the telemetry pipeline itself (collector lag, export failures)

POC Lessons Learned

Issues encountered during POC implementation that inform future work:

Issue	Resolution	Impact on Future Work
Conan lockfile rejected `opentelemetry-cpp/1.18.0`	Used `--lockfile=""` to bypass	Lockfile must be regenerated when adding new dependencies
Conan package only builds OTLP HTTP exporter, not gRPC	Switched from gRPC to HTTP exporter (`localhost:4318/v1/traces`)	HTTP exporter is the default; gRPC requires custom Conan profile
CMake target `opentelemetry-cpp::api` etc. don't exist in Conan package	Use umbrella target `opentelemetry-cpp::opentelemetry-cpp`	Conan targets differ from upstream CMake targets
OTel Collector `logging` exporter deprecated	Renamed to `debug` exporter	Use `debug` in all collector configs going forward
Macro parameter `telemetry` collided with `::xrpl::telemetry::` namespace	Replaced macros with SpanGuard factory methods (no macros needed)	Factory methods avoid macro hygiene issues entirely
`opentelemetry::trace::Scope` creates new context on move	Store scope as member, create once in constructor	SpanGuard move semantics need care with Scope lifecycle
`TracerProviderFactory::Create` returns `unique_ptr<sdk::TracerProvider>`, not `nostd::shared_ptr`	Use `std::shared_ptr` member, wrap in `nostd::shared_ptr` for global provider	OTel SDK factory return types don't match API provider types

33 KiB Raw Blame History

OpenTelemetry POC Task List

Related Plan Documents

Task 0: Docker Observability Stack Setup

Task 1: Add OpenTelemetry C++ SDK Dependency

Task 2: Create Core Telemetry Interface and NullTelemetry

Task 3: Implement OTel-Backed Telemetry

Task 4: Integrate Telemetry into Application Lifecycle

Task 5: Add SpanGuard Factory Methods

Task 6: Instrument RPC ServerHandler

Task 7: Instrument RPC Command Execution

Task 8: Build, Run, and Verify End-to-End

Task 9: Document POC Results and Next Steps

Summary

Next Steps (Post-POC)

Metrics Pipeline for Grafana Dashboards

Additional Instrumentation

Production Hardening

POC Lessons Learned

33 KiB

Raw Blame History