# OpenTelemetry POC Task List > **Goal**: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in xrpld. A successful POC will show RPC request traces flowing from xrpld through an OTel Collector into Tempo, viewable in Grafana. > > **Scope**: RPC tracing only (highest value, lowest risk per the [CRAWL phase](./06-implementation-phases.md#6102-quick-wins-immediate-value) in the implementation phases). No cross-node P2P context propagation or consensus tracing in the POC. ### Related Plan Documents | Document | Relevance to POC | | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) | Core concepts: traces, spans, context propagation, sampling | | [01-architecture-analysis.md](./01-architecture-analysis.md) | RPC request flow (§1.5), key trace points (§1.6), instrumentation priority (§1.7) | | [02-design-decisions.md](./02-design-decisions.md) | SDK selection (§2.1), exporter config (§2.2), span naming (§2.3), attribute schema (§2.4), coexistence with PerfLog/Insight (§2.6) | | [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure (§3.1), key principles (§3.2), performance overhead (§3.3-3.6), conditional compilation (§3.7.3), code intrusiveness (§3.9) | | [04-code-samples.md](./04-code-samples.md) | Telemetry interface (§4.1), SpanGuard factory methods (§4.2-4.3), RPC instrumentation (§4.5.3) | | [05-configuration-reference.md](./05-configuration-reference.md) | xrpld config (§5.1), config parser (§5.2), Application integration (§5.3), CMake (§5.4), Collector config (§5.5), Docker Compose (§5.6), Grafana (§5.8) | | [06-implementation-phases.md](./06-implementation-phases.md) | Phase 1 core tasks (§6.2), Phase 2 RPC tasks (§6.3), quick wins (§6.10), definition of done (§6.11) | | [07-observability-backends.md](./07-observability-backends.md) | Tempo dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3) | --- ## Task 0: Docker Observability Stack Setup > **OTLP** = OpenTelemetry Protocol **Objective**: Stand up the backend infrastructure to receive, store, and display traces. **What to do**: - Create `docker/telemetry/docker-compose.yml` in the repo with three services: 1. **OpenTelemetry Collector** (`otel/opentelemetry-collector-contrib:0.92.0`) - Expose ports `4317` (OTLP gRPC) and `4318` (OTLP HTTP) - Expose port `13133` (health check) - Mount a config file `docker/telemetry/otel-collector-config.yaml` 2. **Tempo** (`grafana/tempo:2.6.1`) - Expose port `3200` (HTTP API) and `4317` (OTLP gRPC, internal) 3. **Grafana** (`grafana/grafana:latest`) — optional but useful - Expose port `3000` - Enable anonymous admin access for local dev (`GF_AUTH_ANONYMOUS_ENABLED=true`, `GF_AUTH_ANONYMOUS_ORG_ROLE=Admin`) - Provision Tempo as a data source via `docker/telemetry/grafana/provisioning/datasources/tempo.yaml` - Create `docker/telemetry/otel-collector-config.yaml`: ```yaml receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 1s send_batch_size: 100 exporters: logging: verbosity: detailed otlp/tempo: endpoint: tempo:4317 tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [logging, otlp/tempo] ``` - Create Grafana Tempo datasource provisioning file at `docker/telemetry/grafana/provisioning/datasources/tempo.yaml`: ```yaml apiVersion: 1 datasources: - name: Tempo type: tempo access: proxy url: http://tempo:3200 ``` **Verification**: Run `docker compose -f docker/telemetry/docker-compose.yml up -d`, then: - `curl http://localhost:13133` returns healthy (Collector) - `http://localhost:3000` opens Grafana (Tempo datasource available, no traces yet) **Reference**: - [05-configuration-reference.md §5.5](./05-configuration-reference.md) — Collector config (dev YAML with Tempo exporter) - [05-configuration-reference.md §5.6](./05-configuration-reference.md) — Docker Compose development environment - [07-observability-backends.md §7.1](./07-observability-backends.md) — Tempo quick start and backend selection - [05-configuration-reference.md §5.8](./05-configuration-reference.md) — Grafana datasource provisioning and dashboards --- ## Task 1: Add OpenTelemetry C++ SDK Dependency **Objective**: Make `opentelemetry-cpp` available to the build system. **What to do**: - Edit `conanfile.py` to add `opentelemetry-cpp` as an **optional** dependency. The gRPC otel plugin flag (`"grpc/*:otel_plugin": False`) in the existing conanfile may need to remain false — we pull the OTel SDK separately. - Add a Conan option: `with_telemetry = [True, False]` defaulting to `False` - When `with_telemetry` is `True`, add `opentelemetry-cpp` to `self.requires()` - Required OTel Conan components: `opentelemetry-cpp` (which bundles api, sdk, and exporters). If the package isn't in Conan Center, consider using `FetchContent` in CMake or building from source as a fallback. - Edit `CMakeLists.txt`: - Add option: `option(XRPL_ENABLE_TELEMETRY "Enable OpenTelemetry tracing" OFF)` - When ON, `find_package(opentelemetry-cpp CONFIG REQUIRED)` and add compile definition `XRPL_ENABLE_TELEMETRY` - When OFF, do nothing (zero build impact) - Verify the build succeeds with `-DXRPL_ENABLE_TELEMETRY=OFF` (no regressions) and with `-DXRPL_ENABLE_TELEMETRY=ON` (SDK links successfully). **Key files**: - `conanfile.py` - `CMakeLists.txt` **Reference**: - [05-configuration-reference.md §5.4](./05-configuration-reference.md) — CMake integration, `FindOpenTelemetry.cmake`, `XRPL_ENABLE_TELEMETRY` option - [03-implementation-strategy.md §3.2](./03-implementation-strategy.md) — Key principle: zero-cost when disabled via compile-time flags - [02-design-decisions.md §2.1](./02-design-decisions.md) — SDK selection rationale and required OTel components --- ## Task 2: Create Core Telemetry Interface and NullTelemetry **Objective**: Define the `Telemetry` abstract interface and a no-op implementation so the rest of the codebase can reference telemetry without hard-depending on the OTel SDK. **What to do**: - Create `include/xrpl/telemetry/Telemetry.h`: - Define `namespace xrpl::telemetry` - Define `struct Telemetry::Setup` holding: `enabled`, `exporterEndpoint`, `samplingRatio`, `serviceName`, `serviceVersion`, `serviceInstanceId`, `traceRpc`, `traceTransactions`, `traceConsensus`, `tracePeer` - Define abstract `class Telemetry` with: - `virtual void start() = 0;` - `virtual void stop() = 0;` - `virtual bool isEnabled() const = 0;` - `virtual nostd::shared_ptr getTracer(string_view name = "xrpld") = 0;` - `virtual nostd::shared_ptr startSpan(string_view name, SpanKind kind = kInternal) = 0;` - `virtual nostd::shared_ptr startSpan(string_view name, Context const& parentContext, SpanKind kind = kInternal) = 0;` - `virtual bool shouldTraceRpc() const = 0;` - `virtual bool shouldTraceTransactions() const = 0;` - `virtual bool shouldTraceConsensus() const = 0;` - Factory: `std::unique_ptr make_Telemetry(Setup const&, beast::Journal);` - Config parser: `Telemetry::Setup setup_Telemetry(Section const&, std::string const& nodePublicKey, std::string const& version);` - Create `include/xrpl/telemetry/SpanGuard.h`: - RAII guard with static factory methods (`rpcSpan()`, `txSpan()`, `consensusSpan()`, etc.) that access the global `Telemetry::getInstance()` singleton internally. - Uses pimpl idiom to hide all OTel types -- the public header has zero `opentelemetry/` includes. - Convenience instance methods: `setAttribute()`, `setOk()`, `setStatus()`, `addEvent()`, `recordException()`, `context()`, `discard()` - When `XRPL_ENABLE_TELEMETRY` is not defined, the entire class compiles to a no-op stub. - See [04-code-samples.md](./04-code-samples.md) §4.2-4.3 for the full API reference. - Create `src/libxrpl/telemetry/NullTelemetry.cpp`: - Implements `Telemetry` with all no-ops. - `isEnabled()` returns `false`, `startSpan()` returns a noop span. - This is used when `XRPL_ENABLE_TELEMETRY` is OFF or `enabled=0` in config. - Guard all OTel SDK headers behind `#ifdef XRPL_ENABLE_TELEMETRY`. The `NullTelemetry` implementation should compile without the OTel SDK present. **Key new files**: - `include/xrpl/telemetry/Telemetry.h` - `include/xrpl/telemetry/SpanGuard.h` - `src/libxrpl/telemetry/NullTelemetry.cpp` **Reference**: - [04-code-samples.md §4.1](./04-code-samples.md) — Full `Telemetry` interface with `Setup` struct, lifecycle, tracer access, span creation, and component filtering methods - [04-code-samples.md §4.2-4.3](./04-code-samples.md) — SpanGuard with factory methods, pimpl design, no-op stub, and discard support - [03-implementation-strategy.md §3.1](./03-implementation-strategy.md) — Directory structure: `include/xrpl/telemetry/` for headers, `src/libxrpl/telemetry/` for implementation - [03-implementation-strategy.md §3.7.3](./03-implementation-strategy.md) — Conditional instrumentation and zero-cost compile-time disabled pattern --- ## Task 3: Implement OTel-Backed Telemetry > **OTLP** = OpenTelemetry Protocol **Objective**: Implement the real `Telemetry` class that initializes the OTel SDK, configures the OTLP exporter and batch processor, and creates tracers/spans. **What to do**: - Create `src/libxrpl/telemetry/Telemetry.cpp` (compiled only when `XRPL_ENABLE_TELEMETRY=ON`): - `class TelemetryImpl : public Telemetry` that: - In `start()`: creates a `TracerProvider` with: - Resource attributes: `service.name`, `service.version`, `service.instance.id` - An `OtlpHttpExporter` pointed at `setup.exporterEndpoint` (default `localhost:4318`) - A `BatchSpanProcessor` with configurable batch size and delay - A `TraceIdRatioBasedSampler` using `setup.samplingRatio` - Sets the global `TracerProvider` - In `stop()`: calls `ForceFlush()` then shuts down the provider - In `startSpan()`: delegates to `getTracer()->StartSpan(name, ...)` - `shouldTraceRpc()` etc. read from `Setup` fields - Create `src/libxrpl/telemetry/TelemetryConfig.cpp`: - `setup_Telemetry()` parses the `[telemetry]` config section from `xrpld.cfg` - Maps config keys: `enabled`, `exporter`, `endpoint`, `sampling_ratio`, `trace_rpc`, `trace_transactions`, `trace_consensus`, `trace_peer` - Wire `make_Telemetry()` factory: - If `setup.enabled` is true AND `XRPL_ENABLE_TELEMETRY` is defined: return `TelemetryImpl` - Otherwise: return `NullTelemetry` - Add telemetry source files to CMake. When `XRPL_ENABLE_TELEMETRY=ON`, compile `Telemetry.cpp` and `TelemetryConfig.cpp` and link against `opentelemetry-cpp::api`, `opentelemetry-cpp::sdk`, `opentelemetry-cpp::otlp_grpc_exporter`. When OFF, compile only `NullTelemetry.cpp`. **Key new files**: - `src/libxrpl/telemetry/Telemetry.cpp` - `src/libxrpl/telemetry/TelemetryConfig.cpp` **Key modified files**: - `CMakeLists.txt` (add telemetry library target) **Reference**: - [04-code-samples.md §4.1](./04-code-samples.md) — `Telemetry` interface that `TelemetryImpl` must implement - [05-configuration-reference.md §5.2](./05-configuration-reference.md) — `setup_Telemetry()` config parser implementation - [02-design-decisions.md §2.2](./02-design-decisions.md) — OTLP/gRPC exporter config (endpoint, TLS options) - [02-design-decisions.md §2.4.1](./02-design-decisions.md) — Resource attributes: `service.name`, `service.version`, `service.instance.id`, `xrpl.network.id` - [03-implementation-strategy.md §3.4](./03-implementation-strategy.md) — Per-operation CPU costs and overhead budget for span creation - [03-implementation-strategy.md §3.5](./03-implementation-strategy.md) — Memory overhead: static (~456 KB) and dynamic (~1.2 MB) budgets --- ## Task 4: Integrate Telemetry into Application Lifecycle **Objective**: Wire the `Telemetry` object into the `ServiceRegistry` / `Application` so all components can access it. **What to do**: - Edit `include/xrpl/core/ServiceRegistry.h`: - Forward-declare `namespace telemetry { class Telemetry; }` inside `namespace xrpl` - Add pure virtual method: `virtual telemetry::Telemetry& getTelemetry() = 0;` - (`Application` extends `ServiceRegistry`, so this is automatically available on `Application` too) - Edit `src/xrpld/app/main/Application.cpp` (the `ApplicationImp` class): - Add member: `std::unique_ptr telemetry_;` - In the member initializer list, construct telemetry with an empty `serviceInstanceId` (node identity is not yet known): ```cpp , telemetry_( telemetry::make_Telemetry( telemetry::setup_Telemetry( config_->section("telemetry"), "", // Updated later via setServiceInstanceId() BuildInfo::getVersionString()), logs_->journal("Telemetry"))) ``` - In `setup()`, after `nodeIdentity_` is resolved, inject the node public key as the service instance ID: ```cpp if (!config_->section("telemetry").exists("service_instance_id")) telemetry_->setServiceInstanceId( toBase58(TokenType::NodePublic, nodeIdentity_->first)); ``` - In `start()`: call `telemetry_->start()` - In `run()` (shutdown path): call `telemetry_->stop()` (to flush pending spans) - Implement `getTelemetry()` override: return `*telemetry_` - Add `[telemetry]` section to the example config `cfg/xrpld-example.cfg`: ```ini # [telemetry] # enabled=1 # endpoint=http://localhost:4318/v1/traces # sampling_ratio=1.0 # trace_rpc=1 ``` > **Access patterns**: Components holding `ServiceRegistry&` (e.g. > `NetworkOPsImp`) call `registry_.get().getTelemetry()`. Components > holding `Application&` (e.g. `ServerHandler`, `PeerImp`, > `RCLConsensusAdaptor`) call `app_.getTelemetry()` directly. Both > resolve to the same `Telemetry` instance. **Key modified files**: - `include/xrpl/core/ServiceRegistry.h` - `src/xrpld/app/main/Application.cpp` - `cfg/xrpld-example.cfg` (example config) **Reference**: - [05-configuration-reference.md §5.3](./05-configuration-reference.md) — `ApplicationImp` changes: member declaration, constructor init, `start()`/`stop()` wiring, `getTelemetry()` override - [05-configuration-reference.md §5.1](./05-configuration-reference.md) — `[telemetry]` config section format and all option defaults - [03-implementation-strategy.md §3.9.2](./03-implementation-strategy.md) — File impact assessment: `Application.cpp` ~15 lines added, ~3 changed (Low risk) --- ## Task 5: Add SpanGuard Factory Methods **Objective**: Add static factory methods to SpanGuard that provide type-safe, one-liner instrumentation and compile to zero-cost no-ops when telemetry is disabled. This replaces the earlier macro-based approach (`TracingInstrumentation.h` has been removed). **What to do**: - Update `include/xrpl/telemetry/SpanGuard.h`: - Add static factory methods that access the global `Telemetry::getInstance()` singleton and check the relevant component filter before creating a span: ```cpp // Each factory checks the global Telemetry instance internally. // No Telemetry& reference needed at the call site. auto span = telemetry::SpanGuard::rpcSpan("rpc.request"); span.setAttribute("xrpl.rpc.command", command); span.setAttribute("xrpl.rpc.status", status); ``` - Factory methods: `rpcSpan()`, `txSpan()`, `consensusSpan()`, `peerSpan()`, `ledgerSpan()`, `span()` - Use the pimpl idiom to hide all OTel types from the public header (zero `opentelemetry/` includes) - When `XRPL_ENABLE_TELEMETRY` is NOT defined, the entire class compiles to a no-op stub with empty inline method bodies - No separate `TracingInstrumentation.h` file is needed. All instrumentation call sites use `#include ` directly. **Key modified file**: - `include/xrpl/telemetry/SpanGuard.h` **Reference**: - [04-code-samples.md §4.3](./04-code-samples.md) — SpanGuard API reference: factory methods, usage patterns, compile-time disabled behavior, and discard support - [03-implementation-strategy.md §3.7.3](./03-implementation-strategy.md) — Conditional instrumentation pattern: factory methods handle compile-time and runtime checks internally - [03-implementation-strategy.md §3.9.7](./03-implementation-strategy.md) — Before/after code examples showing minimal intrusiveness (~1-3 lines per instrumentation point) --- ## Task 6: Instrument RPC ServerHandler > **WS** = WebSocket **Objective**: Add tracing to the HTTP RPC entry point so every incoming RPC request creates a span. **What to do**: - Edit `src/xrpld/rpc/detail/ServerHandler.cpp`: - `#include ` - In `ServerHandler::onRequest(Session& session)`: - At the top of the method, add: `auto span = telemetry::SpanGuard::rpcSpan("rpc.request");` - After the RPC command name is extracted, set attribute: `span.setAttribute("xrpl.rpc.command", command);` - After the response status is known, set: `span.setAttribute("http.status_code", static_cast(statusCode));` - Wrap error paths with: `span.recordException(e);` - In `ServerHandler::processRequest(...)`: - Add a child span: `auto span = telemetry::SpanGuard::rpcSpan("rpc.process");` - Set method attribute: `span.setAttribute("xrpl.rpc.method", request_method);` - In `ServerHandler::onWSMessage(...)` (WebSocket path): - Add: `auto span = telemetry::SpanGuard::rpcSpan("rpc.ws.message");` - The goal is to see spans like: ``` rpc.request └── rpc.process ``` in Tempo/Grafana for every HTTP RPC call. **Key modified file**: - `src/xrpld/rpc/detail/ServerHandler.cpp` (~15-25 lines added) **Reference**: - [04-code-samples.md §4.5.3](./04-code-samples.md) — Complete `ServerHandler::onRequest()` instrumented code sample using SpanGuard factory methods - [01-architecture-analysis.md §1.5](./01-architecture-analysis.md) — RPC request flow diagram: HTTP request -> attributes -> jobqueue.enqueue -> rpc.command -> response - [01-architecture-analysis.md §1.6](./01-architecture-analysis.md) — Key trace points table: `rpc.request` in `ServerHandler.cpp::onRequest()` (Priority: High) - [02-design-decisions.md §2.3](./02-design-decisions.md) — Span naming convention: `rpc.request`, `rpc.command.*` - [02-design-decisions.md §2.4.2](./02-design-decisions.md) — RPC span attributes: `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.params` - [03-implementation-strategy.md §3.9.2](./03-implementation-strategy.md) — File impact: `ServerHandler.cpp` ~40 lines added, ~10 changed (Low risk) --- ## Task 7: Instrument RPC Command Execution **Objective**: Add per-command tracing inside the RPC handler so each command (e.g., `submit`, `account_info`, `server_info`) gets its own child span. **What to do**: - Edit `src/xrpld/rpc/detail/RPCHandler.cpp`: - `#include ` - In `doCommand(RPC::JsonContext& context, Json::Value& result)`: - At the top: `auto span = telemetry::SpanGuard::rpcSpan("rpc.command." + context.method);` - Set attributes: - `span.setAttribute("xrpl.rpc.command", context.method);` - `span.setAttribute("xrpl.rpc.version", static_cast(context.apiVersion));` - `span.setAttribute("xrpl.rpc.role", (context.role == Role::ADMIN) ? "admin" : "user");` - On success: `span.setAttribute("xrpl.rpc.status", "success");` - On error: `span.setAttribute("xrpl.rpc.status", "error");` and set the error message - After this, traces in Tempo/Grafana should look like: ``` rpc.request (xrpl.rpc.command=account_info) └── rpc.process └── rpc.command.account_info (xrpl.rpc.version=2, xrpl.rpc.role=user, xrpl.rpc.status=success) ``` **Key modified file**: - `src/xrpld/rpc/detail/RPCHandler.cpp` (~15-20 lines added) **Reference**: - [04-code-samples.md §4.5.3](./04-code-samples.md) — `ServerHandler::onRequest()` code sample (includes child span pattern for `rpc.command.*`) - [02-design-decisions.md §2.3](./02-design-decisions.md) — Span naming: `rpc.command.*` pattern with dynamic command name (e.g., `rpc.command.server_info`) - [02-design-decisions.md §2.4.2](./02-design-decisions.md) — RPC attribute schema: `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.status` - [01-architecture-analysis.md §1.6](./01-architecture-analysis.md) — Key trace points table: `rpc.command.*` in `RPCHandler.cpp::doCommand()` (Priority: High) - [02-design-decisions.md §2.6.5](./02-design-decisions.md) — Correlation with PerfLog: how `doCommand()` can link trace_id with existing PerfLog entries - [03-implementation-strategy.md §3.4.4](./03-implementation-strategy.md) — RPC request overhead budget: ~1.75 μs total per request --- ## Task 8: Build, Run, and Verify End-to-End > **OTLP** = OpenTelemetry Protocol **Objective**: Prove the full pipeline works: xrpld emits traces -> OTel Collector receives them -> Tempo stores them for Grafana visualization. **What to do**: 1. **Start the Docker stack**: ```bash docker compose -f docker/telemetry/docker-compose.yml up -d ``` Verify Collector health: `curl http://localhost:13133` 2. **Build xrpld with telemetry**: ```bash # Adjust for your actual build workflow conan install . --build=missing -o with_telemetry=True cmake --preset default -DXRPL_ENABLE_TELEMETRY=ON cmake --build --preset default ``` 3. **Configure xrpld**: Add to `xrpld.cfg` (or your local test config): ```ini [telemetry] enabled=1 endpoint=localhost:4317 sampling_ratio=1.0 trace_rpc=1 ``` 4. **Start xrpld** in standalone mode: ```bash ./rippled --conf xrpld.cfg -a --start ``` 5. **Generate RPC traffic**: ```bash # server_info curl -s -X POST http://localhost:5005 \ -H "Content-Type: application/json" \ -d '{"method":"server_info","params":[{}]}' # ledger curl -s -X POST http://localhost:5005 \ -H "Content-Type: application/json" \ -d '{"method":"ledger","params":[{"ledger_index":"current"}]}' # account_info (will error in standalone, that's fine — we trace errors too) curl -s -X POST http://localhost:5005 \ -H "Content-Type: application/json" \ -d '{"method":"account_info","params":[{"account":"rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh"}]}' ``` 6. **Verify in Grafana (Tempo)**: - Open `http://localhost:3000` - Navigate to Explore → select Tempo datasource - Search for service `xrpld` - Confirm you see traces with spans: `rpc.request` -> `rpc.process` -> `rpc.command.server_info` - Click into a trace and verify attributes: `xrpl.rpc.command`, `xrpl.rpc.status`, `xrpl.rpc.version` 7. **Verify zero-overhead when disabled**: - Rebuild with `XRPL_ENABLE_TELEMETRY=OFF`, or set `enabled=0` in config - Run the same RPC calls - Confirm no new traces appear and no errors in xrpld logs **Verification Checklist**: - [ ] Docker stack starts without errors - [ ] xrpld builds with `-DXRPL_ENABLE_TELEMETRY=ON` - [ ] xrpld starts and connects to OTel Collector (check xrpld logs for telemetry messages) - [ ] Traces appear in Grafana/Tempo under service "xrpld" - [ ] Span hierarchy is correct (parent-child relationships) - [ ] Span attributes are populated (`xrpl.rpc.command`, `xrpl.rpc.status`, etc.) - [ ] Error spans show error status and message - [ ] Building with `XRPL_ENABLE_TELEMETRY=OFF` produces no regressions - [ ] Setting `enabled=0` at runtime produces no traces and no errors **Reference**: - [06-implementation-phases.md §6.11.1](./06-implementation-phases.md) — Phase 1 definition of done: SDK compiles, runtime toggle works, span creation verified in Tempo, config validation passes - [06-implementation-phases.md §6.11.2](./06-implementation-phases.md#6112-phase-2-rpc-tracing) — Phase 2 definition of done: 100% RPC coverage, traceparent propagation, <1ms p99 overhead, dashboard deployed - [06-implementation-phases.md §6.8](./06-implementation-phases.md) — Success metrics: trace coverage >95%, CPU overhead <3%, memory <5 MB, latency impact <2% - [03-implementation-strategy.md §3.9.5](./03-implementation-strategy.md) — Backward compatibility: config optional, protocol unchanged, `XRPL_ENABLE_TELEMETRY=OFF` produces identical binary - [01-architecture-analysis.md §1.8](./01-architecture-analysis.md) — Observable outcomes: what traces, metrics, and dashboards to expect --- ## Task 9: Document POC Results and Next Steps > **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket **Objective**: Capture findings, screenshots, and remaining work for the team. **What to do**: - Take screenshots of Grafana/Tempo showing: - The service list with "xrpld" - A trace with the full span tree - Span detail view showing attributes - Document any issues encountered (build issues, SDK quirks, missing attributes) - Note performance observations (build time impact, any noticeable runtime overhead) - Write a short summary of what the POC proves and what it doesn't cover yet: - **Proves**: OTel SDK integrates with xrpld, OTLP export works, RPC traces visible - **Doesn't cover**: Cross-node P2P context propagation, consensus tracing, protobuf trace context, W3C traceparent header extraction, tail-based sampling, production deployment - Outline next steps (mapping to the full plan phases): - [Phase 2](./06-implementation-phases.md) completion: [W3C header extraction](./02-design-decisions.md) (§2.5), WebSocket tracing, all [RPC handlers](./01-architecture-analysis.md) (§1.6) - [Phase 3](./06-implementation-phases.md): [Protobuf `TraceContext` message](./04-code-samples.md) (§4.4), [transaction relay tracing](./04-code-samples.md) (§4.5.1) across nodes - [Phase 4](./06-implementation-phases.md): [Consensus round and phase tracing](./04-code-samples.md) (§4.5.2) - [Phase 5](./06-implementation-phases.md): [Production collector config](./05-configuration-reference.md) (§5.5.2), [Grafana dashboards](./07-observability-backends.md) (§7.6), [alerting](./07-observability-backends.md) (§7.6.3) **Reference**: - [06-implementation-phases.md §6.1](./06-implementation-phases.md) — Full 5-phase timeline overview and Gantt chart - [06-implementation-phases.md §6.10](./06-implementation-phases.md) — Crawl-Walk-Run strategy: POC is the CRAWL phase, next steps are WALK and RUN - [06-implementation-phases.md §6.12](./06-implementation-phases.md) — Recommended implementation order (14 steps across 9 weeks) - [03-implementation-strategy.md §3.9](./03-implementation-strategy.md) — Code intrusiveness assessment and risk matrix for each remaining component - [07-observability-backends.md §7.2](./07-observability-backends.md) — Production backend selection (Tempo, Elastic APM, Honeycomb, Datadog) - [02-design-decisions.md §2.5](./02-design-decisions.md) — Context propagation design: W3C HTTP headers, protobuf P2P, JobQueue internal - [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) — Reference for team onboarding on distributed tracing concepts --- ## Summary | Task | Description | New Files | Modified Files | Depends On | | ---- | ------------------------------------ | --------- | -------------- | ---------- | | 0 | Docker observability stack | 4 | 0 | — | | 1 | OTel C++ SDK dependency | 0 | 2 | — | | 2 | Core Telemetry interface + NullImpl | 3 | 0 | 1 | | 3 | OTel-backed Telemetry implementation | 2 | 1 | 1, 2 | | 4 | Application lifecycle integration | 0 | 3 | 2, 3 | | 5 | SpanGuard factory methods | 0 | 1 | 2 | | 6 | Instrument RPC ServerHandler | 0 | 1 | 4, 5 | | 7 | Instrument RPC command execution | 0 | 1 | 4, 5 | | 8 | End-to-end verification | 0 | 0 | 0-7 | | 9 | Document results and next steps | 1 | 0 | 8 | **Parallel work**: Tasks 0 and 1 can run in parallel. Tasks 2 and 5 have no dependency on each other. Tasks 6 and 7 can be done in parallel once Tasks 4 and 5 are complete. --- ## Next Steps (Post-POC) > **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket ### Metrics Pipeline for Grafana Dashboards The current POC exports **traces only**. Grafana's Explore view can query Tempo for individual traces, but time-series charts (latency histograms, request throughput, error rates) require a **metrics pipeline**. To enable this: 1. **Add a `spanmetrics` connector** to the OTel Collector config that derives RED metrics (Rate, Errors, Duration) from trace spans automatically: ```yaml connectors: spanmetrics: histogram: explicit: buckets: [1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s] dimensions: - name: xrpl.rpc.command - name: xrpl.rpc.status exporters: prometheus: endpoint: 0.0.0.0:8889 service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [debug, otlp/tempo, spanmetrics] metrics: receivers: [spanmetrics] exporters: [prometheus] ``` 2. **Add Prometheus** to the Docker Compose stack to scrape the collector's metrics endpoint. 3. **Add Prometheus as a Grafana datasource** and build dashboards for: - RPC request latency (p50/p95/p99) by command - RPC throughput (requests/sec) by command - Error rate by command - Span duration distribution ### Additional Instrumentation - **W3C `traceparent` header extraction** in `ServerHandler` to support cross-service context propagation from external callers - **WebSocket RPC tracing** in `ServerHandler::onWSMessage()` - **Transaction relay tracing** across nodes using protobuf `TraceContext` messages - **Consensus round and phase tracing** for validator coordination visibility - **Ledger close tracing** to measure close-to-validated latency ### Production Hardening - **Tail-based sampling** in the OTel Collector to reduce volume while retaining error/slow traces - **TLS configuration** for the OTLP exporter in production deployments - **Resource limits** on the batch processor queue to prevent unbounded memory growth - **Health monitoring** for the telemetry pipeline itself (collector lag, export failures) ### POC Lessons Learned Issues encountered during POC implementation that inform future work: | Issue | Resolution | Impact on Future Work | | -------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- | ---------------------------------------------------------------- | | Conan lockfile rejected `opentelemetry-cpp/1.18.0` | Used `--lockfile=""` to bypass | Lockfile must be regenerated when adding new dependencies | | Conan package only builds OTLP HTTP exporter, not gRPC | Switched from gRPC to HTTP exporter (`localhost:4318/v1/traces`) | HTTP exporter is the default; gRPC requires custom Conan profile | | CMake target `opentelemetry-cpp::api` etc. don't exist in Conan package | Use umbrella target `opentelemetry-cpp::opentelemetry-cpp` | Conan targets differ from upstream CMake targets | | OTel Collector `logging` exporter deprecated | Renamed to `debug` exporter | Use `debug` in all collector configs going forward | | Macro parameter `telemetry` collided with `::xrpl::telemetry::` namespace | Replaced macros with SpanGuard factory methods (no macros needed) | Factory methods avoid macro hygiene issues entirely | | `opentelemetry::trace::Scope` creates new context on move | Store scope as member, create once in constructor | SpanGuard move semantics need care with Scope lifecycle | | `TracerProviderFactory::Create` returns `unique_ptr`, not `nostd::shared_ptr` | Use `std::shared_ptr` member, wrap in `nostd::shared_ptr` for global provider | OTel SDK factory return types don't match API provider types |