rippled/OpenTelemetryPlan/POC_taskList.md

# OpenTelemetry POC Task List

> **Goal**: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in xrpld. A successful POC will show RPC request traces flowing from xrpld through an OTel Collector into Tempo, viewable in Grafana.
>
> **Scope**: RPC tracing only (highest value, lowest risk per the [CRAWL phase](./06-implementation-phases.md#6102-quick-wins-immediate-value) in the implementation phases). No cross-node P2P context propagation or consensus tracing in the POC.

### Related Plan Documents

| Document                                                         | Relevance to POC                                                                                                                                          |
| ---------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md)       | Core concepts: traces, spans, context propagation, sampling                                                                                               |
| [01-architecture-analysis.md](./01-architecture-analysis.md)     | RPC request flow (§1.5), key trace points (§1.6), instrumentation priority (§1.7)                                                                         |
| [02-design-decisions.md](./02-design-decisions.md)               | SDK selection (§2.1), exporter config (§2.2), span naming (§2.3), attribute schema (§2.4), coexistence with PerfLog/Insight (§2.6)                        |
| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure (§3.1), key principles (§3.2), performance overhead (§3.3-3.6), conditional compilation (§3.7.3), code intrusiveness (§3.9)           |
| [04-code-samples.md](./04-code-samples.md)                       | Telemetry interface (§4.1), SpanGuard factory methods (§4.2-4.3), RPC instrumentation (§4.5.3)                                                            |
| [05-configuration-reference.md](./05-configuration-reference.md) | xrpld config (§5.1), config parser (§5.2), Application integration (§5.3), CMake (§5.4), Collector config (§5.5), Docker Compose (§5.6), Grafana (§5.8)  |
| [06-implementation-phases.md](./06-implementation-phases.md)     | Phase 1 core tasks (§6.2), Phase 2 RPC tasks (§6.3), quick wins (§6.10), definition of done (§6.11)                                                       |
| [07-observability-backends.md](./07-observability-backends.md)   | Tempo dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3)                                                                                   |

---

## Task 0: Docker Observability Stack Setup

> **OTLP** = OpenTelemetry Protocol

**Objective**: Stand up the backend infrastructure to receive, store, and display traces.

**What to do**:

- Create `docker/telemetry/docker-compose.yml` in the repo with three services:
  1. **OpenTelemetry Collector** (`otel/opentelemetry-collector-contrib:0.92.0`)
     - Expose ports `4317` (OTLP gRPC) and `4318` (OTLP HTTP)
     - Expose port `13133` (health check)
     - Mount a config file `docker/telemetry/otel-collector-config.yaml`
  2. **Tempo** (`grafana/tempo:2.6.1`)
     - Expose port `3200` (HTTP API) and `4317` (OTLP gRPC, internal)
  3. **Grafana** (`grafana/grafana:latest`) — optional but useful
     - Expose port `3000`
     - Enable anonymous admin access for local dev (`GF_AUTH_ANONYMOUS_ENABLED=true`, `GF_AUTH_ANONYMOUS_ORG_ROLE=Admin`)
     - Provision Tempo as a data source via `docker/telemetry/grafana/provisioning/datasources/tempo.yaml`

- Create `docker/telemetry/otel-collector-config.yaml`:

  ```yaml
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

  processors:
    batch:
      timeout: 1s
      send_batch_size: 100

  exporters:
    logging:
      verbosity: detailed
    otlp/tempo:
      endpoint: tempo:4317
      tls:
        insecure: true

  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [batch]
        exporters: [logging, otlp/tempo]
  ```

- Create Grafana Tempo datasource provisioning file at `docker/telemetry/grafana/provisioning/datasources/tempo.yaml`:
  ```yaml
  apiVersion: 1
  datasources:
    - name: Tempo
      type: tempo
      access: proxy
      url: http://tempo:3200
  ```

**Verification**: Run `docker compose -f docker/telemetry/docker-compose.yml up -d`, then:

- `curl http://localhost:13133` returns healthy (Collector)
- `http://localhost:3000` opens Grafana (Tempo datasource available, no traces yet)

**Reference**:

- [05-configuration-reference.md §5.5](./05-configuration-reference.md) — Collector config (dev YAML with Tempo exporter)
- [05-configuration-reference.md §5.6](./05-configuration-reference.md) — Docker Compose development environment
- [07-observability-backends.md §7.1](./07-observability-backends.md) — Tempo quick start and backend selection
- [05-configuration-reference.md §5.8](./05-configuration-reference.md) — Grafana datasource provisioning and dashboards

---

## Task 1: Add OpenTelemetry C++ SDK Dependency

**Objective**: Make `opentelemetry-cpp` available to the build system.

**What to do**:

- Edit `conanfile.py` to add `opentelemetry-cpp` as an **optional** dependency. The gRPC otel plugin flag (`"grpc/*:otel_plugin": False`) in the existing conanfile may need to remain false — we pull the OTel SDK separately.
  - Add a Conan option: `with_telemetry = [True, False]` defaulting to `False`
  - When `with_telemetry` is `True`, add `opentelemetry-cpp` to `self.requires()`
  - Required OTel Conan components: `opentelemetry-cpp` (which bundles api, sdk, and exporters). If the package isn't in Conan Center, consider using `FetchContent` in CMake or building from source as a fallback.
- Edit `CMakeLists.txt`:
  - Add option: `option(XRPL_ENABLE_TELEMETRY "Enable OpenTelemetry tracing" OFF)`
  - When ON, `find_package(opentelemetry-cpp CONFIG REQUIRED)` and add compile definition `XRPL_ENABLE_TELEMETRY`
  - When OFF, do nothing (zero build impact)
- Verify the build succeeds with `-DXRPL_ENABLE_TELEMETRY=OFF` (no regressions) and with `-DXRPL_ENABLE_TELEMETRY=ON` (SDK links successfully).

**Key files**:

- `conanfile.py`
- `CMakeLists.txt`

**Reference**:

- [05-configuration-reference.md §5.4](./05-configuration-reference.md) — CMake integration, `FindOpenTelemetry.cmake`, `XRPL_ENABLE_TELEMETRY` option
- [03-implementation-strategy.md §3.2](./03-implementation-strategy.md) — Key principle: zero-cost when disabled via compile-time flags
- [02-design-decisions.md §2.1](./02-design-decisions.md) — SDK selection rationale and required OTel components

---

## Task 2: Create Core Telemetry Interface and NullTelemetry

**Objective**: Define the `Telemetry` abstract interface and a no-op implementation so the rest of the codebase can reference telemetry without hard-depending on the OTel SDK.

**What to do**:

- Create `include/xrpl/telemetry/Telemetry.h`:
  - Define `namespace xrpl::telemetry`
  - Define `struct Telemetry::Setup` holding: `enabled`, `exporterEndpoint`, `samplingRatio`, `serviceName`, `serviceVersion`, `serviceInstanceId`, `traceRpc`, `traceTransactions`, `traceConsensus`, `tracePeer`
  - Define abstract `class Telemetry` with:
    - `virtual void start() = 0;`
    - `virtual void stop() = 0;`
    - `virtual bool isEnabled() const = 0;`
    - `virtual nostd::shared_ptr<Tracer> getTracer(string_view name = "xrpld") = 0;`
    - `virtual nostd::shared_ptr<Span> startSpan(string_view name, SpanKind kind = kInternal) = 0;`
    - `virtual nostd::shared_ptr<Span> startSpan(string_view name, Context const& parentContext, SpanKind kind = kInternal) = 0;`
    - `virtual bool shouldTraceRpc() const = 0;`
    - `virtual bool shouldTraceTransactions() const = 0;`
    - `virtual bool shouldTraceConsensus() const = 0;`
  - Factory: `std::unique_ptr<Telemetry> make_Telemetry(Setup const&, beast::Journal);`
  - Config parser: `Telemetry::Setup setup_Telemetry(Section const&, std::string const& nodePublicKey, std::string const& version);`

- Create `include/xrpl/telemetry/SpanGuard.h`:
  - RAII guard with static factory methods (`rpcSpan()`, `txSpan()`, `consensusSpan()`, etc.) that access the global `Telemetry::getInstance()` singleton internally.
  - Uses pimpl idiom to hide all OTel types -- the public header has zero `opentelemetry/` includes.
  - Convenience instance methods: `setAttribute()`, `setOk()`, `setStatus()`, `addEvent()`, `recordException()`, `context()`, `discard()`
  - When `XRPL_ENABLE_TELEMETRY` is not defined, the entire class compiles to a no-op stub.
  - See [04-code-samples.md](./04-code-samples.md) §4.2-4.3 for the full API reference.

- Create `src/libxrpl/telemetry/NullTelemetry.cpp`:
  - Implements `Telemetry` with all no-ops.
  - `isEnabled()` returns `false`, `startSpan()` returns a noop span.
  - This is used when `XRPL_ENABLE_TELEMETRY` is OFF or `enabled=0` in config.

- Guard all OTel SDK headers behind `#ifdef XRPL_ENABLE_TELEMETRY`. The `NullTelemetry` implementation should compile without the OTel SDK present.

**Key new files**:

- `include/xrpl/telemetry/Telemetry.h`
- `include/xrpl/telemetry/SpanGuard.h`
- `src/libxrpl/telemetry/NullTelemetry.cpp`

**Reference**:

- [04-code-samples.md §4.1](./04-code-samples.md) — Full `Telemetry` interface with `Setup` struct, lifecycle, tracer access, span creation, and component filtering methods
- [04-code-samples.md §4.2-4.3](./04-code-samples.md) — SpanGuard with factory methods, pimpl design, no-op stub, and discard support
- [03-implementation-strategy.md §3.1](./03-implementation-strategy.md) — Directory structure: `include/xrpl/telemetry/` for headers, `src/libxrpl/telemetry/` for implementation
- [03-implementation-strategy.md §3.7.3](./03-implementation-strategy.md) — Conditional instrumentation and zero-cost compile-time disabled pattern

---

## Task 3: Implement OTel-Backed Telemetry

> **OTLP** = OpenTelemetry Protocol

**Objective**: Implement the real `Telemetry` class that initializes the OTel SDK, configures the OTLP exporter and batch processor, and creates tracers/spans.

**What to do**:

- Create `src/libxrpl/telemetry/Telemetry.cpp` (compiled only when `XRPL_ENABLE_TELEMETRY=ON`):
  - `class TelemetryImpl : public Telemetry` that:
    - In `start()`: creates a `TracerProvider` with:
      - Resource attributes: `service.name`, `service.version`, `service.instance.id`
      - An `OtlpHttpExporter` pointed at `setup.exporterEndpoint` (default `localhost:4318`)
      - A `BatchSpanProcessor` with configurable batch size and delay
      - A `TraceIdRatioBasedSampler` using `setup.samplingRatio`
    - Sets the global `TracerProvider`
    - In `stop()`: calls `ForceFlush()` then shuts down the provider
    - In `startSpan()`: delegates to `getTracer()->StartSpan(name, ...)`
    - `shouldTraceRpc()` etc. read from `Setup` fields

- Create `src/libxrpl/telemetry/TelemetryConfig.cpp`:
  - `setup_Telemetry()` parses the `[telemetry]` config section from `xrpld.cfg`
  - Maps config keys: `enabled`, `exporter`, `endpoint`, `sampling_ratio`, `trace_rpc`, `trace_transactions`, `trace_consensus`, `trace_peer`

- Wire `make_Telemetry()` factory:
  - If `setup.enabled` is true AND `XRPL_ENABLE_TELEMETRY` is defined: return `TelemetryImpl`
  - Otherwise: return `NullTelemetry`

- Add telemetry source files to CMake. When `XRPL_ENABLE_TELEMETRY=ON`, compile `Telemetry.cpp` and `TelemetryConfig.cpp` and link against `opentelemetry-cpp::api`, `opentelemetry-cpp::sdk`, `opentelemetry-cpp::otlp_grpc_exporter`. When OFF, compile only `NullTelemetry.cpp`.

**Key new files**:

- `src/libxrpl/telemetry/Telemetry.cpp`
- `src/libxrpl/telemetry/TelemetryConfig.cpp`

**Key modified files**:

- `CMakeLists.txt` (add telemetry library target)

**Reference**:

- [04-code-samples.md §4.1](./04-code-samples.md) — `Telemetry` interface that `TelemetryImpl` must implement
- [05-configuration-reference.md §5.2](./05-configuration-reference.md) — `setup_Telemetry()` config parser implementation
- [02-design-decisions.md §2.2](./02-design-decisions.md) — OTLP/gRPC exporter config (endpoint, TLS options)
- [02-design-decisions.md §2.4.1](./02-design-decisions.md) — Resource attributes: `service.name`, `service.version`, `service.instance.id`, `xrpl.network.id`
- [03-implementation-strategy.md §3.4](./03-implementation-strategy.md) — Per-operation CPU costs and overhead budget for span creation
- [03-implementation-strategy.md §3.5](./03-implementation-strategy.md) — Memory overhead: static (~456 KB) and dynamic (~1.2 MB) budgets

---

## Task 4: Integrate Telemetry into Application Lifecycle

**Objective**: Wire the `Telemetry` object into the `ServiceRegistry` / `Application` so all components can access it.

**What to do**:

- Edit `include/xrpl/core/ServiceRegistry.h`:
  - Forward-declare `namespace telemetry { class Telemetry; }` inside `namespace xrpl`
  - Add pure virtual method: `virtual telemetry::Telemetry& getTelemetry() = 0;`
  - (`Application` extends `ServiceRegistry`, so this is automatically available on `Application` too)

- Edit `src/xrpld/app/main/Application.cpp` (the `ApplicationImp` class):
  - Add member: `std::unique_ptr<telemetry::Telemetry> telemetry_;`
  - In the member initializer list, construct telemetry with an empty
    `serviceInstanceId` (node identity is not yet known):
    ```cpp
    , telemetry_(
          telemetry::make_Telemetry(
              telemetry::setup_Telemetry(
                  config_->section("telemetry"),
                  "",  // Updated later via setServiceInstanceId()
                  BuildInfo::getVersionString()),
              logs_->journal("Telemetry")))
    ```
  - In `setup()`, after `nodeIdentity_` is resolved, inject the node
    public key as the service instance ID:
    ```cpp
    if (!config_->section("telemetry").exists("service_instance_id"))
        telemetry_->setServiceInstanceId(
            toBase58(TokenType::NodePublic, nodeIdentity_->first));
    ```
  - In `start()`: call `telemetry_->start()`
  - In `run()` (shutdown path): call `telemetry_->stop()` (to flush pending spans)
  - Implement `getTelemetry()` override: return `*telemetry_`

- Add `[telemetry]` section to the example config `cfg/xrpld-example.cfg`:
  ```ini
  # [telemetry]
  # enabled=1
  # endpoint=http://localhost:4318/v1/traces
  # sampling_ratio=1.0
  # trace_rpc=1
  ```

> **Access patterns**: Components holding `ServiceRegistry&` (e.g.
> `NetworkOPsImp`) call `registry_.get().getTelemetry()`. Components
> holding `Application&` (e.g. `ServerHandler`, `PeerImp`,
> `RCLConsensusAdaptor`) call `app_.getTelemetry()` directly. Both
> resolve to the same `Telemetry` instance.

**Key modified files**:

- `include/xrpl/core/ServiceRegistry.h`
- `src/xrpld/app/main/Application.cpp`
- `cfg/xrpld-example.cfg` (example config)

**Reference**:

- [05-configuration-reference.md §5.3](./05-configuration-reference.md) — `ApplicationImp` changes: member declaration, constructor init, `start()`/`stop()` wiring, `getTelemetry()` override
- [05-configuration-reference.md §5.1](./05-configuration-reference.md) — `[telemetry]` config section format and all option defaults
- [03-implementation-strategy.md §3.9.2](./03-implementation-strategy.md) — File impact assessment: `Application.cpp` ~15 lines added, ~3 changed (Low risk)

---

## Task 5: Add SpanGuard Factory Methods

**Objective**: Add static factory methods to SpanGuard that provide type-safe, one-liner instrumentation and compile to zero-cost no-ops when telemetry is disabled. This replaces the earlier macro-based approach (`TracingInstrumentation.h` has been removed).

**What to do**:

- Update `include/xrpl/telemetry/SpanGuard.h`:
  - Add static factory methods that access the global `Telemetry::getInstance()` singleton and check the relevant component filter before creating a span:

    ```cpp
    // Each factory checks the global Telemetry instance internally.
    // No Telemetry& reference needed at the call site.
    auto span = telemetry::SpanGuard::rpcSpan("rpc.request");
    span.setAttribute("xrpl.rpc.command", command);
    span.setAttribute("xrpl.rpc.status", status);
    ```

  - Factory methods: `rpcSpan()`, `txSpan()`, `consensusSpan()`, `peerSpan()`, `ledgerSpan()`, `span()`
  - Use the pimpl idiom to hide all OTel types from the public header (zero `opentelemetry/` includes)
  - When `XRPL_ENABLE_TELEMETRY` is NOT defined, the entire class compiles to a no-op stub with empty inline method bodies

- No separate `TracingInstrumentation.h` file is needed. All instrumentation call sites use `#include <xrpl/telemetry/SpanGuard.h>` directly.

**Key modified file**:

- `include/xrpl/telemetry/SpanGuard.h`

**Reference**:

- [04-code-samples.md §4.3](./04-code-samples.md) — SpanGuard API reference: factory methods, usage patterns, compile-time disabled behavior, and discard support
- [03-implementation-strategy.md §3.7.3](./03-implementation-strategy.md) — Conditional instrumentation pattern: factory methods handle compile-time and runtime checks internally
- [03-implementation-strategy.md §3.9.7](./03-implementation-strategy.md) — Before/after code examples showing minimal intrusiveness (~1-3 lines per instrumentation point)

---

## Task 6: Instrument RPC ServerHandler

> **WS** = WebSocket

**Objective**: Add tracing to the HTTP RPC entry point so every incoming RPC request creates a span.

**What to do**:

- Edit `src/xrpld/rpc/detail/ServerHandler.cpp`:
  - `#include <xrpl/telemetry/SpanGuard.h>`
  - In `ServerHandler::onRequest(Session& session)`:
    - At the top of the method, add: `auto span = telemetry::SpanGuard::rpcSpan("rpc.request");`
    - After the RPC command name is extracted, set attribute: `span.setAttribute("xrpl.rpc.command", command);`
    - After the response status is known, set: `span.setAttribute("http.status_code", static_cast<int64_t>(statusCode));`
    - Wrap error paths with: `span.recordException(e);`
  - In `ServerHandler::processRequest(...)`:
    - Add a child span: `auto span = telemetry::SpanGuard::rpcSpan("rpc.process");`
    - Set method attribute: `span.setAttribute("xrpl.rpc.method", request_method);`
  - In `ServerHandler::onWSMessage(...)` (WebSocket path):
    - Add: `auto span = telemetry::SpanGuard::rpcSpan("rpc.ws.message");`

- The goal is to see spans like:
  ```
  rpc.request
    └── rpc.process
  ```
  in Tempo/Grafana for every HTTP RPC call.

**Key modified file**:

- `src/xrpld/rpc/detail/ServerHandler.cpp` (~15-25 lines added)

**Reference**:

- [04-code-samples.md §4.5.3](./04-code-samples.md) — Complete `ServerHandler::onRequest()` instrumented code sample using SpanGuard factory methods
- [01-architecture-analysis.md §1.5](./01-architecture-analysis.md) — RPC request flow diagram: HTTP request -> attributes -> jobqueue.enqueue -> rpc.command -> response
- [01-architecture-analysis.md §1.6](./01-architecture-analysis.md) — Key trace points table: `rpc.request` in `ServerHandler.cpp::onRequest()` (Priority: High)
- [02-design-decisions.md §2.3](./02-design-decisions.md) — Span naming convention: `rpc.request`, `rpc.command.*`
- [02-design-decisions.md §2.4.2](./02-design-decisions.md) — RPC span attributes: `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.params`
- [03-implementation-strategy.md §3.9.2](./03-implementation-strategy.md) — File impact: `ServerHandler.cpp` ~40 lines added, ~10 changed (Low risk)

---

## Task 7: Instrument RPC Command Execution

**Objective**: Add per-command tracing inside the RPC handler so each command (e.g., `submit`, `account_info`, `server_info`) gets its own child span.

**What to do**:

- Edit `src/xrpld/rpc/detail/RPCHandler.cpp`:
  - `#include <xrpl/telemetry/SpanGuard.h>`
  - In `doCommand(RPC::JsonContext& context, Json::Value& result)`:
    - At the top: `auto span = telemetry::SpanGuard::rpcSpan("rpc.command." + context.method);`
    - Set attributes:
      - `span.setAttribute("xrpl.rpc.command", context.method);`
      - `span.setAttribute("xrpl.rpc.version", static_cast<int64_t>(context.apiVersion));`
      - `span.setAttribute("xrpl.rpc.role", (context.role == Role::ADMIN) ? "admin" : "user");`
    - On success: `span.setAttribute("xrpl.rpc.status", "success");`
    - On error: `span.setAttribute("xrpl.rpc.status", "error");` and set the error message

- After this, traces in Tempo/Grafana should look like:
  ```
  rpc.request  (xrpl.rpc.command=account_info)
    └── rpc.process
          └── rpc.command.account_info  (xrpl.rpc.version=2, xrpl.rpc.role=user, xrpl.rpc.status=success)
  ```

**Key modified file**:

- `src/xrpld/rpc/detail/RPCHandler.cpp` (~15-20 lines added)

**Reference**:

- [04-code-samples.md §4.5.3](./04-code-samples.md) — `ServerHandler::onRequest()` code sample (includes child span pattern for `rpc.command.*`)
- [02-design-decisions.md §2.3](./02-design-decisions.md) — Span naming: `rpc.command.*` pattern with dynamic command name (e.g., `rpc.command.server_info`)
- [02-design-decisions.md §2.4.2](./02-design-decisions.md) — RPC attribute schema: `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.status`
- [01-architecture-analysis.md §1.6](./01-architecture-analysis.md) — Key trace points table: `rpc.command.*` in `RPCHandler.cpp::doCommand()` (Priority: High)
- [02-design-decisions.md §2.6.5](./02-design-decisions.md) — Correlation with PerfLog: how `doCommand()` can link trace_id with existing PerfLog entries
- [03-implementation-strategy.md §3.4.4](./03-implementation-strategy.md) — RPC request overhead budget: ~1.75 μs total per request

---

## Task 8: Build, Run, and Verify End-to-End

> **OTLP** = OpenTelemetry Protocol

**Objective**: Prove the full pipeline works: xrpld emits traces -> OTel Collector receives them -> Tempo stores them for Grafana visualization.

**What to do**:

1. **Start the Docker stack**:

   ```bash
   docker compose -f docker/telemetry/docker-compose.yml up -d
   ```

   Verify Collector health: `curl http://localhost:13133`

2. **Build xrpld with telemetry**:

   ```bash
   # Adjust for your actual build workflow
   conan install . --build=missing -o with_telemetry=True
   cmake --preset default -DXRPL_ENABLE_TELEMETRY=ON
   cmake --build --preset default
   ```

3. **Configure xrpld**:
   Add to `xrpld.cfg` (or your local test config):

   ```ini
   [telemetry]
   enabled=1
   endpoint=localhost:4317
   sampling_ratio=1.0
   trace_rpc=1
   ```

4. **Start xrpld** in standalone mode:

   ```bash
   ./rippled --conf xrpld.cfg -a --start
   ```

5. **Generate RPC traffic**:

   ```bash
   # server_info
   curl -s -X POST http://localhost:5005 \
     -H "Content-Type: application/json" \
     -d '{"method":"server_info","params":[{}]}'

   # ledger
   curl -s -X POST http://localhost:5005 \
     -H "Content-Type: application/json" \
     -d '{"method":"ledger","params":[{"ledger_index":"current"}]}'

   # account_info (will error in standalone, that's fine — we trace errors too)
   curl -s -X POST http://localhost:5005 \
     -H "Content-Type: application/json" \
     -d '{"method":"account_info","params":[{"account":"rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh"}]}'
   ```

6. **Verify in Grafana (Tempo)**:
   - Open `http://localhost:3000`
   - Navigate to Explore → select Tempo datasource
   - Search for service `xrpld`
   - Confirm you see traces with spans: `rpc.request` -> `rpc.process` -> `rpc.command.server_info`
   - Click into a trace and verify attributes: `xrpl.rpc.command`, `xrpl.rpc.status`, `xrpl.rpc.version`

7. **Verify zero-overhead when disabled**:
   - Rebuild with `XRPL_ENABLE_TELEMETRY=OFF`, or set `enabled=0` in config
   - Run the same RPC calls
   - Confirm no new traces appear and no errors in xrpld logs

**Verification Checklist**:

- [ ] Docker stack starts without errors
- [ ] xrpld builds with `-DXRPL_ENABLE_TELEMETRY=ON`
- [ ] xrpld starts and connects to OTel Collector (check xrpld logs for telemetry messages)
- [ ] Traces appear in Grafana/Tempo under service "xrpld"
- [ ] Span hierarchy is correct (parent-child relationships)
- [ ] Span attributes are populated (`xrpl.rpc.command`, `xrpl.rpc.status`, etc.)
- [ ] Error spans show error status and message
- [ ] Building with `XRPL_ENABLE_TELEMETRY=OFF` produces no regressions
- [ ] Setting `enabled=0` at runtime produces no traces and no errors

**Reference**:

- [06-implementation-phases.md §6.11.1](./06-implementation-phases.md) — Phase 1 definition of done: SDK compiles, runtime toggle works, span creation verified in Tempo, config validation passes
- [06-implementation-phases.md §6.11.2](./06-implementation-phases.md#6112-phase-2-rpc-tracing) — Phase 2 definition of done: 100% RPC coverage, traceparent propagation, <1ms p99 overhead, dashboard deployed
- [06-implementation-phases.md §6.8](./06-implementation-phases.md) — Success metrics: trace coverage >95%, CPU overhead <3%, memory <5 MB, latency impact <2%
- [03-implementation-strategy.md §3.9.5](./03-implementation-strategy.md) — Backward compatibility: config optional, protocol unchanged, `XRPL_ENABLE_TELEMETRY=OFF` produces identical binary
- [01-architecture-analysis.md §1.8](./01-architecture-analysis.md) — Observable outcomes: what traces, metrics, and dashboards to expect

---

## Task 9: Document POC Results and Next Steps

> **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket

**Objective**: Capture findings, screenshots, and remaining work for the team.

**What to do**:

- Take screenshots of Grafana/Tempo showing:
  - The service list with "xrpld"
  - A trace with the full span tree
  - Span detail view showing attributes
- Document any issues encountered (build issues, SDK quirks, missing attributes)
- Note performance observations (build time impact, any noticeable runtime overhead)
- Write a short summary of what the POC proves and what it doesn't cover yet:
  - **Proves**: OTel SDK integrates with xrpld, OTLP export works, RPC traces visible
  - **Doesn't cover**: Cross-node P2P context propagation, consensus tracing, protobuf trace context, W3C traceparent header extraction, tail-based sampling, production deployment
- Outline next steps (mapping to the full plan phases):
  - [Phase 2](./06-implementation-phases.md) completion: [W3C header extraction](./02-design-decisions.md) (§2.5), WebSocket tracing, all [RPC handlers](./01-architecture-analysis.md) (§1.6)
  - [Phase 3](./06-implementation-phases.md): [Protobuf `TraceContext` message](./04-code-samples.md) (§4.4), [transaction relay tracing](./04-code-samples.md) (§4.5.1) across nodes
  - [Phase 4](./06-implementation-phases.md): [Consensus round and phase tracing](./04-code-samples.md) (§4.5.2)
  - [Phase 5](./06-implementation-phases.md): [Production collector config](./05-configuration-reference.md) (§5.5.2), [Grafana dashboards](./07-observability-backends.md) (§7.6), [alerting](./07-observability-backends.md) (§7.6.3)

**Reference**:

- [06-implementation-phases.md §6.1](./06-implementation-phases.md) — Full 5-phase timeline overview and Gantt chart
- [06-implementation-phases.md §6.10](./06-implementation-phases.md) — Crawl-Walk-Run strategy: POC is the CRAWL phase, next steps are WALK and RUN
- [06-implementation-phases.md §6.12](./06-implementation-phases.md) — Recommended implementation order (14 steps across 9 weeks)
- [03-implementation-strategy.md §3.9](./03-implementation-strategy.md) — Code intrusiveness assessment and risk matrix for each remaining component
- [07-observability-backends.md §7.2](./07-observability-backends.md) — Production backend selection (Tempo, Elastic APM, Honeycomb, Datadog)
- [02-design-decisions.md §2.5](./02-design-decisions.md) — Context propagation design: W3C HTTP headers, protobuf P2P, JobQueue internal
- [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) — Reference for team onboarding on distributed tracing concepts

---

## Summary

| Task | Description                          | New Files | Modified Files | Depends On |
| ---- | ------------------------------------ | --------- | -------------- | ---------- |
| 0    | Docker observability stack           | 4         | 0              | —          |
| 1    | OTel C++ SDK dependency              | 0         | 2              | —          |
| 2    | Core Telemetry interface + NullImpl  | 3         | 0              | 1          |
| 3    | OTel-backed Telemetry implementation | 2         | 1              | 1, 2       |
| 4    | Application lifecycle integration    | 0         | 3              | 2, 3       |
| 5    | SpanGuard factory methods            | 0         | 1              | 2          |
| 6    | Instrument RPC ServerHandler         | 0         | 1              | 4, 5       |
| 7    | Instrument RPC command execution     | 0         | 1              | 4, 5       |
| 8    | End-to-end verification              | 0         | 0              | 0-7        |
| 9    | Document results and next steps      | 1         | 0              | 8          |

**Parallel work**: Tasks 0 and 1 can run in parallel. Tasks 2 and 5 have no dependency on each other. Tasks 6 and 7 can be done in parallel once Tasks 4 and 5 are complete.

---

## Next Steps (Post-POC)

> **OTLP** = OpenTelemetry Protocol | **WS** = WebSocket

### Metrics Pipeline for Grafana Dashboards

The current POC exports **traces only**. Grafana's Explore view can query Tempo for individual traces, but time-series charts (latency histograms, request throughput, error rates) require a **metrics pipeline**. To enable this:

1. **Add a `spanmetrics` connector** to the OTel Collector config that derives RED metrics (Rate, Errors, Duration) from trace spans automatically:

   ```yaml
   connectors:
     spanmetrics:
       histogram:
         explicit:
           buckets: [1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s]
       dimensions:
         - name: xrpl.rpc.command
         - name: xrpl.rpc.status

   exporters:
     prometheus:
       endpoint: 0.0.0.0:8889

   service:
     pipelines:
       traces:
         receivers: [otlp]
         processors: [batch]
         exporters: [debug, otlp/tempo, spanmetrics]
       metrics:
         receivers: [spanmetrics]
         exporters: [prometheus]
   ```

2. **Add Prometheus** to the Docker Compose stack to scrape the collector's metrics endpoint.

3. **Add Prometheus as a Grafana datasource** and build dashboards for:
   - RPC request latency (p50/p95/p99) by command
   - RPC throughput (requests/sec) by command
   - Error rate by command
   - Span duration distribution

### Additional Instrumentation

- **W3C `traceparent` header extraction** in `ServerHandler` to support cross-service context propagation from external callers
- **WebSocket RPC tracing** in `ServerHandler::onWSMessage()`
- **Transaction relay tracing** across nodes using protobuf `TraceContext` messages
- **Consensus round and phase tracing** for validator coordination visibility
- **Ledger close tracing** to measure close-to-validated latency

### Production Hardening

- **Tail-based sampling** in the OTel Collector to reduce volume while retaining error/slow traces
- **TLS configuration** for the OTLP exporter in production deployments
- **Resource limits** on the batch processor queue to prevent unbounded memory growth
- **Health monitoring** for the telemetry pipeline itself (collector lag, export failures)

### POC Lessons Learned

Issues encountered during POC implementation that inform future work:

| Issue                                                                                              | Resolution                                                                    | Impact on Future Work                                            |
| -------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| Conan lockfile rejected `opentelemetry-cpp/1.18.0`                                                 | Used `--lockfile=""` to bypass                                                | Lockfile must be regenerated when adding new dependencies        |
| Conan package only builds OTLP HTTP exporter, not gRPC                                             | Switched from gRPC to HTTP exporter (`localhost:4318/v1/traces`)              | HTTP exporter is the default; gRPC requires custom Conan profile |
| CMake target `opentelemetry-cpp::api` etc. don't exist in Conan package                            | Use umbrella target `opentelemetry-cpp::opentelemetry-cpp`                    | Conan targets differ from upstream CMake targets                 |
| OTel Collector `logging` exporter deprecated                                                       | Renamed to `debug` exporter                                                   | Use `debug` in all collector configs going forward               |
| Macro parameter `telemetry` collided with `::xrpl::telemetry::` namespace                          | Replaced macros with SpanGuard factory methods (no macros needed)             | Factory methods avoid macro hygiene issues entirely              |
| `opentelemetry::trace::Scope` creates new context on move                                          | Store scope as member, create once in constructor                             | SpanGuard move semantics need care with Scope lifecycle          |
| `TracerProviderFactory::Create` returns `unique_ptr<sdk::TracerProvider>`, not `nostd::shared_ptr` | Use `std::shared_ptr` member, wrap in `nostd::shared_ptr` for global provider | OTel SDK factory return types don't match API provider types     |