# Implementation Phases > **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md) > **Related**: [Configuration Reference](./05-configuration-reference.md) | [Observability Backends](./07-observability-backends.md) --- ## 6.1 Phase Overview > **TxQ** = Transaction Queue ```mermaid gantt title OpenTelemetry Implementation Timeline dateFormat YYYY-MM-DD axisFormat Week %W section Phase 1 Core Infrastructure :p1, 2024-01-01, 2w SDK Integration :p1a, 2024-01-01, 4d Telemetry Interface :p1b, after p1a, 3d Configuration & CMake :p1c, after p1b, 3d Unit Tests :p1d, after p1c, 2d Buffer & Integration :p1e, after p1d, 2d section Phase 2 RPC Tracing :p2, after p1, 2w HTTP Context Extraction :p2a, after p1, 2d RPC Handler Instrumentation :p2b, after p2a, 4d PathFinding Instrumentation :p2f, after p2b, 2d TxQ Instrumentation :p2g, after p2f, 2d WebSocket Support :p2c, after p2g, 2d Integration Tests :p2d, after p2c, 2d Buffer & Review :p2e, after p2d, 4d section Phase 3 Transaction Tracing :p3, after p2, 2w Protocol Buffer Extension :p3a, after p2, 2d PeerImp Instrumentation :p3b, after p3a, 3d Fee Escalation Instrumentation :p3f, after p3b, 2d Relay Context Propagation :p3c, after p3f, 3d Multi-node Tests :p3d, after p3c, 2d Buffer & Review :p3e, after p3d, 4d section Phase 4 Consensus Tracing :p4, after p3, 2w Consensus Round Spans :p4a, after p3, 3d Proposal Handling :p4b, after p4a, 3d Validator List & Manifest Tracing :p4f, after p4b, 2d Amendment Voting Tracing :p4g, after p4f, 2d SHAMap Sync Tracing :p4h, after p4g, 2d Validation Tests :p4c, after p4h, 4d Buffer & Review :p4e, after p4c, 4d section Phase 5 Documentation & Deploy :p5, after p4, 1w section Phase 6 StatsD Metrics Bridge :p6, after p5, 1w section Phase 7 Native OTel Metrics :p7, after p6, 2w section Phase 8 Log-Trace Correlation :p8, after p7, 1w section Phase 9 (Future) Internal Metric Gap Fill :p9, after p8, 2.5w section Phase 10 (Future) Workload Validation :p10, after p9, 2w section Phase 11 (Future) Third-Party Collection :p11, after p10, 3w ``` --- ## 6.2 Phase 1: Core Infrastructure (Weeks 1-2) **Objective**: Establish foundational telemetry infrastructure ### Tasks | Task | Description | | ---- | ----------------------------------------------------- | | 1.1 | Add OpenTelemetry C++ SDK to Conan/CMake | | 1.2 | Implement `Telemetry` interface and factory | | 1.3 | Implement `SpanGuard` RAII wrapper | | 1.4 | Implement configuration parser | | 1.5 | Integrate into `ApplicationImp` | | 1.6 | Add conditional compilation (`XRPL_ENABLE_TELEMETRY`) | | 1.7 | Create `NullTelemetry` no-op implementation | | 1.8 | Unit tests for core infrastructure | ### Exit Criteria - [ ] OpenTelemetry SDK compiles and links - [ ] Telemetry can be enabled/disabled via config - [ ] Basic span creation works - [ ] No performance regression when disabled - [ ] Unit tests passing --- ## 6.3 Phase 2: RPC Tracing (Weeks 3-4) > **TxQ** = Transaction Queue **Objective**: Complete tracing for all RPC operations ### Tasks | Task | Description | | ---- | -------------------------------------------------------------------------- | | 2.1 | Implement W3C Trace Context HTTP header extraction | | 2.2 | Instrument `ServerHandler::onRequest()` | | 2.3 | Instrument `RPCHandler::doCommand()` | | 2.4 | Add RPC-specific attributes | | 2.5 | Instrument WebSocket handler | | 2.6 | PathFinding instrumentation (`pathfind.request`, `pathfind.compute` spans) | | 2.7 | TxQ instrumentation (`txq.enqueue`, `txq.apply` spans) | | 2.8 | Integration tests for RPC tracing | | 2.9 | Performance benchmarks | | 2.10 | Documentation | ### Exit Criteria - [ ] All RPC commands traced - [ ] Trace context propagates from HTTP headers - [ ] WebSocket and HTTP both instrumented - [ ] <1ms overhead per RPC call - [ ] Integration tests passing --- ## 6.4 Phase 3: Transaction Tracing (Weeks 5-6) **Objective**: Trace transaction lifecycle across network ### Tasks | Task | Description | | ---- | ---------------------------------------------------- | | 3.1 | Define `TraceContext` Protocol Buffer message | | 3.2 | Implement protobuf context serialization | | 3.3 | Instrument `PeerImp::handleTransaction()` | | 3.4 | Instrument `NetworkOPs::submitTransaction()` | | 3.5 | Instrument HashRouter integration | | 3.6 | Fee escalation instrumentation (`fee.escalate` span) | | 3.7 | Implement relay context propagation | | 3.8 | Integration tests (multi-node) | | 3.9 | Performance benchmarks | ### Exit Criteria - [ ] Transaction traces span across nodes - [ ] Trace context in Protocol Buffer messages - [ ] HashRouter deduplication visible in traces - [ ] Multi-node integration tests passing - [ ] <5% overhead on transaction throughput --- ## 6.5 Phase 4: Consensus Tracing (Weeks 7-8) **Objective**: Full observability into consensus rounds ### Tasks | Task | Description | | ---- | ---------------------------------------------- | | 4.1 | Instrument `RCLConsensusAdaptor::startRound()` | | 4.2 | Instrument phase transitions | | 4.3 | Instrument proposal handling | | 4.4 | Instrument validation handling | | 4.5 | Add consensus-specific attributes | | 4.6 | Correlate with transaction traces | | 4.7 | Validator list and manifest tracing | | 4.8 | Amendment voting tracing | | 4.9 | SHAMap sync tracing | | 4.10 | Multi-validator integration tests | | 4.11 | Performance validation | ### Spans Produced | Span Name | Location | Attributes | | --------------------------- | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `consensus.proposal.send` | `RCLConsensus.cpp:177` | `xrpl.consensus.round` | | `consensus.ledger_close` | `RCLConsensus.cpp:282` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` | | `consensus.accept` | `RCLConsensus.cpp:395` | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | | `consensus.accept.apply` | `RCLConsensus.cpp:521` | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq`, `parent_close_time`, `close_time_self`, `close_time_vote_bins`, `resolution_direction` | | `consensus.validation.send` | `RCLConsensus.cpp:753` | `xrpl.consensus.proposing` | ### Exit Criteria - [x] Complete consensus round traces - [x] Phase transitions visible - [x] Proposals and validations traced - [x] Close time agreement tracked (per `avCT_CONSENSUS_PCT`) - [x] No impact on consensus timing - [ ] Multi-validator test network validated ### Implementation Status — Phase 4a Complete Phase 4a (establish-phase gap fill & cross-node correlation) adds: - **Deterministic trace ID** derived from `previousLedger.id()` so all validators in the same round share the same `trace_id` (switchable via `consensus_trace_strategy` config: `"deterministic"` or `"attribute"`). See [Configuration Reference](./05-configuration-reference.md) for full configuration options. The `consensus_trace_strategy` option will be documented in the configuration reference as part of Phase 4a implementation. - **Round lifecycle spans**: `consensus.round` with round-to-round span links. - **Establish phase**: `consensus.establish`, `consensus.update_positions` (with `dispute.resolve` events), `consensus.check` (with threshold tracking). - **Mode changes**: `consensus.mode_change` spans. - **Validation**: `consensus.validation.send` with span link to round span (thread-safe cross-thread access via `roundSpanContext_` snapshot). - **Separation of concerns**: telemetry extracted to private helpers (`startRoundTracing`, `createValidationSpan`, `startEstablishTracing`, `updateEstablishTracing`, `endEstablishTracing`). See [Phase4_taskList.md](./Phase4_taskList.md) for the full spec and implementation notes. --- ## 6.5a Phase 4a: Establish-Phase Gap Fill & Cross-Node Correlation **Objective**: Fill tracing gaps in the establish phase and establish cross-node correlation using deterministic trace IDs derived from `previousLedger.id()`. **Approach**: Direct instrumentation in `Consensus.h`. Long-lived spans use direct SpanGuard members; short-lived scoped spans use `XRPL_TRACE_*` macros. ### Tasks | Task | Description | Effort | Risk | | ---- | ------------------------------------------------ | ------ | ------ | | 4a.0 | Prerequisites: extend SpanGuard & Telemetry APIs | 1d | Medium | | 4a.1 | Adaptor `getTelemetry()` method | 0.5d | Low | | 4a.2 | Switchable round span with deterministic traceID | 2d | High | | 4a.3 | Span members in `Consensus.h` | 0.5d | Medium | | 4a.4 | Instrument `phaseEstablish()` | 1d | Medium | | 4a.5 | Instrument `updateOurPositions()` | 1d | Medium | | 4a.6 | Instrument `haveConsensus()` (thresholds) | 1d | Medium | | 4a.7 | Instrument mode changes | 0.5d | Low | | 4a.8 | Reparent existing spans under round | 0.5d | Low | | 4a.9 | Build verification and testing | 1d | Low | **Total Effort**: 9 days ### Spans Produced | Span Name | Location | Key Attributes | | ---------------------------- | ------------------ | ---------------------------------------------------------------- | | `consensus.round` | `RCLConsensus.cpp` | `round_id`, `ledger_id`, `ledger.seq`, `mode`; link → prev round | | `consensus.establish` | `Consensus.h` | `converge_percent`, `establish_count`, `proposers` | | `consensus.update_positions` | `Consensus.h` | `disputes_count`, `converge_percent`, `proposers_agreed/total` | | `consensus.check` | `Consensus.h` | `agree/disagree_count`, `threshold_percent`, `result` | | `consensus.mode_change` | `RCLConsensus.cpp` | `mode.old`, `mode.new` | ### Exit Criteria - [ ] Establish phase internals fully traced (disputes, convergence, thresholds) - [ ] Cross-node correlation works via deterministic trace_id - [ ] Strategy switchable via config (`deterministic` / `attribute`) - [ ] Consecutive rounds linked via follows-from spans - [ ] Build passes with telemetry ON and OFF - [ ] No impact on consensus timing See [Phase4_taskList.md](./Phase4_taskList.md) for full task details. --- ## 6.5b Phase 4b: Cross-Node Propagation (Future) **Objective**: Wire `TraceContextPropagator` for P2P messages (proposals, validations) to enable true distributed tracing between nodes. **Status**: Design documented, NOT implemented. Protobuf fields (field 1001) and `TraceContextPropagator` class exist. Wiring deferred until Phase 4a is validated in a multi-node environment. **Prerequisites**: Phase 4a complete and validated. See [Phase4_taskList.md § Phase 4b](./Phase4_taskList.md) for full design. --- ## 6.6 Phase 5: Documentation & Deployment (Week 9) **Objective**: Production readiness ### Tasks | Task | Description | | ---- | ----------------------------- | | 5.1 | Operator runbook | | 5.2 | Grafana dashboards | | 5.3 | Alert definitions | | 5.4 | Collector deployment examples | | 5.5 | Developer documentation | | 5.6 | Training materials | | 5.7 | Final integration testing | --- ## 6.7 Phase 6: StatsD Metrics Integration (Week 10) **Objective**: Bridge rippled's existing `beast::insight` StatsD metrics into the OpenTelemetry collection pipeline, exposing 300+ pre-existing metrics alongside span-derived RED metrics in Prometheus/Grafana. ### Background rippled has a mature metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These metrics cover node health, peer networking, RPC performance, job queue, and overlay traffic — data that **does not** overlap with the span-based instrumentation from Phases 1-5. By adding a StatsD receiver to the OTel Collector, both metric sources converge in Prometheus. ### Metric Inventory | Category | Group | Type | Count | Key Metrics | | --------------- | ------------------ | ------------- | ---------- | ------------------------------------------------------ | | Node State | `State_Accounting` | Gauge | 10 | `*_duration`, `*_transitions` per operating mode | | Ledger | `LedgerMaster` | Gauge | 2 | `Validated_Ledger_Age`, `Published_Ledger_Age` | | Ledger Fetch | — | Counter | 1 | `ledger_fetches` | | Ledger History | `ledger.history` | Counter | 1 | `mismatch` | | RPC | `rpc` | Counter+Event | 3 | `requests`, `time` (histogram), `size` (histogram) | | Job Queue | — | Gauge+Event | 1 + 2×N | `job_count`, per-job `{name}` and `{name}_q` | | Peer Finder | `Peer_Finder` | Gauge | 2 | `Active_Inbound_Peers`, `Active_Outbound_Peers` | | Overlay | `Overlay` | Gauge | 1 | `Peer_Disconnects` | | Overlay Traffic | per-category | Gauge | 4×57 = 228 | `Bytes_In/Out`, `Messages_In/Out` per traffic category | | Pathfinding | — | Event | 2 | `pathfind_fast`, `pathfind_full` (histograms) | | I/O | — | Event | 1 | `ios_latency` (histogram) | | Resource Mgr | — | Meter | 2 | `warn`, `drop` (rate counters) | | Caches | per-cache | Gauge | 2×N | `{cache}.size`, `{cache}.hit_rate` | **Total**: ~255+ unique metrics (plus dynamic job-type and cache metrics) ### Tasks | Task | Description | | ---- | --------------------------------------------------------------------------------------------------------------- | | 6.1 | **DEFERRED** Fix Meter wire format (`\|m` → `\|c`) in StatsDCollector.cpp — breaking change, tracked separately | | 6.2 | Add `statsd` receiver to OTel Collector config | | 6.3 | Expose UDP port 8125 in docker-compose.yml | | 6.4 | Add `[insight]` config to integration test node configs | | 6.5 | Create "Node Health" Grafana dashboard (8 panels) | | 6.6 | Create "Network Traffic" Grafana dashboard (8 panels) | | 6.7 | Create "RPC & Pathfinding (StatsD)" Grafana dashboard (8 panels) | | 6.8 | Update integration test to verify StatsD metrics in Prometheus | | 6.9 | Update TESTING.md and telemetry-runbook.md | ### Wire Format Fix (Task 6.1) — DEFERRED The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffix, which is non-standard StatsD. The OTel StatsD receiver silently drops these. Fix: change `|m` to `|c` (counter), which is semantically correct since meters are increment-only counters. Only 2 metrics are affected (`warn`, `drop` in Resource Manager). **Status**: Deferred as a separate change — this is a breaking change for any StatsD backend that previously consumed the custom `|m` type. The Resource Warnings and Resource Drops dashboard panels will show no data until this fix is applied. ### New Grafana Dashboards **Node Health** (`statsd-node-health.json`, uid: `rippled-statsd-node-health`): - Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches **Network Traffic** (`statsd-network-traffic.json`, uid: `rippled-statsd-network`): - Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories **RPC & Pathfinding (StatsD)** (`statsd-rpc-pathfinding.json`, uid: `rippled-statsd-rpc`): - RPC Request Rate, Response Time p95/p50, Response Size p95/p50, Pathfinding Fast/Full Duration, Resource Warnings/Drops, Response Time Heatmap ### Exit Criteria - [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age`) - [ ] All 3 new Grafana dashboards load without errors - [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests) - [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ — DEFERRED (breaking change, tracked separately; resolved by Phase 7's OTel Counter mapping) --- ## 6.8 Phase 7: Native OTel Metrics Migration (Weeks 11-12) **Objective**: Replace `StatsDCollector` with a native OpenTelemetry Metrics SDK implementation behind the existing `beast::insight::Collector` interface, eliminating the StatsD UDP dependency and unifying traces and metrics into a single OTLP pipeline. ### Motivation: Why Migrate from StatsD to Native OTel Metrics The Phase 6 StatsD bridge was a pragmatic first step, but it retains inherent limitations that native OTel export resolves. #### What We Gain 1. **Unified telemetry pipeline** — Traces and metrics export via the same OTLP/HTTP endpoint to the same OTel Collector. One protocol, one endpoint, one config. Eliminates the split-brain architecture of "OTLP for traces, StatsD UDP for metrics." 2. **Eliminates StatsD UDP limitations** — StatsD is fire-and-forget over UDP with no delivery guarantees, no backpressure, 1472-byte MTU packet fragmentation, and text-based encoding overhead. OTLP uses HTTP/gRPC with retries, binary protobuf encoding, and connection-level flow control. 3. **Fixes the `|m` wire format issue** — The `StatsDMeterImpl` uses non-standard `|m` StatsD type that the OTel StatsD receiver silently drops. Native OTel counters eliminate this problem entirely (Phase 6 Task 6.1 — DEFERRED becomes resolved). 4. **Richer metric semantics** — OTel Metrics SDK supports explicit histogram bucket boundaries, exemplars (linking metrics to traces), resource attributes, and metric views. StatsD has no concept of these. 5. **Removes infrastructure dependency** — No more StatsD receiver needed in the OTel Collector. One less receiver to configure, monitor, and debug. Simplifies the collector YAML. 6. **Metric-to-trace correlation** — OTel metrics and traces share the same resource attributes (service.name, service.instance.id). Grafana can link from a metric spike directly to the traces that caused it — impossible with StatsD-sourced metrics. 7. **Production-grade export** — OTel's `PeriodicMetricReader` provides configurable export intervals, batch sizes, timeout handling, and graceful shutdown — all built into the SDK rather than hand-rolled in `StatsDCollectorImp`. #### What We Lose 1. **StatsD ecosystem compatibility** — Operators using external StatsD-compatible backends (Datadog Agent, Graphite, Telegraph) will need to switch to OTLP-compatible backends or keep `server=statsd` as a fallback. 2. **Simplicity of UDP** — StatsD's UDP fire-and-forget model is dead simple and has zero connection management. OTLP/HTTP requires a TCP connection, TLS negotiation (in production), and retry logic. The OTel SDK handles this, but it's more moving parts. 3. **Slightly higher memory** — OTel SDK maintains internal aggregation state for metrics before export. StatsD just formats and sends strings. Expected overhead: ~1-2 MB additional for metric state. 4. **Dependency on OTel C++ Metrics SDK stability** — The Metrics SDK is GA since 1.0 and on version 1.18.0, but it's less battle-tested than the tracing SDK in the C++ ecosystem. #### Decision The gains (unified pipeline, delivery guarantees, metric-trace correlation, simpler collector config) significantly outweigh the losses. `StatsDCollector` is retained as a fallback via `server=statsd` for operators who need StatsD ecosystem compatibility during the transition period. ### Architecture #### Class Hierarchy (after Phase 7) ``` beast::insight::Collector (abstract interface — unchanged) | +-- StatsDCollector (existing — retained as fallback, deprecated) | +-- StatsDCounterImpl -> StatsD |c over UDP | +-- StatsDGaugeImpl -> StatsD |g over UDP | +-- StatsDMeterImpl -> StatsD |m over UDP (non-standard) | +-- StatsDEventImpl -> StatsD |ms over UDP | +-- StatsDHookImpl -> 1s periodic callback | +-- NullCollector (existing — unchanged, used when disabled) | +-- NullCounterImpl -> no-op | +-- NullGaugeImpl -> no-op | +-- NullMeterImpl -> no-op | +-- NullEventImpl -> no-op | +-- NullHookImpl -> no-op | +-- OTelCollector (NEW — Phase 7) +-- OTelCounterImpl -> otel::Counter +-- OTelGaugeImpl -> otel::ObservableGauge +-- OTelMeterImpl -> otel::Counter +-- OTelEventImpl -> otel::Histogram +-- OTelHookImpl -> 1s periodic callback (same pattern) ``` #### Data Flow (after Phase 7) ```mermaid graph LR subgraph rippledNode["rippled Node"] A["Trace Macros
XRPL_TRACE_SPAN"] B["beast::insight
OTelCollector"] end subgraph collector["OTel Collector :4317 / :4318"] direction TB R1["OTLP Receiver
:4317 gRPC | :4318 HTTP"] BP["Batch Processor"] SM["SpanMetrics Connector"] R1 --> BP BP --> SM end subgraph backends["Trace Backends"] D["Jaeger / Tempo"] end subgraph metrics["Metrics Stack"] E["Prometheus :9090
scrapes :8889
span-derived + native OTel metrics"] end subgraph viz["Visualization"] F["Grafana :3000"] end A -->|"OTLP/HTTP :4318
(traces)"| R1 B -->|"OTLP/HTTP :4318
(metrics)"| R1 BP -->|"OTLP/gRPC"| D SM -->|"RED metrics"| E R1 -->|"rippled_* metrics
(native OTLP)"| E E --> F D --> F style A fill:#4a90d9,color:#fff,stroke:#2a6db5 style B fill:#d9534f,color:#fff,stroke:#b52d2d style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d style BP fill:#449d44,color:#fff,stroke:#2d6e2d style SM fill:#449d44,color:#fff,stroke:#2d6e2d style D fill:#f0ad4e,color:#000,stroke:#c78c2e style E fill:#f0ad4e,color:#000,stroke:#c78c2e style F fill:#5bc0de,color:#000,stroke:#3aa8c1 style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9 style collector fill:#1a3320,color:#ccc,stroke:#5cb85c style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de ``` **Key change**: StatsD receiver removed from collector. Both traces and metrics enter via OTLP receiver on the same port. #### Configuration ```ini # [insight] section — new "otel" server option [insight] server=otel # NEW: uses OTel OTLP metrics exporter prefix=rippled # metric name prefix (preserved) # Endpoint and auth inherited from [telemetry] section: [telemetry] enabled=1 endpoint=http://localhost:4318/v1/traces ``` The `OTelCollector` reads the OTLP endpoint from `[telemetry]` config (replacing `/v1/traces` with `/v1/metrics` for the metrics exporter). No additional config keys needed. **Backward compatibility**: `server=statsd` continues to work exactly as before. See [Phase7_taskList.md](./Phase7_taskList.md) for detailed per-task breakdown. ### Instrument Type Mapping | beast::insight | OTel Metrics SDK | Rationale | | ---------------------- | -------------------------------- | ---------------------------------------------------------------- | | Counter (int64, `\|c`) | `Counter` | Direct 1:1 mapping | | Gauge (uint64, `\|g`) | `ObservableGauge` | Async callback matches existing Hook polling pattern | | Meter (uint64, `\|m`) | `Counter` | Fixes non-standard wire format; meters are semantically counters | | Event (ms, `\|ms`) | `Histogram` | Duration distributions with explicit bucket boundaries | | Hook (1s callback) | `PeriodicMetricReader` alignment | Same 1s collection interval | ### Tasks | Task | Description | | ---- | ------------------------------------------------------------------------- | | 7.1 | Add OTel Metrics SDK to build deps (conan/cmake) | | 7.2 | Implement `OTelCollector` class (~400-500 lines) | | 7.3 | Update `CollectorManager` — add `server=otel` | | 7.4 | Update OTel Collector YAML (add metrics pipeline, remove StatsD receiver) | | 7.5 | Preserve metric names in Prometheus (naming strategy) | | 7.6 | Update Grafana dashboards (if names change) | | 7.7 | Update integration tests | | 7.8 | Update documentation (runbook, reference docs) | ### Exit Criteria - [ ] All 255+ metrics visible in Prometheus via OTLP pipeline (no StatsD receiver) - [ ] `server=otel` is the default in development docker-compose - [ ] `server=statsd` still works as a fallback - [ ] Existing Grafana dashboards display data correctly - [ ] Integration test passes with OTLP-only metrics pipeline - [ ] No performance regression vs StatsD baseline (< 1% CPU overhead) - [ ] Deferred Task 6.1 (`|m` wire format) no longer relevant --- ## 6.8.1 Phase 8: Log-Trace Correlation and Centralized Log Ingestion (Week 13) ### Motivation rippled's `beast::Journal` logs and OpenTelemetry traces are currently two disjoint observability signals. When investigating an issue, operators must manually correlate timestamps between log files and Jaeger/Tempo traces. Phase 8 bridges this gap by injecting trace context (`trace_id`, `span_id`) into every log line emitted within an active span, and ingesting those logs into Grafana Loki via the OTel Collector's filelog receiver. #### Gains 1. **One-click trace-to-log navigation** — Click a trace in Tempo/Jaeger and immediately see the corresponding log lines in Loki, filtered by `trace_id`. 2. **Reverse lookup (log-to-trace)** — Loki derived fields make `trace_id` values clickable links back to Tempo. 3. **Unified observability** — All three pillars (traces, metrics, logs) flow through the same OTel Collector pipeline and are visible in a single Grafana instance. 4. **Zero new dependencies in rippled** — Uses existing OTel SDK headers (`GetSpan`, `GetContext`) already linked in Phase 1. 5. **Negligible overhead** — `GetSpan()` + `GetContext()` are thread-local reads (<10ns/call). At ~1000 JLOG calls/min, this adds <10us/min. #### Losses / Risks 1. **Log format change** — Existing log parsers that rely on a fixed format will need updating to handle the optional `trace_id=... span_id=...` fields. 2. **Loki resource usage** — Log ingestion adds storage and memory overhead to the observability stack (mitigated by retention policies). 3. **Filelog receiver complexity** — The regex parser must be kept in sync with the log format; a format change in `Logs::format()` could break parsing. #### Decision The correlation value far outweighs the risks. The log format change is backward-compatible (fields are appended only when a span is active), and the filelog receiver regex is straightforward to maintain. ### Architecture Phase 8 has two independent sub-phases that can be developed in parallel: - **Phase 8a (code change)**: Modify `Logs::format()` in `src/libxrpl/basics/Log.cpp` to append `trace_id= span_id=` when the current thread has an active OTel span. Guarded by `#ifdef XRPL_ENABLE_TELEMETRY`. - **Phase 8b (infra only)**: Add Loki to the Docker Compose stack, configure the OTel Collector's `filelog` receiver to tail rippled's log file, parse out structured fields (timestamp, partition, severity, trace_id, span_id, message), and export to Loki via OTLP. Configure Grafana Tempo↔Loki bidirectional linking. #### Trace ID Injection Flow ```mermaid flowchart LR subgraph rippled["rippled process"] JLOG["JLOG(j.info())"] Format["Logs::format()"] OTelCtx["OTel Context
(thread-local)"] JLOG --> Format OTelCtx -.->|"GetSpan()→GetContext()"| Format end subgraph output["Log Output"] LogLine["2024-01-15T10:30:45.123Z
LedgerMaster:NFO
trace_id=abc123...
span_id=def456...
Validated ledger 42"] end Format --> LogLine style rippled fill:#1a237e,stroke:#0d1642,color:#fff style output fill:#1b5e20,stroke:#0d3d14,color:#fff style JLOG fill:#283593,stroke:#1a237e,color:#fff style Format fill:#283593,stroke:#1a237e,color:#fff style OTelCtx fill:#283593,stroke:#1a237e,color:#fff style LogLine fill:#2e7d32,stroke:#1b5e20,color:#fff ``` #### Loki Ingestion Pipeline ```mermaid flowchart LR subgraph collector["OTel Collector"] FR["filelog receiver
tails debug.log"] RP["regex_parser
extracts trace_id,
span_id, severity"] BP["batch processor"] LE["otlp/loki exporter"] FR --> RP --> BP --> LE end LogFile["rippled
debug.log"] --> FR LE --> Loki["Grafana Loki
:3100"] Loki <-->|"derivedFields ↔
tracesToLogs"| Tempo["Grafana Tempo"] style collector fill:#e65100,stroke:#bf360c,color:#fff style FR fill:#f57c00,stroke:#e65100,color:#fff style RP fill:#f57c00,stroke:#e65100,color:#fff style BP fill:#f57c00,stroke:#e65100,color:#fff style LE fill:#f57c00,stroke:#e65100,color:#fff style LogFile fill:#1a237e,stroke:#0d1642,color:#fff style Loki fill:#4a148c,stroke:#2e0d57,color:#fff style Tempo fill:#4a148c,stroke:#2e0d57,color:#fff ``` ### Tasks | Task | Description | | ---- | ---------------------------------------------- | | 8.1 | Inject trace_id into Logs::format() | | 8.2 | Add Loki to Docker Compose stack | | 8.3 | Add filelog receiver to OTel Collector | | 8.4 | Configure Grafana trace-to-log correlation | | 8.5 | Update integration tests | | 8.6 | Update documentation (runbook, reference docs) | **Parallel work**: Task 8.2 (Loki infra) can run in parallel with Task 8.1 (code change). Tasks 8.3–8.6 are sequential. ### Exit Criteria - [ ] Log lines within active spans contain `trace_id= span_id=` - [ ] Log lines outside spans have no trace context (no empty fields) - [ ] Loki ingests rippled logs via OTel Collector filelog receiver - [ ] Grafana Tempo → Loki one-click correlation works - [ ] Grafana Loki → Tempo reverse lookup works via derived field - [ ] Integration test verifies trace_id presence in logs - [ ] No performance regression from trace_id injection (< 0.1% overhead) --- ## 6.8.2 Phase 9: Internal Metric Instrumentation Gap Fill (Weeks 14-15) — Future Enhancement > **Status**: Planned, not yet implemented. ### Motivation Phases 1-8 establish trace spans, StatsD metrics bridge, native OTel metrics, and log-trace correlation. However, ~68 metrics that exist inside rippled's `get_counts`, `server_info`, TxQ, PerfLog, and `CountedObject` systems have **no time-series export path**. These are the metrics that exchanges, payment processors, analytics providers, validators, and researchers need most — NodeStore I/O performance, cache hit rates, per-RPC-method counters, transaction queue depth, fee escalation levels, and live object instance counts. ### Architecture Hybrid approach — two instrumentation strategies based on proximity to existing code: ```mermaid flowchart TB subgraph rippled["rippled process"] subgraph existing["Existing beast::insight registrations"] NS["NodeStore I/O
(Database.cpp)"] end subgraph newreg["New OTel MetricsRegistry"] CR["Cache Hit Rates
(async gauge callbacks)"] TQ["TxQ Metrics
(async gauge callbacks)"] PL["PerfLog RPC/Job
(counters + histograms)"] CO["CountedObjects
(async gauge callbacks)"] LF["Load Factors
(async gauge callbacks)"] end end subgraph export["Export Pipelines"] BI["beast::insight
OTelCollector (Phase 7)"] OS["OTel Metrics SDK
PeriodicMetricReader"] end NS --> BI CR --> OS TQ --> OS PL --> OS CO --> OS LF --> OS BI --> OTLP["OTLP/HTTP :4318
/v1/metrics"] OS --> OTLP style rippled fill:#1a2633,color:#ccc,stroke:#4a90d9 style existing fill:#2a4a6b,color:#fff,stroke:#4a90d9 style newreg fill:#2a4a6b,color:#fff,stroke:#4a90d9 style export fill:#1a3320,color:#ccc,stroke:#5cb85c style NS fill:#4a90d9,color:#fff,stroke:#2a6db5 style CR fill:#5cb85c,color:#fff,stroke:#3d8b3d style TQ fill:#5cb85c,color:#fff,stroke:#3d8b3d style PL fill:#5cb85c,color:#fff,stroke:#3d8b3d style CO fill:#5cb85c,color:#fff,stroke:#3d8b3d style LF fill:#5cb85c,color:#fff,stroke:#3d8b3d style BI fill:#449d44,color:#fff,stroke:#2d6e2d style OS fill:#449d44,color:#fff,stroke:#2d6e2d style OTLP fill:#f0ad4e,color:#000,stroke:#c78c2e ``` - **beast::insight extensions** (blue): NodeStore I/O metrics added near existing `Database.cpp` registrations — exported via Phase 7's `OTelCollector`. - **OTel MetricsRegistry** (green): New centralized class using `ObservableGauge` async callbacks for cache, TxQ, PerfLog, CountedObjects, and load factors — polled at 10s intervals by `PeriodicMetricReader`. ### Third-Party Consumer Context | Consumer Category | Key Metrics They Need From Phase 9 | | ---------------------- | --------------------------------------------------------------- | | Exchanges | Fee escalation levels, TxQ depth, settlement latency | | Payment Processors | Load factors, io_latency, transaction throughput | | Analytics Providers | NodeStore I/O, cache hit rates, counted objects | | Validators / Operators | Per-job execution times, PerfLog RPC counters, consensus timing | | Academic Researchers | Consensus performance time-series, fee market dynamics | | Institutional Custody | Server health scores, reserve calculations, node availability | ### Tasks | Task | Description | | ---- | ----------------------------------------- | | 9.1 | NodeStore I/O metrics | | 9.2 | Cache hit rate metrics + MetricsRegistry | | 9.3 | TxQ metrics | | 9.4 | PerfLog per-RPC metrics | | 9.5 | PerfLog per-job metrics | | 9.6 | Counted object instance metrics | | 9.7 | Fee escalation & load factor metrics | | 9.7a | push_metrics.py parity gauges | | 9.8 | New Grafana dashboards (2 new, 2 updated) | | 9.9 | Update documentation | | 9.10 | Integration tests | See [Phase9_taskList.md](./Phase9_taskList.md) for detailed per-task breakdown. ### Exit Criteria - [ ] All ~68 new metrics visible in Prometheus via OTLP pipeline - [ ] `MetricsRegistry` class registers/deregisters cleanly with OTel SDK - [ ] 2 new Grafana dashboards operational (Fee Market, Job Queue) - [ ] No performance regression (< 0.5% CPU overhead from new callbacks) - [ ] Documentation updated with full new metric inventory --- ## 6.8.3 Phase 10: Synthetic Workload Generation & Telemetry Validation (Weeks 16-17) — Future Enhancement > **Status**: Planned, not yet implemented. ### Motivation Before the telemetry stack (Phases 1-9) can be considered production-ready, we need automated proof that all 16 spans, 22 attributes, 300+ metrics, 10 Grafana dashboards, and log-trace correlation work correctly under realistic load. This phase establishes a reusable CI-integrated validation suite and performance benchmark baseline. ### Architecture ```mermaid flowchart LR subgraph harness["Docker Compose Workload Harness"] direction TB V1["Validator 1"] ~~~ V2["Validator 2"] ~~~ V3["Validator 3"] V4["Validator 4"] ~~~ V5["Validator 5"] end subgraph generators["Workload Generators"] RPC["RPC Load Generator
(configurable RPS,
command distribution)"] TX["Transaction Submitter
(Payment, Offer, NFT,
Escrow, AMM mix)"] end subgraph validation["Validation Suite"] SV["Span Validator
(Jaeger/Tempo API)"] MV["Metric Validator
(Prometheus API)"] LV["Log-Trace Validator
(Loki API)"] DV["Dashboard Validator
(Grafana API)"] BM["Benchmark Suite
(CPU, memory, latency
ON vs OFF comparison)"] end generators --> harness harness --> validation style harness fill:#1a2633,color:#ccc,stroke:#4a90d9 style generators fill:#1a3320,color:#ccc,stroke:#5cb85c style validation fill:#332a1a,color:#ccc,stroke:#f0ad4e style V1 fill:#4a90d9,color:#fff,stroke:#2a6db5 style V2 fill:#4a90d9,color:#fff,stroke:#2a6db5 style V3 fill:#4a90d9,color:#fff,stroke:#2a6db5 style V4 fill:#4a90d9,color:#fff,stroke:#2a6db5 style V5 fill:#4a90d9,color:#fff,stroke:#2a6db5 style RPC fill:#5cb85c,color:#fff,stroke:#3d8b3d style TX fill:#5cb85c,color:#fff,stroke:#3d8b3d style SV fill:#f0ad4e,color:#000,stroke:#c78c2e style MV fill:#f0ad4e,color:#000,stroke:#c78c2e style LV fill:#f0ad4e,color:#000,stroke:#c78c2e style DV fill:#f0ad4e,color:#000,stroke:#c78c2e style BM fill:#f0ad4e,color:#000,stroke:#c78c2e ``` ### Tasks | Task | Description | | ---- | -------------------------------------- | | 10.1 | Multi-node test harness (5 validators) | | 10.2 | RPC load generator | | 10.3 | Transaction submitter (6+ tx types) | | 10.4 | Telemetry validation suite | | 10.5 | Performance benchmark suite | | 10.6 | CI integration | | 10.7 | Documentation | See [Phase10_taskList.md](./Phase10_taskList.md) for detailed per-task breakdown. ### Exit Criteria - [ ] 5-node validator cluster starts and reaches consensus in docker-compose - [ ] Validation suite confirms all 16 spans, 22 attributes, 300+ metrics - [ ] All 10 Grafana dashboards render data (no empty panels) - [ ] Benchmark shows < 3% CPU overhead, < 5MB memory overhead - [ ] CI workflow runs validation on telemetry branch changes --- ## 6.8.4 Phase 11: Third-Party Data Collection Pipelines (Weeks 18-20) — Future Enhancement > **Status**: Planned, not yet implemented. ### Motivation rippled has no native Prometheus/OTLP metrics export for data accessible only via JSON-RPC (`server_info`, `get_counts`, `fee`, `peers`, `validators`, `feature`). Every external consumer — exchanges, payment processors, analytics providers, validators, compliance firms, DeFi protocols, researchers, custodians, and CBDC platforms — must build custom JSON-RPC polling and conversion pipelines. This phase centralizes that work into a reusable custom OTel Collector receiver. ### Architecture ```mermaid flowchart LR subgraph receiver["Custom OTel Collector Receiver (Go)"] direction TB SI["server_info
collector"] GC["get_counts
collector"] FE["fee
collector"] PE["peers
collector"] VA["validators
collector"] DX["DEX/AMM
collector
(optional)"] end rippled["rippled
Admin RPC
:5005"] -->|"JSON-RPC
poll every 30s"| receiver receiver -->|"xrpl_* metrics"| PROM["Prometheus
:9090"] receiver -->|"OTLP export"| OTLP["Any OTLP-
compatible
backend"] PROM --> GF["Grafana
4 new dashboards"] PROM --> AL["Prometheus
Alerting Rules"] style receiver fill:#1a3320,color:#ccc,stroke:#5cb85c style SI fill:#5cb85c,color:#fff,stroke:#3d8b3d style GC fill:#5cb85c,color:#fff,stroke:#3d8b3d style FE fill:#5cb85c,color:#fff,stroke:#3d8b3d style PE fill:#5cb85c,color:#fff,stroke:#3d8b3d style VA fill:#5cb85c,color:#fff,stroke:#3d8b3d style DX fill:#449d44,color:#fff,stroke:#2d6e2d style rippled fill:#4a90d9,color:#fff,stroke:#2a6db5 style PROM fill:#f0ad4e,color:#000,stroke:#c78c2e style OTLP fill:#f0ad4e,color:#000,stroke:#c78c2e style GF fill:#5bc0de,color:#000,stroke:#3aa8c1 style AL fill:#d9534f,color:#fff,stroke:#b52d2d ``` ### Third-Party Consumer Gap Analysis | Consumer Category | Data Unlocked by Phase 11 | | ---------------------- | ------------------------------------------------------------ | | Exchanges | Real-time fee estimates, TxQ capacity, server health scores | | Payment Processors | Settlement latency percentiles, corridor health | | Analytics Providers | Validator metrics, network topology, amendment voting status | | DeFi / AMM | AMM pool TVL, DEX order book depth, trade volumes | | Validators / Operators | Per-peer latency, version distribution, UNL health, alerting | | Compliance | Transaction volume trends, network growth metrics | | Academic Researchers | Consensus performance time-series, decentralization metrics | | CBDC / Tokenization | Token supply tracking, trust line adoption, freeze status | | Institutional Custody | Multi-sig status, escrow tracking, reserve calculations | | Wallet Providers | Server health for node selection, fee prediction data | ### Tasks | Task | Description | | ----- | ------------------------------------- | | 11.1 | OTel Collector receiver scaffold (Go) | | 11.2 | server_info / server_state collector | | 11.3 | get_counts collector | | 11.4 | Peer topology collector | | 11.5 | Validator & amendment collector | | 11.6 | Fee & TxQ collector | | 11.7 | DEX & AMM collector (optional) | | 11.8 | Prometheus alerting rules | | 11.9 | New Grafana dashboards (4) | | 11.10 | Integration with Phase 10 validation | | 11.11 | Documentation | See [Phase11_taskList.md](./Phase11_taskList.md) for detailed per-task breakdown. ### Exit Criteria - [ ] Custom OTel Collector receiver exports all `xrpl_*` metrics to Prometheus - [ ] 4 new Grafana dashboards operational (Validator Health, Network Topology, Fee Market, DEX/AMM) - [ ] Prometheus alerting rules fire correctly for simulated failures - [ ] Receiver handles rippled restart/unavailability gracefully - [ ] Go receiver has unit tests with >80% coverage --- ## 6.9 Risk Assessment ```mermaid quadrantChart title Risk Assessment Matrix x-axis Low Impact --> High Impact y-axis Low Likelihood --> High Likelihood quadrant-1 Mitigate Immediately quadrant-2 Plan Mitigation quadrant-3 Accept Risk quadrant-4 Monitor Closely SDK Compat: [0.2, 0.18] Protocol Chg: [0.75, 0.72] Perf Overhead: [0.58, 0.42] Context Prop: [0.4, 0.55] Memory Leaks: [0.85, 0.25] ``` ### Risk Details | Risk | Likelihood | Impact | Mitigation | | ------------------------------------ | ---------- | ------ | --------------------------------------- | | Protocol changes break compatibility | Medium | High | Use high field numbers, optional fields | | Performance overhead unacceptable | Medium | Medium | Sampling, conditional compilation | | Context propagation complexity | Medium | Medium | Phased rollout, extensive testing | | SDK compatibility issues | Low | Medium | Pin SDK version, fallback to no-op | | Memory leaks in long-running nodes | Low | High | Memory profiling, bounded queues | --- ## 6.10 Success Metrics | Metric | Target | Measurement | | ------------------------ | -------------------------------------------------------------- | --------------------- | | Trace coverage | >95% of transaction code paths (independent of sampling ratio) | Sampling verification | | CPU overhead | <3% | Benchmark tests | | Memory overhead | <10 MB | Memory profiling | | Latency impact (p99) | <2% | Performance tests | | Trace completeness | >99% spans with required attrs | Validation script | | Cross-node trace linkage | >90% of multi-hop transactions | Integration tests | --- ## 6.9 Quick Wins and Crawl-Walk-Run Strategy > **TxQ** = Transaction Queue This section outlines a prioritized approach to maximize ROI with minimal initial investment. ### 6.9.1 Crawl-Walk-Run Overview
```mermaid flowchart TB subgraph crawl["🐢 CRAWL (Week 1-2)"] direction LR c1[Core SDK Setup] ~~~ c2[RPC Tracing Only] ~~~ c3[PathFinding + TxQ Tracing] ~~~ c4[Single Node] end subgraph walk["🚶 WALK (Week 3-5)"] direction LR w1[Transaction Tracing] ~~~ w2[Fee Escalation Tracing] ~~~ w3[Cross-Node Context] ~~~ w4[Basic Dashboards] end subgraph run["🏃 RUN (Week 6-9)"] direction LR r1[Consensus Tracing] ~~~ r2[Validator, Amendment,
SHAMap Tracing] ~~~ r3[Full Correlation] ~~~ r4[Production Deploy] end crawl --> walk --> run style crawl fill:#1b5e20,stroke:#0d3d14,color:#fff style walk fill:#bf360c,stroke:#8c2809,color:#fff style run fill:#0d47a1,stroke:#082f6a,color:#fff style c1 fill:#1b5e20,stroke:#0d3d14,color:#fff style c2 fill:#1b5e20,stroke:#0d3d14,color:#fff style c3 fill:#1b5e20,stroke:#0d3d14,color:#fff style c4 fill:#1b5e20,stroke:#0d3d14,color:#fff style w1 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style w2 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style w3 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style w4 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style r1 fill:#0d47a1,stroke:#082f6a,color:#fff style r2 fill:#0d47a1,stroke:#082f6a,color:#fff style r3 fill:#0d47a1,stroke:#082f6a,color:#fff style r4 fill:#0d47a1,stroke:#082f6a,color:#fff ```
**Reading the diagram:** - **CRAWL (Weeks 1-2)**: Minimal investment -- set up the SDK, instrument RPC and PathFinding/TxQ handlers, and verify on a single node. Delivers immediate latency visibility. - **WALK (Weeks 3-5)**: Expand to transaction lifecycle tracing, fee escalation, cross-node context propagation, and basic Grafana dashboards. This is where distributed tracing starts working. - **RUN (Weeks 6-9)**: Full consensus instrumentation, validator/amendment/SHAMap tracing, end-to-end correlation, and production deployment with sampling and alerting. - **Arrows (crawl → walk → run)**: Each phase builds on the prior one; you cannot skip ahead because later phases depend on infrastructure established earlier. ### 6.9.2 Quick Wins (Immediate Value) | Quick Win | Value | When to Deploy | | ------------------------------ | ------ | -------------- | | **RPC Command Tracing** | High | Week 2 | | **RPC Latency Histograms** | High | Week 2 | | **Error Rate Dashboard** | Medium | Week 2 | | **Transaction Submit Tracing** | High | Week 3 | | **Consensus Round Duration** | Medium | Week 6 | ### 6.9.3 CRAWL Phase (Weeks 1-2) **Goal**: Get basic tracing working with minimal code changes. **What You Get**: - RPC request/response traces for all commands - Latency breakdown per RPC command - PathFinding and TxQ tracing (directly impacts RPC latency) - Error visibility with stack traces - Basic Grafana dashboard **Code Changes**: ~15 lines in `ServerHandler.cpp`, ~40 lines in new telemetry module **Why Start Here**: - RPC is the lowest-risk, highest-visibility component - PathFinding and TxQ are RPC-adjacent and directly affect latency - Immediate value for debugging client issues - No cross-node complexity - Single file modification to existing code ### 6.9.4 WALK Phase (Weeks 3-5) **Goal**: Add transaction lifecycle tracing across nodes. **What You Get**: - End-to-end transaction traces from submit to relay - Fee escalation tracing within the transaction pipeline - Cross-node correlation (see transaction path) - HashRouter deduplication visibility - Relay latency metrics **Code Changes**: ~120 lines across 4 files, plus protobuf extension **Why Do This Second**: - Builds on RPC tracing (transactions submitted via RPC) - Fee escalation is integral to the transaction processing pipeline - Moderate complexity (requires context propagation) - High value for debugging transaction issues ### 6.9.5 RUN Phase (Weeks 6-9) **Goal**: Full observability including consensus. **What You Get**: - Complete consensus round visibility - Phase transition timing - Validator proposal tracking - Validator list and manifest tracing - Amendment voting tracing - SHAMap sync tracing - Full end-to-end traces (client → RPC → TX → consensus → ledger) **Code Changes**: ~100 lines across 3 consensus files, plus validator/amendment/SHAMap modules **Why Do This Last**: - Highest complexity (consensus is critical path) - Validator, amendment, and SHAMap components are lower priority - Requires thorough testing - Lower relative value (consensus issues are rarer) ### 6.9.6 ROI Prioritization Matrix ```mermaid quadrantChart title Implementation ROI Matrix x-axis Low Effort --> High Effort y-axis Low Value --> High Value quadrant-1 Quick Wins - Do First quadrant-2 Major Projects - Plan Carefully quadrant-3 Nice to Have - Optional quadrant-4 Time Sinks - Avoid RPC Tracing: [0.15, 0.92] TX Submit Trace: [0.3, 0.78] TX Relay Trace: [0.5, 0.88] Consensus Trace: [0.72, 0.72] Peer Msg Trace: [0.85, 0.3] Ledger Acquire: [0.55, 0.52] ``` --- ## 6.12 Definition of Done > **TxQ** = Transaction Queue | **HA** = High Availability Clear, measurable criteria for each phase. ### 6.12.1 Phase 1: Core Infrastructure | Criterion | Measurement | Target | | --------------- | ---------------------------------------------------------- | ---------------------------- | | SDK Integration | `cmake --build` succeeds with `-DXRPL_ENABLE_TELEMETRY=ON` | ✅ Compiles | | Runtime Toggle | `enabled=0` produces zero overhead | <0.1% CPU difference | | Span Creation | Unit test creates and exports span | Span appears in Tempo | | Configuration | All config options parsed correctly | Config validation tests pass | | Documentation | Developer guide exists | PR approved | **Definition of Done**: All criteria met, PR merged, no regressions in CI. ### 6.12.2 Phase 2: RPC Tracing | Criterion | Measurement | Target | | ------------------ | ---------------------------------- | -------------------------- | | Coverage | All RPC commands instrumented | 100% of commands | | Context Extraction | traceparent header propagates | Integration test passes | | Attributes | Command, status, duration recorded | Validation script confirms | | Performance | RPC latency overhead | <1ms p99 | | Dashboard | Grafana dashboard deployed | Screenshot in docs | **Definition of Done**: RPC traces visible in Tempo for all commands, dashboard shows latency distribution. ### 6.12.3 Phase 3: Transaction Tracing | Criterion | Measurement | Target | | ---------------- | ------------------------------- | ---------------------------------- | | Local Trace | Submit → validate → TxQ traced | Single-node test passes | | Cross-Node | Context propagates via protobuf | Multi-node test passes | | Relay Visibility | relay_count attribute correct | Spot check 100 txs | | HashRouter | Deduplication visible in trace | Duplicate txs show suppressed=true | | Performance | TX throughput overhead | <5% degradation | **Definition of Done**: Transaction traces span 3+ nodes in test network, performance within bounds. ### 6.12.4 Phase 4: Consensus Tracing | Criterion | Measurement | Target | | -------------------- | ----------------------------- | ------------------------- | | Round Tracing | startRound creates root span | Unit test passes | | Phase Visibility | All phases have child spans | Integration test confirms | | Proposer Attribution | Proposer ID in attributes | Spot check 50 rounds | | Timing Accuracy | Phase durations match PerfLog | <5% variance | | No Consensus Impact | Round timing unchanged | Performance test passes | **Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing. ### 6.12.5 Phase 5: Production Deployment | Criterion | Measurement | Target | | ------------ | ---------------------------- | -------------------------- | | Collector HA | Multiple collectors deployed | No single point of failure | | Sampling | Tail sampling configured | 10% base + errors + slow | | Retention | Data retained per policy | 7 days hot, 30 days warm | | Alerting | Alerts configured | Error spike, high latency | | Runbook | Operator documentation | Approved by ops team | | Training | Team trained | Session completed | **Definition of Done**: Telemetry running in production, operators trained, alerts active. ### 6.12.6 Success Metrics Summary | Phase | Primary Metric | Secondary Metric | Deadline | Status | | -------- | -------------------------------- | --------------------------- | -------------- | ------------------ | | Phase 1 | SDK compiles and runs | Zero overhead when disabled | End of Week 2 | Active | | Phase 2 | 100% RPC coverage | <1ms latency overhead | End of Week 4 | Active | | Phase 3 | Cross-node traces work | <5% throughput impact | End of Week 6 | Active | | Phase 4 | Consensus fully traced | No consensus timing impact | End of Week 8 | Active | | Phase 5 | Production deployment | Operators trained | End of Week 9 | Active | | Phase 6 | StatsD metrics in Prometheus | 3 dashboards operational | End of Week 10 | Active | | Phase 7 | All metrics via OTLP | No StatsD dependency | End of Week 12 | Active | | Phase 8 | trace_id in logs + Loki | Tempo↔Loki correlation | End of Week 13 | Active | | Phase 9 | 68+ new internal metrics in Prom | 2 new dashboards | End of Week 15 | Future Enhancement | | Phase 10 | Full telemetry stack validated | < 3% CPU overhead proven | End of Week 17 | Future Enhancement | | Phase 11 | Third-party metrics via receiver | 4 new dashboards + alerting | End of Week 20 | Future Enhancement | --- ## 6.13 Recommended Implementation Order Based on ROI analysis, implement in this exact order: ```mermaid flowchart TB subgraph week1["Week 1"] t1[1. OpenTelemetry SDK
Conan/CMake integration] t2[2. Telemetry interface
SpanGuard, config] end subgraph week2["Week 2"] t3[3. RPC ServerHandler
instrumentation] t4[4. Basic Tempo setup
for testing] end subgraph week3["Week 3"] t5[5. Transaction submit
tracing] t6[6. Grafana dashboard
v1] end subgraph week4["Week 4"] t7[7. Protobuf context
extension] t8[8. PeerImp tx.relay
instrumentation] end subgraph week5["Week 5"] t9[9. Multi-node
integration tests] t10[10. Performance
benchmarks] end subgraph week6_8["Weeks 6-8"] t11[11. Consensus
instrumentation] t12[12. Full integration
testing] end subgraph week9["Week 9"] t13[13. Production
deployment] t14[14. Documentation
& training] end t1 --> t2 --> t3 --> t4 t4 --> t5 --> t6 t6 --> t7 --> t8 t8 --> t9 --> t10 t10 --> t11 --> t12 t12 --> t13 --> t14 style week1 fill:#1b5e20,stroke:#0d3d14,color:#fff style week2 fill:#1b5e20,stroke:#0d3d14,color:#fff style week3 fill:#bf360c,stroke:#8c2809,color:#fff style week4 fill:#bf360c,stroke:#8c2809,color:#fff style week5 fill:#bf360c,stroke:#8c2809,color:#fff style week6_8 fill:#0d47a1,stroke:#082f6a,color:#fff style week9 fill:#4a148c,stroke:#2e0d57,color:#fff style t1 fill:#1b5e20,stroke:#0d3d14,color:#fff style t2 fill:#1b5e20,stroke:#0d3d14,color:#fff style t3 fill:#1b5e20,stroke:#0d3d14,color:#fff style t4 fill:#1b5e20,stroke:#0d3d14,color:#fff style t5 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style t6 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style t7 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style t8 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style t9 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style t10 fill:#ffe0b2,stroke:#ffcc80,color:#1e293b style t11 fill:#0d47a1,stroke:#082f6a,color:#fff style t12 fill:#0d47a1,stroke:#082f6a,color:#fff style t13 fill:#4a148c,stroke:#2e0d57,color:#fff style t14 fill:#4a148c,stroke:#2e0d57,color:#fff ``` **Reading the diagram:** - **Week 1 (tasks 1-2)**: Foundation work -- integrate the OpenTelemetry SDK via Conan/CMake and build the `Telemetry` interface with `SpanGuard` and config parsing. - **Week 2 (tasks 3-4)**: First observable output -- instrument `ServerHandler` for RPC tracing and stand up Tempo so developers can see traces immediately. - **Weeks 3-5 (tasks 5-10)**: Transaction lifecycle -- add submit tracing, build the first Grafana dashboard, extend protobuf for cross-node context, instrument `PeerImp` relay, then validate with multi-node integration tests and performance benchmarks. - **Weeks 6-8 (tasks 11-12)**: Consensus deep-dive -- instrument consensus rounds and phases, then run full integration testing across all instrumented paths. - **Week 9 (tasks 13-14)**: Go-live -- deploy to production with sampling/alerting configured, and deliver documentation and operator training. - **Arrow chain (t1 → ... → t14)**: Strict sequential dependency; each task's output is a prerequisite for the next. --- _Previous: [Configuration Reference](./05-configuration-reference.md)_ | _Next: [Observability Backends](./07-observability-backends.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_