From f3a095ab653fdbbf1bf9f1fc9a94837d0552c535 Mon Sep 17 00:00:00 2001 From: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com> Date: Thu, 14 May 2026 16:09:48 +0100 Subject: [PATCH 1/2] docs(telemetry): align Phase 1a plan docs with Phase 1b implementation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase-1a plan documents advertised OTLP/gRPC on port 4317 as the default exporter, four unparsed [telemetry] config keys, and "Phase 4a Complete" status with exit-criteria checkboxes marked done. Every downstream branch through Phase 5 ships only OTLP/HTTP on port 4318 via OtlpHttpExporterFactory, never parses the advertised keys, and the Phase 4 work is not yet delivered. Fixes: - 02-design-decisions.md: flip §2.1.1 SDK dependency recommendations to OTLP/HTTP (shipped) with OTLP/gRPC marked Future. Update §2.2 architecture diagram and text from OTLP/gRPC:4317 to OTLP/HTTP:4318. Rewrite §2.2.1 as "OTLP/HTTP (Shipped)" and §2.2.2 as "OTLP/gRPC (Future Work — Planned Upgrade)" with a concrete checklist (Conan dep, config parsing, factory branch, runbook/dashboard updates) for landing the gRPC transport later. - 05-configuration-reference.md: drop the fabricated exporter/otlp_grpc key and the :4317 default from the sample config block and the options-summary table. Move trace_pathfind, trace_txq, trace_validator, trace_amendment into a new "Planned (not yet implemented)" table citing the phase that will add each one. Keep the example config minimal so copy-paste does not produce a silently-ignored stanza. - 06-implementation-phases.md: reset Phase 4 Exit Criteria checkboxes from [x] to [ ] (Phase 4 is not shipped at Phase-1a time). Rename "Phase 4a Complete" to "Phase 4a Plan" and describe the work as future. Replace the broken forward link to Phase4_taskList.md (introduced in the Phase 2 PR) with a sentence pointing readers to where that spec will land. Renumber the final section 6.12 to 6.11 so it sits directly after 6.10; section 6.11 ("Effort Summary") was intentionally removed in earlier edits. --- OpenTelemetryPlan/02-design-decisions.md | 74 ++++++++++++------- .../05-configuration-reference.md | 70 ++++++++++-------- OpenTelemetryPlan/06-implementation-phases.md | 20 ++--- 3 files changed, 97 insertions(+), 67 deletions(-) diff --git a/OpenTelemetryPlan/02-design-decisions.md b/OpenTelemetryPlan/02-design-decisions.md index fe87fc78db..184b40bca2 100644 --- a/OpenTelemetryPlan/02-design-decisions.md +++ b/OpenTelemetryPlan/02-design-decisions.md @@ -13,13 +13,13 @@ **Primary Choice**: OpenTelemetry C++ SDK (`opentelemetry-cpp`) -| Component | Purpose | Required | -| --------------------------------------- | ---------------------- | ----------- | -| `opentelemetry-cpp::api` | Tracing API headers | Yes | -| `opentelemetry-cpp::sdk` | SDK implementation | Yes | -| `opentelemetry-cpp::ext` | Extensions (exporters) | Yes | -| `opentelemetry-cpp::otlp_grpc_exporter` | OTLP/gRPC export | Recommended | -| `opentelemetry-cpp::otlp_http_exporter` | OTLP/HTTP export | Alternative | +| Component | Purpose | Required | +| --------------------------------------- | ---------------------- | ------------------------- | +| `opentelemetry-cpp::api` | Tracing API headers | Yes | +| `opentelemetry-cpp::sdk` | SDK implementation | Yes | +| `opentelemetry-cpp::ext` | Extensions (exporters) | Yes | +| `opentelemetry-cpp::otlp_http_exporter` | OTLP/HTTP export | Yes (shipped in Phase 1b) | +| `opentelemetry-cpp::otlp_grpc_exporter` | OTLP/gRPC export | Future (not yet wired up) | ### 2.1.2 Instrumentation Strategy @@ -51,9 +51,9 @@ flowchart TB elastic["Elastic
APM"] end - node1 -->|"OTLP/gRPC
:4317"| collector - node2 -->|"OTLP/gRPC
:4317"| collector - node3 -->|"OTLP/gRPC
:4317"| collector + node1 -->|"OTLP/HTTP
:4318"| collector + node2 -->|"OTLP/HTTP
:4318"| collector + node3 -->|"OTLP/HTTP
:4318"| collector collector --> tempo collector --> elastic @@ -65,27 +65,15 @@ flowchart TB **Reading the diagram:** -- **xrpld Nodes (blue)**: The source of telemetry data. Each xrpld node exports spans via OTLP/gRPC on port 4317. +- **xrpld Nodes (blue)**: The source of telemetry data. Each xrpld node exports spans via OTLP/HTTP on port 4318 (the only exporter shipped in Phase 1b). - **OpenTelemetry Collector (red)**: The central aggregation point that receives spans from all nodes. Can run as a sidecar (per-node) or standalone (shared). Handles batching, filtering, and routing. - **Observability Backends (green)**: The storage and visualization destinations. Tempo is the recommended backend for both development and production, and Elastic APM is an alternative. The Collector routes to one or more backends. -- **Arrows (nodes to collector to backends)**: The data pipeline -- spans flow from nodes to the Collector over gRPC, then the Collector fans out to the configured backends. +- **Arrows (nodes to collector to backends)**: The data pipeline -- spans flow from nodes to the Collector over HTTP, then the Collector fans out to the configured backends. -### 2.2.1 OTLP/gRPC (Recommended) +### 2.2.1 OTLP/HTTP (Shipped in Phase 1b) ```cpp -// Configuration for OTLP over gRPC -namespace otlp = opentelemetry::exporter::otlp; - -otlp::OtlpGrpcExporterOptions opts; -opts.endpoint = "localhost:4317"; -opts.useTls = true; -opts.sslCaCertPath = "/path/to/ca.crt"; -``` - -### 2.2.2 OTLP/HTTP (Alternative) - -```cpp -// Configuration for OTLP over HTTP +// Configuration for OTLP over HTTP (the only exporter currently wired up). namespace otlp = opentelemetry::exporter::otlp; otlp::OtlpHttpExporterOptions opts; @@ -93,6 +81,40 @@ opts.url = "http://localhost:4318/v1/traces"; opts.content_type = otlp::HttpRequestContentType::kJson; // or kBinary ``` +### 2.2.2 OTLP/gRPC (Future Work — Planned Upgrade) + +OTLP/gRPC is planned as a future upgrade from the HTTP exporter. The gRPC +transport offers lower per-span overhead and tighter back-pressure semantics +than HTTP/JSON, making it attractive for production deployments once the HTTP +path is validated in earlier phases. + +Required to land this upgrade: + +1. Add `opentelemetry-cpp::otlp_grpc_exporter` to the Conan recipe (the + dependency already exists but is not linked in Phase 1b builds). +2. Extend `TelemetryConfig.cpp` to parse an `exporter` key (`otlp_http` + default, `otlp_grpc` opt-in) and a gRPC endpoint override. +3. In `Telemetry::start()` branch on the parsed exporter type and construct + either `OtlpHttpExporterFactory::Create(httpOpts)` or + `OtlpGrpcExporterFactory::Create(grpcOpts)` accordingly. +4. Update the runbook and dashboards to document the alternate port and TLS + settings. + +Example Phase 1b+ gRPC configuration (when wired up): + +```cpp +// Configuration for OTLP over gRPC (future work). +namespace otlp = opentelemetry::exporter::otlp; + +otlp::OtlpGrpcExporterOptions opts; +opts.endpoint = ":4317"; +opts.use_ssl_credentials = true; +opts.ssl_credentials_cacert_path = "/path/to/ca.crt"; +``` + +Until that work lands, `OtlpGrpcExporterOptions` is **not** used by any code +path in Phase 1b through Phase 5. + --- ## 2.3 Span Naming Conventions diff --git a/OpenTelemetryPlan/05-configuration-reference.md b/OpenTelemetryPlan/05-configuration-reference.md index 56627c3b6c..6c7161be7b 100644 --- a/OpenTelemetryPlan/05-configuration-reference.md +++ b/OpenTelemetryPlan/05-configuration-reference.md @@ -26,11 +26,10 @@ Add to `cfg/xrpld-example.cfg`: # # Enable/disable telemetry (default: 0 = disabled) # enabled=1 # -# # Exporter type: "otlp_grpc" (default), "otlp_http", or "none" -# exporter=otlp_grpc -# -# # OTLP endpoint (default: localhost:4317 for gRPC, localhost:4318 for HTTP) -# endpoint=localhost:4317 +# # OTLP endpoint (default: http://localhost:4318/v1/traces - OTLP/HTTP) +# # Note: only OTLP/HTTP is shipped in Phase 1b. OTLP/gRPC support is +# # planned as future work and is not yet parsed by TelemetryConfig.cpp. +# endpoint=http://localhost:4318/v1/traces # # # Use TLS for exporter connection (default: 0) # use_tls=0 @@ -56,10 +55,12 @@ Add to `cfg/xrpld-example.cfg`: # trace_rpc=1 # RPC request handling # trace_peer=0 # Peer messages (high volume, disabled by default) # trace_ledger=1 # Ledger acquisition and building -# trace_pathfind=1 # Path computation (can be expensive) -# trace_txq=1 # Transaction queue and fee escalation -# trace_validator=0 # Validator list and manifest updates (low volume) -# trace_amendment=0 # Amendment voting (very low volume) +# +# # Planned (not yet parsed by TelemetryConfig.cpp): +# # trace_pathfind=1 # Path computation (Phase 2) +# # trace_txq=1 # Transaction queue (Phase 3) +# # trace_validator=0 # Validator list / manifest (future) +# # trace_amendment=0 # Amendment voting (future) # # # Service identification (automatically detected if not specified) # # service_name=xrpld @@ -71,28 +72,35 @@ enabled=0 ### 5.1.2 Configuration Options Summary -| Option | Type | Default | Description | -| --------------------- | ------ | ---------------- | ----------------------------------------- | -| `enabled` | bool | `false` | Enable/disable telemetry | -| `exporter` | string | `"otlp_grpc"` | Exporter type: otlp_grpc, otlp_http, none | -| `endpoint` | string | `localhost:4317` | OTLP collector endpoint | -| `use_tls` | bool | `false` | Enable TLS for exporter connection | -| `tls_ca_cert` | string | `""` | Path to CA certificate file | -| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) | -| `batch_size` | uint | `512` | Spans per export batch | -| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) | -| `max_queue_size` | uint | `2048` | Maximum queued spans | -| `trace_transactions` | bool | `true` | Enable transaction tracing | -| `trace_consensus` | bool | `true` | Enable consensus tracing | -| `trace_rpc` | bool | `true` | Enable RPC tracing | -| `trace_peer` | bool | `false` | Enable peer message tracing (high volume) | -| `trace_ledger` | bool | `true` | Enable ledger tracing | -| `trace_pathfind` | bool | `true` | Enable path computation tracing | -| `trace_txq` | bool | `true` | Enable transaction queue tracing | -| `trace_validator` | bool | `false` | Enable validator list/manifest tracing | -| `trace_amendment` | bool | `false` | Enable amendment voting tracing | -| `service_name` | string | `"xrpld"` | Service name for traces | -| `service_instance_id` | string | `` | Instance identifier | +| Option | Type | Default | Description | +| --------------------- | ------ | --------------------------------- | ----------------------------------------- | +| `enabled` | bool | `false` | Enable/disable telemetry | +| `endpoint` | string | `http://localhost:4318/v1/traces` | OTLP/HTTP collector endpoint | +| `use_tls` | bool | `false` | Enable TLS for exporter connection | +| `tls_ca_cert` | string | `""` | Path to CA certificate file | +| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) | +| `batch_size` | uint | `512` | Spans per export batch | +| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) | +| `max_queue_size` | uint | `2048` | Maximum queued spans | +| `trace_transactions` | bool | `true` | Enable transaction tracing | +| `trace_consensus` | bool | `true` | Enable consensus tracing | +| `trace_rpc` | bool | `true` | Enable RPC tracing | +| `trace_peer` | bool | `false` | Enable peer message tracing (high volume) | +| `trace_ledger` | bool | `true` | Enable ledger tracing | +| `service_name` | string | `"xrpld"` | Service name for traces | +| `service_instance_id` | string | `` | Instance identifier | + +**Planned (not yet implemented)**: the following options appear in the design +documents but are not parsed by `TelemetryConfig.cpp` in Phase 1b and later +phases. They will be added as the corresponding subsystems are instrumented: + +| Option | Planned Phase | Purpose | +| ----------------- | ------------- | ---------------------------------------- | +| `exporter` | Future | Select between OTLP/HTTP and OTLP/gRPC | +| `trace_pathfind` | Phase 2 | Path computation tracing toggle | +| `trace_txq` | Phase 3 | Transaction queue tracing toggle | +| `trace_validator` | Future | Validator list / manifest update tracing | +| `trace_amendment` | Future | Amendment voting tracing | --- diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index ccf1fd54d4..12eea9c67b 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -166,22 +166,21 @@ gantt ### Exit Criteria -- [x] Complete consensus round traces -- [x] Phase transitions visible -- [x] Proposals and validations traced -- [x] No impact on consensus timing +- [ ] Complete consensus round traces +- [ ] Phase transitions visible +- [ ] Proposals and validations traced +- [ ] No impact on consensus timing - [ ] Multi-validator test network validated -### Implementation Status — Phase 4a Complete +### Implementation Status — Phase 4a Plan -Phase 4a (establish-phase gap fill & cross-node correlation) adds: +Phase 4a (establish-phase gap fill & cross-node correlation) will add: - **Deterministic trace ID** derived from `previousLedger.id()` so all validators in the same round share the same `trace_id` (switchable via `consensus_trace_strategy` config: `"deterministic"` or `"attribute"`). See [Configuration Reference](./05-configuration-reference.md) for full - configuration options. The `consensus_trace_strategy` option will be - documented in the configuration reference as part of Phase 4a implementation. + configuration options. - **Round lifecycle spans**: `consensus.round` with round-to-round span links. - **Establish phase**: `consensus.establish`, `consensus.update_positions` (with `dispute.resolve` events), `consensus.check` (with threshold tracking). @@ -192,7 +191,8 @@ Phase 4a (establish-phase gap fill & cross-node correlation) adds: (`startRoundTracing`, `createValidationSpan`, `startEstablishTracing`, `updateEstablishTracing`, `endEstablishTracing`). -See [Phase4_taskList.md](./Phase4_taskList.md) for the full spec and implementation notes. +The `Phase4_taskList.md` spec document is introduced in the Phase 2 PR (#6424) +and will contain the full task breakdown and implementation notes. --- @@ -490,7 +490,7 @@ Clear, measurable criteria for each phase. --- -## 6.12 Recommended Implementation Order +## 6.11 Recommended Implementation Order Based on ROI analysis, implement in this exact order: From dec8b0a9a1120cfa93a3e38f86193dfc17ffaa7c Mon Sep 17 00:00:00 2001 From: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com> Date: Thu, 14 May 2026 16:34:58 +0100 Subject: [PATCH 2/2] docs(telemetry): fix stale RPC span names + drop volatile line numbers in runbook - RPC Spans table: `rpc.request` was documented but the code actually emits `rpc.http_request`. Listed the actual emitted names (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`) and their parent/child relationship. - Drop `:` suffixes from Source File columns in both RPC and Transaction span tables. Line numbers drift with every refactor; the filename is enough for operators to grep. - Summary table: replace the never-emitted `rpc.request` row with the real entry points so `span_name=` filters in PromQL / TraceQL match. --- docs/telemetry-runbook.md | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/docs/telemetry-runbook.md b/docs/telemetry-runbook.md index b700b85073..52febc148a 100644 --- a/docs/telemetry-runbook.md +++ b/docs/telemetry-runbook.md @@ -64,19 +64,20 @@ All spans instrumented in xrpld, grouped by subsystem: ### RPC Spans (Phase 2) -| Span Name | Source File | Attributes | Description | -| -------------------- | --------------------- | -------------------------------- | -------------------------------------------------- | -| `rpc.request` | ServerHandler.cpp:271 | — | Top-level HTTP RPC request | -| `rpc.process` | ServerHandler.cpp:573 | — | RPC processing (child of rpc.request) | -| `rpc.ws_message` | ServerHandler.cpp:384 | — | WebSocket RPC message | -| `rpc.command.` | RPCHandler.cpp:161 | `command`, `version`, `rpc_role` | Per-command span (e.g., `rpc.command.server_info`) | +| Span Name | Source File | Attributes | Description | +| -------------------- | ----------------- | -------------------------------- | ----------------------------------------------------- | +| `rpc.http_request` | ServerHandler.cpp | — | Top-level HTTP RPC request | +| `rpc.ws_upgrade` | ServerHandler.cpp | — | WebSocket upgrade handshake | +| `rpc.ws_message` | ServerHandler.cpp | — | WebSocket RPC message | +| `rpc.process` | ServerHandler.cpp | — | RPC processing (child of rpc.http_request/ws_message) | +| `rpc.command.` | RPCHandler.cpp | `command`, `version`, `rpc_role` | Per-command span (e.g., `rpc.command.server_info`) | ### Transaction Spans (Phase 3) -| Span Name | Source File | Attributes | Description | -| ------------ | ------------------- | ------------------------------------------------------------------------- | ------------------------------------- | -| `tx.process` | NetworkOPs.cpp:1227 | `xrpl.tx.hash`, `local`, `path` | Transaction submission and processing | -| `tx.receive` | PeerImp.cpp:1273 | `xrpl.peer.id`, `xrpl.tx.hash`, `peer_version`, `suppressed`, `tx_status` | Transaction received from peer relay | +| Span Name | Source File | Attributes | Description | +| ------------ | -------------- | ------------------------------------------------------------------------- | ------------------------------------- | +| `tx.process` | NetworkOPs.cpp | `xrpl.tx.hash`, `local`, `path` | Transaction submission and processing | +| `tx.receive` | PeerImp.cpp | `xrpl.peer.id`, `xrpl.tx.hash`, `peer_version`, `suppressed`, `tx_status` | Transaction received from peer relay | ### Transaction Queue Spans (Phase 3) @@ -295,7 +296,9 @@ Three dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`: | Span Name | Prometheus Metric Filter | Grafana Dashboard | | ------------------------------ | -------------------------------------------- | --------------------------------------------- | -| `rpc.request` | `{span_name="rpc.request"}` | -- (available but not paneled) | +| `rpc.http_request` | `{span_name="rpc.http_request"}` | -- (available but not paneled) | +| `rpc.ws_upgrade` | `{span_name="rpc.ws_upgrade"}` | -- (available but not paneled) | +| `rpc.ws_message` | `{span_name="rpc.ws_message"}` | -- (available but not paneled) | | `rpc.process` | `{span_name="rpc.process"}` | -- (available but not paneled) | | `rpc.command.*` | `{span_name=~"rpc.command.*"}` | RPC Performance (all 4 panels) | | `tx.process` | `{span_name="tx.process"}` | Transaction Overview (3 panels) |