diff --git a/OpenTelemetryPlan/02-design-decisions.md b/OpenTelemetryPlan/02-design-decisions.md index 681381ace5..5d738b8172 100644 --- a/OpenTelemetryPlan/02-design-decisions.md +++ b/OpenTelemetryPlan/02-design-decisions.md @@ -13,13 +13,13 @@ **Primary Choice**: OpenTelemetry C++ SDK (`opentelemetry-cpp`) -| Component | Purpose | Required | -| --------------------------------------- | ---------------------- | ----------- | -| `opentelemetry-cpp::api` | Tracing API headers | Yes | -| `opentelemetry-cpp::sdk` | SDK implementation | Yes | -| `opentelemetry-cpp::ext` | Extensions (exporters) | Yes | -| `opentelemetry-cpp::otlp_grpc_exporter` | OTLP/gRPC export | Recommended | -| `opentelemetry-cpp::otlp_http_exporter` | OTLP/HTTP export | Alternative | +| Component | Purpose | Required | +| --------------------------------------- | ---------------------- | ------------------------- | +| `opentelemetry-cpp::api` | Tracing API headers | Yes | +| `opentelemetry-cpp::sdk` | SDK implementation | Yes | +| `opentelemetry-cpp::ext` | Extensions (exporters) | Yes | +| `opentelemetry-cpp::otlp_http_exporter` | OTLP/HTTP export | Yes (shipped in Phase 1b) | +| `opentelemetry-cpp::otlp_grpc_exporter` | OTLP/gRPC export | Future (not yet wired up) | ### 2.1.2 Instrumentation Strategy @@ -51,9 +51,9 @@ flowchart TB elastic["Elastic
APM"] end - node1 -->|"OTLP/gRPC
:4317"| collector - node2 -->|"OTLP/gRPC
:4317"| collector - node3 -->|"OTLP/gRPC
:4317"| collector + node1 -->|"OTLP/HTTP
:4318"| collector + node2 -->|"OTLP/HTTP
:4318"| collector + node3 -->|"OTLP/HTTP
:4318"| collector collector --> tempo collector --> elastic @@ -65,27 +65,15 @@ flowchart TB **Reading the diagram:** -- **xrpld Nodes (blue)**: The source of telemetry data. Each xrpld node exports spans via OTLP/gRPC on port 4317. +- **xrpld Nodes (blue)**: The source of telemetry data. Each xrpld node exports spans via OTLP/HTTP on port 4318 (the only exporter shipped in Phase 1b). - **OpenTelemetry Collector (red)**: The central aggregation point that receives spans from all nodes. Can run as a sidecar (per-node) or standalone (shared). Handles batching, filtering, and routing. - **Observability Backends (green)**: The storage and visualization destinations. Tempo is the recommended backend for both development and production, and Elastic APM is an alternative. The Collector routes to one or more backends. -- **Arrows (nodes to collector to backends)**: The data pipeline -- spans flow from nodes to the Collector over gRPC, then the Collector fans out to the configured backends. +- **Arrows (nodes to collector to backends)**: The data pipeline -- spans flow from nodes to the Collector over HTTP, then the Collector fans out to the configured backends. -### 2.2.1 OTLP/gRPC (Recommended) +### 2.2.1 OTLP/HTTP (Shipped in Phase 1b) ```cpp -// Configuration for OTLP over gRPC -namespace otlp = opentelemetry::exporter::otlp; - -otlp::OtlpGrpcExporterOptions opts; -opts.endpoint = "localhost:4317"; -opts.useTls = true; -opts.sslCaCertPath = "/path/to/ca.crt"; -``` - -### 2.2.2 OTLP/HTTP (Alternative) - -```cpp -// Configuration for OTLP over HTTP +// Configuration for OTLP over HTTP (the only exporter currently wired up). namespace otlp = opentelemetry::exporter::otlp; otlp::OtlpHttpExporterOptions opts; @@ -93,6 +81,40 @@ opts.url = "http://localhost:4318/v1/traces"; opts.content_type = otlp::HttpRequestContentType::kJson; // or kBinary ``` +### 2.2.2 OTLP/gRPC (Future Work — Planned Upgrade) + +OTLP/gRPC is planned as a future upgrade from the HTTP exporter. The gRPC +transport offers lower per-span overhead and tighter back-pressure semantics +than HTTP/JSON, making it attractive for production deployments once the HTTP +path is validated in earlier phases. + +Required to land this upgrade: + +1. Add `opentelemetry-cpp::otlp_grpc_exporter` to the Conan recipe (the + dependency already exists but is not linked in Phase 1b builds). +2. Extend `TelemetryConfig.cpp` to parse an `exporter` key (`otlp_http` + default, `otlp_grpc` opt-in) and a gRPC endpoint override. +3. In `Telemetry::start()` branch on the parsed exporter type and construct + either `OtlpHttpExporterFactory::Create(httpOpts)` or + `OtlpGrpcExporterFactory::Create(grpcOpts)` accordingly. +4. Update the runbook and dashboards to document the alternate port and TLS + settings. + +Example Phase 1b+ gRPC configuration (when wired up): + +```cpp +// Configuration for OTLP over gRPC (future work). +namespace otlp = opentelemetry::exporter::otlp; + +otlp::OtlpGrpcExporterOptions opts; +opts.endpoint = ":4317"; +opts.use_ssl_credentials = true; +opts.ssl_credentials_cacert_path = "/path/to/ca.crt"; +``` + +Until that work lands, `OtlpGrpcExporterOptions` is **not** used by any code +path in Phase 1b through Phase 5. + --- ## 2.3 Span Naming Conventions diff --git a/OpenTelemetryPlan/05-configuration-reference.md b/OpenTelemetryPlan/05-configuration-reference.md index 70df0f5b95..0c7ec5d6f4 100644 --- a/OpenTelemetryPlan/05-configuration-reference.md +++ b/OpenTelemetryPlan/05-configuration-reference.md @@ -26,11 +26,10 @@ Add to `cfg/xrpld-example.cfg`: # # Enable/disable telemetry (default: 0 = disabled) # enabled=1 # -# # Exporter type: "otlp_grpc" (default), "otlp_http", or "none" -# exporter=otlp_grpc -# -# # OTLP endpoint (default: localhost:4317 for gRPC, localhost:4318 for HTTP) -# endpoint=localhost:4317 +# # OTLP endpoint (default: http://localhost:4318/v1/traces - OTLP/HTTP) +# # Note: only OTLP/HTTP is shipped in Phase 1b. OTLP/gRPC support is +# # planned as future work and is not yet parsed by TelemetryConfig.cpp. +# endpoint=http://localhost:4318/v1/traces # # # Use TLS for exporter connection (default: 0) # use_tls=0 @@ -56,10 +55,12 @@ Add to `cfg/xrpld-example.cfg`: # trace_rpc=1 # RPC request handling # trace_peer=0 # Peer messages (high volume, disabled by default) # trace_ledger=1 # Ledger acquisition and building -# trace_pathfind=1 # Path computation (can be expensive) -# trace_txq=1 # Transaction queue and fee escalation -# trace_validator=0 # Validator list and manifest updates (low volume) -# trace_amendment=0 # Amendment voting (very low volume) +# +# # Planned (not yet parsed by TelemetryConfig.cpp): +# # trace_pathfind=1 # Path computation (Phase 2) +# # trace_txq=1 # Transaction queue (Phase 3) +# # trace_validator=0 # Validator list / manifest (future) +# # trace_amendment=0 # Amendment voting (future) # # # Trace ID strategies for cross-node correlation # # "deterministic" (default) derives trace_id from a workflow hash @@ -79,30 +80,37 @@ enabled=0 ### 5.1.2 Configuration Options Summary -| Option | Type | Default | Description | -| -------------------------- | ------ | ----------------- | ---------------------------------------------------------------------------------------------------------- | -| `enabled` | bool | `false` | Enable/disable telemetry | -| `exporter` | string | `"otlp_grpc"` | Exporter type: otlp_grpc, otlp_http, none | -| `endpoint` | string | `localhost:4317` | OTLP collector endpoint | -| `use_tls` | bool | `false` | Enable TLS for exporter connection | -| `tls_ca_cert` | string | `""` | Path to CA certificate file | -| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) | -| `batch_size` | uint | `512` | Spans per export batch | -| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) | -| `max_queue_size` | uint | `2048` | Maximum queued spans | -| `trace_transactions` | bool | `true` | Enable transaction tracing | -| `trace_consensus` | bool | `true` | Enable consensus tracing | -| `trace_rpc` | bool | `true` | Enable RPC tracing | -| `trace_peer` | bool | `false` | Enable peer message tracing (high volume) | -| `trace_ledger` | bool | `true` | Enable ledger tracing | -| `trace_pathfind` | bool | `true` | Enable path computation tracing | -| `trace_txq` | bool | `true` | Enable transaction queue tracing | -| `trace_validator` | bool | `false` | Enable validator list/manifest tracing | -| `trace_amendment` | bool | `false` | Enable amendment voting tracing | -| `tx_trace_strategy` | string | `"deterministic"` | TX trace ID strategy: `"deterministic"` (trace_id = txHash[0:16]) or `"attribute"` (random) | -| `consensus_trace_strategy` | string | `"deterministic"` | Consensus trace ID strategy: `"deterministic"` (trace_id = prevLedgerHash[0:16]) or `"attribute"` (random) | -| `service_name` | string | `"xrpld"` | Service name for traces | -| `service_instance_id` | string | `` | Instance identifier | +| Option | Type | Default | Description | +| -------------------------- | ------ | --------------------------------- | ---------------------------------------------------------------------------------------------------------- | +| `enabled` | bool | `false` | Enable/disable telemetry | +| `endpoint` | string | `http://localhost:4318/v1/traces` | OTLP/HTTP collector endpoint | +| `use_tls` | bool | `false` | Enable TLS for exporter connection | +| `tls_ca_cert` | string | `""` | Path to CA certificate file | +| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) | +| `batch_size` | uint | `512` | Spans per export batch | +| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) | +| `max_queue_size` | uint | `2048` | Maximum queued spans | +| `trace_transactions` | bool | `true` | Enable transaction tracing | +| `trace_consensus` | bool | `true` | Enable consensus tracing | +| `trace_rpc` | bool | `true` | Enable RPC tracing | +| `trace_peer` | bool | `false` | Enable peer message tracing (high volume) | +| `trace_ledger` | bool | `true` | Enable ledger tracing | +| `tx_trace_strategy` | string | `"deterministic"` | TX trace ID strategy: `"deterministic"` (trace_id = txHash[0:16]) or `"attribute"` (random) | +| `consensus_trace_strategy` | string | `"deterministic"` | Consensus trace ID strategy: `"deterministic"` (trace_id = prevLedgerHash[0:16]) or `"attribute"` (random) | +| `service_name` | string | `"xrpld"` | Service name for traces | +| `service_instance_id` | string | `` | Instance identifier | + +**Planned (not yet implemented)**: the following options appear in the design +documents but are not parsed by `TelemetryConfig.cpp` in Phase 1b and later +phases. They will be added as the corresponding subsystems are instrumented: + +| Option | Planned Phase | Purpose | +| ----------------- | ------------- | ---------------------------------------- | +| `exporter` | Future | Select between OTLP/HTTP and OTLP/gRPC | +| `trace_pathfind` | Phase 2 | Path computation tracing toggle | +| `trace_txq` | Phase 3 | Transaction queue tracing toggle | +| `trace_validator` | Future | Validator list / manifest update tracing | +| `trace_amendment` | Future | Amendment voting tracing | --- diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index b71dc1084e..208de9346f 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -207,8 +207,7 @@ Phase 4a (establish-phase gap fill & cross-node correlation) adds: in the same round share the same `trace_id` (switchable via `consensus_trace_strategy` config: `"deterministic"` or `"attribute"`). See [Configuration Reference](./05-configuration-reference.md) for full - configuration options. The `consensus_trace_strategy` option will be - documented in the configuration reference as part of Phase 4a implementation. + configuration options. - **Round lifecycle spans**: `consensus.round` with round-to-round span links. - **Establish phase**: `consensus.establish`, `consensus.update_positions` (with `dispute.resolve` events), `consensus.check` (with threshold tracking). @@ -378,7 +377,7 @@ The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffi --- -## 6.9 Risk Assessment +## 6.8 Risk Assessment ```mermaid quadrantChart @@ -409,7 +408,7 @@ quadrantChart --- -## 6.10 Success Metrics +## 6.9 Success Metrics | Metric | Target | Measurement | | ------------------------ | -------------------------------------------------------------- | --------------------- | @@ -422,13 +421,13 @@ quadrantChart --- -## 6.9 Quick Wins and Crawl-Walk-Run Strategy +## 6.10 Quick Wins and Crawl-Walk-Run Strategy > **TxQ** = Transaction Queue This section outlines a prioritized approach to maximize ROI with minimal initial investment. -### 6.9.1 Crawl-Walk-Run Overview +### 6.10.1 Crawl-Walk-Run Overview
@@ -477,7 +476,7 @@ flowchart TB - **RUN (Weeks 6-9)**: Full consensus instrumentation, establish-phase gap fill, cross-node correlation, StatsD integration, and production deployment with sampling and alerting. - **Arrows (crawl → walk → run)**: Each phase builds on the prior one; you cannot skip ahead because later phases depend on infrastructure established earlier. -### 6.9.2 Quick Wins (Immediate Value) +### 6.10.2 Quick Wins (Immediate Value) | Quick Win | Value | When to Deploy | | ------------------------------ | ------ | -------------- | @@ -487,7 +486,7 @@ flowchart TB | **Transaction Submit Tracing** | High | Week 3 | | **Consensus Round Duration** | Medium | Week 6 | -### 6.9.3 CRAWL Phase (Weeks 1-2) +### 6.10.3 CRAWL Phase (Weeks 1-2) **Goal**: Get basic tracing working with minimal code changes. @@ -509,7 +508,7 @@ flowchart TB - No cross-node complexity - Single file modification to existing code -### 6.9.4 WALK Phase (Weeks 3-5) +### 6.10.4 WALK Phase (Weeks 3-5) **Goal**: Add transaction lifecycle tracing across nodes. @@ -530,7 +529,7 @@ flowchart TB - Moderate complexity (requires context propagation) - High value for debugging transaction issues -### 6.9.5 RUN Phase (Weeks 6-9) +### 6.10.5 RUN Phase (Weeks 6-9) **Goal**: Full observability including consensus. @@ -553,7 +552,7 @@ flowchart TB - Requires thorough testing - Lower relative value (consensus issues are rarer) -### 6.9.6 ROI Prioritization Matrix +### 6.10.6 ROI Prioritization Matrix ```mermaid quadrantChart @@ -575,13 +574,13 @@ quadrantChart --- -## 6.13 Definition of Done +## 6.11 Definition of Done > **TxQ** = Transaction Queue | **HA** = High Availability Clear, measurable criteria for each phase. -### 6.13.1 Phase 1: Core Infrastructure +### 6.11.1 Phase 1: Core Infrastructure | Criterion | Measurement | Target | | --------------- | ---------------------------------------------------------- | ---------------------------- | @@ -593,7 +592,7 @@ Clear, measurable criteria for each phase. **Definition of Done**: All criteria met, PR merged, no regressions in CI. -### 6.13.2 Phase 2: RPC Tracing +### 6.11.2 Phase 2: RPC Tracing | Criterion | Measurement | Target | | ------------------ | ---------------------------------- | -------------------------- | @@ -605,7 +604,7 @@ Clear, measurable criteria for each phase. **Definition of Done**: RPC traces visible in Tempo for all commands, dashboard shows latency distribution. -### 6.13.3 Phase 3: Transaction Tracing +### 6.11.3 Phase 3: Transaction Tracing | Criterion | Measurement | Target | | --------------------- | ------------------------------------------------- | -------------------------------------------------------- | @@ -620,7 +619,7 @@ Clear, measurable criteria for each phase. **Definition of Done**: Transaction traces span 3+ nodes in test network with deterministic trace_id correlation, parent-child ordering via protobuf propagation, and performance within bounds. -### 6.13.4 Phase 4: Consensus Tracing +### 6.11.4 Phase 4: Consensus Tracing | Criterion | Measurement | Target | | -------------------- | ----------------------------- | ------------------------- | @@ -632,7 +631,7 @@ Clear, measurable criteria for each phase. **Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing. -### 6.13.5 Phase 5: Production Deployment +### 6.11.5 Phase 5: Production Deployment | Criterion | Measurement | Target | | ------------ | ---------------------------- | -------------------------- | @@ -645,7 +644,7 @@ Clear, measurable criteria for each phase. **Definition of Done**: Telemetry running in production, operators trained, alerts active. -### 6.13.6 Success Metrics Summary +### 6.11.6 Success Metrics Summary | Phase | Primary Metric | Secondary Metric | Deadline | | ------- | ---------------------- | --------------------------- | ------------- | @@ -657,7 +656,7 @@ Clear, measurable criteria for each phase. --- -## 6.14 Recommended Implementation Order +## 6.12 Recommended Implementation Order Based on ROI analysis, implement in this exact order: diff --git a/docs/telemetry-runbook.md b/docs/telemetry-runbook.md index 36588695c7..e567a07ec9 100644 --- a/docs/telemetry-runbook.md +++ b/docs/telemetry-runbook.md @@ -64,20 +64,21 @@ All spans instrumented in xrpld, grouped by subsystem: ### RPC Spans (Phase 2) -| Span Name | Source File | Attributes | Description | -| -------------------- | --------------------- | ------------------------------------------------------------------------------ | -------------------------------------------------- | -| `rpc.request` | ServerHandler.cpp:271 | — | Top-level HTTP RPC request | -| `rpc.process` | ServerHandler.cpp:573 | — | RPC processing (child of rpc.request) | -| `rpc.ws_message` | ServerHandler.cpp:384 | — | WebSocket RPC message | -| `rpc.command.` | RPCHandler.cpp:161 | `command`, `version`, `rpc_role`, `rpc_status`, `duration_ms`, `error_message` | Per-command span (e.g., `rpc.command.server_info`) | +| Span Name | Source File | Attributes | Description | +| -------------------- | ----------------- | ------------------------------------------------------------------------------ | ----------------------------------------------------- | +| `rpc.http_request` | ServerHandler.cpp | — | Top-level HTTP RPC request | +| `rpc.ws_upgrade` | ServerHandler.cpp | — | WebSocket upgrade handshake | +| `rpc.ws_message` | ServerHandler.cpp | — | WebSocket RPC message | +| `rpc.process` | ServerHandler.cpp | — | RPC processing (child of rpc.http_request/ws_message) | +| `rpc.command.` | RPCHandler.cpp | `command`, `version`, `rpc_role`, `rpc_status`, `duration_ms`, `error_message` | Per-command span (e.g., `rpc.command.server_info`) | ### Transaction Spans (Phase 3) -| Span Name | Source File | Attributes | Description | -| ------------ | ------------------- | ------------------------------------------------------------------------- | ------------------------------------- | -| `tx.process` | NetworkOPs.cpp:1227 | `xrpl.tx.hash`, `local`, `path` | Transaction submission and processing | -| `tx.receive` | PeerImp.cpp:1273 | `xrpl.peer.id`, `xrpl.tx.hash`, `peer_version`, `suppressed`, `tx_status` | Transaction received from peer relay | -| `tx.apply` | BuildLedger.cpp:88 | `xrpl.ledger.seq`, `tx_count`, `tx_failed` | Transaction set applied per ledger | +| Span Name | Source File | Attributes | Description | +| ------------ | --------------- | ------------------------------------------------------------------------- | ------------------------------------- | +| `tx.process` | NetworkOPs.cpp | `xrpl.tx.hash`, `local`, `path` | Transaction submission and processing | +| `tx.receive` | PeerImp.cpp | `xrpl.peer.id`, `xrpl.tx.hash`, `peer_version`, `suppressed`, `tx_status` | Transaction received from peer relay | +| `tx.apply` | BuildLedger.cpp | `xrpl.ledger.seq`, `tx_count`, `tx_failed` | Transaction set applied per ledger | ### Transaction Queue Spans (Phase 3) @@ -452,9 +453,10 @@ Requires `trace_peer=1` in the `[telemetry]` config section. | Span Name | Prometheus Metric Filter | Grafana Dashboard | | ------------------------------ | -------------------------------------------- | --------------------------------------------- | -| `rpc.request` | `{span_name="rpc.request"}` | RPC Performance (Overall Throughput) | -| `rpc.process` | `{span_name="rpc.process"}` | RPC Performance (Overall Throughput) | +| `rpc.http_request` | `{span_name="rpc.http_request"}` | RPC Performance (Overall Throughput) | +| `rpc.ws_upgrade` | `{span_name="rpc.ws_upgrade"}` | -- (available but not paneled) | | `rpc.ws_message` | `{span_name="rpc.ws_message"}` | RPC Performance (WebSocket Rate) | +| `rpc.process` | `{span_name="rpc.process"}` | RPC Performance (Overall Throughput) | | `rpc.command.*` | `{span_name=~"rpc.command.*"}` | RPC Performance (Rate, Latency, Error, Top) | | `tx.process` | `{span_name="tx.process"}` | Transaction Overview (Rate, Latency, Heatmap) | | `tx.receive` | `{span_name="tx.receive"}` | Transaction Overview (Rate, Receive) |