merge: phase-5 (runbook span-name + line-number fixes) into phase-6

# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
#	docs/telemetry-runbook.md
This commit is contained in:
Pratik Mankawde
2026-05-14 16:42:13 +01:00
4 changed files with 122 additions and 91 deletions

View File

@@ -13,13 +13,13 @@
**Primary Choice**: OpenTelemetry C++ SDK (`opentelemetry-cpp`)
| Component | Purpose | Required |
| --------------------------------------- | ---------------------- | ----------- |
| `opentelemetry-cpp::api` | Tracing API headers | Yes |
| `opentelemetry-cpp::sdk` | SDK implementation | Yes |
| `opentelemetry-cpp::ext` | Extensions (exporters) | Yes |
| `opentelemetry-cpp::otlp_grpc_exporter` | OTLP/gRPC export | Recommended |
| `opentelemetry-cpp::otlp_http_exporter` | OTLP/HTTP export | Alternative |
| Component | Purpose | Required |
| --------------------------------------- | ---------------------- | ------------------------- |
| `opentelemetry-cpp::api` | Tracing API headers | Yes |
| `opentelemetry-cpp::sdk` | SDK implementation | Yes |
| `opentelemetry-cpp::ext` | Extensions (exporters) | Yes |
| `opentelemetry-cpp::otlp_http_exporter` | OTLP/HTTP export | Yes (shipped in Phase 1b) |
| `opentelemetry-cpp::otlp_grpc_exporter` | OTLP/gRPC export | Future (not yet wired up) |
### 2.1.2 Instrumentation Strategy
@@ -51,9 +51,9 @@ flowchart TB
elastic["Elastic<br/>APM"]
end
node1 -->|"OTLP/gRPC<br/>:4317"| collector
node2 -->|"OTLP/gRPC<br/>:4317"| collector
node3 -->|"OTLP/gRPC<br/>:4317"| collector
node1 -->|"OTLP/HTTP<br/>:4318"| collector
node2 -->|"OTLP/HTTP<br/>:4318"| collector
node3 -->|"OTLP/HTTP<br/>:4318"| collector
collector --> tempo
collector --> elastic
@@ -65,27 +65,15 @@ flowchart TB
**Reading the diagram:**
- **xrpld Nodes (blue)**: The source of telemetry data. Each xrpld node exports spans via OTLP/gRPC on port 4317.
- **xrpld Nodes (blue)**: The source of telemetry data. Each xrpld node exports spans via OTLP/HTTP on port 4318 (the only exporter shipped in Phase 1b).
- **OpenTelemetry Collector (red)**: The central aggregation point that receives spans from all nodes. Can run as a sidecar (per-node) or standalone (shared). Handles batching, filtering, and routing.
- **Observability Backends (green)**: The storage and visualization destinations. Tempo is the recommended backend for both development and production, and Elastic APM is an alternative. The Collector routes to one or more backends.
- **Arrows (nodes to collector to backends)**: The data pipeline -- spans flow from nodes to the Collector over gRPC, then the Collector fans out to the configured backends.
- **Arrows (nodes to collector to backends)**: The data pipeline -- spans flow from nodes to the Collector over HTTP, then the Collector fans out to the configured backends.
### 2.2.1 OTLP/gRPC (Recommended)
### 2.2.1 OTLP/HTTP (Shipped in Phase 1b)
```cpp
// Configuration for OTLP over gRPC
namespace otlp = opentelemetry::exporter::otlp;
otlp::OtlpGrpcExporterOptions opts;
opts.endpoint = "localhost:4317";
opts.useTls = true;
opts.sslCaCertPath = "/path/to/ca.crt";
```
### 2.2.2 OTLP/HTTP (Alternative)
```cpp
// Configuration for OTLP over HTTP
// Configuration for OTLP over HTTP (the only exporter currently wired up).
namespace otlp = opentelemetry::exporter::otlp;
otlp::OtlpHttpExporterOptions opts;
@@ -93,6 +81,40 @@ opts.url = "http://localhost:4318/v1/traces";
opts.content_type = otlp::HttpRequestContentType::kJson; // or kBinary
```
### 2.2.2 OTLP/gRPC (Future Work — Planned Upgrade)
OTLP/gRPC is planned as a future upgrade from the HTTP exporter. The gRPC
transport offers lower per-span overhead and tighter back-pressure semantics
than HTTP/JSON, making it attractive for production deployments once the HTTP
path is validated in earlier phases.
Required to land this upgrade:
1. Add `opentelemetry-cpp::otlp_grpc_exporter` to the Conan recipe (the
dependency already exists but is not linked in Phase 1b builds).
2. Extend `TelemetryConfig.cpp` to parse an `exporter` key (`otlp_http`
default, `otlp_grpc` opt-in) and a gRPC endpoint override.
3. In `Telemetry::start()` branch on the parsed exporter type and construct
either `OtlpHttpExporterFactory::Create(httpOpts)` or
`OtlpGrpcExporterFactory::Create(grpcOpts)` accordingly.
4. Update the runbook and dashboards to document the alternate port and TLS
settings.
Example Phase 1b+ gRPC configuration (when wired up):
```cpp
// Configuration for OTLP over gRPC (future work).
namespace otlp = opentelemetry::exporter::otlp;
otlp::OtlpGrpcExporterOptions opts;
opts.endpoint = "<otel-collector-host>:4317";
opts.use_ssl_credentials = true;
opts.ssl_credentials_cacert_path = "/path/to/ca.crt";
```
Until that work lands, `OtlpGrpcExporterOptions` is **not** used by any code
path in Phase 1b through Phase 5.
---
## 2.3 Span Naming Conventions

View File

@@ -26,11 +26,10 @@ Add to `cfg/xrpld-example.cfg`:
# # Enable/disable telemetry (default: 0 = disabled)
# enabled=1
#
# # Exporter type: "otlp_grpc" (default), "otlp_http", or "none"
# exporter=otlp_grpc
#
# # OTLP endpoint (default: localhost:4317 for gRPC, localhost:4318 for HTTP)
# endpoint=localhost:4317
# # OTLP endpoint (default: http://localhost:4318/v1/traces - OTLP/HTTP)
# # Note: only OTLP/HTTP is shipped in Phase 1b. OTLP/gRPC support is
# # planned as future work and is not yet parsed by TelemetryConfig.cpp.
# endpoint=http://localhost:4318/v1/traces
#
# # Use TLS for exporter connection (default: 0)
# use_tls=0
@@ -56,10 +55,12 @@ Add to `cfg/xrpld-example.cfg`:
# trace_rpc=1 # RPC request handling
# trace_peer=0 # Peer messages (high volume, disabled by default)
# trace_ledger=1 # Ledger acquisition and building
# trace_pathfind=1 # Path computation (can be expensive)
# trace_txq=1 # Transaction queue and fee escalation
# trace_validator=0 # Validator list and manifest updates (low volume)
# trace_amendment=0 # Amendment voting (very low volume)
#
# # Planned (not yet parsed by TelemetryConfig.cpp):
# # trace_pathfind=1 # Path computation (Phase 2)
# # trace_txq=1 # Transaction queue (Phase 3)
# # trace_validator=0 # Validator list / manifest (future)
# # trace_amendment=0 # Amendment voting (future)
#
# # Trace ID strategies for cross-node correlation
# # "deterministic" (default) derives trace_id from a workflow hash
@@ -79,30 +80,37 @@ enabled=0
### 5.1.2 Configuration Options Summary
| Option | Type | Default | Description |
| -------------------------- | ------ | ----------------- | ---------------------------------------------------------------------------------------------------------- |
| `enabled` | bool | `false` | Enable/disable telemetry |
| `exporter` | string | `"otlp_grpc"` | Exporter type: otlp_grpc, otlp_http, none |
| `endpoint` | string | `localhost:4317` | OTLP collector endpoint |
| `use_tls` | bool | `false` | Enable TLS for exporter connection |
| `tls_ca_cert` | string | `""` | Path to CA certificate file |
| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) |
| `batch_size` | uint | `512` | Spans per export batch |
| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) |
| `max_queue_size` | uint | `2048` | Maximum queued spans |
| `trace_transactions` | bool | `true` | Enable transaction tracing |
| `trace_consensus` | bool | `true` | Enable consensus tracing |
| `trace_rpc` | bool | `true` | Enable RPC tracing |
| `trace_peer` | bool | `false` | Enable peer message tracing (high volume) |
| `trace_ledger` | bool | `true` | Enable ledger tracing |
| `trace_pathfind` | bool | `true` | Enable path computation tracing |
| `trace_txq` | bool | `true` | Enable transaction queue tracing |
| `trace_validator` | bool | `false` | Enable validator list/manifest tracing |
| `trace_amendment` | bool | `false` | Enable amendment voting tracing |
| `tx_trace_strategy` | string | `"deterministic"` | TX trace ID strategy: `"deterministic"` (trace_id = txHash[0:16]) or `"attribute"` (random) |
| `consensus_trace_strategy` | string | `"deterministic"` | Consensus trace ID strategy: `"deterministic"` (trace_id = prevLedgerHash[0:16]) or `"attribute"` (random) |
| `service_name` | string | `"xrpld"` | Service name for traces |
| `service_instance_id` | string | `<node_pubkey>` | Instance identifier |
| Option | Type | Default | Description |
| -------------------------- | ------ | --------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| `enabled` | bool | `false` | Enable/disable telemetry |
| `endpoint` | string | `http://localhost:4318/v1/traces` | OTLP/HTTP collector endpoint |
| `use_tls` | bool | `false` | Enable TLS for exporter connection |
| `tls_ca_cert` | string | `""` | Path to CA certificate file |
| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) |
| `batch_size` | uint | `512` | Spans per export batch |
| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) |
| `max_queue_size` | uint | `2048` | Maximum queued spans |
| `trace_transactions` | bool | `true` | Enable transaction tracing |
| `trace_consensus` | bool | `true` | Enable consensus tracing |
| `trace_rpc` | bool | `true` | Enable RPC tracing |
| `trace_peer` | bool | `false` | Enable peer message tracing (high volume) |
| `trace_ledger` | bool | `true` | Enable ledger tracing |
| `tx_trace_strategy` | string | `"deterministic"` | TX trace ID strategy: `"deterministic"` (trace_id = txHash[0:16]) or `"attribute"` (random) |
| `consensus_trace_strategy` | string | `"deterministic"` | Consensus trace ID strategy: `"deterministic"` (trace_id = prevLedgerHash[0:16]) or `"attribute"` (random) |
| `service_name` | string | `"xrpld"` | Service name for traces |
| `service_instance_id` | string | `<node_pubkey>` | Instance identifier |
**Planned (not yet implemented)**: the following options appear in the design
documents but are not parsed by `TelemetryConfig.cpp` in Phase 1b and later
phases. They will be added as the corresponding subsystems are instrumented:
| Option | Planned Phase | Purpose |
| ----------------- | ------------- | ---------------------------------------- |
| `exporter` | Future | Select between OTLP/HTTP and OTLP/gRPC |
| `trace_pathfind` | Phase 2 | Path computation tracing toggle |
| `trace_txq` | Phase 3 | Transaction queue tracing toggle |
| `trace_validator` | Future | Validator list / manifest update tracing |
| `trace_amendment` | Future | Amendment voting tracing |
---

View File

@@ -207,8 +207,7 @@ Phase 4a (establish-phase gap fill & cross-node correlation) adds:
in the same round share the same `trace_id` (switchable via
`consensus_trace_strategy` config: `"deterministic"` or `"attribute"`).
See [Configuration Reference](./05-configuration-reference.md) for full
configuration options. The `consensus_trace_strategy` option will be
documented in the configuration reference as part of Phase 4a implementation.
configuration options.
- **Round lifecycle spans**: `consensus.round` with round-to-round span links.
- **Establish phase**: `consensus.establish`, `consensus.update_positions` (with
`dispute.resolve` events), `consensus.check` (with threshold tracking).
@@ -378,7 +377,7 @@ The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffi
---
## 6.9 Risk Assessment
## 6.8 Risk Assessment
```mermaid
quadrantChart
@@ -409,7 +408,7 @@ quadrantChart
---
## 6.10 Success Metrics
## 6.9 Success Metrics
| Metric | Target | Measurement |
| ------------------------ | -------------------------------------------------------------- | --------------------- |
@@ -422,13 +421,13 @@ quadrantChart
---
## 6.9 Quick Wins and Crawl-Walk-Run Strategy
## 6.10 Quick Wins and Crawl-Walk-Run Strategy
> **TxQ** = Transaction Queue
This section outlines a prioritized approach to maximize ROI with minimal initial investment.
### 6.9.1 Crawl-Walk-Run Overview
### 6.10.1 Crawl-Walk-Run Overview
<div align="center">
@@ -477,7 +476,7 @@ flowchart TB
- **RUN (Weeks 6-9)**: Full consensus instrumentation, establish-phase gap fill, cross-node correlation, StatsD integration, and production deployment with sampling and alerting.
- **Arrows (crawl → walk → run)**: Each phase builds on the prior one; you cannot skip ahead because later phases depend on infrastructure established earlier.
### 6.9.2 Quick Wins (Immediate Value)
### 6.10.2 Quick Wins (Immediate Value)
| Quick Win | Value | When to Deploy |
| ------------------------------ | ------ | -------------- |
@@ -487,7 +486,7 @@ flowchart TB
| **Transaction Submit Tracing** | High | Week 3 |
| **Consensus Round Duration** | Medium | Week 6 |
### 6.9.3 CRAWL Phase (Weeks 1-2)
### 6.10.3 CRAWL Phase (Weeks 1-2)
**Goal**: Get basic tracing working with minimal code changes.
@@ -509,7 +508,7 @@ flowchart TB
- No cross-node complexity
- Single file modification to existing code
### 6.9.4 WALK Phase (Weeks 3-5)
### 6.10.4 WALK Phase (Weeks 3-5)
**Goal**: Add transaction lifecycle tracing across nodes.
@@ -530,7 +529,7 @@ flowchart TB
- Moderate complexity (requires context propagation)
- High value for debugging transaction issues
### 6.9.5 RUN Phase (Weeks 6-9)
### 6.10.5 RUN Phase (Weeks 6-9)
**Goal**: Full observability including consensus.
@@ -553,7 +552,7 @@ flowchart TB
- Requires thorough testing
- Lower relative value (consensus issues are rarer)
### 6.9.6 ROI Prioritization Matrix
### 6.10.6 ROI Prioritization Matrix
```mermaid
quadrantChart
@@ -575,13 +574,13 @@ quadrantChart
---
## 6.13 Definition of Done
## 6.11 Definition of Done
> **TxQ** = Transaction Queue | **HA** = High Availability
Clear, measurable criteria for each phase.
### 6.13.1 Phase 1: Core Infrastructure
### 6.11.1 Phase 1: Core Infrastructure
| Criterion | Measurement | Target |
| --------------- | ---------------------------------------------------------- | ---------------------------- |
@@ -593,7 +592,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: All criteria met, PR merged, no regressions in CI.
### 6.13.2 Phase 2: RPC Tracing
### 6.11.2 Phase 2: RPC Tracing
| Criterion | Measurement | Target |
| ------------------ | ---------------------------------- | -------------------------- |
@@ -605,7 +604,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: RPC traces visible in Tempo for all commands, dashboard shows latency distribution.
### 6.13.3 Phase 3: Transaction Tracing
### 6.11.3 Phase 3: Transaction Tracing
| Criterion | Measurement | Target |
| --------------------- | ------------------------------------------------- | -------------------------------------------------------- |
@@ -620,7 +619,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: Transaction traces span 3+ nodes in test network with deterministic trace_id correlation, parent-child ordering via protobuf propagation, and performance within bounds.
### 6.13.4 Phase 4: Consensus Tracing
### 6.11.4 Phase 4: Consensus Tracing
| Criterion | Measurement | Target |
| -------------------- | ----------------------------- | ------------------------- |
@@ -632,7 +631,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing.
### 6.13.5 Phase 5: Production Deployment
### 6.11.5 Phase 5: Production Deployment
| Criterion | Measurement | Target |
| ------------ | ---------------------------- | -------------------------- |
@@ -645,7 +644,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: Telemetry running in production, operators trained, alerts active.
### 6.13.6 Success Metrics Summary
### 6.11.6 Success Metrics Summary
| Phase | Primary Metric | Secondary Metric | Deadline |
| ------- | ---------------------- | --------------------------- | ------------- |
@@ -657,7 +656,7 @@ Clear, measurable criteria for each phase.
---
## 6.14 Recommended Implementation Order
## 6.12 Recommended Implementation Order
Based on ROI analysis, implement in this exact order:

View File

@@ -64,20 +64,21 @@ All spans instrumented in xrpld, grouped by subsystem:
### RPC Spans (Phase 2)
| Span Name | Source File | Attributes | Description |
| -------------------- | --------------------- | ------------------------------------------------------------------------------ | -------------------------------------------------- |
| `rpc.request` | ServerHandler.cpp:271 | — | Top-level HTTP RPC request |
| `rpc.process` | ServerHandler.cpp:573 | — | RPC processing (child of rpc.request) |
| `rpc.ws_message` | ServerHandler.cpp:384 | — | WebSocket RPC message |
| `rpc.command.<name>` | RPCHandler.cpp:161 | `command`, `version`, `rpc_role`, `rpc_status`, `duration_ms`, `error_message` | Per-command span (e.g., `rpc.command.server_info`) |
| Span Name | Source File | Attributes | Description |
| -------------------- | ----------------- | ------------------------------------------------------------------------------ | ----------------------------------------------------- |
| `rpc.http_request` | ServerHandler.cpp | — | Top-level HTTP RPC request |
| `rpc.ws_upgrade` | ServerHandler.cpp | — | WebSocket upgrade handshake |
| `rpc.ws_message` | ServerHandler.cpp | — | WebSocket RPC message |
| `rpc.process` | ServerHandler.cpp | — | RPC processing (child of rpc.http_request/ws_message) |
| `rpc.command.<name>` | RPCHandler.cpp | `command`, `version`, `rpc_role`, `rpc_status`, `duration_ms`, `error_message` | Per-command span (e.g., `rpc.command.server_info`) |
### Transaction Spans (Phase 3)
| Span Name | Source File | Attributes | Description |
| ------------ | ------------------- | ------------------------------------------------------------------------- | ------------------------------------- |
| `tx.process` | NetworkOPs.cpp:1227 | `xrpl.tx.hash`, `local`, `path` | Transaction submission and processing |
| `tx.receive` | PeerImp.cpp:1273 | `xrpl.peer.id`, `xrpl.tx.hash`, `peer_version`, `suppressed`, `tx_status` | Transaction received from peer relay |
| `tx.apply` | BuildLedger.cpp:88 | `xrpl.ledger.seq`, `tx_count`, `tx_failed` | Transaction set applied per ledger |
| Span Name | Source File | Attributes | Description |
| ------------ | --------------- | ------------------------------------------------------------------------- | ------------------------------------- |
| `tx.process` | NetworkOPs.cpp | `xrpl.tx.hash`, `local`, `path` | Transaction submission and processing |
| `tx.receive` | PeerImp.cpp | `xrpl.peer.id`, `xrpl.tx.hash`, `peer_version`, `suppressed`, `tx_status` | Transaction received from peer relay |
| `tx.apply` | BuildLedger.cpp | `xrpl.ledger.seq`, `tx_count`, `tx_failed` | Transaction set applied per ledger |
### Transaction Queue Spans (Phase 3)
@@ -452,9 +453,10 @@ Requires `trace_peer=1` in the `[telemetry]` config section.
| Span Name | Prometheus Metric Filter | Grafana Dashboard |
| ------------------------------ | -------------------------------------------- | --------------------------------------------- |
| `rpc.request` | `{span_name="rpc.request"}` | RPC Performance (Overall Throughput) |
| `rpc.process` | `{span_name="rpc.process"}` | RPC Performance (Overall Throughput) |
| `rpc.http_request` | `{span_name="rpc.http_request"}` | RPC Performance (Overall Throughput) |
| `rpc.ws_upgrade` | `{span_name="rpc.ws_upgrade"}` | -- (available but not paneled) |
| `rpc.ws_message` | `{span_name="rpc.ws_message"}` | RPC Performance (WebSocket Rate) |
| `rpc.process` | `{span_name="rpc.process"}` | RPC Performance (Overall Throughput) |
| `rpc.command.*` | `{span_name=~"rpc.command.*"}` | RPC Performance (Rate, Latency, Error, Top) |
| `tx.process` | `{span_name="tx.process"}` | Transaction Overview (Rate, Latency, Heatmap) |
| `tx.receive` | `{span_name="tx.receive"}` | Transaction Overview (Rate, Receive) |