merge: phase-9 (dashboard UID + line-number cleanup, detach callbacks) into phase-10

# Conflicts:
#	docker/telemetry/TESTING.md
This commit is contained in:
Pratik Mankawde
2026-05-14 17:23:55 +01:00
23 changed files with 302 additions and 182 deletions

View File

@@ -13,13 +13,13 @@
**Primary Choice**: OpenTelemetry C++ SDK (`opentelemetry-cpp`)
| Component | Purpose | Required |
| --------------------------------------- | ---------------------- | ----------- |
| `opentelemetry-cpp::api` | Tracing API headers | Yes |
| `opentelemetry-cpp::sdk` | SDK implementation | Yes |
| `opentelemetry-cpp::ext` | Extensions (exporters) | Yes |
| `opentelemetry-cpp::otlp_grpc_exporter` | OTLP/gRPC export | Recommended |
| `opentelemetry-cpp::otlp_http_exporter` | OTLP/HTTP export | Alternative |
| Component | Purpose | Required |
| --------------------------------------- | ---------------------- | ------------------------- |
| `opentelemetry-cpp::api` | Tracing API headers | Yes |
| `opentelemetry-cpp::sdk` | SDK implementation | Yes |
| `opentelemetry-cpp::ext` | Extensions (exporters) | Yes |
| `opentelemetry-cpp::otlp_http_exporter` | OTLP/HTTP export | Yes (shipped in Phase 1b) |
| `opentelemetry-cpp::otlp_grpc_exporter` | OTLP/gRPC export | Future (not yet wired up) |
### 2.1.2 Instrumentation Strategy
@@ -51,9 +51,9 @@ flowchart TB
elastic["Elastic<br/>APM"]
end
node1 -->|"OTLP/gRPC<br/>:4317"| collector
node2 -->|"OTLP/gRPC<br/>:4317"| collector
node3 -->|"OTLP/gRPC<br/>:4317"| collector
node1 -->|"OTLP/HTTP<br/>:4318"| collector
node2 -->|"OTLP/HTTP<br/>:4318"| collector
node3 -->|"OTLP/HTTP<br/>:4318"| collector
collector --> tempo
collector --> elastic
@@ -65,27 +65,15 @@ flowchart TB
**Reading the diagram:**
- **xrpld Nodes (blue)**: The source of telemetry data. Each xrpld node exports spans via OTLP/gRPC on port 4317.
- **xrpld Nodes (blue)**: The source of telemetry data. Each xrpld node exports spans via OTLP/HTTP on port 4318 (the only exporter shipped in Phase 1b).
- **OpenTelemetry Collector (red)**: The central aggregation point that receives spans from all nodes. Can run as a sidecar (per-node) or standalone (shared). Handles batching, filtering, and routing.
- **Observability Backends (green)**: The storage and visualization destinations. Tempo is the recommended backend for both development and production, and Elastic APM is an alternative. The Collector routes to one or more backends.
- **Arrows (nodes to collector to backends)**: The data pipeline -- spans flow from nodes to the Collector over gRPC, then the Collector fans out to the configured backends.
- **Arrows (nodes to collector to backends)**: The data pipeline -- spans flow from nodes to the Collector over HTTP, then the Collector fans out to the configured backends.
### 2.2.1 OTLP/gRPC (Recommended)
### 2.2.1 OTLP/HTTP (Shipped in Phase 1b)
```cpp
// Configuration for OTLP over gRPC
namespace otlp = opentelemetry::exporter::otlp;
otlp::OtlpGrpcExporterOptions opts;
opts.endpoint = "localhost:4317";
opts.useTls = true;
opts.sslCaCertPath = "/path/to/ca.crt";
```
### 2.2.2 OTLP/HTTP (Alternative)
```cpp
// Configuration for OTLP over HTTP
// Configuration for OTLP over HTTP (the only exporter currently wired up).
namespace otlp = opentelemetry::exporter::otlp;
otlp::OtlpHttpExporterOptions opts;
@@ -93,6 +81,40 @@ opts.url = "http://localhost:4318/v1/traces";
opts.content_type = otlp::HttpRequestContentType::kJson; // or kBinary
```
### 2.2.2 OTLP/gRPC (Future Work — Planned Upgrade)
OTLP/gRPC is planned as a future upgrade from the HTTP exporter. The gRPC
transport offers lower per-span overhead and tighter back-pressure semantics
than HTTP/JSON, making it attractive for production deployments once the HTTP
path is validated in earlier phases.
Required to land this upgrade:
1. Add `opentelemetry-cpp::otlp_grpc_exporter` to the Conan recipe (the
dependency already exists but is not linked in Phase 1b builds).
2. Extend `TelemetryConfig.cpp` to parse an `exporter` key (`otlp_http`
default, `otlp_grpc` opt-in) and a gRPC endpoint override.
3. In `Telemetry::start()` branch on the parsed exporter type and construct
either `OtlpHttpExporterFactory::Create(httpOpts)` or
`OtlpGrpcExporterFactory::Create(grpcOpts)` accordingly.
4. Update the runbook and dashboards to document the alternate port and TLS
settings.
Example Phase 1b+ gRPC configuration (when wired up):
```cpp
// Configuration for OTLP over gRPC (future work).
namespace otlp = opentelemetry::exporter::otlp;
otlp::OtlpGrpcExporterOptions opts;
opts.endpoint = "<otel-collector-host>:4317";
opts.use_ssl_credentials = true;
opts.ssl_credentials_cacert_path = "/path/to/ca.crt";
```
Until that work lands, `OtlpGrpcExporterOptions` is **not** used by any code
path in Phase 1b through Phase 5.
---
## 2.3 Span Naming Conventions

View File

@@ -26,11 +26,10 @@ Add to `cfg/xrpld-example.cfg`:
# # Enable/disable telemetry (default: 0 = disabled)
# enabled=1
#
# # Exporter type: "otlp_grpc" (default), "otlp_http", or "none"
# exporter=otlp_grpc
#
# # OTLP endpoint (default: localhost:4317 for gRPC, localhost:4318 for HTTP)
# endpoint=localhost:4317
# # OTLP endpoint (default: http://localhost:4318/v1/traces - OTLP/HTTP)
# # Note: only OTLP/HTTP is shipped in Phase 1b. OTLP/gRPC support is
# # planned as future work and is not yet parsed by TelemetryConfig.cpp.
# endpoint=http://localhost:4318/v1/traces
#
# # Use TLS for exporter connection (default: 0)
# use_tls=0
@@ -56,10 +55,12 @@ Add to `cfg/xrpld-example.cfg`:
# trace_rpc=1 # RPC request handling
# trace_peer=0 # Peer messages (high volume, disabled by default)
# trace_ledger=1 # Ledger acquisition and building
# trace_pathfind=1 # Path computation (can be expensive)
# trace_txq=1 # Transaction queue and fee escalation
# trace_validator=0 # Validator list and manifest updates (low volume)
# trace_amendment=0 # Amendment voting (very low volume)
#
# # Planned (not yet parsed by TelemetryConfig.cpp):
# # trace_pathfind=1 # Path computation (Phase 2)
# # trace_txq=1 # Transaction queue (Phase 3)
# # trace_validator=0 # Validator list / manifest (future)
# # trace_amendment=0 # Amendment voting (future)
#
# # Trace ID strategies for cross-node correlation
# # "deterministic" (default) derives trace_id from a workflow hash
@@ -79,30 +80,37 @@ enabled=0
### 5.1.2 Configuration Options Summary
| Option | Type | Default | Description |
| -------------------------- | ------ | ----------------- | ---------------------------------------------------------------------------------------------------------- |
| `enabled` | bool | `false` | Enable/disable telemetry |
| `exporter` | string | `"otlp_grpc"` | Exporter type: otlp_grpc, otlp_http, none |
| `endpoint` | string | `localhost:4317` | OTLP collector endpoint |
| `use_tls` | bool | `false` | Enable TLS for exporter connection |
| `tls_ca_cert` | string | `""` | Path to CA certificate file |
| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) |
| `batch_size` | uint | `512` | Spans per export batch |
| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) |
| `max_queue_size` | uint | `2048` | Maximum queued spans |
| `trace_transactions` | bool | `true` | Enable transaction tracing |
| `trace_consensus` | bool | `true` | Enable consensus tracing |
| `trace_rpc` | bool | `true` | Enable RPC tracing |
| `trace_peer` | bool | `false` | Enable peer message tracing (high volume) |
| `trace_ledger` | bool | `true` | Enable ledger tracing |
| `trace_pathfind` | bool | `true` | Enable path computation tracing |
| `trace_txq` | bool | `true` | Enable transaction queue tracing |
| `trace_validator` | bool | `false` | Enable validator list/manifest tracing |
| `trace_amendment` | bool | `false` | Enable amendment voting tracing |
| `tx_trace_strategy` | string | `"deterministic"` | TX trace ID strategy: `"deterministic"` (trace_id = txHash[0:16]) or `"attribute"` (random) |
| `consensus_trace_strategy` | string | `"deterministic"` | Consensus trace ID strategy: `"deterministic"` (trace_id = prevLedgerHash[0:16]) or `"attribute"` (random) |
| `service_name` | string | `"xrpld"` | Service name for traces |
| `service_instance_id` | string | `<node_pubkey>` | Instance identifier |
| Option | Type | Default | Description |
| -------------------------- | ------ | --------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| `enabled` | bool | `false` | Enable/disable telemetry |
| `endpoint` | string | `http://localhost:4318/v1/traces` | OTLP/HTTP collector endpoint |
| `use_tls` | bool | `false` | Enable TLS for exporter connection |
| `tls_ca_cert` | string | `""` | Path to CA certificate file |
| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) |
| `batch_size` | uint | `512` | Spans per export batch |
| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) |
| `max_queue_size` | uint | `2048` | Maximum queued spans |
| `trace_transactions` | bool | `true` | Enable transaction tracing |
| `trace_consensus` | bool | `true` | Enable consensus tracing |
| `trace_rpc` | bool | `true` | Enable RPC tracing |
| `trace_peer` | bool | `false` | Enable peer message tracing (high volume) |
| `trace_ledger` | bool | `true` | Enable ledger tracing |
| `tx_trace_strategy` | string | `"deterministic"` | TX trace ID strategy: `"deterministic"` (trace_id = txHash[0:16]) or `"attribute"` (random) |
| `consensus_trace_strategy` | string | `"deterministic"` | Consensus trace ID strategy: `"deterministic"` (trace_id = prevLedgerHash[0:16]) or `"attribute"` (random) |
| `service_name` | string | `"xrpld"` | Service name for traces |
| `service_instance_id` | string | `<node_pubkey>` | Instance identifier |
**Planned (not yet implemented)**: the following options appear in the design
documents but are not parsed by `TelemetryConfig.cpp` in Phase 1b and later
phases. They will be added as the corresponding subsystems are instrumented:
| Option | Planned Phase | Purpose |
| ----------------- | ------------- | ---------------------------------------- |
| `exporter` | Future | Select between OTLP/HTTP and OTLP/gRPC |
| `trace_pathfind` | Phase 2 | Path computation tracing toggle |
| `trace_txq` | Phase 3 | Transaction queue tracing toggle |
| `trace_validator` | Future | Validator list / manifest update tracing |
| `trace_amendment` | Future | Amendment voting tracing |
---

View File

@@ -197,14 +197,14 @@ SHAMap tracing are not implemented.
### Spans Produced
| Span Name | Location | Attributes |
| --------------------------- | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `consensus.phase.open` | `Consensus.h:707` | _(none)_ |
| `consensus.proposal.send` | `RCLConsensus.cpp:232` | `xrpl.consensus.round` |
| `consensus.ledger_close` | `RCLConsensus.cpp:341` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` |
| `consensus.accept` | `RCLConsensus.cpp:492` | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms`, `xrpl.consensus.quorum` |
| `consensus.accept.apply` | `RCLConsensus.cpp:541` | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq`, `parent_close_time`, `close_time_self`, `close_time_vote_bins`, `resolution_direction` |
| `consensus.validation.send` | `RCLConsensus.cpp:900` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` |
| Span Name | Location | Attributes |
| --------------------------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `consensus.phase.open` | `Consensus.h` | _(none)_ |
| `consensus.proposal.send` | `RCLConsensus.cpp` | `xrpl.consensus.round` |
| `consensus.ledger_close` | `RCLConsensus.cpp` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` |
| `consensus.accept` | `RCLConsensus.cpp` | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms`, `xrpl.consensus.quorum` |
| `consensus.accept.apply` | `RCLConsensus.cpp` | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq`, `parent_close_time`, `close_time_self`, `close_time_vote_bins`, `resolution_direction` |
| `consensus.validation.send` | `RCLConsensus.cpp` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` |
### Exit Criteria
@@ -225,8 +225,7 @@ Phase 4a (establish-phase gap fill & cross-node correlation) adds:
in the same round share the same `trace_id` (switchable via
`consensus_trace_strategy` config: `"deterministic"` or `"attribute"`).
See [Configuration Reference](./05-configuration-reference.md) for full
configuration options. The `consensus_trace_strategy` option will be
documented in the configuration reference as part of Phase 4a implementation.
configuration options.
- **Round lifecycle spans**: `consensus.round` with round-to-round span links.
- **Establish phase**: `consensus.establish`, `consensus.update_positions` (with
`dispute.resolve` events), `consensus.check` (with threshold tracking).
@@ -369,7 +368,7 @@ xrpld has a mature metrics framework (`beast::insight`) that emits StatsD-format
### Wire Format Fix (Task 6.1) — DEFERRED
The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffix, which is non-standard StatsD. The OTel StatsD receiver silently drops these. Fix: change `|m` to `|c` (counter), which is semantically correct since meters are increment-only counters. Only 2 metrics are affected (`warn`, `drop` in Resource Manager).
The `StatsDMeterImpl` in `StatsDCollector.cpp` sends metrics with `|m` suffix, which is non-standard StatsD. The OTel StatsD receiver silently drops these. Fix: change `|m` to `|c` (counter), which is semantically correct since meters are increment-only counters. Only 2 metrics are affected (`warn`, `drop` in Resource Manager).
**Status**: Deferred as a separate change this is a breaking change for any StatsD backend that previously consumed the custom `|m` type. The Resource Warnings and Resource Drops dashboard panels will show no data until this fix is applied.
@@ -1221,13 +1220,13 @@ quadrantChart
---
## 6.9 Quick Wins and Crawl-Walk-Run Strategy
## 6.11 Quick Wins and Crawl-Walk-Run Strategy
> **TxQ** = Transaction Queue
This section outlines a prioritized approach to maximize ROI with minimal initial investment.
### 6.9.1 Crawl-Walk-Run Overview
### 6.11.1 Crawl-Walk-Run Overview
<div align="center">
@@ -1276,7 +1275,7 @@ flowchart TB
- **RUN (Weeks 6-9)**: Full consensus instrumentation, establish-phase gap fill, cross-node correlation, StatsD integration, and production deployment with sampling and alerting.
- **Arrows (crawl → walk → run)**: Each phase builds on the prior one; you cannot skip ahead because later phases depend on infrastructure established earlier.
### 6.9.2 Quick Wins (Immediate Value)
### 6.11.2 Quick Wins (Immediate Value)
| Quick Win | Value | When to Deploy |
| ------------------------------ | ------ | -------------- |
@@ -1286,7 +1285,7 @@ flowchart TB
| **Transaction Submit Tracing** | High | Week 3 |
| **Consensus Round Duration** | Medium | Week 6 |
### 6.9.3 CRAWL Phase (Weeks 1-2)
### 6.11.3 CRAWL Phase (Weeks 1-2)
**Goal**: Get basic tracing working with minimal code changes.
@@ -1308,7 +1307,7 @@ flowchart TB
- No cross-node complexity
- Single file modification to existing code
### 6.9.4 WALK Phase (Weeks 3-5)
### 6.11.4 WALK Phase (Weeks 3-5)
**Goal**: Add transaction lifecycle tracing across nodes.
@@ -1329,7 +1328,7 @@ flowchart TB
- Moderate complexity (requires context propagation)
- High value for debugging transaction issues
### 6.9.5 RUN Phase (Weeks 6-9)
### 6.11.5 RUN Phase (Weeks 6-9)
**Goal**: Full observability including consensus.
@@ -1352,7 +1351,7 @@ flowchart TB
- Requires thorough testing
- Lower relative value (consensus issues are rarer)
### 6.9.6 ROI Prioritization Matrix
### 6.11.6 ROI Prioritization Matrix
```mermaid
quadrantChart
@@ -1374,13 +1373,13 @@ quadrantChart
---
## 6.13 Definition of Done
## 6.12 Definition of Done
> **TxQ** = Transaction Queue | **HA** = High Availability
Clear, measurable criteria for each phase.
### 6.13.1 Phase 1: Core Infrastructure
### 6.12.1 Phase 1: Core Infrastructure
| Criterion | Measurement | Target |
| --------------- | ---------------------------------------------------------- | ---------------------------- |
@@ -1392,7 +1391,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: All criteria met, PR merged, no regressions in CI.
### 6.13.2 Phase 2: RPC Tracing
### 6.12.2 Phase 2: RPC Tracing
| Criterion | Measurement | Target |
| ------------------ | ---------------------------------- | -------------------------- |
@@ -1404,7 +1403,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: RPC traces visible in Tempo for all commands, dashboard shows latency distribution.
### 6.13.3 Phase 3: Transaction Tracing
### 6.12.3 Phase 3: Transaction Tracing
| Criterion | Measurement | Target |
| --------------------- | ------------------------------------------------- | -------------------------------------------------------- |
@@ -1419,7 +1418,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: Transaction traces span 3+ nodes in test network with deterministic trace_id correlation, parent-child ordering via protobuf propagation, and performance within bounds.
### 6.13.4 Phase 4: Consensus Tracing
### 6.12.4 Phase 4: Consensus Tracing
| Criterion | Measurement | Target |
| -------------------- | ----------------------------- | ------------------------- |
@@ -1431,7 +1430,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing.
### 6.13.5 Phase 5: Production Deployment
### 6.12.5 Phase 5: Production Deployment
| Criterion | Measurement | Target |
| ------------ | ---------------------------- | -------------------------- |
@@ -1444,7 +1443,7 @@ Clear, measurable criteria for each phase.
**Definition of Done**: Telemetry running in production, operators trained, alerts active.
### 6.13.6 Success Metrics Summary
### 6.12.6 Success Metrics Summary
| Phase | Primary Metric | Secondary Metric | Deadline | Status |
| -------- | ------------------------------------------------------------------ | --------------------------- | -------------- | ------------------ |