mirror of
https://github.com/XRPLF/rippled.git
synced 2026-04-29 15:37:57 +00:00
207 lines
9.8 KiB
Markdown
207 lines
9.8 KiB
Markdown
# Phase 2: RPC Tracing Completion Task List
|
|
|
|
> **Goal**: Complete RPC tracing coverage with unit tests, Grafana search filters, node health attributes, and config hardening. Build on the Phase 1c SpanGuard factory foundation to achieve production-quality RPC observability.
|
|
>
|
|
> **Scope**: Unit tests for core telemetry, Grafana Tempo search filters, node health span attributes, config validation (`std::clamp`).
|
|
>
|
|
> **Branch**: `pratik/otel-phase2-rpc-tracing` (from `pratik/otel-phase1c-rpc-integration`)
|
|
|
|
### Related Plan Documents
|
|
|
|
| Document | Relevance |
|
|
| ------------------------------------------------------------ | ------------------------------------------------------------- |
|
|
| [04-code-samples.md](./04-code-samples.md) | TraceContextPropagator (§4.4.2), RPC instrumentation (§4.5.3) |
|
|
| [02-design-decisions.md](./02-design-decisions.md) | W3C Trace Context (§2.5), span attributes (§2.4.2) |
|
|
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 2 tasks (§6.3), definition of done (§6.11.2) |
|
|
|
|
---
|
|
|
|
## Task 2.1: W3C Trace Context HTTP Header Extraction
|
|
|
|
**Status**: DEFERRED → Phase 3
|
|
|
|
**Reason**: W3C context propagation (`traceparent`/`tracestate` headers) requires a consumer — in Phase 2, RPC spans are entirely local to the node. Phase 3 introduces cross-node transaction tracing via protobuf context propagation, which is the first use case for extracted trace context. Implementing it here without a consumer would be dead code.
|
|
|
|
**Implemented in**: `pratik/otel-phase3-tx-tracing` — `TraceContextPropagator.h/.cpp`
|
|
|
|
---
|
|
|
|
## Task 2.2: Per-Category Span Creation
|
|
|
|
**Status**: COMPLETE (superseded by Phase 1c design)
|
|
|
|
**Original plan**: Add `XRPL_TRACE_PEER` and `XRPL_TRACE_LEDGER` macros.
|
|
|
|
**Actual implementation**: Phase 1c replaced all tracing macros with the `SpanGuard::span(TraceCategory, prefix, name)` factory pattern. The `TraceCategory` enum (`Rpc`, `Transactions`, `Consensus`, `Peer`, `Ledger`) serves the same conditional-creation purpose without macros. No separate task needed — the factory already supports all categories.
|
|
|
|
---
|
|
|
|
## Task 2.3: Add shouldTraceLedger() to Telemetry Interface
|
|
|
|
**Objective**: The `Setup` struct has a `traceLedger` field but there's no corresponding virtual method. Add it for interface completeness.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `include/xrpl/telemetry/Telemetry.h`:
|
|
- Add `virtual bool shouldTraceLedger() const = 0;`
|
|
|
|
- Update all implementations:
|
|
- `src/libxrpl/telemetry/Telemetry.cpp` (TelemetryImpl, NullTelemetryOtel)
|
|
- `src/libxrpl/telemetry/NullTelemetry.cpp` (NullTelemetry)
|
|
|
|
**Key modified files**:
|
|
|
|
- `include/xrpl/telemetry/Telemetry.h`
|
|
- `src/libxrpl/telemetry/Telemetry.cpp`
|
|
- `src/libxrpl/telemetry/NullTelemetry.cpp`
|
|
|
|
---
|
|
|
|
## Task 2.4: Unit Tests for Core Telemetry Infrastructure
|
|
|
|
**Status**: COMPLETE
|
|
|
|
**Objective**: Add unit tests for the core telemetry abstractions to validate correctness and catch regressions.
|
|
|
|
**Implemented**:
|
|
|
|
- `src/tests/libxrpl/telemetry/TelemetryConfig.cpp`:
|
|
- Test Setup defaults (all fields have correct initial values)
|
|
- Test `setup_Telemetry` config parser (empty section, full section, edge cases)
|
|
- Test `samplingRatio` clamping (values outside 0.0-1.0)
|
|
|
|
- `src/tests/libxrpl/telemetry/SpanGuardFactory.cpp`:
|
|
- Test null guard methods are safe (setAttribute, setOk, setError, addEvent on null)
|
|
- Test category span returns null when telemetry disabled
|
|
- Test child/linked span null when no parent context
|
|
- Test move construction transfers ownership
|
|
- Test recordException safe on null guard
|
|
- Test discard() safe on null guard
|
|
|
|
- `src/tests/libxrpl/telemetry/main.cpp` — GTest runner
|
|
- `src/tests/libxrpl/CMakeLists.txt` — test target with optional OTel linking
|
|
|
|
---
|
|
|
|
## Task 2.5: Enhance RPC Span Attributes
|
|
|
|
**Status**: DEFERRED (low priority)
|
|
|
|
**Reason**: The high-value attributes (`command`, `version`, `role`, `status`) are already set by Phase 1c. The remaining HTTP transport-level attributes (`http.method`, `net.peer.ip`, `http.status_code`) provide limited additional insight since:
|
|
|
|
- `http.method` is always POST for JSON-RPC
|
|
- `net.peer.ip` is debug-level info available in logs
|
|
- `xrpl.rpc.duration_ms` is redundant with span duration (OTel captures start/end time natively)
|
|
|
|
These can be added later if dashboard queries specifically need them. The node health attributes (Task 2.8) provide far more operational value and were prioritized instead.
|
|
|
|
---
|
|
|
|
## Task 2.6: Build Verification and Performance Baseline
|
|
|
|
**Objective**: Verify the build succeeds with and without telemetry, and establish a performance baseline.
|
|
|
|
**What to do**:
|
|
|
|
1. Build with `telemetry=ON` and verify no compilation errors
|
|
2. Build with `telemetry=OFF` and verify no regressions
|
|
3. Run existing unit tests to verify no breakage
|
|
4. Document any build issues in lessons.md
|
|
|
|
**Verification Checklist**:
|
|
|
|
- [ ] `conan install . --build=missing -o telemetry=True` succeeds
|
|
- [ ] `cmake --preset default -Dtelemetry=ON` configures correctly
|
|
- [ ] Build succeeds with telemetry ON
|
|
- [ ] Build succeeds with telemetry OFF
|
|
- [ ] Existing tests pass with telemetry ON
|
|
- [ ] Existing tests pass with telemetry OFF
|
|
|
|
---
|
|
|
|
## Task 2.8: RPC Span Attribute Enrichment — Node Health Context
|
|
|
|
> **Source**: [External Dashboard Parity](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md) — adds node-level health context inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard).
|
|
>
|
|
> **Downstream**: Phase 7 (MetricsRegistry uses these attributes for alerting context), Phase 10 (validation checks for these attributes).
|
|
|
|
**Objective**: Add node-level health state to every `rpc.command.*` span so operators can correlate RPC behavior with node state in Tempo.
|
|
|
|
**What to do**:
|
|
|
|
- Edit `src/xrpld/rpc/detail/RPCHandler.cpp`:
|
|
- In the `rpc.command.*` span creation block (after existing `setAttribute` calls for `xrpl.rpc.command`, `xrpl.rpc.version`, etc.):
|
|
- Add `xrpl.node.amendment_blocked` (bool) — from `context.app.getOPs().isAmendmentBlocked()`
|
|
- Add `xrpl.node.server_state` (string) — from `context.app.getOPs().strOperatingMode()`
|
|
|
|
**New span attributes**:
|
|
|
|
| Attribute | Type | Source | Example |
|
|
| ----------------------------- | ------ | ------------------------------------------- | -------- |
|
|
| `xrpl.node.amendment_blocked` | bool | `context.app.getOPs().isAmendmentBlocked()` | `true` |
|
|
| `xrpl.node.server_state` | string | `context.app.getOPs().strOperatingMode()` | `"full"` |
|
|
|
|
**Rationale**: When a node is amendment-blocked or in a degraded state, every RPC response is suspect. Tagging spans with this state enables Tempo TraceQL queries like:
|
|
|
|
```
|
|
{name=~"rpc.command.*"} | xrpl.node.amendment_blocked = true
|
|
```
|
|
|
|
This surfaces all RPCs served during a blocked period — critical for post-incident analysis.
|
|
|
|
**Key modified files**:
|
|
|
|
- `src/xrpld/rpc/detail/RPCHandler.cpp`
|
|
|
|
**Exit Criteria**:
|
|
|
|
- [ ] `rpc.command.server_info` spans carry `xrpl.node.amendment_blocked` and `xrpl.node.server_state` attributes
|
|
- [ ] No measurable latency impact (attribute values are cached atomics, not computed per-call)
|
|
- [ ] Attributes appear in Tempo trace detail view
|
|
|
|
---
|
|
|
|
## Task 2.9: PathFind RPC Instrumentation
|
|
|
|
**Status**: COMPLETE
|
|
|
|
**Objective**: Trace the path_find and ripple_path_find RPC handlers to capture request latency and computation cost.
|
|
|
|
**Spans added**:
|
|
|
|
- `pathfind.request` — wraps `doPathFind()` and `doRipplePathFind()` RPC handlers
|
|
- `pathfind.compute` — wraps `PathRequest::doUpdate()` (fast/normal attr)
|
|
- `pathfind.update_all` — wraps `PathRequestManager::updateAll()` on ledger close (ledger_index attr)
|
|
- `pathfind.discover` — wraps `Pathfinder::findPaths()` graph exploration (search_level attr)
|
|
- `pathfind.rank` — wraps `Pathfinder::computePathRanks()` liquidity validation (num_paths attr)
|
|
|
|
**New file**: `src/xrpld/rpc/detail/PathFindSpanNames.h`
|
|
|
|
**Modified files**:
|
|
|
|
- `src/xrpld/rpc/handlers/orderbook/PathFind.cpp`
|
|
- `src/xrpld/rpc/handlers/orderbook/RipplePathFind.cpp`
|
|
- `src/xrpld/rpc/detail/PathRequest.cpp`
|
|
- `src/xrpld/rpc/detail/PathRequestManager.cpp`
|
|
- `src/xrpld/rpc/detail/Pathfinder.cpp`
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
| Task | Description | Status | Notes |
|
|
| ---- | ------------------------------------------- | ------------------- | ------------------------------------------------ |
|
|
| 2.1 | W3C Trace Context header extraction | Deferred → Phase 3 | No consumer in Phase 2; needs cross-node tracing |
|
|
| 2.2 | Per-category span creation | Complete (Phase 1c) | Superseded by TraceCategory enum + SpanGuard |
|
|
| 2.3 | Add shouldTraceLedger() interface method | Complete (Phase 1c) | Delivered in Phase 1c base branch |
|
|
| 2.4 | Unit tests for core telemetry | Complete | TelemetryConfig + SpanGuardFactory tests |
|
|
| 2.5 | Enhanced RPC span attributes (HTTP-level) | Deferred | Low value; span duration covers timing natively |
|
|
| 2.6 | Build verification and performance baseline | Complete | Verified in CI on Phase 1c |
|
|
| 2.7 | Grafana Tempo search filters | Complete | rpc-command, rpc-status, rpc-role filters |
|
|
| 2.8 | RPC span attribute enrichment (node health) | Complete | amendment_blocked + server_state |
|
|
| 2.9 | PathFind RPC instrumentation (5 spans) | Complete | request, compute, update_all, discover, rank |
|
|
|
|
**Delivered in this branch**: Tasks 2.4, 2.7, 2.8, 2.9.
|
|
**Deferred with rationale**: Tasks 2.1 (→Phase 3), 2.5 (low priority).
|
|
**Superseded**: Task 2.2 (Phase 1c SpanGuard factory covers this).
|