feat(telemetry): add Phase 5 documentation, deployment configs, and integration tests

Add the observability stack deployment infrastructure and integration
test framework for verifying end-to-end trace export.

- Add Grafana dashboards: RPC performance, transaction overview,
  consensus health (pre-provisioned via dashboards.yaml)
- Add Prometheus config for spanmetrics collection from OTel Collector
- Update OTel Collector config with spanmetrics connector and
  prometheus exporter for RED metrics
- Add docker-compose services: prometheus, dashboard provisioning
- Add integration-test.sh with Tempo API-based span verification
  (replaces previous Jaeger-based approach)
- Add TESTING.md with step-by-step deployment and verification guide
- Add telemetry-runbook.md for production operations reference
- Add xrpld-telemetry.cfg sample configuration
- Add toDisplayString() for ConsensusMode (human-readable span values)
- Update Phase 2/3 task lists with known issues sections
- Add Phase 5 integration test task list
- Add TraceContext protobuf fields for future relay propagation
- Wire telemetry lifecycle (setServiceInstanceId/start/stop) in
  Application.cpp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-04-24 21:43:17 +01:00
parent ccd3ceac84
commit 360ecbdb44
25 changed files with 2387 additions and 24 deletions

View File

@@ -186,14 +186,15 @@ flowchart TB
### Task Lists
| Document | Description |
| ------------------------------------------ | --------------------------------------------------- |
| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration |
| [Phase2_taskList.md](./Phase2_taskList.md) | RPC layer trace instrumentation |
| [Phase3_taskList.md](./Phase3_taskList.md) | Peer overlay & consensus tracing |
| [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing |
| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing |
| [presentation.md](./presentation.md) | Presentation slides for OpenTelemetry plan overview |
| Document | Description |
| -------------------------------------------------------------------------- | --------------------------------------------------- |
| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration |
| [Phase2_taskList.md](./Phase2_taskList.md) | RPC layer trace instrumentation |
| [Phase3_taskList.md](./Phase3_taskList.md) | Peer overlay & consensus tracing |
| [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing |
| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing |
| [Phase5_IntegrationTest_taskList.md](./Phase5_IntegrationTest_taskList.md) | Observability stack integration tests |
| [presentation.md](./presentation.md) | Presentation slides for OpenTelemetry plan overview |
---

View File

@@ -204,3 +204,36 @@ This surfaces all RPCs served during a blocked period — critical for post-inci
**Delivered in this branch**: Tasks 2.4, 2.7, 2.8, 2.9.
**Deferred with rationale**: Tasks 2.1 (→Phase 3), 2.5 (low priority).
**Superseded**: Task 2.2 (Phase 1c SpanGuard factory covers this).
---
## Known Issues / Future Work
### Thread safety of TelemetryImpl::stop() vs startSpan()
`TelemetryImpl::stop()` resets `sdkProvider_` (a `std::shared_ptr`) without
synchronization. `getTracer()` reads the same member from RPC handler threads.
This is a data race if any thread calls `startSpan()` concurrently with `stop()`.
**Current mitigation**: `Application::stop()` shuts down `serverHandler_`,
`overlay_`, and `jobQueue_` before calling `telemetry_->stop()`, so no callers
remain. See comments in `Telemetry.cpp:stop()` and `Application.cpp`.
**TODO**: Add an `std::atomic<bool> stopped_` flag checked in `getTracer()` to
make this robust against future shutdown order changes.
### Macro incompatibility: XRPL_TRACE_SPAN vs XRPL_TRACE_SET_ATTR
`XRPL_TRACE_SPAN` and `XRPL_TRACE_SPAN_KIND` declare `_xrpl_guard_` as a bare
`SpanGuard`, but `XRPL_TRACE_SET_ATTR` and `XRPL_TRACE_EXCEPTION` call
`_xrpl_guard_.has_value()` which requires `std::optional<SpanGuard>`. Using
`XRPL_TRACE_SPAN` followed by `XRPL_TRACE_SET_ATTR` in the same scope would
fail to compile.
**Current mitigation**: No call site currently uses `XRPL_TRACE_SPAN` — all
production code uses the conditional macros (`XRPL_TRACE_RPC`, `XRPL_TRACE_TX`,
etc.) which correctly wrap the guard in `std::optional`.
**TODO**: Either make `XRPL_TRACE_SPAN`/`XRPL_TRACE_SPAN_KIND` also wrap in
`std::optional`, or document that `XRPL_TRACE_SET_ATTR` is only compatible with
the conditional macros.

View File

@@ -443,3 +443,28 @@ This gives the best of both worlds: guaranteed cross-node correlation via determ
- [ ] <5% overhead on transaction throughput
- [ ] Deterministic trace_id: same trace_id for same tx across all nodes
- [ ] Protobuf span_id propagation preserves parent-child ordering when available
---
## Known Issues / Future Work
### Propagation utilities not yet wired into P2P flow
`extractFromProtobuf()` and `injectToProtobuf()` in `TraceContextPropagator.h`
are implemented and tested but not called from production code. To enable
cross-node distributed traces:
- Call `injectToProtobuf()` in `PeerImp` when sending `TMTransaction` /
`TMProposeSet` messages
- Call `extractFromProtobuf()` in the corresponding message handlers to
reconstruct the parent span context, then pass it to `startSpan()` as the
parent
This was deferred to validate single-node tracing performance first.
### Unused trace_state proto field
The `TraceContext.trace_state` field (field 4) in `xrpl.proto` is reserved for
W3C `tracestate` vendor-specific key-value pairs but is not read or written by
`TraceContextPropagator`. Wire it when cross-vendor trace propagation is needed.
No wire cost since proto `optional` fields are zero-cost when absent.

View File

@@ -0,0 +1,221 @@
# Phase 5: Integration Test Task List
> **Goal**: End-to-end verification of the complete telemetry pipeline using a
> 6-node consensus network. Proves that RPC, transaction, and consensus spans
> flow through the observability stack (otel-collector, Tempo, Prometheus,
> Grafana) under realistic conditions.
>
> **Scope**: Integration test script, manual testing plan, 6-node local network
> setup, Tempo/Prometheus/Grafana verification.
>
> **Branch**: `pratik/otel-phase5-docs-deployment`
### Related Plan Documents
| Document | Relevance |
| ---------------------------------------------------------------- | ------------------------------------------ |
| [07-observability-backends.md](./07-observability-backends.md) | Tempo, Grafana, Prometheus setup |
| [05-configuration-reference.md](./05-configuration-reference.md) | Collector config, Docker Compose |
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 5 tasks, definition of done |
| [Phase5_taskList.md](./Phase5_taskList.md) | Phase 5 main task list (5.6 = integration) |
---
## Task IT.1: Create Integration Test Script
**Objective**: Automated bash script that stands up a 6-node xrpld network
with telemetry, exercises all span categories, and verifies data in
Tempo/Prometheus.
**What to do**:
- Create `docker/telemetry/integration-test.sh`:
- Prerequisites check (docker, xrpld binary, curl, jq)
- Start observability stack via `docker compose`
- Generate 6 validator key pairs via temp standalone xrpld
- Generate 6 node configs + shared `validators.txt`
- Start 6 xrpld nodes in consensus mode (`--start`, no `-a`)
- Wait for all nodes to reach `"proposing"` state (120s timeout)
**Key new file**: `docker/telemetry/integration-test.sh`
**Verification**:
- [ ] Script starts without errors
- [ ] All 6 nodes reach "proposing" state
- [ ] Observability stack is healthy (otel-collector, Tempo, Prometheus, Grafana)
---
## Task IT.2: RPC Span Verification (Phase 2)
**Objective**: Verify RPC spans flow through the telemetry pipeline.
**What to do**:
- Send `server_info`, `server_state`, `ledger` RPCs to node1 (port 5005)
- Wait for batch export (5s)
- Query Tempo API for:
- `rpc.request` spans (ServerHandler::onRequest)
- `rpc.process` spans (ServerHandler::processRequest)
- `rpc.command.server_info` spans (callMethod)
- `rpc.command.server_state` spans (callMethod)
- `rpc.command.ledger` spans (callMethod)
- Verify `xrpl.rpc.command` attribute present on `rpc.command.*` spans
**Verification**:
- [ ] Tempo shows `rpc.request` traces
- [ ] Tempo shows `rpc.process` traces
- [ ] Tempo shows `rpc.command.*` traces with correct attributes
---
## Task IT.3: Transaction Span Verification (Phase 3)
**Objective**: Verify transaction spans flow through the telemetry pipeline.
**What to do**:
- Get genesis account sequence via `account_info` RPC
- Submit Payment transaction using genesis seed (`snoPBrXtMeMyMHUVTgbuqAfg1SUTb`)
- Wait for consensus inclusion (10s)
- Query Tempo API for:
- `tx.process` spans (NetworkOPsImp::processTransaction) on submitting node
- `tx.receive` spans (PeerImp::handleTransaction) on peer nodes
- Verify `xrpl.tx.hash` attribute on `tx.process` spans
- Verify `xrpl.peer.id` attribute on `tx.receive` spans
**Verification**:
- [ ] Tempo shows `tx.process` traces with `xrpl.tx.hash`
- [ ] Tempo shows `tx.receive` traces with `xrpl.peer.id`
---
## Task IT.4: Consensus Span Verification (Phase 4)
**Objective**: Verify consensus spans flow through the telemetry pipeline.
**What to do**:
- Consensus runs automatically in 6-node network
- Query Tempo API for:
- `consensus.proposal.send` (Adaptor::propose)
- `consensus.ledger_close` (Adaptor::onClose)
- `consensus.accept` (Adaptor::onAccept)
- `consensus.validation.send` (Adaptor::validate)
- Verify attributes:
- `xrpl.consensus.mode` on `consensus.ledger_close`
- `xrpl.consensus.proposers` on `consensus.accept`
- `xrpl.consensus.ledger.seq` on `consensus.validation.send`
**Verification**:
- [ ] Tempo shows `consensus.ledger_close` traces with `xrpl.consensus.mode`
- [ ] Tempo shows `consensus.accept` traces with `xrpl.consensus.proposers`
- [ ] Tempo shows `consensus.proposal.send` traces
- [ ] Tempo shows `consensus.validation.send` traces
---
## Task IT.5: Spanmetrics Verification (Phase 5)
**Objective**: Verify spanmetrics connector derives RED metrics from spans.
**What to do**:
- Query Prometheus for `traces_span_metrics_calls_total`
- Query Prometheus for `traces_span_metrics_duration_milliseconds_count`
- Verify Grafana loads at `http://localhost:3000`
**Verification**:
- [ ] Prometheus returns non-empty results for `traces_span_metrics_calls_total`
- [ ] Prometheus returns non-empty results for duration histogram
- [ ] Grafana UI accessible with dashboards visible
---
## Task IT.6: Manual Testing Plan
**Objective**: Document how to run tests manually for future reference.
**What to do**:
- Create `docker/telemetry/TESTING.md` with:
- Prerequisites section
- Single-node standalone test (quick verification)
- 6-node consensus test (full verification)
- Expected span catalog (all 12 span names with attributes)
- Verification queries (Tempo API, Prometheus API)
- Troubleshooting guide
**Key new file**: `docker/telemetry/TESTING.md`
**Verification**:
- [ ] Document covers both single-node and multi-node testing
- [ ] All 12 span names documented with source file and attributes
- [ ] Troubleshooting section covers common failure modes
---
## Task IT.7: Run and Verify
**Objective**: Execute the integration test and validate results.
**What to do**:
- Run `docker/telemetry/integration-test.sh` locally
- Debug any failures
- Leave stack running for manual verification
- Share URLs:
- Tempo: `http://localhost:3200`
- Grafana: `http://localhost:3000`
- Prometheus: `http://localhost:9090`
**Verification**:
- [ ] Script completes with all checks passing
- [ ] Tempo UI shows rippled service with all expected span names
- [ ] Grafana dashboards load and show data
---
## Task IT.8: Commit
**Objective**: Commit all new files to Phase 5 branch.
**What to do**:
- Run `pcc` (pre-commit checks)
- Commit 3 new files to `pratik/otel-phase5-docs-deployment`
**Verification**:
- [ ] `pcc` passes
- [ ] Commit created on Phase 5 branch
---
## Summary
| Task | Description | New Files | Depends On |
| ---- | ----------------------------- | --------- | ---------- |
| IT.1 | Integration test script | 1 | Phase 5 |
| IT.2 | RPC span verification | 0 | IT.1 |
| IT.3 | Transaction span verification | 0 | IT.1 |
| IT.4 | Consensus span verification | 0 | IT.1 |
| IT.5 | Spanmetrics verification | 0 | IT.1 |
| IT.6 | Manual testing plan | 1 | -- |
| IT.7 | Run and verify | 0 | IT.1-IT.6 |
| IT.8 | Commit | 0 | IT.7 |
**Exit Criteria**:
- [ ] All 6 xrpld nodes reach "proposing" state
- [ ] All 11 expected span names visible in Tempo
- [ ] Spanmetrics available in Prometheus
- [ ] Grafana dashboards show data
- [ ] Manual testing plan document complete