Phase 5: Documentation, deployment configs, integration test infrastructure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-03-20 17:22:29 +00:00
parent 41e4f0a240
commit 0fc3f6b5c2
27 changed files with 2392 additions and 27 deletions

View File

@@ -184,6 +184,16 @@ gantt
| `consensus.accept.apply` | `RCLConsensus.cpp:521` | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq`, `parent_close_time`, `close_time_self`, `close_time_vote_bins`, `resolution_direction` |
| `consensus.validation.send` | `RCLConsensus.cpp:753` | `xrpl.consensus.proposing` |
### Spans Produced
| Span Name | Location | Attributes |
| --------------------------- | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `consensus.proposal.send` | `RCLConsensus.cpp:177` | `xrpl.consensus.round` |
| `consensus.ledger_close` | `RCLConsensus.cpp:282` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` |
| `consensus.accept` | `RCLConsensus.cpp:395` | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` |
| `consensus.accept.apply` | `RCLConsensus.cpp:453` | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq` |
| `consensus.validation.send` | `RCLConsensus.cpp:753` | `xrpl.consensus.proposing` |
### Exit Criteria
- [x] Complete consensus round traces

View File

@@ -186,14 +186,15 @@ flowchart TB
### Task Lists
| Document | Description |
| ------------------------------------------ | --------------------------------------------------- |
| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration |
| [Phase2_taskList.md](./Phase2_taskList.md) | RPC layer trace instrumentation |
| [Phase3_taskList.md](./Phase3_taskList.md) | Peer overlay & consensus tracing |
| [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing |
| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing |
| [presentation.md](./presentation.md) | Presentation slides for OpenTelemetry plan overview |
| Document | Description |
| -------------------------------------------------------------------------- | --------------------------------------------------- |
| [POC_taskList.md](./POC_taskList.md) | Proof-of-concept telemetry integration |
| [Phase2_taskList.md](./Phase2_taskList.md) | RPC layer trace instrumentation |
| [Phase3_taskList.md](./Phase3_taskList.md) | Peer overlay & consensus tracing |
| [Phase4_taskList.md](./Phase4_taskList.md) | Transaction lifecycle tracing |
| [Phase5_taskList.md](./Phase5_taskList.md) | Ledger processing & advanced tracing |
| [Phase5_IntegrationTest_taskList.md](./Phase5_IntegrationTest_taskList.md) | Observability stack integration tests |
| [presentation.md](./presentation.md) | Presentation slides for OpenTelemetry plan overview |
---

View File

@@ -185,3 +185,36 @@
| 2.6 | Build verification and performance baseline | 0 | 0 | 2.1-2.5 |
**Parallel work**: Tasks 2.1, 2.2, 2.3 can run in parallel. Task 2.4 depends on 2.3. Task 2.5 can run in parallel with 2.4. Task 2.6 depends on all others.
---
## Known Issues / Future Work
### Thread safety of TelemetryImpl::stop() vs startSpan()
`TelemetryImpl::stop()` resets `sdkProvider_` (a `std::shared_ptr`) without
synchronization. `getTracer()` reads the same member from RPC handler threads.
This is a data race if any thread calls `startSpan()` concurrently with `stop()`.
**Current mitigation**: `Application::stop()` shuts down `serverHandler_`,
`overlay_`, and `jobQueue_` before calling `telemetry_->stop()`, so no callers
remain. See comments in `Telemetry.cpp:stop()` and `Application.cpp`.
**TODO**: Add an `std::atomic<bool> stopped_` flag checked in `getTracer()` to
make this robust against future shutdown order changes.
### Macro incompatibility: XRPL_TRACE_SPAN vs XRPL_TRACE_SET_ATTR
`XRPL_TRACE_SPAN` and `XRPL_TRACE_SPAN_KIND` declare `_xrpl_guard_` as a bare
`SpanGuard`, but `XRPL_TRACE_SET_ATTR` and `XRPL_TRACE_EXCEPTION` call
`_xrpl_guard_.has_value()` which requires `std::optional<SpanGuard>`. Using
`XRPL_TRACE_SPAN` followed by `XRPL_TRACE_SET_ATTR` in the same scope would
fail to compile.
**Current mitigation**: No call site currently uses `XRPL_TRACE_SPAN` — all
production code uses the conditional macros (`XRPL_TRACE_RPC`, `XRPL_TRACE_TX`,
etc.) which correctly wrap the guard in `std::optional`.
**TODO**: Either make `XRPL_TRACE_SPAN`/`XRPL_TRACE_SPAN_KIND` also wrap in
`std::optional`, or document that `XRPL_TRACE_SET_ATTR` is only compatible with
the conditional macros.

View File

@@ -236,3 +236,28 @@
- [ ] Trace context in Protocol Buffer messages
- [ ] HashRouter deduplication visible in traces
- [ ] <5% overhead on transaction throughput
---
## Known Issues / Future Work
### Propagation utilities not yet wired into P2P flow
`extractFromProtobuf()` and `injectToProtobuf()` in `TraceContextPropagator.h`
are implemented and tested but not called from production code. To enable
cross-node distributed traces:
- Call `injectToProtobuf()` in `PeerImp` when sending `TMTransaction` /
`TMProposeSet` messages
- Call `extractFromProtobuf()` in the corresponding message handlers to
reconstruct the parent span context, then pass it to `startSpan()` as the
parent
This was deferred to validate single-node tracing performance first.
### Unused trace_state proto field
The `TraceContext.trace_state` field (field 4) in `xrpl.proto` is reserved for
W3C `tracestate` vendor-specific key-value pairs but is not read or written by
`TraceContextPropagator`. Wire it when cross-vendor trace propagation is needed.
No wire cost since proto `optional` fields are zero-cost when absent.

View File

@@ -0,0 +1,221 @@
# Phase 5: Integration Test Task List
> **Goal**: End-to-end verification of the complete telemetry pipeline using a
> 6-node consensus network. Proves that RPC, transaction, and consensus spans
> flow through the observability stack (otel-collector, Jaeger, Prometheus,
> Grafana) under realistic conditions.
>
> **Scope**: Integration test script, manual testing plan, 6-node local network
> setup, Jaeger/Prometheus/Grafana verification.
>
> **Branch**: `pratik/otel-phase5-docs-deployment`
### Related Plan Documents
| Document | Relevance |
| ---------------------------------------------------------------- | ------------------------------------------ |
| [07-observability-backends.md](./07-observability-backends.md) | Jaeger, Grafana, Prometheus setup |
| [05-configuration-reference.md](./05-configuration-reference.md) | Collector config, Docker Compose |
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 5 tasks, definition of done |
| [Phase5_taskList.md](./Phase5_taskList.md) | Phase 5 main task list (5.6 = integration) |
---
## Task IT.1: Create Integration Test Script
**Objective**: Automated bash script that stands up a 6-node xrpld network
with telemetry, exercises all span categories, and verifies data in
Jaeger/Prometheus.
**What to do**:
- Create `docker/telemetry/integration-test.sh`:
- Prerequisites check (docker, xrpld binary, curl, jq)
- Start observability stack via `docker compose`
- Generate 6 validator key pairs via temp standalone xrpld
- Generate 6 node configs + shared `validators.txt`
- Start 6 xrpld nodes in consensus mode (`--start`, no `-a`)
- Wait for all nodes to reach `"proposing"` state (120s timeout)
**Key new file**: `docker/telemetry/integration-test.sh`
**Verification**:
- [ ] Script starts without errors
- [ ] All 6 nodes reach "proposing" state
- [ ] Observability stack is healthy (otel-collector, Jaeger, Prometheus, Grafana)
---
## Task IT.2: RPC Span Verification (Phase 2)
**Objective**: Verify RPC spans flow through the telemetry pipeline.
**What to do**:
- Send `server_info`, `server_state`, `ledger` RPCs to node1 (port 5005)
- Wait for batch export (5s)
- Query Jaeger API for:
- `rpc.request` spans (ServerHandler::onRequest)
- `rpc.process` spans (ServerHandler::processRequest)
- `rpc.command.server_info` spans (callMethod)
- `rpc.command.server_state` spans (callMethod)
- `rpc.command.ledger` spans (callMethod)
- Verify `xrpl.rpc.command` attribute present on `rpc.command.*` spans
**Verification**:
- [ ] Jaeger shows `rpc.request` traces
- [ ] Jaeger shows `rpc.process` traces
- [ ] Jaeger shows `rpc.command.*` traces with correct attributes
---
## Task IT.3: Transaction Span Verification (Phase 3)
**Objective**: Verify transaction spans flow through the telemetry pipeline.
**What to do**:
- Get genesis account sequence via `account_info` RPC
- Submit Payment transaction using genesis seed (`snoPBrXtMeMyMHUVTgbuqAfg1SUTb`)
- Wait for consensus inclusion (10s)
- Query Jaeger API for:
- `tx.process` spans (NetworkOPsImp::processTransaction) on submitting node
- `tx.receive` spans (PeerImp::handleTransaction) on peer nodes
- Verify `xrpl.tx.hash` attribute on `tx.process` spans
- Verify `xrpl.peer.id` attribute on `tx.receive` spans
**Verification**:
- [ ] Jaeger shows `tx.process` traces with `xrpl.tx.hash`
- [ ] Jaeger shows `tx.receive` traces with `xrpl.peer.id`
---
## Task IT.4: Consensus Span Verification (Phase 4)
**Objective**: Verify consensus spans flow through the telemetry pipeline.
**What to do**:
- Consensus runs automatically in 6-node network
- Query Jaeger API for:
- `consensus.proposal.send` (Adaptor::propose)
- `consensus.ledger_close` (Adaptor::onClose)
- `consensus.accept` (Adaptor::onAccept)
- `consensus.validation.send` (Adaptor::validate)
- Verify attributes:
- `xrpl.consensus.mode` on `consensus.ledger_close`
- `xrpl.consensus.proposers` on `consensus.accept`
- `xrpl.consensus.ledger.seq` on `consensus.validation.send`
**Verification**:
- [ ] Jaeger shows `consensus.ledger_close` traces with `xrpl.consensus.mode`
- [ ] Jaeger shows `consensus.accept` traces with `xrpl.consensus.proposers`
- [ ] Jaeger shows `consensus.proposal.send` traces
- [ ] Jaeger shows `consensus.validation.send` traces
---
## Task IT.5: Spanmetrics Verification (Phase 5)
**Objective**: Verify spanmetrics connector derives RED metrics from spans.
**What to do**:
- Query Prometheus for `traces_span_metrics_calls_total`
- Query Prometheus for `traces_span_metrics_duration_milliseconds_count`
- Verify Grafana loads at `http://localhost:3000`
**Verification**:
- [ ] Prometheus returns non-empty results for `traces_span_metrics_calls_total`
- [ ] Prometheus returns non-empty results for duration histogram
- [ ] Grafana UI accessible with dashboards visible
---
## Task IT.6: Manual Testing Plan
**Objective**: Document how to run tests manually for future reference.
**What to do**:
- Create `docker/telemetry/TESTING.md` with:
- Prerequisites section
- Single-node standalone test (quick verification)
- 6-node consensus test (full verification)
- Expected span catalog (all 12 span names with attributes)
- Verification queries (Jaeger API, Prometheus API)
- Troubleshooting guide
**Key new file**: `docker/telemetry/TESTING.md`
**Verification**:
- [ ] Document covers both single-node and multi-node testing
- [ ] All 12 span names documented with source file and attributes
- [ ] Troubleshooting section covers common failure modes
---
## Task IT.7: Run and Verify
**Objective**: Execute the integration test and validate results.
**What to do**:
- Run `docker/telemetry/integration-test.sh` locally
- Debug any failures
- Leave stack running for manual verification
- Share URLs:
- Jaeger: `http://localhost:16686`
- Grafana: `http://localhost:3000`
- Prometheus: `http://localhost:9090`
**Verification**:
- [ ] Script completes with all checks passing
- [ ] Jaeger UI shows rippled service with all expected span names
- [ ] Grafana dashboards load and show data
---
## Task IT.8: Commit
**Objective**: Commit all new files to Phase 5 branch.
**What to do**:
- Run `pcc` (pre-commit checks)
- Commit 3 new files to `pratik/otel-phase5-docs-deployment`
**Verification**:
- [ ] `pcc` passes
- [ ] Commit created on Phase 5 branch
---
## Summary
| Task | Description | New Files | Depends On |
| ---- | ----------------------------- | --------- | ---------- |
| IT.1 | Integration test script | 1 | Phase 5 |
| IT.2 | RPC span verification | 0 | IT.1 |
| IT.3 | Transaction span verification | 0 | IT.1 |
| IT.4 | Consensus span verification | 0 | IT.1 |
| IT.5 | Spanmetrics verification | 0 | IT.1 |
| IT.6 | Manual testing plan | 1 | -- |
| IT.7 | Run and verify | 0 | IT.1-IT.6 |
| IT.8 | Commit | 0 | IT.7 |
**Exit Criteria**:
- [ ] All 6 xrpld nodes reach "proposing" state
- [ ] All 11 expected span names visible in Jaeger
- [ ] Spanmetrics available in Prometheus
- [ ] Grafana dashboards show data
- [ ] Manual testing plan document complete