Document complete 71-check enumeration and working/not-working status

Add "What All 71 Checks Means" section to Phase10_taskList.md with
every span, metric, and dashboard listed by name and check number.
Add "Current Status" section enumerating what works (11 items) and
what is not working/not implemented (7 items). Update
06-implementation-phases.md with validation inventory summary and
status. Fix stale "16 spans" reference to "17 spans".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-03-16 13:38:11 +00:00
parent 84cf05d230
commit 4bdec8481e
2 changed files with 209 additions and 15 deletions

View File

@@ -792,7 +792,7 @@ flowchart LR
subgraph validation["Validation Suite"]
SV["Span Validator<br/>(Jaeger API)"]
MV["Metric Validator<br/>(Prometheus API,<br/>required + optional tiers)"]
MV["Metric Validator<br/>(Prometheus API,<br/>all 26 metrics required)"]
DV["Dashboard Validator<br/>(Grafana API)"]
BM["Benchmark Suite<br/>(CPU, memory, latency<br/>ON vs OFF comparison)"]
end
@@ -821,9 +821,12 @@ flowchart LR
### Key Implementation Details
- **Transaction submitter** uses rippled's native WebSocket command format (`{"command": "submit", ...}`) not JSON-RPC format. Response data lives inside `"result"` with `"status"` at the top level.
- **Transaction submitter and RPC load generator** both use rippled's native WebSocket command format (`{"command": ...}`) not JSON-RPC format. Response data lives inside `"result"` with `"status"` at the top level.
- **Node config** requires `[signing_support] true` for server-side signing, and `[ips]` (not `[ips_fixed]`) to ensure peer connections count in `Peer_Finder_Active_*` metrics.
- **Metric validation** requires every metric in `expected_metrics.json` to have > 0 Prometheus series. The workload generators must produce enough load to trigger all metrics, including `ios_latency` (I/O thread latency >= 10ms threshold).
- **Metric validation** uses the Prometheus `/api/v1/series` endpoint (not instant queries) to avoid false negatives from stale StatsD gauges. Every metric in `expected_metrics.json` must have > 0 series.
- **StatsD gauge fix**: `StatsDGaugeImpl` initializes `m_dirty = true` so all gauges emit their initial value on first flush. Without this, gauges starting at 0 that never change (e.g. `jobq_job_count`) would be invisible in Prometheus.
- **I/O latency fix**: `io_latency_sampler` emits unconditionally on first sample, then applies the 10 ms threshold. This ensures `ios_latency` is registered in Prometheus even in low-load CI environments.
- **tx.receive span**: Sets default attributes (`xrpl.tx.suppressed = false`, `xrpl.tx.status = "new"`) on span creation so they are always present. The suppressed/bad code paths override these when applicable.
### Tasks
@@ -841,11 +844,40 @@ flowchart LR
See [Phase10_taskList.md](./Phase10_taskList.md) for detailed per-task breakdown.
### Validation Check Inventory (71 Checks)
The validation suite (`validate_telemetry.py`) runs exactly 71 checks, broken down as:
- **1 service registration** — `rippled` exists in Jaeger
- **17 span existence** — `rpc.request`, `rpc.process`, `rpc.ws_message`, `rpc.command.*`, `tx.process`, `tx.receive`, `tx.apply`, `consensus.proposal.send`, `consensus.ledger_close`, `consensus.accept`, `consensus.validation.send`, `consensus.accept.apply`, `ledger.build`, `ledger.validate`, `ledger.store`, `peer.proposal.receive`, `peer.validation.receive`
- **14 span attribute** — required attributes on the 14 spans that define them (22 unique attributes total)
- **2 span hierarchies** — `rpc.process` -> `rpc.command.*`, `ledger.build` -> `tx.apply` (1 skipped: `rpc.request` -> `rpc.process`, cross-thread)
- **1 span duration bounds** — all spans > 0 and < 60 s
- **26 metric existence** 4 SpanMetrics (`traces_span_metrics_calls_total`, `..._duration_milliseconds_{bucket,count,sum}`), 6 StatsD gauges (`LedgerMaster_Validated_Ledger_Age`, `Published_Ledger_Age`, `State_Accounting_Full_duration`, `Peer_Finder_Active_{Inbound,Outbound}_Peers`, `jobq_job_count`), 2 StatsD counters (`rpc_requests_total`, `ledger_fetches_total`), 3 StatsD histograms (`rpc_time`, `rpc_size`, `ios_latency`), 4 overlay traffic (`total_Bytes_{In,Out}`, `total_Messages_{In,Out}`), 7 Phase 9 OTLP (`nodestore_state`, `cache_metrics`, `txq_metrics`, `rpc_method_{started,finished}_total`, `object_count`, `load_factor_metrics`)
- **10 dashboard loads** `rippled-rpc-perf`, `rippled-transactions`, `rippled-consensus`, `rippled-ledger-ops`, `rippled-peer-net`, `rippled-system-node-health`, `rippled-system-network`, `rippled-system-rpc`, `rippled-system-overlay-detail`, `rippled-system-ledger-sync`
See [Phase10_taskList.md](./Phase10_taskList.md) for the full numbered check-by-check enumeration.
### Current Status
**Working** (71/71 checks pass in CI):
All 17 spans, 26 metrics, 10 dashboards, 14 attribute checks, 2 hierarchies, and duration bounds validated.
**Not implemented or not available in CI**:
1. Performance benchmark suite (Task 10.5) not started
2. `rpc.request` -> `rpc.process` parent-child hierarchy — skipped (cross-thread context propagation)
3. Log-trace correlation validation (Loki) — not included in checks
4. Full 255+ StatsD metric coverage — only 26 representative metrics validated
5. Sustained load / backpressure testing — not implemented
6. `docs/telemetry-runbook.md` updates — not done
7. `09-data-collection-reference.md` "Validation" section — not done
### Exit Criteria
- [x] 2-node validator cluster starts and reaches consensus
- [ ] Validation suite confirms all required spans, attributes, and metrics
- [ ] All 10 Grafana dashboards render data
- [x] Validation suite confirms all required spans, attributes, and metrics (71/71 checks)
- [x] All 10 Grafana dashboards render data
- [ ] Benchmark shows < 3% CPU overhead, < 5MB memory overhead
- [x] CI workflow runs validation on telemetry branch changes

View File

@@ -22,7 +22,7 @@
Before Phases 1-9 can be considered production-ready, we need proof that:
1. All 16 spans fire with correct attributes under real transaction workloads
1. All 17 spans fire with correct attributes under real transaction workloads
2. All 255+ StatsD metrics + ~50 Phase 9 metrics appear in Prometheus with non-zero values
3. Log-trace correlation (Phase 8) produces clickable trace_id links in Loki
4. All 10 Grafana dashboards render meaningful data (no empty panels)
@@ -108,19 +108,32 @@ Before Phases 1-9 can be considered production-ready, we need proof that:
**Implementation notes**:
- `validate_telemetry.py` runs all checks and produces a JSON report.
- `validate_telemetry.py` runs **71 checks** and produces a JSON report.
The 71 checks break down as:
| Category | Count | Source |
| -------------------- | ----- | ----------------------------------------- |
| Service registration | 1 | Jaeger: `rippled` service exists |
| Span existence | 17 | `expected_spans.json` 17 span types |
| Span attributes | 14 | Spans with `required_attributes` |
| Span hierarchies | 2 | Parent-child relationships (1 skipped) |
| Span durations | 1 | All spans > 0 and < 60 s |
| Metric existence | 26 | `expected_metrics.json` 26 metric names |
| Dashboard loads | 10 | `expected_metrics.json` 10 Grafana UIDs |
**Span validation** (queries Jaeger API):
- Lists all registered operations as diagnostics
- Asserts span names from `expected_spans.json` appear in traces
- Validates required attributes per span type
- Validates parent-child span hierarchies
- Asserts all span durations are within bounds (> 0)
- Asserts 17 span names from `expected_spans.json` appear in traces
- Validates required attributes on the 14 spans that define them
- Validates 2 active parent-child span hierarchies (1 skipped cross-thread)
- Asserts all span durations are within bounds (> 0, < 60 s)
**Metric validation** (queries Prometheus API):
- Lists all metric names as diagnostics (helps debug naming issues)
- Every metric in `expected_metrics.json` must have > 0 Prometheus series — absence is a FAIL
- Validates: SpanMetrics, StatsD gauges/counters/histograms, overlay traffic, Phase 9 OTLP metrics (nodestore, cache, txq, rpc_method, object_count, load_factor)
- All 26 metrics in `expected_metrics.json` must have > 0 Prometheus series — absence is a FAIL
- Uses the Prometheus `/api/v1/series` endpoint (not instant queries) to avoid false negatives from stale gauges — beast::insight StatsD gauges only emit on value changes, so a gauge that stabilizes goes stale in Prometheus after ~5 minutes
- Validates: 4 SpanMetrics, 6 StatsD gauges, 2 StatsD counters, 3 StatsD histograms, 4 overlay traffic, 7 Phase 9 OTLP metrics (nodestore, cache, txq, rpc_method, object_count, load_factor)
**Dashboard validation**:
- Queries Grafana API for each of the 10 dashboard UIDs
@@ -228,13 +241,162 @@ Before Phases 1-9 can be considered production-ready, we need proof that:
**Total Effort**: 10 days
## What "All 71 Checks" Means — Complete Enumeration
The validation suite (`validate_telemetry.py`) runs exactly **71 checks** grouped into 7 categories. Every item below is validated by name in CI. Nothing is optional failure of any single check fails the entire suite.
### 1. Service Registration (1 check)
| # | Check | Backend |
| --- | ---------------------------------- | ---------------------- |
| 1 | `rippled` service exists in Jaeger | Jaeger `/api/services` |
### 2. Span Existence (17 checks)
Each span name must appear at least once in Jaeger traces for the `rippled` service.
| # | Span Name | Category | Config Flag |
| --- | --------------------------- | ----------- | ---------------------- |
| 2 | `rpc.request` | RPC | `trace_rpc=1` |
| 3 | `rpc.process` | RPC | `trace_rpc=1` |
| 4 | `rpc.ws_message` | RPC | `trace_rpc=1` |
| 5 | `rpc.command.*` | RPC | `trace_rpc=1` |
| 6 | `tx.process` | Transaction | `trace_transactions=1` |
| 7 | `tx.receive` | Transaction | `trace_transactions=1` |
| 8 | `tx.apply` | Transaction | `trace_transactions=1` |
| 9 | `consensus.proposal.send` | Consensus | `trace_consensus=1` |
| 10 | `consensus.ledger_close` | Consensus | `trace_consensus=1` |
| 11 | `consensus.accept` | Consensus | `trace_consensus=1` |
| 12 | `consensus.validation.send` | Consensus | `trace_consensus=1` |
| 13 | `consensus.accept.apply` | Consensus | `trace_consensus=1` |
| 14 | `ledger.build` | Ledger | `trace_ledger=1` |
| 15 | `ledger.validate` | Ledger | `trace_ledger=1` |
| 16 | `ledger.store` | Ledger | `trace_ledger=1` |
| 17 | `peer.proposal.receive` | Peer | `trace_peer=1` |
| 18 | `peer.validation.receive` | Peer | `trace_peer=1` |
### 3. Span Attribute Validation (14 checks)
14 of the 17 spans define `required_attributes`. Each check asserts all listed attributes are present on at least one instance of that span.
| # | Span Name | Required Attributes |
| --- | --------------------------- | -------------------------------------------------------------------------------------------------- |
| 19 | `rpc.command.*` | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.status`, `xrpl.rpc.duration_ms` |
| 20 | `tx.process` | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` |
| 21 | `tx.receive` | `xrpl.peer.id`, `xrpl.tx.hash`, `xrpl.tx.suppressed`, `xrpl.tx.status` |
| 22 | `tx.apply` | `xrpl.ledger.seq`, `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` |
| 23 | `consensus.proposal.send` | `xrpl.consensus.round` |
| 24 | `consensus.ledger_close` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` |
| 25 | `consensus.accept` | `xrpl.consensus.proposers` |
| 26 | `consensus.validation.send` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` |
| 27 | `consensus.accept.apply` | `xrpl.consensus.close_time`, `xrpl.consensus.ledger.seq` |
| 28 | `ledger.build` | `xrpl.ledger.seq`, `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` |
| 29 | `ledger.validate` | `xrpl.ledger.seq`, `xrpl.ledger.validations` |
| 30 | `ledger.store` | `xrpl.ledger.seq` |
| 31 | `peer.proposal.receive` | `xrpl.peer.id`, `xrpl.peer.proposal.trusted` |
| 32 | `peer.validation.receive` | `xrpl.peer.id`, `xrpl.peer.validation.trusted` |
### 4. Span Parent-Child Hierarchies (2 checks)
| # | Parent | Child | Status | Notes |
| --- | -------------- | --------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------- |
| 33 | `rpc.process` | `rpc.command.*` | Active | Same thread always valid |
| 34 | `ledger.build` | `tx.apply` | Active | Same thread always valid |
| -- | `rpc.request` | `rpc.process` | Skipped | Cross-thread: `onRequest` posts to JobQueue coroutine. Span context not propagated across thread boundary. Requires C++ fix. |
### 5. Span Duration Bounds (1 check)
| # | Check | Criteria |
| --- | ------------------------------ | ---------------------------------------- |
| 35 | All spans have valid durations | Every span duration > 0 and < 60 seconds |
### 6. Metric Existence (26 checks)
Each metric name must have > 0 series in Prometheus (queried via `/api/v1/series` to avoid stale-gauge false negatives).
| # | Metric Name | Category | Source |
| --- | -------------------------------------------------- | ---------------- | ------------------------------------ |
| 36 | `traces_span_metrics_calls_total` | SpanMetrics | OTel Collector spanmetrics connector |
| 37 | `traces_span_metrics_duration_milliseconds_bucket` | SpanMetrics | OTel Collector spanmetrics connector |
| 38 | `traces_span_metrics_duration_milliseconds_count` | SpanMetrics | OTel Collector spanmetrics connector |
| 39 | `traces_span_metrics_duration_milliseconds_sum` | SpanMetrics | OTel Collector spanmetrics connector |
| 40 | `rippled_LedgerMaster_Validated_Ledger_Age` | StatsD Gauge | beast::insight via StatsD UDP |
| 41 | `rippled_LedgerMaster_Published_Ledger_Age` | StatsD Gauge | beast::insight via StatsD UDP |
| 42 | `rippled_State_Accounting_Full_duration` | StatsD Gauge | beast::insight via StatsD UDP |
| 43 | `rippled_Peer_Finder_Active_Inbound_Peers` | StatsD Gauge | beast::insight via StatsD UDP |
| 44 | `rippled_Peer_Finder_Active_Outbound_Peers` | StatsD Gauge | beast::insight via StatsD UDP |
| 45 | `rippled_jobq_job_count` | StatsD Gauge | beast::insight via StatsD UDP |
| 46 | `rippled_rpc_requests_total` | StatsD Counter | beast::insight via StatsD UDP |
| 47 | `rippled_ledger_fetches_total` | StatsD Counter | beast::insight via StatsD UDP |
| 48 | `rippled_rpc_time` | StatsD Histogram | beast::insight via StatsD UDP |
| 49 | `rippled_rpc_size` | StatsD Histogram | beast::insight via StatsD UDP |
| 50 | `rippled_ios_latency` | StatsD Histogram | beast::insight via StatsD UDP |
| 51 | `rippled_total_Bytes_In` | Overlay Traffic | beast::insight via StatsD UDP |
| 52 | `rippled_total_Bytes_Out` | Overlay Traffic | beast::insight via StatsD UDP |
| 53 | `rippled_total_Messages_In` | Overlay Traffic | beast::insight via StatsD UDP |
| 54 | `rippled_total_Messages_Out` | Overlay Traffic | beast::insight via StatsD UDP |
| 55 | `nodestore_state` | Phase 9 OTLP | MetricsRegistry via OTLP |
| 56 | `cache_metrics` | Phase 9 OTLP | MetricsRegistry via OTLP |
| 57 | `txq_metrics` | Phase 9 OTLP | MetricsRegistry via OTLP |
| 58 | `rpc_method_started_total` | Phase 9 OTLP | MetricsRegistry via OTLP |
| 59 | `rpc_method_finished_total` | Phase 9 OTLP | MetricsRegistry via OTLP |
| 60 | `object_count` | Phase 9 OTLP | MetricsRegistry via OTLP |
| 61 | `load_factor_metrics` | Phase 9 OTLP | MetricsRegistry via OTLP |
### 7. Dashboard Loads (10 checks)
Each Grafana dashboard must load successfully and contain at least one panel.
| # | Dashboard UID | Dashboard Name |
| --- | ------------------------------- | ---------------------- |
| 62 | `rippled-rpc-perf` | RPC Performance |
| 63 | `rippled-transactions` | Transactions |
| 64 | `rippled-consensus` | Consensus |
| 65 | `rippled-ledger-ops` | Ledger Operations |
| 66 | `rippled-peer-net` | Peer Network |
| 67 | `rippled-system-node-health` | System: Node Health |
| 68 | `rippled-system-network` | System: Network |
| 69 | `rippled-system-rpc` | System: RPC |
| 70 | `rippled-system-overlay-detail` | System: Overlay Detail |
| 71 | `rippled-system-ledger-sync` | System: Ledger Sync |
---
## Current Status: What Is Working vs. What Is Not
### Working (validated in CI run 23144741908 — 71/71 PASS)
1. **All 17 spans fire** with correct attributes under real workload (RPC + transaction + consensus)
2. **All 26 metrics exist** in Prometheus with non-zero series counts
3. **All 10 Grafana dashboards** load and render panels
4. **All 14 span attribute checks** pass, including `tx.receive` (fixed: default attributes on span creation)
5. **Both parent-child hierarchies** validate (`rpc.process` -> `rpc.command.*`, `ledger.build` -> `tx.apply`)
6. **All span durations** are within bounds (> 0, < 60 s)
7. **RPC load generator** fires 11 command types with < 50% error rate (native WS format)
8. **Transaction submitter** generates 10 transaction types at configurable TPS
9. **2-node validator cluster** starts and reaches consensus in CI
10. **CI workflow** (`telemetry-validation.yml`) runs on push to `pratik/otel-phase10-*` branches and on `workflow_dispatch`
11. **Validation report** is JSON with exit codes, suitable for CI gating
### Not Working / Not Available in CI / Not Implemented Yet
1. **Performance benchmark suite** (`benchmark.sh`, `collect_system_metrics.sh`) **not implemented**. Task 10.5 is not started. The exit criterion "Benchmark shows < 3% CPU overhead, < 5MB memory overhead" is **not met**.
2. **`rpc.request` -> `rpc.process` parent-child hierarchy** — **skipped** (not validated). Cross-thread span context propagation is broken: `onRequest` posts a coroutine to the JobQueue for `processRequest`, but the span context is not forwarded through the `std::function` lambda. Requires a C++ fix to capture and inject the parent span into the coroutine.
3. **Log-trace correlation validation** (Phase 8 Loki `trace_id` links) — **not included** in the 71 checks. The validation suite does not query Loki. This was listed in "Why This Phase Exists" item 3 but is not covered by the current validation.
4. **Full StatsD metric coverage** — the validation checks 26 representative metrics, not the full 255+ beast::insight StatsD metrics. Covering all 255+ would require a complete metric enumeration and significantly longer workload runs to trigger every code path.
5. **Sustained load / backpressure testing** — listed in "Why This Phase Exists" item 6 ("telemetry stack survives sustained load without data loss") but **not implemented**. The current workload runs for ~2 minutes, not long enough to test queue saturation.
6. **`docs/telemetry-runbook.md` updates** — Task 10.7 mentions adding "Validating Telemetry Stack" and "Performance Benchmarking" sections. The runbook has **not been updated**.
7. **`09-data-collection-reference.md` updates** — Task 10.7 mentions adding a "Validation" section with expected metric/span counts. This has **not been updated**.
---
## Exit Criteria
- [x] 2-node validator cluster starts and reaches consensus
- [x] RPC load generator fires all traced RPC commands at configurable rates
- [x] Transaction submitter generates 10 transaction types at configurable TPS
- [ ] Validation suite confirms all spans, attributes, and metrics pass
- [ ] All 10 Grafana dashboards render data
- [x] Validation suite confirms all spans, attributes, and metrics pass (71/71 checks)
- [x] All 10 Grafana dashboards render data
- [ ] Benchmark shows < 3% CPU overhead, < 5MB memory overhead
- [x] CI workflow runs validation on telemetry branch changes
- [x] Validation report output is CI-parseable (JSON with exit codes)