From 4bdec8481eecbb728bc2a4c39870ff78bf38e307 Mon Sep 17 00:00:00 2001 From: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com> Date: Mon, 16 Mar 2026 13:38:11 +0000 Subject: [PATCH] Document complete 71-check enumeration and working/not-working status Add "What All 71 Checks Means" section to Phase10_taskList.md with every span, metric, and dashboard listed by name and check number. Add "Current Status" section enumerating what works (11 items) and what is not working/not implemented (7 items). Update 06-implementation-phases.md with validation inventory summary and status. Fix stale "16 spans" reference to "17 spans". Co-Authored-By: Claude Opus 4.6 --- OpenTelemetryPlan/06-implementation-phases.md | 42 +++- OpenTelemetryPlan/Phase10_taskList.md | 182 +++++++++++++++++- 2 files changed, 209 insertions(+), 15 deletions(-) diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md index 9aadb086f1..ee2ef935b1 100644 --- a/OpenTelemetryPlan/06-implementation-phases.md +++ b/OpenTelemetryPlan/06-implementation-phases.md @@ -792,7 +792,7 @@ flowchart LR subgraph validation["Validation Suite"] SV["Span Validator
(Jaeger API)"] - MV["Metric Validator
(Prometheus API,
required + optional tiers)"] + MV["Metric Validator
(Prometheus API,
all 26 metrics required)"] DV["Dashboard Validator
(Grafana API)"] BM["Benchmark Suite
(CPU, memory, latency
ON vs OFF comparison)"] end @@ -821,9 +821,12 @@ flowchart LR ### Key Implementation Details -- **Transaction submitter** uses rippled's native WebSocket command format (`{"command": "submit", ...}`) — not JSON-RPC format. Response data lives inside `"result"` with `"status"` at the top level. +- **Transaction submitter and RPC load generator** both use rippled's native WebSocket command format (`{"command": ...}`) — not JSON-RPC format. Response data lives inside `"result"` with `"status"` at the top level. - **Node config** requires `[signing_support] true` for server-side signing, and `[ips]` (not `[ips_fixed]`) to ensure peer connections count in `Peer_Finder_Active_*` metrics. -- **Metric validation** requires every metric in `expected_metrics.json` to have > 0 Prometheus series. The workload generators must produce enough load to trigger all metrics, including `ios_latency` (I/O thread latency >= 10ms threshold). +- **Metric validation** uses the Prometheus `/api/v1/series` endpoint (not instant queries) to avoid false negatives from stale StatsD gauges. Every metric in `expected_metrics.json` must have > 0 series. +- **StatsD gauge fix**: `StatsDGaugeImpl` initializes `m_dirty = true` so all gauges emit their initial value on first flush. Without this, gauges starting at 0 that never change (e.g. `jobq_job_count`) would be invisible in Prometheus. +- **I/O latency fix**: `io_latency_sampler` emits unconditionally on first sample, then applies the 10 ms threshold. This ensures `ios_latency` is registered in Prometheus even in low-load CI environments. +- **tx.receive span**: Sets default attributes (`xrpl.tx.suppressed = false`, `xrpl.tx.status = "new"`) on span creation so they are always present. The suppressed/bad code paths override these when applicable. ### Tasks @@ -841,11 +844,40 @@ flowchart LR See [Phase10_taskList.md](./Phase10_taskList.md) for detailed per-task breakdown. +### Validation Check Inventory (71 Checks) + +The validation suite (`validate_telemetry.py`) runs exactly 71 checks, broken down as: + +- **1 service registration** — `rippled` exists in Jaeger +- **17 span existence** — `rpc.request`, `rpc.process`, `rpc.ws_message`, `rpc.command.*`, `tx.process`, `tx.receive`, `tx.apply`, `consensus.proposal.send`, `consensus.ledger_close`, `consensus.accept`, `consensus.validation.send`, `consensus.accept.apply`, `ledger.build`, `ledger.validate`, `ledger.store`, `peer.proposal.receive`, `peer.validation.receive` +- **14 span attribute** — required attributes on the 14 spans that define them (22 unique attributes total) +- **2 span hierarchies** — `rpc.process` -> `rpc.command.*`, `ledger.build` -> `tx.apply` (1 skipped: `rpc.request` -> `rpc.process`, cross-thread) +- **1 span duration bounds** — all spans > 0 and < 60 s +- **26 metric existence** — 4 SpanMetrics (`traces_span_metrics_calls_total`, `..._duration_milliseconds_{bucket,count,sum}`), 6 StatsD gauges (`LedgerMaster_Validated_Ledger_Age`, `Published_Ledger_Age`, `State_Accounting_Full_duration`, `Peer_Finder_Active_{Inbound,Outbound}_Peers`, `jobq_job_count`), 2 StatsD counters (`rpc_requests_total`, `ledger_fetches_total`), 3 StatsD histograms (`rpc_time`, `rpc_size`, `ios_latency`), 4 overlay traffic (`total_Bytes_{In,Out}`, `total_Messages_{In,Out}`), 7 Phase 9 OTLP (`nodestore_state`, `cache_metrics`, `txq_metrics`, `rpc_method_{started,finished}_total`, `object_count`, `load_factor_metrics`) +- **10 dashboard loads** — `rippled-rpc-perf`, `rippled-transactions`, `rippled-consensus`, `rippled-ledger-ops`, `rippled-peer-net`, `rippled-system-node-health`, `rippled-system-network`, `rippled-system-rpc`, `rippled-system-overlay-detail`, `rippled-system-ledger-sync` + +See [Phase10_taskList.md](./Phase10_taskList.md) for the full numbered check-by-check enumeration. + +### Current Status + +**Working** (71/71 checks pass in CI): +All 17 spans, 26 metrics, 10 dashboards, 14 attribute checks, 2 hierarchies, and duration bounds validated. + +**Not implemented or not available in CI**: + +1. Performance benchmark suite (Task 10.5) — not started +2. `rpc.request` -> `rpc.process` parent-child hierarchy — skipped (cross-thread context propagation) +3. Log-trace correlation validation (Loki) — not included in checks +4. Full 255+ StatsD metric coverage — only 26 representative metrics validated +5. Sustained load / backpressure testing — not implemented +6. `docs/telemetry-runbook.md` updates — not done +7. `09-data-collection-reference.md` "Validation" section — not done + ### Exit Criteria - [x] 2-node validator cluster starts and reaches consensus -- [ ] Validation suite confirms all required spans, attributes, and metrics -- [ ] All 10 Grafana dashboards render data +- [x] Validation suite confirms all required spans, attributes, and metrics (71/71 checks) +- [x] All 10 Grafana dashboards render data - [ ] Benchmark shows < 3% CPU overhead, < 5MB memory overhead - [x] CI workflow runs validation on telemetry branch changes diff --git a/OpenTelemetryPlan/Phase10_taskList.md b/OpenTelemetryPlan/Phase10_taskList.md index 190870d592..f214093998 100644 --- a/OpenTelemetryPlan/Phase10_taskList.md +++ b/OpenTelemetryPlan/Phase10_taskList.md @@ -22,7 +22,7 @@ Before Phases 1-9 can be considered production-ready, we need proof that: -1. All 16 spans fire with correct attributes under real transaction workloads +1. All 17 spans fire with correct attributes under real transaction workloads 2. All 255+ StatsD metrics + ~50 Phase 9 metrics appear in Prometheus with non-zero values 3. Log-trace correlation (Phase 8) produces clickable trace_id links in Loki 4. All 10 Grafana dashboards render meaningful data (no empty panels) @@ -108,19 +108,32 @@ Before Phases 1-9 can be considered production-ready, we need proof that: **Implementation notes**: -- `validate_telemetry.py` runs all checks and produces a JSON report. +- `validate_telemetry.py` runs **71 checks** and produces a JSON report. + + The 71 checks break down as: + + | Category | Count | Source | + | -------------------- | ----- | ----------------------------------------- | + | Service registration | 1 | Jaeger: `rippled` service exists | + | Span existence | 17 | `expected_spans.json` — 17 span types | + | Span attributes | 14 | Spans with `required_attributes` | + | Span hierarchies | 2 | Parent-child relationships (1 skipped) | + | Span durations | 1 | All spans > 0 and < 60 s | + | Metric existence | 26 | `expected_metrics.json` — 26 metric names | + | Dashboard loads | 10 | `expected_metrics.json` — 10 Grafana UIDs | **Span validation** (queries Jaeger API): - Lists all registered operations as diagnostics - - Asserts span names from `expected_spans.json` appear in traces - - Validates required attributes per span type - - Validates parent-child span hierarchies - - Asserts all span durations are within bounds (> 0) + - Asserts 17 span names from `expected_spans.json` appear in traces + - Validates required attributes on the 14 spans that define them + - Validates 2 active parent-child span hierarchies (1 skipped — cross-thread) + - Asserts all span durations are within bounds (> 0, < 60 s) **Metric validation** (queries Prometheus API): - Lists all metric names as diagnostics (helps debug naming issues) - - Every metric in `expected_metrics.json` must have > 0 Prometheus series — absence is a FAIL - - Validates: SpanMetrics, StatsD gauges/counters/histograms, overlay traffic, Phase 9 OTLP metrics (nodestore, cache, txq, rpc_method, object_count, load_factor) + - All 26 metrics in `expected_metrics.json` must have > 0 Prometheus series — absence is a FAIL + - Uses the Prometheus `/api/v1/series` endpoint (not instant queries) to avoid false negatives from stale gauges — beast::insight StatsD gauges only emit on value changes, so a gauge that stabilizes goes stale in Prometheus after ~5 minutes + - Validates: 4 SpanMetrics, 6 StatsD gauges, 2 StatsD counters, 3 StatsD histograms, 4 overlay traffic, 7 Phase 9 OTLP metrics (nodestore, cache, txq, rpc_method, object_count, load_factor) **Dashboard validation**: - Queries Grafana API for each of the 10 dashboard UIDs @@ -228,13 +241,162 @@ Before Phases 1-9 can be considered production-ready, we need proof that: **Total Effort**: 10 days +## What "All 71 Checks" Means — Complete Enumeration + +The validation suite (`validate_telemetry.py`) runs exactly **71 checks** grouped into 7 categories. Every item below is validated by name in CI. Nothing is optional — failure of any single check fails the entire suite. + +### 1. Service Registration (1 check) + +| # | Check | Backend | +| --- | ---------------------------------- | ---------------------- | +| 1 | `rippled` service exists in Jaeger | Jaeger `/api/services` | + +### 2. Span Existence (17 checks) + +Each span name must appear at least once in Jaeger traces for the `rippled` service. + +| # | Span Name | Category | Config Flag | +| --- | --------------------------- | ----------- | ---------------------- | +| 2 | `rpc.request` | RPC | `trace_rpc=1` | +| 3 | `rpc.process` | RPC | `trace_rpc=1` | +| 4 | `rpc.ws_message` | RPC | `trace_rpc=1` | +| 5 | `rpc.command.*` | RPC | `trace_rpc=1` | +| 6 | `tx.process` | Transaction | `trace_transactions=1` | +| 7 | `tx.receive` | Transaction | `trace_transactions=1` | +| 8 | `tx.apply` | Transaction | `trace_transactions=1` | +| 9 | `consensus.proposal.send` | Consensus | `trace_consensus=1` | +| 10 | `consensus.ledger_close` | Consensus | `trace_consensus=1` | +| 11 | `consensus.accept` | Consensus | `trace_consensus=1` | +| 12 | `consensus.validation.send` | Consensus | `trace_consensus=1` | +| 13 | `consensus.accept.apply` | Consensus | `trace_consensus=1` | +| 14 | `ledger.build` | Ledger | `trace_ledger=1` | +| 15 | `ledger.validate` | Ledger | `trace_ledger=1` | +| 16 | `ledger.store` | Ledger | `trace_ledger=1` | +| 17 | `peer.proposal.receive` | Peer | `trace_peer=1` | +| 18 | `peer.validation.receive` | Peer | `trace_peer=1` | + +### 3. Span Attribute Validation (14 checks) + +14 of the 17 spans define `required_attributes`. Each check asserts all listed attributes are present on at least one instance of that span. + +| # | Span Name | Required Attributes | +| --- | --------------------------- | -------------------------------------------------------------------------------------------------- | +| 19 | `rpc.command.*` | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.status`, `xrpl.rpc.duration_ms` | +| 20 | `tx.process` | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | +| 21 | `tx.receive` | `xrpl.peer.id`, `xrpl.tx.hash`, `xrpl.tx.suppressed`, `xrpl.tx.status` | +| 22 | `tx.apply` | `xrpl.ledger.seq`, `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | +| 23 | `consensus.proposal.send` | `xrpl.consensus.round` | +| 24 | `consensus.ledger_close` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` | +| 25 | `consensus.accept` | `xrpl.consensus.proposers` | +| 26 | `consensus.validation.send` | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` | +| 27 | `consensus.accept.apply` | `xrpl.consensus.close_time`, `xrpl.consensus.ledger.seq` | +| 28 | `ledger.build` | `xrpl.ledger.seq`, `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | +| 29 | `ledger.validate` | `xrpl.ledger.seq`, `xrpl.ledger.validations` | +| 30 | `ledger.store` | `xrpl.ledger.seq` | +| 31 | `peer.proposal.receive` | `xrpl.peer.id`, `xrpl.peer.proposal.trusted` | +| 32 | `peer.validation.receive` | `xrpl.peer.id`, `xrpl.peer.validation.trusted` | + +### 4. Span Parent-Child Hierarchies (2 checks) + +| # | Parent | Child | Status | Notes | +| --- | -------------- | --------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------- | +| 33 | `rpc.process` | `rpc.command.*` | Active | Same thread — always valid | +| 34 | `ledger.build` | `tx.apply` | Active | Same thread — always valid | +| -- | `rpc.request` | `rpc.process` | Skipped | Cross-thread: `onRequest` posts to JobQueue coroutine. Span context not propagated across thread boundary. Requires C++ fix. | + +### 5. Span Duration Bounds (1 check) + +| # | Check | Criteria | +| --- | ------------------------------ | ---------------------------------------- | +| 35 | All spans have valid durations | Every span duration > 0 and < 60 seconds | + +### 6. Metric Existence (26 checks) + +Each metric name must have > 0 series in Prometheus (queried via `/api/v1/series` to avoid stale-gauge false negatives). + +| # | Metric Name | Category | Source | +| --- | -------------------------------------------------- | ---------------- | ------------------------------------ | +| 36 | `traces_span_metrics_calls_total` | SpanMetrics | OTel Collector spanmetrics connector | +| 37 | `traces_span_metrics_duration_milliseconds_bucket` | SpanMetrics | OTel Collector spanmetrics connector | +| 38 | `traces_span_metrics_duration_milliseconds_count` | SpanMetrics | OTel Collector spanmetrics connector | +| 39 | `traces_span_metrics_duration_milliseconds_sum` | SpanMetrics | OTel Collector spanmetrics connector | +| 40 | `rippled_LedgerMaster_Validated_Ledger_Age` | StatsD Gauge | beast::insight via StatsD UDP | +| 41 | `rippled_LedgerMaster_Published_Ledger_Age` | StatsD Gauge | beast::insight via StatsD UDP | +| 42 | `rippled_State_Accounting_Full_duration` | StatsD Gauge | beast::insight via StatsD UDP | +| 43 | `rippled_Peer_Finder_Active_Inbound_Peers` | StatsD Gauge | beast::insight via StatsD UDP | +| 44 | `rippled_Peer_Finder_Active_Outbound_Peers` | StatsD Gauge | beast::insight via StatsD UDP | +| 45 | `rippled_jobq_job_count` | StatsD Gauge | beast::insight via StatsD UDP | +| 46 | `rippled_rpc_requests_total` | StatsD Counter | beast::insight via StatsD UDP | +| 47 | `rippled_ledger_fetches_total` | StatsD Counter | beast::insight via StatsD UDP | +| 48 | `rippled_rpc_time` | StatsD Histogram | beast::insight via StatsD UDP | +| 49 | `rippled_rpc_size` | StatsD Histogram | beast::insight via StatsD UDP | +| 50 | `rippled_ios_latency` | StatsD Histogram | beast::insight via StatsD UDP | +| 51 | `rippled_total_Bytes_In` | Overlay Traffic | beast::insight via StatsD UDP | +| 52 | `rippled_total_Bytes_Out` | Overlay Traffic | beast::insight via StatsD UDP | +| 53 | `rippled_total_Messages_In` | Overlay Traffic | beast::insight via StatsD UDP | +| 54 | `rippled_total_Messages_Out` | Overlay Traffic | beast::insight via StatsD UDP | +| 55 | `nodestore_state` | Phase 9 OTLP | MetricsRegistry via OTLP | +| 56 | `cache_metrics` | Phase 9 OTLP | MetricsRegistry via OTLP | +| 57 | `txq_metrics` | Phase 9 OTLP | MetricsRegistry via OTLP | +| 58 | `rpc_method_started_total` | Phase 9 OTLP | MetricsRegistry via OTLP | +| 59 | `rpc_method_finished_total` | Phase 9 OTLP | MetricsRegistry via OTLP | +| 60 | `object_count` | Phase 9 OTLP | MetricsRegistry via OTLP | +| 61 | `load_factor_metrics` | Phase 9 OTLP | MetricsRegistry via OTLP | + +### 7. Dashboard Loads (10 checks) + +Each Grafana dashboard must load successfully and contain at least one panel. + +| # | Dashboard UID | Dashboard Name | +| --- | ------------------------------- | ---------------------- | +| 62 | `rippled-rpc-perf` | RPC Performance | +| 63 | `rippled-transactions` | Transactions | +| 64 | `rippled-consensus` | Consensus | +| 65 | `rippled-ledger-ops` | Ledger Operations | +| 66 | `rippled-peer-net` | Peer Network | +| 67 | `rippled-system-node-health` | System: Node Health | +| 68 | `rippled-system-network` | System: Network | +| 69 | `rippled-system-rpc` | System: RPC | +| 70 | `rippled-system-overlay-detail` | System: Overlay Detail | +| 71 | `rippled-system-ledger-sync` | System: Ledger Sync | + +--- + +## Current Status: What Is Working vs. What Is Not + +### Working (validated in CI run 23144741908 — 71/71 PASS) + +1. **All 17 spans fire** with correct attributes under real workload (RPC + transaction + consensus) +2. **All 26 metrics exist** in Prometheus with non-zero series counts +3. **All 10 Grafana dashboards** load and render panels +4. **All 14 span attribute checks** pass, including `tx.receive` (fixed: default attributes on span creation) +5. **Both parent-child hierarchies** validate (`rpc.process` -> `rpc.command.*`, `ledger.build` -> `tx.apply`) +6. **All span durations** are within bounds (> 0, < 60 s) +7. **RPC load generator** fires 11 command types with < 50% error rate (native WS format) +8. **Transaction submitter** generates 10 transaction types at configurable TPS +9. **2-node validator cluster** starts and reaches consensus in CI +10. **CI workflow** (`telemetry-validation.yml`) runs on push to `pratik/otel-phase10-*` branches and on `workflow_dispatch` +11. **Validation report** is JSON with exit codes, suitable for CI gating + +### Not Working / Not Available in CI / Not Implemented Yet + +1. **Performance benchmark suite** (`benchmark.sh`, `collect_system_metrics.sh`) — **not implemented**. Task 10.5 is not started. The exit criterion "Benchmark shows < 3% CPU overhead, < 5MB memory overhead" is **not met**. +2. **`rpc.request` -> `rpc.process` parent-child hierarchy** — **skipped** (not validated). Cross-thread span context propagation is broken: `onRequest` posts a coroutine to the JobQueue for `processRequest`, but the span context is not forwarded through the `std::function` lambda. Requires a C++ fix to capture and inject the parent span into the coroutine. +3. **Log-trace correlation validation** (Phase 8 Loki `trace_id` links) — **not included** in the 71 checks. The validation suite does not query Loki. This was listed in "Why This Phase Exists" item 3 but is not covered by the current validation. +4. **Full StatsD metric coverage** — the validation checks 26 representative metrics, not the full 255+ beast::insight StatsD metrics. Covering all 255+ would require a complete metric enumeration and significantly longer workload runs to trigger every code path. +5. **Sustained load / backpressure testing** — listed in "Why This Phase Exists" item 6 ("telemetry stack survives sustained load without data loss") but **not implemented**. The current workload runs for ~2 minutes, not long enough to test queue saturation. +6. **`docs/telemetry-runbook.md` updates** — Task 10.7 mentions adding "Validating Telemetry Stack" and "Performance Benchmarking" sections. The runbook has **not been updated**. +7. **`09-data-collection-reference.md` updates** — Task 10.7 mentions adding a "Validation" section with expected metric/span counts. This has **not been updated**. + +--- + ## Exit Criteria - [x] 2-node validator cluster starts and reaches consensus - [x] RPC load generator fires all traced RPC commands at configurable rates - [x] Transaction submitter generates 10 transaction types at configurable TPS -- [ ] Validation suite confirms all spans, attributes, and metrics pass -- [ ] All 10 Grafana dashboards render data +- [x] Validation suite confirms all spans, attributes, and metrics pass (71/71 checks) +- [x] All 10 Grafana dashboards render data - [ ] Benchmark shows < 3% CPU overhead, < 5MB memory overhead - [x] CI workflow runs validation on telemetry branch changes - [x] Validation report output is CI-parseable (JSON with exit codes)