Files
rippled/tasks/fix-validation-checks.md
Pratik Mankawde b659d43395 fix: address CI rename checks (rippled -> xrpld) in phase-10 docs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 20:40:44 +01:00

6.5 KiB

Fix Telemetry Validation Checks

Context

The CI pipeline infrastructure is fully operational (build + deploy + run). However, the validate_telemetry.py validation suite fails 35 checks due to mismatches between what the validation expects and what the telemetry stack actually produces. These fall into 4 categories.

CI run: https://github.com/XRPLF/rippled/actions/runs/23026466191


Category 1: StatsD Metrics — 0 Series (25 failures)

Symptoms:

[FAIL] metric.statsd_gauges.xrpld_LedgerMaster_Validated_Ledger_Age: 0 series
[FAIL] metric.statsd_counters.xrpld_rpc_requests: 0 series
[FAIL] metric.statsd_histograms.xrpld_rpc_time: 0 series
[FAIL] metric.overlay_traffic.xrpld_total_Bytes_In: 0 series
[FAIL] metric.phase9_nodestore.xrpld_nodestore_reads_total: 0 series
... (25 total)

Root Cause: Two issues compounding:

  1. StatsD receiver is commented out in otel-collector-config.yaml (lines 39-54). The collector config was updated to expect native OTLP metrics from beast::insight (comment: "StatsD UDP port removed — beast::insight now uses native OTLP"), but the validation harness configures xrpld nodes with server=statsd.

  2. Metric name mismatch: The expected_metrics.json expects StatsD-style metric names (e.g., xrpld_LedgerMaster_Validated_Ledger_Age). When using server=otel, beast::insight emits OTLP metrics which may have different names/structure.

Fix Options (pick one):

  • Option A (recommended): Change the node config in run-full-validation.sh from server=statsd to server=otel (line 255), remove the address=127.0.0.1:8125 line, then update expected_metrics.json with the actual OTLP metric names. This aligns with the collector config's OTLP-first design and avoids re-enabling the StatsD receiver.

  • Option B: Uncomment the StatsD receiver in otel-collector-config.yaml, add statsd to the metrics pipeline receivers list, and keep node config as server=statsd. Simpler but goes against the migration to native OTLP.

Investigation needed for Option A:

  • Run xrpld locally with server=otel, query Prometheus, and capture the actual OTLP metric names to update expected_metrics.json.

Files to modify:

  • docker/telemetry/workload/run-full-validation.sh — change [insight] section
  • docker/telemetry/workload/expected_metrics.json — update metric names for OTLP
  • docker/telemetry/workload/validate_telemetry.py — may need metric query adjustments

Category 2: Missing Spans — tx.process, tx.receive (2 failures)

Symptoms:

[FAIL] span.tx.process: tx.process: 0 traces (expected > 0)
[FAIL] span.tx.receive: tx.receive: 0 traces (expected > 0)

Root Cause: The span names exist in the code:

  • src/xrpld/app/misc/NetworkOPs.cpp:1228XRPL_TRACE_TX("tx.process")
  • src/xrpld/overlay/detail/PeerImp.cpp:1273XRPL_TRACE_TX("tx.receive")

Likely causes (investigate in order):

  1. Batch delay: The 2-second batch delay (batch_delay_ms=2000) plus 30s propagation wait may not be enough if these spans are created late in the workload.
  2. Code path not triggered: tx.process fires in NetworkOPs::processTransaction(). The tx_submitter submits via RPC submit command which calls this path. But if the transactions fail validation before reaching processTransaction(), no span is emitted.
  3. Span naming mismatch: The validation queries Tempo for exact operation name tx.process. Verify Tempo stores the span with this exact name.

Investigation:

  • Check the tx_submitter output in CI logs — are transactions actually succeeding?
  • Query Tempo API locally for all span names to see what's actually emitted.

Files to modify:

  • Possibly docker/telemetry/workload/validate_telemetry.py — adjust timing/queries
  • Possibly docker/telemetry/workload/run-full-validation.sh — increase propagation wait

Category 3: Span Hierarchy — rpc.request -> rpc.process (1 failure)

Symptoms:

[FAIL] span.hierarchy.rpc.request->rpc.process: rpc.process not found in rpc.request traces

Root Cause: The validator fetches traces containing rpc.request from Tempo and checks if any child span is named rpc.process. Both spans are emitted (they pass individual checks), but the parent-child relationship isn't established.

Investigation:

  • Check src/xrpld/rpc/detail/ServerHandler.cpprpc.request (line 271) and rpc.process (line 573) are in the same file. Verify that rpc.process is created as a child of rpc.request (i.e., its parent context is set).
  • The issue may be that rpc.process creates a new root span instead of linking to the rpc.request span context.

Files to modify:

  • Possibly src/xrpld/rpc/detail/ServerHandler.cpp — fix span parenting
  • OR docker/telemetry/workload/validate_telemetry.py — if hierarchy check logic is wrong

Category 4: Dashboard 404s (5 failures)

Symptoms:

[FAIL] dashboard.xrpld-statsd-node-health: HTTP 404
[FAIL] dashboard.xrpld-statsd-network: HTTP 404
[FAIL] dashboard.xrpld-statsd-rpc: HTTP 404
[FAIL] dashboard.xrpld-statsd-overlay-detail: HTTP 404
[FAIL] dashboard.xrpld-statsd-ledger-sync: HTTP 404

Root Cause: Dashboard UIDs were renamed from xrpld-statsd-* to xrpld-system-* but expected_metrics.json still references the old names.

Actual UIDs in docker/telemetry/grafana/dashboards/:

Expected (in expected_metrics.json) Actual (in dashboard JSON)
xrpld-statsd-node-health xrpld-system-node-health
xrpld-statsd-network xrpld-system-network
xrpld-statsd-rpc xrpld-system-rpc
xrpld-statsd-overlay-detail xrpld-system-overlay-detail
xrpld-statsd-ledger-sync xrpld-system-ledger-sync

Fix: Update the 5 UIDs in expected_metrics.jsongrafana_dashboards.uids[].

Files to modify:

  • docker/telemetry/workload/expected_metrics.json — update dashboard UIDs

Execution Order

  1. Category 4 (Dashboard UIDs) — trivial rename, no investigation needed
  2. Category 1 (StatsD/OTLP metrics) — requires investigation to choose Option A vs B and capture actual metric names
  3. Category 2 (Missing tx spans) — requires investigation into transaction code paths
  4. Category 3 (Span hierarchy) — requires investigation into span context propagation

Branch

All changes go on: pratik/otel-phase10-workload-validation Worktree: /tmp/otel-phase10-iter