fix(telemetry): align Phase 10 workload harness with current OTel recording surface + fix CI

The Phase 10 validation harness had drifted from the code's recording surface
and the telemetry-validation CI job was failing before it could build.

CI fix (telemetry-validation.yml):
- Replace nonexistent local action ./.github/actions/print-env with the remote
  XRPLF/actions/print-build-env (the build-xrpld job failed in 56s on this).
- Sync prepare-runner and upload-artifact action SHAs to the canonical workflow.

Recording-surface reconciliation (docker/telemetry/workload/):
- Migrate span attributes from dotted xrpl.<domain>.<field> to the bare/underscore
  form introduced by the 2026-05-13 span-attr naming redesign (tx_hash, peer_id,
  ledger_seq, consensus_mode, consensus_round, full_validation, quorum, ...).
  Dotted xrpl.ledger.hash is retained only on peer.validation.receive (shared
  constant), while consensus.validation.send uses bare ledger_hash.
- Fix attribute placement: tx.apply carries tx_count/tx_failed (not ledger_seq);
  ledger.build carries ledger_seq/close_* (not tx_count/tx_failed).
- Replace the phantom rpc.request span with the real WS root rpc.ws_message; drop
  the never-emitted duration_ms; rebuild the parent-child map accordingly.
- Add the new spans the code emits: apply-pipeline stage spans
  (tx.preflight/preclaim/transactor with stage/tx_type/ter_result), txq.*,
  consensus sub-spans (round/establish/update_positions/check/phase.open),
  ledger.acquire, grpc.*, pathfind.*. Conditional spans are marked optional so
  they are skipped (not failed) when the workload does not exercise them.
- validate_telemetry.py: service.name and Loki job label rippled -> xrpld; fix
  PARITY_SPAN_ATTRS (rename the 4 real attrs, drop the 3 that are metrics not span
  attrs); add optional-span handling that skips missing optional spans while still
  validating attributes when present.
- expected_metrics.json: rippled_ -> xrpld_ on all beast::insight/overlay metrics,
  xrpld_job_count, the 15 on-disk xrpld-* dashboard UIDs, and the real bare
  spanmetrics dimension labels.
- regression-metrics.json + baseline-timings.json: rpc.request -> rpc.ws_message.

Metrics pipeline fix:
- Switch node [insight] config from server=statsd/prefix=rippled to server=otel +
  /v1/metrics endpoint + prefix=xrpld across run-full-validation.sh,
  xrpld-validator.cfg.template, benchmark.sh and the workload compose. The
  collector has no StatsD receiver, so system metrics only reach Prometheus over
  OTLP.

Synthetic load for new spans:
- Add ripple_path_find to the RPC load generator (drives pathfind.* spans).
- Add a high-TPS txq-burst workload phase to force fee escalation (drives txq.*).

All facts verified against the *SpanNames.h headers and a live xrpld node +
collector (Tempo service.name=xrpld, tx.preflight attrs [stage,ter_result,tx_type],
279 xrpld_ Prometheus metrics and zero rippled_).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-06-05 17:08:58 +01:00
parent a1c79b7aab
commit cb9fce6890
14 changed files with 503 additions and 178 deletions

View File

@@ -88,12 +88,12 @@ jobs:
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Prepare runner
uses: XRPLF/actions/prepare-runner@2cbf481018d930656e9276fcc20dc0e3a0be5b6d
uses: XRPLF/actions/prepare-runner@90f11ee655d1687824fb8793db770477d52afbab
with:
enable_ccache: ${{ github.repository_owner == 'XRPLF' }}
- name: Print build environment
uses: ./.github/actions/print-env
uses: XRPLF/actions/print-build-env@59dec886e4afb05a1724443af08baccbc045b574
- name: Get number of processors
uses: XRPLF/actions/get-nproc@cf0433aa74563aead044a1e395610c96d65a37cf
@@ -136,7 +136,7 @@ jobs:
run: ccache --show-stats -vv
- name: Upload xrpld binary
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: xrpld-telemetry
path: ${{ env.BUILD_DIR }}/xrpld
@@ -193,7 +193,7 @@ jobs:
- name: Upload validation reports
if: always()
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: telemetry-validation-reports
path: /tmp/xrpld-validation/reports/
@@ -201,7 +201,7 @@ jobs:
- name: Upload node logs
if: failure()
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: xrpld-node-logs
path: /tmp/xrpld-validation/node*/debug.log