Files
rippled/docker/telemetry/workload/workload-profiles.json
Pratik Mankawde cb9fce6890 fix(telemetry): align Phase 10 workload harness with current OTel recording surface + fix CI
The Phase 10 validation harness had drifted from the code's recording surface
and the telemetry-validation CI job was failing before it could build.

CI fix (telemetry-validation.yml):
- Replace nonexistent local action ./.github/actions/print-env with the remote
  XRPLF/actions/print-build-env (the build-xrpld job failed in 56s on this).
- Sync prepare-runner and upload-artifact action SHAs to the canonical workflow.

Recording-surface reconciliation (docker/telemetry/workload/):
- Migrate span attributes from dotted xrpl.<domain>.<field> to the bare/underscore
  form introduced by the 2026-05-13 span-attr naming redesign (tx_hash, peer_id,
  ledger_seq, consensus_mode, consensus_round, full_validation, quorum, ...).
  Dotted xrpl.ledger.hash is retained only on peer.validation.receive (shared
  constant), while consensus.validation.send uses bare ledger_hash.
- Fix attribute placement: tx.apply carries tx_count/tx_failed (not ledger_seq);
  ledger.build carries ledger_seq/close_* (not tx_count/tx_failed).
- Replace the phantom rpc.request span with the real WS root rpc.ws_message; drop
  the never-emitted duration_ms; rebuild the parent-child map accordingly.
- Add the new spans the code emits: apply-pipeline stage spans
  (tx.preflight/preclaim/transactor with stage/tx_type/ter_result), txq.*,
  consensus sub-spans (round/establish/update_positions/check/phase.open),
  ledger.acquire, grpc.*, pathfind.*. Conditional spans are marked optional so
  they are skipped (not failed) when the workload does not exercise them.
- validate_telemetry.py: service.name and Loki job label rippled -> xrpld; fix
  PARITY_SPAN_ATTRS (rename the 4 real attrs, drop the 3 that are metrics not span
  attrs); add optional-span handling that skips missing optional spans while still
  validating attributes when present.
- expected_metrics.json: rippled_ -> xrpld_ on all beast::insight/overlay metrics,
  xrpld_job_count, the 15 on-disk xrpld-* dashboard UIDs, and the real bare
  spanmetrics dimension labels.
- regression-metrics.json + baseline-timings.json: rpc.request -> rpc.ws_message.

Metrics pipeline fix:
- Switch node [insight] config from server=statsd/prefix=rippled to server=otel +
  /v1/metrics endpoint + prefix=xrpld across run-full-validation.sh,
  xrpld-validator.cfg.template, benchmark.sh and the workload compose. The
  collector has no StatsD receiver, so system metrics only reach Prometheus over
  OTLP.

Synthetic load for new spans:
- Add ripple_path_find to the RPC load generator (drives pathfind.* spans).
- Add a high-TPS txq-burst workload phase to force fee escalation (drives txq.*).

All facts verified against the *SpanNames.h headers and a live xrpld node +
collector (Tempo service.name=xrpld, tx.preflight attrs [stage,ter_result,tx_type],
279 xrpld_ Prometheus metrics and zero rippled_).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 17:08:58 +01:00

109 lines
3.3 KiB
JSON

{
"profiles": {
"full-validation": {
"description": "Full 18-dashboard coverage with burst/idle/plateau patterns",
"phases": [
{
"name": "warmup",
"description": "Low load to populate baseline gauges and node health metrics",
"duration_sec": 30,
"rpc": {
"rate": 5,
"weights": { "server_info": 50, "fee": 30, "ledger": 20 }
},
"tx": null
},
{
"name": "steady-state",
"description": "Medium sustained load — plateau data for all dashboards",
"duration_sec": 60,
"rpc": { "rate": 30 },
"tx": { "tps": 3 }
},
{
"name": "rpc-burst",
"description": "Heavy RPC to saturate job queue and spike latency",
"duration_sec": 30,
"rpc": { "rate": 100 },
"tx": null
},
{
"name": "tx-flood",
"description": "High TX rate for fee escalation and TxQ pressure",
"duration_sec": 30,
"rpc": { "rate": 5, "weights": { "server_info": 50, "fee": 50 } },
"tx": {
"tps": 20,
"weights": { "Payment": 70, "OfferCreate": 20, "TrustSet": 10 }
}
},
{
"name": "txq-burst",
"description": "Single-type Payment burst at high TPS to force open-ledger fee escalation and TxQ queueing, exercising the txq.* spans (txq.enqueue / txq.accept / txq.accept.tx / txq.cleanup)",
"duration_sec": 30,
"rpc": { "rate": 5, "weights": { "fee": 100 } },
"tx": {
"tps": 60,
"weights": { "Payment": 100 }
}
},
{
"name": "mixed-peak",
"description": "Realistic peak load — consensus and ledger ops under stress",
"duration_sec": 60,
"rpc": { "rate": 50 },
"tx": { "tps": 10 }
},
{
"name": "cooldown",
"description": "Low load for recovery metrics and state transition data",
"duration_sec": 30,
"rpc": { "rate": 5, "weights": { "server_info": 80, "fee": 20 } },
"tx": null
}
],
"propagation_wait_sec": 60
},
"quick-smoke": {
"description": "Fast smoke test — minimal data for CI quick checks",
"phases": [
{
"name": "smoke",
"description": "Single phase covering all generator types",
"duration_sec": 30,
"rpc": { "rate": 20 },
"tx": { "tps": 3 }
}
],
"propagation_wait_sec": 30
},
"stress": {
"description": "Heavy sustained load for performance benchmarking",
"phases": [
{
"name": "ramp-up",
"description": "Gradually increasing load",
"duration_sec": 30,
"rpc": { "rate": 20 },
"tx": { "tps": 5 }
},
{
"name": "peak",
"description": "Maximum sustained load",
"duration_sec": 120,
"rpc": { "rate": 150 },
"tx": { "tps": 25 }
},
{
"name": "sustain",
"description": "Continued high load for stability check",
"duration_sec": 60,
"rpc": { "rate": 100 },
"tx": { "tps": 15 }
}
],
"propagation_wait_sec": 60
}
}
}