Commit Graph

14326 Commits

Author SHA1 Message Date
Pratik Mankawde
9fbdfa1fbe Merge phase9 clang-tidy fixes into phase10 2026-05-13 18:13:40 +01:00
Pratik Mankawde
3c13d788fd fix(telemetry): address clang-tidy CI failures on phase9
- MetricsRegistry.cpp: concatenate nested namespaces, add missing
  direct includes (Journal.h, string, string_view, cstdint), suppress
  readability-convert-member-functions-to-static in #else stubs by
  referencing enabled_ member, void unused instanceId parameter.
- MetricsRegistry test: add missing direct includes (Log.h, Journal.h,
  uint256.h, io_context.hpp, optional, stdexcept, string), make
  throwUnimplemented() static, add [[nodiscard]] to getOpenLedger/
  isStopping/getTrapTxID overrides, make const-eligible registry const.
- PerfLogImp.cpp: add braces around if/else body per
  readability-braces-around-statements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 18:13:19 +01:00
Pratik Mankawde
c10b0fd9d1 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-05-13 16:53:56 +01:00
Pratik Mankawde
3131d99029 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 16:53:54 +01:00
Pratik Mankawde
829094df5a Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-13 16:53:52 +01:00
Pratik Mankawde
9554e3889b Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-05-13 16:53:50 +01:00
Pratik Mankawde
fe7cb33b65 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:53:47 +01:00
Pratik Mankawde
f5cf4155c2 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-05-13 16:53:45 +01:00
Pratik Mankawde
ea30adeb47 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-13 16:53:43 +01:00
Pratik Mankawde
9bc2e4abb3 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 16:53:32 +01:00
Pratik Mankawde
7b9e0bc300 fix(telemetry): remove unused includes from RPCHandler after node-health attr removal
NetworkOPs.h and SpanNames.h were only needed for per-span
nodeAmendmentBlocked/nodeServerState calls, which were removed
in the attr naming simplification. Fixes clang-tidy CI failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:53:19 +01:00
Pratik Mankawde
a04459f1f8 fix(telemetry): update collector config + tempo datasource + design doc for simplified attr names
- otel-collector-config.yaml: spanmetrics dimensions use new bare names.
- tempo.yaml: TraceQL filter tags use new bare names.
- 02-design-decisions.md: strip xrpl.txq.* prefix from planned attrs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:47:36 +01:00
Pratik Mankawde
6b5e6a49ec Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-05-13 16:45:23 +01:00
Pratik Mankawde
b4e4b57504 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 16:45:14 +01:00
Pratik Mankawde
6dd43765b5 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-13 16:45:03 +01:00
Pratik Mankawde
cbf389943f Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-05-13 16:44:49 +01:00
Pratik Mankawde
b05e650b6f docs(telemetry): update 09-data-collection-reference + Phase5 integration test list for simplified attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:42:30 +01:00
Pratik Mankawde
57175ab12c Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:37:37 +01:00
Pratik Mankawde
d44a0aa3ff docs(telemetry): update Phase5 task list for simplified attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:37:27 +01:00
Pratik Mankawde
522fe562ff Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-05-13 16:36:34 +01:00
Pratik Mankawde
745102360b docs(telemetry): update Phase4 task list for simplified consensus attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:36:22 +01:00
Pratik Mankawde
19d9c44cf5 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-13 16:31:35 +01:00
Pratik Mankawde
5c14b57462 docs(telemetry): update Phase3 task list for simplified tx/txq attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:31:22 +01:00
Pratik Mankawde
c875944e05 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 16:29:32 +01:00
Pratik Mankawde
2430032e3a docs(telemetry): update Phase2 task list + design docs for attr rename
- Phase2_taskList: update attr refs to bare names, note node-health
  attrs moved to resource level.
- 02-design-decisions: strip xrpl.pathfind.* prefix from planned attrs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:29:20 +01:00
Pratik Mankawde
0f63d14999 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-13 16:28:07 +01:00
Pratik Mankawde
faaec003f4 docs(telemetry): update plan docs for simplified RPC/gRPC attr naming
Update OpenTelemetryPlan docs and Telemetry.h doc example to reflect
the renamed per-span attributes: xrpl.rpc.command -> command,
xrpl.rpc.status -> rpc_status, xrpl.grpc.method -> method, etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:27:55 +01:00
Pratik Mankawde
815e2b1f5d refactor(telemetry): fix remaining old attr refs in tests, docs, workload
- Update Telemetry.h doc example: xrpl.rpc.command -> command.
- Update SpanGuardFactory.cpp test: use new bare attr names.
- Update TESTING.md: rename attr refs in span table + PromQL example.
- Update expected_spans.json: all attrs match simplified naming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:21:18 +01:00
Pratik Mankawde
ec8e3e2950 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-05-13 16:17:49 +01:00
Pratik Mankawde
495d5bd8a0 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 16:17:12 +01:00
Pratik Mankawde
6cd910f06f Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-13 16:17:05 +01:00
Pratik Mankawde
5cd71ed107 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-05-13 16:16:50 +01:00
Pratik Mankawde
9e27120a15 refactor(telemetry): simplify ledger/peer attr naming on phase-6, update dashboards
- Add canonical ledgerHash (xrpl.ledger.hash) to SpanNames.h.
- LedgerSpanNames: reuse shared canonicals (ledgerSeq, closeTime,
  closeTimeCorrect, closeResolutionMs, ledgerHash); bare names for
  tx_count, tx_failed, validations.
- PeerSpanNames: reuse shared canonicals (peerId, ledgerHash); bare
  names for proposal_trusted, validation_full, validation_trusted.
- Update call sites in BuildLedger.cpp, LedgerMaster.cpp, PeerImp.cpp.
- Update 5 Grafana dashboards: strip xrpl.<domain>. prefix from
  per-span attr refs in PromQL/TraceQL queries. Keep rule-5 entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:16:30 +01:00
Pratik Mankawde
e60efd4d2f Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:10:46 +01:00
Pratik Mankawde
c48f5ed6e7 docs(telemetry): update runbook attr names for simplified naming convention
Update 31 attribute references in telemetry-runbook.md to match the
simplified naming: drop xrpl.<domain>. prefix on per-span attrs, use
domain-qualified names for collisions (rpc_status, consensus_state,
etc.), and unify cross-domain refs (xrpl.ledger.seq, xrpl.tx.hash).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:08:48 +01:00
Pratik Mankawde
c9fe4b1a14 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-05-13 16:04:27 +01:00
Pratik Mankawde
46d1012ad4 refactor(telemetry): simplify consensus attr naming on phase-4 — drop xrpl.consensus. prefix
- Add canonical shared bare attrs to SpanNames.h: closeTime,
  closeTimeCorrect, closeResolutionMs (reused by ledger domain).
- Keep qualified (rule 5): ledgerId, mode, round, roundId.
- Domain-qualify collisions: state -> consensus_state,
  result -> consensus_result.
- Reuse canonical ledgerSeq from phase-3.
- Drop xrpl.consensus.* prefix from 20+ attrs (proposers, round_time_ms,
  converge_percent, avalanche_threshold, etc.).
- Dispute attrs: bare names (dispute_our_vote, dispute_yays, etc.).
- Update call sites in RCLConsensus.cpp, Consensus.h.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:04:16 +01:00
Pratik Mankawde
7eeddd3ad9 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-13 16:01:13 +01:00
Pratik Mankawde
e339ba1f6b refactor(telemetry): simplify tx/txq attr naming on phase-3 — drop xrpl.<domain>. prefix
- Add canonical shared attrs to SpanNames.h: txHash (xrpl.tx.hash),
  peerId (xrpl.peer.id), ledgerSeq (xrpl.ledger.seq).
- Drop xrpl.tx.* prefix: local, path, suppressed, peer_version.
- Domain-qualify: status -> tx_status, txq status -> txq_status.
- TxQ: tx_hash -> reuse canonical txHash, ledger_seq -> reuse canonical
  ledgerSeq; bare names for fee_level_paid, required_fee_level, etc.
- Update call sites in PeerImp.cpp, NetworkOPs.cpp.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:01:00 +01:00
Pratik Mankawde
ac1b01b4c7 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 15:57:45 +01:00
Pratik Mankawde
497dd007d9 refactor(telemetry): simplify attr naming on phase-2 — drop xrpl.pathfind. prefix
- Drop xrpl.pathfind.* prefix from per-span attrs (source_account,
  dest_account, fast, search_level, num_complete_paths, num_paths,
  num_requests).
- Keep xrpl.pathfind.ledger_index qualified (rule 5: distinct from
  xrpl.ledger.seq).
- Remove per-span nodeAmendmentBlocked/nodeServerState calls from
  RPCHandler — promoted to resource-level attrs.
- Mark node-health attrs in SpanNames.h as RESOURCE-ONLY with doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:57:36 +01:00
Pratik Mankawde
0d845149ec Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing 2026-05-13 15:55:39 +01:00
Pratik Mankawde
7a854ccad2 refactor(telemetry): simplify attr naming on phase-1c — drop xrpl.<domain>. prefix
- Drop xrpl.rpc.* prefix from per-span attrs (command, version).
- Qualify collision-prone fields: role -> rpc_role/grpc_role,
  status -> rpc_status/grpc_status.
- Rename payload_size -> request_payload_size for cross-domain clarity.
- Simplify link.type -> link_type (bare name, no join).
- Update convention doc in SpanNames.h to reflect new naming rules.
- Update telemetry.md doc with renamed attr keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:13 +01:00
Pratik Mankawde
592e546f82 fix(telemetry): align Phase 10 workload configs with xrpld_ metric prefix
Phase 10's workload validation configs (expected_metrics.json,
regression-metrics.json, validate_telemetry.py) queried the
MetricsRegistry metrics under the rippled_ prefix, but MetricsRegistry
emits them as xrpld_ (see MetricsRegistry.cpp). On a live run the
workload validator reported every MetricsRegistry metric as missing,
masking genuine regressions.

Rename the following to xrpld_ across the workload validator,
expected-metrics manifest, and regression-metrics template:

- nodestore_state, cache_metrics, txq_metrics, load_factor_metrics,
  object_count
- rpc_method_started_total / _finished_total / _errored_total /
  _duration_us
- job_queued_total / _started_total / _finished_total /
  _queued_duration_us_bucket / _running_duration_us_bucket
- peer_quality, server_info, validator_health, ledger_economy,
  db_metrics, complete_ledgers, build_info, state_tracking,
  storage_detail
- ledgers_closed_total, validations_sent_total,
  validations_checked_total, state_changes_total
- validation_agreement, validation_agreements_total,
  validation_missed_total

Mirrors the phase-9 fix in commit 5601615952.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-13 15:01:13 +01:00
Pratik Mankawde
201da0e00d Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-05-13 14:59:45 +01:00
Pratik Mankawde
5601615952 fix(telemetry): align Phase 9 dashboards and integration-test with xrpld_ metric prefix
MetricsRegistry emits OTel SDK metrics with the xrpld_ prefix
(MetricsRegistry.cpp defines "xrpld_nodestore_state",
"xrpld_cache_metrics", etc.), but the Phase 9 dashboards and the
Step 10c integration-test assertions introduced in 892fee638a
queried the rippled_ prefix. Every Phase 9 panel and assertion
therefore rendered "No data" or failed on a live run, even though
the underlying series were being exported correctly.

Rename the rippled_ prefix to xrpld_ for every MetricsRegistry
metric in dashboards and the integration test:

- nodestore_state, cache_metrics, txq_metrics, load_factor_metrics,
  object_count
- rpc_method_started_total / _finished_total / _errored_total /
  _duration_us_bucket
- job_queued_total / _started_total / _finished_total /
  _queued_duration_us_bucket / _running_duration_us_bucket
- peer_quality, server_info, validator_health, ledger_economy,
  db_metrics, complete_ledgers, build_info, state_tracking
- ledgers_closed_total, validations_sent_total,
  validations_checked_total, state_changes_total
- validation_agreement (ValidationTracker 1h/24h/7d windows)

Also add ValidationTracker window-gauge assertions to Step 10c of
integration-test.sh so the 1h/24h/7d agreement and miss counts are
checked alongside the other Phase 9 gauges.

The rippled_ prefix is preserved for beast::insight metrics
(rippled_LedgerMaster_*, rippled_Peer_Finder_*, rippled_total_*,
rippled_Overlay_*, rippled_State_Accounting_*, rippled_transactions_*,
rippled_proposals_*, rippled_validations_Messages_*) because those
flow through the StatsD-style OTelCollector configured with
`[insight] prefix=rippled` and remain on that prefix by design.

Verified against a live 6-node consensus network: all 22 Phase 9 +
ValidationTracker assertions now report 6+ series per metric.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-13 14:59:00 +01:00
Pratik Mankawde
580ee5ede7 fix(telemetry): StatsD gauge and io_latency first-sample emit
Two fixes so gauges register in Prometheus (via StatsD) even when their
initial/steady-state value is 0:

1. StatsDGaugeImpl m_dirty: default-init to true so the initial value
   (0) is emitted on the first flush. Previously, gauges whose value
   never changed from 0 were never flushed and never appeared
   downstream.

2. io_latency_sampler firstSample_: new atomic<bool>, init true.
   m_event.notify now fires when either firstSample_ is true (exchanged
   to false) or lastSample >= 10 ms. This guarantees the io_latency
   metric is registered on startup; subsequent sub-10 ms samples are
   still suppressed to avoid flooding.
2026-05-13 14:40:58 +01:00
Pratik Mankawde
937d11d7c3 fix(telemetry): default tx span attrs on receive path
Set defaults for tx_span::attr::suppressed (false) and
tx_span::attr::status ("new") immediately after creating the txReceive
span. Without defaults, spans whose suppressed/status attributes would
only be set in the HashRouter-suppressed branch lacked these attributes
entirely, producing incomplete span data in downstream stores.

The suppressed branch still overrides these when the transaction has
already been seen via HashRouter.
2026-05-13 14:40:57 +01:00
Pratik Mankawde
689395a705 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-05-13 14:11:58 +01:00
Pratik Mankawde
4cbb1be5b4 fix(telemetry): CI Werror — registry .get() and unused fields
Two build failures surfaced by CI on the Phase 9 branch:

1. NetworkOPsImp stores the ServiceRegistry as
   std::reference_wrapper<ServiceRegistry> registry_, so calls must go
   through registry_.get().<method>(). The MetricsRegistry hooks added
   in setMode() and recvValidation() dereferenced the wrapper directly,
   which compiles against a pre-existing accessor on the wrapper type
   on some toolchains but fails on clang 16/17/20 and gcc 13/15 with
   "no member named 'getMetricsRegistry' in
   std::reference_wrapper<xrpl::ServiceRegistry>".

2. MetricsRegistry::app_ and MetricsRegistry::journal_ are only used
   inside XRPL_ENABLE_TELEMETRY-guarded code paths (gauge callbacks
   and JLOG). When telemetry is disabled, clang's
   -Werror=-Wunused-private-field tripped. Move the two fields under
   the same #ifdef and guard the constructor initialisers with
   [[maybe_unused]] so the no-op build continues to compile cleanly.
2026-05-13 14:11:16 +01:00