Commit Graph

183 Commits

Author SHA1 Message Date
Pratik Mankawde
dc5bb4b35c feat(telemetry): emit xrpld_validation_{agreements,missed}_total counters
Wire the two previously-registered-but-never-incremented validation
counters to ValidationTracker's gross lifetime tallies, exported as
monotonic ObservableCounters. New gross atomics count each ledger once at
first classification and are never adjusted on late repair, keeping the
_total counters monotonic and additive (agreements_total + missed_total ==
ledgers reconciled); the repair-aware windowed view stays on the existing
xrpld_validation_agreement gauge. The validator-health dashboard panels
that already query these names now render data instead of "No data".

Also de-stale 09-data-collection-reference.md: §5b documented flat metric
names (xrpld_cache_SLE_hit_rate, ...) that the code never emits — it emits
labeled gauges (xrpld_cache_metrics{metric="SLE_hit_rate"}). Replace the
stale flat-name tables with a pointer to the canonical labeled section,
reconcile the contradictory headline counts, and correct xrpld_job_count
to its real exported name xrpld_jobq_job_count.

Adds two GTests asserting gross tallies stay frozen on repair while net
totals move, plus the additive invariant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:29:29 +01:00
Pratik Mankawde
f37a4a1022 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
# Conflicts:
#	src/xrpld/app/misc/detail/TxQ.cpp
2026-06-05 12:49:38 +01:00
Pratik Mankawde
8f3974c094 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-05 12:48:40 +01:00
Pratik Mankawde
283fbaa54f Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	OpenTelemetryPlan/09-data-collection-reference.md
2026-06-05 12:48:31 +01:00
Pratik Mankawde
3167a49f41 feat(telemetry): derive per-stage tx metrics from apply-pipeline spans
Wire the apply-pipeline stage spans (tx.preflight, tx.preclaim,
tx.transactor) added on phase-3 through the observability stack so the
spanmetrics connector produces per-stage RED metrics without any native
instruments.

- collector: add the `stage` dimension to the spanmetrics connector so
  the three stages split into separate metric series (3 bounded values).
- dashboard: add a "Tx Apply Pipeline" section to transaction-overview
  with rate, p95 latency, and failure-rate panels grouped by stage, plus
  a `stage` template variable. Panels follow the existing config (node
  filter, exported_instance legends, Title Case, axis labels).
- The failure panel filters ter_result != tesSUCCESS rather than span
  status, because a failing ter code completes the span normally — only
  thrown exceptions set an error status. This matches the existing
  "Transaction Results by Type" panel convention.
- docs: document the spans, attributes, and stage dimension in the data
  collection reference and runbook, including the sampling caveat that
  span-derived metrics inherit tracer head-sampling and undercount at
  sampling_ratio < 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 12:42:53 +01:00
Pratik Mankawde
759d3506b2 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-05 11:58:59 +01:00
Pratik Mankawde
021300538a Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-05 11:58:49 +01:00
Pratik Mankawde
a71d6635e6 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-06-05 11:58:43 +01:00
Pratik Mankawde
3df7e9cba6 code review changes and wire unused attributes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-05 11:42:33 +01:00
Pratik Mankawde
9c69aab326 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
Resolve test conflict: keep xrpl.pb.h include (phase 9) and
std::uint8_t qualifiers (phase 8).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:23:39 +01:00
Pratik Mankawde
3eeb8b3730 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-03 16:22:40 +01:00
Pratik Mankawde
93c27997b4 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-03 16:22:35 +01:00
Pratik Mankawde
ac79a5123e Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
Resolve runbook conflict: keep both phase 6 ledger/peer span tables
AND new insights/sample queries section from the enrichment work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:22:20 +01:00
Pratik Mankawde
b0e9e1a24d Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-03 16:16:53 +01:00
Pratik Mankawde
bf0b843ce1 docs(telemetry): document Task 4.9 consensus span attribute gap fill
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:16:43 +01:00
Pratik Mankawde
fce770e4f4 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-06-03 16:15:43 +01:00
Pratik Mankawde
8dd5ac55e8 docs(telemetry): document Task 3.11 TX/TxQ span attribute gap fill
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:15:33 +01:00
Pratik Mankawde
507828edde Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-06-03 16:14:57 +01:00
Pratik Mankawde
aca6623f14 docs(telemetry): document Task 2.10 RPC/PathFind span attribute gap fill
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:14:49 +01:00
Pratik Mankawde
4d6ddb5f1f Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-01 14:56:09 +01:00
Pratik Mankawde
cd6264c02f Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-01 14:51:39 +01:00
Pratik Mankawde
7aebc62223 clang-tidy fixes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-01 14:50:54 +01:00
Pratik Mankawde
6554f04252 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-01 14:49:13 +01:00
Pratik Mankawde
ce6a3153a1 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-01 11:49:43 +01:00
Pratik Mankawde
3115313551 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-01 11:49:30 +01:00
Pratik Mankawde
2e61a1c412 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-06-01 11:49:02 +01:00
Pratik Mankawde
046e2e2b85 minor doc update
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-01 11:48:47 +01:00
Pratik Mankawde
9f81e770eb Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-01 11:36:19 +01:00
Pratik Mankawde
e321f294e5 clang issues
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 19:22:07 +01:00
Pratik Mankawde
ba7e1f98e4 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:24:43 +01:00
Pratik Mankawde
e7dea147cd Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:18:36 +01:00
Pratik Mankawde
8d730b8b9a Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:16:35 +01:00
Pratik Mankawde
e5fae351d6 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-05-29 17:53:29 +01:00
Pratik Mankawde
a44d91ec27 leftover clang-tidy fixes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 17:52:45 +01:00
Pratik Mankawde
2f96c6547c Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:51:31 +01:00
Pratik Mankawde
c187a62353 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:47:15 +01:00
Pratik Mankawde
c848e51e13 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:44:07 +01:00
Pratik Mankawde
8f9057729c Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:14:21 +01:00
Pratik Mankawde
f031befc6e compilation fixes and levelization fixes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:04:19 +01:00
Pratik Mankawde
4e0b6f5b9e Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-28 18:32:44 +01:00
Pratik Mankawde
53e8c4d54e fix(docs): apply rename scripts to secure-OTel doc references
Two stray "rippled" tokens introduced by 43258e8d ("docs(telemetry):
add secure-OTel pipeline analysis…") were caught by check-rename in
CI. Re-run docs.sh to convert them to xrpld so the rename check
passes on PR #6425 (and downstream PR #6426 once merged up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 18:27:58 +01:00
Pratik Mankawde
5700eeed1b renaming and namespace updates
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 17:52:35 +01:00
Pratik Mankawde
7ac5343119 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:09:41 +01:00
Pratik Mankawde
954223958f renames
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:07:34 +01:00
Pratik Mankawde
c6c019ed8b addressed code review comments
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 15:55:25 +01:00
Pratik Mankawde
43258e8dc0 docs(telemetry): add secure-OTel pipeline analysis and link into plan
Document the threat model and chosen hardening approach for the OTel
pipeline: mTLS to the collector as primary defense (across-network
deployment), NetworkPolicy as defense-in-depth, and source-side
validation plus per-peer rate limiting for protocol::TraceContext on
peer messages. Skips Basic Auth (wrong shape for multi-operator
fleet) and HTTP-gateway header stripping (rippled is P2P).

Wires the new doc into the master plan ToC, mermaid diagram, and
body section, plus cross-refs from the privacy section in
02-design-decisions.md and the collector config in
05-configuration-reference.md so readers reach it from natural
in-context entry points. Adds a backlink at the top of secure-OTel.md
to the master plan.

Adds 'exfiltration' and 'htpasswd' to cspell dictionary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:33:16 +01:00
Pratik Mankawde
4bd1176df5 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 11:38:05 +01:00
Pratik Mankawde
9498b2865f fix(telemetry): address PR #6424 review comments
- Drop xrpl.node.amendment_blocked / xrpl.node.server_state from telemetry
  surface (constants in SpanNames.h, two filters in tempo.yaml). Operators
  read the same data via server_info / server_state RPC; OTel SDK 1.18.0
  cannot refresh resource attrs at runtime so resource-level emission was
  not viable either.

- Namespace all pathfind span attributes under pathfind_* (underscore form
  per Phase 1c rule 5). Renames in PathFindSpanNames.h and call sites in
  PathRequest.cpp, PathRequestManager.cpp, plus the rule-5 retention
  xrpl.pathfind.ledger_index -> pathfind_ledger_index.

- Wire pathfind_source_account / pathfind_dest_account on pathfind.request
  in doPathFind / doRipplePathFind handlers (only when present + string).

- Collapse per-asset pathfind.discover / pathfind.rank spans into one
  pathfind.discover hoisted around the per-source-asset loop in
  PathRequest::findPaths. Span count goes from 2N to 1 per RPC call;
  per-asset breakdown traded for bounded storage and cardinality. Trade-off
  documented inline.

- Fix pathfind_num_paths semantics: now sums getBestPaths().size() across
  the loop (paths actually returned) instead of the maxPaths input cap.

- PathRequestManager::updateAll: move span creation after the locked
  requests_ snapshot, early-return when no active subscriptions exist
  (avoids empty span on every ledger close), set pathfind_num_requests
  = requests.size().

- Update Phase2_taskList.md and 02-design-decisions.md to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:27:29 +01:00
Pratik Mankawde
5c92ebefb2 updated presentation
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-27 14:51:36 +01:00
Pratik Mankawde
a9f52458b3 merge: pratik/otel-phase8-log-correlation (dashboard UID + line-number cleanup) into pratik/otel-phase9-metric-gap-fill
# Conflicts:
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/ledger-operations.json
#	docker/telemetry/grafana/dashboards/peer-network.json
#	docker/telemetry/grafana/dashboards/rpc-performance.json
#	docker/telemetry/grafana/dashboards/system-ledger-data-sync.json
#	docker/telemetry/grafana/dashboards/system-network-traffic.json
#	docker/telemetry/grafana/dashboards/system-node-health.json
#	docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json
#	docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
2026-05-14 17:10:12 +01:00