rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-06-06 18:26:51 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	758a3fec29	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation # Conflicts: # OpenTelemetryPlan/09-data-collection-reference.md	2026-06-05 19:42:53 +01:00
Pratik Mankawde	a23d83f393	docs(telemetry): add ledger.acquire to 09-doc + fix peer-quality dashboard metric prefix Phase 9 introduces the ledger.acquire span (InboundLedger fetch) that phases 7-8 do not have, so the forward-merged 09-data-collection-reference inventory is extended here: - §1.1: add ledger.acquire to the Ledger span table. - §1.2: add its attributes (acquire_reason, timeouts, peer_count, outcome) and note it also sets ledger_seq; bump the span count. Also fix two stale StatsD metric references in the Peer Quality dashboard (xrpld-peer-quality.json): rippled_Peer_Finder_Active_{Inbound,Outbound}_Peers -> xrpld_Peer_Finder_* to match the xrpld_ metric prefix the rest of the stack uses. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 19:36:34 +01:00
Pratik Mankawde	22b533ac51	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-06-05 19:33:13 +01:00
Pratik Mankawde	8046a30e9b	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-05 19:28:58 +01:00
Pratik Mankawde	4a8aa9e514	docs(telemetry): reconcile 09-data-collection-reference span/attribute inventory The §1 span and attribute inventory had regressed to an older 16-span snapshot that uses the pre-2026-05-13 dotted attribute keys, while phase-7's code emits ~36 spans with bare/underscore attribute keys. The §Data Flow Overview and §2 System Metrics sections (native OTLP transport — phase-7's migration) were already correct and are left unchanged. - §1.1: expand the span inventory to the full surface — add gRPC (grpc.<MethodName>), TxQ (txq.), PathFind (pathfind.), and the full consensus set (round/phase.open/ establish/update_positions/check/mode_change/proposal.receive/validation.receive). Fix the phantom rpc.request -> rpc.http_request, add rpc.ws_upgrade. No grpc.request, no pathfind.rank, no ledger.acquire (the latter is added in phase-9, not yet present here). - §1.2: convert every span-attribute key from dotted xrpl.<domain>.<field> to the bare/underscore form. The sole span-attr dotted exception is xrpl.ledger.hash on peer.validation.receive (shared constant); consensus.validation.send uses bare ledger_hash. Resource attrs xrpl.network.id/type stay dotted. Fix tx_count/tx_failed placement (on tx.apply, not ledger.build). Add attribute tables for the new families. - §1.3: list the full set of spanmetrics dimension labels (bare keys, from the collector config) instead of the stale xrpl_rpc_command-style names. - §4/§5: convert Tempo TraceQL and PromQL examples to the bare attribute/label forms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 19:28:45 +01:00
Pratik Mankawde	dc5bb4b35c	feat(telemetry): emit xrpld_validation_{agreements,missed}_total counters Wire the two previously-registered-but-never-incremented validation counters to ValidationTracker's gross lifetime tallies, exported as monotonic ObservableCounters. New gross atomics count each ledger once at first classification and are never adjusted on late repair, keeping the _total counters monotonic and additive (agreements_total + missed_total == ledgers reconciled); the repair-aware windowed view stays on the existing xrpld_validation_agreement gauge. The validator-health dashboard panels that already query these names now render data instead of "No data". Also de-stale 09-data-collection-reference.md: §5b documented flat metric names (xrpld_cache_SLE_hit_rate, ...) that the code never emits — it emits labeled gauges (xrpld_cache_metrics{metric="SLE_hit_rate"}). Replace the stale flat-name tables with a pointer to the canonical labeled section, reconcile the contradictory headline counts, and correct xrpld_job_count to its real exported name xrpld_jobq_job_count. Adds two GTests asserting gross tallies stay frozen on repair while net totals move, plus the additive invariant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 18:29:29 +01:00
Pratik Mankawde	db5b93e2c4	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-06-05 12:50:09 +01:00
Pratik Mankawde	f37a4a1022	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill # Conflicts: # src/xrpld/app/misc/detail/TxQ.cpp	2026-06-05 12:49:38 +01:00
Pratik Mankawde	8f3974c094	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-05 12:48:40 +01:00
Pratik Mankawde	283fbaa54f	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics # Conflicts: # OpenTelemetryPlan/09-data-collection-reference.md	2026-06-05 12:48:31 +01:00
Pratik Mankawde	3167a49f41	feat(telemetry): derive per-stage tx metrics from apply-pipeline spans Wire the apply-pipeline stage spans (tx.preflight, tx.preclaim, tx.transactor) added on phase-3 through the observability stack so the spanmetrics connector produces per-stage RED metrics without any native instruments. - collector: add the `stage` dimension to the spanmetrics connector so the three stages split into separate metric series (3 bounded values). - dashboard: add a "Tx Apply Pipeline" section to transaction-overview with rate, p95 latency, and failure-rate panels grouped by stage, plus a `stage` template variable. Panels follow the existing config (node filter, exported_instance legends, Title Case, axis labels). - The failure panel filters ter_result != tesSUCCESS rather than span status, because a failing ter code completes the span normally — only thrown exceptions set an error status. This matches the existing "Transaction Results by Type" panel convention. - docs: document the spans, attributes, and stage dimension in the data collection reference and runbook, including the sampling caveat that span-derived metrics inherit tracer head-sampling and undercount at sampling_ratio < 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 12:42:53 +01:00
Pratik Mankawde	6b5e6a49ec	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-05-13 16:45:23 +01:00
Pratik Mankawde	b4e4b57504	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-05-13 16:45:14 +01:00
Pratik Mankawde	6dd43765b5	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-05-13 16:45:03 +01:00
Pratik Mankawde	cbf389943f	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-05-13 16:44:49 +01:00
Pratik Mankawde	b05e650b6f	docs(telemetry): update 09-data-collection-reference + Phase5 integration test list for simplified attr naming Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 16:42:30 +01:00
Pratik Mankawde	782d98d249	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-13 11:40:15 +01:00
Pratik Mankawde	c096eeb239	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-05-13 11:30:22 +01:00
Pratik Mankawde	a8549a7ab2	fix(telemetry): address code review findings for Phase 8 log-trace correlation - Replace GetSpan() with direct context value check in Logs::format() to avoid heap allocation (new DefaultSpan) on the no-span path - Restore Phase 7 documentation accidentally deleted during merge - Fix undefined $JAEGER variable → use $TEMPO in integration test - Remove useless LCOV_EXCL markers around #ifdef block - Fix indentation inconsistencies in Log.cpp injection block - Remove incorrect url field from loki.yaml derivedFields - Update stale code sample in Phase8_taskList.md to match implementation - Correct "<10ns" performance claims to accurate ~15-20ns (no-span) and ~50ns (active-span) measurements across all docs - Replace Jaeger references with Tempo in TESTING.md (port 16686→3200) - Improve error handling in check_log_correlation(): track files_scanned, detect missing log files, fix silent grep error masking Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 14:32:46 +01:00
Pratik Mankawde	beaf01ae4d	fix(telemetry): fix CI failures in phase-6 build, clang-tidy, and rename checks Build fixes in PeerImp.cpp: - Rename duplicate `span` variable to `consSpan` in proposal and validation handlers to avoid redefinition error - Fix `->` on non-pointer SpanGuard (now correctly on shared_ptr) - Fix move-only type copy in lambda capture Clang-tidy fixes: - Concatenate nested namespaces in LedgerSpanNames.h and PeerSpanNames.h - Add missing SpanNames.h includes in BuildLedger.cpp, LedgerMaster.cpp, PeerImp.cpp for direct seg:: symbol usage - Add missing <chrono> and <cstdint> includes in BuildLedger.cpp - Remove unused Feature.h include from BuildLedger.cpp Rename check fix: - Run docs.sh to rename rippled_ metric prefixes to xrpld_ in 09-data-collection-reference.md and telemetry-runbook.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-30 17:09:17 +01:00
Pratik Mankawde	70d86d7ebf	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation # Conflicts: # OpenTelemetryPlan/06-implementation-phases.md # OpenTelemetryPlan/09-data-collection-reference.md # OpenTelemetryPlan/OpenTelemetryPlan.md # docker/telemetry/docker-compose.yml # docker/telemetry/grafana/dashboards/statsd-network-traffic.json # docker/telemetry/otel-collector-config.yaml # src/xrpld/overlay/detail/PeerImp.cpp	2026-04-29 20:38:00 +01:00
Pratik Mankawde	9e12e660fe	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 20:25:13 +01:00
Pratik Mankawde	7ab6f4d34b	fix: address CI rename checks (rippled -> xrpld) in phase-8 docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 20:09:43 +01:00
Pratik Mankawde	81b47afde7	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation # Conflicts: # OpenTelemetryPlan/06-implementation-phases.md # OpenTelemetryPlan/08-appendix.md # OpenTelemetryPlan/OpenTelemetryPlan.md # docker/telemetry/grafana/dashboards/statsd-network-traffic.json # docker/telemetry/grafana/dashboards/statsd-node-health.json # docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json	2026-04-29 20:07:43 +01:00
Pratik Mankawde	b65f91117f	fix: address CI checks (prettier, docs.sh rename, levelization) - Prettier formatting for markdown docs and OTelCollector header - docs.sh rippled→xrpld renames in OTelCollector.cpp comments/strings - Updated levelization ordering with new dependency edges Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 20:03:22 +01:00
Pratik Mankawde	b933e8ae00	feat(telemetry): add missing StatsD dashboard panels from production dashboard Compared shared production Grafana dashboard against Phase 6 StatsD dashboards and added 10 missing panels covering job execution/dequeue timers, cache metrics, ledger publish gap, state duration rate, duplicate traffic, and detailed traffic breakdown. Node Health dashboard: 8 → 16 panels, plus quantile template variable. Network Traffic dashboard: 8 → 10 panels, Total Network Bytes now rate(). Updated runbook, data collection reference, and implementation phases docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 14:02:27 +01:00
Pratik Mankawde	2aa8dbc2cb	fix(telemetry): restore StatsD receiver, fix metric prefix and doc errors The StatsD receiver config was lost during a branch rebase (--ours conflict resolution dropped it). Re-add the statsd receiver to the OTel Collector config and wire it into the metrics pipeline so beast::insight UDP metrics flow to Prometheus. Also fixes: - Metric prefix mismatch: docs used xrpld_ but dashboards/tests use rippled_ — align all documentation to match the runnable stack - Remove phantom Peer_Disconnects_Charges from docs (plain atomic, not a beast::insight gauge) - Remove premature .codecov.yml exclusions for Phase 7 OTelCollector files that don't exist on this branch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 12:57:50 +01:00
Pratik Mankawde	1a96f75954	fix(telemetry): apply rename script to phase 6 documentation Replace remaining rippled/Ripple references with xrpld/XRPL in data collection reference, implementation phases, and runbook docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 11:30:50 +01:00
Pratik Mankawde	cb7ee2358d	docs(telemetry): update data collection reference with complete span/attribute inventory Update 09-data-collection-reference.md to reflect the full implementation across all phases: - Expand span inventory from 16 to 35 spans across 8 categories (RPC, PathFind, TX, TxQ, Consensus, Ledger, Peer, gRPC) - Add complete attribute inventory (81 attributes) - Add TxQ spans (6), PathFind spans (5), and all 10 consensus spans - Document LedgerSpanNames.h and PeerSpanNames.h in file inventory - Add close time analysis dashboard panels to dashboard reference - Add $close_time_correct and $resolution_direction template variables - Document toDisplayString(ConsensusMode) utility - Fix section numbering (duplicate section 8) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 15:00:57 +01:00
Pratik Mankawde	cbbd6ebee2	feat(telemetry): add Phase 6 StatsD metrics, ledger/peer spans, and expanded dashboards Integrate the existing StatsD metrics pipeline (beast::insight) into the OpenTelemetry observability stack and add new trace spans for ledger build/store/validate and peer proposal/validation receive. Phase 5b — Ledger, peer, and transaction spans: - Add ledger.build span with close time attributes in BuildLedger.cpp - Add tx.apply span with tx_count/tx_failed in BuildLedger.cpp - Add ledger.store and ledger.validate spans in LedgerMaster.cpp - Add peer.proposal.receive span with trusted attribute in PeerImp.cpp - Add peer.validation.receive span with ledger_hash, full, trusted attributes in PeerImp.cpp - Add ledger-operations and peer-network Grafana dashboards Phase 6 — StatsD metrics integration: - Add StatsD UDP receiver (port 8125) to OTel Collector - Add 5 StatsD Grafana dashboards: node health, network traffic, overlay traffic detail, ledger data sync, RPC pathfinding - Add 09-data-collection-reference.md cataloging all metrics/spans - Update existing dashboards with new span panels - Expand telemetry runbook and integration test script - Add codecov exclusions for telemetry modules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 15:00:57 +01:00
Pratik Mankawde	7e149f7773	refactor(telemetry): remove residual Jaeger references across chain Fix remaining Jaeger references that accumulated across intermediate branches in the stacked PR chain. These were in files modified by multiple phases where the per-branch fixes didn't cover all additions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:35:04 +01:00
Pratik Mankawde	e1f30c1a22	docs: update data-collection-reference and presentation for external dashboard parity - Fix validations_checked_total recording site (NetworkOPs, not LedgerMaster) - Add Slide 11 to presentation: External Dashboard Parity overview with Mermaid diagrams for new metric categories, ValidationTracker sequence, and new dashboard summary Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	5de8c520d1	Phase 10: Workload validation - synthetic load generation and telemetry checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:32:02 +01:00
Pratik Mankawde	81298ceb9f	docs: add external dashboard parity tasks and metric reference for Phase 9 Add Tasks 9.11-9.13 (Validator Health, Peer Quality, Ledger Economy dashboards), new metric tables in data-collection-reference, and monitoring sections in runbook covering validation agreement, validator health, peer quality, and state tracking. Source: external dashboard parity design spec (2026-03-30). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	936c73982d	docs: update Phase 9 docs and dashboard for push_metrics.py parity gauges - Add Task 9.7a to Phase9_taskList.md documenting new gauges - Add metric tables to 09-data-collection-reference.md (server_info, build_info, complete_ledgers, db_metrics, extended cache/nodestore) - Update metric counts from ~50 to ~68 in 06-implementation-phases.md - Add OTel MetricsRegistry gauge reference to telemetry-runbook.md - Add 11 new panels to system-node-health.json Grafana dashboard (server state, uptime, peers, validated seq, last close info, build version, complete ledgers, db sizes, historical fetch rate, peer disconnects) - Fix leftover merge conflict marker in 08-appendix.md - Add ripplex/mseconds to cspell dictionary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	892fee638a	Phase 9: Metric gap fill - nodestore, cache, TxQ, load factor dashboards Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:49 +01:00
Pratik Mankawde	fdec3ce5c4	Phase 8: Log-trace correlation with Loki and filelog receiver Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:37 +01:00
Pratik Mankawde	2f7064ace6	Phase 7: Native OTel metrics migration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:24 +01:00
Pratik Mankawde	1ef234de9d	docs(telemetry): replace Jaeger with Tempo in data collection reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 22:31:07 +01:00
Pratik Mankawde	a37cf74868	docs: add peerDisconnectsCharges metric to data collection reference Bridge the existing beast::insight gauge for resource-limit peer disconnects (peerDisconnectsCharges_) into the StatsD metric inventory. Part of the external dashboard parity initiative. See docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:07 +01:00
Pratik Mankawde	21192e9b3f	Phase 6: StatsD metrics integration into telemetry pipeline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 22:31:07 +01:00

41 Commits