rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-06-06 18:26:51 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	8046a30e9b	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-05 19:28:58 +01:00
Pratik Mankawde	4a8aa9e514	docs(telemetry): reconcile 09-data-collection-reference span/attribute inventory The §1 span and attribute inventory had regressed to an older 16-span snapshot that uses the pre-2026-05-13 dotted attribute keys, while phase-7's code emits ~36 spans with bare/underscore attribute keys. The §Data Flow Overview and §2 System Metrics sections (native OTLP transport — phase-7's migration) were already correct and are left unchanged. - §1.1: expand the span inventory to the full surface — add gRPC (grpc.<MethodName>), TxQ (txq.), PathFind (pathfind.), and the full consensus set (round/phase.open/ establish/update_positions/check/mode_change/proposal.receive/validation.receive). Fix the phantom rpc.request -> rpc.http_request, add rpc.ws_upgrade. No grpc.request, no pathfind.rank, no ledger.acquire (the latter is added in phase-9, not yet present here). - §1.2: convert every span-attribute key from dotted xrpl.<domain>.<field> to the bare/underscore form. The sole span-attr dotted exception is xrpl.ledger.hash on peer.validation.receive (shared constant); consensus.validation.send uses bare ledger_hash. Resource attrs xrpl.network.id/type stay dotted. Fix tx_count/tx_failed placement (on tx.apply, not ledger.build). Add attribute tables for the new families. - §1.3: list the full set of spanmetrics dimension labels (bare keys, from the collector config) instead of the stale xrpl_rpc_command-style names. - §4/§5: convert Tempo TraceQL and PromQL examples to the bare attribute/label forms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 19:28:45 +01:00
Pratik Mankawde	b8dd848899	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-05 13:40:18 +01:00
Pratik Mankawde	b321792a14	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-05 13:40:18 +01:00
Pratik Mankawde	72642b5dc6	feat(telemetry): add tx apply latency panel by type and stage The existing apply-pipeline panels show latency by stage (all types combined) or by type (single span). Neither answers "for a given transaction type, which stage dominates its latency". Add a p95 panel grouped by both tx_type and stage, filterable via the $tx_type and $stage variables. Both dimensions already exist in spanmetrics, so no collector change is needed. Reflow the section so the full-width failure panel sits below the new full-width panel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:39:59 +01:00
Pratik Mankawde	8f3974c094	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-05 12:48:40 +01:00
Pratik Mankawde	283fbaa54f	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics # Conflicts: # OpenTelemetryPlan/09-data-collection-reference.md	2026-06-05 12:48:31 +01:00
Pratik Mankawde	3167a49f41	feat(telemetry): derive per-stage tx metrics from apply-pipeline spans Wire the apply-pipeline stage spans (tx.preflight, tx.preclaim, tx.transactor) added on phase-3 through the observability stack so the spanmetrics connector produces per-stage RED metrics without any native instruments. - collector: add the `stage` dimension to the spanmetrics connector so the three stages split into separate metric series (3 bounded values). - dashboard: add a "Tx Apply Pipeline" section to transaction-overview with rate, p95 latency, and failure-rate panels grouped by stage, plus a `stage` template variable. Panels follow the existing config (node filter, exported_instance legends, Title Case, axis labels). - The failure panel filters ter_result != tesSUCCESS rather than span status, because a failing ter code completes the span normally — only thrown exceptions set an error status. This matches the existing "Transaction Results by Type" panel convention. - docs: document the spans, attributes, and stage dimension in the data collection reference and runbook, including the sampling caveat that span-derived metrics inherit tracer head-sampling and undercount at sampling_ratio < 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 12:42:53 +01:00
Pratik Mankawde	759d3506b2	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd	2026-06-05 11:58:59 +01:00
Pratik Mankawde	021300538a	Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment	2026-06-05 11:58:49 +01:00
Pratik Mankawde	a71d6635e6	Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing	2026-06-05 11:58:43 +01:00
Pratik Mankawde	3df7e9cba6	code review changes and wire unused attributes Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-05 11:42:33 +01:00
Pratik Mankawde	6a16dfa823	clang-tidy and formatting changes Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-05 11:25:29 +01:00
Pratik Mankawde	6428c9f13c	feat(telemetry): add preflight/preclaim stage spans and stage attribute The tx.transactor span covered only the apply stage; preflight and preclaim had no telemetry, so a transaction that hard-failed those stages produced no apply-pipeline span and per-stage latency/failure was invisible. Add tx.preflight and tx.preclaim spans in applySteps.cpp via a makeStageSpan() helper using SpanGuard::hashSpan, so all three stages share a deterministic trace_id derived from txID[0:16] even though they run sequentially and often cross-thread. Each span carries stage, tx_type, and ter_result; exceptions are recorded as tefEXCEPTION before the public wrappers map them. The type lookup is guarded behind the span-active check so it costs nothing when tracing is off. Add a stage="apply" attribute to the tx.transactor span and move its three hardcoded attribute strings to a new library-safe header include/xrpl/tx/detail/TxApplySpanNames.h, which mirrors the daemon-side TxSpanNames.h strings so the collector spanmetrics connector aggregates both span sets under one dimension set. A constants-contract test pins the span-name, attribute-key, and stage-value strings; span content stays covered by the docker integration test, as the rest of the telemetry suite is. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 11:11:55 +01:00
Pratik Mankawde	d6b314e8d5	fix(telemetry): trim Tempo search filters to 7 cross-cutting entry points Reduced from 30 to 7 filters: service.instance.id, name, status, command, tx_hash, tx_type, ledger_hash. Full attribute inventory is in OpenTelemetryPlan/09-data-collection-reference.md §4; TraceQL autocomplete covers the rest. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-06-04 17:43:26 +01:00
Pratik Mankawde	0a800069bf	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 16:43:25 +01:00
Pratik Mankawde	ca3a78abce	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 16:43:25 +01:00
Pratik Mankawde	eef11a65fa	fix(telemetry): code-review dashboard cleanups (legends + stale descriptions) From the code-review pass: - transaction-overview.json: the tx.process and tx.transactor latency-by-type panels used lowercase legends (p95/p50) without the per-node dimension. Use Title Case (P95/P50), add exported_instance to the by() clause, and include [{{exported_instance}}] in the legend, per the dashboard legend convention. - consensus-health.json: panel descriptions still referenced the old dotted attribute names (xrpl.consensus.mode, xrpl.ledger.seq) after the A1 rename; update them to the bare emitted names (consensus_mode, ledger_seq). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 16:43:12 +01:00
Pratik Mankawde	342b9f55a1	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 15:40:17 +01:00
Pratik Mankawde	000ad1d1f5	feat(telemetry): add gRPC and pathfinding span panels (RPC dashboard) The grpc.{Method} spans (GRPCServer.cpp) and pathfind.* spans (PathRequest.cpp) are emitted but had no dashboard coverage. The existing RPC & Pathfinding dashboard only plotted StatsD timers. Add span-derived rows: - gRPC Request Rate by Method (grpc.* by method) - gRPC Latency P95 by Method - gRPC Error Rate by Status (by grpc_status) - Pathfinding Compute Duration (pathfind.compute p95/p50) - Pathfinding Request & Discovery Rate (pathfind.request / pathfind.discover) otel-collector-config.yaml: add method, grpc_role, grpc_status spanmetrics dimensions (bounded value sets). Add a $grpc_method template variable so the gRPC panels can be filtered by method, consistent with the dashboard filter conventions. Note: these spans populate only when the node serves gRPC / pathfinding traffic; they are correct but not exercised by the current health-check workload (they will be covered by the Phase 10 workload generator). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:40:07 +01:00
Pratik Mankawde	17ffe8b049	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 15:37:55 +01:00
Pratik Mankawde	63c6f3b8df	feat(telemetry): surface consensus + TxQ lifecycle spans in dashboards The consensus state-machine and TxQ lifecycle spans are emitted by the code and present in Prometheus, but no panel visualised them. Add panels keyed on those span_names (verified live) plus the low-cardinality dimensions needed to break them down. Consensus Health (consensus-health.json) — new rows: - Consensus Round Duration (full round, p95/p50, mode-filterable) - Consensus Phase Duration (open vs establish breakdown) - Position Update Duration (update_positions p95/p50) - Consensus Stall Rate (consensus.check by consensus_stalled) - Consensus Mode-Change Rate by Target Mode (mode_change by mode_new) Transaction Overview (transaction-overview.json) — new rows: - TxQ Enqueue Rate by Transaction Type (txq.enqueue by tx_type) - Queue Bypass Ratio (txq.apply_direct vs txq.enqueue) - Queue Accept (Drain) Duration per Ledger (txq.accept p95/p50) - Queue Cleanup Rate (txq.cleanup expired entries) otel-collector-config.yaml — add spanmetrics dimensions for the lifecycle breakdowns: mode_new, consensus_stalled, consensus_phase, consensus_result (all bounded value sets, safe as Prometheus labels). All new panels follow the existing dashboard template: $node filter, exported_instance in every legend, Title Case, axis labels, row layout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:37:29 +01:00
Pratik Mankawde	4174aef07b	fix(telemetry): align consensus_mode spanmetrics label with emitted attribute The spanmetrics connector dimension was `xrpl.consensus.mode`, but the code emits the span attribute under the bare key `consensus_mode` (matching every other dimension after the Phase 6 rename). The mismatch left the `xrpl_consensus_mode` Prometheus label empty, so the Consensus Health "Consensus Mode Over Time" panel and the `$consensus_mode` template variable (which filters every panel) matched no live series. - otel-collector-config.yaml: dimension `xrpl.consensus.mode` -> `consensus_mode` - consensus-health.json: 11 label refs `xrpl_consensus_mode` -> `consensus_mode` (the `$consensus_mode` Grafana variable name is unchanged) - telemetry-runbook.md: refresh the stale spanmetrics label table to the bare names actually emitted (command/rpc_status/consensus_mode/local/ proposal_trusted/validation_trusted), fix dotted->bare attribute names in span tables and TraceQL examples (tx_hash, ledger_seq, consensus_round_id, consensus_ledger_id, consensus_round, tx_id event attr), correct the consensus_round_id query to int (not quoted string), and fix the load_type value query ("exception_rpc" -> "exceptioned RPC"). Verified against the live stack: Tempo span tags confirm bare attribute keys (consensus_mode, ledger_seq, tx_hash, ...); the populated xrpl_consensus_mode series in Prometheus is stale retained data from an older build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:29:45 +01:00
Pratik Mankawde	a5f80514a9	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 14:26:16 +01:00
Pratik Mankawde	45ab508ed8	fix(telemetry): use short unit for large count/message panels Count and message-volume panels (operating-mode transitions, job queue depth, network/overlay message totals, getobject message counts) used unit "none", rendering large values as raw unscaled numbers. Switch to "short" so Grafana abbreviates (e.g. 1.5 Mil) for readability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 14:26:03 +01:00
Pratik Mankawde	6c71aa8c2a	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 14:05:25 +01:00
Pratik Mankawde	9b46a343fc	fix(telemetry): migrate system dashboards from dead rippled_ to xrpld_ metrics The system-* dashboards queried the legacy StatsD rippled_ prefix, but the node now emits beast::insight metrics via native OTLP under the xrpld_ prefix (config: [insight] server=otel, prefix=xrpld). All queries returned no data. Migration (names derived from C++ beast::insight registrations, not live Prometheus, since a syncing node does not emit every metric yet): - rippled_ -> xrpld_ prefix across all panel queries and template variables (including the $node variable query, which broke the whole dashboard filter) - Histogram Event instruments export with unit ms, so bare _bucket becomes _milliseconds_bucket: ios_latency, rpc_time, rpc_size, pathfind_fast/full - Job-type metrics were StatsD summaries (label quantile="$quantile"); on the OTLP path they are histograms. Converted those queries to histogram_quantile($quantile, rate(xrpld_<job>_milliseconds_bucket[5m])) and added the previously-undefined $quantile template variable - Per-job-type detail panels: __name__ regex now matches _milliseconds_bucket No panels removed. Panels for metrics not yet emitted (e.g. warn/drop, or job types the syncing node has not run) show no data until the path executes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 14:01:13 +01:00
Pratik Mankawde	15d3e3a375	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 11:28:04 +01:00
Pratik Mankawde	0fe09cda9b	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 11:28:04 +01:00
Pratik Mankawde	194f5b8af8	fix(telemetry): set ms unit on duration heatmap y-axes The three duration heatmaps (transaction, consensus accept, RPC latency) had an axisLabel of "Duration (ms)" but no unit code, so y-axis tick values rendered unscaled. Set unit=ms on both the yAxis options and panel defaults so buckets display as proper millisecond values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 11:27:46 +01:00
Pratik Mankawde	8f9fa52f93	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 10:55:35 +01:00
Pratik Mankawde	fb7c3bc38d	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics # Conflicts: # docker/telemetry/grafana/dashboards/transaction-overview.json	2026-06-04 10:55:27 +01:00
Pratik Mankawde	8e606bbaf4	feat(telemetry): add tx_type/ter_result/txq_status dashboard filters Adds template variables $tx_type, $ter_result, $txq_status to the Transaction Overview dashboard. All relevant panels now respect these filters, enabling operators to drill into specific transaction types or result codes. Changes: - Panel 2 renamed to "Transaction Processing Latency by Type" (now shows p95/p50 per tx_type instead of aggregate) - Panels 1,3,4,5,7,9,12 filter by $tx_type - Panel 10 filters by $tx_type and $ter_result - Panel 11 filters by $txq_status - Removed redundant "TX Processing Latency by Type (p95)" panel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 10:55:11 +01:00
Pratik Mankawde	811b934004	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 10:53:55 +01:00
Pratik Mankawde	c80038fd42	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 10:53:55 +01:00
Pratik Mankawde	7397bbcdd2	feat(telemetry): add tx_type/ter_result/txq_status dashboard filters Adds template variables $tx_type, $ter_result, $txq_status to the Transaction Overview dashboard. All relevant panels now respect these filters, enabling operators to drill into specific transaction types or result codes. Changes: - Panel 2 renamed to "Transaction Processing Latency by Type" (now shows p95/p50 per tx_type instead of aggregate) - Panels 1,3,4,5,7,9,12 filter by $tx_type - Panel 10 filters by $tx_type and $ter_result - Panel 11 filters by $txq_status - Removed redundant "TX Processing Latency by Type (p95)" panel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 10:53:45 +01:00
Pratik Mankawde	9947a52e79	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-04 10:47:47 +01:00
Pratik Mankawde	ee2f1b4fbf	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-04 10:47:47 +01:00
Pratik Mankawde	2627ea7f65	feat(telemetry): add TX Processing Latency by Type panel to dashboard Shows p95 latency of tx.process span broken down by tx_type. Works for both received and locally-processed transactions, unlike the tx.transactor panel which requires the node to be synced and applying. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 10:47:33 +01:00
Pratik Mankawde	4e422a0354	Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment	2026-06-03 17:25:22 +01:00
Pratik Mankawde	36cae13352	Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing	2026-06-03 17:25:22 +01:00
Pratik Mankawde	013252f210	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-03 17:25:22 +01:00
Pratik Mankawde	970914d2ce	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-03 17:25:22 +01:00
Pratik Mankawde	289b049b70	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd	2026-06-03 17:25:22 +01:00
Pratik Mankawde	dfd67b8124	fix(telemetry): eliminate duplicate suppressed attribute on tx.receive span The OTel C++ SDK's SetAttribute appends rather than overwrites on in-flight spans. Setting suppressed=false as a default then overriding to true resulted in both values appearing in the exported span. Fix: remove the default-false set, place suppressed=false once after the HashRouter check passes (non-suppressed path), and suppressed=true remains only in the suppressed path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-03 17:23:59 +01:00
Pratik Mankawde	f60c995fe1	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-03 16:52:00 +01:00
Pratik Mankawde	fff8598a33	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-03 16:52:00 +01:00
Pratik Mankawde	ac1805f0a4	feat(telemetry): add spanmetrics dimensions and dashboard panels for enriched attrs Collector config: add tx_type, ter_result, txq_status, consensus_state, load_type, is_batch as spanmetrics dimensions so they appear as Prometheus labels for dashboard queries. New dashboard panels: - Transaction Overview: Rate by Type, Results by Type, TxQ Status (pie), Transactor Duration p95 by Type - Consensus Health: Outcome Distribution (pie), Failures Over Time - RPC Performance: Resource Cost by Command, Batch vs Single Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-03 16:51:51 +01:00
Pratik Mankawde	365907ab22	Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment	2026-06-03 16:40:22 +01:00
Pratik Mankawde	8b5ded4324	Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing	2026-06-03 16:40:22 +01:00

1 2 3 4 5 ...

14579 Commits