Commit Graph

14579 Commits

Author SHA1 Message Date
Pratik Mankawde
8046a30e9b Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-05 19:28:58 +01:00
Pratik Mankawde
4a8aa9e514 docs(telemetry): reconcile 09-data-collection-reference span/attribute inventory
The §1 span and attribute inventory had regressed to an older 16-span snapshot
that uses the pre-2026-05-13 dotted attribute keys, while phase-7's code emits
~36 spans with bare/underscore attribute keys. The §Data Flow Overview and §2
System Metrics sections (native OTLP transport — phase-7's migration) were already
correct and are left unchanged.

- §1.1: expand the span inventory to the full surface — add gRPC (grpc.<MethodName>),
  TxQ (txq.*), PathFind (pathfind.*), and the full consensus set (round/phase.open/
  establish/update_positions/check/mode_change/proposal.receive/validation.receive).
  Fix the phantom rpc.request -> rpc.http_request, add rpc.ws_upgrade. No grpc.request,
  no pathfind.rank, no ledger.acquire (the latter is added in phase-9, not yet present here).
- §1.2: convert every span-attribute key from dotted xrpl.<domain>.<field> to the
  bare/underscore form. The sole span-attr dotted exception is xrpl.ledger.hash on
  peer.validation.receive (shared constant); consensus.validation.send uses bare
  ledger_hash. Resource attrs xrpl.network.id/type stay dotted. Fix tx_count/tx_failed
  placement (on tx.apply, not ledger.build). Add attribute tables for the new families.
- §1.3: list the full set of spanmetrics dimension labels (bare keys, from the collector
  config) instead of the stale xrpl_rpc_command-style names.
- §4/§5: convert Tempo TraceQL and PromQL examples to the bare attribute/label forms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 19:28:45 +01:00
Pratik Mankawde
b8dd848899 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-05 13:40:18 +01:00
Pratik Mankawde
b321792a14 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-05 13:40:18 +01:00
Pratik Mankawde
72642b5dc6 feat(telemetry): add tx apply latency panel by type and stage
The existing apply-pipeline panels show latency by stage (all types
combined) or by type (single span). Neither answers "for a given
transaction type, which stage dominates its latency". Add a p95 panel
grouped by both tx_type and stage, filterable via the $tx_type and
$stage variables. Both dimensions already exist in spanmetrics, so no
collector change is needed. Reflow the section so the full-width
failure panel sits below the new full-width panel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 13:39:59 +01:00
Pratik Mankawde
8f3974c094 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-05 12:48:40 +01:00
Pratik Mankawde
283fbaa54f Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	OpenTelemetryPlan/09-data-collection-reference.md
2026-06-05 12:48:31 +01:00
Pratik Mankawde
3167a49f41 feat(telemetry): derive per-stage tx metrics from apply-pipeline spans
Wire the apply-pipeline stage spans (tx.preflight, tx.preclaim,
tx.transactor) added on phase-3 through the observability stack so the
spanmetrics connector produces per-stage RED metrics without any native
instruments.

- collector: add the `stage` dimension to the spanmetrics connector so
  the three stages split into separate metric series (3 bounded values).
- dashboard: add a "Tx Apply Pipeline" section to transaction-overview
  with rate, p95 latency, and failure-rate panels grouped by stage, plus
  a `stage` template variable. Panels follow the existing config (node
  filter, exported_instance legends, Title Case, axis labels).
- The failure panel filters ter_result != tesSUCCESS rather than span
  status, because a failing ter code completes the span normally — only
  thrown exceptions set an error status. This matches the existing
  "Transaction Results by Type" panel convention.
- docs: document the spans, attributes, and stage dimension in the data
  collection reference and runbook, including the sampling caveat that
  span-derived metrics inherit tracer head-sampling and undercount at
  sampling_ratio < 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 12:42:53 +01:00
Pratik Mankawde
759d3506b2 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-05 11:58:59 +01:00
Pratik Mankawde
021300538a Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-05 11:58:49 +01:00
Pratik Mankawde
a71d6635e6 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-06-05 11:58:43 +01:00
Pratik Mankawde
3df7e9cba6 code review changes and wire unused attributes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-05 11:42:33 +01:00
Pratik Mankawde
6a16dfa823 clang-tidy and formatting changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-05 11:25:29 +01:00
Pratik Mankawde
6428c9f13c feat(telemetry): add preflight/preclaim stage spans and stage attribute
The tx.transactor span covered only the apply stage; preflight and
preclaim had no telemetry, so a transaction that hard-failed those
stages produced no apply-pipeline span and per-stage latency/failure
was invisible.

Add tx.preflight and tx.preclaim spans in applySteps.cpp via a
makeStageSpan() helper using SpanGuard::hashSpan, so all three stages
share a deterministic trace_id derived from txID[0:16] even though they
run sequentially and often cross-thread. Each span carries stage,
tx_type, and ter_result; exceptions are recorded as tefEXCEPTION before
the public wrappers map them. The type lookup is guarded behind the
span-active check so it costs nothing when tracing is off.

Add a stage="apply" attribute to the tx.transactor span and move its
three hardcoded attribute strings to a new library-safe header
include/xrpl/tx/detail/TxApplySpanNames.h, which mirrors the daemon-side
TxSpanNames.h strings so the collector spanmetrics connector aggregates
both span sets under one dimension set.

A constants-contract test pins the span-name, attribute-key, and
stage-value strings; span content stays covered by the docker
integration test, as the rest of the telemetry suite is.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 11:11:55 +01:00
Pratik Mankawde
d6b314e8d5 fix(telemetry): trim Tempo search filters to 7 cross-cutting entry points
Reduced from 30 to 7 filters: service.instance.id, name, status, command,
tx_hash, tx_type, ledger_hash. Full attribute inventory is in
OpenTelemetryPlan/09-data-collection-reference.md §4; TraceQL autocomplete
covers the rest.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-06-04 17:43:26 +01:00
Pratik Mankawde
0a800069bf Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 16:43:25 +01:00
Pratik Mankawde
ca3a78abce Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 16:43:25 +01:00
Pratik Mankawde
eef11a65fa fix(telemetry): code-review dashboard cleanups (legends + stale descriptions)
From the code-review pass:
- transaction-overview.json: the tx.process and tx.transactor latency-by-type
  panels used lowercase legends (p95/p50) without the per-node dimension. Use
  Title Case (P95/P50), add exported_instance to the by() clause, and include
  [{{exported_instance}}] in the legend, per the dashboard legend convention.
- consensus-health.json: panel descriptions still referenced the old dotted
  attribute names (xrpl.consensus.mode, xrpl.ledger.seq) after the A1 rename;
  update them to the bare emitted names (consensus_mode, ledger_seq).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:43:12 +01:00
Pratik Mankawde
342b9f55a1 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 15:40:17 +01:00
Pratik Mankawde
000ad1d1f5 feat(telemetry): add gRPC and pathfinding span panels (RPC dashboard)
The grpc.{Method} spans (GRPCServer.cpp) and pathfind.* spans (PathRequest.cpp)
are emitted but had no dashboard coverage. The existing RPC & Pathfinding
dashboard only plotted StatsD timers. Add span-derived rows:

- gRPC Request Rate by Method (grpc.* by method)
- gRPC Latency P95 by Method
- gRPC Error Rate by Status (by grpc_status)
- Pathfinding Compute Duration (pathfind.compute p95/p50)
- Pathfinding Request & Discovery Rate (pathfind.request / pathfind.discover)

otel-collector-config.yaml: add method, grpc_role, grpc_status spanmetrics
dimensions (bounded value sets). Add a $grpc_method template variable so the
gRPC panels can be filtered by method, consistent with the dashboard filter
conventions.

Note: these spans populate only when the node serves gRPC / pathfinding
traffic; they are correct but not exercised by the current health-check
workload (they will be covered by the Phase 10 workload generator).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:40:07 +01:00
Pratik Mankawde
17ffe8b049 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 15:37:55 +01:00
Pratik Mankawde
63c6f3b8df feat(telemetry): surface consensus + TxQ lifecycle spans in dashboards
The consensus state-machine and TxQ lifecycle spans are emitted by the code
and present in Prometheus, but no panel visualised them. Add panels keyed on
those span_names (verified live) plus the low-cardinality dimensions needed to
break them down.

Consensus Health (consensus-health.json) — new rows:
- Consensus Round Duration (full round, p95/p50, mode-filterable)
- Consensus Phase Duration (open vs establish breakdown)
- Position Update Duration (update_positions p95/p50)
- Consensus Stall Rate (consensus.check by consensus_stalled)
- Consensus Mode-Change Rate by Target Mode (mode_change by mode_new)

Transaction Overview (transaction-overview.json) — new rows:
- TxQ Enqueue Rate by Transaction Type (txq.enqueue by tx_type)
- Queue Bypass Ratio (txq.apply_direct vs txq.enqueue)
- Queue Accept (Drain) Duration per Ledger (txq.accept p95/p50)
- Queue Cleanup Rate (txq.cleanup expired entries)

otel-collector-config.yaml — add spanmetrics dimensions for the lifecycle
breakdowns: mode_new, consensus_stalled, consensus_phase, consensus_result
(all bounded value sets, safe as Prometheus labels).

All new panels follow the existing dashboard template: $node filter,
exported_instance in every legend, Title Case, axis labels, row layout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:37:29 +01:00
Pratik Mankawde
4174aef07b fix(telemetry): align consensus_mode spanmetrics label with emitted attribute
The spanmetrics connector dimension was `xrpl.consensus.mode`, but the code
emits the span attribute under the bare key `consensus_mode` (matching every
other dimension after the Phase 6 rename). The mismatch left the
`xrpl_consensus_mode` Prometheus label empty, so the Consensus Health
"Consensus Mode Over Time" panel and the `$consensus_mode` template variable
(which filters every panel) matched no live series.

- otel-collector-config.yaml: dimension `xrpl.consensus.mode` -> `consensus_mode`
- consensus-health.json: 11 label refs `xrpl_consensus_mode` -> `consensus_mode`
  (the `$consensus_mode` Grafana variable name is unchanged)
- telemetry-runbook.md: refresh the stale spanmetrics label table to the bare
  names actually emitted (command/rpc_status/consensus_mode/local/
  proposal_trusted/validation_trusted), fix dotted->bare attribute names in
  span tables and TraceQL examples (tx_hash, ledger_seq, consensus_round_id,
  consensus_ledger_id, consensus_round, tx_id event attr), correct the
  consensus_round_id query to int (not quoted string), and fix the
  load_type value query ("exception_rpc" -> "exceptioned RPC").

Verified against the live stack: Tempo span tags confirm bare attribute keys
(consensus_mode, ledger_seq, tx_hash, ...); the populated xrpl_consensus_mode
series in Prometheus is stale retained data from an older build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:29:45 +01:00
Pratik Mankawde
a5f80514a9 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 14:26:16 +01:00
Pratik Mankawde
45ab508ed8 fix(telemetry): use short unit for large count/message panels
Count and message-volume panels (operating-mode transitions, job queue
depth, network/overlay message totals, getobject message counts) used
unit "none", rendering large values as raw unscaled numbers. Switch to
"short" so Grafana abbreviates (e.g. 1.5 Mil) for readability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 14:26:03 +01:00
Pratik Mankawde
6c71aa8c2a Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 14:05:25 +01:00
Pratik Mankawde
9b46a343fc fix(telemetry): migrate system dashboards from dead rippled_ to xrpld_ metrics
The system-* dashboards queried the legacy StatsD rippled_ prefix, but the
node now emits beast::insight metrics via native OTLP under the xrpld_
prefix (config: [insight] server=otel, prefix=xrpld). All queries returned
no data.

Migration (names derived from C++ beast::insight registrations, not live
Prometheus, since a syncing node does not emit every metric yet):
- rippled_ -> xrpld_ prefix across all panel queries and template variables
  (including the $node variable query, which broke the whole dashboard filter)
- Histogram Event instruments export with unit ms, so bare _bucket becomes
  _milliseconds_bucket: ios_latency, rpc_time, rpc_size, pathfind_fast/full
- Job-type metrics were StatsD summaries (label quantile="$quantile"); on the
  OTLP path they are histograms. Converted those queries to
  histogram_quantile($quantile, rate(xrpld_<job>_milliseconds_bucket[5m]))
  and added the previously-undefined $quantile template variable
- Per-job-type detail panels: __name__ regex now matches _milliseconds_bucket

No panels removed. Panels for metrics not yet emitted (e.g. warn/drop, or
job types the syncing node has not run) show no data until the path executes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 14:01:13 +01:00
Pratik Mankawde
15d3e3a375 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 11:28:04 +01:00
Pratik Mankawde
0fe09cda9b Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 11:28:04 +01:00
Pratik Mankawde
194f5b8af8 fix(telemetry): set ms unit on duration heatmap y-axes
The three duration heatmaps (transaction, consensus accept, RPC latency)
had an axisLabel of "Duration (ms)" but no unit code, so y-axis tick
values rendered unscaled. Set unit=ms on both the yAxis options and
panel defaults so buckets display as proper millisecond values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 11:27:46 +01:00
Pratik Mankawde
8f9fa52f93 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 10:55:35 +01:00
Pratik Mankawde
fb7c3bc38d Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	docker/telemetry/grafana/dashboards/transaction-overview.json
2026-06-04 10:55:27 +01:00
Pratik Mankawde
8e606bbaf4 feat(telemetry): add tx_type/ter_result/txq_status dashboard filters
Adds template variables $tx_type, $ter_result, $txq_status to the
Transaction Overview dashboard. All relevant panels now respect these
filters, enabling operators to drill into specific transaction types
or result codes.

Changes:
- Panel 2 renamed to "Transaction Processing Latency by Type" (now
  shows p95/p50 per tx_type instead of aggregate)
- Panels 1,3,4,5,7,9,12 filter by $tx_type
- Panel 10 filters by $tx_type and $ter_result
- Panel 11 filters by $txq_status
- Removed redundant "TX Processing Latency by Type (p95)" panel

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 10:55:11 +01:00
Pratik Mankawde
811b934004 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 10:53:55 +01:00
Pratik Mankawde
c80038fd42 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 10:53:55 +01:00
Pratik Mankawde
7397bbcdd2 feat(telemetry): add tx_type/ter_result/txq_status dashboard filters
Adds template variables $tx_type, $ter_result, $txq_status to the
Transaction Overview dashboard. All relevant panels now respect these
filters, enabling operators to drill into specific transaction types
or result codes.

Changes:
- Panel 2 renamed to "Transaction Processing Latency by Type" (now
  shows p95/p50 per tx_type instead of aggregate)
- Panels 1,3,4,5,7,9,12 filter by $tx_type
- Panel 10 filters by $tx_type and $ter_result
- Panel 11 filters by $txq_status
- Removed redundant "TX Processing Latency by Type (p95)" panel

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 10:53:45 +01:00
Pratik Mankawde
9947a52e79 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 10:47:47 +01:00
Pratik Mankawde
ee2f1b4fbf Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 10:47:47 +01:00
Pratik Mankawde
2627ea7f65 feat(telemetry): add TX Processing Latency by Type panel to dashboard
Shows p95 latency of tx.process span broken down by tx_type. Works for
both received and locally-processed transactions, unlike the tx.transactor
panel which requires the node to be synced and applying.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 10:47:33 +01:00
Pratik Mankawde
4e422a0354 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-03 17:25:22 +01:00
Pratik Mankawde
36cae13352 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-06-03 17:25:22 +01:00
Pratik Mankawde
013252f210 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-03 17:25:22 +01:00
Pratik Mankawde
970914d2ce Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-03 17:25:22 +01:00
Pratik Mankawde
289b049b70 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-03 17:25:22 +01:00
Pratik Mankawde
dfd67b8124 fix(telemetry): eliminate duplicate suppressed attribute on tx.receive span
The OTel C++ SDK's SetAttribute appends rather than overwrites on
in-flight spans. Setting suppressed=false as a default then overriding
to true resulted in both values appearing in the exported span.

Fix: remove the default-false set, place suppressed=false once after
the HashRouter check passes (non-suppressed path), and suppressed=true
remains only in the suppressed path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 17:23:59 +01:00
Pratik Mankawde
f60c995fe1 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-03 16:52:00 +01:00
Pratik Mankawde
fff8598a33 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-03 16:52:00 +01:00
Pratik Mankawde
ac1805f0a4 feat(telemetry): add spanmetrics dimensions and dashboard panels for enriched attrs
Collector config: add tx_type, ter_result, txq_status, consensus_state,
load_type, is_batch as spanmetrics dimensions so they appear as
Prometheus labels for dashboard queries.

New dashboard panels:
- Transaction Overview: Rate by Type, Results by Type, TxQ Status (pie),
  Transactor Duration p95 by Type
- Consensus Health: Outcome Distribution (pie), Failures Over Time
- RPC Performance: Resource Cost by Command, Batch vs Single

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 16:51:51 +01:00
Pratik Mankawde
365907ab22 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-03 16:40:22 +01:00
Pratik Mankawde
8b5ded4324 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-06-03 16:40:22 +01:00