Commit Graph

14680 Commits

Author SHA1 Message Date
Pratik Mankawde
f7df1742fb fix(telemetry): drop bool close_time_correct filter from close-time panels
The five Close Time panels still rendered "No Data" after the metrics
rewrite. Root cause: each query carried
`span.close_time_correct=~"$close_time_correct"`, but close_time_correct
is a boolean span attribute and TraceQL's regex match (=~) against a
bool matches nothing in a metrics query, so every panel returned an
empty series set (HTTP 200, {"series":[]}).

Remove that filter clause. The panels do not break down by
close_time_correct, so dropping it restores data without losing any
dimension. The $node filter (a string attribute) is unaffected and
stays.

Verified via the Grafana datasource proxy that all six targets now
return series.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 18:40:58 +01:00
Pratik Mankawde
dc5bb4b35c feat(telemetry): emit xrpld_validation_{agreements,missed}_total counters
Wire the two previously-registered-but-never-incremented validation
counters to ValidationTracker's gross lifetime tallies, exported as
monotonic ObservableCounters. New gross atomics count each ledger once at
first classification and are never adjusted on late repair, keeping the
_total counters monotonic and additive (agreements_total + missed_total ==
ledgers reconciled); the repair-aware windowed view stays on the existing
xrpld_validation_agreement gauge. The validator-health dashboard panels
that already query these names now render data instead of "No data".

Also de-stale 09-data-collection-reference.md: §5b documented flat metric
names (xrpld_cache_SLE_hit_rate, ...) that the code never emits — it emits
labeled gauges (xrpld_cache_metrics{metric="SLE_hit_rate"}). Replace the
stale flat-name tables with a pointer to the canonical labeled section,
reconcile the contradictory headline counts, and correct xrpld_job_count
to its real exported name xrpld_jobq_job_count.

Adds two GTests asserting gross tallies stay frozen on repair while net
totals move, plus the additive invariant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:29:29 +01:00
Pratik Mankawde
0d1d1aa0e1 fix(telemetry): wire consensus close-time panels via TraceQL metrics
The five Close Time panels (Raw Proposals, Effective/Quantized, Vote
Bins & Resolution, Resolution Direction, Bin Distribution) rendered
empty: they used TraceQL `| select(attr)`, which returns a trace list
that a timeseries/barchart panel cannot plot.

Enable TraceQL metrics in Tempo and rewrite the panels to use it:

- tempo.yaml: add the local-blocks processor to the metrics generator so
  recent blocks are queryable via /api/metrics/query_range. Set
  filter_server_spans=false because the consensus spans are
  SPAN_KIND_INTERNAL (the default keeps only server spans, so attribute
  aggregations over internal spans returned nothing), and
  flush_to_storage=true with a traces_storage path so query_range can
  read the flushed blocks.
- consensus-health.json: replace each panel's select() with a metrics
  query — quantile_over_time on the integer close-time attributes,
  avg_over_time for vote bins / resolution, and count_over_time by the
  resolution_direction and vote-bin dimensions. Set the raw/effective
  panels' unit to seconds (the values are Ripple-epoch seconds, which
  dateTimeFromNow rendered with the wrong epoch).

Verified the query forms compile and return series against live internal
spans; the close-time series populate once the node reaches full sync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:58:29 +01:00
Pratik Mankawde
1fa39fdef6 fix(telemetry): move job-queue gauge to top and add stable panel ids
The Current Job Latency gauge sat at the bottom of the Job Queue
Analysis dashboard; per the dashboard guideline gauges belong at the
top. Move it to the first row and reflow the remaining panels below it.
Also assign explicit sequential panel ids so deep links stay stable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:25:40 +01:00
Pratik Mankawde
18121a8cf4 fix(telemetry): widen only timeseries with right-side table legends
Correct the width rule from the previous layout commit. Full width
(w=24) is now applied ONLY to timeseries panels whose legend is a
right-side table, since those legends need the horizontal room. Panels
with default/bottom legends, pie charts, and the heatmap return to half
width. This narrows "Transaction Receive vs Suppressed" and "TxQ
Enqueue Rate by Transaction Type", which were wrongly widened.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:24:06 +01:00
Pratik Mankawde
93c31573c5 refactor(telemetry): stable panel ids and topic-grouped layout
Make transaction-overview deep-links stable and improve readability:

- Assign explicit sequential panel ids (1..20) so viewPanel=panel-N
  URLs stay pinned to the same chart across edits. Previously ids were
  unset and Grafana auto-assigned them by array position, so any
  reorder silently repointed bookmarks.
- Move the single-value stat panel (Transaction Apply Failed Rate) to
  the top row.
- Lay out in three topic sections (Processing, Apply Pipeline, Queue).
  Within each, timeseries with a breakdown dimension (tx_type, stage,
  ter_result, suppressed) take full width so their right-side table
  legends are readable; single-series panels, pie charts, and the
  heatmap stay half-width and pair up.

All six template variables already default to All (includeAll + multi);
no change needed there.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:02:47 +01:00
Pratik Mankawde
ceea9e49dd fix(telemetry): standardize transaction-overview legends and cap tooltips
Apply the dashboard legend convention across all panels now that the
P50 series have been removed (P95-only):

- Drop the redundant "P95 " / "P50 " prefix; the panel title already
  states the percentile.
- Put every filter/dimension value inside [] comma-separated, ending
  with exported_instance, e.g. "AMMDeposit [Preclaim, xrpld-mainnet]".
- Add exported_instance to the by() clause and legend of the three
  panels that filtered on $node but omitted it (Transaction Rate by
  Type, Transaction Results by Type, TxQ Accept Status), so per-node
  series are produced.
- Title-case the stage value for display via label_replace in the four
  apply-pipeline panels; the span attribute stays lowercase
  (preflight/preclaim/apply) since legendFormat cannot change case.
- Cap tooltip maxHeight at 500 on every panel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 15:06:52 +01:00
Pratik Mankawde
db4d70bbc2 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-05 13:40:19 +01:00
Pratik Mankawde
b8dd848899 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-05 13:40:18 +01:00
Pratik Mankawde
b321792a14 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-05 13:40:18 +01:00
Pratik Mankawde
72642b5dc6 feat(telemetry): add tx apply latency panel by type and stage
The existing apply-pipeline panels show latency by stage (all types
combined) or by type (single span). Neither answers "for a given
transaction type, which stage dominates its latency". Add a p95 panel
grouped by both tx_type and stage, filterable via the $tx_type and
$stage variables. Both dimensions already exist in spanmetrics, so no
collector change is needed. Reflow the section so the full-width
failure panel sits below the new full-width panel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 13:39:59 +01:00
Pratik Mankawde
f37a4a1022 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
# Conflicts:
#	src/xrpld/app/misc/detail/TxQ.cpp
2026-06-05 12:49:38 +01:00
Pratik Mankawde
8f3974c094 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-05 12:48:40 +01:00
Pratik Mankawde
283fbaa54f Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	OpenTelemetryPlan/09-data-collection-reference.md
2026-06-05 12:48:31 +01:00
Pratik Mankawde
3167a49f41 feat(telemetry): derive per-stage tx metrics from apply-pipeline spans
Wire the apply-pipeline stage spans (tx.preflight, tx.preclaim,
tx.transactor) added on phase-3 through the observability stack so the
spanmetrics connector produces per-stage RED metrics without any native
instruments.

- collector: add the `stage` dimension to the spanmetrics connector so
  the three stages split into separate metric series (3 bounded values).
- dashboard: add a "Tx Apply Pipeline" section to transaction-overview
  with rate, p95 latency, and failure-rate panels grouped by stage, plus
  a `stage` template variable. Panels follow the existing config (node
  filter, exported_instance legends, Title Case, axis labels).
- The failure panel filters ter_result != tesSUCCESS rather than span
  status, because a failing ter code completes the span normally — only
  thrown exceptions set an error status. This matches the existing
  "Transaction Results by Type" panel convention.
- docs: document the spans, attributes, and stage dimension in the data
  collection reference and runbook, including the sampling caveat that
  span-derived metrics inherit tracer head-sampling and undercount at
  sampling_ratio < 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 12:42:53 +01:00
Pratik Mankawde
759d3506b2 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-05 11:58:59 +01:00
Pratik Mankawde
021300538a Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-05 11:58:49 +01:00
Pratik Mankawde
a71d6635e6 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-06-05 11:58:43 +01:00
Pratik Mankawde
3df7e9cba6 code review changes and wire unused attributes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-05 11:42:33 +01:00
Pratik Mankawde
6a16dfa823 clang-tidy and formatting changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-05 11:25:29 +01:00
Pratik Mankawde
6428c9f13c feat(telemetry): add preflight/preclaim stage spans and stage attribute
The tx.transactor span covered only the apply stage; preflight and
preclaim had no telemetry, so a transaction that hard-failed those
stages produced no apply-pipeline span and per-stage latency/failure
was invisible.

Add tx.preflight and tx.preclaim spans in applySteps.cpp via a
makeStageSpan() helper using SpanGuard::hashSpan, so all three stages
share a deterministic trace_id derived from txID[0:16] even though they
run sequentially and often cross-thread. Each span carries stage,
tx_type, and ter_result; exceptions are recorded as tefEXCEPTION before
the public wrappers map them. The type lookup is guarded behind the
span-active check so it costs nothing when tracing is off.

Add a stage="apply" attribute to the tx.transactor span and move its
three hardcoded attribute strings to a new library-safe header
include/xrpl/tx/detail/TxApplySpanNames.h, which mirrors the daemon-side
TxSpanNames.h strings so the collector spanmetrics connector aggregates
both span sets under one dimension set.

A constants-contract test pins the span-name, attribute-key, and
stage-value strings; span content stays covered by the docker
integration test, as the rest of the telemetry suite is.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 11:11:55 +01:00
Pratik Mankawde
d7e847a53b removed p50 renders from all dashboards
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 18:11:23 +01:00
Pratik Mankawde
c3bdcb4291 clang-tidy include
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 18:02:47 +01:00
Pratik Mankawde
478b58395b loop levelization
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 17:54:52 +01:00
Pratik Mankawde
cf075888ff docs(telemetry): fix TraceQL/LogQL query syntax in runbook
- Replace all `{name="..."} | attr = val` pipeline queries with the
  correct `{name="..." && span.attr = val}` inline filter syntax
- Add `span.` prefix to all span attribute references; `duration`,
  `status`, `name`, and `resource.*` keep no prefix
- Fix Loki stream selector: `{job="xrpld"}` → `{service_name="xrpld"}`
  in all LogQL examples and the verification step
- Fix cross-node queries: `rootServiceName` → `resource.service.name`,
  `{name=~"tx\\..*"} | attr` → `{name =~ "tx.*" && span.attr}`
- Add DEX section (OfferCreate variants by ter_result, OfferCancel,
  peer relay)
- Add syntax cheat-sheet block at top of Insights section
- Expand tx workflow: per-AMM-type queries, Payment tecPATH.*, TrustSet,
  OracleSet, NFTokenMint cross-span
- Expand consensus: slow rounds, validation send+receive comparison
- Expand cross-subsystem: AMM cross-span, tx.receive no-error
- Expand TxQ: retried status, NFToken enqueue type
- Update Where-to-Look table: add AMM/DEX/NFT/close-time rows, fix
  attribute references to use span. prefix, fix stale consensus_stalled
  entry (now consensus_result on consensus.check)
- All 57 queries verified against live stack — zero parse errors

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-06-04 17:51:37 +01:00
Pratik Mankawde
d6b314e8d5 fix(telemetry): trim Tempo search filters to 7 cross-cutting entry points
Reduced from 30 to 7 filters: service.instance.id, name, status, command,
tx_hash, tx_type, ledger_hash. Full attribute inventory is in
OpenTelemetryPlan/09-data-collection-reference.md §4; TraceQL autocomplete
covers the rest.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-06-04 17:43:26 +01:00
Pratik Mankawde
0a800069bf Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 16:43:25 +01:00
Pratik Mankawde
938a4d17ce Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-04 16:43:25 +01:00
Pratik Mankawde
ca3a78abce Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 16:43:25 +01:00
Pratik Mankawde
eef11a65fa fix(telemetry): code-review dashboard cleanups (legends + stale descriptions)
From the code-review pass:
- transaction-overview.json: the tx.process and tx.transactor latency-by-type
  panels used lowercase legends (p95/p50) without the per-node dimension. Use
  Title Case (P95/P50), add exported_instance to the by() clause, and include
  [{{exported_instance}}] in the legend, per the dashboard legend convention.
- consensus-health.json: panel descriptions still referenced the old dotted
  attribute names (xrpl.consensus.mode, xrpl.ledger.seq) after the A1 rename;
  update them to the bare emitted names (consensus_mode, ledger_seq).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:43:12 +01:00
Pratik Mankawde
9c40039fda fix(telemetry): avoid double-counting consensus_txset ledger mismatches
handleMismatch() recorded the "consensus_txset" reason but then fell through
to the transaction-level comparison, which also recorded a reason
("same_txset_diff_result" / "different_txset"). A single mismatch with
disagreeing consensus tx-set hashes therefore incremented
xrpld_ledger_history_mismatch_total twice across two reason labels, so the
sum over reason exceeded the real mismatch count.

The consensus tx-set hash disagreement is the root cause; return after
recording it so each mismatch contributes exactly one reason.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:37:59 +01:00
Pratik Mankawde
c20d10fd36 fix(telemetry): restore consensus_mode label fix lost in phase-8->9 merge
The A1 fix (xrpl_consensus_mode -> consensus_mode) was applied on phase-6, but
the phase-8->phase-9 merge conflict resolution for consensus-health.json took
phase-9's pre-fix panel base, silently reintroducing all 11 stale
xrpl_consensus_mode label references (the spanmetrics label that is never
populated — see the original A1 commit).

Re-apply the label fix on phase-9: xrpl_consensus_mode -> consensus_mode in
every panel expr, legendFormat, and the $consensus_mode template variable's
label_values() query. The Grafana variable name $consensus_mode is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:20:42 +01:00
Pratik Mankawde
7d8e908879 feat(telemetry): add dashboard panels for new T3 metrics
Visualise the metrics added in this series:
- consensus-health: "Ledger History Mismatch Rate by Reason"
  (xrpld_ledger_history_mismatch_total by reason — fork diagnostics)
- fee-market: "Queue Abandonment Rate (Expired)" and "Queue Admission
  Rejections (Dropped)" (xrpld_txq_expired_total / dropped_total)
- peer-network: "Reduce-Relay Peer Selection" and "Reduce-Relay Missing-Tx
  Frequency" (xrpld_reduce_relay_metrics)
- system-node-health: "Ledger Acquire Duration" and "Ledger Acquire Rate by
  Outcome" (ledger.acquire span)

otel-collector-config.yaml: add outcome and acquire_reason spanmetrics
dimensions so the ledger.acquire outcome breakdown populates.

All panels follow the existing template: $node filter, exported_instance in
legends, Title Case, axis labels.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:16:55 +01:00
Pratik Mankawde
6205199dc7 docs(telemetry): list new instruments in MetricsRegistry class diagram
Add the new synchronous counters (ledger_history_mismatch_total{reason},
txq_expired_total, txq_dropped_total{reason}) and the reduce-relay observable
gauge to the ASCII ownership diagram in the MetricsRegistry header so the
documented instrument inventory matches the code.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:15:20 +01:00
Pratik Mankawde
9376aa7c88 feat(telemetry): add reduce-relay efficiency gauge
The transaction reduce-relay subsystem (selected vs suppressed peers,
feature-disabled peers, missing-tx frequency) was computed in OverlayImpl's
TxMetrics but only surfaced via the get_counts JSON RPC — invisible to
Prometheus/Grafana, despite being the central efficiency KPI for the feature.

Add an observable gauge xrpld_reduce_relay_metrics{metric} that reads
Overlay::txMetrics() and parses its rolling-average fields:
- selected_peers     (txr_selected_cnt)
- suppressed_peers   (txr_suppressed_cnt)
- not_enabled_peers  (txr_not_enabled_cnt)
- missing_tx_freq    (txr_missing_tx_freq)

The JSON values are decimal strings (std::to_string), parsed via std::stoll —
the same JSON-reading pattern as registerNodeStoreGauge. No new Overlay
accessor or core-interface change required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:14:33 +01:00
Pratik Mankawde
864ac729de feat(telemetry): add ledger.acquire span for inbound ledger fetch
InboundLedger drives ledger back-fill and fork recovery with timeout/retry
logic (kLedgerTimeoutRetriesMax = 6), but emitted only a global ledger_fetches
counter — sync/recovery cost was a telemetry blind spot.

Add a ledger.acquire span that wraps the acquisition lifecycle:
- Started in InboundLedger::init() with ledger_seq and acquire_reason
  (history / consensus / generic, mirroring InboundLedger::Reason).
- Finalized in InboundLedger::done() with outcome (complete / failed),
  timeouts, and peer_count, then reset so the span duration is exported.

Held as a std::optional<SpanGuard> member (same pattern as RCLConsensus
roundSpan_). New op/attr/val constants added to LedgerSpanNames.h. Compiles to
a no-op when telemetry is disabled via the SpanGuard fallback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:11:57 +01:00
Pratik Mankawde
793d2ecfce feat(telemetry): add txq expired/dropped counters for queue backpressure
The transaction queue had no metric for demand that leaves or never enters the
queue, so fee-underpayment abandonment and admission-control rejection were
invisible (distinct from jq_trans_overflow, which is the job queue).

Add two synchronous counters via MetricsRegistry:
- xrpld_txq_expired_total — incremented in TxQ::processClosedLedger() for each
  queued transaction removed because its LastLedgerSequence passed (submitters
  who under-bid the escalating fee and were never included).
- xrpld_txq_dropped_total{reason} — incremented in TxQ::apply() at the
  queue-full admission-control returns (reason="queue_full").

Both reach MetricsRegistry via the Application& parameter already passed to
these methods; calls are null-guarded so they no-op when telemetry is disabled.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:06:33 +01:00
Pratik Mankawde
7a509a01eb feat(telemetry): add xrpld_ledger_history_mismatch_total{reason} counter
LedgerHistory::handleMismatch() already classifies a built-vs-validated ledger
mismatch (prior ledger, close time, consensus tx set, same/different tx set),
but only bumped a single untyped beast::insight counter — the reason was
dropped. Fork diagnosis was therefore a log-grep exercise.

Add a labeled OTel counter so the mismatch reason is a queryable time series:
- MetricsRegistry: new ledgerHistoryMismatchCounter_ + incrementLedgerHistoryMismatch(reason)
- LedgerHistory: record one reason per classification branch (unknown,
  prior_ledger, close_time, consensus_txset, same_txset_diff_result,
  different_txset). Reaches MetricsRegistry via the existing app_ reference.

The existing beast::insight mismatchCounter_ is left intact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 16:02:26 +01:00
Pratik Mankawde
d3955d3639 fix(telemetry): emit real diverged-peer count for peers_insane_count
The xrpld_peer_quality{metric="peers_insane_count"} gauge was hardcoded to 0.0
with a TODO, leaving the "Insane/Diverged Peers" panel permanently empty.

PeerImp::json() already exposes the peer's tracking state via the "track"
field (set to "diverged" when tracking_ == Tracking::Diverged). The peer-quality
callback already iterates peer->json() for latency and version, so count peers
whose "track" field equals "diverged" in the same loop — no change to the
abstract Peer interface required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:53:53 +01:00
Pratik Mankawde
d7baf262f8 fix(telemetry): remove duplicate consensus outcome/failures panels
A phase-8->phase-9 merge (a675897aaf) duplicated the "Consensus Outcome
Distribution" and "Consensus Failures Over Time" panels: both appeared twice
with byte-identical queries (verified ignoring gridPos). The pair existed once
on phase-6/7/8 and became two on phase-9 only, so the duplication originated
in phase-9's own merge history.

Remove the second (lower) copy of each and re-stack panel y-positions with no
gaps. The single retained copy keeps the original y=64 row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:51:52 +01:00
Pratik Mankawde
b286335ccf feat(telemetry): add load-factor attribution and 7-day agreement panels
Both metrics are already emitted and live in Prometheus but were not fully
visualised.

- Fee Market (xrpld-fee-market.json): "Load Factor Attribution (Stacked
  Components)" — stacks load_factor_fee_escalation / fee_queue / local / net /
  cluster so an operator can see which component drives the effective fee. The
  existing panels showed the aggregate only.
- Validator Health (xrpld-validator-health.json): "Agreement % (7d)" and
  "Agreements vs Missed (7d)" — the xrpld_validation_agreement gauge already
  observes agreement_pct_7d / agreements_7d / missed_7d, but the dashboard only
  plotted 1h and 24h windows.

Panels follow the existing template: $node filter, exported_instance in legends,
Title Case, axis labels.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:44:07 +01:00
Pratik Mankawde
5c2997d95e Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
# Conflicts:
#	docker/telemetry/grafana/dashboards/consensus-health.json
2026-06-04 15:41:20 +01:00
Pratik Mankawde
342b9f55a1 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-04 15:40:17 +01:00
Pratik Mankawde
000ad1d1f5 feat(telemetry): add gRPC and pathfinding span panels (RPC dashboard)
The grpc.{Method} spans (GRPCServer.cpp) and pathfind.* spans (PathRequest.cpp)
are emitted but had no dashboard coverage. The existing RPC & Pathfinding
dashboard only plotted StatsD timers. Add span-derived rows:

- gRPC Request Rate by Method (grpc.* by method)
- gRPC Latency P95 by Method
- gRPC Error Rate by Status (by grpc_status)
- Pathfinding Compute Duration (pathfind.compute p95/p50)
- Pathfinding Request & Discovery Rate (pathfind.request / pathfind.discover)

otel-collector-config.yaml: add method, grpc_role, grpc_status spanmetrics
dimensions (bounded value sets). Add a $grpc_method template variable so the
gRPC panels can be filtered by method, consistent with the dashboard filter
conventions.

Note: these spans populate only when the node serves gRPC / pathfinding
traffic; they are correct but not exercised by the current health-check
workload (they will be covered by the Phase 10 workload generator).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:40:07 +01:00
Pratik Mankawde
17ffe8b049 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-04 15:37:55 +01:00
Pratik Mankawde
63c6f3b8df feat(telemetry): surface consensus + TxQ lifecycle spans in dashboards
The consensus state-machine and TxQ lifecycle spans are emitted by the code
and present in Prometheus, but no panel visualised them. Add panels keyed on
those span_names (verified live) plus the low-cardinality dimensions needed to
break them down.

Consensus Health (consensus-health.json) — new rows:
- Consensus Round Duration (full round, p95/p50, mode-filterable)
- Consensus Phase Duration (open vs establish breakdown)
- Position Update Duration (update_positions p95/p50)
- Consensus Stall Rate (consensus.check by consensus_stalled)
- Consensus Mode-Change Rate by Target Mode (mode_change by mode_new)

Transaction Overview (transaction-overview.json) — new rows:
- TxQ Enqueue Rate by Transaction Type (txq.enqueue by tx_type)
- Queue Bypass Ratio (txq.apply_direct vs txq.enqueue)
- Queue Accept (Drain) Duration per Ledger (txq.accept p95/p50)
- Queue Cleanup Rate (txq.cleanup expired entries)

otel-collector-config.yaml — add spanmetrics dimensions for the lifecycle
breakdowns: mode_new, consensus_stalled, consensus_phase, consensus_result
(all bounded value sets, safe as Prometheus labels).

All new panels follow the existing dashboard template: $node filter,
exported_instance in every legend, Title Case, axis labels, row layout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:37:29 +01:00
Pratik Mankawde
4174aef07b fix(telemetry): align consensus_mode spanmetrics label with emitted attribute
The spanmetrics connector dimension was `xrpl.consensus.mode`, but the code
emits the span attribute under the bare key `consensus_mode` (matching every
other dimension after the Phase 6 rename). The mismatch left the
`xrpl_consensus_mode` Prometheus label empty, so the Consensus Health
"Consensus Mode Over Time" panel and the `$consensus_mode` template variable
(which filters every panel) matched no live series.

- otel-collector-config.yaml: dimension `xrpl.consensus.mode` -> `consensus_mode`
- consensus-health.json: 11 label refs `xrpl_consensus_mode` -> `consensus_mode`
  (the `$consensus_mode` Grafana variable name is unchanged)
- telemetry-runbook.md: refresh the stale spanmetrics label table to the bare
  names actually emitted (command/rpc_status/consensus_mode/local/
  proposal_trusted/validation_trusted), fix dotted->bare attribute names in
  span tables and TraceQL examples (tx_hash, ledger_seq, consensus_round_id,
  consensus_ledger_id, consensus_round, tx_id event attr), correct the
  consensus_round_id query to int (not quoted string), and fix the
  load_type value query ("exception_rpc" -> "exceptioned RPC").

Verified against the live stack: Tempo span tags confirm bare attribute keys
(consensus_mode, ledger_seq, tx_hash, ...); the populated xrpl_consensus_mode
series in Prometheus is stale retained data from an older build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:29:45 +01:00
Pratik Mankawde
e6643a4389 updated tags
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 14:46:57 +01:00
Pratik Mankawde
80800ee130 use image-renderer in graphana
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 14:40:35 +01:00
Pratik Mankawde
ebc5c5ed9d fix(telemetry): set service_instance_id in [insight] so dashboards filter
beast::insight metrics exported via OTLP carried no exported_instance
label because [insight] omitted service_instance_id (only [telemetry]
set it). Every system-* dashboard filters insight metrics with
exported_instance=~"$node", and the $node template variable is sourced
from label_values(..., exported_instance) — so with the label absent,
$node was empty and all insight-backed panels showed no data.

Add service_instance_id to [insight] in both telemetry configs, matching
the [telemetry] value (xrpld-mainnet / xrpld-devnet). CollectorManager
already reads this key and passes it to OTelCollector, which sets the
service.instance.id resource attribute.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 14:36:04 +01:00