Commit Graph

14771 Commits

Author SHA1 Message Date
Pratik Mankawde
3bb4ea84a4 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-06-05 21:37:42 +01:00
Pratik Mankawde
3c1189d6f8 fix(telemetry): clarify confusing consensus close-time panels
Several close-time panels showed raw codes/booleans and unreadable
time-bar-charts. Relabel and re-visualize them so each reads clearly:

- Close-Time Proposal Spread (was "Close Time Bin Distribution"):
  corrected a wrong description (it claimed per-node proposals; the
  metric is the number of DISTINCT close-time positions peers proposed
  per round, rawCloseTimes.peers.size()). Converted the time-barchart
  (unreadable timestamp axis) to a horizontal bar gauge summing rounds
  per distinct-position count.
- Consensus Outcome Distribution: renameByRegex maps the raw
  consensus_state codes to human labels (yes->Agreed,
  moved_on->Moved On (partial), expired->Expired (timeout), no->No
  Consensus); value mappings alone do not relabel pie legends.
- Close-Time Agreement Rate (was "Close Time Agreement"): legend
  relabelled from "close_time_correct=true/false" to Agreed/Disagreed.
- Close-Time Resolution Change (was "Close Time Resolution Direction"):
  converted to a bar gauge; increased/decreased/unchanged relabelled to
  Coarser (more disagreement) / Finer (better agreement) / Steady.

All four verified by rendering the panels to PNG via the Grafana
image renderer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 21:37:11 +01:00
Pratik Mankawde
b79991b190 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-06-05 21:37:06 +01:00
Pratik Mankawde
4c2125c07e clang-tidy issue fixes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-05 21:35:19 +01:00
Pratik Mankawde
4ea4b99f15 refactor(telemetry): remove non-useful absolute close-time value panels
"Close Time: Raw Proposals" and "Close Time: Effective / Quantized"
plotted the absolute close time (Ripple-epoch seconds), which is an
ever-rising line with no analytical value. There is no clean way to make
them useful: the values are Ripple-epoch seconds (not Unix ms, so date
units misrender), TraceQL metrics cannot do the epoch offset or an
inter-ledger-gap subtraction in-query, and Grafana calculateField
transforms break on these grouped Tempo metric frames (verified by
render: "No data").

Remove both. The useful consensus-timing signals are already covered:
time-to-consensus (round_time_ms), rounds per ledger (establish_count),
previous round time, and the close-time vote-bins / resolution-direction
/ bin-distribution panels remain. Re-tile and re-id (22 panels).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 21:24:36 +01:00
Pratik Mankawde
d83cb0bdb3 fix(telemetry): refresh regression baseline + widen bucket-noise thresholds
With validation now passing 133/133, the only remaining job failure was the
regression gate flagging 4 timing "regressions". Two compounding causes:

1. Stale baseline: the committed baseline was captured (2026-04-24) under the
   old, lighter workload — before the new txq-burst phase (60 TPS) existed. The
   heavier per-ledger work genuinely raises ledger.build / tx.apply /
   ledger.validate / acceptLedger timings, so every run regressed against it.
   Refreshed the baseline from the latest CI-measured timings (same workload).
2. Histogram quantization: SpanMetrics latency buckets are
   [1,5,10,25,...]ms, so a sub-millisecond quantile near a low-end boundary can
   jump a full bucket (1ms->5ms) between runs with no real change. The old
   absolute bounds (2-5ms) were narrower than one bucket width, so that jitter
   tripped the gate. Widened the default span bounds to 10-15ms (~2 low-end
   buckets) and pct to 50%, and the job_queue running bound to 20ms, to tolerate
   quantization noise while still catching genuine multi-bucket regressions. The
   consensus.* overrides (tight pct, large abs) are unchanged.

The refreshed baseline also picks up real rpc.ws_message timings (previously null
under the phantom rpc.request key).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 19:58:07 +01:00
Pratik Mankawde
758a3fec29 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation
# Conflicts:
#	OpenTelemetryPlan/09-data-collection-reference.md
2026-06-05 19:42:53 +01:00
Pratik Mankawde
2ee4d2ff2d fix(telemetry): show consensus rounds as integer distribution, widen table panels
Number of rounds is an integer, but avg_over_time(establish_count)
produced fractional values (2.1). Switch the Rounds panel to
count_over_time() by (span.establish_count): one integer series per
round count (1/2/3...), showing how many ledgers needed that many
establish rounds — the meaningful distribution, inherently integer
(decimals=0).

Apply dashboard rule 9: panels with a right-side table legend take full
width. Widen "Consensus Rounds per Ledger" and "Consensus Outcome
Distribution" to w=24 and re-tile the dashboard.

Verified via the Grafana proxy: rounds=2 dominates (~11-12 ledgers),
rounds=3 occasional.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:40:41 +01:00
Pratik Mankawde
a23d83f393 docs(telemetry): add ledger.acquire to 09-doc + fix peer-quality dashboard metric prefix
Phase 9 introduces the ledger.acquire span (InboundLedger fetch) that phases 7-8
do not have, so the forward-merged 09-data-collection-reference inventory is
extended here:
- §1.1: add ledger.acquire to the Ledger span table.
- §1.2: add its attributes (acquire_reason, timeouts, peer_count, outcome) and
  note it also sets ledger_seq; bump the span count.

Also fix two stale StatsD metric references in the Peer Quality dashboard
(xrpld-peer-quality.json): rippled_Peer_Finder_Active_{Inbound,Outbound}_Peers
-> xrpld_Peer_Finder_* to match the xrpld_ metric prefix the rest of the stack uses.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 19:36:34 +01:00
Pratik Mankawde
24f60ab1d4 feat(telemetry): make consensus panels show real consensus timing and rounds
The consensus duration panels plotted span wall-clock
(traces_span_metrics_duration_milliseconds), which is ~3-8 ms of
instrumentation overhead, not the real consensus time (~3000 ms). And
the close-time value panels plotted an ever-rising absolute epoch line.
Rework them to answer the actual operational questions, all from
attributes that already exist on the consensus spans:

- Time to Reach Consensus (p50/p95) and Average Time to Reach Consensus:
  round_time_ms on consensus.accept — the wall-clock to agree a ledger.
- Consensus Rounds per Ledger (Establish Count): avg and max of
  establish_count on consensus.establish — how many proposal rounds it
  took to converge (1 = first proposal).
- Previous Round Time per Ledger: previous_round_time_ms on
  consensus.round.

Reorder the dashboard into an investigation flow: health/throughput ->
time-to-consensus and rounds -> ledger close/apply timing -> close-time
detail -> failures/mode/mismatch. Assign stable sequential panel ids.

Verified each query returns data via the Grafana datasource proxy
(p95 ~4096 ms, avg ~2825 ms, rounds ~2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:35:34 +01:00
Pratik Mankawde
22b533ac51 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-05 19:33:13 +01:00
Pratik Mankawde
8046a30e9b Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-05 19:28:58 +01:00
Pratik Mankawde
4a8aa9e514 docs(telemetry): reconcile 09-data-collection-reference span/attribute inventory
The §1 span and attribute inventory had regressed to an older 16-span snapshot
that uses the pre-2026-05-13 dotted attribute keys, while phase-7's code emits
~36 spans with bare/underscore attribute keys. The §Data Flow Overview and §2
System Metrics sections (native OTLP transport — phase-7's migration) were already
correct and are left unchanged.

- §1.1: expand the span inventory to the full surface — add gRPC (grpc.<MethodName>),
  TxQ (txq.*), PathFind (pathfind.*), and the full consensus set (round/phase.open/
  establish/update_positions/check/mode_change/proposal.receive/validation.receive).
  Fix the phantom rpc.request -> rpc.http_request, add rpc.ws_upgrade. No grpc.request,
  no pathfind.rank, no ledger.acquire (the latter is added in phase-9, not yet present here).
- §1.2: convert every span-attribute key from dotted xrpl.<domain>.<field> to the
  bare/underscore form. The sole span-attr dotted exception is xrpl.ledger.hash on
  peer.validation.receive (shared constant); consensus.validation.send uses bare
  ledger_hash. Resource attrs xrpl.network.id/type stay dotted. Fix tx_count/tx_failed
  placement (on tx.apply, not ledger.build). Add attribute tables for the new families.
- §1.3: list the full set of spanmetrics dimension labels (bare keys, from the collector
  config) instead of the stale xrpl_rpc_command-style names.
- §4/§5: convert Tempo TraceQL and PromQL examples to the bare attribute/label forms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 19:28:45 +01:00
Pratik Mankawde
fd1c8c6060 fix(telemetry): resolve Phase 10 validation failures surfaced by first full CI run
The print-env CI fix let the Telemetry Stack Validation job build and run the
workload harness end-to-end for the first time. It reported 129/136 checks
passing; this commit fixes the 7 real failures plus a latent regression-gate bug.

Validation-suite fixes (verified against the CI run's actual emission + live node):
- expected_metrics.json: the beast::insight job-depth gauge is `xrpld_jobq_job_count`,
  not `xrpld_job_count` (the latter is a Phase 9 OTel counter). Reverted the prior
  rename. Removed the statsd_histograms block (`xrpld_rpc_time`/`xrpld_rpc_size`):
  these RPC timers do not emit under the WS workload (0 series in CI).
- expected_spans.json: `tx_status` is only set on suppressed/known-bad receives, so
  it is no longer a required attribute of every `tx.receive`. Marked `pathfind.compute`
  and `pathfind.discover` optional and the `pathfind.request -> pathfind.compute`
  hierarchy as skip — the self-to-self XRP probe returns before computing paths in a
  fresh cluster with no liquidity, so only `pathfind.request` fires.

Regression-gate bug (telemetry-validation.yml "Print regression summary"):
- `jq -e` exits non-zero when its filter result is boolean false — the normal case
  for a populated (non-placeholder) baseline — which was misreported as
  "Failed to parse baseline JSON" and failed the job. Dropped `-e` (kept `-r`) so a
  non-zero exit genuinely means malformed JSON.

The optional-span handling and regression comparison both worked correctly in the
CI run (txq.* / pathfind.update_all skipped-when-absent, 0 regressions detected).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 19:26:30 +01:00
Pratik Mankawde
5c275ac476 fix(telemetry): set units, axis labels, and readable legends on close-time panels
Apply the dashboard guidelines to the five close-time panels:

- Axis labels (Title Case) on every panel: "Close Time (Ripple
  Seconds)" for the value panels, "Count / Milliseconds" for vote
  bins/resolution, "Rounds in Window" for the count panels.
- Human-readable legends with the dimension in brackets per the legend
  convention: "Raw Close Time [{{resource.service.instance.id}}]",
  "Effective Close Time [...]", "Resolution Direction
  [{{span.resolution_direction}}]", "{{span.close_time_vote_bins}} Vote
  Bins" — replacing the bare label tokens.
- Unit "none" (plain number): the close-time values are Ripple-epoch
  seconds and TraceQL metrics cannot offset them to a wall-clock unit,
  and the others are counts/ms on a shared axis.

Verified rendered values against raw spans: close times ~833,998,8xx,
resolution 10000 ms, vote bins 1/2/3 — all correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:21:59 +01:00
Pratik Mankawde
8dcd6f9b4a fix(telemetry): raise TraceQL metrics max_duration to 168h
TraceQL metrics queries default to a 3h max range
(query_frontend.metrics.max_duration), so a dashboard set to a longer
window failed with "range ... exceeds 3h0m0s". Add a query_frontend
block raising it to 168h, matching the search max_duration, so the
consensus close-time panels work at 6h/12h/24h ranges.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 19:19:55 +01:00
Pratik Mankawde
cfb2d87cab fix(telemetry): correct close-time legend label key to resource.service.instance.id
The Raw Proposals and Effective/Quantized panels rendered nameless
series: their legendFormat used {{service.instance.id}}, but the
TraceQL metrics query groups by resource.service.instance.id and Tempo
returns that full key as the series label. The legend token did not
match any label, so each series showed blank. Use the matching
{{resource.service.instance.id}} token.

Verified via the Grafana datasource proxy that all six close-time panels
now return correctly-labelled series.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 18:59:57 +01:00
Pratik Mankawde
283218896b fix(telemetry): use avg not quantile for close-time value panels
The Raw Proposals and Effective/Quantized panels showed wrong values
(e.g. 759M, 852M, even 0) against a true value of ~834M. Cause:
quantile_over_time bucketizes into an exponential histogram tuned for
duration distributions, so it cannot represent large absolute integers
(Ripple-epoch seconds) accurately.

Switch both panels to avg_over_time, which returns the correct value
(verified ~833,996,7xx matching the raw span attribute). Average is also
the semantically right aggregation here: close time is a single agreed
value per consensus round, not a latency distribution, so a median was
never meaningful.

Set the unit to none rather than seconds: the value is Ripple-epoch
seconds (Unix = value + 946684800) and TraceQL metrics cannot do the
offset arithmetic in-query, so a duration unit would misrender it.
Clarify in the description that the absolute level tracks wall-clock and
the useful signal is per-node spread / raw-vs-effective gap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 18:48:31 +01:00
Pratik Mankawde
f7df1742fb fix(telemetry): drop bool close_time_correct filter from close-time panels
The five Close Time panels still rendered "No Data" after the metrics
rewrite. Root cause: each query carried
`span.close_time_correct=~"$close_time_correct"`, but close_time_correct
is a boolean span attribute and TraceQL's regex match (=~) against a
bool matches nothing in a metrics query, so every panel returned an
empty series set (HTTP 200, {"series":[]}).

Remove that filter clause. The panels do not break down by
close_time_correct, so dropping it restores data without losing any
dimension. The $node filter (a string attribute) is unaffected and
stays.

Verified via the Grafana datasource proxy that all six targets now
return series.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 18:40:58 +01:00
Pratik Mankawde
dc5bb4b35c feat(telemetry): emit xrpld_validation_{agreements,missed}_total counters
Wire the two previously-registered-but-never-incremented validation
counters to ValidationTracker's gross lifetime tallies, exported as
monotonic ObservableCounters. New gross atomics count each ledger once at
first classification and are never adjusted on late repair, keeping the
_total counters monotonic and additive (agreements_total + missed_total ==
ledgers reconciled); the repair-aware windowed view stays on the existing
xrpld_validation_agreement gauge. The validator-health dashboard panels
that already query these names now render data instead of "No data".

Also de-stale 09-data-collection-reference.md: §5b documented flat metric
names (xrpld_cache_SLE_hit_rate, ...) that the code never emits — it emits
labeled gauges (xrpld_cache_metrics{metric="SLE_hit_rate"}). Replace the
stale flat-name tables with a pointer to the canonical labeled section,
reconcile the contradictory headline counts, and correct xrpld_job_count
to its real exported name xrpld_jobq_job_count.

Adds two GTests asserting gross tallies stay frozen on repair while net
totals move, plus the additive invariant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:29:29 +01:00
Pratik Mankawde
cb9fce6890 fix(telemetry): align Phase 10 workload harness with current OTel recording surface + fix CI
The Phase 10 validation harness had drifted from the code's recording surface
and the telemetry-validation CI job was failing before it could build.

CI fix (telemetry-validation.yml):
- Replace nonexistent local action ./.github/actions/print-env with the remote
  XRPLF/actions/print-build-env (the build-xrpld job failed in 56s on this).
- Sync prepare-runner and upload-artifact action SHAs to the canonical workflow.

Recording-surface reconciliation (docker/telemetry/workload/):
- Migrate span attributes from dotted xrpl.<domain>.<field> to the bare/underscore
  form introduced by the 2026-05-13 span-attr naming redesign (tx_hash, peer_id,
  ledger_seq, consensus_mode, consensus_round, full_validation, quorum, ...).
  Dotted xrpl.ledger.hash is retained only on peer.validation.receive (shared
  constant), while consensus.validation.send uses bare ledger_hash.
- Fix attribute placement: tx.apply carries tx_count/tx_failed (not ledger_seq);
  ledger.build carries ledger_seq/close_* (not tx_count/tx_failed).
- Replace the phantom rpc.request span with the real WS root rpc.ws_message; drop
  the never-emitted duration_ms; rebuild the parent-child map accordingly.
- Add the new spans the code emits: apply-pipeline stage spans
  (tx.preflight/preclaim/transactor with stage/tx_type/ter_result), txq.*,
  consensus sub-spans (round/establish/update_positions/check/phase.open),
  ledger.acquire, grpc.*, pathfind.*. Conditional spans are marked optional so
  they are skipped (not failed) when the workload does not exercise them.
- validate_telemetry.py: service.name and Loki job label rippled -> xrpld; fix
  PARITY_SPAN_ATTRS (rename the 4 real attrs, drop the 3 that are metrics not span
  attrs); add optional-span handling that skips missing optional spans while still
  validating attributes when present.
- expected_metrics.json: rippled_ -> xrpld_ on all beast::insight/overlay metrics,
  xrpld_job_count, the 15 on-disk xrpld-* dashboard UIDs, and the real bare
  spanmetrics dimension labels.
- regression-metrics.json + baseline-timings.json: rpc.request -> rpc.ws_message.

Metrics pipeline fix:
- Switch node [insight] config from server=statsd/prefix=rippled to server=otel +
  /v1/metrics endpoint + prefix=xrpld across run-full-validation.sh,
  xrpld-validator.cfg.template, benchmark.sh and the workload compose. The
  collector has no StatsD receiver, so system metrics only reach Prometheus over
  OTLP.

Synthetic load for new spans:
- Add ripple_path_find to the RPC load generator (drives pathfind.* spans).
- Add a high-TPS txq-burst workload phase to force fee escalation (drives txq.*).

All facts verified against the *SpanNames.h headers and a live xrpld node +
collector (Tempo service.name=xrpld, tx.preflight attrs [stage,ter_result,tx_type],
279 xrpld_ Prometheus metrics and zero rippled_).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 17:08:58 +01:00
Pratik Mankawde
a1c79b7aab Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-06-05 16:58:44 +01:00
Pratik Mankawde
0d1d1aa0e1 fix(telemetry): wire consensus close-time panels via TraceQL metrics
The five Close Time panels (Raw Proposals, Effective/Quantized, Vote
Bins & Resolution, Resolution Direction, Bin Distribution) rendered
empty: they used TraceQL `| select(attr)`, which returns a trace list
that a timeseries/barchart panel cannot plot.

Enable TraceQL metrics in Tempo and rewrite the panels to use it:

- tempo.yaml: add the local-blocks processor to the metrics generator so
  recent blocks are queryable via /api/metrics/query_range. Set
  filter_server_spans=false because the consensus spans are
  SPAN_KIND_INTERNAL (the default keeps only server spans, so attribute
  aggregations over internal spans returned nothing), and
  flush_to_storage=true with a traces_storage path so query_range can
  read the flushed blocks.
- consensus-health.json: replace each panel's select() with a metrics
  query — quantile_over_time on the integer close-time attributes,
  avg_over_time for vote bins / resolution, and count_over_time by the
  resolution_direction and vote-bin dimensions. Set the raw/effective
  panels' unit to seconds (the values are Ripple-epoch seconds, which
  dateTimeFromNow rendered with the wrong epoch).

Verified the query forms compile and return series against live internal
spans; the close-time series populate once the node reaches full sync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:58:29 +01:00
Pratik Mankawde
c97f29c0dd Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-06-05 16:25:41 +01:00
Pratik Mankawde
1fa39fdef6 fix(telemetry): move job-queue gauge to top and add stable panel ids
The Current Job Latency gauge sat at the bottom of the Job Queue
Analysis dashboard; per the dashboard guideline gauges belong at the
top. Move it to the first row and reflow the remaining panels below it.
Also assign explicit sequential panel ids so deep links stay stable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:25:40 +01:00
Pratik Mankawde
fb9e6e5452 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-06-05 16:24:23 +01:00
Pratik Mankawde
18121a8cf4 fix(telemetry): widen only timeseries with right-side table legends
Correct the width rule from the previous layout commit. Full width
(w=24) is now applied ONLY to timeseries panels whose legend is a
right-side table, since those legends need the horizontal room. Panels
with default/bottom legends, pie charts, and the heatmap return to half
width. This narrows "Transaction Receive vs Suppressed" and "TxQ
Enqueue Rate by Transaction Type", which were wrongly widened.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:24:06 +01:00
Pratik Mankawde
ea56a3a0d4 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-06-05 16:02:53 +01:00
Pratik Mankawde
93c31573c5 refactor(telemetry): stable panel ids and topic-grouped layout
Make transaction-overview deep-links stable and improve readability:

- Assign explicit sequential panel ids (1..20) so viewPanel=panel-N
  URLs stay pinned to the same chart across edits. Previously ids were
  unset and Grafana auto-assigned them by array position, so any
  reorder silently repointed bookmarks.
- Move the single-value stat panel (Transaction Apply Failed Rate) to
  the top row.
- Lay out in three topic sections (Processing, Apply Pipeline, Queue).
  Within each, timeseries with a breakdown dimension (tx_type, stage,
  ter_result, suppressed) take full width so their right-side table
  legends are readable; single-series panels, pie charts, and the
  heatmap stay half-width and pair up.

All six template variables already default to All (includeAll + multi);
no change needed there.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:02:47 +01:00
Pratik Mankawde
750e4dc5c6 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-06-05 15:07:08 +01:00
Pratik Mankawde
ceea9e49dd fix(telemetry): standardize transaction-overview legends and cap tooltips
Apply the dashboard legend convention across all panels now that the
P50 series have been removed (P95-only):

- Drop the redundant "P95 " / "P50 " prefix; the panel title already
  states the percentile.
- Put every filter/dimension value inside [] comma-separated, ending
  with exported_instance, e.g. "AMMDeposit [Preclaim, xrpld-mainnet]".
- Add exported_instance to the by() clause and legend of the three
  panels that filtered on $node but omitted it (Transaction Rate by
  Type, Transaction Results by Type, TxQ Accept Status), so per-node
  series are produced.
- Title-case the stage value for display via label_replace in the four
  apply-pipeline panels; the span attribute stays lowercase
  (preflight/preclaim/apply) since legendFormat cannot change case.
- Cap tooltip maxHeight at 500 on every panel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 15:06:52 +01:00
Pratik Mankawde
bcc5ab66c4 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-06-05 13:40:59 +01:00
Pratik Mankawde
db4d70bbc2 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-06-05 13:40:19 +01:00
Pratik Mankawde
b8dd848899 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-05 13:40:18 +01:00
Pratik Mankawde
b321792a14 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-06-05 13:40:18 +01:00
Pratik Mankawde
72642b5dc6 feat(telemetry): add tx apply latency panel by type and stage
The existing apply-pipeline panels show latency by stage (all types
combined) or by type (single span). Neither answers "for a given
transaction type, which stage dominates its latency". Add a p95 panel
grouped by both tx_type and stage, filterable via the $tx_type and
$stage variables. Both dimensions already exist in spanmetrics, so no
collector change is needed. Reflow the section so the full-width
failure panel sits below the new full-width panel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 13:39:59 +01:00
Pratik Mankawde
db5b93e2c4 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-06-05 12:50:09 +01:00
Pratik Mankawde
f37a4a1022 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
# Conflicts:
#	src/xrpld/app/misc/detail/TxQ.cpp
2026-06-05 12:49:38 +01:00
Pratik Mankawde
8f3974c094 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-06-05 12:48:40 +01:00
Pratik Mankawde
283fbaa54f Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	OpenTelemetryPlan/09-data-collection-reference.md
2026-06-05 12:48:31 +01:00
Pratik Mankawde
3167a49f41 feat(telemetry): derive per-stage tx metrics from apply-pipeline spans
Wire the apply-pipeline stage spans (tx.preflight, tx.preclaim,
tx.transactor) added on phase-3 through the observability stack so the
spanmetrics connector produces per-stage RED metrics without any native
instruments.

- collector: add the `stage` dimension to the spanmetrics connector so
  the three stages split into separate metric series (3 bounded values).
- dashboard: add a "Tx Apply Pipeline" section to transaction-overview
  with rate, p95 latency, and failure-rate panels grouped by stage, plus
  a `stage` template variable. Panels follow the existing config (node
  filter, exported_instance legends, Title Case, axis labels).
- The failure panel filters ter_result != tesSUCCESS rather than span
  status, because a failing ter code completes the span normally — only
  thrown exceptions set an error status. This matches the existing
  "Transaction Results by Type" panel convention.
- docs: document the spans, attributes, and stage dimension in the data
  collection reference and runbook, including the sampling caveat that
  span-derived metrics inherit tracer head-sampling and undercount at
  sampling_ratio < 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 12:42:53 +01:00
Pratik Mankawde
759d3506b2 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-06-05 11:58:59 +01:00
Pratik Mankawde
021300538a Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-06-05 11:58:49 +01:00
Pratik Mankawde
a71d6635e6 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-06-05 11:58:43 +01:00
Pratik Mankawde
3df7e9cba6 code review changes and wire unused attributes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-05 11:42:33 +01:00
Pratik Mankawde
6a16dfa823 clang-tidy and formatting changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-05 11:25:29 +01:00
Pratik Mankawde
6428c9f13c feat(telemetry): add preflight/preclaim stage spans and stage attribute
The tx.transactor span covered only the apply stage; preflight and
preclaim had no telemetry, so a transaction that hard-failed those
stages produced no apply-pipeline span and per-stage latency/failure
was invisible.

Add tx.preflight and tx.preclaim spans in applySteps.cpp via a
makeStageSpan() helper using SpanGuard::hashSpan, so all three stages
share a deterministic trace_id derived from txID[0:16] even though they
run sequentially and often cross-thread. Each span carries stage,
tx_type, and ter_result; exceptions are recorded as tefEXCEPTION before
the public wrappers map them. The type lookup is guarded behind the
span-active check so it costs nothing when tracing is off.

Add a stage="apply" attribute to the tx.transactor span and move its
three hardcoded attribute strings to a new library-safe header
include/xrpl/tx/detail/TxApplySpanNames.h, which mirrors the daemon-side
TxSpanNames.h strings so the collector spanmetrics connector aggregates
both span sets under one dimension set.

A constants-contract test pins the span-name, attribute-key, and
stage-value strings; span content stays covered by the docker
integration test, as the rest of the telemetry suite is.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 11:11:55 +01:00
Pratik Mankawde
d7e847a53b removed p50 renders from all dashboards
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 18:11:23 +01:00
Pratik Mankawde
c3bdcb4291 clang-tidy include
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 18:02:47 +01:00
Pratik Mankawde
478b58395b loop levelization
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-04 17:54:52 +01:00