rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-06-06 18:26:51 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	3bb4ea84a4	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-06-05 21:37:42 +01:00
Pratik Mankawde	3c1189d6f8	fix(telemetry): clarify confusing consensus close-time panels Several close-time panels showed raw codes/booleans and unreadable time-bar-charts. Relabel and re-visualize them so each reads clearly: - Close-Time Proposal Spread (was "Close Time Bin Distribution"): corrected a wrong description (it claimed per-node proposals; the metric is the number of DISTINCT close-time positions peers proposed per round, rawCloseTimes.peers.size()). Converted the time-barchart (unreadable timestamp axis) to a horizontal bar gauge summing rounds per distinct-position count. - Consensus Outcome Distribution: renameByRegex maps the raw consensus_state codes to human labels (yes->Agreed, moved_on->Moved On (partial), expired->Expired (timeout), no->No Consensus); value mappings alone do not relabel pie legends. - Close-Time Agreement Rate (was "Close Time Agreement"): legend relabelled from "close_time_correct=true/false" to Agreed/Disagreed. - Close-Time Resolution Change (was "Close Time Resolution Direction"): converted to a bar gauge; increased/decreased/unchanged relabelled to Coarser (more disagreement) / Finer (better agreement) / Steady. All four verified by rendering the panels to PNG via the Grafana image renderer. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:37:11 +01:00
Pratik Mankawde	b79991b190	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-06-05 21:37:06 +01:00
Pratik Mankawde	4c2125c07e	clang-tidy issue fixes Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-05 21:35:19 +01:00
Pratik Mankawde	4ea4b99f15	refactor(telemetry): remove non-useful absolute close-time value panels "Close Time: Raw Proposals" and "Close Time: Effective / Quantized" plotted the absolute close time (Ripple-epoch seconds), which is an ever-rising line with no analytical value. There is no clean way to make them useful: the values are Ripple-epoch seconds (not Unix ms, so date units misrender), TraceQL metrics cannot do the epoch offset or an inter-ledger-gap subtraction in-query, and Grafana calculateField transforms break on these grouped Tempo metric frames (verified by render: "No data"). Remove both. The useful consensus-timing signals are already covered: time-to-consensus (round_time_ms), rounds per ledger (establish_count), previous round time, and the close-time vote-bins / resolution-direction / bin-distribution panels remain. Re-tile and re-id (22 panels). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:24:36 +01:00
Pratik Mankawde	d83cb0bdb3	fix(telemetry): refresh regression baseline + widen bucket-noise thresholds With validation now passing 133/133, the only remaining job failure was the regression gate flagging 4 timing "regressions". Two compounding causes: 1. Stale baseline: the committed baseline was captured (2026-04-24) under the old, lighter workload — before the new txq-burst phase (60 TPS) existed. The heavier per-ledger work genuinely raises ledger.build / tx.apply / ledger.validate / acceptLedger timings, so every run regressed against it. Refreshed the baseline from the latest CI-measured timings (same workload). 2. Histogram quantization: SpanMetrics latency buckets are [1,5,10,25,...]ms, so a sub-millisecond quantile near a low-end boundary can jump a full bucket (1ms->5ms) between runs with no real change. The old absolute bounds (2-5ms) were narrower than one bucket width, so that jitter tripped the gate. Widened the default span bounds to 10-15ms (~2 low-end buckets) and pct to 50%, and the job_queue running bound to 20ms, to tolerate quantization noise while still catching genuine multi-bucket regressions. The consensus.* overrides (tight pct, large abs) are unchanged. The refreshed baseline also picks up real rpc.ws_message timings (previously null under the phantom rpc.request key). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 19:58:07 +01:00
Pratik Mankawde	758a3fec29	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation # Conflicts: # OpenTelemetryPlan/09-data-collection-reference.md	2026-06-05 19:42:53 +01:00
Pratik Mankawde	2ee4d2ff2d	fix(telemetry): show consensus rounds as integer distribution, widen table panels Number of rounds is an integer, but avg_over_time(establish_count) produced fractional values (2.1). Switch the Rounds panel to count_over_time() by (span.establish_count): one integer series per round count (1/2/3...), showing how many ledgers needed that many establish rounds — the meaningful distribution, inherently integer (decimals=0). Apply dashboard rule 9: panels with a right-side table legend take full width. Widen "Consensus Rounds per Ledger" and "Consensus Outcome Distribution" to w=24 and re-tile the dashboard. Verified via the Grafana proxy: rounds=2 dominates (~11-12 ledgers), rounds=3 occasional. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:40:41 +01:00
Pratik Mankawde	a23d83f393	docs(telemetry): add ledger.acquire to 09-doc + fix peer-quality dashboard metric prefix Phase 9 introduces the ledger.acquire span (InboundLedger fetch) that phases 7-8 do not have, so the forward-merged 09-data-collection-reference inventory is extended here: - §1.1: add ledger.acquire to the Ledger span table. - §1.2: add its attributes (acquire_reason, timeouts, peer_count, outcome) and note it also sets ledger_seq; bump the span count. Also fix two stale StatsD metric references in the Peer Quality dashboard (xrpld-peer-quality.json): rippled_Peer_Finder_Active_{Inbound,Outbound}_Peers -> xrpld_Peer_Finder_* to match the xrpld_ metric prefix the rest of the stack uses. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 19:36:34 +01:00
Pratik Mankawde	24f60ab1d4	feat(telemetry): make consensus panels show real consensus timing and rounds The consensus duration panels plotted span wall-clock (traces_span_metrics_duration_milliseconds), which is ~3-8 ms of instrumentation overhead, not the real consensus time (~3000 ms). And the close-time value panels plotted an ever-rising absolute epoch line. Rework them to answer the actual operational questions, all from attributes that already exist on the consensus spans: - Time to Reach Consensus (p50/p95) and Average Time to Reach Consensus: round_time_ms on consensus.accept — the wall-clock to agree a ledger. - Consensus Rounds per Ledger (Establish Count): avg and max of establish_count on consensus.establish — how many proposal rounds it took to converge (1 = first proposal). - Previous Round Time per Ledger: previous_round_time_ms on consensus.round. Reorder the dashboard into an investigation flow: health/throughput -> time-to-consensus and rounds -> ledger close/apply timing -> close-time detail -> failures/mode/mismatch. Assign stable sequential panel ids. Verified each query returns data via the Grafana datasource proxy (p95 ~4096 ms, avg ~2825 ms, rounds ~2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:35:34 +01:00
Pratik Mankawde	22b533ac51	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-06-05 19:33:13 +01:00
Pratik Mankawde	8046a30e9b	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-05 19:28:58 +01:00
Pratik Mankawde	4a8aa9e514	docs(telemetry): reconcile 09-data-collection-reference span/attribute inventory The §1 span and attribute inventory had regressed to an older 16-span snapshot that uses the pre-2026-05-13 dotted attribute keys, while phase-7's code emits ~36 spans with bare/underscore attribute keys. The §Data Flow Overview and §2 System Metrics sections (native OTLP transport — phase-7's migration) were already correct and are left unchanged. - §1.1: expand the span inventory to the full surface — add gRPC (grpc.<MethodName>), TxQ (txq.), PathFind (pathfind.), and the full consensus set (round/phase.open/ establish/update_positions/check/mode_change/proposal.receive/validation.receive). Fix the phantom rpc.request -> rpc.http_request, add rpc.ws_upgrade. No grpc.request, no pathfind.rank, no ledger.acquire (the latter is added in phase-9, not yet present here). - §1.2: convert every span-attribute key from dotted xrpl.<domain>.<field> to the bare/underscore form. The sole span-attr dotted exception is xrpl.ledger.hash on peer.validation.receive (shared constant); consensus.validation.send uses bare ledger_hash. Resource attrs xrpl.network.id/type stay dotted. Fix tx_count/tx_failed placement (on tx.apply, not ledger.build). Add attribute tables for the new families. - §1.3: list the full set of spanmetrics dimension labels (bare keys, from the collector config) instead of the stale xrpl_rpc_command-style names. - §4/§5: convert Tempo TraceQL and PromQL examples to the bare attribute/label forms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 19:28:45 +01:00
Pratik Mankawde	fd1c8c6060	fix(telemetry): resolve Phase 10 validation failures surfaced by first full CI run The print-env CI fix let the Telemetry Stack Validation job build and run the workload harness end-to-end for the first time. It reported 129/136 checks passing; this commit fixes the 7 real failures plus a latent regression-gate bug. Validation-suite fixes (verified against the CI run's actual emission + live node): - expected_metrics.json: the beast::insight job-depth gauge is `xrpld_jobq_job_count`, not `xrpld_job_count` (the latter is a Phase 9 OTel counter). Reverted the prior rename. Removed the statsd_histograms block (`xrpld_rpc_time`/`xrpld_rpc_size`): these RPC timers do not emit under the WS workload (0 series in CI). - expected_spans.json: `tx_status` is only set on suppressed/known-bad receives, so it is no longer a required attribute of every `tx.receive`. Marked `pathfind.compute` and `pathfind.discover` optional and the `pathfind.request -> pathfind.compute` hierarchy as skip — the self-to-self XRP probe returns before computing paths in a fresh cluster with no liquidity, so only `pathfind.request` fires. Regression-gate bug (telemetry-validation.yml "Print regression summary"): - `jq -e` exits non-zero when its filter result is boolean false — the normal case for a populated (non-placeholder) baseline — which was misreported as "Failed to parse baseline JSON" and failed the job. Dropped `-e` (kept `-r`) so a non-zero exit genuinely means malformed JSON. The optional-span handling and regression comparison both worked correctly in the CI run (txq.* / pathfind.update_all skipped-when-absent, 0 regressions detected). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 19:26:30 +01:00
Pratik Mankawde	5c275ac476	fix(telemetry): set units, axis labels, and readable legends on close-time panels Apply the dashboard guidelines to the five close-time panels: - Axis labels (Title Case) on every panel: "Close Time (Ripple Seconds)" for the value panels, "Count / Milliseconds" for vote bins/resolution, "Rounds in Window" for the count panels. - Human-readable legends with the dimension in brackets per the legend convention: "Raw Close Time [{{resource.service.instance.id}}]", "Effective Close Time [...]", "Resolution Direction [{{span.resolution_direction}}]", "{{span.close_time_vote_bins}} Vote Bins" — replacing the bare label tokens. - Unit "none" (plain number): the close-time values are Ripple-epoch seconds and TraceQL metrics cannot offset them to a wall-clock unit, and the others are counts/ms on a shared axis. Verified rendered values against raw spans: close times ~833,998,8xx, resolution 10000 ms, vote bins 1/2/3 — all correct. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:21:59 +01:00
Pratik Mankawde	8dcd6f9b4a	fix(telemetry): raise TraceQL metrics max_duration to 168h TraceQL metrics queries default to a 3h max range (query_frontend.metrics.max_duration), so a dashboard set to a longer window failed with "range ... exceeds 3h0m0s". Add a query_frontend block raising it to 168h, matching the search max_duration, so the consensus close-time panels work at 6h/12h/24h ranges. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:19:55 +01:00
Pratik Mankawde	cfb2d87cab	fix(telemetry): correct close-time legend label key to resource.service.instance.id The Raw Proposals and Effective/Quantized panels rendered nameless series: their legendFormat used {{service.instance.id}}, but the TraceQL metrics query groups by resource.service.instance.id and Tempo returns that full key as the series label. The legend token did not match any label, so each series showed blank. Use the matching {{resource.service.instance.id}} token. Verified via the Grafana datasource proxy that all six close-time panels now return correctly-labelled series. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 18:59:57 +01:00
Pratik Mankawde	283218896b	fix(telemetry): use avg not quantile for close-time value panels The Raw Proposals and Effective/Quantized panels showed wrong values (e.g. 759M, 852M, even 0) against a true value of ~834M. Cause: quantile_over_time bucketizes into an exponential histogram tuned for duration distributions, so it cannot represent large absolute integers (Ripple-epoch seconds) accurately. Switch both panels to avg_over_time, which returns the correct value (verified ~833,996,7xx matching the raw span attribute). Average is also the semantically right aggregation here: close time is a single agreed value per consensus round, not a latency distribution, so a median was never meaningful. Set the unit to none rather than seconds: the value is Ripple-epoch seconds (Unix = value + 946684800) and TraceQL metrics cannot do the offset arithmetic in-query, so a duration unit would misrender it. Clarify in the description that the absolute level tracks wall-clock and the useful signal is per-node spread / raw-vs-effective gap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 18:48:31 +01:00
Pratik Mankawde	f7df1742fb	fix(telemetry): drop bool close_time_correct filter from close-time panels The five Close Time panels still rendered "No Data" after the metrics rewrite. Root cause: each query carried `span.close_time_correct=~"$close_time_correct"`, but close_time_correct is a boolean span attribute and TraceQL's regex match (=~) against a bool matches nothing in a metrics query, so every panel returned an empty series set (HTTP 200, {"series":[]}). Remove that filter clause. The panels do not break down by close_time_correct, so dropping it restores data without losing any dimension. The $node filter (a string attribute) is unaffected and stays. Verified via the Grafana datasource proxy that all six targets now return series. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 18:40:58 +01:00
Pratik Mankawde	dc5bb4b35c	feat(telemetry): emit xrpld_validation_{agreements,missed}_total counters Wire the two previously-registered-but-never-incremented validation counters to ValidationTracker's gross lifetime tallies, exported as monotonic ObservableCounters. New gross atomics count each ledger once at first classification and are never adjusted on late repair, keeping the _total counters monotonic and additive (agreements_total + missed_total == ledgers reconciled); the repair-aware windowed view stays on the existing xrpld_validation_agreement gauge. The validator-health dashboard panels that already query these names now render data instead of "No data". Also de-stale 09-data-collection-reference.md: §5b documented flat metric names (xrpld_cache_SLE_hit_rate, ...) that the code never emits — it emits labeled gauges (xrpld_cache_metrics{metric="SLE_hit_rate"}). Replace the stale flat-name tables with a pointer to the canonical labeled section, reconcile the contradictory headline counts, and correct xrpld_job_count to its real exported name xrpld_jobq_job_count. Adds two GTests asserting gross tallies stay frozen on repair while net totals move, plus the additive invariant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 18:29:29 +01:00
Pratik Mankawde	cb9fce6890	fix(telemetry): align Phase 10 workload harness with current OTel recording surface + fix CI The Phase 10 validation harness had drifted from the code's recording surface and the telemetry-validation CI job was failing before it could build. CI fix (telemetry-validation.yml): - Replace nonexistent local action ./.github/actions/print-env with the remote XRPLF/actions/print-build-env (the build-xrpld job failed in 56s on this). - Sync prepare-runner and upload-artifact action SHAs to the canonical workflow. Recording-surface reconciliation (docker/telemetry/workload/): - Migrate span attributes from dotted xrpl.<domain>.<field> to the bare/underscore form introduced by the 2026-05-13 span-attr naming redesign (tx_hash, peer_id, ledger_seq, consensus_mode, consensus_round, full_validation, quorum, ...). Dotted xrpl.ledger.hash is retained only on peer.validation.receive (shared constant), while consensus.validation.send uses bare ledger_hash. - Fix attribute placement: tx.apply carries tx_count/tx_failed (not ledger_seq); ledger.build carries ledger_seq/close_* (not tx_count/tx_failed). - Replace the phantom rpc.request span with the real WS root rpc.ws_message; drop the never-emitted duration_ms; rebuild the parent-child map accordingly. - Add the new spans the code emits: apply-pipeline stage spans (tx.preflight/preclaim/transactor with stage/tx_type/ter_result), txq., consensus sub-spans (round/establish/update_positions/check/phase.open), ledger.acquire, grpc., pathfind.. Conditional spans are marked optional so they are skipped (not failed) when the workload does not exercise them. - validate_telemetry.py: service.name and Loki job label rippled -> xrpld; fix PARITY_SPAN_ATTRS (rename the 4 real attrs, drop the 3 that are metrics not span attrs); add optional-span handling that skips missing optional spans while still validating attributes when present. - expected_metrics.json: rippled_ -> xrpld_ on all beast::insight/overlay metrics, xrpld_job_count, the 15 on-disk xrpld- dashboard UIDs, and the real bare spanmetrics dimension labels. - regression-metrics.json + baseline-timings.json: rpc.request -> rpc.ws_message. Metrics pipeline fix: - Switch node [insight] config from server=statsd/prefix=rippled to server=otel + /v1/metrics endpoint + prefix=xrpld across run-full-validation.sh, xrpld-validator.cfg.template, benchmark.sh and the workload compose. The collector has no StatsD receiver, so system metrics only reach Prometheus over OTLP. Synthetic load for new spans: - Add ripple_path_find to the RPC load generator (drives pathfind.* spans). - Add a high-TPS txq-burst workload phase to force fee escalation (drives txq.). All facts verified against the SpanNames.h headers and a live xrpld node + collector (Tempo service.name=xrpld, tx.preflight attrs [stage,ter_result,tx_type], 279 xrpld_ Prometheus metrics and zero rippled_). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 17:08:58 +01:00
Pratik Mankawde	a1c79b7aab	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-06-05 16:58:44 +01:00
Pratik Mankawde	0d1d1aa0e1	fix(telemetry): wire consensus close-time panels via TraceQL metrics The five Close Time panels (Raw Proposals, Effective/Quantized, Vote Bins & Resolution, Resolution Direction, Bin Distribution) rendered empty: they used TraceQL `\| select(attr)`, which returns a trace list that a timeseries/barchart panel cannot plot. Enable TraceQL metrics in Tempo and rewrite the panels to use it: - tempo.yaml: add the local-blocks processor to the metrics generator so recent blocks are queryable via /api/metrics/query_range. Set filter_server_spans=false because the consensus spans are SPAN_KIND_INTERNAL (the default keeps only server spans, so attribute aggregations over internal spans returned nothing), and flush_to_storage=true with a traces_storage path so query_range can read the flushed blocks. - consensus-health.json: replace each panel's select() with a metrics query — quantile_over_time on the integer close-time attributes, avg_over_time for vote bins / resolution, and count_over_time by the resolution_direction and vote-bin dimensions. Set the raw/effective panels' unit to seconds (the values are Ripple-epoch seconds, which dateTimeFromNow rendered with the wrong epoch). Verified the query forms compile and return series against live internal spans; the close-time series populate once the node reaches full sync. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:58:29 +01:00
Pratik Mankawde	c97f29c0dd	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-06-05 16:25:41 +01:00
Pratik Mankawde	1fa39fdef6	fix(telemetry): move job-queue gauge to top and add stable panel ids The Current Job Latency gauge sat at the bottom of the Job Queue Analysis dashboard; per the dashboard guideline gauges belong at the top. Move it to the first row and reflow the remaining panels below it. Also assign explicit sequential panel ids so deep links stay stable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:25:40 +01:00
Pratik Mankawde	fb9e6e5452	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-06-05 16:24:23 +01:00
Pratik Mankawde	18121a8cf4	fix(telemetry): widen only timeseries with right-side table legends Correct the width rule from the previous layout commit. Full width (w=24) is now applied ONLY to timeseries panels whose legend is a right-side table, since those legends need the horizontal room. Panels with default/bottom legends, pie charts, and the heatmap return to half width. This narrows "Transaction Receive vs Suppressed" and "TxQ Enqueue Rate by Transaction Type", which were wrongly widened. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:24:06 +01:00
Pratik Mankawde	ea56a3a0d4	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-06-05 16:02:53 +01:00
Pratik Mankawde	93c31573c5	refactor(telemetry): stable panel ids and topic-grouped layout Make transaction-overview deep-links stable and improve readability: - Assign explicit sequential panel ids (1..20) so viewPanel=panel-N URLs stay pinned to the same chart across edits. Previously ids were unset and Grafana auto-assigned them by array position, so any reorder silently repointed bookmarks. - Move the single-value stat panel (Transaction Apply Failed Rate) to the top row. - Lay out in three topic sections (Processing, Apply Pipeline, Queue). Within each, timeseries with a breakdown dimension (tx_type, stage, ter_result, suppressed) take full width so their right-side table legends are readable; single-series panels, pie charts, and the heatmap stay half-width and pair up. All six template variables already default to All (includeAll + multi); no change needed there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:02:47 +01:00
Pratik Mankawde	750e4dc5c6	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-06-05 15:07:08 +01:00
Pratik Mankawde	ceea9e49dd	fix(telemetry): standardize transaction-overview legends and cap tooltips Apply the dashboard legend convention across all panels now that the P50 series have been removed (P95-only): - Drop the redundant "P95 " / "P50 " prefix; the panel title already states the percentile. - Put every filter/dimension value inside [] comma-separated, ending with exported_instance, e.g. "AMMDeposit [Preclaim, xrpld-mainnet]". - Add exported_instance to the by() clause and legend of the three panels that filtered on $node but omitted it (Transaction Rate by Type, Transaction Results by Type, TxQ Accept Status), so per-node series are produced. - Title-case the stage value for display via label_replace in the four apply-pipeline panels; the span attribute stays lowercase (preflight/preclaim/apply) since legendFormat cannot change case. - Cap tooltip maxHeight at 500 on every panel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 15:06:52 +01:00
Pratik Mankawde	bcc5ab66c4	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-06-05 13:40:59 +01:00
Pratik Mankawde	db4d70bbc2	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-06-05 13:40:19 +01:00
Pratik Mankawde	b8dd848899	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-05 13:40:18 +01:00
Pratik Mankawde	b321792a14	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-06-05 13:40:18 +01:00
Pratik Mankawde	72642b5dc6	feat(telemetry): add tx apply latency panel by type and stage The existing apply-pipeline panels show latency by stage (all types combined) or by type (single span). Neither answers "for a given transaction type, which stage dominates its latency". Add a p95 panel grouped by both tx_type and stage, filterable via the $tx_type and $stage variables. Both dimensions already exist in spanmetrics, so no collector change is needed. Reflow the section so the full-width failure panel sits below the new full-width panel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:39:59 +01:00
Pratik Mankawde	db5b93e2c4	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-06-05 12:50:09 +01:00
Pratik Mankawde	f37a4a1022	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill # Conflicts: # src/xrpld/app/misc/detail/TxQ.cpp	2026-06-05 12:49:38 +01:00
Pratik Mankawde	8f3974c094	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-06-05 12:48:40 +01:00
Pratik Mankawde	283fbaa54f	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics # Conflicts: # OpenTelemetryPlan/09-data-collection-reference.md	2026-06-05 12:48:31 +01:00
Pratik Mankawde	3167a49f41	feat(telemetry): derive per-stage tx metrics from apply-pipeline spans Wire the apply-pipeline stage spans (tx.preflight, tx.preclaim, tx.transactor) added on phase-3 through the observability stack so the spanmetrics connector produces per-stage RED metrics without any native instruments. - collector: add the `stage` dimension to the spanmetrics connector so the three stages split into separate metric series (3 bounded values). - dashboard: add a "Tx Apply Pipeline" section to transaction-overview with rate, p95 latency, and failure-rate panels grouped by stage, plus a `stage` template variable. Panels follow the existing config (node filter, exported_instance legends, Title Case, axis labels). - The failure panel filters ter_result != tesSUCCESS rather than span status, because a failing ter code completes the span normally — only thrown exceptions set an error status. This matches the existing "Transaction Results by Type" panel convention. - docs: document the spans, attributes, and stage dimension in the data collection reference and runbook, including the sampling caveat that span-derived metrics inherit tracer head-sampling and undercount at sampling_ratio < 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 12:42:53 +01:00
Pratik Mankawde	759d3506b2	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd	2026-06-05 11:58:59 +01:00
Pratik Mankawde	021300538a	Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment	2026-06-05 11:58:49 +01:00
Pratik Mankawde	a71d6635e6	Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing	2026-06-05 11:58:43 +01:00
Pratik Mankawde	3df7e9cba6	code review changes and wire unused attributes Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-05 11:42:33 +01:00
Pratik Mankawde	6a16dfa823	clang-tidy and formatting changes Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-05 11:25:29 +01:00
Pratik Mankawde	6428c9f13c	feat(telemetry): add preflight/preclaim stage spans and stage attribute The tx.transactor span covered only the apply stage; preflight and preclaim had no telemetry, so a transaction that hard-failed those stages produced no apply-pipeline span and per-stage latency/failure was invisible. Add tx.preflight and tx.preclaim spans in applySteps.cpp via a makeStageSpan() helper using SpanGuard::hashSpan, so all three stages share a deterministic trace_id derived from txID[0:16] even though they run sequentially and often cross-thread. Each span carries stage, tx_type, and ter_result; exceptions are recorded as tefEXCEPTION before the public wrappers map them. The type lookup is guarded behind the span-active check so it costs nothing when tracing is off. Add a stage="apply" attribute to the tx.transactor span and move its three hardcoded attribute strings to a new library-safe header include/xrpl/tx/detail/TxApplySpanNames.h, which mirrors the daemon-side TxSpanNames.h strings so the collector spanmetrics connector aggregates both span sets under one dimension set. A constants-contract test pins the span-name, attribute-key, and stage-value strings; span content stays covered by the docker integration test, as the rest of the telemetry suite is. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 11:11:55 +01:00
Pratik Mankawde	d7e847a53b	removed p50 renders from all dashboards Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-04 18:11:23 +01:00
Pratik Mankawde	c3bdcb4291	clang-tidy include Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-04 18:02:47 +01:00
Pratik Mankawde	478b58395b	loop levelization Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-04 17:54:52 +01:00

1 2 3 4 5 ...

14771 Commits