rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-07-29 18:10:34 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	e804ec83aa	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-06-01 17:03:52 +01:00
Pratik Mankawde	615d339f84	fix(docs): apply rename scripts — prefix=rippled to prefix=xrpld The check-rename CI job requires all rename scripts to have been run. The telemetry config files had 'prefix=rippled' which should be 'xrpld'. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-01 17:03:27 +01:00
Pratik Mankawde	98fc939851	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-01 15:01:19 +01:00
Pratik Mankawde	4d6ddb5f1f	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-06-01 14:56:09 +01:00
Pratik Mankawde	ba7e1f98e4	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-29 18:24:43 +01:00
Pratik Mankawde	088848e7ab	formatting updates Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-29 18:20:08 +01:00
Pratik Mankawde	e7dea147cd	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-29 18:18:36 +01:00
Pratik Mankawde	8d730b8b9a	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-29 18:16:35 +01:00
Pratik Mankawde	7ac5343119	Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-28 16:09:41 +01:00
Pratik Mankawde	c6c019ed8b	addressed code review comments Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-28 15:55:25 +01:00
Pratik Mankawde	4bd1176df5	Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-28 11:38:05 +01:00
Pratik Mankawde	9498b2865f	fix(telemetry): address PR #6424 review comments - Drop xrpl.node.amendment_blocked / xrpl.node.server_state from telemetry surface (constants in SpanNames.h, two filters in tempo.yaml). Operators read the same data via server_info / server_state RPC; OTel SDK 1.18.0 cannot refresh resource attrs at runtime so resource-level emission was not viable either. - Namespace all pathfind span attributes under pathfind_* (underscore form per Phase 1c rule 5). Renames in PathFindSpanNames.h and call sites in PathRequest.cpp, PathRequestManager.cpp, plus the rule-5 retention xrpl.pathfind.ledger_index -> pathfind_ledger_index. - Wire pathfind_source_account / pathfind_dest_account on pathfind.request in doPathFind / doRipplePathFind handlers (only when present + string). - Collapse per-asset pathfind.discover / pathfind.rank spans into one pathfind.discover hoisted around the per-source-asset loop in PathRequest::findPaths. Span count goes from 2N to 1 per RPC call; per-asset breakdown traded for bounded storage and cardinality. Trade-off documented inline. - Fix pathfind_num_paths semantics: now sums getBestPaths().size() across the loop (paths actually returned) instead of the maxPaths input cap. - PathRequestManager::updateAll: move span creation after the locked requests_ snapshot, early-return when no active subscriptions exist (avoids empty span on every ledger close), set pathfind_num_requests = requests.size(). - Update Phase2_taskList.md and 02-design-decisions.md to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 11:27:29 +01:00
Pratik Mankawde	ce04dac32e	consensus total per round time panel added Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-27 14:54:36 +01:00
Pratik Mankawde	0330d037ef	connection to mainnet added Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-27 14:53:29 +01:00
Pratik Mankawde	28befc672c	minor corrections Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-15 20:11:54 +01:00
Pratik Mankawde	fb8f792973	merge: pratik/otel-phase9-metric-gap-fill (Loki + filelog fixes) into pratik/otel-phase10-workload-validation	2026-05-14 17:30:32 +01:00
Pratik Mankawde	318400327c	merge: pratik/otel-phase8-log-correlation (Loki + filelog fixes) into pratik/otel-phase9-metric-gap-fill	2026-05-14 17:30:29 +01:00
Pratik Mankawde	0e25103fdb	fix(telemetry): make Loki ingestion and filelog parsing work end-to-end Three interrelated fixes in otel-collector-config.yaml; without them the Phase 8 log-trace correlation pipeline is silently broken. 1. `resource/logs` processor now upserts `job: xrpld` alongside `service.name: xrpld`. Loki 3.x OTLP ingestion renames `service.name` to the label `service_name`, so the runbook / integration-test queries (`{job="xrpld"} \|= "trace_id="`) returned empty. Upserting the `job` resource attribute at the collector lets the canonical Loki label flow through unchanged. 2. `filelog` regex makes the `partition:` capture non-capturing-optional. `Logs::format()` omits the `partition:` prefix when partition is empty (common for framework-level log lines); the old regex required it and silently dropped those records. 3. Timestamp parser now matches the real log format. `Logs::format()` writes microsecond-precision timestamps like `2026-04-15 10:30:45.123456 UTC`. The layout was `%Y-%b-%d %H:%M:%S` — missing fractional seconds and timezone — which failed strptime and dropped timestamps. New layout is `%Y-%b-%d %H:%M:%S.%f` with `location: UTC`. Also adds a block-comment documenting the real log format so the next person to touch this doesn't re-introduce the same gaps.	2026-05-14 17:29:49 +01:00
Pratik Mankawde	ac57a91b77	merge: phase-9 (dashboard UID + line-number cleanup, detach callbacks) into phase-10 # Conflicts: # docker/telemetry/TESTING.md	2026-05-14 17:23:55 +01:00
Pratik Mankawde	145b1469d6	fix(telemetry): rename phase-9 dashboard JSON files rippled-* -> xrpld-* File renames to match the post-docs.sh project-wide rename + the UID rename applied in the previous commit. Five phase-9 dashboards are affected: - rippled-fee-market.json -> xrpld-fee-market.json - rippled-job-queue.json -> xrpld-job-queue.json - rippled-peer-quality.json -> xrpld-peer-quality.json - rippled-rpc-perf.json -> xrpld-rpc-perf-otel.json - rippled-validator-health.json-> xrpld-validator-health.json `rippled-rpc-perf.json` is renamed to `xrpld-rpc-perf-otel.json` (rather than `xrpld-rpc-perf.json`) to avoid colliding with the phase-6 `rpc-performance.json` dashboard which also uses the `xrpld-rpc-perf` UID. The new filename matches its now-unique `xrpld-rpc-perf-otel` UID that was set in the merge commit.	2026-05-14 17:11:25 +01:00
Pratik Mankawde	a9f52458b3	merge: pratik/otel-phase8-log-correlation (dashboard UID + line-number cleanup) into pratik/otel-phase9-metric-gap-fill # Conflicts: # docker/telemetry/grafana/dashboards/consensus-health.json # docker/telemetry/grafana/dashboards/ledger-operations.json # docker/telemetry/grafana/dashboards/peer-network.json # docker/telemetry/grafana/dashboards/rpc-performance.json # docker/telemetry/grafana/dashboards/system-ledger-data-sync.json # docker/telemetry/grafana/dashboards/system-network-traffic.json # docker/telemetry/grafana/dashboards/system-node-health.json # docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json # docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json # docker/telemetry/grafana/dashboards/transaction-overview.json	2026-05-14 17:10:12 +01:00
Pratik Mankawde	0e5e802e5e	merge: pratik/otel-phase7-native-metrics (dashboard UID + line-number cleanup) into pratik/otel-phase8-log-correlation	2026-05-14 17:07:34 +01:00
Pratik Mankawde	6985e1948b	merge: pratik/otel-phase6-statsd (line-number + docs cleanup) into pratik/otel-phase7-native-metrics # Conflicts: # OpenTelemetryPlan/06-implementation-phases.md # docker/telemetry/grafana/dashboards/system-ledger-data-sync.json # docker/telemetry/grafana/dashboards/system-network-traffic.json # docker/telemetry/grafana/dashboards/system-node-health.json # docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json # docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json	2026-05-14 17:07:15 +01:00
Pratik Mankawde	1a36ef4b0f	fix(telemetry): rename remaining rippled-* dashboard UIDs + fix stale rpc.request span filter Follow-up to the phase-6 dashboard cleanup. The three dashboards introduced by commit `f6105ece98` (consensus-health, rpc-performance, transaction-overview) were missed in the initial UID rename and still carried `rippled-*` UIDs plus line-number refs in panel descriptions. - UIDs: `rippled-consensus` -> `xrpld-consensus`, `rippled-rpc-perf` -> `xrpld-rpc-perf`, `rippled-transactions` -> `xrpld-transactions`, matching the post-`docs.sh`-rename runbook and the other dashboards in this PR. - Strip `:<line>` suffixes from `ServerHandler.cpp`, `RCLConsensus.cpp`, `NetworkOPs.cpp`, etc. references in panel descriptions. Line numbers drift on every refactor; the filename is enough to grep. - Fix the Overall RPC Throughput panel: two targets filtered on `span_name="rpc.request"` (never emitted) instead of `span_name="rpc.http_request"` (the real emitted name). The panel would have shown zero data until this fix.	2026-05-14 16:58:47 +01:00
Pratik Mankawde	a789f6ccf5	docs(telemetry): fix stale rpc.request refs + drop unparsed exporter key in TESTING.md Follow-up to the dashboard cleanup on this branch. Caught additional sites in TESTING.md that still reference the never-emitted `rpc.request` span: - TraceQL query examples in Step 5 "Verify traces in Tempo" now filter on `name="rpc.http_request"` (the real emitted name). - Expected-spans table replaces `rpc.request` with `rpc.http_request`. - Query loop under the Prometheus verification section now iterates over the full set of emitted RPC entry-point names (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`). Also drop `exporter=otlp_http` from the sample telemetry config block. `TelemetryConfig.cpp` does not parse an `exporter` key in any phase through Phase 8; only OTLP/HTTP is wired up, so the line is either a silently ignored no-op or misleading documentation.	2026-05-14 16:53:40 +01:00
Pratik Mankawde	44cdc8133e	fix(telemetry): phase-6 dashboards — rename UIDs, add $node filter, drop line numbers Phase-6 introduces ledger-operations, peer-network, and the five StatsD dashboards. Align them with the rest of the chain: - Rename dashboard UIDs from `rippled-` to `xrpld-` so the provisioned UIDs match the post-rename-script documentation (`docs.sh` rewrites .md but not .json, so the two drifted). Runbook references `xrpld-rpc-perf`, `xrpld-transactions`, etc., now the JSON matches. - Add the `$node` template variable + `exported_instance=~"$node"` filter to every target in the five `statsd-*` dashboards. Mirrors the pattern already used by consensus-health, ledger-operations, and peer-network per the project rule that every dashboard must support per-node filtering. - Strip `:<line>` (and `:NN-NN` range) suffixes from C++ file references in every dashboard panel description and in docker/telemetry/TESTING.md. Line numbers drift on every refactor; the filename alone is enough to grep. - Replace stale `rpc.request` entries with the real emitted span names (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`) in TESTING.md so operators can copy-paste the filters and hit real traces. - Also drop the `:706` line ref from the `StatsDCollector.cpp` callout in `06-implementation-phases.md`.	2026-05-14 16:51:14 +01:00
Pratik Mankawde	34bf61ff77	merge: pratik/otel-phase9-metric-gap-fill fix(SpanKind) into pratik/otel-phase10-workload-validation # Conflicts: # docker/telemetry/otel-collector-config.yaml # docker/telemetry/xrpld-telemetry.cfg	2026-05-14 15:59:39 +01:00
Pratik Mankawde	53e1ff82d8	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-05-14 14:01:46 +01:00
Pratik Mankawde	8df3ea1bbe	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-05-14 14:01:41 +01:00
Pratik Mankawde	5a6882f119	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics # Conflicts: # docker/telemetry/otel-collector-config.yaml	2026-05-14 14:01:36 +01:00
Pratik Mankawde	b449db0434	fix(telemetry): align spanmetrics dimensions, Tempo tags, and dashboard queries with C++ attribute names Spanmetrics dimensions used xrpl.rpc.command etc. but C++ emits bare "command". Tempo tags for phase6-added consensus/tx/peer filters used qualified names but C++ uses bare names. Dashboard panel referenced xrpl_tx_suppressed (never populated) instead of suppressed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 14:01:12 +01:00
Pratik Mankawde	9babfff3c8	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd	2026-05-14 13:59:19 +01:00
Pratik Mankawde	61ab5c6fe3	fix(telemetry): align Tempo consensus search tags with C++ attribute names Consensus span attributes use bare names (close_time_correct, consensus_state, close_resolution_ms) and shared canonical attrs (xrpl.ledger.seq) per SpanNames.h. xrpl.consensus.mode and xrpl.consensus.round are correct (domain-qualified to avoid collision). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 13:59:08 +01:00
Pratik Mankawde	837f7e7b50	Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing	2026-05-14 13:58:38 +01:00
Pratik Mankawde	b392035544	fix(telemetry): align Tempo TX search tags with C++ attribute names Transaction span attributes use bare names (local, tx_status) per SpanNames.h convention, not xrpl.tx.* qualified names. xrpl.tx.hash is correct (shared canonical attr defined in SpanNames.h). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 13:58:31 +01:00
Pratik Mankawde	450004ebd8	Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing	2026-05-14 13:58:19 +01:00
Pratik Mankawde	6f403fdd1b	fix(telemetry): align Tempo search tags with C++ span attribute names RPC span attributes use bare names (command, rpc_status, rpc_role) per the naming convention in SpanNames.h, not xrpl.rpc.* qualified names. Node health attributes (amendment_blocked, server_state) are resource attributes set at Tracer init, not span attributes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 13:58:13 +01:00
Pratik Mankawde	5dc4ae8fcc	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-05-14 13:49:59 +01:00
Pratik Mankawde	690841e934	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-05-14 13:49:51 +01:00
Pratik Mankawde	7d61a4a0ef	feat(telemetry): add missing Phase 9 metric panels to dashboards 13 metrics from 09-data-collection-reference.md were not displayed on any Grafana dashboard. Adds panels for all of them: system-node-health.json (+7 panels): - NodeStore Bytes Read/Written (node_written_bytes, node_read_bytes) - NodeStore Read Threads & Duration (node_reads_duration_us, read_request_bundle, read_threads_running, read_threads_total) - AL_size added to Cache Sizes panel - Current Ledger Index (ledger_current_index) - NuDB Storage Size (storage_detail{metric="nudb_bytes"}) rippled-validator-health.json (+2 panels): - UNL Blocked (validator_health{metric="unl_blocked"}) - Agreement/Missed Counters Rate (validation_agreements_total, validation_missed_total) rippled-job-queue.json (+1 panel): - Transaction Overflow Rate (jq_trans_overflow_total) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 13:32:55 +01:00
Pratik Mankawde	93caaba5ca	fix(telemetry): recover Phase 6 dashboard panels lost during statsd→system rename Panels 8-15 from statsd-node-health.json and panels 8-9 from statsd-network-traffic.json were lost when Phase 7 renamed these files to system-*. The merge (`5cd71ed107`) took Phase 7's smaller version without the extra panels added by commit `b933e8ae00` on Phase 6. Recovered panels (system-node-health.json): - Key Jobs Execution Time (11 job types) - Key Jobs Dequeue Wait Time (11 job types) - FullBelowCache Size - FullBelowCache Hit Rate - Ledger Publish Gap (validated - published age delta) - State Duration Rate (Full vs Tracking) - All Jobs Execution Time Detail (34 job types) - All Jobs Dequeue Wait Detail (34 job types) Recovered panels (system-network-traffic.json): - Duplicate Traffic (Wasted Bandwidth) - All Traffic Categories Detail (topk 15 by byte rate) All recovered panels updated to include exported_instance=~"$node" filter per project dashboard guidelines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 12:33:18 +01:00
Pratik Mankawde	02fe838257	auto refresh at 5seconds Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-13 19:00:36 +01:00
Pratik Mankawde	20477e5494	validator path changes Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-13 18:49:21 +01:00
Pratik Mankawde	f0c6227c06	added config for devnet test run Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-13 18:42:57 +01:00
Pratik Mankawde	a04459f1f8	fix(telemetry): update collector config + tempo datasource + design doc for simplified attr names - otel-collector-config.yaml: spanmetrics dimensions use new bare names. - tempo.yaml: TraceQL filter tags use new bare names. - 02-design-decisions.md: strip xrpl.txq.* prefix from planned attrs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 16:47:36 +01:00
Pratik Mankawde	815e2b1f5d	refactor(telemetry): fix remaining old attr refs in tests, docs, workload - Update Telemetry.h doc example: xrpl.rpc.command -> command. - Update SpanGuardFactory.cpp test: use new bare attr names. - Update TESTING.md: rename attr refs in span table + PromQL example. - Update expected_spans.json: all attrs match simplified naming. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 16:21:18 +01:00
Pratik Mankawde	ec8e3e2950	Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation	2026-05-13 16:17:49 +01:00
Pratik Mankawde	495d5bd8a0	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-05-13 16:17:12 +01:00
Pratik Mankawde	6cd910f06f	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-05-13 16:17:05 +01:00
Pratik Mankawde	5cd71ed107	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-05-13 16:16:50 +01:00

1 2 3

135 Commits