rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-07-23 23:20:33 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	3f8aa47224	fix(telemetry): drop duplicate Beast MetricsRegistry test + remove author-local symlink - `src/test/telemetry/MetricsRegistry_test.cpp` (Beast `unit_test::suite` format under `src/test/`) duplicates the GTest version already maintained at `src/tests/libxrpl/telemetry/MetricsRegistry.cpp`. Project rule (`tasks/lessons.md` §Test Format): all new tests use GTest under `src/tests/libxrpl/`. The GTest version exercises the same four cases (disabled construction, start/stop lifecycle, recording no-op, destructor-calls-stop). Deleting the Beast duplicate eliminates drift and keeps the test authoritative in one place. - Drop the matching `test.telemetry > xrpl.basics/xrpl.core/xrpld.telemetry` entries from `.github/scripts/levelization/results/ordering.txt` because `xrpl.test.telemetry` (the GTest binary) retains its own entries; the removed ones belonged to the deleted Beast suite. - `.claude/instructions.md` was committed as a symlink to an author-local absolute path (`/home/pratik/sourceCode/personal/Rippled/ instructions.md`) that does not exist for any other contributor or in CI. Remove the symlink from git tracking and add `.claude/` to `.gitignore` so future agent commits do not re-add per-developer settings.	2026-05-14 17:27:28 +01:00
Pratik Mankawde	ac57a91b77	merge: phase-9 (dashboard UID + line-number cleanup, detach callbacks) into phase-10 # Conflicts: # docker/telemetry/TESTING.md	2026-05-14 17:23:55 +01:00
Pratik Mankawde	2735e4ac78	fix(telemetry): detach metrics gauge callbacks before Application services stop MetricsRegistry observable-gauge callbacks run on the OTel reader thread and read live state from nodeStore_, overlay_, networkOPs_, ledgerMaster, inboundLedgers, loadManager, and others. The old shutdown sequence called metricsRegistry_->stop() AFTER all those services were already stopped, which left a race window between each service's stop() and the final provider_->ForceFlush() during which a callback could dereference already-stopped service state. The try/catch guards in each callback mitigated crashes but not reads from freed members. - Add MetricsRegistry::detachCallbacks() that sets an atomic<bool> callbacksDetached_ with release ordering. Idempotent. - Guard every ObservableGauge callback entry with an acquire-load of the same flag and return early if it is set. Covers all 15 registered callbacks (cacheHitRate, txq, objectCount, loadFactor, nodeStore, serverInfo, buildInfo, completeLedgers, dbMetrics, validatorHealth, peerQuality, ledgerEconomy, stateTracking, storageDetail, validationAgreement). - Application::run() shutdown sequence now calls metricsRegistry_->detachCallbacks() right after m_loadManager->stop() and BEFORE m_shaMapStore, m_jobQueue, overlay_, grpcServer_, m_networkOPs, serverHandler_, m_ledgerReplayer, m_inboundTransactions, m_inboundLedgers, ledgerCleaner_, m_nodeStore, perfLog_ are stopped. The acquire/release pair guarantees subsequent reader-thread ticks see the detach before they dereference stopped services. - metricsRegistry_->stop() keeps setting the flag as a belt-and-suspenders defense in case a future caller forgets to detach first. - Drop the misleading "No explicit RemoveCallback is needed" comment from stop(); provider destruction alone does not beat the reader thread to already-freed state. The objectCountGauge callback previously discarded its state pointer via `void* /* state */`; restore the state argument so it can access self->callbacksDetached_ too.	2026-05-14 17:20:52 +01:00
Pratik Mankawde	145b1469d6	fix(telemetry): rename phase-9 dashboard JSON files rippled-* -> xrpld-* File renames to match the post-docs.sh project-wide rename + the UID rename applied in the previous commit. Five phase-9 dashboards are affected: - rippled-fee-market.json -> xrpld-fee-market.json - rippled-job-queue.json -> xrpld-job-queue.json - rippled-peer-quality.json -> xrpld-peer-quality.json - rippled-rpc-perf.json -> xrpld-rpc-perf-otel.json - rippled-validator-health.json-> xrpld-validator-health.json `rippled-rpc-perf.json` is renamed to `xrpld-rpc-perf-otel.json` (rather than `xrpld-rpc-perf.json`) to avoid colliding with the phase-6 `rpc-performance.json` dashboard which also uses the `xrpld-rpc-perf` UID. The new filename matches its now-unique `xrpld-rpc-perf-otel` UID that was set in the merge commit.	2026-05-14 17:11:25 +01:00
Pratik Mankawde	a9f52458b3	merge: pratik/otel-phase8-log-correlation (dashboard UID + line-number cleanup) into pratik/otel-phase9-metric-gap-fill # Conflicts: # docker/telemetry/grafana/dashboards/consensus-health.json # docker/telemetry/grafana/dashboards/ledger-operations.json # docker/telemetry/grafana/dashboards/peer-network.json # docker/telemetry/grafana/dashboards/rpc-performance.json # docker/telemetry/grafana/dashboards/system-ledger-data-sync.json # docker/telemetry/grafana/dashboards/system-network-traffic.json # docker/telemetry/grafana/dashboards/system-node-health.json # docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json # docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json # docker/telemetry/grafana/dashboards/transaction-overview.json	2026-05-14 17:10:12 +01:00
Pratik Mankawde	0e5e802e5e	merge: pratik/otel-phase7-native-metrics (dashboard UID + line-number cleanup) into pratik/otel-phase8-log-correlation	2026-05-14 17:07:34 +01:00
Pratik Mankawde	6985e1948b	merge: pratik/otel-phase6-statsd (line-number + docs cleanup) into pratik/otel-phase7-native-metrics # Conflicts: # OpenTelemetryPlan/06-implementation-phases.md # docker/telemetry/grafana/dashboards/system-ledger-data-sync.json # docker/telemetry/grafana/dashboards/system-network-traffic.json # docker/telemetry/grafana/dashboards/system-node-health.json # docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json # docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json	2026-05-14 17:07:15 +01:00
Pratik Mankawde	a844c14e49	merge: pratik/otel-phase5-docs-deployment (line-number + docs cleanup) into pratik/otel-phase6-statsd	2026-05-14 17:00:05 +01:00
Pratik Mankawde	c3c980e858	merge: pratik/otel-phase4-consensus-tracing (line-number + docs cleanup) into pratik/otel-phase5-docs-deployment	2026-05-14 17:00:02 +01:00
Pratik Mankawde	92bc0b24b8	docs(telemetry): drop volatile line numbers from Phase 4 span-catalog table Phase 4 added a span catalog in `06-implementation-phases.md` listing the source location for each consensus span. Line numbers `Consensus.h:707`, `RCLConsensus.cpp:232/341/492/541/900` drift on every refactor and would become stale PR after PR. Filename alone is enough for operators to grep — the RCLConsensus.cpp spans are already unambiguous from the span name itself.	2026-05-14 16:59:43 +01:00
Pratik Mankawde	1a36ef4b0f	fix(telemetry): rename remaining rippled-* dashboard UIDs + fix stale rpc.request span filter Follow-up to the phase-6 dashboard cleanup. The three dashboards introduced by commit `f6105ece98` (consensus-health, rpc-performance, transaction-overview) were missed in the initial UID rename and still carried `rippled-*` UIDs plus line-number refs in panel descriptions. - UIDs: `rippled-consensus` -> `xrpld-consensus`, `rippled-rpc-perf` -> `xrpld-rpc-perf`, `rippled-transactions` -> `xrpld-transactions`, matching the post-`docs.sh`-rename runbook and the other dashboards in this PR. - Strip `:<line>` suffixes from `ServerHandler.cpp`, `RCLConsensus.cpp`, `NetworkOPs.cpp`, etc. references in panel descriptions. Line numbers drift on every refactor; the filename is enough to grep. - Fix the Overall RPC Throughput panel: two targets filtered on `span_name="rpc.request"` (never emitted) instead of `span_name="rpc.http_request"` (the real emitted name). The panel would have shown zero data until this fix.	2026-05-14 16:58:47 +01:00
Pratik Mankawde	a789f6ccf5	docs(telemetry): fix stale rpc.request refs + drop unparsed exporter key in TESTING.md Follow-up to the dashboard cleanup on this branch. Caught additional sites in TESTING.md that still reference the never-emitted `rpc.request` span: - TraceQL query examples in Step 5 "Verify traces in Tempo" now filter on `name="rpc.http_request"` (the real emitted name). - Expected-spans table replaces `rpc.request` with `rpc.http_request`. - Query loop under the Prometheus verification section now iterates over the full set of emitted RPC entry-point names (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`). Also drop `exporter=otlp_http` from the sample telemetry config block. `TelemetryConfig.cpp` does not parse an `exporter` key in any phase through Phase 8; only OTLP/HTTP is wired up, so the line is either a silently ignored no-op or misleading documentation.	2026-05-14 16:53:40 +01:00
Pratik Mankawde	44cdc8133e	fix(telemetry): phase-6 dashboards — rename UIDs, add $node filter, drop line numbers Phase-6 introduces ledger-operations, peer-network, and the five StatsD dashboards. Align them with the rest of the chain: - Rename dashboard UIDs from `rippled-` to `xrpld-` so the provisioned UIDs match the post-rename-script documentation (`docs.sh` rewrites .md but not .json, so the two drifted). Runbook references `xrpld-rpc-perf`, `xrpld-transactions`, etc., now the JSON matches. - Add the `$node` template variable + `exported_instance=~"$node"` filter to every target in the five `statsd-*` dashboards. Mirrors the pattern already used by consensus-health, ledger-operations, and peer-network per the project rule that every dashboard must support per-node filtering. - Strip `:<line>` (and `:NN-NN` range) suffixes from C++ file references in every dashboard panel description and in docker/telemetry/TESTING.md. Line numbers drift on every refactor; the filename alone is enough to grep. - Replace stale `rpc.request` entries with the real emitted span names (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`) in TESTING.md so operators can copy-paste the filters and hit real traces. - Also drop the `:706` line ref from the `StatsDCollector.cpp` callout in `06-implementation-phases.md`.	2026-05-14 16:51:14 +01:00
Pratik Mankawde	dfe91e071f	merge: phase-5 (runbook span-name + line-number fixes) into phase-6 # Conflicts: # OpenTelemetryPlan/06-implementation-phases.md # docs/telemetry-runbook.md	2026-05-14 16:42:13 +01:00
Pratik Mankawde	dec8b0a9a1	docs(telemetry): fix stale RPC span names + drop volatile line numbers in runbook - RPC Spans table: `rpc.request` was documented but the code actually emits `rpc.http_request`. Listed the actual emitted names (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`) and their parent/child relationship. - Drop `:<line>` suffixes from Source File columns in both RPC and Transaction span tables. Line numbers drift with every refactor; the filename is enough for operators to grep. - Summary table: replace the never-emitted `rpc.request` row with the real entry points so `span_name=` filters in PromQL / TraceQL match.	2026-05-14 16:34:58 +01:00
Pratik Mankawde	df1d8aed44	merge: phase-4 (phase-1a docs fixes) into phase-5	2026-05-14 16:24:36 +01:00
Pratik Mankawde	41d72cb51b	merge: phase-3 (phase-1a docs fixes) into phase-4 # Conflicts: # OpenTelemetryPlan/06-implementation-phases.md	2026-05-14 16:24:27 +01:00
Pratik Mankawde	45e1c15d24	merge: pratik/otel-phase2-rpc-tracing (phase-1a docs fixes) into pratik/otel-phase3-tx-tracing # Conflicts: # OpenTelemetryPlan/05-configuration-reference.md	2026-05-14 16:13:35 +01:00
Pratik Mankawde	865ab65a07	merge: pratik/otel-phase1c-rpc-integration (phase-1a docs fixes) into pratik/otel-phase2-rpc-tracing	2026-05-14 16:11:04 +01:00
Pratik Mankawde	009c63e7db	merge: pratik/otel-phase1b-telemetry-infra (phase-1a docs fixes) into pratik/otel-phase1c-rpc-integration	2026-05-14 16:11:01 +01:00
Pratik Mankawde	5d70a5fffd	merge: pratik/otel-phase1a-plan-docs (phase-1a docs fixes) into pratik/otel-phase1b-telemetry-infra	2026-05-14 16:10:59 +01:00
Pratik Mankawde	f3a095ab65	docs(telemetry): align Phase 1a plan docs with Phase 1b implementation Phase-1a plan documents advertised OTLP/gRPC on port 4317 as the default exporter, four unparsed [telemetry] config keys, and "Phase 4a Complete" status with exit-criteria checkboxes marked done. Every downstream branch through Phase 5 ships only OTLP/HTTP on port 4318 via OtlpHttpExporterFactory, never parses the advertised keys, and the Phase 4 work is not yet delivered. Fixes: - 02-design-decisions.md: flip §2.1.1 SDK dependency recommendations to OTLP/HTTP (shipped) with OTLP/gRPC marked Future. Update §2.2 architecture diagram and text from OTLP/gRPC:4317 to OTLP/HTTP:4318. Rewrite §2.2.1 as "OTLP/HTTP (Shipped)" and §2.2.2 as "OTLP/gRPC (Future Work — Planned Upgrade)" with a concrete checklist (Conan dep, config parsing, factory branch, runbook/dashboard updates) for landing the gRPC transport later. - 05-configuration-reference.md: drop the fabricated exporter/otlp_grpc key and the :4317 default from the sample config block and the options-summary table. Move trace_pathfind, trace_txq, trace_validator, trace_amendment into a new "Planned (not yet implemented)" table citing the phase that will add each one. Keep the example config minimal so copy-paste does not produce a silently-ignored stanza. - 06-implementation-phases.md: reset Phase 4 Exit Criteria checkboxes from [x] to [ ] (Phase 4 is not shipped at Phase-1a time). Rename "Phase 4a Complete" to "Phase 4a Plan" and describe the work as future. Replace the broken forward link to Phase4_taskList.md (introduced in the Phase 2 PR) with a sentence pointing readers to where that spec will land. Renumber the final section 6.12 to 6.11 so it sits directly after 6.10; section 6.11 ("Effort Summary") was intentionally removed in earlier edits.	2026-05-14 16:09:48 +01:00
Pratik Mankawde	34bf61ff77	merge: pratik/otel-phase9-metric-gap-fill fix(SpanKind) into pratik/otel-phase10-workload-validation # Conflicts: # docker/telemetry/otel-collector-config.yaml # docker/telemetry/xrpld-telemetry.cfg	2026-05-14 15:59:39 +01:00
Pratik Mankawde	9d99ce6ae8	merge: pratik/otel-phase8-log-correlation fix(SpanKind) into pratik/otel-phase9-metric-gap-fill	2026-05-14 15:55:09 +01:00
Pratik Mankawde	577cb9b5f0	merge: pratik/otel-phase7-native-metrics fix(SpanKind) into pratik/otel-phase8-log-correlation	2026-05-14 15:55:07 +01:00
Pratik Mankawde	7d202127bb	merge: pratik/otel-phase6-statsd fix(SpanKind) into pratik/otel-phase7-native-metrics	2026-05-14 15:55:05 +01:00
Pratik Mankawde	56090b0ead	merge: pratik/otel-phase5-docs-deployment fix(SpanKind) into pratik/otel-phase6-statsd	2026-05-14 15:55:03 +01:00
Pratik Mankawde	6c6d6f953f	merge: pratik/otel-phase4-consensus-tracing fix(SpanKind) into pratik/otel-phase5-docs-deployment	2026-05-14 15:55:01 +01:00
Pratik Mankawde	0b4b3c7bf2	merge: pratik/otel-phase3-tx-tracing fix(SpanKind) into pratik/otel-phase4-consensus-tracing	2026-05-14 15:54:59 +01:00
Pratik Mankawde	3e894f8e93	merge: pratik/otel-phase2-rpc-tracing fix(SpanKind) into pratik/otel-phase3-tx-tracing	2026-05-14 15:54:57 +01:00
Pratik Mankawde	cb7dc5c52e	merge: pratik/otel-phase1c-rpc-integration fix(SpanKind) into pratik/otel-phase2-rpc-tracing	2026-05-14 15:54:55 +01:00
Pratik Mankawde	9cfb43d8d0	merge: pratik/otel-phase1b-telemetry-infra fix(SpanKind) into pratik/otel-phase1c-rpc-integration	2026-05-14 15:54:53 +01:00
Pratik Mankawde	7ada57e2a8	fix(telemetry): map TraceCategory to OTel SpanKind in SpanGuard::span() SpanGuard::span() hardcoded SpanKind::kInternal for every span. Tempo's service-graph and spanmetrics RED calculations rely on kServer / kConsumer / kClient / kProducer to classify inbound vs outbound vs internal operations. With kInternal everywhere, the service graph collapses to a single self-loop and RED metrics attribute all latency to internal work. Add categoryToSpanKind() mapping: - Rpc -> kServer (inbound synchronous request) - Peer -> kConsumer (inbound async peer message) - Transactions -> kInternal - Consensus -> kInternal - Ledger -> kInternal Only the single-argument overload is affected; childSpan / linkedSpan continue to default to kInternal because they represent in-process continuations of an already-kinded parent.	2026-05-14 15:53:59 +01:00
Pratik Mankawde	53e1ff82d8	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-05-14 14:01:46 +01:00
Pratik Mankawde	8df3ea1bbe	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-05-14 14:01:41 +01:00
Pratik Mankawde	5a6882f119	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics # Conflicts: # docker/telemetry/otel-collector-config.yaml	2026-05-14 14:01:36 +01:00
Pratik Mankawde	b449db0434	fix(telemetry): align spanmetrics dimensions, Tempo tags, and dashboard queries with C++ attribute names Spanmetrics dimensions used xrpl.rpc.command etc. but C++ emits bare "command". Tempo tags for phase6-added consensus/tx/peer filters used qualified names but C++ uses bare names. Dashboard panel referenced xrpl_tx_suppressed (never populated) instead of suppressed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 14:01:12 +01:00
Pratik Mankawde	9babfff3c8	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd	2026-05-14 13:59:19 +01:00
Pratik Mankawde	68b32ed0f0	Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment	2026-05-14 13:59:14 +01:00
Pratik Mankawde	61ab5c6fe3	fix(telemetry): align Tempo consensus search tags with C++ attribute names Consensus span attributes use bare names (close_time_correct, consensus_state, close_resolution_ms) and shared canonical attrs (xrpl.ledger.seq) per SpanNames.h. xrpl.consensus.mode and xrpl.consensus.round are correct (domain-qualified to avoid collision). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 13:59:08 +01:00
Pratik Mankawde	837f7e7b50	Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing	2026-05-14 13:58:38 +01:00
Pratik Mankawde	b392035544	fix(telemetry): align Tempo TX search tags with C++ attribute names Transaction span attributes use bare names (local, tx_status) per SpanNames.h convention, not xrpl.tx.* qualified names. xrpl.tx.hash is correct (shared canonical attr defined in SpanNames.h). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 13:58:31 +01:00
Pratik Mankawde	450004ebd8	Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing	2026-05-14 13:58:19 +01:00
Pratik Mankawde	6f403fdd1b	fix(telemetry): align Tempo search tags with C++ span attribute names RPC span attributes use bare names (command, rpc_status, rpc_role) per the naming convention in SpanNames.h, not xrpl.rpc.* qualified names. Node health attributes (amendment_blocked, server_state) are resource attributes set at Tracer init, not span attributes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 13:58:13 +01:00
Pratik Mankawde	5dc4ae8fcc	Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill	2026-05-14 13:49:59 +01:00
Pratik Mankawde	690841e934	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-05-14 13:49:51 +01:00
Pratik Mankawde	7d61a4a0ef	feat(telemetry): add missing Phase 9 metric panels to dashboards 13 metrics from 09-data-collection-reference.md were not displayed on any Grafana dashboard. Adds panels for all of them: system-node-health.json (+7 panels): - NodeStore Bytes Read/Written (node_written_bytes, node_read_bytes) - NodeStore Read Threads & Duration (node_reads_duration_us, read_request_bundle, read_threads_running, read_threads_total) - AL_size added to Cache Sizes panel - Current Ledger Index (ledger_current_index) - NuDB Storage Size (storage_detail{metric="nudb_bytes"}) rippled-validator-health.json (+2 panels): - UNL Blocked (validator_health{metric="unl_blocked"}) - Agreement/Missed Counters Rate (validation_agreements_total, validation_missed_total) rippled-job-queue.json (+1 panel): - Transaction Overflow Rate (jq_trans_overflow_total) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 13:32:55 +01:00
Pratik Mankawde	93caaba5ca	fix(telemetry): recover Phase 6 dashboard panels lost during statsd→system rename Panels 8-15 from statsd-node-health.json and panels 8-9 from statsd-network-traffic.json were lost when Phase 7 renamed these files to system-*. The merge (`5cd71ed107`) took Phase 7's smaller version without the extra panels added by commit `b933e8ae00` on Phase 6. Recovered panels (system-node-health.json): - Key Jobs Execution Time (11 job types) - Key Jobs Dequeue Wait Time (11 job types) - FullBelowCache Size - FullBelowCache Hit Rate - Ledger Publish Gap (validated - published age delta) - State Duration Rate (Full vs Tracking) - All Jobs Execution Time Detail (34 job types) - All Jobs Dequeue Wait Detail (34 job types) Recovered panels (system-network-traffic.json): - Duplicate Traffic (Wasted Bandwidth) - All Traffic Categories Detail (topk 15 by byte rate) All recovered panels updated to include exported_instance=~"$node" filter per project dashboard guidelines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 12:33:18 +01:00
Pratik Mankawde	02fe838257	auto refresh at 5seconds Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-13 19:00:36 +01:00
Pratik Mankawde	20477e5494	validator path changes Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-13 18:49:21 +01:00

1 2 3 4 5 ...

14380 Commits