rippled

mirror of https://github.com/XRPLF/rippled.git synced 2026-07-23 15:10:34 +00:00

Author	SHA1	Message	Date
Pratik Mankawde	0e5e802e5e	merge: pratik/otel-phase7-native-metrics (dashboard UID + line-number cleanup) into pratik/otel-phase8-log-correlation	2026-05-14 17:07:34 +01:00
Pratik Mankawde	6985e1948b	merge: pratik/otel-phase6-statsd (line-number + docs cleanup) into pratik/otel-phase7-native-metrics # Conflicts: # OpenTelemetryPlan/06-implementation-phases.md # docker/telemetry/grafana/dashboards/system-ledger-data-sync.json # docker/telemetry/grafana/dashboards/system-network-traffic.json # docker/telemetry/grafana/dashboards/system-node-health.json # docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json # docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json	2026-05-14 17:07:15 +01:00
Pratik Mankawde	1a36ef4b0f	fix(telemetry): rename remaining rippled-* dashboard UIDs + fix stale rpc.request span filter Follow-up to the phase-6 dashboard cleanup. The three dashboards introduced by commit `f6105ece98` (consensus-health, rpc-performance, transaction-overview) were missed in the initial UID rename and still carried `rippled-*` UIDs plus line-number refs in panel descriptions. - UIDs: `rippled-consensus` -> `xrpld-consensus`, `rippled-rpc-perf` -> `xrpld-rpc-perf`, `rippled-transactions` -> `xrpld-transactions`, matching the post-`docs.sh`-rename runbook and the other dashboards in this PR. - Strip `:<line>` suffixes from `ServerHandler.cpp`, `RCLConsensus.cpp`, `NetworkOPs.cpp`, etc. references in panel descriptions. Line numbers drift on every refactor; the filename is enough to grep. - Fix the Overall RPC Throughput panel: two targets filtered on `span_name="rpc.request"` (never emitted) instead of `span_name="rpc.http_request"` (the real emitted name). The panel would have shown zero data until this fix.	2026-05-14 16:58:47 +01:00
Pratik Mankawde	a789f6ccf5	docs(telemetry): fix stale rpc.request refs + drop unparsed exporter key in TESTING.md Follow-up to the dashboard cleanup on this branch. Caught additional sites in TESTING.md that still reference the never-emitted `rpc.request` span: - TraceQL query examples in Step 5 "Verify traces in Tempo" now filter on `name="rpc.http_request"` (the real emitted name). - Expected-spans table replaces `rpc.request` with `rpc.http_request`. - Query loop under the Prometheus verification section now iterates over the full set of emitted RPC entry-point names (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`). Also drop `exporter=otlp_http` from the sample telemetry config block. `TelemetryConfig.cpp` does not parse an `exporter` key in any phase through Phase 8; only OTLP/HTTP is wired up, so the line is either a silently ignored no-op or misleading documentation.	2026-05-14 16:53:40 +01:00
Pratik Mankawde	44cdc8133e	fix(telemetry): phase-6 dashboards — rename UIDs, add $node filter, drop line numbers Phase-6 introduces ledger-operations, peer-network, and the five StatsD dashboards. Align them with the rest of the chain: - Rename dashboard UIDs from `rippled-` to `xrpld-` so the provisioned UIDs match the post-rename-script documentation (`docs.sh` rewrites .md but not .json, so the two drifted). Runbook references `xrpld-rpc-perf`, `xrpld-transactions`, etc., now the JSON matches. - Add the `$node` template variable + `exported_instance=~"$node"` filter to every target in the five `statsd-*` dashboards. Mirrors the pattern already used by consensus-health, ledger-operations, and peer-network per the project rule that every dashboard must support per-node filtering. - Strip `:<line>` (and `:NN-NN` range) suffixes from C++ file references in every dashboard panel description and in docker/telemetry/TESTING.md. Line numbers drift on every refactor; the filename alone is enough to grep. - Replace stale `rpc.request` entries with the real emitted span names (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`) in TESTING.md so operators can copy-paste the filters and hit real traces. - Also drop the `:706` line ref from the `StatsDCollector.cpp` callout in `06-implementation-phases.md`.	2026-05-14 16:51:14 +01:00
Pratik Mankawde	8df3ea1bbe	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-05-14 14:01:41 +01:00
Pratik Mankawde	5a6882f119	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics # Conflicts: # docker/telemetry/otel-collector-config.yaml	2026-05-14 14:01:36 +01:00
Pratik Mankawde	b449db0434	fix(telemetry): align spanmetrics dimensions, Tempo tags, and dashboard queries with C++ attribute names Spanmetrics dimensions used xrpl.rpc.command etc. but C++ emits bare "command". Tempo tags for phase6-added consensus/tx/peer filters used qualified names but C++ uses bare names. Dashboard panel referenced xrpl_tx_suppressed (never populated) instead of suppressed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 14:01:12 +01:00
Pratik Mankawde	9babfff3c8	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd	2026-05-14 13:59:19 +01:00
Pratik Mankawde	61ab5c6fe3	fix(telemetry): align Tempo consensus search tags with C++ attribute names Consensus span attributes use bare names (close_time_correct, consensus_state, close_resolution_ms) and shared canonical attrs (xrpl.ledger.seq) per SpanNames.h. xrpl.consensus.mode and xrpl.consensus.round are correct (domain-qualified to avoid collision). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 13:59:08 +01:00
Pratik Mankawde	837f7e7b50	Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing	2026-05-14 13:58:38 +01:00
Pratik Mankawde	b392035544	fix(telemetry): align Tempo TX search tags with C++ attribute names Transaction span attributes use bare names (local, tx_status) per SpanNames.h convention, not xrpl.tx.* qualified names. xrpl.tx.hash is correct (shared canonical attr defined in SpanNames.h). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 13:58:31 +01:00
Pratik Mankawde	450004ebd8	Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing	2026-05-14 13:58:19 +01:00
Pratik Mankawde	6f403fdd1b	fix(telemetry): align Tempo search tags with C++ span attribute names RPC span attributes use bare names (command, rpc_status, rpc_role) per the naming convention in SpanNames.h, not xrpl.rpc.* qualified names. Node health attributes (amendment_blocked, server_state) are resource attributes set at Tracer init, not span attributes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 13:58:13 +01:00
Pratik Mankawde	690841e934	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-05-14 13:49:51 +01:00
Pratik Mankawde	93caaba5ca	fix(telemetry): recover Phase 6 dashboard panels lost during statsd→system rename Panels 8-15 from statsd-node-health.json and panels 8-9 from statsd-network-traffic.json were lost when Phase 7 renamed these files to system-*. The merge (`5cd71ed107`) took Phase 7's smaller version without the extra panels added by commit `b933e8ae00` on Phase 6. Recovered panels (system-node-health.json): - Key Jobs Execution Time (11 job types) - Key Jobs Dequeue Wait Time (11 job types) - FullBelowCache Size - FullBelowCache Hit Rate - Ledger Publish Gap (validated - published age delta) - State Duration Rate (Full vs Tracking) - All Jobs Execution Time Detail (34 job types) - All Jobs Dequeue Wait Detail (34 job types) Recovered panels (system-network-traffic.json): - Duplicate Traffic (Wasted Bandwidth) - All Traffic Categories Detail (topk 15 by byte rate) All recovered panels updated to include exported_instance=~"$node" filter per project dashboard guidelines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 12:33:18 +01:00
Pratik Mankawde	6cd910f06f	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-05-13 16:17:05 +01:00
Pratik Mankawde	5cd71ed107	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics	2026-05-13 16:16:50 +01:00
Pratik Mankawde	9e27120a15	refactor(telemetry): simplify ledger/peer attr naming on phase-6, update dashboards - Add canonical ledgerHash (xrpl.ledger.hash) to SpanNames.h. - LedgerSpanNames: reuse shared canonicals (ledgerSeq, closeTime, closeTimeCorrect, closeResolutionMs, ledgerHash); bare names for tx_count, tx_failed, validations. - PeerSpanNames: reuse shared canonicals (peerId, ledgerHash); bare names for proposal_trusted, validation_full, validation_trusted. - Update call sites in BuildLedger.cpp, LedgerMaster.cpp, PeerImp.cpp. - Update 5 Grafana dashboards: strip xrpl.<domain>. prefix from per-span attr refs in PromQL/TraceQL queries. Keep rule-5 entries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 16:16:30 +01:00
Pratik Mankawde	fac3287912	fix(telemetry): use .batches for Tempo trace lookup in integration test Tempo /api/traces/{id} returns OTLP-shaped JSON with a top-level "batches" key, not "data". The cross-check in check_log_correlation was querying jq '.data \| length' which always returned null, causing the Log-Tempo cross-check to fail even when the trace existed.	2026-05-13 12:16:41 +01:00
Pratik Mankawde	e49c5997b7	added loki config. Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>	2026-05-06 17:37:43 +01:00
Pratik Mankawde	85330920ac	feat(telemetry): add Loki service and filelog receiver for Phase 8 log ingestion Cherry-pick Loki infrastructure from phase-10 back to where it belongs (Phase 8, Tasks 8.2/8.3): - Add Loki 3.4.2 service to docker-compose.yml (port 3100) - Add filelog receiver to OTel Collector config (tails debug.log, regex_parser extracts trace_id/span_id/partition/severity) - Add otlphttp/loki exporter (uses Loki 3.x native OTLP ingestion) - Add logs pipeline: filelog -> batch -> otlphttp/loki - Add health_check extension - Mount xrpld log directory into collector container - Add prometheus-data and loki-data persistent volumes StatsD receiver intentionally excluded — Phase 7 migrated to native OTLP metrics, making the StatsD receiver unnecessary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 14:55:45 +01:00
Pratik Mankawde	fac6c3ac1d	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation	2026-05-06 14:34:17 +01:00
Pratik Mankawde	a8549a7ab2	fix(telemetry): address code review findings for Phase 8 log-trace correlation - Replace GetSpan() with direct context value check in Logs::format() to avoid heap allocation (new DefaultSpan) on the no-span path - Restore Phase 7 documentation accidentally deleted during merge - Fix undefined $JAEGER variable → use $TEMPO in integration test - Remove useless LCOV_EXCL markers around #ifdef block - Fix indentation inconsistencies in Log.cpp injection block - Remove incorrect url field from loki.yaml derivedFields - Update stale code sample in Phase8_taskList.md to match implementation - Correct "<10ns" performance claims to accurate ~15-20ns (no-span) and ~50ns (active-span) measurements across all docs - Replace Jaeger references with Tempo in TESTING.md (port 16686→3200) - Improve error handling in check_log_correlation(): track files_scanned, detect missing log files, fix silent grep error masking Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 14:32:46 +01:00
Pratik Mankawde	761688383d	fix(telemetry): address code review issues in OTelCollector - Fix use-after-free: extract gauge callback to static function and call RemoveCallback in ~OTelGaugeImpl() before unregistering from collector - Use memory_order_acq_rel on callHooks() debounce CAS for proper happens-before relationship between hook invocations - Add explicit 2s timeout to ForceFlush() in destructor to prevent blocking indefinitely when OTLP endpoint is unreachable at shutdown - Add OTLP receiver to metrics pipeline so native OTel metrics from xrpld are actually received by the collector - Remove stale health check port from docker-compose (extension was removed from collector config) - Clarify fallback docs: StatsD path requires re-enabling receiver/port - Fix comments: Counter uses uint64_t not int64_t, gauge clamps to [0, INT64_MAX] not [0, UINT64_MAX] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 14:24:52 +01:00
Pratik Mankawde	8e7a2d6c53	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation # Conflicts: # OpenTelemetryPlan/06-implementation-phases.md # OpenTelemetryPlan/08-appendix.md # OpenTelemetryPlan/OpenTelemetryPlan.md	2026-04-29 21:07:32 +01:00
Pratik Mankawde	9adcc49171	fix: re-apply phase-7 doc/config changes lost during merge Re-applies phase-7 unique modifications to documentation and configuration files that were overwritten when taking phase-6's versions during the merge conflict resolution. Changes: - docker-compose.yml: comment out StatsD port 8125, add OTLP notes - otel-collector-config.yaml: remove StatsD receiver, update pipeline - integration-test.sh: server=otel, check_otel_metric, StatsD port check - telemetry-runbook.md: System Metrics section, server=otel config, troubleshooting for missing OTel metrics - 02-design-decisions.md: Phase 7 coexistence strategy notes - 05-configuration-reference.md: OTel System Metrics correlation - 06-implementation-phases.md: add Phase 7 section (~180 lines) - OpenTelemetryPlan.md: update phases table (7 phases, 60.6 days) - 08-appendix.md: add Phase7_taskList.md to document index - Delete 5 statsd-.json dashboards (replaced by system-.json) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 21:05:48 +01:00
Pratik Mankawde	7ab6f4d34b	fix: address CI rename checks (rippled -> xrpld) in phase-8 docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 20:09:43 +01:00
Pratik Mankawde	81b47afde7	Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation # Conflicts: # OpenTelemetryPlan/06-implementation-phases.md # OpenTelemetryPlan/08-appendix.md # OpenTelemetryPlan/OpenTelemetryPlan.md # docker/telemetry/grafana/dashboards/statsd-network-traffic.json # docker/telemetry/grafana/dashboards/statsd-node-health.json # docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json	2026-04-29 20:07:43 +01:00
Pratik Mankawde	769668579a	Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics # Conflicts: # .codecov.yml # .github/scripts/levelization/results/ordering.txt # .github/workflows/reusable-clang-tidy-files.yml # CMakeLists.txt # OpenTelemetryPlan/00-tracing-fundamentals.md # OpenTelemetryPlan/01-architecture-analysis.md # OpenTelemetryPlan/02-design-decisions.md # OpenTelemetryPlan/03-implementation-strategy.md # OpenTelemetryPlan/04-code-samples.md # OpenTelemetryPlan/05-configuration-reference.md # OpenTelemetryPlan/06-implementation-phases.md # OpenTelemetryPlan/07-observability-backends.md # OpenTelemetryPlan/08-appendix.md # OpenTelemetryPlan/09-data-collection-reference.md # OpenTelemetryPlan/OpenTelemetryPlan.md # OpenTelemetryPlan/POC_taskList.md # OpenTelemetryPlan/Phase2_taskList.md # OpenTelemetryPlan/Phase3_taskList.md # OpenTelemetryPlan/Phase4_taskList.md # OpenTelemetryPlan/Phase5_IntegrationTest_taskList.md # OpenTelemetryPlan/Phase5_taskList.md # OpenTelemetryPlan/presentation.md # cfg/xrpld-example.cfg # conan.lock # conanfile.py # cspell.config.yaml # docker/telemetry/TESTING.md # docker/telemetry/docker-compose.yml # docker/telemetry/grafana/dashboards/consensus-health.json # docker/telemetry/grafana/dashboards/transaction-overview.json # docker/telemetry/grafana/provisioning/dashboards/dashboards.yaml # docker/telemetry/grafana/provisioning/datasources/tempo.yaml # docker/telemetry/integration-test.sh # docker/telemetry/otel-collector-config.yaml # docker/telemetry/tempo.yaml # docker/telemetry/xrpld-telemetry.cfg # docs/build/telemetry.md # docs/telemetry-runbook.md # include/xrpl/core/ServiceRegistry.h # include/xrpl/protocol/detail/features.macro # include/xrpl/telemetry/SpanGuard.h # include/xrpl/telemetry/Telemetry.h # include/xrpl/telemetry/TraceContextPropagator.h # src/libxrpl/basics/MallocTrim.cpp # src/libxrpl/nodestore/backend/MemoryFactory.cpp # src/libxrpl/nodestore/backend/NuDBFactory.cpp # src/libxrpl/nodestore/backend/RocksDBFactory.cpp # src/libxrpl/telemetry/NullTelemetry.cpp # src/libxrpl/telemetry/Telemetry.cpp # src/libxrpl/telemetry/TelemetryConfig.cpp # src/tests/libxrpl/basics/MallocTrim.cpp # src/tests/libxrpl/telemetry/TelemetryConfig.cpp # src/xrpld/app/consensus/RCLConsensus.cpp # src/xrpld/app/consensus/RCLConsensus.h # src/xrpld/app/ledger/detail/BuildLedger.cpp # src/xrpld/app/ledger/detail/LedgerMaster.cpp # src/xrpld/app/main/Application.cpp # src/xrpld/app/misc/NetworkOPs.cpp # src/xrpld/consensus/Consensus.h # src/xrpld/overlay/detail/PeerImp.cpp # src/xrpld/rpc/detail/RPCHandler.cpp # src/xrpld/rpc/detail/ServerHandler.cpp	2026-04-29 19:50:32 +01:00
Pratik Mankawde	8fb33b0818	feat(telemetry): add Phase 4 consensus tracing with SpanGuard API Instrument the consensus subsystem with OpenTelemetry spans covering the full round lifecycle: round start, establish phase, proposal send, ledger close, position updates, consensus check, accept, validation send, and mode changes. Key design choices adapted from the original Phase 4 implementation to the new SpanGuard factory pattern introduced in Phase 3: - Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs (consensus round spans share trace_id across validators via ledger hash) - Add SpanGuard::addEvent() overload with key-value attribute pairs (used for dispute.resolve events during position updates) - Add ConsensusSpanNames.h with compile-time span name constants following the colocated *SpanNames.h pattern from Phase 3 - Add consensusTraceStrategy config option ("deterministic"/"attribute") for cross-node trace correlation strategy selection - Use SpanGuard::linkedSpan() for follows-from relationships between consecutive rounds and cross-thread validation spans - Use SpanGuard::captureContext() for thread-safe context propagation from consensus thread to jtACCEPT worker thread Spans produced: consensus.round, consensus.proposal.send, consensus.ledger_close, consensus.establish, consensus.update_positions, consensus.check, consensus.accept, consensus.accept.apply, consensus.validation.send, consensus.mode_change Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 17:32:56 +01:00
Pratik Mankawde	3508917f17	feat(telemetry): Phase 3 transaction tracing with protobuf context propagation - TraceContext protobuf message for cross-node trace propagation (added to TMTransaction, TMProposeSet, TMValidation at field 1001) - TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf - PeerImp::handleTransaction: tx.receive span with peer.id, peer.version, tx.hash, tx.suppressed, tx.status attributes - NetworkOPsImp::processTransaction: tx.process span with tx.hash, tx.local, tx.path attributes - Tempo search filters for tx.hash, tx.local, tx.status - Unit tests for TraceContextPropagator (round-trip, edge cases) - Levelization: xrpld.app/overlay > xrpld.telemetry dependencies Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory pattern introduced in Phase 1c. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-29 17:32:49 +01:00
Pratik Mankawde	b933e8ae00	feat(telemetry): add missing StatsD dashboard panels from production dashboard Compared shared production Grafana dashboard against Phase 6 StatsD dashboards and added 10 missing panels covering job execution/dequeue timers, cache metrics, ledger publish gap, state duration rate, duplicate traffic, and detailed traffic breakdown. Node Health dashboard: 8 → 16 panels, plus quantile template variable. Network Traffic dashboard: 8 → 10 panels, Total Network Bytes now rate(). Updated runbook, data collection reference, and implementation phases docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 14:02:27 +01:00
Pratik Mankawde	a1cb752745	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd	2026-04-29 13:01:38 +01:00
Pratik Mankawde	0dec657c61	fix(telemetry): rename dashboard provider to xrpld, replace Jaeger with Tempo troubleshooting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 13:00:40 +01:00
Pratik Mankawde	2aa8dbc2cb	fix(telemetry): restore StatsD receiver, fix metric prefix and doc errors The StatsD receiver config was lost during a branch rebase (--ours conflict resolution dropped it). Re-add the statsd receiver to the OTel Collector config and wire it into the metrics pipeline so beast::insight UDP metrics flow to Prometheus. Also fixes: - Metric prefix mismatch: docs used xrpld_ but dashboards/tests use rippled_ — align all documentation to match the runnable stack - Remove phantom Peer_Disconnects_Charges from docs (plain atomic, not a beast::insight gauge) - Remove premature .codecov.yml exclusions for Phase 7 OTelCollector files that don't exist on this branch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 12:57:50 +01:00
Pratik Mankawde	8daf09b3ce	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd # Conflicts: # docker/telemetry/grafana/dashboards/consensus-health.json # docker/telemetry/grafana/dashboards/transaction-overview.json	2026-04-29 12:37:06 +01:00
Pratik Mankawde	a3044bcef9	fix(telemetry): address review findings for docs/dashboards - Add missing xrpl.consensus.quorum attribute to consensus.accept in runbook - Fix dashboard legend formats: add exported_instance, use Title Case Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 12:36:24 +01:00
Pratik Mankawde	3433c9583d	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd # Conflicts: # docker/telemetry/grafana/dashboards/consensus-health.json # docker/telemetry/grafana/dashboards/transaction-overview.json # docker/telemetry/otel-collector-config.yaml # docs/telemetry-runbook.md	2026-04-29 12:34:27 +01:00
Pratik Mankawde	21dad9a17d	docs(telemetry): sync runbook, dashboards, and configs with code - Add 14 missing spans to runbook (6 TxQ + 8 consensus) - Fix tx.receive attributes and config table in runbook - Document dispute.resolve and tx.included span events - Add spanmetrics dimensions for close_time_correct and tx.suppressed - Fix Close Time Agreement and TX Receive vs Suppressed panel PromQL - Wire $consensus_mode template variable to all consensus panels - Add 10 Tempo search filters for operational attributes - Apply rename script artifacts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 12:29:53 +01:00
Pratik Mankawde	88e25119f0	Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd	2026-04-29 11:29:14 +01:00
Pratik Mankawde	c5a59645d9	fix(telemetry): resolve merge conflicts, bashate, and rename for phase 5 Resolve merge conflicts taking phase 4 consensus span improvements, fix bashate indentation in integration test script, and apply rename script to Phase5 integration test docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 11:28:54 +01:00
Pratik Mankawde	b54b17708f	feat(telemetry): add close time analysis panels to consensus-health dashboard Add 5 new panels to the consensus-health Grafana dashboard using Tempo TraceQL queries against consensus.accept.apply span attributes: - Close Time: Raw Proposals (Per Node) — each node's unrounded wall-clock close_time_self, reveals clock drift across validators - Close Time: Effective / Quantized — the consensus-agreed close_time after rounding to resolution bins, written to ledger header - Close Time Vote Bins & Resolution — number of distinct vote bins (close_time_vote_bins) and bin size (close_resolution_ms) on dual axes - Close Time Resolution Direction — whether resolution increased (coarser), decreased (finer), or stayed unchanged - Close Time Bin Distribution — bar chart showing how raw proposals distribute across quantized bins per round Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 15:00:57 +01:00
Pratik Mankawde	cbbd6ebee2	feat(telemetry): add Phase 6 StatsD metrics, ledger/peer spans, and expanded dashboards Integrate the existing StatsD metrics pipeline (beast::insight) into the OpenTelemetry observability stack and add new trace spans for ledger build/store/validate and peer proposal/validation receive. Phase 5b — Ledger, peer, and transaction spans: - Add ledger.build span with close time attributes in BuildLedger.cpp - Add tx.apply span with tx_count/tx_failed in BuildLedger.cpp - Add ledger.store and ledger.validate spans in LedgerMaster.cpp - Add peer.proposal.receive span with trusted attribute in PeerImp.cpp - Add peer.validation.receive span with ledger_hash, full, trusted attributes in PeerImp.cpp - Add ledger-operations and peer-network Grafana dashboards Phase 6 — StatsD metrics integration: - Add StatsD UDP receiver (port 8125) to OTel Collector - Add 5 StatsD Grafana dashboards: node health, network traffic, overlay traffic detail, ledger data sync, RPC pathfinding - Add 09-data-collection-reference.md cataloging all metrics/spans - Update existing dashboards with new span panels - Expand telemetry runbook and integration test script - Add codecov exclusions for telemetry modules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 15:00:57 +01:00
Pratik Mankawde	de7194011d	fix(docs): apply rename scripts to telemetry deployment docs Run .github/scripts/rename/docs.sh to replace rippled → xrpld references in TESTING.md, xrpld-telemetry.cfg, and telemetry-runbook.md, fixing the check-rename CI failure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-28 15:00:40 +01:00
Pratik Mankawde	f6105ece98	feat(telemetry): add Phase 5 documentation, deployment configs, and integration tests Add the observability stack deployment infrastructure and integration test framework for verifying end-to-end trace export. - Add Grafana dashboards: RPC performance, transaction overview, consensus health (pre-provisioned via dashboards.yaml) - Add Prometheus config for spanmetrics collection from OTel Collector - Update OTel Collector config with spanmetrics connector and prometheus exporter for RED metrics - Add docker-compose services: prometheus, dashboard provisioning - Add integration-test.sh with Tempo API-based span verification (replaces previous Jaeger-based approach) - Add TESTING.md with step-by-step deployment and verification guide - Add telemetry-runbook.md for production operations reference - Add xrpld-telemetry.cfg sample configuration - Add toDisplayString() for ConsensusMode (human-readable span values) - Update Phase 2/3 task lists with known issues sections - Add Phase 5 integration test task list - Add TraceContext protobuf fields for future relay propagation - Wire telemetry lifecycle (setServiceInstanceId/start/stop) in Application.cpp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 15:00:40 +01:00
Pratik Mankawde	34ee231d62	feat(telemetry): add Phase 4 consensus tracing with SpanGuard API Instrument the consensus subsystem with OpenTelemetry spans covering the full round lifecycle: round start, establish phase, proposal send, ledger close, position updates, consensus check, accept, validation send, and mode changes. Key design choices adapted from the original Phase 4 implementation to the new SpanGuard factory pattern introduced in Phase 3: - Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs (consensus round spans share trace_id across validators via ledger hash) - Add SpanGuard::addEvent() overload with key-value attribute pairs (used for dispute.resolve events during position updates) - Add ConsensusSpanNames.h with compile-time span name constants following the colocated *SpanNames.h pattern from Phase 3 - Add consensusTraceStrategy config option ("deterministic"/"attribute") for cross-node trace correlation strategy selection - Use SpanGuard::linkedSpan() for follows-from relationships between consecutive rounds and cross-thread validation spans - Use SpanGuard::captureContext() for thread-safe context propagation from consensus thread to jtACCEPT worker thread Spans produced: consensus.round, consensus.proposal.send, consensus.ledger_close, consensus.establish, consensus.update_positions, consensus.check, consensus.accept, consensus.accept.apply, consensus.validation.send, consensus.mode_change Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 14:34:39 +01:00
Pratik Mankawde	19eead6955	feat(telemetry): Phase 3 transaction tracing with protobuf context propagation - TraceContext protobuf message for cross-node trace propagation (added to TMTransaction, TMProposeSet, TMValidation at field 1001) - TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf - PeerImp::handleTransaction: tx.receive span with peer.id, peer.version, tx.hash, tx.suppressed, tx.status attributes - NetworkOPsImp::processTransaction: tx.process span with tx.hash, tx.local, tx.path attributes - Tempo search filters for tx.hash, tx.local, tx.status - Unit tests for TraceContextPropagator (round-trip, edge cases) - Levelization: xrpld.app/overlay > xrpld.telemetry dependencies Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory pattern introduced in Phase 1c. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 14:28:31 +01:00
Pratik Mankawde	eb51457e69	fix(telemetry): address Phase 2 code review findings - Move node health attribute strings to compile-time constants in SpanNames.h (attr::nodeAmendmentBlocked, attr::nodeServerState) - Add Tempo search filters for node health attributes - Remove unnecessary .c_str() on strOperatingMode() return - Add samplingRatio clamping test (values > 1.0 and < 0.0) - Fix Task 2.3 status: delivered in Phase 1c, not Phase 2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 14:28:07 +01:00
Pratik Mankawde	832648c351	feat(telemetry): add RPC trace filters and SpanGuard unit tests - Grafana Tempo datasource: add rpc-command, rpc-status, rpc-role search filters for the Explore UI - Unit tests: TelemetryConfig (config parsing defaults and sections), SpanGuardFactory (null guard safety, move semantics, discard, all factory methods) - Test CMake registration with optional OTel linking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 14:28:07 +01:00

1 2

68 Commits