Commit Graph

67 Commits

Author SHA1 Message Date
Pratik Mankawde
088848e7ab formatting updates
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:20:08 +01:00
Pratik Mankawde
e7dea147cd Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:18:36 +01:00
Pratik Mankawde
8d730b8b9a Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:16:35 +01:00
Pratik Mankawde
2f96c6547c Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:51:31 +01:00
Pratik Mankawde
c187a62353 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:47:15 +01:00
Pratik Mankawde
c848e51e13 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 16:44:07 +01:00
Pratik Mankawde
3a1f22583f Merge branch 'pratik/otel-phase1a-plan-docs' into pratik/otel-phase1b-telemetry-infra
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 15:34:22 +01:00
Pratik Mankawde
7ac5343119 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 16:09:41 +01:00
Pratik Mankawde
c6c019ed8b addressed code review comments
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 15:55:25 +01:00
Pratik Mankawde
4bd1176df5 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-28 11:38:05 +01:00
Pratik Mankawde
9498b2865f fix(telemetry): address PR #6424 review comments
- Drop xrpl.node.amendment_blocked / xrpl.node.server_state from telemetry
  surface (constants in SpanNames.h, two filters in tempo.yaml). Operators
  read the same data via server_info / server_state RPC; OTel SDK 1.18.0
  cannot refresh resource attrs at runtime so resource-level emission was
  not viable either.

- Namespace all pathfind span attributes under pathfind_* (underscore form
  per Phase 1c rule 5). Renames in PathFindSpanNames.h and call sites in
  PathRequest.cpp, PathRequestManager.cpp, plus the rule-5 retention
  xrpl.pathfind.ledger_index -> pathfind_ledger_index.

- Wire pathfind_source_account / pathfind_dest_account on pathfind.request
  in doPathFind / doRipplePathFind handlers (only when present + string).

- Collapse per-asset pathfind.discover / pathfind.rank spans into one
  pathfind.discover hoisted around the per-source-asset loop in
  PathRequest::findPaths. Span count goes from 2N to 1 per RPC call;
  per-asset breakdown traded for bounded storage and cardinality. Trade-off
  documented inline.

- Fix pathfind_num_paths semantics: now sums getBestPaths().size() across
  the loop (paths actually returned) instead of the maxPaths input cap.

- PathRequestManager::updateAll: move span creation after the locked
  requests_ snapshot, early-return when no active subscriptions exist
  (avoids empty span on every ledger close), set pathfind_num_requests
  = requests.size().

- Update Phase2_taskList.md and 02-design-decisions.md to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:27:29 +01:00
Ayaz Salikhov
23d0812827 style: Use shfmt instead of bashate (#7326) 2026-05-26 18:28:23 +00:00
Ayaz Salikhov
49cb3f45a4 ci: Add clang to nix images (#7308)
Co-authored-by: semgrep-companion-app[bot] <218312740+semgrep-companion-app[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-26 15:45:33 +00:00
Pratik Mankawde
6985e1948b merge: pratik/otel-phase6-statsd (line-number + docs cleanup) into pratik/otel-phase7-native-metrics
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
#	docker/telemetry/grafana/dashboards/system-ledger-data-sync.json
#	docker/telemetry/grafana/dashboards/system-network-traffic.json
#	docker/telemetry/grafana/dashboards/system-node-health.json
#	docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json
#	docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json
2026-05-14 17:07:15 +01:00
Pratik Mankawde
1a36ef4b0f fix(telemetry): rename remaining rippled-* dashboard UIDs + fix stale rpc.request span filter
Follow-up to the phase-6 dashboard cleanup. The three dashboards
introduced by commit f6105ece98 (consensus-health, rpc-performance,
transaction-overview) were missed in the initial UID rename and still
carried `rippled-*` UIDs plus line-number refs in panel descriptions.

- UIDs: `rippled-consensus` -> `xrpld-consensus`,
  `rippled-rpc-perf` -> `xrpld-rpc-perf`,
  `rippled-transactions` -> `xrpld-transactions`, matching the
  post-`docs.sh`-rename runbook and the other dashboards in this PR.
- Strip `:<line>` suffixes from `ServerHandler.cpp`, `RCLConsensus.cpp`,
  `NetworkOPs.cpp`, etc. references in panel descriptions. Line numbers
  drift on every refactor; the filename is enough to grep.
- Fix the Overall RPC Throughput panel: two targets filtered on
  `span_name="rpc.request"` (never emitted) instead of
  `span_name="rpc.http_request"` (the real emitted name). The panel
  would have shown zero data until this fix.
2026-05-14 16:58:47 +01:00
Pratik Mankawde
a789f6ccf5 docs(telemetry): fix stale rpc.request refs + drop unparsed exporter key in TESTING.md
Follow-up to the dashboard cleanup on this branch. Caught additional sites
in TESTING.md that still reference the never-emitted `rpc.request` span:

- TraceQL query examples in Step 5 "Verify traces in Tempo" now filter on
  `name="rpc.http_request"` (the real emitted name).
- Expected-spans table replaces `rpc.request` with `rpc.http_request`.
- Query loop under the Prometheus verification section now iterates over
  the full set of emitted RPC entry-point names
  (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`).

Also drop `exporter=otlp_http` from the sample telemetry config block.
`TelemetryConfig.cpp` does not parse an `exporter` key in any phase through
Phase 8; only OTLP/HTTP is wired up, so the line is either a silently
ignored no-op or misleading documentation.
2026-05-14 16:53:40 +01:00
Pratik Mankawde
44cdc8133e fix(telemetry): phase-6 dashboards — rename UIDs, add $node filter, drop line numbers
Phase-6 introduces ledger-operations, peer-network, and the five StatsD
dashboards. Align them with the rest of the chain:

- Rename dashboard UIDs from `rippled-*` to `xrpld-*` so the provisioned
  UIDs match the post-rename-script documentation (`docs.sh` rewrites
  .md but not .json, so the two drifted). Runbook references
  `xrpld-rpc-perf`, `xrpld-transactions`, etc., now the JSON matches.
- Add the `$node` template variable + `exported_instance=~"$node"` filter
  to every target in the five `statsd-*` dashboards. Mirrors the pattern
  already used by consensus-health, ledger-operations, and peer-network
  per the project rule that every dashboard must support per-node
  filtering.
- Strip `:<line>` (and `:NN-NN` range) suffixes from C++ file references
  in every dashboard panel description and in docker/telemetry/TESTING.md.
  Line numbers drift on every refactor; the filename alone is enough to
  grep.
- Replace stale `rpc.request` entries with the real emitted span names
  (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`)
  in TESTING.md so operators can copy-paste the filters and hit real
  traces.
- Also drop the `:706` line ref from the `StatsDCollector.cpp` callout
  in `06-implementation-phases.md`.
2026-05-14 16:51:14 +01:00
Pratik Mankawde
5a6882f119 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	docker/telemetry/otel-collector-config.yaml
2026-05-14 14:01:36 +01:00
Pratik Mankawde
b449db0434 fix(telemetry): align spanmetrics dimensions, Tempo tags, and dashboard queries with C++ attribute names
Spanmetrics dimensions used xrpl.rpc.command etc. but C++ emits bare
"command". Tempo tags for phase6-added consensus/tx/peer filters used
qualified names but C++ uses bare names. Dashboard panel referenced
xrpl_tx_suppressed (never populated) instead of suppressed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 14:01:12 +01:00
Pratik Mankawde
9babfff3c8 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-14 13:59:19 +01:00
Pratik Mankawde
61ab5c6fe3 fix(telemetry): align Tempo consensus search tags with C++ attribute names
Consensus span attributes use bare names (close_time_correct,
consensus_state, close_resolution_ms) and shared canonical attrs
(xrpl.ledger.seq) per SpanNames.h. xrpl.consensus.mode and
xrpl.consensus.round are correct (domain-qualified to avoid collision).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:59:08 +01:00
Pratik Mankawde
837f7e7b50 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-14 13:58:38 +01:00
Pratik Mankawde
b392035544 fix(telemetry): align Tempo TX search tags with C++ attribute names
Transaction span attributes use bare names (local, tx_status) per
SpanNames.h convention, not xrpl.tx.* qualified names. xrpl.tx.hash
is correct (shared canonical attr defined in SpanNames.h).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:58:31 +01:00
Pratik Mankawde
450004ebd8 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-14 13:58:19 +01:00
Pratik Mankawde
6f403fdd1b fix(telemetry): align Tempo search tags with C++ span attribute names
RPC span attributes use bare names (command, rpc_status, rpc_role) per
the naming convention in SpanNames.h, not xrpl.rpc.* qualified names.
Node health attributes (amendment_blocked, server_state) are resource
attributes set at Tracer init, not span attributes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:58:13 +01:00
Pratik Mankawde
93caaba5ca fix(telemetry): recover Phase 6 dashboard panels lost during statsd→system rename
Panels 8-15 from statsd-node-health.json and panels 8-9 from
statsd-network-traffic.json were lost when Phase 7 renamed these files
to system-*. The merge (5cd71ed107) took Phase 7's smaller version
without the extra panels added by commit b933e8ae00 on Phase 6.

Recovered panels (system-node-health.json):
- Key Jobs Execution Time (11 job types)
- Key Jobs Dequeue Wait Time (11 job types)
- FullBelowCache Size
- FullBelowCache Hit Rate
- Ledger Publish Gap (validated - published age delta)
- State Duration Rate (Full vs Tracking)
- All Jobs Execution Time Detail (34 job types)
- All Jobs Dequeue Wait Detail (34 job types)

Recovered panels (system-network-traffic.json):
- Duplicate Traffic (Wasted Bandwidth)
- All Traffic Categories Detail (topk 15 by byte rate)

All recovered panels updated to include exported_instance=~"$node"
filter per project dashboard guidelines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 12:33:18 +01:00
Pratik Mankawde
5cd71ed107 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-05-13 16:16:50 +01:00
Pratik Mankawde
9e27120a15 refactor(telemetry): simplify ledger/peer attr naming on phase-6, update dashboards
- Add canonical ledgerHash (xrpl.ledger.hash) to SpanNames.h.
- LedgerSpanNames: reuse shared canonicals (ledgerSeq, closeTime,
  closeTimeCorrect, closeResolutionMs, ledgerHash); bare names for
  tx_count, tx_failed, validations.
- PeerSpanNames: reuse shared canonicals (peerId, ledgerHash); bare
  names for proposal_trusted, validation_full, validation_trusted.
- Update call sites in BuildLedger.cpp, LedgerMaster.cpp, PeerImp.cpp.
- Update 5 Grafana dashboards: strip xrpl.<domain>. prefix from
  per-span attr refs in PromQL/TraceQL queries. Keep rule-5 entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:16:30 +01:00
Ayaz Salikhov
648ec747f2 feat: Implement nix-based Dockerfile for CI (#7083) 2026-05-13 15:10:53 +00:00
Pratik Mankawde
761688383d fix(telemetry): address code review issues in OTelCollector
- Fix use-after-free: extract gauge callback to static function and call
  RemoveCallback in ~OTelGaugeImpl() before unregistering from collector
- Use memory_order_acq_rel on callHooks() debounce CAS for proper
  happens-before relationship between hook invocations
- Add explicit 2s timeout to ForceFlush() in destructor to prevent
  blocking indefinitely when OTLP endpoint is unreachable at shutdown
- Add OTLP receiver to metrics pipeline so native OTel metrics from
  xrpld are actually received by the collector
- Remove stale health check port from docker-compose (extension was
  removed from collector config)
- Clarify fallback docs: StatsD path requires re-enabling receiver/port
- Fix comments: Counter uses uint64_t not int64_t, gauge clamps to
  [0, INT64_MAX] not [0, UINT64_MAX]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:24:52 +01:00
Pratik Mankawde
9adcc49171 fix: re-apply phase-7 doc/config changes lost during merge
Re-applies phase-7 unique modifications to documentation and
configuration files that were overwritten when taking phase-6's
versions during the merge conflict resolution.

Changes:
- docker-compose.yml: comment out StatsD port 8125, add OTLP notes
- otel-collector-config.yaml: remove StatsD receiver, update pipeline
- integration-test.sh: server=otel, check_otel_metric, StatsD port check
- telemetry-runbook.md: System Metrics section, server=otel config,
  troubleshooting for missing OTel metrics
- 02-design-decisions.md: Phase 7 coexistence strategy notes
- 05-configuration-reference.md: OTel System Metrics correlation
- 06-implementation-phases.md: add Phase 7 section (~180 lines)
- OpenTelemetryPlan.md: update phases table (7 phases, 60.6 days)
- 08-appendix.md: add Phase7_taskList.md to document index
- Delete 5 statsd-*.json dashboards (replaced by system-*.json)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 21:05:48 +01:00
Pratik Mankawde
769668579a Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	.codecov.yml
#	.github/scripts/levelization/results/ordering.txt
#	.github/workflows/reusable-clang-tidy-files.yml
#	CMakeLists.txt
#	OpenTelemetryPlan/00-tracing-fundamentals.md
#	OpenTelemetryPlan/01-architecture-analysis.md
#	OpenTelemetryPlan/02-design-decisions.md
#	OpenTelemetryPlan/03-implementation-strategy.md
#	OpenTelemetryPlan/04-code-samples.md
#	OpenTelemetryPlan/05-configuration-reference.md
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/07-observability-backends.md
#	OpenTelemetryPlan/08-appendix.md
#	OpenTelemetryPlan/09-data-collection-reference.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
#	OpenTelemetryPlan/POC_taskList.md
#	OpenTelemetryPlan/Phase2_taskList.md
#	OpenTelemetryPlan/Phase3_taskList.md
#	OpenTelemetryPlan/Phase4_taskList.md
#	OpenTelemetryPlan/Phase5_IntegrationTest_taskList.md
#	OpenTelemetryPlan/Phase5_taskList.md
#	OpenTelemetryPlan/presentation.md
#	cfg/xrpld-example.cfg
#	conan.lock
#	conanfile.py
#	cspell.config.yaml
#	docker/telemetry/TESTING.md
#	docker/telemetry/docker-compose.yml
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
#	docker/telemetry/grafana/provisioning/dashboards/dashboards.yaml
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docker/telemetry/integration-test.sh
#	docker/telemetry/otel-collector-config.yaml
#	docker/telemetry/tempo.yaml
#	docker/telemetry/xrpld-telemetry.cfg
#	docs/build/telemetry.md
#	docs/telemetry-runbook.md
#	include/xrpl/core/ServiceRegistry.h
#	include/xrpl/protocol/detail/features.macro
#	include/xrpl/telemetry/SpanGuard.h
#	include/xrpl/telemetry/Telemetry.h
#	include/xrpl/telemetry/TraceContextPropagator.h
#	src/libxrpl/basics/MallocTrim.cpp
#	src/libxrpl/nodestore/backend/MemoryFactory.cpp
#	src/libxrpl/nodestore/backend/NuDBFactory.cpp
#	src/libxrpl/nodestore/backend/RocksDBFactory.cpp
#	src/libxrpl/telemetry/NullTelemetry.cpp
#	src/libxrpl/telemetry/Telemetry.cpp
#	src/libxrpl/telemetry/TelemetryConfig.cpp
#	src/tests/libxrpl/basics/MallocTrim.cpp
#	src/tests/libxrpl/telemetry/TelemetryConfig.cpp
#	src/xrpld/app/consensus/RCLConsensus.cpp
#	src/xrpld/app/consensus/RCLConsensus.h
#	src/xrpld/app/ledger/detail/BuildLedger.cpp
#	src/xrpld/app/ledger/detail/LedgerMaster.cpp
#	src/xrpld/app/main/Application.cpp
#	src/xrpld/app/misc/NetworkOPs.cpp
#	src/xrpld/consensus/Consensus.h
#	src/xrpld/overlay/detail/PeerImp.cpp
#	src/xrpld/rpc/detail/RPCHandler.cpp
#	src/xrpld/rpc/detail/ServerHandler.cpp
2026-04-29 19:50:32 +01:00
Pratik Mankawde
8fb33b0818 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
3508917f17 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:49 +01:00
Pratik Mankawde
b933e8ae00 feat(telemetry): add missing StatsD dashboard panels from production dashboard
Compared shared production Grafana dashboard against Phase 6 StatsD
dashboards and added 10 missing panels covering job execution/dequeue
timers, cache metrics, ledger publish gap, state duration rate, duplicate
traffic, and detailed traffic breakdown.

Node Health dashboard: 8 → 16 panels, plus quantile template variable.
Network Traffic dashboard: 8 → 10 panels, Total Network Bytes now rate().
Updated runbook, data collection reference, and implementation phases docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 14:02:27 +01:00
Pratik Mankawde
a1cb752745 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 13:01:38 +01:00
Pratik Mankawde
0dec657c61 fix(telemetry): rename dashboard provider to xrpld, replace Jaeger with Tempo troubleshooting
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 13:00:40 +01:00
Pratik Mankawde
2aa8dbc2cb fix(telemetry): restore StatsD receiver, fix metric prefix and doc errors
The StatsD receiver config was lost during a branch rebase (--ours
conflict resolution dropped it). Re-add the statsd receiver to the
OTel Collector config and wire it into the metrics pipeline so
beast::insight UDP metrics flow to Prometheus.

Also fixes:
- Metric prefix mismatch: docs used xrpld_ but dashboards/tests use
  rippled_ — align all documentation to match the runnable stack
- Remove phantom Peer_Disconnects_Charges from docs (plain atomic,
  not a beast::insight gauge)
- Remove premature .codecov.yml exclusions for Phase 7 OTelCollector
  files that don't exist on this branch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:57:50 +01:00
Pratik Mankawde
8daf09b3ce Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
2026-04-29 12:37:06 +01:00
Pratik Mankawde
a3044bcef9 fix(telemetry): address review findings for docs/dashboards
- Add missing xrpl.consensus.quorum attribute to consensus.accept in runbook
- Fix dashboard legend formats: add exported_instance, use Title Case

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:36:24 +01:00
Pratik Mankawde
3433c9583d Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
#	docker/telemetry/otel-collector-config.yaml
#	docs/telemetry-runbook.md
2026-04-29 12:34:27 +01:00
Pratik Mankawde
21dad9a17d docs(telemetry): sync runbook, dashboards, and configs with code
- Add 14 missing spans to runbook (6 TxQ + 8 consensus)
- Fix tx.receive attributes and config table in runbook
- Document dispute.resolve and tx.included span events
- Add spanmetrics dimensions for close_time_correct and tx.suppressed
- Fix Close Time Agreement and TX Receive vs Suppressed panel PromQL
- Wire $consensus_mode template variable to all consensus panels
- Add 10 Tempo search filters for operational attributes
- Apply rename script artifacts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:29:53 +01:00
Pratik Mankawde
88e25119f0 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 11:29:14 +01:00
Pratik Mankawde
c5a59645d9 fix(telemetry): resolve merge conflicts, bashate, and rename for phase 5
Resolve merge conflicts taking phase 4 consensus span improvements,
fix bashate indentation in integration test script, and apply rename
script to Phase5 integration test docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:28:54 +01:00
Pratik Mankawde
b54b17708f feat(telemetry): add close time analysis panels to consensus-health dashboard
Add 5 new panels to the consensus-health Grafana dashboard using Tempo
TraceQL queries against consensus.accept.apply span attributes:

- Close Time: Raw Proposals (Per Node) — each node's unrounded
  wall-clock close_time_self, reveals clock drift across validators
- Close Time: Effective / Quantized — the consensus-agreed close_time
  after rounding to resolution bins, written to ledger header
- Close Time Vote Bins & Resolution — number of distinct vote bins
  (close_time_vote_bins) and bin size (close_resolution_ms) on dual axes
- Close Time Resolution Direction — whether resolution increased
  (coarser), decreased (finer), or stayed unchanged
- Close Time Bin Distribution — bar chart showing how raw proposals
  distribute across quantized bins per round

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:57 +01:00
Pratik Mankawde
cbbd6ebee2 feat(telemetry): add Phase 6 StatsD metrics, ledger/peer spans, and expanded dashboards
Integrate the existing StatsD metrics pipeline (beast::insight) into
the OpenTelemetry observability stack and add new trace spans for
ledger build/store/validate and peer proposal/validation receive.

Phase 5b — Ledger, peer, and transaction spans:
- Add ledger.build span with close time attributes in BuildLedger.cpp
- Add tx.apply span with tx_count/tx_failed in BuildLedger.cpp
- Add ledger.store and ledger.validate spans in LedgerMaster.cpp
- Add peer.proposal.receive span with trusted attribute in PeerImp.cpp
- Add peer.validation.receive span with ledger_hash, full, trusted
  attributes in PeerImp.cpp
- Add ledger-operations and peer-network Grafana dashboards

Phase 6 — StatsD metrics integration:
- Add StatsD UDP receiver (port 8125) to OTel Collector
- Add 5 StatsD Grafana dashboards: node health, network traffic,
  overlay traffic detail, ledger data sync, RPC pathfinding
- Add 09-data-collection-reference.md cataloging all metrics/spans
- Update existing dashboards with new span panels
- Expand telemetry runbook and integration test script
- Add codecov exclusions for telemetry modules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:57 +01:00
Pratik Mankawde
de7194011d fix(docs): apply rename scripts to telemetry deployment docs
Run .github/scripts/rename/docs.sh to replace rippled → xrpld
references in TESTING.md, xrpld-telemetry.cfg, and
telemetry-runbook.md, fixing the check-rename CI failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 15:00:40 +01:00
Pratik Mankawde
f6105ece98 feat(telemetry): add Phase 5 documentation, deployment configs, and integration tests
Add the observability stack deployment infrastructure and integration
test framework for verifying end-to-end trace export.

- Add Grafana dashboards: RPC performance, transaction overview,
  consensus health (pre-provisioned via dashboards.yaml)
- Add Prometheus config for spanmetrics collection from OTel Collector
- Update OTel Collector config with spanmetrics connector and
  prometheus exporter for RED metrics
- Add docker-compose services: prometheus, dashboard provisioning
- Add integration-test.sh with Tempo API-based span verification
  (replaces previous Jaeger-based approach)
- Add TESTING.md with step-by-step deployment and verification guide
- Add telemetry-runbook.md for production operations reference
- Add xrpld-telemetry.cfg sample configuration
- Add toDisplayString() for ConsensusMode (human-readable span values)
- Update Phase 2/3 task lists with known issues sections
- Add Phase 5 integration test task list
- Add TraceContext protobuf fields for future relay propagation
- Wire telemetry lifecycle (setServiceInstanceId/start/stop) in
  Application.cpp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:40 +01:00
Pratik Mankawde
34ee231d62 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:34:39 +01:00
Pratik Mankawde
19eead6955 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:31 +01:00