Commit Graph

67 Commits

Author SHA1 Message Date
Pratik Mankawde
7d61a4a0ef feat(telemetry): add missing Phase 9 metric panels to dashboards
13 metrics from 09-data-collection-reference.md were not displayed on
any Grafana dashboard. Adds panels for all of them:

system-node-health.json (+7 panels):
- NodeStore Bytes Read/Written (node_written_bytes, node_read_bytes)
- NodeStore Read Threads & Duration (node_reads_duration_us,
  read_request_bundle, read_threads_running, read_threads_total)
- AL_size added to Cache Sizes panel
- Current Ledger Index (ledger_current_index)
- NuDB Storage Size (storage_detail{metric="nudb_bytes"})

rippled-validator-health.json (+2 panels):
- UNL Blocked (validator_health{metric="unl_blocked"})
- Agreement/Missed Counters Rate (validation_agreements_total,
  validation_missed_total)

rippled-job-queue.json (+1 panel):
- Transaction Overflow Rate (jq_trans_overflow_total)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:32:55 +01:00
Pratik Mankawde
02fe838257 auto refresh at 5seconds
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-13 19:00:36 +01:00
Pratik Mankawde
20477e5494 validator path changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-13 18:49:21 +01:00
Pratik Mankawde
f0c6227c06 added config for devnet test run
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-13 18:42:57 +01:00
Pratik Mankawde
495d5bd8a0 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 16:17:12 +01:00
Pratik Mankawde
6cd910f06f Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-13 16:17:05 +01:00
Pratik Mankawde
5cd71ed107 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-05-13 16:16:50 +01:00
Pratik Mankawde
9e27120a15 refactor(telemetry): simplify ledger/peer attr naming on phase-6, update dashboards
- Add canonical ledgerHash (xrpl.ledger.hash) to SpanNames.h.
- LedgerSpanNames: reuse shared canonicals (ledgerSeq, closeTime,
  closeTimeCorrect, closeResolutionMs, ledgerHash); bare names for
  tx_count, tx_failed, validations.
- PeerSpanNames: reuse shared canonicals (peerId, ledgerHash); bare
  names for proposal_trusted, validation_full, validation_trusted.
- Update call sites in BuildLedger.cpp, LedgerMaster.cpp, PeerImp.cpp.
- Update 5 Grafana dashboards: strip xrpl.<domain>. prefix from
  per-span attr refs in PromQL/TraceQL queries. Keep rule-5 entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:16:30 +01:00
Pratik Mankawde
5601615952 fix(telemetry): align Phase 9 dashboards and integration-test with xrpld_ metric prefix
MetricsRegistry emits OTel SDK metrics with the xrpld_ prefix
(MetricsRegistry.cpp defines "xrpld_nodestore_state",
"xrpld_cache_metrics", etc.), but the Phase 9 dashboards and the
Step 10c integration-test assertions introduced in 892fee638a
queried the rippled_ prefix. Every Phase 9 panel and assertion
therefore rendered "No data" or failed on a live run, even though
the underlying series were being exported correctly.

Rename the rippled_ prefix to xrpld_ for every MetricsRegistry
metric in dashboards and the integration test:

- nodestore_state, cache_metrics, txq_metrics, load_factor_metrics,
  object_count
- rpc_method_started_total / _finished_total / _errored_total /
  _duration_us_bucket
- job_queued_total / _started_total / _finished_total /
  _queued_duration_us_bucket / _running_duration_us_bucket
- peer_quality, server_info, validator_health, ledger_economy,
  db_metrics, complete_ledgers, build_info, state_tracking
- ledgers_closed_total, validations_sent_total,
  validations_checked_total, state_changes_total
- validation_agreement (ValidationTracker 1h/24h/7d windows)

Also add ValidationTracker window-gauge assertions to Step 10c of
integration-test.sh so the 1h/24h/7d agreement and miss counts are
checked alongside the other Phase 9 gauges.

The rippled_ prefix is preserved for beast::insight metrics
(rippled_LedgerMaster_*, rippled_Peer_Finder_*, rippled_total_*,
rippled_Overlay_*, rippled_State_Accounting_*, rippled_transactions_*,
rippled_proposals_*, rippled_validations_Messages_*) because those
flow through the StatsD-style OTelCollector configured with
`[insight] prefix=rippled` and remain on that prefix by design.

Verified against a live 6-node consensus network: all 22 Phase 9 +
ValidationTracker assertions now report 6+ series per metric.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-13 14:59:00 +01:00
Pratik Mankawde
db04120f74 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 12:24:00 +01:00
Pratik Mankawde
fac3287912 fix(telemetry): use .batches for Tempo trace lookup in integration test
Tempo /api/traces/{id} returns OTLP-shaped JSON with a top-level
"batches" key, not "data". The cross-check in check_log_correlation
was querying jq '.data | length' which always returned null, causing
the Log-Tempo cross-check to fail even when the trace existed.
2026-05-13 12:16:41 +01:00
Pratik Mankawde
c096eeb239 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 11:30:22 +01:00
Pratik Mankawde
e49c5997b7 added loki config.
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-06 17:37:43 +01:00
Pratik Mankawde
85330920ac feat(telemetry): add Loki service and filelog receiver for Phase 8 log ingestion
Cherry-pick Loki infrastructure from phase-10 back to where it belongs
(Phase 8, Tasks 8.2/8.3):

- Add Loki 3.4.2 service to docker-compose.yml (port 3100)
- Add filelog receiver to OTel Collector config (tails debug.log,
  regex_parser extracts trace_id/span_id/partition/severity)
- Add otlphttp/loki exporter (uses Loki 3.x native OTLP ingestion)
- Add logs pipeline: filelog -> batch -> otlphttp/loki
- Add health_check extension
- Mount xrpld log directory into collector container
- Add prometheus-data and loki-data persistent volumes

StatsD receiver intentionally excluded — Phase 7 migrated to native
OTLP metrics, making the StatsD receiver unnecessary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:55:45 +01:00
Pratik Mankawde
fac6c3ac1d Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-06 14:34:17 +01:00
Pratik Mankawde
a8549a7ab2 fix(telemetry): address code review findings for Phase 8 log-trace correlation
- Replace GetSpan() with direct context value check in Logs::format()
  to avoid heap allocation (new DefaultSpan) on the no-span path
- Restore Phase 7 documentation accidentally deleted during merge
- Fix undefined $JAEGER variable → use $TEMPO in integration test
- Remove useless LCOV_EXCL markers around #ifdef block
- Fix indentation inconsistencies in Log.cpp injection block
- Remove incorrect url field from loki.yaml derivedFields
- Update stale code sample in Phase8_taskList.md to match implementation
- Correct "<10ns" performance claims to accurate ~15-20ns (no-span)
  and ~50ns (active-span) measurements across all docs
- Replace Jaeger references with Tempo in TESTING.md (port 16686→3200)
- Improve error handling in check_log_correlation(): track files_scanned,
  detect missing log files, fix silent grep error masking

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:32:46 +01:00
Pratik Mankawde
761688383d fix(telemetry): address code review issues in OTelCollector
- Fix use-after-free: extract gauge callback to static function and call
  RemoveCallback in ~OTelGaugeImpl() before unregistering from collector
- Use memory_order_acq_rel on callHooks() debounce CAS for proper
  happens-before relationship between hook invocations
- Add explicit 2s timeout to ForceFlush() in destructor to prevent
  blocking indefinitely when OTLP endpoint is unreachable at shutdown
- Add OTLP receiver to metrics pipeline so native OTel metrics from
  xrpld are actually received by the collector
- Remove stale health check port from docker-compose (extension was
  removed from collector config)
- Clarify fallback docs: StatsD path requires re-enabling receiver/port
- Fix comments: Counter uses uint64_t not int64_t, gauge clamps to
  [0, INT64_MAX] not [0, UINT64_MAX]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:24:52 +01:00
Pratik Mankawde
1658d3dc40 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-04-29 21:09:47 +01:00
Pratik Mankawde
8e7a2d6c53 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/08-appendix.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
2026-04-29 21:07:32 +01:00
Pratik Mankawde
9adcc49171 fix: re-apply phase-7 doc/config changes lost during merge
Re-applies phase-7 unique modifications to documentation and
configuration files that were overwritten when taking phase-6's
versions during the merge conflict resolution.

Changes:
- docker-compose.yml: comment out StatsD port 8125, add OTLP notes
- otel-collector-config.yaml: remove StatsD receiver, update pipeline
- integration-test.sh: server=otel, check_otel_metric, StatsD port check
- telemetry-runbook.md: System Metrics section, server=otel config,
  troubleshooting for missing OTel metrics
- 02-design-decisions.md: Phase 7 coexistence strategy notes
- 05-configuration-reference.md: OTel System Metrics correlation
- 06-implementation-phases.md: add Phase 7 section (~180 lines)
- OpenTelemetryPlan.md: update phases table (7 phases, 60.6 days)
- 08-appendix.md: add Phase7_taskList.md to document index
- Delete 5 statsd-*.json dashboards (replaced by system-*.json)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 21:05:48 +01:00
Pratik Mankawde
9e12e660fe Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 20:25:13 +01:00
Pratik Mankawde
7ab6f4d34b fix: address CI rename checks (rippled -> xrpld) in phase-8 docs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 20:09:43 +01:00
Pratik Mankawde
81b47afde7 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/08-appendix.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
#	docker/telemetry/grafana/dashboards/statsd-network-traffic.json
#	docker/telemetry/grafana/dashboards/statsd-node-health.json
#	docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
2026-04-29 20:07:43 +01:00
Pratik Mankawde
769668579a Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	.codecov.yml
#	.github/scripts/levelization/results/ordering.txt
#	.github/workflows/reusable-clang-tidy-files.yml
#	CMakeLists.txt
#	OpenTelemetryPlan/00-tracing-fundamentals.md
#	OpenTelemetryPlan/01-architecture-analysis.md
#	OpenTelemetryPlan/02-design-decisions.md
#	OpenTelemetryPlan/03-implementation-strategy.md
#	OpenTelemetryPlan/04-code-samples.md
#	OpenTelemetryPlan/05-configuration-reference.md
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/07-observability-backends.md
#	OpenTelemetryPlan/08-appendix.md
#	OpenTelemetryPlan/09-data-collection-reference.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
#	OpenTelemetryPlan/POC_taskList.md
#	OpenTelemetryPlan/Phase2_taskList.md
#	OpenTelemetryPlan/Phase3_taskList.md
#	OpenTelemetryPlan/Phase4_taskList.md
#	OpenTelemetryPlan/Phase5_IntegrationTest_taskList.md
#	OpenTelemetryPlan/Phase5_taskList.md
#	OpenTelemetryPlan/presentation.md
#	cfg/xrpld-example.cfg
#	conan.lock
#	conanfile.py
#	cspell.config.yaml
#	docker/telemetry/TESTING.md
#	docker/telemetry/docker-compose.yml
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
#	docker/telemetry/grafana/provisioning/dashboards/dashboards.yaml
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docker/telemetry/integration-test.sh
#	docker/telemetry/otel-collector-config.yaml
#	docker/telemetry/tempo.yaml
#	docker/telemetry/xrpld-telemetry.cfg
#	docs/build/telemetry.md
#	docs/telemetry-runbook.md
#	include/xrpl/core/ServiceRegistry.h
#	include/xrpl/protocol/detail/features.macro
#	include/xrpl/telemetry/SpanGuard.h
#	include/xrpl/telemetry/Telemetry.h
#	include/xrpl/telemetry/TraceContextPropagator.h
#	src/libxrpl/basics/MallocTrim.cpp
#	src/libxrpl/nodestore/backend/MemoryFactory.cpp
#	src/libxrpl/nodestore/backend/NuDBFactory.cpp
#	src/libxrpl/nodestore/backend/RocksDBFactory.cpp
#	src/libxrpl/telemetry/NullTelemetry.cpp
#	src/libxrpl/telemetry/Telemetry.cpp
#	src/libxrpl/telemetry/TelemetryConfig.cpp
#	src/tests/libxrpl/basics/MallocTrim.cpp
#	src/tests/libxrpl/telemetry/TelemetryConfig.cpp
#	src/xrpld/app/consensus/RCLConsensus.cpp
#	src/xrpld/app/consensus/RCLConsensus.h
#	src/xrpld/app/ledger/detail/BuildLedger.cpp
#	src/xrpld/app/ledger/detail/LedgerMaster.cpp
#	src/xrpld/app/main/Application.cpp
#	src/xrpld/app/misc/NetworkOPs.cpp
#	src/xrpld/consensus/Consensus.h
#	src/xrpld/overlay/detail/PeerImp.cpp
#	src/xrpld/rpc/detail/RPCHandler.cpp
#	src/xrpld/rpc/detail/ServerHandler.cpp
2026-04-29 19:50:32 +01:00
Pratik Mankawde
b933e8ae00 feat(telemetry): add missing StatsD dashboard panels from production dashboard
Compared shared production Grafana dashboard against Phase 6 StatsD
dashboards and added 10 missing panels covering job execution/dequeue
timers, cache metrics, ledger publish gap, state duration rate, duplicate
traffic, and detailed traffic breakdown.

Node Health dashboard: 8 → 16 panels, plus quantile template variable.
Network Traffic dashboard: 8 → 10 panels, Total Network Bytes now rate().
Updated runbook, data collection reference, and implementation phases docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 14:02:27 +01:00
Pratik Mankawde
a1cb752745 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 13:01:38 +01:00
Pratik Mankawde
0dec657c61 fix(telemetry): rename dashboard provider to xrpld, replace Jaeger with Tempo troubleshooting
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 13:00:40 +01:00
Pratik Mankawde
2aa8dbc2cb fix(telemetry): restore StatsD receiver, fix metric prefix and doc errors
The StatsD receiver config was lost during a branch rebase (--ours
conflict resolution dropped it). Re-add the statsd receiver to the
OTel Collector config and wire it into the metrics pipeline so
beast::insight UDP metrics flow to Prometheus.

Also fixes:
- Metric prefix mismatch: docs used xrpld_ but dashboards/tests use
  rippled_ — align all documentation to match the runnable stack
- Remove phantom Peer_Disconnects_Charges from docs (plain atomic,
  not a beast::insight gauge)
- Remove premature .codecov.yml exclusions for Phase 7 OTelCollector
  files that don't exist on this branch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:57:50 +01:00
Pratik Mankawde
8daf09b3ce Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
2026-04-29 12:37:06 +01:00
Pratik Mankawde
a3044bcef9 fix(telemetry): address review findings for docs/dashboards
- Add missing xrpl.consensus.quorum attribute to consensus.accept in runbook
- Fix dashboard legend formats: add exported_instance, use Title Case

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:36:24 +01:00
Pratik Mankawde
3433c9583d Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
#	docker/telemetry/otel-collector-config.yaml
#	docs/telemetry-runbook.md
2026-04-29 12:34:27 +01:00
Pratik Mankawde
21dad9a17d docs(telemetry): sync runbook, dashboards, and configs with code
- Add 14 missing spans to runbook (6 TxQ + 8 consensus)
- Fix tx.receive attributes and config table in runbook
- Document dispute.resolve and tx.included span events
- Add spanmetrics dimensions for close_time_correct and tx.suppressed
- Fix Close Time Agreement and TX Receive vs Suppressed panel PromQL
- Wire $consensus_mode template variable to all consensus panels
- Add 10 Tempo search filters for operational attributes
- Apply rename script artifacts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 12:29:53 +01:00
Pratik Mankawde
88e25119f0 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 11:29:14 +01:00
Pratik Mankawde
c5a59645d9 fix(telemetry): resolve merge conflicts, bashate, and rename for phase 5
Resolve merge conflicts taking phase 4 consensus span improvements,
fix bashate indentation in integration test script, and apply rename
script to Phase5 integration test docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:28:54 +01:00
Pratik Mankawde
b54b17708f feat(telemetry): add close time analysis panels to consensus-health dashboard
Add 5 new panels to the consensus-health Grafana dashboard using Tempo
TraceQL queries against consensus.accept.apply span attributes:

- Close Time: Raw Proposals (Per Node) — each node's unrounded
  wall-clock close_time_self, reveals clock drift across validators
- Close Time: Effective / Quantized — the consensus-agreed close_time
  after rounding to resolution bins, written to ledger header
- Close Time Vote Bins & Resolution — number of distinct vote bins
  (close_time_vote_bins) and bin size (close_resolution_ms) on dual axes
- Close Time Resolution Direction — whether resolution increased
  (coarser), decreased (finer), or stayed unchanged
- Close Time Bin Distribution — bar chart showing how raw proposals
  distribute across quantized bins per round

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:57 +01:00
Pratik Mankawde
cbbd6ebee2 feat(telemetry): add Phase 6 StatsD metrics, ledger/peer spans, and expanded dashboards
Integrate the existing StatsD metrics pipeline (beast::insight) into
the OpenTelemetry observability stack and add new trace spans for
ledger build/store/validate and peer proposal/validation receive.

Phase 5b — Ledger, peer, and transaction spans:
- Add ledger.build span with close time attributes in BuildLedger.cpp
- Add tx.apply span with tx_count/tx_failed in BuildLedger.cpp
- Add ledger.store and ledger.validate spans in LedgerMaster.cpp
- Add peer.proposal.receive span with trusted attribute in PeerImp.cpp
- Add peer.validation.receive span with ledger_hash, full, trusted
  attributes in PeerImp.cpp
- Add ledger-operations and peer-network Grafana dashboards

Phase 6 — StatsD metrics integration:
- Add StatsD UDP receiver (port 8125) to OTel Collector
- Add 5 StatsD Grafana dashboards: node health, network traffic,
  overlay traffic detail, ledger data sync, RPC pathfinding
- Add 09-data-collection-reference.md cataloging all metrics/spans
- Update existing dashboards with new span panels
- Expand telemetry runbook and integration test script
- Add codecov exclusions for telemetry modules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:57 +01:00
Pratik Mankawde
de7194011d fix(docs): apply rename scripts to telemetry deployment docs
Run .github/scripts/rename/docs.sh to replace rippled → xrpld
references in TESTING.md, xrpld-telemetry.cfg, and
telemetry-runbook.md, fixing the check-rename CI failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 15:00:40 +01:00
Pratik Mankawde
f6105ece98 feat(telemetry): add Phase 5 documentation, deployment configs, and integration tests
Add the observability stack deployment infrastructure and integration
test framework for verifying end-to-end trace export.

- Add Grafana dashboards: RPC performance, transaction overview,
  consensus health (pre-provisioned via dashboards.yaml)
- Add Prometheus config for spanmetrics collection from OTel Collector
- Update OTel Collector config with spanmetrics connector and
  prometheus exporter for RED metrics
- Add docker-compose services: prometheus, dashboard provisioning
- Add integration-test.sh with Tempo API-based span verification
  (replaces previous Jaeger-based approach)
- Add TESTING.md with step-by-step deployment and verification guide
- Add telemetry-runbook.md for production operations reference
- Add xrpld-telemetry.cfg sample configuration
- Add toDisplayString() for ConsensusMode (human-readable span values)
- Update Phase 2/3 task lists with known issues sections
- Add Phase 5 integration test task list
- Add TraceContext protobuf fields for future relay propagation
- Wire telemetry lifecycle (setServiceInstanceId/start/stop) in
  Application.cpp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:40 +01:00
Pratik Mankawde
34ee231d62 feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.

Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:

- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
  (consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
  (used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
  following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
  for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
  consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
  from consensus thread to jtACCEPT worker thread

Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:34:39 +01:00
Pratik Mankawde
19eead6955 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:31 +01:00
Pratik Mankawde
eb51457e69 fix(telemetry): address Phase 2 code review findings
- Move node health attribute strings to compile-time constants in
  SpanNames.h (attr::nodeAmendmentBlocked, attr::nodeServerState)
- Add Tempo search filters for node health attributes
- Remove unnecessary .c_str() on strOperatingMode() return
- Add samplingRatio clamping test (values > 1.0 and < 0.0)
- Fix Task 2.3 status: delivered in Phase 1c, not Phase 2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
832648c351 feat(telemetry): add RPC trace filters and SpanGuard unit tests
- Grafana Tempo datasource: add rpc-command, rpc-status, rpc-role
  search filters for the Explore UI
- Unit tests: TelemetryConfig (config parsing defaults and sections),
  SpanGuardFactory (null guard safety, move semantics, discard, all
  factory methods)
- Test CMake registration with optional OTel linking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:28:07 +01:00
Pratik Mankawde
e9c5c3520e fix(telemetry): address Phase 1b code review findings
Redesign SpanGuard with pimpl idiom to hide all OpenTelemetry types
from public headers. Add global Telemetry accessor so SpanGuard factory
methods work without explicit Telemetry references. Add child/linked
span creation and cross-thread context propagation. Update plan docs
to reflect macro removal in favor of SpanGuard factory pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 14:26:05 +01:00
Pratik Mankawde
3852b5ae4b fix(telemetry): address review findings and PR #6437 comments
Critical fixes:
- Restore accidentally removed mallocTrim call and MallocTrim.h include
- Add missing shouldTraceLedger() to interface and all implementations
- Derive networkId/networkType from config_->NETWORK_ID (0=mainnet,
  1=testnet, 2=devnet) instead of leaving defaults unpopulated
- Clamp sampling_ratio to [0.0, 1.0] in config parser

PR comment fixes:
- Rename rippled -> xrpld in service name defaults, getTracer() calls,
  Docker network, comments, and docs/build/telemetry.md
- Remove exporter config option (only otlp_http supported)
- Add trace_ledger and service_name to example config
- Clarify head-based sampling semantics in config comments
- Add filter descriptions for span intrinsic filters in Grafana datasource
- Add inline comments to Docker Compose services

Docker/config improvements:
- Remove deprecated version: "3.8" from docker-compose.yml
- Pin images: collector 0.121.0, grafana 11.5.2
- Add health_check extension to otel-collector-config.yaml
- Comment out Tempo metrics_generator remote_write (no Prometheus service)
- Add Prometheus datasource caveat in Grafana datasource config

Other:
- Revert unrelated formatting changes in ServiceRegistry.h
- Change Conan telemetry default to False (matches CMake OFF)
- Add CLAUDE.md-required docs (ASCII diagrams, usage examples,
  @note thread-safety) to Telemetry.h and SpanGuard.h

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 14:25:31 +01:00
Pratik Mankawde
ca2d616277 refactor(telemetry): remove Jaeger service, exporter, and datasource
Tempo is now the sole trace backend. Remove Jaeger all-in-one service
from docker-compose, otlp/jaeger exporter from OTel Collector config,
and Jaeger Grafana datasource provisioning file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:25:31 +01:00
Pratik Mankawde
88686af850 Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 14:25:31 +01:00
Pratik Mankawde
45ffe8e2ec fix(telemetry): add missing counters, fix dashboard metric name, clean dead code
- Add rippled_validation_agreements_total and rippled_validation_missed_total
  counter declarations and creation (wiring to ValidationTracker pending rebase)
- Fix peer-quality dashboard: query rippled_server_info{metric="peer_disconnects_resources"}
  instead of non-existent rippled_Overlay_Peer_Disconnects_Charges
- Remove dead getCountsJson() call in storageDetail callback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
b92354715d feat(telemetry): add validator health, peer quality dashboards and ledger economy panels (Tasks 9.11-9.13)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
936c73982d docs: update Phase 9 docs and dashboard for push_metrics.py parity gauges
- Add Task 9.7a to Phase9_taskList.md documenting new gauges
- Add metric tables to 09-data-collection-reference.md (server_info,
  build_info, complete_ledgers, db_metrics, extended cache/nodestore)
- Update metric counts from ~50 to ~68 in 06-implementation-phases.md
- Add OTel MetricsRegistry gauge reference to telemetry-runbook.md
- Add 11 new panels to system-node-health.json Grafana dashboard
  (server state, uptime, peers, validated seq, last close info,
  build version, complete ledgers, db sizes, historical fetch rate,
  peer disconnects)
- Fix leftover merge conflict marker in 08-appendix.md
- Add ripplex/mseconds to cspell dictionary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
892fee638a Phase 9: Metric gap fill - nodestore, cache, TxQ, load factor dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00