Commit Graph

13833 Commits

Author SHA1 Message Date
Pratik Mankawde
85a2220312 Phase 7-8: Plan docs for native OTel metrics migration and log-trace correlation
Phase 7 (native metrics): Replace StatsDCollector with OTelCollectorImpl
behind the existing beast::insight::Collector interface. Maps Counter,
Gauge, Meter, Event to OTel SDK instruments. Exports via OTLP/HTTP to
same collector endpoint as traces. Eliminates StatsD UDP dependency.
Resolves deferred Phase 6 Task 6.1 (|m wire format).

Phase 8 (log correlation): Inject trace_id/span_id into JLOG output
via Logs::format() thread-local span context read. Add Grafana Loki
with OTel Collector filelog receiver for centralized log ingestion.
Enable bidirectional Tempo-Loki correlation in Grafana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
a8c2f94e8a Remove 'rippled' prefix from dashboard titles, add new dashboards to doc
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
f1025d4f71 Fix markdown formatting in data collection reference
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
a2082f29c4 Add StatsD dashboards for overlay traffic detail and ledger data sync
Two new Grafana dashboards covering previously uncovered beast::insight
StatsD metrics:

Overlay Traffic Detail (8 panels):
- Squelch traffic (squelch, suppressed, ignored)
- Overhead breakdown (base, cluster, manifest)
- Validator list distribution traffic
- Set get/share (transaction set exchange)
- Have/requested transactions protocol
- Unknown/unclassified traffic
- Proof path request/response
- Replay delta request/response

Ledger Data & Sync (6 panels):
- Ledger data exchange by sub-type (TX set, TX node, account state)
- Legacy ledger share/get traffic
- GetObject by type (ledger, transaction, account state, CAS, fetch pack)
- GetObject aggregate and special types
- GetObject message counts
- All-categories ranked bar gauge (top 20)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
fcb630921f formatting
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
2a5dd29278 Appendix: add 09-data-collection-reference.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
64d8369dbc Add consensus.accept.apply span to data collection reference
Add the close time span and its 6 attributes to the Phase 4 consensus
span table and attribute table in 09-data-collection-reference.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
4dcd65968f document updates
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
2bea046dab Phase 6: Integrate beast::insight StatsD metrics into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
350a546b9b Update Consensus Mode Over Time panel description to Title Case
Match the toDisplayString() values introduced in Phase 4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:27 +00:00
Pratik Mankawde
4be2809276 Fix codecov: add LCOV_EXCL_LINE to opening lines of multi-line trace macros
clang-format splits XRPL_TRACE_SET_ATTR calls across two lines. The
LCOV_EXCL_LINE comment must appear on BOTH lines, not just the
continuation line, since gcov instruments each line independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
3b9962069e Fix codecov/patch: exclude telemetry no-op lines from coverage
Add LCOV_EXCL_LINE markers on trace macro calls that expand to ((void)0)
when telemetry is disabled. gcov instruments these no-op expressions at
-O0 causing false patch coverage failures.

Also add telemetry module paths to .codecov.yml ignore list since they
are conditionally compiled behind XRPL_ENABLE_TELEMETRY which is not
enabled in coverage builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
76d373d9b4 formatting changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
be59e4940f Phase 5b: Add ledger/peer/tx spans + expand Grafana dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
5499cbcde4 Update consensus mode filter description to Title Case
Match the toDisplayString() Title Case values (Proposing, Observing,
Wrong Ledger, Switched Ledger) in the $consensus_mode template
variable description.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:17 +00:00
Pratik Mankawde
3c07a42026 Fix Grafana datasource default and add human-friendly standalone node name
- Set Prometheus as default datasource (isDefault: true) so dashboard
  panels using type: prometheus without explicit UID resolve correctly.
  Previously Jaeger was implicitly the default, causing all PromQL
  panels to return empty results.
- Explicitly set isDefault: false on Jaeger datasource to prevent
  implicit default behavior.
- Add service_instance_id=rippled-standalone to xrpld-telemetry.cfg
  so Grafana $node dropdown shows a readable name instead of the
  base58 node public key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
8e2693bb35 Appendix: add Phase5_IntegrationTest_taskList.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
2b641e6947 Use human-friendly node names in integration test telemetry config
Set service_instance_id=rippled-node-N in each test node's [telemetry]
section so Grafana/Tempo filters show readable names instead of base58
public keys. Update dashboard descriptions and Tempo datasource comments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
7518fb5bb0 Fix Grafana node filter: use exported_instance instead of service_instance_id
Prometheus renames the OTel resource attribute service.instance.id to
'instance', which then conflicts with the built-in scrape 'instance'
label. Prometheus resolves this by prefixing it as 'exported_instance'.
Update all dashboard PromQL queries and template variable queries to
use exported_instance so the Node dropdown correctly populates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
95f1b10cfd Set explicit uid on Prometheus datasource for dashboard compatibility
Dashboard template variables reference datasource uid "prometheus" but
the provisioning config had no uid set, causing Grafana to auto-assign
a random one and break dashboard panel queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
400624a822 Add Grafana dashboard template variables for node and metric filtering
- Add resource_metrics_key_attributes to spanmetrics connector so
  service.instance.id becomes a Prometheus label for per-node filtering
- Add 'node' dropdown (service_instance_id) to all 3 dashboards
- Add 'command' dropdown (xrpl_rpc_command) to RPC Performance
- Add 'tx_origin' dropdown (xrpl_tx_local) to Transaction Overview
- Add 'consensus_mode' dropdown (xrpl_consensus_mode) to Consensus Health
- Update all panel PromQL queries to include $node filter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
3b22d73bef Add consensus close time panels and docs for accept.apply span
- Add Ledger Apply Duration and Close Time Agreement panels to
  consensus-health dashboard
- Add consensus.accept.apply to telemetry runbook with TraceQL
  queries for close time disagreements and consensus failures
- Add span to TESTING.md expected span catalog and verification loop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
70396debcb Phase 5: Observability stack — spanmetrics, dashboards, runbook
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
9bb8f2e12a Use Title Case for consensus mode names in telemetry attributes
Add toDisplayString(ConsensusMode) helper that returns Title Case
names (Proposing, Observing, Wrong Ledger, Switched Ledger) for use
in OTel span attributes and Grafana dashboards. The existing
to_string() is preserved unchanged for log output stability.

Updated call sites:
- RCLConsensus.cpp: onClose, onModeChange, startRoundInternal
- TracingMacros.cpp: test attribute value

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:43:55 +00:00
Pratik Mankawde
31d0ed178a Phase 4a: Establish-phase gap fill & cross-node correlation
Add full consensus tracing with deterministic trace ID correlation
and establish-phase instrumentation:

- Deterministic trace_id from previousLedger.id() for cross-node
  correlation (switchable via consensus_trace_strategy config)
- Round-to-round span links (follows-from) for causal chaining
- Establish phase spans with convergence tracking, dispute resolution
  events, and threshold escalation attributes
- Validation spans with links to round spans (thread-safe via
  roundSpanContext_ snapshot for jtACCEPT cross-thread access)
- Mode change spans for proposing/observing transitions
- New startSpan overload with span links in Telemetry interface
- XRPL_TRACE_ADD_EVENT macro with do-while(0) safety wrapper
- Config validation for consensus_trace_strategy
- Test adaptor (csf::Peer) updated with getTelemetry() stub

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:10:00 +00:00
Pratik Mankawde
7675af41ec Add consensus.accept.apply span with ledger close time attributes
Add a new trace span in doAccept() capturing ledger close time details:
- xrpl.consensus.close_time: agreed-upon close time (epoch seconds)
- xrpl.consensus.close_time_correct: whether validators converged
  (per avCT_CONSENSUS_PCT = 75% threshold)
- xrpl.consensus.close_resolution_ms: time rounding granularity
- xrpl.consensus.state: "finished" or "moved_on" (consensus failure)
- xrpl.consensus.proposing: whether this node was proposing

Update Tempo datasource with close time filters, plan docs with
new span inventory, and add test coverage for the attribute pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:10:00 +00:00
Pratik Mankawde
655b78a7d8 Add consensus trace filters to Grafana Tempo datasource
Add xrpl.consensus.mode (static), xrpl.consensus.round (dynamic),
and xrpl.consensus.ledger.seq (static) search filters for Phase 4
consensus span attributes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:24 +00:00
Pratik Mankawde
0bbffaebc4 Phase 4: Consensus tracing — round lifecycle, proposals, validations
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:24 +00:00
Pratik Mankawde
e33e592128 Add transaction trace filters to Grafana Tempo datasource
Add xrpl.tx.hash (static), xrpl.tx.local and xrpl.tx.status
(dynamic) search filters for Phase 3 transaction span attributes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
e6b05d700c Phase 3: Transaction tracing — protobuf context, PeerImp, NetworkOPs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
f114ec5462 Fix gersemi cmake formatting in test CMakeLists
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
1d05075585 Appendix: add Phase 2-5 task lists to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
e77f2f0fa5 Add RPC trace filters to Grafana Tempo datasource
Add xrpl.rpc.command (static), xrpl.rpc.status and xrpl.rpc.role
(dynamic) search filters for Phase 2 RPC span attributes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
ea00fb9bc8 Phase 2: Complete RPC tracing — interface, macros, attributes, tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
a707abd6f2 Phase 1c: RPC layer telemetry integration
Add tracing instrumentation to the RPC request handling layer:

TracingInstrumentation.h:
- Convenience macros (XRPL_TRACE_RPC, XRPL_TRACE_TX, etc.) that
  create RAII SpanGuard objects when telemetry is enabled
- XRPL_TRACE_SET_ATTR / XRPL_TRACE_EXCEPTION for span enrichment
- Zero-overhead no-ops when XRPL_ENABLE_TELEMETRY is not defined

RPCHandler.cpp:
- Trace each RPC command with span name "rpc.command.<method>"
- Record command name, API version, role, and status as attributes
- Capture exceptions on the span for error visibility

ServerHandler.cpp:
- Trace HTTP requests ("rpc.request"), WebSocket messages
  ("rpc.ws_message"), and processRequest ("rpc.process")

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
99f4f16eb9 Phase 1a: OpenTelemetry plan documentation
Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:

- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
f333ae4c11 Fix gersemi cmake formatting (if/endif spacing, target_link_libraries wrapping)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
9ecea839bb Appendix: add presentation.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
ba3bf72326 Move presentation.md into OpenTelemetryPlan directory
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
b3609fb345 Respect custom service_instance_id when set in [telemetry] config
Only override serviceInstanceId with the node public key when the user
hasn't explicitly set service_instance_id in the [telemetry] section.
This allows operators to assign human-friendly names (e.g. "validator-1")
while still defaulting to the node's base58 public key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
99a0b4094e Fix empty service.instance.id by deferring node identity injection
The Telemetry object is constructed in ApplicationImp's member initializer
list where nodeIdentity_ is not yet available, resulting in an empty
service.instance.id resource attribute. Add setServiceInstanceId() virtual
method that Application::setup() calls after nodeIdentity_ is known but
before telemetry_->start() creates the OTel resource.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
48a3cf56b8 Add node identification filters to Grafana Tempo datasource
Add resource-level search filters for node identity:
- service.instance.id (node public key) — unique node identifier
- service.version (rippled build version)
- xrpl.network.id (numeric network ID)
- xrpl.network.type (mainnet/testnet/devnet/standalone)

These enable filtering traces by specific nodes in multi-node
deployments and by network in mixed environments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
8e43103310 Add Tempo search filters and metrics generator for trace exploration
Configure Grafana Tempo datasource with pre-built search filters
(service.name, span name, status, duration) for the Explore UI.
Enable Tempo metrics_generator with service-graphs and span-metrics
processors to power Grafana's service map visualization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
d412791a01 Add Grafana Tempo as second trace backend alongside Jaeger
- Add Tempo 2.7.2 service to docker-compose with local storage
- Add otlp/tempo exporter to OTel Collector traces pipeline
- Add Tempo Grafana datasource provisioning with node graph
- Update 05-configuration-reference.md examples with Tempo
- OTel Collector fans traces to both Jaeger and Tempo simultaneously

Jaeger provides a standalone UI at :16686 for quick lookups.
Tempo is queryable via Grafana Explore using TraceQL and is the
recommended backend for production (supports S3/GCS storage).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
d8d7a40fca Phase 1b: Telemetry core infrastructure
Add the OpenTelemetry telemetry library and supporting infrastructure:

Build system:
- Conan opentelemetry-cpp dependency with OTLP/gRPC exporter
- CMake integration for xrpl_telemetry library target
- Levelization ordering updates

Core library (libxrpl):
- Telemetry class: provider lifecycle, span creation, sampling config
- SpanGuard: RAII span management with attribute/exception helpers
- TelemetryConfig: parse [telemetry] config section
- NullTelemetry: no-op implementation when telemetry is disabled

Application integration:
- Telemetry member in ApplicationImp with start/stop lifecycle
- getTelemetry() interface on Application
- ServiceRegistry telemetry accessor

Docker observability stack:
- OTel Collector, Jaeger, Grafana docker-compose setup
- Collector config with OTLP gRPC receiver and Jaeger exporter

Config and docs:
- Example telemetry config section in xrpld-example.cfg
- Build documentation for telemetry setup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
fe6cd31762 Add Phase 4a implementation status to plan docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:08:13 +00:00
Pratik Mankawde
fd18cf9e01 Merge remote-tracking branch 'origin/develop' into pratik/otel-phase1a-plan-docs 2026-03-11 14:58:44 +00:00
Alex Kremer
7b3724b7a3 fix: Add missed clang-tidy bugprone-inc-dec-conditions check (#6526) 2026-03-11 14:04:26 +00:00
Ayaz Salikhov
bee2d112c6 ci: Fix how clang-tidy is run when .clang-tidy is changed (#6521) 2026-03-11 14:18:18 +01:00
Bart
01c977bbfe ci: Fix rules used to determine when to upload Conan recipes (#6524)
The refs as previously used pointed to the source branch, not the target branch. However, determining the target branch is different depending on the GitHub event. The pull request logic was incorrect and needed to be fixed, and the logic inside the workflow could be simplified. Both modifications have been made in this commit.
2026-03-11 13:43:58 +01:00