Commit Graph

9808 Commits

Author SHA1 Message Date
Pratik Mankawde
010ac78fc3 Fix MetricsRegistry: add rippled_ prefix and Resource attributes
The MetricsRegistry's OTel MeterProvider was missing Resource attributes
(service.name, service.instance.id) causing all 6 nodes' metrics to merge
into a single "unknown_service" in Prometheus with no exported_instance
label for per-node filtering.

Additionally, instrument names lacked the rippled_ prefix that dashboards
and integration tests expect (e.g. "job_queued_total" should be
"rippled_job_queued_total" to match the beast::insight naming convention).

Changes:
- Add Resource with service.name and service.instance.id to MeterProvider
- Prefix all instrument names with rippled_ (counters, histograms, gauges)
- Update start() signature to accept instanceId parameter
- Pass service_instance_id from [telemetry] config in Application::start()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:58:56 +00:00
Pratik Mankawde
fc2d86dbb2 Fix Windows build: pre-include boost/asio socket types before OTel headers
OTel's spin_lock_mutex.h defines _WINSOCKAPI_ and includes <windows.h>,
which poisons the include state for boost/asio/detail/socket_types.hpp.
Pre-include the boost/asio socket types header on MSVC to get winsock2.h
in before the OTel headers interfere.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
55e47c3047 Fix clang-tidy: remove unused alias, annotate empty catches in MetricsRegistry
- Remove unused `metric_api` namespace alias (misc-unused-alias-decls)
- Add NOLINT(bugprone-empty-catch) to 5 observable gauge callback catches
  that intentionally swallow exceptions when services aren't ready yet

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
0773fc802f Fix CI: guard MetricsRegistry GTest for telemetry=OFF only
When telemetry=ON, XRPL_ENABLE_TELEMETRY is globally defined, causing
MetricsRegistry.cpp to compile its full OTel path which references
xrpld symbols (LedgerMaster, TxQ, OpenLedger) that cannot be linked
into the standalone GTest binary. Guard the test with #ifndef and only
add MetricsRegistry.cpp as a source when telemetry is OFF.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
d712363c0d Convert MetricsRegistry test from Beast to GTest format
Move test from src/test/telemetry/ (Beast unit_test::suite) to
src/tests/libxrpl/telemetry/ (GTest TEST_F). The test exercises the
no-op/disabled path only, which compiles without XRPL_ENABLE_TELEMETRY
and has no xrpld link dependencies beyond MetricsRegistry.cpp itself.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
badaf96267 Fix CI: use xrpl namespace in MetricsRegistry test suite definition
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
022720db09 Fix MetricsRegistry: add missing OpenLedger.h and Histogram::Record context arg
- Added missing #include <xrpld/app/ledger/OpenLedger.h> for
  app.openLedger().current() calls in observable gauge callbacks.
- Added opentelemetry::context::Context{} as third argument to
  Histogram::Record() calls — the initializer_list overload requires
  an explicit Context parameter in the installed OTel C++ SDK version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
9289cb671d Phase 9: Internal Metric Instrumentation Gap Fill (Tasks 9.1-9.10)
Implement ~50 OTel metrics covering NodeStore I/O, cache hit rates,
TxQ state, PerfLog per-RPC/per-job counters, CountedObject instances,
and load factor breakdown via MetricsRegistry.

Core implementation:
- MetricsRegistry class with synchronous instruments (Counter, Histogram)
  for RPC and Job metrics, and ObservableGauge callbacks for cache, TxQ,
  CountedObject, LoadFactor, and NodeStore state polling.
- ServiceRegistry extended with getMetricsRegistry() virtual method.
- Application wires MetricsRegistry lifecycle (create/start/stop).
- PerfLogImp instrumented to emit OTel metrics on RPC and Job events.

Dashboards & observability:
- 3 new Grafana dashboards: RPC Performance, Job Queue, Fee Market/TxQ.
- Extended statsd-node-health dashboard with NodeStore, Cache, and
  CountedObject panels.
- 10 alerting rules added to telemetry-runbook.md.
- Integration test extended with 12 OTel metric validation checks.

Documentation:
- 09-data-collection-reference.md updated with Phase 9 metric tables.
- Unit tests for MetricsRegistry disabled-path (no-op) behavior.

All OTel SDK code guarded with #ifdef XRPL_ENABLE_TELEMETRY.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
5dcf366f8f Fix codecov: exclude telemetry ifdef block in Log.cpp from coverage
The trace context injection block is compiled out when
XRPL_ENABLE_TELEMETRY is not defined (coverage builds). codecov still
counts preprocessor-excluded lines as uncovered in the source diff.
Wrap with LCOV_EXCL_START/STOP.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
7192a374a8 Fix Log.cpp: ToLowerBase16 requires nostd::span, not raw char arrays
The OTel SDK's TraceId::ToLowerBase16 and SpanId::ToLowerBase16 expect
opentelemetry::nostd::span<char, N> rather than raw char arrays. Also
corrected array sizes from 33/17 to 32/16 (no null terminator needed
since we use output.append(buf, N)).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
124a510a53 Phase 8: Fix CI — add missing OTel trace/context.h include and cspell logql
Add opentelemetry/trace/context.h to Log.cpp so that
opentelemetry::trace::GetSpan() resolves correctly.
Add 'logql' to cspell dictionary to silence unknown-word warnings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
4e24569892 Phase 8: Implement log-trace correlation and Loki log ingestion
Task 8.1: Inject trace_id/span_id into Logs::format() when an active
OTel span exists, guarded by #ifdef XRPL_ENABLE_TELEMETRY. Uses
thread-local GetSpan()/GetContext() with <10ns overhead per call.

Task 8.2: Add Grafana Loki service (grafana/loki:2.9.0) to the Docker
Compose stack with port 3100 exposed. Add Loki as a dependency for
otel-collector and grafana services.

Task 8.3: Add filelog receiver to OTel Collector config to tail rippled
debug.log files with regex_parser extracting timestamp, partition,
severity, trace_id, span_id, and message fields. Add loki exporter and
logs pipeline. Mount rippled log directory into collector container.

Task 8.4: Add tracesToLogs config in Tempo datasource provisioning
pointing to Loki with filterByTraceID enabled, enabling one-click
trace-to-log navigation in Grafana.

Task 8.5: Add check_log_correlation() function to integration-test.sh
that greps debug.log files for trace_id pattern and cross-checks the
trace_id exists in Jaeger.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
8af6af6b2a Fix std::bad_weak_ptr crash in OTelGaugeImpl constructor
Remove illegal shared_from_this() call from OTelGaugeImpl constructor.
The shared_ptr control block is not yet associated with the object during
construction, causing std::bad_weak_ptr when [insight] server=otel is
configured. The weakSelf variable was dead code — never used — since the
callback captures `this` directly via void* state. The raw pointer is safe
because RemoveCallback() is called in the destructor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:41 +00:00
Pratik Mankawde
eb7e102015 Fix codecov: exclude OTel collector path in CollectorManager from coverage
The else-if branch for server=="otel" in CollectorManager.cpp is never
reached in unit tests (no test configures [insight] with server=otel).
Mark it with LCOV_EXCL_START/STOP to exclude from patch coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
3541deb972 Fix OTelCollector: Journal::info is a method, not a bool member
The beast::Journal stream accessors (info, warn, etc.) are methods that
return a Stream object. They must be called with () to test if the log
level is active.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
a431cf97b3 Fix OTelCollector.cpp compilation errors against OTel C++ SDK
- CreateInt64Counter -> CreateUInt64Counter (no int64 counter in API)
- Counter<int64_t> member -> Counter<uint64_t> with static_cast in Add()
- Add async_instruments.h include for ObservableInstrument definition
- Replace JLOG() macro with beast::Journal if(info)/info() pattern
- MeterProviderFactory::Create(reader,resource) -> Create(views,resource)
  + AddMetricReader() (no 2-arg reader overload exists)
- HistogramAggregationConfig: std::list<double> -> std::vector<double>
  using boundaries_ member instead of aggregate initialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
35eeffa068 Phase 7: Add node filters to system dashboards + OTelCollector instanceId
- Add $node template variable to all 5 system-* Grafana dashboards
  with exported_instance=~"$node" filter on all PromQL queries
- Add instanceId parameter to OTelCollector::New() factory to set
  service.instance.id resource attribute on metrics (matches trace
  exporter behavior for human-friendly node names in Grafana)
- CollectorManager reads service_instance_id from [insight] config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
7d51436d26 Phase 7: Native OTel metrics migration (Tasks 7.1-7.7)
Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.

- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
  SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
  export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
  pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
  StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
  and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
  reference docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
4be2809276 Fix codecov: add LCOV_EXCL_LINE to opening lines of multi-line trace macros
clang-format splits XRPL_TRACE_SET_ATTR calls across two lines. The
LCOV_EXCL_LINE comment must appear on BOTH lines, not just the
continuation line, since gcov instruments each line independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
3b9962069e Fix codecov/patch: exclude telemetry no-op lines from coverage
Add LCOV_EXCL_LINE markers on trace macro calls that expand to ((void)0)
when telemetry is disabled. gcov instruments these no-op expressions at
-O0 causing false patch coverage failures.

Also add telemetry module paths to .codecov.yml ignore list since they
are conditionally compiled behind XRPL_ENABLE_TELEMETRY which is not
enabled in coverage builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
be59e4940f Phase 5b: Add ledger/peer/tx spans + expand Grafana dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
70396debcb Phase 5: Observability stack — spanmetrics, dashboards, runbook
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
9bb8f2e12a Use Title Case for consensus mode names in telemetry attributes
Add toDisplayString(ConsensusMode) helper that returns Title Case
names (Proposing, Observing, Wrong Ledger, Switched Ledger) for use
in OTel span attributes and Grafana dashboards. The existing
to_string() is preserved unchanged for log output stability.

Updated call sites:
- RCLConsensus.cpp: onClose, onModeChange, startRoundInternal
- TracingMacros.cpp: test attribute value

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:43:55 +00:00
Pratik Mankawde
31d0ed178a Phase 4a: Establish-phase gap fill & cross-node correlation
Add full consensus tracing with deterministic trace ID correlation
and establish-phase instrumentation:

- Deterministic trace_id from previousLedger.id() for cross-node
  correlation (switchable via consensus_trace_strategy config)
- Round-to-round span links (follows-from) for causal chaining
- Establish phase spans with convergence tracking, dispute resolution
  events, and threshold escalation attributes
- Validation spans with links to round spans (thread-safe via
  roundSpanContext_ snapshot for jtACCEPT cross-thread access)
- Mode change spans for proposing/observing transitions
- New startSpan overload with span links in Telemetry interface
- XRPL_TRACE_ADD_EVENT macro with do-while(0) safety wrapper
- Config validation for consensus_trace_strategy
- Test adaptor (csf::Peer) updated with getTelemetry() stub

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:10:00 +00:00
Pratik Mankawde
7675af41ec Add consensus.accept.apply span with ledger close time attributes
Add a new trace span in doAccept() capturing ledger close time details:
- xrpl.consensus.close_time: agreed-upon close time (epoch seconds)
- xrpl.consensus.close_time_correct: whether validators converged
  (per avCT_CONSENSUS_PCT = 75% threshold)
- xrpl.consensus.close_resolution_ms: time rounding granularity
- xrpl.consensus.state: "finished" or "moved_on" (consensus failure)
- xrpl.consensus.proposing: whether this node was proposing

Update Tempo datasource with close time filters, plan docs with
new span inventory, and add test coverage for the attribute pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:10:00 +00:00
Pratik Mankawde
0bbffaebc4 Phase 4: Consensus tracing — round lifecycle, proposals, validations
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:24 +00:00
Pratik Mankawde
e6b05d700c Phase 3: Transaction tracing — protobuf context, PeerImp, NetworkOPs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
f114ec5462 Fix gersemi cmake formatting in test CMakeLists
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
ea00fb9bc8 Phase 2: Complete RPC tracing — interface, macros, attributes, tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
a707abd6f2 Phase 1c: RPC layer telemetry integration
Add tracing instrumentation to the RPC request handling layer:

TracingInstrumentation.h:
- Convenience macros (XRPL_TRACE_RPC, XRPL_TRACE_TX, etc.) that
  create RAII SpanGuard objects when telemetry is enabled
- XRPL_TRACE_SET_ATTR / XRPL_TRACE_EXCEPTION for span enrichment
- Zero-overhead no-ops when XRPL_ENABLE_TELEMETRY is not defined

RPCHandler.cpp:
- Trace each RPC command with span name "rpc.command.<method>"
- Record command name, API version, role, and status as attributes
- Capture exceptions on the span for error visibility

ServerHandler.cpp:
- Trace HTTP requests ("rpc.request"), WebSocket messages
  ("rpc.ws_message"), and processRequest ("rpc.process")

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
b3609fb345 Respect custom service_instance_id when set in [telemetry] config
Only override serviceInstanceId with the node public key when the user
hasn't explicitly set service_instance_id in the [telemetry] section.
This allows operators to assign human-friendly names (e.g. "validator-1")
while still defaulting to the node's base58 public key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
99a0b4094e Fix empty service.instance.id by deferring node identity injection
The Telemetry object is constructed in ApplicationImp's member initializer
list where nodeIdentity_ is not yet available, resulting in an empty
service.instance.id resource attribute. Add setServiceInstanceId() virtual
method that Application::setup() calls after nodeIdentity_ is known but
before telemetry_->start() creates the OTel resource.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
d8d7a40fca Phase 1b: Telemetry core infrastructure
Add the OpenTelemetry telemetry library and supporting infrastructure:

Build system:
- Conan opentelemetry-cpp dependency with OTLP/gRPC exporter
- CMake integration for xrpl_telemetry library target
- Levelization ordering updates

Core library (libxrpl):
- Telemetry class: provider lifecycle, span creation, sampling config
- SpanGuard: RAII span management with attribute/exception helpers
- TelemetryConfig: parse [telemetry] config section
- NullTelemetry: no-op implementation when telemetry is disabled

Application integration:
- Telemetry member in ApplicationImp with start/stop lifecycle
- getTelemetry() interface on Application
- ServiceRegistry telemetry accessor

Docker observability stack:
- OTel Collector, Jaeger, Grafana docker-compose setup
- Collector config with OTLP gRPC receiver and Jaeger exporter

Config and docs:
- Example telemetry config section in xrpld-example.cfg
- Build documentation for telemetry setup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Alex Kremer
7b3724b7a3 fix: Add missed clang-tidy bugprone-inc-dec-conditions check (#6526) 2026-03-11 14:04:26 +00:00
Alex Kremer
f27d8f3890 chore: Enable clang-tidy bugprone-inc-dec-in-conditions check (#6455) 2026-03-10 20:12:15 +00:00
Alex Kremer
8345cd77df chore: Enable clang-tidy bugprone-unused-raii check (#6505) 2026-03-10 19:48:56 +00:00
Alex Kremer
c38aabdaee chore: Enable clang-tidy bugprone-unhandled-self-assignment check (#6504) 2026-03-10 17:42:49 +00:00
Alex Kremer
a896ed3987 chore: Enable clang-tidy bugprone-optional-value-conversion check (#6470) 2026-03-10 15:56:24 +01:00
Alex Kremer
1a7d67c4db chore: Enable clang-tidy bugprone-reserved-identifier check (#6456) 2026-03-10 10:29:08 +01:00
Alex Kremer
92983d8040 chore: Enable clang-tidy bugprone-too-small-loop-variable check (#6473) 2026-03-10 08:56:44 +00:00
Alex Kremer
320a65f77c chore: Enable clang-tidy bugprone-suspicious-stringview-data-usage check (#6467) 2026-03-10 08:34:27 +00:00
Alex Kremer
e284969ae4 chore: Enable clang-tidy bugprone-pointer-arithmetic-on-polymorphic-object check (#6469) 2026-03-09 19:36:56 +01:00
Alex Kremer
0335076359 chore: Fix additional clang-tidy issues for unused-local-non-trivial-variable check (#6509) 2026-03-09 17:16:04 +00:00
Sergey Kuznetsov
e2290b1a0a feat: Add mutex wrapper from clio (#6447)
This change adds a mutex wrapper copied from clio. The wrapper attaches a mutex to the data it protects, which improves safety and readability.
2026-03-09 16:33:20 +00:00
Alex Kremer
1ee0567b14 chore: Enable clang-tidy bugprone-suspicious-missing-comma check (#6468) 2026-03-09 15:48:38 +00:00
Alex Kremer
6b301efc8c chore: Enable clang-tidy bugprone-unused-local-non-trivial-variable check (#6458) 2026-03-09 15:25:52 +00:00
Vito Tumas
5865bd017f refactor: Update transaction folder structure (#6483)
This change reorganizes the `tx/transactors` directory for consistency and discoverability. There are no behavioral changes, this is a pure refactor. Underscores were chosen as the way to separate multi-words as this is the more popular option in C++ projects.
 
Specific changes:
- Rename all subdirectories to lowercase/snake_case (`AMM` → `amm`, `Check` → `check`, `NFT` → `nft`, `PermissionedDomain` → `permissioned_domain`, etc.)
- Merge `AMM/` and `Offer/` into `dex/`, including `PermissionedDEXHelpers`
- Rename `MPT/` → `token/`, absorbing `SetTrust` and `Clawback`
- Move top-level transactors into named groups: `account/`, `bridge/`, `credentials/`, `did/`, `escrow/`, `oracle/`, `payment/`, `payment_channel/`, `system/`
- Update all include paths across the codebase and `transactions.macro`
2026-03-06 08:25:31 +00:00
Ayaz Salikhov
af0ec7defd chore: Apply gersemi changes (#6486) 2026-03-05 19:54:44 +00:00
Alex Kremer
dde450784d Add Formats and Flags to server_definitions (#6321)
This change implements https://github.com/XRPLF/XRPL-Standards/discussions/418: "System XLS: Add Formats and Flags to server_definitions".
2026-03-05 16:11:27 +00:00
Ayaz Salikhov
c69091bded chore: Add Git information compile-time info to only one file (#6464)
The existing code added the git commit info (`GIT_COMMIT_HASH` and `GIT_BRANCH`) to every file, which was a problem for leveraging `ccache` to cache build objects. This change adds a separate C++ file from where these compile-time variables are propagated to wherever they are needed. A new CMake file is added to set the commit info if the `git` binary is available.
2026-03-04 19:45:28 +00:00