Commit Graph

9797 Commits

Author SHA1 Message Date
Pratik Mankawde
4e24569892 Phase 8: Implement log-trace correlation and Loki log ingestion
Task 8.1: Inject trace_id/span_id into Logs::format() when an active
OTel span exists, guarded by #ifdef XRPL_ENABLE_TELEMETRY. Uses
thread-local GetSpan()/GetContext() with <10ns overhead per call.

Task 8.2: Add Grafana Loki service (grafana/loki:2.9.0) to the Docker
Compose stack with port 3100 exposed. Add Loki as a dependency for
otel-collector and grafana services.

Task 8.3: Add filelog receiver to OTel Collector config to tail rippled
debug.log files with regex_parser extracting timestamp, partition,
severity, trace_id, span_id, and message fields. Add loki exporter and
logs pipeline. Mount rippled log directory into collector container.

Task 8.4: Add tracesToLogs config in Tempo datasource provisioning
pointing to Loki with filterByTraceID enabled, enabling one-click
trace-to-log navigation in Grafana.

Task 8.5: Add check_log_correlation() function to integration-test.sh
that greps debug.log files for trace_id pattern and cross-checks the
trace_id exists in Jaeger.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
8af6af6b2a Fix std::bad_weak_ptr crash in OTelGaugeImpl constructor
Remove illegal shared_from_this() call from OTelGaugeImpl constructor.
The shared_ptr control block is not yet associated with the object during
construction, causing std::bad_weak_ptr when [insight] server=otel is
configured. The weakSelf variable was dead code — never used — since the
callback captures `this` directly via void* state. The raw pointer is safe
because RemoveCallback() is called in the destructor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:41 +00:00
Pratik Mankawde
eb7e102015 Fix codecov: exclude OTel collector path in CollectorManager from coverage
The else-if branch for server=="otel" in CollectorManager.cpp is never
reached in unit tests (no test configures [insight] with server=otel).
Mark it with LCOV_EXCL_START/STOP to exclude from patch coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
3541deb972 Fix OTelCollector: Journal::info is a method, not a bool member
The beast::Journal stream accessors (info, warn, etc.) are methods that
return a Stream object. They must be called with () to test if the log
level is active.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
a431cf97b3 Fix OTelCollector.cpp compilation errors against OTel C++ SDK
- CreateInt64Counter -> CreateUInt64Counter (no int64 counter in API)
- Counter<int64_t> member -> Counter<uint64_t> with static_cast in Add()
- Add async_instruments.h include for ObservableInstrument definition
- Replace JLOG() macro with beast::Journal if(info)/info() pattern
- MeterProviderFactory::Create(reader,resource) -> Create(views,resource)
  + AddMetricReader() (no 2-arg reader overload exists)
- HistogramAggregationConfig: std::list<double> -> std::vector<double>
  using boundaries_ member instead of aggregate initialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
35eeffa068 Phase 7: Add node filters to system dashboards + OTelCollector instanceId
- Add $node template variable to all 5 system-* Grafana dashboards
  with exported_instance=~"$node" filter on all PromQL queries
- Add instanceId parameter to OTelCollector::New() factory to set
  service.instance.id resource attribute on metrics (matches trace
  exporter behavior for human-friendly node names in Grafana)
- CollectorManager reads service_instance_id from [insight] config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
7d51436d26 Phase 7: Native OTel metrics migration (Tasks 7.1-7.7)
Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.

- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
  SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
  export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
  pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
  StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
  and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
  reference docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
4be2809276 Fix codecov: add LCOV_EXCL_LINE to opening lines of multi-line trace macros
clang-format splits XRPL_TRACE_SET_ATTR calls across two lines. The
LCOV_EXCL_LINE comment must appear on BOTH lines, not just the
continuation line, since gcov instruments each line independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
3b9962069e Fix codecov/patch: exclude telemetry no-op lines from coverage
Add LCOV_EXCL_LINE markers on trace macro calls that expand to ((void)0)
when telemetry is disabled. gcov instruments these no-op expressions at
-O0 causing false patch coverage failures.

Also add telemetry module paths to .codecov.yml ignore list since they
are conditionally compiled behind XRPL_ENABLE_TELEMETRY which is not
enabled in coverage builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
be59e4940f Phase 5b: Add ledger/peer/tx spans + expand Grafana dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
70396debcb Phase 5: Observability stack — spanmetrics, dashboards, runbook
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
9bb8f2e12a Use Title Case for consensus mode names in telemetry attributes
Add toDisplayString(ConsensusMode) helper that returns Title Case
names (Proposing, Observing, Wrong Ledger, Switched Ledger) for use
in OTel span attributes and Grafana dashboards. The existing
to_string() is preserved unchanged for log output stability.

Updated call sites:
- RCLConsensus.cpp: onClose, onModeChange, startRoundInternal
- TracingMacros.cpp: test attribute value

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:43:55 +00:00
Pratik Mankawde
31d0ed178a Phase 4a: Establish-phase gap fill & cross-node correlation
Add full consensus tracing with deterministic trace ID correlation
and establish-phase instrumentation:

- Deterministic trace_id from previousLedger.id() for cross-node
  correlation (switchable via consensus_trace_strategy config)
- Round-to-round span links (follows-from) for causal chaining
- Establish phase spans with convergence tracking, dispute resolution
  events, and threshold escalation attributes
- Validation spans with links to round spans (thread-safe via
  roundSpanContext_ snapshot for jtACCEPT cross-thread access)
- Mode change spans for proposing/observing transitions
- New startSpan overload with span links in Telemetry interface
- XRPL_TRACE_ADD_EVENT macro with do-while(0) safety wrapper
- Config validation for consensus_trace_strategy
- Test adaptor (csf::Peer) updated with getTelemetry() stub

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:10:00 +00:00
Pratik Mankawde
7675af41ec Add consensus.accept.apply span with ledger close time attributes
Add a new trace span in doAccept() capturing ledger close time details:
- xrpl.consensus.close_time: agreed-upon close time (epoch seconds)
- xrpl.consensus.close_time_correct: whether validators converged
  (per avCT_CONSENSUS_PCT = 75% threshold)
- xrpl.consensus.close_resolution_ms: time rounding granularity
- xrpl.consensus.state: "finished" or "moved_on" (consensus failure)
- xrpl.consensus.proposing: whether this node was proposing

Update Tempo datasource with close time filters, plan docs with
new span inventory, and add test coverage for the attribute pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:10:00 +00:00
Pratik Mankawde
0bbffaebc4 Phase 4: Consensus tracing — round lifecycle, proposals, validations
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:24 +00:00
Pratik Mankawde
e6b05d700c Phase 3: Transaction tracing — protobuf context, PeerImp, NetworkOPs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
f114ec5462 Fix gersemi cmake formatting in test CMakeLists
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
ea00fb9bc8 Phase 2: Complete RPC tracing — interface, macros, attributes, tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
a707abd6f2 Phase 1c: RPC layer telemetry integration
Add tracing instrumentation to the RPC request handling layer:

TracingInstrumentation.h:
- Convenience macros (XRPL_TRACE_RPC, XRPL_TRACE_TX, etc.) that
  create RAII SpanGuard objects when telemetry is enabled
- XRPL_TRACE_SET_ATTR / XRPL_TRACE_EXCEPTION for span enrichment
- Zero-overhead no-ops when XRPL_ENABLE_TELEMETRY is not defined

RPCHandler.cpp:
- Trace each RPC command with span name "rpc.command.<method>"
- Record command name, API version, role, and status as attributes
- Capture exceptions on the span for error visibility

ServerHandler.cpp:
- Trace HTTP requests ("rpc.request"), WebSocket messages
  ("rpc.ws_message"), and processRequest ("rpc.process")

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
b3609fb345 Respect custom service_instance_id when set in [telemetry] config
Only override serviceInstanceId with the node public key when the user
hasn't explicitly set service_instance_id in the [telemetry] section.
This allows operators to assign human-friendly names (e.g. "validator-1")
while still defaulting to the node's base58 public key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
99a0b4094e Fix empty service.instance.id by deferring node identity injection
The Telemetry object is constructed in ApplicationImp's member initializer
list where nodeIdentity_ is not yet available, resulting in an empty
service.instance.id resource attribute. Add setServiceInstanceId() virtual
method that Application::setup() calls after nodeIdentity_ is known but
before telemetry_->start() creates the OTel resource.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
d8d7a40fca Phase 1b: Telemetry core infrastructure
Add the OpenTelemetry telemetry library and supporting infrastructure:

Build system:
- Conan opentelemetry-cpp dependency with OTLP/gRPC exporter
- CMake integration for xrpl_telemetry library target
- Levelization ordering updates

Core library (libxrpl):
- Telemetry class: provider lifecycle, span creation, sampling config
- SpanGuard: RAII span management with attribute/exception helpers
- TelemetryConfig: parse [telemetry] config section
- NullTelemetry: no-op implementation when telemetry is disabled

Application integration:
- Telemetry member in ApplicationImp with start/stop lifecycle
- getTelemetry() interface on Application
- ServiceRegistry telemetry accessor

Docker observability stack:
- OTel Collector, Jaeger, Grafana docker-compose setup
- Collector config with OTLP gRPC receiver and Jaeger exporter

Config and docs:
- Example telemetry config section in xrpld-example.cfg
- Build documentation for telemetry setup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Alex Kremer
7b3724b7a3 fix: Add missed clang-tidy bugprone-inc-dec-conditions check (#6526) 2026-03-11 14:04:26 +00:00
Alex Kremer
f27d8f3890 chore: Enable clang-tidy bugprone-inc-dec-in-conditions check (#6455) 2026-03-10 20:12:15 +00:00
Alex Kremer
8345cd77df chore: Enable clang-tidy bugprone-unused-raii check (#6505) 2026-03-10 19:48:56 +00:00
Alex Kremer
c38aabdaee chore: Enable clang-tidy bugprone-unhandled-self-assignment check (#6504) 2026-03-10 17:42:49 +00:00
Alex Kremer
a896ed3987 chore: Enable clang-tidy bugprone-optional-value-conversion check (#6470) 2026-03-10 15:56:24 +01:00
Alex Kremer
1a7d67c4db chore: Enable clang-tidy bugprone-reserved-identifier check (#6456) 2026-03-10 10:29:08 +01:00
Alex Kremer
92983d8040 chore: Enable clang-tidy bugprone-too-small-loop-variable check (#6473) 2026-03-10 08:56:44 +00:00
Alex Kremer
320a65f77c chore: Enable clang-tidy bugprone-suspicious-stringview-data-usage check (#6467) 2026-03-10 08:34:27 +00:00
Alex Kremer
e284969ae4 chore: Enable clang-tidy bugprone-pointer-arithmetic-on-polymorphic-object check (#6469) 2026-03-09 19:36:56 +01:00
Alex Kremer
0335076359 chore: Fix additional clang-tidy issues for unused-local-non-trivial-variable check (#6509) 2026-03-09 17:16:04 +00:00
Sergey Kuznetsov
e2290b1a0a feat: Add mutex wrapper from clio (#6447)
This change adds a mutex wrapper copied from clio. The wrapper attaches a mutex to the data it protects, which improves safety and readability.
2026-03-09 16:33:20 +00:00
Alex Kremer
1ee0567b14 chore: Enable clang-tidy bugprone-suspicious-missing-comma check (#6468) 2026-03-09 15:48:38 +00:00
Alex Kremer
6b301efc8c chore: Enable clang-tidy bugprone-unused-local-non-trivial-variable check (#6458) 2026-03-09 15:25:52 +00:00
Vito Tumas
5865bd017f refactor: Update transaction folder structure (#6483)
This change reorganizes the `tx/transactors` directory for consistency and discoverability. There are no behavioral changes, this is a pure refactor. Underscores were chosen as the way to separate multi-words as this is the more popular option in C++ projects.
 
Specific changes:
- Rename all subdirectories to lowercase/snake_case (`AMM` → `amm`, `Check` → `check`, `NFT` → `nft`, `PermissionedDomain` → `permissioned_domain`, etc.)
- Merge `AMM/` and `Offer/` into `dex/`, including `PermissionedDEXHelpers`
- Rename `MPT/` → `token/`, absorbing `SetTrust` and `Clawback`
- Move top-level transactors into named groups: `account/`, `bridge/`, `credentials/`, `did/`, `escrow/`, `oracle/`, `payment/`, `payment_channel/`, `system/`
- Update all include paths across the codebase and `transactions.macro`
2026-03-06 08:25:31 +00:00
Ayaz Salikhov
af0ec7defd chore: Apply gersemi changes (#6486) 2026-03-05 19:54:44 +00:00
Alex Kremer
dde450784d Add Formats and Flags to server_definitions (#6321)
This change implements https://github.com/XRPLF/XRPL-Standards/discussions/418: "System XLS: Add Formats and Flags to server_definitions".
2026-03-05 16:11:27 +00:00
Ayaz Salikhov
c69091bded chore: Add Git information compile-time info to only one file (#6464)
The existing code added the git commit info (`GIT_COMMIT_HASH` and `GIT_BRANCH`) to every file, which was a problem for leveraging `ccache` to cache build objects. This change adds a separate C++ file from where these compile-time variables are propagated to wherever they are needed. A new CMake file is added to set the commit info if the `git` binary is available.
2026-03-04 19:45:28 +00:00
Alex Kremer
b451d5e412 chore: Enable clang-tidy bugprone-return-const-ref-from-parameter check (#6459) 2026-03-04 18:10:10 +00:00
Alex Kremer
af97df5a63 chore: Enable clang-tidy bugprone-move-forwarding-reference check (#6457) 2026-03-04 17:03:27 +00:00
Peter Chen
e39954d128 fix: Gateway balance with MPT (#6143)
When `gateway_balances` gets called on an account that is involved in the `EscrowCreate` transaction (with MPT being escrowed), the method returns internal error. This change fixes this case by excluding the MPT type when totaling escrow amount.
2026-03-04 15:50:51 +00:00
tequ
3cd1e3d94e refactor: Update PermissionedDomainDelete to use keylet for sle access (#6063) 2026-03-04 04:11:58 +01:00
Ayaz Salikhov
fcec31ed20 chore: Update pre-commit hooks (#6460) 2026-03-03 20:23:22 +00:00
Sergey Kuznetsov
5300e65686 tests: Improve stability of Subscribe tests (#6420)
The `Subscribe` tests were flaky, because each test performs some operations (e.g. sends transactions) and waits for messages to appear in subscription with a 100ms timeout. If tests are slow (e.g. compiled in debug mode or a slow machine) then some of them could fail. This change adds an attempt to synchronize the background Env's thread and the test's thread by ensuring that all the scheduled operations are started before the test's thread starts to wait for a websocket message. This is done by limiting I/O threads of the app inside Env to 1 and adding a synchronization barrier after closing the ledger.
2026-03-03 08:46:55 -05:00
Alex Kremer
afc660a1b5 refactor: Fix clang-tidy bugprone-empty-catch check (#6419)
This change fixes or suppresses instances detected by the `bugprone-empty-catch` clang-tidy check.
2026-03-02 17:08:56 +00:00
Vito Tumas
1a7f824b89 refactor: Splits invariant checks into multiple classes (#6440)
The invariant check system had grown into a single monolithic file pair containing 24 invariant checker classes. The large `InvariantCheck.cpp` file was a frequent source of merge conflicts and difficult to navigate. This refactoring improves maintainability and readability with zero behavioral changes.

In particular, this change:
- Splits `InvariantCheck.h` and `InvariantCheck.cpp` into 10 focused header/source pairs organized by domain under a new `invariants/` subdirectory.
- Extracts the shared `Privilege` enum and `hasPrivilege()` function into a dedicated `InvariantCheckPrivilege.h` header, so domain-specific files can reference them independently.
2026-02-27 21:02:39 +00:00
Mayukha Vadari
404f35d556 test: Grep for failures in CI (#6339)
This change adjusts the CI tests to make it easier to spot errors, without needing to sift through the thousands of lines of output.
2026-02-27 03:01:38 +00:00
Bart
3a8a18c2ca refactor: Use uint256 directly as key instead of void pointer (#6313)
This change replaces `void const*` by `uint256 const&` for database fetches.

Object hashes are expressed using the `uint256` data type, and are converted to `void *` when calling the `fetch` or `fetchBatch` functions. However, in these fetch functions they are converted back to `uint256`, making the conversion process unnecessary. In a few cases the underlying pointer is needed, but that can then be easy obtained via `[hash variable].data()`.
2026-02-25 18:23:34 -05:00
Valentin Balaschenko
bdd106d992 Explicitly trim the heap after cache sweeps (#6022)
Limited to Linux/glibc builds.
2026-02-24 21:33:13 +00:00