Task 8.1: Inject trace_id/span_id into Logs::format() when an active
OTel span exists, guarded by #ifdef XRPL_ENABLE_TELEMETRY. Uses
thread-local GetSpan()/GetContext() with <10ns overhead per call.
Task 8.2: Add Grafana Loki service (grafana/loki:2.9.0) to the Docker
Compose stack with port 3100 exposed. Add Loki as a dependency for
otel-collector and grafana services.
Task 8.3: Add filelog receiver to OTel Collector config to tail rippled
debug.log files with regex_parser extracting timestamp, partition,
severity, trace_id, span_id, and message fields. Add loki exporter and
logs pipeline. Mount rippled log directory into collector container.
Task 8.4: Add tracesToLogs config in Tempo datasource provisioning
pointing to Loki with filterByTraceID enabled, enabling one-click
trace-to-log navigation in Grafana.
Task 8.5: Add check_log_correlation() function to integration-test.sh
that greps debug.log files for trace_id pattern and cross-checks the
trace_id exists in Jaeger.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove illegal shared_from_this() call from OTelGaugeImpl constructor.
The shared_ptr control block is not yet associated with the object during
construction, causing std::bad_weak_ptr when [insight] server=otel is
configured. The weakSelf variable was dead code — never used — since the
callback captures `this` directly via void* state. The raw pointer is safe
because RemoveCallback() is called in the destructor.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The else-if branch for server=="otel" in CollectorManager.cpp is never
reached in unit tests (no test configures [insight] with server=otel).
Mark it with LCOV_EXCL_START/STOP to exclude from patch coverage.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The beast::Journal stream accessors (info, warn, etc.) are methods that
return a Stream object. They must be called with () to test if the log
level is active.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- CreateInt64Counter -> CreateUInt64Counter (no int64 counter in API)
- Counter<int64_t> member -> Counter<uint64_t> with static_cast in Add()
- Add async_instruments.h include for ObservableInstrument definition
- Replace JLOG() macro with beast::Journal if(info)/info() pattern
- MeterProviderFactory::Create(reader,resource) -> Create(views,resource)
+ AddMetricReader() (no 2-arg reader overload exists)
- HistogramAggregationConfig: std::list<double> -> std::vector<double>
using boundaries_ member instead of aggregate initialization
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add $node template variable to all 5 system-* Grafana dashboards
with exported_instance=~"$node" filter on all PromQL queries
- Add instanceId parameter to OTelCollector::New() factory to set
service.instance.id resource attribute on metrics (matches trace
exporter behavior for human-friendly node names in Grafana)
- CollectorManager reads service_instance_id from [insight] config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.
- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
reference docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
clang-format splits XRPL_TRACE_SET_ATTR calls across two lines. The
LCOV_EXCL_LINE comment must appear on BOTH lines, not just the
continuation line, since gcov instruments each line independently.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add LCOV_EXCL_LINE markers on trace macro calls that expand to ((void)0)
when telemetry is disabled. gcov instruments these no-op expressions at
-O0 causing false patch coverage failures.
Also add telemetry module paths to .codecov.yml ignore list since they
are conditionally compiled behind XRPL_ENABLE_TELEMETRY which is not
enabled in coverage builds.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add toDisplayString(ConsensusMode) helper that returns Title Case
names (Proposing, Observing, Wrong Ledger, Switched Ledger) for use
in OTel span attributes and Grafana dashboards. The existing
to_string() is preserved unchanged for log output stability.
Updated call sites:
- RCLConsensus.cpp: onClose, onModeChange, startRoundInternal
- TracingMacros.cpp: test attribute value
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add full consensus tracing with deterministic trace ID correlation
and establish-phase instrumentation:
- Deterministic trace_id from previousLedger.id() for cross-node
correlation (switchable via consensus_trace_strategy config)
- Round-to-round span links (follows-from) for causal chaining
- Establish phase spans with convergence tracking, dispute resolution
events, and threshold escalation attributes
- Validation spans with links to round spans (thread-safe via
roundSpanContext_ snapshot for jtACCEPT cross-thread access)
- Mode change spans for proposing/observing transitions
- New startSpan overload with span links in Telemetry interface
- XRPL_TRACE_ADD_EVENT macro with do-while(0) safety wrapper
- Config validation for consensus_trace_strategy
- Test adaptor (csf::Peer) updated with getTelemetry() stub
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a new trace span in doAccept() capturing ledger close time details:
- xrpl.consensus.close_time: agreed-upon close time (epoch seconds)
- xrpl.consensus.close_time_correct: whether validators converged
(per avCT_CONSENSUS_PCT = 75% threshold)
- xrpl.consensus.close_resolution_ms: time rounding granularity
- xrpl.consensus.state: "finished" or "moved_on" (consensus failure)
- xrpl.consensus.proposing: whether this node was proposing
Update Tempo datasource with close time filters, plan docs with
new span inventory, and add test coverage for the attribute pattern.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add tracing instrumentation to the RPC request handling layer:
TracingInstrumentation.h:
- Convenience macros (XRPL_TRACE_RPC, XRPL_TRACE_TX, etc.) that
create RAII SpanGuard objects when telemetry is enabled
- XRPL_TRACE_SET_ATTR / XRPL_TRACE_EXCEPTION for span enrichment
- Zero-overhead no-ops when XRPL_ENABLE_TELEMETRY is not defined
RPCHandler.cpp:
- Trace each RPC command with span name "rpc.command.<method>"
- Record command name, API version, role, and status as attributes
- Capture exceptions on the span for error visibility
ServerHandler.cpp:
- Trace HTTP requests ("rpc.request"), WebSocket messages
("rpc.ws_message"), and processRequest ("rpc.process")
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only override serviceInstanceId with the node public key when the user
hasn't explicitly set service_instance_id in the [telemetry] section.
This allows operators to assign human-friendly names (e.g. "validator-1")
while still defaulting to the node's base58 public key.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Telemetry object is constructed in ApplicationImp's member initializer
list where nodeIdentity_ is not yet available, resulting in an empty
service.instance.id resource attribute. Add setServiceInstanceId() virtual
method that Application::setup() calls after nodeIdentity_ is known but
before telemetry_->start() creates the OTel resource.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This change reorganizes the `tx/transactors` directory for consistency and discoverability. There are no behavioral changes, this is a pure refactor. Underscores were chosen as the way to separate multi-words as this is the more popular option in C++ projects.
Specific changes:
- Rename all subdirectories to lowercase/snake_case (`AMM` → `amm`, `Check` → `check`, `NFT` → `nft`, `PermissionedDomain` → `permissioned_domain`, etc.)
- Merge `AMM/` and `Offer/` into `dex/`, including `PermissionedDEXHelpers`
- Rename `MPT/` → `token/`, absorbing `SetTrust` and `Clawback`
- Move top-level transactors into named groups: `account/`, `bridge/`, `credentials/`, `did/`, `escrow/`, `oracle/`, `payment/`, `payment_channel/`, `system/`
- Update all include paths across the codebase and `transactions.macro`
The existing code added the git commit info (`GIT_COMMIT_HASH` and `GIT_BRANCH`) to every file, which was a problem for leveraging `ccache` to cache build objects. This change adds a separate C++ file from where these compile-time variables are propagated to wherever they are needed. A new CMake file is added to set the commit info if the `git` binary is available.
When `gateway_balances` gets called on an account that is involved in the `EscrowCreate` transaction (with MPT being escrowed), the method returns internal error. This change fixes this case by excluding the MPT type when totaling escrow amount.
The `Subscribe` tests were flaky, because each test performs some operations (e.g. sends transactions) and waits for messages to appear in subscription with a 100ms timeout. If tests are slow (e.g. compiled in debug mode or a slow machine) then some of them could fail. This change adds an attempt to synchronize the background Env's thread and the test's thread by ensuring that all the scheduled operations are started before the test's thread starts to wait for a websocket message. This is done by limiting I/O threads of the app inside Env to 1 and adding a synchronization barrier after closing the ledger.
The invariant check system had grown into a single monolithic file pair containing 24 invariant checker classes. The large `InvariantCheck.cpp` file was a frequent source of merge conflicts and difficult to navigate. This refactoring improves maintainability and readability with zero behavioral changes.
In particular, this change:
- Splits `InvariantCheck.h` and `InvariantCheck.cpp` into 10 focused header/source pairs organized by domain under a new `invariants/` subdirectory.
- Extracts the shared `Privilege` enum and `hasPrivilege()` function into a dedicated `InvariantCheckPrivilege.h` header, so domain-specific files can reference them independently.
This change replaces `void const*` by `uint256 const&` for database fetches.
Object hashes are expressed using the `uint256` data type, and are converted to `void *` when calling the `fetch` or `fetchBatch` functions. However, in these fetch functions they are converted back to `uint256`, making the conversion process unnecessary. In a few cases the underlying pointer is needed, but that can then be easy obtained via `[hash variable].data()`.