Add opentelemetry/trace/context.h to Log.cpp so that
opentelemetry::trace::GetSpan() resolves correctly.
Add 'logql' to cspell dictionary to silence unknown-word warnings.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Upgrade Loki from 2.9.0 to 3.4.2 which supports native OTLP ingestion.
Replace removed `loki` exporter with `otlphttp/loki` pointed at Loki's
/otlp endpoint. The `loki` exporter was dropped in otel-collector-contrib
v0.147.0.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Task 8.6: Add Log-Trace Correlation section to telemetry-runbook.md
with LogQL examples, verification steps, and troubleshooting guidance.
Update 09-data-collection-reference.md section 5a from "Future" to
actual implementation docs covering log format, ingestion pipeline,
Grafana correlation config, and Loki backend. Add Phase 8 log
correlation test section and troubleshooting to TESTING.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Task 8.1: Inject trace_id/span_id into Logs::format() when an active
OTel span exists, guarded by #ifdef XRPL_ENABLE_TELEMETRY. Uses
thread-local GetSpan()/GetContext() with <10ns overhead per call.
Task 8.2: Add Grafana Loki service (grafana/loki:2.9.0) to the Docker
Compose stack with port 3100 exposed. Add Loki as a dependency for
otel-collector and grafana services.
Task 8.3: Add filelog receiver to OTel Collector config to tail rippled
debug.log files with regex_parser extracting timestamp, partition,
severity, trace_id, span_id, and message fields. Add loki exporter and
logs pipeline. Mount rippled log directory into collector container.
Task 8.4: Add tracesToLogs config in Tempo datasource provisioning
pointing to Loki with filterByTraceID enabled, enabling one-click
trace-to-log navigation in Grafana.
Task 8.5: Add check_log_correlation() function to integration-test.sh
that greps debug.log files for trace_id pattern and cross-checks the
trace_id exists in Jaeger.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds Grafana Loki data source with derivedFields config linking
trace_id values in log lines to Tempo traces. This enables one-click
log-to-trace navigation in Grafana Explore.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove illegal shared_from_this() call from OTelGaugeImpl constructor.
The shared_ptr control block is not yet associated with the object during
construction, causing std::bad_weak_ptr when [insight] server=otel is
configured. The weakSelf variable was dead code — never used — since the
callback captures `this` directly via void* state. The raw pointer is safe
because RemoveCallback() is called in the destructor.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The else-if branch for server=="otel" in CollectorManager.cpp is never
reached in unit tests (no test configures [insight] with server=otel).
Mark it with LCOV_EXCL_START/STOP to exclude from patch coverage.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The beast::Journal stream accessors (info, warn, etc.) are methods that
return a Stream object. They must be called with () to test if the log
level is active.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The label_values() PromQL function requires a metric name as the first
argument. Without it, Prometheus returns raw label hashes instead of
readable node names like "validator-0:6006".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- CreateInt64Counter -> CreateUInt64Counter (no int64 counter in API)
- Counter<int64_t> member -> Counter<uint64_t> with static_cast in Add()
- Add async_instruments.h include for ObservableInstrument definition
- Replace JLOG() macro with beast::Journal if(info)/info() pattern
- MeterProviderFactory::Create(reader,resource) -> Create(views,resource)
+ AddMetricReader() (no 2-arg reader overload exists)
- HistogramAggregationConfig: std::list<double> -> std::vector<double>
using boundaries_ member instead of aggregate initialization
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add $node template variable to all 5 system-* Grafana dashboards
with exported_instance=~"$node" filter on all PromQL queries
- Add instanceId parameter to OTelCollector::New() factory to set
service.instance.id resource attribute on metrics (matches trace
exporter behavior for human-friendly node names in Grafana)
- CollectorManager reads service_instance_id from [insight] config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.
- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
reference docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the close time span and its 6 attributes to the Phase 4 consensus
span table and attribute table in 09-data-collection-reference.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
clang-format splits XRPL_TRACE_SET_ATTR calls across two lines. The
LCOV_EXCL_LINE comment must appear on BOTH lines, not just the
continuation line, since gcov instruments each line independently.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add LCOV_EXCL_LINE markers on trace macro calls that expand to ((void)0)
when telemetry is disabled. gcov instruments these no-op expressions at
-O0 causing false patch coverage failures.
Also add telemetry module paths to .codecov.yml ignore list since they
are conditionally compiled behind XRPL_ENABLE_TELEMETRY which is not
enabled in coverage builds.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Match the toDisplayString() Title Case values (Proposing, Observing,
Wrong Ledger, Switched Ledger) in the $consensus_mode template
variable description.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set Prometheus as default datasource (isDefault: true) so dashboard
panels using type: prometheus without explicit UID resolve correctly.
Previously Jaeger was implicitly the default, causing all PromQL
panels to return empty results.
- Explicitly set isDefault: false on Jaeger datasource to prevent
implicit default behavior.
- Add service_instance_id=rippled-standalone to xrpld-telemetry.cfg
so Grafana $node dropdown shows a readable name instead of the
base58 node public key.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set service_instance_id=rippled-node-N in each test node's [telemetry]
section so Grafana/Tempo filters show readable names instead of base58
public keys. Update dashboard descriptions and Tempo datasource comments.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prometheus renames the OTel resource attribute service.instance.id to
'instance', which then conflicts with the built-in scrape 'instance'
label. Prometheus resolves this by prefixing it as 'exported_instance'.
Update all dashboard PromQL queries and template variable queries to
use exported_instance so the Node dropdown correctly populates.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dashboard template variables reference datasource uid "prometheus" but
the provisioning config had no uid set, causing Grafana to auto-assign
a random one and break dashboard panel queries.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add resource_metrics_key_attributes to spanmetrics connector so
service.instance.id becomes a Prometheus label for per-node filtering
- Add 'node' dropdown (service_instance_id) to all 3 dashboards
- Add 'command' dropdown (xrpl_rpc_command) to RPC Performance
- Add 'tx_origin' dropdown (xrpl_tx_local) to Transaction Overview
- Add 'consensus_mode' dropdown (xrpl_consensus_mode) to Consensus Health
- Update all panel PromQL queries to include $node filter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Ledger Apply Duration and Close Time Agreement panels to
consensus-health dashboard
- Add consensus.accept.apply to telemetry runbook with TraceQL
queries for close time disagreements and consensus failures
- Add span to TESTING.md expected span catalog and verification loop
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add toDisplayString(ConsensusMode) helper that returns Title Case
names (Proposing, Observing, Wrong Ledger, Switched Ledger) for use
in OTel span attributes and Grafana dashboards. The existing
to_string() is preserved unchanged for log output stability.
Updated call sites:
- RCLConsensus.cpp: onClose, onModeChange, startRoundInternal
- TracingMacros.cpp: test attribute value
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add full consensus tracing with deterministic trace ID correlation
and establish-phase instrumentation:
- Deterministic trace_id from previousLedger.id() for cross-node
correlation (switchable via consensus_trace_strategy config)
- Round-to-round span links (follows-from) for causal chaining
- Establish phase spans with convergence tracking, dispute resolution
events, and threshold escalation attributes
- Validation spans with links to round spans (thread-safe via
roundSpanContext_ snapshot for jtACCEPT cross-thread access)
- Mode change spans for proposing/observing transitions
- New startSpan overload with span links in Telemetry interface
- XRPL_TRACE_ADD_EVENT macro with do-while(0) safety wrapper
- Config validation for consensus_trace_strategy
- Test adaptor (csf::Peer) updated with getTelemetry() stub
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a new trace span in doAccept() capturing ledger close time details:
- xrpl.consensus.close_time: agreed-upon close time (epoch seconds)
- xrpl.consensus.close_time_correct: whether validators converged
(per avCT_CONSENSUS_PCT = 75% threshold)
- xrpl.consensus.close_resolution_ms: time rounding granularity
- xrpl.consensus.state: "finished" or "moved_on" (consensus failure)
- xrpl.consensus.proposing: whether this node was proposing
Update Tempo datasource with close time filters, plan docs with
new span inventory, and add test coverage for the attribute pattern.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add xrpl.tx.hash (static), xrpl.tx.local and xrpl.tx.status
(dynamic) search filters for Phase 3 transaction span attributes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add xrpl.rpc.command (static), xrpl.rpc.status and xrpl.rpc.role
(dynamic) search filters for Phase 2 RPC span attributes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>