The label_values() PromQL function requires a metric name as the first
argument. Without it, Prometheus returns raw label hashes instead of
readable node names like "validator-0:6006".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- CreateInt64Counter -> CreateUInt64Counter (no int64 counter in API)
- Counter<int64_t> member -> Counter<uint64_t> with static_cast in Add()
- Add async_instruments.h include for ObservableInstrument definition
- Replace JLOG() macro with beast::Journal if(info)/info() pattern
- MeterProviderFactory::Create(reader,resource) -> Create(views,resource)
+ AddMetricReader() (no 2-arg reader overload exists)
- HistogramAggregationConfig: std::list<double> -> std::vector<double>
using boundaries_ member instead of aggregate initialization
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add $node template variable to all 5 system-* Grafana dashboards
with exported_instance=~"$node" filter on all PromQL queries
- Add instanceId parameter to OTelCollector::New() factory to set
service.instance.id resource attribute on metrics (matches trace
exporter behavior for human-friendly node names in Grafana)
- CollectorManager reads service_instance_id from [insight] config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.
- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
reference docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the close time span and its 6 attributes to the Phase 4 consensus
span table and attribute table in 09-data-collection-reference.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
clang-format splits XRPL_TRACE_SET_ATTR calls across two lines. The
LCOV_EXCL_LINE comment must appear on BOTH lines, not just the
continuation line, since gcov instruments each line independently.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add LCOV_EXCL_LINE markers on trace macro calls that expand to ((void)0)
when telemetry is disabled. gcov instruments these no-op expressions at
-O0 causing false patch coverage failures.
Also add telemetry module paths to .codecov.yml ignore list since they
are conditionally compiled behind XRPL_ENABLE_TELEMETRY which is not
enabled in coverage builds.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Match the toDisplayString() Title Case values (Proposing, Observing,
Wrong Ledger, Switched Ledger) in the $consensus_mode template
variable description.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set Prometheus as default datasource (isDefault: true) so dashboard
panels using type: prometheus without explicit UID resolve correctly.
Previously Jaeger was implicitly the default, causing all PromQL
panels to return empty results.
- Explicitly set isDefault: false on Jaeger datasource to prevent
implicit default behavior.
- Add service_instance_id=rippled-standalone to xrpld-telemetry.cfg
so Grafana $node dropdown shows a readable name instead of the
base58 node public key.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set service_instance_id=rippled-node-N in each test node's [telemetry]
section so Grafana/Tempo filters show readable names instead of base58
public keys. Update dashboard descriptions and Tempo datasource comments.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prometheus renames the OTel resource attribute service.instance.id to
'instance', which then conflicts with the built-in scrape 'instance'
label. Prometheus resolves this by prefixing it as 'exported_instance'.
Update all dashboard PromQL queries and template variable queries to
use exported_instance so the Node dropdown correctly populates.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dashboard template variables reference datasource uid "prometheus" but
the provisioning config had no uid set, causing Grafana to auto-assign
a random one and break dashboard panel queries.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add resource_metrics_key_attributes to spanmetrics connector so
service.instance.id becomes a Prometheus label for per-node filtering
- Add 'node' dropdown (service_instance_id) to all 3 dashboards
- Add 'command' dropdown (xrpl_rpc_command) to RPC Performance
- Add 'tx_origin' dropdown (xrpl_tx_local) to Transaction Overview
- Add 'consensus_mode' dropdown (xrpl_consensus_mode) to Consensus Health
- Update all panel PromQL queries to include $node filter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Ledger Apply Duration and Close Time Agreement panels to
consensus-health dashboard
- Add consensus.accept.apply to telemetry runbook with TraceQL
queries for close time disagreements and consensus failures
- Add span to TESTING.md expected span catalog and verification loop
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add toDisplayString(ConsensusMode) helper that returns Title Case
names (Proposing, Observing, Wrong Ledger, Switched Ledger) for use
in OTel span attributes and Grafana dashboards. The existing
to_string() is preserved unchanged for log output stability.
Updated call sites:
- RCLConsensus.cpp: onClose, onModeChange, startRoundInternal
- TracingMacros.cpp: test attribute value
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add full consensus tracing with deterministic trace ID correlation
and establish-phase instrumentation:
- Deterministic trace_id from previousLedger.id() for cross-node
correlation (switchable via consensus_trace_strategy config)
- Round-to-round span links (follows-from) for causal chaining
- Establish phase spans with convergence tracking, dispute resolution
events, and threshold escalation attributes
- Validation spans with links to round spans (thread-safe via
roundSpanContext_ snapshot for jtACCEPT cross-thread access)
- Mode change spans for proposing/observing transitions
- New startSpan overload with span links in Telemetry interface
- XRPL_TRACE_ADD_EVENT macro with do-while(0) safety wrapper
- Config validation for consensus_trace_strategy
- Test adaptor (csf::Peer) updated with getTelemetry() stub
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a new trace span in doAccept() capturing ledger close time details:
- xrpl.consensus.close_time: agreed-upon close time (epoch seconds)
- xrpl.consensus.close_time_correct: whether validators converged
(per avCT_CONSENSUS_PCT = 75% threshold)
- xrpl.consensus.close_resolution_ms: time rounding granularity
- xrpl.consensus.state: "finished" or "moved_on" (consensus failure)
- xrpl.consensus.proposing: whether this node was proposing
Update Tempo datasource with close time filters, plan docs with
new span inventory, and add test coverage for the attribute pattern.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add xrpl.tx.hash (static), xrpl.tx.local and xrpl.tx.status
(dynamic) search filters for Phase 3 transaction span attributes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add xrpl.rpc.command (static), xrpl.rpc.status and xrpl.rpc.role
(dynamic) search filters for Phase 2 RPC span attributes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add tracing instrumentation to the RPC request handling layer:
TracingInstrumentation.h:
- Convenience macros (XRPL_TRACE_RPC, XRPL_TRACE_TX, etc.) that
create RAII SpanGuard objects when telemetry is enabled
- XRPL_TRACE_SET_ATTR / XRPL_TRACE_EXCEPTION for span enrichment
- Zero-overhead no-ops when XRPL_ENABLE_TELEMETRY is not defined
RPCHandler.cpp:
- Trace each RPC command with span name "rpc.command.<method>"
- Record command name, API version, role, and status as attributes
- Capture exceptions on the span for error visibility
ServerHandler.cpp:
- Trace HTTP requests ("rpc.request"), WebSocket messages
("rpc.ws_message"), and processRequest ("rpc.process")
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only override serviceInstanceId with the node public key when the user
hasn't explicitly set service_instance_id in the [telemetry] section.
This allows operators to assign human-friendly names (e.g. "validator-1")
while still defaulting to the node's base58 public key.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Telemetry object is constructed in ApplicationImp's member initializer
list where nodeIdentity_ is not yet available, resulting in an empty
service.instance.id resource attribute. Add setServiceInstanceId() virtual
method that Application::setup() calls after nodeIdentity_ is known but
before telemetry_->start() creates the OTel resource.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add resource-level search filters for node identity:
- service.instance.id (node public key) — unique node identifier
- service.version (rippled build version)
- xrpl.network.id (numeric network ID)
- xrpl.network.type (mainnet/testnet/devnet/standalone)
These enable filtering traces by specific nodes in multi-node
deployments and by network in mixed environments.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Configure Grafana Tempo datasource with pre-built search filters
(service.name, span name, status, duration) for the Explore UI.
Enable Tempo metrics_generator with service-graphs and span-metrics
processors to power Grafana's service map visualization.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Tempo 2.7.2 service to docker-compose with local storage
- Add otlp/tempo exporter to OTel Collector traces pipeline
- Add Tempo Grafana datasource provisioning with node graph
- Update 05-configuration-reference.md examples with Tempo
- OTel Collector fans traces to both Jaeger and Tempo simultaneously
Jaeger provides a standalone UI at :16686 for quick lookups.
Tempo is queryable via Grafana Explore using TraceQL and is the
recommended backend for production (supports S3/GCS storage).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>