Add $node template variable (exported_instance) to rippled-fee-market,
rippled-job-queue, and rippled-rpc-perf dashboards enabling multi-node
filtering. Add $job_type variable to job-queue and $method variable to
rpc-perf dashboards. Inject exported_instance=~"$node" filter into all
PromQL queries across these dashboards including rate(), histogram_quantile(),
topk(), and sum() expressions. Also add the instance filter to Phase 9
panels (NodeStore, Cache, CountedObjects) in system-node-health dashboard.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Upgrade Loki from 2.9.0 to 3.4.2 which supports native OTLP ingestion.
Replace removed `loki` exporter with `otlphttp/loki` pointed at Loki's
/otlp endpoint. The `loki` exporter was dropped in otel-collector-contrib
v0.147.0.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Task 8.6: Add Log-Trace Correlation section to telemetry-runbook.md
with LogQL examples, verification steps, and troubleshooting guidance.
Update 09-data-collection-reference.md section 5a from "Future" to
actual implementation docs covering log format, ingestion pipeline,
Grafana correlation config, and Loki backend. Add Phase 8 log
correlation test section and troubleshooting to TESTING.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Task 8.1: Inject trace_id/span_id into Logs::format() when an active
OTel span exists, guarded by #ifdef XRPL_ENABLE_TELEMETRY. Uses
thread-local GetSpan()/GetContext() with <10ns overhead per call.
Task 8.2: Add Grafana Loki service (grafana/loki:2.9.0) to the Docker
Compose stack with port 3100 exposed. Add Loki as a dependency for
otel-collector and grafana services.
Task 8.3: Add filelog receiver to OTel Collector config to tail rippled
debug.log files with regex_parser extracting timestamp, partition,
severity, trace_id, span_id, and message fields. Add loki exporter and
logs pipeline. Mount rippled log directory into collector container.
Task 8.4: Add tracesToLogs config in Tempo datasource provisioning
pointing to Loki with filterByTraceID enabled, enabling one-click
trace-to-log navigation in Grafana.
Task 8.5: Add check_log_correlation() function to integration-test.sh
that greps debug.log files for trace_id pattern and cross-checks the
trace_id exists in Jaeger.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds Grafana Loki data source with derivedFields config linking
trace_id values in log lines to Tempo traces. This enables one-click
log-to-trace navigation in Grafana Explore.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The label_values() PromQL function requires a metric name as the first
argument. Without it, Prometheus returns raw label hashes instead of
readable node names like "validator-0:6006".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add $node template variable to all 5 system-* Grafana dashboards
with exported_instance=~"$node" filter on all PromQL queries
- Add instanceId parameter to OTelCollector::New() factory to set
service.instance.id resource attribute on metrics (matches trace
exporter behavior for human-friendly node names in Grafana)
- CollectorManager reads service_instance_id from [insight] config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.
- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
reference docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Match the toDisplayString() Title Case values (Proposing, Observing,
Wrong Ledger, Switched Ledger) in the $consensus_mode template
variable description.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set Prometheus as default datasource (isDefault: true) so dashboard
panels using type: prometheus without explicit UID resolve correctly.
Previously Jaeger was implicitly the default, causing all PromQL
panels to return empty results.
- Explicitly set isDefault: false on Jaeger datasource to prevent
implicit default behavior.
- Add service_instance_id=rippled-standalone to xrpld-telemetry.cfg
so Grafana $node dropdown shows a readable name instead of the
base58 node public key.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set service_instance_id=rippled-node-N in each test node's [telemetry]
section so Grafana/Tempo filters show readable names instead of base58
public keys. Update dashboard descriptions and Tempo datasource comments.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prometheus renames the OTel resource attribute service.instance.id to
'instance', which then conflicts with the built-in scrape 'instance'
label. Prometheus resolves this by prefixing it as 'exported_instance'.
Update all dashboard PromQL queries and template variable queries to
use exported_instance so the Node dropdown correctly populates.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dashboard template variables reference datasource uid "prometheus" but
the provisioning config had no uid set, causing Grafana to auto-assign
a random one and break dashboard panel queries.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add resource_metrics_key_attributes to spanmetrics connector so
service.instance.id becomes a Prometheus label for per-node filtering
- Add 'node' dropdown (service_instance_id) to all 3 dashboards
- Add 'command' dropdown (xrpl_rpc_command) to RPC Performance
- Add 'tx_origin' dropdown (xrpl_tx_local) to Transaction Overview
- Add 'consensus_mode' dropdown (xrpl_consensus_mode) to Consensus Health
- Update all panel PromQL queries to include $node filter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Ledger Apply Duration and Close Time Agreement panels to
consensus-health dashboard
- Add consensus.accept.apply to telemetry runbook with TraceQL
queries for close time disagreements and consensus failures
- Add span to TESTING.md expected span catalog and verification loop
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a new trace span in doAccept() capturing ledger close time details:
- xrpl.consensus.close_time: agreed-upon close time (epoch seconds)
- xrpl.consensus.close_time_correct: whether validators converged
(per avCT_CONSENSUS_PCT = 75% threshold)
- xrpl.consensus.close_resolution_ms: time rounding granularity
- xrpl.consensus.state: "finished" or "moved_on" (consensus failure)
- xrpl.consensus.proposing: whether this node was proposing
Update Tempo datasource with close time filters, plan docs with
new span inventory, and add test coverage for the attribute pattern.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add xrpl.tx.hash (static), xrpl.tx.local and xrpl.tx.status
(dynamic) search filters for Phase 3 transaction span attributes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add xrpl.rpc.command (static), xrpl.rpc.status and xrpl.rpc.role
(dynamic) search filters for Phase 2 RPC span attributes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add resource-level search filters for node identity:
- service.instance.id (node public key) — unique node identifier
- service.version (rippled build version)
- xrpl.network.id (numeric network ID)
- xrpl.network.type (mainnet/testnet/devnet/standalone)
These enable filtering traces by specific nodes in multi-node
deployments and by network in mixed environments.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Configure Grafana Tempo datasource with pre-built search filters
(service.name, span name, status, duration) for the Explore UI.
Enable Tempo metrics_generator with service-graphs and span-metrics
processors to power Grafana's service map visualization.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Tempo 2.7.2 service to docker-compose with local storage
- Add otlp/tempo exporter to OTel Collector traces pipeline
- Add Tempo Grafana datasource provisioning with node graph
- Update 05-configuration-reference.md examples with Tempo
- OTel Collector fans traces to both Jaeger and Tempo simultaneously
Jaeger provides a standalone UI at :16686 for quick lookups.
Tempo is queryable via Grafana Explore using TraceQL and is the
recommended backend for production (supports S3/GCS storage).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>