Spanmetrics dimensions used xrpl.rpc.command etc. but C++ emits bare
"command". Tempo tags for phase6-added consensus/tx/peer filters used
qualified names but C++ uses bare names. Dashboard panel referenced
xrpl_tx_suppressed (never populated) instead of suppressed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consensus span attributes use bare names (close_time_correct,
consensus_state, close_resolution_ms) and shared canonical attrs
(xrpl.ledger.seq) per SpanNames.h. xrpl.consensus.mode and
xrpl.consensus.round are correct (domain-qualified to avoid collision).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Transaction span attributes use bare names (local, tx_status) per
SpanNames.h convention, not xrpl.tx.* qualified names. xrpl.tx.hash
is correct (shared canonical attr defined in SpanNames.h).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RPC span attributes use bare names (command, rpc_status, rpc_role) per
the naming convention in SpanNames.h, not xrpl.rpc.* qualified names.
Node health attributes (amendment_blocked, server_state) are resource
attributes set at Tracer init, not span attributes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Panels 8-15 from statsd-node-health.json and panels 8-9 from
statsd-network-traffic.json were lost when Phase 7 renamed these files
to system-*. The merge (5cd71ed107) took Phase 7's smaller version
without the extra panels added by commit b933e8ae00 on Phase 6.
Recovered panels (system-node-health.json):
- Key Jobs Execution Time (11 job types)
- Key Jobs Dequeue Wait Time (11 job types)
- FullBelowCache Size
- FullBelowCache Hit Rate
- Ledger Publish Gap (validated - published age delta)
- State Duration Rate (Full vs Tracking)
- All Jobs Execution Time Detail (34 job types)
- All Jobs Dequeue Wait Detail (34 job types)
Recovered panels (system-network-traffic.json):
- Duplicate Traffic (Wasted Bandwidth)
- All Traffic Categories Detail (topk 15 by byte rate)
All recovered panels updated to include exported_instance=~"$node"
filter per project dashboard guidelines.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MetricsRegistry emits OTel SDK metrics with the xrpld_ prefix
(MetricsRegistry.cpp defines "xrpld_nodestore_state",
"xrpld_cache_metrics", etc.), but the Phase 9 dashboards and the
Step 10c integration-test assertions introduced in 892fee638a
queried the rippled_ prefix. Every Phase 9 panel and assertion
therefore rendered "No data" or failed on a live run, even though
the underlying series were being exported correctly.
Rename the rippled_ prefix to xrpld_ for every MetricsRegistry
metric in dashboards and the integration test:
- nodestore_state, cache_metrics, txq_metrics, load_factor_metrics,
object_count
- rpc_method_started_total / _finished_total / _errored_total /
_duration_us_bucket
- job_queued_total / _started_total / _finished_total /
_queued_duration_us_bucket / _running_duration_us_bucket
- peer_quality, server_info, validator_health, ledger_economy,
db_metrics, complete_ledgers, build_info, state_tracking
- ledgers_closed_total, validations_sent_total,
validations_checked_total, state_changes_total
- validation_agreement (ValidationTracker 1h/24h/7d windows)
Also add ValidationTracker window-gauge assertions to Step 10c of
integration-test.sh so the 1h/24h/7d agreement and miss counts are
checked alongside the other Phase 9 gauges.
The rippled_ prefix is preserved for beast::insight metrics
(rippled_LedgerMaster_*, rippled_Peer_Finder_*, rippled_total_*,
rippled_Overlay_*, rippled_State_Accounting_*, rippled_transactions_*,
rippled_proposals_*, rippled_validations_Messages_*) because those
flow through the StatsD-style OTelCollector configured with
`[insight] prefix=rippled` and remain on that prefix by design.
Verified against a live 6-node consensus network: all 22 Phase 9 +
ValidationTracker assertions now report 6+ series per metric.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tempo /api/traces/{id} returns OTLP-shaped JSON with a top-level
"batches" key, not "data". The cross-check in check_log_correlation
was querying jq '.data | length' which always returned null, causing
the Log-Tempo cross-check to fail even when the trace existed.
- Replace GetSpan() with direct context value check in Logs::format()
to avoid heap allocation (new DefaultSpan) on the no-span path
- Restore Phase 7 documentation accidentally deleted during merge
- Fix undefined $JAEGER variable → use $TEMPO in integration test
- Remove useless LCOV_EXCL markers around #ifdef block
- Fix indentation inconsistencies in Log.cpp injection block
- Remove incorrect url field from loki.yaml derivedFields
- Update stale code sample in Phase8_taskList.md to match implementation
- Correct "<10ns" performance claims to accurate ~15-20ns (no-span)
and ~50ns (active-span) measurements across all docs
- Replace Jaeger references with Tempo in TESTING.md (port 16686→3200)
- Improve error handling in check_log_correlation(): track files_scanned,
detect missing log files, fix silent grep error masking
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Fix use-after-free: extract gauge callback to static function and call
RemoveCallback in ~OTelGaugeImpl() before unregistering from collector
- Use memory_order_acq_rel on callHooks() debounce CAS for proper
happens-before relationship between hook invocations
- Add explicit 2s timeout to ForceFlush() in destructor to prevent
blocking indefinitely when OTLP endpoint is unreachable at shutdown
- Add OTLP receiver to metrics pipeline so native OTel metrics from
xrpld are actually received by the collector
- Remove stale health check port from docker-compose (extension was
removed from collector config)
- Clarify fallback docs: StatsD path requires re-enabling receiver/port
- Fix comments: Counter uses uint64_t not int64_t, gauge clamps to
[0, INT64_MAX] not [0, UINT64_MAX]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.
Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:
- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
(consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
(used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
from consensus thread to jtACCEPT worker thread
Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The StatsD receiver config was lost during a branch rebase (--ours
conflict resolution dropped it). Re-add the statsd receiver to the
OTel Collector config and wire it into the metrics pipeline so
beast::insight UDP metrics flow to Prometheus.
Also fixes:
- Metric prefix mismatch: docs used xrpld_ but dashboards/tests use
rippled_ — align all documentation to match the runnable stack
- Remove phantom Peer_Disconnects_Charges from docs (plain atomic,
not a beast::insight gauge)
- Remove premature .codecov.yml exclusions for Phase 7 OTelCollector
files that don't exist on this branch
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add missing xrpl.consensus.quorum attribute to consensus.accept in runbook
- Fix dashboard legend formats: add exported_instance, use Title Case
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 14 missing spans to runbook (6 TxQ + 8 consensus)
- Fix tx.receive attributes and config table in runbook
- Document dispute.resolve and tx.included span events
- Add spanmetrics dimensions for close_time_correct and tx.suppressed
- Fix Close Time Agreement and TX Receive vs Suppressed panel PromQL
- Wire $consensus_mode template variable to all consensus panels
- Add 10 Tempo search filters for operational attributes
- Apply rename script artifacts
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve merge conflicts taking phase 4 consensus span improvements,
fix bashate indentation in integration test script, and apply rename
script to Phase5 integration test docs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 5 new panels to the consensus-health Grafana dashboard using Tempo
TraceQL queries against consensus.accept.apply span attributes:
- Close Time: Raw Proposals (Per Node) — each node's unrounded
wall-clock close_time_self, reveals clock drift across validators
- Close Time: Effective / Quantized — the consensus-agreed close_time
after rounding to resolution bins, written to ledger header
- Close Time Vote Bins & Resolution — number of distinct vote bins
(close_time_vote_bins) and bin size (close_resolution_ms) on dual axes
- Close Time Resolution Direction — whether resolution increased
(coarser), decreased (finer), or stayed unchanged
- Close Time Bin Distribution — bar chart showing how raw proposals
distribute across quantized bins per round
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>