- `src/test/telemetry/MetricsRegistry_test.cpp` (Beast `unit_test::suite`
format under `src/test/`) duplicates the GTest version already
maintained at `src/tests/libxrpl/telemetry/MetricsRegistry.cpp`.
Project rule (`tasks/lessons.md` §Test Format): all new tests use
GTest under `src/tests/libxrpl/`. The GTest version exercises the
same four cases (disabled construction, start/stop lifecycle, recording
no-op, destructor-calls-stop). Deleting the Beast duplicate eliminates
drift and keeps the test authoritative in one place.
- Drop the matching `test.telemetry > xrpl.basics/xrpl.core/xrpld.telemetry`
entries from `.github/scripts/levelization/results/ordering.txt`
because `xrpl.test.telemetry` (the GTest binary) retains its own
entries; the removed ones belonged to the deleted Beast suite.
- `.claude/instructions.md` was committed as a symlink to an
author-local absolute path (`/home/pratik/sourceCode/personal/Rippled/
instructions.md`) that does not exist for any other contributor or in
CI. Remove the symlink from git tracking and add `.claude/` to
`.gitignore` so future agent commits do not re-add per-developer
settings.
MetricsRegistry observable-gauge callbacks run on the OTel reader thread
and read live state from nodeStore_, overlay_, networkOPs_, ledgerMaster,
inboundLedgers, loadManager, and others. The old shutdown sequence called
metricsRegistry_->stop() AFTER all those services were already stopped,
which left a race window between each service's stop() and the final
provider_->ForceFlush() during which a callback could dereference
already-stopped service state. The try/catch guards in each callback
mitigated crashes but not reads from freed members.
- Add MetricsRegistry::detachCallbacks() that sets an atomic<bool>
callbacksDetached_ with release ordering. Idempotent.
- Guard every ObservableGauge callback entry with an acquire-load of the
same flag and return early if it is set. Covers all 15 registered
callbacks (cacheHitRate, txq, objectCount, loadFactor, nodeStore,
serverInfo, buildInfo, completeLedgers, dbMetrics, validatorHealth,
peerQuality, ledgerEconomy, stateTracking, storageDetail,
validationAgreement).
- Application::run() shutdown sequence now calls
metricsRegistry_->detachCallbacks() right after m_loadManager->stop()
and BEFORE m_shaMapStore, m_jobQueue, overlay_, grpcServer_,
m_networkOPs, serverHandler_, m_ledgerReplayer, m_inboundTransactions,
m_inboundLedgers, ledgerCleaner_, m_nodeStore, perfLog_ are stopped.
The acquire/release pair guarantees subsequent reader-thread ticks see
the detach before they dereference stopped services.
- metricsRegistry_->stop() keeps setting the flag as a belt-and-suspenders
defense in case a future caller forgets to detach first.
- Drop the misleading "No explicit RemoveCallback is needed" comment
from stop(); provider destruction alone does not beat the reader
thread to already-freed state.
The objectCountGauge callback previously discarded its state pointer
via `void* /* state */`; restore the state argument so it can access
self->callbacksDetached_ too.
File renames to match the post-docs.sh project-wide rename + the UID
rename applied in the previous commit. Five phase-9 dashboards are
affected:
- rippled-fee-market.json -> xrpld-fee-market.json
- rippled-job-queue.json -> xrpld-job-queue.json
- rippled-peer-quality.json -> xrpld-peer-quality.json
- rippled-rpc-perf.json -> xrpld-rpc-perf-otel.json
- rippled-validator-health.json-> xrpld-validator-health.json
`rippled-rpc-perf.json` is renamed to `xrpld-rpc-perf-otel.json` (rather
than `xrpld-rpc-perf.json`) to avoid colliding with the
phase-6 `rpc-performance.json` dashboard which also uses the
`xrpld-rpc-perf` UID. The new filename matches its now-unique
`xrpld-rpc-perf-otel` UID that was set in the merge commit.
Phase 4 added a span catalog in `06-implementation-phases.md` listing the
source location for each consensus span. Line numbers `Consensus.h:707`,
`RCLConsensus.cpp:232/341/492/541/900` drift on every refactor and would
become stale PR after PR. Filename alone is enough for operators to
grep — the RCLConsensus.cpp spans are already unambiguous from the span
name itself.
Follow-up to the phase-6 dashboard cleanup. The three dashboards
introduced by commit f6105ece98 (consensus-health, rpc-performance,
transaction-overview) were missed in the initial UID rename and still
carried `rippled-*` UIDs plus line-number refs in panel descriptions.
- UIDs: `rippled-consensus` -> `xrpld-consensus`,
`rippled-rpc-perf` -> `xrpld-rpc-perf`,
`rippled-transactions` -> `xrpld-transactions`, matching the
post-`docs.sh`-rename runbook and the other dashboards in this PR.
- Strip `:<line>` suffixes from `ServerHandler.cpp`, `RCLConsensus.cpp`,
`NetworkOPs.cpp`, etc. references in panel descriptions. Line numbers
drift on every refactor; the filename is enough to grep.
- Fix the Overall RPC Throughput panel: two targets filtered on
`span_name="rpc.request"` (never emitted) instead of
`span_name="rpc.http_request"` (the real emitted name). The panel
would have shown zero data until this fix.
Follow-up to the dashboard cleanup on this branch. Caught additional sites
in TESTING.md that still reference the never-emitted `rpc.request` span:
- TraceQL query examples in Step 5 "Verify traces in Tempo" now filter on
`name="rpc.http_request"` (the real emitted name).
- Expected-spans table replaces `rpc.request` with `rpc.http_request`.
- Query loop under the Prometheus verification section now iterates over
the full set of emitted RPC entry-point names
(`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`).
Also drop `exporter=otlp_http` from the sample telemetry config block.
`TelemetryConfig.cpp` does not parse an `exporter` key in any phase through
Phase 8; only OTLP/HTTP is wired up, so the line is either a silently
ignored no-op or misleading documentation.
Phase-6 introduces ledger-operations, peer-network, and the five StatsD
dashboards. Align them with the rest of the chain:
- Rename dashboard UIDs from `rippled-*` to `xrpld-*` so the provisioned
UIDs match the post-rename-script documentation (`docs.sh` rewrites
.md but not .json, so the two drifted). Runbook references
`xrpld-rpc-perf`, `xrpld-transactions`, etc., now the JSON matches.
- Add the `$node` template variable + `exported_instance=~"$node"` filter
to every target in the five `statsd-*` dashboards. Mirrors the pattern
already used by consensus-health, ledger-operations, and peer-network
per the project rule that every dashboard must support per-node
filtering.
- Strip `:<line>` (and `:NN-NN` range) suffixes from C++ file references
in every dashboard panel description and in docker/telemetry/TESTING.md.
Line numbers drift on every refactor; the filename alone is enough to
grep.
- Replace stale `rpc.request` entries with the real emitted span names
(`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`)
in TESTING.md so operators can copy-paste the filters and hit real
traces.
- Also drop the `:706` line ref from the `StatsDCollector.cpp` callout
in `06-implementation-phases.md`.
- RPC Spans table: `rpc.request` was documented but the code actually emits
`rpc.http_request`. Listed the actual emitted names
(`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`)
and their parent/child relationship.
- Drop `:<line>` suffixes from Source File columns in both RPC and
Transaction span tables. Line numbers drift with every refactor; the
filename is enough for operators to grep.
- Summary table: replace the never-emitted `rpc.request` row with the real
entry points so `span_name=` filters in PromQL / TraceQL match.
Phase-1a plan documents advertised OTLP/gRPC on port 4317 as the default
exporter, four unparsed [telemetry] config keys, and "Phase 4a Complete"
status with exit-criteria checkboxes marked done. Every downstream branch
through Phase 5 ships only OTLP/HTTP on port 4318 via OtlpHttpExporterFactory,
never parses the advertised keys, and the Phase 4 work is not yet delivered.
Fixes:
- 02-design-decisions.md: flip §2.1.1 SDK dependency recommendations to
OTLP/HTTP (shipped) with OTLP/gRPC marked Future. Update §2.2 architecture
diagram and text from OTLP/gRPC:4317 to OTLP/HTTP:4318. Rewrite §2.2.1 as
"OTLP/HTTP (Shipped)" and §2.2.2 as "OTLP/gRPC (Future Work — Planned
Upgrade)" with a concrete checklist (Conan dep, config parsing, factory
branch, runbook/dashboard updates) for landing the gRPC transport later.
- 05-configuration-reference.md: drop the fabricated exporter/otlp_grpc key
and the :4317 default from the sample config block and the options-summary
table. Move trace_pathfind, trace_txq, trace_validator, trace_amendment
into a new "Planned (not yet implemented)" table citing the phase that will
add each one. Keep the example config minimal so copy-paste does not produce
a silently-ignored stanza.
- 06-implementation-phases.md: reset Phase 4 Exit Criteria checkboxes from
[x] to [ ] (Phase 4 is not shipped at Phase-1a time). Rename "Phase 4a
Complete" to "Phase 4a Plan" and describe the work as future. Replace the
broken forward link to Phase4_taskList.md (introduced in the Phase 2 PR)
with a sentence pointing readers to where that spec will land. Renumber
the final section 6.12 to 6.11 so it sits directly after 6.10; section 6.11
("Effort Summary") was intentionally removed in earlier edits.
SpanGuard::span() hardcoded SpanKind::kInternal for every span. Tempo's
service-graph and spanmetrics RED calculations rely on kServer /
kConsumer / kClient / kProducer to classify inbound vs outbound vs
internal operations. With kInternal everywhere, the service graph
collapses to a single self-loop and RED metrics attribute all latency
to internal work.
Add categoryToSpanKind() mapping:
- Rpc -> kServer (inbound synchronous request)
- Peer -> kConsumer (inbound async peer message)
- Transactions -> kInternal
- Consensus -> kInternal
- Ledger -> kInternal
Only the single-argument overload is affected; childSpan / linkedSpan
continue to default to kInternal because they represent in-process
continuations of an already-kinded parent.
Spanmetrics dimensions used xrpl.rpc.command etc. but C++ emits bare
"command". Tempo tags for phase6-added consensus/tx/peer filters used
qualified names but C++ uses bare names. Dashboard panel referenced
xrpl_tx_suppressed (never populated) instead of suppressed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consensus span attributes use bare names (close_time_correct,
consensus_state, close_resolution_ms) and shared canonical attrs
(xrpl.ledger.seq) per SpanNames.h. xrpl.consensus.mode and
xrpl.consensus.round are correct (domain-qualified to avoid collision).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Transaction span attributes use bare names (local, tx_status) per
SpanNames.h convention, not xrpl.tx.* qualified names. xrpl.tx.hash
is correct (shared canonical attr defined in SpanNames.h).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RPC span attributes use bare names (command, rpc_status, rpc_role) per
the naming convention in SpanNames.h, not xrpl.rpc.* qualified names.
Node health attributes (amendment_blocked, server_state) are resource
attributes set at Tracer init, not span attributes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Panels 8-15 from statsd-node-health.json and panels 8-9 from
statsd-network-traffic.json were lost when Phase 7 renamed these files
to system-*. The merge (5cd71ed107) took Phase 7's smaller version
without the extra panels added by commit b933e8ae00 on Phase 6.
Recovered panels (system-node-health.json):
- Key Jobs Execution Time (11 job types)
- Key Jobs Dequeue Wait Time (11 job types)
- FullBelowCache Size
- FullBelowCache Hit Rate
- Ledger Publish Gap (validated - published age delta)
- State Duration Rate (Full vs Tracking)
- All Jobs Execution Time Detail (34 job types)
- All Jobs Dequeue Wait Detail (34 job types)
Recovered panels (system-network-traffic.json):
- Duplicate Traffic (Wasted Bandwidth)
- All Traffic Categories Detail (topk 15 by byte rate)
All recovered panels updated to include exported_instance=~"$node"
filter per project dashboard guidelines.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>