Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation

# Conflicts: # OpenTelemetryPlan/09-data-collection-reference.md
2026-07-25 08:00:22 +00:00 · 2026-06-05 19:42:53 +01:00
parent fd1c8c6060 2ee4d2ff2d
commit 758a3fec29
9 changed files with 1098 additions and 632 deletions
--- a/OpenTelemetryPlan/09-data-collection-reference.md
+++ b/OpenTelemetryPlan/09-data-collection-reference.md
@@ -78,22 +78,45 @@ There are two independent telemetry pipelines entering a single **OTel Collector

 ## 1. OpenTelemetry Spans

-### 1.1 Complete Span Inventory (16 spans)
+### 1.1 Complete Span Inventory (~37 spans)

 > **See also**: [02-design-decisions.md §2.3](./02-design-decisions.md#23-span-naming-conventions) for naming conventions and the full span catalog with rationale. [04-code-samples.md §4.6](./04-code-samples.md#46-span-flow-visualization) for span flow diagrams.

+> **Span names vs. attribute keys**: span names use dotted `subsystem.operation`
+> form (e.g. `rpc.http_request`). Span _attribute_ keys use the bare/underscore
+> form from the 2026-05-13 naming redesign (e.g. `tx_hash`, not `xrpl.tx.hash`).
+> The dotted `xrpl.*` form is reserved for OTel **resource** attributes set once
+> at startup. See §1.2 for the full attribute inventory.
+
 #### RPC Spans

 Controlled by `trace_rpc=1` in `[telemetry]` config.

-| Span Name            | Parent        | Source File       | Description                                                              |
-| -------------------- | ------------- | ----------------- | ------------------------------------------------------------------------ |
-| `rpc.request`        | —             | ServerHandler.cpp | Top-level HTTP RPC request entry point                                   |
-| `rpc.process`        | `rpc.request` | ServerHandler.cpp | RPC processing pipeline                                                  |
-| `rpc.ws_message`     | —             | ServerHandler.cpp | WebSocket message handling                                               |
-| `rpc.command.<name>` | `rpc.process` | RPCHandler.cpp    | Per-command span (e.g., `rpc.command.server_info`, `rpc.command.ledger`) |
+| Span Name            | Parent             | Source File       | Description                                                              |
+| -------------------- | ------------------ | ----------------- | ------------------------------------------------------------------------ |
+| `rpc.http_request`   | —                  | ServerHandler.cpp | Top-level HTTP JSON-RPC request entry point                              |
+| `rpc.ws_message`     | —                  | ServerHandler.cpp | WebSocket message handling (one per inbound frame)                       |
+| `rpc.ws_upgrade`     | —                  | ServerHandler.cpp | WebSocket upgrade handshake (records handshake failures)                 |
+| `rpc.process`        | `rpc.http_request` | ServerHandler.cpp | RPC processing pipeline (single or batch request)                        |
+| `rpc.command.<name>` | `rpc.process`      | RPCHandler.cpp    | Per-command span (e.g., `rpc.command.server_info`, `rpc.command.ledger`) |

-**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"rpc.request|rpc.command.*"}`
+**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"rpc.http_request|rpc.command.*"}`
+
+**Grafana dashboard**: _RPC Performance_ (`xrpld-rpc-perf`)
+
+#### gRPC Spans
+
+Controlled by `trace_rpc=1` in `[telemetry]` config.
+
+| Span Name           | Parent | Source File    | Description                                                                                                               |
+| ------------------- | ------ | -------------- | ------------------------------------------------------------------------------------------------------------------------- |
+| `grpc.<MethodName>` | —      | GRPCServer.cpp | One flat span per gRPC method (e.g., `grpc.GetLedger`, `grpc.GetLedgerData`, `grpc.GetLedgerDiff`, `grpc.GetLedgerEntry`) |
+
+The method name is embedded in the span name (formed at the call site as
+`grpc.<MethodName>`), so dashboards break out per-method latency and error
+rates without TraceQL attribute filters.
+
+**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"grpc.*"}`

 **Grafana dashboard**: _RPC Performance_ (`xrpld-rpc-perf`)

@@ -121,17 +144,46 @@ or, for the apply pipeline: `{resource.service.name="xrpld" && name=~"tx.preflig

 **Grafana dashboard**: _Transaction Overview_ (`xrpld-transactions`)

+#### Transaction Queue (TxQ) Spans
+
+Controlled by `trace_transactions=1` in `[telemetry]` config.
+
+| Span Name          | Parent        | Source File | Description                                         |
+| ------------------ | ------------- | ----------- | --------------------------------------------------- |
+| `txq.enqueue`      | `tx.process`  | TxQ.cpp     | Enqueue decision when a tx is submitted             |
+| `txq.apply_direct` | `txq.enqueue` | TxQ.cpp     | Direct apply attempt that bypasses the queue        |
+| `txq.batch_clear`  | `txq.enqueue` | TxQ.cpp     | Batch clear of an account's queued txs              |
+| `txq.accept`       | —             | TxQ.cpp     | Ledger-close accept loop (drains the queue)         |
+| `txq.accept.tx`    | `txq.accept`  | TxQ.cpp     | Per-queued-transaction apply inside the accept loop |
+| `txq.cleanup`      | —             | TxQ.cpp     | Post-close cleanup of expired queue entries         |
+
+**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"txq.*"}`
+
+**Grafana dashboard**: _Transaction Overview_ (`xrpld-transactions`)
+
 #### Consensus Spans

 Controlled by `trace_consensus=1` in `[telemetry]` config.

-| Span Name                   | Parent | Source File      | Description                                   |
-| --------------------------- | ------ | ---------------- | --------------------------------------------- |
-| `consensus.proposal.send`   | —      | RCLConsensus.cpp | Node broadcasts its transaction set proposal  |
-| `consensus.ledger_close`    | —      | RCLConsensus.cpp | Ledger close event triggered by consensus     |
-| `consensus.accept`          | —      | RCLConsensus.cpp | Consensus accepts a ledger (round complete)   |
-| `consensus.validation.send` | —      | RCLConsensus.cpp | Validation message sent after ledger accepted |
-| `consensus.accept.apply`    | —      | RCLConsensus.cpp | Ledger application with close time details    |
+| Span Name                      | Parent             | Source File      | Description                                                         |
+| ------------------------------ | ------------------ | ---------------- | ------------------------------------------------------------------- |
+| `consensus.round`              | — (root)           | RCLConsensus.cpp | Root span for one consensus round (deterministic trace per round)   |
+| `consensus.phase.open`         | `consensus.round`  | Consensus.h      | Open phase — collecting transactions before close                   |
+| `consensus.proposal.send`      | `consensus.round`  | RCLConsensus.cpp | Node broadcasts its transaction set proposal                        |
+| `consensus.ledger_close`       | `consensus.round`  | RCLConsensus.cpp | Ledger close event triggered by consensus                           |
+| `consensus.establish`          | `consensus.round`  | Consensus.h      | Establish phase — converging on the transaction set                 |
+| `consensus.update_positions`   | `consensus.round`  | Consensus.h      | Position update with per-dispute vote details                       |
+| `consensus.check`              | `consensus.round`  | Consensus.h      | Consensus threshold check (agree/disagree tally)                    |
+| `consensus.accept`             | `consensus.round`  | RCLConsensus.cpp | Consensus accepts a ledger (round complete)                         |
+| `consensus.accept.apply`       | `consensus.accept` | RCLConsensus.cpp | Ledger application with close-time details (jtACCEPT thread)        |
+| `consensus.validation.send`    | `consensus.round`  | RCLConsensus.cpp | Validation message sent after ledger accepted (follows-from link)   |
+| `consensus.mode_change`        | `consensus.round`  | RCLConsensus.cpp | Operating-mode transition during the round                          |
+| `consensus.proposal.receive`   | (context)          | PeerImp.cpp      | Proposal received from a peer (context-propagated into the round)   |
+| `consensus.validation.receive` | (context)          | PeerImp.cpp      | Validation received from a peer (context-propagated into the round) |
+
+The `.receive` spans are created per-message in the overlay and joined to the
+round trace via context propagation rather than direct parenting. The
+`consensus.validation.send` span uses a follows-from link off the round.

 **Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"consensus.*"}`

@@ -141,11 +193,12 @@ Controlled by `trace_consensus=1` in `[telemetry]` config.

 Controlled by `trace_ledger=1` in `[telemetry]` config.

-| Span Name         | Parent | Source File      | Description                                    |
-| ----------------- | ------ | ---------------- | ---------------------------------------------- |
-| `ledger.build`    | —      | BuildLedger.cpp  | Build new ledger from accepted transaction set |
-| `ledger.validate` | —      | LedgerMaster.cpp | Ledger promoted to validated status            |
-| `ledger.store`    | —      | LedgerMaster.cpp | Ledger stored to database/history              |
+| Span Name         | Parent | Source File       | Description                                    |
+| ----------------- | ------ | ----------------- | ---------------------------------------------- |
+| `ledger.build`    | —      | BuildLedger.cpp   | Build new ledger from accepted transaction set |
+| `ledger.validate` | —      | LedgerMaster.cpp  | Ledger promoted to validated status            |
+| `ledger.store`    | —      | LedgerMaster.cpp  | Ledger stored to database/history              |
+| `ledger.acquire`  | —      | InboundLedger.cpp | Fetch a missing ledger from peers              |

 **Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"ledger.*"}`

@@ -164,88 +217,206 @@ Controlled by `trace_peer=1` in `[telemetry]` config. **Disabled by default** (h

 **Grafana dashboard**: _Peer Network_ (`xrpld-peer-net`)

+#### PathFind Spans
+
+Controlled by `trace_rpc=1` in `[telemetry]` config.
+
+| Span Name             | Parent             | Source File     | Description                                                |
+| --------------------- | ------------------ | --------------- | ---------------------------------------------------------- |
+| `pathfind.request`    | `rpc.command.*`    | PathRequest.cpp | `path_find` / `ripple_path_find` RPC entry                 |
+| `pathfind.compute`    | `pathfind.request` | PathRequest.cpp | Path computation for one request (`PathRequest::doUpdate`) |
+| `pathfind.discover`   | `pathfind.compute` | Pathfinder.cpp  | Graph exploration (one per RPC call)                       |
+| `pathfind.update_all` | —                  | PathRequest.cpp | Async recomputation of all active requests at ledger close |
+
+**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"pathfind.*"}`
+
 ---

-### 1.2 Complete Attribute Inventory (22 attributes)
+### 1.2 Complete Attribute Inventory (bare/underscore keys)

 > **See also**: [02-design-decisions.md §2.4.2](./02-design-decisions.md#242-span-attributes-by-category) for attribute design rationale and privacy considerations.

-Every span can carry key-value attributes that provide context for filtering and aggregation.
+Every span can carry key-value attributes that provide context for filtering and
+aggregation. Per the 2026-05-13 naming redesign, span-attribute keys use the
+**bare** field name (the span name already carries the domain), or the
+`<domain>_<field>` underscore form where a bare name would collide (e.g.
+`rpc_status`, `grpc_status`, `tx_status`, `txq_status`).
+
+> **Dotted exceptions** (do not confuse with span attributes):
+>
+> - `xrpl.ledger.hash` is the **only** dotted span attribute. It is a shared
+>   constant set on `peer.validation.receive`. Note that `consensus.validation.send`
+>   uses the **bare** `ledger_hash` instead.
+> - `xrpl.network.id` and `xrpl.network.type` are **resource** attributes set
+>   once at startup on the OTel resource — not span attributes. They appear on
+>   every span's resource scope, queried as `{resource.xrpl.network.id=...}`.

 #### RPC Attributes

-| Attribute       | Type   | Set On          | Description                                      |
-| --------------- | ------ | --------------- | ------------------------------------------------ |
-| `command`       | string | `rpc.command.*` | RPC command name (e.g., `server_info`, `ledger`) |
-| `version`       | int64  | `rpc.command.*` | API version number                               |
-| `rpc_role`      | string | `rpc.command.*` | Caller role: `"admin"` or `"user"`               |
-| `rpc_status`    | string | `rpc.command.*` | Result: `"success"` or `"error"`                 |
-| `duration_ms`   | int64  | `rpc.command.*` | Command execution time in milliseconds           |
-| `error_message` | string | `rpc.command.*` | Error details (only set on failure)              |
+| Attribute              | Type    | Set On                            | Description                                      |
+| ---------------------- | ------- | --------------------------------- | ------------------------------------------------ |
+| `command`              | string  | `rpc.command.*`, `rpc.ws_message` | RPC command name (e.g., `server_info`, `ledger`) |
+| `version`              | int64   | `rpc.command.*`                   | API version number                               |
+| `rpc_role`             | string  | `rpc.command.*`                   | Caller role: `"admin"` or `"user"`               |
+| `rpc_status`           | string  | `rpc.command.*`                   | Result: `"success"` or `"error"`                 |
+| `request_payload_size` | int64   | `rpc.http_request`                | Bytes of inbound request payload                 |
+| `is_batch`             | boolean | `rpc.process`                     | `true` if the request is a JSON-RPC batch        |
+| `batch_size`           | int64   | `rpc.process`                     | Number of sub-requests in a batch                |
+| `load_type`            | string  | `rpc.command.*`                   | Resource cost category after execution           |

 **Tempo query**: `{span.command="server_info"}` to find all `server_info` calls.

-**Prometheus label**: `xrpl_rpc_command` (dots converted to underscores by SpanMetrics).
+**Prometheus label**: `command` (used as a SpanMetrics dimension).
+
+#### gRPC Attributes
+
+| Attribute     | Type   | Set On              | Description                          |
+| ------------- | ------ | ------------------- | ------------------------------------ |
+| `method`      | string | `grpc.<MethodName>` | gRPC method name (e.g., `GetLedger`) |
+| `grpc_role`   | string | `grpc.<MethodName>` | Caller role: `"admin"` or `"user"`   |
+| `grpc_status` | string | `grpc.<MethodName>` | Result: `"success"` or `"error"`     |
+
+**Tempo query**: `{span.method="GetLedger"}` or `{name="grpc.GetLedger"}`.
+
+**Prometheus labels**: `method`, `grpc_role`, `grpc_status` (SpanMetrics dimensions).

 #### Transaction Attributes

-| Attribute           | Type    | Set On                                         | Description                                                           |
-| ------------------- | ------- | ---------------------------------------------- | --------------------------------------------------------------------- |
-| `xrpl.tx.hash`      | string  | `tx.process`, `tx.receive`                     | Transaction hash (hex-encoded)                                        |
-| `local`             | boolean | `tx.process`                                   | `true` if locally submitted, `false` if peer-relayed                  |
-| `path`              | string  | `tx.process`                                   | Submission path: `"sync"` or `"async"`                                |
-| `suppressed`        | boolean | `tx.receive`                                   | `true` if transaction was suppressed (duplicate)                      |
-| `tx_status`         | string  | `tx.receive`                                   | Transaction status (e.g., `"known_bad"`)                              |
-| `xrpl.peer.id`      | int64   | `tx.receive`                                   | Peer identifier (also set on peer spans)                              |
-| `xrpl.peer.version` | string  | `tx.receive`                                   | Peer protocol version string                                          |
-| `stage`             | string  | `tx.preflight`, `tx.preclaim`, `tx.transactor` | Apply-pipeline stage: `preflight`, `preclaim`, or `apply`             |
-| `tx_type`           | string  | `tx.preflight`, `tx.preclaim`, `tx.transactor` | Transaction type name (e.g., `Payment`)                               |
-| `ter_result`        | string  | `tx.preflight`, `tx.preclaim`, `tx.transactor` | Engine result token for that stage (e.g., `tesSUCCESS`, `terPRE_SEQ`) |
-| `applied`           | boolean | `tx.transactor`                                | `true` if the transaction was applied to the ledger                   |
+| Attribute      | Type    | Set On                                                       | Description                                                           |
+| -------------- | ------- | ------------------------------------------------------------ | --------------------------------------------------------------------- |
+| `tx_hash`      | string  | `tx.process`, `tx.receive`                                   | Transaction hash (hex-encoded)                                        |
+| `local`        | boolean | `tx.process`                                                 | `true` if locally submitted, `false` if peer-relayed                  |
+| `path`         | string  | `tx.process`                                                 | Submission path: `"sync"` or `"async"`                                |
+| `tx_type`      | string  | `tx.process`, `tx.preflight`, `tx.preclaim`, `tx.transactor` | Transaction type name (e.g., `Payment`)                               |
+| `fee`          | int64   | `tx.process`                                                 | Transaction fee in drops                                              |
+| `sequence`     | int64   | `tx.process`                                                 | Transaction sequence number                                           |
+| `suppressed`   | boolean | `tx.receive`                                                 | `true` if transaction was suppressed (duplicate)                      |
+| `tx_status`    | string  | `tx.receive`                                                 | Transaction status (e.g., `"known_bad"`)                              |
+| `peer_id`      | int64   | `tx.receive`                                                 | Peer identifier (also set on peer spans)                              |
+| `peer_version` | string  | `tx.receive`                                                 | Peer protocol version string                                          |
+| `stage`        | string  | `tx.preflight`, `tx.preclaim`, `tx.transactor`               | Apply-pipeline stage: `preflight`, `preclaim`, or `apply`             |
+| `ter_result`   | string  | `tx.preflight`, `tx.preclaim`, `tx.transactor`               | Engine result token for that stage (e.g., `tesSUCCESS`, `terPRE_SEQ`) |
+| `applied`      | boolean | `tx.transactor`                                              | `true` if the transaction was applied to the ledger                   |

-**Tempo query**: `{span.xrpl.tx.hash="<hash>"}` to trace a specific transaction across nodes.
+**Tempo query**: `{span.tx_hash="<hash>"}` to trace a specific transaction across nodes.

-**Prometheus label**: `xrpl_tx_local` (used as SpanMetrics dimension).
+**Prometheus labels**: `local`, `suppressed`, `tx_type`, `ter_result`, `stage` (SpanMetrics dimensions).
+
+#### Transaction Queue (TxQ) Attributes
+
+| Attribute            | Type    | Set On                         | Description                                                 |
+| -------------------- | ------- | ------------------------------ | ----------------------------------------------------------- |
+| `tx_hash`            | string  | `txq.enqueue`, `txq.accept.tx` | Transaction hash                                            |
+| `tx_type`            | string  | `txq.enqueue`                  | Transaction type name                                       |
+| `txq_status`         | string  | `txq.enqueue`, `txq.accept.tx` | Queue outcome (e.g. `queued`, `applied_direct`, `rejected`) |
+| `fee_level_paid`     | int64   | `txq.enqueue`                  | Fee level paid by the queued tx                             |
+| `required_fee_level` | int64   | `txq.enqueue`                  | Minimum fee level for inclusion                             |
+| `num_cleared`        | int64   | `txq.batch_clear`              | Entries cleared in a batch                                  |
+| `queue_size`         | int64   | `txq.accept`                   | Current TxQ depth                                           |
+| `ledger_changed`     | boolean | `txq.accept`                   | Whether the ledger changed since last attempt               |
+| `ter_code`           | int64   | `txq.accept.tx`                | Transaction engine result code                              |
+| `retries_remaining`  | int64   | `txq.accept.tx`                | Retries left before discard                                 |
+| `ledger_seq`         | int64   | `txq.cleanup`                  | Ledger sequence number                                      |
+| `expired_count`      | int64   | `txq.cleanup`                  | Number of expired entries cleared                           |
+
+**Prometheus label**: `txq_status` (SpanMetrics dimension).

 #### Consensus Attributes

-| Attribute                            | Type    | Set On                                                                                              | Description                                                   |
-| ------------------------------------ | ------- | --------------------------------------------------------------------------------------------------- | ------------------------------------------------------------- |
-| `xrpl.consensus.round`               | int64   | `consensus.proposal.send`                                                                           | Consensus round number                                        |
-| `xrpl.consensus.mode`                | string  | `consensus.proposal.send`, `consensus.ledger_close`                                                 | Node mode: `"syncing"`, `"tracking"`, `"full"`, `"proposing"` |
-| `xrpl.consensus.proposers`           | int64   | `consensus.proposal.send`, `consensus.accept`                                                       | Number of proposers in the round                              |
-| `xrpl.consensus.proposing`           | boolean | `consensus.validation.send`                                                                         | Whether this node was a proposer                              |
-| `xrpl.consensus.ledger.seq`          | int64   | `consensus.ledger_close`, `consensus.accept`, `consensus.validation.send`, `consensus.accept.apply` | Ledger sequence number                                        |
-| `xrpl.consensus.close_time`          | int64   | `consensus.accept.apply`                                                                            | Agreed-upon ledger close time (epoch seconds)                 |
-| `xrpl.consensus.close_time_correct`  | boolean | `consensus.accept.apply`                                                                            | Whether validators reached agreement on close time            |
-| `xrpl.consensus.close_resolution_ms` | int64   | `consensus.accept.apply`                                                                            | Close time rounding granularity in milliseconds               |
-| `xrpl.consensus.state`               | string  | `consensus.accept.apply`                                                                            | Consensus outcome: `"finished"` or `"moved_on"`               |
-| `xrpl.consensus.round_time_ms`       | int64   | `consensus.accept.apply`                                                                            | Total consensus round duration in milliseconds                |
+| Attribute                  | Type    | Set On                                                                                             | Description                                              |
+| -------------------------- | ------- | -------------------------------------------------------------------------------------------------- | -------------------------------------------------------- |
+| `consensus_ledger_id`      | string  | `consensus.round`                                                                                  | Previous-ledger id anchoring the round                   |
+| `ledger_seq`               | int64   | `consensus.round`, `consensus.ledger_close`, `consensus.accept.apply`, `consensus.validation.send` | Ledger sequence number                                   |
+| `consensus_mode`           | string  | `consensus.round`, `consensus.ledger_close`                                                        | Node mode: `"Proposing"`, `"Observing"`, `"Wrong"`, etc. |
+| `consensus_round_id`       | int64   | `consensus.round`                                                                                  | Round identifier                                         |
+| `consensus_phase`          | string  | `consensus.round`                                                                                  | Current phase name (updated on each transition)          |
+| `trace_strategy`           | string  | `consensus.round`                                                                                  | Trace-id strategy (`deterministic` / `random`)           |
+| `previous_ledger_seq`      | int64   | `consensus.round`                                                                                  | Sequence of the previous ledger                          |
+| `previous_proposers`       | int64   | `consensus.round`                                                                                  | Proposer count in the previous round                     |
+| `previous_round_time_ms`   | int64   | `consensus.round`                                                                                  | Duration of the previous round                           |
+| `consensus_round`          | int64   | `consensus.proposal.send`                                                                          | Proposal sequence number for the broadcast proposal      |
+| `is_bow_out`               | boolean | `consensus.proposal.send`                                                                          | Whether the proposal is a bow-out (resigning the round)  |
+| `tx_count_open`            | int64   | `consensus.ledger_close`                                                                           | Transactions in the open ledger at close                 |
+| `close_time_resolution_ms` | int64   | `consensus.ledger_close`                                                                           | Close-time rounding granularity                          |
+| `converge_percent`         | int64   | `consensus.establish`, `consensus.update_positions`                                                | Convergence percentage                                   |
+| `establish_count`          | int64   | `consensus.establish`                                                                              | Establish-phase iteration count                          |
+| `proposers`                | int64   | `consensus.establish`, `consensus.update_positions`, `consensus.accept`                            | Number of proposers                                      |
+| `disputes_count`           | int64   | `consensus.establish`, `consensus.update_positions`                                                | Number of disputed transactions                          |
+| `tx_id`                    | string  | `consensus.update_positions`                                                                       | Disputed transaction id (per-dispute event)              |
+| `dispute_our_vote`         | boolean | `consensus.update_positions`                                                                       | Our vote on the disputed tx                              |
+| `dispute_yays`             | int64   | `consensus.update_positions`                                                                       | Yes votes on the disputed tx                             |
+| `dispute_nays`             | int64   | `consensus.update_positions`                                                                       | No votes on the disputed tx                              |
+| `agree_count`              | int64   | `consensus.check`                                                                                  | Agreeing proposer count                                  |
+| `disagree_count`           | int64   | `consensus.check`                                                                                  | Disagreeing proposer count                               |
+| `threshold_percent`        | int64   | `consensus.check`                                                                                  | Agreement threshold percentage                           |
+| `consensus_result`         | string  | `consensus.check`                                                                                  | Check outcome                                            |
+| `quorum`                   | int64   | `consensus.check`, `consensus.accept`                                                              | Quorum required                                          |
+| `round_time_ms`            | int64   | `consensus.accept`, `consensus.accept.apply`                                                       | Total consensus round duration in milliseconds           |
+| `consensus_state`          | string  | `consensus.accept.apply`                                                                           | Consensus outcome: `"finished"` or `"moved_on"`          |
+| `close_time`               | int64   | `consensus.accept.apply`                                                                           | Agreed-upon ledger close time (epoch seconds)            |
+| `close_time_correct`       | boolean | `consensus.accept.apply`                                                                           | Whether validators agreed on close time                  |
+| `close_resolution_ms`      | int64   | `consensus.accept.apply`                                                                           | Close-time rounding granularity in milliseconds          |
+| `proposing`                | boolean | `consensus.accept.apply`, `consensus.validation.send`                                              | Whether this node was a proposer                         |
+| `parent_close_time`        | int64   | `consensus.accept.apply`                                                                           | Parent ledger close time                                 |
+| `close_time_self`          | int64   | `consensus.accept.apply`                                                                           | This node's close-time vote                              |
+| `close_time_vote_bins`     | string  | `consensus.accept.apply`                                                                           | Distribution of close-time votes                         |
+| `resolution_direction`     | string  | `consensus.accept.apply`                                                                           | Whether close resolution increased/decreased/unchanged   |
+| `tx_count`                 | int64   | `consensus.accept.apply`                                                                           | Transactions in the accepted set                         |
+| `ledger_hash`              | string  | `consensus.validation.send`                                                                        | Full hash of the validated ledger (**bare**, not dotted) |
+| `full_validation`          | boolean | `consensus.validation.send`                                                                        | Whether this is a full validation                        |
+| `validation_sign_time`     | int64   | `consensus.validation.send`                                                                        | Validation signing time                                  |
+| `mode_old`                 | string  | `consensus.mode_change`                                                                            | Operating mode before the transition                     |
+| `mode_new`                 | string  | `consensus.mode_change`                                                                            | Operating mode after the transition                      |

-**Tempo query**: `{span.xrpl.consensus.mode="proposing"}` to find rounds where node was proposing.
+**Tempo query**: `{span.consensus_mode="Proposing"}` to find rounds where the node was proposing.

-**Prometheus label**: `xrpl_consensus_mode` (used as SpanMetrics dimension).
+**Prometheus labels**: `consensus_mode`, `consensus_state`, `consensus_phase`, `consensus_result`, `consensus_stalled`, `mode_new`, `close_time_correct` (SpanMetrics dimensions).

 #### Ledger Attributes

-| Attribute                 | Type  | Set On                                                        | Description                                    |
-| ------------------------- | ----- | ------------------------------------------------------------- | ---------------------------------------------- |
-| `xrpl.ledger.seq`         | int64 | `ledger.build`, `ledger.validate`, `ledger.store`, `tx.apply` | Ledger sequence number                         |
-| `xrpl.ledger.validations` | int64 | `ledger.validate`                                             | Number of validations received for this ledger |
-| `xrpl.ledger.tx_count`    | int64 | `ledger.build`, `tx.apply`                                    | Transactions in the ledger                     |
-| `xrpl.ledger.tx_failed`   | int64 | `ledger.build`, `tx.apply`                                    | Failed transactions in the ledger              |
+| Attribute             | Type    | Set On                                            | Description                                      |
+| --------------------- | ------- | ------------------------------------------------- | ------------------------------------------------ |
+| `ledger_seq`          | int64   | `ledger.build`, `ledger.validate`, `ledger.store` | Ledger sequence number                           |
+| `close_time`          | int64   | `ledger.build`                                    | Ledger close time (epoch seconds)                |
+| `close_time_correct`  | boolean | `ledger.build`                                    | Whether close time was agreed upon by validators |
+| `close_resolution_ms` | int64   | `ledger.build`                                    | Close time rounding granularity in milliseconds  |
+| `tx_count`            | int64   | `tx.apply`                                        | Transactions applied to the ledger               |
+| `tx_failed`           | int64   | `tx.apply`                                        | Failed transactions in the apply set             |
+| `validations`         | int64   | `ledger.validate`                                 | Number of validations received for this ledger   |
+| `acquire_reason`      | string  | `ledger.acquire`                                  | Why the ledger fetch was triggered               |
+| `timeouts`            | int64   | `ledger.acquire`                                  | Number of fetch timeouts                         |
+| `peer_count`          | int64   | `ledger.acquire`                                  | Peers queried during the fetch                   |
+| `outcome`             | string  | `ledger.acquire`                                  | Fetch outcome                                    |

-**Tempo query**: `{span.xrpl.ledger.seq=12345}` to find all spans for a specific ledger.
+The apply-step span `tx.apply` (child of `ledger.build`) carries `tx_count`/`tx_failed`;
+the parent `ledger.build` carries `ledger_seq` and the close-time attributes.
+`ledger.acquire` (InboundLedger) also sets `ledger_seq`.
+
+**Tempo query**: `{span.ledger_seq=12345}` to find all spans for a specific ledger.

 #### Peer Attributes

-| Attribute                      | Type    | Set On                                                           | Description                                          |
-| ------------------------------ | ------- | ---------------------------------------------------------------- | ---------------------------------------------------- |
-| `xrpl.peer.id`                 | int64   | `tx.receive`, `peer.proposal.receive`, `peer.validation.receive` | Peer identifier                                      |
-| `xrpl.peer.proposal.trusted`   | boolean | `peer.proposal.receive`                                          | Whether the proposal came from a trusted validator   |
-| `xrpl.peer.validation.trusted` | boolean | `peer.validation.receive`                                        | Whether the validation came from a trusted validator |
+| Attribute            | Type    | Set On                                                           | Description                                          |
+| -------------------- | ------- | ---------------------------------------------------------------- | ---------------------------------------------------- |
+| `peer_id`            | int64   | `tx.receive`, `peer.proposal.receive`, `peer.validation.receive` | Peer identifier                                      |
+| `proposal_trusted`   | boolean | `peer.proposal.receive`                                          | Whether the proposal came from a trusted validator   |
+| `validation_trusted` | boolean | `peer.validation.receive`                                        | Whether the validation came from a trusted validator |
+| `validation_full`    | boolean | `peer.validation.receive`                                        | Whether the validation is a full validation          |
+| `xrpl.ledger.hash`   | string  | `peer.validation.receive`                                        | Validated ledger hash (**dotted** — shared constant) |

-**Prometheus labels**: `xrpl_peer_proposal_trusted`, `xrpl_peer_validation_trusted` (SpanMetrics dimensions).
+**Prometheus labels**: `proposal_trusted`, `validation_trusted` (SpanMetrics dimensions).
+
+#### PathFind Attributes
+
+| Attribute                 | Type    | Set On                | Description                              |
+| ------------------------- | ------- | --------------------- | ---------------------------------------- |
+| `pathfind_source_account` | string  | `pathfind.request`    | Originating account for the path search  |
+| `pathfind_dest_account`   | string  | `pathfind.request`    | Destination account                      |
+| `pathfind_fast`           | boolean | `pathfind.compute`    | Whether fast pathfinding mode is enabled |
+| `pathfind_search_level`   | int64   | `pathfind.discover`   | Depth of graph exploration               |
+| `pathfind_num_paths`      | int64   | `pathfind.discover`   | Total paths produced                     |
+| `pathfind_ledger_index`   | int64   | `pathfind.update_all` | Target ledger index                      |
+| `pathfind_num_requests`   | int64   | `pathfind.update_all` | Active requests recomputed               |

 ---

@@ -264,17 +435,34 @@ The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Er

 **Standard labels on every metric**: `span_name`, `status_code`, `service_name`, `span_kind`

-**Additional dimension labels** (configured in `otel-collector-config.yaml`):
+**Additional dimension labels** (configured in `otel-collector-config.yaml`).
+The Prometheus label is the **bare span-attribute key verbatim** — the
+SpanMetrics connector does not rewrite or prefix it:

-| Span Attribute        | Prometheus Label               | Applies To                                     |
-| --------------------- | ------------------------------ | ---------------------------------------------- |
-| `command`             | `xrpl_rpc_command`             | `rpc.command.*`                                |
-| `rpc_status`          | `xrpl_rpc_status`              | `rpc.command.*`                                |
-| `xrpl.consensus.mode` | `xrpl_consensus_mode`          | `consensus.ledger_close`                       |
-| `local`               | `xrpl_tx_local`                | `tx.process`                                   |
-| `proposal_trusted`    | `xrpl_peer_proposal_trusted`   | `peer.proposal.receive`                        |
-| `validation_trusted`  | `xrpl_peer_validation_trusted` | `peer.validation.receive`                      |
-| `stage`               | `stage`                        | `tx.preflight`, `tx.preclaim`, `tx.transactor` |
+| Prometheus Label / Span Attribute | Type    | Applies To                                     |
+| --------------------------------- | ------- | ---------------------------------------------- |
+| `command`                         | string  | `rpc.command.*`                                |
+| `rpc_status`                      | string  | `rpc.command.*`                                |
+| `consensus_mode`                  | string  | `consensus.round`, `consensus.ledger_close`    |
+| `close_time_correct`              | boolean | `consensus.accept.apply`                       |
+| `local`                           | boolean | `tx.process`                                   |
+| `suppressed`                      | boolean | `tx.receive`                                   |
+| `proposal_trusted`                | boolean | `peer.proposal.receive`                        |
+| `validation_trusted`              | boolean | `peer.validation.receive`                      |
+| `tx_type`                         | string  | `tx.*`, `txq.enqueue`                          |
+| `ter_result`                      | string  | `tx.preflight`, `tx.preclaim`, `tx.transactor` |
+| `stage`                           | string  | `tx.preflight`, `tx.preclaim`, `tx.transactor` |
+| `txq_status`                      | string  | `txq.enqueue`, `txq.accept.tx`                 |
+| `consensus_state`                 | string  | `consensus.accept.apply`                       |
+| `load_type`                       | string  | `rpc.command.*`                                |
+| `is_batch`                        | boolean | `rpc.process`                                  |
+| `mode_new`                        | string  | `consensus.mode_change`                        |
+| `consensus_stalled`               | boolean | `consensus.check`                              |
+| `consensus_phase`                 | string  | `consensus.round`                              |
+| `consensus_result`                | string  | `consensus.check`                              |
+| `method`                          | string  | `grpc.<MethodName>`                            |
+| `grpc_role`                       | string  | `grpc.<MethodName>`                            |
+| `grpc_status`                     | string  | `grpc.<MethodName>`                            |

 The `stage` dimension (3 values: `preflight`, `preclaim`, `apply`) turns the
 apply-pipeline spans into per-stage RED metrics with no native instruments — the
@@ -337,7 +525,7 @@ prefix=xrpld
 | `xrpld_Peer_Finder_Active_Outbound_Peers`         | PeerfinderManager.cpp | Active outbound peer connections          | 10–21                           |
 | `xrpld_Overlay_Peer_Disconnects`                  | OverlayImpl.cpp       | Cumulative peer disconnection count       | Low growth                      |
 | `xrpld_Overlay_Peer_Disconnects_Charges`          | OverlayImpl.cpp       | Disconnects due to resource limit charges | Low growth (subset of above)    |
-| `xrpld_job_count`                                 | JobQueue.cpp          | Current job queue depth                   | 0–100 (healthy)                 |
+| `xrpld_jobq_job_count`                            | JobQueue.cpp          | Current job queue depth (group `jobq`)    | 0–100 (healthy)                 |

 **Grafana dashboard**: _Node Health (System Metrics)_ (`xrpld-system-node-health`)

@@ -439,38 +627,47 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo

 | What to Find             | Tempo TraceQL Query                                                            |
 | ------------------------ | ------------------------------------------------------------------------------ |
-| All RPC calls            | `{resource.service.name="xrpld" && name="rpc.request"}`                        |
+| All RPC calls            | `{resource.service.name="xrpld" && name="rpc.http_request"}`                   |
 | Specific RPC command     | `{resource.service.name="xrpld" && name="rpc.command.server_info"}`            |
 | Slow RPC calls           | `{resource.service.name="xrpld" && name=~"rpc.command.*"} \| duration > 100ms` |
 | Failed RPC calls         | `{span.rpc_status="error"}`                                                    |
-| Specific transaction     | `{span.xrpl.tx.hash="<hex_hash>"}`                                             |
-| Local transactions only  | `{span.xrpl.tx.local=true}`                                                    |
-| Consensus rounds         | `{resource.service.name="xrpld" && name="consensus.accept"}`                   |
-| Rounds by mode           | `{span.xrpl.consensus.mode="proposing"}`                                       |
-| Specific ledger          | `{span.xrpl.ledger.seq=12345}`                                                 |
-| Peer proposals (trusted) | `{span.xrpl.peer.proposal.trusted=true}`                                       |
+| gRPC method calls        | `{resource.service.name="xrpld" && name="grpc.GetLedger"}`                     |
+| Specific transaction     | `{span.tx_hash="<hex_hash>"}`                                                  |
+| Local transactions only  | `{span.local=true}`                                                            |
+| Consensus rounds         | `{resource.service.name="xrpld" && name="consensus.round"}`                    |
+| Rounds by mode           | `{span.consensus_mode="Proposing"}`                                            |
+| Specific ledger          | `{span.ledger_seq=12345}`                                                      |
+| Peer proposals (trusted) | `{span.proposal_trusted=true}`                                                 |

 ### Trace Structure

 A typical RPC trace shows the span hierarchy:

 ```
-rpc.request (ServerHandler)
+rpc.http_request (ServerHandler)
  └── rpc.process (ServerHandler)
       └── rpc.command.server_info (RPCHandler)
 ```

-A consensus round produces independent spans (not parent-child):
+A consensus round groups its lifecycle spans under a single root
+(`consensus.round`); the build/ledger spans run as their own trees:

 ```
-consensus.ledger_close        (close event)
-consensus.proposal.send       (broadcast proposal)
-ledger.build                  (build new ledger)
-  └── tx.apply                (apply transaction set)
-consensus.accept              (accept result)
-consensus.validation.send     (send validation)
-ledger.validate               (promote to validated)
-ledger.store                  (persist to DB)
+consensus.round                    (root — one per round)
+  ├── consensus.phase.open         (open phase)
+  ├── consensus.proposal.send      (broadcast proposal)
+  ├── consensus.ledger_close       (close event)
+  ├── consensus.establish          (establish phase)
+  ├── consensus.update_positions   (position updates)
+  ├── consensus.check              (threshold check)
+  ├── consensus.accept             (accept result)
+  │     └── consensus.accept.apply (apply, jtACCEPT thread)
+  └── consensus.validation.send    (send validation, follows-from link)
+
+ledger.build                       (build new ledger)
+  └── tx.apply                     (apply transaction set)
+ledger.validate                    (promote to validated)
+ledger.store                       (persist to DB)
 ```

 ---
@@ -483,19 +680,19 @@ ledger.store                  (persist to DB)

 ```promql
 # RPC request rate by command (last 5 minutes)
-sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))
+sum by (command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))

 # RPC p95 latency by command
-histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))
+histogram_quantile(0.95, sum by (le, command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))

 # Consensus round duration p95
-histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name="consensus.accept"}[5m])))
+histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name="consensus.round"}[5m])))

 # Transaction processing rate (local vs relay)
-sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))
+sum by (local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))

 # Trusted vs untrusted proposal rate
-sum by (xrpl_peer_proposal_trusted) (rate(traces_span_metrics_calls_total{span_name="peer.proposal.receive"}[5m]))
+sum by (proposal_trusted) (rate(traces_span_metrics_calls_total{span_name="peer.proposal.receive"}[5m]))
 ```

 ### StatsD Metrics
@@ -592,90 +789,22 @@ count_over_time({job="xrpld"} |= "trace_id=" [5m])

 ---

-## 5b. Future: Internal Metric Gap Fill (Phase 9)
+## 5b. Internal Metric Gap Fill (Phase 9)

-> **Status**: Planned, not yet implemented.
+> **Status**: Implemented.
 > **Plan details**: [06-implementation-phases.md §6.8.2](./06-implementation-phases.md) — motivation, architecture, third-party context
 > **Task breakdown**: [Phase9_taskList.md](./Phase9_taskList.md) — per-task implementation details

-Phase 9 fills ~50+ metrics that exist inside xrpld but currently lack time-series export. Uses a hybrid approach: `beast::insight` extensions for NodeStore I/O, OTel `ObservableGauge` async callbacks for new categories.
+Phase 9 fills the metrics that exist inside xrpld but previously lacked time-series export. It
+uses a hybrid approach: `beast::insight` extensions for NodeStore I/O plus OTel `ObservableGauge`
+async callbacks for new categories.

-### New Metric Categories
-
-#### NodeStore I/O (via beast::insight)
-
-| Prometheus Metric                  | Type  | Description                         |
-| ---------------------------------- | ----- | ----------------------------------- |
-| `xrpld_nodestore_reads_total`      | Gauge | Cumulative read operations          |
-| `xrpld_nodestore_reads_hit`        | Gauge | Cache-served reads                  |
-| `xrpld_nodestore_writes`           | Gauge | Cumulative write operations         |
-| `xrpld_nodestore_written_bytes`    | Gauge | Cumulative bytes written            |
-| `xrpld_nodestore_read_bytes`       | Gauge | Cumulative bytes read               |
-| `xrpld_nodestore_read_duration_us` | Gauge | Cumulative read time (microseconds) |
-| `xrpld_nodestore_write_load`       | Gauge | Current write load score            |
-| `xrpld_nodestore_read_queue`       | Gauge | Items in read queue                 |
-
-#### Cache Hit Rates (via OTel MetricsRegistry)
-
-| Prometheus Metric             | Type  | Description                          |
-| ----------------------------- | ----- | ------------------------------------ |
-| `xrpld_cache_SLE_hit_rate`    | Gauge | SLE cache hit rate (0.0-1.0)         |
-| `xrpld_cache_ledger_hit_rate` | Gauge | Ledger object cache hit rate         |
-| `xrpld_cache_AL_hit_rate`     | Gauge | AcceptedLedger cache hit rate        |
-| `xrpld_cache_treenode_size`   | Gauge | SHAMap TreeNode cache size (entries) |
-| `xrpld_cache_fullbelow_size`  | Gauge | FullBelow cache size                 |
-
-#### Transaction Queue (via OTel MetricsRegistry)
-
-| Prometheus Metric                    | Type  | Description                      |
-| ------------------------------------ | ----- | -------------------------------- |
-| `xrpld_txq_count`                    | Gauge | Current transactions in queue    |
-| `xrpld_txq_max_size`                 | Gauge | Maximum queue capacity           |
-| `xrpld_txq_in_ledger`                | Gauge | Transactions in open ledger      |
-| `xrpld_txq_per_ledger`               | Gauge | Expected transactions per ledger |
-| `xrpld_txq_open_ledger_fee_level`    | Gauge | Open ledger fee escalation level |
-| `xrpld_txq_med_fee_level`            | Gauge | Median fee level in queue        |
-| `xrpld_txq_reference_fee_level`      | Gauge | Reference fee level              |
-| `xrpld_txq_min_processing_fee_level` | Gauge | Minimum fee to get processed     |
-
-#### PerfLog Per-RPC Method (via OTel Metrics SDK)
-
-| Prometheus Metric                     | Type      | Labels            | Description                 |
-| ------------------------------------- | --------- | ----------------- | --------------------------- |
-| `xrpld_rpc_method_started_total`      | Counter   | `method="<name>"` | RPC calls started           |
-| `xrpld_rpc_method_finished_total`     | Counter   | `method="<name>"` | RPC calls completed         |
-| `xrpld_rpc_method_errored_total`      | Counter   | `method="<name>"` | RPC calls errored           |
-| `xrpld_rpc_method_duration_us_bucket` | Histogram | `method="<name>"` | Execution time distribution |
-
-#### PerfLog Per-Job Type (via OTel Metrics SDK)
-
-| Prometheus Metric                      | Type      | Labels              | Description     |
-| -------------------------------------- | --------- | ------------------- | --------------- |
-| `xrpld_job_queued_total`               | Counter   | `job_type="<name>"` | Jobs queued     |
-| `xrpld_job_started_total`              | Counter   | `job_type="<name>"` | Jobs started    |
-| `xrpld_job_finished_total`             | Counter   | `job_type="<name>"` | Jobs completed  |
-| `xrpld_job_queued_duration_us_bucket`  | Histogram | `job_type="<name>"` | Queue wait time |
-| `xrpld_job_running_duration_us_bucket` | Histogram | `job_type="<name>"` | Execution time  |
-
-#### Counted Object Instances (via OTel MetricsRegistry)
-
-| Prometheus Metric    | Type  | Labels          | Description                     |
-| -------------------- | ----- | --------------- | ------------------------------- |
-| `xrpld_object_count` | Gauge | `type="<name>"` | Live instances of internal type |
-
-Tracked types: `Transaction`, `Ledger`, `NodeObject`, `STTx`, `STLedgerEntry`, `InboundLedger`, `Pathfinder`, `PathRequest`, `HashRouterEntry`
-
-#### Fee Escalation & Load Factors (via OTel MetricsRegistry)
-
-| Prometheus Metric                  | Type  | Description                          |
-| ---------------------------------- | ----- | ------------------------------------ |
-| `xrpld_load_factor`                | Gauge | Combined transaction cost multiplier |
-| `xrpld_load_factor_server`         | Gauge | Server + cluster + network load      |
-| `xrpld_load_factor_local`          | Gauge | Local server load only               |
-| `xrpld_load_factor_net`            | Gauge | Network-wide load estimate           |
-| `xrpld_load_factor_cluster`        | Gauge | Cluster peer load                    |
-| `xrpld_load_factor_fee_escalation` | Gauge | Open ledger fee escalation           |
-| `xrpld_load_factor_fee_queue`      | Gauge | Queue entry fee level                |
+> **Authoritative metric names live in [§ Phase 9: OTel SDK-Exported Metrics](#phase-9-otel-sdk-exported-metrics-metricsregistry) below.**
+> Most internal metrics are emitted as **labeled** gauges — one instrument carrying many logical
+> values via a `metric` label (e.g. `xrpld_cache_metrics{metric="SLE_hit_rate"}`,
+> `xrpld_txq_metrics{metric="txq_count"}`, `xrpld_load_factor_metrics{metric="load_factor"}`,
+> `xrpld_nodestore_state{metric="node_reads_total"}`) — not the flat per-name form. Query the
+> labeled names; the flat names (`xrpld_cache_SLE_hit_rate`, `xrpld_txq_count`, …) are **not** emitted.

 #### Server Info (via OTel MetricsRegistry)

@@ -759,15 +888,23 @@ docker/telemetry/workload/benchmark.sh --xrpld .build/xrpld --duration 300

 ### Validated Telemetry Inventory

-| Category           | Expected Count | Validation Method                | Config File             |
-| ------------------ | -------------- | -------------------------------- | ----------------------- |
-| Trace spans        | 17             | Tempo API query                  | `expected_spans.json`   |
-| Span attributes    | 22             | Per-span attribute assertion     | `expected_spans.json`   |
-| StatsD metrics     | 255+           | Prometheus query                 | `expected_metrics.json` |
-| Phase 9 metrics    | 68+            | Prometheus query                 | `expected_metrics.json` |
-| SpanMetrics RED    | 4 per span     | Prometheus query                 | `expected_metrics.json` |
-| Grafana dashboards | 10             | Dashboard API "no data" check    | `expected_metrics.json` |
-| Log-trace links    | Present        | Loki query + Tempo reverse check | —                       |
+> **Counting note — families vs series.** A _metric family_ is one distinct Prometheus `__name__`
+> (histogram `_bucket`/`_count`/`_sum` collapsed to one). A _series_ is a family × its label
+> combinations. The legacy overlay-traffic block is the bulk of the count: ~56 message categories ×
+> 4 (`_Bytes_In/_Out`, `_Messages_In/_Out`) ≈ 224 families on its own. The labeled gauges
+> (`xrpld_cache_metrics{metric}`, …) are few families but many series. Validate against the figures
+> below as **families currently emitting** (idle nodes under-report — workload-gated metrics such as
+> per-RPC/error counters appear only once exercised, which is Phase 10's purpose).
+
+| Category                  | Expected Count            | Validation Method                | Config File             |
+| ------------------------- | ------------------------- | -------------------------------- | ----------------------- |
+| Trace spans               | ~37 (required + optional) | Tempo API query                  | `expected_spans.json`   |
+| Span attributes           | per-span assertion        | Per-span attribute assertion     | `expected_spans.json`   |
+| Legacy `xrpld_*` families | ~270 (≈224 traffic)       | Prometheus `__name__` query      | `expected_metrics.json` |
+| Native MetricsRegistry    | 35 instruments            | Prometheus query                 | `expected_metrics.json` |
+| SpanMetrics RED           | 4 per span                | Prometheus query                 | `expected_metrics.json` |
+| Grafana dashboards        | 15                        | Dashboard API "no data" check    | `expected_metrics.json` |
+| Log-trace links           | Present                   | Loki query + Tempo reverse check | —                       |

 ### Performance Overhead Targets

@@ -1021,15 +1158,27 @@ State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full

 #### Synchronous Counters (Phase 7+)

-| Prometheus Metric                   | Type    | Description                      | Increment Site        |
-| ----------------------------------- | ------- | -------------------------------- | --------------------- |
-| `xrpld_ledgers_closed_total`        | Counter | Ledgers closed by consensus      | RCLConsensus.cpp      |
-| `xrpld_validations_sent_total`      | Counter | Validations sent                 | RCLConsensus.cpp      |
-| `xrpld_validations_checked_total`   | Counter | Network validations observed     | LedgerMaster.cpp      |
-| `xrpld_validation_agreements_total` | Counter | Cumulative validation agreements | ValidationTracker.cpp |
-| `xrpld_validation_missed_total`     | Counter | Cumulative validation misses     | ValidationTracker.cpp |
-| `xrpld_state_changes_total`         | Counter | Operating mode transitions       | NetworkOPs.cpp        |
-| `xrpld_jq_trans_overflow_total`     | Counter | Job queue transaction overflows  | JobQueue.cpp          |
+| Prometheus Metric                 | Type    | Description                     | Increment Site   |
+| --------------------------------- | ------- | ------------------------------- | ---------------- |
+| `xrpld_ledgers_closed_total`      | Counter | Ledgers closed by consensus     | RCLConsensus.cpp |
+| `xrpld_validations_sent_total`    | Counter | Validations sent                | RCLConsensus.cpp |
+| `xrpld_validations_checked_total` | Counter | Network validations observed    | LedgerMaster.cpp |
+| `xrpld_state_changes_total`       | Counter | Operating mode transitions      | NetworkOPs.cpp   |
+| `xrpld_jq_trans_overflow_total`   | Counter | Job queue transaction overflows | JobQueue.cpp     |
+
+Lifetime validation agreement/miss tallies are exported as monotonic **ObservableCounters**
+(not synchronous counters) observed from `ValidationTracker`'s gross lifetime totals:
+
+| Prometheus Metric                   | Type              | Description                                | Source                |
+| ----------------------------------- | ----------------- | ------------------------------------------ | --------------------- |
+| `xrpld_validation_agreements_total` | ObservableCounter | Lifetime validations that initially agreed | ValidationTracker.cpp |
+| `xrpld_validation_missed_total`     | ObservableCounter | Lifetime validations that initially missed | ValidationTracker.cpp |
+
+> **Counting semantics (initial-classification only):** each reconciled ledger increments exactly
+> one of these two counters, at first classification. A later late-repair (miss → agreement) does
+> **not** move either counter — keeping both strictly monotonic (a Prometheus `_total` must never
+> decrease) and additive (`agreements_total + missed_total` = ledgers reconciled). The
+> repair-aware, windowed view remains on `xrpld_validation_agreement{metric="…"}`.

 #### Span Attribute Enrichments (Phases 2-4)

@@ -1094,7 +1243,7 @@ State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full
 | Issue                                                              | Impact                                           | Status                                                               |
 | ------------------------------------------------------------------ | ------------------------------------------------ | -------------------------------------------------------------------- |
 | `warn` and `drop` metrics use non-standard StatsD `\|m` meter type | Metrics silently dropped by OTel StatsD receiver | Phase 6 Task 6.1 — needs `\|m` → `\|c` change in StatsDCollector.cpp |
-| `xrpld_job_count` may not emit in standalone mode                  | Missing from Prometheus in some test configs     | Requires active job queue activity                                   |
+| `xrpld_jobq_job_count` may not emit in standalone mode             | Missing from Prometheus in some test configs     | Requires active job queue activity                                   |
 | `xrpld_rpc_requests` depends on `[insight]` config                 | Zero series if StatsD not configured             | Requires `[insight] server=statsd` in xrpld.cfg                      |
 | Peer tracing disabled by default                                   | No `peer.*` spans unless `trace_peer=1`          | Intentional — high volume on mainnet                                 |

--- a/docker/telemetry/grafana/dashboards/consensus-health.json
+++ b/docker/telemetry/grafana/dashboards/consensus-health.json
--- a/docker/telemetry/grafana/dashboards/xrpld-peer-quality.json
+++ b/docker/telemetry/grafana/dashboards/xrpld-peer-quality.json
@@ -303,14 +303,14 @@
          "datasource": {
            "type": "prometheus"
          },
-          "expr": "rippled_Peer_Finder_Active_Inbound_Peers{exported_instance=~\"$node\"}",
+          "expr": "xrpld_Peer_Finder_Active_Inbound_Peers{exported_instance=~\"$node\"}",
          "legendFormat": "Inbound [{{exported_instance}}]"
        },
        {
          "datasource": {
            "type": "prometheus"
          },
-          "expr": "rippled_Peer_Finder_Active_Outbound_Peers{exported_instance=~\"$node\"}",
+          "expr": "xrpld_Peer_Finder_Active_Outbound_Peers{exported_instance=~\"$node\"}",
          "legendFormat": "Outbound [{{exported_instance}}]"
        }
      ],
--- a/docker/telemetry/tempo.yaml
+++ b/docker/telemetry/tempo.yaml
@@ -17,6 +17,14 @@ stream_over_http_enabled: true
 server:
  http_listen_port: 3200

+# Raise the TraceQL metrics query range limit. The default
+# query_frontend.metrics.max_duration is 3h, so a dashboard set to a longer
+# window (e.g. 6h/12h) fails with "range exceeds 3h0m0s". 168h matches the
+# search max_duration and gives dashboards generous headroom.
+query_frontend:
+  metrics:
+    max_duration: 168h
+
 distributor:
  receivers:
    otlp:
--- a/src/tests/libxrpl/telemetry/ValidationTracker.cpp
+++ b/src/tests/libxrpl/telemetry/ValidationTracker.cpp
@@ -132,6 +132,8 @@ TEST_F(ValidationTrackerTest, EmptyWindowReturnsZero)
    EXPECT_EQ(tracker_.missed24h(), 0u);
    EXPECT_EQ(tracker_.totalAgreements(), 0u);
    EXPECT_EQ(tracker_.totalMissed(), 0u);
+    EXPECT_EQ(tracker_.totalAgreementsEver(), 0u);
+    EXPECT_EQ(tracker_.totalMissedEver(), 0u);
    EXPECT_EQ(tracker_.totalValidationsSent(), 0u);
    EXPECT_EQ(tracker_.totalValidationsChecked(), 0u);
 }
@@ -282,3 +284,91 @@ TEST_F(ValidationTrackerTest, OnlyWeValidated)
    EXPECT_EQ(tracker_.missed1h(), 1u);
    EXPECT_DOUBLE_EQ(tracker_.agreementPct1h(), 0.0);
 }
+
+// ---------------------------------------------------------------
+// 10. Gross miss tally is monotonic across a late repair
+//     The gross lifetime tallies (totalAgreementsEver/totalMissedEver)
+//     back the monotonic Prometheus _total counters. A late repair must
+//     move the NET totals (miss -> agreement) but must NOT move the gross
+//     tallies: a miss already counted stays counted, and the repair does
+//     not add a second (agreement) count for the same ledger.
+// ---------------------------------------------------------------
+TEST_F(ValidationTrackerTest, GrossMissedNeverDecrementsOnRepair)
+{
+    auto const hash = makeHash(10);
+    LedgerIndex const seq = 1000;
+
+    // Network validates, we do not (yet).
+    tracker_.recordNetworkValidation(hash, seq);
+
+    // Grace period elapses -- reconciled as a miss.
+    std::this_thread::sleep_for(std::chrono::seconds(9));
+    tracker_.reconcile();
+
+    // Net and gross both show exactly one initial miss, zero agreements.
+    EXPECT_EQ(tracker_.totalMissed(), 1u);
+    EXPECT_EQ(tracker_.totalMissedEver(), 1u);
+    EXPECT_EQ(tracker_.totalAgreements(), 0u);
+    EXPECT_EQ(tracker_.totalAgreementsEver(), 0u);
+
+    // Late arrival of our validation repairs the miss to an agreement.
+    tracker_.recordOurValidation(hash, seq);
+    tracker_.reconcile();
+
+    // Net totals reflect the repair...
+    EXPECT_EQ(tracker_.totalMissed(), 0u);
+    EXPECT_EQ(tracker_.totalAgreements(), 1u);
+    // ...but the gross tallies are frozen at first classification: the miss
+    // stays counted and no agreement was added (repair path excluded).
+    EXPECT_EQ(tracker_.totalMissedEver(), 1u);
+    EXPECT_EQ(tracker_.totalAgreementsEver(), 0u);
+}
+
+// ---------------------------------------------------------------
+// 11. Gross tallies count initial classification only (additive)
+//     With a mix of initial agreements and misses the gross tallies equal
+//     the net totals. A subsequent repair shifts the net totals but leaves
+//     the gross tallies unchanged, and the gross sum equals the number of
+//     reconciled ledgers (the additive invariant the _total counters rely on).
+// ---------------------------------------------------------------
+TEST_F(ValidationTrackerTest, GrossAgreementsCountInitialOnly)
+{
+    // 3 initial agreements: both sides validate.
+    for (int i = 1; i <= 3; ++i)
+    {
+        auto const h = makeHash(static_cast<std::uint64_t>(i));
+        tracker_.recordOurValidation(h, static_cast<LedgerIndex>(i));
+        tracker_.recordNetworkValidation(h, static_cast<LedgerIndex>(i));
+    }
+
+    // 2 initial misses: only network validates.
+    for (int i = 4; i <= 5; ++i)
+    {
+        auto const h = makeHash(static_cast<std::uint64_t>(i));
+        tracker_.recordNetworkValidation(h, static_cast<LedgerIndex>(i));
+    }
+
+    // Grace period elapses -- all five reconciled at first classification.
+    std::this_thread::sleep_for(std::chrono::seconds(9));
+    tracker_.reconcile();
+
+    // Before any repair, gross equals net.
+    EXPECT_EQ(tracker_.totalAgreements(), 3u);
+    EXPECT_EQ(tracker_.totalAgreementsEver(), 3u);
+    EXPECT_EQ(tracker_.totalMissed(), 2u);
+    EXPECT_EQ(tracker_.totalMissedEver(), 2u);
+
+    // Repair one of the misses (hash 4) within the repair window.
+    tracker_.recordOurValidation(makeHash(4), 4);
+    tracker_.reconcile();
+
+    // Net totals shift by the repair...
+    EXPECT_EQ(tracker_.totalAgreements(), 4u);
+    EXPECT_EQ(tracker_.totalMissed(), 1u);
+    // ...gross tallies stay at the initial classification.
+    EXPECT_EQ(tracker_.totalAgreementsEver(), 3u);
+    EXPECT_EQ(tracker_.totalMissedEver(), 2u);
+
+    // Additive invariant: gross agree + gross miss == ledgers reconciled.
+    EXPECT_EQ(tracker_.totalAgreementsEver() + tracker_.totalMissedEver(), 5u);
+}
--- a/src/xrpld/telemetry/MetricsRegistry.cpp
+++ b/src/xrpld/telemetry/MetricsRegistry.cpp
@@ -244,10 +244,9 @@ MetricsRegistry::start(std::string const& endpoint, std::string const& instanceI
        "xrpld_txq_expired_total", "Total transactions expired out of the transaction queue");
    txqDroppedCounter_ = meter_->CreateUInt64Counter(
        "xrpld_txq_dropped_total", "Total transactions refused admission to the queue by reason");
-    validationAgreementsCounter_ = meter_->CreateUInt64Counter(
-        "xrpld_validation_agreements_total", "Total validation agreements");
-    validationMissedCounter_ =
-        meter_->CreateUInt64Counter("xrpld_validation_missed_total", "Total validation misses");
+    // Note: xrpld_validation_agreements_total / xrpld_validation_missed_total
+    // are monotonic ObservableCounters created in registerValidationTotalsCounters()
+    // (below), observed from ValidationTracker's gross lifetime tallies.

    // Register all observable (async) gauges.
    registerAsyncGauges();
@@ -441,6 +440,7 @@ MetricsRegistry::registerAsyncGauges()
    registerStateTrackingGauge();
    registerStorageDetailGauge();
    registerValidationAgreementGauge();
+    registerValidationTotalsCounters();
 }

 void
@@ -1325,13 +1325,67 @@ MetricsRegistry::registerValidationAgreementGauge()
            }
        },
        this);
+}

-    // Note: validationAgreementsCounter_ and validationMissedCounter_ are
-    // created above but not currently incremented.  The
-    // xrpld_validation_agreement gauge already provides agreement and miss
-    // counts from ValidationTracker's rolling windows and lifetime totals.
-    // These counters are reserved for future use if a push-style counter
-    // integration with ValidationTracker is desired.
+void
+MetricsRegistry::registerValidationTotalsCounters()
+{
+    // Lifetime validation agreement/miss counters.
+    //
+    // These are monotonic ObservableCounters (not the sync Counters they used
+    // to be): a Prometheus _total must never decrease, but ValidationTracker's
+    // NET totals are non-monotonic (a late repair decrements the net miss
+    // count). We therefore observe the tracker's GROSS lifetime tallies, which
+    // count each ledger once at first classification and are never adjusted on
+    // repair (initial-classification semantics — see ValidationTracker). The
+    // repaired/agreement view remains available from xrpld_validation_agreement.
+    //
+    // reconcile() is called first so pending events are resolved before the
+    // tallies are read; the callback fires every ~10 s from the
+    // PeriodicExportingMetricReader thread.
+    validationAgreementsObservable_ = meter_->CreateInt64ObservableCounter(
+        "xrpld_validation_agreements_total",
+        "Lifetime validations that initially agreed with network consensus");
+    validationAgreementsObservable_->AddCallback(
+        [](opentelemetry::metrics::ObserverResult result, void* state) {
+            auto* self = static_cast<MetricsRegistry*>(state);
+            if (self->callbacksDetached_.load(std::memory_order_acquire))
+                return;
+            try
+            {
+                self->validationTracker_.reconcile();
+                opentelemetry::nostd::get<opentelemetry::nostd::shared_ptr<
+                    opentelemetry::metrics::ObserverResultT<int64_t>>>(result)
+                    ->Observe(static_cast<int64_t>(self->validationTracker_.totalAgreementsEver()));
+            }
+            catch (...)  // NOLINT(bugprone-empty-catch)
+            {
+                // Silently skip on error.
+            }
+        },
+        this);
+
+    validationMissedObservable_ = meter_->CreateInt64ObservableCounter(
+        "xrpld_validation_missed_total",
+        "Lifetime validations that initially missed network consensus");
+    validationMissedObservable_->AddCallback(
+        [](opentelemetry::metrics::ObserverResult result, void* state) {
+            auto* self = static_cast<MetricsRegistry*>(state);
+            if (self->callbacksDetached_.load(std::memory_order_acquire))
+                return;
+            try
+            {
+                self->validationTracker_.reconcile();
+                opentelemetry::nostd::get<opentelemetry::nostd::shared_ptr<
+                    opentelemetry::metrics::ObserverResultT<int64_t>>>(result)
+                    ->Observe(static_cast<int64_t>(self->validationTracker_.totalMissedEver()));
+            }
+            catch (...)  // NOLINT(bugprone-empty-catch)
+            {
+                // Silently skip on error.
+            }
+        },
+        this);
 }

 #endif  // XRPL_ENABLE_TELEMETRY
--- a/src/xrpld/telemetry/MetricsRegistry.h
+++ b/src/xrpld/telemetry/MetricsRegistry.h
@@ -529,13 +529,16 @@ private:
    /// Counter: xrpld_txq_dropped_total{reason} — incremented when a transaction is refused
    /// admission to the queue.
    opentelemetry::nostd::unique_ptr<opentelemetry::metrics::Counter<uint64_t>> txqDroppedCounter_;
-    /// Counter: xrpld_validation_agreements_total — incremented by ValidationTracker on
-    /// agreement.
-    opentelemetry::nostd::unique_ptr<opentelemetry::metrics::Counter<uint64_t>>
-        validationAgreementsCounter_;
-    /// Counter: xrpld_validation_missed_total — incremented by ValidationTracker on miss.
-    opentelemetry::nostd::unique_ptr<opentelemetry::metrics::Counter<uint64_t>>
-        validationMissedCounter_;
+    /// ObservableCounter: xrpld_validation_agreements_total — observed from
+    /// ValidationTracker::totalAgreementsEver() (monotonic gross lifetime
+    /// tally, initial-classification semantics).
+    opentelemetry::nostd::shared_ptr<opentelemetry::metrics::ObservableInstrument>
+        validationAgreementsObservable_;
+    /// ObservableCounter: xrpld_validation_missed_total — observed from
+    /// ValidationTracker::totalMissedEver() (monotonic gross lifetime tally,
+    /// initial-classification semantics).
+    opentelemetry::nostd::shared_ptr<opentelemetry::metrics::ObservableInstrument>
+        validationMissedObservable_;

    /** Register all observable gauge callbacks with the OTel SDK.
        Dispatches to one helper per metric domain so that each helper
@@ -580,6 +583,8 @@ private:
    registerStorageDetailGauge();  // Task 7.13
    void
    registerValidationAgreementGauge();  // Task 7.15
+    void
+    registerValidationTotalsCounters();  // gap-fill: lifetime agree/miss _total
 #endif                                   // XRPL_ENABLE_TELEMETRY
 };

--- a/src/xrpld/telemetry/ValidationTracker.h
+++ b/src/xrpld/telemetry/ValidationTracker.h
@@ -186,6 +186,26 @@ public:
    uint64_t
    totalMissed() const;

+    /** Lifetime agreements counted at first classification only.
+     *  @note Unlike totalAgreements(), this is strictly monotonic: it is
+     *  incremented only when a ledger is first reconciled as an agreement and
+     *  is never adjusted by a late repair. It backs the monotonic Prometheus
+     *  counter xrpld_validation_agreements_total. See the counting-semantics
+     *  note in detail/ValidationTracker.cpp.
+     */
+    uint64_t
+    totalAgreementsEver() const;
+
+    /** Lifetime misses counted at first classification only.
+     *  @note Unlike totalMissed(), this is strictly monotonic: it is
+     *  incremented only when a ledger is first reconciled as a miss and is
+     *  never decremented by a late repair. It backs the monotonic Prometheus
+     *  counter xrpld_validation_missed_total. See the counting-semantics note
+     *  in detail/ValidationTracker.cpp.
+     */
+    uint64_t
+    totalMissedEver() const;
+
    /** Total validations this node sent. */
    uint64_t
    totalValidationsSent() const;
@@ -254,12 +274,33 @@ private:
    /// Sliding window of reconciled events (last 7 days).
    std::deque<WindowEvent> window7d_;

-    /// Lifetime count of agreements.
+    /// Lifetime count of agreements (net: incremented on agree, also on
+    /// repair). May be read via totalAgreements(); feeds the windowed gauge.
    std::atomic<uint64_t> totalAgreements_{0};

-    /// Lifetime count of misses.
+    /// Lifetime count of misses (net: incremented on miss, decremented on
+    /// repair). NON-monotonic. May be read via totalMissed().
    std::atomic<uint64_t> totalMissed_{0};

+    // Monotonic "gross" lifetime tallies for the Prometheus _total counters.
+    //
+    // Counting decision (initial-classification only): each reconciled ledger
+    // is counted exactly once, at its first classification, into exactly one
+    // of the two tallies below. A later late-repair (miss -> agreement) does
+    // NOT move either tally. This keeps both strictly monotonic (a Prometheus
+    // _total must never decrease) and additive:
+    //     totalAgreementsGross_ + totalMissedGross_ == ledgers reconciled.
+    // The repaired/agreement view is still available from the windowed gauge
+    // (xrpld_validation_agreement) and the net totals above.
+
+    /// Monotonic lifetime initial agreements; backs
+    /// xrpld_validation_agreements_total. Never adjusted on repair.
+    std::atomic<uint64_t> totalAgreementsGross_{0};
+
+    /// Monotonic lifetime initial misses; backs xrpld_validation_missed_total.
+    /// Never decremented on repair.
+    std::atomic<uint64_t> totalMissedGross_{0};
+
    /// Lifetime count of validations this node sent.
    std::atomic<uint64_t> totalValidationsSent_{0};

--- a/src/xrpld/telemetry/detail/ValidationTracker.cpp
+++ b/src/xrpld/telemetry/detail/ValidationTracker.cpp
@@ -63,10 +63,16 @@ ValidationTracker::reconcile()
            if (evt.agreed)
            {
                totalAgreements_.fetch_add(1, std::memory_order_relaxed);
+                // Gross tally: count the initial agreement once. See the
+                // counting-decision note below (repair branch).
+                totalAgreementsGross_.fetch_add(1, std::memory_order_relaxed);
            }
            else
            {
                totalMissed_.fetch_add(1, std::memory_order_relaxed);
+                // Gross tally: count the initial miss once. See the
+                // counting-decision note below (repair branch).
+                totalMissedGross_.fetch_add(1, std::memory_order_relaxed);
            }

            WindowEvent const we{.time = now, .ledgerHash = evt.ledgerHash, .agreed = evt.agreed};
@@ -78,11 +84,20 @@ ValidationTracker::reconcile()
            evt.reconciled && !evt.agreed && evt.weValidated && evt.networkValidated &&
            (now - evt.recordTime) <= kLateRepairWindow)
        {
-            // Late repair: was a miss, now both flags set.
+            // Late repair: was a miss, now both flags set. Adjust the NET
+            // totals (used by the windowed agreement gauge) so the live view
+            // reflects the repair.
            evt.agreed = true;
            totalMissed_.fetch_sub(1, std::memory_order_relaxed);
            totalAgreements_.fetch_add(1, std::memory_order_relaxed);

+            // Counting decision (initial-classification only): the gross
+            // tallies (totalAgreementsGross_ / totalMissedGross_) that back the
+            // monotonic Prometheus _total counters are deliberately NOT touched
+            // here. Each ledger is counted once, at first classification; a
+            // repair must not decrement missed (a _total may never decrease)
+            // nor add a second agreement (which would double-count the ledger).
+
            // Flip the corresponding window entries from miss to agreement.
            repairWindowEntry(window1h_, evt.ledgerHash);
            repairWindowEntry(window24h_, evt.ledgerHash);
@@ -253,6 +268,18 @@ ValidationTracker::totalMissed() const
    return totalMissed_.load(std::memory_order_relaxed);
 }

+uint64_t
+ValidationTracker::totalAgreementsEver() const
+{
+    return totalAgreementsGross_.load(std::memory_order_relaxed);
+}
+
+uint64_t
+ValidationTracker::totalMissedEver() const
+{
+    return totalMissedGross_.load(std::memory_order_relaxed);
+}
+
 uint64_t
 ValidationTracker::totalValidationsSent() const
 {