feat(telemetry): derive per-stage tx metrics from apply-pipeline spans

Wire the apply-pipeline stage spans (tx.preflight, tx.preclaim,
tx.transactor) added on phase-3 through the observability stack so the
spanmetrics connector produces per-stage RED metrics without any native
instruments.

- collector: add the `stage` dimension to the spanmetrics connector so
  the three stages split into separate metric series (3 bounded values).
- dashboard: add a "Tx Apply Pipeline" section to transaction-overview
  with rate, p95 latency, and failure-rate panels grouped by stage, plus
  a `stage` template variable. Panels follow the existing config (node
  filter, exported_instance legends, Title Case, axis labels).
- The failure panel filters ter_result != tesSUCCESS rather than span
  status, because a failing ter code completes the span normally — only
  thrown exceptions set an error status. This matches the existing
  "Transaction Results by Type" panel convention.
- docs: document the spans, attributes, and stage dimension in the data
  collection reference and runbook, including the sampling caveat that
  span-derived metrics inherit tracer head-sampling and undercount at
  sampling_ratio < 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Pratik Mankawde
2026-06-05 12:42:53 +01:00
parent 759d3506b2
commit 3167a49f41
4 changed files with 251 additions and 27 deletions

View File

@@ -102,13 +102,23 @@ Controlled by `trace_rpc=1` in `[telemetry]` config.
Controlled by `trace_transactions=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| ------------ | -------------- | --------------- | ----------------------------------------------------------------- |
| `tx.process` | — | NetworkOPs.cpp | Transaction submission entry point (local or peer-relayed) |
| `tx.receive` | — | PeerImp.cpp | Raw transaction received from peer overlay (before deduplication) |
| `tx.apply` | `ledger.build` | BuildLedger.cpp | Transaction set applied to new ledger during consensus |
| Span Name | Parent | Source File | Description |
| --------------- | -------------- | --------------- | ----------------------------------------------------------------- |
| `tx.process` | — | NetworkOPs.cpp | Transaction submission entry point (local or peer-relayed) |
| `tx.receive` | — | PeerImp.cpp | Raw transaction received from peer overlay (before deduplication) |
| `tx.apply` | `ledger.build` | BuildLedger.cpp | Transaction set applied to new ledger during consensus |
| `tx.preflight` | — | applySteps.cpp | Stateless checks stage (`stage=preflight`) |
| `tx.preclaim` | — | applySteps.cpp | Ledger-aware checks stage before fee claim (`stage=preclaim`) |
| `tx.transactor` | — | Transactor.cpp | Apply stage — the transactor runs (`stage=apply`) |
The three apply-pipeline spans share a deterministic `trace_id` derived from
`txID[0:16]`, so preflight, preclaim, and transactor for one transaction group
under a single trace even though they run sequentially and often on different
threads. A transaction that hard-fails preflight or preclaim never reaches the
later spans — the `stage` attribute identifies where it stopped.
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"tx.process|tx.receive"}`
or, for the apply pipeline: `{resource.service.name="xrpld" && name=~"tx.preflight|tx.preclaim|tx.transactor"}`
**Grafana dashboard**: _Transaction Overview_ (`xrpld-transactions`)
@@ -229,15 +239,19 @@ Every span can carry key-value attributes that provide context for filtering and
#### Transaction Attributes
| Attribute | Type | Set On | Description |
| ------------------- | ------- | -------------------------- | ---------------------------------------------------- |
| `xrpl.tx.hash` | string | `tx.process`, `tx.receive` | Transaction hash (hex-encoded) |
| `local` | boolean | `tx.process` | `true` if locally submitted, `false` if peer-relayed |
| `path` | string | `tx.process` | Submission path: `"sync"` or `"async"` |
| `suppressed` | boolean | `tx.receive` | `true` if transaction was suppressed (duplicate) |
| `tx_status` | string | `tx.receive` | Transaction status (e.g., `"known_bad"`) |
| `xrpl.peer.id` | int64 | `tx.receive` | Peer identifier (also set on peer spans) |
| `xrpl.peer.version` | string | `tx.receive` | Peer protocol version string |
| Attribute | Type | Set On | Description |
| ------------------- | ------- | ---------------------------------------------- | --------------------------------------------------------------------- |
| `xrpl.tx.hash` | string | `tx.process`, `tx.receive` | Transaction hash (hex-encoded) |
| `local` | boolean | `tx.process` | `true` if locally submitted, `false` if peer-relayed |
| `path` | string | `tx.process` | Submission path: `"sync"` or `"async"` |
| `suppressed` | boolean | `tx.receive` | `true` if transaction was suppressed (duplicate) |
| `tx_status` | string | `tx.receive` | Transaction status (e.g., `"known_bad"`) |
| `xrpl.peer.id` | int64 | `tx.receive` | Peer identifier (also set on peer spans) |
| `xrpl.peer.version` | string | `tx.receive` | Peer protocol version string |
| `stage` | string | `tx.preflight`, `tx.preclaim`, `tx.transactor` | Apply-pipeline stage: `preflight`, `preclaim`, or `apply` |
| `tx_type` | string | `tx.preflight`, `tx.preclaim`, `tx.transactor` | Transaction type name (e.g., `Payment`) |
| `ter_result` | string | `tx.preflight`, `tx.preclaim`, `tx.transactor` | Engine result token for that stage (e.g., `tesSUCCESS`, `terPRE_SEQ`) |
| `applied` | boolean | `tx.transactor` | `true` if the transaction was applied to the ledger |
**Tempo query**: `{span.xrpl.tx.hash="<hash>"}` to trace a specific transaction across nodes.
@@ -375,14 +389,25 @@ The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Er
**Additional dimension labels** (configured in `otel-collector-config.yaml`):
| Span Attribute | Prometheus Label | Applies To |
| --------------------- | ------------------------------ | ------------------------- |
| `command` | `xrpl_rpc_command` | `rpc.command.*` |
| `rpc_status` | `xrpl_rpc_status` | `rpc.command.*` |
| `xrpl.consensus.mode` | `xrpl_consensus_mode` | `consensus.ledger_close` |
| `local` | `xrpl_tx_local` | `tx.process` |
| `proposal_trusted` | `xrpl_peer_proposal_trusted` | `peer.proposal.receive` |
| `validation_trusted` | `xrpl_peer_validation_trusted` | `peer.validation.receive` |
| Span Attribute | Prometheus Label | Applies To |
| --------------------- | ------------------------------ | ---------------------------------------------- |
| `command` | `xrpl_rpc_command` | `rpc.command.*` |
| `rpc_status` | `xrpl_rpc_status` | `rpc.command.*` |
| `xrpl.consensus.mode` | `xrpl_consensus_mode` | `consensus.ledger_close` |
| `local` | `xrpl_tx_local` | `tx.process` |
| `proposal_trusted` | `xrpl_peer_proposal_trusted` | `peer.proposal.receive` |
| `validation_trusted` | `xrpl_peer_validation_trusted` | `peer.validation.receive` |
| `stage` | `stage` | `tx.preflight`, `tx.preclaim`, `tx.transactor` |
The `stage` dimension (3 values: `preflight`, `preclaim`, `apply`) turns the
apply-pipeline spans into per-stage RED metrics with no native instruments — the
_Transaction Overview_ dashboard charts rate, p95 latency, and failure rate by stage.
> **Sampling caveat**: span-derived metrics inherit the **tracer head-sampling**
> ratio (`sampling_ratio` in `[telemetry]`, via `TraceIdRatioBasedSampler`). At
> `sampling_ratio < 1.0` the stage RED metrics undercount proportionally — they
> reflect sampled traces, not the full transaction volume. Native StatsD/meter
> metrics do not sample. Account for this when reading absolute stage rates.
**Where to query**: Prometheus → `traces_span_metrics_calls_total{span_name="rpc.command.server_info"}`

View File

@@ -669,6 +669,138 @@
},
"overrides": []
}
},
{
"title": "Tx Apply Pipeline Rate by Stage",
"description": "Span rate for each apply-pipeline stage (preflight, preclaim, apply). A drop between stages shows where transactions are filtered out. Requires the stage dimension in spanmetrics.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 64
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "max"]
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (stage, exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=~\"tx.preflight|tx.preclaim|tx.transactor\", stage=~\"$stage\"}[5m]))",
"legendFormat": "{{stage}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Spans / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Tx Apply Pipeline Latency by Stage (p95)",
"description": "95th-percentile duration of each apply-pipeline stage. Isolates which stage (preflight, preclaim, apply) dominates transaction processing time.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 64
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "max"]
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "histogram_quantile(0.95, sum by (le, stage, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=~\"tx.preflight|tx.preclaim|tx.transactor\", stage=~\"$stage\"}[5m])))",
"legendFormat": "P95 {{stage}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Tx Apply Pipeline Failure Rate by Stage",
"description": "Rate of apply-pipeline spans whose ter_result is not tesSUCCESS, split by stage. Shows whether failures concentrate in preflight, preclaim, or apply. Filters on ter_result rather than span status because a failing ter code completes the span normally; only thrown exceptions set an error status.",
"type": "timeseries",
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 72
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "max"]
}
},
"targets": [
{
"datasource": {
"type": "prometheus"
},
"expr": "sum by (stage, exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=~\"tx.preflight|tx.preclaim|tx.transactor\", stage=~\"$stage\", ter_result!~\"tesSUCCESS|\"}[5m]))",
"legendFormat": "{{stage}} [{{exported_instance}}]"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Failed Spans / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
}
],
"schemaVersion": 39,
@@ -768,6 +900,24 @@
},
"sort": 1,
"label": "Queue Status"
},
{
"name": "stage",
"type": "query",
"datasource": {
"type": "prometheus"
},
"query": "label_values(traces_span_metrics_calls_total{span_name=~\"tx.preflight|tx.preclaim|tx.transactor\", stage!=\"\"}, stage)",
"refresh": 2,
"includeAll": true,
"multi": true,
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"sort": 1,
"label": "Apply Stage"
}
]
},

View File

@@ -59,6 +59,9 @@ connectors:
- name: validation_trusted
- name: tx_type
- name: ter_result
# Apply-pipeline stage (preflight|preclaim|apply) — splits the
# tx.preflight/tx.preclaim/tx.transactor span RED metrics per stage.
- name: stage
- name: txq_status
- name: consensus_state
- name: load_type

View File

@@ -74,11 +74,20 @@ All spans instrumented in xrpld, grouped by subsystem:
### Transaction Spans (Phase 3)
| Span Name | Source File | Attributes | Description |
| ------------ | --------------- | --------------------------------------------------------------------------------- | ------------------------------------- |
| `tx.process` | NetworkOPs.cpp | `tx_hash`, `local`, `path`, `tx_type`, `fee`, `sequence`, `ter_result`, `applied` | Transaction submission and processing |
| `tx.receive` | PeerImp.cpp | `peer_id`, `tx_hash`, `tx_type`, `peer_version`, `suppressed`, `tx_status` | Transaction received from peer relay |
| `tx.apply` | BuildLedger.cpp | `ledger_seq`, `tx_count`, `tx_failed` | Transaction set applied per ledger |
| Span Name | Source File | Attributes | Description |
| --------------- | --------------- | --------------------------------------------------------------------------------- | ------------------------------------- |
| `tx.process` | NetworkOPs.cpp | `tx_hash`, `local`, `path`, `tx_type`, `fee`, `sequence`, `ter_result`, `applied` | Transaction submission and processing |
| `tx.receive` | PeerImp.cpp | `peer_id`, `tx_hash`, `tx_type`, `peer_version`, `suppressed`, `tx_status` | Transaction received from peer relay |
| `tx.apply` | BuildLedger.cpp | `ledger_seq`, `tx_count`, `tx_failed` | Transaction set applied per ledger |
| `tx.preflight` | applySteps.cpp | `stage`, `tx_type`, `ter_result` | Stateless checks stage |
| `tx.preclaim` | applySteps.cpp | `stage`, `tx_type`, `ter_result` | Ledger-aware checks stage |
| `tx.transactor` | Transactor.cpp | `stage`, `tx_type`, `ter_result`, `applied` | Apply stage (transactor runs) |
The three apply-pipeline spans (`tx.preflight`, `tx.preclaim`, `tx.transactor`)
share a deterministic `trace_id` from `txID[0:16]`, so they group under one
trace per transaction. The `stage` attribute (`preflight` / `preclaim` /
`apply`) drives the collector spanmetrics `stage` dimension, giving per-stage
RED metrics on the _Transaction Overview_ dashboard.
### Transaction Queue Spans (Phase 3)
@@ -182,6 +191,43 @@ This section shows what questions you can answer using the span attributes, with
{name=~"tx\\..*"} | tx_type = "NFTokenMint"
```
### Apply Pipeline by Stage
```
# All three stages of one transaction (preflight -> preclaim -> apply)
{name=~"tx.preflight|tx.preclaim|tx.transactor"}
# Transactions that failed at the preclaim stage
{name="tx.preclaim"} | ter_result != "tesSUCCESS"
# Transactions that hard-failed preflight (never reached preclaim/apply)
{name="tx.preflight"} | ter_result != "tesSUCCESS"
```
PromQL on the span-derived metrics (dashboard: _Transaction Overview_):
```
# Per-stage throughput — the funnel preflight >= preclaim >= apply
sum by (stage) (rate(traces_span_metrics_calls_total{span_name=~"tx.preflight|tx.preclaim|tx.transactor"}[5m]))
# Per-stage p95 latency
histogram_quantile(0.95, sum by (le, stage) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"tx.preflight|tx.preclaim|tx.transactor"}[5m])))
# Per-stage failure rate (ter_result != tesSUCCESS; a failing ter completes the
# span normally, so filter on the attribute, not status_code which only flags exceptions)
sum by (stage) (rate(traces_span_metrics_calls_total{span_name=~"tx.preflight|tx.preclaim|tx.transactor", ter_result!~"tesSUCCESS|"}[5m]))
```
> **Alerting**: a rising `tx.preflight` / `tx.preclaim` failure rate points to
> malformed or stale-sequence submissions (often spam or a misbehaving client);
> a rising `tx.transactor` failure rate points to apply-time problems. Alert per
> stage rather than on a single aggregate so the failing stage is obvious.
> **Sampling caveat**: these stage metrics are span-derived and inherit the
> **tracer head-sampling** ratio (`sampling_ratio`). At `sampling_ratio < 1.0`
> they undercount proportionally — treat them as relative trends, not absolute
> transaction counts. Native StatsD metrics are unsampled.
### Transaction Queue Health
```