feat(telemetry): derive per-stage tx metrics from apply-pipeline spans

Wire the apply-pipeline stage spans (tx.preflight, tx.preclaim, tx.transactor) added on phase-3 through the observability stack so the spanmetrics connector produces per-stage RED metrics without any native instruments. - collector: add the `stage` dimension to the spanmetrics connector so the three stages split into separate metric series (3 bounded values). - dashboard: add a "Tx Apply Pipeline" section to transaction-overview with rate, p95 latency, and failure-rate panels grouped by stage, plus a `stage` template variable. Panels follow the existing config (node filter, exported_instance legends, Title Case, axis labels). - The failure panel filters ter_result != tesSUCCESS rather than span status, because a failing ter code completes the span normally — only thrown exceptions set an error status. This matches the existing "Transaction Results by Type" panel convention. - docs: document the spans, attributes, and stage dimension in the data collection reference and runbook, including the sampling caveat that span-derived metrics inherit tracer head-sampling and undercount at sampling_ratio < 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-27 00:50:45 +00:00 · 2026-06-05 12:42:53 +01:00
parent 759d3506b2
commit 3167a49f41
4 changed files with 251 additions and 27 deletions
--- a/OpenTelemetryPlan/09-data-collection-reference.md
+++ b/OpenTelemetryPlan/09-data-collection-reference.md
@@ -102,13 +102,23 @@ Controlled by `trace_rpc=1` in `[telemetry]` config.

 Controlled by `trace_transactions=1` in `[telemetry]` config.

-| Span Name    | Parent         | Source File     | Description                                                       |
-| ------------ | -------------- | --------------- | ----------------------------------------------------------------- |
-| `tx.process` | —              | NetworkOPs.cpp  | Transaction submission entry point (local or peer-relayed)        |
-| `tx.receive` | —              | PeerImp.cpp     | Raw transaction received from peer overlay (before deduplication) |
-| `tx.apply`   | `ledger.build` | BuildLedger.cpp | Transaction set applied to new ledger during consensus            |
+| Span Name       | Parent         | Source File     | Description                                                       |
+| --------------- | -------------- | --------------- | ----------------------------------------------------------------- |
+| `tx.process`    | —              | NetworkOPs.cpp  | Transaction submission entry point (local or peer-relayed)        |
+| `tx.receive`    | —              | PeerImp.cpp     | Raw transaction received from peer overlay (before deduplication) |
+| `tx.apply`      | `ledger.build` | BuildLedger.cpp | Transaction set applied to new ledger during consensus            |
+| `tx.preflight`  | —              | applySteps.cpp  | Stateless checks stage (`stage=preflight`)                        |
+| `tx.preclaim`   | —              | applySteps.cpp  | Ledger-aware checks stage before fee claim (`stage=preclaim`)     |
+| `tx.transactor` | —              | Transactor.cpp  | Apply stage — the transactor runs (`stage=apply`)                 |
+
+The three apply-pipeline spans share a deterministic `trace_id` derived from
+`txID[0:16]`, so preflight, preclaim, and transactor for one transaction group
+under a single trace even though they run sequentially and often on different
+threads. A transaction that hard-fails preflight or preclaim never reaches the
+later spans — the `stage` attribute identifies where it stopped.

 **Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"tx.process|tx.receive"}`
+or, for the apply pipeline: `{resource.service.name="xrpld" && name=~"tx.preflight|tx.preclaim|tx.transactor"}`

 **Grafana dashboard**: _Transaction Overview_ (`xrpld-transactions`)

@@ -229,15 +239,19 @@ Every span can carry key-value attributes that provide context for filtering and

 #### Transaction Attributes

-| Attribute           | Type    | Set On                     | Description                                          |
-| ------------------- | ------- | -------------------------- | ---------------------------------------------------- |
-| `xrpl.tx.hash`      | string  | `tx.process`, `tx.receive` | Transaction hash (hex-encoded)                       |
-| `local`             | boolean | `tx.process`               | `true` if locally submitted, `false` if peer-relayed |
-| `path`              | string  | `tx.process`               | Submission path: `"sync"` or `"async"`               |
-| `suppressed`        | boolean | `tx.receive`               | `true` if transaction was suppressed (duplicate)     |
-| `tx_status`         | string  | `tx.receive`               | Transaction status (e.g., `"known_bad"`)             |
-| `xrpl.peer.id`      | int64   | `tx.receive`               | Peer identifier (also set on peer spans)             |
-| `xrpl.peer.version` | string  | `tx.receive`               | Peer protocol version string                         |
+| Attribute           | Type    | Set On                                         | Description                                                           |
+| ------------------- | ------- | ---------------------------------------------- | --------------------------------------------------------------------- |
+| `xrpl.tx.hash`      | string  | `tx.process`, `tx.receive`                     | Transaction hash (hex-encoded)                                        |
+| `local`             | boolean | `tx.process`                                   | `true` if locally submitted, `false` if peer-relayed                  |
+| `path`              | string  | `tx.process`                                   | Submission path: `"sync"` or `"async"`                                |
+| `suppressed`        | boolean | `tx.receive`                                   | `true` if transaction was suppressed (duplicate)                      |
+| `tx_status`         | string  | `tx.receive`                                   | Transaction status (e.g., `"known_bad"`)                              |
+| `xrpl.peer.id`      | int64   | `tx.receive`                                   | Peer identifier (also set on peer spans)                              |
+| `xrpl.peer.version` | string  | `tx.receive`                                   | Peer protocol version string                                          |
+| `stage`             | string  | `tx.preflight`, `tx.preclaim`, `tx.transactor` | Apply-pipeline stage: `preflight`, `preclaim`, or `apply`             |
+| `tx_type`           | string  | `tx.preflight`, `tx.preclaim`, `tx.transactor` | Transaction type name (e.g., `Payment`)                               |
+| `ter_result`        | string  | `tx.preflight`, `tx.preclaim`, `tx.transactor` | Engine result token for that stage (e.g., `tesSUCCESS`, `terPRE_SEQ`) |
+| `applied`           | boolean | `tx.transactor`                                | `true` if the transaction was applied to the ledger                   |

 **Tempo query**: `{span.xrpl.tx.hash="<hash>"}` to trace a specific transaction across nodes.

@@ -375,14 +389,25 @@ The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Er

 **Additional dimension labels** (configured in `otel-collector-config.yaml`):

-| Span Attribute        | Prometheus Label               | Applies To                |
-| --------------------- | ------------------------------ | ------------------------- |
-| `command`             | `xrpl_rpc_command`             | `rpc.command.*`           |
-| `rpc_status`          | `xrpl_rpc_status`              | `rpc.command.*`           |
-| `xrpl.consensus.mode` | `xrpl_consensus_mode`          | `consensus.ledger_close`  |
-| `local`               | `xrpl_tx_local`                | `tx.process`              |
-| `proposal_trusted`    | `xrpl_peer_proposal_trusted`   | `peer.proposal.receive`   |
-| `validation_trusted`  | `xrpl_peer_validation_trusted` | `peer.validation.receive` |
+| Span Attribute        | Prometheus Label               | Applies To                                     |
+| --------------------- | ------------------------------ | ---------------------------------------------- |
+| `command`             | `xrpl_rpc_command`             | `rpc.command.*`                                |
+| `rpc_status`          | `xrpl_rpc_status`              | `rpc.command.*`                                |
+| `xrpl.consensus.mode` | `xrpl_consensus_mode`          | `consensus.ledger_close`                       |
+| `local`               | `xrpl_tx_local`                | `tx.process`                                   |
+| `proposal_trusted`    | `xrpl_peer_proposal_trusted`   | `peer.proposal.receive`                        |
+| `validation_trusted`  | `xrpl_peer_validation_trusted` | `peer.validation.receive`                      |
+| `stage`               | `stage`                        | `tx.preflight`, `tx.preclaim`, `tx.transactor` |
+
+The `stage` dimension (3 values: `preflight`, `preclaim`, `apply`) turns the
+apply-pipeline spans into per-stage RED metrics with no native instruments — the
+_Transaction Overview_ dashboard charts rate, p95 latency, and failure rate by stage.
+
+> **Sampling caveat**: span-derived metrics inherit the **tracer head-sampling**
+> ratio (`sampling_ratio` in `[telemetry]`, via `TraceIdRatioBasedSampler`). At
+> `sampling_ratio < 1.0` the stage RED metrics undercount proportionally — they
+> reflect sampled traces, not the full transaction volume. Native StatsD/meter
+> metrics do not sample. Account for this when reading absolute stage rates.

 **Where to query**: Prometheus → `traces_span_metrics_calls_total{span_name="rpc.command.server_info"}`

--- a/docker/telemetry/grafana/dashboards/transaction-overview.json
+++ b/docker/telemetry/grafana/dashboards/transaction-overview.json
@@ -669,6 +669,138 @@
        },
        "overrides": []
      }
+    },
+    {
+      "title": "Tx Apply Pipeline Rate by Stage",
+      "description": "Span rate for each apply-pipeline stage (preflight, preclaim, apply). A drop between stages shows where transactions are filtered out. Requires the stage dimension in spanmetrics.",
+      "type": "timeseries",
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 64
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        },
+        "legend": {
+          "displayMode": "table",
+          "placement": "right",
+          "calcs": ["mean", "max"]
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "sum by (stage, exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=~\"tx.preflight|tx.preclaim|tx.transactor\", stage=~\"$stage\"}[5m]))",
+          "legendFormat": "{{stage}} [{{exported_instance}}]"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "ops",
+          "custom": {
+            "axisLabel": "Spans / Sec",
+            "spanNulls": true,
+            "insertNulls": false,
+            "showPoints": "auto",
+            "pointSize": 3
+          }
+        },
+        "overrides": []
+      }
+    },
+    {
+      "title": "Tx Apply Pipeline Latency by Stage (p95)",
+      "description": "95th-percentile duration of each apply-pipeline stage. Isolates which stage (preflight, preclaim, apply) dominates transaction processing time.",
+      "type": "timeseries",
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 64
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        },
+        "legend": {
+          "displayMode": "table",
+          "placement": "right",
+          "calcs": ["mean", "max"]
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "histogram_quantile(0.95, sum by (le, stage, exported_instance) (rate(traces_span_metrics_duration_milliseconds_bucket{exported_instance=~\"$node\", span_name=~\"tx.preflight|tx.preclaim|tx.transactor\", stage=~\"$stage\"}[5m])))",
+          "legendFormat": "P95 {{stage}} [{{exported_instance}}]"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "ms",
+          "custom": {
+            "axisLabel": "Duration (ms)",
+            "spanNulls": true,
+            "insertNulls": false,
+            "showPoints": "auto",
+            "pointSize": 3
+          }
+        },
+        "overrides": []
+      }
+    },
+    {
+      "title": "Tx Apply Pipeline Failure Rate by Stage",
+      "description": "Rate of apply-pipeline spans whose ter_result is not tesSUCCESS, split by stage. Shows whether failures concentrate in preflight, preclaim, or apply. Filters on ter_result rather than span status because a failing ter code completes the span normally; only thrown exceptions set an error status.",
+      "type": "timeseries",
+      "gridPos": {
+        "h": 8,
+        "w": 24,
+        "x": 0,
+        "y": 72
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        },
+        "legend": {
+          "displayMode": "table",
+          "placement": "right",
+          "calcs": ["mean", "max"]
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "sum by (stage, exported_instance) (rate(traces_span_metrics_calls_total{exported_instance=~\"$node\", span_name=~\"tx.preflight|tx.preclaim|tx.transactor\", stage=~\"$stage\", ter_result!~\"tesSUCCESS|\"}[5m]))",
+          "legendFormat": "{{stage}} [{{exported_instance}}]"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "ops",
+          "custom": {
+            "axisLabel": "Failed Spans / Sec",
+            "spanNulls": true,
+            "insertNulls": false,
+            "showPoints": "auto",
+            "pointSize": 3
+          }
+        },
+        "overrides": []
+      }
    }
  ],
  "schemaVersion": 39,
@@ -768,6 +900,24 @@
        },
        "sort": 1,
        "label": "Queue Status"
+      },
+      {
+        "name": "stage",
+        "type": "query",
+        "datasource": {
+          "type": "prometheus"
+        },
+        "query": "label_values(traces_span_metrics_calls_total{span_name=~\"tx.preflight|tx.preclaim|tx.transactor\", stage!=\"\"}, stage)",
+        "refresh": 2,
+        "includeAll": true,
+        "multi": true,
+        "allValue": ".*",
+        "current": {
+          "text": "All",
+          "value": "$__all"
+        },
+        "sort": 1,
+        "label": "Apply Stage"
      }
    ]
  },
--- a/docker/telemetry/otel-collector-config.yaml
+++ b/docker/telemetry/otel-collector-config.yaml
@@ -59,6 +59,9 @@ connectors:
      - name: validation_trusted
      - name: tx_type
      - name: ter_result
+      # Apply-pipeline stage (preflight|preclaim|apply) — splits the
+      # tx.preflight/tx.preclaim/tx.transactor span RED metrics per stage.
+      - name: stage
      - name: txq_status
      - name: consensus_state
      - name: load_type
--- a/docs/telemetry-runbook.md
+++ b/docs/telemetry-runbook.md
@@ -74,11 +74,20 @@ All spans instrumented in xrpld, grouped by subsystem:

 ### Transaction Spans (Phase 3)

-| Span Name    | Source File     | Attributes                                                                        | Description                           |
-| ------------ | --------------- | --------------------------------------------------------------------------------- | ------------------------------------- |
-| `tx.process` | NetworkOPs.cpp  | `tx_hash`, `local`, `path`, `tx_type`, `fee`, `sequence`, `ter_result`, `applied` | Transaction submission and processing |
-| `tx.receive` | PeerImp.cpp     | `peer_id`, `tx_hash`, `tx_type`, `peer_version`, `suppressed`, `tx_status`        | Transaction received from peer relay  |
-| `tx.apply`   | BuildLedger.cpp | `ledger_seq`, `tx_count`, `tx_failed`                                             | Transaction set applied per ledger    |
+| Span Name       | Source File     | Attributes                                                                        | Description                           |
+| --------------- | --------------- | --------------------------------------------------------------------------------- | ------------------------------------- |
+| `tx.process`    | NetworkOPs.cpp  | `tx_hash`, `local`, `path`, `tx_type`, `fee`, `sequence`, `ter_result`, `applied` | Transaction submission and processing |
+| `tx.receive`    | PeerImp.cpp     | `peer_id`, `tx_hash`, `tx_type`, `peer_version`, `suppressed`, `tx_status`        | Transaction received from peer relay  |
+| `tx.apply`      | BuildLedger.cpp | `ledger_seq`, `tx_count`, `tx_failed`                                             | Transaction set applied per ledger    |
+| `tx.preflight`  | applySteps.cpp  | `stage`, `tx_type`, `ter_result`                                                  | Stateless checks stage                |
+| `tx.preclaim`   | applySteps.cpp  | `stage`, `tx_type`, `ter_result`                                                  | Ledger-aware checks stage             |
+| `tx.transactor` | Transactor.cpp  | `stage`, `tx_type`, `ter_result`, `applied`                                       | Apply stage (transactor runs)         |
+
+The three apply-pipeline spans (`tx.preflight`, `tx.preclaim`, `tx.transactor`)
+share a deterministic `trace_id` from `txID[0:16]`, so they group under one
+trace per transaction. The `stage` attribute (`preflight` / `preclaim` /
+`apply`) drives the collector spanmetrics `stage` dimension, giving per-stage
+RED metrics on the _Transaction Overview_ dashboard.

 ### Transaction Queue Spans (Phase 3)

@@ -182,6 +191,43 @@ This section shows what questions you can answer using the span attributes, with
 {name=~"tx\\..*"} | tx_type = "NFTokenMint"
 ```

+### Apply Pipeline by Stage
+
+```
+# All three stages of one transaction (preflight -> preclaim -> apply)
+{name=~"tx.preflight|tx.preclaim|tx.transactor"}
+
+# Transactions that failed at the preclaim stage
+{name="tx.preclaim"} | ter_result != "tesSUCCESS"
+
+# Transactions that hard-failed preflight (never reached preclaim/apply)
+{name="tx.preflight"} | ter_result != "tesSUCCESS"
+```
+
+PromQL on the span-derived metrics (dashboard: _Transaction Overview_):
+
+```
+# Per-stage throughput — the funnel preflight >= preclaim >= apply
+sum by (stage) (rate(traces_span_metrics_calls_total{span_name=~"tx.preflight|tx.preclaim|tx.transactor"}[5m]))
+
+# Per-stage p95 latency
+histogram_quantile(0.95, sum by (le, stage) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"tx.preflight|tx.preclaim|tx.transactor"}[5m])))
+
+# Per-stage failure rate (ter_result != tesSUCCESS; a failing ter completes the
+# span normally, so filter on the attribute, not status_code which only flags exceptions)
+sum by (stage) (rate(traces_span_metrics_calls_total{span_name=~"tx.preflight|tx.preclaim|tx.transactor", ter_result!~"tesSUCCESS|"}[5m]))
+```
+
+> **Alerting**: a rising `tx.preflight` / `tx.preclaim` failure rate points to
+> malformed or stale-sequence submissions (often spam or a misbehaving client);
+> a rising `tx.transactor` failure rate points to apply-time problems. Alert per
+> stage rather than on a single aggregate so the failing stage is obvious.
+
+> **Sampling caveat**: these stage metrics are span-derived and inherit the
+> **tracer head-sampling** ratio (`sampling_ratio`). At `sampling_ratio < 1.0`
+> they undercount proportionally — treat them as relative trends, not absolute
+> transaction counts. Native StatsD metrics are unsampled.
+
 ### Transaction Queue Health

 ```