feat(telemetry): add missing StatsD dashboard panels from production dashboard

Compared shared production Grafana dashboard against Phase 6 StatsD dashboards and added 10 missing panels covering job execution/dequeue timers, cache metrics, ledger publish gap, state duration rate, duplicate traffic, and detailed traffic breakdown. Node Health dashboard: 8 → 16 panels, plus quantile template variable. Network Traffic dashboard: 8 → 10 panels, Total Network Bytes now rate(). Updated runbook, data collection reference, and implementation phases docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-07-21 14:11:07 +00:00 · 2026-04-29 14:02:27 +01:00
parent a1cb752745
commit b933e8ae00
7 changed files with 710 additions and 40 deletions
--- a/OpenTelemetryPlan/06-implementation-phases.md
+++ b/OpenTelemetryPlan/06-implementation-phases.md
@@ -343,8 +343,8 @@ xrpld has a mature metrics framework (`beast::insight`) that emits StatsD-format
 | 6.2  | Add `statsd` receiver to OTel Collector config                                                                  |
 | 6.3  | Expose UDP port 8125 in docker-compose.yml                                                                      |
 | 6.4  | Add `[insight]` config to integration test node configs                                                         |
-| 6.5  | Create "Node Health" Grafana dashboard (8 panels)                                                               |
-| 6.6  | Create "Network Traffic" Grafana dashboard (8 panels)                                                           |
+| 6.5  | Create "Node Health" Grafana dashboard (16 panels)                                                              |
+| 6.6  | Create "Network Traffic" Grafana dashboard (10 panels)                                                          |
 | 6.7  | Create "RPC & Pathfinding (StatsD)" Grafana dashboard (8 panels)                                                |
 | 6.8  | Update integration test to verify StatsD metrics in Prometheus                                                  |
 | 6.9  | Update TESTING.md and telemetry-runbook.md                                                                      |
@@ -359,11 +359,11 @@ The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffi

 **Node Health** (`statsd-node-health.json`, uid: `xrpld-statsd-node-health`):

- Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches
+- Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches, Key Jobs Execution/Dequeue Time, FullBelowCache Size/Hit Rate, Ledger Publish Gap, State Duration Rate, All Jobs Detail

 **Network Traffic** (`statsd-network-traffic.json`, uid: `xrpld-statsd-network`):

- Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories
+- Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories, Duplicate Traffic, All Traffic Categories Detail

 **RPC & Pathfinding (StatsD)** (`statsd-rpc-pathfinding.json`, uid: `xrpld-statsd-rpc`):

--- a/OpenTelemetryPlan/09-data-collection-reference.md
+++ b/OpenTelemetryPlan/09-data-collection-reference.md
@@ -425,6 +425,8 @@ prefix=rippled
 | `rippled_Peer_Finder_Active_Outbound_Peers`         | PeerfinderManager.cpp | Active outbound peer connections         | 10–21                           |
 | `rippled_Overlay_Peer_Disconnects`                  | OverlayImpl.cpp       | Cumulative peer disconnection count      | Low growth                      |
 | `rippled_job_count`                                 | JobQueue.cpp          | Current job queue depth                  | 0–100 (healthy)                 |
+| `rippled_Node_family_full_below_cache_size`         | TaggedCache.h         | FullBelowCache entry count               | Varies                          |
+| `rippled_Node_family_full_below_cache_hit_rate`     | TaggedCache.h         | FullBelowCache hit rate percentage       | 0–100                           |

 **Grafana dashboard**: _Node Health (StatsD)_ (`xrpld-statsd-node-health`)

@@ -484,6 +486,35 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo

 **Grafana dashboards**: _Network Traffic_ (`xrpld-statsd-network`), _Overlay Traffic Detail_ (`xrpld-statsd-overlay-detail`), _Ledger Data & Sync_ (`xrpld-statsd-ledger-sync`)

+### 2.5 Per-Job Timer Events
+
+For each of the 36 non-special job types (defined in `JobTypes.h`), two StatsD timer events are emitted:
+
+- `rippled_{jobName}` — execution duration
+- `rippled_{jobName}_q` — dequeue wait time
+
+These produce summary metrics with quantiles (0th, 50th, 90th, 95th, 99th, 100th).
+
+**Key job types** (most operationally relevant):
+
+| Job Name            | Source Enum      | Description                   |
+| ------------------- | ---------------- | ----------------------------- |
+| `acceptLedger`      | `jtACCEPT`       | Consensus round acceptance    |
+| `advanceLedger`     | `jtADVANCE`      | Ledger advancement            |
+| `transaction`       | `jtTRANSACTION`  | Transaction processing        |
+| `writeObjects`      | `jtWRITE`        | Database object writes        |
+| `publishNewLedger`  | `jtPUBLEDGER`    | New ledger publication        |
+| `trustedValidation` | `jtVALIDATION_t` | Trusted validation processing |
+| `trustedProposal`   | `jtPROPOSAL_t`   | Trusted proposal processing   |
+| `clientRPC`         | `jtCLIENT_RPC`   | Client RPC request handling   |
+| `heartbeat`         | `jtNETOP_TIMER`  | Network heartbeat timer       |
+| `sweep`             | `jtSWEEP`        | Cache sweep / cleanup         |
+| `ledgerData`        | `jtLEDGER_DATA`  | Ledger data processing        |
+
+Special job types (`limit=0`: `peerCommand`, `diskAccess`, `processTransaction`, `orderBookSetup`, `pathFind`, `nodeRead`, `nodeWrite`, `generic`, `SyncReadNode`, `AsyncReadNode`, `WriteNode`) do **not** emit timer events.
+
+**Grafana dashboard**: _Node Health (StatsD)_ (`xrpld-statsd-node-health`) — Key Jobs and All Jobs panels
+
 ---

 ## 3. Grafana Dashboard Reference
@@ -502,13 +533,13 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo

 ### 3.2 StatsD Dashboards (5)

-| Dashboard              | UID                           | Data Source         | Key Panels                                                                        |
-| ---------------------- | ----------------------------- | ------------------- | --------------------------------------------------------------------------------- |
-| Node Health            | `xrpld-statsd-node-health`    | Prometheus (StatsD) | Ledger age, operating mode, I/O latency, job queue, fetch rate                    |
-| Network Traffic        | `xrpld-statsd-network`        | Prometheus (StatsD) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category     |
-| RPC & Pathfinding      | `xrpld-statsd-rpc`            | Prometheus (StatsD) | RPC rate, response time/size, pathfinding duration, resource warnings/drops       |
-| Overlay Traffic Detail | `xrpld-statsd-overlay-detail` | Prometheus (StatsD) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths |
-| Ledger Data & Sync     | `xrpld-statsd-ledger-sync`    | Prometheus (StatsD) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap |
+| Dashboard              | UID                           | Data Source         | Key Panels                                                                                                                                         |
+| ---------------------- | ----------------------------- | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Node Health            | `xrpld-statsd-node-health`    | Prometheus (StatsD) | Ledger age, operating mode, I/O latency, job queue, fetch rate, key/all jobs execution time, cache size/hit rate, publish gap, state duration rate |
+| Network Traffic        | `xrpld-statsd-network`        | Prometheus (StatsD) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category, duplicate traffic, all traffic categories detail                    |
+| RPC & Pathfinding      | `xrpld-statsd-rpc`            | Prometheus (StatsD) | RPC rate, response time/size, pathfinding duration, resource warnings/drops                                                                        |
+| Overlay Traffic Detail | `xrpld-statsd-overlay-detail` | Prometheus (StatsD) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths                                                                  |
+| Ledger Data & Sync     | `xrpld-statsd-ledger-sync`    | Prometheus (StatsD) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap                                                                  |

 ### 3.3 Consensus Close-Time Panels

--- a/OpenTelemetryPlan/OpenTelemetryPlan.md
+++ b/OpenTelemetryPlan/OpenTelemetryPlan.md
@@ -226,7 +226,7 @@ The appendix contains a glossary of OpenTelemetry and xrpld-specific terms, refe

 ## 9. Data Collection Reference

-A single-source-of-truth reference documenting every piece of telemetry data collected by xrpld. Covers all 16 OpenTelemetry spans with their 22 attributes, all StatsD metrics (gauges, counters, histograms, overlay traffic), SpanMetrics-derived Prometheus metrics, and all 8 Grafana dashboards. Includes Jaeger search guides and Prometheus query examples.
+A single-source-of-truth reference documenting every piece of telemetry data collected by xrpld. Covers all 16 OpenTelemetry spans with their 22 attributes, all StatsD metrics (gauges, counters, histograms, overlay traffic), SpanMetrics-derived Prometheus metrics, and all 10 Grafana dashboards. Includes Jaeger search guides and Prometheus query examples.

 ➡️ **[View Data Collection Reference](./09-data-collection-reference.md)**

--- a/cspell.config.yaml
+++ b/cspell.config.yaml
@@ -187,6 +187,7 @@ words:
  - nixfmt
  - nixos
  - nixpkgs
+  - NETOP
  - NOLINT
  - NOLINTNEXTLINE
  - nonxrp
--- a/docker/telemetry/grafana/dashboards/statsd-network-traffic.json
+++ b/docker/telemetry/grafana/dashboards/statsd-network-traffic.json
@@ -96,7 +96,7 @@
    },
    {
      "title": "Total Network Bytes",
-      "description": "Total bytes sent and received across all peer connections. Sourced from the total.Bytes_In and total.Bytes_Out traffic category gauges (OverlayImpl.h:535-548). Provides a high-level view of network bandwidth consumption.",
+      "description": "Rate of total bytes sent and received across all peer connections. Sourced from the total.Bytes_In and total.Bytes_Out traffic category gauges (OverlayImpl.h:535-548). Wrapped in rate() to show throughput rather than cumulative counter values.",
      "type": "timeseries",
      "gridPos": {
        "h": 8,
@@ -115,22 +115,22 @@
          "datasource": {
            "type": "prometheus"
          },
-          "expr": "rippled_total_Bytes_In",
+          "expr": "rate(rippled_total_Bytes_In[5m])",
          "legendFormat": "Bytes In"
        },
        {
          "datasource": {
            "type": "prometheus"
          },
-          "expr": "rippled_total_Bytes_Out",
+          "expr": "rate(rippled_total_Bytes_Out[5m])",
          "legendFormat": "Bytes Out"
        }
      ],
      "fieldConfig": {
        "defaults": {
-          "unit": "decbytes",
+          "unit": "Bps",
          "custom": {
-            "axisLabel": "Bytes",
+            "axisLabel": "Throughput",
            "spanNulls": true,
            "insertNulls": false,
            "showPoints": "auto",
@@ -655,6 +655,119 @@
          }
        ]
      }
+    },
+    {
+      "title": "Duplicate Traffic (Wasted Bandwidth)",
+      "description": "Rate of duplicate overlay traffic across transaction, proposal, and validation categories. Duplicate messages are messages the node has already seen and discards. High duplicate rates indicate inefficient message routing or network topology issues causing redundant relays.",
+      "type": "timeseries",
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 32
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rate(rippled_transactions_duplicate_Bytes_In[5m])",
+          "legendFormat": "TX Duplicate In"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rate(rippled_transactions_duplicate_Bytes_Out[5m])",
+          "legendFormat": "TX Duplicate Out"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rate(rippled_proposals_duplicate_Bytes_In[5m])",
+          "legendFormat": "Proposals Duplicate In"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rate(rippled_proposals_duplicate_Bytes_Out[5m])",
+          "legendFormat": "Proposals Duplicate Out"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rate(rippled_validations_duplicate_Bytes_In[5m])",
+          "legendFormat": "Validations Duplicate In"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rate(rippled_validations_duplicate_Bytes_Out[5m])",
+          "legendFormat": "Validations Duplicate Out"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "Bps",
+          "custom": {
+            "axisLabel": "Throughput",
+            "spanNulls": true,
+            "insertNulls": false,
+            "showPoints": "auto",
+            "pointSize": 3
+          }
+        },
+        "overrides": []
+      }
+    },
+    {
+      "title": "All Traffic Categories (Detail)",
+      "description": "Top 15 traffic categories by inbound byte rate, excluding the total aggregate. Provides a detailed timeseries view of which overlay message types are consuming the most bandwidth over time. Complements the bar gauge snapshot view in the Overlay Traffic panel.",
+      "type": "timeseries",
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 32
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "topk(15, rate({__name__=~\"rippled_.*_Bytes_In\", __name__!~\"rippled_total_.*\"}[5m]))",
+          "legendFormat": "{{__name__}}"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "Bps",
+          "custom": {
+            "axisLabel": "Throughput",
+            "spanNulls": true,
+            "insertNulls": false,
+            "showPoints": "auto",
+            "pointSize": 3
+          }
+        },
+        "overrides": []
+      }
    }
  ],
  "schemaVersion": 39,
--- a/docker/telemetry/grafana/dashboards/statsd-node-health.json
+++ b/docker/telemetry/grafana/dashboards/statsd-node-health.json
@@ -287,7 +287,7 @@
    },
    {
      "title": "Job Queue Depth",
-      "description": "Current number of jobs waiting in the job queue. Sourced from the job_count gauge (JobQueue.cpp:26). A sustained high value indicates the node cannot process work fast enough \u2014 common during ledger replay or heavy RPC load.",
+      "description": "Current number of jobs waiting in the job queue. Sourced from the job_count gauge (JobQueue.cpp:26). A sustained high value indicates the node cannot process work fast enough — common during ledger replay or heavy RPC load.",
      "type": "timeseries",
      "gridPos": {
        "h": 8,
@@ -399,12 +399,527 @@
        },
        "overrides": []
      }
+    },
+    {
+      "title": "Key Jobs Execution Time",
+      "description": "Execution time for critical job types at the selected quantile. Sourced from per-job-type events in JobTypeData (JobTypeData.h:48). Shows how long key consensus, transaction, and maintenance jobs take to execute. Spikes indicate processing bottlenecks.",
+      "type": "timeseries",
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 32
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_acceptLedger{quantile=\"$quantile\"}",
+          "legendFormat": "Accept Ledger [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_advanceLedger{quantile=\"$quantile\"}",
+          "legendFormat": "Advance Ledger [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_transaction{quantile=\"$quantile\"}",
+          "legendFormat": "Transaction [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_writeObjects{quantile=\"$quantile\"}",
+          "legendFormat": "Write Objects [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_heartbeat{quantile=\"$quantile\"}",
+          "legendFormat": "Heartbeat [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_sweep{quantile=\"$quantile\"}",
+          "legendFormat": "Sweep [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_trustedValidation{quantile=\"$quantile\"}",
+          "legendFormat": "Trusted Validation [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_trustedProposal{quantile=\"$quantile\"}",
+          "legendFormat": "Trusted Proposal [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_publishNewLedger{quantile=\"$quantile\"}",
+          "legendFormat": "Publish New Ledger [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_clientRPC{quantile=\"$quantile\"}",
+          "legendFormat": "Client RPC [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_ledgerData{quantile=\"$quantile\"}",
+          "legendFormat": "Ledger Data [{{quantile}}]"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "ms",
+          "custom": {
+            "axisLabel": "Duration (ms)",
+            "spanNulls": true,
+            "insertNulls": false,
+            "showPoints": "auto",
+            "pointSize": 3
+          }
+        },
+        "overrides": []
+      }
+    },
+    {
+      "title": "Key Jobs Dequeue Wait Time",
+      "description": "Time spent waiting in the job queue before execution for critical job types. Sourced from per-job-type dequeue events (JobTypeData.h:47). High dequeue times indicate the job queue is backlogged and jobs are waiting too long to be scheduled.",
+      "type": "timeseries",
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 32
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_acceptLedger_q{quantile=\"$quantile\"}",
+          "legendFormat": "Accept Ledger [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_advanceLedger_q{quantile=\"$quantile\"}",
+          "legendFormat": "Advance Ledger [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_transaction_q{quantile=\"$quantile\"}",
+          "legendFormat": "Transaction [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_writeObjects_q{quantile=\"$quantile\"}",
+          "legendFormat": "Write Objects [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_heartbeat_q{quantile=\"$quantile\"}",
+          "legendFormat": "Heartbeat [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_sweep_q{quantile=\"$quantile\"}",
+          "legendFormat": "Sweep [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_trustedValidation_q{quantile=\"$quantile\"}",
+          "legendFormat": "Trusted Validation [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_trustedProposal_q{quantile=\"$quantile\"}",
+          "legendFormat": "Trusted Proposal [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_publishNewLedger_q{quantile=\"$quantile\"}",
+          "legendFormat": "Publish New Ledger [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_clientRPC_q{quantile=\"$quantile\"}",
+          "legendFormat": "Client RPC [{{quantile}}]"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_ledgerData_q{quantile=\"$quantile\"}",
+          "legendFormat": "Ledger Data [{{quantile}}]"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "ms",
+          "custom": {
+            "axisLabel": "Wait Time (ms)",
+            "spanNulls": true,
+            "insertNulls": false,
+            "showPoints": "auto",
+            "pointSize": 3
+          }
+        },
+        "overrides": []
+      }
+    },
+    {
+      "title": "FullBelowCache Size",
+      "description": "Number of entries in the FullBelowCache. Sourced from the TaggedCache size gauge (TaggedCache.h:183) for the Node family full below cache (NodeFamily.cpp:29). This cache tracks which SHAMap nodes have all children present locally, avoiding redundant fetches during ledger acquisition.",
+      "type": "timeseries",
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 40
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_Node_family_full_below_cache_size",
+          "legendFormat": "FullBelowCache Size"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "custom": {
+            "axisLabel": "Entries",
+            "spanNulls": true,
+            "insertNulls": false,
+            "showPoints": "auto",
+            "pointSize": 3
+          }
+        },
+        "overrides": []
+      }
+    },
+    {
+      "title": "FullBelowCache Hit Rate",
+      "description": "Hit rate percentage for the FullBelowCache. Sourced from the TaggedCache hit_rate gauge (TaggedCache.h:184). A high hit rate means the node is efficiently reusing cached knowledge about complete SHAMap subtrees. Low hit rates during steady state warrant investigation.",
+      "type": "gauge",
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 40
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_Node_family_full_below_cache_hit_rate",
+          "legendFormat": "Hit Rate"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "thresholds": {
+            "steps": [
+              {
+                "color": "red",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 25
+              },
+              {
+                "color": "green",
+                "value": 50
+              }
+            ]
+          }
+        },
+        "overrides": []
+      }
+    },
+    {
+      "title": "Ledger Publish Gap",
+      "description": "Difference between published and validated ledger ages. Computed as Published_Ledger_Age minus Validated_Ledger_Age. A value near zero means the publish pipeline keeps up with validation. A growing gap indicates the publish pipeline is falling behind, potentially causing stale data for subscribers.",
+      "type": "stat",
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 48
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rippled_LedgerMaster_Published_Ledger_Age - rippled_LedgerMaster_Validated_Ledger_Age",
+          "legendFormat": "Publish Gap"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "s",
+          "thresholds": {
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 5
+              },
+              {
+                "color": "red",
+                "value": 10
+              }
+            ]
+          }
+        },
+        "overrides": []
+      }
+    },
+    {
+      "title": "State Duration Rate (Full vs Tracking)",
+      "description": "Rate of change of time spent in Full and Tracking operating modes, normalized to seconds. Sourced from State_Accounting duration gauges (NetworkOPs.cpp:774-778). In steady state the Full duration rate should be close to 1.0 (gaining one second of Full-mode time per wall-clock second). A drop below 1.0 means the node is spending time in other modes.",
+      "type": "timeseries",
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 48
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rate(rippled_State_Accounting_Full_duration[5m]) / 1000000",
+          "legendFormat": "Full Mode Rate"
+        },
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "rate(rippled_State_Accounting_Tracking_duration[5m]) / 1000000",
+          "legendFormat": "Tracking Mode Rate"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "custom": {
+            "axisLabel": "Rate (s/s)",
+            "spanNulls": true,
+            "insertNulls": false,
+            "showPoints": "auto",
+            "pointSize": 3
+          }
+        },
+        "overrides": []
+      }
+    },
+    {
+      "title": "All Jobs Execution Time (Detail)",
+      "description": "Execution time for ALL non-special job types at the selected quantile. Shows the complete picture of job execution performance. Use the Key Jobs panel for a focused view of the most critical jobs.",
+      "type": "timeseries",
+      "gridPos": {
+        "h": 8,
+        "w": 24,
+        "x": 0,
+        "y": 56
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "{__name__=~\"rippled_(makeFetchPack|publishAcqLedger|untrustedValidation|manifest|localTransaction|ledgerReplayRequest|ledgerRequest|untrustedProposal|ledgerReplayTask|ledgerData|clientCommand|clientSubscribe|clientFeeChange|clientConsensus|clientAccountHistory|clientRPC|clientWebsocket|RPC|updatePaths|transaction|batch|advanceLedger|publishNewLedger|fetchTxnData|writeAhead|trustedValidation|writeObjects|acceptLedger|trustedProposal|sweep|clusterReport|heartbeat|administration|handleHaveTransactions|doTransactions)\", quantile=\"$quantile\"}",
+          "legendFormat": "{{__name__}} [{{quantile}}]"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "ms",
+          "custom": {
+            "axisLabel": "Duration (ms)",
+            "spanNulls": true,
+            "insertNulls": false,
+            "showPoints": "auto",
+            "pointSize": 3
+          }
+        },
+        "overrides": []
+      }
+    },
+    {
+      "title": "All Jobs Dequeue Wait (Detail)",
+      "description": "Dequeue wait time for ALL non-special job types at the selected quantile. Shows the complete picture of job queue waiting times. High wait times across many job types indicate systemic job queue congestion.",
+      "type": "timeseries",
+      "gridPos": {
+        "h": 8,
+        "w": 24,
+        "x": 0,
+        "y": 64
+      },
+      "options": {
+        "tooltip": {
+          "mode": "multi",
+          "sort": "desc"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus"
+          },
+          "expr": "{__name__=~\"rippled_(makeFetchPack_q|publishAcqLedger_q|untrustedValidation_q|manifest_q|localTransaction_q|ledgerReplayRequest_q|ledgerRequest_q|untrustedProposal_q|ledgerReplayTask_q|ledgerData_q|clientCommand_q|clientSubscribe_q|clientFeeChange_q|clientConsensus_q|clientAccountHistory_q|clientRPC_q|clientWebsocket_q|RPC_q|updatePaths_q|transaction_q|batch_q|advanceLedger_q|publishNewLedger_q|fetchTxnData_q|writeAhead_q|trustedValidation_q|writeObjects_q|acceptLedger_q|trustedProposal_q|sweep_q|clusterReport_q|heartbeat_q|administration_q|handleHaveTransactions_q|doTransactions_q)\", quantile=\"$quantile\"}",
+          "legendFormat": "{{__name__}} [{{quantile}}]"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "ms",
+          "custom": {
+            "axisLabel": "Wait Time (ms)",
+            "spanNulls": true,
+            "insertNulls": false,
+            "showPoints": "auto",
+            "pointSize": 3
+          }
+        },
+        "overrides": []
+      }
    }
  ],
  "schemaVersion": 39,
  "tags": ["rippled", "statsd", "node-health", "telemetry"],
  "templating": {
-    "list": []
+    "list": [
+      {
+        "name": "quantile",
+        "label": "Quantile",
+        "type": "custom",
+        "query": "0.5,0.9,0.95,0.99",
+        "current": {
+          "selected": true,
+          "text": "0.95",
+          "value": "0.95"
+        },
+        "options": [
+          {
+            "selected": false,
+            "text": "0.5",
+            "value": "0.5"
+          },
+          {
+            "selected": false,
+            "text": "0.9",
+            "value": "0.9"
+          },
+          {
+            "selected": true,
+            "text": "0.95",
+            "value": "0.95"
+          },
+          {
+            "selected": false,
+            "text": "0.99",
+            "value": "0.99"
+          }
+        ],
+        "multi": false,
+        "includeAll": false
+      }
+    ]
  },
  "time": {
    "from": "now-1h",
--- a/docs/telemetry-runbook.md
+++ b/docs/telemetry-runbook.md
@@ -251,7 +251,7 @@ The OTel Collector receives these via a `statsd` receiver on UDP port 8125 and e

 ## Grafana Dashboards

-Eight dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
+Ten dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:

 ### RPC Performance (`xrpld-rpc-perf`)

@@ -320,29 +320,39 @@ Requires `trace_peer=1` in the `[telemetry]` config section.

 ### Node Health — StatsD (`xrpld-statsd-node-health`)

-| Panel                      | Type       | PromQL                                                 | Labels Used |
-| -------------------------- | ---------- | ------------------------------------------------------ | ----------- |
-| Validated Ledger Age       | stat       | `rippled_LedgerMaster_Validated_Ledger_Age`            | —           |
-| Published Ledger Age       | stat       | `rippled_LedgerMaster_Published_Ledger_Age`            | —           |
-| Operating Mode Duration    | timeseries | `rippled_State_Accounting_*_duration`                  | —           |
-| Operating Mode Transitions | timeseries | `rippled_State_Accounting_*_transitions`               | —           |
-| I/O Latency                | timeseries | `histogram_quantile(0.95, rippled_ios_latency_bucket)` | —           |
-| Job Queue Depth            | timeseries | `rippled_job_count`                                    | —           |
-| Ledger Fetch Rate          | stat       | `rate(rippled_ledger_fetches[5m])`                     | —           |
-| Ledger History Mismatches  | stat       | `rate(rippled_ledger_history_mismatch[5m])`            | —           |
+| Panel                                  | Type       | PromQL                                                            | Labels Used |
+| -------------------------------------- | ---------- | ----------------------------------------------------------------- | ----------- |
+| Validated Ledger Age                   | stat       | `rippled_LedgerMaster_Validated_Ledger_Age`                       | —           |
+| Published Ledger Age                   | stat       | `rippled_LedgerMaster_Published_Ledger_Age`                       | —           |
+| Operating Mode Duration                | timeseries | `rippled_State_Accounting_*_duration`                             | —           |
+| Operating Mode Transitions             | timeseries | `rippled_State_Accounting_*_transitions`                          | —           |
+| I/O Latency                            | timeseries | `histogram_quantile(0.95, rippled_ios_latency_bucket)`            | —           |
+| Job Queue Depth                        | timeseries | `rippled_job_count`                                               | —           |
+| Ledger Fetch Rate                      | stat       | `rate(rippled_ledger_fetches[5m])`                                | —           |
+| Ledger History Mismatches              | stat       | `rate(rippled_ledger_history_mismatch[5m])`                       | —           |
+| Key Jobs Execution Time                | timeseries | `rippled_acceptLedger{quantile="$quantile"}` (+ 10 more key jobs) | `quantile`  |
+| Key Jobs Dequeue Wait Time             | timeseries | `rippled_acceptLedger_q{quantile="$quantile"}` (+ 10 more)        | `quantile`  |
+| FullBelowCache Size                    | timeseries | `rippled_Node_family_full_below_cache_size`                       | —           |
+| FullBelowCache Hit Rate                | gauge      | `rippled_Node_family_full_below_cache_hit_rate`                   | —           |
+| Ledger Publish Gap                     | stat       | `Published_Ledger_Age - Validated_Ledger_Age`                     | —           |
+| State Duration Rate (Full vs Tracking) | timeseries | `rate(rippled_State_Accounting_Full_duration[5m]) / 1000000`      | —           |
+| All Jobs Execution Time (Detail)       | timeseries | `{__name__=~"rippled_<all_jobs>", quantile="$quantile"}`          | `quantile`  |
+| All Jobs Dequeue Wait (Detail)         | timeseries | `{__name__=~"rippled_<all_jobs>_q", quantile="$quantile"}`        | `quantile`  |

 ### Network Traffic — StatsD (`xrpld-statsd-network`)

-| Panel                  | Type       | PromQL                                 | Labels Used |
-| ---------------------- | ---------- | -------------------------------------- | ----------- |
-| Active Peers           | timeseries | `rippled_Peer_Finder_Active_*_Peers`   | —           |
-| Peer Disconnects       | timeseries | `rippled_Overlay_Peer_Disconnects`     | —           |
-| Total Network Bytes    | timeseries | `rippled_total_Bytes_In/Out`           | —           |
-| Total Network Messages | timeseries | `rippled_total_Messages_In/Out`        | —           |
-| Transaction Traffic    | timeseries | `rippled_transactions_Messages_In/Out` | —           |
-| Proposal Traffic       | timeseries | `rippled_proposals_Messages_In/Out`    | —           |
-| Validation Traffic     | timeseries | `rippled_validations_Messages_In/Out`  | —           |
-| Traffic by Category    | bargauge   | `topk(10, rippled_*_Bytes_In)`         | —           |
+| Panel                                | Type       | PromQL                                       | Labels Used |
+| ------------------------------------ | ---------- | -------------------------------------------- | ----------- |
+| Active Peers                         | timeseries | `rippled_Peer_Finder_Active_*_Peers`         | —           |
+| Peer Disconnects                     | timeseries | `rippled_Overlay_Peer_Disconnects`           | —           |
+| Total Network Bytes                  | timeseries | `rate(rippled_total_Bytes_In/Out[5m])`       | —           |
+| Total Network Messages               | timeseries | `rippled_total_Messages_In/Out`              | —           |
+| Transaction Traffic                  | timeseries | `rippled_transactions_Messages_In/Out`       | —           |
+| Proposal Traffic                     | timeseries | `rippled_proposals_Messages_In/Out`          | —           |
+| Validation Traffic                   | timeseries | `rippled_validations_Messages_In/Out`        | —           |
+| Traffic by Category                  | bargauge   | `topk(10, rippled_*_Bytes_In)`               | —           |
+| Duplicate Traffic (Wasted Bandwidth) | timeseries | `rate(rippled_*_duplicate_Bytes_In/Out[5m])` | —           |
+| All Traffic Categories (Detail)      | timeseries | `topk(15, rate(rippled_*_Bytes_In[5m]))`     | —           |

 ### RPC & Pathfinding — StatsD (`xrpld-statsd-rpc`)