# Observability Backend Recommendations > **Parent Document**: [OpenTelemetryPlan.md](./OpenTelemetryPlan.md) > **Related**: [Implementation Phases](./06-implementation-phases.md) | [Appendix](./08-appendix.md) --- ## 7.1 Development/Testing Backends > **OTLP** = OpenTelemetry Protocol | Backend | Pros | Cons | Use Case | | ---------- | ----------------------------------- | ---------------------- | ------------------- | | **Tempo** | Cost-effective, Grafana integration | Requires Grafana stack | Local dev, CI, Prod | | **Zipkin** | Simple, lightweight | Basic features | Quick prototyping | ### Quick Start with Tempo ```bash # Start Tempo with OTLP support docker run -d --name tempo \ -p 3200:3200 \ -p 4317:4317 \ -p 4318:4318 \ grafana/tempo:2.6.1 ``` --- ## 7.2 Production Backends > **APM** = Application Performance Monitoring | Backend | Pros | Cons | Use Case | | ----------------- | ----------------------------------------- | ---------------------- | --------------------------- | | **Grafana Tempo** | Cost-effective, Grafana integration | Requires Grafana stack | Most production deployments | | **Elastic APM** | Full observability stack, log correlation | Resource intensive | Existing Elastic users | | **Honeycomb** | Excellent query, high cardinality | SaaS cost | Deep debugging needs | | **Datadog APM** | Full platform, easy setup | SaaS cost | Enterprise with budget | ### Backend Selection Flowchart ```mermaid flowchart TD start[Select Backend] --> budget{Budget
Constraints?} budget -->|Yes| oss[Open Source] budget -->|No| saas{Prefer
SaaS?} oss --> existing{Existing
Stack?} existing -->|Grafana| tempo[Grafana Tempo] existing -->|Elastic| elastic[Elastic APM] existing -->|None| tempo saas -->|Yes| enterprise{Enterprise
Support?} saas -->|No| oss enterprise -->|Yes| datadog[Datadog APM] enterprise -->|No| honeycomb[Honeycomb] tempo --> final[Configure Collector] elastic --> final honeycomb --> final datadog --> final style start fill:#0f172a,stroke:#020617,color:#fff style budget fill:#334155,stroke:#1e293b,color:#fff style oss fill:#1e293b,stroke:#0f172a,color:#fff style existing fill:#334155,stroke:#1e293b,color:#fff style saas fill:#334155,stroke:#1e293b,color:#fff style enterprise fill:#334155,stroke:#1e293b,color:#fff style final fill:#0f172a,stroke:#020617,color:#fff style tempo fill:#1b5e20,stroke:#0d3d14,color:#fff style elastic fill:#bf360c,stroke:#8c2809,color:#fff style honeycomb fill:#0d47a1,stroke:#082f6a,color:#fff style datadog fill:#4a148c,stroke:#2e0d57,color:#fff ``` **Reading the diagram:** - **Budget Constraints? (Yes)**: Leads to open-source options. If you already run Grafana or Elastic, pick the matching backend; otherwise default to Grafana Tempo. - **Budget Constraints? (No) → Prefer SaaS?**: If you want a managed service, choose between Datadog (enterprise support) and Honeycomb (developer-focused). If not, fall back to open-source. - **Terminal nodes (Tempo / Elastic / Honeycomb / Datadog)**: Each represents a concrete backend choice, all of which feed into the same final step. - **Configure Collector**: Regardless of backend, you always finish by configuring the OTel Collector to export to your chosen destination. --- ## 7.3 Recommended Production Architecture > **OTLP** = OpenTelemetry Protocol | **APM** = Application Performance Monitoring | **HA** = High Availability ```mermaid flowchart TB subgraph validators["Validator Nodes"] v1[xrpld
Validator 1] v2[xrpld
Validator 2] end subgraph stock["Stock Nodes"] s1[xrpld
Stock 1] s2[xrpld
Stock 2] end subgraph collector["OTel Collector Cluster"] c1[Collector
DC1] c2[Collector
DC2] end subgraph backends["Storage Backends"] tempo[(Grafana
Tempo)] elastic[(Elastic
APM)] archive[(S3/GCS
Archive)] end subgraph ui["Visualization"] grafana[Grafana
Dashboards] end v1 -->|OTLP| c1 v2 -->|OTLP| c1 s1 -->|OTLP| c2 s2 -->|OTLP| c2 c1 --> tempo c1 --> elastic c2 --> tempo c2 --> archive tempo --> grafana elastic --> grafana %% Note: simplified single-collector-per-DC topology shown for clarity style validators fill:#b71c1c,stroke:#7f1d1d,color:#ffffff style stock fill:#0d47a1,stroke:#082f6a,color:#ffffff style collector fill:#bf360c,stroke:#8c2809,color:#ffffff style backends fill:#1b5e20,stroke:#0d3d14,color:#ffffff style ui fill:#4a148c,stroke:#2e0d57,color:#ffffff ``` **Reading the diagram:** - **Validator / Stock Nodes**: All xrpld nodes emit trace data via OTLP. Validators and stock nodes are grouped separately because they may reside in different network zones. - **Collector Cluster (DC1, DC2)**: Regional collectors receive OTLP from nodes in their datacenter, apply processing (sampling, enrichment), and fan out to multiple backends. - **Storage Backends**: Tempo and Elastic provide queryable trace storage; S3/GCS Archive provides long-term cold storage for compliance or post-incident analysis. - **Grafana Dashboards**: The single visualization layer that queries both Tempo and Elastic, giving operators a unified view of all traces. - **Data flow direction**: Nodes → Collectors → Storage → Grafana. Each arrow represents a network hop; minimizing collector-to-backend hops reduces latency. > **Note**: Production deployments should use multiple collector instances behind a load balancer for high availability. The diagram shows a simplified single-collector topology for clarity. --- ## 7.4 Architecture Considerations ### 7.4.1 Collector Placement | Strategy | Description | Pros | Cons | | ------------- | -------------------- | ------------------------ | ----------------------- | | **Sidecar** | Collector per node | Isolation, simple config | Resource overhead | | **DaemonSet** | Collector per host | Shared resources | Complexity | | **Gateway** | Central collector(s) | Centralized processing | Single point of failure | **Recommendation**: Use **Gateway** pattern with regional collectors for xrpld networks: - One collector cluster per datacenter/region - Tail-based sampling at collector level - Multiple export destinations for redundancy ### 7.4.2 Sampling Strategy ```mermaid flowchart LR subgraph head["Head Sampling (Node)"] hs[Node-level head sampling
configurable, default: 100%
recommended production: 10%] end subgraph tail["Tail Sampling (Collector)"] ts1[Keep all errors] ts2[Keep slow >5s] ts3[Keep 10% rest] end head --> tail ts1 --> final[Final Traces] ts2 --> final ts3 --> final style head fill:#0d47a1,stroke:#082f6a,color:#fff style tail fill:#1b5e20,stroke:#0d3d14,color:#fff style hs fill:#0d47a1,stroke:#082f6a,color:#fff style ts1 fill:#1b5e20,stroke:#0d3d14,color:#fff style ts2 fill:#1b5e20,stroke:#0d3d14,color:#fff style ts3 fill:#1b5e20,stroke:#0d3d14,color:#fff style final fill:#bf360c,stroke:#8c2809,color:#fff ``` **Reading the diagram:** - **Head Sampling (Node)**: The first filter -- each xrpld node decides whether to sample a trace at creation time (default 100%, recommended 10% in production). This controls the volume leaving the node. - **Tail Sampling (Collector)**: The second filter -- the collector inspects completed traces and applies rules: keep all errors, keep anything slower than 5 seconds, and keep 10% of the remainder. - **Arrow head → tail**: All head-sampled traces flow to the collector, where tail sampling further reduces volume while preserving the most valuable data. - **Final Traces**: The output after both sampling stages; this is what gets stored and queried. The two-stage approach balances cost with debuggability. ### 7.4.3 Data Retention | Environment | Hot Storage | Warm Storage | Cold Archive | | ----------- | ----------- | ------------ | ------------ | | Development | 24 hours | N/A | N/A | | Staging | 7 days | N/A | N/A | | Production | 7 days | 30 days | many years | --- ## 7.5 Integration Checklist - [ ] Choose primary backend (Tempo recommended for cost/features) - [ ] Deploy collector cluster with high availability - [ ] Configure tail-based sampling for error/latency traces - [ ] Set up Grafana dashboards for trace visualization - [ ] Configure alerts for trace anomalies - [ ] Establish data retention policies - [ ] Test trace correlation with logs and metrics --- ## 7.6 Grafana Dashboard Examples Pre-built dashboards for xrpld observability. ### 7.6.1 Consensus Health Dashboard ```json { "title": "xrpld Consensus Health", "uid": "xrpld-consensus-health", "tags": ["xrpld", "consensus", "tracing"], "panels": [ { "title": "Consensus Round Duration", "type": "timeseries", "datasource": "Tempo", "targets": [ { "queryType": "traceql", "query": "{resource.service.name=\"xrpld\" && name=\"consensus.round\"} | avg(duration) by (resource.service.instance.id)" } ], "fieldConfig": { "defaults": { "unit": "ms", "thresholds": { "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 4000 }, { "color": "red", "value": 5000 } ] } } }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 } }, { "title": "Phase Duration Breakdown", "type": "barchart", "datasource": "Tempo", "targets": [ { "queryType": "traceql", "query": "{resource.service.name=\"xrpld\" && name=~\"consensus.phase.*\"} | avg(duration) by (name)" } ], "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 } }, { "title": "Proposers per Round", "type": "stat", "datasource": "Tempo", "targets": [ { "queryType": "traceql", "query": "{resource.service.name=\"xrpld\" && name=\"consensus.round\"} | avg(span.xrpl.consensus.proposers)" } ], "gridPos": { "h": 4, "w": 6, "x": 0, "y": 8 } }, { "title": "Recent Slow Rounds (>5s)", "type": "table", "datasource": "Tempo", "targets": [ { "queryType": "traceql", "query": "{resource.service.name=\"xrpld\" && name=\"consensus.round\"} | duration > 5s" } ], "gridPos": { "h": 8, "w": 24, "x": 0, "y": 12 } } ] } ``` ### 7.6.2 Node Overview Dashboard ```json { "title": "xrpld Node Overview", "uid": "xrpld-node-overview", "panels": [ { "title": "Active Nodes", "type": "stat", "datasource": "Tempo", "targets": [ { "queryType": "traceql", "query": "{resource.service.name=\"xrpld\"} | count_over_time() by (resource.service.instance.id) | count()" } ], "gridPos": { "h": 4, "w": 4, "x": 0, "y": 0 } }, { "title": "Total Transactions (1h)", "type": "stat", "datasource": "Tempo", "targets": [ { "queryType": "traceql", "query": "{resource.service.name=\"xrpld\" && name=\"tx.receive\"} | count()" } ], "gridPos": { "h": 4, "w": 4, "x": 4, "y": 0 } }, { "title": "Error Rate", "type": "gauge", "datasource": "Tempo", "targets": [ { "queryType": "traceql", "query": "{resource.service.name=\"xrpld\" && status.code=error} | rate() / {resource.service.name=\"xrpld\"} | rate() * 100" } ], "fieldConfig": { "defaults": { "unit": "percent", "max": 10, "thresholds": { "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 1 }, { "color": "red", "value": 5 } ] } } }, "gridPos": { "h": 4, "w": 4, "x": 8, "y": 0 } }, { "title": "Service Map", "type": "nodeGraph", "datasource": "Tempo", "gridPos": { "h": 12, "w": 12, "x": 12, "y": 0 } } ] } ``` ### 7.6.3 Alert Rules ```yaml # grafana/provisioning/alerting/rippled-alerts.yaml apiVersion: 1 groups: - name: xrpld-tracing-alerts folder: xrpld interval: 1m rules: - uid: consensus-slow title: Consensus Round Slow condition: A data: - refId: A datasourceUid: tempo model: queryType: traceql query: '{resource.service.name="xrpld" && name="consensus.round"} | avg(duration) > 5s' # Note: Verify TraceQL aggregate queries are supported by your # Tempo version. Aggregate alerting (e.g., avg(duration)) requires # Tempo 2.3+ with TraceQL metrics enabled. for: 5m annotations: summary: Consensus rounds taking >5 seconds description: "Consensus duration: {{ $value }}ms" labels: severity: warning - uid: rpc-error-spike title: RPC Error Rate Spike condition: B data: - refId: B datasourceUid: tempo model: queryType: traceql query: '{resource.service.name="xrpld" && name=~"rpc.command.*" && status.code=error} | rate() > 0.05' # Note: Verify TraceQL aggregate queries are supported by your # Tempo version. Aggregate alerting (e.g., rate()) requires # Tempo 2.3+ with TraceQL metrics enabled. for: 2m annotations: summary: RPC error rate >5% labels: severity: critical - uid: tx-throughput-drop title: Transaction Throughput Drop condition: C data: - refId: C datasourceUid: tempo model: queryType: traceql query: '{resource.service.name="xrpld" && name="tx.receive"} | rate() < 10' for: 10m annotations: summary: Transaction throughput below threshold labels: severity: warning ``` --- ## 7.7 PerfLog and Insight Correlation > **OTLP** = OpenTelemetry Protocol How to correlate OpenTelemetry traces with existing xrpld observability. ### 7.7.1 Correlation Architecture ```mermaid flowchart TB subgraph xrpld["xrpld Node"] otel[OpenTelemetry
Spans] perflog[PerfLog
JSON Logs] insight[Beast Insight
StatsD Metrics] end subgraph collectors["Data Collection"] otelc[OTel Collector] promtail[Promtail/Fluentd] statsd[StatsD Exporter] end subgraph storage["Storage"] tempo[(Tempo)] loki[(Loki)] prom[(Prometheus)] end subgraph grafana["Grafana"] traces[Trace View] logs[Log View] metrics[Metrics View] corr[Correlation
Panel] end otel -->|OTLP| otelc --> tempo perflog -->|JSON| promtail --> loki insight -->|StatsD| statsd --> prom tempo --> traces loki --> logs prom --> metrics traces --> corr logs --> corr metrics --> corr style xrpld fill:#0d47a1,stroke:#082f6a,color:#fff style collectors fill:#bf360c,stroke:#8c2809,color:#fff style storage fill:#1b5e20,stroke:#0d3d14,color:#fff style grafana fill:#4a148c,stroke:#2e0d57,color:#fff style otel fill:#0d47a1,stroke:#082f6a,color:#fff style perflog fill:#0d47a1,stroke:#082f6a,color:#fff style insight fill:#0d47a1,stroke:#082f6a,color:#fff style otelc fill:#bf360c,stroke:#8c2809,color:#fff style promtail fill:#bf360c,stroke:#8c2809,color:#fff style statsd fill:#bf360c,stroke:#8c2809,color:#fff style tempo fill:#1b5e20,stroke:#0d3d14,color:#fff style loki fill:#1b5e20,stroke:#0d3d14,color:#fff style prom fill:#1b5e20,stroke:#0d3d14,color:#fff style traces fill:#4a148c,stroke:#2e0d57,color:#fff style logs fill:#4a148c,stroke:#2e0d57,color:#fff style metrics fill:#4a148c,stroke:#2e0d57,color:#fff style corr fill:#4a148c,stroke:#2e0d57,color:#fff ``` **Reading the diagram:** - **xrpld Node (three sources)**: A single node emits three independent data streams -- OpenTelemetry spans, PerfLog JSON logs, and Beast Insight StatsD metrics. - **Data Collection layer**: Each stream has its own collector -- OTel Collector for spans, Promtail/Fluentd for logs, and a StatsD exporter for metrics. They operate independently. - **Storage layer (Tempo, Loki, Prometheus)**: Each data type lands in a purpose-built store optimized for its query patterns (trace search, log grep, metric aggregation). - **Grafana Correlation Panel**: The key integration point -- Grafana queries all three stores and links them via shared fields (`trace_id`, `xrpl.tx.hash`, `ledger_seq`), enabling a single-pane debugging experience. ### 7.7.2 Correlation Fields | Source | Field | Link To | Purpose | | ----------- | --------------------------- | ------------- | -------------------------- | | **Trace** | `trace_id` | Logs | Find log entries for trace | | **Trace** | `xrpl.tx.hash` | Logs, Metrics | Find TX-related data | | **Trace** | `xrpl.consensus.ledger.seq` | Logs | Find ledger-related logs | | **PerfLog** | `trace_id` (new) | Traces | Jump to trace from log | | **PerfLog** | `ledger_seq` | Traces | Find consensus trace | | **Insight** | `exemplar.trace_id` | Traces | Jump from metric spike | ### 7.7.3 Example: Debugging a Slow Transaction **Step 1: Find the trace** ``` # In Grafana Explore with Tempo {resource.service.name="xrpld" && span.xrpl.tx.hash="ABC123..."} ``` **Step 2: Get the trace_id from the trace view** ``` Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736 ``` **Step 3: Find related PerfLog entries** ``` # In Grafana Explore with Loki {job="xrpld"} |= "4bf92f3577b34da6a3ce929d0e0e4736" ``` **Step 4: Check Insight metrics for the time window** ``` # In Grafana with Prometheus rate(xrpld_tx_applied_total[1m]) @ timestamp_from_trace ``` ### 7.7.4 Unified Dashboard Example ```json { "title": "xrpld Unified Observability", "uid": "xrpld-unified", "panels": [ { "title": "Transaction Latency (Traces)", "type": "timeseries", "datasource": "Tempo", "targets": [ { "queryType": "traceql", "query": "{resource.service.name=\"xrpld\" && name=\"tx.receive\"} | histogram_over_time(duration)" } ], "gridPos": { "h": 6, "w": 8, "x": 0, "y": 0 } }, { "title": "Transaction Rate (Metrics)", "type": "timeseries", "datasource": "Prometheus", "targets": [ { "expr": "rate(xrpld_tx_received_total[5m])", "legendFormat": "{{ instance }}" } ], "fieldConfig": { "defaults": { "links": [ { "title": "View traces", "url": "/explore?left={\"datasource\":\"Tempo\",\"query\":\"{resource.service.name=\\\"xrpld\\\" && name=\\\"tx.receive\\\"}\"}" } ] } }, "gridPos": { "h": 6, "w": 8, "x": 8, "y": 0 } }, { "title": "Recent Logs", "type": "logs", "datasource": "Loki", "targets": [ { "expr": "{job=\"xrpld\"} | json" } ], "gridPos": { "h": 6, "w": 8, "x": 16, "y": 0 } }, { "title": "Trace Search", "type": "table", "datasource": "Tempo", "targets": [ { "queryType": "traceql", "query": "{resource.service.name=\"xrpld\"}" } ], "fieldConfig": { "overrides": [ { "matcher": { "id": "byName", "options": "traceID" }, "properties": [ { "id": "links", "value": [ { "title": "View trace", "url": "/explore?left={\"datasource\":\"Tempo\",\"query\":\"${__value.raw}\"}" }, { "title": "View logs", "url": "/explore?left={\"datasource\":\"Loki\",\"query\":\"{job=\\\"xrpld\\\"} |= \\\"${__value.raw}\\\"\"}" } ] } ] } ] }, "gridPos": { "h": 12, "w": 24, "x": 0, "y": 6 } } ] } ``` --- _Previous: [Implementation Phases](./06-implementation-phases.md)_ | _Next: [Appendix](./08-appendix.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_