Files
rippled/OpenTelemetryPlan/07-observability-backends.md
Pratik Mankawde 913a4b794c docs: correct OTel overhead estimates against SDK benchmarks
Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:

- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
  ~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
  SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
  stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns

Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).

Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 15:00:47 +01:00

21 KiB

Observability Backend Recommendations

Parent Document: OpenTelemetryPlan.md Related: Implementation Phases | Appendix


7.1 Development/Testing Backends

OTLP = OpenTelemetry Protocol

Backend Pros Cons Use Case
Tempo Cost-effective, Grafana integration Requires Grafana stack Local dev, CI, Prod
Zipkin Simple, lightweight Basic features Quick prototyping

Quick Start with Tempo

# Start Tempo with OTLP support
docker run -d --name tempo \
  -p 3200:3200 \
  -p 4317:4317 \
  -p 4318:4318 \
  grafana/tempo:2.6.1

7.2 Production Backends

APM = Application Performance Monitoring

Backend Pros Cons Use Case
Grafana Tempo Cost-effective, Grafana integration Requires Grafana stack Most production deployments
Elastic APM Full observability stack, log correlation Resource intensive Existing Elastic users
Honeycomb Excellent query, high cardinality SaaS cost Deep debugging needs
Datadog APM Full platform, easy setup SaaS cost Enterprise with budget

Backend Selection Flowchart

flowchart TD
    start[Select Backend] --> budget{Budget<br/>Constraints?}

    budget -->|Yes| oss[Open Source]
    budget -->|No| saas{Prefer<br/>SaaS?}

    oss --> existing{Existing<br/>Stack?}
    existing -->|Grafana| tempo[Grafana Tempo]
    existing -->|Elastic| elastic[Elastic APM]
    existing -->|None| tempo

    saas -->|Yes| enterprise{Enterprise<br/>Support?}
    saas -->|No| oss

    enterprise -->|Yes| datadog[Datadog APM]
    enterprise -->|No| honeycomb[Honeycomb]

    tempo --> final[Configure Collector]
    elastic --> final
    honeycomb --> final
    datadog --> final

    style start fill:#0f172a,stroke:#020617,color:#fff
    style budget fill:#334155,stroke:#1e293b,color:#fff
    style oss fill:#1e293b,stroke:#0f172a,color:#fff
    style existing fill:#334155,stroke:#1e293b,color:#fff
    style saas fill:#334155,stroke:#1e293b,color:#fff
    style enterprise fill:#334155,stroke:#1e293b,color:#fff
    style final fill:#0f172a,stroke:#020617,color:#fff
    style tempo fill:#1b5e20,stroke:#0d3d14,color:#fff
    style elastic fill:#bf360c,stroke:#8c2809,color:#fff
    style honeycomb fill:#0d47a1,stroke:#082f6a,color:#fff
    style datadog fill:#4a148c,stroke:#2e0d57,color:#fff

Reading the diagram:

  • Budget Constraints? (Yes): Leads to open-source options. If you already run Grafana or Elastic, pick the matching backend; otherwise default to Grafana Tempo.
  • Budget Constraints? (No) → Prefer SaaS?: If you want a managed service, choose between Datadog (enterprise support) and Honeycomb (developer-focused). If not, fall back to open-source.
  • Terminal nodes (Tempo / Elastic / Honeycomb / Datadog): Each represents a concrete backend choice, all of which feed into the same final step.
  • Configure Collector: Regardless of backend, you always finish by configuring the OTel Collector to export to your chosen destination.

OTLP = OpenTelemetry Protocol | APM = Application Performance Monitoring | HA = High Availability

flowchart TB
    subgraph validators["Validator Nodes"]
        v1[rippled<br/>Validator 1]
        v2[rippled<br/>Validator 2]
    end

    subgraph stock["Stock Nodes"]
        s1[rippled<br/>Stock 1]
        s2[rippled<br/>Stock 2]
    end

    subgraph collector["OTel Collector Cluster"]
        c1[Collector<br/>DC1]
        c2[Collector<br/>DC2]
    end

    subgraph backends["Storage Backends"]
        tempo[(Grafana<br/>Tempo)]
        elastic[(Elastic<br/>APM)]
        archive[(S3/GCS<br/>Archive)]
    end

    subgraph ui["Visualization"]
        grafana[Grafana<br/>Dashboards]
    end

    v1 -->|OTLP| c1
    v2 -->|OTLP| c1
    s1 -->|OTLP| c2
    s2 -->|OTLP| c2

    c1 --> tempo
    c1 --> elastic
    c2 --> tempo
    c2 --> archive

    tempo --> grafana
    elastic --> grafana

    %% Note: simplified single-collector-per-DC topology shown for clarity

    style validators fill:#b71c1c,stroke:#7f1d1d,color:#ffffff
    style stock fill:#0d47a1,stroke:#082f6a,color:#ffffff
    style collector fill:#bf360c,stroke:#8c2809,color:#ffffff
    style backends fill:#1b5e20,stroke:#0d3d14,color:#ffffff
    style ui fill:#4a148c,stroke:#2e0d57,color:#ffffff

Reading the diagram:

  • Validator / Stock Nodes: All rippled nodes emit trace data via OTLP. Validators and stock nodes are grouped separately because they may reside in different network zones.
  • Collector Cluster (DC1, DC2): Regional collectors receive OTLP from nodes in their datacenter, apply processing (sampling, enrichment), and fan out to multiple backends.
  • Storage Backends: Tempo and Elastic provide queryable trace storage; S3/GCS Archive provides long-term cold storage for compliance or post-incident analysis.
  • Grafana Dashboards: The single visualization layer that queries both Tempo and Elastic, giving operators a unified view of all traces.
  • Data flow direction: Nodes → Collectors → Storage → Grafana. Each arrow represents a network hop; minimizing collector-to-backend hops reduces latency.

Note

: Production deployments should use multiple collector instances behind a load balancer for high availability. The diagram shows a simplified single-collector topology for clarity.


7.4 Architecture Considerations

7.4.1 Collector Placement

Strategy Description Pros Cons
Sidecar Collector per node Isolation, simple config Resource overhead
DaemonSet Collector per host Shared resources Complexity
Gateway Central collector(s) Centralized processing Single point of failure

Recommendation: Use Gateway pattern with regional collectors for rippled networks:

  • One collector cluster per datacenter/region
  • Tail-based sampling at collector level
  • Multiple export destinations for redundancy

7.4.2 Sampling Strategy

flowchart LR
    subgraph head["Head Sampling (Node)"]
        hs[Node-level head sampling<br/>configurable, default: 100%<br/>recommended production: 10%]
    end

    subgraph tail["Tail Sampling (Collector)"]
        ts1[Keep all errors]
        ts2[Keep slow >5s]
        ts3[Keep 10% rest]
    end

    head --> tail

    ts1 --> final[Final Traces]
    ts2 --> final
    ts3 --> final

    style head fill:#0d47a1,stroke:#082f6a,color:#fff
    style tail fill:#1b5e20,stroke:#0d3d14,color:#fff
    style hs fill:#0d47a1,stroke:#082f6a,color:#fff
    style ts1 fill:#1b5e20,stroke:#0d3d14,color:#fff
    style ts2 fill:#1b5e20,stroke:#0d3d14,color:#fff
    style ts3 fill:#1b5e20,stroke:#0d3d14,color:#fff
    style final fill:#bf360c,stroke:#8c2809,color:#fff

Reading the diagram:

  • Head Sampling (Node): The first filter -- each rippled node decides whether to sample a trace at creation time (default 100%, recommended 10% in production). This controls the volume leaving the node.
  • Tail Sampling (Collector): The second filter -- the collector inspects completed traces and applies rules: keep all errors, keep anything slower than 5 seconds, and keep 10% of the remainder.
  • Arrow head → tail: All head-sampled traces flow to the collector, where tail sampling further reduces volume while preserving the most valuable data.
  • Final Traces: The output after both sampling stages; this is what gets stored and queried. The two-stage approach balances cost with debuggability.

7.4.3 Data Retention

Environment Hot Storage Warm Storage Cold Archive
Development 24 hours N/A N/A
Staging 7 days N/A N/A
Production 7 days 30 days many years

7.5 Integration Checklist

  • Choose primary backend (Tempo recommended for cost/features)
  • Deploy collector cluster with high availability
  • Configure tail-based sampling for error/latency traces
  • Set up Grafana dashboards for trace visualization
  • Configure alerts for trace anomalies
  • Establish data retention policies
  • Test trace correlation with logs and metrics

7.6 Grafana Dashboard Examples

Pre-built dashboards for rippled observability.

7.6.1 Consensus Health Dashboard

{
  "title": "rippled Consensus Health",
  "uid": "rippled-consensus-health",
  "tags": ["rippled", "consensus", "tracing"],
  "panels": [
    {
      "title": "Consensus Round Duration",
      "type": "timeseries",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | avg(duration) by (resource.service.instance.id)"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "ms",
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 4000 },
              { "color": "red", "value": 5000 }
            ]
          }
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
    },
    {
      "title": "Phase Duration Breakdown",
      "type": "barchart",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=~\"consensus.phase.*\"} | avg(duration) by (name)"
        }
      ],
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
    },
    {
      "title": "Proposers per Round",
      "type": "stat",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | avg(span.xrpl.consensus.proposers)"
        }
      ],
      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 8 }
    },
    {
      "title": "Recent Slow Rounds (>5s)",
      "type": "table",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | duration > 5s"
        }
      ],
      "gridPos": { "h": 8, "w": 24, "x": 0, "y": 12 }
    }
  ]
}

7.6.2 Node Overview Dashboard

{
  "title": "rippled Node Overview",
  "uid": "rippled-node-overview",
  "panels": [
    {
      "title": "Active Nodes",
      "type": "stat",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\"} | count_over_time() by (resource.service.instance.id) | count()"
        }
      ],
      "gridPos": { "h": 4, "w": 4, "x": 0, "y": 0 }
    },
    {
      "title": "Total Transactions (1h)",
      "type": "stat",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=\"tx.receive\"} | count()"
        }
      ],
      "gridPos": { "h": 4, "w": 4, "x": 4, "y": 0 }
    },
    {
      "title": "Error Rate",
      "type": "gauge",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && status.code=error} | rate() / {resource.service.name=\"rippled\"} | rate() * 100"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "max": 10,
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 1 },
              { "color": "red", "value": 5 }
            ]
          }
        }
      },
      "gridPos": { "h": 4, "w": 4, "x": 8, "y": 0 }
    },
    {
      "title": "Service Map",
      "type": "nodeGraph",
      "datasource": "Tempo",
      "gridPos": { "h": 12, "w": 12, "x": 12, "y": 0 }
    }
  ]
}

7.6.3 Alert Rules

# grafana/provisioning/alerting/rippled-alerts.yaml
apiVersion: 1

groups:
  - name: rippled-tracing-alerts
    folder: rippled
    interval: 1m
    rules:
      - uid: consensus-slow
        title: Consensus Round Slow
        condition: A
        data:
          - refId: A
            datasourceUid: tempo
            model:
              queryType: traceql
              query: '{resource.service.name="rippled" && name="consensus.round"} | avg(duration) > 5s'
              # Note: Verify TraceQL aggregate queries are supported by your
              # Tempo version. Aggregate alerting (e.g., avg(duration)) requires
              # Tempo 2.3+ with TraceQL metrics enabled.
        for: 5m
        annotations:
          summary: Consensus rounds taking >5 seconds
          description: "Consensus duration: {{ $value }}ms"
        labels:
          severity: warning

      - uid: rpc-error-spike
        title: RPC Error Rate Spike
        condition: B
        data:
          - refId: B
            datasourceUid: tempo
            model:
              queryType: traceql
              query: '{resource.service.name="rippled" && name=~"rpc.command.*" && status.code=error} | rate() > 0.05'
              # Note: Verify TraceQL aggregate queries are supported by your
              # Tempo version. Aggregate alerting (e.g., rate()) requires
              # Tempo 2.3+ with TraceQL metrics enabled.
        for: 2m
        annotations:
          summary: RPC error rate >5%
        labels:
          severity: critical

      - uid: tx-throughput-drop
        title: Transaction Throughput Drop
        condition: C
        data:
          - refId: C
            datasourceUid: tempo
            model:
              queryType: traceql
              query: '{resource.service.name="rippled" && name="tx.receive"} | rate() < 10'
        for: 10m
        annotations:
          summary: Transaction throughput below threshold
        labels:
          severity: warning

7.7 PerfLog and Insight Correlation

OTLP = OpenTelemetry Protocol

How to correlate OpenTelemetry traces with existing rippled observability.

7.7.1 Correlation Architecture

flowchart TB
    subgraph rippled["rippled Node"]
        otel[OpenTelemetry<br/>Spans]
        perflog[PerfLog<br/>JSON Logs]
        insight[Beast Insight<br/>StatsD Metrics]
    end

    subgraph collectors["Data Collection"]
        otelc[OTel Collector]
        promtail[Promtail/Fluentd]
        statsd[StatsD Exporter]
    end

    subgraph storage["Storage"]
        tempo[(Tempo)]
        loki[(Loki)]
        prom[(Prometheus)]
    end

    subgraph grafana["Grafana"]
        traces[Trace View]
        logs[Log View]
        metrics[Metrics View]
        corr[Correlation<br/>Panel]
    end

    otel -->|OTLP| otelc --> tempo
    perflog -->|JSON| promtail --> loki
    insight -->|StatsD| statsd --> prom

    tempo --> traces
    loki --> logs
    prom --> metrics

    traces --> corr
    logs --> corr
    metrics --> corr

    style rippled fill:#0d47a1,stroke:#082f6a,color:#fff
    style collectors fill:#bf360c,stroke:#8c2809,color:#fff
    style storage fill:#1b5e20,stroke:#0d3d14,color:#fff
    style grafana fill:#4a148c,stroke:#2e0d57,color:#fff
    style otel fill:#0d47a1,stroke:#082f6a,color:#fff
    style perflog fill:#0d47a1,stroke:#082f6a,color:#fff
    style insight fill:#0d47a1,stroke:#082f6a,color:#fff
    style otelc fill:#bf360c,stroke:#8c2809,color:#fff
    style promtail fill:#bf360c,stroke:#8c2809,color:#fff
    style statsd fill:#bf360c,stroke:#8c2809,color:#fff
    style tempo fill:#1b5e20,stroke:#0d3d14,color:#fff
    style loki fill:#1b5e20,stroke:#0d3d14,color:#fff
    style prom fill:#1b5e20,stroke:#0d3d14,color:#fff
    style traces fill:#4a148c,stroke:#2e0d57,color:#fff
    style logs fill:#4a148c,stroke:#2e0d57,color:#fff
    style metrics fill:#4a148c,stroke:#2e0d57,color:#fff
    style corr fill:#4a148c,stroke:#2e0d57,color:#fff

Reading the diagram:

  • rippled Node (three sources): A single node emits three independent data streams -- OpenTelemetry spans, PerfLog JSON logs, and Beast Insight StatsD metrics.
  • Data Collection layer: Each stream has its own collector -- OTel Collector for spans, Promtail/Fluentd for logs, and a StatsD exporter for metrics. They operate independently.
  • Storage layer (Tempo, Loki, Prometheus): Each data type lands in a purpose-built store optimized for its query patterns (trace search, log grep, metric aggregation).
  • Grafana Correlation Panel: The key integration point -- Grafana queries all three stores and links them via shared fields (trace_id, xrpl.tx.hash, ledger_seq), enabling a single-pane debugging experience.

7.7.2 Correlation Fields

Source Field Link To Purpose
Trace trace_id Logs Find log entries for trace
Trace xrpl.tx.hash Logs, Metrics Find TX-related data
Trace xrpl.consensus.ledger.seq Logs Find ledger-related logs
PerfLog trace_id (new) Traces Jump to trace from log
PerfLog ledger_seq Traces Find consensus trace
Insight exemplar.trace_id Traces Jump from metric spike

7.7.3 Example: Debugging a Slow Transaction

Step 1: Find the trace

# In Grafana Explore with Tempo
{resource.service.name="rippled" && span.xrpl.tx.hash="ABC123..."}

Step 2: Get the trace_id from the trace view

Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736

Step 3: Find related PerfLog entries

# In Grafana Explore with Loki
{job="rippled"} |= "4bf92f3577b34da6a3ce929d0e0e4736"

Step 4: Check Insight metrics for the time window

# In Grafana with Prometheus
rate(rippled_tx_applied_total[1m])
  @ timestamp_from_trace

7.7.4 Unified Dashboard Example

{
  "title": "rippled Unified Observability",
  "uid": "rippled-unified",
  "panels": [
    {
      "title": "Transaction Latency (Traces)",
      "type": "timeseries",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=\"tx.receive\"} | histogram_over_time(duration)"
        }
      ],
      "gridPos": { "h": 6, "w": 8, "x": 0, "y": 0 }
    },
    {
      "title": "Transaction Rate (Metrics)",
      "type": "timeseries",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "rate(rippled_tx_received_total[5m])",
          "legendFormat": "{{ instance }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "links": [
            {
              "title": "View traces",
              "url": "/explore?left={\"datasource\":\"Tempo\",\"query\":\"{resource.service.name=\\\"rippled\\\" && name=\\\"tx.receive\\\"}\"}"
            }
          ]
        }
      },
      "gridPos": { "h": 6, "w": 8, "x": 8, "y": 0 }
    },
    {
      "title": "Recent Logs",
      "type": "logs",
      "datasource": "Loki",
      "targets": [
        {
          "expr": "{job=\"rippled\"} | json"
        }
      ],
      "gridPos": { "h": 6, "w": 8, "x": 16, "y": 0 }
    },
    {
      "title": "Trace Search",
      "type": "table",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\"}"
        }
      ],
      "fieldConfig": {
        "overrides": [
          {
            "matcher": { "id": "byName", "options": "traceID" },
            "properties": [
              {
                "id": "links",
                "value": [
                  {
                    "title": "View trace",
                    "url": "/explore?left={\"datasource\":\"Tempo\",\"query\":\"${__value.raw}\"}"
                  },
                  {
                    "title": "View logs",
                    "url": "/explore?left={\"datasource\":\"Loki\",\"query\":\"{job=\\\"rippled\\\"} |= \\\"${__value.raw}\\\"\"}"
                  }
                ]
              }
            ]
          }
        ]
      },
      "gridPos": { "h": 12, "w": 24, "x": 0, "y": 6 }
    }
  ]
}

Previous: Implementation Phases | Next: Appendix | Back to: Overview