mirror of https://github.com/XRPLF/rippled.git synced 2026-04-29 15:37:57 +00:00

Files

Pratik Mankawde 913a4b794c docs: correct OTel overhead estimates against SDK benchmarks

Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:

- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
  ~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
  SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
  stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns

Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).

Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-16 15:00:47 +01:00

21 KiB

Raw Blame History

Observability Backend Recommendations

Parent Document: OpenTelemetryPlan.md Related: Implementation Phases | Appendix

7.1 Development/Testing Backends

OTLP = OpenTelemetry Protocol

Backend	Pros	Cons	Use Case
Tempo	Cost-effective, Grafana integration	Requires Grafana stack	Local dev, CI, Prod
Zipkin	Simple, lightweight	Basic features	Quick prototyping

Quick Start with Tempo

# Start Tempo with OTLP support
docker run -d --name tempo \
  -p 3200:3200 \
  -p 4317:4317 \
  -p 4318:4318 \
  grafana/tempo:2.6.1

7.2 Production Backends

APM = Application Performance Monitoring

Backend	Pros	Cons	Use Case
Grafana Tempo	Cost-effective, Grafana integration	Requires Grafana stack	Most production deployments
Elastic APM	Full observability stack, log correlation	Resource intensive	Existing Elastic users
Honeycomb	Excellent query, high cardinality	SaaS cost	Deep debugging needs
Datadog APM	Full platform, easy setup	SaaS cost	Enterprise with budget

Backend Selection Flowchart

flowchart TD
    start[Select Backend] --> budget{Budget<br/>Constraints?}

    budget -->|Yes| oss[Open Source]
    budget -->|No| saas{Prefer<br/>SaaS?}

    oss --> existing{Existing<br/>Stack?}
    existing -->|Grafana| tempo[Grafana Tempo]
    existing -->|Elastic| elastic[Elastic APM]
    existing -->|None| tempo

    saas -->|Yes| enterprise{Enterprise<br/>Support?}
    saas -->|No| oss

    enterprise -->|Yes| datadog[Datadog APM]
    enterprise -->|No| honeycomb[Honeycomb]

    tempo --> final[Configure Collector]
    elastic --> final
    honeycomb --> final
    datadog --> final

    style start fill:#0f172a,stroke:#020617,color:#fff
    style budget fill:#334155,stroke:#1e293b,color:#fff
    style oss fill:#1e293b,stroke:#0f172a,color:#fff
    style existing fill:#334155,stroke:#1e293b,color:#fff
    style saas fill:#334155,stroke:#1e293b,color:#fff
    style enterprise fill:#334155,stroke:#1e293b,color:#fff
    style final fill:#0f172a,stroke:#020617,color:#fff
    style tempo fill:#1b5e20,stroke:#0d3d14,color:#fff
    style elastic fill:#bf360c,stroke:#8c2809,color:#fff
    style honeycomb fill:#0d47a1,stroke:#082f6a,color:#fff
    style datadog fill:#4a148c,stroke:#2e0d57,color:#fff

Reading the diagram:

Budget Constraints? (Yes): Leads to open-source options. If you already run Grafana or Elastic, pick the matching backend; otherwise default to Grafana Tempo.
Budget Constraints? (No) → Prefer SaaS?: If you want a managed service, choose between Datadog (enterprise support) and Honeycomb (developer-focused). If not, fall back to open-source.
Terminal nodes (Tempo / Elastic / Honeycomb / Datadog): Each represents a concrete backend choice, all of which feed into the same final step.
Configure Collector: Regardless of backend, you always finish by configuring the OTel Collector to export to your chosen destination.

7.3 Recommended Production Architecture

OTLP = OpenTelemetry Protocol | APM = Application Performance Monitoring | HA = High Availability

flowchart TB
    subgraph validators["Validator Nodes"]
        v1[rippled<br/>Validator 1]
        v2[rippled<br/>Validator 2]
    end

    subgraph stock["Stock Nodes"]
        s1[rippled<br/>Stock 1]
        s2[rippled<br/>Stock 2]
    end

    subgraph collector["OTel Collector Cluster"]
        c1[Collector<br/>DC1]
        c2[Collector<br/>DC2]
    end

    subgraph backends["Storage Backends"]
        tempo[(Grafana<br/>Tempo)]
        elastic[(Elastic<br/>APM)]
        archive[(S3/GCS<br/>Archive)]
    end

    subgraph ui["Visualization"]
        grafana[Grafana<br/>Dashboards]
    end

    v1 -->|OTLP| c1
    v2 -->|OTLP| c1
    s1 -->|OTLP| c2
    s2 -->|OTLP| c2

    c1 --> tempo
    c1 --> elastic
    c2 --> tempo
    c2 --> archive

    tempo --> grafana
    elastic --> grafana

    %% Note: simplified single-collector-per-DC topology shown for clarity

    style validators fill:#b71c1c,stroke:#7f1d1d,color:#ffffff
    style stock fill:#0d47a1,stroke:#082f6a,color:#ffffff
    style collector fill:#bf360c,stroke:#8c2809,color:#ffffff
    style backends fill:#1b5e20,stroke:#0d3d14,color:#ffffff
    style ui fill:#4a148c,stroke:#2e0d57,color:#ffffff

Reading the diagram:

Validator / Stock Nodes: All rippled nodes emit trace data via OTLP. Validators and stock nodes are grouped separately because they may reside in different network zones.
Collector Cluster (DC1, DC2): Regional collectors receive OTLP from nodes in their datacenter, apply processing (sampling, enrichment), and fan out to multiple backends.
Storage Backends: Tempo and Elastic provide queryable trace storage; S3/GCS Archive provides long-term cold storage for compliance or post-incident analysis.
Grafana Dashboards: The single visualization layer that queries both Tempo and Elastic, giving operators a unified view of all traces.
Data flow direction: Nodes → Collectors → Storage → Grafana. Each arrow represents a network hop; minimizing collector-to-backend hops reduces latency.

Note

: Production deployments should use multiple collector instances behind a load balancer for high availability. The diagram shows a simplified single-collector topology for clarity.

7.4 Architecture Considerations

7.4.1 Collector Placement

Strategy	Description	Pros	Cons
Sidecar	Collector per node	Isolation, simple config	Resource overhead
DaemonSet	Collector per host	Shared resources	Complexity
Gateway	Central collector(s)	Centralized processing	Single point of failure

Recommendation: Use Gateway pattern with regional collectors for rippled networks:

One collector cluster per datacenter/region
Tail-based sampling at collector level
Multiple export destinations for redundancy

7.4.2 Sampling Strategy

flowchart LR
    subgraph head["Head Sampling (Node)"]
        hs[Node-level head sampling<br/>configurable, default: 100%<br/>recommended production: 10%]
    end

    subgraph tail["Tail Sampling (Collector)"]
        ts1[Keep all errors]
        ts2[Keep slow >5s]
        ts3[Keep 10% rest]
    end

    head --> tail

    ts1 --> final[Final Traces]
    ts2 --> final
    ts3 --> final

    style head fill:#0d47a1,stroke:#082f6a,color:#fff
    style tail fill:#1b5e20,stroke:#0d3d14,color:#fff
    style hs fill:#0d47a1,stroke:#082f6a,color:#fff
    style ts1 fill:#1b5e20,stroke:#0d3d14,color:#fff
    style ts2 fill:#1b5e20,stroke:#0d3d14,color:#fff
    style ts3 fill:#1b5e20,stroke:#0d3d14,color:#fff
    style final fill:#bf360c,stroke:#8c2809,color:#fff

Reading the diagram:

Head Sampling (Node): The first filter -- each rippled node decides whether to sample a trace at creation time (default 100%, recommended 10% in production). This controls the volume leaving the node.
Tail Sampling (Collector): The second filter -- the collector inspects completed traces and applies rules: keep all errors, keep anything slower than 5 seconds, and keep 10% of the remainder.
Arrow head → tail: All head-sampled traces flow to the collector, where tail sampling further reduces volume while preserving the most valuable data.
Final Traces: The output after both sampling stages; this is what gets stored and queried. The two-stage approach balances cost with debuggability.

7.4.3 Data Retention

Environment	Hot Storage	Warm Storage	Cold Archive
Development	24 hours	N/A	N/A
Staging	7 days	N/A	N/A
Production	7 days	30 days	many years

7.5 Integration Checklist

Choose primary backend (Tempo recommended for cost/features)
Deploy collector cluster with high availability
Configure tail-based sampling for error/latency traces
Set up Grafana dashboards for trace visualization
Configure alerts for trace anomalies
Establish data retention policies
Test trace correlation with logs and metrics

7.6 Grafana Dashboard Examples

Pre-built dashboards for rippled observability.

7.6.1 Consensus Health Dashboard

{
  "title": "rippled Consensus Health",
  "uid": "rippled-consensus-health",
  "tags": ["rippled", "consensus", "tracing"],
  "panels": [
    {
      "title": "Consensus Round Duration",
      "type": "timeseries",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | avg(duration) by (resource.service.instance.id)"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "ms",
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 4000 },
              { "color": "red", "value": 5000 }
            ]
          }
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
    },
    {
      "title": "Phase Duration Breakdown",
      "type": "barchart",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=~\"consensus.phase.*\"} | avg(duration) by (name)"
        }
      ],
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
    },
    {
      "title": "Proposers per Round",
      "type": "stat",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | avg(span.xrpl.consensus.proposers)"
        }
      ],
      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 8 }
    },
    {
      "title": "Recent Slow Rounds (>5s)",
      "type": "table",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=\"consensus.round\"} | duration > 5s"
        }
      ],
      "gridPos": { "h": 8, "w": 24, "x": 0, "y": 12 }
    }
  ]
}

7.6.2 Node Overview Dashboard

{
  "title": "rippled Node Overview",
  "uid": "rippled-node-overview",
  "panels": [
    {
      "title": "Active Nodes",
      "type": "stat",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\"} | count_over_time() by (resource.service.instance.id) | count()"
        }
      ],
      "gridPos": { "h": 4, "w": 4, "x": 0, "y": 0 }
    },
    {
      "title": "Total Transactions (1h)",
      "type": "stat",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=\"tx.receive\"} | count()"
        }
      ],
      "gridPos": { "h": 4, "w": 4, "x": 4, "y": 0 }
    },
    {
      "title": "Error Rate",
      "type": "gauge",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && status.code=error} | rate() / {resource.service.name=\"rippled\"} | rate() * 100"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "max": 10,
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 1 },
              { "color": "red", "value": 5 }
            ]
          }
        }
      },
      "gridPos": { "h": 4, "w": 4, "x": 8, "y": 0 }
    },
    {
      "title": "Service Map",
      "type": "nodeGraph",
      "datasource": "Tempo",
      "gridPos": { "h": 12, "w": 12, "x": 12, "y": 0 }
    }
  ]
}

7.6.3 Alert Rules

# grafana/provisioning/alerting/rippled-alerts.yaml
apiVersion: 1

groups:
  - name: rippled-tracing-alerts
    folder: rippled
    interval: 1m
    rules:
      - uid: consensus-slow
        title: Consensus Round Slow
        condition: A
        data:
          - refId: A
            datasourceUid: tempo
            model:
              queryType: traceql
              query: '{resource.service.name="rippled" && name="consensus.round"} | avg(duration) > 5s'
              # Note: Verify TraceQL aggregate queries are supported by your
              # Tempo version. Aggregate alerting (e.g., avg(duration)) requires
              # Tempo 2.3+ with TraceQL metrics enabled.
        for: 5m
        annotations:
          summary: Consensus rounds taking >5 seconds
          description: "Consensus duration: {{ $value }}ms"
        labels:
          severity: warning

      - uid: rpc-error-spike
        title: RPC Error Rate Spike
        condition: B
        data:
          - refId: B
            datasourceUid: tempo
            model:
              queryType: traceql
              query: '{resource.service.name="rippled" && name=~"rpc.command.*" && status.code=error} | rate() > 0.05'
              # Note: Verify TraceQL aggregate queries are supported by your
              # Tempo version. Aggregate alerting (e.g., rate()) requires
              # Tempo 2.3+ with TraceQL metrics enabled.
        for: 2m
        annotations:
          summary: RPC error rate >5%
        labels:
          severity: critical

      - uid: tx-throughput-drop
        title: Transaction Throughput Drop
        condition: C
        data:
          - refId: C
            datasourceUid: tempo
            model:
              queryType: traceql
              query: '{resource.service.name="rippled" && name="tx.receive"} | rate() < 10'
        for: 10m
        annotations:
          summary: Transaction throughput below threshold
        labels:
          severity: warning

7.7 PerfLog and Insight Correlation

OTLP = OpenTelemetry Protocol

How to correlate OpenTelemetry traces with existing rippled observability.

7.7.1 Correlation Architecture

flowchart TB
    subgraph rippled["rippled Node"]
        otel[OpenTelemetry<br/>Spans]
        perflog[PerfLog<br/>JSON Logs]
        insight[Beast Insight<br/>StatsD Metrics]
    end

    subgraph collectors["Data Collection"]
        otelc[OTel Collector]
        promtail[Promtail/Fluentd]
        statsd[StatsD Exporter]
    end

    subgraph storage["Storage"]
        tempo[(Tempo)]
        loki[(Loki)]
        prom[(Prometheus)]
    end

    subgraph grafana["Grafana"]
        traces[Trace View]
        logs[Log View]
        metrics[Metrics View]
        corr[Correlation<br/>Panel]
    end

    otel -->|OTLP| otelc --> tempo
    perflog -->|JSON| promtail --> loki
    insight -->|StatsD| statsd --> prom

    tempo --> traces
    loki --> logs
    prom --> metrics

    traces --> corr
    logs --> corr
    metrics --> corr

    style rippled fill:#0d47a1,stroke:#082f6a,color:#fff
    style collectors fill:#bf360c,stroke:#8c2809,color:#fff
    style storage fill:#1b5e20,stroke:#0d3d14,color:#fff
    style grafana fill:#4a148c,stroke:#2e0d57,color:#fff
    style otel fill:#0d47a1,stroke:#082f6a,color:#fff
    style perflog fill:#0d47a1,stroke:#082f6a,color:#fff
    style insight fill:#0d47a1,stroke:#082f6a,color:#fff
    style otelc fill:#bf360c,stroke:#8c2809,color:#fff
    style promtail fill:#bf360c,stroke:#8c2809,color:#fff
    style statsd fill:#bf360c,stroke:#8c2809,color:#fff
    style tempo fill:#1b5e20,stroke:#0d3d14,color:#fff
    style loki fill:#1b5e20,stroke:#0d3d14,color:#fff
    style prom fill:#1b5e20,stroke:#0d3d14,color:#fff
    style traces fill:#4a148c,stroke:#2e0d57,color:#fff
    style logs fill:#4a148c,stroke:#2e0d57,color:#fff
    style metrics fill:#4a148c,stroke:#2e0d57,color:#fff
    style corr fill:#4a148c,stroke:#2e0d57,color:#fff

Reading the diagram:

rippled Node (three sources): A single node emits three independent data streams -- OpenTelemetry spans, PerfLog JSON logs, and Beast Insight StatsD metrics.
Data Collection layer: Each stream has its own collector -- OTel Collector for spans, Promtail/Fluentd for logs, and a StatsD exporter for metrics. They operate independently.
Storage layer (Tempo, Loki, Prometheus): Each data type lands in a purpose-built store optimized for its query patterns (trace search, log grep, metric aggregation).
Grafana Correlation Panel: The key integration point -- Grafana queries all three stores and links them via shared fields (trace_id, xrpl.tx.hash, ledger_seq), enabling a single-pane debugging experience.

7.7.2 Correlation Fields

Source	Field	Link To	Purpose
Trace	`trace_id`	Logs	Find log entries for trace
Trace	`xrpl.tx.hash`	Logs, Metrics	Find TX-related data
Trace	`xrpl.consensus.ledger.seq`	Logs	Find ledger-related logs
PerfLog	`trace_id` (new)	Traces	Jump to trace from log
PerfLog	`ledger_seq`	Traces	Find consensus trace
Insight	`exemplar.trace_id`	Traces	Jump from metric spike

7.7.3 Example: Debugging a Slow Transaction

Step 1: Find the trace

# In Grafana Explore with Tempo
{resource.service.name="rippled" && span.xrpl.tx.hash="ABC123..."}

Step 2: Get the trace_id from the trace view

Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736

Step 3: Find related PerfLog entries

# In Grafana Explore with Loki
{job="rippled"} |= "4bf92f3577b34da6a3ce929d0e0e4736"

Step 4: Check Insight metrics for the time window

# In Grafana with Prometheus
rate(rippled_tx_applied_total[1m])
  @ timestamp_from_trace

7.7.4 Unified Dashboard Example

{
  "title": "rippled Unified Observability",
  "uid": "rippled-unified",
  "panels": [
    {
      "title": "Transaction Latency (Traces)",
      "type": "timeseries",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\" && name=\"tx.receive\"} | histogram_over_time(duration)"
        }
      ],
      "gridPos": { "h": 6, "w": 8, "x": 0, "y": 0 }
    },
    {
      "title": "Transaction Rate (Metrics)",
      "type": "timeseries",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "rate(rippled_tx_received_total[5m])",
          "legendFormat": "{{ instance }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "links": [
            {
              "title": "View traces",
              "url": "/explore?left={\"datasource\":\"Tempo\",\"query\":\"{resource.service.name=\\\"rippled\\\" && name=\\\"tx.receive\\\"}\"}"
            }
          ]
        }
      },
      "gridPos": { "h": 6, "w": 8, "x": 8, "y": 0 }
    },
    {
      "title": "Recent Logs",
      "type": "logs",
      "datasource": "Loki",
      "targets": [
        {
          "expr": "{job=\"rippled\"} | json"
        }
      ],
      "gridPos": { "h": 6, "w": 8, "x": 16, "y": 0 }
    },
    {
      "title": "Trace Search",
      "type": "table",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"rippled\"}"
        }
      ],
      "fieldConfig": {
        "overrides": [
          {
            "matcher": { "id": "byName", "options": "traceID" },
            "properties": [
              {
                "id": "links",
                "value": [
                  {
                    "title": "View trace",
                    "url": "/explore?left={\"datasource\":\"Tempo\",\"query\":\"${__value.raw}\"}"
                  },
                  {
                    "title": "View logs",
                    "url": "/explore?left={\"datasource\":\"Loki\",\"query\":\"{job=\\\"rippled\\\"} |= \\\"${__value.raw}\\\"\"}"
                  }
                ]
              }
            ]
          }
        ]
      },
      "gridPos": { "h": 12, "w": 24, "x": 0, "y": 6 }
    }
  ]
}

Previous: Implementation Phases | Next: Appendix | Back to: Overview

21 KiB Raw Blame History

Observability Backend Recommendations

7.1 Development/Testing Backends

Quick Start with Tempo

7.2 Production Backends

Backend Selection Flowchart

7.3 Recommended Production Architecture

7.4 Architecture Considerations

7.4.1 Collector Placement

7.4.2 Sampling Strategy

7.4.3 Data Retention

7.5 Integration Checklist

7.6 Grafana Dashboard Examples

7.6.1 Consensus Health Dashboard

7.6.2 Node Overview Dashboard

7.6.3 Alert Rules

7.7 PerfLog and Insight Correlation

7.7.1 Correlation Architecture

7.7.2 Correlation Fields

7.7.3 Example: Debugging a Slow Transaction

7.7.4 Unified Dashboard Example

21 KiB

Raw Blame History