mirror of https://github.com/XRPLF/rippled.git synced 2026-06-02 16:26:48 +00:00

Files

Pratik Mankawde 9f81e770eb Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing

Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>

2026-06-01 11:36:19 +01:00

31 KiB

Raw Blame History

Configuration Reference

Parent Document: OpenTelemetryPlan.md Related: Code Samples | Implementation Phases

5.1 xrpld Configuration

OTLP = OpenTelemetry Protocol | TxQ = Transaction Queue

5.1.1 Configuration File Section

Add to cfg/xrpld-example.cfg:

# ═══════════════════════════════════════════════════════════════════════════════
# TELEMETRY (OpenTelemetry Distributed Tracing)
# ═══════════════════════════════════════════════════════════════════════════════
#
# Enables distributed tracing for transaction flow, consensus, and RPC calls.
# Traces are exported to an OpenTelemetry Collector using OTLP protocol.
#
# [telemetry]
#
# # Enable/disable telemetry (default: 0 = disabled)
# enabled=1
#
# # OTLP endpoint (default: http://localhost:4318/v1/traces - OTLP/HTTP)
# # Note: only OTLP/HTTP is shipped in Phase 1b. OTLP/gRPC support is
# # planned as future work and is not yet parsed by TelemetryConfig.cpp.
# endpoint=http://localhost:4318/v1/traces
#
# # Use TLS for exporter connection (default: 0)
# use_tls=0
#
# # Path to CA certificate for TLS (optional)
# # tls_ca_cert=/path/to/ca.crt
#
# # Sampling ratio: 0.0-1.0 (default: 1.0 = 100% sampling)
# # Use lower values in production to reduce overhead
# # Default: 1.0 (all traces). For production deployments with high
# # throughput, 0.1 (10%) is recommended to reduce overhead.
# # See Section 7.4.2 for sampling strategy details.
# sampling_ratio=0.1
#
# # Batch processor settings
# batch_size=512           # Spans per batch (default: 512)
# batch_delay_ms=5000      # Max delay before sending batch (default: 5000)
# max_queue_size=2048      # Max queued spans (default: 2048)
#
# # Component-specific tracing (default: all enabled except peer)
# trace_transactions=1     # Transaction relay and processing
# trace_consensus=1        # Consensus rounds and proposals
# trace_rpc=1              # RPC request handling
# trace_peer=0             # Peer messages (high volume, disabled by default)
# trace_ledger=1           # Ledger acquisition and building
#
# # Planned (not yet parsed by TelemetryConfig.cpp):
# # trace_pathfind=1       # Path computation (Phase 2)
# # trace_txq=1            # Transaction queue (Phase 3)
# # trace_validator=0      # Validator list / manifest (future)
# # trace_amendment=0      # Amendment voting (future)
#
# # Trace ID strategies for cross-node correlation
# # "deterministic" (default) derives trace_id from a workflow hash
# #   (txHash for transactions, prevLedgerHash for consensus) so all nodes
# #   produce spans under the same trace_id for the same workflow.
# # "attribute" uses random trace_id; correlation via attribute queries.
# tx_trace_strategy=deterministic
# consensus_trace_strategy=deterministic
#
# # Service identification (automatically detected if not specified)
# # service_name=xrpld
# # service_instance_id=<node_public_key>

[telemetry]
enabled=0

5.1.2 Configuration Options Summary

Option	Type	Default	Description
`enabled`	bool	`false`	Enable/disable telemetry
`endpoint`	string	`http://localhost:4318/v1/traces`	OTLP/HTTP collector endpoint
`use_tls`	bool	`false`	Enable TLS for exporter connection
`tls_ca_cert`	string	`""`	Path to CA certificate file
`sampling_ratio`	float	`1.0`	Sampling ratio (0.0-1.0)
`batch_size`	uint	`512`	Spans per export batch
`batch_delay_ms`	uint	`5000`	Max delay before sending batch (ms)
`max_queue_size`	uint	`2048`	Maximum queued spans
`trace_transactions`	bool	`true`	Enable transaction tracing
`trace_consensus`	bool	`true`	Enable consensus tracing
`trace_rpc`	bool	`true`	Enable RPC tracing
`trace_peer`	bool	`false`	Enable peer message tracing (high volume)
`trace_ledger`	bool	`true`	Enable ledger tracing
`tx_trace_strategy`	string	`"deterministic"`	TX trace ID strategy: `"deterministic"` (trace_id = txHash[0:16]) or `"attribute"` (random)
`consensus_trace_strategy`	string	`"deterministic"`	Consensus trace ID strategy: `"deterministic"` (trace_id = prevLedgerHash[0:16]) or `"attribute"` (random)
`service_name`	string	`"xrpld"`	Service name for traces
`service_instance_id`	string	`<node_pubkey>`	Instance identifier

Planned (not yet implemented): the following options appear in the design documents but are not parsed by TelemetryConfig.cpp in Phase 1b and later phases. They will be added as the corresponding subsystems are instrumented:

Option	Planned Phase	Purpose
`exporter`	Future	Select between OTLP/HTTP and OTLP/gRPC
`trace_pathfind`	Phase 2	Path computation tracing toggle
`trace_txq`	Phase 3	Transaction queue tracing toggle
`trace_validator`	Future	Validator list / manifest update tracing
`trace_amendment`	Future	Amendment voting tracing

5.2 Configuration Parser

TxQ = Transaction Queue

// src/libxrpl/telemetry/TelemetryConfig.cpp

#include <xrpl/telemetry/Telemetry.h>
#include <xrpl/basics/Log.h>

namespace xrpl {
namespace telemetry {

Telemetry::Setup
setupTelemetry(
    Section const& section,
    std::string const& nodePublicKey,
    std::string const& version)
{
    Telemetry::Setup setup;

    // Basic settings
    setup.enabled = section.value_or("enabled", false);
    setup.serviceName = section.value_or("service_name", "xrpld");
    setup.serviceVersion = version;
    setup.serviceInstanceId = section.value_or(
        "service_instance_id", nodePublicKey);

    // Exporter settings
    setup.exporterType = section.value_or("exporter", "otlp_grpc");

    if (setup.exporterType == "otlp_grpc")
        setup.exporterEndpoint = section.value_or("endpoint", "localhost:4317");
    else if (setup.exporterType == "otlp_http")
        setup.exporterEndpoint = section.value_or("endpoint", "localhost:4318");

    setup.useTls = section.value_or("use_tls", false);
    setup.tlsCertPath = section.value_or("tls_ca_cert", "");

    // Sampling
    setup.samplingRatio = section.value_or("sampling_ratio", 1.0);
    if (setup.samplingRatio < 0.0 || setup.samplingRatio > 1.0)
    {
        Throw<std::runtime_error>(
            "telemetry.sampling_ratio must be between 0.0 and 1.0");
    }

    // Batch processor
    setup.batchSize = section.value_or("batch_size", 512u);
    setup.batchDelay = std::chrono::milliseconds{
        section.value_or("batch_delay_ms", 5000u)};
    setup.maxQueueSize = section.value_or("max_queue_size", 2048u);

    // Component filtering
    setup.traceTransactions = section.value_or("trace_transactions", true);
    setup.traceConsensus = section.value_or("trace_consensus", true);
    setup.traceRpc = section.value_or("trace_rpc", true);
    setup.tracePeer = section.value_or("trace_peer", false);
    setup.traceLedger = section.value_or("trace_ledger", true);
    setup.tracePathfind = section.value_or("trace_pathfind", true);
    setup.traceTxQ = section.value_or("trace_txq", true);
    setup.traceValidator = section.value_or("trace_validator", false);
    setup.traceAmendment = section.value_or("trace_amendment", false);

    return setup;
}

} // namespace telemetry
} // namespace xrpl

5.3 Application Integration

5.3.1 ApplicationImp Changes

Deferred identity: The node public key (nodeIdentity_) is not available during ApplicationImp's member initializer list — it is resolved later in setup(). The Telemetry object is therefore constructed with an empty serviceInstanceId and patched via setServiceInstanceId() once setup() has called getNodeIdentity().

// src/xrpld/app/main/Application.cpp (modified)

#include <xrpl/telemetry/Telemetry.h>

class ApplicationImp : public Application, public BasicApp
{
    // ... existing members (perfLog_, etc.) ...

    // Telemetry — constructed in the member initializer list with
    // an empty serviceInstanceId, patched in setup().
    std::unique_ptr<telemetry::Telemetry> telemetry_;

    // Member initializer list (excerpt):
    // ...
    // , telemetry_(
    //       telemetry::makeTelemetry(
    //           telemetry::setupTelemetry(
    //               config_->section("telemetry"),
    //               "",  // Updated later via setServiceInstanceId()
    //               BuildInfo::getVersionString()),
    //           logs_->journal("Telemetry")))
    // ...

    bool setup(...) override
    {
        // ... existing setup code ...

        nodeIdentity_ = getNodeIdentity(*this, cmdline);

        // Inject node identity into telemetry resource attributes,
        // unless the user already set a custom service_instance_id.
        if (!config_->section("telemetry").exists("service_instance_id"))
            telemetry_->setServiceInstanceId(
                toBase58(TokenType::NodePublic, nodeIdentity_->first));

        // ... rest of setup ...
    }

    void start(bool withTimers) override
    {
        // ... existing start code ...
        telemetry_->start();
    }

    void run() override
    {
        // ... existing run/shutdown code ...
        telemetry_->stop();
    }

    telemetry::Telemetry&
    getTelemetry() override
    {
        return *telemetry_;
    }
};

5.3.2 ServiceRegistry Interface Addition

// include/xrpl/core/ServiceRegistry.h (modified)

namespace telemetry {
class Telemetry;
}  // namespace telemetry

class ServiceRegistry
{
public:
    // ... existing virtual methods ...

    /** Get the telemetry system for distributed tracing. */
    virtual telemetry::Telemetry&
    getTelemetry() = 0;
};

Note: Application extends ServiceRegistry, so getTelemetry() is available on both. Components that hold a ServiceRegistry& (e.g. NetworkOPsImp) call registry_.get().getTelemetry(). Components that still hold an Application& (e.g. ServerHandler, PeerImp, RCLConsensusAdaptor) call app_.getTelemetry() directly.

5.4 CMake Integration

OTLP = OpenTelemetry Protocol

5.4.1 Find OpenTelemetry Module

# cmake/FindOpenTelemetry.cmake

# Find OpenTelemetry C++ SDK
#
# This module defines:
#   OpenTelemetry_FOUND - System has OpenTelemetry
#   OpenTelemetry::api - API library target
#   OpenTelemetry::sdk - SDK library target
#   OpenTelemetry::otlp_grpc_exporter - OTLP gRPC exporter target
#   OpenTelemetry::otlp_http_exporter - OTLP HTTP exporter target

find_package(opentelemetry-cpp CONFIG QUIET)

if(opentelemetry-cpp_FOUND)
    set(OpenTelemetry_FOUND TRUE)

    # Create imported targets if not already created by config
    if(NOT TARGET OpenTelemetry::api)
        add_library(OpenTelemetry::api ALIAS opentelemetry-cpp::api)
    endif()
    if(NOT TARGET OpenTelemetry::sdk)
        add_library(OpenTelemetry::sdk ALIAS opentelemetry-cpp::sdk)
    endif()
    if(NOT TARGET OpenTelemetry::otlp_grpc_exporter)
        add_library(OpenTelemetry::otlp_grpc_exporter ALIAS
            opentelemetry-cpp::otlp_grpc_exporter)
    endif()
else()
    # Try pkg-config fallback
    find_package(PkgConfig QUIET)
    if(PKG_CONFIG_FOUND)
        pkg_check_modules(OTEL opentelemetry-cpp QUIET)
        if(OTEL_FOUND)
            set(OpenTelemetry_FOUND TRUE)
            # Create imported targets from pkg-config
            add_library(OpenTelemetry::api INTERFACE IMPORTED)
            target_include_directories(OpenTelemetry::api INTERFACE
                ${OTEL_INCLUDE_DIRS})
        endif()
    endif()
endif()

include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(OpenTelemetry
    REQUIRED_VARS OpenTelemetry_FOUND)

5.4.2 CMakeLists.txt Changes

# CMakeLists.txt (additions)

# ═══════════════════════════════════════════════════════════════════════════════
# TELEMETRY OPTIONS
# ═══════════════════════════════════════════════════════════════════════════════

option(XRPL_ENABLE_TELEMETRY
    "Enable OpenTelemetry distributed tracing support" OFF)

if(XRPL_ENABLE_TELEMETRY)
    find_package(OpenTelemetry REQUIRED)

    # Define compile-time flag
    add_compile_definitions(XRPL_ENABLE_TELEMETRY)

    message(STATUS "OpenTelemetry tracing: ENABLED")
else()
    message(STATUS "OpenTelemetry tracing: DISABLED")
endif()

# ═══════════════════════════════════════════════════════════════════════════════
# TELEMETRY LIBRARY
# ═══════════════════════════════════════════════════════════════════════════════

if(XRPL_ENABLE_TELEMETRY)
    add_library(xrpl_telemetry
        src/libxrpl/telemetry/Telemetry.cpp
        src/libxrpl/telemetry/TelemetryConfig.cpp
        src/libxrpl/telemetry/TraceContext.cpp
    )

    target_include_directories(xrpl_telemetry
        PUBLIC
            ${CMAKE_CURRENT_SOURCE_DIR}/include
    )

    target_link_libraries(xrpl_telemetry
        PUBLIC
            OpenTelemetry::api
            OpenTelemetry::sdk
            OpenTelemetry::otlp_grpc_exporter
        PRIVATE
            xrpl_basics
    )

    # Add to main library dependencies
    target_link_libraries(xrpld PRIVATE xrpl_telemetry)
else()
    # Create null implementation library
    add_library(xrpl_telemetry
        src/libxrpl/telemetry/NullTelemetry.cpp
    )
    target_include_directories(xrpl_telemetry
        PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include
    )
endif()

5.5 OpenTelemetry Collector Configuration

OTLP = OpenTelemetry Protocol | APM = Application Performance Monitoring

Production hardening: The configurations in this section are starting points. For production deployments where xrpld ships telemetry across a network to a centrally-hosted collector, see Securing the OTel Pipeline for the required mTLS receiver config, NetworkPolicy, and peer trace-context validation.

5.5.1 Development Configuration

# otel-collector-dev.yaml
# Minimal configuration for local development

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 100

exporters:
  # Console output for debugging
  logging:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

  # Tempo for trace storage
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, otlp/tempo]

5.5.2 Production Configuration

# otel-collector-prod.yaml
# Production configuration with filtering, sampling, and multiple backends

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /etc/otel/server.crt
          key_file: /etc/otel/server.key
          ca_file: /etc/otel/ca.crt

processors:
  # Memory limiter to prevent OOM
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

  # Batch processing for efficiency
  batch:
    timeout: 5s
    send_batch_size: 512
    send_batch_max_size: 1024

  # Tail-based sampling (keep errors and slow traces)
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      # Always keep error traces
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      # Keep slow consensus rounds (>5s)
      - name: slow-consensus
        type: latency
        latency:
          threshold_ms: 5000
      # Keep slow RPC requests (>1s)
      - name: slow-rpc
        type: and
        and:
          and_sub_policy:
            - name: rpc-spans
              type: string_attribute
              string_attribute:
                key: command
                values: [".*"]
                enabled_regex_matching: true
            - name: latency
              type: latency
              latency:
                threshold_ms: 1000
      # Probabilistic sampling for the rest
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

  # Attribute processing
  attributes:
    actions:
      # Hash sensitive data
      - key: xrpl.tx.account
        action: hash
      # Add deployment info
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  # Grafana Tempo for long-term storage
  otlp/tempo:
    endpoint: tempo.monitoring:4317
    tls:
      insecure: false
      ca_file: /etc/otel/tempo-ca.crt

  # Elastic APM for correlation with logs
  otlp/elastic:
    endpoint: apm.elastic:8200
    headers:
      Authorization: "Bearer ${ELASTIC_APM_TOKEN}"

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679

service:
  extensions: [health_check, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, attributes, batch]
      exporters: [otlp/tempo, otlp/elastic]

5.6 Docker Compose Development Environment

OTLP = OpenTelemetry Protocol

# docker-compose-telemetry.yaml
version: "3.8"

services:
  # OpenTelemetry Collector
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.92.0
    container_name: otel-collector
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-dev.yaml:/etc/otel-collector-config.yaml:ro
    ports:
      - "4317:4317" # OTLP gRPC
      - "4318:4318" # OTLP HTTP
      - "13133:13133" # Health check
    depends_on:
      - tempo

  # Tempo for trace storage
  tempo:
    image: grafana/tempo:2.6.1
    container_name: tempo
    ports:
      - "3200:3200" # Tempo HTTP API
      - "4317" # OTLP gRPC (internal)

  # Grafana for dashboards
  grafana:
    image: grafana/grafana:10.2.3
    container_name: grafana
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    ports:
      - "3000:3000"
    depends_on:
      - tempo

  # Prometheus for metrics (optional, for correlation)
  prometheus:
    image: prom/prometheus:v2.48.1
    container_name: prometheus
    volumes:
      - ./prometheus.yaml:/etc/prometheus/prometheus.yml:ro
    ports:
      - "9090:9090"

networks:
  default:
    name: xrpld-telemetry

5.7 Configuration Architecture

OTLP = OpenTelemetry Protocol

flowchart TB
    subgraph config["Configuration Sources"]
        cfgFile["xrpld.cfg<br/>[telemetry] section"]
        cmake["CMake<br/>XRPL_ENABLE_TELEMETRY"]
    end

    subgraph init["Initialization"]
        parse["setupTelemetry()"]
        factory["makeTelemetry()"]
    end

    subgraph runtime["Runtime Components"]
        tracer["TracerProvider"]
        exporter["OTLP Exporter"]
        processor["BatchProcessor"]
    end

    subgraph collector["Collector Pipeline"]
        recv["Receivers"]
        proc["Processors"]
        exp["Exporters"]
    end

    cfgFile --> parse
    cmake -->|"compile flag"| parse
    parse --> factory
    factory --> tracer
    tracer --> processor
    processor --> exporter
    exporter -->|"OTLP"| recv
    recv --> proc
    proc --> exp

    style config fill:#e3f2fd,stroke:#1976d2
    style runtime fill:#e8f5e9,stroke:#388e3c
    style collector fill:#fff3e0,stroke:#ff9800

Reading the diagram:

Configuration Sources: xrpld.cfg provides runtime settings (endpoint, sampling) while the CMake flag controls whether telemetry is compiled in at all.
Initialization: setupTelemetry() parses config values, then makeTelemetry() constructs the provider, processor, and exporter objects.
Runtime Components: The TracerProvider creates spans, the BatchProcessor buffers them, and the OTLP Exporter serializes and sends them over the wire.
OTLP arrow to Collector: Trace data leaves the xrpld process via OTLP (gRPC or HTTP) and enters the external Collector pipeline.
Collector Pipeline: Receivers ingest OTLP data, Processors apply sampling/filtering/enrichment, and Exporters forward traces to storage backends (Tempo, etc.).

5.8 Grafana Integration

APM = Application Performance Monitoring

Step-by-step instructions for integrating xrpld traces with Grafana.

5.8.1 Data Source Configuration

Tempo (Recommended)

# grafana/provisioning/datasources/tempo.yaml
apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      httpMethod: GET
      tracesToLogs:
        datasourceUid: loki
        tags: ["service.name", "xrpl.tx.hash"]
        mappedTags: [{ key: "trace_id", value: "traceID" }]
        mapTagNamesEnabled: true
        filterByTraceID: true
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
      search:
        hide: false
      lokiSearch:
        datasourceUid: loki

Elastic APM

# grafana/provisioning/datasources/elastic-apm.yaml
apiVersion: 1

datasources:
  - name: Elasticsearch-APM
    type: elasticsearch
    access: proxy
    url: http://elasticsearch:9200
    database: "apm-*"
    jsonData:
      esVersion: "8.0.0"
      timeField: "@timestamp"
      logMessageField: message
      logLevelField: log.level

5.8.2 Dashboard Provisioning

# grafana/provisioning/dashboards/dashboards.yaml
apiVersion: 1

providers:
  - name: "xrpld-dashboards"
    orgId: 1
    folder: "xrpld"
    folderUid: "xrpld"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards/rippled

5.8.3 Example Dashboard: RPC Performance

{
  "title": "xrpld RPC Performance",
  "uid": "xrpld-rpc-performance",
  "panels": [
    {
      "title": "RPC Latency by Command",
      "type": "heatmap",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"xrpld\" && span.command != \"\"} | histogram_over_time(duration) by (span.command)"
        }
      ],
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
    },
    {
      "title": "RPC Error Rate",
      "type": "timeseries",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"xrpld\" && status.code=error} | rate() by (span.command)"
        }
      ],
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
    },
    {
      "title": "Top 10 Slowest RPC Commands",
      "type": "table",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"xrpld\" && span.command != \"\"} | avg(duration) by (span.command) | topk(10)"
        }
      ],
      "gridPos": { "h": 8, "w": 24, "x": 0, "y": 8 }
    },
    {
      "title": "Recent Traces",
      "type": "table",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"xrpld\"}"
        }
      ],
      "gridPos": { "h": 8, "w": 24, "x": 0, "y": 16 }
    }
  ]
}

5.8.4 Example Dashboard: Transaction Tracing

{
  "title": "xrpld Transaction Tracing",
  "uid": "xrpld-tx-tracing",
  "panels": [
    {
      "title": "Transaction Throughput",
      "type": "stat",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"xrpld\" && name=\"tx.receive\"} | rate()"
        }
      ],
      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 }
    },
    {
      "title": "Cross-Node Relay Count",
      "type": "timeseries",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"xrpld\" && name=\"tx.relay\"} | avg(span.xrpl.tx.relay_count)"
        }
      ],
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 }
    },
    {
      "title": "Transaction Validation Errors",
      "type": "table",
      "datasource": "Tempo",
      "targets": [
        {
          "queryType": "traceql",
          "query": "{resource.service.name=\"xrpld\" && name=\"tx.validate\" && status.code=error}"
        }
      ],
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 }
    }
  ]
}

5.8.5 TraceQL Query Examples

Common queries for xrpld traces:

# Find all traces for a specific transaction hash
{resource.service.name="xrpld" && span.xrpl.tx.hash="ABC123..."}

# Find slow RPC commands (>100ms)
{resource.service.name="xrpld" && name=~"rpc.command.*"} | duration > 100ms

# Find consensus rounds taking >5 seconds
{resource.service.name="xrpld" && name="consensus.round"} | duration > 5s

# Find failed transactions with error details
{resource.service.name="xrpld" && name="tx.validate" && status.code=error}

# Find transactions relayed to many peers
{resource.service.name="xrpld" && name="tx.relay"} | span.xrpl.tx.relay_count > 10

# Compare latency across nodes
{resource.service.name="xrpld" && name="rpc.command.account_info"} | avg(duration) by (resource.service.instance.id)

5.8.6 Correlation with PerfLog

To correlate OpenTelemetry traces with existing PerfLog data:

Step 1: Configure Loki to ingest PerfLog

# promtail-config.yaml
scrape_configs:
  - job_name: xrpld-perflog
    static_configs:
      - targets:
          - localhost
        labels:
          job: xrpld
          __path__: /var/log/rippled/perf*.log
    pipeline_stages:
      - json:
          expressions:
            trace_id: trace_id
            ledger_seq: ledger_seq
            tx_hash: tx_hash
      - labels:
          trace_id:
          ledger_seq:
          tx_hash:

Step 2: Add trace_id to PerfLog entries

Modify PerfLog to include trace_id when available:

// In PerfLog output, add trace_id from current span context
void logPerf(Json::Value& entry) {
    auto span = opentelemetry::trace::GetSpan(
        opentelemetry::context::RuntimeContext::GetCurrent());
    if (span && span->GetContext().IsValid()) {
        char traceIdHex[33];
        span->GetContext().trace_id().ToLowerBase16(traceIdHex);
        entry["trace_id"] = std::string(traceIdHex, 32);
    }
    // ... existing logging
}

Step 3: Configure Grafana trace-to-logs link

In Tempo data source configuration, set up the derived field:

jsonData:
  tracesToLogs:
    datasourceUid: loki
    tags: ["trace_id", "xrpl.tx.hash"]
    filterByTraceID: true
    filterBySpanID: false

5.8.7 Correlation with Insight/OTel System Metrics

To correlate traces with Beast Insight system metrics:

Step 1: Export Insight metrics to Prometheus

Beast Insight metrics are exported natively via OTLP to the OTel Collector, which exposes them on the Prometheus endpoint alongside spanmetrics. No separate StatsD exporter is needed when using server=otel.

# xrpld.cfg — native OTel metrics (recommended)
[insight]
server=otel
endpoint=http://localhost:4318/v1/metrics
prefix=xrpld

Step 2: Add exemplars to metrics

OpenTelemetry SDK automatically adds exemplars (trace IDs) to metrics when using the Prometheus exporter. This links metrics spikes to specific traces.

Step 3: Configure Grafana metric-to-trace link

# In Prometheus data source
jsonData:
  exemplarTraceIdDestinations:
    - name: trace_id
      datasourceUid: tempo

Step 4: Dashboard panel with exemplars

{
  "title": "RPC Latency with Trace Links",
  "type": "timeseries",
  "datasource": "Prometheus",
  "targets": [
    {
      "expr": "histogram_quantile(0.99, rate(xrpld_rpc_duration_seconds_bucket[5m]))",
      "exemplar": true
    }
  ]
}

This allows clicking on metric data points to jump directly to the related trace.

Previous: Code Samples | Next: Implementation Phases | Back to: Overview

31 KiB Raw Blame History

Configuration Reference

5.1 xrpld Configuration

5.1.1 Configuration File Section

5.1.2 Configuration Options Summary

5.2 Configuration Parser

5.3 Application Integration

5.3.1 ApplicationImp Changes

5.3.2 ServiceRegistry Interface Addition

5.4 CMake Integration

5.4.1 Find OpenTelemetry Module

5.4.2 CMakeLists.txt Changes

5.5 OpenTelemetry Collector Configuration

5.5.1 Development Configuration

5.5.2 Production Configuration

5.6 Docker Compose Development Environment

5.7 Configuration Architecture

5.8 Grafana Integration

5.8.1 Data Source Configuration

Tempo (Recommended)

Elastic APM

5.8.2 Dashboard Provisioning

5.8.3 Example Dashboard: RPC Performance

5.8.4 Example Dashboard: Transaction Tracing

5.8.5 TraceQL Query Examples

5.8.6 Correlation with PerfLog

5.8.7 Correlation with Insight/OTel System Metrics

31 KiB

Raw Blame History