Files
rippled/docker/telemetry/TESTING.md
2026-03-31 16:39:40 +01:00

23 KiB

OpenTelemetry Integration Testing Guide

This document describes how to verify the rippled OpenTelemetry telemetry pipeline end-to-end, from span generation through the observability stack (otel-collector, Jaeger, Prometheus, Grafana).


Prerequisites

Build xrpld with telemetry

conan install . --build=missing -o telemetry=True
cmake --preset default -Dtelemetry=ON
cmake --build --preset default --target xrpld

The binary is at .build/xrpld.

Required tools

  • Docker with docker compose (v2)
  • curl
  • jq (JSON processor)

Verify binary

.build/xrpld --version

Test 1: Single-Node Standalone (Quick Verification)

This test verifies RPC and transaction spans in standalone mode. Consensus spans will not fire because standalone mode does not run consensus.

Step 1: Start the observability stack

docker compose -f docker/telemetry/docker-compose.yml up -d

Wait for services to be ready:

# otel-collector health
curl -sf http://localhost:13133/ && echo "collector ready"

# Jaeger UI
curl -sf http://localhost:16686/ > /dev/null && echo "jaeger ready"

Step 2: Start xrpld in standalone mode

.build/xrpld --conf docker/telemetry/xrpld-telemetry.cfg -a --start

Wait a few seconds for the node to initialize.

Step 3: Exercise RPC spans

# server_info
curl -s http://localhost:5005 \
  -d '{"method":"server_info"}' | jq .result.info.server_state

# server_state
curl -s http://localhost:5005 \
  -d '{"method":"server_state"}' | jq .result.state.server_state

# ledger
curl -s http://localhost:5005 \
  -d '{"method":"ledger","params":[{"ledger_index":"current"}]}' \
  | jq .result.ledger_current_index

Step 4: Submit a transaction

Close the ledger first (required in standalone mode):

curl -s http://localhost:5005 -d '{"method":"ledger_accept"}'

Submit a Payment from the genesis account:

curl -s http://localhost:5005 -d '{
  "method": "submit",
  "params": [{
    "secret": "snoPBrXtMeMyMHUVTgbuqAfg1SUTb",
    "tx_json": {
      "TransactionType": "Payment",
      "Account": "rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh",
      "Destination": "rPMh7Pi9ct699iZUTWzJaUMR1o42VEfGqF",
      "Amount": "10000000"
    }
  }]
}' | jq .result.engine_result

Expected result: "tesSUCCESS".

Close the ledger again to finalize:

curl -s http://localhost:5005 -d '{"method":"ledger_accept"}'

Step 5: Verify traces in Jaeger

Wait 5 seconds for the batch export, then:

JAEGER="http://localhost:16686"

# Check rippled service is registered
curl -s "$JAEGER/api/services" | jq '.data'

# Check RPC spans
curl -s "$JAEGER/api/traces?service=rippled&operation=rpc.request&limit=5&lookback=1h" \
  | jq '.data | length'

curl -s "$JAEGER/api/traces?service=rippled&operation=rpc.process&limit=5&lookback=1h" \
  | jq '.data | length'

curl -s "$JAEGER/api/traces?service=rippled&operation=rpc.command.server_info&limit=5&lookback=1h" \
  | jq '.data | length'

# Check transaction spans
curl -s "$JAEGER/api/traces?service=rippled&operation=tx.process&limit=5&lookback=1h" \
  | jq '.data | length'

Or open the Jaeger UI: http://localhost:16686

Step 6: Teardown

# Kill xrpld (Ctrl+C or)
kill $(pgrep -f 'xrpld.*xrpld-telemetry')

# Stop observability stack
docker compose -f docker/telemetry/docker-compose.yml down

# Clean xrpld data
rm -rf data/

Expected spans (standalone mode)

Span Name Expected Notes
rpc.request Yes Every HTTP RPC call
rpc.process Yes Every RPC processing
rpc.command.server_info Yes server_info RPC
rpc.command.server_state Yes server_state RPC
rpc.command.ledger Yes ledger RPC
rpc.command.submit Yes submit RPC
rpc.command.ledger_accept Yes ledger_accept RPC
tx.process Yes Transaction submission
tx.receive No No peers in standalone
consensus.* No Consensus disabled standalone

Test 2: 6-Node Consensus Network (Full Verification)

This test verifies ALL span categories including consensus and peer transaction relay, using a 6-node validator network.

Automated

Run the integration test script:

bash docker/telemetry/integration-test.sh

The script will:

  1. Start the observability stack
  2. Generate 6 validator key pairs
  3. Create config files for each node
  4. Start all 6 nodes
  5. Wait for consensus ("proposing" state)
  6. Exercise RPC, submit transactions
  7. Verify all span categories in Jaeger
  8. Verify spanmetrics in Prometheus
  9. Print results and leave the stack running

Manual

If you prefer to run the steps manually:

Step 1: Start observability stack

docker compose -f docker/telemetry/docker-compose.yml up -d

Step 2: Generate validator keys

Start a temporary standalone xrpld:

.build/xrpld --conf docker/telemetry/xrpld-telemetry.cfg -a --start &
TEMP_PID=$!
sleep 5

Generate 6 key pairs:

for i in $(seq 1 6); do
  curl -s http://localhost:5005 \
    -d '{"method":"validation_create"}' | jq '.result'
done

Record the validation_seed and validation_public_key for each. Kill the temporary node:

kill $TEMP_PID
rm -rf data/

Step 3: Create node configs

For each node (1-6), create a config file. Template:

[server]
port_rpc
port_peer

[port_rpc]
port = {5004 + node_number}
ip = 127.0.0.1
admin = 127.0.0.1
protocol = http

[port_peer]
port = {51234 + node_number}
ip = 0.0.0.0
protocol = peer

[node_db]
type=NuDB
path=/tmp/xrpld-integration/node{N}/nudb
online_delete=256

[database_path]
/tmp/xrpld-integration/node{N}/db

[debug_logfile]
/tmp/xrpld-integration/node{N}/debug.log

[validation_seed]
{seed from step 2}

[validators_file]
/tmp/xrpld-integration/validators.txt

[ips_fixed]
127.0.0.1 51235
127.0.0.1 51236
127.0.0.1 51237
127.0.0.1 51238
127.0.0.1 51239
127.0.0.1 51240

[peer_private]
1

[telemetry]
enabled=1
endpoint=http://localhost:4318/v1/traces
exporter=otlp_http
sampling_ratio=1.0
batch_size=512
batch_delay_ms=2000
max_queue_size=2048
trace_rpc=1
trace_transactions=1
trace_consensus=1
trace_peer=0
trace_ledger=1

[rpc_startup]
{ "command": "log_level", "severity": "warning" }

[ssl_verify]
0

Step 4: Create validators.txt

[validators]
{public_key_1}
{public_key_2}
{public_key_3}
{public_key_4}
{public_key_5}
{public_key_6}

Step 5: Start all 6 nodes

for i in $(seq 1 6); do
  .build/xrpld --conf /tmp/xrpld-integration/node$i/xrpld.cfg --start &
  echo $! > /tmp/xrpld-integration/node$i/xrpld.pid
done

Step 6: Wait for consensus

Poll each node until server_state = "proposing":

for port in 5005 5006 5007 5008 5009 5010; do
  while true; do
    state=$(curl -s http://localhost:$port \
      -d '{"method":"server_info"}' \
      | jq -r '.result.info.server_state')
    echo "Port $port: $state"
    [ "$state" = "proposing" ] && break
    sleep 5
  done
done

Step 7: Exercise RPC and submit transaction

# RPC calls
curl -s http://localhost:5005 -d '{"method":"server_info"}'
curl -s http://localhost:5005 -d '{"method":"server_state"}'
curl -s http://localhost:5005 -d '{"method":"ledger","params":[{"ledger_index":"current"}]}'

# Submit transaction
curl -s http://localhost:5005 -d '{
  "method": "submit",
  "params": [{
    "secret": "snoPBrXtMeMyMHUVTgbuqAfg1SUTb",
    "tx_json": {
      "TransactionType": "Payment",
      "Account": "rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh",
      "Destination": "rPMh7Pi9ct699iZUTWzJaUMR1o42VEfGqF",
      "Amount": "10000000"
    }
  }]
}'

Wait 15 seconds for consensus and batch export.

Step 8: Verify in Jaeger

See the "Verification Queries" section below.


Expected Span Catalog

All 16 production span names instrumented across Phases 2-5:

Span Name Source File Phase Key Attributes How to Trigger
rpc.request ServerHandler.cpp:271 2 -- Any HTTP RPC call
rpc.process ServerHandler.cpp:573 2 -- Any HTTP RPC call
rpc.ws_message ServerHandler.cpp:384 2 -- WebSocket RPC message
rpc.command.<name> RPCHandler.cpp:161 2 xrpl.rpc.command, xrpl.rpc.version, xrpl.rpc.role Any RPC command
tx.process NetworkOPs.cpp:1227 3 xrpl.tx.hash, xrpl.tx.local, xrpl.tx.path Submit transaction
tx.receive PeerImp.cpp:1273 3 xrpl.peer.id Peer relays transaction
consensus.proposal.send RCLConsensus.cpp:177 4 xrpl.consensus.round Consensus proposing phase
consensus.ledger_close RCLConsensus.cpp:282 4 xrpl.consensus.ledger.seq, xrpl.consensus.mode Ledger close event
consensus.accept RCLConsensus.cpp:395 4 xrpl.consensus.proposers, xrpl.consensus.round_time_ms Ledger accepted
consensus.validation.send RCLConsensus.cpp:753 4 xrpl.consensus.ledger.seq, xrpl.consensus.proposing Validation sent
consensus.accept.apply RCLConsensus.cpp:453 4 xrpl.consensus.close_time, close_time_correct, close_resolution_ms, state Ledger apply + close time
tx.apply BuildLedger.cpp:88 5 xrpl.ledger.tx_count, xrpl.ledger.tx_failed Ledger close (tx set)
ledger.build BuildLedger.cpp:31 5 xrpl.ledger.seq, xrpl.ledger.close_time, close_time_correct, close_resolution_ms Ledger build
ledger.validate LedgerMaster.cpp:915 5 xrpl.ledger.seq, xrpl.ledger.validations Ledger validated
ledger.store LedgerMaster.cpp:409 5 xrpl.ledger.seq Ledger stored
peer.proposal.receive PeerImp.cpp:1667 5 xrpl.peer.id, xrpl.peer.proposal.trusted Peer sends proposal
peer.validation.receive PeerImp.cpp:2264 5 xrpl.peer.id, xrpl.peer.validation.trusted Peer sends validation

Verification Queries

Jaeger API

Base URL: http://localhost:16686

JAEGER="http://localhost:16686"

# List all services
curl -s "$JAEGER/api/services" | jq '.data'

# List operations for rippled
curl -s "$JAEGER/api/services/rippled/operations" | jq '.data'

# Query traces by operation
for op in "rpc.request" "rpc.process" \
          "rpc.command.server_info" "rpc.command.server_state" "rpc.command.ledger" \
          "tx.process" "tx.receive" "tx.apply" \
          "consensus.proposal.send" "consensus.ledger_close" \
          "consensus.accept" "consensus.accept.apply" \
          "consensus.validation.send" \
          "ledger.build" "ledger.validate" "ledger.store" \
          "peer.proposal.receive" "peer.validation.receive"; do
  count=$(curl -s "$JAEGER/api/traces?service=rippled&operation=$op&limit=5&lookback=1h" \
    | jq '.data | length')
  printf "%-35s %s traces\n" "$op" "$count"
done

Prometheus API

Base URL: http://localhost:9090

PROM="http://localhost:9090"

# Span call counts (from spanmetrics connector)
curl -s "$PROM/api/v1/query?query=traces_span_metrics_calls_total" \
  | jq '.data.result[] | {span: .metric.span_name, count: .value[1]}'

# Latency histogram
curl -s "$PROM/api/v1/query?query=traces_span_metrics_duration_milliseconds_count" \
  | jq '.data.result[] | {span: .metric.span_name, count: .value[1]}'

# RPC calls by command
curl -s "$PROM/api/v1/query?query=traces_span_metrics_calls_total{span_name=~\"rpc.command.*\"}" \
  | jq '.data.result[] | {command: .metric["xrpl.rpc.command"], count: .value[1]}'

System Metrics (beast::insight via OTel native)

rippled's built-in beast::insight framework exports metrics natively via OTLP/HTTP to the OTel Collector on port 4318 (same endpoint as traces). These appear in Prometheus alongside spanmetrics.

Requires [insight] config in xrpld.cfg:

[insight]
server=otel
endpoint=http://localhost:4318/v1/metrics
prefix=rippled

Verify system metrics in Prometheus:

# Ledger age gauge
curl -s "$PROM/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age" | jq '.data.result'

# Peer counts
curl -s "$PROM/api/v1/query?query=rippled_Peer_Finder_Active_Inbound_Peers" | jq '.data.result'

# RPC request counter
curl -s "$PROM/api/v1/query?query=rippled_rpc_requests" | jq '.data.result'

# State accounting
curl -s "$PROM/api/v1/query?query=rippled_State_Accounting_Full_duration" | jq '.data.result'

# Overlay traffic
curl -s "$PROM/api/v1/query?query=rippled_total_Bytes_In" | jq '.data.result'

Key system metrics (prefix rippled_):

Metric Type Source
LedgerMaster_Validated_Ledger_Age gauge LedgerMaster.h:373
LedgerMaster_Published_Ledger_Age gauge LedgerMaster.h:374
State_Accounting_{Mode}_duration gauge NetworkOPs.cpp:774
State_Accounting_{Mode}_transitions gauge NetworkOPs.cpp:780
Peer_Finder_Active_Inbound_Peers gauge PeerfinderManager.cpp:214
Peer_Finder_Active_Outbound_Peers gauge PeerfinderManager.cpp:215
Overlay_Peer_Disconnects gauge OverlayImpl.h:557
job_count gauge JobQueue.cpp:26
rpc_requests counter ServerHandler.cpp:108
rpc_time histogram ServerHandler.cpp:110
rpc_size histogram ServerHandler.cpp:109
ios_latency histogram Application.cpp:438
pathfind_fast histogram PathRequests.h:23
pathfind_full histogram PathRequests.h:24
ledger_fetches counter InboundLedgers.cpp:44
ledger_history_mismatch counter LedgerHistory.cpp:16
warn counter Logic.h:33
drop counter Logic.h:34
{category}_Bytes_In/Out gauge OverlayImpl.h:535 (57 traffic categories)
{category}_Messages_In/Out gauge OverlayImpl.h:535 (57 traffic categories)

Grafana

Open http://localhost:3000 (anonymous admin access enabled).

Pre-configured dashboards (span-derived):

  • RPC Performance: Request rates, latency percentiles by command, top commands, WebSocket rate
  • Transaction Overview: Transaction processing rates, apply duration, peer relay, failed tx rate
  • Consensus Health: Consensus round duration, proposer counts, mode tracking, accept heatmap
  • Ledger Operations: Build/validate/store rates and durations, TX apply metrics
  • Peer Network: Proposal/validation receive rates, trusted vs untrusted breakdown (requires trace_peer=1)

Pre-configured dashboards (system metrics):

  • Node Health (System Metrics): Validated/published ledger age, operating mode, I/O latency, job queue
  • Network Traffic (System Metrics): Peer counts, disconnects, overlay traffic by category
  • RPC & Pathfinding (System Metrics): RPC request rate/time/size, pathfinding duration, resource warnings

Pre-configured datasources:

  • Jaeger: Trace data at http://jaeger:16686
  • Tempo: Trace data at http://tempo:3200 (via Grafana Explore)
  • Prometheus: Metrics at http://prometheus:9090
  • Loki: Log data at http://loki:3100 (via Grafana Explore)

Test 3: Log-Trace Correlation (Phase 8)

Phase 8 injects trace_id and span_id into rippled's log output when a log line is emitted within an active OTel span. This test verifies the end-to-end log-trace correlation pipeline.

Step 1: Verify trace_id in log output

After running Test 1 or Test 2 (which generate RPC spans), check the rippled debug.log for trace context:

grep 'trace_id=[a-f0-9]\{32\} span_id=[a-f0-9]\{16\}' /path/to/debug.log

Expected: log lines with trace_id=<32hex> span_id=<16hex> between the severity code and the message. Example:

2024-01-15T10:30:45.123Z RPCHandler:NFO trace_id=abc123def456789012345678abcdef01 span_id=0123456789abcdef Calling server_info

Lines emitted outside of an active span (background tasks, startup) will NOT have trace context — this is expected.

Step 2: Cross-check trace_id in Jaeger

Extract a trace_id from the log and verify it exists in Jaeger:

TRACE_ID=$(grep -o 'trace_id=[a-f0-9]\{32\}' /path/to/debug.log | head -1 | cut -d= -f2)
echo "Checking trace: $TRACE_ID"
curl -s "http://localhost:16686/api/traces/$TRACE_ID" | jq '.data | length'

Expected result: 1 (the trace exists in Jaeger).

Step 3: Verify Loki log ingestion

The OTel Collector's filelog receiver tails rippled's debug.log and exports parsed entries to Loki. Verify Loki has received entries:

# Query Loki for any rippled logs
curl -sG "http://localhost:3100/loki/api/v1/query" \
  --data-urlencode 'query={job="rippled"}' \
  --data-urlencode 'limit=5' | jq '.data.result | length'

Expected: > 0 results.

Step 4: Verify Grafana Tempo-to-Loki correlation

  1. Open Grafana at http://localhost:3000
  2. Navigate to Explore -> select Tempo datasource
  3. Search for a trace (e.g., operation rpc.command.server_info)
  4. Click "Logs for this trace" in the trace detail view
  5. Verify that Loki log lines appear, filtered by the trace's trace_id

Step 5: Verify Grafana Loki-to-Tempo correlation

  1. In Grafana Explore, select Loki datasource
  2. Query: {job="rippled"} |= "trace_id="
  3. In the log results, click the TraceID derived field link
  4. Verify it navigates to the full trace in Tempo

Expected results

Check Expected
trace_id= in debug.log Present in log lines within active spans
span_id= in debug.log Present alongside trace_id
Logs without active span No trace_id/span_id fields
trace_id in Jaeger Matches a valid trace
Loki log ingestion Logs visible via LogQL
Tempo -> Loki "Logs for trace" Shows correlated log lines
Loki -> Tempo TraceID link Navigates to correct trace

Troubleshooting

No traces in Jaeger

  1. Check otel-collector logs:
    docker compose -f docker/telemetry/docker-compose.yml logs otel-collector
    
  2. Verify xrpld telemetry config has enabled=1 and correct endpoint
  3. Check that otel-collector port 4318 is accessible:
    curl -sf http://localhost:4318 && echo "reachable"
    
  4. Increase batch_delay_ms or decrease batch_size in xrpld config

Nodes not reaching "proposing" state

  1. Check that all peer ports (51235-51240) are not in use:
    for p in 51235 51236 51237 51238 51239 51240; do
      ss -tlnp | grep ":$p " && echo "port $p in use"
    done
    
  2. Verify [ips_fixed] lists all 6 peer ports
  3. Verify validators.txt has all 6 public keys
  4. Check node debug logs: tail -50 /tmp/xrpld-integration/node1/debug.log
  5. Ensure [peer_private] is set to 1 (prevents reaching out to public network)

Transaction not processing

  1. Verify genesis account exists:
    curl -s http://localhost:5005 \
      -d '{"method":"account_info","params":[{"account":"rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh"}]}' \
      | jq .result.account_data.Balance
    
  2. Check submit response for error codes
  3. In standalone mode, remember to call ledger_accept after submitting

No trace_id in log output (Phase 8)

  1. Verify rippled was built with telemetry=ON (-Dtelemetry=ON in CMake)
  2. Verify enabled=1 in the [telemetry] config section
  3. Log lines only contain trace context when emitted inside an active span. Background logs (startup, periodic tasks outside spans) will not have trace_id/span_id.
  4. Ensure the trace category is enabled (e.g., trace_rpc=1 for RPC logs)

No logs in Loki (Phase 8)

  1. Verify the log file mount in docker-compose.yml:
    volumes:
      - /tmp/xrpld-integration:/var/log/rippled:ro
    
  2. Check OTel Collector logs for filelog receiver errors:
    docker compose -f docker/telemetry/docker-compose.yml logs otel-collector | grep -i "filelog\|loki\|error"
    
  3. Verify Loki is running:
    curl -s http://localhost:3100/ready
    
  4. Verify the filelog receiver glob pattern matches your log files: The default pattern is /var/log/rippled/*/debug.log
  1. Verify tracesToLogs is configured in the Tempo datasource provisioning (docker/telemetry/grafana/provisioning/datasources/tempo.yaml)
  2. Verify derivedFields is configured in the Loki datasource provisioning (docker/telemetry/grafana/provisioning/datasources/loki.yaml)
  3. Restart Grafana after changing provisioning files:
    docker compose -f docker/telemetry/docker-compose.yml restart grafana
    

Spanmetrics not appearing in Prometheus

  1. Verify otel-collector config has spanmetrics connector
  2. Check that the metrics pipeline is configured:
    service:
      pipelines:
        metrics:
          receivers: [otlp, spanmetrics]
          exporters: [prometheus]
    
  3. Verify Prometheus can reach collector:
    curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets'