mirror of https://github.com/XRPLF/rippled.git synced 2026-04-10 22:12:25 +00:00

Files

Pratik Mankawde f135842071 docs: correct OTel overhead estimates against SDK benchmarks

Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:

- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
  ~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
  SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
  stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns

Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).

Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-30 15:55:26 +01:00

38 KiB

Raw Blame History

OpenTelemetry Distributed Tracing for rippled

Slide 1: Introduction

CNCF = Cloud Native Computing Foundation

What is OpenTelemetry?

OpenTelemetry is an open-source, CNCF-backed observability framework for distributed tracing, metrics, and logs.

Why OpenTelemetry for rippled?

End-to-End Transaction Visibility: Track transactions from submission → consensus → ledger inclusion
Cross-Node Correlation: Follow requests across multiple independent nodes using a unique trace_id
Consensus Round Analysis: Understand timing and behavior across validators
Incident Debugging: Correlate events across distributed nodes during issues

flowchart LR
    A["Node A<br/>tx.receive<br/>trace_id: abc123"] --> B["Node B<br/>tx.relay<br/>trace_id: abc123"] --> C["Node C<br/>tx.validate<br/>trace_id: abc123"] --> D["Node D<br/>ledger.apply<br/>trace_id: abc123"]

    style A fill:#1565c0,stroke:#0d47a1,color:#fff
    style B fill:#2e7d32,stroke:#1b5e20,color:#fff
    style C fill:#2e7d32,stroke:#1b5e20,color:#fff
    style D fill:#e65100,stroke:#bf360c,color:#fff

Reading the diagram:

Node A (blue, leftmost): The originating node that first receives the transaction and assigns a new trace_id: abc123; this ID becomes the correlation key for the entire distributed trace.
Node B and Node C (green, middle): Relay and validation nodes — each creates its own span but carries the same trace_id, so their work is linked to the original submission without any central coordinator.
Node D (orange, rightmost): The final node that applies the transaction to the ledger; the trace now spans the full lifecycle from submission to ledger inclusion.
Left-to-right flow: The horizontal progression shows the real-world message path — a transaction hops from node to node, and the shared trace_id stitches all hops into a single queryable trace.

Trace ID: abc123 — All nodes share the same trace, enabling cross-node correlation.

Slide 2: OpenTelemetry vs Open Source Alternatives

CNCF = Cloud Native Computing Foundation

Feature	OpenTelemetry	Jaeger	Zipkin	SkyWalking	Pinpoint	Prometheus
Tracing	YES	YES	YES	YES	YES	NO
Metrics	YES	NO	NO	YES	YES	YES
Logs	YES	NO	NO	YES	NO	NO
C++ SDK	YES Official	YES (Deprecated)	YES (Unmaintained)	NO	NO	YES
Vendor Neutral	YES Primary goal	NO	NO	NO	NO	NO
Instrumentation	Manual + Auto	Manual	Manual	Auto-first	Auto-first	Manual
Backend	Any (exporters)	Self	Self	Self	Self	Self
CNCF Status	Incubating	Graduated	NO	Incubating	NO	Graduated

Why OpenTelemetry? It's the only actively maintained, full-featured C++ option with vendor neutrality — allowing export to Tempo, Prometheus, Grafana, or any commercial backend without changing instrumentation.

Slide 3: Adoption Scope — Traces Only (Current Plan)

OpenTelemetry supports three signal types: Traces, Metrics, and Logs. rippled already captures metrics (StatsD via Beast Insight) and logs (Journal/PerfLog). The question is: how much of OTel do we adopt?

Scenario A: Add distributed tracing. Keep StatsD for metrics and Journal for logs.

flowchart LR
    subgraph rippled["rippled Process"]
        direction TB
        OTel["OTel SDK<br/>(Traces)"]
        Insight["Beast Insight<br/>(StatsD Metrics)"]
        Journal["Journal + PerfLog<br/>(Logging)"]
    end

    OTel -->|"OTLP"| Collector["OTel Collector"]
    Insight -->|"UDP"| StatsD["StatsD Server"]
    Journal -->|"File I/O"| LogFile["perf.log / debug.log"]

    Collector --> Tempo["Tempo / Jaeger"]
    StatsD --> Graphite["Graphite / Grafana"]
    LogFile --> Loki["Loki (optional)"]

    style rippled fill:#424242,stroke:#212121,color:#fff
    style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff
    style Insight fill:#1565c0,stroke:#0d47a1,color:#fff
    style Journal fill:#e65100,stroke:#bf360c,color:#fff
    style Collector fill:#2e7d32,stroke:#1b5e20,color:#fff

Aspect	Details
What changes for operators	Deploy OTel Collector + trace backend. Existing StatsD and log pipelines stay as-is.
Codebase impact	New `Telemetry` module (~1500 LOC). Beast Insight and Journal untouched.
New capabilities	Cross-node trace correlation, span-based debugging, request lifecycle visibility.
What we still can't do	Correlate metrics with specific traces natively. StatsD metrics remain fire-and-forget with no trace exemplars.
Maintenance burden	Three separate observability systems to maintain (OTel + StatsD + Journal).
Risk	Lowest — additive change, no existing systems disturbed.

Slide 4: Future Adoption — Metrics & Logs via OTel

Scenario B: + OTel Metrics (Replace StatsD)

Migrate StatsD to OTel Metrics API, exposing Prometheus-compatible metrics. Remove Beast Insight.

flowchart LR
    subgraph rippled["rippled Process"]
        direction TB
        OTel["OTel SDK<br/>(Traces + Metrics)"]
        Journal["Journal + PerfLog<br/>(Logging)"]
    end

    OTel -->|"OTLP"| Collector["OTel Collector"]
    Journal -->|"File I/O"| LogFile["perf.log / debug.log"]

    Collector --> Tempo["Tempo<br/>(Traces)"]
    Collector --> Prom["Prometheus<br/>(Metrics)"]
    LogFile --> Loki["Loki (optional)"]

    style rippled fill:#424242,stroke:#212121,color:#fff
    style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff
    style Journal fill:#e65100,stroke:#bf360c,color:#fff
    style Collector fill:#2e7d32,stroke:#1b5e20,color:#fff

Better metrics? Yes — Prometheus gives native histograms (p50/p95/p99), multi-dimensional labels, and exemplars linking metric spikes to traces.
Codebase: Remove Beast::Insight + StatsDCollector (~2000 LOC). Single SDK for traces and metrics.
Operator effort: Rewrite dashboards from StatsD/Graphite queries to PromQL. Run both in parallel during transition.
Risk: Medium — operators must migrate monitoring infrastructure.

Scenario C: + OTel Logs (Full Stack)

Also replace Journal logging with OTel Logs API. Single SDK for everything.

flowchart LR
    subgraph rippled["rippled Process"]
        OTel["OTel SDK<br/>(Traces + Metrics + Logs)"]
    end

    OTel -->|"OTLP"| Collector["OTel Collector"]

    Collector --> Tempo["Tempo<br/>(Traces)"]
    Collector --> Prom["Prometheus<br/>(Metrics)"]
    Collector --> Loki["Loki / Elastic<br/>(Logs)"]

    style rippled fill:#424242,stroke:#212121,color:#fff
    style OTel fill:#2e7d32,stroke:#1b5e20,color:#fff
    style Collector fill:#2e7d32,stroke:#1b5e20,color:#fff

Structured logging: OTel Logs API outputs structured records with trace_id, span_id, severity, and attributes by design.
Full correlation: Every log line carries trace_id. Click trace → see logs. Click metric spike → see trace → see logs.
Codebase: Remove Beast Insight (~2000 LOC) + simplify Journal/PerfLog (~3000 LOC). One dependency instead of three.
Risk: Highest — beast::Journal is deeply embedded in every component. Large refactor. OTel C++ Logs API is newer (stable since v1.11, less battle-tested).

Recommendation

flowchart LR
    A["Phase 1<br/><b>Traces Only</b><br/>(Current Plan)"] --> B["Phase 2<br/><b>+ Metrics</b><br/>(Replace StatsD)"] --> C["Phase 3<br/><b>+ Logs</b><br/>(Full OTel)"]

    style A fill:#2e7d32,stroke:#1b5e20,color:#fff
    style B fill:#1565c0,stroke:#0d47a1,color:#fff
    style C fill:#e65100,stroke:#bf360c,color:#fff

Phase	Signal	Strategy	Risk
Phase 1 (now)	Traces	Add OTel traces. Keep StatsD and Journal. Prove value.	Low
Phase 2 (future)	+ Metrics	Migrate StatsD → Prometheus via OTel. Remove Beast Insight.	Medium
Phase 3 (future)	+ Logs	Adopt OTel Logs API. Align with structured logging initiative.	High

Key Takeaway: Start with traces (unique value, lowest risk), then incrementally adopt metrics and logs as the OTel infrastructure proves itself.

Slide 5: Comparison with rippled's Existing Solutions

Current Observability Stack

Aspect	PerfLog (JSON)	StatsD (Metrics)	OpenTelemetry (NEW)
Type	Logging	Metrics	Distributed Tracing
Scope	Single node	Single node	Cross-node
Data	JSON log entries	Counters, gauges	Spans with context
Correlation	By timestamp	By metric name	By `trace_id`
Overhead	Low (file I/O)	Low (UDP)	Low-Medium (configurable)
Question Answered	"What happened here?"	"How many? How fast?"	"What was the journey?"

Use Case Matrix

Scenario	PerfLog	StatsD	OpenTelemetry
"How many TXs per second?"	❌	✅	❌
"Why was this specific TX slow?"	⚠️	❌	✅
"Which node delayed consensus?"	❌	❌	✅
"Show TX journey across 5 nodes"	❌	❌	✅

Key Insight: In the traces-only approach (Phase 1), OpenTelemetry complements existing systems. In future phases, OTel metrics and logs could replace StatsD and Journal respectively — see Slides 3-4 for the full adoption roadmap.

Slide 6: Architecture

OTLP = OpenTelemetry Protocol | WS = WebSocket

High-Level Integration Architecture

flowchart TB
    subgraph rippled["rippled Node"]
        subgraph services["Core Services"]
            direction LR
            RPC["RPC Server<br/>(HTTP/WS)"] ~~~ Overlay["Overlay<br/>(P2P Network)"] ~~~ Consensus["Consensus<br/>(RCLConsensus)"]
        end

        Telemetry["Telemetry Module<br/>(OpenTelemetry SDK)"]

        services --> Telemetry
    end

    Telemetry -->|OTLP/gRPC| Collector["OTel Collector"]

    Collector --> Tempo["Grafana Tempo"]
    Collector --> Elastic["Elastic APM"]

    style rippled fill:#424242,stroke:#212121,color:#fff
    style services fill:#1565c0,stroke:#0d47a1,color:#fff
    style Telemetry fill:#2e7d32,stroke:#1b5e20,color:#fff
    style Collector fill:#e65100,stroke:#bf360c,color:#fff

Reading the diagram:

Core Services (blue, top): RPC Server, Overlay, and Consensus are the three primary components that generate trace data — they represent the entry points for client requests, peer messages, and consensus rounds respectively.
Telemetry Module (green, middle): The OpenTelemetry SDK sits below the core services and receives span data from all three; it acts as a single collection point within the rippled process.
OTel Collector (orange, center): An external process that receives spans over OTLP/gRPC from the Telemetry Module; it decouples rippled from backend choices and handles batching, sampling, and routing.
Backends (bottom row): Tempo and Elastic APM are interchangeable — the Collector fans out to any combination, so operators can switch backends without modifying rippled code.
Top-to-bottom flow: Data flows from instrumented code down through the SDK, out over the network to the Collector, and finally into storage/visualization backends.

Context Propagation

sequenceDiagram
    participant Client
    participant NodeA as Node A
    participant NodeB as Node B

    Client->>NodeA: Submit TX (no context)
    Note over NodeA: Creates trace_id: abc123<br/>span: tx.receive
    NodeA->>NodeB: Relay TX<br/>(traceparent: abc123)
    Note over NodeB: Links to trace_id: abc123<br/>span: tx.relay

HTTP/RPC: W3C Trace Context headers (traceparent)
P2P Messages: Protocol Buffer extension fields

Slide 7: Implementation Plan

5-Phase Rollout (9 Weeks)

Note

: Dates shown are relative to project start, not calendar dates.

gantt
    title Implementation Timeline
    dateFormat  YYYY-MM-DD
    axisFormat  Week %W

    section Phase 1
    Core Infrastructure    :p1, 2024-01-01, 2w

    section Phase 2
    RPC Tracing           :p2, after p1, 2w

    section Phase 3
    Transaction Tracing   :p3, after p2, 2w

    section Phase 4
    Consensus Tracing     :p4, after p3, 2w

    section Phase 5
    Documentation         :p5, after p4, 1w

Phase Details

Phase	Focus	Key Deliverables	Effort
1	Core Infrastructure	SDK integration, Telemetry interface, Config	10 days
2	RPC Tracing	HTTP context extraction, Handler spans	10 days
3	Transaction Tracing	Protobuf context, P2P relay propagation	10 days
4	Consensus Tracing	Round spans, Proposal/validation tracing	10 days
5	Documentation	Runbook, Dashboards, Training	7 days

Total Effort: ~47 developer-days (2 developers)

Future Phases (not in current scope): After traces are stable, OTel metrics can replace StatsD (~3 weeks), and OTel logs can replace Journal (~4 weeks, aligned with structured logging initiative). See Slides 3-4 for the full adoption roadmap.

Slide 8: Performance Overhead

OTLP = OpenTelemetry Protocol

Estimated System Impact

Metric	Overhead	Notes
CPU	1-3%	Span creation and attribute setting
Memory	~10 MB	SDK statics + batch buffer + worker thread stack
Network	10-50 KB/s	Compressed OTLP export to collector
Latency (p99)	<2%	With proper sampling configuration

How We Arrived at These Numbers

Assumptions (XRPL mainnet baseline):

Parameter	Value	Source
Transaction throughput	~25 TPS (peaks to ~50)	Mainnet average
Default peers per node	21	`peerfinder/detail/Tuning.h` (`defaultMaxPeers`)
Consensus round frequency	~1 round / 3-4 seconds	`ConsensusParms.h` (`ledgerMIN_CONSENSUS=1950ms`)
Proposers per round	~20-35	Mainnet UNL size
P2P message rate	~160 msgs/sec	See message breakdown below
Avg TX processing time	~200 μs	Profiled baseline
Single span creation cost	500-1000 ns	OTel C++ SDK benchmarks (see 3.5.4)

P2P message breakdown (per node, mainnet):

Message Type	Rate	Derivation
TMTransaction	~100/sec	~25 TPS × ~4 relay hops per TX, deduplicated by HashRouter
TMValidation	~50/sec	~35 validators × ~1 validation/3s round ≈ ~12/sec, plus relay fan-out
TMProposeSet	~10/sec	~35 proposers / 3s round ≈ ~12/round, clustered in establish phase
Total	~160/sec	Only traced message types counted

CPU (1-3%) — Calculation:

Per-transaction tracing cost breakdown:

Operation	Cost	Notes
`tx.receive` span (create + end + 4 attributes)	~1400 ns	~1000ns create + ~200ns end + 4×50ns attrs
`tx.validate` span	~1200 ns	~1000ns create + ~200ns for 2 attributes
`tx.relay` span	~1200 ns	~1000ns create + ~200ns for 2 attributes
Context injection into P2P message	~200 ns	Serialize trace_id + span_id into protobuf
Total per TX	~4.0 μs

CPU overhead: 4.0 μs / 200 μs baseline = ~2.0% per transaction. Under high load with consensus + RPC spans overlapping, reaches ~3%. Consensus itself adds only ~36 μs per 3-second round (~0.001%), so the TX path dominates. On production server hardware (3+ GHz Xeon), span creation drops to ~500-600 ns, bringing per-TX cost to ~2.6 μs (~1.3%). See Section 3.5.4 for benchmark sources.

Memory (~10 MB) — Calculation:

Component	Size	Notes
TracerProvider + Exporter (gRPC channel init)	~320 KB	Allocated once at startup
BatchSpanProcessor (circular buffer)	~16 KB	2049 × 8-byte AtomicUniquePtr entries
BatchSpanProcessor (worker thread stack)	~8 MB	Default Linux thread stack size
Active spans (in-flight, max ~1000)	~500-800 KB	~500-800 bytes/span × 1000 concurrent
Export queue (batch buffer, max 2048 spans)	~1 MB	~500 bytes/span × 2048 queue depth
Thread-local context storage (~100 threads)	~6.4 KB	~64 bytes/thread
Total	~10 MB ceiling

Memory plateaus once the export queue fills — the max_queue_size=2048 config bounds growth. The worker thread stack (~8 MB) dominates the static footprint but is virtual memory; actual RSS depends on stack usage (typically much less). Active spans are larger than originally estimated (~500-800 bytes) because the OTel SDK Span object includes a mutex (~40 bytes), SpanData recordable (~250 bytes base), and std::map-based attribute storage (~200-500 bytes for 3-5 string attributes). See Section 3.5.4 for source references.

Network (10-50 KB/s) — Calculation:

Two sources of network overhead:

(A) OTLP span export to Collector:

Sampling Rate	Effective Spans/sec	Avg Span Size (compressed)	Bandwidth
100% (dev only)	~500	~500 bytes	~250 KB/s
10% (recommended prod)	~50	~500 bytes	~25 KB/s
1% (minimal)	~5	~500 bytes	~2.5 KB/s

The ~500 spans/sec at 100% comes from: ~100 TX spans + ~160 P2P context spans + ~23 consensus spans/round + ~50 RPC spans = ~500/sec. OTLP protobuf with gzip compression yields ~500 bytes/span average.

(B) P2P trace context overhead (added to existing messages, always-on regardless of sampling):

Message Type	Rate	Context Size	Bandwidth
TMTransaction	~100/sec	29 bytes	~2.9 KB/s
TMValidation	~50/sec	29 bytes	~1.5 KB/s
TMProposeSet	~10/sec	29 bytes	~0.3 KB/s
Total P2P			~4.7 KB/s

Combined: 25 KB/s (OTLP export at 10%) + 5 KB/s (P2P context) ≈ ~30 KB/s typical. The 10-50 KB/s range covers 10-20% sampling under normal to peak mainnet load.

Latency (<2%) — Calculation:

Path	Tracing Cost	Baseline	Overhead
Fast RPC (e.g., `server_info`)	2.75 μs	~1 ms	0.275%
Slow RPC (e.g., `path_find`)	2.75 μs	~100 ms	0.003%
Transaction processing	4.0 μs	~200 μs	2.0%
Consensus round	36 μs	~3 sec	0.001%

At p99, even the worst case (TX processing at 2.0%) is within the 1-3% range. RPC and consensus overhead are negligible. On production hardware, TX overhead drops to ~1.3%.

Per-Message Overhead (Context Propagation)

Each P2P message carries trace context with the following overhead:

Field	Size	Description
`trace_id`	16 bytes	Unique identifier for the entire trace
`span_id`	8 bytes	Current span (becomes parent on receiver)
`trace_flags`	1 byte	Sampling decision flags
`trace_state`	0-4 bytes	Optional vendor-specific data
Total	~29 bytes	Added per traced P2P message

flowchart LR
    subgraph msg["P2P Message with Trace Context"]
        A["Original Message<br/>(variable size)"] --> B["+ TraceContext<br/>(~29 bytes)"]
    end

    subgraph breakdown["Context Breakdown"]
        C["trace_id<br/>16 bytes"]
        D["span_id<br/>8 bytes"]
        E["flags<br/>1 byte"]
        F["state<br/>0-4 bytes"]
    end

    B --> breakdown

    style A fill:#424242,stroke:#212121,color:#fff
    style B fill:#2e7d32,stroke:#1b5e20,color:#fff
    style C fill:#1565c0,stroke:#0d47a1,color:#fff
    style D fill:#1565c0,stroke:#0d47a1,color:#fff
    style E fill:#e65100,stroke:#bf360c,color:#fff
    style F fill:#4a148c,stroke:#2e0d57,color:#fff

Reading the diagram:

Original Message (gray, left): The existing P2P message payload of variable size — this is unchanged; trace context is appended, never modifying the original data.
+ TraceContext (green, right of message): The additional 29-byte context block attached to each traced message; the arrow from the original message shows it is a pure addition.
Context Breakdown (right subgraph): The four fields — trace_id (16 bytes), span_id (8 bytes), flags (1 byte), and state (0-4 bytes) — show exactly what is added and their individual sizes.
Color coding: Blue fields (trace_id, span_id) are the core identifiers required for trace correlation; orange (flags) controls sampling decisions; purple (state) is optional vendor data typically omitted.

Note

: 29 bytes represents ~1-6% overhead depending on message size (500B simple TX to 5KB proposal), which is acceptable for the observability benefits provided.

Mitigation Strategies

flowchart LR
    A["Head Sampling<br/>10% default"] --> B["Tail Sampling<br/>Keep errors/slow"] --> C["Batch Export<br/>Reduce I/O"] --> D["Conditional Compile<br/>XRPL_ENABLE_TELEMETRY"]

    style A fill:#1565c0,stroke:#0d47a1,color:#fff
    style B fill:#2e7d32,stroke:#1b5e20,color:#fff
    style C fill:#e65100,stroke:#bf360c,color:#fff
    style D fill:#4a148c,stroke:#2e0d57,color:#fff

For a detailed explanation of head vs. tail sampling, see Slide 9.

Kill Switches (Rollback Options)

Config Disable: Set enabled=0 in config → instant disable, no restart needed for sampling
Rebuild: Compile with XRPL_ENABLE_TELEMETRY=OFF → zero overhead (no-op)
Full Revert: Clean separation allows easy commit reversion

Slide 9: Sampling Strategies — Head vs. Tail

Sampling controls which traces are recorded and exported. Without sampling, every operation generates a trace — at 500+ spans/sec, this overwhelms storage and network. Sampling lets you keep the signal, discard the noise.

Head Sampling (Decision at Start)

The sampling decision is made when a trace begins, before any work is done. A random number is generated; if it falls within the configured ratio, the entire trace is recorded. Otherwise, the trace is silently dropped.

flowchart LR
    A["New Request<br/>Arrives"] --> B{"Random < 10%?"}
    B -->|"Yes (1 in 10)"| C["Record Entire Trace<br/>(all spans)"]
    B -->|"No (9 in 10)"| D["Drop Entire Trace<br/>(zero overhead)"]

    style C fill:#2e7d32,stroke:#1b5e20,color:#fff
    style D fill:#c62828,stroke:#8c2809,color:#fff
    style B fill:#1565c0,stroke:#0d47a1,color:#fff

Aspect	Details
Where it runs	Inside rippled (SDK-level). Configured via `sampling_ratio` in `rippled.cfg`.
When the decision happens	At trace creation time — before the first span is even populated.
How it works	`sampling_ratio=0.1` means each trace has a 10% probability of being recorded. Dropped traces incur near-zero overhead (no spans created, no attributes set, no export).
Propagation	Once a trace is sampled, the `trace_flags` field (1 byte in the context header) tells downstream nodes to also sample it. Unsampled traces propagate `trace_flags=0`, so downstream nodes skip them too.
Pros	Lowest overhead. Simple to configure. Predictable resource usage.
Cons	Blind — it doesn't know if the trace will be interesting. A rare error or slow consensus round has only a 10% chance of being captured.
Best for	High-volume, steady-state traffic where most traces look similar (e.g., routine RPC requests).

rippled configuration:

[telemetry]
# Record 10% of traces (recommended for production)
sampling_ratio=0.1

Tail Sampling (Decision at End)

The sampling decision is made after the trace completes, based on its actual content — was it slow? Did it error? Was it a consensus round? This requires buffering complete traces before deciding.

flowchart TB
    A["All Traces<br/>Buffered (100%)"] --> B["OTel Collector<br/>Evaluates Rules"]

    B --> C{"Error?"}
    C -->|Yes| K["KEEP"]

    C -->|No| D{"Slow?<br/>(>5s consensus,<br/>>1s RPC)"}
    D -->|Yes| K

    D -->|No| E{"Random < 10%?"}
    E -->|Yes| K
    E -->|No| F["DROP"]

    style K fill:#2e7d32,stroke:#1b5e20,color:#fff
    style F fill:#c62828,stroke:#8c2809,color:#fff
    style B fill:#1565c0,stroke:#0d47a1,color:#fff
    style C fill:#e65100,stroke:#bf360c,color:#fff
    style D fill:#e65100,stroke:#bf360c,color:#fff
    style E fill:#4a148c,stroke:#2e0d57,color:#fff

Aspect	Details
Where it runs	In the OTel Collector (external process), not inside rippled. rippled exports 100% of traces; the Collector decides what to keep.
When the decision happens	After the Collector has received all spans for a trace (waits `decision_wait=10s` for stragglers).
How it works	Policy rules evaluate the completed trace: keep all errors, keep slow operations above a threshold, keep all consensus rounds, then probabilistically sample the rest at 10%.
Pros	Never misses important traces. Errors, slow requests, and consensus anomalies are always captured regardless of probability.
Cons	Higher resource usage — rippled must export 100% of spans to the Collector, which buffers them in memory before deciding. The Collector needs more RAM (configured via `num_traces` and `decision_wait`).
Best for	Production troubleshooting where you can't afford to miss errors or anomalies.

Collector configuration (tail sampling rules for rippled):

processors:
  tail_sampling:
    decision_wait: 10s # Wait for all spans in a trace
    num_traces: 100000 # Buffer up to 100K concurrent traces
    policies:
      - name: errors # Always keep error traces
        type: status_code
        status_code: { status_codes: [ERROR] }

      - name: slow-consensus # Keep consensus rounds >5s
        type: latency
        latency: { threshold_ms: 5000 }

      - name: slow-rpc # Keep slow RPC requests >1s
        type: latency
        latency: { threshold_ms: 1000 }

      - name: probabilistic # Sample 10% of everything else
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

Head vs. Tail — Side-by-Side

	Head Sampling	Tail Sampling
Decision point	Trace start (inside rippled)	Trace end (in OTel Collector)
Knows trace content?	No (random coin flip)	Yes (evaluates completed trace)
Overhead on rippled	Lowest (dropped traces = no-op)	Higher (must export 100% to Collector)
Collector resource usage	Low (receives only sampled traces)	Higher (buffers all traces before deciding)
Captures all errors?	No (only if trace was randomly selected)	Yes (error policy catches them)
Captures slow operations?	No (random)	Yes (latency policy catches them)
Configuration	`rippled.cfg`: `sampling_ratio=0.1`	`otel-collector.yaml`: `tail_sampling` processor
Best for	High-throughput steady-state	Troubleshooting & anomaly detection

Recommended Strategy for rippled

Use both in a layered approach:

flowchart LR
    subgraph rippled["rippled (Head Sampling)"]
        HS["sampling_ratio=1.0<br/>(export everything)"]
    end

    subgraph collector["OTel Collector (Tail Sampling)"]
        TS["Keep: errors + slow + 10% random<br/>Drop: routine traces"]
    end

    subgraph storage["Backend Storage"]
        ST["Only interesting traces<br/>stored long-term"]
    end

    rippled -->|"100% of spans"| collector -->|"~15-20% kept"| storage

    style rippled fill:#424242,stroke:#212121,color:#fff
    style collector fill:#1565c0,stroke:#0d47a1,color:#fff
    style storage fill:#2e7d32,stroke:#1b5e20,color:#fff

Why this works: rippled exports everything (no blind drops), the Collector applies intelligent filtering (keep errors/slow/anomalies, sample the rest), and only ~15-20% of traces reach storage. If Collector resource usage becomes a concern, add head sampling at sampling_ratio=0.5 to halve the export volume while still giving the Collector enough data for good tail-sampling decisions.

Slide 10: Data Collection & Privacy

What Data is Collected

Category	Attributes Collected	Purpose
Transaction	`tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index`	Trace transaction lifecycle
Consensus	`round`, `phase`, `mode`, `proposers` (count of proposing validators), `duration_ms`	Analyze consensus timing
RPC	`command`, `version`, `status`, `duration_ms`	Monitor RPC performance
Peer	`peer.id`(public key), `latency_ms`, `message.type`, `message.size`	Network topology analysis
Ledger	`ledger.hash`, `ledger.index`, `close_time`, `tx_count`	Ledger progression tracking
Job	`job.type`, `queue_ms`, `worker`	JobQueue performance

What is NOT Collected (Privacy Guarantees)

flowchart LR
    subgraph notCollected["❌ NOT Collected"]
        direction LR
        A["Private Keys"] ~~~ B["Account Balances"] ~~~ C["Transaction Amounts"]
    end

    subgraph alsoNot["❌ Also Excluded"]
        direction LR
        D["IP Addresses<br/>(configurable)"] ~~~ E["Personal Data"] ~~~ F["Raw TX Payloads"]
    end

    style A fill:#c62828,stroke:#8c2809,color:#fff
    style B fill:#c62828,stroke:#8c2809,color:#fff
    style C fill:#c62828,stroke:#8c2809,color:#fff
    style D fill:#c62828,stroke:#8c2809,color:#fff
    style E fill:#c62828,stroke:#8c2809,color:#fff
    style F fill:#c62828,stroke:#8c2809,color:#fff

Reading the diagram:

NOT Collected (top row, red): Private Keys, Account Balances, and Transaction Amounts are explicitly excluded — these are financial/security-sensitive fields that telemetry never touches.
Also Excluded (bottom row, red): IP Addresses (configurable per deployment), Personal Data, and Raw TX Payloads are also excluded — these protect operator and user privacy.
All-red styling: Every box is styled in red to visually reinforce that these are hard exclusions, not optional — the telemetry system has no code path to collect any of these fields.
Two-row layout: The split between "NOT Collected" and "Also Excluded" distinguishes between financial data (top) and operational/personal data (bottom), making the privacy boundaries clear to auditors.

Privacy Protection Mechanisms

Mechanism	Description
Account Hashing	`xrpl.tx.account` is hashed at collector level before storage
Configurable Redaction	Sensitive fields can be excluded via config
Sampling	Only 10% of traces recorded by default (reduces exposure)
Local Control	Node operators control what gets exported
No Raw Payloads	Transaction content is never recorded, only metadata

Key Principle: Telemetry collects operational metadata (timing, counts, hashes) — never sensitive content (keys, balances, amounts).

End of Presentation

38 KiB Raw Blame History Unescape Escape

OpenTelemetry Distributed Tracing for rippled

Slide 1: Introduction

What is OpenTelemetry?

Why OpenTelemetry for rippled?

Slide 2: OpenTelemetry vs Open Source Alternatives

Slide 3: Adoption Scope — Traces Only (Current Plan)

Slide 4: Future Adoption — Metrics & Logs via OTel

Scenario B: + OTel Metrics (Replace StatsD)

Scenario C: + OTel Logs (Full Stack)

Recommendation

Slide 5: Comparison with rippled's Existing Solutions

Current Observability Stack

Use Case Matrix

Slide 6: Architecture

High-Level Integration Architecture

Context Propagation

Slide 7: Implementation Plan

5-Phase Rollout (9 Weeks)

Phase Details

Slide 8: Performance Overhead

Estimated System Impact

How We Arrived at These Numbers

Per-Message Overhead (Context Propagation)

Mitigation Strategies

Kill Switches (Rollback Options)

Slide 9: Sampling Strategies — Head vs. Tail

Head Sampling (Decision at Start)

Tail Sampling (Decision at End)

Head vs. Tail — Side-by-Side

Recommended Strategy for rippled

Slide 10: Data Collection & Privacy

What Data is Collected

What is NOT Collected (Privacy Guarantees)

Privacy Protection Mechanisms

38 KiB

Raw Blame History