mirror of https://github.com/XRPLF/rippled.git synced 2026-06-02 16:26:48 +00:00

Files

Pratik Mankawde 98fc939851 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation

Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>

2026-06-01 15:01:19 +01:00

23 KiB

Raw Blame History

OpenTelemetry Observability for xrpld

Status: Phases 1-8 shipped. Traces, metrics, logs all live via OTel.

Slide 1: Introduction

CNCF = Cloud Native Computing Foundation | OTel = OpenTelemetry

What is OpenTelemetry?

CNCF-backed, vendor-neutral framework for traces, metrics, and logs with a single SDK and wire protocol (OTLP).

Why OTel for xrpld?

End-to-end TX visibility — submission → consensus → ledger inclusion
Cross-node correlation — shared trace_id stitches hops without a central coordinator
Consensus round analysis — phase timing across validators
Incident debugging — correlated traces, metrics, logs for one query

flowchart LR
    A["Node A<br/>tx.receive<br/>trace_id: abc123"] --> B["Node B<br/>tx.relay<br/>trace_id: abc123"] --> C["Node C<br/>tx.validate<br/>trace_id: abc123"] --> D["Node D<br/>ledger.apply<br/>trace_id: abc123"]

    style A fill:#1565c0,stroke:#0d47a1,color:#fff
    style B fill:#2e7d32,stroke:#1b5e20,color:#fff
    style C fill:#2e7d32,stroke:#1b5e20,color:#fff
    style D fill:#e65100,stroke:#bf360c,color:#fff

One trace, four nodes, full lifecycle.

Slide 2: Old Stack vs New OTel Stack

Side-by-Side

Aspect	Before (StatsD + Debug Logs)	After (OTel: Traces + Metrics + Logs)
Metrics	Beast Insight → StatsD UDP → Graphite	`MetricsRegistry` → OTLP/HTTP → Prometheus
Metric inventory	~250 metric series at runtime (28 registrations × overlay traffic categories)	23 native instruments × dimensions + RED via spanmetrics
Logs	`beast::Journal` → `debug.log` (grep / tail)	Journal → filelog tail → Loki (structured, queryable)
Traces	None	Telemetry SDK → OTLP → Tempo (cross-node)
Correlation	Timestamp + grep across files	Shared `trace_id` across all 3 signals
Format	Counter/gauge names; free-form log lines	OTLP protobuf; structured records
Backend choice	Locked to StatsD daemon + log files	Vendor-neutral via Collector exporters
Cross-node view	❌ Not possible	✅ Native via trace context propagation
Histogram p50/p95/p99	❌ Counters/gauges only	✅ Native histograms + spanmetrics

Legacy StatsD Metric Series (~250 total)

Category	Series	Notes
Overlay traffic gauges	~224	56 `TrafficCount::category` enum × 4 gauges (`Bytes_{In,Out}`, `Messages_{In,Out}`)
Peer Finder	2	`Active_{In,Out}bound_Peers`
State Accounting	10	`{Disconnected,Connected,Syncing,Tracking,Full}_{duration,transitions}`
Ledger	4	`Validated/Published_Ledger_Age`, `mismatch`, `ledger_fetches`
RPC / Pathfinding	5	`requests`, `size`, `time`, `pathfind_{fast,full}`
JobQueue / IO / Disconn	3	`job_count`, `ios_latency`, `Peer_Disconnects`
Total	~248	28 `make_*` call sites; series count balloons via overlay-category fan-out

Use Case Matrix

Scenario	StatsD	Debug Logs	OTel Traces	OTel Metrics	OTel Logs
"TXs per second?"	✅	❌	❌	✅	❌
"Why was this specific TX slow?"	❌	⚠️	✅	❌	⚠️
"Which node delayed consensus?"	❌	❌	✅	❌	❌
"TX journey across 5 nodes"	❌	❌	✅	❌	❌
"Validator error at 14:02"	❌	✅	⚠️	❌	✅
"Reproduce rare assertion / crash"	❌	✅	❌	❌	✅
"p99 RPC latency by method"	⚠️	❌	⚠️	✅	❌

Old stack: 2 signals, no correlation, single node. New stack: 3 signals, trace_id everywhere, cross-node native.

Slide 3: OTel vs Open-Source Alternatives

Feature	OpenTelemetry	Jaeger	Zipkin	SkyWalking	Pinpoint	Prometheus
Tracing	✅	✅	✅	✅	✅	❌
Metrics	✅	❌	❌	✅	✅	✅
Logs	✅	❌	❌	✅	❌	❌
C++ SDK	✅ Official	⚠️ Deprecated	⚠️ Unmaintained	❌	❌	✅
Vendor neutral	✅ Primary goal	❌	❌	❌	❌	❌
Instrumentation	Manual + Auto	Manual	Manual	Auto-first	Auto-first	Manual
Backend	Any (exporters)	Self	Self	Self	Self	Self
CNCF Status	Incubating	Graduated	—	Incubating	—	Graduated

Only actively maintained, full-signal C++ option. Backend-agnostic — Tempo/Prometheus/Loki/Elastic/commercial all work without code change.

Slide 4: Architecture (Current)

OTLP = OpenTelemetry Protocol over HTTP/gRPC

flowchart TB
    subgraph xrpld["xrpld Node"]
        direction TB
        Surfaces["RPC · TX · Consensus · Peer · Ledger · Job"]
        SDK["Telemetry SDK + MetricsRegistry"]
        Journal["beast::Journal → debug.log<br/>(trace_id/span_id injected)"]
        Surfaces --> SDK
        Surfaces --> Journal
    end

    SDK -->|"OTLP/HTTP :4318<br/>traces + metrics"| Collector["OTel Collector"]
    Journal -->|"filelog tail"| Collector

    Collector --> Tempo["Tempo<br/>(traces)"]
    Collector --> Prom["Prometheus<br/>(metrics)"]
    Collector --> Loki["Loki<br/>(logs)"]

    Tempo --> Grafana["Grafana<br/>(15 dashboards)"]
    Prom --> Grafana
    Loki --> Grafana

    style xrpld fill:#424242,stroke:#212121,color:#fff
    style SDK fill:#2e7d32,stroke:#1b5e20,color:#fff
    style Journal fill:#1565c0,stroke:#0d47a1,color:#fff
    style Collector fill:#e65100,stroke:#bf360c,color:#fff
    style Grafana fill:#4a148c,stroke:#2e0d57,color:#fff

Component	Role
Telemetry SDK	Span creation, trace context, OTLP traces export
MetricsRegistry	RPC/job/peer/consensus counters, gauges, histograms
beast::Journal filelog	`debug.log` tailed by Collector, parsed → Loki
OTel Collector	Receive OTLP + filelog; route to Tempo/Prom/Loki
Spanmetrics connector	Derives RED metrics from spans (Prometheus)

Slide 5: Signal Coverage

Surface	Traces (Spans)	Metrics (OTLP)	Logs (Journal Partition)
RPC	`rpc.request` + handler spans	request count, latency p50/p95/p99, error rate	`RPC*`
Transactions	`tx.receive`, `tx.validate`, `tx.relay`, `tx.apply`	TX/sec by result, fee escalation gauges	`TxQ`, `LedgerMaster`
Consensus	`consensus.round`, `proposal.send/recv`, `validation.send/recv`	round duration, phase histograms, mode gauge	`Consensus`, `LedgerConsensus`
Peer / Overlay	`peer.send`, `peer.receive` per message type	peer count, bytes/sec by msg type, suppression	`Overlay`, `PeerImp`
Ledger	`ledger.close`, `ledger.apply`	close time, TX count, ledger index gauge	`LedgerMaster`
Job Queue	(sampled per type)	queue depth, queue/run duration histograms	`JobQueue`

~30 distinct span kinds, ~80 metric series, structured logs from 50+ partitions.

Slide 6: Context Propagation

sequenceDiagram
    participant Client
    participant NodeA as Node A
    participant NodeB as Node B

    Client->>NodeA: Submit TX (no context)
    Note over NodeA: Create trace_id: abc123<br/>span: tx.receive
    NodeA->>NodeB: Relay TX (TraceContext field, ~29B)
    Note over NodeB: Link trace_id: abc123<br/>span: tx.relay (parent: A)

Carrier	Mechanism
HTTP / WebSocket RPC	W3C `traceparent` header
P2P protobuf	`TraceContext` extension field per message
Internal job dispatch	Thread-local context + `SpanGuard`

Field	Size	Description
`trace_id`	16 bytes	Trace correlation key
`span_id`	8 bytes	Parent span on receiver
`trace_flags`	1 byte	Sampling decision
`trace_state`	0-4 bytes	Optional vendor data
Total	~29 B	Per traced P2P message (~1-6% of msg)

Slide 7: Performance Overhead

Metric	Overhead	Driver
CPU	1-3%	~4 μs/TX span work (~2% at 25 TPS baseline)
Memory	~10 MB	SDK statics + worker stack + 2048-span export queue
Network	10-50 KB/s	OTLP export + 29 B P2P context per traced msg
Latency (p99)	<2%	TX path dominates; RPC and consensus negligible

Kill Switches

enabled=0 in xrpld.cfg → instant disable, no restart
Build with XRPL_ENABLE_TELEMETRY=OFF → zero overhead (no-op stubs)
Reduce sampling_ratio → linear export reduction

Derivations and per-component cost tables: see 03-implementation-strategy.md §3.5.4.

Slide 8: Sampling — Head vs Tail

	Head Sampling	Tail Sampling
Where	Inside xrpld (SDK)	OTel Collector (external)
Decision time	Trace start (random coin flip)	Trace end (after all spans buffered)
Knows trace content?	No	Yes — error, latency, span kind
xrpld overhead	Lowest (drop = no-op)	Higher (export 100%)
Captures all errors?	No	Yes (status_code policy)
Captures slow ops?	No	Yes (latency policy)
Config	`xrpld.cfg`: `sampling_ratio=0.1`	`tail_sampling` processor in collector
Best for	Steady-state high volume	Anomaly + error retention

Recommended Layered Strategy

flowchart LR
    xrpld["xrpld<br/>sampling_ratio=1.0<br/>(export all)"] -->|"100%"| col["Collector<br/>tail_sampling:<br/>errors + slow + 10% random"]
    col -->|"~15-20% kept"| tempo["Tempo storage"]

    style xrpld fill:#424242,stroke:#212121,color:#fff
    style col fill:#1565c0,stroke:#0d47a1,color:#fff
    style tempo fill:#2e7d32,stroke:#1b5e20,color:#fff

If Collector resource pressure: drop sampling_ratio to 0.5 — still enough trace volume for tail decisions.

Slide 9: Data Collection & Privacy

Collected (operational metadata)

Category	Attributes
Transaction	`tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index`
Consensus	`round`, `phase`, `mode`, `proposers`, `duration_ms`
RPC	`command`, `version`, `status`, `duration_ms`
Peer	`peer.id` (public key), `latency_ms`, `message.type`, `message.size`
Ledger	`ledger.hash`, `ledger.index`, `close_time`, `tx_count`
Job	`job.type`, `queue_ms`, `worker`

NOT Collected (hard exclusions)

❌ Private keys · ❌ Account balances · ❌ Transaction amounts · ❌ Raw payloads · ❌ Personal data · ⚙️ IP addresses (configurable)

Privacy Mechanisms

Mechanism	Description
Account hashing	`xrpl.tx.account` hashed at Collector before storage
Configurable redaction	Sensitive attributes excluded via Collector config
Sampling	10% default reduces exposure
Local control	Operator owns Collector → backend pipeline
No raw payloads	Span attributes are metadata only, never message contents

Principle: telemetry records operational metadata — never financial or personal content.

Slide 10: Implementation Timeline

gantt
    title OpenTelemetry Rollout
    dateFormat  YYYY-MM-DD
    axisFormat  Week %W

    section Done
    Phase 1 Core Infra        :done, p1, 2024-01-01, 2w
    Phase 2 RPC Tracing       :done, p2, after p1, 2w
    Phase 3 TX Tracing        :done, p3, after p2, 2w
    Phase 4 Consensus         :done, p4, after p3, 2w
    Phase 5 Docs/Deploy       :done, p5, after p4, 1w
    Phase 6 StatsD Bridge     :done, p6, after p5, 1w
    Phase 7 Native OTel Metrics :done, p7, after p6, 2w
    Phase 8 Log-Trace Correlation :done, p8, after p7, 1w
    Phase 9 Metric Gap Fill   :active, p9, after p8, 2w

    section Future
    Phase 10 Workload Validation :p10, after p9, 2w
    Phase 11 3rd-Party Pipelines :p11, after p10, 3w

Phase	Focus	Status
1	SDK integration, Telemetry, Config	✅ Done
2	RPC handler spans, HTTP context	✅ Done
3	TX spans, P2P protobuf context	✅ Done
4	Consensus rounds, proposal/validation	✅ Done
5	Runbook, dashboards, deployment	✅ Done
6	StatsD bridge (interim)	✅ Done
7	Native OTel metrics (replace Beast Insight)	✅ Done
8	Log-trace correlation (Loki)	✅ Done
9	Internal metric gap fill	✅ Done

Slide 11: Current State — What Shipped

By Signal

Signal	Backend	Status	Notes
Traces	Tempo	✅	All 6 surfaces instrumented; cross-node propagation live
Metrics	Prometheus	✅	Native OTLP; Beast Insight retired
Logs	Loki	✅	filelog tailing `debug.log`; `trace_id` injected

By Surface

Surface	Spans Live	Metrics Live	Notes
RPC	✅	✅	Handler + pathfinding + TxQ
Transactions	✅	✅	Receive, validate, relay, apply
Consensus	✅	✅	Round + proposal/validation send+receive (Phase 4a)
Peer / Overlay	✅	✅	Per-msg-type send/receive
Ledger	✅	✅	Close + apply
Job Queue	✅	✅	Queue depth + duration histograms

Stack Live

Component	Version
OTel Collector (contrib)	0.121.0
Grafana Tempo	2.7.2
Grafana Loki	3.4.2
Prometheus	latest
Grafana	11.5.2
Dashboards provisioned	15

Slide 12: Future Phases

Phase 10 — Synthetic Workload Validation

Aspect	Detail
Goal	Drive instrumented surfaces under reproducible load
Why	Validate dashboards, catch regressions, measure overhead at scale
Deliverable	Workload generator + assertion suite (RPC/TX/peer churn scenarios)
Effort	~2 weeks

Phase 11 — Admin-RPC Receiver (`xrpl_*` metrics)

Aspect	Detail
Goal	Custom Go OTel Collector receiver polls xrpld admin RPC, emits `xrpl_*` Prometheus metrics
Why	Admin-RPC-only data has no native export — every consumer reinvents JSON-RPC polling
Scope	`validators` (UNL, listed keys), `feature` (amendments), `peers` (per-peer detail), `amm_info`, `book_offers`, `fee` (detail tiers)
Excluded	`server_info` / `get_counts` basics — Phase 9 (#6513) already ships `xrpld_server_info` + 14 gauges/histograms natively from in-process state
Deliverable	Go receiver plugin + custom Collector binary + 4 Grafana dashboards (UNL, amendments, AMM, DEX) + Prometheus alerts
Effort	~3 weeks

flowchart LR
    rpc["xrpld admin RPC<br/>(validators, feature, peers,<br/>amm_info, book_offers, fee)"] -->|JSON-RPC poll| recv["Custom Go receiver<br/>(in Collector)"]
    recv -->|xrpl_* metrics| prom["Prometheus"]
    prom --> graf["Grafana dashboards"]

    style rpc fill:#2e7d32,stroke:#1b5e20,color:#fff
    style recv fill:#1565c0,stroke:#0d47a1,color:#fff
    style prom fill:#e65100,stroke:#bf360c,color:#fff
    style graf fill:#6a1b9a,stroke:#4a148c,color:#fff

Phase 11 fills the gap above Phase 9 — data only reachable via admin RPC, not via in-process metric callbacks.

Slide 11: External Dashboard Parity (Phase 7+)

Bridging Community Monitoring into Native OTel

The community xrpl-validator-dashboard provides 86 metrics for validator operators. We integrated the 29 missing metrics natively into the OTel pipeline.

New Metric Categories

graph LR
    subgraph "New Observable Gauges"
        VH["Validator Health<br/>amendment_blocked, UNL expiry,<br/>quorum"]
        PQ["Peer Quality<br/>P90 latency, insane peers,<br/>version awareness"]
        LE["Ledger Economy<br/>fees, reserves, tx rate,<br/>ledger age"]
        ST["State Tracking<br/>state value 0-6,<br/>time in state"]
        VA["Validation Agreement<br/>1h/24h agreement %,<br/>agreements, misses"]
    end

    subgraph "Counters"
        C1["ledgers_closed_total"]
        C2["validations_sent_total"]
        C3["state_changes_total"]
    end

    style VH fill:#1565c0,color:#fff
    style PQ fill:#2e7d32,color:#fff
    style LE fill:#e65100,color:#fff
    style ST fill:#6a1b9a,color:#fff
    style VA fill:#c62828,color:#fff
    style C1 fill:#37474f,color:#fff
    style C2 fill:#37474f,color:#fff
    style C3 fill:#37474f,color:#fff

ValidationTracker — Agreement Computation

sequenceDiagram
    participant C as RCLConsensus
    participant VT as ValidationTracker
    participant MR as MetricsRegistry
    participant P as Prometheus

    C->>VT: recordOurValidation(hash, seq)
    Note over VT: Stores pending event
    C->>VT: recordNetworkValidation(hash, seq)
    Note over VT: Marks network validated
    MR->>VT: reconcile() [every 10s]
    Note over VT: After 8s grace period:<br/>both validated → agreed<br/>only one → missed<br/>5min late repair window
    MR->>P: Export agreement_pct_1h/24h

New Grafana Dashboards

Dashboard	Key Panels
Validator Health	Agreement %, amendment blocked, quorum, state value
Peer Quality	P90 latency, version awareness, upgrade recommended
Ledger Economy	Base fee, reserves, ledger age, transaction rate

End of Presentation

23 KiB Raw Blame History Unescape Escape