From 9ee9e566d4bcee7ff2b84d64f01caeee0260f834 Mon Sep 17 00:00:00 2001 From: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com> Date: Fri, 17 Apr 2026 13:11:59 +0100 Subject: [PATCH] removed presentation.md from root Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com> --- presentation.md | 280 ------------------------------------------------ 1 file changed, 280 deletions(-) delete mode 100644 presentation.md diff --git a/presentation.md b/presentation.md deleted file mode 100644 index 7a443a635c..0000000000 --- a/presentation.md +++ /dev/null @@ -1,280 +0,0 @@ -# OpenTelemetry Distributed Tracing for rippled - ---- - -## Slide 1: Introduction - -### What is OpenTelemetry? - -OpenTelemetry is an open-source, CNCF-backed observability framework for distributed tracing, metrics, and logs. - -### Why OpenTelemetry for rippled? - -- **End-to-End Transaction Visibility**: Track transactions from submission → consensus → ledger inclusion -- **Cross-Node Correlation**: Follow requests across multiple independent nodes using a unique `trace_id` -- **Consensus Round Analysis**: Understand timing and behavior across validators -- **Incident Debugging**: Correlate events across distributed nodes during issues - -```mermaid -flowchart LR - A["Node A
tx.receive
trace_id: abc123"] --> B["Node B
tx.relay
trace_id: abc123"] --> C["Node C
tx.validate
trace_id: abc123"] --> D["Node D
ledger.apply
trace_id: abc123"] - - style A fill:#1565c0,stroke:#0d47a1,color:#fff - style B fill:#2e7d32,stroke:#1b5e20,color:#fff - style C fill:#2e7d32,stroke:#1b5e20,color:#fff - style D fill:#e65100,stroke:#bf360c,color:#fff -``` - -> **Trace ID: abc123** — All nodes share the same trace, enabling cross-node correlation. - ---- - -## Slide 2: OpenTelemetry vs Open Source Alternatives - -| Feature | OpenTelemetry | Jaeger | Zipkin | SkyWalking | Pinpoint | Prometheus | -| ------------------- | ---------------- | ---------------- | ------------------ | ---------- | ---------- | ---------- | -| **Tracing** | YES | YES | YES | YES | YES | NO | -| **Metrics** | YES | NO | NO | YES | YES | YES | -| **Logs** | YES | NO | NO | YES | NO | NO | -| **C++ SDK** | YES Official | YES (Deprecated) | YES (Unmaintained) | NO | NO | YES | -| **Vendor Neutral** | YES Primary goal | NO | NO | NO | NO | NO | -| **Instrumentation** | Manual + Auto | Manual | Manual | Auto-first | Auto-first | Manual | -| **Backend** | Any (exporters) | Self | Self | Self | Self | Self | -| **CNCF Status** | Incubating | Graduated | NO | Incubating | NO | Graduated | - -> **Why OpenTelemetry?** It's the only actively maintained, full-featured C++ option with vendor neutrality — allowing export to Jaeger, Prometheus, Grafana, or any commercial backend without changing instrumentation. - ---- - -## Slide 3: Comparison with rippled's Existing Solutions - -### Current Observability Stack - -| Aspect | PerfLog (JSON) | StatsD (Metrics) | OpenTelemetry (NEW) | -| --------------------- | --------------------- | --------------------- | --------------------------- | -| **Type** | Logging | Metrics | Distributed Tracing | -| **Scope** | Single node | Single node | **Cross-node** | -| **Data** | JSON log entries | Counters, gauges | Spans with context | -| **Correlation** | By timestamp | By metric name | By `trace_id` | -| **Overhead** | Low (file I/O) | Low (UDP) | Low-Medium (configurable) | -| **Question Answered** | "What happened here?" | "How many? How fast?" | **"What was the journey?"** | - -### Use Case Matrix - -| Scenario | PerfLog | StatsD | OpenTelemetry | -| -------------------------------- | ------- | ------ | ------------- | -| "How many TXs per second?" | ❌ | ✅ | ❌ | -| "Why was this specific TX slow?" | ⚠️ | ❌ | ✅ | -| "Which node delayed consensus?" | ❌ | ❌ | ✅ | -| "Show TX journey across 5 nodes" | ❌ | ❌ | ✅ | - -> **Key Insight**: OpenTelemetry **complements** (not replaces) existing systems. - ---- - -## Slide 4: Architecture - -### High-Level Integration Architecture - -```mermaid -flowchart TB - subgraph rippled["rippled Node"] - subgraph services["Core Services"] - direction LR - RPC["RPC Server
(HTTP/WS)"] ~~~ Overlay["Overlay
(P2P Network)"] ~~~ Consensus["Consensus
(RCLConsensus)"] - end - - Telemetry["Telemetry Module
(OpenTelemetry SDK)"] - - services --> Telemetry - end - - Telemetry -->|OTLP/gRPC| Collector["OTel Collector"] - - Collector --> Tempo["Grafana Tempo"] - Collector --> Jaeger["Jaeger"] - Collector --> Elastic["Elastic APM"] - - style rippled fill:#424242,stroke:#212121,color:#fff - style services fill:#1565c0,stroke:#0d47a1,color:#fff - style Telemetry fill:#2e7d32,stroke:#1b5e20,color:#fff - style Collector fill:#e65100,stroke:#bf360c,color:#fff -``` - -### Context Propagation - -```mermaid -sequenceDiagram - participant Client - participant NodeA as Node A - participant NodeB as Node B - - Client->>NodeA: Submit TX (no context) - Note over NodeA: Creates trace_id: abc123
span: tx.receive - NodeA->>NodeB: Relay TX
(traceparent: abc123) - Note over NodeB: Links to trace_id: abc123
span: tx.relay -``` - -- **HTTP/RPC**: W3C Trace Context headers (`traceparent`) -- **P2P Messages**: Protocol Buffer extension fields - ---- - -## Slide 5: Implementation Plan - -### 5-Phase Rollout (9 Weeks) - -```mermaid -gantt - title Implementation Timeline - dateFormat YYYY-MM-DD - axisFormat Week %W - - section Phase 1 - Core Infrastructure :p1, 2024-01-01, 2w - - section Phase 2 - RPC Tracing :p2, after p1, 2w - - section Phase 3 - Transaction Tracing :p3, after p2, 2w - - section Phase 4 - Consensus Tracing :p4, after p3, 2w - - section Phase 5 - Documentation :p5, after p4, 1w -``` - -### Phase Details - -| Phase | Focus | Key Deliverables | Effort | -| ----- | ------------------- | -------------------------------------------- | ------- | -| 1 | Core Infrastructure | SDK integration, Telemetry interface, Config | 10 days | -| 2 | RPC Tracing | HTTP context extraction, Handler spans | 10 days | -| 3 | Transaction Tracing | Protobuf context, P2P relay propagation | 10 days | -| 4 | Consensus Tracing | Round spans, Proposal/validation tracing | 10 days | -| 5 | Documentation | Runbook, Dashboards, Training | 7 days | - -**Total Effort**: ~47 developer-days (2 developers) - ---- - -## Slide 6: Performance Overhead - -### Estimated System Impact - -| Metric | Overhead | Notes | -| ----------------- | ---------- | ----------------------------------- | -| **CPU** | 1-3% | Span creation and attribute setting | -| **Memory** | 2-5 MB | Batch buffer for pending spans | -| **Network** | 10-50 KB/s | Compressed OTLP export to collector | -| **Latency (p99)** | <2% | With proper sampling configuration | - -### Per-Message Overhead (Context Propagation) - -Each P2P message carries trace context with the following overhead: - -| Field | Size | Description | -| ------------- | ------------- | ----------------------------------------- | -| `trace_id` | 16 bytes | Unique identifier for the entire trace | -| `span_id` | 8 bytes | Current span (becomes parent on receiver) | -| `trace_flags` | 4 bytes | Sampling decision flags | -| `trace_state` | 0-4 bytes | Optional vendor-specific data | -| **Total** | **~32 bytes** | **Added per traced P2P message** | - -```mermaid -flowchart LR - subgraph msg["P2P Message with Trace Context"] - A["Original Message
(variable size)"] --> B["+ TraceContext
(~32 bytes)"] - end - - subgraph breakdown["Context Breakdown"] - C["trace_id
16 bytes"] - D["span_id
8 bytes"] - E["flags
4 bytes"] - F["state
0-4 bytes"] - end - - B --> breakdown - - style A fill:#424242,stroke:#212121,color:#fff - style B fill:#2e7d32,stroke:#1b5e20,color:#fff - style C fill:#1565c0,stroke:#0d47a1,color:#fff - style D fill:#1565c0,stroke:#0d47a1,color:#fff - style E fill:#e65100,stroke:#bf360c,color:#fff - style F fill:#4a148c,stroke:#2e0d57,color:#fff -``` - -> **Note**: 32 bytes is negligible compared to typical transaction messages (hundreds to thousands of bytes) - -### Mitigation Strategies - -```mermaid -flowchart LR - A["Head Sampling
10% default"] --> B["Tail Sampling
Keep errors/slow"] --> C["Batch Export
Reduce I/O"] --> D["Conditional Compile
XRPL_ENABLE_TELEMETRY"] - - style A fill:#1565c0,stroke:#0d47a1,color:#fff - style B fill:#2e7d32,stroke:#1b5e20,color:#fff - style C fill:#e65100,stroke:#bf360c,color:#fff - style D fill:#4a148c,stroke:#2e0d57,color:#fff -``` - -### Kill Switches (Rollback Options) - -1. **Config Disable**: Set `enabled=0` in config → instant disable, no restart needed for sampling -2. **Rebuild**: Compile with `XRPL_ENABLE_TELEMETRY=OFF` → zero overhead (no-op) -3. **Full Revert**: Clean separation allows easy commit reversion - ---- - -## Slide 7: Data Collection & Privacy - -### What Data is Collected - -| Category | Attributes Collected | Purpose | -| --------------- | ---------------------------------------------------------------------------------- | --------------------------- | -| **Transaction** | `tx.hash`, `tx.type`, `tx.result`, `tx.fee`, `ledger_index` | Trace transaction lifecycle | -| **Consensus** | `round`, `phase`, `mode`, `proposers`(public key or public node id), `duration_ms` | Analyze consensus timing | -| **RPC** | `command`, `version`, `status`, `duration_ms` | Monitor RPC performance | -| **Peer** | `peer.id`(public key), `latency_ms`, `message.type`, `message.size` | Network topology analysis | -| **Ledger** | `ledger.hash`, `ledger.index`, `close_time`, `tx_count` | Ledger progression tracking | -| **Job** | `job.type`, `queue_ms`, `worker` | JobQueue performance | - -### What is NOT Collected (Privacy Guarantees) - -```mermaid -flowchart LR - subgraph notCollected["❌ NOT Collected"] - direction LR - A["Private Keys"] ~~~ B["Account Balances"] ~~~ C["Transaction Amounts"] - end - - subgraph alsoNot["❌ Also Excluded"] - direction LR - D["IP Addresses
(configurable)"] ~~~ E["Personal Data"] ~~~ F["Raw TX Payloads"] - end - - style A fill:#c62828,stroke:#8c2809,color:#fff - style B fill:#c62828,stroke:#8c2809,color:#fff - style C fill:#c62828,stroke:#8c2809,color:#fff - style D fill:#c62828,stroke:#8c2809,color:#fff - style E fill:#c62828,stroke:#8c2809,color:#fff - style F fill:#c62828,stroke:#8c2809,color:#fff -``` - -### Privacy Protection Mechanisms - -| Mechanism | Description | -| -------------------------- | ------------------------------------------------------------- | -| **Account Hashing** | `xrpl.tx.account` is hashed at collector level before storage | -| **Configurable Redaction** | Sensitive fields can be excluded via config | -| **Sampling** | Only 10% of traces recorded by default (reduces exposure) | -| **Local Control** | Node operators control what gets exported | -| **No Raw Payloads** | Transaction content is never recorded, only metadata | - -> **Key Principle**: Telemetry collects **operational metadata** (timing, counts, hashes) — never **sensitive content** (keys, balances, amounts). - ---- - -_End of Presentation_