compare with other open source vendors

Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
This commit is contained in:
Pratik Mankawde
2026-02-20 15:41:01 +00:00
parent bff83a2b92
commit 6e8f0624ce
12 changed files with 121 additions and 67 deletions

View File

@@ -136,6 +136,7 @@ For traces to work across nodes, **trace context must be propagated** in message
### How span_id Changes at Each Hop
Only **one** `span_id` travels in the context - the sender's current span. Each node:
1. Extracts the received `span_id` and uses it as the `parent_span_id`
2. Creates a **new** `span_id` for its own span
3. Sends its own `span_id` as the parent when forwarding
@@ -192,19 +193,23 @@ message TMTransaction {
Not every trace needs to be recorded. **Sampling** reduces overhead:
### Head Sampling (at trace start)
```
Request arrives → Random 10% chance → Record or skip entire trace
```
- ✅ Low overhead
- ❌ May miss interesting traces
### Tail Sampling (after trace completes)
```
Trace completes → Collector evaluates:
- Error? → KEEP
- Slow? → KEEP
- Normal? → Sample 10%
```
- ✅ Never loses important traces
- ❌ Higher memory usage at collector
@@ -236,4 +241,4 @@ Trace completes → Collector evaluates:
---
*Next: [Architecture Analysis](./01-architecture-analysis.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)*
_Next: [Architecture Analysis](./01-architecture-analysis.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_

View File

@@ -250,6 +250,7 @@ After implementing OpenTelemetry, operators and developers will gain visibility
### 1.8.3 Concrete Dashboard Examples
**Transaction Trace View (Jaeger/Tempo):**
```
┌────────────────────────────────────────────────────────────────────────────────┐
│ Trace: abc123... (Transaction Submission) Duration: 847ms │
@@ -270,6 +271,7 @@ After implementing OpenTelemetry, operators and developers will gain visibility
```
**RPC Performance Dashboard Panel:**
```
┌─────────────────────────────────────────────────────────────┐
│ RPC Command Latency (Last 1 Hour) │
@@ -325,4 +327,4 @@ xychart-beta
---
*Next: [Design Decisions](./02-design-decisions.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)*
_Next: [Design Decisions](./02-design-decisions.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_

View File

@@ -95,6 +95,7 @@ opts.content_type = otlp::HttpRequestContentType::kJson; // or kBinary
```
**Examples**:
- `tx.receive` - Transaction received from peer
- `consensus.phase.establish` - Consensus establish phase
- `rpc.command.server_info` - server_info RPC command
@@ -104,51 +105,51 @@ opts.content_type = otlp::HttpRequestContentType::kJson; // or kBinary
```yaml
# Transaction Spans
tx:
receive: "Transaction received from network"
validate: "Transaction signature/format validation"
process: "Full transaction processing"
relay: "Transaction relay to peers"
apply: "Apply transaction to ledger"
receive: "Transaction received from network"
validate: "Transaction signature/format validation"
process: "Full transaction processing"
relay: "Transaction relay to peers"
apply: "Apply transaction to ledger"
# Consensus Spans
consensus:
round: "Complete consensus round"
round: "Complete consensus round"
phase:
open: "Open phase - collecting transactions"
open: "Open phase - collecting transactions"
establish: "Establish phase - reaching agreement"
accept: "Accept phase - applying consensus"
accept: "Accept phase - applying consensus"
proposal:
receive: "Receive peer proposal"
send: "Send our proposal"
receive: "Receive peer proposal"
send: "Send our proposal"
validation:
receive: "Receive peer validation"
send: "Send our validation"
receive: "Receive peer validation"
send: "Send our validation"
# RPC Spans
rpc:
request: "HTTP/WebSocket request handling"
request: "HTTP/WebSocket request handling"
command:
"*": "Specific RPC command (dynamic)"
"*": "Specific RPC command (dynamic)"
# Peer Spans
peer:
connect: "Peer connection establishment"
disconnect: "Peer disconnection"
connect: "Peer connection establishment"
disconnect: "Peer disconnection"
message:
send: "Send protocol message"
receive: "Receive protocol message"
send: "Send protocol message"
receive: "Receive protocol message"
# Ledger Spans
ledger:
acquire: "Ledger acquisition from network"
build: "Build new ledger"
validate: "Ledger validation"
close: "Close ledger"
acquire: "Ledger acquisition from network"
build: "Build new ledger"
validate: "Ledger validation"
close: "Close ledger"
# Job Spans
job:
enqueue: "Job added to queue"
execute: "Job execution"
enqueue: "Job added to queue"
execute: "Job execution"
```
---
@@ -173,6 +174,7 @@ resource::SemanticConventions::SERVICE_INSTANCE_ID = <node_public_key_base58>
### 2.4.2 Span Attributes by Category
#### Transaction Attributes
```cpp
"xrpl.tx.hash" = string // Transaction hash (hex)
"xrpl.tx.type" = string // "Payment", "OfferCreate", etc.
@@ -184,6 +186,7 @@ resource::SemanticConventions::SERVICE_INSTANCE_ID = <node_public_key_base58>
```
#### Consensus Attributes
```cpp
"xrpl.consensus.round" = int64 // Round number
"xrpl.consensus.phase" = string // "open", "establish", "accept"
@@ -196,6 +199,7 @@ resource::SemanticConventions::SERVICE_INSTANCE_ID = <node_public_key_base58>
```
#### RPC Attributes
```cpp
"xrpl.rpc.command" = string // Command name
"xrpl.rpc.version" = int64 // API version
@@ -204,6 +208,7 @@ resource::SemanticConventions::SERVICE_INSTANCE_ID = <node_public_key_base58>
```
#### Peer & Message Attributes
```cpp
"xrpl.peer.id" = string // Peer public key (base58)
"xrpl.peer.address" = string // IP:port
@@ -215,6 +220,7 @@ resource::SemanticConventions::SERVICE_INSTANCE_ID = <node_public_key_base58>
```
#### Ledger & Job Attributes
```cpp
"xrpl.ledger.hash" = string // Ledger hash
"xrpl.ledger.index" = int64 // Ledger sequence/index
@@ -352,6 +358,7 @@ rippled already has two observability mechanisms. OpenTelemetry complements (not
### 2.6.2 What Each Framework Does Best
#### PerfLog
- **Purpose**: Detailed local event logging for RPC and job execution
- **Strengths**:
- Rich JSON output with timing data
@@ -373,6 +380,7 @@ rippled already has two observability mechanisms. OpenTelemetry complements (not
```
#### Beast Insight (StatsD)
- **Purpose**: Real-time metrics for monitoring dashboards
- **Strengths**:
- Aggregated metrics (counters, gauges, histograms)
@@ -391,6 +399,7 @@ insight.timing("consensus.round", duration);
```
#### OpenTelemetry (NEW)
- **Purpose**: Distributed request tracing across nodes
- **Strengths**:
- **Cross-node correlation** via `trace_id`
@@ -411,14 +420,14 @@ span->SetAttribute("peer.id", peerId);
### 2.6.3 When to Use Each
| Scenario | PerfLog | StatsD | OpenTelemetry |
| --------------------------------------- | --------- | ------ | ------------- |
| "How many TXs per second?" | ❌ | ✅ | ❌ |
| "What's the p99 RPC latency?" | ❌ | ✅ | ✅ |
| "Why was this specific TX slow?" | ⚠️ partial | ❌ | ✅ |
| "Which node delayed consensus?" | ❌ | ❌ | ✅ |
| "What happened on node X at time T?" | ✅ | ❌ | ✅ |
| "Show me the TX journey across 5 nodes" | ❌ | ❌ | ✅ |
| Scenario | PerfLog | StatsD | OpenTelemetry |
| --------------------------------------- | ---------- | ------ | ------------- |
| "How many TXs per second?" | ❌ | ✅ | ❌ |
| "What's the p99 RPC latency?" | ❌ | ✅ | ✅ |
| "Why was this specific TX slow?" | ⚠️ partial | ❌ | ✅ |
| "Which node delayed consensus?" | ❌ | ❌ | ✅ |
| "What happened on node X at time T?" | ✅ | ❌ | ✅ |
| "Show me the TX journey across 5 nodes" | ❌ | ❌ | ✅ |
### 2.6.4 Coexistence Strategy
@@ -482,4 +491,4 @@ Status doCommand(RPC::JsonContext& context, Json::Value& result)
---
*Previous: [Architecture Analysis](./01-architecture-analysis.md)* | *Next: [Implementation Strategy](./03-implementation-strategy.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)*
_Previous: [Architecture Analysis](./01-architecture-analysis.md)_ | _Next: [Implementation Strategy](./03-implementation-strategy.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_

View File

@@ -191,6 +191,7 @@ xychart-beta
```
**Notes**:
- Memory increases linearly with span rate
- Batch export prevents unbounded growth
- Queue size is configurable (default 2048 spans)
@@ -386,8 +387,8 @@ quadrantChart
### 3.9.5 Backward Compatibility
| Compatibility | Status | Notes |
| --------------- | ------ | ----------------------------------------------------- |
| Compatibility | Status | Notes |
| --------------- | ------- | ----------------------------------------------------- |
| **Config File** | ✅ Full | New `[telemetry]` section is optional |
| **Protocol** | ✅ Full | Optional protobuf fields with high field numbers |
| **Build** | ✅ Full | `XRPL_ENABLE_TELEMETRY=OFF` produces identical binary |
@@ -405,6 +406,7 @@ If issues are discovered after deployment:
### 3.9.7 Code Change Examples
**Minimal RPC Instrumentation (Low Intrusiveness):**
```cpp
// Before
void ServerHandler::onRequest(...) {
@@ -425,6 +427,7 @@ void ServerHandler::onRequest(...) {
```
**Consensus Instrumentation (Medium Intrusiveness):**
```cpp
// Before
void RCLConsensusAdaptor::startRound(...) {
@@ -445,4 +448,4 @@ void RCLConsensusAdaptor::startRound(...) {
---
*Previous: [Design Decisions](./02-design-decisions.md)* | *Next: [Code Samples](./04-code-samples.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)*
_Previous: [Design Decisions](./02-design-decisions.md)_ | _Next: [Code Samples](./04-code-samples.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_

View File

@@ -979,4 +979,4 @@ flowchart TB
---
*Previous: [Implementation Strategy](./03-implementation-strategy.md)* | *Next: [Configuration Reference](./05-configuration-reference.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)*
_Previous: [Implementation Strategy](./03-implementation-strategy.md)_ | _Next: [Configuration Reference](./05-configuration-reference.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_

View File

@@ -506,7 +506,7 @@ service:
```yaml
# docker-compose-telemetry.yaml
version: '3.8'
version: "3.8"
services:
# OpenTelemetry Collector
@@ -517,8 +517,8 @@ services:
volumes:
- ./otel-collector-dev.yaml:/etc/otel-collector-config.yaml:ro
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "13133:13133" # Health check
depends_on:
- jaeger
@@ -628,8 +628,8 @@ datasources:
httpMethod: GET
tracesToLogs:
datasourceUid: loki
tags: ['service.name', 'xrpl.tx.hash']
mappedTags: [{ key: 'trace_id', value: 'traceID' }]
tags: ["service.name", "xrpl.tx.hash"]
mappedTags: [{ key: "trace_id", value: "traceID" }]
mapTagNamesEnabled: true
filterByTraceID: true
serviceMap:
@@ -656,7 +656,7 @@ datasources:
jsonData:
tracesToLogs:
datasourceUid: loki
tags: ['service.name']
tags: ["service.name"]
```
#### Elastic APM
@@ -685,10 +685,10 @@ datasources:
apiVersion: 1
providers:
- name: 'rippled-dashboards'
- name: "rippled-dashboards"
orgId: 1
folder: 'rippled'
folderUid: 'rippled'
folder: "rippled"
folderUid: "rippled"
type: file
disableDeletion: false
updateIntervalSeconds: 30
@@ -880,7 +880,7 @@ In Tempo data source configuration, set up the derived field:
jsonData:
tracesToLogs:
datasourceUid: loki
tags: ['trace_id', 'xrpl.tx.hash']
tags: ["trace_id", "xrpl.tx.hash"]
filterByTraceID: true
filterBySpanID: false
```
@@ -894,9 +894,9 @@ To correlate traces with existing Beast Insight metrics:
```yaml
# prometheus.yaml
scrape_configs:
- job_name: 'rippled-statsd'
- job_name: "rippled-statsd"
static_configs:
- targets: ['statsd-exporter:9102']
- targets: ["statsd-exporter:9102"]
```
**Step 2: Add exemplars to metrics**
@@ -933,4 +933,4 @@ This allows clicking on metric data points to jump directly to the related trace
---
*Previous: [Code Samples](./04-code-samples.md)* | *Next: [Implementation Phases](./06-implementation-phases.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)*
_Previous: [Code Samples](./04-code-samples.md)_ | _Next: [Implementation Phases](./06-implementation-phases.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_

View File

@@ -315,6 +315,7 @@ flowchart TB
**Goal**: Get basic tracing working with minimal code changes.
**What You Get**:
- RPC request/response traces for all commands
- Latency breakdown per RPC command
- Error visibility with stack traces
@@ -323,6 +324,7 @@ flowchart TB
**Code Changes**: ~15 lines in `ServerHandler.cpp`, ~40 lines in new telemetry module
**Why Start Here**:
- RPC is the lowest-risk, highest-visibility component
- Immediate value for debugging client issues
- No cross-node complexity
@@ -333,6 +335,7 @@ flowchart TB
**Goal**: Add transaction lifecycle tracing across nodes.
**What You Get**:
- End-to-end transaction traces from submit to relay
- Cross-node correlation (see transaction path)
- HashRouter deduplication visibility
@@ -341,6 +344,7 @@ flowchart TB
**Code Changes**: ~120 lines across 4 files, plus protobuf extension
**Why Do This Second**:
- Builds on RPC tracing (transactions submitted via RPC)
- Moderate complexity (requires context propagation)
- High value for debugging transaction issues
@@ -350,6 +354,7 @@ flowchart TB
**Goal**: Full observability including consensus.
**What You Get**:
- Complete consensus round visibility
- Phase transition timing
- Validator proposal tracking
@@ -358,6 +363,7 @@ flowchart TB
**Code Changes**: ~100 lines across 3 consensus files
**Why Do This Last**:
- Highest complexity (consensus is critical path)
- Requires thorough testing
- Lower relative value (consensus issues are rarer)
@@ -392,7 +398,7 @@ Clear, measurable criteria for each phase.
| Criterion | Measurement | Target |
| --------------- | ---------------------------------------------------------- | ---------------------------- |
| SDK Integration | `cmake --build` succeeds with `-DXRPL_ENABLE_TELEMETRY=ON` | ✅ Compiles |
| SDK Integration | `cmake --build` succeeds with `-DXRPL_ENABLE_TELEMETRY=ON` | ✅ Compiles |
| Runtime Toggle | `enabled=0` produces zero overhead | <0.1% CPU difference |
| Span Creation | Unit test creates and exports span | Span appears in Jaeger |
| Configuration | All config options parsed correctly | Config validation tests pass |
@@ -534,4 +540,4 @@ flowchart TB
---
*Previous: [Configuration Reference](./05-configuration-reference.md)* | *Next: [Observability Backends](./07-observability-backends.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)*
_Previous: [Configuration Reference](./05-configuration-reference.md)_ | _Next: [Observability Backends](./07-observability-backends.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_

View File

@@ -137,6 +137,7 @@ flowchart TB
| **Gateway** | Central collector(s) | Centralized processing | Single point of failure |
**Recommendation**: Use **Gateway** pattern with regional collectors for rippled networks:
- One collector cluster per datacenter/region
- Tail-based sampling at collector level
- Multiple export destinations for redundancy
@@ -472,23 +473,27 @@ flowchart TB
### 7.7.3 Example: Debugging a Slow Transaction
**Step 1: Find the trace**
```
# In Grafana Explore with Tempo
{resource.service.name="rippled" && span.xrpl.tx.hash="ABC123..."}
```
**Step 2: Get the trace_id from the trace view**
```
Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736
```
**Step 3: Find related PerfLog entries**
```
# In Grafana Explore with Loki
{job="rippled"} |= "4bf92f3577b34da6a3ce929d0e0e4736"
```
**Step 4: Check Insight metrics for the time window**
```
# In Grafana with Prometheus
rate(rippled_tx_applied_total[1m])
@@ -587,4 +592,4 @@ rate(rippled_tx_applied_total[1m])
---
*Previous: [Implementation Phases](./06-implementation-phases.md)* | *Next: [Appendix](./08-appendix.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)*
_Previous: [Implementation Phases](./06-implementation-phases.md)_ | _Next: [Appendix](./08-appendix.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_

View File

@@ -130,4 +130,4 @@ flowchart TB
---
*Previous: [Observability Backends](./07-observability-backends.md)* | *Back to: [Overview](./OpenTelemetryPlan.md)*
_Previous: [Observability Backends](./07-observability-backends.md)_ | _Back to: [Overview](./OpenTelemetryPlan.md)_

View File

@@ -130,6 +130,7 @@ Performance optimization strategies include probabilistic head sampling (10% def
## 4. Code Samples
Complete C++ implementation examples are provided for all telemetry components:
- `Telemetry.h` - Core interface for tracer access and span creation
- `SpanGuard.h` - RAII wrapper for automatic span lifecycle management
- `TracingInstrumentation.h` - Macros for conditional instrumentation
@@ -186,4 +187,4 @@ The appendix contains a glossary of OpenTelemetry and rippled-specific terms, re
---
*This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. For detailed information on any section, follow the links to the corresponding sub-documents.*
_This document provides a comprehensive implementation plan for integrating OpenTelemetry distributed tracing into the rippled XRP Ledger node software. For detailed information on any section, follow the links to the corresponding sub-documents._

View File

@@ -303,3 +303,9 @@ words:
- xrplf
- xxhash
- xxhasher
- xychart
- otelc
- zpages
- traceql
- Gantt
- gantt

View File

@@ -29,7 +29,24 @@ flowchart LR
---
## Slide 2: Comparison with Existing Solutions
## Slide 2: OpenTelemetry vs Open Source Alternatives
| Feature | OpenTelemetry | Jaeger | Zipkin | SkyWalking | Pinpoint | Prometheus |
| ------------------- | ---------------- | ---------------- | ------------------ | ---------- | ---------- | ---------- |
| **Tracing** | YES | YES | YES | YES | YES | NO |
| **Metrics** | YES | NO | NO | YES | YES | YES |
| **Logs** | YES | NO | NO | YES | NO | NO |
| **C++ SDK** | YES Official | YES (Deprecated) | YES (Unmaintained) | NO | NO | YES |
| **Vendor Neutral** | YES Primary goal | NO | NO | NO | NO | NO |
| **Instrumentation** | Manual + Auto | Manual | Manual | Auto-first | Auto-first | Manual |
| **Backend** | Any (exporters) | Self | Self | Self | Self | Self |
| **CNCF Status** | Incubating | Graduated | NO | Incubating | NO | Graduated |
> **Why OpenTelemetry?** It's the only actively maintained, full-featured C++ option with vendor neutrality — allowing export to Jaeger, Prometheus, Grafana, or any commercial backend without changing instrumentation.
---
## Slide 3: Comparison with rippled's Existing Solutions
### Current Observability Stack
@@ -46,16 +63,16 @@ flowchart LR
| Scenario | PerfLog | StatsD | OpenTelemetry |
| -------------------------------- | ------- | ------ | ------------- |
| "How many TXs per second?" | ❌ | ✅ | ❌ |
| "Why was this specific TX slow?" | ⚠️ | ❌ | ✅ |
| "Which node delayed consensus?" | ❌ | ❌ | ✅ |
| "Show TX journey across 5 nodes" | ❌ | ❌ | ✅ |
| "How many TXs per second?" | ❌ | ✅ | ❌ |
| "Why was this specific TX slow?" | ⚠️ | ❌ | ✅ |
| "Which node delayed consensus?" | ❌ | ❌ | ✅ |
| "Show TX journey across 5 nodes" | ❌ | ❌ | ✅ |
> **Key Insight**: OpenTelemetry **complements** (not replaces) existing systems.
---
## Slide 3: Architecture
## Slide 4: Architecture
### High-Level Integration Architecture
@@ -103,7 +120,7 @@ sequenceDiagram
---
## Slide 4: Implementation Plan
## Slide 5: Implementation Plan
### 5-Phase Rollout (9 Weeks)
@@ -143,7 +160,7 @@ gantt
---
## Slide 5: Performance Overhead
## Slide 6: Performance Overhead
### Estimated System Impact
@@ -211,7 +228,7 @@ flowchart LR
---
## Slide 6: Data Collection & Privacy
## Slide 7: Data Collection & Privacy
### What Data is Collected
@@ -260,4 +277,4 @@ flowchart LR
---
*End of Presentation*
_End of Presentation_