From 3f897e00a6c3c00a0b43b8122625138295be0069 Mon Sep 17 00:00:00 2001
From: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
Date: Fri, 6 Mar 2026 14:09:37 +0000
Subject: [PATCH] document updates
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
---
.../09-data-collection-reference.md | 100 +++++++++++++++---
1 file changed, 83 insertions(+), 17 deletions(-)
diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md
index 9d24d00368..22d3592b28 100644
--- a/OpenTelemetryPlan/09-data-collection-reference.md
+++ b/OpenTelemetryPlan/09-data-collection-reference.md
@@ -1,33 +1,83 @@
# Observability Data Collection Reference
> **Audience**: Developers and operators. This is the single source of truth for all telemetry data collected by rippled's observability stack.
+>
+> **Related docs**: [docs/telemetry-runbook.md](../docs/telemetry-runbook.md) (operator runbook with alerting and troubleshooting) | [03-implementation-strategy.md](./03-implementation-strategy.md) (code structure and performance optimization) | [04-code-samples.md](./04-code-samples.md) (C++ instrumentation examples)
## Data Flow Overview
```mermaid
graph LR
- subgraph rippled Node
- A[Trace Macros
XRPL_TRACE_SPAN] -->|OTLP/HTTP :4318| C[OTel Collector]
- B[beast::insight
StatsD metrics] -->|UDP :8125| C
+ subgraph rippledNode["rippled Node"]
+ A["Trace Macros
XRPL_TRACE_SPAN
(OTLP/HTTP exporter)"]
+ B["beast::insight
StatsD metrics
(UDP sender)"]
end
- C -->|Jaeger export| D[Jaeger :16686
Trace search & visualization]
- C -->|SpanMetrics connector| E[Prometheus :9090
RED metrics from spans]
- C -->|StatsD receiver| E
- E --> F[Grafana :3000
8 dashboards]
- D --> F
- style A fill:#4a90d9,color:#fff
- style B fill:#d9534f,color:#fff
- style C fill:#5cb85c,color:#fff
- style D fill:#f0ad4e,color:#000
- style E fill:#f0ad4e,color:#000
- style F fill:#5bc0de,color:#000
+ subgraph collector["OTel Collector :4317 / :4318 / :8125"]
+ direction TB
+ R1["OTLP Receiver
:4317 gRPC | :4318 HTTP"]
+ R2["StatsD Receiver
:8125 UDP"]
+ BP["Batch Processor
timeout 1s, batch 100"]
+ SM["SpanMetrics Connector
derives RED metrics
from trace spans"]
+
+ R1 --> BP
+ BP --> SM
+ end
+
+ subgraph backends["Trace Backends (choose one or both)"]
+ D["Jaeger :16686
Trace search &
visualization"]
+ T["Grafana Tempo
(preferred for production)
S3/GCS long-term storage"]
+ end
+
+ subgraph metrics["Metrics Stack"]
+ E["Prometheus :9090
scrapes :8889
span-derived + StatsD metrics"]
+ end
+
+ subgraph viz["Visualization"]
+ F["Grafana :3000
8 dashboards"]
+ end
+
+ A -->|"OTLP/HTTP :4318
(traces + attributes)"| R1
+ B -->|"UDP :8125
(gauges, counters, timers)"| R2
+
+ BP -->|"OTLP/gRPC :4317"| D
+ BP -->|"OTLP/gRPC"| T
+
+ SM -->|"span_calls_total
span_duration_ms
(6 dimension labels)"| E
+ R2 -->|"rippled_* gauges
rippled_* counters
rippled_* summaries"| E
+
+ E -->|"Prometheus
data source"| F
+ D -->|"Jaeger
data source"| F
+ T -->|"Tempo
data source"| F
+
+ style A fill:#4a90d9,color:#fff,stroke:#2a6db5
+ style B fill:#d9534f,color:#fff,stroke:#b52d2d
+ style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
+ style R2 fill:#5cb85c,color:#fff,stroke:#3d8b3d
+ style BP fill:#449d44,color:#fff,stroke:#2d6e2d
+ style SM fill:#449d44,color:#fff,stroke:#2d6e2d
+ style D fill:#f0ad4e,color:#000,stroke:#c78c2e
+ style T fill:#e8953a,color:#000,stroke:#b5732a
+ style E fill:#f0ad4e,color:#000,stroke:#c78c2e
+ style F fill:#5bc0de,color:#000,stroke:#3aa8c1
+ style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9
+ style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
+ style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
+ style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
+ style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de
```
-There are two independent telemetry pipelines:
+There are two independent telemetry pipelines entering a single **OTel Collector**:
-1. **OpenTelemetry Traces** — Distributed spans with attributes, exported via OTLP/HTTP to the collector, which sends them to Jaeger for visualization and derives RED metrics via the SpanMetrics connector.
-2. **beast::insight StatsD** — System-level gauges, counters, and timers emitted as StatsD UDP packets, received by the collector's StatsD receiver, and exported to Prometheus.
+1. **OpenTelemetry Traces** — Distributed spans with attributes, exported via OTLP/HTTP (:4318) to the collector's **OTLP Receiver**. The **Batch Processor** groups spans (1s timeout, batch size 100) before forwarding to trace backends. The **SpanMetrics Connector** derives RED metrics (rate, errors, duration) from every span and feeds them into the metrics pipeline.
+2. **beast::insight StatsD** — System-level gauges, counters, and timers emitted as StatsD UDP packets to port :8125, ingested by the collector's **StatsD Receiver**, and exported alongside span-derived metrics to Prometheus.
+
+**Trace backends** — The collector exports traces via OTLP/gRPC to one or both:
+
+- **Jaeger** (development) — Provides trace search UI at `:16686`. Easy single-binary setup.
+- **Grafana Tempo** (production) — Preferred for production. Supports S3/GCS object storage for cost-effective long-term trace retention and integrates natively with Grafana.
+
+> **Further reading**: [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) for core OpenTelemetry concepts (traces, spans, context propagation, sampling). [07-observability-backends.md](./07-observability-backends.md) for production backend selection, collector placement, and sampling strategies.
---
@@ -35,6 +85,8 @@ There are two independent telemetry pipelines:
### 1.1 Complete Span Inventory (16 spans)
+> **See also**: [02-design-decisions.md §2.3](./02-design-decisions.md#23-span-naming-conventions) for naming conventions and the full span catalog with rationale. [04-code-samples.md §4.6](./04-code-samples.md#46-span-flow-visualization) for span flow diagrams.
+
#### RPC Spans
Controlled by `trace_rpc=1` in `[telemetry]` config.
@@ -110,6 +162,8 @@ Controlled by `trace_peer=1` in `[telemetry]` config. **Disabled by default** (h
### 1.2 Complete Attribute Inventory (22 attributes)
+> **See also**: [02-design-decisions.md §2.4.2](./02-design-decisions.md#242-span-attributes-by-category) for attribute design rationale and privacy considerations.
+
Every span can carry key-value attributes that provide context for filtering and aggregation.
#### RPC Attributes
@@ -180,6 +234,8 @@ Every span can carry key-value attributes that provide context for filtering and
### 1.3 SpanMetrics — Derived Prometheus Metrics
+> **See also**: [01-architecture-analysis.md](./01-architecture-analysis.md) §1.8.2 for how span-derived metrics map to operational insights.
+
The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Errors, Duration) metrics from every span. No custom metrics code in rippled is needed.
| Prometheus Metric | Type | Description |
@@ -208,6 +264,8 @@ The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Er
## 2. StatsD Metrics (beast::insight)
+> **See also**: [02-design-decisions.md](./02-design-decisions.md) for the beast::insight coexistence design. [06-implementation-phases.md](./06-implementation-phases.md) for the Phase 6 metric inventory.
+
These are system-level metrics emitted by rippled's `beast::insight` framework via StatsD UDP. They cover operational data that doesn't map to individual trace spans.
### Configuration
@@ -302,6 +360,8 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo
## 3. Grafana Dashboard Reference
+> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8 for Grafana data source provisioning (Tempo, Jaeger, Prometheus) and TraceQL query examples.
+
### 3.1 Span-Derived Dashboards (5)
| Dashboard | UID | Data Source | Key Panels |
@@ -330,6 +390,8 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo
## 4. Jaeger Trace Search Guide
+> **See also**: [08-appendix.md](./08-appendix.md) §8.2 for span hierarchy visualizations. [05-configuration-reference.md](./05-configuration-reference.md) §5.8.5 for TraceQL examples when using Grafana Tempo instead of Jaeger.
+
### Finding Traces by Type
| What to Find | Jaeger Search Parameters |
@@ -372,6 +434,8 @@ ledger.store (persist to DB)
## 5. Prometheus Query Examples
+> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8.7 for correlating Prometheus StatsD metrics with trace-derived metrics.
+
### Span-Derived Metrics
```promql
@@ -439,6 +503,8 @@ The telemetry system is designed with privacy in mind:
## 8. Configuration Quick Reference
+> **Full reference**: [05-configuration-reference.md](./05-configuration-reference.md) §5.1 for all `[telemetry]` options with defaults, the config parser implementation, and collector YAML configurations (dev and production).
+
### Minimal Setup (development)
```ini