From 3f897e00a6c3c00a0b43b8122625138295be0069 Mon Sep 17 00:00:00 2001 From: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com> Date: Fri, 6 Mar 2026 14:09:37 +0000 Subject: [PATCH] document updates Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com> --- .../09-data-collection-reference.md | 100 +++++++++++++++--- 1 file changed, 83 insertions(+), 17 deletions(-) diff --git a/OpenTelemetryPlan/09-data-collection-reference.md b/OpenTelemetryPlan/09-data-collection-reference.md index 9d24d00368..22d3592b28 100644 --- a/OpenTelemetryPlan/09-data-collection-reference.md +++ b/OpenTelemetryPlan/09-data-collection-reference.md @@ -1,33 +1,83 @@ # Observability Data Collection Reference > **Audience**: Developers and operators. This is the single source of truth for all telemetry data collected by rippled's observability stack. +> +> **Related docs**: [docs/telemetry-runbook.md](../docs/telemetry-runbook.md) (operator runbook with alerting and troubleshooting) | [03-implementation-strategy.md](./03-implementation-strategy.md) (code structure and performance optimization) | [04-code-samples.md](./04-code-samples.md) (C++ instrumentation examples) ## Data Flow Overview ```mermaid graph LR - subgraph rippled Node - A[Trace Macros
XRPL_TRACE_SPAN] -->|OTLP/HTTP :4318| C[OTel Collector] - B[beast::insight
StatsD metrics] -->|UDP :8125| C + subgraph rippledNode["rippled Node"] + A["Trace Macros
XRPL_TRACE_SPAN
(OTLP/HTTP exporter)"] + B["beast::insight
StatsD metrics
(UDP sender)"] end - C -->|Jaeger export| D[Jaeger :16686
Trace search & visualization] - C -->|SpanMetrics connector| E[Prometheus :9090
RED metrics from spans] - C -->|StatsD receiver| E - E --> F[Grafana :3000
8 dashboards] - D --> F - style A fill:#4a90d9,color:#fff - style B fill:#d9534f,color:#fff - style C fill:#5cb85c,color:#fff - style D fill:#f0ad4e,color:#000 - style E fill:#f0ad4e,color:#000 - style F fill:#5bc0de,color:#000 + subgraph collector["OTel Collector :4317 / :4318 / :8125"] + direction TB + R1["OTLP Receiver
:4317 gRPC | :4318 HTTP"] + R2["StatsD Receiver
:8125 UDP"] + BP["Batch Processor
timeout 1s, batch 100"] + SM["SpanMetrics Connector
derives RED metrics
from trace spans"] + + R1 --> BP + BP --> SM + end + + subgraph backends["Trace Backends (choose one or both)"] + D["Jaeger :16686
Trace search &
visualization"] + T["Grafana Tempo
(preferred for production)
S3/GCS long-term storage"] + end + + subgraph metrics["Metrics Stack"] + E["Prometheus :9090
scrapes :8889
span-derived + StatsD metrics"] + end + + subgraph viz["Visualization"] + F["Grafana :3000
8 dashboards"] + end + + A -->|"OTLP/HTTP :4318
(traces + attributes)"| R1 + B -->|"UDP :8125
(gauges, counters, timers)"| R2 + + BP -->|"OTLP/gRPC :4317"| D + BP -->|"OTLP/gRPC"| T + + SM -->|"span_calls_total
span_duration_ms
(6 dimension labels)"| E + R2 -->|"rippled_* gauges
rippled_* counters
rippled_* summaries"| E + + E -->|"Prometheus
data source"| F + D -->|"Jaeger
data source"| F + T -->|"Tempo
data source"| F + + style A fill:#4a90d9,color:#fff,stroke:#2a6db5 + style B fill:#d9534f,color:#fff,stroke:#b52d2d + style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d + style R2 fill:#5cb85c,color:#fff,stroke:#3d8b3d + style BP fill:#449d44,color:#fff,stroke:#2d6e2d + style SM fill:#449d44,color:#fff,stroke:#2d6e2d + style D fill:#f0ad4e,color:#000,stroke:#c78c2e + style T fill:#e8953a,color:#000,stroke:#b5732a + style E fill:#f0ad4e,color:#000,stroke:#c78c2e + style F fill:#5bc0de,color:#000,stroke:#3aa8c1 + style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9 + style collector fill:#1a3320,color:#ccc,stroke:#5cb85c + style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e + style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e + style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de ``` -There are two independent telemetry pipelines: +There are two independent telemetry pipelines entering a single **OTel Collector**: -1. **OpenTelemetry Traces** — Distributed spans with attributes, exported via OTLP/HTTP to the collector, which sends them to Jaeger for visualization and derives RED metrics via the SpanMetrics connector. -2. **beast::insight StatsD** — System-level gauges, counters, and timers emitted as StatsD UDP packets, received by the collector's StatsD receiver, and exported to Prometheus. +1. **OpenTelemetry Traces** — Distributed spans with attributes, exported via OTLP/HTTP (:4318) to the collector's **OTLP Receiver**. The **Batch Processor** groups spans (1s timeout, batch size 100) before forwarding to trace backends. The **SpanMetrics Connector** derives RED metrics (rate, errors, duration) from every span and feeds them into the metrics pipeline. +2. **beast::insight StatsD** — System-level gauges, counters, and timers emitted as StatsD UDP packets to port :8125, ingested by the collector's **StatsD Receiver**, and exported alongside span-derived metrics to Prometheus. + +**Trace backends** — The collector exports traces via OTLP/gRPC to one or both: + +- **Jaeger** (development) — Provides trace search UI at `:16686`. Easy single-binary setup. +- **Grafana Tempo** (production) — Preferred for production. Supports S3/GCS object storage for cost-effective long-term trace retention and integrates natively with Grafana. + +> **Further reading**: [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) for core OpenTelemetry concepts (traces, spans, context propagation, sampling). [07-observability-backends.md](./07-observability-backends.md) for production backend selection, collector placement, and sampling strategies. --- @@ -35,6 +85,8 @@ There are two independent telemetry pipelines: ### 1.1 Complete Span Inventory (16 spans) +> **See also**: [02-design-decisions.md §2.3](./02-design-decisions.md#23-span-naming-conventions) for naming conventions and the full span catalog with rationale. [04-code-samples.md §4.6](./04-code-samples.md#46-span-flow-visualization) for span flow diagrams. + #### RPC Spans Controlled by `trace_rpc=1` in `[telemetry]` config. @@ -110,6 +162,8 @@ Controlled by `trace_peer=1` in `[telemetry]` config. **Disabled by default** (h ### 1.2 Complete Attribute Inventory (22 attributes) +> **See also**: [02-design-decisions.md §2.4.2](./02-design-decisions.md#242-span-attributes-by-category) for attribute design rationale and privacy considerations. + Every span can carry key-value attributes that provide context for filtering and aggregation. #### RPC Attributes @@ -180,6 +234,8 @@ Every span can carry key-value attributes that provide context for filtering and ### 1.3 SpanMetrics — Derived Prometheus Metrics +> **See also**: [01-architecture-analysis.md](./01-architecture-analysis.md) §1.8.2 for how span-derived metrics map to operational insights. + The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Errors, Duration) metrics from every span. No custom metrics code in rippled is needed. | Prometheus Metric | Type | Description | @@ -208,6 +264,8 @@ The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Er ## 2. StatsD Metrics (beast::insight) +> **See also**: [02-design-decisions.md](./02-design-decisions.md) for the beast::insight coexistence design. [06-implementation-phases.md](./06-implementation-phases.md) for the Phase 6 metric inventory. + These are system-level metrics emitted by rippled's `beast::insight` framework via StatsD UDP. They cover operational data that doesn't map to individual trace spans. ### Configuration @@ -302,6 +360,8 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo ## 3. Grafana Dashboard Reference +> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8 for Grafana data source provisioning (Tempo, Jaeger, Prometheus) and TraceQL query examples. + ### 3.1 Span-Derived Dashboards (5) | Dashboard | UID | Data Source | Key Panels | @@ -330,6 +390,8 @@ For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), fo ## 4. Jaeger Trace Search Guide +> **See also**: [08-appendix.md](./08-appendix.md) §8.2 for span hierarchy visualizations. [05-configuration-reference.md](./05-configuration-reference.md) §5.8.5 for TraceQL examples when using Grafana Tempo instead of Jaeger. + ### Finding Traces by Type | What to Find | Jaeger Search Parameters | @@ -372,6 +434,8 @@ ledger.store (persist to DB) ## 5. Prometheus Query Examples +> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8.7 for correlating Prometheus StatsD metrics with trace-derived metrics. + ### Span-Derived Metrics ```promql @@ -439,6 +503,8 @@ The telemetry system is designed with privacy in mind: ## 8. Configuration Quick Reference +> **Full reference**: [05-configuration-reference.md](./05-configuration-reference.md) §5.1 for all `[telemetry]` options with defaults, the config parser implementation, and collector YAML configurations (dev and production). + ### Minimal Setup (development) ```ini