From 5598b0eac759bd072a0fa14ff2d090f0f629beaa Mon Sep 17 00:00:00 2001 From: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com> Date: Tue, 9 Jun 2026 18:22:52 +0100 Subject: [PATCH] docs(telemetry): fix head sampling at 1.0, remove configurable ratio Document that head sampling is intentionally fixed at 100% and no longer exposes a sampling_ratio config knob. A per-node ratio let nodes make divergent keep/drop decisions for the same distributed trace, producing broken/partial traces; pinning at 1.0 with a ParentBased sampler keeps decisions coherent across the network. Volume reduction is delegated to collector-side tail sampling. Co-Authored-By: Claude Opus 4.8 (1M context) --- OpenTelemetryPlan/00-tracing-fundamentals.md | 9 +++++++- OpenTelemetryPlan/04-code-samples.md | 6 ++++-- .../05-configuration-reference.md | 21 +++++++------------ .../07-observability-backends.md | 4 ++-- OpenTelemetryPlan/OpenTelemetryPlan.md | 2 +- 5 files changed, 22 insertions(+), 20 deletions(-) diff --git a/OpenTelemetryPlan/00-tracing-fundamentals.md b/OpenTelemetryPlan/00-tracing-fundamentals.md index 24322bdd09..d7cc40c5ac 100644 --- a/OpenTelemetryPlan/00-tracing-fundamentals.md +++ b/OpenTelemetryPlan/00-tracing-fundamentals.md @@ -514,12 +514,19 @@ Not every trace needs to be recorded. **Sampling** reduces overhead: ### Head Sampling (at trace start) ``` -Request arrives → Random 10% chance → Record or skip entire trace +Request arrives → Random N% chance → Record or skip entire trace ``` - ✅ Low overhead - ❌ May miss interesting traces +> **xrpld note**: xrpld intentionally fixes head sampling at 100% (sample +> everything) and does not expose a configurable ratio. A per-node ratio +> would let different nodes make divergent keep/drop decisions for the same +> distributed trace, producing broken/partial traces. xrpld uses a +> `ParentBased` sampler so spans with a remote parent honor the upstream +> decision. Volume reduction is delegated to collector-side tail sampling. + ### Tail Sampling (after trace completes) ``` diff --git a/OpenTelemetryPlan/04-code-samples.md b/OpenTelemetryPlan/04-code-samples.md index 1452c30f5e..d70bcbc760 100644 --- a/OpenTelemetryPlan/04-code-samples.md +++ b/OpenTelemetryPlan/04-code-samples.md @@ -53,8 +53,10 @@ public: bool useTls = false; std::string tlsCertPath; - // Sampling configuration - double samplingRatio = 1.0; // 1.0 = 100% sampling + // Head sampling: fixed at 1.0 (sample everything), not config-driven. + // Keeps trace keep/drop decisions coherent across nodes; volume + // reduction is delegated to the collector's tail sampling. + double samplingRatio = 1.0; // Batch processor settings std::uint32_t batchSize = 512; diff --git a/OpenTelemetryPlan/05-configuration-reference.md b/OpenTelemetryPlan/05-configuration-reference.md index d6f13e0d9d..0ea40c08a9 100644 --- a/OpenTelemetryPlan/05-configuration-reference.md +++ b/OpenTelemetryPlan/05-configuration-reference.md @@ -37,12 +37,11 @@ Add to `cfg/xrpld-example.cfg`: # # Path to CA certificate for TLS (optional) # # tls_ca_cert=/path/to/ca.crt # -# # Sampling ratio: 0.0-1.0 (default: 1.0 = 100% sampling) -# # Use lower values in production to reduce overhead -# # Default: 1.0 (all traces). For production deployments with high -# # throughput, 0.1 (10%) is recommended to reduce overhead. -# # See Section 7.4.2 for sampling strategy details. -# sampling_ratio=0.1 +# # Head sampling is intentionally fixed at 1.0 (sample everything) and is +# # NOT configurable. A per-node head-sampling ratio would let nodes make +# # divergent keep/drop decisions for the same distributed trace, producing +# # broken/partial traces across the network. Volume reduction is delegated +# # to the collector's tail sampling instead. See Section 7.4.2. # # # Batch processor settings # batch_size=512 # Spans per batch (default: 512) @@ -78,7 +77,6 @@ enabled=0 | `endpoint` | string | `http://localhost:4318/v1/traces` | OTLP/HTTP collector endpoint | | `use_tls` | bool | `false` | Enable TLS for exporter connection | | `tls_ca_cert` | string | `""` | Path to CA certificate file | -| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) | | `batch_size` | uint | `512` | Spans per export batch | | `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) | | `max_queue_size` | uint | `2048` | Maximum queued spans | @@ -143,13 +141,8 @@ setup_Telemetry( setup.useTls = section.value_or("use_tls", false); setup.tlsCertPath = section.value_or("tls_ca_cert", ""); - // Sampling - setup.samplingRatio = section.value_or("sampling_ratio", 1.0); - if (setup.samplingRatio < 0.0 || setup.samplingRatio > 1.0) - { - Throw( - "telemetry.sampling_ratio must be between 0.0 and 1.0"); - } + // Head sampling is fixed at 1.0 (sample everything) and is not read from + // config — see Section 7.4.2. setup.samplingRatio stays at its 1.0 default. // Batch processor setup.batchSize = section.value_or("batch_size", 512u); diff --git a/OpenTelemetryPlan/07-observability-backends.md b/OpenTelemetryPlan/07-observability-backends.md index a1c303b545..5d1638670a 100644 --- a/OpenTelemetryPlan/07-observability-backends.md +++ b/OpenTelemetryPlan/07-observability-backends.md @@ -171,7 +171,7 @@ flowchart TB ```mermaid flowchart LR subgraph head["Head Sampling (Node)"] - hs[Node-level head sampling
configurable, default: 100%
recommended production: 10%] + hs[Node-level head sampling
fixed at 100%
not configurable] end subgraph tail["Tail Sampling (Collector)"] @@ -197,7 +197,7 @@ flowchart LR **Reading the diagram:** -- **Head Sampling (Node)**: The first filter -- each xrpld node decides whether to sample a trace at creation time (default 100%, recommended 10% in production). This controls the volume leaving the node. +- **Head Sampling (Node)**: xrpld pins head sampling at 100% (sample everything) and does not expose a configurable ratio. This is intentional: a per-node ratio would let different nodes make divergent keep/drop decisions for the same distributed trace, producing broken/partial traces. xrpld uses a `ParentBased` sampler so spans inheriting a remote parent honor the upstream decision. Volume reduction is delegated to the collector's tail sampling. - **Tail Sampling (Collector)**: The second filter -- the collector inspects completed traces and applies rules: keep all errors, keep anything slower than 5 seconds, and keep 10% of the remainder. - **Arrow head → tail**: All head-sampled traces flow to the collector, where tail sampling further reduces volume while preserving the most valuable data. - **Final Traces**: The output after both sampling stages; this is what gets stored and queried. The two-stage approach balances cost with debuggability. diff --git a/OpenTelemetryPlan/OpenTelemetryPlan.md b/OpenTelemetryPlan/OpenTelemetryPlan.md index 8f7476753b..3974d79481 100644 --- a/OpenTelemetryPlan/OpenTelemetryPlan.md +++ b/OpenTelemetryPlan/OpenTelemetryPlan.md @@ -148,7 +148,7 @@ Span naming follows a hierarchical `.` convention (e.g., ` The telemetry code is organized under `include/xrpl/telemetry/` for headers and `src/libxrpl/telemetry/` for implementation. Key principles include RAII-based span management via `SpanGuard`, conditional compilation with `XRPL_ENABLE_TELEMETRY`, and minimal runtime overhead through batch processing and efficient sampling. -Performance optimization strategies include probabilistic head sampling (10% default), tail-based sampling at the collector for errors and slow traces, batch export to reduce network overhead, and conditional instrumentation that compiles to no-ops when disabled. +Performance optimization strategies include head sampling fixed at 100% (intentionally not configurable, so trace keep/drop decisions stay coherent across nodes), tail-based sampling at the collector for errors and slow traces to reduce volume, batch export to reduce network overhead, and conditional instrumentation that compiles to no-ops when disabled. ➡️ **[Read full Implementation Strategy](./03-implementation-strategy.md)**