mirror of
https://github.com/XRPLF/rippled.git
synced 2026-06-10 12:16:49 +00:00
Merge branch 'pratik/otel-phase1a-plan-docs' into pratik/otel-phase1b-telemetry-infra
This commit is contained in:
@@ -514,12 +514,19 @@ Not every trace needs to be recorded. **Sampling** reduces overhead:
|
||||
### Head Sampling (at trace start)
|
||||
|
||||
```
|
||||
Request arrives → Random 10% chance → Record or skip entire trace
|
||||
Request arrives → Random N% chance → Record or skip entire trace
|
||||
```
|
||||
|
||||
- ✅ Low overhead
|
||||
- ❌ May miss interesting traces
|
||||
|
||||
> **xrpld note**: xrpld intentionally fixes head sampling at 100% (sample
|
||||
> everything) and does not expose a configurable ratio. A per-node ratio
|
||||
> would let different nodes make divergent keep/drop decisions for the same
|
||||
> distributed trace, producing broken/partial traces. xrpld uses a
|
||||
> `ParentBased` sampler so spans with a remote parent honor the upstream
|
||||
> decision. Volume reduction is delegated to collector-side tail sampling.
|
||||
|
||||
### Tail Sampling (after trace completes)
|
||||
|
||||
```
|
||||
|
||||
@@ -53,8 +53,10 @@ public:
|
||||
bool useTls = false;
|
||||
std::string tlsCertPath;
|
||||
|
||||
// Sampling configuration
|
||||
double samplingRatio = 1.0; // 1.0 = 100% sampling
|
||||
// Head sampling: fixed at 1.0 (sample everything), not config-driven.
|
||||
// Keeps trace keep/drop decisions coherent across nodes; volume
|
||||
// reduction is delegated to the collector's tail sampling.
|
||||
double samplingRatio = 1.0;
|
||||
|
||||
// Batch processor settings
|
||||
std::uint32_t batchSize = 512;
|
||||
|
||||
@@ -37,12 +37,11 @@ Add to `cfg/xrpld-example.cfg`:
|
||||
# # Path to CA certificate for TLS (optional)
|
||||
# # tls_ca_cert=/path/to/ca.crt
|
||||
#
|
||||
# # Sampling ratio: 0.0-1.0 (default: 1.0 = 100% sampling)
|
||||
# # Use lower values in production to reduce overhead
|
||||
# # Default: 1.0 (all traces). For production deployments with high
|
||||
# # throughput, 0.1 (10%) is recommended to reduce overhead.
|
||||
# # See Section 7.4.2 for sampling strategy details.
|
||||
# sampling_ratio=0.1
|
||||
# # Head sampling is intentionally fixed at 1.0 (sample everything) and is
|
||||
# # NOT configurable. A per-node head-sampling ratio would let nodes make
|
||||
# # divergent keep/drop decisions for the same distributed trace, producing
|
||||
# # broken/partial traces across the network. Volume reduction is delegated
|
||||
# # to the collector's tail sampling instead. See Section 7.4.2.
|
||||
#
|
||||
# # Batch processor settings
|
||||
# batch_size=512 # Spans per batch (default: 512)
|
||||
@@ -78,7 +77,6 @@ enabled=0
|
||||
| `endpoint` | string | `http://localhost:4318/v1/traces` | OTLP/HTTP collector endpoint |
|
||||
| `use_tls` | bool | `false` | Enable TLS for exporter connection |
|
||||
| `tls_ca_cert` | string | `""` | Path to CA certificate file |
|
||||
| `sampling_ratio` | float | `1.0` | Sampling ratio (0.0-1.0) |
|
||||
| `batch_size` | uint | `512` | Spans per export batch |
|
||||
| `batch_delay_ms` | uint | `5000` | Max delay before sending batch (ms) |
|
||||
| `max_queue_size` | uint | `2048` | Maximum queued spans |
|
||||
@@ -143,13 +141,8 @@ setup_Telemetry(
|
||||
setup.useTls = section.value_or("use_tls", false);
|
||||
setup.tlsCertPath = section.value_or("tls_ca_cert", "");
|
||||
|
||||
// Sampling
|
||||
setup.samplingRatio = section.value_or("sampling_ratio", 1.0);
|
||||
if (setup.samplingRatio < 0.0 || setup.samplingRatio > 1.0)
|
||||
{
|
||||
Throw<std::runtime_error>(
|
||||
"telemetry.sampling_ratio must be between 0.0 and 1.0");
|
||||
}
|
||||
// Head sampling is fixed at 1.0 (sample everything) and is not read from
|
||||
// config — see Section 7.4.2. setup.samplingRatio stays at its 1.0 default.
|
||||
|
||||
// Batch processor
|
||||
setup.batchSize = section.value_or("batch_size", 512u);
|
||||
|
||||
@@ -171,7 +171,7 @@ flowchart TB
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph head["Head Sampling (Node)"]
|
||||
hs[Node-level head sampling<br/>configurable, default: 100%<br/>recommended production: 10%]
|
||||
hs[Node-level head sampling<br/>fixed at 100%<br/>not configurable]
|
||||
end
|
||||
|
||||
subgraph tail["Tail Sampling (Collector)"]
|
||||
@@ -197,7 +197,7 @@ flowchart LR
|
||||
|
||||
**Reading the diagram:**
|
||||
|
||||
- **Head Sampling (Node)**: The first filter -- each xrpld node decides whether to sample a trace at creation time (default 100%, recommended 10% in production). This controls the volume leaving the node.
|
||||
- **Head Sampling (Node)**: xrpld pins head sampling at 100% (sample everything) and does not expose a configurable ratio. This is intentional: a per-node ratio would let different nodes make divergent keep/drop decisions for the same distributed trace, producing broken/partial traces. xrpld uses a `ParentBased` sampler so spans inheriting a remote parent honor the upstream decision. Volume reduction is delegated to the collector's tail sampling.
|
||||
- **Tail Sampling (Collector)**: The second filter -- the collector inspects completed traces and applies rules: keep all errors, keep anything slower than 5 seconds, and keep 10% of the remainder.
|
||||
- **Arrow head → tail**: All head-sampled traces flow to the collector, where tail sampling further reduces volume while preserving the most valuable data.
|
||||
- **Final Traces**: The output after both sampling stages; this is what gets stored and queried. The two-stage approach balances cost with debuggability.
|
||||
|
||||
@@ -148,7 +148,7 @@ Span naming follows a hierarchical `<component>.<operation>` convention (e.g., `
|
||||
|
||||
The telemetry code is organized under `include/xrpl/telemetry/` for headers and `src/libxrpl/telemetry/` for implementation. Key principles include RAII-based span management via `SpanGuard` (with `discard()` for dropping unwanted spans), a `FilteringSpanProcessor` that intercepts `OnEnd()` to prevent discarded spans from entering the export pipeline, conditional compilation with `XRPL_ENABLE_TELEMETRY`, and minimal runtime overhead through batch processing and efficient sampling.
|
||||
|
||||
Performance optimization strategies include probabilistic head sampling (10% default), tail-based sampling at the collector for errors and slow traces, batch export to reduce network overhead, and conditional instrumentation that compiles to no-ops when disabled.
|
||||
Performance optimization strategies include head sampling fixed at 100% (intentionally not configurable, so trace keep/drop decisions stay coherent across nodes), tail-based sampling at the collector for errors and slow traces to reduce volume, batch export to reduce network overhead, and conditional instrumentation that compiles to no-ops when disabled.
|
||||
|
||||
➡️ **[Read full Implementation Strategy](./03-implementation-strategy.md)**
|
||||
|
||||
|
||||
Reference in New Issue
Block a user