docs(telemetry): add secure-OTel pipeline analysis and link into plan

Document the threat model and chosen hardening approach for the OTel pipeline: mTLS to the collector as primary defense (across-network deployment), NetworkPolicy as defense-in-depth, and source-side validation plus per-peer rate limiting for protocol::TraceContext on peer messages. Skips Basic Auth (wrong shape for multi-operator fleet) and HTTP-gateway header stripping (rippled is P2P). Wires the new doc into the master plan ToC, mermaid diagram, and body section, plus cross-refs from the privacy section in 02-design-decisions.md and the collector config in 05-configuration-reference.md so readers reach it from natural in-context entry points. Adds a backlink at the top of secure-OTel.md to the master plan. Adds 'exfiltration' and 'htpasswd' to cspell dictionary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-23 07:00:21 +00:00 · 2026-05-28 12:33:16 +01:00
parent 4bd1176df5
commit 43258e8dc0
6 changed files with 275 additions and 13 deletions
--- a/OpenTelemetryPlan/02-design-decisions.md
+++ b/OpenTelemetryPlan/02-design-decisions.md
@@ -433,6 +433,8 @@ redact_peer_address=1 # Remove peer IP addresses

 > **Key Principle**: Telemetry collects **operational metadata** (timing, counts, hashes) — never **sensitive content** (keys, balances, amounts, raw payloads).

+> **See also**: [Securing the OTel Pipeline](./secure-OTel.md) covers transport-level protection for telemetry leaving the node — mTLS to the collector and validation of incoming peer trace context. Privacy controls in this section keep sensitive data out of spans; the security doc keeps the spans themselves out of untrusted hands.
+
 ---

 ## 2.5 Context Propagation Design
--- a/OpenTelemetryPlan/05-configuration-reference.md
+++ b/OpenTelemetryPlan/05-configuration-reference.md
@@ -405,6 +405,8 @@ endif()

 > **OTLP** = OpenTelemetry Protocol | **APM** = Application Performance Monitoring

+> **Production hardening**: The configurations in this section are starting points. For production deployments where rippled ships telemetry across a network to a centrally-hosted collector, see [Securing the OTel Pipeline](./secure-OTel.md) for the required mTLS receiver config, NetworkPolicy, and peer trace-context validation.
+
 ### 5.5.1 Development Configuration

 ```yaml
--- a/OpenTelemetryPlan/08-appendix.md
+++ b/OpenTelemetryPlan/08-appendix.md
@@ -170,19 +170,20 @@ flowchart TB

 ### Plan Documents

-| Document                                                         | Description                                  |
-| ---------------------------------------------------------------- | -------------------------------------------- |
-| [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)                   | Master overview and executive summary        |
-| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md)       | Distributed tracing concepts and OTel primer |
-| [01-architecture-analysis.md](./01-architecture-analysis.md)     | xrpld architecture and trace points          |
-| [02-design-decisions.md](./02-design-decisions.md)               | SDK selection, exporters, span conventions   |
-| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure, performance analysis    |
-| [04-code-samples.md](./04-code-samples.md)                       | C++ code examples for all components         |
-| [05-configuration-reference.md](./05-configuration-reference.md) | xrpld config, CMake, Collector configs       |
-| [06-implementation-phases.md](./06-implementation-phases.md)     | Timeline, tasks, risks, success metrics      |
-| [07-observability-backends.md](./07-observability-backends.md)   | Backend selection and architecture           |
-| [08-appendix.md](./08-appendix.md)                               | Glossary, references, version history        |
-| [presentation.md](./presentation.md)                             | Slide deck for OTel plan overview            |
+| Document                                                         | Description                                        |
+| ---------------------------------------------------------------- | -------------------------------------------------- |
+| [OpenTelemetryPlan.md](./OpenTelemetryPlan.md)                   | Master overview and executive summary              |
+| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md)       | Distributed tracing concepts and OTel primer       |
+| [01-architecture-analysis.md](./01-architecture-analysis.md)     | xrpld architecture and trace points                |
+| [02-design-decisions.md](./02-design-decisions.md)               | SDK selection, exporters, span conventions         |
+| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure, performance analysis          |
+| [04-code-samples.md](./04-code-samples.md)                       | C++ code examples for all components               |
+| [05-configuration-reference.md](./05-configuration-reference.md) | xrpld config, CMake, Collector configs             |
+| [06-implementation-phases.md](./06-implementation-phases.md)     | Timeline, tasks, risks, success metrics            |
+| [07-observability-backends.md](./07-observability-backends.md)   | Backend selection and architecture                 |
+| [08-appendix.md](./08-appendix.md)                               | Glossary, references, version history              |
+| [secure-OTel.md](./secure-OTel.md)                               | Threat model and hardening (mTLS, peer validation) |
+| [presentation.md](./presentation.md)                             | Slide deck for OTel plan overview                  |

 ### Task Lists

--- a/OpenTelemetryPlan/OpenTelemetryPlan.md
+++ b/OpenTelemetryPlan/OpenTelemetryPlan.md
@@ -54,6 +54,7 @@ flowchart TB
        phases["06-implementation-phases.md"]
        backends["07-observability-backends.md"]
        appendix["08-appendix.md"]
+        secure["secure-OTel.md"]
        poc["POC_taskList.md"]
    end

@@ -70,6 +71,7 @@ flowchart TB
    config --> phases
    phases --> backends
    backends --> appendix
+    backends --> secure
    phases --> poc

    style overview fill:#1b5e20,stroke:#0d3d14,color:#fff,stroke-width:2px
@@ -86,6 +88,7 @@ flowchart TB
    style phases fill:#4a148c,stroke:#2e0d57,color:#fff
    style backends fill:#4a148c,stroke:#2e0d57,color:#fff
    style appendix fill:#4a148c,stroke:#2e0d57,color:#fff
+    style secure fill:#4a148c,stroke:#2e0d57,color:#fff
    style poc fill:#4a148c,stroke:#2e0d57,color:#fff
 ```

@@ -106,6 +109,7 @@ flowchart TB
 | **6**   | [Implementation Phases](./06-implementation-phases.md)     | 5-phase timeline, tasks, risks, success metrics                        |
 | **7**   | [Observability Backends](./07-observability-backends.md)   | Backend selection guide and production architecture                    |
 | **8**   | [Appendix](./08-appendix.md)                               | Glossary, references, version history                                  |
+| **Sec** | [Securing the OTel Pipeline](./secure-OTel.md)             | Threat model and hardening (mTLS, peer trace-context validation)       |
 | **POC** | [POC Task List](./POC_taskList.md)                         | Proof of concept tasks for RPC tracing end-to-end demo                 |

 ---
@@ -220,6 +224,14 @@ The appendix contains a glossary of OpenTelemetry and xrpld-specific terms, refe

 ---

+## Securing the OTel Pipeline
+
+Threat model and hardening guidance for production deployments where rippled nodes ship telemetry to a centrally-hosted collector across an untrusted network. Covers the two attack surfaces (collector ingress and peer trace-context spoofing) and the chosen defenses: mTLS as primary collector auth, NetworkPolicy as defense-in-depth, and source-side validation plus per-peer rate limiting for the `protocol::TraceContext` field on peer messages.
+
+➡️ **[View Securing the OTel Pipeline](./secure-OTel.md)**
+
+---
+
 ## POC Task List

 A step-by-step task list for building a minimal end-to-end proof of concept that demonstrates distributed tracing in xrpld. The POC scope is limited to RPC tracing — showing request traces flowing from xrpld through an OpenTelemetry Collector into Tempo, viewable in Grafana.
--- a/OpenTelemetryPlan/secure-OTel.md
+++ b/OpenTelemetryPlan/secure-OTel.md
@@ -0,0 +1,243 @@
+# Securing OpenTelemetry Against Trace Context Spoofing
+
+> **Part of**: [OpenTelemetry Implementation Plan](./OpenTelemetryPlan.md) — see also [Design Decisions § Privacy](./02-design-decisions.md#244-privacy--sensitive-data-policy) (what we don't collect) and [Configuration Reference § 5.5](./05-configuration-reference.md#55-opentelemetry-collector-configuration) (collector base config).
+
+Trace context spoofing (or poisoning) occurs when untrusted actors inject tampered or stale trace IDs into your system. If these requests are processed, the spans are appended to historical trace buckets, stretching trace durations, ruining p99 latency metrics, and breaking Grafana dashboards.
+
+This guide outlines two categories of defense: mitigating tampered contexts and locking down the OpenTelemetry (OTel) Collector to trusted clients only.
+
+---
+
+## Part 1: Mitigating Tampered Trace Contexts
+
+### 1. Perimeter Defense: Strip Headers at the API Gateway
+
+The most effective way to prevent spoofing from external sources is to treat your API Gateway (Envoy, NGINX, AWS ALB) as a hard boundary. Strip incoming W3C tracing headers (`traceparent`, `tracestate`) from public traffic so the gateway is forced to generate a fresh, legitimate `trace_id`.
+
+**NGINX Example (Stripping Headers):**
+
+Code output
+
+File generated successfully.
+
+```nginx
+server {
+    listen 80;
+
+    location {
+        # Clear out untrusted incoming trace headers
+        proxy_set_header traceparent "";
+        proxy_set_header tracestate "";
+
+        proxy_pass http://backend_service;
+    }
+}
+```
+
+### **2. Timestamp-Anchored Trace IDs and OTTL Filtering**
+
+If you use a custom trace ID generator that embeds a timestamp in the first few bytes (like AWS X-Ray or UUIDv7), you can use the OTel Collector's OpenTelemetry Transform Language (OTTL) to detect anomalies.
+**Collector Configuration (Conceptual OTTL Filter):**
+
+```yaml
+processors:
+  filter/stale_traces:
+    error_mode: ignore
+    traces:
+      span:
+        # Example: Drop spans where the start time is significantly different
+        # from an expected parameter or embedded timestamp logic.
+        # Note: Standard W3C trace IDs do not contain timestamps by default.
+        - 'Keep out-of-bounds spans: time.sub(start_time, now()) > duration("1h")'
+```
+
+## **Part 2: Restricting Access to the OTel Collector**
+
+Locking down the Collector ensures that only authenticated, trusted clients can submit telemetry data.
+
+### **Approach A: Network Layer Security (Kubernetes Network Policies)**
+
+Ensure your Collector is not exposed to the public internet. If running in Kubernetes, use a NetworkPolicy to restrict ingress traffic to specific namespaces.
+**Kubernetes NetworkPolicy Example:**
+
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: allow-internal-otel
+  namespace: observability
+spec:
+  podSelector:
+    matchLabels:
+      app: opentelemetry-collector
+  policyTypes:
+    - Ingress
+  ingress:
+    - from:
+        - namespaceSelector:
+            matchLabels:
+              environment: production
+      ports:
+        - protocol: TCP
+          port: 4317 # gRPC
+        - protocol: TCP
+          port: 4318 # HTTP
+```
+
+### **Approach B: Transport Layer Security (Mutual TLS / mTLS)**
+
+Require clients to present a valid cryptographic certificate to connect to the Collector.
+**Collector Configuration (mTLS):**
+
+```yaml
+receivers:
+  otlp:
+    protocols:
+      grpc:
+        endpoint: 0.0.0.0:4317
+        tls:
+          client_ca_file: /certs/client_ca.pem # CA that signs trusted client certs
+          cert_file: /certs/collector.pem
+          key_file: /certs/collector.key
+          auth_type: require_and_verify_client_cert # Rejects unauthorized clients
+```
+
+### **Approach C: Application Layer Authentication (Basic Auth Extension)**
+
+Use the Collector's extension system to require an API key or Basic Auth credentials.
+**Collector Configuration (Basic Auth):**
+
+```yaml
+extensions:
+  basicauth/collector:
+    htpasswd:
+      inline: |
+        # username:trusted-client, password:SecurePassword123
+        trusted-client:$apr1$4v8p76o6$DMTX5Wv6uOmrFAZp2X1N1.
+
+receivers:
+  otlp:
+    protocols:
+      grpc:
+        endpoint: 0.0.0.0:4317
+        auth:
+          authenticator: basicauth/collector
+
+processors:
+  batch:
+
+exporters:
+  otlp:
+    endpoint: my-backend-storage:4317
+
+service:
+  extensions: [basicauth/collector]
+  pipelines:
+    traces:
+      receivers: [otlp]
+      processors: [batch]
+      exporters: [otlp]
+```
+
+**Client Setup (Environment Variables):**
+Developers must pass the authentication header using the standard OTel SDK environment variables:
+
+```bash
+# Base64 encoded "trusted-client:SecurePassword123"
+export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic dHJ1c3RlZC1jbGllbnQ6U2VjdXJlUGFzc3dvcmQxMjM="
+```
+
+---
+
+Available routes to build on top of: https://github.com/XRPLF/rippled/pull/6425#discussion_r3234751995
+
+---
+
+# Analysis: Applying the Guide to rippled
+
+The guide above is written for HTTP-fronted web services. rippled is a P2P node daemon, so the threat model and the applicable defenses differ. This section captures how each approach maps to rippled and the chosen direction.
+
+## Threat Model
+
+rippled has **two distinct attack surfaces**, not one. The original guide conflates them under "trace context spoofing"; for rippled they need separate defenses.
+
+| Surface                                     | Attacker                                                 | Vector                                                                                                                                                                                     | Defense                                       |
+| ------------------------------------------- | -------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------- |
+| **Collector ingress** (rippled → collector) | Anyone who can reach `4317`/`4318` on the collector host | Forged OTLP traffic, telemetry exfiltration, DoS on collector                                                                                                                              | mTLS + network policy                         |
+| **Peer trace context** (peer → rippled)     | Malicious peer in the XRPL overlay                       | Crafted `protocol::TraceContext` field inside peer protobuf messages (TMTransaction, consensus, etc.) — used to forge `trace_id`/`span_id`, pollute p99, attach spans to historical traces | Validate + rate-limit at the receive boundary |
+
+**Deployment context:** Across-network. rippled nodes (potentially run by external operators or in different DCs) ship telemetry to a centrally-hosted collector across an untrusted network. The collector is NOT on the same host or private VPC as every node.
+
+```
+                ┌── peer (untrusted) ── TMTransaction{trace_context} ──▶ rippled
+                │                                                            │
+                │                                              [validate + rate-limit]
+                │                                                            │
+                │                                                            ▼
+                │                                                     SpanGuard (clean)
+                │                                                            │
+                │                                                            │ OTLP/gRPC
+                │                                                            │ + mTLS
+                │                                                            ▼
+                └─────────────────────────────────────────  [require_and_verify_client_cert]
+                                                                  OTel Collector
+                                                              (in private subnet, NetPol)
+```
+
+## Part 1 Applicability — Peer Trace-Context Validation
+
+The guide's NGINX header stripping and OTTL stale-span filtering target HTTP gateways and post-hoc cleanup. Neither fits rippled directly:
+
+- **NGINX header stripping** — N/A. There is no HTTP gateway between peers and rippled; trace context arrives inside protobuf peer messages (`protocol::TraceContext`), not as W3C `traceparent` headers. See [src/xrpld/telemetry/PropagationHelpers.h](../src/xrpld/telemetry/PropagationHelpers.h).
+- **OTTL stale-span filtering** — Weak fit. Post-hoc cleanup at the collector loses peer identity (you can't tell _which_ peer poisoned the trace). Validation at the receive site is stronger.
+
+**rippled-specific Part 1 mitigations:**
+
+1. **Validate extracted context at the boundary** in [src/xrpld/telemetry/ConsensusReceiveTracing.h](../src/xrpld/telemetry/ConsensusReceiveTracing.h) and any other peer-message receive site. Reject if `trace_id` is all-zero, wrong length, or fails W3C format checks. Treat invalid context as "no propagated context" — start a fresh span — rather than dropping the message.
+2. **Per-peer sample rate limiting** so a hostile peer cannot flood the collector with spans bearing a fabricated `trace_id`. Use probabilistic sampling on the receive path keyed by peer identity.
+
+## Part 2 — Comparison of Collector Hardening Approaches
+
+Evaluated for the across-network deployment shape:
+
+| Approach                        | Across-network fit                                                                                                                                                                                                                                                                              | Cost                                                                 | Verdict                            |
+| ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- | ---------------------------------- |
+| **A. NetworkPolicy / firewall** | Necessary baseline (don't expose `4317`/`4318` to the internet), but insufficient on its own when traffic genuinely crosses networks — you cannot NetworkPolicy the public internet.                                                                                                            | Cheap.                                                               | **Defense-in-depth, not primary.** |
+| **B. mTLS**                     | Strongest fit. Every rippled node holds a client cert; collector verifies with `require_and_verify_client_cert`. Encrypts in transit (raw OTLP over the internet leaks transaction patterns and validator identity). Compromised node = revoke one cert, no shared secret to rotate everywhere. | Cert issuance + rotation pipeline.                                   | **Primary.**                       |
+| **C. Basic Auth**               | Worst shape for this topology. Single shared password across all rippled nodes — one leaked node config compromises the whole fleet. Doesn't encrypt; you'd need TLS underneath anyway, at which point you're 80% of the way to mTLS.                                                           | Cheap to set up, expensive to operate (rotation across N operators). | **Skip.**                          |
+
+## Decision
+
+**Primary defense:** mTLS (Approach B) on the collector's OTLP receivers, with `auth_type: require_and_verify_client_cert`.
+
+**Defense-in-depth:** NetworkPolicy / firewall rules (Approach A) so `4317`/`4318` are never reachable from outside the expected operator subnets even if mTLS were misconfigured.
+
+**Skipped:** Basic Auth (Approach C) — wrong shape for an across-network, multi-operator topology.
+
+**Plus rippled-specific Part 1 work:** trace-context validation and per-peer rate limiting at peer-message receive sites.
+
+## Decisions Made
+
+| Decision             | Choice                           | Rationale                                                                                                                                                                                                                                                     |
+| -------------------- | -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Cert source for mTLS | **Reuse XRPL node identity key** | One identity per node, no separate PKI to operate. Fits XRPL's existing trust model; requires small CA tooling step to derive/sign the OTel client cert from the node key.                                                                                    |
+| Part 1 scope         | **Include in this spec**         | Collector hardening and peer trace-context validation share one threat model. Coherent design doc; can still be split into multiple PRs at implementation.                                                                                                    |
+| Dev impact           | **Production-only**              | Local `docker/telemetry/docker-compose.yml` keeps `insecure: true` and no auth for fast iteration. Only production deployment manifests gain mTLS. Accepted risk: minor dev/prod drift, mitigated by integration tests against a TLS-enabled collector in CI. |
+
+## Out of Scope
+
+- NGINX/Envoy header stripping (no HTTP gateway in front of rippled-to-collector traffic).
+- OTTL stale-span filtering at the collector (weaker than source validation; loses peer identity).
+- Local development docker-compose hardening.
+- Telemetry backend (Tempo) hardening — separate concern, downstream of the collector.
+
+## Next Step
+
+Write this up as a design doc with full sections covering:
+
+1. Threat model & architecture (this section, expanded)
+2. Collector hardening — mTLS config, NetworkPolicy
+3. Cert pipeline — deriving OTel client cert from XRPL node key
+4. Peer trace-context validation — receive-site checks in `ConsensusReceiveTracing.h`
+5. Per-peer span rate limiting
+6. Testing & rollout
--- a/cspell.config.yaml
+++ b/cspell.config.yaml
@@ -103,6 +103,7 @@ words:
  - enabled
  - endmacro
  - exceptioned
+  - exfiltration
  - Falco
  - fcontext
  - finalizers
@@ -118,6 +119,7 @@ words:
  - gpgkey
  - hotwallet
  - hicpp
+  - htpasswd
  - hwaddress
  - hwrap
  - ifndef