rippled/docker/telemetry/workload/README.md

# Telemetry Workload Tools

Synthetic workload generation and validation tools for xrpld's OpenTelemetry telemetry stack. These tools validate that all spans, metrics, dashboards, and log-trace correlation work end-to-end under controlled load.

## Quick Start

```bash
# Build xrpld with telemetry enabled
conan install . --build=missing -o telemetry=True
cmake --preset default -Dtelemetry=ON
cmake --build --preset default

# Run full validation (starts everything, runs load, validates)
docker/telemetry/workload/run-full-validation.sh --xrpld .build/xrpld

# Cleanup when done
docker/telemetry/workload/run-full-validation.sh --cleanup
```

## Architecture

The validation suite runs a multi-node xrpld cluster as local processes alongside
a Docker Compose telemetry stack. The cluster exercises consensus, peer-to-peer
spans (proposals, validations), and all metric pipelines.

```
run-full-validation.sh (shell orchestrator)
  |
  |-- docker-compose.workload.yaml
  |     |-- otel-collector (traces via OTLP + StatsD receiver)
  |     |-- tempo (trace backend + TraceQL search API)
  |     |-- prometheus (metrics scraping)
  |     |-- grafana (dashboards, provisioned automatically)
  |
  |-- generate-validator-keys.sh
  |     -> validator-keys.json, validators.txt
  |
  |-- Nx xrpld nodes (local processes, full telemetry)
  |     - Each node: [telemetry] enabled=1, trace_rpc/consensus/transactions
  |     - [signing_support] true (server-side signing for tx_submitter)
  |     - Peer discovery via [ips] (not [ips_fixed]) for active peer counts
  |
  |-- workload_orchestrator.py (phased load execution)
  |     |-- rpc_load_generator.py (WebSocket RPC traffic)
  |     |-- tx_submitter.py (transaction diversity)
  |     -> workload-report.json + per-phase reports
  |
  |-- validate_telemetry.py (pass/fail checks)
  |     -> validation-report.json
  |
  |-- benchmark.sh (baseline vs telemetry comparison)
        -> benchmark-report-*.md
```

## Workload Profiles

The workload orchestrator (`workload_orchestrator.py`) reads named profiles
from `workload-profiles.json` and executes sequential load phases. Within
each phase, the RPC generator and TX submitter run concurrently.

### Available Profiles

| Profile           | Phases | Duration                     | Purpose                                                     |
| ----------------- | ------ | ---------------------------- | ----------------------------------------------------------- |
| `full-validation` | 6      | ~5 min + 1 min propagation   | Full 18-dashboard coverage with burst/idle/plateau patterns |
| `quick-smoke`     | 1      | ~30s + 30s propagation       | Fast CI smoke test                                          |
| `stress`          | 3      | ~3.5 min + 1 min propagation | Heavy sustained load for benchmarking                       |

### full-validation Phases

| Phase        | RPC Rate | TX TPS | Duration | Dashboard Coverage                              |
| ------------ | -------- | ------ | -------- | ----------------------------------------------- |
| warmup       | 5 RPS    | —      | 30s      | Node Health, Validator Health (baseline gauges) |
| steady-state | 30 RPS   | 3 TPS  | 60s      | All dashboards (plateau data)                   |
| rpc-burst    | 100 RPS  | —      | 30s      | Job Queue, RPC Performance (latency spikes)     |
| tx-flood     | 5 RPS    | 20 TPS | 30s      | Fee Market & TxQ, Transaction Overview          |
| mixed-peak   | 50 RPS   | 10 TPS | 60s      | Consensus Health, Ledger Operations             |
| cooldown     | 5 RPS    | —      | 30s      | Recovery patterns, state transitions            |

### Custom Profiles

Add profiles to `workload-profiles.json`:

```json
{
  "profiles": {
    "my-custom": {
      "description": "Custom profile for specific testing",
      "phases": [
        {
          "name": "phase-name",
          "description": "What this phase exercises",
          "duration_sec": 60,
          "rpc": { "rate": 50, "weights": { "server_info": 80, "fee": 20 } },
          "tx": { "tps": 5, "weights": { "Payment": 100 } }
        }
      ],
      "propagation_wait_sec": 30
    }
  }
}
```

Set `"rpc"` or `"tx"` to `null` to skip that generator for a phase.
Custom `"weights"` override the default command/transaction distribution.

## Tools Reference

### run-full-validation.sh

Orchestrates the complete validation pipeline. Starts the telemetry stack, starts a multi-node xrpld cluster, generates load, and validates the results.

```bash
# Full validation with defaults (uses full-validation profile)
./run-full-validation.sh --xrpld /path/to/xrpld

# Quick smoke test
./run-full-validation.sh --xrpld /path/to/xrpld --profile quick-smoke

# Stress test with benchmarks
./run-full-validation.sh --xrpld /path/to/xrpld --profile stress --with-benchmark

# Skip Loki checks (if Phase 8 not deployed)
./run-full-validation.sh --xrpld /path/to/xrpld --skip-loki
```

### workload_orchestrator.py

Reads a named profile from `workload-profiles.json` and executes sequential
load phases. Within each phase, `rpc_load_generator.py` and `tx_submitter.py`
run as concurrent subprocesses. Produces per-phase reports and a combined
summary.

```bash
# Run with a specific profile
python3 workload_orchestrator.py --profile full-validation

# Multiple endpoints
python3 workload_orchestrator.py --profile full-validation \
    --endpoints ws://localhost:6006 ws://localhost:6007

# Save combined report
python3 workload_orchestrator.py --profile stress --report /tmp/report.json
```

### rpc_load_generator.py

Generates RPC traffic matching realistic production distribution. Uses
xrpld's **native WebSocket command format** (`{"command": ...}`) with flat
parameters — the same format as `tx_submitter.py`.

- 40% health checks (server_info, fee)
- 30% wallet queries (account_info, account_lines, account_objects)
- 15% explorer queries (ledger, ledger_data)
- 10% transaction lookups (tx, account_tx)
- 5% DEX queries (book_offers, amm_info)

```bash
# Basic usage
python3 rpc_load_generator.py --endpoints ws://localhost:6006 --rate 50 --duration 120

# Multiple endpoints (round-robin)
python3 rpc_load_generator.py \
    --endpoints ws://localhost:6006 ws://localhost:6007 \
    --rate 100 --duration 300

# Custom weights
python3 rpc_load_generator.py --endpoints ws://localhost:6006 \
    --weights '{"server_info": 80, "account_info": 20}'
```

### tx_submitter.py

Submits diverse transaction types to exercise the full span and metric surface.
Uses xrpld's **native WebSocket command format** (`{"command": ...}`) rather
than JSON-RPC format. The response payload is inside the `"result"` key, with
`"status"` at the top level.

Supported transaction types:

- Payment (XRP transfers) — exercises `tx.process`, `tx.receive`, `tx.apply`
- OfferCreate / OfferCancel (DEX activity)
- TrustSet (trust line creation)
- NFTokenMint / NFTokenCreateOffer (NFT activity)
- EscrowCreate / EscrowFinish (escrow lifecycle)
- AMMCreate / AMMDeposit (AMM pool operations)

Requires `[signing_support] true` in the node config for server-side signing.

```bash
# Basic usage
python3 tx_submitter.py --endpoint ws://localhost:6006 --tps 5 --duration 120

# Custom mix
python3 tx_submitter.py --endpoint ws://localhost:6006 \
    --weights '{"Payment": 60, "OfferCreate": 20, "TrustSet": 20}'
```

### validate_telemetry.py

Automated validation that all expected telemetry data exists. Every metric and span is required — if it doesn't fire, the validation fails.

- **Span validation**: All span types from `expected_spans.json` with required attributes and parent-child hierarchies
- **Metric validation**: All metrics from `expected_metrics.json` — SpanMetrics, StatsD gauges/counters/histograms, Phase 9 OTLP metrics. Every listed metric must have > 0 series. Uses the Prometheus `/api/v1/series` endpoint (not instant queries) to avoid false negatives from stale gauges.
- **Log-trace correlation**: trace_id/span_id in Loki logs (requires Loki)
- **Dashboard validation**: All 10 Grafana dashboards load with panels

```bash
# Run all validations
python3 validate_telemetry.py --report /tmp/report.json

# Skip Loki checks
python3 validate_telemetry.py --skip-loki --report /tmp/report.json
```

### OTel Timings Regression Gate

`capture_timings.py` + `compare_to_baseline.py` implement a regression gate
that compares OTel-derived per-span/per-RPC/per-job timings against a
committed baseline. Unlike `benchmark.sh` (which measures the overhead of
enabling telemetry on the current binary), this gate catches **xrpld
performance regressions over time** by diffing against a stored baseline
from a prior run.

How it runs inside the validation pipeline:

1. `run-full-validation.sh` executes the normal workload and validation suite.
2. After validation, `capture_timings.py` queries Prometheus for every
   metric in `regression-metrics.json` and writes `reports/timings.json`.
3. `compare_to_baseline.py` reads `timings.json`,
   `baselines/baseline-timings.json`, and `regression-thresholds.json`,
   then either:
   - Prints the paste-me JSON block (when the baseline is a placeholder
     or empty) and exits 0.
   - Prints a delta table, writes `reports/regression-report.json`, and
     exits non-zero if any metric breached both the percentage AND
     absolute bound.

Bootstrapping a baseline:

1. Push the branch. The `Telemetry Validation` CI run prints the full
   timings JSON under "Paste into `baselines/baseline-timings.json`" in
   the workflow Step Summary.
2. Open a PR copying that JSON block verbatim into
   `baselines/baseline-timings.json`. Reviewer approval is the audit gate.
3. Subsequent runs compare against it; the gate fails on regression.

Per-run tuning:

- `--skip-regression` disables the gate (local exploration only).
- `REGRESSION_WINDOW` env var overrides the default Prometheus `rate()`
  window (`3m`). Keep close to the workload duration.
- Metric surface lives in `regression-metrics.json`; thresholds in
  `regression-thresholds.json`; both are reviewed changes.

See [`baselines/README.md`](./baselines/README.md) for the baseline
lifecycle and refresh process.

### benchmark.sh

Compares baseline (no telemetry) vs telemetry-enabled performance:

```bash
./benchmark.sh --xrpld /path/to/xrpld --duration 300
```

Thresholds (configurable via environment):

| Metric            | Threshold | Env Variable                |
| ----------------- | --------- | --------------------------- |
| CPU overhead      | < 3%      | BENCH_CPU_OVERHEAD_PCT      |
| Memory overhead   | < 5MB     | BENCH_MEM_OVERHEAD_MB       |
| RPC p99 latency   | < 2ms     | BENCH_RPC_LATENCY_IMPACT_MS |
| Throughput impact | < 5%      | BENCH_TPS_IMPACT_PCT        |
| Consensus impact  | < 1%      | BENCH_CONSENSUS_IMPACT_PCT  |

## Reading Validation Reports

The validation report (`validation-report.json`) is structured as:

```json
{
  "summary": {
    "total": 45,
    "passed": 42,
    "failed": 3,
    "all_passed": false
  },
  "checks": [
    {
      "name": "span.rpc.ws_message",
      "category": "span",
      "passed": true,
      "message": "rpc.ws_message: 15 traces found",
      "details": { "trace_count": 15 }
    }
  ]
}
```

Categories:

- **span**: Span type existence and attribute validation
- **metric**: Prometheus metric existence
- **log**: Log-trace correlation checks
- **dashboard**: Grafana dashboard accessibility

## CI Integration

The validation runs as a GitHub Actions workflow (`.github/workflows/telemetry-validation.yml`):

- Triggered manually or on pushes to telemetry branches
- Builds xrpld, starts the full stack, runs load, validates
- Uploads reports as artifacts
- Posts summary to PR

## Configuration Files

| File                              | Purpose                                                       |
| --------------------------------- | ------------------------------------------------------------- |
| `workload-profiles.json`          | Named load profiles with phase definitions                    |
| `expected_spans.json`             | Span inventory (names, attributes, hierarchies, config flags) |
| `expected_metrics.json`           | Metric inventory — every listed metric must be present        |
| `test_accounts.json`              | Test account roles (keys generated at runtime)                |
| `regression-metrics.json`         | Metric surface for the OTel regression gate                   |
| `regression-thresholds.json`      | Per-metric regression bounds (pct AND abs)                    |
| `baselines/baseline-timings.json` | Committed baseline — populated from first CI run              |
| `requirements.txt`                | Python dependencies                                           |

### expected_metrics.json Format

```json
{
  "category_name": {
    "description": "Human-readable description.",
    "metrics": ["metric_1", "metric_2"]
  }
}
```

Every metric listed must produce > 0 Prometheus series during the validation run. If a metric doesn't fire, the workload generators need to produce enough load to trigger it.

### expected_spans.json Format

Each span entry defines its name, category, parent (for hierarchy validation),
required attributes, and the `config_flag` that must be enabled:

```json
{
  "name": "rpc.command.*",
  "category": "rpc",
  "parent": "rpc.process",
  "required_attributes": ["command", "version", "rpc_role", "rpc_status"],
  "config_flag": "trace_rpc"
}
```

## Node Configuration Notes

The orchestrator (`run-full-validation.sh`) generates node configs with:

- `[telemetry] enabled=1` with all trace categories (`trace_rpc`, `trace_consensus`, `trace_transactions`)
- `[signing_support] true` — required for `tx_submitter.py` to submit signed transactions via WebSocket
- `[ips]` (not `[ips_fixed]`) — ensures peer connections are counted in `Peer_Finder_Active_Inbound/Outbound_Peers` metrics (fixed peers are excluded from these counters by design)

## StatsD Gauge Behaviour

Beast::insight StatsD gauges only emit when their value _changes_ from the previous sample. This can cause two problems in the validation environment:

1. **Initial-zero gauges** — if a gauge value is 0 from startup and never changes, the gauge would never emit. To address this, `StatsDGaugeImpl` initializes `m_dirty = true`, ensuring the first flush always emits the initial value.
2. **Stale gauges** — once a gauge stabilizes (e.g., peer count stays at 1), it stops emitting new data points. Prometheus marks it stale after ~5 minutes. The validation script uses the Prometheus `/api/v1/series` endpoint instead of instant queries to catch such gauges.