Files
rippled/OpenTelemetryPlan/Phase5_taskList.md
2026-03-12 22:09:01 +00:00

8.8 KiB

Phase 5: Documentation & Deployment Task List

Goal: Production readiness — Grafana dashboards, spanmetrics pipeline, operator runbook, alert definitions, and final integration testing. This phase ensures the telemetry system is useful and maintainable in production.

Scope: Grafana dashboard definitions, OTel Collector spanmetrics connector, Prometheus integration, alert rules, operator documentation, and production-ready Docker Compose stack.

Branch: pratik/otel-phase5-docs-deployment (from pratik/otel-phase4-consensus-tracing)

Document Relevance
07-observability-backends.md Jaeger setup (§7.1), Grafana dashboards (§7.6), alerts (§7.6.3)
05-configuration-reference.md Collector config (§5.5), production config (§5.5.2), Docker Compose (§5.6)
06-implementation-phases.md Phase 5 tasks (§6.6), definition of done (§6.11.5)

Task 5.1: Add Spanmetrics Connector to OTel Collector

Objective: Derive RED metrics (Rate, Errors, Duration) from trace spans automatically, enabling Grafana time-series dashboards.

What to do:

  • Edit docker/telemetry/otel-collector-config.yaml:

    • Add spanmetrics connector:
      connectors:
        spanmetrics:
          histogram:
            explicit:
              buckets: [1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s]
          dimensions:
            - name: xrpl.rpc.command
            - name: xrpl.rpc.status
            - name: xrpl.consensus.phase
            - name: xrpl.tx.type
      
    • Add prometheus exporter:
      exporters:
        prometheus:
          endpoint: 0.0.0.0:8889
      
    • Wire the pipeline:
      service:
        pipelines:
          traces:
            receivers: [otlp]
            processors: [batch]
            exporters: [debug, otlp/jaeger, spanmetrics]
          metrics:
            receivers: [spanmetrics]
            exporters: [prometheus]
      
  • Edit docker/telemetry/docker-compose.yml:

    • Expose port 8889 on the collector for Prometheus scraping
    • Add Prometheus service
    • Add Prometheus as Grafana datasource

Key modified files:

  • docker/telemetry/otel-collector-config.yaml
  • docker/telemetry/docker-compose.yml

Key new files:

  • docker/telemetry/prometheus.yml (Prometheus scrape config)
  • docker/telemetry/grafana/provisioning/datasources/prometheus.yaml

Reference:


Task 5.2: Create Grafana Dashboards

Objective: Provide pre-built Grafana dashboards for RPC performance, transaction lifecycle, and consensus health.

What to do:

  • Create docker/telemetry/grafana/provisioning/dashboards/dashboards.yaml (provisioning config)

  • Create dashboard JSON files:

    1. RPC Performance Dashboard (rpc-performance.json):

      • RPC request latency (p50/p95/p99) by command — histogram panel
      • RPC throughput (requests/sec) by command — time series
      • RPC error rate by command — bar gauge
      • Top slowest RPC commands — table
    2. Transaction Overview Dashboard (transaction-overview.json):

      • Transaction processing rate — time series
      • Transaction latency distribution — histogram
      • Suppression rate (duplicates) — stat panel
      • Transaction processing path (sync vs async) — pie chart
    3. Consensus Health Dashboard (consensus-health.json):

      • Consensus round duration — time series
      • Phase duration breakdown (open/establish/accept) — stacked bar
      • Proposals sent/received per round — stat panel
      • Consensus mode distribution (proposing/observing) — pie chart
  • Store dashboards in docker/telemetry/grafana/dashboards/

Key new files:

  • docker/telemetry/grafana/provisioning/dashboards/dashboards.yaml
  • docker/telemetry/grafana/dashboards/rpc-performance.json
  • docker/telemetry/grafana/dashboards/transaction-overview.json
  • docker/telemetry/grafana/dashboards/consensus-health.json

Reference:


Task 5.3: Define Alert Rules

Objective: Create alert definitions for key telemetry anomalies.

What to do:

  • Create docker/telemetry/grafana/provisioning/alerting/alerts.yaml:
    • RPC Latency Alert: p99 latency > 1s for any command over 5 minutes
    • RPC Error Rate Alert: Error rate > 5% for any command over 5 minutes
    • Consensus Duration Alert: Round duration > 10s (warn), > 30s (critical)
    • Transaction Processing Alert: Processing rate drops below threshold
    • Telemetry Pipeline Health: No spans received for > 2 minutes

Key new files:

  • docker/telemetry/grafana/provisioning/alerting/alerts.yaml

Reference:


Task 5.4: Production Collector Configuration

Objective: Create a production-ready OTel Collector configuration with tail-based sampling and resource limits.

What to do:

  • Create docker/telemetry/otel-collector-config-production.yaml:
    • Tail-based sampling policy:
      • Always sample errors and slow traces
      • 10% base sampling rate for normal traces
      • Always sample first trace for each unique RPC command
    • Resource limits:
      • Memory limiter processor (80% of available memory)
      • Queued retry for export failures
    • TLS configuration for production endpoints
    • Health check endpoint

Key new files:

  • docker/telemetry/otel-collector-config-production.yaml

Reference:


Task 5.5: Operator Runbook

Objective: Create operator documentation for managing the telemetry system in production.

What to do:

  • Create docs/telemetry-runbook.md:
    • Setup: How to enable telemetry in rippled
    • Configuration: All config options with descriptions
    • Collector Deployment: Docker Compose vs. Kubernetes vs. bare metal
    • Troubleshooting: Common issues and resolutions
      • No traces appearing
      • High memory usage from telemetry
      • Collector connection failures
      • Sampling configuration tuning
    • Performance Tuning: Batch size, queue size, sampling ratio guidelines
    • Upgrading: How to upgrade OTel SDK and Collector versions

Key new files:

  • docs/telemetry-runbook.md

Task 5.6: Final Integration Testing

Objective: Validate the complete telemetry stack end-to-end.

What to do:

  1. Start full Docker stack (Collector, Jaeger, Grafana, Prometheus)
  2. Build rippled with telemetry=ON
  3. Run in standalone mode with telemetry enabled
  4. Generate RPC traffic and verify traces in Jaeger
  5. Verify dashboards populate in Grafana
  6. Verify alerts trigger correctly
  7. Test telemetry OFF path (no regressions)
  8. Run full test suite

Verification Checklist:

  • Docker stack starts without errors
  • Traces appear in Jaeger with correct hierarchy
  • Grafana dashboards show metrics derived from spans
  • Prometheus scrapes spanmetrics successfully
  • Alerts can be triggered by simulated conditions
  • Build succeeds with telemetry ON and OFF
  • Full test suite passes

Summary

Task Description New Files Modified Files Depends On
5.1 Spanmetrics connector + Prometheus 2 2 Phase 4
5.2 Grafana dashboards 4 0 5.1
5.3 Alert definitions 1 0 5.1
5.4 Production collector config 1 0 Phase 4
5.5 Operator runbook 1 0 Phase 4
5.6 Final integration testing 0 0 5.1-5.5

Parallel work: Tasks 5.1, 5.4, and 5.5 can run in parallel. Tasks 5.2 and 5.3 depend on 5.1. Task 5.6 depends on all others.

Exit Criteria (from 06-implementation-phases.md §6.11.5):

  • Dashboards deployed and showing data
  • Alerts configured and tested
  • Operator documentation complete
  • Production collector config ready
  • Full test suite passes