Commit Graph

35 Commits

Author SHA1 Message Date
Pratik Mankawde
f11ebc1253 fix: StatsDGauge dirty init + tx_submitter sequence drift in CI
Two CI failures traced to root cause:

1. rippled_jobq_job_count: 0 series — StatsDGaugeImpl declared
   m_dirty{false} despite the constructor comment saying "start dirty".
   Gauges whose value starts and stays at 0 never emitted, so Prometheus
   never scraped them.  Fix: m_dirty{true} on the member initializer.

2. TX error rate 82.8% — the submitter tracked account sequences
   locally, but in a multi-node consensus network other nodes' txns
   advance sequences independently.  After a few ledger closes the
   locally-tracked sequence fell behind the ledger, producing
   tefPAST_SEQ for every subsequent submission.  Fix: refresh account
   sequences from account_info every 10 s during the submission loop.
2026-04-24 19:42:20 +01:00
Pratik Mankawde
577d1f8a21 fix: address review findings in regression gate
- capture_timings.py: fail when captured/total ratio < 50%
  (--min-capture-ratio). Prevents silent pass on unreachable Prometheus.
- run-full-validation.sh: set REGRESSION_EXIT=2 on capture failure so
  the final exit code reflects it. Update exit code docs in header.
- compare_to_baseline.py: extract _skip_delta helper to bring
  compute_delta under 80 lines. Fix 0.0-as-falsy bug in abs_bound
  resolution (use explicit None check instead of `or`). Remove dead
  variable override_prefix_key.
- prom_queries.py: extract _build_simple_entries and _build_job_entries
  to bring build_query_plan under 80 lines. Fix module docstring return
  type example. Use aiohttp.ClientTimeout instead of bare int.
- telemetry-validation.yml: add set -euo pipefail to regression summary
  step; guard jq calls with -e flag and fallback; fail on missing
  baseline file; emit ::warning annotation when timings.json missing.
- baselines/README.md: document the placeholder field.
2026-04-24 19:36:15 +01:00
Pratik Mankawde
df79d5e74b feat: add OTel-driven regression gate for Phase 10 telemetry validation
Captures per-span / per-RPC / per-job timings from Prometheus after the
workload run and diffs them against a committed baseline. Regression
requires breaching both a percentage and an absolute bound, tolerating
small-value noise. When the baseline is a placeholder, the comparator
emits the captured JSON in the exact schema for one-time paste into
baselines/baseline-timings.json, and the CI Step Summary surfaces that
block for the reviewer.

Scope: gate only — automated baseline persistence, benchmark.sh
PromQL migration, and the historical trend dashboard remain follow-ups.
2026-04-24 18:53:44 +01:00
Pratik Mankawde
8583343fd9 fix(telemetry): restore Loki, StatsD, filelog configs lost in rebase
The Jaeger-removal rebase used --ours conflict resolution which
dropped content added by intermediate phases (6-8): StatsD receiver,
filelog receiver, Loki service/exporter, health_check extension,
and OTLP metrics pipeline. Restore from pre-rebase origin minus
Jaeger references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 11:02:03 +01:00
Pratik Mankawde
7e149f7773 refactor(telemetry): remove residual Jaeger references across chain
Fix remaining Jaeger references that accumulated across intermediate
branches in the stacked PR chain. These were in files modified by
multiple phases where the per-branch fixes didn't cover all additions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:35:04 +01:00
Pratik Mankawde
a142a700e8 refactor(telemetry): migrate Phase 10 validation from Jaeger to Tempo native API
Migrate validate_telemetry.py to Tempo TraceQL search API, remove
Jaeger service from workload docker-compose, update readiness checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
ff1502f939 feat(telemetry): add workload orchestrator with phased load profiles
Add a profile-driven workload orchestrator that executes sequential load
phases with configurable RPC rates and TX throughput. Three profiles:
full-validation (6 phases covering all 18 dashboards), quick-smoke (CI),
and stress (benchmarking). Fix 10 validation failures: correct Phase 9
metric prefixes, relax peer latency bounds for localhost clusters, and
allow sub-microsecond span durations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
ecf103104b fix(telemetry): fix CI failures in MetricsRegistry test, levelization, and dashboard titles
- Update MockServiceRegistry to match current ServiceRegistry interface
  (17 method renames: get* prefix, PathRequests→PathRequestManager)
- Make throwUnimplemented() static to satisfy clang-tidy
- Regenerate levelization ordering.txt and loops.txt
- Remove 'rippled' prefix from 3 StatsD dashboard titles

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
e63ca4c495 fix(telemetry): fix dashboard UID and add parity attributes to expected_spans
- Remove duplicate 'system-node-health' UID from expected_metrics.json
  (already covered by 'rippled-system-node-health')
- Add parity span attributes to expected_spans.json: node health on
  rpc.command.*, validation hash/full on consensus.validation.send,
  quorum/proposers on consensus.accept, validation hash/full on
  peer.validation.receive

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
711ae43174 feat(telemetry): add external dashboard parity validation checks (Task 10.8)
Add ~28 validation checks for external dashboard parity:
- 8 span attribute checks (server_info, tx.receive, consensus, peer spans)
- 13 metric existence checks (validation agreement, validator health,
  peer quality, ledger economy, state tracking, counters, storage)
- 3 dashboard load checks (validator-health, peer-quality, system-node-health)
- 4 value sanity checks (agreement %, UNL expiry, latency, state value)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
5de8c520d1 Phase 10: Workload validation - synthetic load generation and telemetry checks
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:02 +01:00
Pratik Mankawde
0644438549 Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:01 +01:00
Pratik Mankawde
d8c586b2fb Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:01 +01:00
Pratik Mankawde
8cca4ec77b Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:32:01 +01:00
Pratik Mankawde
45ffe8e2ec fix(telemetry): add missing counters, fix dashboard metric name, clean dead code
- Add rippled_validation_agreements_total and rippled_validation_missed_total
  counter declarations and creation (wiring to ValidationTracker pending rebase)
- Fix peer-quality dashboard: query rippled_server_info{metric="peer_disconnects_resources"}
  instead of non-existent rippled_Overlay_Peer_Disconnects_Charges
- Remove dead getCountsJson() call in storageDetail callback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
b92354715d feat(telemetry): add validator health, peer quality dashboards and ledger economy panels (Tasks 9.11-9.13)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
936c73982d docs: update Phase 9 docs and dashboard for push_metrics.py parity gauges
- Add Task 9.7a to Phase9_taskList.md documenting new gauges
- Add metric tables to 09-data-collection-reference.md (server_info,
  build_info, complete_ledgers, db_metrics, extended cache/nodestore)
- Update metric counts from ~50 to ~68 in 06-implementation-phases.md
- Add OTel MetricsRegistry gauge reference to telemetry-runbook.md
- Add 11 new panels to system-node-health.json Grafana dashboard
  (server state, uptime, peers, validated seq, last close info,
  build version, complete ledgers, db sizes, historical fetch rate,
  peer disconnects)
- Fix leftover merge conflict marker in 08-appendix.md
- Add ripplex/mseconds to cspell dictionary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
892fee638a Phase 9: Metric gap fill - nodestore, cache, TxQ, load factor dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
facc111c22 Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
5ec9f3f30a Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
8f364ed6f4 Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:49 +01:00
Pratik Mankawde
fdec3ce5c4 Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
aa062ecdbe Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
0e15f95543 Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
2f7064ace6 Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
21192e9b3f Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:07 +01:00
Pratik Mankawde
7e5591318f Phase 5b: Ledger, peer, and tx spans with expanded Grafana dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:30:59 +01:00
Pratik Mankawde
87ed778efe refactor(telemetry): migrate integration test and docs from Jaeger to Tempo API
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:29:30 +01:00
Pratik Mankawde
d0ff82801c fix: use docker/telemetry/data/ for runtime data and add .gitignore
Move xrpld data paths from ./data/ to docker/telemetry/data/ so runtime
files stay within the docker telemetry directory. Add .gitignore to
exclude the data directory from version control.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:29:30 +01:00
Pratik Mankawde
f940290866 Phase 5: Documentation, deployment configs, integration test infrastructure
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:29:30 +01:00
Pratik Mankawde
a127711b86 Phase 4: Consensus tracing - round lifecycle, proposals, validations, close time
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:33 +01:00
Pratik Mankawde
88d17e4c04 Phase 3: Transaction tracing - protobuf context propagation, PeerImp, NetworkOPs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:27 +01:00
Pratik Mankawde
945faac770 Phase 2: RPC tracing - span macros, attributes, WebSocket, command spans
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:22 +01:00
Pratik Mankawde
8421134420 refactor(telemetry): remove Jaeger service, exporter, and datasource
Tempo is now the sole trace backend. Remove Jaeger all-in-one service
from docker-compose, otlp/jaeger exporter from OTel Collector config,
and Jaeger Grafana datasource provisioning file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:28:12 +01:00
Pratik Mankawde
a7470615be Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:28:12 +01:00