Commit Graph

12 Commits

Author SHA1 Message Date
Pratik Mankawde
9289cb671d Phase 9: Internal Metric Instrumentation Gap Fill (Tasks 9.1-9.10)
Implement ~50 OTel metrics covering NodeStore I/O, cache hit rates,
TxQ state, PerfLog per-RPC/per-job counters, CountedObject instances,
and load factor breakdown via MetricsRegistry.

Core implementation:
- MetricsRegistry class with synchronous instruments (Counter, Histogram)
  for RPC and Job metrics, and ObservableGauge callbacks for cache, TxQ,
  CountedObject, LoadFactor, and NodeStore state polling.
- ServiceRegistry extended with getMetricsRegistry() virtual method.
- Application wires MetricsRegistry lifecycle (create/start/stop).
- PerfLogImp instrumented to emit OTel metrics on RPC and Job events.

Dashboards & observability:
- 3 new Grafana dashboards: RPC Performance, Job Queue, Fee Market/TxQ.
- Extended statsd-node-health dashboard with NodeStore, Cache, and
  CountedObject panels.
- 10 alerting rules added to telemetry-runbook.md.
- Integration test extended with 12 OTel metric validation checks.

Documentation:
- 09-data-collection-reference.md updated with Phase 9 metric tables.
- Unit tests for MetricsRegistry disabled-path (no-op) behavior.

All OTel SDK code guarded with #ifdef XRPL_ENABLE_TELEMETRY.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
b73592f934 Phase 9-11: Future enhancement plans for metric gap fill, workload validation, and third-party pipelines
- Phase 9: Internal Metric Instrumentation Gap Fill (10 tasks, 12d)
  - MetricsRegistry class, NodeStore I/O, cache, TxQ, PerfLog, CountedObjects, load factors
- Phase 10: Synthetic Workload Generation & Telemetry Validation (7 tasks, 10d)
  - Multi-node harness, RPC/tx generators, validation suite, benchmarks, CI
- Phase 11: Third-Party Data Collection Pipelines (11 tasks, 15d)
  - Custom OTel Collector receiver (Go), 30 external metrics, alerting rules, 4 dashboards
- Updated 06-implementation-phases.md with plan sections §6.8.2-§6.8.4, gantt, effort summary
- Updated 09-data-collection-reference.md with §5b-§5d future metric definitions
- Updated 08-appendix.md with Phase 9-11 glossary, task list entries, cross-reference guide, effort summary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
2573e956f1 Phase 8: Update documentation for log-trace correlation
Task 8.6: Add Log-Trace Correlation section to telemetry-runbook.md
with LogQL examples, verification steps, and troubleshooting guidance.
Update 09-data-collection-reference.md section 5a from "Future" to
actual implementation docs covering log format, ingestion pipeline,
Grafana correlation config, and Loki backend. Add Phase 8 log
correlation test section and troubleshooting to TESTING.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
503d3f7d48 Phase 8: Log-trace correlation plan docs and task list
- Add §6.8.1 to 06-implementation-phases.md with full Phase 8 plan
  (motivation, architecture, Mermaid diagrams, tasks table, exit criteria)
- Add Phase8_taskList.md with per-task breakdown (8.1-8.6)
- Add §5a log-trace correlation section to 09-data-collection-reference.md
- Add Phase 8 row to OpenTelemetryPlan.md, update totals to 13 weeks / 8 phases
- Add Phases 6-8 to Gantt chart in 06-implementation-phases.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
7d51436d26 Phase 7: Native OTel metrics migration (Tasks 7.1-7.7)
Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.

- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
  SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
  export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
  pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
  StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
  and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
  reference docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
702cf63c62 Separate plan from tasks: move Phase 7 plan into 06-implementation-phases.md, remove Phase 8 content
- Move Phase 7 motivation (gains/losses/decision) and architecture (class
  hierarchy, data flow diagram, config) from Phase7_taskList.md into
  06-implementation-phases.md §6.8
- Strip Phase7_taskList.md to tasks only (7.1-7.8 + summary table)
- Remove Phase8_taskList.md — belongs on Phase 8 branch
- Remove §6.8.1 (Phase 8) from 06-implementation-phases.md
- Remove §5a (Phase 8 log correlation) from 09-data-collection-reference.md
- Remove Phase 8 row from OpenTelemetryPlan.md phase table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
85a2220312 Phase 7-8: Plan docs for native OTel metrics migration and log-trace correlation
Phase 7 (native metrics): Replace StatsDCollector with OTelCollectorImpl
behind the existing beast::insight::Collector interface. Maps Counter,
Gauge, Meter, Event to OTel SDK instruments. Exports via OTLP/HTTP to
same collector endpoint as traces. Eliminates StatsD UDP dependency.
Resolves deferred Phase 6 Task 6.1 (|m wire format).

Phase 8 (log correlation): Inject trace_id/span_id into JLOG output
via Logs::format() thread-local span context read. Add Grafana Loki
with OTel Collector filelog receiver for centralized log ingestion.
Enable bidirectional Tempo-Loki correlation in Grafana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
a8c2f94e8a Remove 'rippled' prefix from dashboard titles, add new dashboards to doc
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
f1025d4f71 Fix markdown formatting in data collection reference
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
64d8369dbc Add consensus.accept.apply span to data collection reference
Add the close time span and its 6 attributes to the Phase 4 consensus
span table and attribute table in 09-data-collection-reference.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
4dcd65968f document updates
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
2bea046dab Phase 6: Integrate beast::insight StatsD metrics into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00