Commit Graph

13845 Commits

Author SHA1 Message Date
Pratik Mankawde
4545a9495b Appendix: add Phase8_taskList.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
503d3f7d48 Phase 8: Log-trace correlation plan docs and task list
- Add §6.8.1 to 06-implementation-phases.md with full Phase 8 plan
  (motivation, architecture, Mermaid diagrams, tasks table, exit criteria)
- Add Phase8_taskList.md with per-task breakdown (8.1-8.6)
- Add §5a log-trace correlation section to 09-data-collection-reference.md
- Add Phase 8 row to OpenTelemetryPlan.md, update totals to 13 weeks / 8 phases
- Add Phases 6-8 to Gantt chart in 06-implementation-phases.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
5861b555b1 Phase 8: Add Loki datasource provisioning for log-trace correlation
Adds Grafana Loki data source with derivedFields config linking
trace_id values in log lines to Tempo traces. This enables one-click
log-to-trace navigation in Grafana Explore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
8af6af6b2a Fix std::bad_weak_ptr crash in OTelGaugeImpl constructor
Remove illegal shared_from_this() call from OTelGaugeImpl constructor.
The shared_ptr control block is not yet associated with the object during
construction, causing std::bad_weak_ptr when [insight] server=otel is
configured. The weakSelf variable was dead code — never used — since the
callback captures `this` directly via void* state. The raw pointer is safe
because RemoveCallback() is called in the destructor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:41 +00:00
Pratik Mankawde
eb7e102015 Fix codecov: exclude OTel collector path in CollectorManager from coverage
The else-if branch for server=="otel" in CollectorManager.cpp is never
reached in unit tests (no test configures [insight] with server=otel).
Mark it with LCOV_EXCL_START/STOP to exclude from patch coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
3541deb972 Fix OTelCollector: Journal::info is a method, not a bool member
The beast::Journal stream accessors (info, warn, etc.) are methods that
return a Stream object. They must be called with () to test if the log
level is active.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
7e515899c1 Fix system dashboard $node template variable queries
The label_values() PromQL function requires a metric name as the first
argument. Without it, Prometheus returns raw label hashes instead of
readable node names like "validator-0:6006".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
a431cf97b3 Fix OTelCollector.cpp compilation errors against OTel C++ SDK
- CreateInt64Counter -> CreateUInt64Counter (no int64 counter in API)
- Counter<int64_t> member -> Counter<uint64_t> with static_cast in Add()
- Add async_instruments.h include for ObservableInstrument definition
- Replace JLOG() macro with beast::Journal if(info)/info() pattern
- MeterProviderFactory::Create(reader,resource) -> Create(views,resource)
  + AddMetricReader() (no 2-arg reader overload exists)
- HistogramAggregationConfig: std::list<double> -> std::vector<double>
  using boundaries_ member instead of aggregate initialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
35eeffa068 Phase 7: Add node filters to system dashboards + OTelCollector instanceId
- Add $node template variable to all 5 system-* Grafana dashboards
  with exported_instance=~"$node" filter on all PromQL queries
- Add instanceId parameter to OTelCollector::New() factory to set
  service.instance.id resource attribute on metrics (matches trace
  exporter behavior for human-friendly node names in Grafana)
- CollectorManager reads service_instance_id from [insight] config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
7d51436d26 Phase 7: Native OTel metrics migration (Tasks 7.1-7.7)
Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.

- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
  SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
  export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
  pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
  StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
  and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
  reference docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
0aa0bfa5dd Appendix: add Phase7_taskList.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
702cf63c62 Separate plan from tasks: move Phase 7 plan into 06-implementation-phases.md, remove Phase 8 content
- Move Phase 7 motivation (gains/losses/decision) and architecture (class
  hierarchy, data flow diagram, config) from Phase7_taskList.md into
  06-implementation-phases.md §6.8
- Strip Phase7_taskList.md to tasks only (7.1-7.8 + summary table)
- Remove Phase8_taskList.md — belongs on Phase 8 branch
- Remove §6.8.1 (Phase 8) from 06-implementation-phases.md
- Remove §5a (Phase 8 log correlation) from 09-data-collection-reference.md
- Remove Phase 8 row from OpenTelemetryPlan.md phase table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
85a2220312 Phase 7-8: Plan docs for native OTel metrics migration and log-trace correlation
Phase 7 (native metrics): Replace StatsDCollector with OTelCollectorImpl
behind the existing beast::insight::Collector interface. Maps Counter,
Gauge, Meter, Event to OTel SDK instruments. Exports via OTLP/HTTP to
same collector endpoint as traces. Eliminates StatsD UDP dependency.
Resolves deferred Phase 6 Task 6.1 (|m wire format).

Phase 8 (log correlation): Inject trace_id/span_id into JLOG output
via Logs::format() thread-local span context read. Add Grafana Loki
with OTel Collector filelog receiver for centralized log ingestion.
Enable bidirectional Tempo-Loki correlation in Grafana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
a8c2f94e8a Remove 'rippled' prefix from dashboard titles, add new dashboards to doc
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
f1025d4f71 Fix markdown formatting in data collection reference
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
a2082f29c4 Add StatsD dashboards for overlay traffic detail and ledger data sync
Two new Grafana dashboards covering previously uncovered beast::insight
StatsD metrics:

Overlay Traffic Detail (8 panels):
- Squelch traffic (squelch, suppressed, ignored)
- Overhead breakdown (base, cluster, manifest)
- Validator list distribution traffic
- Set get/share (transaction set exchange)
- Have/requested transactions protocol
- Unknown/unclassified traffic
- Proof path request/response
- Replay delta request/response

Ledger Data & Sync (6 panels):
- Ledger data exchange by sub-type (TX set, TX node, account state)
- Legacy ledger share/get traffic
- GetObject by type (ledger, transaction, account state, CAS, fetch pack)
- GetObject aggregate and special types
- GetObject message counts
- All-categories ranked bar gauge (top 20)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
fcb630921f formatting
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
2a5dd29278 Appendix: add 09-data-collection-reference.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
64d8369dbc Add consensus.accept.apply span to data collection reference
Add the close time span and its 6 attributes to the Phase 4 consensus
span table and attribute table in 09-data-collection-reference.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
4dcd65968f document updates
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
2bea046dab Phase 6: Integrate beast::insight StatsD metrics into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
350a546b9b Update Consensus Mode Over Time panel description to Title Case
Match the toDisplayString() values introduced in Phase 4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:27 +00:00
Pratik Mankawde
4be2809276 Fix codecov: add LCOV_EXCL_LINE to opening lines of multi-line trace macros
clang-format splits XRPL_TRACE_SET_ATTR calls across two lines. The
LCOV_EXCL_LINE comment must appear on BOTH lines, not just the
continuation line, since gcov instruments each line independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
3b9962069e Fix codecov/patch: exclude telemetry no-op lines from coverage
Add LCOV_EXCL_LINE markers on trace macro calls that expand to ((void)0)
when telemetry is disabled. gcov instruments these no-op expressions at
-O0 causing false patch coverage failures.

Also add telemetry module paths to .codecov.yml ignore list since they
are conditionally compiled behind XRPL_ENABLE_TELEMETRY which is not
enabled in coverage builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
76d373d9b4 formatting changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
be59e4940f Phase 5b: Add ledger/peer/tx spans + expand Grafana dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:27 +00:00
Pratik Mankawde
5499cbcde4 Update consensus mode filter description to Title Case
Match the toDisplayString() Title Case values (Proposing, Observing,
Wrong Ledger, Switched Ledger) in the $consensus_mode template
variable description.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:45:17 +00:00
Pratik Mankawde
3c07a42026 Fix Grafana datasource default and add human-friendly standalone node name
- Set Prometheus as default datasource (isDefault: true) so dashboard
  panels using type: prometheus without explicit UID resolve correctly.
  Previously Jaeger was implicitly the default, causing all PromQL
  panels to return empty results.
- Explicitly set isDefault: false on Jaeger datasource to prevent
  implicit default behavior.
- Add service_instance_id=rippled-standalone to xrpld-telemetry.cfg
  so Grafana $node dropdown shows a readable name instead of the
  base58 node public key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
8e2693bb35 Appendix: add Phase5_IntegrationTest_taskList.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
2b641e6947 Use human-friendly node names in integration test telemetry config
Set service_instance_id=rippled-node-N in each test node's [telemetry]
section so Grafana/Tempo filters show readable names instead of base58
public keys. Update dashboard descriptions and Tempo datasource comments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
7518fb5bb0 Fix Grafana node filter: use exported_instance instead of service_instance_id
Prometheus renames the OTel resource attribute service.instance.id to
'instance', which then conflicts with the built-in scrape 'instance'
label. Prometheus resolves this by prefixing it as 'exported_instance'.
Update all dashboard PromQL queries and template variable queries to
use exported_instance so the Node dropdown correctly populates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
95f1b10cfd Set explicit uid on Prometheus datasource for dashboard compatibility
Dashboard template variables reference datasource uid "prometheus" but
the provisioning config had no uid set, causing Grafana to auto-assign
a random one and break dashboard panel queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
400624a822 Add Grafana dashboard template variables for node and metric filtering
- Add resource_metrics_key_attributes to spanmetrics connector so
  service.instance.id becomes a Prometheus label for per-node filtering
- Add 'node' dropdown (service_instance_id) to all 3 dashboards
- Add 'command' dropdown (xrpl_rpc_command) to RPC Performance
- Add 'tx_origin' dropdown (xrpl_tx_local) to Transaction Overview
- Add 'consensus_mode' dropdown (xrpl_consensus_mode) to Consensus Health
- Update all panel PromQL queries to include $node filter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
3b22d73bef Add consensus close time panels and docs for accept.apply span
- Add Ledger Apply Duration and Close Time Agreement panels to
  consensus-health dashboard
- Add consensus.accept.apply to telemetry runbook with TraceQL
  queries for close time disagreements and consensus failures
- Add span to TESTING.md expected span catalog and verification loop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
70396debcb Phase 5: Observability stack — spanmetrics, dashboards, runbook
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:44:11 +00:00
Pratik Mankawde
9bb8f2e12a Use Title Case for consensus mode names in telemetry attributes
Add toDisplayString(ConsensusMode) helper that returns Title Case
names (Proposing, Observing, Wrong Ledger, Switched Ledger) for use
in OTel span attributes and Grafana dashboards. The existing
to_string() is preserved unchanged for log output stability.

Updated call sites:
- RCLConsensus.cpp: onClose, onModeChange, startRoundInternal
- TracingMacros.cpp: test attribute value

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:43:55 +00:00
Pratik Mankawde
31d0ed178a Phase 4a: Establish-phase gap fill & cross-node correlation
Add full consensus tracing with deterministic trace ID correlation
and establish-phase instrumentation:

- Deterministic trace_id from previousLedger.id() for cross-node
  correlation (switchable via consensus_trace_strategy config)
- Round-to-round span links (follows-from) for causal chaining
- Establish phase spans with convergence tracking, dispute resolution
  events, and threshold escalation attributes
- Validation spans with links to round spans (thread-safe via
  roundSpanContext_ snapshot for jtACCEPT cross-thread access)
- Mode change spans for proposing/observing transitions
- New startSpan overload with span links in Telemetry interface
- XRPL_TRACE_ADD_EVENT macro with do-while(0) safety wrapper
- Config validation for consensus_trace_strategy
- Test adaptor (csf::Peer) updated with getTelemetry() stub

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:10:00 +00:00
Pratik Mankawde
7675af41ec Add consensus.accept.apply span with ledger close time attributes
Add a new trace span in doAccept() capturing ledger close time details:
- xrpl.consensus.close_time: agreed-upon close time (epoch seconds)
- xrpl.consensus.close_time_correct: whether validators converged
  (per avCT_CONSENSUS_PCT = 75% threshold)
- xrpl.consensus.close_resolution_ms: time rounding granularity
- xrpl.consensus.state: "finished" or "moved_on" (consensus failure)
- xrpl.consensus.proposing: whether this node was proposing

Update Tempo datasource with close time filters, plan docs with
new span inventory, and add test coverage for the attribute pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:10:00 +00:00
Pratik Mankawde
655b78a7d8 Add consensus trace filters to Grafana Tempo datasource
Add xrpl.consensus.mode (static), xrpl.consensus.round (dynamic),
and xrpl.consensus.ledger.seq (static) search filters for Phase 4
consensus span attributes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:24 +00:00
Pratik Mankawde
0bbffaebc4 Phase 4: Consensus tracing — round lifecycle, proposals, validations
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:24 +00:00
Pratik Mankawde
e33e592128 Add transaction trace filters to Grafana Tempo datasource
Add xrpl.tx.hash (static), xrpl.tx.local and xrpl.tx.status
(dynamic) search filters for Phase 3 transaction span attributes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
e6b05d700c Phase 3: Transaction tracing — protobuf context, PeerImp, NetworkOPs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
f114ec5462 Fix gersemi cmake formatting in test CMakeLists
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
1d05075585 Appendix: add Phase 2-5 task lists to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
e77f2f0fa5 Add RPC trace filters to Grafana Tempo datasource
Add xrpl.rpc.command (static), xrpl.rpc.status and xrpl.rpc.role
(dynamic) search filters for Phase 2 RPC span attributes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
ea00fb9bc8 Phase 2: Complete RPC tracing — interface, macros, attributes, tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
a707abd6f2 Phase 1c: RPC layer telemetry integration
Add tracing instrumentation to the RPC request handling layer:

TracingInstrumentation.h:
- Convenience macros (XRPL_TRACE_RPC, XRPL_TRACE_TX, etc.) that
  create RAII SpanGuard objects when telemetry is enabled
- XRPL_TRACE_SET_ATTR / XRPL_TRACE_EXCEPTION for span enrichment
- Zero-overhead no-ops when XRPL_ENABLE_TELEMETRY is not defined

RPCHandler.cpp:
- Trace each RPC command with span name "rpc.command.<method>"
- Record command name, API version, role, and status as attributes
- Capture exceptions on the span for error visibility

ServerHandler.cpp:
- Trace HTTP requests ("rpc.request"), WebSocket messages
  ("rpc.ws_message"), and processRequest ("rpc.process")

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
99f4f16eb9 Phase 1a: OpenTelemetry plan documentation
Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:

- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
f333ae4c11 Fix gersemi cmake formatting (if/endif spacing, target_link_libraries wrapping)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00
Pratik Mankawde
9ecea839bb Appendix: add presentation.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:09:01 +00:00