Commit Graph

14358 Commits

Author SHA1 Message Date
Pratik Mankawde
34bf61ff77 merge: pratik/otel-phase9-metric-gap-fill fix(SpanKind) into pratik/otel-phase10-workload-validation
# Conflicts:
#	docker/telemetry/otel-collector-config.yaml
#	docker/telemetry/xrpld-telemetry.cfg
2026-05-14 15:59:39 +01:00
Pratik Mankawde
9d99ce6ae8 merge: pratik/otel-phase8-log-correlation fix(SpanKind) into pratik/otel-phase9-metric-gap-fill 2026-05-14 15:55:09 +01:00
Pratik Mankawde
577cb9b5f0 merge: pratik/otel-phase7-native-metrics fix(SpanKind) into pratik/otel-phase8-log-correlation 2026-05-14 15:55:07 +01:00
Pratik Mankawde
7d202127bb merge: pratik/otel-phase6-statsd fix(SpanKind) into pratik/otel-phase7-native-metrics 2026-05-14 15:55:05 +01:00
Pratik Mankawde
56090b0ead merge: pratik/otel-phase5-docs-deployment fix(SpanKind) into pratik/otel-phase6-statsd 2026-05-14 15:55:03 +01:00
Pratik Mankawde
6c6d6f953f merge: pratik/otel-phase4-consensus-tracing fix(SpanKind) into pratik/otel-phase5-docs-deployment 2026-05-14 15:55:01 +01:00
Pratik Mankawde
0b4b3c7bf2 merge: pratik/otel-phase3-tx-tracing fix(SpanKind) into pratik/otel-phase4-consensus-tracing 2026-05-14 15:54:59 +01:00
Pratik Mankawde
3e894f8e93 merge: pratik/otel-phase2-rpc-tracing fix(SpanKind) into pratik/otel-phase3-tx-tracing 2026-05-14 15:54:57 +01:00
Pratik Mankawde
cb7dc5c52e merge: pratik/otel-phase1c-rpc-integration fix(SpanKind) into pratik/otel-phase2-rpc-tracing 2026-05-14 15:54:55 +01:00
Pratik Mankawde
9cfb43d8d0 merge: pratik/otel-phase1b-telemetry-infra fix(SpanKind) into pratik/otel-phase1c-rpc-integration 2026-05-14 15:54:53 +01:00
Pratik Mankawde
7ada57e2a8 fix(telemetry): map TraceCategory to OTel SpanKind in SpanGuard::span()
SpanGuard::span() hardcoded SpanKind::kInternal for every span. Tempo's
service-graph and spanmetrics RED calculations rely on kServer /
kConsumer / kClient / kProducer to classify inbound vs outbound vs
internal operations. With kInternal everywhere, the service graph
collapses to a single self-loop and RED metrics attribute all latency
to internal work.

Add categoryToSpanKind() mapping:
  - Rpc           -> kServer   (inbound synchronous request)
  - Peer          -> kConsumer (inbound async peer message)
  - Transactions  -> kInternal
  - Consensus     -> kInternal
  - Ledger        -> kInternal

Only the single-argument overload is affected; childSpan / linkedSpan
continue to default to kInternal because they represent in-process
continuations of an already-kinded parent.
2026-05-14 15:53:59 +01:00
Pratik Mankawde
53e1ff82d8 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-14 14:01:46 +01:00
Pratik Mankawde
8df3ea1bbe Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-14 14:01:41 +01:00
Pratik Mankawde
5a6882f119 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	docker/telemetry/otel-collector-config.yaml
2026-05-14 14:01:36 +01:00
Pratik Mankawde
b449db0434 fix(telemetry): align spanmetrics dimensions, Tempo tags, and dashboard queries with C++ attribute names
Spanmetrics dimensions used xrpl.rpc.command etc. but C++ emits bare
"command". Tempo tags for phase6-added consensus/tx/peer filters used
qualified names but C++ uses bare names. Dashboard panel referenced
xrpl_tx_suppressed (never populated) instead of suppressed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 14:01:12 +01:00
Pratik Mankawde
9babfff3c8 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-14 13:59:19 +01:00
Pratik Mankawde
68b32ed0f0 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-05-14 13:59:14 +01:00
Pratik Mankawde
61ab5c6fe3 fix(telemetry): align Tempo consensus search tags with C++ attribute names
Consensus span attributes use bare names (close_time_correct,
consensus_state, close_resolution_ms) and shared canonical attrs
(xrpl.ledger.seq) per SpanNames.h. xrpl.consensus.mode and
xrpl.consensus.round are correct (domain-qualified to avoid collision).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:59:08 +01:00
Pratik Mankawde
837f7e7b50 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-14 13:58:38 +01:00
Pratik Mankawde
b392035544 fix(telemetry): align Tempo TX search tags with C++ attribute names
Transaction span attributes use bare names (local, tx_status) per
SpanNames.h convention, not xrpl.tx.* qualified names. xrpl.tx.hash
is correct (shared canonical attr defined in SpanNames.h).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:58:31 +01:00
Pratik Mankawde
450004ebd8 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-14 13:58:19 +01:00
Pratik Mankawde
6f403fdd1b fix(telemetry): align Tempo search tags with C++ span attribute names
RPC span attributes use bare names (command, rpc_status, rpc_role) per
the naming convention in SpanNames.h, not xrpl.rpc.* qualified names.
Node health attributes (amendment_blocked, server_state) are resource
attributes set at Tracer init, not span attributes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:58:13 +01:00
Pratik Mankawde
5dc4ae8fcc Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-14 13:49:59 +01:00
Pratik Mankawde
690841e934 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-14 13:49:51 +01:00
Pratik Mankawde
7d61a4a0ef feat(telemetry): add missing Phase 9 metric panels to dashboards
13 metrics from 09-data-collection-reference.md were not displayed on
any Grafana dashboard. Adds panels for all of them:

system-node-health.json (+7 panels):
- NodeStore Bytes Read/Written (node_written_bytes, node_read_bytes)
- NodeStore Read Threads & Duration (node_reads_duration_us,
  read_request_bundle, read_threads_running, read_threads_total)
- AL_size added to Cache Sizes panel
- Current Ledger Index (ledger_current_index)
- NuDB Storage Size (storage_detail{metric="nudb_bytes"})

rippled-validator-health.json (+2 panels):
- UNL Blocked (validator_health{metric="unl_blocked"})
- Agreement/Missed Counters Rate (validation_agreements_total,
  validation_missed_total)

rippled-job-queue.json (+1 panel):
- Transaction Overflow Rate (jq_trans_overflow_total)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:32:55 +01:00
Pratik Mankawde
93caaba5ca fix(telemetry): recover Phase 6 dashboard panels lost during statsd→system rename
Panels 8-15 from statsd-node-health.json and panels 8-9 from
statsd-network-traffic.json were lost when Phase 7 renamed these files
to system-*. The merge (5cd71ed107) took Phase 7's smaller version
without the extra panels added by commit b933e8ae00 on Phase 6.

Recovered panels (system-node-health.json):
- Key Jobs Execution Time (11 job types)
- Key Jobs Dequeue Wait Time (11 job types)
- FullBelowCache Size
- FullBelowCache Hit Rate
- Ledger Publish Gap (validated - published age delta)
- State Duration Rate (Full vs Tracking)
- All Jobs Execution Time Detail (34 job types)
- All Jobs Dequeue Wait Detail (34 job types)

Recovered panels (system-network-traffic.json):
- Duplicate Traffic (Wasted Bandwidth)
- All Traffic Categories Detail (topk 15 by byte rate)

All recovered panels updated to include exported_instance=~"$node"
filter per project dashboard guidelines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 12:33:18 +01:00
Pratik Mankawde
02fe838257 auto refresh at 5seconds
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-13 19:00:36 +01:00
Pratik Mankawde
20477e5494 validator path changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-13 18:49:21 +01:00
Pratik Mankawde
f0c6227c06 added config for devnet test run
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-13 18:42:57 +01:00
Pratik Mankawde
a9e4006591 fix(telemetry): address clang-tidy CI failures on phase10 beast test
- Add missing direct includes (contract.h, Log.h, Journal.h, uint256.h,
  suite.h, io_context.hpp, optional, stdexcept, string).
- Replace broad unit_test.h with specific unit_test/suite.h.
- Concatenate nested namespaces (xrpl::test).
- Add [[nodiscard]] to getOpenLedger/isStopping/getTrapTxID overrides.
- Make const-eligible variables const (Journal j, registry in
  disabled_construction test).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 18:27:05 +01:00
Pratik Mankawde
eac2538a55 Merge phase9 clang-tidy fixes into phase10
# Conflicts:
#	src/tests/libxrpl/telemetry/MetricsRegistry.cpp
2026-05-13 18:24:49 +01:00
Pratik Mankawde
ddca4a982b fix(telemetry): address clang-tidy CI failures on phase9
- MetricsRegistry.cpp: concatenate nested namespaces, add missing
  direct includes (Journal.h, string, string_view, cstdint), suppress
  readability-convert-member-functions-to-static in #else stubs by
  referencing enabled_ member, void unused instanceId parameter.
- MetricsRegistry test: add missing direct includes (Log.h, Journal.h,
  uint256.h, io_context.hpp, optional, stdexcept, string), make
  throwUnimplemented() static, add [[nodiscard]] to getOpenLedger/
  isStopping/getTrapTxID overrides, make const-eligible registry const.
- PerfLogImp.cpp: add braces around if/else body per
  readability-braces-around-statements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 18:24:29 +01:00
Pratik Mankawde
9fbdfa1fbe Merge phase9 clang-tidy fixes into phase10 2026-05-13 18:13:40 +01:00
Pratik Mankawde
3c13d788fd fix(telemetry): address clang-tidy CI failures on phase9
- MetricsRegistry.cpp: concatenate nested namespaces, add missing
  direct includes (Journal.h, string, string_view, cstdint), suppress
  readability-convert-member-functions-to-static in #else stubs by
  referencing enabled_ member, void unused instanceId parameter.
- MetricsRegistry test: add missing direct includes (Log.h, Journal.h,
  uint256.h, io_context.hpp, optional, stdexcept, string), make
  throwUnimplemented() static, add [[nodiscard]] to getOpenLedger/
  isStopping/getTrapTxID overrides, make const-eligible registry const.
- PerfLogImp.cpp: add braces around if/else body per
  readability-braces-around-statements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 18:13:19 +01:00
Pratik Mankawde
c10b0fd9d1 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-05-13 16:53:56 +01:00
Pratik Mankawde
3131d99029 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 16:53:54 +01:00
Pratik Mankawde
829094df5a Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-13 16:53:52 +01:00
Pratik Mankawde
9554e3889b Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-05-13 16:53:50 +01:00
Pratik Mankawde
fe7cb33b65 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:53:47 +01:00
Pratik Mankawde
f5cf4155c2 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-05-13 16:53:45 +01:00
Pratik Mankawde
ea30adeb47 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-13 16:53:43 +01:00
Pratik Mankawde
9bc2e4abb3 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-13 16:53:32 +01:00
Pratik Mankawde
7b9e0bc300 fix(telemetry): remove unused includes from RPCHandler after node-health attr removal
NetworkOPs.h and SpanNames.h were only needed for per-span
nodeAmendmentBlocked/nodeServerState calls, which were removed
in the attr naming simplification. Fixes clang-tidy CI failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:53:19 +01:00
Pratik Mankawde
a04459f1f8 fix(telemetry): update collector config + tempo datasource + design doc for simplified attr names
- otel-collector-config.yaml: spanmetrics dimensions use new bare names.
- tempo.yaml: TraceQL filter tags use new bare names.
- 02-design-decisions.md: strip xrpl.txq.* prefix from planned attrs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:47:36 +01:00
Pratik Mankawde
6b5e6a49ec Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-05-13 16:45:23 +01:00
Pratik Mankawde
b4e4b57504 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 16:45:14 +01:00
Pratik Mankawde
6dd43765b5 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-13 16:45:03 +01:00
Pratik Mankawde
cbf389943f Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-05-13 16:44:49 +01:00
Pratik Mankawde
b05e650b6f docs(telemetry): update 09-data-collection-reference + Phase5 integration test list for simplified attr naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:42:30 +01:00
Pratik Mankawde
57175ab12c Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-13 16:37:37 +01:00