Commit Graph

14212 Commits

Author SHA1 Message Date
Pratik Mankawde
e49c5997b7 added loki config.
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-06 17:37:43 +01:00
Pratik Mankawde
5ad5bacc94 compilation fixes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-06 16:26:50 +01:00
Pratik Mankawde
85330920ac feat(telemetry): add Loki service and filelog receiver for Phase 8 log ingestion
Cherry-pick Loki infrastructure from phase-10 back to where it belongs
(Phase 8, Tasks 8.2/8.3):

- Add Loki 3.4.2 service to docker-compose.yml (port 3100)
- Add filelog receiver to OTel Collector config (tails debug.log,
  regex_parser extracts trace_id/span_id/partition/severity)
- Add otlphttp/loki exporter (uses Loki 3.x native OTLP ingestion)
- Add logs pipeline: filelog -> batch -> otlphttp/loki
- Add health_check extension
- Mount xrpld log directory into collector container
- Add prometheus-data and loki-data persistent volumes

StatsD receiver intentionally excluded — Phase 7 migrated to native
OTLP metrics, making the StatsD receiver unnecessary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:55:45 +01:00
Pratik Mankawde
fac6c3ac1d Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-06 14:34:17 +01:00
Pratik Mankawde
a8549a7ab2 fix(telemetry): address code review findings for Phase 8 log-trace correlation
- Replace GetSpan() with direct context value check in Logs::format()
  to avoid heap allocation (new DefaultSpan) on the no-span path
- Restore Phase 7 documentation accidentally deleted during merge
- Fix undefined $JAEGER variable → use $TEMPO in integration test
- Remove useless LCOV_EXCL markers around #ifdef block
- Fix indentation inconsistencies in Log.cpp injection block
- Remove incorrect url field from loki.yaml derivedFields
- Update stale code sample in Phase8_taskList.md to match implementation
- Correct "<10ns" performance claims to accurate ~15-20ns (no-span)
  and ~50ns (active-span) measurements across all docs
- Replace Jaeger references with Tempo in TESTING.md (port 16686→3200)
- Improve error handling in check_log_correlation(): track files_scanned,
  detect missing log files, fix silent grep error masking

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:32:46 +01:00
Pratik Mankawde
761688383d fix(telemetry): address code review issues in OTelCollector
- Fix use-after-free: extract gauge callback to static function and call
  RemoveCallback in ~OTelGaugeImpl() before unregistering from collector
- Use memory_order_acq_rel on callHooks() debounce CAS for proper
  happens-before relationship between hook invocations
- Add explicit 2s timeout to ForceFlush() in destructor to prevent
  blocking indefinitely when OTLP endpoint is unreachable at shutdown
- Add OTLP receiver to metrics pipeline so native OTel metrics from
  xrpld are actually received by the collector
- Remove stale health check port from docker-compose (extension was
  removed from collector config)
- Clarify fallback docs: StatsD path requires re-enabling receiver/port
- Fix comments: Counter uses uint64_t not int64_t, gauge clamps to
  [0, INT64_MAX] not [0, UINT64_MAX]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:24:52 +01:00
Pratik Mankawde
ed31bab500 fix: restore unrelated files to phase-6 state after revert
The blanket revert of f4555c80fe also un-reverted some files that had
been correctly matched to phase-6 (nodestore Backend API refactor,
Vault_test changes). Restore those to the base branch state so the
phase-7 PR only contains telemetry changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:15:14 +01:00
Pratik Mankawde
b62987bda7 Revert "fix: revert all unrelated upstream develop changes from phase-7 PR"
This reverts commit f4555c80fe.
2026-05-06 14:13:59 +01:00
Pratik Mankawde
f08412b3e0 ordering update
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-06 13:27:29 +01:00
Pratik Mankawde
43ae5c2d20 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-04-30 17:14:42 +01:00
Pratik Mankawde
8dec56df0b Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-04-30 17:14:34 +01:00
Pratik Mankawde
beaf01ae4d fix(telemetry): fix CI failures in phase-6 build, clang-tidy, and rename checks
Build fixes in PeerImp.cpp:
- Rename duplicate `span` variable to `consSpan` in proposal and
  validation handlers to avoid redefinition error
- Fix `->` on non-pointer SpanGuard (now correctly on shared_ptr)
- Fix move-only type copy in lambda capture

Clang-tidy fixes:
- Concatenate nested namespaces in LedgerSpanNames.h and PeerSpanNames.h
- Add missing SpanNames.h includes in BuildLedger.cpp, LedgerMaster.cpp,
  PeerImp.cpp for direct seg:: symbol usage
- Add missing <chrono> and <cstdint> includes in BuildLedger.cpp
- Remove unused Feature.h include from BuildLedger.cpp

Rename check fix:
- Run docs.sh to rename rippled_ metric prefixes to xrpld_ in
  09-data-collection-reference.md and telemetry-runbook.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 17:09:17 +01:00
Pratik Mankawde
5e37f71139 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-04-30 17:00:27 +01:00
Pratik Mankawde
3f793b2098 fix: revert incorrect docs.sh renames (README.md URL, configLegacyName)
- protocol/README.md: restore historical GitHub URL path (src/ripple/)
- Config.cpp: restore configLegacyName as "rippled.cfg" (legacy name
  must remain as-is for backward compatibility)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 17:00:03 +01:00
Pratik Mankawde
f4555c80fe fix: revert all unrelated upstream develop changes from phase-7 PR
Reverts 259 files that carried unrelated upstream changes through the
phase-6 merge: enum class removals (cppcoreguidelines-use-enum-class),
scoped_lock→lock_guard conversions (modernize-use-scoped-lock),
nodestore Backend API changes (void const* key), .clang-tidy config,
test infrastructure deletions, and miscellaneous develop changes.

These changes belong on develop, not in the telemetry PR chain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 16:59:24 +01:00
Pratik Mankawde
5bdb6b4eaf Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-04-30 16:45:49 +01:00
Pratik Mankawde
f44b89b99d fix: revert unrelated develop changes from phase-7 PR
- Revert reusable-build-test-config.yml to develop (action SHA update
  and "Show test failure summary" step removal don't belong here)
- Revert upload-conan-deps.yml to develop (action SHA update)
- Revert features.macro: BatchInnerSigs and Batch back to Supported::no
  (these feature flag changes are unrelated to telemetry)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 16:45:43 +01:00
Pratik Mankawde
1c0d21dba4 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-04-29 21:17:57 +01:00
Pratik Mankawde
f183e9b57f Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-04-29 21:17:48 +01:00
Pratik Mankawde
8e7a2d6c53 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/08-appendix.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
2026-04-29 21:07:32 +01:00
Pratik Mankawde
9adcc49171 fix: re-apply phase-7 doc/config changes lost during merge
Re-applies phase-7 unique modifications to documentation and
configuration files that were overwritten when taking phase-6's
versions during the merge conflict resolution.

Changes:
- docker-compose.yml: comment out StatsD port 8125, add OTLP notes
- otel-collector-config.yaml: remove StatsD receiver, update pipeline
- integration-test.sh: server=otel, check_otel_metric, StatsD port check
- telemetry-runbook.md: System Metrics section, server=otel config,
  troubleshooting for missing OTel metrics
- 02-design-decisions.md: Phase 7 coexistence strategy notes
- 05-configuration-reference.md: OTel System Metrics correlation
- 06-implementation-phases.md: add Phase 7 section (~180 lines)
- OpenTelemetryPlan.md: update phases table (7 phases, 60.6 days)
- 08-appendix.md: add Phase7_taskList.md to document index
- Delete 5 statsd-*.json dashboards (replaced by system-*.json)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 21:05:48 +01:00
Pratik Mankawde
7ab6f4d34b fix: address CI rename checks (rippled -> xrpld) in phase-8 docs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 20:09:43 +01:00
Pratik Mankawde
81b47afde7 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/08-appendix.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
#	docker/telemetry/grafana/dashboards/statsd-network-traffic.json
#	docker/telemetry/grafana/dashboards/statsd-node-health.json
#	docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
2026-04-29 20:07:43 +01:00
Pratik Mankawde
b65f91117f fix: address CI checks (prettier, docs.sh rename, levelization)
- Prettier formatting for markdown docs and OTelCollector header
- docs.sh rippled→xrpld renames in OTelCollector.cpp comments/strings
- Updated levelization ordering with new dependency edges

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 20:03:22 +01:00
Pratik Mankawde
57ed0d9fd0 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 19:59:02 +01:00
Pratik Mankawde
51918ef868 Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-04-29 19:58:54 +01:00
Pratik Mankawde
c8674d61b8 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing
# Conflicts:
#	src/xrpld/app/consensus/RCLConsensus.cpp
2026-04-29 19:58:45 +01:00
Pratik Mankawde
cf18032e7f fix(telemetry): address clang-tidy errors on phase3 transaction tracing files
- Add [[maybe_unused]] to RAII span variables in TxQ.cpp
- Remove unused st.h include, add missing to_string header in TxQ.cpp
- Concatenate nested namespaces in TxQSpanNames.h, TxSpanNames.h,
  ConsensusReceiveTracing.h, PropagationHelpers.h, TxTracing.h
- Remove unused TraceContextPropagator.h include from RCLConsensus.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 19:57:42 +01:00
Pratik Mankawde
f18ddd95c1 fix(telemetry): address clang-tidy errors on phase4 consensus tracing files
- Add [[nodiscard]] to getConsensusTraceStrategy, getYays, getNays
- Add missing <string>, SpanGuard.h, SpanNames.h includes
- Fix widening cast placement (cast before arithmetic, not after)
- Replace nested ternary with lambda for const dir variable
- Add braces to if/else-if chains in Consensus.h
- Concatenate nested namespaces in ConsensusSpanNames.h

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 19:54:23 +01:00
Pratik Mankawde
769668579a Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	.codecov.yml
#	.github/scripts/levelization/results/ordering.txt
#	.github/workflows/reusable-clang-tidy-files.yml
#	CMakeLists.txt
#	OpenTelemetryPlan/00-tracing-fundamentals.md
#	OpenTelemetryPlan/01-architecture-analysis.md
#	OpenTelemetryPlan/02-design-decisions.md
#	OpenTelemetryPlan/03-implementation-strategy.md
#	OpenTelemetryPlan/04-code-samples.md
#	OpenTelemetryPlan/05-configuration-reference.md
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/07-observability-backends.md
#	OpenTelemetryPlan/08-appendix.md
#	OpenTelemetryPlan/09-data-collection-reference.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
#	OpenTelemetryPlan/POC_taskList.md
#	OpenTelemetryPlan/Phase2_taskList.md
#	OpenTelemetryPlan/Phase3_taskList.md
#	OpenTelemetryPlan/Phase4_taskList.md
#	OpenTelemetryPlan/Phase5_IntegrationTest_taskList.md
#	OpenTelemetryPlan/Phase5_taskList.md
#	OpenTelemetryPlan/presentation.md
#	cfg/xrpld-example.cfg
#	conan.lock
#	conanfile.py
#	cspell.config.yaml
#	docker/telemetry/TESTING.md
#	docker/telemetry/docker-compose.yml
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
#	docker/telemetry/grafana/provisioning/dashboards/dashboards.yaml
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docker/telemetry/integration-test.sh
#	docker/telemetry/otel-collector-config.yaml
#	docker/telemetry/tempo.yaml
#	docker/telemetry/xrpld-telemetry.cfg
#	docs/build/telemetry.md
#	docs/telemetry-runbook.md
#	include/xrpl/core/ServiceRegistry.h
#	include/xrpl/protocol/detail/features.macro
#	include/xrpl/telemetry/SpanGuard.h
#	include/xrpl/telemetry/Telemetry.h
#	include/xrpl/telemetry/TraceContextPropagator.h
#	src/libxrpl/basics/MallocTrim.cpp
#	src/libxrpl/nodestore/backend/MemoryFactory.cpp
#	src/libxrpl/nodestore/backend/NuDBFactory.cpp
#	src/libxrpl/nodestore/backend/RocksDBFactory.cpp
#	src/libxrpl/telemetry/NullTelemetry.cpp
#	src/libxrpl/telemetry/Telemetry.cpp
#	src/libxrpl/telemetry/TelemetryConfig.cpp
#	src/tests/libxrpl/basics/MallocTrim.cpp
#	src/tests/libxrpl/telemetry/TelemetryConfig.cpp
#	src/xrpld/app/consensus/RCLConsensus.cpp
#	src/xrpld/app/consensus/RCLConsensus.h
#	src/xrpld/app/ledger/detail/BuildLedger.cpp
#	src/xrpld/app/ledger/detail/LedgerMaster.cpp
#	src/xrpld/app/main/Application.cpp
#	src/xrpld/app/misc/NetworkOPs.cpp
#	src/xrpld/consensus/Consensus.h
#	src/xrpld/overlay/detail/PeerImp.cpp
#	src/xrpld/rpc/detail/RPCHandler.cpp
#	src/xrpld/rpc/detail/ServerHandler.cpp
2026-04-29 19:50:32 +01:00
Pratik Mankawde
e6266e4e8d Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-04-29 18:20:23 +01:00
Pratik Mankawde
025620cc4e Merge branch 'pratik/otel-phase4-consensus-tracing' into pratik/otel-phase5-docs-deployment 2026-04-29 18:20:19 +01:00
Pratik Mankawde
692ce65f3e Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-04-29 18:20:04 +01:00
Pratik Mankawde
bb3732c22f Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-04-29 18:19:58 +01:00
Pratik Mankawde
87a8780b73 Merge branch 'pratik/otel-phase1c-rpc-integration' into pratik/otel-phase2-rpc-tracing
# Conflicts:
#	src/xrpld/rpc/detail/RPCHandler.cpp
2026-04-29 18:18:37 +01:00
Pratik Mankawde
78522ba18e Merge branch 'pratik/otel-phase1b-telemetry-infra' into pratik/otel-phase1c-rpc-integration 2026-04-29 18:17:02 +01:00
Pratik Mankawde
79fbb9c303 fix(telemetry): address clang-tidy errors on phase1c RPC integration files
- Concatenate nested namespaces in SpanNames.h, RpcSpanNames.h, GrpcSpanNames.h
- Remove unused InfoSub.h and NetworkOPs.h includes from RPCHandler.cpp
- Add missing <string_view> includes in RPCHandler.cpp and GRPCServer.cpp
- Replace nested ternary with if/else-if in RPCHandler.cpp
- Add IWYU pragma keep for json_body.h in ServerHandler.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 18:16:24 +01:00
Pratik Mankawde
e21e7b0d51 fix(telemetry): add missing PublicKey.h include for toBase58 in Application.cpp
Clang-tidy misc-include-cleaner requires direct includes for all used
symbols. Application.cpp calls toBase58(TokenType::NodePublic, ...) at
line 1359 but did not directly include PublicKey.h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:51:35 +01:00
Pratik Mankawde
3dd2f34591 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
# Conflicts:
#	OpenTelemetryPlan/Phase3_taskList.md
#	docker/telemetry/grafana/provisioning/datasources/tempo.yaml
#	docs/telemetry-runbook.md
#	include/xrpl/proto/xrpl.proto
#	src/xrpld/app/consensus/RCLConsensus.cpp
#	src/xrpld/app/misc/detail/TxQ.cpp
2026-04-29 17:38:03 +01:00
Pratik Mankawde
521e0756e1 docs(telemetry): add cross-node trace propagation to runbook
Document the propagation infrastructure: send-side injection in
NetworkOPs/RCLConsensus, receive-side extraction in PeerImp via
PropagationHelpers.h and ConsensusReceiveTracing.h. Update
consensus receive span descriptions to reflect parent extraction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:33:10 +01:00
Pratik Mankawde
dbcd040180 fix(telemetry): fix Clang unused-variable and incomplete-type errors
- Add [[maybe_unused]] to RAII spans in TxQ.cpp
- Include Telemetry.h in RCLConsensus.cpp for complete type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
17e69e660c feat(telemetry): add toDisplayString() and use Title Case in consensus attributes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
ef10c754b1 fix(telemetry): address code review findings for Phase 4 consensus tracing
Fix quorum attribute to use actual validator quorum instead of proposer
count, add missing ConsensusState::Expired handling in haveConsensus()
span, move ConsensusSpanNames.h to xrpld/consensus/ to resolve
levelization cycle, remove unused constants, enrich proposal receive
span with sequence, and correct stale documentation references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
912890c104 fix: address PR review round 2 — event name constants, span timing
- Add cons_span::event namespace with disputeResolve and txIncluded
  constants; replace hardcoded strings in Consensus.h and RCLConsensus.cpp
- Move proposal.receive and validation.receive spans in PeerImp into
  shared_ptr captured by job lambdas so they measure checkPropose and
  checkValidation timing, not just message parsing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
ac68091bec code review changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
2773de7b54 docs(telemetry): mark Phase 4/4a consensus tracing tasks complete
Update Phase4_taskList.md and 06-implementation-phases.md to reflect
completed implementation of all remaining Phase 4/4a tasks (4.2-4.6,
4a.5, 4a.6, 4a.8). Update exit criteria and summary tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
5f7de1bb48 feat(telemetry): complete Phase 4 consensus tracing
Implement remaining Phase 4/4a consensus tracing tasks:

- Add consensus.phase.open span (open → closeLedger lifecycle)
- Add consensus.proposal.receive span in PeerImp with trusted attr
- Add consensus.validation.receive span in PeerImp with trusted/seq attrs
- Add tx_count attr on accept.apply, disputes_count on update_positions
- Add tx.included events with txId in doAccept transaction loop
- Enhance dispute.resolve event with yays/nays fields
- Add avalanche_threshold attr on update_positions span
- Reparent accept/accept.apply as children of round span via childSpan()

Also adds compile-time constants in ConsensusSpanNames.h and updates
the span hierarchy diagram.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
264516c37d docs update
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
eb84ac57c7 fix(telemetry): remove duplicate hashSpan(4-arg) from rebase
The 4-arg hashSpan overload was duplicated during a prior rebase
cascade — it appeared at both line 240 and line 305 in SpanGuard.cpp.
This would cause a linker error (multiple definition).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00
Pratik Mankawde
021c81e978 docs(telemetry): document hashSpan factory, ConsensusSpanNames.h, and API details
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 17:32:56 +01:00