Commit Graph

13947 Commits

Author SHA1 Message Date
Pratik Mankawde
3f2fba1284 docs(telemetry): add deterministic TX trace ID design (Task 3.9)
Add trace_id = txHash[0:16] strategy so all nodes handling the same
transaction independently produce spans under the same trace_id,
combined with protobuf span_id propagation for parent-child ordering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 15:35:43 +01:00
Pratik Mankawde
e966e00944 refactor(telemetry): extract TX span name constants into TxSpanNames.h
Move scattered string literals from PeerImp.cpp and NetworkOPs.cpp into
compile-time constants in src/xrpld/telemetry/TxSpanNames.h. Follows
the same StaticStr/join() pattern established in Phase 1c for RPC spans.

Constants cover: span prefixes (tx), operations (receive, process),
attribute keys (hash, local, path, suppressed, status, peerId,
peerVersion), and values (sync, async, knownBad).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 15:35:43 +01:00
Pratik Mankawde
d7e4a2f55d docs(telemetry): update Phase 3/4 task lists for SpanGuard factory pattern
Replace references to old XRPL_TRACE_TX/CONSENSUS macros with
SpanGuard::span(TraceCategory, ...) factory calls introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 15:35:43 +01:00
Pratik Mankawde
d7033798a2 docs(telemetry): add Task 3.8 TX span peer version attribute spec
Adds xrpl.peer.version attribute to tx.receive spans for version-mismatch
correlation during network upgrades.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 15:35:43 +01:00
Pratik Mankawde
c7ef0eb647 feat(telemetry): Phase 3 transaction tracing with protobuf context propagation
- TraceContext protobuf message for cross-node trace propagation
  (added to TMTransaction, TMProposeSet, TMValidation at field 1001)
- TraceContextPropagator.h: inline extractFromProtobuf/injectToProtobuf
- PeerImp::handleTransaction: tx.receive span with peer.id, peer.version,
  tx.hash, tx.suppressed, tx.status attributes
- NetworkOPsImp::processTransaction: tx.process span with tx.hash,
  tx.local, tx.path attributes
- Tempo search filters for tx.hash, tx.local, tx.status
- Unit tests for TraceContextPropagator (round-trip, edge cases)
- Levelization: xrpld.app/overlay > xrpld.telemetry dependencies

Translated from macro API (XRPL_TRACE_TX/SET_ATTR) to SpanGuard factory
pattern introduced in Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 15:35:43 +01:00
Pratik Mankawde
55f8eba4ef docs(telemetry): add Task 2.9 PathFind instrumentation to Phase 2 task list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 15:34:28 +01:00
Pratik Mankawde
87ef7abd44 feat(telemetry): add PathFind tracing with 5 spans (Tasks 2.9/2.10)
Instrument the path finding subsystem with full span coverage:

- pathfind.request: wraps doPathFind() and doRipplePathFind() RPC handlers
- pathfind.compute: wraps PathRequest::doUpdate() with fast/normal attr
- pathfind.update_all: wraps PathRequestManager::updateAll() on ledger
  close with ledger_index attr
- pathfind.discover: wraps Pathfinder::findPaths() graph exploration
  with search_level attr
- pathfind.rank: wraps Pathfinder::computePathRanks() liquidity
  validation with num_paths attr

New file: PathFindSpanNames.h with compile-time constants following
the StaticStr/join() pattern from Phase 1c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 15:25:44 +01:00
Pratik Mankawde
c59ea77bdc fix(telemetry): address Phase 2 code review findings
- Move node health attribute strings to compile-time constants in
  SpanNames.h (attr::nodeAmendmentBlocked, attr::nodeServerState)
- Add Tempo search filters for node health attributes
- Remove unnecessary .c_str() on strOperatingMode() return
- Add samplingRatio clamping test (values > 1.0 and < 0.0)
- Fix Task 2.3 status: delivered in Phase 1c, not Phase 2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:49:40 +01:00
Pratik Mankawde
810999a006 fix(telemetry): align TelemetryConfig tests with current API
- serviceName default is "xrpld" not "rippled"
- Remove references to nonexistent exporterType field
- Pass networkId (4th param) to setup_Telemetry()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:49:40 +01:00
Pratik Mankawde
62690729f4 docs(telemetry): update Phase2 task list to reflect actual implementation
Mark deferred tasks (2.1→Phase 3, 2.5→low priority) with rationale.
Mark superseded tasks (2.2→Phase 1c SpanGuard factory). Add Task 2.7
for Grafana search filters. Update summary table with status column.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:49:40 +01:00
Pratik Mankawde
a4db27de48 feat(telemetry): add RPC trace filters and SpanGuard unit tests
- Grafana Tempo datasource: add rpc-command, rpc-status, rpc-role
  search filters for the Explore UI
- Unit tests: TelemetryConfig (config parsing defaults and sections),
  SpanGuardFactory (null guard safety, move semantics, discard, all
  factory methods)
- Test CMake registration with optional OTel linking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:49:40 +01:00
Pratik Mankawde
a06e96ea8a feat(telemetry): add node health attributes to RPC spans (Task 2.8)
Add amendment_blocked and server_state span attributes to every
rpc.command.* span so operators can correlate RPC behavior with node state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:49:40 +01:00
Pratik Mankawde
c4884a3965 docs(telemetry): add Phase 2-5 task lists and appendix update
Introduces task list documents for Phases 2 through 5, with Tempo
references (replacing Jaeger) and Task 2.8 dashboard parity spec.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:49:40 +01:00
Pratik Mankawde
cac45cc430 fix(telemetry): pass name_ through CallData::clone()
Without this, cloned CallData instances (created for the next incoming
gRPC request) would have an empty name_, making subsequent span attrs
blank.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:48:12 +01:00
Pratik Mankawde
759d425eb7 feat(telemetry): instrument missing critical/medium RPC span paths
Add spans to previously uninstrumented error and validation paths:

- gRPC: span in CallData::process(coro) with method name attribute,
  covers all 4 gRPC endpoints (GetLedger, GetLedgerData, etc.)
- WebSocket parse errors: span in onWSMessage() for invalid JSON
- WebSocket upgrade failures: span in onHandoff() try/catch
- Command dispatch rejections: span in doCommand() when fillHandler()
  fails (unknown command, too busy, permission denied)

New files: GrpcSpanNames.h (gRPC span constants)
Modified: GRPCServer.h (name_ member), RpcSpanNames.h (wsUpgrade op,
updated coverage diagram)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:48:12 +01:00
Pratik Mankawde
aeacede345 docs(telemetry): replace text hierarchy with ASCII box diagrams
Follow project convention (PerfLog.h, SpanGuard.h) for documentation
diagrams. Show HTTP single, HTTP batch, and WebSocket span nesting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:48:12 +01:00
Pratik Mankawde
7753778677 docs(telemetry): add RPC span coverage map to RpcSpanNames.h
Document the span hierarchy, covered paths, and known instrumentation
gaps directly in the header that developers reference when adding spans.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:48:12 +01:00
Pratik Mankawde
8671798864 refactor(telemetry): extract span name constants into modular headers
Centralise scattered string literals into compile-time constants using
StaticStr<N> and join() for dot-separated composition. Shared primitives
live in SpanNames.h; RPC-specific names in RpcSpanNames.h. Future modules
(consensus, peer, ledger) add their own *SpanNames.h without bloating
the central header.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:48:12 +01:00
Pratik Mankawde
29b879e731 refactor(telemetry): update RPC call sites to TraceCategory API
Replace rpcSpan(fullName) calls with span(TraceCategory::Rpc, prefix,
name). Add 'using namespace telemetry' to both RPC files so call sites
read cleanly without repeated namespace qualifiers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 16:48:12 +01:00
Pratik Mankawde
cf92041435 feat(telemetry): replace tracing macros with SpanGuard factory pattern
Delete TracingInstrumentation.h and replace all XRPL_TRACE_* macro
invocations with direct SpanGuard::rpcSpan() calls. SpanGuard's pimpl
design and global Telemetry accessor eliminate the need for macro
wrappers and explicit Telemetry instance passing at call sites.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 16:48:12 +01:00
Pratik Mankawde
95a4f25596 fix(telemetry): address Phase 1c code review findings
TracingInstrumentation.h:
- Unify all span-creation macros to use std::optional<SpanGuard>
  (fixes type mismatch between XRPL_TRACE_SPAN and SET_ATTR)
- Wrap XRPL_TRACE_SET_ATTR/EXCEPTION in do-while(0) (dangling-else)
- Move macros outside namespace blocks (macros are global)
- Cache telemetry reference to avoid double-evaluation
- Remove leaked _xrpl_span_ intermediate variable
- Add @note tags for thread safety, scope, and usage constraints
- Add 3 usage examples per CLAUDE.md requirements

ServerHandler.cpp:
- Remove misleading rpc.request span from onRequest() (span ended
  before coroutine runs, producing orphan spans)
- Add rpc.http_request span to HTTP processSession() (runs inside
  the coroutine, correct parent for rpc.process/rpc.command spans)
- Add XRPL_TRACE_EXCEPTION and error status in both catch blocks
  (WS processSession and processRequest)

SpanGuard.h:
- Add null guards to all mutating methods (setOk, setStatus,
  setAttribute, addEvent, recordException) for safety after discard()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 16:48:12 +01:00
Pratik Mankawde
a5feef83f2 removed presentation.md from root
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-20 16:48:12 +01:00
Pratik Mankawde
18bb528bc5 Phase 1c: RPC integration - ServerHandler tracing, telemetry config wiring
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 16:48:12 +01:00
Pratik Mankawde
3c8ca1b836 docs(telemetry): fix doc references to match pimpl architecture
Replace references to non-existent TracingInstrumentation.h with
SpanGuard.cpp pimpl implementation that actually exists on this branch.
Update conditional compilation section to describe the pimpl approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 16:47:50 +01:00
Pratik Mankawde
7c8a3906a1 refactor(telemetry): replace per-category factory methods with TraceCategory enum
Replace rpcSpan(), txSpan(), consensusSpan(), peerSpan(), ledgerSpan()
with a single span(TraceCategory, prefix, name) factory method. Adding
a new traceable subsystem now requires only a new enum value and one
switch case — no new methods or header changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 17:40:37 +01:00
Pratik Mankawde
ad233846fa fix(telemetry): address Phase 1b code review findings
- SpanContext::isValid(): add inline no-op when XRPL_ENABLE_TELEMETRY
  is not defined, preventing a linker error if called in that path
- linkedSpan(): set kIsRootSpanKey on the StartSpanOptions parent
  context so linked spans start a genuinely independent sub-tree
  instead of silently becoming children of the current active span
- Telemetry::instance_: use std::atomic with acquire/release ordering
  to avoid a data race between start()/stop() and factory methods
  called from worker threads

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 17:15:36 +01:00
Pratik Mankawde
714ac49f47 fix(telemetry): address Phase 1b code review findings
Redesign SpanGuard with pimpl idiom to hide all OpenTelemetry types
from public headers. Add global Telemetry accessor so SpanGuard factory
methods work without explicit Telemetry references. Add child/linked
span creation and cross-thread context propagation. Update plan docs
to reflect macro removal in favor of SpanGuard factory pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 16:31:50 +01:00
Pratik Mankawde
49f9fe52ac docs(telemetry): update plan docs for FilteringSpanProcessor and discard()
Add DiscardFlag.h and FilteringSpanProcessor references to the file
tree, key files table, and implementation summary in OpenTelemetryPlan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 17:46:03 +01:00
Pratik Mankawde
0de5a77605 feat(telemetry): add FilteringSpanProcessor and SpanGuard::discard()
Add span discard mechanism that drops unwanted spans before they enter
the batch export queue, saving both network bandwidth and storage.

FilteringSpanProcessor is a custom SpanProcessor decorator that wraps
BatchSpanProcessor. SpanGuard::discard() sets a thread-local flag
(tl_discardCurrentSpan) before calling Span::End(). The OTel SDK calls
OnEnd() synchronously on the same thread, where the flag is checked and
cleared to drop the span.

New file: DiscardFlag.h — zero-dependency header for the thread-local
flag, avoiding transitive include bloat from Telemetry.h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 17:43:58 +01:00
Pratik Mankawde
4830715d57 fix(telemetry): address review findings and PR #6437 comments
Critical fixes:
- Restore accidentally removed mallocTrim call and MallocTrim.h include
- Add missing shouldTraceLedger() to interface and all implementations
- Derive networkId/networkType from config_->NETWORK_ID (0=mainnet,
  1=testnet, 2=devnet) instead of leaving defaults unpopulated
- Clamp sampling_ratio to [0.0, 1.0] in config parser

PR comment fixes:
- Rename rippled -> xrpld in service name defaults, getTracer() calls,
  Docker network, comments, and docs/build/telemetry.md
- Remove exporter config option (only otlp_http supported)
- Add trace_ledger and service_name to example config
- Clarify head-based sampling semantics in config comments
- Add filter descriptions for span intrinsic filters in Grafana datasource
- Add inline comments to Docker Compose services

Docker/config improvements:
- Remove deprecated version: "3.8" from docker-compose.yml
- Pin images: collector 0.121.0, grafana 11.5.2
- Add health_check extension to otel-collector-config.yaml
- Comment out Tempo metrics_generator remote_write (no Prometheus service)
- Add Prometheus datasource caveat in Grafana datasource config

Other:
- Revert unrelated formatting changes in ServiceRegistry.h
- Change Conan telemetry default to False (matches CMake OFF)
- Add CLAUDE.md-required docs (ASCII diagrams, usage examples,
  @note thread-safety) to Telemetry.h and SpanGuard.h

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 17:07:01 +01:00
Pratik Mankawde
8a49f2c8c0 docs(telemetry): remove remaining Jaeger references from config reference
Remove duplicate otlp/tempo exporter block, duplicate tempo service
definition, and jaeger dependency from docker-compose example.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 15:45:55 +01:00
Pratik Mankawde
bd30c4d630 refactor(telemetry): remove Jaeger service, exporter, and datasource
Tempo is now the sole trace backend. Remove Jaeger all-in-one service
from docker-compose, otlp/jaeger exporter from OTel Collector config,
and Jaeger Grafana datasource provisioning file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:37:58 +01:00
Pratik Mankawde
ce82e067bd Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 15:37:58 +01:00
Pratik Mankawde
193f5b39cb docs(telemetry): update plan docs for ServiceRegistry migration
Plan documents referenced Application.h and app_ for getTelemetry()
but the codebase now uses ServiceRegistry as the interface. Updated:

- 05-configuration-reference.md: getTelemetry() on ServiceRegistry,
  deferred serviceInstanceId pattern in ApplicationImp
- POC_taskList.md Task 4: target ServiceRegistry.h not Application.h,
  correct config file path and constructor pattern
- 04-code-samples.md: fix overlay() -> getOverlay(), rewrite JobQueue
  sample to reflect actual architecture (no app_ member)
- 03-implementation-strategy.md: fix file impact table path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 15:37:13 +01:00
Pratik Mankawde
db8111ef7c docs(telemetry): replace Jaeger with Tempo in architecture diagram
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:00:48 +01:00
Pratik Mankawde
913a4b794c docs: correct OTel overhead estimates against SDK benchmarks
Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:

- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
  ~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
  SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
  stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns

Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).

Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 15:00:47 +01:00
Pratik Mankawde
accea17e9d moved presentation.md file
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-04-16 15:00:47 +01:00
Pratik Mankawde
c6fa00fbe3 Remove effort estimates from implementation phases document
Strip effort/risk columns from task tables and remove the §6.9 Effort
Summary section with its pie chart and resource requirements table.
Renumber §6.10 Quick Wins → §6.9.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 15:00:47 +01:00
Pratik Mankawde
bfb8f4f01a Add Phase 4a implementation status to plan docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 15:00:47 +01:00
Pratik Mankawde
4b745a86b7 Appendix: add 00-tracing-fundamentals.md and POC_taskList.md to document index
Split document index into Plan Documents and Task Lists sections.
These files were introduced in this branch but missing from the index.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 15:00:47 +01:00
Pratik Mankawde
ddf894dcb0 Phase 1a: OpenTelemetry plan documentation
Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:

- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 15:00:47 +01:00
Sergey Kuznetsov
d52d735543 chore: Move codegen venv setup into build stage (#6617)
Co-authored-by: JCW <a1q123456@users.noreply.github.com>
Co-authored-by: Bart <bthomee@users.noreply.github.com>
2026-04-15 18:50:49 +00:00
Alex Kremer
6a0ce46755 chore: Enable most clang-tidy bugprone checks (#6929)
Co-authored-by: Ayaz Salikhov <mathbunnyru@users.noreply.github.com>
2026-04-14 20:24:21 +00:00
Bart
2f029a2120 refactor: Improve exception handling (#6540) (#6735)
Co-authored-by: Bart <11445373+bthomee@users.noreply.github.com>
2026-04-14 17:14:24 +00:00
Zhiyuan Wang
61fbde3a71 refactor: Remove unused notTooManyOffers function from NFTokenUtils (#6737) 2026-04-13 23:18:10 +00:00
Bart
e2e537b3bb fix: Change Tuning::bookOffers minimum limit to 1 (#6812)
Co-authored-by: Bart <11445373+bthomee@users.noreply.github.com>
2026-04-10 14:38:46 +00:00
Ed Hennis
a873250019 chore: Make pre-commit line ending conversions work on Windows (#6832) (#6833) 2026-04-10 10:12:52 +00:00
Gregory Tsipenyuk
56c9d1d497 fix: Add description for terLOCKED error (#6811) 2026-04-08 20:56:19 +00:00
yinyiqian1
d52dd29d20 fix: Address AI reviewer comments for Permission Delegation (#6675) 2026-04-08 20:22:19 +00:00
Mayukha Vadari
7793b5f10b refactor: Combine AMMHelpers and AMMUtils (#6733) 2026-04-08 17:38:33 +00:00