Pratik Mankawde
9e12e660fe
Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-29 20:25:13 +01:00
Pratik Mankawde
81b47afde7
Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
...
# Conflicts:
# OpenTelemetryPlan/06-implementation-phases.md
# OpenTelemetryPlan/08-appendix.md
# OpenTelemetryPlan/OpenTelemetryPlan.md
# docker/telemetry/grafana/dashboards/statsd-network-traffic.json
# docker/telemetry/grafana/dashboards/statsd-node-health.json
# docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
2026-04-29 20:07:43 +01:00
Pratik Mankawde
769668579a
Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
...
# Conflicts:
# .codecov.yml
# .github/scripts/levelization/results/ordering.txt
# .github/workflows/reusable-clang-tidy-files.yml
# CMakeLists.txt
# OpenTelemetryPlan/00-tracing-fundamentals.md
# OpenTelemetryPlan/01-architecture-analysis.md
# OpenTelemetryPlan/02-design-decisions.md
# OpenTelemetryPlan/03-implementation-strategy.md
# OpenTelemetryPlan/04-code-samples.md
# OpenTelemetryPlan/05-configuration-reference.md
# OpenTelemetryPlan/06-implementation-phases.md
# OpenTelemetryPlan/07-observability-backends.md
# OpenTelemetryPlan/08-appendix.md
# OpenTelemetryPlan/09-data-collection-reference.md
# OpenTelemetryPlan/OpenTelemetryPlan.md
# OpenTelemetryPlan/POC_taskList.md
# OpenTelemetryPlan/Phase2_taskList.md
# OpenTelemetryPlan/Phase3_taskList.md
# OpenTelemetryPlan/Phase4_taskList.md
# OpenTelemetryPlan/Phase5_IntegrationTest_taskList.md
# OpenTelemetryPlan/Phase5_taskList.md
# OpenTelemetryPlan/presentation.md
# cfg/xrpld-example.cfg
# conan.lock
# conanfile.py
# cspell.config.yaml
# docker/telemetry/TESTING.md
# docker/telemetry/docker-compose.yml
# docker/telemetry/grafana/dashboards/consensus-health.json
# docker/telemetry/grafana/dashboards/transaction-overview.json
# docker/telemetry/grafana/provisioning/dashboards/dashboards.yaml
# docker/telemetry/grafana/provisioning/datasources/tempo.yaml
# docker/telemetry/integration-test.sh
# docker/telemetry/otel-collector-config.yaml
# docker/telemetry/tempo.yaml
# docker/telemetry/xrpld-telemetry.cfg
# docs/build/telemetry.md
# docs/telemetry-runbook.md
# include/xrpl/core/ServiceRegistry.h
# include/xrpl/protocol/detail/features.macro
# include/xrpl/telemetry/SpanGuard.h
# include/xrpl/telemetry/Telemetry.h
# include/xrpl/telemetry/TraceContextPropagator.h
# src/libxrpl/basics/MallocTrim.cpp
# src/libxrpl/nodestore/backend/MemoryFactory.cpp
# src/libxrpl/nodestore/backend/NuDBFactory.cpp
# src/libxrpl/nodestore/backend/RocksDBFactory.cpp
# src/libxrpl/telemetry/NullTelemetry.cpp
# src/libxrpl/telemetry/Telemetry.cpp
# src/libxrpl/telemetry/TelemetryConfig.cpp
# src/tests/libxrpl/basics/MallocTrim.cpp
# src/tests/libxrpl/telemetry/TelemetryConfig.cpp
# src/xrpld/app/consensus/RCLConsensus.cpp
# src/xrpld/app/consensus/RCLConsensus.h
# src/xrpld/app/ledger/detail/BuildLedger.cpp
# src/xrpld/app/ledger/detail/LedgerMaster.cpp
# src/xrpld/app/main/Application.cpp
# src/xrpld/app/misc/NetworkOPs.cpp
# src/xrpld/consensus/Consensus.h
# src/xrpld/overlay/detail/PeerImp.cpp
# src/xrpld/rpc/detail/RPCHandler.cpp
# src/xrpld/rpc/detail/ServerHandler.cpp
2026-04-29 19:50:32 +01:00
Pratik Mankawde
3dd2f34591
Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
...
# Conflicts:
# OpenTelemetryPlan/Phase3_taskList.md
# docker/telemetry/grafana/provisioning/datasources/tempo.yaml
# docs/telemetry-runbook.md
# include/xrpl/proto/xrpl.proto
# src/xrpld/app/consensus/RCLConsensus.cpp
# src/xrpld/app/misc/detail/TxQ.cpp
2026-04-29 17:38:03 +01:00
Pratik Mankawde
8fb33b0818
feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
...
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.
Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:
- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
(consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
(used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
from consensus thread to jtACCEPT worker thread
Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-29 17:32:56 +01:00
Pratik Mankawde
3ed22580fe
fix(telemetry): address remaining clang-tidy and cspell CI failures
...
- Add "hicpp" to cspell dictionary for NOLINT annotations
- Concatenate nested namespaces in RpcSpanNames.h
- Fix include hygiene and nested ternary in RPCHandler.cpp
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-29 17:31:58 +01:00
Pratik Mankawde
b933e8ae00
feat(telemetry): add missing StatsD dashboard panels from production dashboard
...
Compared shared production Grafana dashboard against Phase 6 StatsD
dashboards and added 10 missing panels covering job execution/dequeue
timers, cache metrics, ledger publish gap, state duration rate, duplicate
traffic, and detailed traffic breakdown.
Node Health dashboard: 8 → 16 panels, plus quantile template variable.
Network Traffic dashboard: 8 → 10 panels, Total Network Bytes now rate().
Updated runbook, data collection reference, and implementation phases docs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-29 14:02:27 +01:00
Pratik Mankawde
f6105ece98
feat(telemetry): add Phase 5 documentation, deployment configs, and integration tests
...
Add the observability stack deployment infrastructure and integration
test framework for verifying end-to-end trace export.
- Add Grafana dashboards: RPC performance, transaction overview,
consensus health (pre-provisioned via dashboards.yaml)
- Add Prometheus config for spanmetrics collection from OTel Collector
- Update OTel Collector config with spanmetrics connector and
prometheus exporter for RED metrics
- Add docker-compose services: prometheus, dashboard provisioning
- Add integration-test.sh with Tempo API-based span verification
(replaces previous Jaeger-based approach)
- Add TESTING.md with step-by-step deployment and verification guide
- Add telemetry-runbook.md for production operations reference
- Add xrpld-telemetry.cfg sample configuration
- Add toDisplayString() for ConsensusMode (human-readable span values)
- Update Phase 2/3 task lists with known issues sections
- Add Phase 5 integration test task list
- Add TraceContext protobuf fields for future relay propagation
- Wire telemetry lifecycle (setServiceInstanceId/start/stop) in
Application.cpp
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-28 15:00:40 +01:00
Pratik Mankawde
34ee231d62
feat(telemetry): add Phase 4 consensus tracing with SpanGuard API
...
Instrument the consensus subsystem with OpenTelemetry spans covering
the full round lifecycle: round start, establish phase, proposal send,
ledger close, position updates, consensus check, accept, validation
send, and mode changes.
Key design choices adapted from the original Phase 4 implementation
to the new SpanGuard factory pattern introduced in Phase 3:
- Add SpanGuard::hashSpan() for category-gated hash-derived trace IDs
(consensus round spans share trace_id across validators via ledger hash)
- Add SpanGuard::addEvent() overload with key-value attribute pairs
(used for dispute.resolve events during position updates)
- Add ConsensusSpanNames.h with compile-time span name constants
following the colocated *SpanNames.h pattern from Phase 3
- Add consensusTraceStrategy config option ("deterministic"/"attribute")
for cross-node trace correlation strategy selection
- Use SpanGuard::linkedSpan() for follows-from relationships between
consecutive rounds and cross-thread validation spans
- Use SpanGuard::captureContext() for thread-safe context propagation
from consensus thread to jtACCEPT worker thread
Spans produced: consensus.round, consensus.proposal.send,
consensus.ledger_close, consensus.establish, consensus.update_positions,
consensus.check, consensus.accept, consensus.accept.apply,
consensus.validation.send, consensus.mode_change
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-28 14:34:39 +01:00
Pratik Mankawde
a9ee819ea1
docs(telemetry): add Phase 2-5 task lists and appendix update
...
Introduces task list documents for Phases 2 through 5, with Tempo
references (replacing Jaeger) and Task 2.8 dashboard parity spec.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-28 14:28:07 +01:00
Pratik Mankawde
e9c5c3520e
fix(telemetry): address Phase 1b code review findings
...
Redesign SpanGuard with pimpl idiom to hide all OpenTelemetry types
from public headers. Add global Telemetry accessor so SpanGuard factory
methods work without explicit Telemetry references. Add child/linked
span creation and cross-thread context propagation. Update plan docs
to reflect macro removal in favor of SpanGuard factory pattern.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-28 14:26:05 +01:00
Pratik Mankawde
913a4b794c
docs: correct OTel overhead estimates against SDK benchmarks
...
Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:
- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns
Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).
Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-16 15:00:47 +01:00
Pratik Mankawde
ddf894dcb0
Phase 1a: OpenTelemetry plan documentation
...
Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:
- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-16 15:00:47 +01:00
Gregory Tsipenyuk
dfcad69155
feat: Add MPT support to DEX ( #5285 )
2026-04-08 16:17:37 +00:00
Pratik Mankawde
936c73982d
docs: update Phase 9 docs and dashboard for push_metrics.py parity gauges
...
- Add Task 9.7a to Phase9_taskList.md documenting new gauges
- Add metric tables to 09-data-collection-reference.md (server_info,
build_info, complete_ledgers, db_metrics, extended cache/nodestore)
- Update metric counts from ~50 to ~68 in 06-implementation-phases.md
- Add OTel MetricsRegistry gauge reference to telemetry-runbook.md
- Add 11 new panels to system-node-health.json Grafana dashboard
(server state, uptime, peers, validated seq, last close info,
build version, complete ledgers, db sizes, historical fetch rate,
peer disconnects)
- Fix leftover merge conflict marker in 08-appendix.md
- Add ripplex/mseconds to cspell dictionary
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:31:49 +01:00
Pratik Mankawde
fdec3ce5c4
Phase 8: Log-trace correlation with Loki and filelog receiver
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:31:37 +01:00
Pratik Mankawde
f940290866
Phase 5: Documentation, deployment configs, integration test infrastructure
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:29:30 +01:00
Pratik Mankawde
a127711b86
Phase 4: Consensus tracing - round lifecycle, proposals, validations, close time
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:28:33 +01:00
Pratik Mankawde
88d17e4c04
Phase 3: Transaction tracing - protobuf context propagation, PeerImp, NetworkOPs
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:28:27 +01:00
Pratik Mankawde
945faac770
Phase 2: RPC tracing - span macros, attributes, WebSocket, command spans
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:28:22 +01:00
Pratik Mankawde
f135842071
docs: correct OTel overhead estimates against SDK benchmarks
...
Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:
- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns
Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).
Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-30 15:55:26 +01:00
Pratik Mankawde
e482b56f58
Phase 1a: OpenTelemetry plan documentation
...
Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:
- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-30 15:55:26 +01:00
Ayaz Salikhov
dfed0481f7
docs: Rewrite conan docs for custom recipes ( #6647 )
2026-03-25 14:25:33 +00:00
Jingchen
b1e5ba0518
feat: Add code generator for transactions and ledger entries ( #6443 )
...
Signed-off-by: JCW <a1q123456@users.noreply.github.com >
Co-authored-by: Bart <bthomee@users.noreply.github.com >
2026-03-18 21:11:51 +00:00
Pratik Mankawde
5ae97fa8ae
refactor: Add no-ASAN macro for Throw statements ( #6373 )
...
Throwing exceptions from code sometime confuses ASAN, as it cannot keep track of stack frames. This change therefore adds a macro to skip instrumentation around the `Throw` function.
2026-03-17 13:10:32 +00:00
Pratik Mankawde
983816248a
fix: Switch to boost::coroutine2 ( #6372 )
...
ASAN wasn't able to keep track of `boost::coroutine` context switches, and would lead to many false positives being detected. By switching to `boost::coroutine2` and `ucontext`, ASAN is able to know about the context switches advertised by the `boost::fiber` class, which in turn leads to more cleaner ASAN analysis.
2026-03-16 15:34:15 +00:00
Alex Kremer
afc660a1b5
refactor: Fix clang-tidy bugprone-empty-catch check ( #6419 )
...
This change fixes or suppresses instances detected by the `bugprone-empty-catch` clang-tidy check.
2026-03-02 17:08:56 +00:00
Sergey Kuznetsov
0fd237d707
chore: Add nix development environment ( #6314 )
...
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-02-23 20:10:07 -05:00
Sergey Kuznetsov
31302877ab
ci: Add clang tidy workflow to ci ( #6369 )
2026-02-19 14:06:44 -05:00
Mayukha Vadari
bf4674f42b
refactor: Fix spelling issues in tests ( #6199 )
...
This change removes the `src/tests` exception from the `cspell` config and fixes all the issues that arise as a result. No functionality/test change.
2026-02-06 20:30:22 +00:00