Pratik Mankawde
4d6ddb5f1f
Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
...
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com >
2026-06-01 14:56:09 +01:00
Pratik Mankawde
ba7e1f98e4
Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
...
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com >
2026-05-29 18:24:43 +01:00
Pratik Mankawde
e7dea147cd
Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
...
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com >
2026-05-29 18:18:36 +01:00
Pratik Mankawde
8d730b8b9a
Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
...
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com >
2026-05-29 18:16:35 +01:00
Pratik Mankawde
43258e8dc0
docs(telemetry): add secure-OTel pipeline analysis and link into plan
...
Document the threat model and chosen hardening approach for the OTel
pipeline: mTLS to the collector as primary defense (across-network
deployment), NetworkPolicy as defense-in-depth, and source-side
validation plus per-peer rate limiting for protocol::TraceContext on
peer messages. Skips Basic Auth (wrong shape for multi-operator
fleet) and HTTP-gateway header stripping (rippled is P2P).
Wires the new doc into the master plan ToC, mermaid diagram, and
body section, plus cross-refs from the privacy section in
02-design-decisions.md and the collector config in
05-configuration-reference.md so readers reach it from natural
in-context entry points. Adds a backlink at the top of secure-OTel.md
to the master plan.
Adds 'exfiltration' and 'htpasswd' to cspell dictionary.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-28 12:33:16 +01:00
Pratik Mankawde
9adcc49171
fix: re-apply phase-7 doc/config changes lost during merge
...
Re-applies phase-7 unique modifications to documentation and
configuration files that were overwritten when taking phase-6's
versions during the merge conflict resolution.
Changes:
- docker-compose.yml: comment out StatsD port 8125, add OTLP notes
- otel-collector-config.yaml: remove StatsD receiver, update pipeline
- integration-test.sh: server=otel, check_otel_metric, StatsD port check
- telemetry-runbook.md: System Metrics section, server=otel config,
troubleshooting for missing OTel metrics
- 02-design-decisions.md: Phase 7 coexistence strategy notes
- 05-configuration-reference.md: OTel System Metrics correlation
- 06-implementation-phases.md: add Phase 7 section (~180 lines)
- OpenTelemetryPlan.md: update phases table (7 phases, 60.6 days)
- 08-appendix.md: add Phase7_taskList.md to document index
- Delete 5 statsd-*.json dashboards (replaced by system-*.json)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-29 21:05:48 +01:00
Pratik Mankawde
9e12e660fe
Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill
...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-29 20:25:13 +01:00
Pratik Mankawde
81b47afde7
Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
...
# Conflicts:
# OpenTelemetryPlan/06-implementation-phases.md
# OpenTelemetryPlan/08-appendix.md
# OpenTelemetryPlan/OpenTelemetryPlan.md
# docker/telemetry/grafana/dashboards/statsd-network-traffic.json
# docker/telemetry/grafana/dashboards/statsd-node-health.json
# docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
2026-04-29 20:07:43 +01:00
Pratik Mankawde
cbbd6ebee2
feat(telemetry): add Phase 6 StatsD metrics, ledger/peer spans, and expanded dashboards
...
Integrate the existing StatsD metrics pipeline (beast::insight) into
the OpenTelemetry observability stack and add new trace spans for
ledger build/store/validate and peer proposal/validation receive.
Phase 5b — Ledger, peer, and transaction spans:
- Add ledger.build span with close time attributes in BuildLedger.cpp
- Add tx.apply span with tx_count/tx_failed in BuildLedger.cpp
- Add ledger.store and ledger.validate spans in LedgerMaster.cpp
- Add peer.proposal.receive span with trusted attribute in PeerImp.cpp
- Add peer.validation.receive span with ledger_hash, full, trusted
attributes in PeerImp.cpp
- Add ledger-operations and peer-network Grafana dashboards
Phase 6 — StatsD metrics integration:
- Add StatsD UDP receiver (port 8125) to OTel Collector
- Add 5 StatsD Grafana dashboards: node health, network traffic,
overlay traffic detail, ledger data sync, RPC pathfinding
- Add 09-data-collection-reference.md cataloging all metrics/spans
- Update existing dashboards with new span panels
- Expand telemetry runbook and integration test script
- Add codecov exclusions for telemetry modules
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-28 15:00:57 +01:00
Pratik Mankawde
f6105ece98
feat(telemetry): add Phase 5 documentation, deployment configs, and integration tests
...
Add the observability stack deployment infrastructure and integration
test framework for verifying end-to-end trace export.
- Add Grafana dashboards: RPC performance, transaction overview,
consensus health (pre-provisioned via dashboards.yaml)
- Add Prometheus config for spanmetrics collection from OTel Collector
- Update OTel Collector config with spanmetrics connector and
prometheus exporter for RED metrics
- Add docker-compose services: prometheus, dashboard provisioning
- Add integration-test.sh with Tempo API-based span verification
(replaces previous Jaeger-based approach)
- Add TESTING.md with step-by-step deployment and verification guide
- Add telemetry-runbook.md for production operations reference
- Add xrpld-telemetry.cfg sample configuration
- Add toDisplayString() for ConsensusMode (human-readable span values)
- Update Phase 2/3 task lists with known issues sections
- Add Phase 5 integration test task list
- Add TraceContext protobuf fields for future relay propagation
- Wire telemetry lifecycle (setServiceInstanceId/start/stop) in
Application.cpp
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-28 15:00:40 +01:00
Pratik Mankawde
a9ee819ea1
docs(telemetry): add Phase 2-5 task lists and appendix update
...
Introduces task list documents for Phases 2 through 5, with Tempo
references (replacing Jaeger) and Task 2.8 dashboard parity spec.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-28 14:28:07 +01:00
Pratik Mankawde
88686af850
Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-28 14:25:31 +01:00
Pratik Mankawde
1fd971b78b
fix(docs): apply rename scripts to OpenTelemetry plan docs
...
Run .github/scripts/rename/docs.sh to replace rippled → xrpld
references in all plan documentation files, fixing the check-rename
CI failure.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-28 13:57:38 +01:00
Pratik Mankawde
913a4b794c
docs: correct OTel overhead estimates against SDK benchmarks
...
Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:
- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns
Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).
Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-16 15:00:47 +01:00
Pratik Mankawde
4b745a86b7
Appendix: add 00-tracing-fundamentals.md and POC_taskList.md to document index
...
Split document index into Plan Documents and Task Lists sections.
These files were introduced in this branch but missing from the index.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-16 15:00:47 +01:00
Pratik Mankawde
ddf894dcb0
Phase 1a: OpenTelemetry plan documentation
...
Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:
- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-16 15:00:47 +01:00
Pratik Mankawde
936c73982d
docs: update Phase 9 docs and dashboard for push_metrics.py parity gauges
...
- Add Task 9.7a to Phase9_taskList.md documenting new gauges
- Add metric tables to 09-data-collection-reference.md (server_info,
build_info, complete_ledgers, db_metrics, extended cache/nodestore)
- Update metric counts from ~50 to ~68 in 06-implementation-phases.md
- Add OTel MetricsRegistry gauge reference to telemetry-runbook.md
- Add 11 new panels to system-node-health.json Grafana dashboard
(server state, uptime, peers, validated seq, last close info,
build version, complete ledgers, db sizes, historical fetch rate,
peer disconnects)
- Fix leftover merge conflict marker in 08-appendix.md
- Add ripplex/mseconds to cspell dictionary
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:31:49 +01:00
Pratik Mankawde
892fee638a
Phase 9: Metric gap fill - nodestore, cache, TxQ, load factor dashboards
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:31:49 +01:00
Pratik Mankawde
fdec3ce5c4
Phase 8: Log-trace correlation with Loki and filelog receiver
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:31:37 +01:00
Pratik Mankawde
2f7064ace6
Phase 7: Native OTel metrics migration
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:31:24 +01:00
Pratik Mankawde
21192e9b3f
Phase 6: StatsD metrics integration into telemetry pipeline
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:31:07 +01:00
Pratik Mankawde
f940290866
Phase 5: Documentation, deployment configs, integration test infrastructure
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:29:30 +01:00
Pratik Mankawde
945faac770
Phase 2: RPC tracing - span macros, attributes, WebSocket, command spans
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:28:22 +01:00
Pratik Mankawde
a7470615be
Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-31 22:28:12 +01:00
Pratik Mankawde
f135842071
docs: correct OTel overhead estimates against SDK benchmarks
...
Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:
- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns
Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).
Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-30 15:55:26 +01:00
Pratik Mankawde
2fb6124412
Appendix: add 00-tracing-fundamentals.md and POC_taskList.md to document index
...
Split document index into Plan Documents and Task Lists sections.
These files were introduced in this branch but missing from the index.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-30 15:55:26 +01:00
Pratik Mankawde
e482b56f58
Phase 1a: OpenTelemetry plan documentation
...
Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:
- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-30 15:55:26 +01:00