Commit Graph

23 Commits

Author SHA1 Message Date
Pratik Mankawde
98fc939851 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-06-01 15:01:19 +01:00
Pratik Mankawde
ba7e1f98e4 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:24:43 +01:00
Pratik Mankawde
e7dea147cd Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:18:36 +01:00
Pratik Mankawde
8d730b8b9a Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-29 18:16:35 +01:00
Pratik Mankawde
53e8c4d54e fix(docs): apply rename scripts to secure-OTel doc references
Two stray "rippled" tokens introduced by 43258e8d ("docs(telemetry):
add secure-OTel pipeline analysis…") were caught by check-rename in
CI. Re-run docs.sh to convert them to xrpld so the rename check
passes on PR #6425 (and downstream PR #6426 once merged up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 18:27:58 +01:00
Pratik Mankawde
43258e8dc0 docs(telemetry): add secure-OTel pipeline analysis and link into plan
Document the threat model and chosen hardening approach for the OTel
pipeline: mTLS to the collector as primary defense (across-network
deployment), NetworkPolicy as defense-in-depth, and source-side
validation plus per-peer rate limiting for protocol::TraceContext on
peer messages. Skips Basic Auth (wrong shape for multi-operator
fleet) and HTTP-gateway header stripping (rippled is P2P).

Wires the new doc into the master plan ToC, mermaid diagram, and
body section, plus cross-refs from the privacy section in
02-design-decisions.md and the collector config in
05-configuration-reference.md so readers reach it from natural
in-context entry points. Adds a backlink at the top of secure-OTel.md
to the master plan.

Adds 'exfiltration' and 'htpasswd' to cspell dictionary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:33:16 +01:00
Pratik Mankawde
9adcc49171 fix: re-apply phase-7 doc/config changes lost during merge
Re-applies phase-7 unique modifications to documentation and
configuration files that were overwritten when taking phase-6's
versions during the merge conflict resolution.

Changes:
- docker-compose.yml: comment out StatsD port 8125, add OTLP notes
- otel-collector-config.yaml: remove StatsD receiver, update pipeline
- integration-test.sh: server=otel, check_otel_metric, StatsD port check
- telemetry-runbook.md: System Metrics section, server=otel config,
  troubleshooting for missing OTel metrics
- 02-design-decisions.md: Phase 7 coexistence strategy notes
- 05-configuration-reference.md: OTel System Metrics correlation
- 06-implementation-phases.md: add Phase 7 section (~180 lines)
- OpenTelemetryPlan.md: update phases table (7 phases, 60.6 days)
- 08-appendix.md: add Phase7_taskList.md to document index
- Delete 5 statsd-*.json dashboards (replaced by system-*.json)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 21:05:48 +01:00
Pratik Mankawde
70d86d7ebf Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/09-data-collection-reference.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
#	docker/telemetry/docker-compose.yml
#	docker/telemetry/grafana/dashboards/statsd-network-traffic.json
#	docker/telemetry/otel-collector-config.yaml
#	src/xrpld/overlay/detail/PeerImp.cpp
2026-04-29 20:38:00 +01:00
Pratik Mankawde
81b47afde7 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/08-appendix.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
#	docker/telemetry/grafana/dashboards/statsd-network-traffic.json
#	docker/telemetry/grafana/dashboards/statsd-node-health.json
#	docker/telemetry/grafana/dashboards/statsd-rpc-pathfinding.json
2026-04-29 20:07:43 +01:00
Pratik Mankawde
b933e8ae00 feat(telemetry): add missing StatsD dashboard panels from production dashboard
Compared shared production Grafana dashboard against Phase 6 StatsD
dashboards and added 10 missing panels covering job execution/dequeue
timers, cache metrics, ledger publish gap, state duration rate, duplicate
traffic, and detailed traffic breakdown.

Node Health dashboard: 8 → 16 panels, plus quantile template variable.
Network Traffic dashboard: 8 → 10 panels, Total Network Bytes now rate().
Updated runbook, data collection reference, and implementation phases docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 14:02:27 +01:00
Pratik Mankawde
1a96f75954 fix(telemetry): apply rename script to phase 6 documentation
Replace remaining rippled/Ripple references with xrpld/XRPL in
data collection reference, implementation phases, and runbook docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 11:30:50 +01:00
Pratik Mankawde
cbbd6ebee2 feat(telemetry): add Phase 6 StatsD metrics, ledger/peer spans, and expanded dashboards
Integrate the existing StatsD metrics pipeline (beast::insight) into
the OpenTelemetry observability stack and add new trace spans for
ledger build/store/validate and peer proposal/validation receive.

Phase 5b — Ledger, peer, and transaction spans:
- Add ledger.build span with close time attributes in BuildLedger.cpp
- Add tx.apply span with tx_count/tx_failed in BuildLedger.cpp
- Add ledger.store and ledger.validate spans in LedgerMaster.cpp
- Add peer.proposal.receive span with trusted attribute in PeerImp.cpp
- Add peer.validation.receive span with ledger_hash, full, trusted
  attributes in PeerImp.cpp
- Add ledger-operations and peer-network Grafana dashboards

Phase 6 — StatsD metrics integration:
- Add StatsD UDP receiver (port 8125) to OTel Collector
- Add 5 StatsD Grafana dashboards: node health, network traffic,
  overlay traffic detail, ledger data sync, RPC pathfinding
- Add 09-data-collection-reference.md cataloging all metrics/spans
- Update existing dashboards with new span panels
- Expand telemetry runbook and integration test script
- Add codecov exclusions for telemetry modules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:00:57 +01:00
Pratik Mankawde
5e8277f36a docs(telemetry): fix doc references to match pimpl architecture
Replace references to non-existent TracingInstrumentation.h with
SpanGuard.cpp pimpl implementation that actually exists on this branch.
Update conditional compilation section to describe the pimpl approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:26:05 +01:00
Pratik Mankawde
26947267b1 docs(telemetry): update plan docs for FilteringSpanProcessor and discard()
Add DiscardFlag.h and FilteringSpanProcessor references to the file
tree, key files table, and implementation summary in OpenTelemetryPlan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 14:25:31 +01:00
Pratik Mankawde
1fd971b78b fix(docs): apply rename scripts to OpenTelemetry plan docs
Run .github/scripts/rename/docs.sh to replace rippled → xrpld
references in all plan documentation files, fixing the check-rename
CI failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 13:57:38 +01:00
Pratik Mankawde
913a4b794c docs: correct OTel overhead estimates against SDK benchmarks
Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:

- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
  ~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
  SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
  stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns

Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).

Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 15:00:47 +01:00
Pratik Mankawde
ddf894dcb0 Phase 1a: OpenTelemetry plan documentation
Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:

- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 15:00:47 +01:00
Pratik Mankawde
7e149f7773 refactor(telemetry): remove residual Jaeger references across chain
Fix remaining Jaeger references that accumulated across intermediate
branches in the stacked PR chain. These were in files modified by
multiple phases where the per-branch fixes didn't cover all additions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 22:35:04 +01:00
Pratik Mankawde
fdec3ce5c4 Phase 8: Log-trace correlation with Loki and filelog receiver
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:37 +01:00
Pratik Mankawde
2f7064ace6 Phase 7: Native OTel metrics migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:24 +01:00
Pratik Mankawde
21192e9b3f Phase 6: StatsD metrics integration into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:31:07 +01:00
Pratik Mankawde
f135842071 docs: correct OTel overhead estimates against SDK benchmarks
Verified CPU, memory, and network overhead calculations against
official OTel C++ SDK benchmarks (969 CI runs) and source code
analysis. Key corrections:

- Span creation: 200-500ns → 500-1000ns (SDK BM_SpanCreation median
  ~1000ns; original estimate matched API no-op, not SDK path)
- Per-TX overhead: 2.4μs → 4.0μs (2.0% vs 1.2%; still within 1-3%)
- Active span memory: ~200 bytes → ~500-800 bytes (Span wrapper +
  SpanData + std::map attribute storage)
- Static memory: ~456KB → ~8.3MB (BatchSpanProcessor worker thread
  stack ~8MB was omitted)
- Total memory ceiling: ~2.3MB → ~10MB
- Memory success metric target: <5MB → <10MB
- AddEvent: 50-80ns → 100-200ns

Added Section 3.5.4 with links to all benchmark sources.
Updated presentation.md with matching corrections.
High-level conclusions unchanged (1-3% CPU, negligible consensus).

Also includes: review fixes, cross-document consistency improvements,
additional component tracing docs (PathFinding, TxQ, Validator, etc.),
context size corrections (32 → 25 bytes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 15:55:26 +01:00
Pratik Mankawde
e482b56f58 Phase 1a: OpenTelemetry plan documentation
Add comprehensive planning documentation for the OpenTelemetry
distributed tracing integration:

- Tracing fundamentals and concepts
- Architecture analysis of rippled's tracing surface area
- Design decisions and trade-offs
- Implementation strategy and code samples
- Configuration reference
- Implementation phases roadmap
- Observability backend comparison
- POC task list and presentation materials

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 15:55:26 +01:00