Pratik Mankawde
f6105ece98
feat(telemetry): add Phase 5 documentation, deployment configs, and integration tests
...
Add the observability stack deployment infrastructure and integration
test framework for verifying end-to-end trace export.
- Add Grafana dashboards: RPC performance, transaction overview,
consensus health (pre-provisioned via dashboards.yaml)
- Add Prometheus config for spanmetrics collection from OTel Collector
- Update OTel Collector config with spanmetrics connector and
prometheus exporter for RED metrics
- Add docker-compose services: prometheus, dashboard provisioning
- Add integration-test.sh with Tempo API-based span verification
(replaces previous Jaeger-based approach)
- Add TESTING.md with step-by-step deployment and verification guide
- Add telemetry-runbook.md for production operations reference
- Add xrpld-telemetry.cfg sample configuration
- Add toDisplayString() for ConsensusMode (human-readable span values)
- Update Phase 2/3 task lists with known issues sections
- Add Phase 5 integration test task list
- Add TraceContext protobuf fields for future relay propagation
- Wire telemetry lifecycle (setServiceInstanceId/start/stop) in
Application.cpp
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-28 15:00:40 +01:00
Pratik Mankawde
e9c5c3520e
fix(telemetry): address Phase 1b code review findings
...
Redesign SpanGuard with pimpl idiom to hide all OpenTelemetry types
from public headers. Add global Telemetry accessor so SpanGuard factory
methods work without explicit Telemetry references. Add child/linked
span creation and cross-thread context propagation. Update plan docs
to reflect macro removal in favor of SpanGuard factory pattern.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-28 14:26:05 +01:00
Pratik Mankawde
3852b5ae4b
fix(telemetry): address review findings and PR #6437 comments
...
Critical fixes:
- Restore accidentally removed mallocTrim call and MallocTrim.h include
- Add missing shouldTraceLedger() to interface and all implementations
- Derive networkId/networkType from config_->NETWORK_ID (0=mainnet,
1=testnet, 2=devnet) instead of leaving defaults unpopulated
- Clamp sampling_ratio to [0.0, 1.0] in config parser
PR comment fixes:
- Rename rippled -> xrpld in service name defaults, getTracer() calls,
Docker network, comments, and docs/build/telemetry.md
- Remove exporter config option (only otlp_http supported)
- Add trace_ledger and service_name to example config
- Clarify head-based sampling semantics in config comments
- Add filter descriptions for span intrinsic filters in Grafana datasource
- Add inline comments to Docker Compose services
Docker/config improvements:
- Remove deprecated version: "3.8" from docker-compose.yml
- Pin images: collector 0.121.0, grafana 11.5.2
- Add health_check extension to otel-collector-config.yaml
- Comment out Tempo metrics_generator remote_write (no Prometheus service)
- Add Prometheus datasource caveat in Grafana datasource config
Other:
- Revert unrelated formatting changes in ServiceRegistry.h
- Change Conan telemetry default to False (matches CMake OFF)
- Add CLAUDE.md-required docs (ASCII diagrams, usage examples,
@note thread-safety) to Telemetry.h and SpanGuard.h
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-28 14:25:31 +01:00
Pratik Mankawde
ca2d616277
refactor(telemetry): remove Jaeger service, exporter, and datasource
...
Tempo is now the sole trace backend. Remove Jaeger all-in-one service
from docker-compose, otlp/jaeger exporter from OTel Collector config,
and Jaeger Grafana datasource provisioning file.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-28 14:25:31 +01:00
Pratik Mankawde
88686af850
Phase 1b: Telemetry core infrastructure - CMake, Conan, SpanGuard, config
...
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-28 14:25:31 +01:00