Add POC for OTel for RPC layer

Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
This commit is contained in:
Pratik Mankawde
2026-02-23 15:10:48 +00:00
parent 441ccd514c
commit 9bb7ea04cb
19 changed files with 1465 additions and 108 deletions

View File

@@ -126,6 +126,13 @@ if (rocksdb)
target_link_libraries(xrpl_libs INTERFACE RocksDB::rocksdb)
endif ()
option(telemetry "Enable OpenTelemetry tracing" OFF)
if (telemetry)
find_package(opentelemetry-cpp CONFIG REQUIRED)
add_compile_definitions(XRPL_ENABLE_TELEMETRY)
message(STATUS "OpenTelemetry tracing enabled")
endif ()
# Work around changes to Conan recipe for now.
if (TARGET nudb::core)
set(nudb nudb::core)

View File

@@ -0,0 +1,500 @@
# OpenTelemetry POC Task List
> **Goal**: Build a minimal end-to-end proof of concept that demonstrates distributed tracing in rippled. A successful POC will show RPC request traces flowing from rippled through an OTel Collector into Jaeger, viewable in a browser UI.
>
> **Scope**: RPC tracing only (highest value, lowest risk per the [CRAWL phase](./06-implementation-phases.md#6102-quick-wins-immediate-value) in the implementation phases). No cross-node P2P context propagation or consensus tracing in the POC.
### Related Plan Documents
| Document | Relevance to POC |
|----------|-----------------|
| [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) | Core concepts: traces, spans, context propagation, sampling |
| [01-architecture-analysis.md](./01-architecture-analysis.md) | RPC request flow (§1.5), key trace points (§1.6), instrumentation priority (§1.7) |
| [02-design-decisions.md](./02-design-decisions.md) | SDK selection (§2.1), exporter config (§2.2), span naming (§2.3), attribute schema (§2.4), coexistence with PerfLog/Insight (§2.6) |
| [03-implementation-strategy.md](./03-implementation-strategy.md) | Directory structure (§3.1), key principles (§3.2), performance overhead (§3.3-3.6), conditional compilation (§3.7.3), code intrusiveness (§3.9) |
| [04-code-samples.md](./04-code-samples.md) | Telemetry interface (§4.1), SpanGuard (§4.2), macros (§4.3), RPC instrumentation (§4.5.3) |
| [05-configuration-reference.md](./05-configuration-reference.md) | rippled config (§5.1), config parser (§5.2), Application integration (§5.3), CMake (§5.4), Collector config (§5.5), Docker Compose (§5.6), Grafana (§5.8) |
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 1 core tasks (§6.2), Phase 2 RPC tasks (§6.3), quick wins (§6.10), definition of done (§6.11) |
| [07-observability-backends.md](./07-observability-backends.md) | Jaeger dev setup (§7.1), Grafana dashboards (§7.6), alert rules (§7.6.3) |
---
## Task 0: Docker Observability Stack Setup
**Objective**: Stand up the backend infrastructure to receive, store, and display traces.
**What to do**:
- Create `docker/telemetry/docker-compose.yml` in the repo with three services:
1. **OpenTelemetry Collector** (`otel/opentelemetry-collector-contrib:latest`)
- Expose ports `4317` (OTLP gRPC) and `4318` (OTLP HTTP)
- Expose port `13133` (health check)
- Mount a config file `docker/telemetry/otel-collector-config.yaml`
2. **Jaeger** (`jaegertracing/all-in-one:latest`)
- Expose port `16686` (UI) and `14250` (gRPC collector)
- Set env `COLLECTOR_OTLP_ENABLED=true`
3. **Grafana** (`grafana/grafana:latest`) — optional but useful
- Expose port `3000`
- Enable anonymous admin access for local dev (`GF_AUTH_ANONYMOUS_ENABLED=true`, `GF_AUTH_ANONYMOUS_ORG_ROLE=Admin`)
- Provision Jaeger as a data source via `docker/telemetry/grafana/provisioning/datasources/jaeger.yaml`
- Create `docker/telemetry/otel-collector-config.yaml`:
```yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 100
exporters:
logging:
verbosity: detailed
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [logging, otlp/jaeger]
```
- Create Grafana Jaeger datasource provisioning file at `docker/telemetry/grafana/provisioning/datasources/jaeger.yaml`:
```yaml
apiVersion: 1
datasources:
- name: Jaeger
type: jaeger
access: proxy
url: http://jaeger:16686
```
**Verification**: Run `docker compose -f docker/telemetry/docker-compose.yml up -d`, then:
- `curl http://localhost:13133` returns healthy (Collector)
- `http://localhost:16686` opens Jaeger UI (no traces yet)
- `http://localhost:3000` opens Grafana (optional)
**Reference**:
- [05-configuration-reference.md §5.5](./05-configuration-reference.md) — Collector config (dev YAML with Jaeger exporter)
- [05-configuration-reference.md §5.6](./05-configuration-reference.md) — Docker Compose development environment
- [07-observability-backends.md §7.1](./07-observability-backends.md) — Jaeger quick start and backend selection
- [05-configuration-reference.md §5.8](./05-configuration-reference.md) — Grafana datasource provisioning and dashboards
---
## Task 1: Add OpenTelemetry C++ SDK Dependency
**Objective**: Make `opentelemetry-cpp` available to the build system.
**What to do**:
- Edit `conanfile.py` to add `opentelemetry-cpp` as an **optional** dependency. The gRPC otel plugin flag (`"grpc/*:otel_plugin": False`) in the existing conanfile may need to remain false — we pull the OTel SDK separately.
- Add a Conan option: `with_telemetry = [True, False]` defaulting to `False`
- When `with_telemetry` is `True`, add `opentelemetry-cpp` to `self.requires()`
- Required OTel Conan components: `opentelemetry-cpp` (which bundles api, sdk, and exporters). If the package isn't in Conan Center, consider using `FetchContent` in CMake or building from source as a fallback.
- Edit `CMakeLists.txt`:
- Add option: `option(XRPL_ENABLE_TELEMETRY "Enable OpenTelemetry tracing" OFF)`
- When ON, `find_package(opentelemetry-cpp CONFIG REQUIRED)` and add compile definition `XRPL_ENABLE_TELEMETRY`
- When OFF, do nothing (zero build impact)
- Verify the build succeeds with `-DXRPL_ENABLE_TELEMETRY=OFF` (no regressions) and with `-DXRPL_ENABLE_TELEMETRY=ON` (SDK links successfully).
**Key files**:
- `/home/pratik/sourceCode/2rippled/conanfile.py`
- `/home/pratik/sourceCode/2rippled/CMakeLists.txt`
**Reference**:
- [05-configuration-reference.md §5.4](./05-configuration-reference.md) — CMake integration, `FindOpenTelemetry.cmake`, `XRPL_ENABLE_TELEMETRY` option
- [03-implementation-strategy.md §3.2](./03-implementation-strategy.md) — Key principle: zero-cost when disabled via compile-time flags
- [02-design-decisions.md §2.1](./02-design-decisions.md) — SDK selection rationale and required OTel components
---
## Task 2: Create Core Telemetry Interface and NullTelemetry
**Objective**: Define the `Telemetry` abstract interface and a no-op implementation so the rest of the codebase can reference telemetry without hard-depending on the OTel SDK.
**What to do**:
- Create `include/xrpl/telemetry/Telemetry.h`:
- Define `namespace xrpl::telemetry`
- Define `struct Telemetry::Setup` holding: `enabled`, `exporterEndpoint`, `samplingRatio`, `serviceName`, `serviceVersion`, `serviceInstanceId`, `traceRpc`, `traceTransactions`, `traceConsensus`, `tracePeer`
- Define abstract `class Telemetry` with:
- `virtual void start() = 0;`
- `virtual void stop() = 0;`
- `virtual bool isEnabled() const = 0;`
- `virtual nostd::shared_ptr<Tracer> getTracer(string_view name = "rippled") = 0;`
- `virtual nostd::shared_ptr<Span> startSpan(string_view name, SpanKind kind = kInternal) = 0;`
- `virtual nostd::shared_ptr<Span> startSpan(string_view name, Context const& parentContext, SpanKind kind = kInternal) = 0;`
- `virtual bool shouldTraceRpc() const = 0;`
- `virtual bool shouldTraceTransactions() const = 0;`
- `virtual bool shouldTraceConsensus() const = 0;`
- Factory: `std::unique_ptr<Telemetry> make_Telemetry(Setup const&, beast::Journal);`
- Config parser: `Telemetry::Setup setup_Telemetry(Section const&, std::string const& nodePublicKey, std::string const& version);`
- Create `include/xrpl/telemetry/SpanGuard.h`:
- RAII guard that takes an `nostd::shared_ptr<Span>`, creates a `Scope`, and calls `span->End()` in destructor.
- Convenience: `setAttribute()`, `setOk()`, `setStatus()`, `addEvent()`, `recordException()`, `context()`
- See [04-code-samples.md](./04-code-samples.md) §4.2 for the full implementation.
- Create `src/libxrpl/telemetry/NullTelemetry.cpp`:
- Implements `Telemetry` with all no-ops.
- `isEnabled()` returns `false`, `startSpan()` returns a noop span.
- This is used when `XRPL_ENABLE_TELEMETRY` is OFF or `enabled=0` in config.
- Guard all OTel SDK headers behind `#ifdef XRPL_ENABLE_TELEMETRY`. The `NullTelemetry` implementation should compile without the OTel SDK present.
**Key new files**:
- `include/xrpl/telemetry/Telemetry.h`
- `include/xrpl/telemetry/SpanGuard.h`
- `src/libxrpl/telemetry/NullTelemetry.cpp`
**Reference**:
- [04-code-samples.md §4.1](./04-code-samples.md) — Full `Telemetry` interface with `Setup` struct, lifecycle, tracer access, span creation, and component filtering methods
- [04-code-samples.md §4.2](./04-code-samples.md) — Full `SpanGuard` RAII implementation and `NullSpanGuard` no-op class
- [03-implementation-strategy.md §3.1](./03-implementation-strategy.md) — Directory structure: `include/xrpl/telemetry/` for headers, `src/libxrpl/telemetry/` for implementation
- [03-implementation-strategy.md §3.7.3](./03-implementation-strategy.md) — Conditional instrumentation and zero-cost compile-time disabled pattern
---
## Task 3: Implement OTel-Backed Telemetry
**Objective**: Implement the real `Telemetry` class that initializes the OTel SDK, configures the OTLP exporter and batch processor, and creates tracers/spans.
**What to do**:
- Create `src/libxrpl/telemetry/Telemetry.cpp` (compiled only when `XRPL_ENABLE_TELEMETRY=ON`):
- `class TelemetryImpl : public Telemetry` that:
- In `start()`: creates a `TracerProvider` with:
- Resource attributes: `service.name`, `service.version`, `service.instance.id`
- An `OtlpGrpcExporter` pointed at `setup.exporterEndpoint` (default `localhost:4317`)
- A `BatchSpanProcessor` with configurable batch size and delay
- A `TraceIdRatioBasedSampler` using `setup.samplingRatio`
- Sets the global `TracerProvider`
- In `stop()`: calls `ForceFlush()` then shuts down the provider
- In `startSpan()`: delegates to `getTracer()->StartSpan(name, ...)`
- `shouldTraceRpc()` etc. read from `Setup` fields
- Create `src/libxrpl/telemetry/TelemetryConfig.cpp`:
- `setup_Telemetry()` parses the `[telemetry]` config section from `xrpld.cfg`
- Maps config keys: `enabled`, `exporter`, `endpoint`, `sampling_ratio`, `trace_rpc`, `trace_transactions`, `trace_consensus`, `trace_peer`
- Wire `make_Telemetry()` factory:
- If `setup.enabled` is true AND `XRPL_ENABLE_TELEMETRY` is defined: return `TelemetryImpl`
- Otherwise: return `NullTelemetry`
- Add telemetry source files to CMake. When `XRPL_ENABLE_TELEMETRY=ON`, compile `Telemetry.cpp` and `TelemetryConfig.cpp` and link against `opentelemetry-cpp::api`, `opentelemetry-cpp::sdk`, `opentelemetry-cpp::otlp_grpc_exporter`. When OFF, compile only `NullTelemetry.cpp`.
**Key new files**:
- `src/libxrpl/telemetry/Telemetry.cpp`
- `src/libxrpl/telemetry/TelemetryConfig.cpp`
**Key modified files**:
- `CMakeLists.txt` (add telemetry library target)
**Reference**:
- [04-code-samples.md §4.1](./04-code-samples.md) — `Telemetry` interface that `TelemetryImpl` must implement
- [05-configuration-reference.md §5.2](./05-configuration-reference.md) — `setup_Telemetry()` config parser implementation
- [02-design-decisions.md §2.2](./02-design-decisions.md) — OTLP/gRPC exporter config (endpoint, TLS options)
- [02-design-decisions.md §2.4.1](./02-design-decisions.md) — Resource attributes: `service.name`, `service.version`, `service.instance.id`, `xrpl.network.id`
- [03-implementation-strategy.md §3.4](./03-implementation-strategy.md) — Per-operation CPU costs and overhead budget for span creation
- [03-implementation-strategy.md §3.5](./03-implementation-strategy.md) — Memory overhead: static (~456 KB) and dynamic (~1.2 MB) budgets
---
## Task 4: Integrate Telemetry into Application Lifecycle
**Objective**: Wire the `Telemetry` object into `Application` so all components can access it.
**What to do**:
- Edit `src/xrpld/app/main/Application.h`:
- Forward-declare `namespace xrpl::telemetry { class Telemetry; }`
- Add pure virtual method: `virtual telemetry::Telemetry& getTelemetry() = 0;`
- Edit `src/xrpld/app/main/Application.cpp` (the `ApplicationImp` class):
- Add member: `std::unique_ptr<telemetry::Telemetry> telemetry_;`
- In the constructor, after config is loaded and node identity is known:
```cpp
auto const telemetrySection = config_->section("telemetry");
auto telemetrySetup = telemetry::setup_Telemetry(
telemetrySection,
toBase58(TokenType::NodePublic, nodeIdentity_.publicKey()),
BuildInfo::getVersionString());
telemetry_ = telemetry::make_Telemetry(telemetrySetup, logs_->journal("Telemetry"));
```
- In `start()`: call `telemetry_->start()` early
- In `stop()` or destructor: call `telemetry_->stop()` late (to flush pending spans)
- Implement `getTelemetry()` override: return `*telemetry_`
- Add `[telemetry]` section to the example config `cfg/rippled-example.cfg`:
```ini
# [telemetry]
# enabled=1
# endpoint=localhost:4317
# sampling_ratio=1.0
# trace_rpc=1
```
**Key modified files**:
- `src/xrpld/app/main/Application.h`
- `src/xrpld/app/main/Application.cpp`
- `cfg/rippled-example.cfg` (or equivalent example config)
**Reference**:
- [05-configuration-reference.md §5.3](./05-configuration-reference.md) — `ApplicationImp` changes: member declaration, constructor init, `start()`/`stop()` wiring, `getTelemetry()` override
- [05-configuration-reference.md §5.1](./05-configuration-reference.md) — `[telemetry]` config section format and all option defaults
- [03-implementation-strategy.md §3.9.2](./03-implementation-strategy.md) — File impact assessment: `Application.cpp` ~15 lines added, ~3 changed (Low risk)
---
## Task 5: Create Instrumentation Macros
**Objective**: Define convenience macros that make instrumenting code one-liners, and that compile to zero-cost no-ops when telemetry is disabled.
**What to do**:
- Create `src/xrpld/telemetry/TracingInstrumentation.h`:
- When `XRPL_ENABLE_TELEMETRY` is defined:
```cpp
#define XRPL_TRACE_SPAN(telemetry, name) \
auto _xrpl_span_ = (telemetry).startSpan(name); \
::xrpl::telemetry::SpanGuard _xrpl_guard_(_xrpl_span_)
#define XRPL_TRACE_RPC(telemetry, name) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((telemetry).shouldTraceRpc()) { \
_xrpl_guard_.emplace((telemetry).startSpan(name)); \
}
#define XRPL_TRACE_SET_ATTR(key, value) \
if (_xrpl_guard_.has_value()) { \
_xrpl_guard_->setAttribute(key, value); \
}
#define XRPL_TRACE_EXCEPTION(e) \
if (_xrpl_guard_.has_value()) { \
_xrpl_guard_->recordException(e); \
}
```
- When `XRPL_ENABLE_TELEMETRY` is NOT defined, all macros expand to `((void)0)`
**Key new file**:
- `src/xrpld/telemetry/TracingInstrumentation.h`
**Reference**:
- [04-code-samples.md §4.3](./04-code-samples.md) — Full macro definitions for `XRPL_TRACE_SPAN`, `XRPL_TRACE_RPC`, `XRPL_TRACE_CONSENSUS`, `XRPL_TRACE_SET_ATTR`, `XRPL_TRACE_EXCEPTION` with both enabled and disabled branches
- [03-implementation-strategy.md §3.7.3](./03-implementation-strategy.md) — Conditional instrumentation pattern: compile-time `#ifndef` and runtime `shouldTrace*()` checks
- [03-implementation-strategy.md §3.9.7](./03-implementation-strategy.md) — Before/after code examples showing minimal intrusiveness (~1-3 lines per instrumentation point)
---
## Task 6: Instrument RPC ServerHandler
**Objective**: Add tracing to the HTTP RPC entry point so every incoming RPC request creates a span.
**What to do**:
- Edit `src/xrpld/rpc/detail/ServerHandler.cpp`:
- `#include` the `TracingInstrumentation.h` header
- In `ServerHandler::onRequest(Session& session)`:
- At the top of the method, add: `XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.request");`
- After the RPC command name is extracted, set attribute: `XRPL_TRACE_SET_ATTR("xrpl.rpc.command", command);`
- After the response status is known, set: `XRPL_TRACE_SET_ATTR("http.status_code", static_cast<int64_t>(statusCode));`
- Wrap error paths with: `XRPL_TRACE_EXCEPTION(e);`
- In `ServerHandler::processRequest(...)`:
- Add a child span: `XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.process");`
- Set method attribute: `XRPL_TRACE_SET_ATTR("xrpl.rpc.method", request_method);`
- In `ServerHandler::onWSMessage(...)` (WebSocket path):
- Add: `XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.ws.message");`
- The goal is to see spans like:
```
rpc.request
└── rpc.process
```
in Jaeger for every HTTP RPC call.
**Key modified file**:
- `src/xrpld/rpc/detail/ServerHandler.cpp` (~15-25 lines added)
**Reference**:
- [04-code-samples.md §4.5.3](./04-code-samples.md) — Complete `ServerHandler::onRequest()` instrumented code sample with W3C header extraction, span creation, attribute setting, and error handling
- [01-architecture-analysis.md §1.5](./01-architecture-analysis.md) — RPC request flow diagram: HTTP request -> attributes -> jobqueue.enqueue -> rpc.command -> response
- [01-architecture-analysis.md §1.6](./01-architecture-analysis.md) — Key trace points table: `rpc.request` in `ServerHandler.cpp::onRequest()` (Priority: High)
- [02-design-decisions.md §2.3](./02-design-decisions.md) — Span naming convention: `rpc.request`, `rpc.command.*`
- [02-design-decisions.md §2.4.2](./02-design-decisions.md) — RPC span attributes: `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.params`
- [03-implementation-strategy.md §3.9.2](./03-implementation-strategy.md) — File impact: `ServerHandler.cpp` ~40 lines added, ~10 changed (Low risk)
---
## Task 7: Instrument RPC Command Execution
**Objective**: Add per-command tracing inside the RPC handler so each command (e.g., `submit`, `account_info`, `server_info`) gets its own child span.
**What to do**:
- Edit `src/xrpld/rpc/detail/RPCHandler.cpp`:
- `#include` the `TracingInstrumentation.h` header
- In `doCommand(RPC::JsonContext& context, Json::Value& result)`:
- At the top: `XRPL_TRACE_RPC(context.app.getTelemetry(), "rpc.command." + context.method);`
- Set attributes:
- `XRPL_TRACE_SET_ATTR("xrpl.rpc.command", context.method);`
- `XRPL_TRACE_SET_ATTR("xrpl.rpc.version", static_cast<int64_t>(context.apiVersion));`
- `XRPL_TRACE_SET_ATTR("xrpl.rpc.role", (context.role == Role::ADMIN) ? "admin" : "user");`
- On success: `XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "success");`
- On error: `XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "error");` and set the error message
- After this, traces in Jaeger should look like:
```
rpc.request (xrpl.rpc.command=account_info)
└── rpc.process
└── rpc.command.account_info (xrpl.rpc.version=2, xrpl.rpc.role=user, xrpl.rpc.status=success)
```
**Key modified file**:
- `src/xrpld/rpc/detail/RPCHandler.cpp` (~15-20 lines added)
**Reference**:
- [04-code-samples.md §4.5.3](./04-code-samples.md) — `ServerHandler::onRequest()` code sample (includes child span pattern for `rpc.command.*`)
- [02-design-decisions.md §2.3](./02-design-decisions.md) — Span naming: `rpc.command.*` pattern with dynamic command name (e.g., `rpc.command.server_info`)
- [02-design-decisions.md §2.4.2](./02-design-decisions.md) — RPC attribute schema: `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.status`
- [01-architecture-analysis.md §1.6](./01-architecture-analysis.md) — Key trace points table: `rpc.command.*` in `RPCHandler.cpp::doCommand()` (Priority: High)
- [02-design-decisions.md §2.6.5](./02-design-decisions.md) — Correlation with PerfLog: how `doCommand()` can link trace_id with existing PerfLog entries
- [03-implementation-strategy.md §3.4.4](./03-implementation-strategy.md) — RPC request overhead budget: ~1.75 μs total per request
---
## Task 8: Build, Run, and Verify End-to-End
**Objective**: Prove the full pipeline works: rippled emits traces -> OTel Collector receives them -> Jaeger displays them.
**What to do**:
1. **Start the Docker stack**:
```bash
docker compose -f docker/telemetry/docker-compose.yml up -d
```
Verify Collector health: `curl http://localhost:13133`
2. **Build rippled with telemetry**:
```bash
# Adjust for your actual build workflow
conan install . --build=missing -o with_telemetry=True
cmake --preset default -DXRPL_ENABLE_TELEMETRY=ON
cmake --build --preset default
```
3. **Configure rippled**:
Add to `rippled.cfg` (or your local test config):
```ini
[telemetry]
enabled=1
endpoint=localhost:4317
sampling_ratio=1.0
trace_rpc=1
```
4. **Start rippled** in standalone mode:
```bash
./rippled --conf rippled.cfg -a --start
```
5. **Generate RPC traffic**:
```bash
# server_info
curl -s -X POST http://localhost:5005 \
-H "Content-Type: application/json" \
-d '{"method":"server_info","params":[{}]}'
# ledger
curl -s -X POST http://localhost:5005 \
-H "Content-Type: application/json" \
-d '{"method":"ledger","params":[{"ledger_index":"current"}]}'
# account_info (will error in standalone, that's fine — we trace errors too)
curl -s -X POST http://localhost:5005 \
-H "Content-Type: application/json" \
-d '{"method":"account_info","params":[{"account":"rHb9CJAWyB4rj91VRWn96DkukG4bwdtyTh"}]}'
```
6. **Verify in Jaeger**:
- Open `http://localhost:16686`
- Select service `rippled` from the dropdown
- Click "Find Traces"
- Confirm you see traces with spans: `rpc.request` -> `rpc.process` -> `rpc.command.server_info`
- Click into a trace and verify attributes: `xrpl.rpc.command`, `xrpl.rpc.status`, `xrpl.rpc.version`
7. **Verify zero-overhead when disabled**:
- Rebuild with `XRPL_ENABLE_TELEMETRY=OFF`, or set `enabled=0` in config
- Run the same RPC calls
- Confirm no new traces appear and no errors in rippled logs
**Verification Checklist**:
- [ ] Docker stack starts without errors
- [ ] rippled builds with `-DXRPL_ENABLE_TELEMETRY=ON`
- [ ] rippled starts and connects to OTel Collector (check rippled logs for telemetry messages)
- [ ] Traces appear in Jaeger UI under service "rippled"
- [ ] Span hierarchy is correct (parent-child relationships)
- [ ] Span attributes are populated (`xrpl.rpc.command`, `xrpl.rpc.status`, etc.)
- [ ] Error spans show error status and message
- [ ] Building with `XRPL_ENABLE_TELEMETRY=OFF` produces no regressions
- [ ] Setting `enabled=0` at runtime produces no traces and no errors
**Reference**:
- [06-implementation-phases.md §6.11.1](./06-implementation-phases.md) — Phase 1 definition of done: SDK compiles, runtime toggle works, span creation verified in Jaeger, config validation passes
- [06-implementation-phases.md §6.11.2](./06-implementation-phases.md) — Phase 2 definition of done: 100% RPC coverage, traceparent propagation, <1ms p99 overhead, dashboard deployed
- [06-implementation-phases.md §6.8](./06-implementation-phases.md) — Success metrics: trace coverage >95%, CPU overhead <3%, memory <5 MB, latency impact <2%
- [03-implementation-strategy.md §3.9.5](./03-implementation-strategy.md) — Backward compatibility: config optional, protocol unchanged, `XRPL_ENABLE_TELEMETRY=OFF` produces identical binary
- [01-architecture-analysis.md §1.8](./01-architecture-analysis.md) — Observable outcomes: what traces, metrics, and dashboards to expect
---
## Task 9: Document POC Results and Next Steps
**Objective**: Capture findings, screenshots, and remaining work for the team.
**What to do**:
- Take screenshots of Jaeger showing:
- The service list with "rippled"
- A trace with the full span tree
- Span detail view showing attributes
- Document any issues encountered (build issues, SDK quirks, missing attributes)
- Note performance observations (build time impact, any noticeable runtime overhead)
- Write a short summary of what the POC proves and what it doesn't cover yet:
- **Proves**: OTel SDK integrates with rippled, OTLP export works, RPC traces visible
- **Doesn't cover**: Cross-node P2P context propagation, consensus tracing, protobuf trace context, W3C traceparent header extraction, tail-based sampling, production deployment
- Outline next steps (mapping to the full plan phases):
- [Phase 2](./06-implementation-phases.md) completion: [W3C header extraction](./02-design-decisions.md) (§2.5), WebSocket tracing, all [RPC handlers](./01-architecture-analysis.md) (§1.6)
- [Phase 3](./06-implementation-phases.md): [Protobuf `TraceContext` message](./04-code-samples.md) (§4.4), [transaction relay tracing](./04-code-samples.md) (§4.5.1) across nodes
- [Phase 4](./06-implementation-phases.md): [Consensus round and phase tracing](./04-code-samples.md) (§4.5.2)
- [Phase 5](./06-implementation-phases.md): [Production collector config](./05-configuration-reference.md) (§5.5.2), [Grafana dashboards](./07-observability-backends.md) (§7.6), [alerting](./07-observability-backends.md) (§7.6.3)
**Reference**:
- [06-implementation-phases.md §6.1](./06-implementation-phases.md) — Full 5-phase timeline overview and Gantt chart
- [06-implementation-phases.md §6.10](./06-implementation-phases.md) — Crawl-Walk-Run strategy: POC is the CRAWL phase, next steps are WALK and RUN
- [06-implementation-phases.md §6.12](./06-implementation-phases.md) — Recommended implementation order (14 steps across 9 weeks)
- [03-implementation-strategy.md §3.9](./03-implementation-strategy.md) — Code intrusiveness assessment and risk matrix for each remaining component
- [07-observability-backends.md §7.2](./07-observability-backends.md) — Production backend selection (Tempo, Elastic APM, Honeycomb, Datadog)
- [02-design-decisions.md §2.5](./02-design-decisions.md) — Context propagation design: W3C HTTP headers, protobuf P2P, JobQueue internal
- [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) — Reference for team onboarding on distributed tracing concepts
---
## Summary
| Task | Description | New Files | Modified Files | Depends On |
|------|--------------------------------------|-----------|----------------|------------|
| 0 | Docker observability stack | 4 | 0 | — |
| 1 | OTel C++ SDK dependency | 0 | 2 | — |
| 2 | Core Telemetry interface + NullImpl | 3 | 0 | 1 |
| 3 | OTel-backed Telemetry implementation | 2 | 1 | 1, 2 |
| 4 | Application lifecycle integration | 0 | 3 | 2, 3 |
| 5 | Instrumentation macros | 1 | 0 | 2 |
| 6 | Instrument RPC ServerHandler | 0 | 1 | 4, 5 |
| 7 | Instrument RPC command execution | 0 | 1 | 4, 5 |
| 8 | End-to-end verification | 0 | 0 | 0-7 |
| 9 | Document results and next steps | 1 | 0 | 8 |
**Parallel work**: Tasks 0 and 1 can run in parallel. Tasks 2 and 5 have no dependency on each other. Tasks 6 and 7 can be done in parallel once Tasks 4 and 5 are complete.

View File

@@ -1433,14 +1433,17 @@
# To accept connections on well known ports such as 80 (HTTP) or
# 443 (HTTPS), most operating systems will require xrpld to
# run with administrator privileges, or else xrpld will not start.
[network_id]
0
[logging]
level=debug
[server]
port_rpc_admin_local
port_peer
port_ws_admin_local
#port_ws_public
#ssl_key = /etc/ssl/private/server.key
#ssl_cert = /etc/ssl/certs/server.crt
port_ws_public
port_peer
[port_rpc_admin_local]
port = 5005
@@ -1449,13 +1452,8 @@ admin = 127.0.0.1
protocol = http
[port_peer]
# Many servers still use the legacy port of 51235, so for backward-compatibility
# we maintain that port number here. However, for new servers we recommend
# changing this to the default port of 2459.
port = 51235
ip = 0.0.0.0
# alternatively, to accept connections on IPv4 + IPv6, use:
#ip = ::
ip = 127.0.0.1
protocol = peer
[port_ws_admin_local]
@@ -1465,67 +1463,102 @@ admin = 127.0.0.1
protocol = ws
send_queue_limit = 500
[port_ws_public]
port = 6005
ip = 127.0.0.1
protocol = ws
send_queue_limit = 500
[port_grpc]
port = 50051
ip = 127.0.0.1
secure_gateway = 127.0.0.1
#[port_ws_public]
#port = 6005
#ip = 127.0.0.1
#protocol = wss
#send_queue_limit = 500
[validator_token]
eyJtYW5pZmVzdCI6IkpBQUFBQUZ4SWUxdFQrbVpjK2g0akY3bFpXaWdHRVZMOThmRlJ1VnJz
K1hTa0ZOQ2dSV0daSE1oQW91ZnZrMUFtSjU3ZUQ5dUdIRmZjS3Qrd1FFUXRyUDJzcnBOTDR4
MnJ3RjBka1l3UkFJZ1JOLzlndUZnUXh0UWRwYjNiSlFvQWk1VHY2WFRhRm5vd1ZKb2lvalJh
eklDSUJ3S1FQRXE4UjB4anFWUnIxOTZXeFlxcm8wZWoyY2JSM0lqdGtnT2ZCMmFjQkpBdGpU
Rkpzb2VRZU05ZlZjcnNBOGMxVVRzVWQxK0RJbXphck5JVFhIdHF2WkRhZWh6UHBqSTF1NjBG
MEpiVnAvWGtiSXovNVFyNWFnek9xTkFOa1RXQWc9PSIsInZhbGlkYXRpb25fc2VjcmV0X2tl
eSI6IjNBNURCMzczMTMxQ0E5QThGMjJDQzkxRTBGOUY2NEM2MjBGQzBCRUE2MUM1OEZFRkRC
NThDRjIyMzc5ODg3RkIifQ==
#-------------------------------------------------------------------------------
[validation_quorum]
quorum = 3
[node_size]
medium
# [node_db]
# type=RocksDB
# path=/var/lib/rippled/db
# This is primary persistent datastore for xrpld. This includes transaction
# metadata, account states, and ledger headers. Helpful information can be
# found at https://xrpl.org/capacity-planning.html#node-db-type
# type=NuDB is recommended for non-validators with fast SSDs. Validators or
# slow / spinning disks should use RocksDB. Caution: Spinning disks are
# not recommended. They do not perform well enough to consistently remain
# synced to the network.
# online_delete=512 is recommended to delete old ledgers while maintaining at
# least 512.
# advisory_delete=0 allows the online delete process to run automatically
# when the node has approximately two times the "online_delete" value of
# ledgers. No external administrative command is required to initiate
# deletion.
[node_db]
type=NuDB
path=/var/lib/xrpld/db/nudb
nudb_block_size=4096
path=~/data/livenet/db/nudb
online_delete=512
advisory_delete=0
[database_path]
/var/lib/xrpld/db
~/data/livenet/db
# This needs to be an absolute directory reference, not a relative one.
# Modify this value as required.
[debug_logfile]
/var/log/xrpld/debug.log
~/data/livenet/debug.log
# To use the XRP test network
# (see https://xrpl.org/connect-your-rippled-to-the-xrp-test-net.html),
# use the following [ips] section:
# [ips]
# r.altnet.rippletest.net 51235
# File containing trusted validator keys or validator list publishers.
# Unless an absolute path is specified, it will be considered relative to the
# folder in which the xrpld.cfg file is located.
[validators_file]
validators.txt
# Turn down default logging to save disk space in the long run.
# Valid values here are trace, debug, info, warning, error, and fatal
[rpc_startup]
{ "command": "log_level", "severity": "warning" }
{ "command": "log_level", "severity": "info" }
# If ssl_verify is 1, certificates will be validated.
# To allow the use of self-signed certificates for development or internal use,
# set to ssl_verify to 0.
[ssl_verify]
1
0
[shard_db]
path=~/data/livenet/db/nudb
max_historical_shards=12
#-------------------------------------------------------------------------------
#
# 11. Telemetry (OpenTelemetry Tracing)
#
#-------------------------------------------------------------------------------
#
# Enables distributed tracing via OpenTelemetry. Requires building with
# -DXRPL_ENABLE_TELEMETRY=ON (telemetry Conan option).
#
# [telemetry]
#
# enabled=0
#
# Enable or disable telemetry at runtime. Default: 0 (disabled).
#
# endpoint=http://localhost:4318/v1/traces
#
# The OpenTelemetry Collector endpoint (OTLP/HTTP). Default: http://localhost:4318/v1/traces.
#
# exporter=otlp_http
#
# Exporter type: otlp_http. Default: otlp_http.
#
# sampling_ratio=1.0
#
# Fraction of traces to sample (0.0 to 1.0). Default: 1.0 (all traces).
#
# trace_rpc=1
#
# Enable RPC request tracing. Default: 1.
#
# trace_transactions=1
#
# Enable transaction lifecycle tracing. Default: 1.
#
# trace_consensus=1
#
# Enable consensus round tracing. Default: 1.
#
# trace_peer=0
#
# Enable peer message tracing (high volume). Default: 0.
#

View File

@@ -119,6 +119,15 @@ target_link_libraries(
add_module(xrpl tx)
target_link_libraries(xrpl.libxrpl.tx PUBLIC xrpl.libxrpl.ledger)
# Telemetry module
add_module(xrpl telemetry)
target_link_libraries(xrpl.libxrpl.telemetry PUBLIC xrpl.libxrpl.basics xrpl.libxrpl.beast)
if (telemetry)
target_link_libraries(
xrpl.libxrpl.telemetry
PUBLIC opentelemetry-cpp::opentelemetry-cpp)
endif ()
add_library(xrpl.libxrpl)
set_target_properties(xrpl.libxrpl PROPERTIES OUTPUT_NAME xrpl)
@@ -144,6 +153,7 @@ target_link_modules(
resource
server
shamap
telemetry
tx)
# All headers in libxrpl are in modules.

View File

@@ -32,6 +32,7 @@ install(TARGETS common
xrpl.libxrpl.resource
xrpl.libxrpl.server
xrpl.libxrpl.shamap
xrpl.libxrpl.telemetry
xrpl.libxrpl.tx
antithesis-sdk-cpp
EXPORT XrplExports

View File

@@ -22,6 +22,7 @@ class Xrpl(ConanFile):
"rocksdb": [True, False],
"shared": [True, False],
"static": [True, False],
"telemetry": [True, False],
"tests": [True, False],
"unity": [True, False],
"xrpld": [True, False],
@@ -54,6 +55,7 @@ class Xrpl(ConanFile):
"rocksdb": True,
"shared": False,
"static": True,
"telemetry": False,
"tests": False,
"unity": False,
"xrpld": False,
@@ -140,6 +142,8 @@ class Xrpl(ConanFile):
self.requires("jemalloc/5.3.0")
if self.options.rocksdb:
self.requires("rocksdb/10.5.1")
if self.options.telemetry:
self.requires("opentelemetry-cpp/1.18.0")
self.requires("xxhash/0.8.3", **transitive_headers_opt)
exports_sources = (
@@ -168,6 +172,7 @@ class Xrpl(ConanFile):
tc.variables["rocksdb"] = self.options.rocksdb
tc.variables["BUILD_SHARED_LIBS"] = self.options.shared
tc.variables["static"] = self.options.static
tc.variables["telemetry"] = self.options.telemetry
tc.variables["unity"] = self.options.unity
tc.variables["xrpld"] = self.options.xrpld
tc.generate()
@@ -220,3 +225,5 @@ class Xrpl(ConanFile):
]
if self.options.rocksdb:
libxrpl.requires.append("rocksdb::librocksdb")
if self.options.telemetry:
libxrpl.requires.append("opentelemetry-cpp::opentelemetry-cpp")

View File

@@ -0,0 +1,44 @@
version: "3.8"
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "13133:13133" # Health check
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
depends_on:
- jaeger
networks:
- rippled-telemetry
jaeger:
image: jaegertracing/all-in-one:latest
environment:
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686" # Jaeger UI
- "14250:14250" # gRPC
networks:
- rippled-telemetry
grafana:
image: grafana/grafana:latest
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
ports:
- "3000:3000"
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning:ro
depends_on:
- jaeger
networks:
- rippled-telemetry
networks:
rippled-telemetry:
driver: bridge

View File

@@ -0,0 +1,7 @@
apiVersion: 1
datasources:
- name: Jaeger
type: jaeger
access: proxy
url: http://jaeger:16686

View File

@@ -0,0 +1,27 @@
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 100
exporters:
debug:
verbosity: detailed
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [debug, otlp/jaeger]

View File

@@ -19,6 +19,9 @@ class Manager;
namespace perf {
class PerfLog;
}
namespace telemetry {
class Telemetry;
}
// This is temporary until we migrate all code to use ServiceRegistry.
class Application;
@@ -205,6 +208,9 @@ public:
virtual perf::PerfLog&
getPerfLog() = 0;
virtual telemetry::Telemetry&
getTelemetry() = 0;
// Configuration and state
virtual bool
isStopping() const = 0;

View File

@@ -0,0 +1,105 @@
#pragma once
#ifdef XRPL_ENABLE_TELEMETRY
#include <opentelemetry/trace/span.h>
#include <opentelemetry/trace/scope.h>
#include <opentelemetry/context/runtime_context.h>
#include <opentelemetry/nostd/shared_ptr.h>
#include <string_view>
#include <exception>
namespace xrpl {
namespace telemetry {
class SpanGuard
{
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span> span_;
opentelemetry::trace::Scope scope_;
public:
explicit SpanGuard(
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span> span)
: span_(std::move(span)), scope_(span_)
{
}
SpanGuard(SpanGuard const&) = delete;
SpanGuard& operator=(SpanGuard const&) = delete;
SpanGuard(SpanGuard&& other) noexcept
: span_(std::move(other.span_)), scope_(span_)
{
other.span_ = nullptr;
}
SpanGuard& operator=(SpanGuard&&) = delete;
~SpanGuard()
{
if (span_)
span_->End();
}
opentelemetry::trace::Span&
span()
{
return *span_;
}
opentelemetry::trace::Span const&
span() const
{
return *span_;
}
void
setOk()
{
span_->SetStatus(opentelemetry::trace::StatusCode::kOk);
}
void
setStatus(
opentelemetry::trace::StatusCode code,
std::string_view description = "")
{
span_->SetStatus(code, std::string(description));
}
template <typename T>
void
setAttribute(std::string_view key, T&& value)
{
span_->SetAttribute(
opentelemetry::nostd::string_view(key.data(), key.size()),
std::forward<T>(value));
}
void
addEvent(std::string_view name)
{
span_->AddEvent(std::string(name));
}
void
recordException(std::exception const& e)
{
span_->AddEvent("exception", {
{"exception.type", "std::exception"},
{"exception.message", std::string(e.what())}
});
span_->SetStatus(
opentelemetry::trace::StatusCode::kError, e.what());
}
opentelemetry::context::Context
context() const
{
return opentelemetry::context::RuntimeContext::GetCurrent();
}
};
} // namespace telemetry
} // namespace xrpl
#endif // XRPL_ENABLE_TELEMETRY

View File

@@ -0,0 +1,92 @@
#pragma once
#include <xrpl/beast/utility/Journal.h>
#include <xrpl/basics/BasicConfig.h>
#include <chrono>
#include <memory>
#include <string>
#include <string_view>
#ifdef XRPL_ENABLE_TELEMETRY
#include <opentelemetry/trace/tracer.h>
#include <opentelemetry/trace/span.h>
#include <opentelemetry/context/context.h>
#include <opentelemetry/nostd/shared_ptr.h>
#endif
namespace xrpl {
namespace telemetry {
class Telemetry
{
public:
struct Setup
{
bool enabled = false;
std::string serviceName = "rippled";
std::string serviceVersion;
std::string serviceInstanceId;
std::string exporterType = "otlp_http";
std::string exporterEndpoint = "http://localhost:4318/v1/traces";
bool useTls = false;
std::string tlsCertPath;
double samplingRatio = 1.0;
std::uint32_t batchSize = 512;
std::chrono::milliseconds batchDelay{5000};
std::uint32_t maxQueueSize = 2048;
std::uint32_t networkId = 0;
std::string networkType = "mainnet";
bool traceTransactions = true;
bool traceConsensus = true;
bool traceRpc = true;
bool tracePeer = false;
bool traceLedger = true;
};
virtual ~Telemetry() = default;
virtual void start() = 0;
virtual void stop() = 0;
virtual bool isEnabled() const = 0;
virtual bool shouldTraceTransactions() const = 0;
virtual bool shouldTraceConsensus() const = 0;
virtual bool shouldTraceRpc() const = 0;
virtual bool shouldTracePeer() const = 0;
#ifdef XRPL_ENABLE_TELEMETRY
virtual opentelemetry::nostd::shared_ptr<opentelemetry::trace::Tracer>
getTracer(std::string_view name = "rippled") = 0;
virtual opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>
startSpan(
std::string_view name,
opentelemetry::trace::SpanKind kind =
opentelemetry::trace::SpanKind::kInternal) = 0;
virtual opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>
startSpan(
std::string_view name,
opentelemetry::context::Context const& parentContext,
opentelemetry::trace::SpanKind kind =
opentelemetry::trace::SpanKind::kInternal) = 0;
#endif
};
std::unique_ptr<Telemetry>
make_Telemetry(Telemetry::Setup const& setup, beast::Journal journal);
Telemetry::Setup
setup_Telemetry(
Section const& section,
std::string const& nodePublicKey,
std::string const& version);
} // namespace telemetry
} // namespace xrpl

View File

@@ -0,0 +1,103 @@
#include <xrpl/telemetry/Telemetry.h>
#ifdef XRPL_ENABLE_TELEMETRY
#include <opentelemetry/trace/noop.h>
#endif
namespace xrpl {
namespace telemetry {
namespace {
class NullTelemetry : public Telemetry
{
Setup setup_;
public:
explicit NullTelemetry(Setup const& setup) : setup_(setup)
{
}
void
start() override
{
}
void
stop() override
{
}
bool
isEnabled() const override
{
return false;
}
bool
shouldTraceTransactions() const override
{
return false;
}
bool
shouldTraceConsensus() const override
{
return false;
}
bool
shouldTraceRpc() const override
{
return false;
}
bool
shouldTracePeer() const override
{
return false;
}
#ifdef XRPL_ENABLE_TELEMETRY
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Tracer>
getTracer(std::string_view) override
{
static auto noopTracer =
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Tracer>(
new opentelemetry::trace::NoopTracer());
return noopTracer;
}
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>
startSpan(std::string_view, opentelemetry::trace::SpanKind) override
{
return opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>(
new opentelemetry::trace::NoopSpan(nullptr));
}
opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>
startSpan(
std::string_view,
opentelemetry::context::Context const&,
opentelemetry::trace::SpanKind) override
{
return opentelemetry::nostd::shared_ptr<opentelemetry::trace::Span>(
new opentelemetry::trace::NoopSpan(nullptr));
}
#endif
};
} // namespace
// When XRPL_ENABLE_TELEMETRY is off OR setup.enabled is false,
// return NullTelemetry
#ifndef XRPL_ENABLE_TELEMETRY
std::unique_ptr<Telemetry>
make_Telemetry(Telemetry::Setup const& setup, beast::Journal)
{
return std::make_unique<NullTelemetry>(setup);
}
#endif
} // namespace telemetry
} // namespace xrpl

View File

@@ -0,0 +1,272 @@
#ifdef XRPL_ENABLE_TELEMETRY
#include <xrpl/telemetry/Telemetry.h>
#include <xrpl/basics/Log.h>
#include <opentelemetry/exporters/otlp/otlp_http_exporter_factory.h>
#include <opentelemetry/exporters/otlp/otlp_http_exporter_options.h>
#include <opentelemetry/sdk/resource/semantic_conventions.h>
#include <opentelemetry/sdk/trace/batch_span_processor_factory.h>
#include <opentelemetry/sdk/trace/batch_span_processor_options.h>
#include <opentelemetry/sdk/trace/sampler.h>
#include <opentelemetry/sdk/trace/samplers/trace_id_ratio.h>
#include <opentelemetry/sdk/trace/tracer_provider.h>
#include <opentelemetry/sdk/trace/tracer_provider_factory.h>
#include <opentelemetry/trace/noop.h>
#include <opentelemetry/trace/provider.h>
namespace xrpl {
namespace telemetry {
namespace {
namespace trace_api = opentelemetry::trace;
namespace trace_sdk = opentelemetry::sdk::trace;
namespace otlp_http = opentelemetry::exporter::otlp;
namespace resource = opentelemetry::sdk::resource;
// A no-op implementation used when XRPL_ENABLE_TELEMETRY is defined
// but setup.enabled is false. This lives in the anonymous namespace
// so there is no ODR conflict with the NullTelemetry in
// NullTelemetry.cpp (which also lives in its own anonymous namespace).
class NullTelemetryOtel : public Telemetry
{
Setup setup_;
public:
explicit NullTelemetryOtel(Setup const& setup) : setup_(setup)
{
}
void
start() override
{
}
void
stop() override
{
}
bool
isEnabled() const override
{
return false;
}
bool
shouldTraceTransactions() const override
{
return false;
}
bool
shouldTraceConsensus() const override
{
return false;
}
bool
shouldTraceRpc() const override
{
return false;
}
bool
shouldTracePeer() const override
{
return false;
}
opentelemetry::nostd::shared_ptr<trace_api::Tracer>
getTracer(std::string_view) override
{
static auto noopTracer =
opentelemetry::nostd::shared_ptr<trace_api::Tracer>(
new trace_api::NoopTracer());
return noopTracer;
}
opentelemetry::nostd::shared_ptr<trace_api::Span>
startSpan(std::string_view, trace_api::SpanKind) override
{
return opentelemetry::nostd::shared_ptr<trace_api::Span>(
new trace_api::NoopSpan(nullptr));
}
opentelemetry::nostd::shared_ptr<trace_api::Span>
startSpan(
std::string_view,
opentelemetry::context::Context const&,
trace_api::SpanKind) override
{
return opentelemetry::nostd::shared_ptr<trace_api::Span>(
new trace_api::NoopSpan(nullptr));
}
};
class TelemetryImpl : public Telemetry
{
Setup setup_;
beast::Journal journal_;
std::shared_ptr<trace_sdk::TracerProvider> sdkProvider_;
public:
TelemetryImpl(Setup const& setup, beast::Journal journal)
: setup_(setup), journal_(journal)
{
}
void
start() override
{
JLOG(journal_.info())
<< "Telemetry starting: endpoint=" << setup_.exporterEndpoint
<< " sampling=" << setup_.samplingRatio;
// Configure OTLP HTTP exporter
otlp_http::OtlpHttpExporterOptions exporterOpts;
exporterOpts.url = setup_.exporterEndpoint;
if (setup_.useTls)
exporterOpts.ssl_ca_cert_path = setup_.tlsCertPath;
auto exporter =
otlp_http::OtlpHttpExporterFactory::Create(exporterOpts);
// Configure batch processor
trace_sdk::BatchSpanProcessorOptions processorOpts;
processorOpts.max_queue_size = setup_.maxQueueSize;
processorOpts.schedule_delay_millis =
std::chrono::milliseconds(setup_.batchDelay);
processorOpts.max_export_batch_size = setup_.batchSize;
auto processor =
trace_sdk::BatchSpanProcessorFactory::Create(
std::move(exporter), processorOpts);
// Configure resource attributes
auto resourceAttrs = resource::Resource::Create({
{resource::SemanticConventions::kServiceName,
setup_.serviceName},
{resource::SemanticConventions::kServiceVersion,
setup_.serviceVersion},
{resource::SemanticConventions::kServiceInstanceId,
setup_.serviceInstanceId},
{"xrpl.network.id",
static_cast<int64_t>(setup_.networkId)},
{"xrpl.network.type", setup_.networkType},
});
// Configure sampler
auto sampler =
std::make_unique<trace_sdk::TraceIdRatioBasedSampler>(
setup_.samplingRatio);
// Create TracerProvider
sdkProvider_ =
trace_sdk::TracerProviderFactory::Create(
std::move(processor),
resourceAttrs,
std::move(sampler));
// Set as global provider
trace_api::Provider::SetTracerProvider(
opentelemetry::nostd::shared_ptr<trace_api::TracerProvider>(
sdkProvider_));
JLOG(journal_.info()) << "Telemetry started successfully";
}
void
stop() override
{
JLOG(journal_.info()) << "Telemetry stopping";
if (sdkProvider_)
{
// Force flush before shutdown
sdkProvider_->ForceFlush();
sdkProvider_.reset();
trace_api::Provider::SetTracerProvider(
opentelemetry::nostd::shared_ptr<trace_api::TracerProvider>(
new trace_api::NoopTracerProvider()));
}
JLOG(journal_.info()) << "Telemetry stopped";
}
bool
isEnabled() const override
{
return true;
}
bool
shouldTraceTransactions() const override
{
return setup_.traceTransactions;
}
bool
shouldTraceConsensus() const override
{
return setup_.traceConsensus;
}
bool
shouldTraceRpc() const override
{
return setup_.traceRpc;
}
bool
shouldTracePeer() const override
{
return setup_.tracePeer;
}
opentelemetry::nostd::shared_ptr<trace_api::Tracer>
getTracer(std::string_view name) override
{
if (!sdkProvider_)
return trace_api::Provider::GetTracerProvider()->GetTracer(
std::string(name));
return sdkProvider_->GetTracer(std::string(name));
}
opentelemetry::nostd::shared_ptr<trace_api::Span>
startSpan(std::string_view name, trace_api::SpanKind kind) override
{
auto tracer = getTracer("rippled");
trace_api::StartSpanOptions opts;
opts.kind = kind;
return tracer->StartSpan(std::string(name), opts);
}
opentelemetry::nostd::shared_ptr<trace_api::Span>
startSpan(
std::string_view name,
opentelemetry::context::Context const& parentContext,
trace_api::SpanKind kind) override
{
auto tracer = getTracer("rippled");
trace_api::StartSpanOptions opts;
opts.kind = kind;
opts.parent = parentContext;
return tracer->StartSpan(std::string(name), opts);
}
};
} // namespace
std::unique_ptr<Telemetry>
make_Telemetry(Telemetry::Setup const& setup, beast::Journal journal)
{
if (setup.enabled)
return std::make_unique<TelemetryImpl>(setup, journal);
return std::make_unique<NullTelemetryOtel>(setup);
}
} // namespace telemetry
} // namespace xrpl
#endif // XRPL_ENABLE_TELEMETRY

View File

@@ -0,0 +1,50 @@
#include <xrpl/telemetry/Telemetry.h>
namespace xrpl {
namespace telemetry {
Telemetry::Setup
setup_Telemetry(
Section const& section,
std::string const& nodePublicKey,
std::string const& version)
{
Telemetry::Setup setup;
setup.enabled = section.value_or<int>("enabled", 0) != 0;
setup.serviceName = section.value_or<std::string>("service_name", "rippled");
setup.serviceVersion = version;
setup.serviceInstanceId =
section.value_or<std::string>("service_instance_id", nodePublicKey);
setup.exporterType =
section.value_or<std::string>("exporter", "otlp_http");
setup.exporterEndpoint = section.value_or<std::string>(
"endpoint", "http://localhost:4318/v1/traces");
setup.useTls = section.value_or<int>("use_tls", 0) != 0;
setup.tlsCertPath =
section.value_or<std::string>("tls_ca_cert", "");
setup.samplingRatio = section.value_or<double>("sampling_ratio", 1.0);
setup.batchSize =
section.value_or<std::uint32_t>("batch_size", 512u);
setup.batchDelay = std::chrono::milliseconds{
section.value_or<std::uint32_t>("batch_delay_ms", 5000u)};
setup.maxQueueSize =
section.value_or<std::uint32_t>("max_queue_size", 2048u);
setup.traceTransactions =
section.value_or<int>("trace_transactions", 1) != 0;
setup.traceConsensus =
section.value_or<int>("trace_consensus", 1) != 0;
setup.traceRpc = section.value_or<int>("trace_rpc", 1) != 0;
setup.tracePeer = section.value_or<int>("trace_peer", 0) != 0;
setup.traceLedger = section.value_or<int>("trace_ledger", 1) != 0;
return setup;
}
} // namespace telemetry
} // namespace xrpl

View File

@@ -51,6 +51,7 @@
#include <xrpl/resource/Fees.h>
#include <xrpl/server/LoadFeeTrack.h>
#include <xrpl/server/Wallet.h>
#include <xrpl/telemetry/Telemetry.h>
#include <xrpl/tx/apply.h>
#include <boost/algorithm/string/predicate.hpp>
@@ -146,6 +147,7 @@ public:
beast::Journal m_journal;
std::unique_ptr<perf::PerfLog> perfLog_;
std::unique_ptr<telemetry::Telemetry> telemetry_;
Application::MutexType m_masterMutex;
// Required by the SHAMapStore
@@ -250,45 +252,50 @@ public:
, m_journal(logs_->journal("Application"))
// PerfLog must be started before any other threads are launched.
, perfLog_(
perf::make_PerfLog(
perf::setup_PerfLog(config_->section("perf"), config_->CONFIG_DIR),
*this,
logs_->journal("PerfLog"),
[this] { signalStop("PerfLog"); }))
, perfLog_(perf::make_PerfLog(
perf::setup_PerfLog(config_->section("perf"), config_->CONFIG_DIR),
*this,
logs_->journal("PerfLog"),
[this] { signalStop("PerfLog"); }))
, telemetry_(telemetry::make_Telemetry(
telemetry::setup_Telemetry(
config_->section("telemetry"),
"", // nodePublicKey not yet available at this point
BuildInfo::getVersionString()),
logs_->journal("Telemetry")))
, m_txMaster(*this)
, m_collectorManager(
make_CollectorManager(config_->section(SECTION_INSIGHT), logs_->journal("Collector")))
, m_jobQueue(
std::make_unique<JobQueue>(
[](std::unique_ptr<Config> const& config) {
if (config->standalone() && !config->FORCE_MULTI_THREAD)
return 1;
, m_jobQueue(std::make_unique<JobQueue>(
[](std::unique_ptr<Config> const& config) {
if (config->standalone() && !config->FORCE_MULTI_THREAD)
return 1;
if (config->WORKERS)
return config->WORKERS;
if (config->WORKERS)
return config->WORKERS;
auto count = static_cast<int>(std::thread::hardware_concurrency());
auto count = static_cast<int>(std::thread::hardware_concurrency());
// Be more aggressive about the number of threads to use
// for the job queue if the server is configured as
// "large" or "huge" if there are enough cores.
if (config->NODE_SIZE >= 4 && count >= 16)
count = 6 + std::min(count, 8);
else if (config->NODE_SIZE >= 3 && count >= 8)
count = 4 + std::min(count, 6);
else
count = 2 + std::min(count, 4);
// Be more aggressive about the number of threads to use
// for the job queue if the server is configured as
// "large" or "huge" if there are enough cores.
if (config->NODE_SIZE >= 4 && count >= 16)
count = 6 + std::min(count, 8);
else if (config->NODE_SIZE >= 3 && count >= 8)
count = 4 + std::min(count, 6);
else
count = 2 + std::min(count, 4);
return count;
}(config_),
m_collectorManager->group("jobq"),
logs_->journal("JobQueue"),
*logs_,
*perfLog_))
return count;
}(config_),
m_collectorManager->group("jobq"),
logs_->journal("JobQueue"),
*logs_,
*perfLog_))
, m_nodeStoreScheduler(*m_jobQueue)
@@ -323,18 +330,16 @@ public:
, m_orderBookDB(make_OrderBookDB(*this, {config_->PATH_SEARCH_MAX, config_->standalone()}))
, m_pathRequests(
std::make_unique<PathRequests>(
*this,
logs_->journal("PathRequest"),
m_collectorManager->collector()))
, m_pathRequests(std::make_unique<PathRequests>(
*this,
logs_->journal("PathRequest"),
m_collectorManager->collector()))
, m_ledgerMaster(
std::make_unique<LedgerMaster>(
*this,
stopwatch(),
m_collectorManager->collector(),
logs_->journal("LedgerMaster")))
, m_ledgerMaster(std::make_unique<LedgerMaster>(
*this,
stopwatch(),
m_collectorManager->collector(),
logs_->journal("LedgerMaster")))
, ledgerCleaner_(make_LedgerCleaner(*this, logs_->journal("LedgerCleaner")))
@@ -350,11 +355,10 @@ public:
gotTXSet(set, fromAcquire);
}))
, m_ledgerReplayer(
std::make_unique<LedgerReplayer>(
*this,
*m_inboundLedgers,
make_PeerSetBuilder(*this)))
, m_ledgerReplayer(std::make_unique<LedgerReplayer>(
*this,
*m_inboundLedgers,
make_PeerSetBuilder(*this)))
, m_acceptedLedgerCache(
"AcceptedLedger",
@@ -385,14 +389,13 @@ public:
, publisherManifests_(std::make_unique<ManifestCache>(logs_->journal("ManifestCache")))
, validators_(
std::make_unique<ValidatorList>(
*validatorManifests_,
*publisherManifests_,
*timeKeeper_,
config_->legacy("database_path"),
logs_->journal("ValidatorList"),
config_->VALIDATION_QUORUM))
, validators_(std::make_unique<ValidatorList>(
*validatorManifests_,
*publisherManifests_,
*timeKeeper_,
config_->legacy("database_path"),
logs_->journal("ValidatorList"),
config_->VALIDATION_QUORUM))
, validatorSites_(std::make_unique<ValidatorSite>(*this))
@@ -617,6 +620,12 @@ public:
return *perfLog_;
}
telemetry::Telemetry&
getTelemetry() override
{
return *telemetry_;
}
NodeCache&
getTempNodeCache() override
{
@@ -886,9 +895,8 @@ public:
}))
{
using namespace std::chrono;
sweepTimer_.expires_after(
seconds{config_->SWEEP_INTERVAL.value_or(
config_->getValueFor(SizedItem::sweepInterval))});
sweepTimer_.expires_after(seconds{
config_->SWEEP_INTERVAL.value_or(config_->getValueFor(SizedItem::sweepInterval))});
sweepTimer_.async_wait(std::move(*optionalCountedHandler));
}
}
@@ -1463,6 +1471,7 @@ ApplicationImp::start(bool withTimers)
ledgerCleaner_->start();
perfLog_->start();
telemetry_->start();
}
void
@@ -1553,6 +1562,7 @@ ApplicationImp::run()
ledgerCleaner_->stop();
m_nodeStore->stop();
perfLog_->stop();
telemetry_->stop();
JLOG(m_journal.info()) << "Done.";
}

View File

@@ -8,6 +8,7 @@
#include <xrpld/rpc/Role.h>
#include <xrpld/rpc/detail/Handler.h>
#include <xrpld/rpc/detail/Tuning.h>
#include <xrpld/telemetry/TracingInstrumentation.h>
#include <xrpl/basics/Log.h>
#include <xrpl/core/JobQueue.h>
@@ -157,6 +158,11 @@ template <class Object, class Method>
Status
callMethod(JsonContext& context, Method method, std::string const& name, Object& result)
{
XRPL_TRACE_RPC(context.app.getTelemetry(), "rpc.command." + name);
XRPL_TRACE_SET_ATTR("xrpl.rpc.command", name.c_str());
XRPL_TRACE_SET_ATTR("xrpl.rpc.version", static_cast<int64_t>(context.apiVersion));
XRPL_TRACE_SET_ATTR("xrpl.rpc.role", (context.role == Role::ADMIN ? "admin" : "user"));
static std::atomic<std::uint64_t> requestId{0};
auto& perfLog = context.app.getPerfLog();
std::uint64_t const curId = ++requestId;
@@ -172,12 +178,15 @@ callMethod(JsonContext& context, Method method, std::string const& name, Object&
JLOG(context.j.debug()) << "RPC call " << name << " completed in "
<< ((end - start).count() / 1000000000.0) << "seconds";
perfLog.rpcFinish(name, curId);
XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "success");
return ret;
}
catch (std::exception& e)
{
perfLog.rpcError(name, curId);
JLOG(context.j.info()) << "Caught throw: " << e.what();
XRPL_TRACE_EXCEPTION(e);
XRPL_TRACE_SET_ATTR("xrpl.rpc.status", "error");
if (context.loadType == Resource::feeReferenceRPC)
context.loadType = Resource::feeExceptionRPC;

View File

@@ -7,6 +7,7 @@
#include <xrpld/rpc/detail/Tuning.h>
#include <xrpld/rpc/detail/WSInfoSub.h>
#include <xrpld/rpc/json_body.h>
#include <xrpld/telemetry/TracingInstrumentation.h>
#include <xrpl/basics/Log.h>
#include <xrpl/basics/base64.h>
@@ -267,6 +268,8 @@ buffers_to_string(ConstBufferSequence const& bs)
void
ServerHandler::onRequest(Session& session)
{
XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.request");
// Make sure RPC is enabled on the port
if (session.port().protocol.count("http") == 0 && session.port().protocol.count("https") == 0)
{
@@ -378,6 +381,7 @@ ServerHandler::processSession(
std::shared_ptr<JobQueue::Coro> const& coro,
Json::Value const& jv)
{
XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.ws_message");
auto is = std::static_pointer_cast<WSInfoSub>(session->appDefined);
if (is->getConsumer().disconnect(m_journal))
{
@@ -566,6 +570,7 @@ ServerHandler::processRequest(
std::string_view forwardedFor,
std::string_view user)
{
XRPL_TRACE_RPC(app_.getTelemetry(), "rpc.process");
auto rpcJ = app_.journal("RPC");
Json::Value jsonOrig;

View File

@@ -0,0 +1,69 @@
#pragma once
#ifdef XRPL_ENABLE_TELEMETRY
#include <xrpl/telemetry/Telemetry.h>
#include <xrpl/telemetry/SpanGuard.h>
#include <optional>
namespace xrpl {
namespace telemetry {
// Start a span that is automatically ended when guard goes out of scope
#define XRPL_TRACE_SPAN(_tel_obj_, _span_name_) \
auto _xrpl_span_ = (_tel_obj_).startSpan(_span_name_); \
::xrpl::telemetry::SpanGuard _xrpl_guard_(_xrpl_span_)
// Start a span with specific kind
#define XRPL_TRACE_SPAN_KIND(_tel_obj_, _span_name_, _span_kind_) \
auto _xrpl_span_ = (_tel_obj_).startSpan(_span_name_, _span_kind_); \
::xrpl::telemetry::SpanGuard _xrpl_guard_(_xrpl_span_)
// Conditional span for RPC tracing
#define XRPL_TRACE_RPC(_tel_obj_, _span_name_) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((_tel_obj_).shouldTraceRpc()) { \
_xrpl_guard_.emplace((_tel_obj_).startSpan(_span_name_)); \
}
// Conditional span for transaction tracing
#define XRPL_TRACE_TX(_tel_obj_, _span_name_) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((_tel_obj_).shouldTraceTransactions()) { \
_xrpl_guard_.emplace((_tel_obj_).startSpan(_span_name_)); \
}
// Conditional span for consensus tracing
#define XRPL_TRACE_CONSENSUS(_tel_obj_, _span_name_) \
std::optional<::xrpl::telemetry::SpanGuard> _xrpl_guard_; \
if ((_tel_obj_).shouldTraceConsensus()) { \
_xrpl_guard_.emplace((_tel_obj_).startSpan(_span_name_)); \
}
// Set attribute on current span (if exists)
#define XRPL_TRACE_SET_ATTR(key, value) \
if (_xrpl_guard_.has_value()) { \
_xrpl_guard_->setAttribute(key, value); \
}
// Record exception on current span
#define XRPL_TRACE_EXCEPTION(e) \
if (_xrpl_guard_.has_value()) { \
_xrpl_guard_->recordException(e); \
}
} // namespace telemetry
} // namespace xrpl
#else // XRPL_ENABLE_TELEMETRY not defined
#define XRPL_TRACE_SPAN(_tel_obj_, _span_name_) ((void)0)
#define XRPL_TRACE_SPAN_KIND(_tel_obj_, _span_name_, _span_kind_) ((void)0)
#define XRPL_TRACE_RPC(_tel_obj_, _span_name_) ((void)0)
#define XRPL_TRACE_TX(_tel_obj_, _span_name_) ((void)0)
#define XRPL_TRACE_CONSENSUS(_tel_obj_, _span_name_) ((void)0)
#define XRPL_TRACE_SET_ATTR(key, value) ((void)0)
#define XRPL_TRACE_EXCEPTION(e) ((void)0)
#endif // XRPL_ENABLE_TELEMETRY