21 KiB
Phase 3: Transaction Tracing Task List
Goal: Trace the full transaction lifecycle from RPC submission through peer relay, including cross-node context propagation via Protocol Buffer extensions. This is the WALK phase that demonstrates true distributed tracing.
Scope: Protocol Buffer
TraceContextmessage, context serialization, PeerImp transaction instrumentation, NetworkOPs processing instrumentation, HashRouter visibility, and multi-node relay context propagation.Branch:
pratik/otel-phase3-tx-tracing(frompratik/otel-phase2-rpc-tracing)
Related Plan Documents
| Document | Relevance |
|---|---|
| 04-code-samples.md | TraceContext protobuf (§4.4.1), PeerImp instrumentation (§4.5.1), context serialization (§4.4.2) |
| 01-architecture-analysis.md | Transaction flow (§1.3), key trace points (§1.6) |
| 06-implementation-phases.md | Phase 3 tasks (§6.4), definition of done (§6.11.3) |
| 02-design-decisions.md | Context propagation design (§2.5), attribute schema (§2.4.3) |
Task 3.1: Define TraceContext Protocol Buffer Message
Objective: Add trace context fields to the P2P protocol messages so trace IDs can propagate across nodes.
What to do:
-
Edit
include/xrpl/proto/xrpl.proto(orsrc/ripple/proto/ripple.proto, wherever the proto is):- Add
TraceContextmessage definition:message TraceContext { bytes trace_id = 1; // 16-byte trace identifier bytes span_id = 2; // 8-byte span identifier uint32 trace_flags = 3; // bit 0 = sampled string trace_state = 4; // W3C tracestate value } - Add
optional TraceContext trace_context = 1001;to:TMTransactionTMProposeSet(for Phase 4 use)TMValidation(for Phase 4 use)
- Use high field numbers (1001+) to avoid conflicts with existing fields
- Add
-
Regenerate protobuf C++ code
Key modified files:
include/xrpl/proto/xrpl.proto(or equivalent)
Reference:
- 04-code-samples.md §4.4.1 — TraceContext message definition
- 02-design-decisions.md §2.5.2 — Protocol buffer context propagation design
Task 3.2: Implement Protobuf Context Serialization
Objective: Create utilities to serialize/deserialize OTel trace context to/from protobuf TraceContext messages.
What to do:
-
Create
include/xrpl/telemetry/TraceContextPropagator.h(extend from Phase 2 if exists, or add protobuf methods):- Add protobuf-specific methods:
static Context extractFromProtobuf(protocol::TraceContext const& proto)— reconstruct OTel context from protobuf fieldsstatic void injectToProtobuf(Context const& ctx, protocol::TraceContext& proto)— serialize current span context into protobuf fields
- Both methods guard behind
#ifdef XRPL_ENABLE_TELEMETRY
- Add protobuf-specific methods:
-
Create/extend
src/libxrpl/telemetry/TraceContextPropagator.cpp:- Implement extraction: read trace_id (16 bytes), span_id (8 bytes), trace_flags from protobuf, construct
SpanContext, wrap inContext - Implement injection: get current span from context, serialize its TraceId, SpanId, and TraceFlags into protobuf fields
- Implement extraction: read trace_id (16 bytes), span_id (8 bytes), trace_flags from protobuf, construct
Key new/modified files:
include/xrpl/telemetry/TraceContextPropagator.hsrc/libxrpl/telemetry/TraceContextPropagator.cpp
Reference:
- 04-code-samples.md §4.4.2 — Full extract/inject implementation
Task 3.3: Instrument PeerImp Transaction Handling
Objective: Add trace spans to the peer-level transaction receive and relay path.
What to do:
-
Edit
src/xrpld/overlay/detail/PeerImp.cpp:- In
onMessage(TMTransaction)/handleTransaction():- Extract parent trace context from incoming
TMTransaction::trace_contextfield (if present) - Create
tx.receivespan as child of extracted context (or new root if none) - Set attributes:
xrpl.tx.hash,xrpl.peer.id,xrpl.tx.status - On HashRouter suppression (duplicate): set
xrpl.tx.suppressed=true, addtx.duplicateevent - Wrap validation call with child span
tx.validate - Wrap relay with
tx.relayspan
- Extract parent trace context from incoming
- When relaying to peers:
- Inject current trace context into outgoing
TMTransaction::trace_context - Set
xrpl.tx.relay_countattribute
- Inject current trace context into outgoing
- In
-
Use
SpanGuard::span(TraceCategory::Transactions, "tx", "receive")factory (Phase 1c replaced macros with the SpanGuard factory pattern)
Key modified files:
src/xrpld/overlay/detail/PeerImp.cpp
Reference:
- 04-code-samples.md §4.5.1 — Full PeerImp instrumentation example
- 01-architecture-analysis.md §1.3 — Transaction flow diagram
- 01-architecture-analysis.md §1.6 — tx.receive trace point
Task 3.4: Instrument NetworkOPs Transaction Processing
Objective: Trace the transaction processing pipeline in NetworkOPs, covering both sync and async paths.
What to do:
- Edit
src/xrpld/app/misc/NetworkOPs.cpp:-
In
processTransaction():- Create
tx.processspan - Set attributes:
xrpl.tx.hash,xrpl.tx.type,xrpl.tx.local(whether from RPC or peer) - Record whether sync or async path is taken
- Create
-
In
doTransactionAsync():- Capture parent context before queuing
- Create
tx.queuespan with queue depth attribute - Add event when transaction is dequeued for processing
-
In
doTransactionSync():- Create
tx.process_syncspan - Record result (applied, queued, rejected)
- Create
-
Key modified files:
src/xrpld/app/misc/NetworkOPs.cpp
Reference:
- 01-architecture-analysis.md §1.6 — tx.validate and tx.process trace points
- 02-design-decisions.md §2.4.3 — Transaction attribute schema
Task 3.5: Instrument HashRouter for Dedup Visibility
Objective: Make transaction deduplication visible in traces by recording HashRouter decisions as span attributes/events.
What to do:
-
Edit
src/xrpld/overlay/detail/PeerImp.cpp(in handleTransaction):- After calling
HashRouter::shouldProcess()oraddSuppressionPeer():- Record
xrpl.tx.suppressedattribute (true/false) - Record
xrpl.tx.flagsshowing current HashRouter state (SAVED, TRUSTED, etc.) - Add
tx.first_seenortx.duplicateevent
- Record
- After calling
-
This is NOT a modification to HashRouter itself — just recording its decisions as span attributes in the existing PeerImp instrumentation from Task 3.3.
Key modified files:
src/xrpld/overlay/detail/PeerImp.cpp(same changes as 3.3, logically grouped)
Task 3.6: Context Propagation in Transaction Relay
Objective: Ensure trace context flows correctly when transactions are relayed between peers, creating linked spans across nodes.
What to do:
-
Verify the relay path injects trace context:
- When
PeerImprelays a transaction, theTMTransactionmessage should carrytrace_context - When a remote peer receives it, the context is extracted and used as parent
- When
-
Test context propagation:
- Manually verify with 2+ node setup that trace IDs match across nodes
- Confirm parent-child span relationships are correct in Tempo
-
Handle edge cases:
- Missing trace context (older peers): create new root span
- Corrupted trace context: log warning, create new root span
- Sampled-out traces: respect trace flags
Key modified files:
src/xrpld/overlay/detail/PeerImp.cppsrc/xrpld/overlay/detail/OverlayImpl.cpp(if relay method needs context param)
Reference:
- 02-design-decisions.md §2.5 — Context propagation design
- 04-code-samples.md §4.5.1 — Relay context injection pattern
Task 3.7: Build Verification and Testing
Objective: Verify all Phase 3 changes compile and work correctly.
What to do:
- Build with
telemetry=ON— verify no compilation errors - Build with
telemetry=OFF— verify no regressions - Run existing unit tests
- Verify protobuf regeneration produces correct C++ code
- Document any issues encountered
Verification Checklist:
- Protobuf changes generate valid C++
- Build succeeds with telemetry ON
- Build succeeds with telemetry OFF
- Existing tests pass
- No undefined symbols from new telemetry calls
Task 3.8: Transaction Span Peer Version Attribute
Source: External Dashboard Parity — adds peer version context inspired by the community xrpl-validator-dashboard.
Upstream: Phase 2 (RPC span infrastructure must exist). Downstream: Phase 10 (validation checks for this attribute).
Objective: Add the relaying peer's rippled version to tx.receive spans so operators can correlate transaction issues with peer version mismatches during network upgrades.
What to do:
- Edit
src/xrpld/overlay/detail/PeerImp.cpp:- In the
tx.receivespan block (after existingxrpl.peer.idsetAttribute call):- Add
xrpl.peer.version(string) — fromthis->getVersion() - Only set if
getVersion()returns a non-empty string (avoid empty-string attributes)
- Add
- In the
New span attribute:
| Attribute | Type | Source | Example |
|---|---|---|---|
xrpl.peer.version |
string | peer->getVersion() |
"rippled-2.4.0" |
Rationale: Transaction relay is where version mismatches cause subtle serialization or validation bugs. Tracing "this tx came from a v2.3.0 peer" helps diagnose compatibility issues. The community dashboard tracks peer versions externally; this brings version awareness into the trace itself.
Key modified files:
src/xrpld/overlay/detail/PeerImp.cpp
Exit Criteria:
tx.receivespans carryxrpl.peer.versionattribute with a non-empty version string- Attribute is omitted (not set to empty string) when
getVersion()returns empty - Attribute visible in Jaeger span detail view
Task 3.9: Deterministic Transaction Trace ID
Upstream: Task 3.2 (protobuf serialization), Task 3.3 (PeerImp span exists). Downstream: Phase 10 (workload validation can query by tx hash directly). Pattern: Mirrors the consensus deterministic trace ID in Phase 4a (
createDeterministicContextinRCLConsensus.cpp), adapted for transactions.
Objective: Derive the trace_id for transaction spans deterministically from the transaction hash so that all nodes handling the same transaction independently produce spans under the same trace_id — regardless of whether protobuf context propagation succeeds.
Why: The current approach creates spans with random trace_ids and relies entirely
on protobuf TraceContext propagation to link them. If any hop in the relay chain
drops the context (older peers, message corruption, mixed-version networks), the trace
splits and downstream spans become impossible to find. With deterministic trace_ids,
correlation is guaranteed because every node derives the same trace_id from the same
txID.
Approach — deterministic trace_id + protobuf span_id propagation:
- Derive
trace_id = txHash[0:16](first 16 bytes of the 32-byte transaction hash). - Generate a random 8-byte
span_idper node (each node's span is unique within the shared trace). - Create the span under this deterministic context as parent.
- Additionally, if protobuf
TraceContextis present in the incomingTMTransactionmessage, extract the sender'sspan_idand use it as the span's parent — this preserves parent-child ordering in the trace tree. - If protobuf context is absent (older peer, first hop), the span still has the
correct deterministic
trace_id— it appears as a sibling root in the same trace rather than being lost.
This gives the best of both worlds: guaranteed cross-node correlation via deterministic
trace_id, plus parent-child relay ordering via protobuf span_id when available.
What to do:
-
Create
createDeterministicTxContext(uint256 const& txHash)utility function:- Location: shared header or file-local in
PeerImp.cppandNetworkOPs.cpp(or a shared telemetry utility if both need it). - Pattern: identical to
createDeterministicContext(uint256 const& ledgerId)inRCLConsensus.cpp— taketxHash[0:16]as trace_id, random span_id viadefault_prng(), sampled flag set,remote=false. - Guard behind
#ifdef XRPL_ENABLE_TELEMETRY.
opentelemetry::context::Context createDeterministicTxContext(uint256 const& txHash) { namespace trace = opentelemetry::trace; // First 16 bytes of the 32-byte tx hash as trace ID. trace::TraceId traceId( opentelemetry::nostd::span<uint8_t const, 16>(txHash.data(), 16)); // Random span_id so each node's span is unique within the trace. uint8_t spanIdBytes[8]; auto const rval = default_prng()(); std::memcpy(spanIdBytes, &rval, sizeof(spanIdBytes)); trace::SpanId spanId( opentelemetry::nostd::span<uint8_t const, 8>(spanIdBytes, 8)); trace::SpanContext syntheticCtx( traceId, spanId, trace::TraceFlags(1), /* remote = */ false); return opentelemetry::context::Context{}.SetValue( trace::kSpanKey, opentelemetry::nostd::shared_ptr<trace::Span>( new trace::DefaultSpan(syntheticCtx))); } - Location: shared header or file-local in
-
Edit
src/xrpld/overlay/detail/PeerImp.cpp— restructurehandleTransaction():-
Move span creation after deserialization (txID must be known first):
- Deserialize
STTxand gettxID(existing code at line ~1382). - Create deterministic parent context:
auto detCtx = createDeterministicTxContext(txID). - If
m->has_trace_context(): extract protobuf context viaextractFromProtobuf(), combine with deterministic trace_id — use the protobuf span_id as parent to preserve relay ordering, but override trace_id with the deterministic one. - If no protobuf context: create span under
detCtxdirectly. - Set all existing attributes (
hash,peerId,peerVersion,suppressed, etc.).
- Deserialize
-
Combining deterministic trace_id with protobuf parent span_id: When both are available, construct a synthetic
SpanContextwith:trace_id=txHash[0:16](deterministic)span_id= extracted from protobuf (sender's span_id → becomes parent)trace_flags= from protobufremote= true (came from another node)
// Pseudo-code for the combined context: auto detTraceId = trace::TraceId(txHash.data(), 16); auto remoteSpanId = /* from extractFromProtobuf */; auto remoteFlags = /* from extractFromProtobuf */; trace::SpanContext combinedCtx( detTraceId, remoteSpanId, remoteFlags, /* remote = */ true); // Use as parent context for the new span.
-
-
Edit
src/xrpld/app/misc/NetworkOPs.cpp— updateprocessTransaction():transaction->getID()is already available at the top of the function.- Create deterministic parent context from
txID. - Create
tx.processspan under this context. - No protobuf context to extract here (NetworkOPs is intra-node), so deterministic context alone is sufficient.
-
Add
tx_trace_strategyattribute to spans:- Add
inline constexpr auto traceStrategy = join(xrplTx, makeStr("trace_strategy"));toTxSpanNames.h. - Set on each tx span:
span.setAttribute(tx_span::attr::traceStrategy, "deterministic").
- Add
Key new/modified files:
src/xrpld/overlay/detail/PeerImp.cpp— restructured span creationsrc/xrpld/app/misc/NetworkOPs.cpp— deterministic context for tx.processsrc/xrpld/app/misc/TxSpanNames.h— newtraceStrategyattribute constant- New or shared utility for
createDeterministicTxContext()(location TBD: could be a shared header likeinclude/xrpl/telemetry/DeterministicContext.h, or file-local if only used in two places)
Interaction with existing tasks:
- Task 3.3 (PeerImp instrumentation): The span creation in
handleTransaction()must be restructured — the span currently starts beforetxIDis known. This task moves it after deserialization. - Task 3.6 (Relay context propagation): Protobuf injection at the relay site
remains the same —
injectToProtobuf()serializes the current span'sspan_id. The receiver extracts it and combines with the deterministictrace_id. - Phase 4a (Consensus deterministic trace ID): This task follows the same pattern.
Consider extracting a shared utility (e.g.,
createDeterministicContext(uint256)) that both consensus and transaction tracing use.
Exit Criteria:
tx.receiveandtx.processspans have deterministic trace_id =txHash[0:16]- All nodes handling the same transaction produce spans under the same trace_id
- Protobuf
span_idpropagation still works when available (parent-child ordering) - Missing protobuf context (old peer) degrades gracefully to sibling spans, not lost traces
xrpl.tx.trace_strategyattribute set to"deterministic"on all tx spans- Trace queryable by tx hash (truncate hash → trace_id → direct lookup in Tempo)
Deliverables implemented (not in original plan):
-
SpanGuard::txSpan()factory method (include/xrpl/telemetry/SpanGuard.h): Two overloads for creating transaction spans with deterministic trace IDs:txSpan(category, group, name, txHash)— standalone span (deterministic trace_id fromtxHash[0:16], no parent span_id).txSpan(category, group, name, txHash, parentCtx)— child span (deterministic trace_id combined with protobuf-extracted parent span_id for relay ordering).
-
TxTracing.hhelper functions (src/xrpld/overlay/detail/TxTracing.h): File-local helpers that wrapSpanGuard::txSpan()for the two main PeerImp call sites:txReceiveSpan(txHash, parentCtx)— createstx.receivespan with deterministic trace_id and optional protobuf parent context.txProcessSpan(txHash)— createstx.processspan with deterministic trace_id only (no protobuf parent, used intra-node).- Note:
TxTracing.hincludesxrpl.pb.hunconditionally (outside#ifdef XRPL_ENABLE_TELEMETRY) becauseprotocol::TMTransactionappears in the function signatures regardless of telemetry build mode.
Task 3.10: TxQ Instrumentation
Status: COMPLETE
Objective: Trace the transaction queue lifecycle — enqueue decisions, direct apply, batch clear, ledger-close accept loop, per-tx apply, and cleanup.
Spans added:
txq.enqueue— wrapsTxQ::apply()with tx_hash attributetxq.apply_direct— wrapsTxQ::tryDirectApply()fast-pathtxq.batch_clear— wrapsTxQ::tryClearAccountQueueUpThruTx()txq.accept— wrapsTxQ::accept()ledger-close dequeue with queue_size attrtxq.accept_tx— per-tx span inside accept loop with tx_hash, ter_code, retries_remaining attributestxq.cleanup— wrapsTxQ::processClosedLedger()with ledger_seq attribute
New file: src/xrpld/app/misc/detail/TxQSpanNames.h
Modified file: src/xrpld/app/misc/detail/TxQ.cpp
Summary
| Task | Description | New Files | Modified Files | Depends On |
|---|---|---|---|---|
| 3.1 | TraceContext protobuf message | 0 | 1 | Phase 2 |
| 3.2 | Protobuf context serialization | 1-2 | 0 | 3.1 |
| 3.3 | PeerImp transaction instrumentation | 0 | 1 | 3.2 |
| 3.4 | NetworkOPs transaction processing | 0 | 1 | Phase 2 |
| 3.5 | HashRouter dedup visibility | 0 | 1 | 3.3 |
| 3.6 | Relay context propagation | 0 | 1-2 | 3.3, 3.5 |
| 3.7 | Build verification and testing | 0 | 0 | 3.1-3.6 |
| 3.8 | TX span peer version attribute | 0 | 1 | 3.3 |
| 3.9 | Deterministic transaction trace ID | 0-1 | 3 | 3.2, 3.3 |
| 3.10 | TxQ instrumentation (6 spans) | 1 | 1 | 3.4 |
Parallel work: Tasks 3.1 and 3.4 can start in parallel. Task 3.2 depends on 3.1. Tasks 3.3 and 3.5 depend on 3.2. Task 3.6 depends on 3.3 and 3.5. Task 3.8 depends on 3.3 (span must exist). Task 3.9 depends on 3.2 and 3.3. Task 3.10 depends on 3.4 (tx.process span must exist).
Exit Criteria (from 06-implementation-phases.md §6.11.3):
- Transaction traces span across nodes
- Trace context in Protocol Buffer messages
- HashRouter deduplication visible in traces
- <5% overhead on transaction throughput
- Deterministic trace_id: same trace_id for same tx across all nodes
- Protobuf span_id propagation preserves parent-child ordering when available