78 KiB
Boost.Coroutine to C++20 Standard Coroutines Migration Plan
Project: rippled (XRP Ledger node)
Branch: pratik/Switch-to-std-coroutines
Date: 2026-02-25
Status: Planning
Table of Contents
- Research & Analysis
- Current State Assessment
- Migration Strategy
- Implementation Plan
- Testing & Validation Strategy
- Risks & Mitigation
- Timeline & Milestones
- Standards & Guidelines
- Task List
1. Research & Analysis
1.1 Stackful (Boost.Coroutine) vs Stackless (C++20) Architecture
graph TD
subgraph Boost["Boost.Coroutine2 (Stackful)"]
direction TB
B1["Coroutine Created"]
B2["1 MB Stack Allocated"]
B3["Full Call Stack Available"]
B4["yield() from ANY<br/>nesting depth"]
B5["Context Switch:<br/>save/restore registers<br/>+ stack pointer<br/>~40-100 CPU cycles"]
B1 --> B2 --> B3 --> B4 --> B5
end
subgraph Std["C++20 Coroutines (Stackless)"]
direction TB
S1["Coroutine Created"]
S2["200-500 B Frame on Heap"]
S3["No Dedicated Stack"]
S4["co_await ONLY at<br/>explicit suspension points"]
S5["Context Switch:<br/>resume via function call<br/>symmetric transfer / tail-call<br/>~20-50 CPU cycles"]
S1 --> S2 --> S3 --> S4 --> S5
end
Boost (right): Each coroutine gets a full 1 MB stack. Suspension saves the
entire register set and stack pointer, so yield() can be called from any
nesting depth — the whole call chain is preserved. The cost is high per-coroutine
memory and a heavier context switch (~40-100 cycles for fcontext save/restore).
C++20 (left): The compiler allocates a small heap frame (200-500 bytes)
holding only the local variables that live across suspension points. There is no
dedicated stack — suspension is only allowed at explicit co_await expressions
in the immediate coroutine function. Resumption is a normal function call
(symmetric transfer makes it a tail-call), costing ~20-50 cycles. The trade-off
is that nested functions that need to suspend must themselves be coroutines.
1.2 API & Programming Model Comparison
| Aspect | Boost.Coroutine2 (Current) | C++20 Coroutines (Target) |
|---|---|---|
| Type | Stackful, asymmetric. Each coroutine carries its own call stack, and control transfers between a parent (caller) and a child (coroutine) — never between two siblings directly. | Stackless, asymmetric. The compiler transforms the coroutine function into a state machine allocated on the heap. The same parent/child asymmetry applies, but there is no separate stack. |
| Stack Model | Dedicated 1 MB stack per coroutine. Allocated at construction via boost::context::fixedsize_stack. The full stack is reserved even if the coroutine only uses a few hundred bytes, leading to high memory overhead under concurrency. |
Heap frame of ~200-500 bytes. The compiler allocates only the local variables that live across suspension points into a coroutine frame on the heap. The frame may be elided entirely if the compiler can prove the coroutine's lifetime is bounded by its caller. |
| Suspension | (*yield_)() — can yield from any call depth. Because the coroutine has its own stack, a call chain fn_a() → fn_b() → yield() suspends the entire stack. The yield_ pointer is a push_type* provided by Boost. |
co_await expr — only at explicit suspension points. Suspension is only possible in the immediate coroutine function body. If a nested regular function needs to suspend, it must itself be refactored into a coroutine returning an awaitable. |
| Resumption | coro_() — resumes from last yield. Calling the pull_type object switches back to the coroutine's stack and continues execution right after the last yield() call. |
handle.resume() — resumes from last co_await. The std::coroutine_handle<> is a lightweight pointer to the coroutine frame. Calling .resume() jumps to the suspension point via a function-call dispatch (no stack switch). |
| Creation | pull_type constructor auto-starts the coroutine. When a pull_type is constructed, it immediately transfers control into the coroutine body, which runs until its first yield(). The caller must account for this eager start. |
Calling a coroutine function returns a suspended handle. The function body does NOT execute until handle.resume() is called (when initial_suspend() returns suspend_always). This lazy-start model gives the caller full control over when execution begins. |
| Completion Check | static_cast<bool>(coro_) returns false when done. The pull_type is contextually convertible to bool; it becomes false after the coroutine body returns. |
handle.done() returns true when done. A direct query on the coroutine handle. Calling resume() after done() is true is undefined behavior. |
| Value Passing | Typed via pull_type<T> / push_type<T>. Values are exchanged through the coroutine's type parameter — pull_type<T> pulls values out, push_type<T> pushes values in. rippled uses <void> (no values exchanged). |
Via promise_type::return_value(T) or co_return. Values are stored in the promise object inside the coroutine frame. The caller retrieves them through await_resume(). For void coroutines, return_void() is used instead. |
| Exception Handling | Natural stack-based propagation. An exception thrown inside the coroutine unwinds its stack normally and propagates to the caller at the pull_type call site (i.e., whoever called coro_()). |
Explicit capture via promise_type::unhandled_exception(). Exceptions thrown in the coroutine body are caught by the promise and stored (typically as std::exception_ptr). They are rethrown in await_resume() when the caller co_awaits the result. |
| Cancellation | Application-managed (poll a flag). There is no built-in cancellation. rippled uses expectEarlyExit() to mark a coroutine as abandoned during shutdown, then decrements nSuspend_ so JobQueue::stop() can proceed. |
Via await_ready() or cancellation tokens. An awaiter can check a cancellation flag in await_ready() and return true to skip suspension. Alternatively, std::stop_token patterns (C++20) can be threaded through. Our JobQueueAwaiter returns false from await_suspend() when the JobQueue is stopping, effectively cancelling the suspend. |
| Keywords | None (library-only). All coroutine machinery is expressed through library types (pull_type, push_type) and regular function calls. No special language syntax required. |
co_await, co_yield, co_return. The presence of any of these keywords in a function body makes it a coroutine. The compiler generates the state machine, frame allocation, and suspension/resumption code automatically. |
| Standard | Boost library (not ISO C++). Boost.Coroutine is deprecated in favor of Boost.Coroutine2, which itself has no active development. Depends on Boost.Context for platform-specific assembly-level stack switching. |
ISO C++20 standard. Part of the language specification. Supported by all major compilers (GCC 11+, Clang 14+, MSVC 19.28+). Tooling, debugger support, and static analysis are steadily improving across the ecosystem. |
1.3 Performance Characteristics
| Metric | Boost.Coroutine2 | C++20 Coroutines |
|---|---|---|
| Memory per coroutine | ~1MB (fixed stack) | ~200-500 bytes (frame only) |
| 1000 concurrent coroutines | ~1 GB | ~0.5 MB |
| Context switch cost | ~40-100 CPU cycles (fcontext save/restore) | ~20-50 CPU cycles (function call) |
| Allocation | Stack allocated at creation | Heap allocation (compiler may elide) |
| Cache behavior | Poor (large stack rarely fully used) | Good (small frame, hot data close) |
| Compiler optimization | Opaque to compiler | Inlinable, optimizable |
1.4 Feature Parity Analysis
Suspension Points
- Boost: Can yield from any nesting level —
fn_a()callsfn_b()callsyield(). The entire call stack is preserved. - C++20: Suspension only at
co_awaitexpressions in the immediate coroutine function. Nested functions that need to suspend must themselves be coroutines returning awaitables. - Impact: (Assumption, needs confirmation from people who know the code better) Rippled's usage is shallow —
yield()is called directly from the RPC handler lambda, never from deeply nested code. This makes migration straightforward.
Boost — yield from coroutine body, resume later via post():
jq.postCoro(jtCLIENT, "Handler", [&](auto const& coro) {
auto result = doFirstHalf();
coro->yield(); // suspend — entire stack preserved
// resumes here when coro->post() is called externally
doSecondHalf(result);
});
C++20 — co_await suspends, JobQueueAwaiter combines yield + auto-repost:
jq.postCoroTask(jtCLIENT, "Handler", [&](auto runner) -> CoroTask<void> {
auto result = doFirstHalf();
co_await JobQueueAwaiter{runner}; // suspend + auto-repost
// resumes on a worker thread when the job is picked up
doSecondHalf(result);
co_return;
});
Key difference — Boost can yield from nested calls; C++20 cannot:
// Boost — works: yield from inside a helper function
void helper(std::shared_ptr<JobQueue::Coro> coro) {
coro->yield(); // OK — stackful, entire call stack is preserved
}
jq.postCoro(jtCLIENT, "Deep", [](auto coro) { helper(coro); });
// C++20 — does NOT work: regular functions cannot co_await
void helper(std::shared_ptr<CoroTaskRunner> runner) {
co_await runner->suspend(); // COMPILE ERROR — not a coroutine
}
// FIX: helper must itself be a coroutine returning CoroTask<void>
CoroTask<void> helper(std::shared_ptr<CoroTaskRunner> runner) {
co_await runner->suspend(); // OK — this is a coroutine
co_return;
}
Exception Handling
- Boost: Exceptions propagate naturally up the call stack across yield points.
- C++20: Exceptions in coroutine body are caught by
promise_type::unhandled_exception(). Must be explicitly stored and rethrown. - Impact: Need to implement
unhandled_exception()in promise type. Pattern is well-established.
Boost — exceptions propagate naturally through yield():
jq.postCoro(jtCLIENT, "Risky", [](auto coro) {
coro->yield();
throw std::runtime_error("oops");
// Exception propagates up the coroutine stack naturally.
// The Coro::resume() caller sees it when the coroutine unwinds.
});
C++20 — exceptions are captured by promise_type and rethrown on co_await:
// Inner coroutine throws
CoroTask<int> failingOp() {
throw std::runtime_error("oops");
co_return 0; // never reached
}
// Outer coroutine catches — exception crosses coroutine boundary via promise
jq.postCoroTask(jtCLIENT, "Caller", [](auto runner) -> CoroTask<void> {
try {
int v = co_await failingOp(); // rethrows here
} catch (std::runtime_error const& e) {
// e.what() == "oops" — caught across coroutine boundary
}
co_return;
});
Key difference — C++20 requires explicit plumbing, but it's already wired up:
// Inside CoroTask<void>::promise_type (already implemented):
void unhandled_exception() {
exception_ = std::current_exception(); // capture
}
// Inside CoroTask<void>::await_resume() (already implemented):
void await_resume() {
if (auto& ep = handle_.promise().exception_)
std::rethrow_exception(ep); // rethrow to caller
}
Cancellation
- Boost: rippled uses
expectEarlyExit()for graceful shutdown — not a general cancellation mechanism. - C++20: Can check cancellation in
await_ready()before suspension, or viastop_tokenpatterns. - Impact: C++20 provides strictly better cancellation support.
Boost — expectEarlyExit() for cleanup when coroutine never ran:
auto coro = std::make_shared<Coro>(create, jq, t, name, fn);
if (!coro->post()) {
// JobQueue is stopping — coroutine will never run.
// Must manually decrement nSuspend_ so shutdown doesn't hang.
coro->expectEarlyExit();
coro.reset();
}
No cooperative in-body cancellation — coroutine just runs to completion or gets abandoned.
C++20 — expectEarlyExit() for the same case, plus cooperative in-body checking:
// Same early-exit pattern when post() fails:
auto runner = CoroTaskRunner::create(jq, t, name);
runner->init(fn);
++nSuspend_;
if (!runner->post()) {
runner->expectEarlyExit(); // decrements nSuspend_, destroys frame
runner.reset();
}
// Cooperative cancellation — coroutine checks jq.isStopping() after each yield:
jq.postCoroTask(jtCLIENT, "Long", [jqp = &jq](auto runner) -> CoroTask<void> {
while (hasWork()) {
co_await JobQueueAwaiter{runner};
if (jqp->isStopping())
co_return; // graceful exit
doNextChunk();
}
co_return;
});
C++20 bonus — JobQueueAwaiter::await_suspend() handles shutdown automatically:
bool await_suspend(std::coroutine_handle<>) {
runner->onSuspend();
if (!runner->post()) {
// JQ stopping — undo suspend, return false so coroutine
// continues immediately (can fall through to co_return)
runner->onUndoSuspend();
return false;
}
return true; // actually suspend
}
1.5 Compiler Support
| Compiler | rippled Minimum | C++20 Coroutine Support | Status |
|---|---|---|---|
| GCC | 12.0+ | Full (since GCC 11) | Ready |
| Clang | 16.0+ | Full (since Clang 14) | Ready |
| MSVC | 19.28+ | Full (since VS2019 16.8) | Ready |
rippled already requires C++20 (CMAKE_CXX_STANDARD 20 in CMakeLists.txt). All supported compilers have mature C++20 coroutine support. No compiler upgrades required.
1.6 Viability Analysis — Addressing Stackless Concerns
C++20 stackless coroutines have well-known limitations compared to stackful coroutines. This section analyzes each concern against rippled's actual codebase to determine viability.
Concern 1: Cannot Suspend from Nested Call Stacks
Claim: Stackless coroutines cannot yield from arbitrary stack depths. If fn_a() calls fn_b() calls yield(), only stackful coroutines can suspend the entire chain.
Analysis: An exhaustive codebase audit found:
- 1 production yield() call:
RipplePathFind.cpp:131— directly in the handler function body - All test yield() calls: directly in
postCorolambda bodies (Coroutine_test.cpp, JobQueue_test.cpp) - The
push_type*architecture makes deep-nested yield() structurally impossible — theyield_pointer is only available inside thepostCorolambda via theshared_ptr<Coro>, and handlers callcontext.coro->yield()at the top level
graph LR
subgraph Stackful["Stackful (Boost) — can yield anywhere"]
direction TB
A1["postCoro lambda"] --> A2["handlerFn()"]
A2 --> A3["helperFn()"]
A3 --> A4["coro→yield() ✅"]
end
subgraph Stackless["Stackless (C++20) — co_await at top only"]
direction TB
B1["postCoroTask lambda"] --> B2["co_await ✅"]
B1 --> B3["regularFn()"]
B3 -.-> B4["co_await ❌"]
end
subgraph Rippled["rippled actual usage — all shallow"]
direction TB
C1["postCoro lambda"] --> C2["context.coro→yield()<br/>(direct, no nesting)"]
end
style A4 fill:#f96,stroke:#333,color:#000
style B4 fill:#f66,stroke:#333,color:#fff
style C2 fill:#3d8,stroke:#333,color:#000
Verdict: This concern does NOT apply. All suspension is shallow.
Concern 2: Colored Function Problem (Viral co_await)
Claim: Once a function needs to suspend, every caller up the chain must also be a coroutine. This "infects" the call chain.
Analysis: In rippled's case, the coloring is minimal:
postCoroTask()launches a coroutine — this is the "root" colored function- The
postCorolambda itself becomes the coroutine function (returnsCoroTask<void>) doRipplePathFind()is the only handler that callsco_await- No other handler in the chain needs to become a coroutine — they continue to be regular functions dispatched through
doCommand()
The "coloring" stops at the entry point lambda and the one handler that suspends. No deep infection.
graph TD
subgraph Feared["Feared: deep coloring infection"]
direction TB
F1["main()"] -->|"must become<br/>coroutine"| F2["Server::run()"]
F2 -->|"must become<br/>coroutine"| F3["dispatch()"]
F3 -->|"must become<br/>coroutine"| F4["doCommand()"]
F4 -->|"must become<br/>coroutine"| F5["handler()"]
F5 --> F6["co_await"]
style F1 fill:#f66,stroke:#333,color:#fff
style F2 fill:#f66,stroke:#333,color:#fff
style F3 fill:#f66,stroke:#333,color:#fff
style F4 fill:#f66,stroke:#333,color:#fff
style F5 fill:#f66,stroke:#333,color:#fff
end
subgraph Actual["Actual: coloring stops at entry point"]
direction TB
A1["main()"] --- A2["Server::run()"]
A2 --- A3["dispatch()"]
A3 --- A4["doCommand()"]
A4 -->|"only this<br/>is a coroutine"| A5["postCoroTask lambda<br/>→ CoroTask<void>"]
A5 --> A6["co_await"]
style A1 fill:#eee,stroke:#333,color:#000
style A2 fill:#eee,stroke:#333,color:#000
style A3 fill:#eee,stroke:#333,color:#000
style A4 fill:#eee,stroke:#333,color:#000
style A5 fill:#f96,stroke:#333,color:#000
style A6 fill:#3d8,stroke:#333,color:#000
end
Verdict: Minimal impact. Only 4 lambdas (3 entry points + 1 handler) need co_await.
Concern 3: No Standard Library Support for Common Patterns
Claim: C++20 provides the language primitives but no standard task type, executor integration, or composition utilities.
Analysis: This is accurate — we need to write custom types:
CoroTask<T>(task/return type) — well-established pattern, ~80 linesJobQueueAwaiter(executor integration) — ~20 linesFinalAwaiter(continuation chaining) — ~10 lines
However, these types are small, well-understood, and have extensive reference implementations (cppcoro, folly::coro, libunifex). The total boilerplate is approximately 150-200 lines of header code.
graph TD
subgraph StdLib["C++20 standard provides"]
direction LR
S1["coroutine_handle<P>"]
S2["suspend_always /<br/>suspend_never"]
S3["noop_coroutine()"]
S4["co_await / co_return"]
end
subgraph Custom["Custom types we wrote (~150 lines total)"]
direction TB
C1["CoroTask<T><br/>~80 lines<br/>(task + promise_type)"]
C2["JobQueueAwaiter<br/>~20 lines<br/>(suspend + repost)"]
C3["FinalAwaiter<br/>~10 lines<br/>(symmetric transfer)"]
C4["CoroTaskRunner<br/>~40 lines decl<br/>(lifecycle manager)"]
end
subgraph Ref["Reference implementations"]
direction LR
R1["cppcoro"]
R2["folly::coro"]
R3["libunifex"]
end
S1 --> C1
S2 --> C1
S3 --> C3
S4 --> C2
Ref -.->|"patterns<br/>borrowed from"| Custom
Verdict: Manageable. Custom types are small and well-documented in C++ community.
Concern 4: Stack Overflow from Synchronous Resumption Chains
Claim: If coroutine A co_awaits coroutine B, and B completes synchronously, B's final_suspend resumes A on the same stack, potentially building up unbounded stack depth.
Analysis: This is addressed by symmetric transfer via FinalAwaiter::await_suspend() returning a coroutine_handle<> instead of void. The compiler transforms this into a tail-call, preventing stack growth. This is the standard solution used by all major coroutine libraries and is implemented in our FinalAwaiter design (Section 4.1).
graph TD
subgraph Problem["Without symmetric transfer — stack grows"]
direction TB
P1["A resumes B"] --> P2["B::resume()<br/>stack frame +1"]
P2 --> P3["B completes<br/>final_suspend resumes A"]
P3 --> P4["A::resume()<br/>stack frame +2"]
P4 --> P5["A resumes C"]
P5 --> P6["C::resume()<br/>stack frame +3"]
P6 --> P7["... stack overflow ❌"]
style P7 fill:#f66,stroke:#333,color:#fff
end
subgraph Solution["With symmetric transfer — tail call, no growth"]
direction TB
S1["A resumes B"] --> S2["B::resume()<br/>stack frame 1"]
S2 --> S3["B completes"]
S3 -->|"FinalAwaiter returns<br/>handle → tail call"| S4["A::resume()<br/>stack frame 1 (reused)"]
S4 --> S5["A resumes C"]
S5 -->|"tail call"| S6["C::resume()<br/>stack frame 1 (reused)"]
S6 --> S7["... bounded ✅"]
style S7 fill:#3d8,stroke:#333,color:#000
end
Verdict: Solved by symmetric transfer (already in our design).
Concern 5: Dangling Reference Risk
Claim: Coroutine frames are heap-allocated and outlive the calling scope, making references to locals dangerous.
Analysis: This is a real concern that requires engineering discipline:
- Coroutine parameters are copied into the frame (safe by default)
- References passed to coroutine functions can dangle if the referent's scope ends before the coroutine completes
- Our design mitigates this:
RPC::Contextis passed by reference but its lifetime is managed byshared_ptr<Coro>/ the entry point lambda's scope, which outlives the coroutine
graph TD
subgraph Danger["Dangling reference — ❌ use-after-free"]
direction TB
D1["caller() scope"] --> D2["int local = 42"]
D2 --> D3["launch coroutine<br/>with &local"]
D3 --> D4["caller returns<br/>local destroyed"]
D4 --> D5["coroutine resumes<br/>reads &local 💥"]
style D5 fill:#f66,stroke:#333,color:#fff
end
subgraph Safe["rippled pattern — ✅ lifetime managed"]
direction TB
S1["postCoroTask()"] --> S2["shared_ptr<Runner><br/>owns coroutine frame"]
S2 --> S3["lambda captures<br/>by value or<br/>shared_ptr"]
S3 --> S4["FuncStore keeps<br/>lambda alive on heap"]
S4 --> S5["coroutine resumes<br/>captures still valid ✅"]
style S5 fill:#3d8,stroke:#333,color:#000
end
Verdict: Real risk, but manageable with RAII patterns and ASAN testing.
Concern 6: yield_to.h / boost::asio::spawn
Claim: yield_to.h:111 uses boost::asio::spawn, suggesting broader coroutine usage.
Analysis: yield_to.h uses boost::asio::spawn with boost::context::fixedsize_stack(2 * 1024 * 1024) — this is a completely separate coroutine system:
- Different type:
boost::asio::yield_context(notpush_type*) - Different purpose: test infrastructure for async I/O tests
- Different mechanism: Boost.Asio stackful coroutines (not Boost.Coroutine2)
- Not part of this migration scope — used only in tests and unrelated to
JobQueue::Coro
graph TD
subgraph ThisMigration["This migration (JobQueue::Coro)"]
direction TB
M1["Boost.Coroutine2<br/>push_type / pull_type"] -->|"replace with"| M2["C++20 coroutines<br/>CoroTask + co_await"]
M3["JobQueue::Coro"] -->|"replace with"| M4["CoroTaskRunner"]
M5["coro→yield() + post()"] -->|"replace with"| M6["JobQueueAwaiter"]
end
subgraph OutOfScope["Out of scope (yield_to.h)"]
direction TB
O1["boost::asio::spawn"]
O2["yield_context"]
O3["fixedsize_stack(2MB)"]
O1 --- O2
O1 --- O3
end
ThisMigration ~~~ OutOfScope
style OutOfScope fill:#eee,stroke:#999
Verdict: Separate system. Out of scope for this migration.
Overall Viability Conclusion
The migration IS viable because:
- rippled's coroutine usage is shallow (no deep-nested yield)
- The colored function infection is limited to 4 call sites
- Custom types are small and well-understood
- Symmetric transfer solves the stack overflow concern
- ASAN/TSAN testing catches lifetime and race bugs
- The alternative (ASAN annotations for Boost.Context) only addresses sanitizer false positives — it does not provide memory savings, standard compliance, or the dependency elimination that C++20 migration delivers
1.7 Merits & Demerits Summary
Merits of C++20 Migration
- 2000x memory reduction per coroutine (1MB → ~500 bytes)
- Faster context switching (~2x improvement)
- Remove external dependency on Boost.Coroutine (and transitively Boost.Context)
- Language-native — better tooling, debugger support, static analysis
- Future-proof — ISO standard, not a deprecated library
- Compiler-optimizable — suspension points can be inlined/elided
- ASAN compatibility — eliminates Boost context-switching false positives (see
docs/build/sanitizers.md)
Demerits / Challenges
- Stackless limitation — cannot yield from nested calls (verified: not an issue for rippled's shallow usage)
- Explicit lifetime management —
coroutine_handle::destroy()must be called (mitigated by RAII CoroTask) - Verbose boilerplate — promise_type, awaiter interfaces (~150-200 lines of infrastructure code)
- Debugging — no visible coroutine stack in debugger (improving with tooling)
- Learning curve — team needs familiarity with C++20 coroutine machinery
- Dangling reference risk — coroutine frames outlive calling scope (mitigated by ASAN + careful design)
- No standard library task type — must write custom CoroTask, awaiters (well-established patterns exist)
Alternative Considered: ASAN Annotations Only
Instead of full migration, one could keep Boost.Coroutine and add __sanitizer_start_switch_fiber / __sanitizer_finish_switch_fiber annotations to Coro.ipp to suppress ASAN false positives. This was evaluated and rejected because:
- It only fixes sanitizer false positives — does NOT reduce 1MB/coroutine memory usage
- Does NOT remove the deprecated Boost.Coroutine dependency
- Does NOT provide standard compliance or future-proofing
- The full migration is feasible given shallow yield usage and delivers all the above benefits
2. Current State Assessment
2.1 Architecture Overview
graph TD
subgraph "Request Entry Points"
HTTP["HTTP Request<br/>ServerHandler::onRequest()"]
WS["WebSocket Message<br/>ServerHandler::onWSMessage()"]
GRPC["gRPC Request<br/>CallData::process()"]
end
subgraph "Coroutine Layer"
POST["JobQueue::postCoro()<br/>Creates Coro + schedules job"]
CORO["JobQueue::Coro<br/>boost::coroutines::pull_type<br/>1MB stack per instance"]
end
subgraph "JobQueue Thread Pool"
W1["Worker Thread 1"]
W2["Worker Thread 2"]
WN["Worker Thread N"]
end
subgraph "RPC Handlers"
CTX["RPC::Context<br/>holds shared_ptr#lt;Coro#gt;"]
RPC["RPC Handler<br/>e.g. doRipplePathFind"]
YIELD["coro.yield()<br/>Suspends execution"]
RESUME["coro.post()<br/>Reschedules on JobQueue"]
end
HTTP --> POST
WS --> POST
GRPC --> POST
POST --> CORO
CORO --> W1
CORO --> W2
CORO --> WN
W1 --> CTX
W2 --> CTX
CTX --> RPC
RPC --> YIELD
YIELD -.->|"event completes"| RESUME
RESUME --> W1
2.2 JobQueue::Coro Implementation Audit
File: include/xrpl/core/JobQueue.h (lines 40-120) + include/xrpl/core/Coro.ipp
Class Members
class Coro : public std::enable_shared_from_this<Coro> {
detail::LocalValues lvs_; // Per-coroutine thread-local storage
JobQueue& jq_; // Parent JobQueue reference
JobType type_; // Job type (jtCLIENT_RPC, etc.)
std::string name_; // Name for logging
bool running_; // Is currently executing
std::mutex mutex_; // Prevents concurrent resume
std::mutex mutex_run_; // Guards running_ flag
std::condition_variable cv_; // For join() blocking
boost::coroutines::asymmetric_coroutine<void>::pull_type coro_; // THE BOOST COROUTINE
boost::coroutines::asymmetric_coroutine<void>::push_type* yield_; // Yield function pointer
bool finished_; // Debug assertion flag
};
Boost.Coroutine APIs Used
| API | Location | Purpose |
|---|---|---|
asymmetric_coroutine<void>::pull_type |
JobQueue.h:51 |
The coroutine object itself |
asymmetric_coroutine<void>::push_type |
JobQueue.h:52 |
Yield function type |
boost::coroutines::attributes(megabytes(1)) |
Coro.ipp:23 |
Stack size configuration |
#include <boost/coroutine/all.hpp> |
JobQueue.h:10 |
Header inclusion |
Method Behaviors
| Method | Behavior |
|---|---|
| Constructor | Creates pull_type with 1MB stack. Lambda captures user function. Auto-runs to first yield(). |
yield() |
Increments jq_.nSuspend_, calls (*yield_)() to suspend. Returns control to caller. |
post() |
Sets running_=true, calls jq_.addJob() with a lambda that calls resume(). Returns false if JobQueue is stopping. |
resume() |
Swaps LocalValues, acquires mutex_, calls coro_() to resume. Restores LocalValues. Sets running_=false, notifies cv_. |
runnable() |
Returns static_cast<bool>(coro_) — true if coroutine hasn't returned. |
expectEarlyExit() |
Decrements nSuspend_, sets finished_=true. Used during shutdown. |
join() |
Blocks on cv_ until running_==false. |
2.3 Coroutine Execution Lifecycle
sequenceDiagram
participant HT as Handler Thread
participant JQ as JobQueue
participant WT as Worker Thread
participant C as Coro
participant UF as User Function
HT->>JQ: postCoro(type, name, fn)
JQ->>C: Coro::Coro() constructor
Note over C: pull_type auto-starts lambda
C->>C: yield_ = #amp;do_yield
C->>C: yield() [initial suspension]
C-->>JQ: Returns to constructor
JQ->>JQ: coro->post()
JQ->>JQ: addJob(type, name, resume_lambda)
JQ-->>HT: Returns shared_ptr#lt;Coro#gt;
Note over HT: Handler thread is FREE
WT->>C: resume() [job executes]
Note over C: Swap LocalValues
C->>C: coro_() [resume boost coroutine]
C->>UF: fn(shared_from_this())
UF->>UF: Do work...
UF->>C: coro->yield() [suspend]
Note over C: ++nSuspend_, invoke yield_()
C-->>WT: Returns from resume()
Note over WT: Worker thread is FREE
Note over UF: External event completes
UF->>C: coro->post() [reschedule]
C->>JQ: addJob(resume_lambda)
WT->>C: resume() [job executes]
C->>C: coro_() [resume]
C->>UF: Continues after yield()
UF->>UF: Finish work
UF-->>C: Return [coroutine complete]
Note over C: running_=false, cv_.notify_all()
2.4 All Coroutine Touchpoints
Core Infrastructure (Must Change)
| File | Role | Lines of Interest |
|---|---|---|
include/xrpl/core/JobQueue.h |
Coro class definition, postCoro template | Lines 10, 40-120, 385-402 |
include/xrpl/core/Coro.ipp |
Coro method implementations | All (122 lines) |
include/xrpl/basics/LocalValue.h |
Per-coroutine thread-local storage | Lines 12-59 (LocalValues) |
cmake/deps/Boost.cmake |
Boost.Coroutine dependency | Lines 7, 24 |
Entry Points (postCoro Callers)
| File | Entry Point | Job Type |
|---|---|---|
src/xrpld/rpc/detail/ServerHandler.cpp:287 |
onRequest() — HTTP RPC |
jtCLIENT_RPC |
src/xrpld/rpc/detail/ServerHandler.cpp:325 |
onWSMessage() — WebSocket |
jtCLIENT_WEBSOCKET |
src/xrpld/app/main/GRPCServer.cpp:102 |
CallData::process() — gRPC |
jtRPC |
Context Propagation
| File | Role |
|---|---|
src/xrpld/rpc/Context.h:27 |
RPC::Context holds shared_ptr<JobQueue::Coro> coro |
src/xrpld/rpc/ServerHandler.h:174-188 |
processSession/processRequest pass coro through |
Active Coroutine Consumer (yield/post)
| File | Usage |
|---|---|
src/xrpld/rpc/handlers/RipplePathFind.cpp:131 |
context.coro->yield() — suspends for path-finding |
src/xrpld/rpc/handlers/RipplePathFind.cpp:116-123 |
Continuation calls coro->post() or coro->resume() |
Test Files
| File | Tests |
|---|---|
src/test/core/Coroutine_test.cpp |
correct_order, incorrect_order, thread_specific_storage |
src/test/core/JobQueue_test.cpp |
testPostCoro (post/resume cycles, shutdown behavior) |
src/test/app/Path_test.cpp |
Path-finding RPC via postCoro |
src/test/jtx/impl/AMMTest.cpp |
AMM RPC via postCoro |
2.5 Suspension/Continuation Model
The current model documented in src/xrpld/rpc/README.md defines four functional types:
Callback = std::function<void()> — generic 0-arg function
Continuation = std::function<void(Callback)> — calls Callback later
Suspend = std::function<void(Continuation)> — runs Continuation, suspends
Coroutine = std::function<void(Suspend)> — given a Suspend, starts work
In practice, JobQueue::Coro simplifies this to:
- Suspend =
coro->yield() - Continue =
coro->post()(async on JobQueue) orcoro->resume()(sync on current thread)
2.6 CMake Dependency
In cmake/deps/Boost.cmake:
find_package(Boost REQUIRED COMPONENTS ... coroutine ...)
target_link_libraries(xrpl_boost INTERFACE ... Boost::coroutine ...)
Additionally in cmake/XrplInterface.cmake:
BOOST_COROUTINES_NO_DEPRECATION_WARNING // Suppresses Boost.Coroutine deprecation warnings
2.7 Existing C++20 Coroutine Usage
rippled already uses C++20 coroutines in test code:
src/tests/libxrpl/net/HTTPClient.cppusesco_awaitwithboost::asio::use_awaitable- Demonstrates team familiarity with C++20 coroutine syntax
- Proves compiler toolchain supports C++20 coroutines
3. Migration Strategy
3.1 Incremental vs Atomic Migration
Decision: Incremental (multi-phase) migration.
Rationale:
- Only one RPC handler (
RipplePathFind) actively usesyield()/post()suspension - The three entry points (HTTP, WS, gRPC) all funnel through
postCoro() - The
RPC::Context.corofield is the sole propagation mechanism - We can introduce a new C++20 coroutine system alongside the existing one and migrate callsites incrementally
3.2 Migration Phases
graph TD
subgraph "Phase 1: Foundation"
P1A["Create CoroTask#lt;T#gt; type<br/>(promise_type, awaiter)"]
P1B["Create JobQueueAwaiter<br/>(schedules resume on JobQueue)"]
P1C["Add postCoroTask() to JobQueue<br/>(parallel to postCoro)"]
P1D["Unit tests for new primitives"]
P1A --> P1B --> P1C --> P1D
end
subgraph "Phase 2: Entry Point Migration"
P2A["Migrate ServerHandler::onRequest()"]
P2B["Migrate ServerHandler::onWSMessage()"]
P2C["Migrate GRPCServer::CallData::process()"]
P2D["Update RPC::Context to use new type"]
P2A --> P2D
P2B --> P2D
P2C --> P2D
end
subgraph "Phase 3: Handler Migration"
P3A["Migrate RipplePathFind handler"]
P3B["Verify all other handlers<br/>(no active yield usage)"]
end
subgraph "Phase 4: Cleanup"
P4A["Remove old Coro class"]
P4B["Remove Boost.Coroutine from CMake"]
P4C["Remove deprecation warning suppression"]
P4D["Final benchmarks & validation"]
end
P1D --> P2A
P2D --> P3A
P3B --> P4A
P3A --> P4A
P4A --> P4B --> P4C --> P4D
3.3 Coexistence Strategy
During migration, both implementations will coexist:
graph LR
subgraph "Transition Period"
OLD["JobQueue::Coro<br/>(Boost, existing)"]
NEW["JobQueue::CoroTask<br/>(C++20, new)"]
CTX["RPC::Context"]
end
CTX -->|"phase 1-2"| OLD
CTX -->|"phase 2-3"| NEW
style OLD fill:#fdd,stroke:#c00,color:#000
style NEW fill:#dfd,stroke:#0a0,color:#000
RPC::Contextwill temporarily hold bothshared_ptr<Coro>(old) and the new coroutine handle- Entry points will be migrated one at a time
- Each migration is independently testable
- Once all entry points and handlers are migrated, old code is removed
3.4 Breaking Changes & Compatibility
| Concern | Impact | Mitigation |
|---|---|---|
RPC::Context::coro type change |
All RPC handlers receive context | Migrate context field last, after all consumers updated |
postCoro() removal |
3 callers | Replace with postCoroTask(), remove old API in Phase 4 |
LocalValue integration |
Thread-local storage must work | New implementation must swap LocalValues identically |
| Shutdown behavior | expectEarlyExit(), nSuspend_ tracking |
Replicate in new CoroTask |
4. Implementation Plan
4.1 New Type Design
CoroTask<T> — Coroutine Return Type
classDiagram
class CoroTask~T~ {
+Handle handle_
+CoroTask(Handle h)
+destroy()
+bool done() const
+T get() const
+bool await_ready() const
+void await_suspend(coroutine_handle h) const
+T await_resume() const
}
class promise_type {
-result_ : variant~T, exception_ptr~
-continuation_ : coroutine_handle
+CoroTask get_return_object()
+suspend_always initial_suspend()
+FinalAwaiter final_suspend()
+void return_value(T)
+void return_void()
+void unhandled_exception()
}
class FinalAwaiter {
+bool await_ready()
+coroutine_handle await_suspend(coroutine_handle~promise_type~)
+void await_resume()
}
class JobQueueAwaiter {
-jq_ : JobQueue
-type_ : JobType
-name_ : string
+bool await_ready()
+void await_suspend(coroutine_handle h)
+void await_resume()
}
CoroTask --> promise_type : contains
promise_type --> FinalAwaiter : returns from final_suspend
CoroTask ..> JobQueueAwaiter : used with co_await
JobQueueAwaiter — Schedules Resumption on JobQueue
// Conceptual design — actual implementation may vary
struct JobQueueAwaiter {
JobQueue& jq;
JobType type;
std::string name;
bool await_ready() { return false; } // Always suspend
void await_suspend(std::coroutine_handle<> h) {
// Schedule coroutine resumption as a job
jq.addJob(type, name, [h]() { h.resume(); });
}
void await_resume() {}
};
4.2 Mapping: Old API → New API
graph LR
subgraph "Current (Boost)"
A1["postCoro(type, name, fn)"]
A2["coro->yield()"]
A3["coro->post()"]
A4["coro->resume()"]
A5["coro->join()"]
A6["coro->runnable()"]
A7["coro->expectEarlyExit()"]
end
subgraph "New (C++20)"
B1["postCoroTask(type, name, fn)<br/>fn returns CoroTask<void>"]
B2["co_await JobQueueAwaiter{jq, type, name}"]
B3["Built into await_suspend()<br/>(automatic scheduling)"]
B4["handle.resume()<br/>(direct call)"]
B5["co_await task<br/>(continuation-based)"]
B6["handle.done()"]
B7["handle.destroy() + cleanup"]
end
A1 --> B1
A2 --> B2
A3 --> B3
A4 --> B4
A5 --> B5
A6 --> B6
A7 --> B7
4.3 File Changes Required
Phase 1: New Coroutine Primitives
| File | Action | Description |
|---|---|---|
include/xrpl/core/CoroTask.h |
CREATE | CoroTask<T> return type with promise_type, FinalAwaiter |
include/xrpl/core/JobQueueAwaiter.h |
CREATE | Awaiter that schedules resume on JobQueue |
include/xrpl/core/JobQueue.h |
MODIFY | Add postCoroTask() template alongside existing postCoro() |
src/test/core/CoroTask_test.cpp |
CREATE | Unit tests for CoroTask<T> and JobQueueAwaiter |
Phase 2: Entry Point Migration
| File | Action | Description |
|---|---|---|
src/xrpld/rpc/detail/ServerHandler.cpp |
MODIFY | onRequest() and onWSMessage(): replace postCoro → postCoroTask |
src/xrpld/rpc/ServerHandler.h |
MODIFY | Update processSession/processRequest signatures |
src/xrpld/app/main/GRPCServer.cpp |
MODIFY | CallData::process(): replace postCoro → postCoroTask |
src/xrpld/app/main/GRPCServer.h |
MODIFY | Update process() method signature |
src/xrpld/rpc/Context.h |
MODIFY | Change shared_ptr<JobQueue::Coro> to new coroutine handle type |
Phase 3: Handler Migration
| File | Action | Description |
|---|---|---|
src/xrpld/rpc/handlers/RipplePathFind.cpp |
MODIFY | Replace context.coro->yield() / coro->post() with co_await |
src/test/app/Path_test.cpp |
MODIFY | Update test to use new coroutine API |
src/test/jtx/impl/AMMTest.cpp |
MODIFY | Update test to use new coroutine API |
Phase 4: Cleanup
| File | Action | Description |
|---|---|---|
include/xrpl/core/Coro.ipp |
DELETE | Remove old Boost.Coroutine implementation |
include/xrpl/core/JobQueue.h |
MODIFY | Remove Coro class, postCoro(), Coro_create_t, Boost includes |
cmake/deps/Boost.cmake |
MODIFY | Remove coroutine from find_package and target_link_libraries |
cmake/XrplInterface.cmake |
MODIFY | Remove BOOST_COROUTINES_NO_DEPRECATION_WARNING |
src/test/core/Coroutine_test.cpp |
MODIFY | Rewrite tests for new CoroTask |
src/test/core/JobQueue_test.cpp |
MODIFY | Update testPostCoro to use new API |
include/xrpl/basics/LocalValue.h |
MODIFY | Update LocalValues integration for C++20 coroutines |
4.4 LocalValue Integration Design
The current LocalValue system swaps per-coroutine storage on resume/yield:
sequenceDiagram
participant WT as Worker Thread
participant LV as LocalValues
participant C as Coroutine
Note over WT: Thread has its own LocalValues
WT->>LV: saved = getLocalValues().release()
WT->>LV: getLocalValues().reset(#amp;coro.lvs_)
Note over LV: Now pointing to coroutine's storage
WT->>C: coro_() / handle.resume()
Note over C: User code sees coroutine's LocalValues
C-->>WT: yield / co_await returns
WT->>LV: getLocalValues().release()
WT->>LV: getLocalValues().reset(saved)
Note over LV: Restored to thread's storage
For C++20: The same swap pattern must be implemented in the awaiter's await_suspend() and await_resume(), or in a wrapper that calls handle.resume().
4.5 RipplePathFind Migration Design
Current pattern:
// Continuation callback
auto callback = [&context]() {
std::shared_ptr<JobQueue::Coro> coroCopy{context.coro};
if (!coroCopy->post()) {
coroCopy->resume(); // Fallback: run on current thread
}
};
// Start async work, then suspend
jvResult = makeLegacyPathRequest(request, callback, ...);
if (request) {
context.coro->yield(); // ← SUSPEND HERE
jvResult = request->doStatus(context.params); // ← RESUME HERE
}
Target pattern:
// Start async work, suspend via co_await
jvResult = makeLegacyPathRequest(request, /* awaiter-based callback */, ...);
if (request) {
co_await PathFindAwaiter{context}; // ← SUSPEND + RESUME via awaiter
jvResult = request->doStatus(context.params);
}
The PathFindAwaiter will encapsulate the scheduling logic currently in the lambda continuation.
5. Testing & Validation Strategy
5.1 Test Architecture
graph TD
subgraph "Unit Tests"
UT1["CoroTask_test<br/>- Construction/destruction<br/>- co_return values<br/>- Exception propagation<br/>- Lifetime management"]
UT2["JobQueueAwaiter_test<br/>- Schedule on correct JobType<br/>- Resume on worker thread<br/>- Shutdown handling"]
UT3["LocalValue integration<br/>- Per-coroutine isolation<br/>- Multi-coroutine concurrent<br/>- Cross-thread consistency"]
end
subgraph "Migration Tests"
MT1["Coroutine_test rewrite<br/>- correct_order<br/>- incorrect_order<br/>- thread_specific_storage"]
MT2["PostCoro migration<br/>- Post/resume cycles<br/>- Shutdown rejection<br/>- Early exit"]
end
subgraph "Integration Tests"
IT1["RPC Path Finding<br/>- Suspend/resume flow<br/>- Shutdown during suspend<br/>- Concurrent requests"]
IT2["Full --unittest suite<br/>- All existing tests pass<br/>- No regressions"]
end
subgraph "Performance Tests"
PT1["Memory benchmarks"]
PT2["Context switch benchmarks"]
PT3["RPC throughput under load"]
end
subgraph "Sanitizer Tests"
ST1["ASAN<br/>(memory errors)"]
ST2["TSAN<br/>(data races)"]
ST3["UBSan<br/>(undefined behavior)"]
end
UT1 --> MT1
UT2 --> MT2
MT1 --> IT1
MT2 --> IT2
IT1 --> PT1
IT2 --> PT2
PT1 --> ST1
PT2 --> ST2
PT3 --> ST3
5.2 Benchmarking Tests
Memory Usage Benchmark
Test: Create N coroutines, measure RSS
- N = 100, 1000, 10000
- Measure: peak RSS, per-coroutine overhead
- Compare: Boost (N * 1MB + overhead) vs C++20 (N * ~500B + overhead)
- Tool: /proc/self/status (VmRSS), or getrusage()
Context Switch Benchmark
Test: Yield/resume M times across N coroutines
- M = 100,000 iterations
- N = 1, 10, 100 concurrent coroutines
- Measure: total time, per-switch latency (ns)
- Compare: Boost yield/resume cycle vs C++20 co_await/resume cycle
- Tool: std::chrono::high_resolution_clock
RPC Throughput Benchmark
Test: Concurrent ripple_path_find requests
- Load: 10, 50, 100 concurrent requests
- Measure: requests/second, p50/p95/p99 latency
- Compare: before vs after migration
- Tool: Custom load generator or existing perf infrastructure
5.3 Unit Test Coverage
| Test | What It Validates |
|---|---|
CoroTask<void> basic |
Coroutine runs to completion, handle cleanup |
CoroTask<int> with value |
co_return value correctly retrieved |
CoroTask exception |
unhandled_exception() captures and rethrows |
CoroTask cancellation |
Destruction before completion cleans up |
JobQueueAwaiter basic |
co_await suspends, resumes on worker thread |
JobQueueAwaiter shutdown |
Returns false / throws when JobQueue stopping |
PostCoroTask lifecycle |
Create → suspend → resume → complete |
PostCoroTask multiple yields |
Multiple co_await points in sequence |
LocalValue isolation |
4 coroutines, each sees own LocalValue |
LocalValue cross-thread |
Resume on different thread, values preserved |
5.4 Integration Testing
- All existing
--unittesttests must pass unchanged (except coroutine-specific tests that are rewritten) - Path_test must pass with identical behavior
- AMMTest RPC tests must pass
- ServerHandler HTTP/WS handling must work end-to-end
5.5 Sanitizer Testing
Per docs/build/sanitizers.md:
# ASAN (memory errors — especially important for coroutine frame lifetime)
export SANITIZERS=address,undefinedbehavior
# Build + test
# TSAN (data races — critical for concurrent coroutine resume)
export SANITIZERS=thread
# Build + test (separate build — cannot mix with ASAN)
Key benefit: Removing Boost.Coroutine eliminates the __asan_handle_no_return false positives caused by Boost context switching (documented in docs/build/sanitizers.md line 184).
5.6 Regression Testing Methodology
graph LR
subgraph "Before Migration (Baseline)"
B1["Build on develop branch"]
B2["Run --unittest (record pass/fail)"]
B3["Run memory benchmark (record RSS)"]
B4["Run context switch benchmark (record ns/switch)"]
end
subgraph "After Migration"
A1["Build on feature branch"]
A2["Run --unittest (compare pass/fail)"]
A3["Run memory benchmark (compare RSS)"]
A4["Run context switch benchmark (compare ns/switch)"]
end
subgraph "Acceptance Criteria"
C1["Zero test regressions"]
C2["Memory: ≤ baseline"]
C3["Context switch: ≤ baseline"]
C4["ASAN/TSAN clean"]
end
B1 --> B2 --> B3 --> B4
A1 --> A2 --> A3 --> A4
B2 -.->|compare| C1
A2 -.->|compare| C1
B3 -.->|compare| C2
A3 -.->|compare| C2
B4 -.->|compare| C3
A4 -.->|compare| C3
A2 -.-> C4
6. Risks & Mitigation
6.1 Risk Matrix
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Performance regression in context switching | Low | High | Benchmark before/after; C++20 should be faster |
| Coroutine frame lifetime bugs (use-after-destroy) | Medium | High | ASAN testing, RAII wrapper for handle, code review |
| Data races on resume | Medium | High | TSAN testing, careful await_suspend() implementation |
| LocalValue corruption across threads | Low | High | Dedicated test with 4+ concurrent coroutines |
| Shutdown race conditions | Medium | Medium | Replicate existing mutex/cv pattern in new design |
| Missed coroutine consumer during migration | Low | Medium | Exhaustive grep audit (Section 2.4 is complete) |
| Compiler bugs in coroutine codegen | Low | Medium | Test on all three compilers (GCC, Clang, MSVC) |
| Exception loss across suspension points | Medium | Medium | Test exception propagation in every phase |
| Third-party code depending on Boost.Coroutine | Very Low | Low | Grep confirms only internal usage |
| Dangling references in coroutine frames | Medium | High | ASAN testing, avoid reference params in coroutine functions, use shared_ptr |
| Colored function infection spreading | Low | Medium | Only 4 call sites need co_await; no nested handlers suspend |
| Symmetric transfer not available | Very Low | High | All target compilers (GCC 12+, Clang 16+) support symmetric transfer |
| Future handler adding deep yield | Low | Medium | Code review + CI: static analysis flag any yield from nested depth |
6.2 Rollback Strategy
graph TD
START["Migration In Progress"]
CHECK{"Critical Issue<br/>Discovered?"}
PHASE{"Which Phase?"}
P1["Phase 1: Delete new files<br/>No production code changed"]
P2["Phase 2: Revert entry point changes<br/>Old postCoro still present"]
P3["Phase 3: Revert handler changes<br/>Old Coro still present"]
P4["Phase 4: Cannot easily rollback<br/>Old code deleted"]
PREVENT["Prevention:<br/>Do NOT delete old code<br/>until Phase 4 is fully validated"]
START --> CHECK
CHECK -->|Yes| PHASE
CHECK -->|No| DONE["Continue Migration"]
PHASE -->|1| P1
PHASE -->|2| P2
PHASE -->|3| P3
PHASE -->|4| P4
P4 --> PREVENT
Key principle: Old Coro class and postCoro() remain in the codebase through Phases 1-3. They are only removed in Phase 4, after all migration is validated. Each phase is independently revertible via git revert.
6.3 Specific Risk: Stackful → Stackless Limitation
The Big Question: Can all current yield() call sites work with stackless co_await?
Analysis:
graph TD
Q["Does yield() get called from<br/>a deeply nested function?"]
Q -->|Yes| PROBLEM["PROBLEM: co_await can't<br/>suspend from nested calls"]
Q -->|No| OK["OK: Direct co_await<br/>in coroutine function"]
CHECK1["RipplePathFind.cpp:131<br/>context.coro.yield()"]
CHECK1 -->|"Called directly in handler"| OK
CHECK2["Coroutine_test.cpp<br/>c.yield()"]
CHECK2 -->|"Called directly in lambda"| OK
CHECK3["JobQueue_test.cpp<br/>c.yield()"]
CHECK3 -->|"Called directly in lambda"| OK
style OK fill:#dfd,stroke:#0a0,color:#000
style PROBLEM fill:#fdd,stroke:#c00,color:#000
Result: All yield() calls are in the direct body of the postCoro lambda or RPC handler function. No deep nesting exists. Migration to stackless co_await is fully feasible without architectural redesign.
7. Timeline & Milestones
7.1 Milestone Overview
gantt
title Migration Timeline
dateFormat YYYY-MM-DD
axisFormat %b %d
section Phase 1 - Foundation
CoroTask + JobQueueAwaiter design :p1a, 2026-02-26, 3d
CoroTask implementation :p1b, after p1a, 3d
Unit tests for primitives :p1c, after p1b, 2d
PR 1 - New coroutine primitives :milestone, p1m, after p1c, 0d
section Phase 2 - Entry Points
Migrate ServerHandler (HTTP + WS) :p2a, after p1m, 3d
Migrate GRPCServer :p2b, after p2a, 2d
Update RPC Context :p2c, after p2b, 1d
PR 2 - Entry point migration :milestone, p2m, after p2c, 0d
section Phase 3 - Handlers
Migrate RipplePathFind :p3a, after p2m, 3d
Update test infrastructure :p3b, after p3a, 2d
PR 3 - Handler migration :milestone, p3m, after p3b, 0d
section Phase 4 - Cleanup
Remove old Coro and update CMake :p4a, after p3m, 2d
Performance benchmarks :p4b, after p4a, 2d
Sanitizer validation :p4c, after p4b, 1d
PR 4 - Cleanup + validation :milestone, p4m, after p4c, 0d
7.2 Milestone Details
Milestone 1: New Coroutine Primitives (PR #1)
Deliverables:
CoroTask<T>withpromise_type,FinalAwaiterCoroTask<void>specializationJobQueueAwaiterfor scheduling on JobQueuepostCoroTask()onJobQueue- LocalValue integration in new coroutine type
- Unit test suite:
CoroTask_test.cpp
Acceptance Criteria:
- All new unit tests pass
- Existing
--unittestsuite passes (no regressions from new code) - ASAN + TSAN clean on new tests
- Code compiles on GCC 12+, Clang 16+
Milestone 2: Entry Point Migration (PR #2)
Deliverables:
ServerHandler::onRequest()usespostCoroTask()ServerHandler::onWSMessage()usespostCoroTask()GRPCServer::CallData::process()usespostCoroTask()RPC::Contextupdated to carry new coroutine typeprocessSession/processRequestsignatures updated
Acceptance Criteria:
- HTTP, WebSocket, and gRPC RPC requests work end-to-end
- Full
--unittestsuite passes - Manual smoke test:
ripple_path_findvia HTTP/WS
Milestone 3: Handler Migration (PR #3)
Deliverables:
RipplePathFindusesco_awaitinstead ofyield()/post()- Path_test and AMMTest updated
- Coroutine_test and JobQueue_test updated for new API
Acceptance Criteria:
- Path-finding suspension/continuation works correctly
- All
--unittesttests pass - Shutdown-during-pathfind scenario tested
Milestone 4: Cleanup & Validation (PR #4)
Deliverables:
- Old
Coroclass andCoro.ippremoved postCoro()removed fromJobQueueBoost::coroutineremoved from CMakeBOOST_COROUTINES_NO_DEPRECATION_WARNINGremoved- Performance benchmark results documented
- Sanitizer test results documented
Acceptance Criteria:
- Build succeeds without Boost.Coroutine
- Full
--unittestsuite passes - Memory per coroutine ≤ 10KB (down from 1MB)
- Context switch time ≤ baseline
- ASAN, TSAN, UBSan all clean
8. Standards & Guidelines
8.1 Coroutine Design Standards
Rule 1: All coroutine return types must use RAII for handle lifetime
// GOOD: Handle destroyed in destructor
~CoroTask() {
if (handle_) handle_.destroy();
}
// BAD: Manual destroy calls scattered in code
void cleanup() { handle_.destroy(); } // Easy to forget
Rule 2: Never resume a coroutine from within await_suspend()
// GOOD: Schedule resume on executor
void await_suspend(std::coroutine_handle<> h) {
jq_.addJob(type_, name_, [h]() { h.resume(); });
}
// BAD: Direct resume in await_suspend (blocks caller)
void await_suspend(std::coroutine_handle<> h) {
h.resume(); // Defeats the purpose of suspension
}
Rule 3: Use suspend_always for initial_suspend() (lazy start)
// GOOD: Lazy start — coroutine doesn't run until explicitly resumed
std::suspend_always initial_suspend() { return {}; }
// BAD for our use case: Eager start — runs immediately on creation
std::suspend_never initial_suspend() { return {}; }
Rationale: Matches existing Boost behavior where postCoro() schedules execution, not the constructor.
Rule 4: Always handle unhandled_exception() explicitly
void unhandled_exception() {
exception_ = std::current_exception();
// NEVER: just swallow the exception
// NEVER: std::terminate() without logging
}
Rule 5: Use suspend_always for final_suspend() to enable continuation
// GOOD: Suspend at end to allow cleanup and value retrieval
auto final_suspend() noexcept {
struct FinalAwaiter {
bool await_ready() noexcept { return false; }
std::coroutine_handle<> await_suspend(
std::coroutine_handle<promise_type> h) noexcept {
if (h.promise().continuation_)
return h.promise().continuation_; // Resume waiter
return std::noop_coroutine();
}
void await_resume() noexcept {}
};
return FinalAwaiter{};
}
Rule 6: Coroutine functions must be clearly marked
// GOOD: Return type makes it obvious this is a coroutine
CoroTask<Json::Value> doRipplePathFind(RPC::JsonContext& context) {
co_await ...;
co_return result;
}
// BAD: Coroutine hidden behind auto or unclear return type
auto doSomething() { co_return; }
8.2 Coding Guidelines
Thread Safety
- Never resume a coroutine concurrently from two threads. Use the same mutex pattern as existing
Coro::mutex_to prevent races. await_suspend()is the synchronization point. All state visible beforeawait_suspend()must be visible afterawait_resume().- Use
std::atomicor mutexes for shared state between coroutine and continuation callback.
Memory Management
CoroTask<T>owns itscoroutine_handle. It is move-only, non-copyable.- Never store raw
coroutine_handle<>in long-lived data structures without clear ownership. - Prefer
shared_ptr<CoroTask<T>>when multiple parties need to observe/wait on a coroutine, mirroring the existingshared_ptr<Coro>pattern.
Error Handling
- Exceptions thrown in coroutine body are captured by
promise_type::unhandled_exception()and rethrown inawait_resume(). - Never let exceptions escape
final_suspend()— it'snoexcept. - Shutdown path: When
JobQueueis stopping andaddJob()returns false, the awaiter must resume the coroutine with an error (throw or return error state) rather than leaving it suspended forever.
Naming Conventions
| Entity | Convention | Example |
|---|---|---|
| Coroutine return type | CoroTask<T> |
CoroTask<void>, CoroTask<Json::Value> |
| Awaiter types | *Awaiter suffix |
JobQueueAwaiter, PathFindAwaiter |
| Coroutine functions | Same as regular functions | doRipplePathFind(...) |
| Promise types | Nested promise_type |
CoroTask<T>::promise_type |
| JobQueue method | postCoroTask() |
jq.postCoroTask(jtCLIENT, "name", fn) |
Code Organization
- Coroutine primitives go in
include/xrpl/core/(header-only where possible) - Application-specific awaiters go alongside their consumers
- Tests mirror source structure:
src/test/core/CoroTask_test.cpp - No conditional compilation (
#ifdef) for old vs new coroutine code — migration is clean phases
Documentation
- Each awaiter must document: what it waits for, which thread resumes, and what
await_resume()returns. - Promise type must document: exception handling behavior and suspension points.
- Migration commits must reference this plan in commit messages.
8.3 Branch Strategy
Each milestone is developed on a sub-branch of the main feature branch. This keeps PRs focused and independently reviewable.
develop
└── pratik/Switch-to-std-coroutines (main feature branch)
├── pratik/std-coro/add-coroutine-primitives (CoroTask, CoroTaskRunner, JobQueueAwaiter, postCoroTask)
├── pratik/std-coro/migrate-entry-points (ServerHandler, GRPCServer, RPC::Context)
├── pratik/std-coro/migrate-handlers (doRipplePathFind, PathFindAwaiter, tests)
└── pratik/std-coro/cleanup-boost-coroutine (delete Coro.ipp, remove Boost dep, benchmarks)
Workflow:
- Create sub-branch from
pratik/Switch-to-std-coroutinesfor each milestone - Develop and test on the sub-branch
- Create PR from sub-branch →
pratik/Switch-to-std-coroutines - After review + merge, start next milestone sub-branch from the updated feature branch
- Final PR from
pratik/Switch-to-std-coroutines→develop
Rules:
- Never push directly to the main feature branch — always via sub-branch PR
- Each sub-branch must pass
--unittestand sanitizers before PR - Sub-branch names follow the pattern:
pratik/std-coro/<descriptive-action>(e.g.,add-coroutine-primitives,migrate-entry-points) - Milestone PRs must reference this plan document in the description
8.4 Code Review Checklist
For every PR in this migration:
coroutine_handle::destroy()called exactly once per coroutine- No concurrent
handle.resume()calls possible unhandled_exception()stores the exception (doesn't discard it)final_suspend()isnoexcept- Awaiter
await_suspend()doesn't block (schedules, not runs) LocalValuescorrectly swapped on suspend/resume boundaries- Shutdown path tested (JobQueue stopping during coroutine execution)
- ASAN clean (no use-after-free on coroutine frame)
- TSAN clean (no data races on resume)
- All existing
--unittesttests still pass
9. Task List
Milestone 1: New Coroutine Primitives
-
1.1 Design
CoroTask<T>class withpromise_type- Define
promise_typewithinitial_suspend,final_suspend,unhandled_exception,return_value/return_void - Implement
FinalAwaiterfor continuation support - Implement move-only RAII handle wrapper
- Support both
CoroTask<T>andCoroTask<void>
- Define
-
1.2 Design and implement
JobQueueAwaiterawait_suspend()callsjq_.addJob(type, name, [h]{ h.resume(); })- Handle
addJob()failure (shutdown) — resume with error flag or throw - Integrate
nSuspend_counter increment/decrement
-
1.3 Implement
LocalValuesswap in new coroutine resume path- Before
handle.resume(): save thread-local, install coroutine-local - After
handle.resume()returns: restore thread-local - Ensure this works when coroutine migrates between threads
- Before
-
1.4 Add
postCoroTask()template toJobQueue- Accept callable returning
CoroTask<void> - Schedule initial execution on JobQueue (mirror
postCoro()behavior) - Return a handle/shared_ptr for join/cancel
- Accept callable returning
-
1.5 Write unit tests (
src/test/core/CoroTask_test.cpp)- Test
CoroTask<void>runs to completion - Test
CoroTask<int>returns value - Test exception propagation across co_await
- Test coroutine destruction before completion
- Test
JobQueueAwaiterschedules on correct thread - Test
LocalValueisolation across 4+ coroutines - Test shutdown rejection (addJob returns false)
- Test
correct_orderequivalent (yield → join → post → complete) - Test
incorrect_orderequivalent (post → yield → complete) - Test multiple sequential co_await points
- Test
-
1.6 Verify build on GCC 12+, Clang 16+
-
1.7 Run ASAN + TSAN on new tests
-
1.8 Run full
--unittestsuite (no regressions) -
1.9 Self-review and create PR #1
Milestone 2: Entry Point Migration
-
2.1 Migrate
ServerHandler::onRequest()(ServerHandler.cpp:287)- Replace
m_jobQueue.postCoro(jtCLIENT_RPC, ...)withpostCoroTask() - Update lambda to return
CoroTask<void>(addco_return) - Update
processSessionto accept new coroutine type
- Replace
-
2.2 Migrate
ServerHandler::onWSMessage()(ServerHandler.cpp:325)- Replace
m_jobQueue.postCoro(jtCLIENT_WEBSOCKET, ...)withpostCoroTask() - Update lambda signature
- Replace
-
2.3 Migrate
GRPCServer::CallData::process()(GRPCServer.cpp:102)- Replace
app_.getJobQueue().postCoro(JobType::jtRPC, ...)withpostCoroTask() - Update
process(shared_ptr<Coro> coro)overload signature
- Replace
-
2.4 Update
RPC::Context(Context.h:27)- Replace
std::shared_ptr<JobQueue::Coro> coro{}with new coroutine wrapper type - Ensure all code that accesses
context.corocompiles
- Replace
-
2.5 Update
ServerHandler.hsignaturesprocessSession()andprocessRequest()parameter types
-
2.6 Update
GRPCServer.hsignaturesprocess()method parameter types
-
2.7 Run full
--unittestsuite -
2.8 Manual smoke test: HTTP + WS + gRPC RPC requests
-
2.9 Run ASAN + TSAN
-
2.10 Self-review and create PR #2
Milestone 3: Handler Migration
-
3.1 Migrate
doRipplePathFind()(RipplePathFind.cpp)- Replace
context.coro->yield()withco_await PathFindAwaiter{...} - Replace continuation lambda's
coro->post()/coro->resume()with awaiter scheduling - Handle shutdown case (post failure) in awaiter
- Replace
-
3.2 Create
PathFindAwaiter(or use genericJobQueueAwaiter)- Encapsulate the continuation + yield pattern from
RipplePathFind.cpplines 108-132
- Encapsulate the continuation + yield pattern from
-
3.3 Update
Path_test.cpp- Replace
postCorousage withpostCoroTask - Ensure
context.corousage matches new type
- Replace
-
3.4 Update
AMMTest.cpp- Replace
postCorousage withpostCoroTask
- Replace
-
3.5 Rewrite
Coroutine_test.cppfor new APIcorrect_order: postCoroTask → co_await → join → resume → completeincorrect_order: post before yield equivalentthread_specific_storage: 4 coroutines with LocalValue isolation
-
3.6 Update
JobQueue_test.cpptestPostCoro- Migrate to
postCoroTaskAPI
- Migrate to
-
3.7 Verify
ripple_path_findworks end-to-end with new coroutines -
3.8 Test shutdown-during-pathfind scenario
-
3.9 Run full
--unittestsuite -
3.10 Run ASAN + TSAN
-
3.11 Self-review and create PR #3
Milestone 4: Cleanup & Validation
- 4.1 Delete
include/xrpl/core/Coro.ipp - 4.2 Remove from
JobQueue.h:#include <boost/coroutine/all.hpp>struct Coro_create_tclass Coro(entire class)postCoro()template- Comment block (lines 322-377) describing old race condition
- 4.3 Update
cmake/deps/Boost.cmake:- Remove
coroutinefromfind_package(Boost REQUIRED COMPONENTS ...) - Remove
Boost::coroutinefromtarget_link_libraries
- Remove
- 4.4 Update
cmake/XrplInterface.cmake:- Remove
BOOST_COROUTINES_NO_DEPRECATION_WARNING
- Remove
- 4.5 Run memory benchmark
- Create N=1000 coroutines, compare RSS: before vs after
- Document results
- 4.6 Run context switch benchmark
- 100K yield/resume cycles, compare latency: before vs after
- Document results
- 4.7 Run RPC throughput benchmark
- Concurrent
ripple_path_findrequests, compare throughput - Document results
- Concurrent
- 4.8 Run full
--unittestsuite - 4.9 Run ASAN, TSAN, UBSan
- Confirm
__asan_handle_no_returnwarnings are gone
- Confirm
- 4.10 Verify build on all supported compilers
- 4.11 Self-review and create PR #4
- 4.12 Document final benchmark results in PR description
Appendix A: File Inventory
Complete list of files that reference coroutines (for audit tracking):
| # | File | Must Change | Phase |
|---|---|---|---|
| 1 | include/xrpl/core/JobQueue.h |
Yes | 1 (add), 4 (remove old) |
| 2 | include/xrpl/core/Coro.ipp |
Yes | 4 (delete) |
| 3 | include/xrpl/basics/LocalValue.h |
Maybe | 1 (if integration changes) |
| 4 | cmake/deps/Boost.cmake |
Yes | 4 |
| 5 | cmake/XrplInterface.cmake |
Yes | 4 |
| 6 | src/xrpld/rpc/Context.h |
Yes | 2 |
| 7 | src/xrpld/rpc/detail/ServerHandler.cpp |
Yes | 2 |
| 8 | src/xrpld/rpc/ServerHandler.h |
Yes | 2 |
| 9 | src/xrpld/app/main/GRPCServer.cpp |
Yes | 2 |
| 10 | src/xrpld/app/main/GRPCServer.h |
Yes | 2 |
| 11 | src/xrpld/rpc/handlers/RipplePathFind.cpp |
Yes | 3 |
| 12 | src/test/core/Coroutine_test.cpp |
Yes | 3 |
| 13 | src/test/core/JobQueue_test.cpp |
Yes | 3 |
| 14 | src/test/app/Path_test.cpp |
Yes | 3 |
| 15 | src/test/jtx/impl/AMMTest.cpp |
Yes | 3 |
| 16 | src/xrpld/rpc/README.md |
Yes | 4 (update docs) |
Appendix B: New Files to Create
| # | File | Phase | Purpose |
|---|---|---|---|
| 1 | include/xrpl/core/CoroTask.h |
1 | CoroTask<T> return type + promise_type |
| 2 | include/xrpl/core/JobQueueAwaiter.h |
1 | Awaiter for scheduling on JobQueue |
| 3 | src/test/core/CoroTask_test.cpp |
1 | Unit tests for new primitives |