# Boost.Coroutine to C++20 Standard Coroutines Migration Plan
**Project**: rippled (XRP Ledger node)
**Branch**: `pratik/Switch-to-std-coroutines`
**Date**: 2026-02-25
**Status**: Planning
---
## Table of Contents
1. [Research & Analysis](#1-research--analysis)
2. [Current State Assessment](#2-current-state-assessment)
3. [Migration Strategy](#3-migration-strategy)
4. [Implementation Plan](#4-implementation-plan)
5. [Testing & Validation Strategy](#5-testing--validation-strategy)
6. [Risks & Mitigation](#6-risks--mitigation)
7. [Timeline & Milestones](#7-timeline--milestones)
8. [Standards & Guidelines](#8-standards--guidelines)
9. [Task List](#9-task-list)
---
## 1. Research & Analysis
### 1.1 Stackful (Boost.Coroutine) vs Stackless (C++20) Architecture
```mermaid
graph TD
subgraph Boost["Boost.Coroutine2 (Stackful)"]
direction TB
B1["Coroutine Created"]
B2["1 MB Stack Allocated"]
B3["Full Call Stack Available"]
B4["yield() from ANY
nesting depth"]
B5["Context Switch:
save/restore registers
+ stack pointer
~40-100 CPU cycles"]
B1 --> B2 --> B3 --> B4 --> B5
end
subgraph Std["C++20 Coroutines (Stackless)"]
direction TB
S1["Coroutine Created"]
S2["200-500 B Frame on Heap"]
S3["No Dedicated Stack"]
S4["co_await ONLY at
explicit suspension points"]
S5["Context Switch:
resume via function call
symmetric transfer / tail-call
~20-50 CPU cycles"]
S1 --> S2 --> S3 --> S4 --> S5
end
```
**Boost (right)**: Each coroutine gets a full 1 MB stack. Suspension saves the
entire register set and stack pointer, so `yield()` can be called from any
nesting depth — the whole call chain is preserved. The cost is high per-coroutine
memory and a heavier context switch (~40-100 cycles for `fcontext` save/restore).
**C++20 (left)**: The compiler allocates a small heap frame (200-500 bytes)
holding only the local variables that live across suspension points. There is no
dedicated stack — suspension is only allowed at explicit `co_await` expressions
in the immediate coroutine function. Resumption is a normal function call
(symmetric transfer makes it a tail-call), costing ~20-50 cycles. The trade-off
is that nested functions that need to suspend must themselves be coroutines.
### 1.2 API & Programming Model Comparison
| Aspect | Boost.Coroutine2 (Current) | C++20 Coroutines (Target) |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Type** | **Stackful, asymmetric.** Each coroutine carries its own call stack, and control transfers between a parent (caller) and a child (coroutine) — never between two siblings directly. | **Stackless, asymmetric.** The compiler transforms the coroutine function into a state machine allocated on the heap. The same parent/child asymmetry applies, but there is no separate stack. |
| **Stack Model** | **Dedicated 1 MB stack per coroutine.** Allocated at construction via `boost::context::fixedsize_stack`. The full stack is reserved even if the coroutine only uses a few hundred bytes, leading to high memory overhead under concurrency. | **Heap frame of ~200-500 bytes.** The compiler allocates only the local variables that live across suspension points into a coroutine frame on the heap. The frame may be elided entirely if the compiler can prove the coroutine's lifetime is bounded by its caller. |
| **Suspension** | **`(*yield_)()` — can yield from any call depth.** Because the coroutine has its own stack, a call chain `fn_a() → fn_b() → yield()` suspends the entire stack. The `yield_` pointer is a `push_type*` provided by Boost. | **`co_await expr` — only at explicit suspension points.** Suspension is only possible in the immediate coroutine function body. If a nested regular function needs to suspend, it must itself be refactored into a coroutine returning an awaitable. |
| **Resumption** | **`coro_()` — resumes from last yield.** Calling the `pull_type` object switches back to the coroutine's stack and continues execution right after the last `yield()` call. | **`handle.resume()` — resumes from last co_await.** The `std::coroutine_handle<>` is a lightweight pointer to the coroutine frame. Calling `.resume()` jumps to the suspension point via a function-call dispatch (no stack switch). |
| **Creation** | **`pull_type` constructor auto-starts the coroutine.** When a `pull_type` is constructed, it immediately transfers control into the coroutine body, which runs until its first `yield()`. The caller must account for this eager start. | **Calling a coroutine function returns a suspended handle.** The function body does NOT execute until `handle.resume()` is called (when `initial_suspend()` returns `suspend_always`). This lazy-start model gives the caller full control over when execution begins. |
| **Completion Check** | **`static_cast(coro_)` returns false when done.** The `pull_type` is contextually convertible to `bool`; it becomes false after the coroutine body returns. | **`handle.done()` returns true when done.** A direct query on the coroutine handle. Calling `resume()` after `done()` is true is undefined behavior. |
| **Value Passing** | **Typed via `pull_type` / `push_type`.** Values are exchanged through the coroutine's type parameter — `pull_type` pulls values out, `push_type` pushes values in. rippled uses `` (no values exchanged). | **Via `promise_type::return_value(T)` or `co_return`.** Values are stored in the promise object inside the coroutine frame. The caller retrieves them through `await_resume()`. For void coroutines, `return_void()` is used instead. |
| **Exception Handling** | **Natural stack-based propagation.** An exception thrown inside the coroutine unwinds its stack normally and propagates to the caller at the `pull_type` call site (i.e., whoever called `coro_()`). | **Explicit capture via `promise_type::unhandled_exception()`.** Exceptions thrown in the coroutine body are caught by the promise and stored (typically as `std::exception_ptr`). They are rethrown in `await_resume()` when the caller co_awaits the result. |
| **Cancellation** | **Application-managed (poll a flag).** There is no built-in cancellation. rippled uses `expectEarlyExit()` to mark a coroutine as abandoned during shutdown, then decrements `nSuspend_` so `JobQueue::stop()` can proceed. | **Via `await_ready()` or cancellation tokens.** An awaiter can check a cancellation flag in `await_ready()` and return true to skip suspension. Alternatively, `std::stop_token` patterns (C++20) can be threaded through. Our `JobQueueAwaiter` returns false from `await_suspend()` when the JobQueue is stopping, effectively cancelling the suspend. |
| **Keywords** | **None (library-only).** All coroutine machinery is expressed through library types (`pull_type`, `push_type`) and regular function calls. No special language syntax required. | **`co_await`, `co_yield`, `co_return`.** The presence of any of these keywords in a function body makes it a coroutine. The compiler generates the state machine, frame allocation, and suspension/resumption code automatically. |
| **Standard** | **Boost library (not ISO C++).** `Boost.Coroutine` is deprecated in favor of `Boost.Coroutine2`, which itself has no active development. Depends on `Boost.Context` for platform-specific assembly-level stack switching. | **ISO C++20 standard.** Part of the language specification. Supported by all major compilers (GCC 11+, Clang 14+, MSVC 19.28+). Tooling, debugger support, and static analysis are steadily improving across the ecosystem. |
### 1.3 Performance Characteristics
| Metric | Boost.Coroutine2 | C++20 Coroutines |
| ------------------------------ | ------------------------------------------ | ------------------------------------ |
| **Memory per coroutine** | ~1MB (fixed stack) | ~200-500 bytes (frame only) |
| **1000 concurrent coroutines** | ~1 GB | ~0.5 MB |
| **Context switch cost** | ~40-100 CPU cycles (fcontext save/restore) | ~20-50 CPU cycles (function call) |
| **Allocation** | Stack allocated at creation | Heap allocation (compiler may elide) |
| **Cache behavior** | Poor (large stack rarely fully used) | Good (small frame, hot data close) |
| **Compiler optimization** | Opaque to compiler | Inlinable, optimizable |
### 1.4 Feature Parity Analysis
#### Suspension Points
- **Boost**: Can yield from any nesting level — `fn_a()` calls `fn_b()` calls `yield()`. The entire call stack is preserved.
- **C++20**: Suspension only at `co_await` expressions in the immediate coroutine function. Nested functions that need to suspend must themselves be coroutines returning awaitables.
- **Impact**: (Assumption, needs confirmation from people who know the code better) Rippled's usage is **shallow** — `yield()` is called directly from the RPC handler lambda, never from deeply nested code. This makes migration straightforward.
**Boost** — yield from coroutine body, resume later via `post()`:
```cpp
jq.postCoro(jtCLIENT, "Handler", [&](auto const& coro) {
auto result = doFirstHalf();
coro->yield(); // suspend — entire stack preserved
// resumes here when coro->post() is called externally
doSecondHalf(result);
});
```
**C++20** — `co_await` suspends, `JobQueueAwaiter` combines yield + auto-repost:
```cpp
jq.postCoroTask(jtCLIENT, "Handler", [&](auto runner) -> CoroTask {
auto result = doFirstHalf();
co_await JobQueueAwaiter{runner}; // suspend + auto-repost
// resumes on a worker thread when the job is picked up
doSecondHalf(result);
co_return;
});
```
**Key difference** — Boost can yield from nested calls; C++20 cannot:
```cpp
// Boost — works: yield from inside a helper function
void helper(std::shared_ptr coro) {
coro->yield(); // OK — stackful, entire call stack is preserved
}
jq.postCoro(jtCLIENT, "Deep", [](auto coro) { helper(coro); });
// C++20 — does NOT work: regular functions cannot co_await
void helper(std::shared_ptr runner) {
co_await runner->suspend(); // COMPILE ERROR — not a coroutine
}
// FIX: helper must itself be a coroutine returning CoroTask
CoroTask helper(std::shared_ptr runner) {
co_await runner->suspend(); // OK — this is a coroutine
co_return;
}
```
#### Exception Handling
- **Boost**: Exceptions propagate naturally up the call stack across yield points.
- **C++20**: Exceptions in coroutine body are caught by `promise_type::unhandled_exception()`. Must be explicitly stored and rethrown.
- **Impact**: Need to implement `unhandled_exception()` in promise type. Pattern is well-established.
**Boost** — exceptions propagate naturally through `yield()`:
```cpp
jq.postCoro(jtCLIENT, "Risky", [](auto coro) {
coro->yield();
throw std::runtime_error("oops");
// Exception propagates up the coroutine stack naturally.
// The Coro::resume() caller sees it when the coroutine unwinds.
});
```
**C++20** — exceptions are captured by `promise_type` and rethrown on `co_await`:
```cpp
// Inner coroutine throws
CoroTask failingOp() {
throw std::runtime_error("oops");
co_return 0; // never reached
}
// Outer coroutine catches — exception crosses coroutine boundary via promise
jq.postCoroTask(jtCLIENT, "Caller", [](auto runner) -> CoroTask {
try {
int v = co_await failingOp(); // rethrows here
} catch (std::runtime_error const& e) {
// e.what() == "oops" — caught across coroutine boundary
}
co_return;
});
```
**Key difference** — C++20 requires explicit plumbing, but it's already wired up:
```cpp
// Inside CoroTask::promise_type (already implemented):
void unhandled_exception() {
exception_ = std::current_exception(); // capture
}
// Inside CoroTask::await_resume() (already implemented):
void await_resume() {
if (auto& ep = handle_.promise().exception_)
std::rethrow_exception(ep); // rethrow to caller
}
```
#### Cancellation
- **Boost**: rippled uses `expectEarlyExit()` for graceful shutdown — not a general cancellation mechanism.
- **C++20**: Can check cancellation in `await_ready()` before suspension, or via `stop_token` patterns.
- **Impact**: C++20 provides strictly better cancellation support.
**Boost** — `expectEarlyExit()` for cleanup when coroutine never ran:
```cpp
auto coro = std::make_shared(create, jq, t, name, fn);
if (!coro->post()) {
// JobQueue is stopping — coroutine will never run.
// Must manually decrement nSuspend_ so shutdown doesn't hang.
coro->expectEarlyExit();
coro.reset();
}
```
No cooperative in-body cancellation — coroutine just runs to completion or gets abandoned.
**C++20** — `expectEarlyExit()` for the same case, plus cooperative in-body checking:
```cpp
// Same early-exit pattern when post() fails:
auto runner = CoroTaskRunner::create(jq, t, name);
runner->init(fn);
++nSuspend_;
if (!runner->post()) {
runner->expectEarlyExit(); // decrements nSuspend_, destroys frame
runner.reset();
}
// Cooperative cancellation — coroutine checks jq.isStopping() after each yield:
jq.postCoroTask(jtCLIENT, "Long", [jqp = &jq](auto runner) -> CoroTask {
while (hasWork()) {
co_await JobQueueAwaiter{runner};
if (jqp->isStopping())
co_return; // graceful exit
doNextChunk();
}
co_return;
});
```
**C++20 bonus** — `JobQueueAwaiter::await_suspend()` handles shutdown automatically:
```cpp
bool await_suspend(std::coroutine_handle<>) {
runner->onSuspend();
if (!runner->post()) {
// JQ stopping — undo suspend, return false so coroutine
// continues immediately (can fall through to co_return)
runner->onUndoSuspend();
return false;
}
return true; // actually suspend
}
```
### 1.5 Compiler Support
| Compiler | rippled Minimum | C++20 Coroutine Support | Status |
| --------- | --------------- | ------------------------ | ------ |
| **GCC** | 12.0+ | Full (since GCC 11) | Ready |
| **Clang** | 16.0+ | Full (since Clang 14) | Ready |
| **MSVC** | 19.28+ | Full (since VS2019 16.8) | Ready |
rippled already requires C++20 (`CMAKE_CXX_STANDARD 20` in `CMakeLists.txt`). All supported compilers have mature C++20 coroutine support. **No compiler upgrades required.**
### 1.6 Viability Analysis — Addressing Stackless Concerns
C++20 stackless coroutines have well-known limitations compared to stackful coroutines. This section analyzes each concern against rippled's **actual codebase** to determine viability.
#### Concern 1: Cannot Suspend from Nested Call Stacks
**Claim**: Stackless coroutines cannot yield from arbitrary stack depths. If `fn_a()` calls `fn_b()` calls `yield()`, only stackful coroutines can suspend the entire chain.
**Analysis**: An exhaustive codebase audit found:
- **1 production yield() call**: `RipplePathFind.cpp:131` — directly in the handler function body
- **All test yield() calls**: directly in `postCoro` lambda bodies (Coroutine_test.cpp, JobQueue_test.cpp)
- **The `push_type*` architecture** makes deep-nested yield() structurally impossible — the `yield_` pointer is only available inside the `postCoro` lambda via the `shared_ptr`, and handlers call `context.coro->yield()` at the top level
```mermaid
graph LR
subgraph Stackful["Stackful (Boost) — can yield anywhere"]
direction TB
A1["postCoro lambda"] --> A2["handlerFn()"]
A2 --> A3["helperFn()"]
A3 --> A4["coro→yield() ✅"]
end
subgraph Stackless["Stackless (C++20) — co_await at top only"]
direction TB
B1["postCoroTask lambda"] --> B2["co_await ✅"]
B1 --> B3["regularFn()"]
B3 -.-> B4["co_await ❌"]
end
subgraph Rippled["rippled actual usage — all shallow"]
direction TB
C1["postCoro lambda"] --> C2["context.coro→yield()
(direct, no nesting)"]
end
style A4 fill:#f96,stroke:#333,color:#000
style B4 fill:#f66,stroke:#333,color:#fff
style C2 fill:#3d8,stroke:#333,color:#000
```
**Verdict**: This concern does NOT apply. All suspension is shallow.
#### Concern 2: Colored Function Problem (Viral co_await)
**Claim**: Once a function needs to suspend, every caller up the chain must also be a coroutine. This "infects" the call chain.
**Analysis**: In rippled's case, the coloring is minimal:
- `postCoroTask()` launches a coroutine — this is the "root" colored function
- The `postCoro` lambda itself becomes the coroutine function (returns `CoroTask`)
- `doRipplePathFind()` is the only handler that calls `co_await`
- No other handler in the chain needs to become a coroutine — they continue to be regular functions dispatched through `doCommand()`
The "coloring" stops at the entry point lambda and the one handler that suspends. No deep infection.
```mermaid
graph TD
subgraph Feared["Feared: deep coloring infection"]
direction TB
F1["main()"] -->|"must become
coroutine"| F2["Server::run()"]
F2 -->|"must become
coroutine"| F3["dispatch()"]
F3 -->|"must become
coroutine"| F4["doCommand()"]
F4 -->|"must become
coroutine"| F5["handler()"]
F5 --> F6["co_await"]
style F1 fill:#f66,stroke:#333,color:#fff
style F2 fill:#f66,stroke:#333,color:#fff
style F3 fill:#f66,stroke:#333,color:#fff
style F4 fill:#f66,stroke:#333,color:#fff
style F5 fill:#f66,stroke:#333,color:#fff
end
subgraph Actual["Actual: coloring stops at entry point"]
direction TB
A1["main()"] --- A2["Server::run()"]
A2 --- A3["dispatch()"]
A3 --- A4["doCommand()"]
A4 -->|"only this
is a coroutine"| A5["postCoroTask lambda
→ CoroTask<void>"]
A5 --> A6["co_await"]
style A1 fill:#eee,stroke:#333,color:#000
style A2 fill:#eee,stroke:#333,color:#000
style A3 fill:#eee,stroke:#333,color:#000
style A4 fill:#eee,stroke:#333,color:#000
style A5 fill:#f96,stroke:#333,color:#000
style A6 fill:#3d8,stroke:#333,color:#000
end
```
**Verdict**: Minimal impact. Only 4 lambdas (3 entry points + 1 handler) need `co_await`.
#### Concern 3: No Standard Library Support for Common Patterns
**Claim**: C++20 provides the language primitives but no standard task type, executor integration, or composition utilities.
**Analysis**: This is accurate — we need to write custom types:
- `CoroTask` (task/return type) — well-established pattern, ~80 lines
- `JobQueueAwaiter` (executor integration) — ~20 lines
- `FinalAwaiter` (continuation chaining) — ~10 lines
However, these types are small, well-understood, and have extensive reference implementations (cppcoro, folly::coro, libunifex). The total boilerplate is approximately 150-200 lines of header code.
```mermaid
graph TD
subgraph StdLib["C++20 standard provides"]
direction LR
S1["coroutine_handle<P>"]
S2["suspend_always /
suspend_never"]
S3["noop_coroutine()"]
S4["co_await / co_return"]
end
subgraph Custom["Custom types we wrote (~150 lines total)"]
direction TB
C1["CoroTask<T>
~80 lines
(task + promise_type)"]
C2["JobQueueAwaiter
~20 lines
(suspend + repost)"]
C3["FinalAwaiter
~10 lines
(symmetric transfer)"]
C4["CoroTaskRunner
~40 lines decl
(lifecycle manager)"]
end
subgraph Ref["Reference implementations"]
direction LR
R1["cppcoro"]
R2["folly::coro"]
R3["libunifex"]
end
S1 --> C1
S2 --> C1
S3 --> C3
S4 --> C2
Ref -.->|"patterns
borrowed from"| Custom
```
**Verdict**: Manageable. Custom types are small and well-documented in C++ community.
#### Concern 4: Stack Overflow from Synchronous Resumption Chains
**Claim**: If coroutine A `co_await`s coroutine B, and B completes synchronously, B's `final_suspend` resumes A on the same stack, potentially building up unbounded stack depth.
**Analysis**: This is addressed by **symmetric transfer** via `FinalAwaiter::await_suspend()` returning a `coroutine_handle<>` instead of `void`. The compiler transforms this into a tail-call, preventing stack growth. This is the standard solution used by all major coroutine libraries and is implemented in our `FinalAwaiter` design (Section 4.1).
```mermaid
graph TD
subgraph Problem["Without symmetric transfer — stack grows"]
direction TB
P1["A resumes B"] --> P2["B::resume()
stack frame +1"]
P2 --> P3["B completes
final_suspend resumes A"]
P3 --> P4["A::resume()
stack frame +2"]
P4 --> P5["A resumes C"]
P5 --> P6["C::resume()
stack frame +3"]
P6 --> P7["... stack overflow ❌"]
style P7 fill:#f66,stroke:#333,color:#fff
end
subgraph Solution["With symmetric transfer — tail call, no growth"]
direction TB
S1["A resumes B"] --> S2["B::resume()
stack frame 1"]
S2 --> S3["B completes"]
S3 -->|"FinalAwaiter returns
handle → tail call"| S4["A::resume()
stack frame 1 (reused)"]
S4 --> S5["A resumes C"]
S5 -->|"tail call"| S6["C::resume()
stack frame 1 (reused)"]
S6 --> S7["... bounded ✅"]
style S7 fill:#3d8,stroke:#333,color:#000
end
```
**Verdict**: Solved by symmetric transfer (already in our design).
#### Concern 5: Dangling Reference Risk
**Claim**: Coroutine frames are heap-allocated and outlive the calling scope, making references to locals dangerous.
**Analysis**: This is a real concern that requires engineering discipline:
- Coroutine parameters are copied into the frame (safe by default)
- References passed to coroutine functions can dangle if the referent's scope ends before the coroutine completes
- Our design mitigates this: `RPC::Context` is passed by reference but its lifetime is managed by `shared_ptr` / the entry point lambda's scope, which outlives the coroutine
```mermaid
graph TD
subgraph Danger["Dangling reference — ❌ use-after-free"]
direction TB
D1["caller() scope"] --> D2["int local = 42"]
D2 --> D3["launch coroutine
with &local"]
D3 --> D4["caller returns
local destroyed"]
D4 --> D5["coroutine resumes
reads &local 💥"]
style D5 fill:#f66,stroke:#333,color:#fff
end
subgraph Safe["rippled pattern — ✅ lifetime managed"]
direction TB
S1["postCoroTask()"] --> S2["shared_ptr<Runner>
owns coroutine frame"]
S2 --> S3["lambda captures
by value or
shared_ptr"]
S3 --> S4["FuncStore keeps
lambda alive on heap"]
S4 --> S5["coroutine resumes
captures still valid ✅"]
style S5 fill:#3d8,stroke:#333,color:#000
end
```
**Verdict**: Real risk, but manageable with RAII patterns and ASAN testing.
#### Concern 6: yield_to.h / boost::asio::spawn
**Claim**: `yield_to.h:111` uses `boost::asio::spawn`, suggesting broader coroutine usage.
**Analysis**: `yield_to.h` uses `boost::asio::spawn` with `boost::context::fixedsize_stack(2 * 1024 * 1024)` — this is a **completely separate** coroutine system:
- Different type: `boost::asio::yield_context` (not `push_type*`)
- Different purpose: test infrastructure for async I/O tests
- Different mechanism: Boost.Asio stackful coroutines (not Boost.Coroutine2)
- **Not part of this migration scope** — used only in tests and unrelated to `JobQueue::Coro`
```mermaid
graph TD
subgraph ThisMigration["This migration (JobQueue::Coro)"]
direction TB
M1["Boost.Coroutine2
push_type / pull_type"] -->|"replace with"| M2["C++20 coroutines
CoroTask + co_await"]
M3["JobQueue::Coro"] -->|"replace with"| M4["CoroTaskRunner"]
M5["coro→yield() + post()"] -->|"replace with"| M6["JobQueueAwaiter"]
end
subgraph OutOfScope["Out of scope (yield_to.h)"]
direction TB
O1["boost::asio::spawn"]
O2["yield_context"]
O3["fixedsize_stack(2MB)"]
O1 --- O2
O1 --- O3
end
ThisMigration ~~~ OutOfScope
style OutOfScope fill:#eee,stroke:#999
```
**Verdict**: Separate system. Out of scope for this migration.
#### Overall Viability Conclusion
The migration IS viable because:
1. rippled's coroutine usage is **shallow** (no deep-nested yield)
2. The **colored function infection** is limited to 4 call sites
3. Custom types are **small and well-understood**
4. **Symmetric transfer** solves the stack overflow concern
5. **ASAN/TSAN** testing catches lifetime and race bugs
6. The alternative (ASAN annotations for Boost.Context) only addresses sanitizer false positives — it does not provide memory savings, standard compliance, or the dependency elimination that C++20 migration delivers
### 1.7 Merits & Demerits Summary
#### Merits of C++20 Migration
1. **2000x memory reduction** per coroutine (1MB → ~500 bytes)
2. **Faster context switching** (~2x improvement)
3. **Remove external dependency** on Boost.Coroutine (and transitively Boost.Context)
4. **Language-native** — better tooling, debugger support, static analysis
5. **Future-proof** — ISO standard, not a deprecated library
6. **Compiler-optimizable** — suspension points can be inlined/elided
7. **ASAN compatibility** — eliminates Boost context-switching false positives (see `docs/build/sanitizers.md`)
#### Demerits / Challenges
1. **Stackless limitation** — cannot yield from nested calls (verified: not an issue for rippled's shallow usage)
2. **Explicit lifetime management** — `coroutine_handle::destroy()` must be called (mitigated by RAII CoroTask)
3. **Verbose boilerplate** — promise_type, awaiter interfaces (~150-200 lines of infrastructure code)
4. **Debugging** — no visible coroutine stack in debugger (improving with tooling)
5. **Learning curve** — team needs familiarity with C++20 coroutine machinery
6. **Dangling reference risk** — coroutine frames outlive calling scope (mitigated by ASAN + careful design)
7. **No standard library task type** — must write custom CoroTask, awaiters (well-established patterns exist)
#### Alternative Considered: ASAN Annotations Only
Instead of full migration, one could keep Boost.Coroutine and add `__sanitizer_start_switch_fiber` / `__sanitizer_finish_switch_fiber` annotations to Coro.ipp to suppress ASAN false positives. This was evaluated and rejected because:
- It only fixes sanitizer false positives — does NOT reduce 1MB/coroutine memory usage
- Does NOT remove the deprecated Boost.Coroutine dependency
- Does NOT provide standard compliance or future-proofing
- The full migration is feasible given shallow yield usage and delivers all the above benefits
---
## 2. Current State Assessment
### 2.1 Architecture Overview
```mermaid
graph TD
subgraph "Request Entry Points"
HTTP["HTTP Request
ServerHandler::onRequest()"]
WS["WebSocket Message
ServerHandler::onWSMessage()"]
GRPC["gRPC Request
CallData::process()"]
end
subgraph "Coroutine Layer"
POST["JobQueue::postCoro()
Creates Coro + schedules job"]
CORO["JobQueue::Coro
boost::coroutines::pull_type
1MB stack per instance"]
end
subgraph "JobQueue Thread Pool"
W1["Worker Thread 1"]
W2["Worker Thread 2"]
WN["Worker Thread N"]
end
subgraph "RPC Handlers"
CTX["RPC::Context
holds shared_ptr#lt;Coro#gt;"]
RPC["RPC Handler
e.g. doRipplePathFind"]
YIELD["coro.yield()
Suspends execution"]
RESUME["coro.post()
Reschedules on JobQueue"]
end
HTTP --> POST
WS --> POST
GRPC --> POST
POST --> CORO
CORO --> W1
CORO --> W2
CORO --> WN
W1 --> CTX
W2 --> CTX
CTX --> RPC
RPC --> YIELD
YIELD -.->|"event completes"| RESUME
RESUME --> W1
```
### 2.2 `JobQueue::Coro` Implementation Audit
**File**: `include/xrpl/core/JobQueue.h` (lines 40-120) + `include/xrpl/core/Coro.ipp`
#### Class Members
```cpp
class Coro : public std::enable_shared_from_this {
detail::LocalValues lvs_; // Per-coroutine thread-local storage
JobQueue& jq_; // Parent JobQueue reference
JobType type_; // Job type (jtCLIENT_RPC, etc.)
std::string name_; // Name for logging
bool running_; // Is currently executing
std::mutex mutex_; // Prevents concurrent resume
std::mutex mutex_run_; // Guards running_ flag
std::condition_variable cv_; // For join() blocking
boost::coroutines::asymmetric_coroutine::pull_type coro_; // THE BOOST COROUTINE
boost::coroutines::asymmetric_coroutine::push_type* yield_; // Yield function pointer
bool finished_; // Debug assertion flag
};
```
#### Boost.Coroutine APIs Used
| API | Location | Purpose |
| --------------------------------------------- | --------------- | --------------------------- |
| `asymmetric_coroutine::pull_type` | `JobQueue.h:51` | The coroutine object itself |
| `asymmetric_coroutine::push_type` | `JobQueue.h:52` | Yield function type |
| `boost::coroutines::attributes(megabytes(1))` | `Coro.ipp:23` | Stack size configuration |
| `#include ` | `JobQueue.h:10` | Header inclusion |
#### Method Behaviors
| Method | Behavior |
| ----------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| **Constructor** | Creates `pull_type` with 1MB stack. Lambda captures user function. Auto-runs to first `yield()`. |
| **`yield()`** | Increments `jq_.nSuspend_`, calls `(*yield_)()` to suspend. Returns control to caller. |
| **`post()`** | Sets `running_=true`, calls `jq_.addJob()` with a lambda that calls `resume()`. Returns false if JobQueue is stopping. |
| **`resume()`** | Swaps `LocalValues`, acquires `mutex_`, calls `coro_()` to resume. Restores `LocalValues`. Sets `running_=false`, notifies `cv_`. |
| **`runnable()`** | Returns `static_cast(coro_)` — true if coroutine hasn't returned. |
| **`expectEarlyExit()`** | Decrements `nSuspend_`, sets `finished_=true`. Used during shutdown. |
| **`join()`** | Blocks on `cv_` until `running_==false`. |
### 2.3 Coroutine Execution Lifecycle
```mermaid
sequenceDiagram
participant HT as Handler Thread
participant JQ as JobQueue
participant WT as Worker Thread
participant C as Coro
participant UF as User Function
HT->>JQ: postCoro(type, name, fn)
JQ->>C: Coro::Coro() constructor
Note over C: pull_type auto-starts lambda
C->>C: yield_ = #amp;do_yield
C->>C: yield() [initial suspension]
C-->>JQ: Returns to constructor
JQ->>JQ: coro->post()
JQ->>JQ: addJob(type, name, resume_lambda)
JQ-->>HT: Returns shared_ptr#lt;Coro#gt;
Note over HT: Handler thread is FREE
WT->>C: resume() [job executes]
Note over C: Swap LocalValues
C->>C: coro_() [resume boost coroutine]
C->>UF: fn(shared_from_this())
UF->>UF: Do work...
UF->>C: coro->yield() [suspend]
Note over C: ++nSuspend_, invoke yield_()
C-->>WT: Returns from resume()
Note over WT: Worker thread is FREE
Note over UF: External event completes
UF->>C: coro->post() [reschedule]
C->>JQ: addJob(resume_lambda)
WT->>C: resume() [job executes]
C->>C: coro_() [resume]
C->>UF: Continues after yield()
UF->>UF: Finish work
UF-->>C: Return [coroutine complete]
Note over C: running_=false, cv_.notify_all()
```
### 2.4 All Coroutine Touchpoints
#### Core Infrastructure (Must Change)
| File | Role | Lines of Interest |
| ---------------------------------- | ---------------------------------------- | ------------------------- |
| `include/xrpl/core/JobQueue.h` | Coro class definition, postCoro template | Lines 10, 40-120, 385-402 |
| `include/xrpl/core/Coro.ipp` | Coro method implementations | All (122 lines) |
| `include/xrpl/basics/LocalValue.h` | Per-coroutine thread-local storage | Lines 12-59 (LocalValues) |
| `cmake/deps/Boost.cmake` | Boost.Coroutine dependency | Lines 7, 24 |
#### Entry Points (postCoro Callers)
| File | Entry Point | Job Type |
| -------------------------------------------- | ---------------------------- | -------------------- |
| `src/xrpld/rpc/detail/ServerHandler.cpp:287` | `onRequest()` — HTTP RPC | `jtCLIENT_RPC` |
| `src/xrpld/rpc/detail/ServerHandler.cpp:325` | `onWSMessage()` — WebSocket | `jtCLIENT_WEBSOCKET` |
| `src/xrpld/app/main/GRPCServer.cpp:102` | `CallData::process()` — gRPC | `jtRPC` |
#### Context Propagation
| File | Role |
| --------------------------------------- | ------------------------------------------------------ |
| `src/xrpld/rpc/Context.h:27` | `RPC::Context` holds `shared_ptr coro` |
| `src/xrpld/rpc/ServerHandler.h:174-188` | `processSession/processRequest` pass coro through |
#### Active Coroutine Consumer (yield/post)
| File | Usage |
| --------------------------------------------------- | ----------------------------------------------------- |
| `src/xrpld/rpc/handlers/RipplePathFind.cpp:131` | `context.coro->yield()` — suspends for path-finding |
| `src/xrpld/rpc/handlers/RipplePathFind.cpp:116-123` | Continuation calls `coro->post()` or `coro->resume()` |
#### Test Files
| File | Tests |
| ---------------------------------- | ------------------------------------------------------------- |
| `src/test/core/Coroutine_test.cpp` | `correct_order`, `incorrect_order`, `thread_specific_storage` |
| `src/test/core/JobQueue_test.cpp` | `testPostCoro` (post/resume cycles, shutdown behavior) |
| `src/test/app/Path_test.cpp` | Path-finding RPC via postCoro |
| `src/test/jtx/impl/AMMTest.cpp` | AMM RPC via postCoro |
### 2.5 Suspension/Continuation Model
The current model documented in `src/xrpld/rpc/README.md` defines four functional types:
```
Callback = std::function — generic 0-arg function
Continuation = std::function — calls Callback later
Suspend = std::function — runs Continuation, suspends
Coroutine = std::function — given a Suspend, starts work
```
In practice, `JobQueue::Coro` simplifies this to:
- **Suspend** = `coro->yield()`
- **Continue** = `coro->post()` (async on JobQueue) or `coro->resume()` (sync on current thread)
### 2.6 CMake Dependency
In `cmake/deps/Boost.cmake`:
```cmake
find_package(Boost REQUIRED COMPONENTS ... coroutine ...)
target_link_libraries(xrpl_boost INTERFACE ... Boost::coroutine ...)
```
Additionally in `cmake/XrplInterface.cmake`:
```cpp
BOOST_COROUTINES_NO_DEPRECATION_WARNING // Suppresses Boost.Coroutine deprecation warnings
```
### 2.7 Existing C++20 Coroutine Usage
rippled **already uses C++20 coroutines** in test code:
- `src/tests/libxrpl/net/HTTPClient.cpp` uses `co_await` with `boost::asio::use_awaitable`
- Demonstrates team familiarity with C++20 coroutine syntax
- Proves compiler toolchain supports C++20 coroutines
---
## 3. Migration Strategy
### 3.1 Incremental vs Atomic Migration
**Decision: Incremental (multi-phase) migration.**
Rationale:
- Only **one RPC handler** (`RipplePathFind`) actively uses `yield()/post()` suspension
- The **three entry points** (HTTP, WS, gRPC) all funnel through `postCoro()`
- The `RPC::Context.coro` field is the sole propagation mechanism
- We can introduce a new C++20 coroutine system alongside the existing one and migrate callsites incrementally
### 3.2 Migration Phases
```mermaid
graph TD
subgraph "Phase 1: Foundation"
P1A["Create CoroTask#lt;T#gt; type
(promise_type, awaiter)"]
P1B["Create JobQueueAwaiter
(schedules resume on JobQueue)"]
P1C["Add postCoroTask() to JobQueue
(parallel to postCoro)"]
P1D["Unit tests for new primitives"]
P1A --> P1B --> P1C --> P1D
end
subgraph "Phase 2: Entry Point Migration"
P2A["Migrate ServerHandler::onRequest()"]
P2B["Migrate ServerHandler::onWSMessage()"]
P2C["Migrate GRPCServer::CallData::process()"]
P2D["Update RPC::Context to use new type"]
P2A --> P2D
P2B --> P2D
P2C --> P2D
end
subgraph "Phase 3: Handler Migration"
P3A["Migrate RipplePathFind handler"]
P3B["Verify all other handlers
(no active yield usage)"]
end
subgraph "Phase 4: Cleanup"
P4A["Remove old Coro class"]
P4B["Remove Boost.Coroutine from CMake"]
P4C["Remove deprecation warning suppression"]
P4D["Final benchmarks & validation"]
end
P1D --> P2A
P2D --> P3A
P3B --> P4A
P3A --> P4A
P4A --> P4B --> P4C --> P4D
```
### 3.3 Coexistence Strategy
During migration, both implementations will coexist:
```mermaid
graph LR
subgraph "Transition Period"
OLD["JobQueue::Coro
(Boost, existing)"]
NEW["JobQueue::CoroTask
(C++20, new)"]
CTX["RPC::Context"]
end
CTX -->|"phase 1-2"| OLD
CTX -->|"phase 2-3"| NEW
style OLD fill:#fdd,stroke:#c00,color:#000
style NEW fill:#dfd,stroke:#0a0,color:#000
```
- `RPC::Context` will temporarily hold both `shared_ptr` (old) and the new coroutine handle
- Entry points will be migrated one at a time
- Each migration is independently testable
- Once all entry points and handlers are migrated, old code is removed
### 3.4 Breaking Changes & Compatibility
| Concern | Impact | Mitigation |
| -------------------------------- | ----------------------------------------- | -------------------------------------------------------- |
| `RPC::Context::coro` type change | All RPC handlers receive context | Migrate context field last, after all consumers updated |
| `postCoro()` removal | 3 callers | Replace with `postCoroTask()`, remove old API in Phase 4 |
| `LocalValue` integration | Thread-local storage must work | New implementation must swap LocalValues identically |
| Shutdown behavior | `expectEarlyExit()`, `nSuspend_` tracking | Replicate in new CoroTask |
---
## 4. Implementation Plan
### 4.1 New Type Design
#### `CoroTask` — Coroutine Return Type
```mermaid
classDiagram
class CoroTask~T~ {
+Handle handle_
+CoroTask(Handle h)
+destroy()
+bool done() const
+T get() const
+bool await_ready() const
+void await_suspend(coroutine_handle h) const
+T await_resume() const
}
class promise_type {
-result_ : variant~T, exception_ptr~
-continuation_ : coroutine_handle
+CoroTask get_return_object()
+suspend_always initial_suspend()
+FinalAwaiter final_suspend()
+void return_value(T)
+void return_void()
+void unhandled_exception()
}
class FinalAwaiter {
+bool await_ready()
+coroutine_handle await_suspend(coroutine_handle~promise_type~)
+void await_resume()
}
class JobQueueAwaiter {
-jq_ : JobQueue
-type_ : JobType
-name_ : string
+bool await_ready()
+void await_suspend(coroutine_handle h)
+void await_resume()
}
CoroTask --> promise_type : contains
promise_type --> FinalAwaiter : returns from final_suspend
CoroTask ..> JobQueueAwaiter : used with co_await
```
#### `JobQueueAwaiter` — Schedules Resumption on JobQueue
```cpp
// Conceptual design — actual implementation may vary
struct JobQueueAwaiter {
JobQueue& jq;
JobType type;
std::string name;
bool await_ready() { return false; } // Always suspend
void await_suspend(std::coroutine_handle<> h) {
// Schedule coroutine resumption as a job
jq.addJob(type, name, [h]() { h.resume(); });
}
void await_resume() {}
};
```
### 4.2 Mapping: Old API → New API
```mermaid
graph LR
subgraph "Current (Boost)"
A1["postCoro(type, name, fn)"]
A2["coro->yield()"]
A3["coro->post()"]
A4["coro->resume()"]
A5["coro->join()"]
A6["coro->runnable()"]
A7["coro->expectEarlyExit()"]
end
subgraph "New (C++20)"
B1["postCoroTask(type, name, fn)
fn returns CoroTask<void>"]
B2["co_await JobQueueAwaiter{jq, type, name}"]
B3["Built into await_suspend()
(automatic scheduling)"]
B4["handle.resume()
(direct call)"]
B5["co_await task
(continuation-based)"]
B6["handle.done()"]
B7["handle.destroy() + cleanup"]
end
A1 --> B1
A2 --> B2
A3 --> B3
A4 --> B4
A5 --> B5
A6 --> B6
A7 --> B7
```
### 4.3 File Changes Required
#### Phase 1: New Coroutine Primitives
| File | Action | Description |
| ------------------------------------- | ---------- | ------------------------------------------------------------- |
| `include/xrpl/core/CoroTask.h` | **CREATE** | `CoroTask` return type with `promise_type`, `FinalAwaiter` |
| `include/xrpl/core/JobQueueAwaiter.h` | **CREATE** | Awaiter that schedules resume on JobQueue |
| `include/xrpl/core/JobQueue.h` | **MODIFY** | Add `postCoroTask()` template alongside existing `postCoro()` |
| `src/test/core/CoroTask_test.cpp` | **CREATE** | Unit tests for `CoroTask` and `JobQueueAwaiter` |
#### Phase 2: Entry Point Migration
| File | Action | Description |
| ---------------------------------------- | ---------- | ---------------------------------------------------------------------- |
| `src/xrpld/rpc/detail/ServerHandler.cpp` | **MODIFY** | `onRequest()` and `onWSMessage()`: replace `postCoro` → `postCoroTask` |
| `src/xrpld/rpc/ServerHandler.h` | **MODIFY** | Update `processSession`/`processRequest` signatures |
| `src/xrpld/app/main/GRPCServer.cpp` | **MODIFY** | `CallData::process()`: replace `postCoro` → `postCoroTask` |
| `src/xrpld/app/main/GRPCServer.h` | **MODIFY** | Update `process()` method signature |
| `src/xrpld/rpc/Context.h` | **MODIFY** | Change `shared_ptr` to new coroutine handle type |
#### Phase 3: Handler Migration
| File | Action | Description |
| ------------------------------------------- | ---------- | ---------------------------------------------------------------- |
| `src/xrpld/rpc/handlers/RipplePathFind.cpp` | **MODIFY** | Replace `context.coro->yield()` / `coro->post()` with `co_await` |
| `src/test/app/Path_test.cpp` | **MODIFY** | Update test to use new coroutine API |
| `src/test/jtx/impl/AMMTest.cpp` | **MODIFY** | Update test to use new coroutine API |
#### Phase 4: Cleanup
| File | Action | Description |
| ---------------------------------- | ---------- | ------------------------------------------------------------------ |
| `include/xrpl/core/Coro.ipp` | **DELETE** | Remove old Boost.Coroutine implementation |
| `include/xrpl/core/JobQueue.h` | **MODIFY** | Remove `Coro` class, `postCoro()`, `Coro_create_t`, Boost includes |
| `cmake/deps/Boost.cmake` | **MODIFY** | Remove `coroutine` from `find_package` and `target_link_libraries` |
| `cmake/XrplInterface.cmake` | **MODIFY** | Remove `BOOST_COROUTINES_NO_DEPRECATION_WARNING` |
| `src/test/core/Coroutine_test.cpp` | **MODIFY** | Rewrite tests for new CoroTask |
| `src/test/core/JobQueue_test.cpp` | **MODIFY** | Update `testPostCoro` to use new API |
| `include/xrpl/basics/LocalValue.h` | **MODIFY** | Update LocalValues integration for C++20 coroutines |
### 4.4 LocalValue Integration Design
The current `LocalValue` system swaps per-coroutine storage on resume/yield:
```mermaid
sequenceDiagram
participant WT as Worker Thread
participant LV as LocalValues
participant C as Coroutine
Note over WT: Thread has its own LocalValues
WT->>LV: saved = getLocalValues().release()
WT->>LV: getLocalValues().reset(#amp;coro.lvs_)
Note over LV: Now pointing to coroutine's storage
WT->>C: coro_() / handle.resume()
Note over C: User code sees coroutine's LocalValues
C-->>WT: yield / co_await returns
WT->>LV: getLocalValues().release()
WT->>LV: getLocalValues().reset(saved)
Note over LV: Restored to thread's storage
```
**For C++20**: The same swap pattern must be implemented in the awaiter's `await_suspend()` and `await_resume()`, or in a wrapper that calls `handle.resume()`.
### 4.5 RipplePathFind Migration Design
Current pattern:
```cpp
// Continuation callback
auto callback = [&context]() {
std::shared_ptr coroCopy{context.coro};
if (!coroCopy->post()) {
coroCopy->resume(); // Fallback: run on current thread
}
};
// Start async work, then suspend
jvResult = makeLegacyPathRequest(request, callback, ...);
if (request) {
context.coro->yield(); // ← SUSPEND HERE
jvResult = request->doStatus(context.params); // ← RESUME HERE
}
```
Target pattern:
```cpp
// Start async work, suspend via co_await
jvResult = makeLegacyPathRequest(request, /* awaiter-based callback */, ...);
if (request) {
co_await PathFindAwaiter{context}; // ← SUSPEND + RESUME via awaiter
jvResult = request->doStatus(context.params);
}
```
The `PathFindAwaiter` will encapsulate the scheduling logic currently in the lambda continuation.
---
## 5. Testing & Validation Strategy
### 5.1 Test Architecture
```mermaid
graph TD
subgraph "Unit Tests"
UT1["CoroTask_test
- Construction/destruction
- co_return values
- Exception propagation
- Lifetime management"]
UT2["JobQueueAwaiter_test
- Schedule on correct JobType
- Resume on worker thread
- Shutdown handling"]
UT3["LocalValue integration
- Per-coroutine isolation
- Multi-coroutine concurrent
- Cross-thread consistency"]
end
subgraph "Migration Tests"
MT1["Coroutine_test rewrite
- correct_order
- incorrect_order
- thread_specific_storage"]
MT2["PostCoro migration
- Post/resume cycles
- Shutdown rejection
- Early exit"]
end
subgraph "Integration Tests"
IT1["RPC Path Finding
- Suspend/resume flow
- Shutdown during suspend
- Concurrent requests"]
IT2["Full --unittest suite
- All existing tests pass
- No regressions"]
end
subgraph "Performance Tests"
PT1["Memory benchmarks"]
PT2["Context switch benchmarks"]
PT3["RPC throughput under load"]
end
subgraph "Sanitizer Tests"
ST1["ASAN
(memory errors)"]
ST2["TSAN
(data races)"]
ST3["UBSan
(undefined behavior)"]
end
UT1 --> MT1
UT2 --> MT2
MT1 --> IT1
MT2 --> IT2
IT1 --> PT1
IT2 --> PT2
PT1 --> ST1
PT2 --> ST2
PT3 --> ST3
```
### 5.2 Benchmarking Tests
#### Memory Usage Benchmark
```
Test: Create N coroutines, measure RSS
- N = 100, 1000, 10000
- Measure: peak RSS, per-coroutine overhead
- Compare: Boost (N * 1MB + overhead) vs C++20 (N * ~500B + overhead)
- Tool: /proc/self/status (VmRSS), or getrusage()
```
#### Context Switch Benchmark
```
Test: Yield/resume M times across N coroutines
- M = 100,000 iterations
- N = 1, 10, 100 concurrent coroutines
- Measure: total time, per-switch latency (ns)
- Compare: Boost yield/resume cycle vs C++20 co_await/resume cycle
- Tool: std::chrono::high_resolution_clock
```
#### RPC Throughput Benchmark
```
Test: Concurrent ripple_path_find requests
- Load: 10, 50, 100 concurrent requests
- Measure: requests/second, p50/p95/p99 latency
- Compare: before vs after migration
- Tool: Custom load generator or existing perf infrastructure
```
### 5.3 Unit Test Coverage
| Test | What It Validates |
| ------------------------------ | --------------------------------------------- |
| `CoroTask` basic | Coroutine runs to completion, handle cleanup |
| `CoroTask` with value | `co_return` value correctly retrieved |
| `CoroTask` exception | `unhandled_exception()` captures and rethrows |
| `CoroTask` cancellation | Destruction before completion cleans up |
| `JobQueueAwaiter` basic | `co_await` suspends, resumes on worker thread |
| `JobQueueAwaiter` shutdown | Returns false / throws when JobQueue stopping |
| `PostCoroTask` lifecycle | Create → suspend → resume → complete |
| `PostCoroTask` multiple yields | Multiple co_await points in sequence |
| `LocalValue` isolation | 4 coroutines, each sees own LocalValue |
| `LocalValue` cross-thread | Resume on different thread, values preserved |
### 5.4 Integration Testing
- **All existing `--unittest` tests must pass unchanged** (except coroutine-specific tests that are rewritten)
- **Path_test** must pass with identical behavior
- **AMMTest** RPC tests must pass
- **ServerHandler** HTTP/WS handling must work end-to-end
### 5.5 Sanitizer Testing
Per `docs/build/sanitizers.md`:
```bash
# ASAN (memory errors — especially important for coroutine frame lifetime)
export SANITIZERS=address,undefinedbehavior
# Build + test
# TSAN (data races — critical for concurrent coroutine resume)
export SANITIZERS=thread
# Build + test (separate build — cannot mix with ASAN)
```
**Key benefit**: Removing Boost.Coroutine eliminates the `__asan_handle_no_return` false positives caused by Boost context switching (documented in `docs/build/sanitizers.md` line 184).
### 5.6 Regression Testing Methodology
```mermaid
graph LR
subgraph "Before Migration (Baseline)"
B1["Build on develop branch"]
B2["Run --unittest (record pass/fail)"]
B3["Run memory benchmark (record RSS)"]
B4["Run context switch benchmark (record ns/switch)"]
end
subgraph "After Migration"
A1["Build on feature branch"]
A2["Run --unittest (compare pass/fail)"]
A3["Run memory benchmark (compare RSS)"]
A4["Run context switch benchmark (compare ns/switch)"]
end
subgraph "Acceptance Criteria"
C1["Zero test regressions"]
C2["Memory: ≤ baseline"]
C3["Context switch: ≤ baseline"]
C4["ASAN/TSAN clean"]
end
B1 --> B2 --> B3 --> B4
A1 --> A2 --> A3 --> A4
B2 -.->|compare| C1
A2 -.->|compare| C1
B3 -.->|compare| C2
A3 -.->|compare| C2
B4 -.->|compare| C3
A4 -.->|compare| C3
A2 -.-> C4
```
---
## 6. Risks & Mitigation
### 6.1 Risk Matrix
| Risk | Probability | Impact | Mitigation |
| ----------------------------------------------------- | ----------- | ------ | --------------------------------------------------------------------------- |
| **Performance regression** in context switching | Low | High | Benchmark before/after; C++20 should be faster |
| **Coroutine frame lifetime bugs** (use-after-destroy) | Medium | High | ASAN testing, RAII wrapper for handle, code review |
| **Data races on resume** | Medium | High | TSAN testing, careful await_suspend() implementation |
| **LocalValue corruption** across threads | Low | High | Dedicated test with 4+ concurrent coroutines |
| **Shutdown race conditions** | Medium | Medium | Replicate existing mutex/cv pattern in new design |
| **Missed coroutine consumer** during migration | Low | Medium | Exhaustive grep audit (Section 2.4 is complete) |
| **Compiler bugs** in coroutine codegen | Low | Medium | Test on all three compilers (GCC, Clang, MSVC) |
| **Exception loss** across suspension points | Medium | Medium | Test exception propagation in every phase |
| **Third-party code depending on Boost.Coroutine** | Very Low | Low | Grep confirms only internal usage |
| **Dangling references in coroutine frames** | Medium | High | ASAN testing, avoid reference params in coroutine functions, use shared_ptr |
| **Colored function infection spreading** | Low | Medium | Only 4 call sites need co_await; no nested handlers suspend |
| **Symmetric transfer not available** | Very Low | High | All target compilers (GCC 12+, Clang 16+) support symmetric transfer |
| **Future handler adding deep yield** | Low | Medium | Code review + CI: static analysis flag any yield from nested depth |
### 6.2 Rollback Strategy
```mermaid
graph TD
START["Migration In Progress"]
CHECK{"Critical Issue
Discovered?"}
PHASE{"Which Phase?"}
P1["Phase 1: Delete new files
No production code changed"]
P2["Phase 2: Revert entry point changes
Old postCoro still present"]
P3["Phase 3: Revert handler changes
Old Coro still present"]
P4["Phase 4: Cannot easily rollback
Old code deleted"]
PREVENT["Prevention:
Do NOT delete old code
until Phase 4 is fully validated"]
START --> CHECK
CHECK -->|Yes| PHASE
CHECK -->|No| DONE["Continue Migration"]
PHASE -->|1| P1
PHASE -->|2| P2
PHASE -->|3| P3
PHASE -->|4| P4
P4 --> PREVENT
```
**Key principle**: Old `Coro` class and `postCoro()` remain in the codebase through Phases 1-3. They are only removed in Phase 4, after all migration is validated. Each phase is independently revertible via `git revert`.
### 6.3 Specific Risk: Stackful → Stackless Limitation
**The Big Question**: Can all current `yield()` call sites work with stackless `co_await`?
**Analysis**:
```mermaid
graph TD
Q["Does yield() get called from
a deeply nested function?"]
Q -->|Yes| PROBLEM["PROBLEM: co_await can't
suspend from nested calls"]
Q -->|No| OK["OK: Direct co_await
in coroutine function"]
CHECK1["RipplePathFind.cpp:131
context.coro.yield()"]
CHECK1 -->|"Called directly in handler"| OK
CHECK2["Coroutine_test.cpp
c.yield()"]
CHECK2 -->|"Called directly in lambda"| OK
CHECK3["JobQueue_test.cpp
c.yield()"]
CHECK3 -->|"Called directly in lambda"| OK
style OK fill:#dfd,stroke:#0a0,color:#000
style PROBLEM fill:#fdd,stroke:#c00,color:#000
```
**Result**: All `yield()` calls are in the direct body of the postCoro lambda or RPC handler function. **No deep nesting exists.** Migration to stackless `co_await` is fully feasible without architectural redesign.
---
## 7. Timeline & Milestones
### 7.1 Milestone Overview
```mermaid
gantt
title Migration Timeline
dateFormat YYYY-MM-DD
axisFormat %b %d
section Phase 1 - Foundation
CoroTask + JobQueueAwaiter design :p1a, 2026-02-26, 3d
CoroTask implementation :p1b, after p1a, 3d
Unit tests for primitives :p1c, after p1b, 2d
PR 1 - New coroutine primitives :milestone, p1m, after p1c, 0d
section Phase 2 - Entry Points
Migrate ServerHandler (HTTP + WS) :p2a, after p1m, 3d
Migrate GRPCServer :p2b, after p2a, 2d
Update RPC Context :p2c, after p2b, 1d
PR 2 - Entry point migration :milestone, p2m, after p2c, 0d
section Phase 3 - Handlers
Migrate RipplePathFind :p3a, after p2m, 3d
Update test infrastructure :p3b, after p3a, 2d
PR 3 - Handler migration :milestone, p3m, after p3b, 0d
section Phase 4 - Cleanup
Remove old Coro and update CMake :p4a, after p3m, 2d
Performance benchmarks :p4b, after p4a, 2d
Sanitizer validation :p4c, after p4b, 1d
PR 4 - Cleanup + validation :milestone, p4m, after p4c, 0d
```
### 7.2 Milestone Details
#### Milestone 1: New Coroutine Primitives (PR #1)
**Deliverables**:
- `CoroTask` with `promise_type`, `FinalAwaiter`
- `CoroTask` specialization
- `JobQueueAwaiter` for scheduling on JobQueue
- `postCoroTask()` on `JobQueue`
- LocalValue integration in new coroutine type
- Unit test suite: `CoroTask_test.cpp`
**Acceptance Criteria**:
- All new unit tests pass
- Existing `--unittest` suite passes (no regressions from new code)
- ASAN + TSAN clean on new tests
- Code compiles on GCC 12+, Clang 16+
#### Milestone 2: Entry Point Migration (PR #2)
**Deliverables**:
- `ServerHandler::onRequest()` uses `postCoroTask()`
- `ServerHandler::onWSMessage()` uses `postCoroTask()`
- `GRPCServer::CallData::process()` uses `postCoroTask()`
- `RPC::Context` updated to carry new coroutine type
- `processSession`/`processRequest` signatures updated
**Acceptance Criteria**:
- HTTP, WebSocket, and gRPC RPC requests work end-to-end
- Full `--unittest` suite passes
- Manual smoke test: `ripple_path_find` via HTTP/WS
#### Milestone 3: Handler Migration (PR #3)
**Deliverables**:
- `RipplePathFind` uses `co_await` instead of `yield()/post()`
- Path_test and AMMTest updated
- Coroutine_test and JobQueue_test updated for new API
**Acceptance Criteria**:
- Path-finding suspension/continuation works correctly
- All `--unittest` tests pass
- Shutdown-during-pathfind scenario tested
#### Milestone 4: Cleanup & Validation (PR #4)
**Deliverables**:
- Old `Coro` class and `Coro.ipp` removed
- `postCoro()` removed from `JobQueue`
- `Boost::coroutine` removed from CMake
- `BOOST_COROUTINES_NO_DEPRECATION_WARNING` removed
- Performance benchmark results documented
- Sanitizer test results documented
**Acceptance Criteria**:
- Build succeeds without Boost.Coroutine
- Full `--unittest` suite passes
- Memory per coroutine ≤ 10KB (down from 1MB)
- Context switch time ≤ baseline
- ASAN, TSAN, UBSan all clean
---
## 8. Standards & Guidelines
### 8.1 Coroutine Design Standards
#### Rule 1: All coroutine return types must use RAII for handle lifetime
```cpp
// GOOD: Handle destroyed in destructor
~CoroTask() {
if (handle_) handle_.destroy();
}
// BAD: Manual destroy calls scattered in code
void cleanup() { handle_.destroy(); } // Easy to forget
```
#### Rule 2: Never resume a coroutine from within `await_suspend()`
```cpp
// GOOD: Schedule resume on executor
void await_suspend(std::coroutine_handle<> h) {
jq_.addJob(type_, name_, [h]() { h.resume(); });
}
// BAD: Direct resume in await_suspend (blocks caller)
void await_suspend(std::coroutine_handle<> h) {
h.resume(); // Defeats the purpose of suspension
}
```
#### Rule 3: Use `suspend_always` for `initial_suspend()` (lazy start)
```cpp
// GOOD: Lazy start — coroutine doesn't run until explicitly resumed
std::suspend_always initial_suspend() { return {}; }
// BAD for our use case: Eager start — runs immediately on creation
std::suspend_never initial_suspend() { return {}; }
```
Rationale: Matches existing Boost behavior where `postCoro()` schedules execution, not the constructor.
#### Rule 4: Always handle `unhandled_exception()` explicitly
```cpp
void unhandled_exception() {
exception_ = std::current_exception();
// NEVER: just swallow the exception
// NEVER: std::terminate() without logging
}
```
#### Rule 5: Use `suspend_always` for `final_suspend()` to enable continuation
```cpp
// GOOD: Suspend at end to allow cleanup and value retrieval
auto final_suspend() noexcept {
struct FinalAwaiter {
bool await_ready() noexcept { return false; }
std::coroutine_handle<> await_suspend(
std::coroutine_handle h) noexcept {
if (h.promise().continuation_)
return h.promise().continuation_; // Resume waiter
return std::noop_coroutine();
}
void await_resume() noexcept {}
};
return FinalAwaiter{};
}
```
#### Rule 6: Coroutine functions must be clearly marked
```cpp
// GOOD: Return type makes it obvious this is a coroutine
CoroTask doRipplePathFind(RPC::JsonContext& context) {
co_await ...;
co_return result;
}
// BAD: Coroutine hidden behind auto or unclear return type
auto doSomething() { co_return; }
```
### 8.2 Coding Guidelines
#### Thread Safety
1. **Never resume a coroutine concurrently from two threads.** Use the same mutex pattern as existing `Coro::mutex_` to prevent races.
2. **`await_suspend()` is the synchronization point.** All state visible before `await_suspend()` must be visible after `await_resume()`.
3. **Use `std::atomic` or mutexes for shared state** between coroutine and continuation callback.
#### Memory Management
1. **`CoroTask` owns its `coroutine_handle`**. It is move-only, non-copyable.
2. **Never store raw `coroutine_handle<>`** in long-lived data structures without clear ownership.
3. **Prefer `shared_ptr>`** when multiple parties need to observe/wait on a coroutine, mirroring the existing `shared_ptr` pattern.
#### Error Handling
1. **Exceptions thrown in coroutine body** are captured by `promise_type::unhandled_exception()` and rethrown in `await_resume()`.
2. **Never let exceptions escape `final_suspend()`** — it's `noexcept`.
3. **Shutdown path**: When `JobQueue` is stopping and `addJob()` returns false, the awaiter must resume the coroutine with an error (throw or return error state) rather than leaving it suspended forever.
#### Naming Conventions
| Entity | Convention | Example |
| --------------------- | ------------------------- | ----------------------------------------- |
| Coroutine return type | `CoroTask` | `CoroTask`, `CoroTask` |
| Awaiter types | `*Awaiter` suffix | `JobQueueAwaiter`, `PathFindAwaiter` |
| Coroutine functions | Same as regular functions | `doRipplePathFind(...)` |
| Promise types | Nested `promise_type` | `CoroTask::promise_type` |
| JobQueue method | `postCoroTask()` | `jq.postCoroTask(jtCLIENT, "name", fn)` |
#### Code Organization
1. **Coroutine primitives** go in `include/xrpl/core/` (header-only where possible)
2. **Application-specific awaiters** go alongside their consumers
3. **Tests** mirror source structure: `src/test/core/CoroTask_test.cpp`
4. **No conditional compilation** (`#ifdef`) for old vs new coroutine code — migration is clean phases
#### Documentation
1. **Each awaiter must document**: what it waits for, which thread resumes, and what `await_resume()` returns.
2. **Promise type must document**: exception handling behavior and suspension points.
3. **Migration commits must reference this plan** in commit messages.
### 8.3 Branch Strategy
Each milestone is developed on a **sub-branch** of the main feature branch. This keeps PRs focused and independently reviewable.
```
develop
└── pratik/Switch-to-std-coroutines (main feature branch)
├── pratik/std-coro/add-coroutine-primitives (CoroTask, CoroTaskRunner, JobQueueAwaiter, postCoroTask)
├── pratik/std-coro/migrate-entry-points (ServerHandler, GRPCServer, RPC::Context)
├── pratik/std-coro/migrate-handlers (doRipplePathFind, PathFindAwaiter, tests)
└── pratik/std-coro/cleanup-boost-coroutine (delete Coro.ipp, remove Boost dep, benchmarks)
```
**Workflow**:
1. Create sub-branch from `pratik/Switch-to-std-coroutines` for each milestone
2. Develop and test on the sub-branch
3. Create PR from sub-branch → `pratik/Switch-to-std-coroutines`
4. After review + merge, start next milestone sub-branch from the updated feature branch
5. Final PR from `pratik/Switch-to-std-coroutines` → `develop`
**Rules**:
- Never push directly to the main feature branch — always via sub-branch PR
- Each sub-branch must pass `--unittest` and sanitizers before PR
- Sub-branch names follow the pattern: `pratik/std-coro/` (e.g., `add-coroutine-primitives`, `migrate-entry-points`)
- Milestone PRs must reference this plan document in the description
### 8.4 Code Review Checklist
For every PR in this migration:
- [ ] `coroutine_handle::destroy()` called exactly once per coroutine
- [ ] No concurrent `handle.resume()` calls possible
- [ ] `unhandled_exception()` stores the exception (doesn't discard it)
- [ ] `final_suspend()` is `noexcept`
- [ ] Awaiter `await_suspend()` doesn't block (schedules, not runs)
- [ ] `LocalValues` correctly swapped on suspend/resume boundaries
- [ ] Shutdown path tested (JobQueue stopping during coroutine execution)
- [ ] ASAN clean (no use-after-free on coroutine frame)
- [ ] TSAN clean (no data races on resume)
- [ ] All existing `--unittest` tests still pass
---
## 9. Task List
### Milestone 1: New Coroutine Primitives
- [ ] **1.1** Design `CoroTask` class with `promise_type`
- Define `promise_type` with `initial_suspend`, `final_suspend`, `unhandled_exception`, `return_value`/`return_void`
- Implement `FinalAwaiter` for continuation support
- Implement move-only RAII handle wrapper
- Support both `CoroTask` and `CoroTask`
- [ ] **1.2** Design and implement `JobQueueAwaiter`
- `await_suspend()` calls `jq_.addJob(type, name, [h]{ h.resume(); })`
- Handle `addJob()` failure (shutdown) — resume with error flag or throw
- Integrate `nSuspend_` counter increment/decrement
- [ ] **1.3** Implement `LocalValues` swap in new coroutine resume path
- Before `handle.resume()`: save thread-local, install coroutine-local
- After `handle.resume()` returns: restore thread-local
- Ensure this works when coroutine migrates between threads
- [ ] **1.4** Add `postCoroTask()` template to `JobQueue`
- Accept callable returning `CoroTask`
- Schedule initial execution on JobQueue (mirror `postCoro()` behavior)
- Return a handle/shared_ptr for join/cancel
- [ ] **1.5** Write unit tests (`src/test/core/CoroTask_test.cpp`)
- Test `CoroTask` runs to completion
- Test `CoroTask` returns value
- Test exception propagation across co_await
- Test coroutine destruction before completion
- Test `JobQueueAwaiter` schedules on correct thread
- Test `LocalValue` isolation across 4+ coroutines
- Test shutdown rejection (addJob returns false)
- Test `correct_order` equivalent (yield → join → post → complete)
- Test `incorrect_order` equivalent (post → yield → complete)
- Test multiple sequential co_await points
- [ ] **1.6** Verify build on GCC 12+, Clang 16+
- [ ] **1.7** Run ASAN + TSAN on new tests
- [ ] **1.8** Run full `--unittest` suite (no regressions)
- [ ] **1.9** Self-review and create PR #1
### Milestone 2: Entry Point Migration
- [ ] **2.1** Migrate `ServerHandler::onRequest()` (`ServerHandler.cpp:287`)
- Replace `m_jobQueue.postCoro(jtCLIENT_RPC, ...)` with `postCoroTask()`
- Update lambda to return `CoroTask` (add `co_return`)
- Update `processSession` to accept new coroutine type
- [ ] **2.2** Migrate `ServerHandler::onWSMessage()` (`ServerHandler.cpp:325`)
- Replace `m_jobQueue.postCoro(jtCLIENT_WEBSOCKET, ...)` with `postCoroTask()`
- Update lambda signature
- [ ] **2.3** Migrate `GRPCServer::CallData::process()` (`GRPCServer.cpp:102`)
- Replace `app_.getJobQueue().postCoro(JobType::jtRPC, ...)` with `postCoroTask()`
- Update `process(shared_ptr coro)` overload signature
- [ ] **2.4** Update `RPC::Context` (`Context.h:27`)
- Replace `std::shared_ptr coro{}` with new coroutine wrapper type
- Ensure all code that accesses `context.coro` compiles
- [ ] **2.5** Update `ServerHandler.h` signatures
- `processSession()` and `processRequest()` parameter types
- [ ] **2.6** Update `GRPCServer.h` signatures
- `process()` method parameter types
- [ ] **2.7** Run full `--unittest` suite
- [ ] **2.8** Manual smoke test: HTTP + WS + gRPC RPC requests
- [ ] **2.9** Run ASAN + TSAN
- [ ] **2.10** Self-review and create PR #2
### Milestone 3: Handler Migration
- [ ] **3.1** Migrate `doRipplePathFind()` (`RipplePathFind.cpp`)
- Replace `context.coro->yield()` with `co_await PathFindAwaiter{...}`
- Replace continuation lambda's `coro->post()` / `coro->resume()` with awaiter scheduling
- Handle shutdown case (post failure) in awaiter
- [ ] **3.2** Create `PathFindAwaiter` (or use generic `JobQueueAwaiter`)
- Encapsulate the continuation + yield pattern from `RipplePathFind.cpp` lines 108-132
- [ ] **3.3** Update `Path_test.cpp`
- Replace `postCoro` usage with `postCoroTask`
- Ensure `context.coro` usage matches new type
- [ ] **3.4** Update `AMMTest.cpp`
- Replace `postCoro` usage with `postCoroTask`
- [ ] **3.5** Rewrite `Coroutine_test.cpp` for new API
- `correct_order`: postCoroTask → co_await → join → resume → complete
- `incorrect_order`: post before yield equivalent
- `thread_specific_storage`: 4 coroutines with LocalValue isolation
- [ ] **3.6** Update `JobQueue_test.cpp` `testPostCoro`
- Migrate to `postCoroTask` API
- [ ] **3.7** Verify `ripple_path_find` works end-to-end with new coroutines
- [ ] **3.8** Test shutdown-during-pathfind scenario
- [ ] **3.9** Run full `--unittest` suite
- [ ] **3.10** Run ASAN + TSAN
- [ ] **3.11** Self-review and create PR #3
### Milestone 4: Cleanup & Validation
- [ ] **4.1** Delete `include/xrpl/core/Coro.ipp`
- [ ] **4.2** Remove from `JobQueue.h`:
- `#include `
- `struct Coro_create_t`
- `class Coro` (entire class)
- `postCoro()` template
- Comment block (lines 322-377) describing old race condition
- [ ] **4.3** Update `cmake/deps/Boost.cmake`:
- Remove `coroutine` from `find_package(Boost REQUIRED COMPONENTS ...)`
- Remove `Boost::coroutine` from `target_link_libraries`
- [ ] **4.4** Update `cmake/XrplInterface.cmake`:
- Remove `BOOST_COROUTINES_NO_DEPRECATION_WARNING`
- [ ] **4.5** Run memory benchmark
- Create N=1000 coroutines, compare RSS: before vs after
- Document results
- [ ] **4.6** Run context switch benchmark
- 100K yield/resume cycles, compare latency: before vs after
- Document results
- [ ] **4.7** Run RPC throughput benchmark
- Concurrent `ripple_path_find` requests, compare throughput
- Document results
- [ ] **4.8** Run full `--unittest` suite
- [ ] **4.9** Run ASAN, TSAN, UBSan
- Confirm `__asan_handle_no_return` warnings are gone
- [ ] **4.10** Verify build on all supported compilers
- [ ] **4.11** Self-review and create PR #4
- [ ] **4.12** Document final benchmark results in PR description
---
## Appendix A: File Inventory
Complete list of files that reference coroutines (for audit tracking):
| # | File | Must Change | Phase |
| --- | ------------------------------------------- | ----------- | -------------------------- |
| 1 | `include/xrpl/core/JobQueue.h` | Yes | 1 (add), 4 (remove old) |
| 2 | `include/xrpl/core/Coro.ipp` | Yes | 4 (delete) |
| 3 | `include/xrpl/basics/LocalValue.h` | Maybe | 1 (if integration changes) |
| 4 | `cmake/deps/Boost.cmake` | Yes | 4 |
| 5 | `cmake/XrplInterface.cmake` | Yes | 4 |
| 6 | `src/xrpld/rpc/Context.h` | Yes | 2 |
| 7 | `src/xrpld/rpc/detail/ServerHandler.cpp` | Yes | 2 |
| 8 | `src/xrpld/rpc/ServerHandler.h` | Yes | 2 |
| 9 | `src/xrpld/app/main/GRPCServer.cpp` | Yes | 2 |
| 10 | `src/xrpld/app/main/GRPCServer.h` | Yes | 2 |
| 11 | `src/xrpld/rpc/handlers/RipplePathFind.cpp` | Yes | 3 |
| 12 | `src/test/core/Coroutine_test.cpp` | Yes | 3 |
| 13 | `src/test/core/JobQueue_test.cpp` | Yes | 3 |
| 14 | `src/test/app/Path_test.cpp` | Yes | 3 |
| 15 | `src/test/jtx/impl/AMMTest.cpp` | Yes | 3 |
| 16 | `src/xrpld/rpc/README.md` | Yes | 4 (update docs) |
## Appendix B: New Files to Create
| # | File | Phase | Purpose |
| --- | ------------------------------------- | ----- | ---------------------------------------- |
| 1 | `include/xrpl/core/CoroTask.h` | 1 | `CoroTask` return type + promise_type |
| 2 | `include/xrpl/core/JobQueueAwaiter.h` | 1 | Awaiter for scheduling on JobQueue |
| 3 | `src/test/core/CoroTask_test.cpp` | 1 | Unit tests for new primitives |