mirror of https://github.com/XRPLF/rippled.git synced 2026-03-01 02:02:34 +00:00

Files

Pratik Mankawde 185921ea94 comments and document update

Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>

2026-02-27 16:44:58 +00:00

78 KiB

Raw Blame History

Boost.Coroutine to C++20 Standard Coroutines Migration Plan

Project: rippled (XRP Ledger node) Branch: pratik/Switch-to-std-coroutines Date: 2026-02-25 Status: Planning

Research & Analysis
Current State Assessment
Migration Strategy
Implementation Plan
Testing & Validation Strategy
Risks & Mitigation
Timeline & Milestones
Standards & Guidelines
Task List

1. Research & Analysis

1.1 Stackful (Boost.Coroutine) vs Stackless (C++20) Architecture

graph TD
    subgraph Boost["Boost.Coroutine2 (Stackful)"]
        direction TB
        B1["Coroutine Created"]
        B2["1 MB Stack Allocated"]
        B3["Full Call Stack Available"]
        B4["yield&lpar;&rpar; from ANY<br/>nesting depth"]
        B5["Context Switch:<br/>save/restore registers<br/>+ stack pointer<br/>~40-100 CPU cycles"]
        B1 --> B2 --> B3 --> B4 --> B5
    end

    subgraph Std["C++20 Coroutines (Stackless)"]
        direction TB
        S1["Coroutine Created"]
        S2["200-500 B Frame on Heap"]
        S3["No Dedicated Stack"]
        S4["co_await ONLY at<br/>explicit suspension points"]
        S5["Context Switch:<br/>resume via function call<br/>symmetric transfer / tail-call<br/>~20-50 CPU cycles"]
        S1 --> S2 --> S3 --> S4 --> S5
    end

Boost (right): Each coroutine gets a full 1 MB stack. Suspension saves the entire register set and stack pointer, so yield() can be called from any nesting depth — the whole call chain is preserved. The cost is high per-coroutine memory and a heavier context switch (~40-100 cycles for fcontext save/restore).

C++20 (left): The compiler allocates a small heap frame (200-500 bytes) holding only the local variables that live across suspension points. There is no dedicated stack — suspension is only allowed at explicit co_await expressions in the immediate coroutine function. Resumption is a normal function call (symmetric transfer makes it a tail-call), costing ~20-50 cycles. The trade-off is that nested functions that need to suspend must themselves be coroutines.

1.2 API & Programming Model Comparison

Aspect	Boost.Coroutine2 (Current)	C++20 Coroutines (Target)
Type	Stackful, asymmetric. Each coroutine carries its own call stack, and control transfers between a parent (caller) and a child (coroutine) — never between two siblings directly.	Stackless, asymmetric. The compiler transforms the coroutine function into a state machine allocated on the heap. The same parent/child asymmetry applies, but there is no separate stack.
Stack Model	Dedicated 1 MB stack per coroutine. Allocated at construction via `boost::context::fixedsize_stack`. The full stack is reserved even if the coroutine only uses a few hundred bytes, leading to high memory overhead under concurrency.	Heap frame of ~200-500 bytes. The compiler allocates only the local variables that live across suspension points into a coroutine frame on the heap. The frame may be elided entirely if the compiler can prove the coroutine's lifetime is bounded by its caller.
Suspension	*`(yield_)()` — can yield from any call depth.** Because the coroutine has its own stack, a call chain `fn_a() → fn_b() → yield()` suspends the entire stack. The `yield_` pointer is a `push_type*` provided by Boost.	`co_await expr` — only at explicit suspension points. Suspension is only possible in the immediate coroutine function body. If a nested regular function needs to suspend, it must itself be refactored into a coroutine returning an awaitable.
Resumption	`coro_()` — resumes from last yield. Calling the `pull_type` object switches back to the coroutine's stack and continues execution right after the last `yield()` call.	`handle.resume()` — resumes from last co_await. The `std::coroutine_handle<>` is a lightweight pointer to the coroutine frame. Calling `.resume()` jumps to the suspension point via a function-call dispatch (no stack switch).
Creation	`pull_type` constructor auto-starts the coroutine. When a `pull_type` is constructed, it immediately transfers control into the coroutine body, which runs until its first `yield()`. The caller must account for this eager start.	Calling a coroutine function returns a suspended handle. The function body does NOT execute until `handle.resume()` is called (when `initial_suspend()` returns `suspend_always`). This lazy-start model gives the caller full control over when execution begins.
Completion Check	`static_cast<bool>(coro_)` returns false when done. The `pull_type` is contextually convertible to `bool`; it becomes false after the coroutine body returns.	`handle.done()` returns true when done. A direct query on the coroutine handle. Calling `resume()` after `done()` is true is undefined behavior.
Value Passing	Typed via `pull_type<T>` / `push_type<T>`. Values are exchanged through the coroutine's type parameter — `pull_type<T>` pulls values out, `push_type<T>` pushes values in. rippled uses `<void>` (no values exchanged).	Via `promise_type::return_value(T)` or `co_return`. Values are stored in the promise object inside the coroutine frame. The caller retrieves them through `await_resume()`. For void coroutines, `return_void()` is used instead.
Exception Handling	Natural stack-based propagation. An exception thrown inside the coroutine unwinds its stack normally and propagates to the caller at the `pull_type` call site (i.e., whoever called `coro_()`).	Explicit capture via `promise_type::unhandled_exception()`. Exceptions thrown in the coroutine body are caught by the promise and stored (typically as `std::exception_ptr`). They are rethrown in `await_resume()` when the caller co_awaits the result.
Cancellation	Application-managed (poll a flag). There is no built-in cancellation. rippled uses `expectEarlyExit()` to mark a coroutine as abandoned during shutdown, then decrements `nSuspend_` so `JobQueue::stop()` can proceed.	Via `await_ready()` or cancellation tokens. An awaiter can check a cancellation flag in `await_ready()` and return true to skip suspension. Alternatively, `std::stop_token` patterns (C++20) can be threaded through. Our `JobQueueAwaiter` returns false from `await_suspend()` when the JobQueue is stopping, effectively cancelling the suspend.
Keywords	None (library-only). All coroutine machinery is expressed through library types (`pull_type`, `push_type`) and regular function calls. No special language syntax required.	`co_await`, `co_yield`, `co_return`. The presence of any of these keywords in a function body makes it a coroutine. The compiler generates the state machine, frame allocation, and suspension/resumption code automatically.
Standard	Boost library (not ISO C++). `Boost.Coroutine` is deprecated in favor of `Boost.Coroutine2`, which itself has no active development. Depends on `Boost.Context` for platform-specific assembly-level stack switching.	ISO C++20 standard. Part of the language specification. Supported by all major compilers (GCC 11+, Clang 14+, MSVC 19.28+). Tooling, debugger support, and static analysis are steadily improving across the ecosystem.

1.3 Performance Characteristics

Metric	Boost.Coroutine2	C++20 Coroutines
Memory per coroutine	~1MB (fixed stack)	~200-500 bytes (frame only)
1000 concurrent coroutines	~1 GB	~0.5 MB
Context switch cost	~40-100 CPU cycles (fcontext save/restore)	~20-50 CPU cycles (function call)
Allocation	Stack allocated at creation	Heap allocation (compiler may elide)
Cache behavior	Poor (large stack rarely fully used)	Good (small frame, hot data close)
Compiler optimization	Opaque to compiler	Inlinable, optimizable

1.4 Feature Parity Analysis

Suspension Points

Boost: Can yield from any nesting level — fn_a() calls fn_b() calls yield(). The entire call stack is preserved.
C++20: Suspension only at co_await expressions in the immediate coroutine function. Nested functions that need to suspend must themselves be coroutines returning awaitables.
Impact: (Assumption, needs confirmation from people who know the code better) Rippled's usage is shallow — yield() is called directly from the RPC handler lambda, never from deeply nested code. This makes migration straightforward.

Boost — yield from coroutine body, resume later via post():

jq.postCoro(jtCLIENT, "Handler", [&](auto const& coro) {
    auto result = doFirstHalf();
    coro->yield();          // suspend — entire stack preserved
    // resumes here when coro->post() is called externally
    doSecondHalf(result);
});

C++20 — co_await suspends, JobQueueAwaiter combines yield + auto-repost:

jq.postCoroTask(jtCLIENT, "Handler", [&](auto runner) -> CoroTask<void> {
    auto result = doFirstHalf();
    co_await JobQueueAwaiter{runner};  // suspend + auto-repost
    // resumes on a worker thread when the job is picked up
    doSecondHalf(result);
    co_return;
});

Key difference — Boost can yield from nested calls; C++20 cannot:

// Boost — works: yield from inside a helper function
void helper(std::shared_ptr<JobQueue::Coro> coro) {
    coro->yield();  // OK — stackful, entire call stack is preserved
}
jq.postCoro(jtCLIENT, "Deep", [](auto coro) { helper(coro); });

// C++20 — does NOT work: regular functions cannot co_await
void helper(std::shared_ptr<CoroTaskRunner> runner) {
    co_await runner->suspend();  // COMPILE ERROR — not a coroutine
}
// FIX: helper must itself be a coroutine returning CoroTask<void>
CoroTask<void> helper(std::shared_ptr<CoroTaskRunner> runner) {
    co_await runner->suspend();  // OK — this is a coroutine
    co_return;
}

Exception Handling

Boost: Exceptions propagate naturally up the call stack across yield points.
C++20: Exceptions in coroutine body are caught by promise_type::unhandled_exception(). Must be explicitly stored and rethrown.
Impact: Need to implement unhandled_exception() in promise type. Pattern is well-established.

Boost — exceptions propagate naturally through yield():

jq.postCoro(jtCLIENT, "Risky", [](auto coro) {
    coro->yield();
    throw std::runtime_error("oops");
    // Exception propagates up the coroutine stack naturally.
    // The Coro::resume() caller sees it when the coroutine unwinds.
});

C++20 — exceptions are captured by promise_type and rethrown on co_await:

// Inner coroutine throws
CoroTask<int> failingOp() {
    throw std::runtime_error("oops");
    co_return 0;  // never reached
}

// Outer coroutine catches — exception crosses coroutine boundary via promise
jq.postCoroTask(jtCLIENT, "Caller", [](auto runner) -> CoroTask<void> {
    try {
        int v = co_await failingOp();  // rethrows here
    } catch (std::runtime_error const& e) {
        // e.what() == "oops" — caught across coroutine boundary
    }
    co_return;
});

Key difference — C++20 requires explicit plumbing, but it's already wired up:

// Inside CoroTask<void>::promise_type (already implemented):
void unhandled_exception() {
    exception_ = std::current_exception();  // capture
}
// Inside CoroTask<void>::await_resume() (already implemented):
void await_resume() {
    if (auto& ep = handle_.promise().exception_)
        std::rethrow_exception(ep);          // rethrow to caller
}

Cancellation

Boost: rippled uses expectEarlyExit() for graceful shutdown — not a general cancellation mechanism.
C++20: Can check cancellation in await_ready() before suspension, or via stop_token patterns.
Impact: C++20 provides strictly better cancellation support.

Boost — expectEarlyExit() for cleanup when coroutine never ran:

auto coro = std::make_shared<Coro>(create, jq, t, name, fn);
if (!coro->post()) {
    // JobQueue is stopping — coroutine will never run.
    // Must manually decrement nSuspend_ so shutdown doesn't hang.
    coro->expectEarlyExit();
    coro.reset();
}

No cooperative in-body cancellation — coroutine just runs to completion or gets abandoned.

C++20 — expectEarlyExit() for the same case, plus cooperative in-body checking:

// Same early-exit pattern when post() fails:
auto runner = CoroTaskRunner::create(jq, t, name);
runner->init(fn);
++nSuspend_;
if (!runner->post()) {
    runner->expectEarlyExit();  // decrements nSuspend_, destroys frame
    runner.reset();
}

// Cooperative cancellation — coroutine checks jq.isStopping() after each yield:
jq.postCoroTask(jtCLIENT, "Long", [jqp = &jq](auto runner) -> CoroTask<void> {
    while (hasWork()) {
        co_await JobQueueAwaiter{runner};
        if (jqp->isStopping())
            co_return;              // graceful exit
        doNextChunk();
    }
    co_return;
});

C++20 bonus — JobQueueAwaiter::await_suspend() handles shutdown automatically:

bool await_suspend(std::coroutine_handle<>) {
    runner->onSuspend();
    if (!runner->post()) {
        // JQ stopping — undo suspend, return false so coroutine
        // continues immediately (can fall through to co_return)
        runner->onUndoSuspend();
        return false;
    }
    return true;  // actually suspend
}

1.5 Compiler Support

Compiler	rippled Minimum	C++20 Coroutine Support	Status
GCC	12.0+	Full (since GCC 11)	Ready
Clang	16.0+	Full (since Clang 14)	Ready
MSVC	19.28+	Full (since VS2019 16.8)	Ready

rippled already requires C++20 (CMAKE_CXX_STANDARD 20 in CMakeLists.txt). All supported compilers have mature C++20 coroutine support. No compiler upgrades required.

1.6 Viability Analysis — Addressing Stackless Concerns

C++20 stackless coroutines have well-known limitations compared to stackful coroutines. This section analyzes each concern against rippled's actual codebase to determine viability.

Concern 1: Cannot Suspend from Nested Call Stacks

Claim: Stackless coroutines cannot yield from arbitrary stack depths. If fn_a() calls fn_b() calls yield(), only stackful coroutines can suspend the entire chain.

Analysis: An exhaustive codebase audit found:

1 production yield() call: RipplePathFind.cpp:131 — directly in the handler function body
All test yield() calls: directly in postCoro lambda bodies (Coroutine_test.cpp, JobQueue_test.cpp)
The push_type* architecture makes deep-nested yield() structurally impossible — the yield_ pointer is only available inside the postCoro lambda via the shared_ptr<Coro>, and handlers call context.coro->yield() at the top level

graph LR
    subgraph Stackful["Stackful &lpar;Boost&rpar; — can yield anywhere"]
        direction TB
        A1["postCoro lambda"] --> A2["handlerFn&lpar;&rpar;"]
        A2 --> A3["helperFn&lpar;&rpar;"]
        A3 --> A4["coro→yield&lpar;&rpar; ✅"]
    end

    subgraph Stackless["Stackless &lpar;C++20&rpar; — co_await at top only"]
        direction TB
        B1["postCoroTask lambda"] --> B2["co_await ✅"]
        B1 --> B3["regularFn&lpar;&rpar;"]
        B3 -.-> B4["co_await ❌"]
    end

    subgraph Rippled["rippled actual usage — all shallow"]
        direction TB
        C1["postCoro lambda"] --> C2["context.coro→yield&lpar;&rpar;<br/>&lpar;direct, no nesting&rpar;"]
    end

    style A4 fill:#f96,stroke:#333,color:#000
    style B4 fill:#f66,stroke:#333,color:#fff
    style C2 fill:#3d8,stroke:#333,color:#000

Verdict: This concern does NOT apply. All suspension is shallow.

Concern 2: Colored Function Problem (Viral co_await)

Claim: Once a function needs to suspend, every caller up the chain must also be a coroutine. This "infects" the call chain.

Analysis: In rippled's case, the coloring is minimal:

postCoroTask() launches a coroutine — this is the "root" colored function
The postCoro lambda itself becomes the coroutine function (returns CoroTask<void>)
doRipplePathFind() is the only handler that calls co_await
No other handler in the chain needs to become a coroutine — they continue to be regular functions dispatched through doCommand()

The "coloring" stops at the entry point lambda and the one handler that suspends. No deep infection.

graph TD
    subgraph Feared["Feared: deep coloring infection"]
        direction TB
        F1["main&lpar;&rpar;"] -->|"must become<br/>coroutine"| F2["Server::run&lpar;&rpar;"]
        F2 -->|"must become<br/>coroutine"| F3["dispatch&lpar;&rpar;"]
        F3 -->|"must become<br/>coroutine"| F4["doCommand&lpar;&rpar;"]
        F4 -->|"must become<br/>coroutine"| F5["handler&lpar;&rpar;"]
        F5 --> F6["co_await"]
        style F1 fill:#f66,stroke:#333,color:#fff
        style F2 fill:#f66,stroke:#333,color:#fff
        style F3 fill:#f66,stroke:#333,color:#fff
        style F4 fill:#f66,stroke:#333,color:#fff
        style F5 fill:#f66,stroke:#333,color:#fff
    end

    subgraph Actual["Actual: coloring stops at entry point"]
        direction TB
        A1["main&lpar;&rpar;"] --- A2["Server::run&lpar;&rpar;"]
        A2 --- A3["dispatch&lpar;&rpar;"]
        A3 --- A4["doCommand&lpar;&rpar;"]
        A4 -->|"only this<br/>is a coroutine"| A5["postCoroTask lambda<br/>→ CoroTask&lt;void&gt;"]
        A5 --> A6["co_await"]
        style A1 fill:#eee,stroke:#333,color:#000
        style A2 fill:#eee,stroke:#333,color:#000
        style A3 fill:#eee,stroke:#333,color:#000
        style A4 fill:#eee,stroke:#333,color:#000
        style A5 fill:#f96,stroke:#333,color:#000
        style A6 fill:#3d8,stroke:#333,color:#000
    end

Verdict: Minimal impact. Only 4 lambdas (3 entry points + 1 handler) need co_await.

Concern 3: No Standard Library Support for Common Patterns

Claim: C++20 provides the language primitives but no standard task type, executor integration, or composition utilities.

Analysis: This is accurate — we need to write custom types:

CoroTask<T> (task/return type) — well-established pattern, ~80 lines
JobQueueAwaiter (executor integration) — ~20 lines
FinalAwaiter (continuation chaining) — ~10 lines

However, these types are small, well-understood, and have extensive reference implementations (cppcoro, folly::coro, libunifex). The total boilerplate is approximately 150-200 lines of header code.

graph TD
    subgraph StdLib["C++20 standard provides"]
        direction LR
        S1["coroutine_handle&lt;P&gt;"]
        S2["suspend_always /<br/>suspend_never"]
        S3["noop_coroutine&lpar;&rpar;"]
        S4["co_await / co_return"]
    end

    subgraph Custom["Custom types we wrote &lpar;~150 lines total&rpar;"]
        direction TB
        C1["CoroTask&lt;T&gt;<br/>~80 lines<br/>&lpar;task + promise_type&rpar;"]
        C2["JobQueueAwaiter<br/>~20 lines<br/>&lpar;suspend + repost&rpar;"]
        C3["FinalAwaiter<br/>~10 lines<br/>&lpar;symmetric transfer&rpar;"]
        C4["CoroTaskRunner<br/>~40 lines decl<br/>&lpar;lifecycle manager&rpar;"]
    end

    subgraph Ref["Reference implementations"]
        direction LR
        R1["cppcoro"]
        R2["folly::coro"]
        R3["libunifex"]
    end

    S1 --> C1
    S2 --> C1
    S3 --> C3
    S4 --> C2
    Ref -.->|"patterns<br/>borrowed from"| Custom

Verdict: Manageable. Custom types are small and well-documented in C++ community.

Concern 4: Stack Overflow from Synchronous Resumption Chains

Claim: If coroutine A co_awaits coroutine B, and B completes synchronously, B's final_suspend resumes A on the same stack, potentially building up unbounded stack depth.

Analysis: This is addressed by symmetric transfer via FinalAwaiter::await_suspend() returning a coroutine_handle<> instead of void. The compiler transforms this into a tail-call, preventing stack growth. This is the standard solution used by all major coroutine libraries and is implemented in our FinalAwaiter design (Section 4.1).

graph TD
    subgraph Problem["Without symmetric transfer — stack grows"]
        direction TB
        P1["A resumes B"] --> P2["B::resume&lpar;&rpar;<br/>stack frame +1"]
        P2 --> P3["B completes<br/>final_suspend resumes A"]
        P3 --> P4["A::resume&lpar;&rpar;<br/>stack frame +2"]
        P4 --> P5["A resumes C"]
        P5 --> P6["C::resume&lpar;&rpar;<br/>stack frame +3"]
        P6 --> P7["... stack overflow ❌"]
        style P7 fill:#f66,stroke:#333,color:#fff
    end

    subgraph Solution["With symmetric transfer — tail call, no growth"]
        direction TB
        S1["A resumes B"] --> S2["B::resume&lpar;&rpar;<br/>stack frame 1"]
        S2 --> S3["B completes"]
        S3 -->|"FinalAwaiter returns<br/>handle → tail call"| S4["A::resume&lpar;&rpar;<br/>stack frame 1 &lpar;reused&rpar;"]
        S4 --> S5["A resumes C"]
        S5 -->|"tail call"| S6["C::resume&lpar;&rpar;<br/>stack frame 1 &lpar;reused&rpar;"]
        S6 --> S7["... bounded ✅"]
        style S7 fill:#3d8,stroke:#333,color:#000
    end

Verdict: Solved by symmetric transfer (already in our design).

Concern 5: Dangling Reference Risk

Claim: Coroutine frames are heap-allocated and outlive the calling scope, making references to locals dangerous.

Analysis: This is a real concern that requires engineering discipline:

Coroutine parameters are copied into the frame (safe by default)
References passed to coroutine functions can dangle if the referent's scope ends before the coroutine completes
Our design mitigates this: RPC::Context is passed by reference but its lifetime is managed by shared_ptr<Coro> / the entry point lambda's scope, which outlives the coroutine

graph TD
    subgraph Danger["Dangling reference — ❌ use-after-free"]
        direction TB
        D1["caller&lpar;&rpar; scope"] --> D2["int local = 42"]
        D2 --> D3["launch coroutine<br/>with &amp;local"]
        D3 --> D4["caller returns<br/>local destroyed"]
        D4 --> D5["coroutine resumes<br/>reads &amp;local 💥"]
        style D5 fill:#f66,stroke:#333,color:#fff
    end

    subgraph Safe["rippled pattern — ✅ lifetime managed"]
        direction TB
        S1["postCoroTask&lpar;&rpar;"] --> S2["shared_ptr&lt;Runner&gt;<br/>owns coroutine frame"]
        S2 --> S3["lambda captures<br/>by value or<br/>shared_ptr"]
        S3 --> S4["FuncStore keeps<br/>lambda alive on heap"]
        S4 --> S5["coroutine resumes<br/>captures still valid ✅"]
        style S5 fill:#3d8,stroke:#333,color:#000
    end

Verdict: Real risk, but manageable with RAII patterns and ASAN testing.

Concern 6: yield_to.h / boost::asio::spawn

Claim: yield_to.h:111 uses boost::asio::spawn, suggesting broader coroutine usage.

Analysis: yield_to.h uses boost::asio::spawn with boost::context::fixedsize_stack(2 * 1024 * 1024) — this is a completely separate coroutine system:

Different type: boost::asio::yield_context (not push_type*)
Different purpose: test infrastructure for async I/O tests
Different mechanism: Boost.Asio stackful coroutines (not Boost.Coroutine2)
Not part of this migration scope — used only in tests and unrelated to JobQueue::Coro

graph TD
    subgraph ThisMigration["This migration &lpar;JobQueue::Coro&rpar;"]
        direction TB
        M1["Boost.Coroutine2<br/>push_type / pull_type"] -->|"replace with"| M2["C++20 coroutines<br/>CoroTask + co_await"]
        M3["JobQueue::Coro"] -->|"replace with"| M4["CoroTaskRunner"]
        M5["coro→yield&lpar;&rpar; + post&lpar;&rpar;"] -->|"replace with"| M6["JobQueueAwaiter"]
    end

    subgraph OutOfScope["Out of scope &lpar;yield_to.h&rpar;"]
        direction TB
        O1["boost::asio::spawn"]
        O2["yield_context"]
        O3["fixedsize_stack&lpar;2MB&rpar;"]
        O1 --- O2
        O1 --- O3
    end

    ThisMigration ~~~ OutOfScope
    style OutOfScope fill:#eee,stroke:#999

Verdict: Separate system. Out of scope for this migration.

Overall Viability Conclusion

The migration IS viable because:

rippled's coroutine usage is shallow (no deep-nested yield)
The colored function infection is limited to 4 call sites
Custom types are small and well-understood
Symmetric transfer solves the stack overflow concern
ASAN/TSAN testing catches lifetime and race bugs
The alternative (ASAN annotations for Boost.Context) only addresses sanitizer false positives — it does not provide memory savings, standard compliance, or the dependency elimination that C++20 migration delivers

1.7 Merits & Demerits Summary

Merits of C++20 Migration

2000x memory reduction per coroutine (1MB → ~500 bytes)
Faster context switching (~2x improvement)
Remove external dependency on Boost.Coroutine (and transitively Boost.Context)
Language-native — better tooling, debugger support, static analysis
Future-proof — ISO standard, not a deprecated library
Compiler-optimizable — suspension points can be inlined/elided
ASAN compatibility — eliminates Boost context-switching false positives (see docs/build/sanitizers.md)

Demerits / Challenges

Stackless limitation — cannot yield from nested calls (verified: not an issue for rippled's shallow usage)
Explicit lifetime management — coroutine_handle::destroy() must be called (mitigated by RAII CoroTask)
Verbose boilerplate — promise_type, awaiter interfaces (~150-200 lines of infrastructure code)
Debugging — no visible coroutine stack in debugger (improving with tooling)
Learning curve — team needs familiarity with C++20 coroutine machinery
Dangling reference risk — coroutine frames outlive calling scope (mitigated by ASAN + careful design)
No standard library task type — must write custom CoroTask, awaiters (well-established patterns exist)

Alternative Considered: ASAN Annotations Only

Instead of full migration, one could keep Boost.Coroutine and add __sanitizer_start_switch_fiber / __sanitizer_finish_switch_fiber annotations to Coro.ipp to suppress ASAN false positives. This was evaluated and rejected because:

It only fixes sanitizer false positives — does NOT reduce 1MB/coroutine memory usage
Does NOT remove the deprecated Boost.Coroutine dependency
Does NOT provide standard compliance or future-proofing
The full migration is feasible given shallow yield usage and delivers all the above benefits

2. Current State Assessment

2.1 Architecture Overview

graph TD
    subgraph "Request Entry Points"
        HTTP["HTTP Request<br/>ServerHandler::onRequest()"]
        WS["WebSocket Message<br/>ServerHandler::onWSMessage()"]
        GRPC["gRPC Request<br/>CallData::process()"]
    end

    subgraph "Coroutine Layer"
        POST["JobQueue::postCoro()<br/>Creates Coro + schedules job"]
        CORO["JobQueue::Coro<br/>boost::coroutines::pull_type<br/>1MB stack per instance"]
    end

    subgraph "JobQueue Thread Pool"
        W1["Worker Thread 1"]
        W2["Worker Thread 2"]
        WN["Worker Thread N"]
    end

    subgraph "RPC Handlers"
        CTX["RPC::Context<br/>holds shared_ptr#lt;Coro#gt;"]
        RPC["RPC Handler<br/>e.g. doRipplePathFind"]
        YIELD["coro.yield()<br/>Suspends execution"]
        RESUME["coro.post()<br/>Reschedules on JobQueue"]
    end

    HTTP --> POST
    WS --> POST
    GRPC --> POST
    POST --> CORO
    CORO --> W1
    CORO --> W2
    CORO --> WN
    W1 --> CTX
    W2 --> CTX
    CTX --> RPC
    RPC --> YIELD
    YIELD -.->|"event completes"| RESUME
    RESUME --> W1

2.2 `JobQueue::Coro` Implementation Audit

File: include/xrpl/core/JobQueue.h (lines 40-120) + include/xrpl/core/Coro.ipp

Class Members

class Coro : public std::enable_shared_from_this<Coro> {
    detail::LocalValues lvs_;                                         // Per-coroutine thread-local storage
    JobQueue& jq_;                                                    // Parent JobQueue reference
    JobType type_;                                                    // Job type (jtCLIENT_RPC, etc.)
    std::string name_;                                                // Name for logging
    bool running_;                                                    // Is currently executing
    std::mutex mutex_;                                                // Prevents concurrent resume
    std::mutex mutex_run_;                                            // Guards running_ flag
    std::condition_variable cv_;                                      // For join() blocking
    boost::coroutines::asymmetric_coroutine<void>::pull_type coro_;   // THE BOOST COROUTINE
    boost::coroutines::asymmetric_coroutine<void>::push_type* yield_; // Yield function pointer
    bool finished_;                                                   // Debug assertion flag
};

Boost.Coroutine APIs Used

API	Location	Purpose
`asymmetric_coroutine<void>::pull_type`	`JobQueue.h:51`	The coroutine object itself
`asymmetric_coroutine<void>::push_type`	`JobQueue.h:52`	Yield function type
`boost::coroutines::attributes(megabytes(1))`	`Coro.ipp:23`	Stack size configuration
`#include <boost/coroutine/all.hpp>`	`JobQueue.h:10`	Header inclusion

Method Behaviors

Method	Behavior
Constructor	Creates `pull_type` with 1MB stack. Lambda captures user function. Auto-runs to first `yield()`.
`yield()`	Increments `jq_.nSuspend_`, calls `(*yield_)()` to suspend. Returns control to caller.
`post()`	Sets `running_=true`, calls `jq_.addJob()` with a lambda that calls `resume()`. Returns false if JobQueue is stopping.
`resume()`	Swaps `LocalValues`, acquires `mutex_`, calls `coro_()` to resume. Restores `LocalValues`. Sets `running_=false`, notifies `cv_`.
`runnable()`	Returns `static_cast<bool>(coro_)` — true if coroutine hasn't returned.
`expectEarlyExit()`	Decrements `nSuspend_`, sets `finished_=true`. Used during shutdown.
`join()`	Blocks on `cv_` until `running_==false`.

2.3 Coroutine Execution Lifecycle

sequenceDiagram
    participant HT as Handler Thread
    participant JQ as JobQueue
    participant WT as Worker Thread
    participant C as Coro
    participant UF as User Function

    HT->>JQ: postCoro(type, name, fn)
    JQ->>C: Coro::Coro() constructor
    Note over C: pull_type auto-starts lambda
    C->>C: yield_ = #amp;do_yield
    C->>C: yield() [initial suspension]
    C-->>JQ: Returns to constructor
    JQ->>JQ: coro->post()
    JQ->>JQ: addJob(type, name, resume_lambda)
    JQ-->>HT: Returns shared_ptr#lt;Coro#gt;
    Note over HT: Handler thread is FREE

    WT->>C: resume() [job executes]
    Note over C: Swap LocalValues
    C->>C: coro_() [resume boost coroutine]
    C->>UF: fn(shared_from_this())
    UF->>UF: Do work...

    UF->>C: coro->yield() [suspend]
    Note over C: ++nSuspend_, invoke yield_()
    C-->>WT: Returns from resume()
    Note over WT: Worker thread is FREE

    Note over UF: External event completes
    UF->>C: coro->post() [reschedule]
    C->>JQ: addJob(resume_lambda)

    WT->>C: resume() [job executes]
    C->>C: coro_() [resume]
    C->>UF: Continues after yield()
    UF->>UF: Finish work
    UF-->>C: Return [coroutine complete]
    Note over C: running_=false, cv_.notify_all()

2.4 All Coroutine Touchpoints

Core Infrastructure (Must Change)

File	Role	Lines of Interest
`include/xrpl/core/JobQueue.h`	Coro class definition, postCoro template	Lines 10, 40-120, 385-402
`include/xrpl/core/Coro.ipp`	Coro method implementations	All (122 lines)
`include/xrpl/basics/LocalValue.h`	Per-coroutine thread-local storage	Lines 12-59 (LocalValues)
`cmake/deps/Boost.cmake`	Boost.Coroutine dependency	Lines 7, 24

Entry Points (postCoro Callers)

File	Entry Point	Job Type
`src/xrpld/rpc/detail/ServerHandler.cpp:287`	`onRequest()` — HTTP RPC	`jtCLIENT_RPC`
`src/xrpld/rpc/detail/ServerHandler.cpp:325`	`onWSMessage()` — WebSocket	`jtCLIENT_WEBSOCKET`
`src/xrpld/app/main/GRPCServer.cpp:102`	`CallData::process()` — gRPC	`jtRPC`

Context Propagation

File	Role
`src/xrpld/rpc/Context.h:27`	`RPC::Context` holds `shared_ptr<JobQueue::Coro> coro`
`src/xrpld/rpc/ServerHandler.h:174-188`	`processSession/processRequest` pass coro through

Active Coroutine Consumer (yield/post)

File	Usage
`src/xrpld/rpc/handlers/RipplePathFind.cpp:131`	`context.coro->yield()` — suspends for path-finding
`src/xrpld/rpc/handlers/RipplePathFind.cpp:116-123`	Continuation calls `coro->post()` or `coro->resume()`

Test Files

File	Tests
`src/test/core/Coroutine_test.cpp`	`correct_order`, `incorrect_order`, `thread_specific_storage`
`src/test/core/JobQueue_test.cpp`	`testPostCoro` (post/resume cycles, shutdown behavior)
`src/test/app/Path_test.cpp`	Path-finding RPC via postCoro
`src/test/jtx/impl/AMMTest.cpp`	AMM RPC via postCoro

2.5 Suspension/Continuation Model

The current model documented in src/xrpld/rpc/README.md defines four functional types:

Callback     = std::function<void()>           — generic 0-arg function
Continuation = std::function<void(Callback)>   — calls Callback later
Suspend      = std::function<void(Continuation)> — runs Continuation, suspends
Coroutine    = std::function<void(Suspend)>    — given a Suspend, starts work

In practice, JobQueue::Coro simplifies this to:

Suspend = coro->yield()
Continue = coro->post() (async on JobQueue) or coro->resume() (sync on current thread)

2.6 CMake Dependency

In cmake/deps/Boost.cmake:

find_package(Boost REQUIRED COMPONENTS ... coroutine ...)
target_link_libraries(xrpl_boost INTERFACE ... Boost::coroutine ...)

Additionally in cmake/XrplInterface.cmake:

BOOST_COROUTINES_NO_DEPRECATION_WARNING  // Suppresses Boost.Coroutine deprecation warnings

2.7 Existing C++20 Coroutine Usage

rippled already uses C++20 coroutines in test code:

src/tests/libxrpl/net/HTTPClient.cpp uses co_await with boost::asio::use_awaitable
Demonstrates team familiarity with C++20 coroutine syntax
Proves compiler toolchain supports C++20 coroutines

3. Migration Strategy

3.1 Incremental vs Atomic Migration

Decision: Incremental (multi-phase) migration.

Rationale:

Only one RPC handler (RipplePathFind) actively uses yield()/post() suspension
The three entry points (HTTP, WS, gRPC) all funnel through postCoro()
The RPC::Context.coro field is the sole propagation mechanism
We can introduce a new C++20 coroutine system alongside the existing one and migrate callsites incrementally

3.2 Migration Phases

graph TD
    subgraph "Phase 1: Foundation"
        P1A["Create CoroTask#lt;T#gt; type<br/>(promise_type, awaiter)"]
        P1B["Create JobQueueAwaiter<br/>(schedules resume on JobQueue)"]
        P1C["Add postCoroTask() to JobQueue<br/>(parallel to postCoro)"]
        P1D["Unit tests for new primitives"]
        P1A --> P1B --> P1C --> P1D
    end

    subgraph "Phase 2: Entry Point Migration"
        P2A["Migrate ServerHandler::onRequest()"]
        P2B["Migrate ServerHandler::onWSMessage()"]
        P2C["Migrate GRPCServer::CallData::process()"]
        P2D["Update RPC::Context to use new type"]
        P2A --> P2D
        P2B --> P2D
        P2C --> P2D
    end

    subgraph "Phase 3: Handler Migration"
        P3A["Migrate RipplePathFind handler"]
        P3B["Verify all other handlers<br/>(no active yield usage)"]
    end

    subgraph "Phase 4: Cleanup"
        P4A["Remove old Coro class"]
        P4B["Remove Boost.Coroutine from CMake"]
        P4C["Remove deprecation warning suppression"]
        P4D["Final benchmarks & validation"]
    end

    P1D --> P2A
    P2D --> P3A
    P3B --> P4A
    P3A --> P4A
    P4A --> P4B --> P4C --> P4D

3.3 Coexistence Strategy

During migration, both implementations will coexist:

graph LR
    subgraph "Transition Period"
        OLD["JobQueue::Coro<br/>(Boost, existing)"]
        NEW["JobQueue::CoroTask<br/>(C++20, new)"]
        CTX["RPC::Context"]
    end

    CTX -->|"phase 1-2"| OLD
    CTX -->|"phase 2-3"| NEW

    style OLD fill:#fdd,stroke:#c00,color:#000
    style NEW fill:#dfd,stroke:#0a0,color:#000

RPC::Context will temporarily hold both shared_ptr<Coro> (old) and the new coroutine handle
Entry points will be migrated one at a time
Each migration is independently testable
Once all entry points and handlers are migrated, old code is removed

3.4 Breaking Changes & Compatibility

Concern	Impact	Mitigation
`RPC::Context::coro` type change	All RPC handlers receive context	Migrate context field last, after all consumers updated
`postCoro()` removal	3 callers	Replace with `postCoroTask()`, remove old API in Phase 4
`LocalValue` integration	Thread-local storage must work	New implementation must swap LocalValues identically
Shutdown behavior	`expectEarlyExit()`, `nSuspend_` tracking	Replicate in new CoroTask

4. Implementation Plan

4.1 New Type Design

`CoroTask<T>` — Coroutine Return Type

classDiagram
    class CoroTask~T~ {
        +Handle handle_
        +CoroTask(Handle h)
        +destroy()
        +bool done() const
        +T get() const
        +bool await_ready() const
        +void await_suspend(coroutine_handle h) const
        +T await_resume() const
    }

    class promise_type {
        -result_ : variant~T, exception_ptr~
        -continuation_ : coroutine_handle
        +CoroTask get_return_object()
        +suspend_always initial_suspend()
        +FinalAwaiter final_suspend()
        +void return_value(T)
        +void return_void()
        +void unhandled_exception()
    }

    class FinalAwaiter {
        +bool await_ready()
        +coroutine_handle await_suspend(coroutine_handle~promise_type~)
        +void await_resume()
    }

    class JobQueueAwaiter {
        -jq_ : JobQueue
        -type_ : JobType
        -name_ : string
        +bool await_ready()
        +void await_suspend(coroutine_handle h)
        +void await_resume()
    }

    CoroTask --> promise_type : contains
    promise_type --> FinalAwaiter : returns from final_suspend
    CoroTask ..> JobQueueAwaiter : used with co_await

`JobQueueAwaiter` — Schedules Resumption on JobQueue

// Conceptual design — actual implementation may vary
struct JobQueueAwaiter {
    JobQueue& jq;
    JobType type;
    std::string name;

    bool await_ready() { return false; }  // Always suspend

    void await_suspend(std::coroutine_handle<> h) {
        // Schedule coroutine resumption as a job
        jq.addJob(type, name, [h]() { h.resume(); });
    }

    void await_resume() {}
};

4.2 Mapping: Old API → New API

graph LR
    subgraph "Current (Boost)"
        A1["postCoro(type, name, fn)"]
        A2["coro->yield()"]
        A3["coro->post()"]
        A4["coro->resume()"]
        A5["coro->join()"]
        A6["coro->runnable()"]
        A7["coro->expectEarlyExit()"]
    end

    subgraph "New (C++20)"
        B1["postCoroTask(type, name, fn)<br/>fn returns CoroTask&lt;void&gt;"]
        B2["co_await JobQueueAwaiter{jq, type, name}"]
        B3["Built into await_suspend()<br/>(automatic scheduling)"]
        B4["handle.resume()<br/>(direct call)"]
        B5["co_await task<br/>(continuation-based)"]
        B6["handle.done()"]
        B7["handle.destroy() + cleanup"]
    end

    A1 --> B1
    A2 --> B2
    A3 --> B3
    A4 --> B4
    A5 --> B5
    A6 --> B6
    A7 --> B7

4.3 File Changes Required

Phase 1: New Coroutine Primitives

File	Action	Description
`include/xrpl/core/CoroTask.h`	CREATE	`CoroTask<T>` return type with `promise_type`, `FinalAwaiter`
`include/xrpl/core/JobQueueAwaiter.h`	CREATE	Awaiter that schedules resume on JobQueue
`include/xrpl/core/JobQueue.h`	MODIFY	Add `postCoroTask()` template alongside existing `postCoro()`
`src/test/core/CoroTask_test.cpp`	CREATE	Unit tests for `CoroTask<T>` and `JobQueueAwaiter`

Phase 2: Entry Point Migration

File	Action	Description
`src/xrpld/rpc/detail/ServerHandler.cpp`	MODIFY	`onRequest()` and `onWSMessage()`: replace `postCoro` → `postCoroTask`
`src/xrpld/rpc/ServerHandler.h`	MODIFY	Update `processSession`/`processRequest` signatures
`src/xrpld/app/main/GRPCServer.cpp`	MODIFY	`CallData::process()`: replace `postCoro` → `postCoroTask`
`src/xrpld/app/main/GRPCServer.h`	MODIFY	Update `process()` method signature
`src/xrpld/rpc/Context.h`	MODIFY	Change `shared_ptr<JobQueue::Coro>` to new coroutine handle type

Phase 3: Handler Migration

File	Action	Description
`src/xrpld/rpc/handlers/RipplePathFind.cpp`	MODIFY	Replace `context.coro->yield()` / `coro->post()` with `co_await`
`src/test/app/Path_test.cpp`	MODIFY	Update test to use new coroutine API
`src/test/jtx/impl/AMMTest.cpp`	MODIFY	Update test to use new coroutine API

Phase 4: Cleanup

File	Action	Description
`include/xrpl/core/Coro.ipp`	DELETE	Remove old Boost.Coroutine implementation
`include/xrpl/core/JobQueue.h`	MODIFY	Remove `Coro` class, `postCoro()`, `Coro_create_t`, Boost includes
`cmake/deps/Boost.cmake`	MODIFY	Remove `coroutine` from `find_package` and `target_link_libraries`
`cmake/XrplInterface.cmake`	MODIFY	Remove `BOOST_COROUTINES_NO_DEPRECATION_WARNING`
`src/test/core/Coroutine_test.cpp`	MODIFY	Rewrite tests for new CoroTask
`src/test/core/JobQueue_test.cpp`	MODIFY	Update `testPostCoro` to use new API
`include/xrpl/basics/LocalValue.h`	MODIFY	Update LocalValues integration for C++20 coroutines

4.4 LocalValue Integration Design

The current LocalValue system swaps per-coroutine storage on resume/yield:

sequenceDiagram
    participant WT as Worker Thread
    participant LV as LocalValues
    participant C as Coroutine

    Note over WT: Thread has its own LocalValues

    WT->>LV: saved = getLocalValues().release()
    WT->>LV: getLocalValues().reset(#amp;coro.lvs_)
    Note over LV: Now pointing to coroutine's storage

    WT->>C: coro_() / handle.resume()
    Note over C: User code sees coroutine's LocalValues

    C-->>WT: yield / co_await returns

    WT->>LV: getLocalValues().release()
    WT->>LV: getLocalValues().reset(saved)
    Note over LV: Restored to thread's storage

For C++20: The same swap pattern must be implemented in the awaiter's await_suspend() and await_resume(), or in a wrapper that calls handle.resume().

4.5 RipplePathFind Migration Design

Current pattern:

// Continuation callback
auto callback = [&context]() {
    std::shared_ptr<JobQueue::Coro> coroCopy{context.coro};
    if (!coroCopy->post()) {
        coroCopy->resume();  // Fallback: run on current thread
    }
};

// Start async work, then suspend
jvResult = makeLegacyPathRequest(request, callback, ...);
if (request) {
    context.coro->yield();       // ← SUSPEND HERE
    jvResult = request->doStatus(context.params);  // ← RESUME HERE
}

Target pattern:

// Start async work, suspend via co_await
jvResult = makeLegacyPathRequest(request, /* awaiter-based callback */, ...);
if (request) {
    co_await PathFindAwaiter{context};  // ← SUSPEND + RESUME via awaiter
    jvResult = request->doStatus(context.params);
}

The PathFindAwaiter will encapsulate the scheduling logic currently in the lambda continuation.

5. Testing & Validation Strategy

5.1 Test Architecture

graph TD
    subgraph "Unit Tests"
        UT1["CoroTask_test<br/>- Construction/destruction<br/>- co_return values<br/>- Exception propagation<br/>- Lifetime management"]
        UT2["JobQueueAwaiter_test<br/>- Schedule on correct JobType<br/>- Resume on worker thread<br/>- Shutdown handling"]
        UT3["LocalValue integration<br/>- Per-coroutine isolation<br/>- Multi-coroutine concurrent<br/>- Cross-thread consistency"]
    end

    subgraph "Migration Tests"
        MT1["Coroutine_test rewrite<br/>- correct_order<br/>- incorrect_order<br/>- thread_specific_storage"]
        MT2["PostCoro migration<br/>- Post/resume cycles<br/>- Shutdown rejection<br/>- Early exit"]
    end

    subgraph "Integration Tests"
        IT1["RPC Path Finding<br/>- Suspend/resume flow<br/>- Shutdown during suspend<br/>- Concurrent requests"]
        IT2["Full --unittest suite<br/>- All existing tests pass<br/>- No regressions"]
    end

    subgraph "Performance Tests"
        PT1["Memory benchmarks"]
        PT2["Context switch benchmarks"]
        PT3["RPC throughput under load"]
    end

    subgraph "Sanitizer Tests"
        ST1["ASAN<br/>(memory errors)"]
        ST2["TSAN<br/>(data races)"]
        ST3["UBSan<br/>(undefined behavior)"]
    end

    UT1 --> MT1
    UT2 --> MT2
    MT1 --> IT1
    MT2 --> IT2
    IT1 --> PT1
    IT2 --> PT2
    PT1 --> ST1
    PT2 --> ST2
    PT3 --> ST3

5.2 Benchmarking Tests

Memory Usage Benchmark

Test: Create N coroutines, measure RSS
- N = 100, 1000, 10000
- Measure: peak RSS, per-coroutine overhead
- Compare: Boost (N * 1MB + overhead) vs C++20 (N * ~500B + overhead)
- Tool: /proc/self/status (VmRSS), or getrusage()

Context Switch Benchmark

Test: Yield/resume M times across N coroutines
- M = 100,000 iterations
- N = 1, 10, 100 concurrent coroutines
- Measure: total time, per-switch latency (ns)
- Compare: Boost yield/resume cycle vs C++20 co_await/resume cycle
- Tool: std::chrono::high_resolution_clock

RPC Throughput Benchmark

Test: Concurrent ripple_path_find requests
- Load: 10, 50, 100 concurrent requests
- Measure: requests/second, p50/p95/p99 latency
- Compare: before vs after migration
- Tool: Custom load generator or existing perf infrastructure

5.3 Unit Test Coverage

Test	What It Validates
`CoroTask<void>` basic	Coroutine runs to completion, handle cleanup
`CoroTask<int>` with value	`co_return` value correctly retrieved
`CoroTask` exception	`unhandled_exception()` captures and rethrows
`CoroTask` cancellation	Destruction before completion cleans up
`JobQueueAwaiter` basic	`co_await` suspends, resumes on worker thread
`JobQueueAwaiter` shutdown	Returns false / throws when JobQueue stopping
`PostCoroTask` lifecycle	Create → suspend → resume → complete
`PostCoroTask` multiple yields	Multiple co_await points in sequence
`LocalValue` isolation	4 coroutines, each sees own LocalValue
`LocalValue` cross-thread	Resume on different thread, values preserved

5.4 Integration Testing

All existing --unittest tests must pass unchanged (except coroutine-specific tests that are rewritten)
Path_test must pass with identical behavior
AMMTest RPC tests must pass
ServerHandler HTTP/WS handling must work end-to-end

5.5 Sanitizer Testing

Per docs/build/sanitizers.md:

# ASAN (memory errors — especially important for coroutine frame lifetime)
export SANITIZERS=address,undefinedbehavior
# Build + test

# TSAN (data races — critical for concurrent coroutine resume)
export SANITIZERS=thread
# Build + test (separate build — cannot mix with ASAN)

Key benefit: Removing Boost.Coroutine eliminates the __asan_handle_no_return false positives caused by Boost context switching (documented in docs/build/sanitizers.md line 184).

5.6 Regression Testing Methodology

graph LR
    subgraph "Before Migration (Baseline)"
        B1["Build on develop branch"]
        B2["Run --unittest (record pass/fail)"]
        B3["Run memory benchmark (record RSS)"]
        B4["Run context switch benchmark (record ns/switch)"]
    end

    subgraph "After Migration"
        A1["Build on feature branch"]
        A2["Run --unittest (compare pass/fail)"]
        A3["Run memory benchmark (compare RSS)"]
        A4["Run context switch benchmark (compare ns/switch)"]
    end

    subgraph "Acceptance Criteria"
        C1["Zero test regressions"]
        C2["Memory: ≤ baseline"]
        C3["Context switch: ≤ baseline"]
        C4["ASAN/TSAN clean"]
    end

    B1 --> B2 --> B3 --> B4
    A1 --> A2 --> A3 --> A4
    B2 -.->|compare| C1
    A2 -.->|compare| C1
    B3 -.->|compare| C2
    A3 -.->|compare| C2
    B4 -.->|compare| C3
    A4 -.->|compare| C3
    A2 -.-> C4

6. Risks & Mitigation

6.1 Risk Matrix

Risk	Probability	Impact	Mitigation
Performance regression in context switching	Low	High	Benchmark before/after; C++20 should be faster
Coroutine frame lifetime bugs (use-after-destroy)	Medium	High	ASAN testing, RAII wrapper for handle, code review
Data races on resume	Medium	High	TSAN testing, careful await_suspend() implementation
LocalValue corruption across threads	Low	High	Dedicated test with 4+ concurrent coroutines
Shutdown race conditions	Medium	Medium	Replicate existing mutex/cv pattern in new design
Missed coroutine consumer during migration	Low	Medium	Exhaustive grep audit (Section 2.4 is complete)
Compiler bugs in coroutine codegen	Low	Medium	Test on all three compilers (GCC, Clang, MSVC)
Exception loss across suspension points	Medium	Medium	Test exception propagation in every phase
Third-party code depending on Boost.Coroutine	Very Low	Low	Grep confirms only internal usage
Dangling references in coroutine frames	Medium	High	ASAN testing, avoid reference params in coroutine functions, use shared_ptr
Colored function infection spreading	Low	Medium	Only 4 call sites need co_await; no nested handlers suspend
Symmetric transfer not available	Very Low	High	All target compilers (GCC 12+, Clang 16+) support symmetric transfer
Future handler adding deep yield	Low	Medium	Code review + CI: static analysis flag any yield from nested depth

6.2 Rollback Strategy

graph TD
    START["Migration In Progress"]
    CHECK{"Critical Issue<br/>Discovered?"}
    PHASE{"Which Phase?"}

    P1["Phase 1: Delete new files<br/>No production code changed"]
    P2["Phase 2: Revert entry point changes<br/>Old postCoro still present"]
    P3["Phase 3: Revert handler changes<br/>Old Coro still present"]
    P4["Phase 4: Cannot easily rollback<br/>Old code deleted"]

    PREVENT["Prevention:<br/>Do NOT delete old code<br/>until Phase 4 is fully validated"]

    START --> CHECK
    CHECK -->|Yes| PHASE
    CHECK -->|No| DONE["Continue Migration"]
    PHASE -->|1| P1
    PHASE -->|2| P2
    PHASE -->|3| P3
    PHASE -->|4| P4
    P4 --> PREVENT

Key principle: Old Coro class and postCoro() remain in the codebase through Phases 1-3. They are only removed in Phase 4, after all migration is validated. Each phase is independently revertible via git revert.

6.3 Specific Risk: Stackful → Stackless Limitation

The Big Question: Can all current yield() call sites work with stackless co_await?

Analysis:

graph TD
    Q["Does yield() get called from<br/>a deeply nested function?"]
    Q -->|Yes| PROBLEM["PROBLEM: co_await can't<br/>suspend from nested calls"]
    Q -->|No| OK["OK: Direct co_await<br/>in coroutine function"]

    CHECK1["RipplePathFind.cpp:131<br/>context.coro.yield()"]
    CHECK1 -->|"Called directly in handler"| OK

    CHECK2["Coroutine_test.cpp<br/>c.yield()"]
    CHECK2 -->|"Called directly in lambda"| OK

    CHECK3["JobQueue_test.cpp<br/>c.yield()"]
    CHECK3 -->|"Called directly in lambda"| OK

    style OK fill:#dfd,stroke:#0a0,color:#000
    style PROBLEM fill:#fdd,stroke:#c00,color:#000

Result: All yield() calls are in the direct body of the postCoro lambda or RPC handler function. No deep nesting exists. Migration to stackless co_await is fully feasible without architectural redesign.

7. Timeline & Milestones

7.1 Milestone Overview

gantt
    title Migration Timeline
    dateFormat  YYYY-MM-DD
    axisFormat  %b %d

    section Phase 1 - Foundation
    CoroTask + JobQueueAwaiter design     :p1a, 2026-02-26, 3d
    CoroTask implementation               :p1b, after p1a, 3d
    Unit tests for primitives             :p1c, after p1b, 2d
    PR 1 - New coroutine primitives       :milestone, p1m, after p1c, 0d

    section Phase 2 - Entry Points
    Migrate ServerHandler (HTTP + WS)     :p2a, after p1m, 3d
    Migrate GRPCServer                    :p2b, after p2a, 2d
    Update RPC Context                     :p2c, after p2b, 1d
    PR 2 - Entry point migration          :milestone, p2m, after p2c, 0d

    section Phase 3 - Handlers
    Migrate RipplePathFind                :p3a, after p2m, 3d
    Update test infrastructure            :p3b, after p3a, 2d
    PR 3 - Handler migration              :milestone, p3m, after p3b, 0d

    section Phase 4 - Cleanup
    Remove old Coro and update CMake      :p4a, after p3m, 2d
    Performance benchmarks                :p4b, after p4a, 2d
    Sanitizer validation                  :p4c, after p4b, 1d
    PR 4 - Cleanup + validation           :milestone, p4m, after p4c, 0d

7.2 Milestone Details

Milestone 1: New Coroutine Primitives (PR #1)

Deliverables:

CoroTask<T> with promise_type, FinalAwaiter
CoroTask<void> specialization
JobQueueAwaiter for scheduling on JobQueue
postCoroTask() on JobQueue
LocalValue integration in new coroutine type
Unit test suite: CoroTask_test.cpp

Acceptance Criteria:

All new unit tests pass
Existing --unittest suite passes (no regressions from new code)
ASAN + TSAN clean on new tests
Code compiles on GCC 12+, Clang 16+

Milestone 2: Entry Point Migration (PR #2)

Deliverables:

ServerHandler::onRequest() uses postCoroTask()
ServerHandler::onWSMessage() uses postCoroTask()
GRPCServer::CallData::process() uses postCoroTask()
RPC::Context updated to carry new coroutine type
processSession/processRequest signatures updated

Acceptance Criteria:

HTTP, WebSocket, and gRPC RPC requests work end-to-end
Full --unittest suite passes
Manual smoke test: ripple_path_find via HTTP/WS

Milestone 3: Handler Migration (PR #3)

Deliverables:

RipplePathFind uses co_await instead of yield()/post()
Path_test and AMMTest updated
Coroutine_test and JobQueue_test updated for new API

Acceptance Criteria:

Path-finding suspension/continuation works correctly
All --unittest tests pass
Shutdown-during-pathfind scenario tested

Milestone 4: Cleanup & Validation (PR #4)

Deliverables:

Old Coro class and Coro.ipp removed
postCoro() removed from JobQueue
Boost::coroutine removed from CMake
BOOST_COROUTINES_NO_DEPRECATION_WARNING removed
Performance benchmark results documented
Sanitizer test results documented

Acceptance Criteria:

Build succeeds without Boost.Coroutine
Full --unittest suite passes
Memory per coroutine ≤ 10KB (down from 1MB)
Context switch time ≤ baseline
ASAN, TSAN, UBSan all clean

8. Standards & Guidelines

8.1 Coroutine Design Standards

Rule 1: All coroutine return types must use RAII for handle lifetime

// GOOD: Handle destroyed in destructor
~CoroTask() {
    if (handle_) handle_.destroy();
}

// BAD: Manual destroy calls scattered in code
void cleanup() { handle_.destroy(); } // Easy to forget

Rule 2: Never resume a coroutine from within `await_suspend()`

// GOOD: Schedule resume on executor
void await_suspend(std::coroutine_handle<> h) {
    jq_.addJob(type_, name_, [h]() { h.resume(); });
}

// BAD: Direct resume in await_suspend (blocks caller)
void await_suspend(std::coroutine_handle<> h) {
    h.resume(); // Defeats the purpose of suspension
}

Rule 3: Use `suspend_always` for `initial_suspend()` (lazy start)

// GOOD: Lazy start — coroutine doesn't run until explicitly resumed
std::suspend_always initial_suspend() { return {}; }

// BAD for our use case: Eager start — runs immediately on creation
std::suspend_never initial_suspend() { return {}; }

Rationale: Matches existing Boost behavior where postCoro() schedules execution, not the constructor.

Rule 4: Always handle `unhandled_exception()` explicitly

void unhandled_exception() {
    exception_ = std::current_exception();
    // NEVER: just swallow the exception
    // NEVER: std::terminate() without logging
}

Rule 5: Use `suspend_always` for `final_suspend()` to enable continuation

// GOOD: Suspend at end to allow cleanup and value retrieval
auto final_suspend() noexcept {
    struct FinalAwaiter {
        bool await_ready() noexcept { return false; }
        std::coroutine_handle<> await_suspend(
            std::coroutine_handle<promise_type> h) noexcept {
            if (h.promise().continuation_)
                return h.promise().continuation_;  // Resume waiter
            return std::noop_coroutine();
        }
        void await_resume() noexcept {}
    };
    return FinalAwaiter{};
}

Rule 6: Coroutine functions must be clearly marked

// GOOD: Return type makes it obvious this is a coroutine
CoroTask<Json::Value> doRipplePathFind(RPC::JsonContext& context) {
    co_await ...;
    co_return result;
}

// BAD: Coroutine hidden behind auto or unclear return type
auto doSomething() { co_return; }

8.2 Coding Guidelines

Thread Safety

Never resume a coroutine concurrently from two threads. Use the same mutex pattern as existing Coro::mutex_ to prevent races.
await_suspend() is the synchronization point. All state visible before await_suspend() must be visible after await_resume().
Use std::atomic or mutexes for shared state between coroutine and continuation callback.

Memory Management

CoroTask<T> owns its coroutine_handle. It is move-only, non-copyable.
Never store raw coroutine_handle<> in long-lived data structures without clear ownership.
Prefer shared_ptr<CoroTask<T>> when multiple parties need to observe/wait on a coroutine, mirroring the existing shared_ptr<Coro> pattern.

Error Handling

Exceptions thrown in coroutine body are captured by promise_type::unhandled_exception() and rethrown in await_resume().
Never let exceptions escape final_suspend() — it's noexcept.
Shutdown path: When JobQueue is stopping and addJob() returns false, the awaiter must resume the coroutine with an error (throw or return error state) rather than leaving it suspended forever.

Naming Conventions

Entity	Convention	Example
Coroutine return type	`CoroTask<T>`	`CoroTask<void>`, `CoroTask<Json::Value>`
Awaiter types	`*Awaiter` suffix	`JobQueueAwaiter`, `PathFindAwaiter`
Coroutine functions	Same as regular functions	`doRipplePathFind(...)`
Promise types	Nested `promise_type`	`CoroTask<T>::promise_type`
JobQueue method	`postCoroTask()`	`jq.postCoroTask(jtCLIENT, "name", fn)`

Code Organization

Coroutine primitives go in include/xrpl/core/ (header-only where possible)
Application-specific awaiters go alongside their consumers
Tests mirror source structure: src/test/core/CoroTask_test.cpp
No conditional compilation (#ifdef) for old vs new coroutine code — migration is clean phases

Documentation

Each awaiter must document: what it waits for, which thread resumes, and what await_resume() returns.
Promise type must document: exception handling behavior and suspension points.
Migration commits must reference this plan in commit messages.

8.3 Branch Strategy

Each milestone is developed on a sub-branch of the main feature branch. This keeps PRs focused and independently reviewable.

develop
  └── pratik/Switch-to-std-coroutines                       (main feature branch)
        ├── pratik/std-coro/add-coroutine-primitives        (CoroTask, CoroTaskRunner, JobQueueAwaiter, postCoroTask)
        ├── pratik/std-coro/migrate-entry-points            (ServerHandler, GRPCServer, RPC::Context)
        ├── pratik/std-coro/migrate-handlers                (doRipplePathFind, PathFindAwaiter, tests)
        └── pratik/std-coro/cleanup-boost-coroutine         (delete Coro.ipp, remove Boost dep, benchmarks)

Workflow:

Create sub-branch from pratik/Switch-to-std-coroutines for each milestone
Develop and test on the sub-branch
Create PR from sub-branch → pratik/Switch-to-std-coroutines
After review + merge, start next milestone sub-branch from the updated feature branch
Final PR from pratik/Switch-to-std-coroutines → develop

Rules:

Never push directly to the main feature branch — always via sub-branch PR
Each sub-branch must pass --unittest and sanitizers before PR
Sub-branch names follow the pattern: pratik/std-coro/<descriptive-action> (e.g., add-coroutine-primitives, migrate-entry-points)
Milestone PRs must reference this plan document in the description

8.4 Code Review Checklist

For every PR in this migration:

coroutine_handle::destroy() called exactly once per coroutine
No concurrent handle.resume() calls possible
unhandled_exception() stores the exception (doesn't discard it)
final_suspend() is noexcept
Awaiter await_suspend() doesn't block (schedules, not runs)
LocalValues correctly swapped on suspend/resume boundaries
Shutdown path tested (JobQueue stopping during coroutine execution)
ASAN clean (no use-after-free on coroutine frame)
TSAN clean (no data races on resume)
All existing --unittest tests still pass

9. Task List

Milestone 1: New Coroutine Primitives

1.1 Design CoroTask<T> class with promise_type
- Define promise_type with initial_suspend, final_suspend, unhandled_exception, return_value/return_void
- Implement FinalAwaiter for continuation support
- Implement move-only RAII handle wrapper
- Support both CoroTask<T> and CoroTask<void>
1.2 Design and implement JobQueueAwaiter
- await_suspend() calls jq_.addJob(type, name, [h]{ h.resume(); })
- Handle addJob() failure (shutdown) — resume with error flag or throw
- Integrate nSuspend_ counter increment/decrement
1.3 Implement LocalValues swap in new coroutine resume path
- Before handle.resume(): save thread-local, install coroutine-local
- After handle.resume() returns: restore thread-local
- Ensure this works when coroutine migrates between threads
1.4 Add postCoroTask() template to JobQueue
- Accept callable returning CoroTask<void>
- Schedule initial execution on JobQueue (mirror postCoro() behavior)
- Return a handle/shared_ptr for join/cancel
1.5 Write unit tests (src/test/core/CoroTask_test.cpp)
- Test CoroTask<void> runs to completion
- Test CoroTask<int> returns value
- Test exception propagation across co_await
- Test coroutine destruction before completion
- Test JobQueueAwaiter schedules on correct thread
- Test LocalValue isolation across 4+ coroutines
- Test shutdown rejection (addJob returns false)
- Test correct_order equivalent (yield → join → post → complete)
- Test incorrect_order equivalent (post → yield → complete)
- Test multiple sequential co_await points
1.6 Verify build on GCC 12+, Clang 16+
1.7 Run ASAN + TSAN on new tests
1.8 Run full --unittest suite (no regressions)
1.9 Self-review and create PR #1

Milestone 2: Entry Point Migration

2.1 Migrate ServerHandler::onRequest() (ServerHandler.cpp:287)
- Replace m_jobQueue.postCoro(jtCLIENT_RPC, ...) with postCoroTask()
- Update lambda to return CoroTask<void> (add co_return)
- Update processSession to accept new coroutine type
2.2 Migrate ServerHandler::onWSMessage() (ServerHandler.cpp:325)
- Replace m_jobQueue.postCoro(jtCLIENT_WEBSOCKET, ...) with postCoroTask()
- Update lambda signature
2.3 Migrate GRPCServer::CallData::process() (GRPCServer.cpp:102)
- Replace app_.getJobQueue().postCoro(JobType::jtRPC, ...) with postCoroTask()
- Update process(shared_ptr<Coro> coro) overload signature
2.4 Update RPC::Context (Context.h:27)
- Replace std::shared_ptr<JobQueue::Coro> coro{} with new coroutine wrapper type
- Ensure all code that accesses context.coro compiles
2.5 Update ServerHandler.h signatures
- processSession() and processRequest() parameter types
2.6 Update GRPCServer.h signatures
- process() method parameter types
2.7 Run full --unittest suite
2.8 Manual smoke test: HTTP + WS + gRPC RPC requests
2.9 Run ASAN + TSAN
2.10 Self-review and create PR #2

Milestone 3: Handler Migration

3.1 Migrate doRipplePathFind() (RipplePathFind.cpp)
- Replace context.coro->yield() with co_await PathFindAwaiter{...}
- Replace continuation lambda's coro->post() / coro->resume() with awaiter scheduling
- Handle shutdown case (post failure) in awaiter
3.2 Create PathFindAwaiter (or use generic JobQueueAwaiter)
- Encapsulate the continuation + yield pattern from RipplePathFind.cpp lines 108-132
3.3 Update Path_test.cpp
- Replace postCoro usage with postCoroTask
- Ensure context.coro usage matches new type
3.4 Update AMMTest.cpp
- Replace postCoro usage with postCoroTask
3.5 Rewrite Coroutine_test.cpp for new API
- correct_order: postCoroTask → co_await → join → resume → complete
- incorrect_order: post before yield equivalent
- thread_specific_storage: 4 coroutines with LocalValue isolation
3.6 Update JobQueue_test.cpp testPostCoro
- Migrate to postCoroTask API
3.7 Verify ripple_path_find works end-to-end with new coroutines
3.8 Test shutdown-during-pathfind scenario
3.9 Run full --unittest suite
3.10 Run ASAN + TSAN
3.11 Self-review and create PR #3

Milestone 4: Cleanup & Validation

4.1 Delete include/xrpl/core/Coro.ipp
4.2 Remove from JobQueue.h:
- #include <boost/coroutine/all.hpp>
- struct Coro_create_t
- class Coro (entire class)
- postCoro() template
- Comment block (lines 322-377) describing old race condition
4.3 Update cmake/deps/Boost.cmake:
- Remove coroutine from find_package(Boost REQUIRED COMPONENTS ...)
- Remove Boost::coroutine from target_link_libraries
4.4 Update cmake/XrplInterface.cmake:
- Remove BOOST_COROUTINES_NO_DEPRECATION_WARNING
4.5 Run memory benchmark
- Create N=1000 coroutines, compare RSS: before vs after
- Document results
4.6 Run context switch benchmark
- 100K yield/resume cycles, compare latency: before vs after
- Document results
4.7 Run RPC throughput benchmark
- Concurrent ripple_path_find requests, compare throughput
- Document results
4.8 Run full --unittest suite
4.9 Run ASAN, TSAN, UBSan
- Confirm __asan_handle_no_return warnings are gone
4.10 Verify build on all supported compilers
4.11 Self-review and create PR #4
4.12 Document final benchmark results in PR description

Appendix A: File Inventory

Complete list of files that reference coroutines (for audit tracking):

#	File	Must Change	Phase
1	`include/xrpl/core/JobQueue.h`	Yes	1 (add), 4 (remove old)
2	`include/xrpl/core/Coro.ipp`	Yes	4 (delete)
3	`include/xrpl/basics/LocalValue.h`	Maybe	1 (if integration changes)
4	`cmake/deps/Boost.cmake`	Yes	4
5	`cmake/XrplInterface.cmake`	Yes	4
6	`src/xrpld/rpc/Context.h`	Yes	2
7	`src/xrpld/rpc/detail/ServerHandler.cpp`	Yes	2
8	`src/xrpld/rpc/ServerHandler.h`	Yes	2
9	`src/xrpld/app/main/GRPCServer.cpp`	Yes	2
10	`src/xrpld/app/main/GRPCServer.h`	Yes	2
11	`src/xrpld/rpc/handlers/RipplePathFind.cpp`	Yes	3
12	`src/test/core/Coroutine_test.cpp`	Yes	3
13	`src/test/core/JobQueue_test.cpp`	Yes	3
14	`src/test/app/Path_test.cpp`	Yes	3
15	`src/test/jtx/impl/AMMTest.cpp`	Yes	3
16	`src/xrpld/rpc/README.md`	Yes	4 (update docs)

Appendix B: New Files to Create

#	File	Phase	Purpose
1	`include/xrpl/core/CoroTask.h`	1	`CoroTask<T>` return type + promise_type
2	`include/xrpl/core/JobQueueAwaiter.h`	1	Awaiter for scheduling on JobQueue
3	`src/test/core/CoroTask_test.cpp`	1	Unit tests for new primitives

78 KiB Raw Blame History

Boost.Coroutine to C++20 Standard Coroutines Migration Plan

Table of Contents

1. Research & Analysis

1.1 Stackful (Boost.Coroutine) vs Stackless (C++20) Architecture

1.2 API & Programming Model Comparison

1.3 Performance Characteristics

1.4 Feature Parity Analysis

Suspension Points

Exception Handling

Cancellation

1.5 Compiler Support

1.6 Viability Analysis — Addressing Stackless Concerns

Concern 1: Cannot Suspend from Nested Call Stacks

Concern 2: Colored Function Problem (Viral co_await)

Concern 3: No Standard Library Support for Common Patterns

Concern 4: Stack Overflow from Synchronous Resumption Chains

Concern 5: Dangling Reference Risk

Concern 6: yield_to.h / boost::asio::spawn

Overall Viability Conclusion

1.7 Merits & Demerits Summary

Merits of C++20 Migration

Demerits / Challenges

Alternative Considered: ASAN Annotations Only

2. Current State Assessment

2.1 Architecture Overview

2.2 JobQueue::Coro Implementation Audit

Class Members

Boost.Coroutine APIs Used

Method Behaviors

2.3 Coroutine Execution Lifecycle

2.4 All Coroutine Touchpoints

Core Infrastructure (Must Change)

Entry Points (postCoro Callers)

Context Propagation

Active Coroutine Consumer (yield/post)

Test Files

2.5 Suspension/Continuation Model

2.6 CMake Dependency

2.7 Existing C++20 Coroutine Usage

3. Migration Strategy

3.1 Incremental vs Atomic Migration

3.2 Migration Phases

3.3 Coexistence Strategy

3.4 Breaking Changes & Compatibility

4. Implementation Plan

4.1 New Type Design

CoroTask<T> — Coroutine Return Type

JobQueueAwaiter — Schedules Resumption on JobQueue

4.2 Mapping: Old API → New API

4.3 File Changes Required

Phase 1: New Coroutine Primitives

Phase 2: Entry Point Migration

Phase 3: Handler Migration

Phase 4: Cleanup

4.4 LocalValue Integration Design

4.5 RipplePathFind Migration Design

5. Testing & Validation Strategy

5.1 Test Architecture

5.2 Benchmarking Tests

Memory Usage Benchmark

Context Switch Benchmark

RPC Throughput Benchmark

5.3 Unit Test Coverage

5.4 Integration Testing

5.5 Sanitizer Testing

5.6 Regression Testing Methodology

6. Risks & Mitigation

6.1 Risk Matrix

6.2 Rollback Strategy

6.3 Specific Risk: Stackful → Stackless Limitation

7. Timeline & Milestones

7.1 Milestone Overview

7.2 Milestone Details

Milestone 1: New Coroutine Primitives (PR #1)

Milestone 2: Entry Point Migration (PR #2)

Milestone 3: Handler Migration (PR #3)

Milestone 4: Cleanup & Validation (PR #4)

8. Standards & Guidelines

8.1 Coroutine Design Standards

78 KiB

Raw Blame History

2.2 `JobQueue::Coro` Implementation Audit

`CoroTask<T>` — Coroutine Return Type

`JobQueueAwaiter` — Schedules Resumption on JobQueue

Rule 2: Never resume a coroutine from within `await_suspend()`

Rule 3: Use `suspend_always` for `initial_suspend()` (lazy start)

Rule 4: Always handle `unhandled_exception()` explicitly

Rule 5: Use `suspend_always` for `final_suspend()` to enable continuation