From cb5259d7c579313b1ee29c0753017e7edd9f236f Mon Sep 17 00:00:00 2001 From: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com> Date: Wed, 25 Mar 2026 15:44:19 +0000 Subject: [PATCH] docs: Add Boost to C++20 coroutine migration plan Comprehensive migration plan documenting the switch from Boost.Coroutine2 to C++20 standard coroutines in rippled, including research analysis, implementation phases, risk assessment, and testing strategy. Co-Authored-By: Claude Opus 4.6 --- BoostToStdCoroutineSwitchPlan.md | 1841 ++++++++++++++++++++++++++++++ BoostToStdCoroutineTaskList.md | 142 +++ cspell.config.yaml | 5 + 3 files changed, 1988 insertions(+) create mode 100644 BoostToStdCoroutineSwitchPlan.md create mode 100644 BoostToStdCoroutineTaskList.md diff --git a/BoostToStdCoroutineSwitchPlan.md b/BoostToStdCoroutineSwitchPlan.md new file mode 100644 index 0000000000..20958a25be --- /dev/null +++ b/BoostToStdCoroutineSwitchPlan.md @@ -0,0 +1,1841 @@ +# Boost.Coroutine to C++20 Standard Coroutines Migration Plan + +> **Status:** Implementation Complete +> **Author:** Pratik Mankawde +> **Created:** 2026-02-25 +> **Project:** rippled (XRP Ledger node) +> **Branch:** `Switch-to-std-coroutines` +> **Dependencies:** C++20 compiler support (GCC 12+, Clang 16+, MSVC 19.28+) + +--- + +## Table of Contents + +1. [What Is This?](#1-what-is-this) +2. [Why Is This Needed?](#2-why-is-this-needed) +3. [Actors, Actions & High-Level Flow](#3-actors-actions--high-level-flow) +4. [Research & Analysis](#4-research--analysis) +5. [Current State Assessment](#5-current-state-assessment) +6. [Migration Strategy](#6-migration-strategy) +7. [Implementation Plan](#7-implementation-plan) +8. [Testing & Validation Strategy](#8-testing--validation-strategy) +9. [Risks & Mitigation](#9-risks--mitigation) +10. [Timeline & Milestones](#10-timeline--milestones) +11. [Standards & Guidelines](#11-standards--guidelines) +12. [Task List](#12-task-list) +13. [FAQ](#13-faq) +14. [Glossary](#14-glossary) + +--- + +## 1. What Is This? + +This document describes the plan for migrating rippled's coroutine implementation from **Boost.Coroutine2** (a third-party C++ library that provides stackful coroutines) to **C++20 standard coroutines** (a language-native stackless coroutine facility built into modern C++ compilers). + +Coroutines in rippled are used to handle long-running RPC requests — such as pathfinding — without blocking server threads. When a request needs to wait for an external event, the coroutine **suspends** (freeing the thread for other work) and **resumes** later when the event completes. + +--- + +## 2. Why Is This Needed? + +- **Memory waste** — Each Boost coroutine allocates a **1.5 MB stack**, even though rippled's coroutines use only a few hundred bytes. With C++20, each coroutine frame is ~200–500 bytes — a **~3000x reduction**. +- **Deprecated dependency** — Boost.Coroutine (v1) is deprecated; Boost.Coroutine2 is not officially deprecated but receives minimal maintenance and no active feature development. Continuing to depend on it creates long-term maintenance risk. +- **Sanitizer incompatibility** — Boost's context-switching mechanism confuses ASAN and TSAN, producing false positives that make it harder to find real bugs. C++20 coroutines are transparent to sanitizers. +- **No compiler optimization** — Boost's stackful coroutines are opaque to the compiler. C++20 coroutines can be inlined and optimized at compile time, reducing context-switch overhead. +- **Standard compliance** — C++20 coroutines are part of the ISO C++ standard, supported by all major compilers. This eliminates a platform-specific dependency and improves portability. +- **Better tooling** — Debuggers, static analyzers, and IDE tooling increasingly understand C++20 coroutines natively, improving the development experience. + +--- + +## 3. Actors, Actions & High-Level Flow + +### 3a. Actors + +| Who (Plain English) | Technical Term | +| ---------------------------------------- | ------------------------------------------------- | +| External client sending a request | RPC Client (HTTP, WebSocket, or gRPC) | +| Server code that receives the request | Entry Point (`ServerHandler`, `GRPCServer`) | +| Thread pool that executes work | JobQueue (worker threads) | +| Wrapper that manages coroutine lifecycle | `CoroTaskRunner` (new) / `JobQueue::Coro` (old) | +| The suspended/resumable unit of work | Coroutine (`CoroTask` / `boost::coroutine`) | +| RPC handler that does the actual work | Handler (e.g., `doRipplePathFind`) | + +### 3b. Actions + +| What Happens (Plain English) | Technical Term | +| -------------------------------------------------- | ------------------------------------------- | +| Client sends a request to the server | RPC request (HTTP/WS/gRPC) | +| Server creates a coroutine to handle the request | `postCoroTask()` (new) / `postCoro()` (old) | +| Coroutine starts running on a worker thread | `resume()` / `handle.resume()` | +| Handler needs to wait for an external event | `co_await` (new) / `yield()` (old) | +| Coroutine suspends, freeing the worker thread | Suspension | +| External event completes, coroutine is rescheduled | `post()` / `yieldAndPost()` | +| Coroutine finishes and result is sent to client | Completion | + +### 3c. High-Level Flow + +```mermaid +flowchart TD + A["`Client sends request +(HTTP, WebSocket, or gRPC)`"] --> B["`Server receives request +and creates a coroutine`"] + B --> C["`Coroutine is scheduled +on a worker thread`"] + C --> D["`Handler runs and +processes the request`"] + D --> E{"`Does the handler +need to wait?`"} + E -->|No| G["`Handler finishes +and sends response`"] + E -->|Yes| F["`Coroutine suspends, +worker thread is freed`"] + F --> H["External event completes"] + H --> I["`Coroutine is rescheduled +on a worker thread`"] + I --> D +``` + +**Reading the diagram:** + +- A client (e.g., a wallet app) sends an RPC request to the rippled server. +- The server wraps the request in a coroutine and schedules it on a worker thread from the JobQueue. +- The handler processes the request. Most handlers finish immediately and return a response. +- For long-running handlers (e.g., pathfinding), the coroutine **suspends** — the worker thread is released to handle other requests. +- When the external event completes (e.g., pathfinding results are ready), the coroutine is rescheduled on an available worker thread and resumes where it left off. +- This suspend/resume cycle can repeat multiple times before the handler finishes and the response is sent. + +--- + +## 4. Research & Analysis + +> **RSS** = Resident Set Size + +### 4.1 Stackful (Boost.Coroutine) vs Stackless (C++20) Architecture + +```mermaid +graph LR + subgraph boost["Boost.Coroutine2 (Stackful)"] + direction TB + A[Coroutine Created] --> B[1.5 MB Stack Allocated] + B --> C[Full Call Stack Available] + C --> D[yield from ANY nesting depth] + D --> E["`**Context Switch:** +Save/Restore registers ++ stack pointer`"] + end + + subgraph cpp20["C++20 Coroutines (Stackless)"] + direction TB + F[Coroutine Created] --> G["200-500B Frame on Heap"] + G --> H[No Dedicated Stack] + H --> I[co_await ONLY at suspension points] + I --> J["`**Context Switch:** +Resume via function call`"] + end + + boost ~~~ cpp20 +``` + +### 4.2 API & Programming Model Comparison + +| Aspect | Boost.Coroutine2 (Current) | C++20 Coroutines (Target) | +| ---------------------- | --------------------------------------------- | ------------------------------------------------------------------------------- | +| **Type** | Stackful, asymmetric | Stackless, asymmetric | +| **Stack Model** | Dedicated 1.5 MB stack per coroutine | Coroutine frame on heap (~200-500 bytes, approximate, implementation-dependent) | +| **Suspension** | `(*yield_)()` — can yield from any call depth | `co_await expr` — only at explicit suspension points | +| **Resumption** | `coro_()` — resumes from last yield | `handle.resume()` — resumes from last co_await | +| **Creation** | `pull_type` constructor (runs to first yield) | Calling a coroutine function returns a handle | +| **Completion Check** | `static_cast(coro_)` | `handle.done()` | +| **Value Passing** | Typed via `pull_type` / `push_type` | Via `promise_type::return_value()` | +| **Exception Handling** | Natural stack-based propagation | `promise_type::unhandled_exception()` — explicit | +| **Cancellation** | Application-managed (poll a flag) | Via `await_ready()` / cancellation tokens | +| **Keywords** | None (library-only) | `co_await`, `co_yield`, `co_return` | +| **Standard** | Boost library (not ISO C++) | ISO C++20 standard | + +### 4.3 Coroutine Start Behavior: Eager vs Lazy + +A key behavioral difference between Boost.Coroutine2 and C++20 coroutines is **what happens at creation time**. + +| Aspect | Boost.Coroutine2 (`pull_type`) | C++20 (`initial_suspend = suspend_always`) | +| ------------------------------------ | ------------------------------------------------------------------------------------ | ------------------------------------------------ | +| **On construction** | Eagerly runs coroutine body until first `yield()` | No body code executes — only allocates the frame | +| **First execution** | Already happened (up to first yield) | Deferred until explicit `resume()` call | +| **Initial yield required?** | Yes — the API demands a yield to return control to the creator | No — coroutine is born suspended | +| **Stack context switch on creation** | Yes — switches to coroutine stack and back | No — construction is just a heap allocation | +| **Work wasted on creation** | Any code before first `yield()` runs eagerly, even if the coroutine is never resumed | Zero — no work done until `resume()` | + +**Was the initial-yield pattern a rippled requirement or a Boost API limitation?** + +It was a **Boost API design choice**, not a rippled requirement. Boost's `pull_type` was modeled after generators: construct the coroutine, and the first value is immediately available to pull. The coroutine runs eagerly to its first `yield()` to produce that value. rippled just needed "create a coroutine, run it later" — the eager-run-to-first-yield was Boost's mechanism for achieving that, not something rippled specifically required. + +**What C++20 does better:** + +1. **Lazy start by default** — With `suspend_always` as `initial_suspend`, construction and first execution are fully decoupled. No dummy initial yield, no wasted work, no confusing "the coroutine already ran halfway" semantics. +2. **No forced context switch on creation** — Boost's `pull_type` had to actually switch stack contexts on construction to run to the first yield. C++20 coroutine construction is just a heap allocation. +3. **Custom suspension points** — Each `co_await` is controlled by an awaiter object (`await_ready`, `await_suspend`, `await_resume`), allowing runtime decisions about whether to actually suspend, which thread to resume on, and what values to return from suspension. +4. **Symmetric transfer** — `await_suspend` can return a `coroutine_handle<>`, transferring directly to another coroutine without stack buildup. Boost coroutines couldn't do this — each resume/yield went through the scheduler. +5. **Composability** — `co_await` on another coroutine is natural. With Boost, nesting coroutines meant manually managing multiple coroutine objects. + +In rippled's migration, this means `postCoroTask()` creates a `CoroTask` that is born suspended (`initial_suspend = suspend_always`), then schedules its first `resume()` via `addJob()`. The coroutine body doesn't execute any code until a worker thread picks up the job — strictly cleaner than Boost's eager-run-to-first-yield pattern. + +### 4.4 Performance Characteristics + +| Metric | Boost.Coroutine2 | C++20 Coroutines | +| ------------------------------ | ---------------------------------------------------------------------------------------- | ------------------------------------ | +| **Memory per coroutine** | ~1.5 MB (fixed stack) | ~200-500 bytes (frame only) | +| **1000 concurrent coroutines** | ~1.5 GB | ~0.5 MB | +| **Context switch cost** | ~19 cycles / 9 ns with fcontext; ~1,130 cycles / 547 ns with ucontext (ASAN/TSAN builds) | ~20-50 CPU cycles (function call) | +| **Allocation** | Stack allocated at creation | Heap allocation (compiler may elide) | +| **Cache behavior** | Poor (large stack rarely fully used) | Good (small frame, hot data close) | +| **Compiler optimization** | Opaque to compiler | Inlinable, optimizable | + +### 4.5 Feature Parity Analysis + +#### Suspension Points + +- **Boost**: Can yield from any nesting level — `fn_a()` calls `fn_b()` calls `yield()`. The entire call stack is preserved. +- **C++20**: Suspension only at `co_await` expressions in the immediate coroutine function. Nested functions that need to suspend must themselves be coroutines returning awaitables. +- **Impact**: rippled's usage is **shallow** — `yield()` is called directly from the RPC handler lambda, never from deeply nested code. This makes migration straightforward. + +#### Exception Handling + +- **Boost**: Exceptions propagate naturally up the call stack across yield points. +- **C++20**: Exceptions in coroutine body are caught by `promise_type::unhandled_exception()`. Must be explicitly stored and rethrown. +- **Impact**: Need to implement `unhandled_exception()` in promise type. Pattern is well-established. + +#### Cancellation + +- **Boost**: rippled uses `expectEarlyExit()` for graceful shutdown — not a general cancellation mechanism. +- **C++20**: Can check cancellation in `await_ready()` before suspension, or via `stop_token` patterns. +- **Impact**: C++20 provides strictly better cancellation support. + +### 4.6 Compiler Support + +| Compiler | rippled Minimum | C++20 Coroutine Support | Status | +| --------- | --------------- | ---------------------------------------------- | ------ | +| **GCC** | 12.0+ | Full (since GCC 11) | Ready | +| **Clang** | 16.0+ | Full (since Clang 14; partial Windows support) | Ready | +| **MSVC** | 19.28+ | Full (since VS2019 16.8) | Ready | + +rippled already requires C++20 (`CMAKE_CXX_STANDARD 20` in `CMakeLists.txt`). All supported compilers have mature C++20 coroutine support. **No compiler upgrades required.** + +### 4.7 Viability Analysis — Addressing Stackless Concerns + +C++20 stackless coroutines have well-known limitations compared to stackful coroutines. This section analyzes each concern against rippled's **actual codebase** to determine viability. + +#### **Concern 1: Cannot Suspend from Nested Call Stacks** + +**Claim**: Stackless coroutines cannot yield from arbitrary stack depths. If `fn_a()` calls `fn_b()` calls `yield()`, only stackful coroutines can suspend the entire chain. + +**Analysis**: An exhaustive codebase audit found: + +- **1 production yield() call**: `RipplePathFind.cpp:131` — directly in the handler function body +- **All test yield() calls**: directly in `postCoro` lambda bodies (Coroutine_test.cpp, JobQueue_test.cpp) +- **The `push_type*` architecture** makes deep-nested yield() structurally impossible — the `yield_` pointer is only available inside the `postCoro` lambda via the `shared_ptr`, and handlers call `context.coro->yield()` at the top level + +**Verdict**: This concern does NOT apply. All suspension is shallow. + +#### **Concern 2: Colored Function Problem (Viral co_await)** + +**Claim**: Once a function needs to suspend, every caller up the chain must also be a coroutine. This "infects" the call chain. + +**Analysis**: In rippled's case, the coloring is minimal: + +- `postCoroTask()` launches a coroutine — this is the "root" colored function +- The `postCoro` lambda itself becomes the coroutine function (returns `CoroTask`) +- `doRipplePathFind()` is the only handler that calls `co_await` +- No other handler in the chain needs to become a coroutine — they continue to be regular functions dispatched through `doCommand()` + +The "coloring" stops at the entry point lambda and the one handler that suspends. No deep infection. + +**Verdict**: Minimal impact. Only 4 lambdas (3 entry points + 1 handler) need `co_await`. + +#### **Concern 3: No Standard Library Support for Common Patterns** + +**Claim**: C++20 provides the language primitives but no standard task type, executor integration, or composition utilities. + +**Analysis**: This is accurate — we need to write custom types: + +- `CoroTask` (task/return type) — well-established pattern, ~80 lines +- `JobQueueAwaiter` (executor integration) — ~20 lines +- `FinalAwaiter` (continuation chaining) — ~10 lines + +However, these types are small, well-understood, and have extensive reference implementations (cppcoro, folly::coro, libunifex). The total boilerplate is approximately 150-200 lines of header code. + +**Verdict**: Manageable. Custom types are small and well-documented in C++ community. + +#### **Concern 4: Stack Overflow from Synchronous Resumption Chains** + +**Claim**: If coroutine A `co_await`s coroutine B, and B completes synchronously, B's `final_suspend` resumes A on the same stack, potentially building up unbounded stack depth. + +**Why this is a real problem without symmetric transfer**: When `await_suspend()` returns `void`, the coroutine unconditionally suspends and returns from `.resume()`. If the awaited coroutine completes synchronously and calls `.resume()` on the awaiter, each such call adds a stack frame. In a loop that repeatedly `co_await`s short-lived coroutines (e.g., a generator producing millions of values), the stack grows with each iteration until it overflows — typically after ~1M iterations. + +**How symmetric transfer solves it**: When `await_suspend()` returns a `coroutine_handle<>` instead of `void`, the compiler destroys the current coroutine's stack frame _before_ jumping to the returned handle. This is effectively a tail-call: `resume()` becomes a `jmp` instead of a `call`, so each chained resumption consumes **zero additional stack space**. + +The C++ standard (P0913R0) mandates this by requiring: _"Implementations shall not impose any limits on how many coroutines can be resumed in this fashion."_ This effectively requires compilers to implement tail-call-like behavior — any finite stack would impose a limit otherwise. + +**Returning `std::noop_coroutine()`** from `await_suspend()` signals "suspend and return to caller" without resuming another coroutine, serving the role that `void` return used to play. + +**Applicability to rippled**: rippled does not chain coroutines (coroutine A awaiting coroutine B). The `co_await` points in rippled await `JobQueueAwaiter` (reschedules on the thread pool) and `yieldAndPost()` (suspend + re-post), both of which always suspend asynchronously. However, symmetric transfer is still implemented in our `FinalAwaiter` ([Section 7.1](#71-new-type-design)) as a best practice — it costs nothing and prevents stack overflow if the usage pattern ever changes. + +**Verdict**: Real concern for coroutine chains, but does not affect rippled's current usage. Solved by symmetric transfer in our design regardless. + +#### **Concern 5: Dangling Reference Risk** + +**Claim**: Coroutine frames are heap-allocated and outlive the calling scope, making references to locals dangerous. + +**Analysis**: This is a real concern that requires engineering discipline: + +- Coroutine parameters are copied into the frame (safe by default) +- References passed to coroutine functions can dangle if the referent's scope ends before the coroutine completes +- Our design mitigates this: `RPC::Context` is passed by reference but its lifetime is managed by `shared_ptr` / the entry point lambda's scope, which outlives the coroutine + +**Verdict**: Real risk, but manageable with RAII patterns and ASAN testing. + +#### **Concern 6: yield_to.h / boost::asio::spawn** + +**Claim**: `yield_to.h:111` uses `boost::asio::spawn`, suggesting broader coroutine usage. + +**Analysis**: `boost::asio::spawn` with `boost::context::fixedsize_stack` is a **completely separate** stackful coroutine system from `JobQueue::Coro`: + +- Different type: `boost::asio::yield_context` (not `push_type*`) +- Different mechanism: Boost.Asio stackful coroutines (not Boost.Coroutine2) +- **Not part of this migration scope** — unrelated to `JobQueue::Coro` + +**Usage sites** (both test and production): + +| File | Context | Scope | +| ------------------------------------------------ | ------------------------------------------------------- | ---------- | +| `include/xrpl/beast/test/yield_to.h` | Test infrastructure for async I/O tests | Test | +| `src/test/server/ServerStatus_test.cpp` | Server status tests via `enable_yield_to` | Test | +| `src/test/beast/beast_io_latency_probe_test.cpp` | Latency probe tests via `enable_yield_to` | Test | +| `include/xrpl/server/detail/Spawn.h` | `util::spawn()` wrapper with exception propagation | Production | +| `include/xrpl/server/detail/BaseHTTPPeer.h` | HTTP/WS connection handling (SSL handshake, read loops) | Production | + +**Verdict**: Separate system. Out of scope for this migration. + +**Consequence — `Boost::context` dependency is retained**: Because `boost::asio::spawn` depends on `Boost.Context` for its stackful fiber implementation, the `Boost::context` library **cannot be removed** as part of this migration. The CMake cleanup ([Phase 4](#phase-4-cleanup)) replaces `Boost::coroutine` with `Boost::context` — it does not eliminate the Boost fiber dependency entirely. + +Additionally, when running under ASAN or TSAN, `Boost.Context` must be built with the `ucontext` backend (not the default `fcontext`) so that it emits `__sanitizer_start_switch_fiber` / `__sanitizer_finish_switch_fiber` annotations during fiber context switches. Without these annotations, the sanitizers cannot track memory ownership across fiber stack switches and will report false positives (stack-use-after-scope under ASAN, data races under TSAN) for the `boost::asio::spawn` call sites listed above. This requires: + +- `BOOST_USE_UCONTEXT` — selects the ucontext backend (fcontext has no sanitizer annotations) +- `BOOST_USE_ASAN` / `BOOST_USE_TSAN` — enables the sanitizer fiber-switching hooks in Boost.Context +- These defines must match what Boost itself was compiled with (see `conan/profiles/sanitizers`) + +**Potential issues from retaining `Boost::context`**: + +1. **Continued 2MB stack allocation** — `boost::asio::spawn` in `BaseHTTPPeer.h` and `yield_to.h` still allocates 2MB stacks per fiber via `fixedsize_stack`. This does not benefit from the C++20 coroutine memory reduction. +2. **Sanitizer blind spots** — if Boost is not compiled with matching `BOOST_USE_UCONTEXT` / `BOOST_USE_ASAN` / `BOOST_USE_TSAN` defines, the fiber context switches in production server code (`BaseHTTPPeer`) will produce false positives or mask real bugs. +3. **Future migration needed** — `boost::asio::spawn` with stackful fibers should eventually be migrated to `boost::asio::co_spawn` with C++20 coroutines (or `boost::asio::awaitable`) to fully eliminate the `Boost::context` dependency. This is a separate initiative. + +#### Overall Viability Conclusion + +The migration IS viable because: + +1. rippled's coroutine usage is **shallow** (no deep-nested yield) +2. The **colored function infection** is limited to 4 call sites +3. Custom types are **small and well-understood** +4. **Symmetric transfer** solves the stack overflow concern +5. **ASAN/TSAN** testing catches lifetime and race bugs +6. The alternative (ASAN annotations for Boost.Context) only addresses sanitizer false positives — it does not provide memory savings, standard compliance, or the dependency elimination that C++20 migration delivers + +### 4.8 Merits & Demerits Summary + +#### Merits of C++20 Migration + +1. **~3000x memory reduction** per coroutine (1.5 MB → ~500 bytes) +2. **Faster context switching** (~2x improvement) +3. **Remove external dependency** on Boost.Coroutine (`Boost::context` is retained — see [Concern 6](#concern-6-yield_toh--boostasiospawn)) +4. **Language-native** — better tooling, debugger support, static analysis +5. **Future-proof** — ISO standard, not a deprecated library +6. **Compiler-optimizable** — suspension points can be inlined/elided +7. **ASAN compatibility** — eliminates `JobQueue::Coro` Boost context-switching false positives (see `docs/build/sanitizers.md`). Note: `boost::asio::spawn` false positives remain and require `BOOST_USE_UCONTEXT` + `BOOST_USE_ASAN`/`BOOST_USE_TSAN` — see [Concern 6](#concern-6-yield_toh--boostasiospawn) + +#### Demerits / Challenges + +1. **Stackless limitation** — cannot yield from nested calls (verified: not an issue for rippled's shallow usage) +2. **Explicit lifetime management** — `coroutine_handle::destroy()` must be called (mitigated by RAII CoroTask) +3. **Verbose boilerplate** — promise_type, awaiter interfaces (~150-200 lines of infrastructure code) +4. **Debugging** — no visible coroutine stack in debugger (improving with tooling) +5. **Learning curve** — team needs familiarity with C++20 coroutine machinery +6. **Dangling reference risk** — coroutine frames outlive calling scope (mitigated by ASAN + careful design) +7. **No standard library task type** — must write custom CoroTask, awaiters (well-established patterns exist) + +#### Alternative Considered: ASAN Annotations Only + +Instead of full migration, one could keep Boost.Coroutine and add `__sanitizer_start_switch_fiber` / `__sanitizer_finish_switch_fiber` annotations to Coro.ipp to suppress ASAN false positives. This was evaluated and rejected because: + +- It only fixes sanitizer false positives — does NOT reduce 1.5 MB/coroutine memory usage +- Does NOT remove the Boost.Coroutine dependency +- Does NOT provide standard compliance or future-proofing +- The full migration is feasible given shallow yield usage and delivers all the above benefits + +--- + +## 5. Current State Assessment + +### 5.1 Architecture Overview + +
+ +```mermaid +graph TD + subgraph "Request Entry Points" + HTTP["`**HTTP Request** +ServerHandler::onRequest()`"] + WS["`**WebSocket Message** +ServerHandler::onWSMessage()`"] + GRPC["`**gRPC Request** +CallData::process()`"] + end + + subgraph "Coroutine Layer" + POST["`**JobQueue::postCoro()** +Creates Coro ++ schedules job`"] + CORO["`**JobQueue::Coro** +boost::coroutines2:: +coroutine::pull_type +1.5 MB stack per instance`"] + end + + subgraph "JobQueue Thread Pool" + W1["Worker Thread 1"] + W2["Worker Thread 2"] + WN["Worker Thread N"] + end + + subgraph "RPC Handlers" + CTX["`**RPC::Context** +holds shared_ptr`"] + RPC["`**RPC Handler** +e.g. doRipplePathFind`"] + YIELD["`**coro.yield()** +Suspends execution`"] + RESUME["`**coro.post()** +Reschedules on JobQueue`"] + end + + HTTP --> POST + WS --> POST + GRPC --> POST + POST --> CORO + CORO --> W1 + CORO --> W2 + CORO --> WN + W1 --> CTX + W2 --> CTX + CTX --> RPC + RPC --> YIELD + YIELD -.->|"event completes"| RESUME + RESUME --> W1 +``` + +
+ +**Reading the diagram:** + +- Requests arrive via HTTP, WebSocket, or gRPC and are routed to `postCoro()`. +- `postCoro()` creates a `Coro` object (1.5 MB stack) and schedules it on the JobQueue. +- A worker thread picks up the job, creates an `RPC::Context`, and invokes the handler. +- If the handler calls `yield()`, the coroutine suspends and the worker thread is freed. +- When an external event completes, `post()` reschedules the coroutine on the JobQueue. +- A worker thread resumes the coroutine and the handler continues from where it left off. + +### 5.2 `JobQueue::Coro` Implementation Audit + +**File**: `include/xrpl/core/JobQueue.h` (lines 40-120) + `include/xrpl/core/Coro.ipp` + +#### Class Members + +```cpp +class Coro : public std::enable_shared_from_this { + detail::LocalValues lvs_; // Per-coroutine thread-local storage + JobQueue& jq_; // Parent JobQueue reference + JobType type_; // Job type (jtCLIENT_RPC, etc.) + std::string name_; // Name for logging + bool running_; // Is currently executing + std::mutex mutex_; // Prevents concurrent resume + std::mutex mutex_run_; // Guards running_ flag + std::condition_variable cv_; // For join() blocking + boost::coroutines2::coroutine::pull_type coro_; // THE BOOST COROUTINE + boost::coroutines2::coroutine::push_type* yield_; // Yield function pointer + bool finished_; // Debug assertion flag +}; +``` + +#### Boost.Coroutine APIs Used + +| API | Location | Purpose | +| -------------------------------------------------------- | --------------- | --------------------------- | +| `coroutine::pull_type` | `JobQueue.h:52` | The coroutine object itself | +| `coroutine::push_type` | `JobQueue.h:53` | Yield function type | +| `boost::context::protected_fixedsize_stack(1536 * 1024)` | `Coro.ipp:14` | Stack size configuration | +| `#include ` | `JobQueue.h:11` | Header inclusion | + +#### Method Behaviors + +| Method | Behavior | +| ----------------------- | --------------------------------------------------------------------------------------------------------------------------------- | +| **Constructor** | Creates `pull_type` with 1.5 MB stack. Lambda captures user function. Auto-runs to first `yield()`. | +| **`yield()`** | Increments `jq_.nSuspend_`, calls `(*yield_)()` to suspend. Returns control to caller. | +| **`post()`** | Sets `running_=true`, calls `jq_.addJob()` with a lambda that calls `resume()`. Returns false if JobQueue is stopping. | +| **`resume()`** | Swaps `LocalValues`, acquires `mutex_`, calls `coro_()` to resume. Restores `LocalValues`. Sets `running_=false`, notifies `cv_`. | +| **`runnable()`** | Returns `static_cast(coro_)` — true if coroutine hasn't returned. | +| **`expectEarlyExit()`** | Decrements `nSuspend_`, sets `finished_=true`. Used during shutdown. | +| **`join()`** | Blocks on `cv_` until `running_==false`. | + +### 5.3 Coroutine Execution Lifecycle + +```mermaid +sequenceDiagram + participant HT as Handler Thread + participant JQ as JobQueue + participant WT as Worker Thread + participant C as Coro + participant UF as User Function + + HT->>JQ: postCoro(type, name, fn) + JQ->>C: Coro::Coro() constructor + Note over C: pull_type auto-starts lambda + C->>C: yield_ = &do_yield + C->>C: yield() [initial suspension] + C-->>JQ: Returns to constructor + JQ->>JQ: coro->post() + JQ->>JQ: addJob(type, name, resume_lambda) + JQ-->>HT: Returns shared_ptr + Note over HT: Handler thread is FREE + + WT->>C: resume() [job executes] + Note over C: Swap LocalValues + C->>C: coro_() [resume boost coroutine] + C->>UF: fn(shared_from_this()) + UF->>UF: Do work... + + UF->>C: coro->yield() [suspend] + Note over C: ++nSuspend_, invoke yield_() + C-->>WT: Returns from resume() + Note over WT: Worker thread is FREE + + Note over UF: External event completes + UF->>C: coro->post() [reschedule] + C->>JQ: addJob(resume_lambda) + + WT->>C: resume() [job executes] + C->>C: coro_() [resume] + C->>UF: Continues after yield() + UF->>UF: Finish work + UF-->>C: Return [coroutine complete] + Note over C: running_=false, cv_.notify_all() +``` + +**Reading the diagram:** + +- The handler thread calls `postCoro()`, which constructs the Boost coroutine (auto-runs to the first `yield()`), then schedules it on the JobQueue. +- A worker thread calls `resume()`, which swaps `LocalValues` and resumes the Boost coroutine. The user function runs. +- When the user function calls `yield()`, control returns to the worker thread, which is now free. +- An external event triggers `post()`, which calls `addJob()` to reschedule the coroutine. +- A worker thread resumes the coroutine again, the user function finishes, and `cv_.notify_all()` signals completion. + +### 5.4 All Coroutine Touchpoints + +#### Core Infrastructure (Must Change) + +| File | Role | Lines of Interest | +| ---------------------------------- | ---------------------------------------- | ------------------------- | +| `include/xrpl/core/JobQueue.h` | Coro class definition, postCoro template | Lines 10, 40-120, 385-402 | +| `include/xrpl/core/Coro.ipp` | Coro method implementations | All (122 lines) | +| `include/xrpl/basics/LocalValue.h` | Per-coroutine thread-local storage | Lines 12-59 (LocalValues) | +| `cmake/deps/Boost.cmake` | Boost.Coroutine dependency | Lines 7, 24 | + +#### Entry Points (postCoro Callers) + +| File | Entry Point | Job Type | +| -------------------------------------------- | ---------------------------- | -------------------- | +| `src/xrpld/rpc/detail/ServerHandler.cpp:287` | `onRequest()` — HTTP RPC | `jtCLIENT_RPC` | +| `src/xrpld/rpc/detail/ServerHandler.cpp:325` | `onWSMessage()` — WebSocket | `jtCLIENT_WEBSOCKET` | +| `src/xrpld/app/main/GRPCServer.cpp:102` | `CallData::process()` — gRPC | `jtRPC` | + +#### Context Propagation + +| File | Role | +| --------------------------------------- | ------------------------------------------------------ | +| `src/xrpld/rpc/Context.h:27` | `RPC::Context` holds `shared_ptr coro` | +| `src/xrpld/rpc/ServerHandler.h:174-188` | `processSession/processRequest` pass coro through | + +#### Active Coroutine Consumer (yield/post) + +| File | Usage | +| --------------------------------------------------- | ----------------------------------------------------- | +| `src/xrpld/rpc/handlers/RipplePathFind.cpp:131` | `context.coro->yield()` — suspends for path-finding | +| `src/xrpld/rpc/handlers/RipplePathFind.cpp:116-123` | Continuation calls `coro->post()` or `coro->resume()` | + +#### Test Files + +| File | Tests | +| ---------------------------------- | ------------------------------------------------------------- | +| `src/test/core/Coroutine_test.cpp` | `correct_order`, `incorrect_order`, `thread_specific_storage` | +| `src/test/core/JobQueue_test.cpp` | `testPostCoro` (post/resume cycles, shutdown behavior) | +| `src/test/app/Path_test.cpp` | Path-finding RPC via postCoro | +| `src/test/jtx/impl/AMMTest.cpp` | AMM RPC via postCoro | + +### 5.5 Suspension/Continuation Model + +The current model documented in `src/xrpld/rpc/README.md` defines four functional types: + +``` +Callback = std::function — generic 0-arg function +Continuation = std::function — calls Callback later +Suspend = std::function — runs Continuation, suspends +Coroutine = std::function — given a Suspend, starts work +``` + +In practice, `JobQueue::Coro` simplifies this to: + +- **Suspend** = `coro->yield()` +- **Continue** = `coro->post()` (async on JobQueue) or `coro->resume()` (sync on current thread) + +### 5.6 CMake Dependency + +In `cmake/deps/Boost.cmake`: + +```cmake +find_package(Boost REQUIRED COMPONENTS ... coroutine ...) +target_link_libraries(xrpl_boost INTERFACE ... Boost::coroutine ...) +``` + +Additionally in `cmake/XrplInterface.cmake`: + +```cpp +BOOST_COROUTINES2_NO_DEPRECATION_WARNING // Suppresses Boost.Coroutine deprecation warnings +``` + +### 5.7 Existing C++20 Coroutine Usage + +rippled **already uses C++20 coroutines** in test code: + +- `src/tests/libxrpl/net/HTTPClient.cpp` uses `co_await` with `boost::asio::use_awaitable` +- Demonstrates team familiarity with C++20 coroutine syntax +- Proves compiler toolchain supports C++20 coroutines + +--- + +## 6. Migration Strategy + +### 6.1 Incremental vs Atomic Migration + +**Decision: Incremental (multi-phase) migration.** + +Rationale: + +- Only **one RPC handler** (`RipplePathFind`) actively uses `yield()/post()` suspension +- The **three entry points** (HTTP, WS, gRPC) all funnel through `postCoro()` +- The `RPC::Context.coro` field is the sole propagation mechanism +- We can introduce a new C++20 coroutine system alongside the existing one and migrate callsites incrementally + +### 6.2 Migration Phases + +```mermaid +graph TD + subgraph PH1 ["Phase 1: Foundation"] + P1A["`Create CoroTask type +(promise_type, awaiter)`"] + P1B["`Create JobQueueAwaiter +(schedules resume on JobQueue)`"] + P1C["`Add postCoroTask() to JobQueue +(parallel to postCoro)`"] + P1D["Unit tests for new primitives"] + P1A --> P1B --> P1C --> P1D + end + + subgraph PH2 ["Phase 2: Entry Point Migration"] + P2A["`Migrate +ServerHandler::onRequest()`"] + P2B["`Migrate +ServerHandler::onWSMessage()`"] + P2C["`Migrate +GRPCServer::CallData::process()`"] + P2D["`Update +RPC::Context to use new type`"] + P2A --> P2D + P2B --> P2D + P2C --> P2D + end + + subgraph PH3 ["Phase 3: Handler Migration"] + P3A["`Migrate +RipplePathFind handler`"] + P3B["`Verify all other handlers +(no active yield usage)`"] + end + + subgraph PH4 ["Phase 4: Cleanup"] + P4A["`Remove +old Coro class`"] + P4B["`Remove +Boost.Coroutine from CMake`"] + P4C["`Remove +deprecation warning suppression`"] + P4D["`Final benchmarks +& validation`"] + P4A --> P4B --> P4C --> P4D + end + + PH1 --> PH2 + PH2 --> PH3 + PH3 --> PH4 +``` + +**Reading the diagram:** + +- **[Phase 1](#phase-1-new-coroutine-primitives)** builds the new coroutine primitives (`CoroTask`, `JobQueueAwaiter`, `postCoroTask()`) alongside the existing Boost code. No production code changes. +- **[Phase 2](#phase-2-entry-point-migration)** migrates the three entry points (HTTP, WebSocket, gRPC) to use `postCoroTask()` and updates `RPC::Context`. +- **[Phase 3](#phase-3-handler-migration)** migrates the `RipplePathFind` handler and verifies no other handlers use `yield()`. +- **[Phase 4](#phase-4-cleanup)** removes the old `Coro` class, `Coro.ipp`, `Boost::coroutine` from CMake, and runs final benchmarks. +- Each phase depends on the previous one completing. The old code is not deleted until [Phase 4](#phase-4-cleanup), so rollback is safe through Phases 1–3. + +### 6.3 Coexistence Strategy + +During migration, both implementations will coexist: + +```mermaid +graph LR + subgraph "Transition Period" + OLD["`**JobQueue::Coro** +(Boost, existing)`"] + NEW["`**JobQueue::CoroTask** +(C++20, new)`"] + CTX["RPC::Context"] + end + + CTX -->|"phase 1-2"| OLD + CTX -->|"phase 2-3"| NEW + + style OLD fill:#fdd,stroke:#c00,color:#000 + style NEW fill:#dfd,stroke:#0a0,color:#000 +``` + +- `RPC::Context` will temporarily hold both `shared_ptr` (old) and the new coroutine handle +- Entry points will be migrated one at a time +- Each migration is independently testable +- Once all entry points and handlers are migrated, old code is removed + +### 6.4 Breaking Changes & Compatibility + +| Concern | Impact | Mitigation | +| -------------------------------- | ----------------------------------------- | ---------------------------------------------------------------------------- | +| `RPC::Context::coro` type change | All RPC handlers receive context | Migrate context field last, after all consumers updated | +| `postCoro()` removal | 3 callers | Replace with `postCoroTask()`, remove old API in [Phase 4](#phase-4-cleanup) | +| `LocalValue` integration | Thread-local storage must work | New implementation must swap LocalValues identically | +| Shutdown behavior | `expectEarlyExit()`, `nSuspend_` tracking | Replicate in new CoroTask | + +--- + +## 7. Implementation Plan + +> **ASAN** = AddressSanitizer | **TSAN** = ThreadSanitizer | **RAII** = Resource Acquisition Is Initialization + +### 7.1 New Type Design + +#### `CoroTask` — Coroutine Return Type + +```mermaid +classDiagram + class CoroTask~T~ { + +Handle handle_ + +CoroTask(Handle h) + +destroy() + +bool done() const + +T get() const + +bool await_ready() const + +void await_suspend(coroutine_handle h) const + +T await_resume() const + } + + class promise_type { + -result_ : variant~T, exception_ptr~ + -continuation_ : coroutine_handle + +CoroTask get_return_object() + +suspend_always initial_suspend() + +FinalAwaiter final_suspend() + +void return_value(T) + +void return_void() + +void unhandled_exception() + } + + class FinalAwaiter { + +bool await_ready() + +coroutine_handle await_suspend(coroutine_handle~promise_type~) + +void await_resume() + } + + class JobQueueAwaiter { + -jq_ : JobQueue + -type_ : JobType + -name_ : string + +bool await_ready() + +void await_suspend(coroutine_handle h) + +void await_resume() + } + + CoroTask --> promise_type : contains + promise_type --> FinalAwaiter : returns from final_suspend + CoroTask ..> JobQueueAwaiter : used with co_await +``` + +#### `JobQueueAwaiter` — Schedules Resumption on JobQueue + +```cpp +// Conceptual design — actual implementation may vary +struct JobQueueAwaiter { + JobQueue& jq; + JobType type; + std::string name; + + bool await_ready() { return false; } // Always suspend + + void await_suspend(std::coroutine_handle<> h) { + // Schedule coroutine resumption as a job + jq.addJob(type, name, [h]() { h.resume(); }); + } + + void await_resume() {} +}; +``` + +### 7.2 New Architecture Overview + +The following diagram mirrors [Section 5.1](#51-architecture-overview) but shows the target state after migration. + +
+ +```mermaid +graph TD + subgraph "Request Entry Points" + HTTP["`**HTTP Request** +ServerHandler::onRequest()`"] + WS["`**WebSocket Message** +ServerHandler::onWSMessage()`"] + GRPC["`**gRPC Request** +CallData::process()`"] + end + + subgraph "Coroutine Layer" + POST["`**JobQueue::postCoroTask()** +Creates CoroTaskRunner ++ schedules job`"] + TASK["`**CoroTask** +~200-500 byte heap frame +managed by CoroTaskRunner`"] + end + + subgraph "JobQueue Thread Pool" + W1["Worker Thread 1"] + W2["Worker Thread 2"] + WN["Worker Thread N"] + end + + subgraph "RPC Handlers" + CTX["`**RPC::Context** +holds CoroTaskRunner ref`"] + RPC["`**RPC Handler** +e.g. doRipplePathFind`"] + AWAIT["`**co_await yieldAndPost()** +Suspends coroutine`"] + SCHED["`**JobQueueAwaiter** +Reschedules via addJob()`"] + end + + HTTP --> POST + WS --> POST + GRPC --> POST + POST --> TASK + TASK --> W1 + TASK --> W2 + TASK --> WN + W1 --> CTX + W2 --> CTX + CTX --> RPC + RPC --> AWAIT + AWAIT -.->|"event completes"| SCHED + SCHED --> W1 +``` + +
+ +**Reading the diagram:** + +- Requests arrive via HTTP, WebSocket, or gRPC and are routed to `postCoroTask()`. +- `postCoroTask()` creates a `CoroTaskRunner` wrapping a `CoroTask` (~200-500 byte heap frame) and schedules it on the JobQueue. +- A worker thread picks up the job, creates an `RPC::Context`, and invokes the handler. +- If the handler needs to wait, it calls `co_await yieldAndPost()` — the coroutine suspends and the worker thread is freed. +- When the external event completes, `JobQueueAwaiter` reschedules the coroutine via `addJob()`. +- A worker thread resumes the coroutine (`handle.resume()`) and the handler continues from where it left off. + +### 7.3 New Coroutine Execution Lifecycle + +The following diagram mirrors [Section 5.3](#53-coroutine-execution-lifecycle) but shows the C++20 coroutine flow. + +```mermaid +sequenceDiagram + participant HT as Handler Thread + participant JQ as JobQueue + participant WT as Worker Thread + participant R as CoroTaskRunner + participant UF as User Function + + HT->>JQ: postCoroTask(type, name, fn) + JQ->>R: CoroTaskRunner created + Note over R: CoroTask constructed
initial_suspend = suspend_always
~200-500 byte frame on heap + R->>JQ: addJob(type, name, resume_lambda) + JQ-->>HT: Returns shared_ptr + Note over HT: Handler thread is FREE + + WT->>R: resume() [job executes] + Note over R: Swap LocalValues + R->>R: handle_.resume() + R->>UF: fn(runner) starts + UF->>UF: Do work... + + UF->>R: co_await yieldAndPost() + Note over R: ++nSuspend_
await_suspend schedules addJob() + R-->>WT: Returns from resume() + Note over WT: Worker thread is FREE + + Note over UF: External event completes + + WT->>R: resume() [job executes] + Note over R: Swap LocalValues + R->>R: handle_.resume() + Note over UF: Continues after co_await + UF->>UF: Finish work + UF-->>R: co_return [coroutine complete] + Note over R: final_suspend → FinalAwaiter
returns noop_coroutine() + Note over R: runCount_--, cv_.notify_all() +``` + +**Reading the diagram:** + +- The handler thread calls `postCoroTask()`, which creates a `CoroTaskRunner` wrapping a lazily-started `CoroTask` (no auto-run — unlike Boost's `pull_type`), then schedules it on the JobQueue. +- A worker thread calls `resume()`, which swaps `LocalValues` and calls `handle_.resume()`. The user function starts. +- When the user function calls `co_await yieldAndPost()`, the awaiter's `await_suspend()` increments `nSuspend_` and schedules a new job via `addJob()`. The worker thread is freed. +- When a worker thread picks up the rescheduled job, `resume()` calls `handle_.resume()` again, continuing from the `co_await` point. +- When the user function executes `co_return`, `final_suspend()` returns the `FinalAwaiter`, which returns `std::noop_coroutine()` (no continuation to resume). The runner signals completion. + +**Key differences from Boost lifecycle ([Section 5.3](#53-coroutine-execution-lifecycle)):** + +| Aspect | Boost (old) | C++20 (new) | +| -------------- | ---------------------------------------------------- | ----------------------------------------------------- | +| **Creation** | `pull_type` auto-runs to first `yield()` | `initial_suspend = suspend_always` — no auto-run | +| **Frame** | 1.5 MB stack allocated at construction | ~200-500 byte heap frame | +| **Suspend** | `(*yield_)()` — context switch via fcontext/ucontext | `co_await` — compiler-generated state machine | +| **Resume** | `coro_()` — context switch back | `handle_.resume()` — function call | +| **Completion** | `static_cast(coro_)` returns false | `handle_.done()` returns true, `FinalAwaiter` runs | +| **Cleanup** | Destructor asserts `finished_` | RAII: `CoroTask` destructor calls `handle_.destroy()` | + +### 7.4 Mapping: Old API → New API + +```mermaid +graph LR + subgraph "Current (Boost)" + direction TB + A1["postCoro(type, name, fn)"] + A2["coro->yield()"] + A3["coro->post()"] + A4["coro->resume()"] + A5["coro->join()"] + A6["coro->runnable()"] + A7["coro->expectEarlyExit()"] + end + + subgraph "New (C++20)" + direction TB + B1["`postCoroTask(type, name, fn) +fn returns CoroTask`"] + B2["`co_await JobQueueAwaiter +{jq, type, name}`"] + B3["`Built into await_suspend() +(automatic scheduling)`"] + B4["`handle.resume() +(direct call)`"] + B5["`co_await task +(continuation-based)`"] + B6["handle.done()"] + B7["handle.destroy() + cleanup"] + end + + A1 --> B1 + A2 --> B2 + A3 --> B3 + A4 --> B4 + A5 --> B5 + A6 --> B6 + A7 --> B7 +``` + +### 7.5 File Changes Required + +#### Phase 1: New Coroutine Primitives + +| File | Action | Description | +| ------------------------------------- | ---------- | ------------------------------------------------------------- | +| `include/xrpl/core/CoroTask.h` | **CREATE** | `CoroTask` return type with `promise_type`, `FinalAwaiter` | +| `include/xrpl/core/JobQueueAwaiter.h` | **CREATE** | Awaiter that schedules resume on JobQueue | +| `include/xrpl/core/JobQueue.h` | **MODIFY** | Add `postCoroTask()` template alongside existing `postCoro()` | +| `src/test/core/CoroTask_test.cpp` | **CREATE** | Unit tests for `CoroTask` and `JobQueueAwaiter` | + +#### Phase 2: Entry Point Migration + +| File | Action | Description | +| ---------------------------------------- | ---------- | ---------------------------------------------------------------------- | +| `src/xrpld/rpc/detail/ServerHandler.cpp` | **MODIFY** | `onRequest()` and `onWSMessage()`: replace `postCoro` → `postCoroTask` | +| `src/xrpld/rpc/ServerHandler.h` | **MODIFY** | Update `processSession`/`processRequest` signatures | +| `src/xrpld/app/main/GRPCServer.cpp` | **MODIFY** | `CallData::process()`: replace `postCoro` → `postCoroTask` | +| `src/xrpld/app/main/GRPCServer.h` | **MODIFY** | Update `process()` method signature | +| `src/xrpld/rpc/Context.h` | **MODIFY** | Change `shared_ptr` to new coroutine handle type | + +#### Phase 3: Handler Migration + +| File | Action | Description | +| ------------------------------------------- | ---------- | ---------------------------------------------------------------- | +| `src/xrpld/rpc/handlers/RipplePathFind.cpp` | **MODIFY** | Replace `context.coro->yield()` / `coro->post()` with `co_await` | +| `src/test/app/Path_test.cpp` | **MODIFY** | Update test to use new coroutine API | +| `src/test/jtx/impl/AMMTest.cpp` | **MODIFY** | Update test to use new coroutine API | + +#### Phase 4: Cleanup + +| File | Action | Description | +| ---------------------------------- | ---------- | ------------------------------------------------------------------ | +| `include/xrpl/core/Coro.ipp` | **DELETE** | Remove old Boost.Coroutine implementation | +| `include/xrpl/core/JobQueue.h` | **MODIFY** | Remove `Coro` class, `postCoro()`, `Coro_create_t`, Boost includes | +| `cmake/deps/Boost.cmake` | **MODIFY** | Remove `coroutine` from `find_package` and `target_link_libraries` | +| `cmake/XrplInterface.cmake` | **MODIFY** | Remove `BOOST_COROUTINES2_NO_DEPRECATION_WARNING` | +| `src/test/core/Coroutine_test.cpp` | **MODIFY** | Rewrite tests for new CoroTask | +| `src/test/core/JobQueue_test.cpp` | **MODIFY** | Update `testPostCoro` to use new API | +| `include/xrpl/basics/LocalValue.h` | **MODIFY** | Update LocalValues integration for C++20 coroutines | + +### 7.6 LocalValue Integration Design + +> **TLS** = Thread-Local Storage | **LV** = LocalValues (per-coroutine storage map) + +The `LocalValue` system provides per-coroutine isolation by swapping thread-local storage when a coroutine is resumed or suspended. This swap pattern is **shared by both the old Boost and the new C++20 implementation** — only the resume/yield mechanism differs. + +#### Old Flow (Boost.Coroutine2) + +```mermaid +sequenceDiagram + participant WT as Worker Thread + participant LV as LocalValues (TLS) + participant C as Boost Coroutine + + Note over WT: Thread has its own LocalValues + + WT->>LV: saved = getLocalValues().release() + WT->>LV: getLocalValues().reset(&coro.lvs_) + Note over LV: Now pointing to coroutine's storage + + WT->>C: coro_() — enters pull_type + Note over C: User code sees coroutine's LocalValues + + C-->>WT: (*yield_)() — push_type returns + + WT->>LV: getLocalValues().release() + WT->>LV: getLocalValues().reset(saved) + Note over LV: Restored to thread's storage +``` + +#### New Flow (C++20 Coroutines) + +```mermaid +sequenceDiagram + participant WT as Worker Thread + participant LV as LocalValues (TLS) + participant C as C++20 Coroutine + + Note over WT: Thread has its own LocalValues + + WT->>LV: saved = getLocalValues().release() + WT->>LV: getLocalValues().reset(&runner.lvs_) + Note over LV: Now pointing to coroutine's storage + + WT->>C: handle.resume() + Note over C: User code sees coroutine's LocalValues + + C-->>WT: co_await suspends — returns to caller + + WT->>LV: getLocalValues().release() + WT->>LV: getLocalValues().reset(saved) + Note over LV: Restored to thread's storage +``` + +#### Reading the diagrams + +- **Before resume**: The worker thread saves its own TLS pointer and installs the coroutine's `lvs_` map into thread-local storage. Any `LocalValue` access inside the coroutine will read/write the coroutine's copy, not the thread's. +- **During execution**: The coroutine body runs with its own isolated `LocalValues`. Multiple coroutines on different threads each see their own data. +- **After suspend/return**: The thread's original TLS pointer is restored. This ensures the worker thread's own `LocalValues` are not contaminated by the coroutine's mutations. +- **Old vs new difference**: In Boost, the swap wraps a `coro_()` / `(*yield_)()` call pair. In C++20, the swap wraps `handle.resume()` and the `co_await` suspension point. The swap logic itself is identical. + +### 7.7 RipplePathFind Migration Design + +Current pattern: + +```cpp +// Continuation callback +auto callback = [&context]() { + std::shared_ptr coroCopy{context.coro}; + if (!coroCopy->post()) { + coroCopy->resume(); // Fallback: run on current thread + } +}; + +// Start async work, then suspend +jvResult = makeLegacyPathRequest(request, callback, ...); +if (request) { + context.coro->yield(); // ← SUSPEND HERE + jvResult = request->doStatus(context.params); // ← RESUME HERE +} +``` + +Target pattern: + +```cpp +// Start async work, suspend via co_await +jvResult = makeLegacyPathRequest(request, /* awaiter-based callback */, ...); +if (request) { + co_await PathFindAwaiter{context}; // ← SUSPEND + RESUME via awaiter + jvResult = request->doStatus(context.params); +} +``` + +The `PathFindAwaiter` will encapsulate the scheduling logic currently in the lambda continuation. + +--- + +## 8. Testing & Validation Strategy + +### 8.1 Test Architecture + +```mermaid +graph LR + subgraph "Unit Tests" + direction TB + UT1["`**CoroTask_test** +- Construction/destruction +- co_return values +- Exception propagation +- Lifetime management`"] + UT2["`**JobQueueAwaiter_test** +- Schedule on correct JobType +- Resume on worker thread +- Shutdown handling`"] + UT3["`**LocalValue integration** +- Per-coroutine isolation +- Multi-coroutine concurrent +- Cross-thread consistency`"] + end + + subgraph "Migration Tests" + direction TB + MT1["`**Coroutine_test rewrite** +- correct_order +- incorrect_order +- thread_specific_storage`"] + MT2["`**PostCoro migration** +- Post/resume cycles +- Shutdown rejection +- Early exit`"] + end + + subgraph "Integration Tests" + direction TB + IT1["`**RPC Path Finding** +- Suspend/resume flow +- Shutdown during suspend +- Concurrent requests`"] + IT2["`**Full --unittest suite** +- All existing tests pass +- No regressions`"] + end + + subgraph "Performance Tests" + direction TB + PT1["Memory benchmarks"] + PT2["Context switch benchmarks"] + PT3["RPC throughput under load"] + end + + subgraph "Sanitizer Tests" + direction TB + ST1["`**ASAN** +(memory errors)`"] + ST2["`**TSAN** +(data races)`"] + ST3["`**UBSan** +(undefined behavior)`"] + end + + UT1 --> MT1 + UT2 --> MT2 + MT1 --> IT1 + MT2 --> IT2 + IT1 --> PT1 + IT2 --> PT2 + PT1 --> ST1 + PT2 --> ST2 + PT3 --> ST3 +``` + +### 8.2 Benchmarking Tests + +> **RSS** = Resident Set Size (physical memory currently used by a process) + +#### Memory Usage Benchmark + +``` +Test: Create N coroutines, measure RSS +- N = 100, 1000, 10000 +- Measure: peak RSS, per-coroutine overhead +- Compare: Boost (N * 1.5 MB + overhead) vs C++20 (N * ~500B + overhead) +- Tool: /proc/self/status (VmRSS), or getrusage() +``` + +#### Context Switch Benchmark + +``` +Test: Yield/resume M times across N coroutines +- M = 100,000 iterations +- N = 1, 10, 100 concurrent coroutines +- Measure: total time, per-switch latency (ns) +- Compare: Boost yield/resume cycle vs C++20 co_await/resume cycle +- Tool: std::chrono::high_resolution_clock +``` + +#### RPC Throughput Benchmark + +``` +Test: Concurrent ripple_path_find requests +- Load: 10, 50, 100 concurrent requests +- Measure: requests/second, p50/p95/p99 latency +- Compare: before vs after migration +- Tool: Custom load generator or existing perf infrastructure +``` + +### 8.3 Unit Test Coverage + +| Test | What It Validates | +| ------------------------------ | --------------------------------------------- | +| `CoroTask` basic | Coroutine runs to completion, handle cleanup | +| `CoroTask` with value | `co_return` value correctly retrieved | +| `CoroTask` exception | `unhandled_exception()` captures and rethrows | +| `CoroTask` cancellation | Destruction before completion cleans up | +| `JobQueueAwaiter` basic | `co_await` suspends, resumes on worker thread | +| `JobQueueAwaiter` shutdown | Returns false / throws when JobQueue stopping | +| `PostCoroTask` lifecycle | Create → suspend → resume → complete | +| `PostCoroTask` multiple yields | Multiple co_await points in sequence | +| `LocalValue` isolation | 4 coroutines, each sees own LocalValue | +| `LocalValue` cross-thread | Resume on different thread, values preserved | + +### 8.4 Integration Testing + +- **All existing `--unittest` tests must pass unchanged** (except coroutine-specific tests that are rewritten) +- **Path_test** must pass with identical behavior +- **AMMTest** RPC tests must pass +- **ServerHandler** HTTP/WS handling must work end-to-end + +### 8.5 Sanitizer Testing + +Per `docs/build/sanitizers.md`: + +```bash +# ASAN (memory errors — especially important for coroutine frame lifetime) +export SANITIZERS=address,undefinedbehavior +# Build + test + +# TSAN (data races — critical for concurrent coroutine resume) +export SANITIZERS=thread +# Build + test (separate build — cannot mix with ASAN) +``` + +**Key benefit**: Removing Boost.Coroutine eliminates the `__asan_handle_no_return` false positives caused by Boost context switching (documented in `docs/build/sanitizers.md` line 184). + +### 8.6 Regression Testing Methodology + +```mermaid +graph LR + subgraph "Before Migration (Baseline)" + direction TB + B1["`Build on +develop branch`"] + B2["`Run --unittest +(record pass/fail)`"] + B3["`Run memory benchmark +(record RSS)`"] + B4["`Run context switch +benchmark (record ns/switch)`"] + B1 --> B2 --> B3 --> B4 + end + + subgraph "After Migration" + direction TB + A1["`Build on +feature branch`"] + A2["`Run --unittest +(compare pass/fail)`"] + A3["`Run memory benchmark +(compare RSS)`"] + A4["`Run context switch +benchmark (compare ns/switch)`"] + A1 --> A2 --> A3 --> A4 + end + + subgraph "Acceptance Criteria" + direction TB + C1["Zero test regressions"] + C2["Memory: ≤ baseline"] + C3["Context switch: ≤ baseline"] + C4["ASAN/TSAN clean"] + end + + B2 -.->|compare| C1 + A2 -.->|compare| C1 + B3 -.->|compare| C2 + A3 -.->|compare| C2 + B4 -.->|compare| C3 + A4 -.->|compare| C3 + A2 -.-> C4 +``` + +--- + +## 9. Risks & Mitigation + +### 9.1 Risk Matrix + +
+ +```mermaid +--- +config: + quadrantChart: + chartWidth: 800 + chartHeight: 800 + pointRadius: 5 + pointTextPadding: 8 + pointLabelFontSize: 14 + titleFontSize: 18 +--- +quadrantChart + title Risk Assessment — Probability vs Impact + x-axis Low Probability --> High Probability + y-axis Low Impact --> High Impact + quadrant-1 Mitigate Actively + quadrant-2 Monitor Closely + quadrant-3 Accept + quadrant-4 Mitigate if Easy + Frame lifetime bugs: [0.55, 0.95] + Dangling references: [0.62, 0.78] + Data races on resume: [0.48, 0.86] + Shutdown races: [0.58, 0.55] + Exception loss: [0.42, 0.50] + Perf regression: [0.22, 0.90] + LocalValue corruption: [0.18, 0.72] + Symmetric transfer unavail: [0.08, 0.92] + Compiler bugs: [0.32, 0.60] + Colored fn spread: [0.30, 0.38] + Missed consumer: [0.14, 0.45] + Future deep yield: [0.22, 0.30] + Third-party Boost dep: [0.08, 0.20] +``` + +
+ +| Risk | Probability | Impact | Mitigation | +| ----------------------------------------------------- | ----------- | ------ | -------------------------------------------------------------------------------- | +| **Performance regression** in context switching | Low | High | Benchmark before/after; C++20 should be faster | +| **Coroutine frame lifetime bugs** (use-after-destroy) | Medium | High | ASAN testing, RAII wrapper for handle, code review | +| **Data races on resume** | Medium | High | TSAN testing, careful await_suspend() implementation | +| **LocalValue corruption** across threads | Low | High | Dedicated test with 4+ concurrent coroutines | +| **Shutdown race conditions** | Medium | Medium | Replicate existing mutex/cv pattern in new design | +| **Missed coroutine consumer** during migration | Low | Medium | Exhaustive grep audit ([Section 5.4](#54-all-coroutine-touchpoints) is complete) | +| **Compiler bugs** in coroutine codegen | Low | Medium | Test on all three compilers (GCC, Clang, MSVC) | +| **Exception loss** across suspension points | Medium | Medium | Test exception propagation in every phase | +| **Third-party code depending on Boost.Coroutine** | Very Low | Low | Grep confirms only internal usage | +| **Dangling references in coroutine frames** | Medium | High | ASAN testing, avoid reference params in coroutine functions, use shared_ptr | +| **Colored function infection spreading** | Low | Medium | Only 4 call sites need co_await; no nested handlers suspend | +| **Symmetric transfer not available** | Very Low | High | All target compilers (GCC 12+, Clang 16+) support symmetric transfer | +| **Future handler adding deep yield** | Low | Medium | Code review + CI: static analysis flag any yield from nested depth | + +### 9.2 Rollback Strategy + +```mermaid +graph TD + START["Migration In Progress"] + CHECK{"`Critical Issue +Discovered?`"} + PHASE{"Which Phase?"} + + P1["`**Phase 1:** Delete new files +No production code changed`"] + P2["`**Phase 2:** Revert entry +point changes +Old postCoro still present`"] + P3["`**Phase 3:** Revert handler changes +Old Coro still present`"] + P4["`**Phase 4:** Cannot easily rollback +Old code deleted`"] + + PREVENT["`**Prevention:** +Do NOT delete old code +until Phase 4 is fully validated`"] + + START --> CHECK + CHECK -->|Yes| PHASE + CHECK -->|No| DONE["Continue Migration"] + PHASE -->|1| P1 + PHASE -->|2| P2 + PHASE -->|3| P3 + PHASE -->|4| P4 + P4 --> PREVENT +``` + +**Key principle**: Old `Coro` class and `postCoro()` remain in the codebase through Phases 1-3. They are only removed in [Phase 4](#phase-4-cleanup), after all migration is validated. Each phase is independently revertible via `git revert`. + +### 9.3 Specific Risk: Stackful → Stackless Limitation + +**The Big Question**: Can all current `yield()` call sites work with stackless `co_await`? + +**Analysis**: + +```mermaid +graph TD + Q["`Does yield() get called from +a deeply nested function?`"] + Q -->|Yes| PROBLEM["`**PROBLEM:** co_await can't +suspend from nested calls`"] + Q -->|No| OK["`**OK:** Direct co_await +in coroutine function`"] + + CHECK1["`RipplePathFind.cpp:131 +context.coro.yield()`"] + CHECK1 -->|"Called directly in handler"| OK + + CHECK2["`Coroutine_test.cpp +c.yield()`"] + CHECK2 -->|"Called directly in lambda"| OK + + CHECK3["`JobQueue_test.cpp +c.yield()`"] + CHECK3 -->|"Called directly in lambda"| OK + + style OK fill:#dfd,stroke:#0a0,color:#000 + style PROBLEM fill:#fdd,stroke:#c00,color:#000 +``` + +**Result**: All `yield()` calls are in the direct body of the postCoro lambda or RPC handler function. **No deep nesting exists.** Migration to stackless `co_await` is fully feasible without architectural redesign. + +--- + +## 10. Timeline & Milestones + +### 10.1 Milestone Overview + +```mermaid +gantt + title Migration Timeline + dateFormat YYYY-MM-DD + axisFormat %b %d + tickInterval 1week + + section Phase 1 + Design types :p1a, 2026-02-26, 14d + Implement :p1b, after p1a, 14d + Unit tests :p1c, after p1b, 10d + PR 1 :milestone, p1m, after p1c, 0d + + section Phase 2 + ServerHandler :p2a, after p1m, 14d + GRPCServer :p2b, after p2a, 10d + Context :p2c, after p2b, 10d + PR 2 :milestone, p2m, after p2c, 0d + + section Phase 3 + RipplePathFind :p3a, after p2m, 14d + Test updates :p3b, after p3a, 10d + PR 3 :milestone, p3m, after p3b, 0d + + section Phase 4 + Cleanup :p4a, after p3m, 14d + Benchmarks :p4b, after p4a, 10d + Sanitizers :p4c, after p4b, 10d + PR 4 :milestone, p4m, after p4c, 0d +``` + +### 10.2 Milestone Details + +#### Milestone 1: New Coroutine Primitives (PR #1) + +**Deliverables**: + +- `CoroTask` with `promise_type`, `FinalAwaiter` +- `CoroTask` specialization +- `JobQueueAwaiter` for scheduling on JobQueue +- `postCoroTask()` on `JobQueue` +- LocalValue integration in new coroutine type +- Unit test suite: `CoroTask_test.cpp` + +**Acceptance Criteria**: + +- All new unit tests pass +- Existing `--unittest` suite passes (no regressions from new code) +- ASAN + TSAN clean on new tests +- Code compiles on GCC 12+, Clang 16+ + +#### Milestone 2: Entry Point Migration (PR #2) + +**Deliverables**: + +- `ServerHandler::onRequest()` uses `postCoroTask()` +- `ServerHandler::onWSMessage()` uses `postCoroTask()` +- `GRPCServer::CallData::process()` uses `postCoroTask()` +- `RPC::Context` updated to carry new coroutine type +- `processSession`/`processRequest` signatures updated + +**Acceptance Criteria**: + +- HTTP, WebSocket, and gRPC RPC requests work end-to-end +- Full `--unittest` suite passes +- Manual smoke test: `ripple_path_find` via HTTP/WS + +#### Milestone 3: Handler Migration (PR #3) + +**Deliverables**: + +- `RipplePathFind` uses `co_await` instead of `yield()/post()` +- Path_test and AMMTest updated +- Coroutine_test and JobQueue_test updated for new API + +**Acceptance Criteria**: + +- Path-finding suspension/continuation works correctly +- All `--unittest` tests pass +- Shutdown-during-pathfind scenario tested + +#### Milestone 4: Cleanup & Validation (PR #4) + +**Deliverables**: + +- Old `Coro` class and `Coro.ipp` removed +- `postCoro()` removed from `JobQueue` +- `Boost::coroutine` removed from CMake +- `BOOST_COROUTINES2_NO_DEPRECATION_WARNING` removed +- Performance benchmark results documented +- Sanitizer test results documented + +**Acceptance Criteria**: + +- Build succeeds without Boost.Coroutine +- Full `--unittest` suite passes +- Memory per coroutine ≤ 10KB (down from 1.5 MB) +- Context switch time ≤ baseline +- ASAN, TSAN, UBSan all clean + +--- + +## 11. Standards & Guidelines + +### 11.1 Coroutine Design Standards + +#### Rule 1: All coroutine return types must use RAII for handle lifetime + +```cpp +// GOOD: Handle destroyed in destructor +~CoroTask() { + if (handle_) handle_.destroy(); +} + +// BAD: Manual destroy calls scattered in code +void cleanup() { handle_.destroy(); } // Easy to forget +``` + +#### Rule 2: Never resume a coroutine from within `await_suspend()` + +```cpp +// GOOD: Schedule resume on executor +void await_suspend(std::coroutine_handle<> h) { + jq_.addJob(type_, name_, [h]() { h.resume(); }); +} + +// BAD: Direct resume in await_suspend (blocks caller) +void await_suspend(std::coroutine_handle<> h) { + h.resume(); // Defeats the purpose of suspension +} +``` + +#### Rule 3: Use `suspend_always` for `initial_suspend()` (lazy start) + +```cpp +// GOOD: Lazy start — coroutine doesn't run until explicitly resumed +std::suspend_always initial_suspend() { return {}; } + +// BAD for our use case: Eager start — runs immediately on creation +std::suspend_never initial_suspend() { return {}; } +``` + +Rationale: Matches existing Boost behavior where `postCoro()` schedules execution, not the constructor. + +#### Rule 4: Always handle `unhandled_exception()` explicitly + +```cpp +void unhandled_exception() { + exception_ = std::current_exception(); + // NEVER: just swallow the exception + // NEVER: std::terminate() without logging +} +``` + +#### Rule 5: Use `suspend_always` for `final_suspend()` to enable continuation + +```cpp +// GOOD: Suspend at end to allow cleanup and value retrieval +auto final_suspend() noexcept { + struct FinalAwaiter { + bool await_ready() noexcept { return false; } + std::coroutine_handle<> await_suspend( + std::coroutine_handle h) noexcept { + if (h.promise().continuation_) + return h.promise().continuation_; // Resume waiter + return std::noop_coroutine(); + } + void await_resume() noexcept {} + }; + return FinalAwaiter{}; +} +``` + +#### Rule 6: Coroutine functions must be clearly marked + +```cpp +// GOOD: Return type makes it obvious this is a coroutine +CoroTask doRipplePathFind(RPC::JsonContext& context) { + co_await ...; + co_return result; +} + +// BAD: Coroutine hidden behind auto or unclear return type +auto doSomething() { co_return; } +``` + +### 11.2 Coding Guidelines + +#### Thread Safety + +1. **Never resume a coroutine concurrently from two threads.** Use the same mutex pattern as existing `Coro::mutex_` to prevent races. +2. **`await_suspend()` is the synchronization point.** All state visible before `await_suspend()` must be visible after `await_resume()`. +3. **Use `std::atomic` or mutexes for shared state** between coroutine and continuation callback. + +#### Memory Management + +1. **`CoroTask` owns its `coroutine_handle`**. It is move-only, non-copyable. +2. **Never store raw `coroutine_handle<>`** in long-lived data structures without clear ownership. +3. **Prefer `shared_ptr>`** when multiple parties need to observe/wait on a coroutine, mirroring the existing `shared_ptr` pattern. + +#### Error Handling + +1. **Exceptions thrown in coroutine body** are captured by `promise_type::unhandled_exception()` and rethrown in `await_resume()`. +2. **Never let exceptions escape `final_suspend()`** — it's `noexcept`. +3. **Shutdown path**: When `JobQueue` is stopping and `addJob()` returns false, the awaiter must resume the coroutine with an error (throw or return error state) rather than leaving it suspended forever. + +#### Naming Conventions + +| Entity | Convention | Example | +| --------------------- | ------------------------- | ----------------------------------------- | +| Coroutine return type | `CoroTask` | `CoroTask`, `CoroTask` | +| Awaiter types | `*Awaiter` suffix | `JobQueueAwaiter`, `PathFindAwaiter` | +| Coroutine functions | Same as regular functions | `doRipplePathFind(...)` | +| Promise types | Nested `promise_type` | `CoroTask::promise_type` | +| JobQueue method | `postCoroTask()` | `jq.postCoroTask(jtCLIENT, "name", fn)` | + +#### Code Organization + +1. **Coroutine primitives** go in `include/xrpl/core/` (header-only where possible) +2. **Application-specific awaiters** go alongside their consumers +3. **Tests** mirror source structure: `src/test/core/CoroTask_test.cpp` +4. **No conditional compilation** (`#ifdef`) for old vs new coroutine code — migration is clean phases + +#### Documentation + +1. **Each awaiter must document**: what it waits for, which thread resumes, and what `await_resume()` returns. +2. **Promise type must document**: exception handling behavior and suspension points. +3. **Migration commits must reference this plan** in commit messages. + +### 11.3 Branch Strategy + +Each milestone is developed on a **sub-branch** of the main feature branch. This keeps PRs focused and independently reviewable. + +```mermaid +--- +displayMode: compact +config: + gitGraph: + useMaxWidth: false +--- +gitGraph + commit id: "develop" + branch Switch-to-std-coroutines + commit id: "feature branch created" + + branch std-coro/add-coroutine-primitives + commit id: "CoroTask, CoroTaskRunner" + commit id: "JobQueueAwaiter, postCoroTask" + commit id: "Unit tests" type: HIGHLIGHT + + checkout Switch-to-std-coroutines + merge std-coro/add-coroutine-primitives id: "PR #1 merged" + + branch std-coro/migrate-entry-points + commit id: "ServerHandler migration" + commit id: "GRPCServer, RPC::Context" type: HIGHLIGHT + + checkout Switch-to-std-coroutines + merge std-coro/migrate-entry-points id: "PR #2 merged" + + branch std-coro/migrate-handlers + commit id: "RipplePathFind co_await" + commit id: "Test updates" type: HIGHLIGHT + + checkout Switch-to-std-coroutines + merge std-coro/migrate-handlers id: "PR #3 merged" + + branch std-coro/cleanup-boost-coroutine + commit id: "Delete Coro.ipp" + commit id: "Remove Boost dep, benchmarks" type: HIGHLIGHT + + checkout Switch-to-std-coroutines + merge std-coro/cleanup-boost-coroutine id: "PR #4 merged" + + checkout main + merge Switch-to-std-coroutines id: "Final merge to develop" +``` + +**Workflow**: + +1. Create sub-branch from `Switch-to-std-coroutines` for each milestone +2. Develop and test on the sub-branch +3. Create PR from sub-branch → `Switch-to-std-coroutines` +4. After review + merge, start next milestone sub-branch from the updated feature branch +5. Final PR from `Switch-to-std-coroutines` → `develop` + +**Rules**: + +- Never push directly to the main feature branch — always via sub-branch PR +- Each sub-branch must pass `--unittest` and sanitizers before PR +- Sub-branch names follow the pattern: `std-coro/` (e.g., `add-coroutine-primitives`, `migrate-entry-points`) +- Milestone PRs must reference this plan document in the description + +### 11.4 Code Review Checklist + +For every PR in this migration: + +- [ ] `coroutine_handle::destroy()` called exactly once per coroutine +- [ ] No concurrent `handle.resume()` calls possible +- [ ] `unhandled_exception()` stores the exception (doesn't discard it) +- [ ] `final_suspend()` is `noexcept` +- [ ] Awaiter `await_suspend()` doesn't block (schedules, not runs) +- [ ] `LocalValues` correctly swapped on suspend/resume boundaries +- [ ] Shutdown path tested (JobQueue stopping during coroutine execution) +- [ ] ASAN clean (no use-after-free on coroutine frame) +- [ ] TSAN clean (no data races on resume) +- [ ] All existing `--unittest` tests still pass + +--- + +## 12. Task List + +See [BoostToStdCoroutineTaskList.md](BoostToStdCoroutineTaskList.md) for the full task list with per-milestone checkboxes. + +--- + +## Appendix A: File Inventory + +Complete list of files that reference coroutines (for audit tracking): + +| # | File | Must Change | Phase | +| --- | ------------------------------------------- | ----------- | -------------------------- | +| 1 | `include/xrpl/core/JobQueue.h` | Yes | 1 (add), 4 (remove old) | +| 2 | `include/xrpl/core/Coro.ipp` | Yes | 4 (delete) | +| 3 | `include/xrpl/basics/LocalValue.h` | Maybe | 1 (if integration changes) | +| 4 | `cmake/deps/Boost.cmake` | Yes | 4 | +| 5 | `cmake/XrplInterface.cmake` | Yes | 4 | +| 6 | `src/xrpld/rpc/Context.h` | Yes | 2 | +| 7 | `src/xrpld/rpc/detail/ServerHandler.cpp` | Yes | 2 | +| 8 | `src/xrpld/rpc/ServerHandler.h` | Yes | 2 | +| 9 | `src/xrpld/app/main/GRPCServer.cpp` | Yes | 2 | +| 10 | `src/xrpld/app/main/GRPCServer.h` | Yes | 2 | +| 11 | `src/xrpld/rpc/handlers/RipplePathFind.cpp` | Yes | 3 | +| 12 | `src/test/core/Coroutine_test.cpp` | Yes | 3 | +| 13 | `src/test/core/JobQueue_test.cpp` | Yes | 3 | +| 14 | `src/test/app/Path_test.cpp` | Yes | 3 | +| 15 | `src/test/jtx/impl/AMMTest.cpp` | Yes | 3 | +| 16 | `src/xrpld/rpc/README.md` | Yes | 4 (update docs) | + +--- + +## Appendix B: New Files to Create + +| # | File | Phase | Purpose | +| --- | ------------------------------------- | ----- | ---------------------------------------- | +| 1 | `include/xrpl/core/CoroTask.h` | 1 | `CoroTask` return type + promise_type | +| 2 | `include/xrpl/core/JobQueueAwaiter.h` | 1 | Awaiter for scheduling on JobQueue | +| 3 | `src/test/core/CoroTask_test.cpp` | 1 | Unit tests for new primitives | + +--- + +## 13. FAQ + +**Why is `Boost::context` still a dependency after the migration?** +The migration only removes `Boost::coroutine`. rippled's production server code (`BaseHTTPPeer.h`) and test infrastructure (`yield_to.h`) still use `boost::asio::spawn`, which depends on `Boost.Context` for stackful fiber execution. Migrating those call sites to `boost::asio::co_spawn` / `boost::asio::awaitable` is a separate initiative. See [Concern 6](#concern-6-yield_toh--boostasiospawn) for details. + +**Can C++20 stackless coroutines yield from deeply nested function calls?** +No — `co_await` can only appear in the immediate coroutine function body. However, an exhaustive audit confirmed that all `yield()` calls in rippled are at the top level of their lambda or handler function. No deep nesting exists. See [Section 4.7, Concern 1](#concern-1-cannot-suspend-from-nested-call-stacks). + +**Why was `RipplePathFind` not migrated to use `co_await` as the plan originally proposed?** +During implementation, it was simpler and more robust to replace the coroutine-based yield/post pattern with a `std::condition_variable` synchronous wait. Since `RipplePathFind` is the only handler that suspends, and it already runs on a JobQueue worker thread, blocking that thread for up to 30 seconds is acceptable. This eliminates coroutine complexity from the handler entirely. + +**What is `CoroTaskRunner` and why is it not in the original plan?** +`CoroTaskRunner` emerged during implementation as a lifecycle manager for `CoroTask`. It wraps the coroutine handle, manages `LocalValues` swapping, tracks the run count for join/post synchronization, and provides the `yieldAndPost()` method. It is defined as a nested class in `JobQueue.h` with its implementation in `CoroTaskRunner.ipp`. + +**What is `yieldAndPost()` and why was it added?** +`yieldAndPost()` is an inline awaiter method on `CoroTaskRunner` that atomically suspends the coroutine and reposts it on the JobQueue. It was added to work around a GCC-12 compiler bug where an external awaiter struct used at multiple `co_await` points corrupts the coroutine state machine's resume index. The inline version avoids this by defining the awaiter inside the member function. + +**Does the migration affect the public RPC API?** +No. The migration is entirely internal — it changes how RPC handlers are scheduled and suspended, not what they accept or return. All RPC request/response formats are unchanged. + +--- + +## 14. Glossary + +| **Term** | **Definition** | +| ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **ASAN** | AddressSanitizer — a compiler-based tool that detects memory errors such as use-after-free, buffer overflows, and stack-use-after-scope at runtime. | +| **Awaiter** | A C++20 type that implements `await_ready()`, `await_suspend()`, and `await_resume()` to control coroutine suspension and resumption behavior. | +| **Boost.Context** | A Boost library providing low-level context-switching primitives (fiber stacks) used by `boost::asio::spawn`. Retained after this migration for `boost::asio::spawn` support. | +| **Boost.Coroutine2** | A Boost library providing stackful asymmetric coroutines. This is the library being replaced by this migration. | +| **Colored function** | A function that must be a coroutine (i.e., use `co_await`/`co_return`) because it needs to suspend. The "colored function problem" refers to the viral nature of this requirement up the call chain. | +| **`CoroTask`** | The new C++20 coroutine return type introduced by this migration. Contains a `promise_type`, `FinalAwaiter`, and RAII handle management. | +| **`CoroTaskRunner`** | A nested class in `JobQueue` that manages the lifecycle of a `CoroTask`, including `LocalValues` swapping, run count tracking, and the `yieldAndPost()` method. | +| **`coroutine_handle<>`** | A C++20 standard library type that is a type-erased handle to a suspended coroutine frame, used to resume or destroy the coroutine. | +| **fcontext** | The default Boost.Context backend that uses hand-written assembly for context switching. Does not support sanitizer annotations. | +| **`FinalAwaiter`** | An awaiter returned by `promise_type::final_suspend()` that implements symmetric transfer — resuming a continuation handle instead of returning to the caller, preventing stack overflow. | +| **`JobQueue`** | rippled's thread pool that manages worker threads and schedules jobs. Coroutines are scheduled on it via `postCoroTask()`. | +| **`JobQueueAwaiter`** | An awaiter type that, when `co_await`ed, suspends the coroutine and schedules its resumption as a job on the `JobQueue`. | +| **`LocalValues`** | A per-coroutine thread-local storage mechanism in rippled. `LocalValues` are swapped on suspend/resume to give each coroutine its own isolated storage. | +| **`postCoro()`** | The old `JobQueue` method that creates a Boost.Coroutine and schedules it. Replaced by `postCoroTask()`. | +| **`postCoroTask()`** | The new `JobQueue` method that creates a `CoroTaskRunner` wrapping a `CoroTask` and schedules it. | +| **`promise_type`** | A nested type inside a C++20 coroutine return type that controls coroutine creation, suspension, value return, and exception handling. | +| **Stackful coroutine** | A coroutine that has its own dedicated call stack (1.5 MB in rippled), allowing suspension from any nesting depth. Boost.Coroutine2 provides stackful coroutines. | +| **Stackless coroutine** | A coroutine that stores only a small heap-allocated frame (~200–500 bytes) and can only suspend at explicit `co_await` points. C++20 coroutines are stackless. | +| **Symmetric transfer** | A technique where `await_suspend()` returns a `coroutine_handle<>` instead of `void`, enabling the compiler to tail-call into the next coroutine and prevent stack overflow. | +| **TSAN** | ThreadSanitizer — a compiler-based tool that detects data races and lock-order inversions at runtime. | +| **UBSan** | Undefined Behavior Sanitizer — a compiler-based tool that detects undefined behavior such as signed integer overflow, null pointer dereference, and type punning violations. | +| **ucontext** | An alternative Boost.Context backend that uses POSIX `ucontext_t` for context switching. Supports sanitizer fiber-switching annotations, unlike the default fcontext backend. | +| **`yieldAndPost()`** | A method on `CoroTaskRunner` that atomically suspends the coroutine and reposts it on the JobQueue. Added to work around a GCC-12 compiler bug with external awaiters. | diff --git a/BoostToStdCoroutineTaskList.md b/BoostToStdCoroutineTaskList.md new file mode 100644 index 0000000000..93f0b79cb9 --- /dev/null +++ b/BoostToStdCoroutineTaskList.md @@ -0,0 +1,142 @@ +# Boost.Coroutine to C++20 Migration — Task List + +> Parent document: [BoostToStdCoroutineSwitchPlan.md](BoostToStdCoroutineSwitchPlan.md) + +--- + +## Milestone 1: New Coroutine Primitives + +- [ ] **1.1** Design `CoroTask` class with `promise_type` + - Define `promise_type` with `initial_suspend`, `final_suspend`, `unhandled_exception`, `return_value`/`return_void` + - Implement `FinalAwaiter` for continuation support + - Implement move-only RAII handle wrapper + - Support both `CoroTask` and `CoroTask` + +- [ ] **1.2** Design and implement `JobQueueAwaiter` + - `await_suspend()` calls `jq_.addJob(type, name, [h]{ h.resume(); })` + - Handle `addJob()` failure (shutdown) — resume with error flag or throw + - Integrate `nSuspend_` counter increment/decrement + +- [ ] **1.3** Implement `LocalValues` swap in new coroutine resume path + - Before `handle.resume()`: save thread-local, install coroutine-local + - After `handle.resume()` returns: restore thread-local + - Ensure this works when coroutine migrates between threads + +- [ ] **1.4** Add `postCoroTask()` template to `JobQueue` + - Accept callable returning `CoroTask` + - Schedule initial execution on JobQueue (mirror `postCoro()` behavior) + - Return a handle/shared_ptr for join/cancel + +- [ ] **1.5** Write unit tests (`src/test/core/CoroTask_test.cpp`) + - Test `CoroTask` runs to completion + - Test `CoroTask` returns value + - Test exception propagation across co_await + - Test coroutine destruction before completion + - Test `JobQueueAwaiter` schedules on correct thread + - Test `LocalValue` isolation across 4+ coroutines + - Test shutdown rejection (addJob returns false) + - Test `correct_order` equivalent (yield → join → post → complete) + - Test `incorrect_order` equivalent (post → yield → complete) + - Test multiple sequential co_await points + +- [ ] **1.6** Verify build on GCC 12+, Clang 16+ +- [ ] **1.7** Run ASAN + TSAN on new tests +- [ ] **1.8** Run full `--unittest` suite (no regressions) +- [ ] **1.9** Self-review and create PR #1 + +--- + +## Milestone 2: Entry Point Migration + +- [ ] **2.1** Migrate `ServerHandler::onRequest()` (`ServerHandler.cpp:287`) + - Replace `m_jobQueue.postCoro(jtCLIENT_RPC, ...)` with `postCoroTask()` + - Update lambda to return `CoroTask` (add `co_return`) + - Update `processSession` to accept new coroutine type + +- [ ] **2.2** Migrate `ServerHandler::onWSMessage()` (`ServerHandler.cpp:325`) + - Replace `m_jobQueue.postCoro(jtCLIENT_WEBSOCKET, ...)` with `postCoroTask()` + - Update lambda signature + +- [ ] **2.3** Migrate `GRPCServer::CallData::process()` (`GRPCServer.cpp:102`) + - Replace `app_.getJobQueue().postCoro(JobType::jtRPC, ...)` with `postCoroTask()` + - Update `process(shared_ptr coro)` overload signature + +- [ ] **2.4** Update `RPC::Context` (`Context.h:27`) + - Replace `std::shared_ptr coro{}` with new coroutine wrapper type + - Ensure all code that accesses `context.coro` compiles + +- [ ] **2.5** Update `ServerHandler.h` signatures + - `processSession()` and `processRequest()` parameter types + +- [ ] **2.6** Update `GRPCServer.h` signatures + - `process()` method parameter types + +- [ ] **2.7** Run full `--unittest` suite +- [ ] **2.8** Manual smoke test: HTTP + WS + gRPC RPC requests +- [ ] **2.9** Run ASAN + TSAN +- [ ] **2.10** Self-review and create PR #2 + +--- + +## Milestone 3: Handler Migration + +- [ ] **3.1** Migrate `doRipplePathFind()` (`RipplePathFind.cpp`) + - Replace `context.coro->yield()` with `co_await PathFindAwaiter{...}` + - Replace continuation lambda's `coro->post()` / `coro->resume()` with awaiter scheduling + - Handle shutdown case (post failure) in awaiter + +- [ ] **3.2** Create `PathFindAwaiter` (or use generic `JobQueueAwaiter`) + - Encapsulate the continuation + yield pattern from `RipplePathFind.cpp` lines 108-132 + +- [ ] **3.3** Update `Path_test.cpp` + - Replace `postCoro` usage with `postCoroTask` + - Ensure `context.coro` usage matches new type + +- [ ] **3.4** Update `AMMTest.cpp` + - Replace `postCoro` usage with `postCoroTask` + +- [ ] **3.5** Rewrite `Coroutine_test.cpp` for new API + - `correct_order`: postCoroTask → co_await → join → resume → complete + - `incorrect_order`: post before yield equivalent + - `thread_specific_storage`: 4 coroutines with LocalValue isolation + +- [ ] **3.6** Update `JobQueue_test.cpp` `testPostCoro` + - Migrate to `postCoroTask` API + +- [ ] **3.7** Verify `ripple_path_find` works end-to-end with new coroutines +- [ ] **3.8** Test shutdown-during-pathfind scenario +- [ ] **3.9** Run full `--unittest` suite +- [ ] **3.10** Run ASAN + TSAN +- [ ] **3.11** Self-review and create PR #3 + +--- + +## Milestone 4: Cleanup & Validation + +- [ ] **4.1** Delete `include/xrpl/core/Coro.ipp` +- [ ] **4.2** Remove from `JobQueue.h`: + - `#include ` + - `struct Coro_create_t` + - `class Coro` (entire class) + - `postCoro()` template + - Comment block (lines 322-377) describing old race condition +- [ ] **4.3** Update `cmake/deps/Boost.cmake`: + - Remove `coroutine` from `find_package(Boost REQUIRED COMPONENTS ...)` + - Remove `Boost::coroutine` from `target_link_libraries` +- [ ] **4.4** Update `cmake/XrplInterface.cmake`: + - Remove `BOOST_COROUTINES2_NO_DEPRECATION_WARNING` +- [ ] **4.5** Run memory benchmark + - Create N=1000 coroutines, compare RSS: before vs after + - Document results +- [ ] **4.6** Run context switch benchmark + - 100K yield/resume cycles, compare latency: before vs after + - Document results +- [ ] **4.7** Run RPC throughput benchmark + - Concurrent `ripple_path_find` requests, compare throughput + - Document results +- [ ] **4.8** Run full `--unittest` suite +- [ ] **4.9** Run ASAN, TSAN, UBSan + - Confirm `__asan_handle_no_return` warnings are gone +- [ ] **4.10** Verify build on all supported compilers +- [ ] **4.11** Self-review and create PR #4 +- [ ] **4.12** Document final benchmark results in PR description diff --git a/cspell.config.yaml b/cspell.config.yaml index 028f02191e..26c73606f5 100644 --- a/cspell.config.yaml +++ b/cspell.config.yaml @@ -74,6 +74,7 @@ words: - coeffs - coldwallet - compr + - cppcoro - conanfile - conanrun - confs @@ -110,6 +111,7 @@ words: - fmtdur - fsanitize - funclets + - gantt - gcov - gcovr - ghead @@ -212,6 +214,7 @@ words: - protobuf - protos - ptrs + - Pratik - pushd - pyenv - pyparsing @@ -219,6 +222,7 @@ words: - queuable - Raphson - replayer + - reposts - rerere - retriable - RIPD @@ -252,6 +256,7 @@ words: - sslws - statsd - STATSDCOLLECTOR + - stackful - stissue - stnum - stobj