From 146bf2a07f2d63e24b2cf569fb742671c6ddf595 Mon Sep 17 00:00:00 2001 From: CJ Cobb <46455409+cjcobb23@users.noreply.github.com> Date: Wed, 23 Jun 2021 11:35:13 -0400 Subject: [PATCH] Update backend README --- src/backend/README.md | 118 +++++++----------------------------------- 1 file changed, 18 insertions(+), 100 deletions(-) diff --git a/src/backend/README.md b/src/backend/README.md index 745cbb63..b207812a 100644 --- a/src/backend/README.md +++ b/src/backend/README.md @@ -1,108 +1,26 @@ -Reporting mode is a special operating mode of rippled, designed to handle RPCs -for validated data. A server running in reporting mode does not connect to the -p2p network, but rather extracts validated data from a node that is connected -to the p2p network. To run rippled in reporting mode, you must also run a -separate rippled node in p2p mode, to use as an ETL source. Multiple reporting -nodes can share access to the same network accessible databases (Postgres and -Cassandra); at any given time, only one reporting node will be performing ETL -and writing to the databases, while the others simply read from the databases. -A server running in reporting mode will forward any requests that require access -to the p2p network to a p2p node. +The backend is clio's view into the database. The database could be either PostgreSQL or Cassandra. +Multiple clio servers can share access to the same database. -# Reporting ETL -A single reporting node has one or more ETL sources, specified in the config -file. A reporting node will subscribe to the "ledgers" stream of each of the ETL -sources. This stream sends a message whenever a new ledger is validated. Upon -receiving a message on the stream, reporting will then fetch the data associated -with the newly validated ledger from one of the ETL sources. The fetch is -performed via a gRPC request ("GetLedger"). This request returns the ledger -header, transactions+metadata blobs, and every ledger object -added/modified/deleted as part of this ledger. ETL then writes all of this data -to the databases, and moves on to the next ledger. ETL does not apply -transactions, but rather extracts the already computed results of those -transactions (all of the added/modified/deleted SHAMap leaf nodes of the state -tree). The new SHAMap inner nodes are computed by the ETL writer; this computation mainly -involves manipulating child pointers and recomputing hashes, logic which is -buried inside of SHAMap. +`BackendInterface`, and it's derived classes, store very little state. The read methods go directly to the database, +and generally don't access any internal data structures. Nearly all of the methods are const. -If the database is entirely empty, ETL must download an entire ledger in full -(as opposed to just the diff, as described above). This download is done via the -"GetLedgerData" gRPC request. "GetLedgerData" allows clients to page through an -entire ledger over several RPC calls. ETL will page through an entire ledger, -and write each object to the database. +The data model used by clio is called the flat map data model. The flat map data model does not store any +SHAMap inner nodes, and instead only stores the raw ledger objects contained in the leaf node. Ledger objects +are stored in the database with a compound key of `(object_id, ledger_sequence)`, where `ledger_sequence` is the +ledger in which the object was created or modified. Objects are then fetched using an inequality operation, +such as `SELECT * FROM objects WHERE object_id = id AND ledger_sequence <= seq`, where `seq` is the ledger +in which we are trying to look up the object. When an object is deleted, we write an empty blob. -If the database is not empty, the reporting node will first come up in a "soft" -read-only mode. In read-only mode, the server does not perform ETL and simply -publishes new ledgers as they are written to the database. -If the database is not updated within a certain time period -(currently hard coded at 20 seconds), the reporting node will begin the ETL -process and start writing to the database. Postgres will report an error when -trying to write a record with a key that already exists. ETL uses this error to -determine that another process is writing to the database, and subsequently -falls back to a soft read-only mode. Reporting nodes can also operate in strict -read-only mode, in which case they will never write to the database. +Transactions are stored in a separate table, where the key is the hash. -# Database Nuances -The database schema for reporting mode does not allow any history gaps. -Attempting to write a ledger to a non-empty database where the previous ledger -does not exist will return an error. +Ledger headers are stored in their own table. -The databases must be set up prior to running reporting mode. This requires -creating the Postgres database, and setting up the Cassandra keyspace. Reporting -mode will create the objects table in Cassandra if the table does not yet exist. +The account_tx table maps accounts to a list of transactions that affect them. -Creating the Postgres database: -``` -$ psql -h [host] -U [user] -postgres=# create database [database]; -``` -Creating the keyspace: -``` -$ cqlsh [host] [port] -> CREATE KEYSPACE rippled WITH REPLICATION = - {'class' : 'SimpleStrategy', 'replication_factor' : 3 }; -``` -A replication factor of 3 is recommended. However, when running locally, only a -replication factor of 1 is supported. -Online delete is not supported by reporting mode and must be done manually. The -easiest way to do this would be to setup a second Cassandra keyspace and -Postgres database, bring up a single reporting mode instance that uses those -databases, and start ETL at a ledger of your choosing (via --startReporting on -the command line). Once this node is caught up, the other databases can be -deleted. - -To delete: -``` -$ psql -h [host] -U [user] -d [database] -reporting=$ truncate table ledgers cascade; -``` -``` -$ cqlsh [host] [port] -> truncate table objects; -``` -# Proxy -RPCs that require access to the p2p network and/or the open ledger are forwarded -from the reporting node to one of the ETL sources. The request is not processed -prior to forwarding, and the response is delivered as-is to the client. -Reporting will forward any requests that always require p2p/open ledger access -(fee and submit, for instance). In addition, any request that explicitly -requests data from the open or closed ledger (via setting -"ledger_index":"current" or "ledger_index":"closed"), will be forwarded to a -p2p node. - -For the stream "transactions_proposed" (AKA "rt_transactions"), reporting -subscribes to the "transactions_proposed" streams of each ETL source, and then -forwards those messages to any clients subscribed to the same stream on the -reporting node. A reporting node will subscribe to the stream on each ETL -source, but will only forward the messages from one of the streams at any given -time (to avoid sending the same message more than once to the same client). - -# API changes -A reporting node defaults to only returning validated data. If a ledger is not -specified, the most recently validated ledger is used. This is in contrast to -the normal rippled behavior, where the open ledger is used by default. - -Reporting will reject all subscribe requests for streams "server", "manifests", -"validations", "peer_status" and "consensus". +### Backend Indexer +With the elimination of SHAMap inner nodes, iterating across a ledger becomes difficult. In order to iterate, +a keys table is maintained, which keeps a collection of all keys in a ledger. This table has one record for every +million ledgers, where each record has all of the keys in that ledger, as well as all of the keys that were deleted +between that ledger and the prior ledger written to the keys table. Most of this logic is contained in `BackendIndexer`.