From 146bf2a07f2d63e24b2cf569fb742671c6ddf595 Mon Sep 17 00:00:00 2001
From: CJ Cobb <46455409+cjcobb23@users.noreply.github.com>
Date: Wed, 23 Jun 2021 11:35:13 -0400
Subject: [PATCH] Update backend README

---
 src/backend/README.md | 118 +++++++-----------------------------------
 1 file changed, 18 insertions(+), 100 deletions(-)

diff --git a/src/backend/README.md b/src/backend/README.md
index 745cbb63..b207812a 100644
--- a/src/backend/README.md
+++ b/src/backend/README.md
@@ -1,108 +1,26 @@
-Reporting mode is a special operating mode of rippled, designed to handle RPCs
-for validated data. A server running in reporting mode does not connect to the
-p2p network, but rather extracts validated data from a node that is connected
-to the p2p network. To run rippled in reporting mode, you must also run a
-separate rippled node in p2p mode, to use as an ETL source. Multiple reporting
-nodes can share access to the same network accessible databases (Postgres and
-Cassandra); at any given time, only one reporting node will be performing ETL
-and writing to the databases, while the others simply read from the databases.
-A server running in reporting mode will forward any requests that require access
-to the p2p network to a p2p node.
+The backend is clio's view into the database. The database could be either PostgreSQL or Cassandra.
+Multiple clio servers can share access to the same database. 
 
-# Reporting ETL
-A single reporting node has one or more ETL sources, specified in the config
-file. A reporting node will subscribe to the "ledgers" stream of each of the ETL
-sources. This stream sends a message whenever a new ledger is validated. Upon
-receiving a message on the stream, reporting will then fetch the data associated
-with the newly validated ledger from one of the ETL sources. The fetch is
-performed via a gRPC request ("GetLedger"). This request returns the ledger
-header, transactions+metadata blobs, and every ledger object
-added/modified/deleted as part of this ledger. ETL then writes all of this data
-to the databases, and moves on to the next ledger. ETL does not apply
-transactions, but rather extracts the already computed results of those
-transactions (all of the added/modified/deleted SHAMap leaf nodes of the state
-tree). The new SHAMap inner nodes are computed by the ETL writer; this computation mainly
-involves manipulating child pointers and recomputing hashes, logic which is
-buried inside of SHAMap.
+`BackendInterface`, and it's derived classes, store very little state. The read methods go directly to the database,
+and generally don't access any internal data structures. Nearly all of the methods are const.
 
-If the database is entirely empty, ETL must download an entire ledger in full
-(as opposed to just the diff, as described above). This download is done via the
-"GetLedgerData" gRPC request. "GetLedgerData" allows clients to page through an
-entire ledger over several RPC calls. ETL will page through an entire ledger,
-and write each object to the database.
+The data model used by clio is called the flat map data model. The flat map data model does not store any
+SHAMap inner nodes, and instead only stores the raw ledger objects contained in the leaf node. Ledger objects
+are stored in the database with a compound key of `(object_id, ledger_sequence)`, where `ledger_sequence` is the
+ledger in which the object was created or modified. Objects are then fetched using an inequality operation,
+such as `SELECT * FROM objects WHERE object_id = id AND ledger_sequence <= seq`, where `seq` is the ledger
+in which we are trying to look up the object. When an object is deleted, we write an empty blob.
 
-If the database is not empty, the reporting node will first come up in a "soft"
-read-only mode. In read-only mode, the server does not perform ETL and simply
-publishes new ledgers as they are written to the database. 
-If the database is not updated within a certain time period
-(currently hard coded at 20 seconds), the reporting node will begin the ETL
-process and start writing to the database. Postgres will report an error when
-trying to write a record with a key that already exists. ETL uses this error to
-determine that another process is writing to the database, and subsequently
-falls back to a soft read-only mode. Reporting nodes can also operate in strict
-read-only mode, in which case they will never write to the database.
+Transactions are stored in a separate table, where the key is the hash.
 
-# Database Nuances
-The database schema for reporting mode does not allow any history gaps.
-Attempting to write a ledger to a non-empty database where the previous ledger
-does not exist will return an error.
+Ledger headers are stored in their own table.
 
-The databases must be set up prior to running reporting mode. This requires
-creating the Postgres database, and setting up the Cassandra keyspace. Reporting
-mode will create the objects table in Cassandra if the table does not yet exist. 
+The account_tx table maps accounts to a list of transactions that affect them.
 
-Creating the Postgres database:
-```
-$ psql -h [host] -U [user]
-postgres=# create database [database];
-```
-Creating the keyspace:
-```
-$ cqlsh [host] [port]
-> CREATE KEYSPACE rippled WITH REPLICATION =
-  {'class' : 'SimpleStrategy', 'replication_factor' : 3    };
-```
-A replication factor of 3 is recommended. However, when running locally, only a
-replication factor of 1 is supported.
 
-Online delete is not supported by reporting mode and must be done manually. The
-easiest way to do this would be to setup a second Cassandra keyspace and
-Postgres database, bring up a single reporting mode instance that uses those
-databases, and start ETL at a ledger of your choosing (via --startReporting on
-the command line). Once this node is caught up, the other databases can be
-deleted.
-
-To delete:
-```
-$ psql -h [host] -U [user] -d [database]
-reporting=$ truncate table ledgers cascade;
-```
-```
-$ cqlsh [host] [port]
-> truncate table objects;
-```
-# Proxy
-RPCs that require access to the p2p network and/or the open ledger are forwarded
-from the reporting node to one of the ETL sources. The request is not processed
-prior to forwarding, and the response is delivered as-is to the client.
-Reporting will forward any requests that always require p2p/open ledger access
-(fee and submit, for instance). In addition, any request that explicitly
-requests data from the open or closed ledger (via setting
-"ledger_index":"current" or "ledger_index":"closed"), will be forwarded to a
-p2p node. 
-
-For the stream "transactions_proposed" (AKA "rt_transactions"), reporting
-subscribes to the "transactions_proposed" streams of each ETL source, and then
-forwards those messages to any clients subscribed to the same stream on the
-reporting node. A reporting node will subscribe to the stream on each ETL
-source, but will only forward the messages from one of the streams at any given
-time (to avoid sending the same message more than once to the same client).
-
-# API changes
-A reporting node defaults to only returning validated data. If a ledger is not
-specified, the most recently validated ledger is used. This is in contrast to
-the normal rippled behavior, where the open ledger is used by default.
-
-Reporting will reject all subscribe requests for streams "server", "manifests",
-"validations", "peer_status" and "consensus".
+### Backend Indexer
 
+With the elimination of SHAMap inner nodes, iterating across a ledger becomes difficult. In order to iterate,
+a keys table is maintained, which keeps a collection of all keys in a ledger. This table has one record for every
+million ledgers, where each record has all of the keys in that ledger, as well as all of the keys that were deleted
+between that ledger and the prior ledger written to the keys table. Most of this logic is contained in `BackendIndexer`.