# Negative UNL Engineering Spec ## The Problem Statement The moment-to-moment health of the XRP Ledger network depends on the health and connectivity of a small number of computers (nodes). The most important nodes are validators, specifically ones listed on the unique node list ([UNL](#Question-What-are-UNLs)). Ripple publishes a recommended UNL that most network nodes use to determine which peers in the network are trusted. Although most validators use the same list, they are not required to. The XRP Ledger network progresses to the next ledger when enough validators reach agreement (above the minimum quorum of 80%) about what transactions to include in the next ledger. As an example, if there are 10 validators on the UNL, at least 8 validators have to agree with the latest ledger for it to become validated. But what if enough of those validators are offline to drop the network below the 80% quorum? The XRP Ledger network favors safety/correctness over advancing the ledger. Which means if enough validators are offline, the network will not be able to validate ledgers. Unfortunately validators can go offline at any time for many different reasons. Power outages, network connectivity issues, and hardware failures are just a few scenarios where a validator would appear "offline". Given that most of these events are temporary, it would make sense to temporarily remove that validator from the UNL. But the UNL is updated infrequently and not every node uses the same UNL. So instead of removing the unreliable validator from the Ripple recommended UNL, we can create a second negative UNL which is stored directly on the ledger (so the entire network has the same view). This will help the network see which validators are **currently** unreliable, and adjust their quorum calculation accordingly. *Improving the liveness of the network is the main motivation for the negative UNL.* ### Targeted Faults In order to determine which validators are unreliable, we need clearly define what kind of faults to measure and analyze. We want to deal with the faults we frequently observe in the production network. Hence we will only monitor for validators that do not reliably respond to network messages or send out validations disagreeing with the locally generated validations. We will not target other byzantine faults. To track whether or not a validator is responding to the network, we could monitor them with a “heartbeat” protocol. Instead of creating a new heartbeat protocol, we can leverage some existing protocol messages to mimic the heartbeat. We picked validation messages because validators should send one and only one validation message per ledger. In addition, we only count the validation messages that agree with the local node's validations. With the negative UNL, the network could keep making forward progress safely even if the number of remaining validators gets to 60%. Say we have a network with 10 validators on the UNL and everything is operating correctly. The quorum required for this network would be 8 (80% of 10). When validators fail, the quorum required would be as low as 6 (60% of 10), which is the absolute ***minimum quorum***. We need the absolute minimum quorum to be strictly greater than 50% of the original UNL so that there cannot be two partitions of well-behaved nodes headed in different directions. We arbitrarily choose 60% as the minimum quorum to give a margin of safety. Consider these events in the absence of negative UNL: 1. 1:00pm - validator1 fails, votes vs. quorum: 9 >= 8, we have quorum 1. 3:00pm - validator2 fails, votes vs. quorum: 8 >= 8, we have quorum 1. 5:00pm - validator3 fails, votes vs. quorum: 7 < 8, we don’t have quorum * **network cannot validate new ledgers with 3 failed validators** We're below 80% agreement, so new ledgers cannot be validated. This is how the XRP Ledger operates today, but if the negative UNL was enabled, the events would happen as follows. (Please note that the events below are from a simplified version of our protocol.) 1. 1:00pm - validator1 fails, votes vs. quorum: 9 >= 8, we have quorum 1. 1:40pm - network adds validator1 to negative UNL, quorum changes to ceil(9 * 0.8), or 8 1. 3:00pm - validator2 fails, votes vs. quorum: 8 >= 8, we have quorum 1. 3:40pm - network adds validator2 to negative UNL, quorum changes to ceil(8 * 0.8), or 7 1. 5:00pm - validator3 fails, votes vs. quorum: 7 >= 7, we have quorum 1. 5:40pm - network adds validator3 to negative UNL, quorum changes to ceil(7 * 0.8), or 6 1. 7:00pm - validator4 fails, votes vs. quorum: 6 >= 6, we have quorum * **network can still validate new ledgers with 4 failed validators** ## External Interactions ### Message Format Changes This proposal will: 1. add a new pseudo-transaction type 1. add the negative UNL to the ledger data structure. Any tools or systems that rely on the format of this data will have to be updated. ### Amendment This feature **will** need an amendment to activate. ## Design This section discusses the following topics about the Negative UNL design: * [Negative UNL protocol overview](#Negative-UNL-Protocol-Overview) * [Validator reliability measurement](#Validator-Reliability-Measurement) * [Format Changes](#Format-Changes) * [Negative UNL maintenance](#Negative-UNL-Maintenance) * [Quorum size calculation](#Quorum-Size-Calculation) * [Filter validation messages](#Filter-Validation-Messages) * [High level sequence diagram of code changes](#High-Level-Sequence-Diagram-of-Code-Changes) ### Negative UNL Protocol Overview Every ledger stores a list of zero or more unreliable validators. Updates to the list must be approved by the validators using the consensus mechanism that validators use to agree on the set of transactions. The list is used only when checking if a ledger is fully validated. If a validator V is in the list, nodes with V in their UNL adjust the quorum and V’s validation message is not counted when verifying if a ledger is fully validated. V’s flow of messages and network interactions, however, will remain the same. We define the ***effective UNL** = original UNL - negative UNL*, and the ***effective quorum*** as the quorum of the *effective UNL*. And we set *effective quorum = Ceiling(80% * effective UNL)*. ### Validator Reliability Measurement A node only measures the reliability of validators on its own UNL, and only proposes based on local observations. There are many metrics that a node can measure about its validators, but we have chosen ledger validation messages. This is because every validator shall send one and only one signed validation message per ledger. This keeps the measurement simple and removes timing/clock-sync issues. A node will measure the percentage of agreeing validation messages (*PAV*) received from each validator on the node's UNL. Note that the node will only count the validation messages that agree with its own validations. We define the **PAV** as the **P**ercentage of **A**greed **V**alidation messages received for the last N ledgers, where N = 256 by default. When the PAV drops below the ***low-water mark***, the validator is considered unreliable, and is a candidate to be disabled by being added to the negative UNL. A validator must have a PAV higher than the ***high-water mark*** to be re-enabled. The validator is re-enabled by removing it from the negative UNL. In the implementation, we plan to set the low-water mark as 50% and the high-water mark as 80%. ### Format Changes The negative UNL component in a ledger contains three fields. * ***NegativeUNL***: The current negative UNL, a list of unreliable validators. * ***ToDisable***: The validator to be added to the negative UNL on the next flag ledger. * ***ToReEnable***: The validator to be removed from the negative UNL on the next flag ledger. All three fields are optional. When the *ToReEnable* field exists, the *NegativeUNL* field cannot be empty. A new pseudo-transaction, ***UNLModify***, is added. It has three fields * ***Disabling***: A flag indicating whether the modification is to disable or to re-enable a validator. * ***Seq***: The ledger sequence number. * ***Validator***: The validator to be disabled or re-enabled. There would be at most one *disable* `UNLModify` and one *re-enable* `UNLModify` transaction per flag ledger. The full machinery is described further on. ### Negative UNL Maintenance The negative UNL can only be modified on the flag ledgers. If a validator's reliability status changes, it takes two flag ledgers to modify the negative UNL. Let's see an example of the algorithm: * Ledger seq = 100: A validator V goes offline. * Ledger seq = 256: This is a flag ledger, and V's reliability measurement *PAV* is lower than the low-water mark. Other validators add `UNLModify` pseudo-transactions `{true, 256, V}` to the transaction set which goes through the consensus. Then the pseudo-transaction is applied to the negative UNL ledger component by setting `ToDisable = V`. * Ledger seq = 257 ~ 511: The negative UNL ledger component is copied from the parent ledger. * Ledger seq=512: This is a flag ledger, and the negative UNL is updated `NegativeUNL = NegativeUNL + ToDisable`. The negative UNL may have up to `MaxNegativeListed = floor(original UNL * 25%)` validators. The 25% is because of 75% * 80% = 60%, where 75% = 100% - 25%, 80% is the quorum of the effective UNL, and 60% is the absolute minimum quorum of the original UNL. Adding more than 25% validators to the negative UNL does not improve the liveness of the network, because adding more validators to the negative UNL cannot lower the effective quorum. The following is the detailed algorithm: * **If** the ledger seq = x is a flag ledger 1. Compute `NegativeUNL = NegativeUNL + ToDisable - ToReEnable` if they exist in the parent ledger 1. Try to find a candidate to disable if `sizeof NegativeUNL < MaxNegativeListed` 1. Find a validator V that has a *PAV* lower than the low-water mark, but is not in `NegativeUNL`. 1. If two or more are found, their public keys are XORed with the hash of the parent ledger and the one with the lowest XOR result is chosen. 1. If V is found, create a `UNLModify` pseudo-transaction `TxDisableValidator = {true, x, V}` 1. Try to find a candidate to re-enable if `sizeof NegativeUNL > 0`: 1. Find a validator U that is in `NegativeUNL` and has a *PAV* higher than the high-water mark. 1. If U is not found, try to find one in `NegativeUNL` but not in the local *UNL*. 1. If two or more are found, their public keys are XORed with the hash of the parent ledger and the one with the lowest XOR result is chosen. 1. If U is found, create a `UNLModify` pseudo-transaction `TxReEnableValidator = {false, x, U}` 1. If any `UNLModify` pseudo-transactions are created, add them to the transaction set. The transaction set goes through the consensus algorithm. 1. If have enough support, the `UNLModify` pseudo-transactions remain in the transaction set agreed by the validators. Then the pseudo-transactions are applied to the ledger: 1. If have `TxDisableValidator`, set `ToDisable=TxDisableValidator.V`. Else clear `ToDisable`. 1. If have `TxReEnableValidator`, set `ToReEnable=TxReEnableValidator.U`. Else clear `ToReEnable`. * **Else** (not a flag ledger) 1. Copy the negative UNL ledger component from the parent ledger The negative UNL is stored on each ledger because we don't know when a validator may reconnect to the network. If the negative UNL was stored only on every flag ledger, then a new validator would have to wait until it acquires the latest flag ledger to know the negative UNL. So any new ledgers created that are not flag ledgers copy the negative UNL from the parent ledger. Note that when we have a validator to disable and a validator to re-enable at the same flag ledger, we create two separate `UNLModify` pseudo-transactions. We want either one or the other or both to make it into the ledger on their own merits. Readers may have noticed that we defined several rules of creating the `UNLModify` pseudo-transactions but did not describe how to enforce the rules. The rules are actually enforced by the existing consensus algorithm. Unless enough validators propose the same pseudo-transaction it will not be included in the transaction set of the ledger. ### Quorum Size Calculation The effective quorum is 80% of the effective UNL. Note that because at most 25% of the original UNL can be on the negative UNL, the quorum should not be lower than the absolute minimum quorum (i.e. 60%) of the original UNL. However, considering that different nodes may have different UNLs, to be safe we compute `quorum = Ceiling(max(60% * original UNL, 80% * effective UNL))`. ### Filter Validation Messages If a validator V is in the negative UNL, it still participates in consensus sessions in the same way, i.e. V still follows the protocol and publishes proposal and validation messages. The messages from V are still stored the same way by everyone, used to calculate the new PAV for V, and could be used in future consensus sessions if needed. However V's ledger validation message is not counted when checking if the ledger is fully validated. ### High Level Sequence Diagram of Code Changes The diagram below is the sequence of one round of consensus. Classes and components with non-trivial changes are colored green. * The `ValidatorList` class is modified to compute the quorum of the effective UNL. * The `Validations` class provides an interface for querying the validation messages from trusted validators. * The `ConsensusAdaptor` component: * The `RCLConsensus::Adaptor` class is modified for creating `UNLModify` Pseudo-Transactions. * The `Change` class is modified for applying `UNLModify` Pseudo-Transactions. * The `Ledger` class is modified for creating and adjusting the negative UNL ledger component. * The `LedgerMaster` class is modified for filtering out validation messages from negative UNL validators when verifying if a ledger is fully validated.  ## Roads Not Taken ### Use a Mechanism Like Fee Voting to Process UNLModify Pseudo-Transactions The previous version of the negative UNL specification used the same mechanism as the [fee voting](https://xrpl.org/fee-voting.html#voting-process.) for creating the negative UNL, and used the negative UNL as soon as the ledger was fully validated. However the timing of fully validation can differ among nodes, so different negative UNLs could be used, resulting in different effective UNLs and different quorums for the same ledger. As a result, the network's safety is impacted. This updated version does not impact safety though operates a bit more slowly. The negative UNL modifications in the *UNLModify* pseudo-transaction approved by the consensus will take effect at the next flag ledger. The extra time of the 256 ledgers should be enough for nodes to be in sync of the negative UNL modifications. ### Use an Expiration Approach to Re-enable Validators After a validator disabled by the negative UNL becomes reliable, other validators explicitly vote for re-enabling it. An alternative approach to re-enable a validator is the expiration approach, which was considered in the previous version of the specification. In the expiration approach, every entry in the negative UNL has a fixed expiration time. One flag ledger interval was chosen as the expiration interval. Once expired, the other validators must continue voting to keep the unreliable validator on the negative UNL. The advantage of this approach is its simplicity. But it has a requirement. The negative UNL protocol must be able to vote multiple unreliable validators to be disabled at the same flag ledger. In this version of the specification, however, only one unreliable validator can be disabled at a flag ledger. So the expiration approach cannot be simply applied. ### Validator Reliability Measurement and Flag Ledger Frequency If the ledger time is about 4.5 seconds and the low-water mark is 50%, then in the worst case, it takes 48 minutes *((0.5 * 256 + 256 + 256) * 4.5 / 60 = 48)* to put an offline validator on the negative UNL. We considered lowering the flag ledger frequency so that the negative UNL can be more responsive. We also considered decoupling the reliability measurement and flag ledger frequency to be more flexible. In practice, however, their benefits are not clear. ## New Attack Vectors A group of malicious validators may try to frame a reliable validator and put it on the negative UNL. But they cannot succeed. Because: 1. A reliable validator sends a signed validation message every ledger. A sufficient peer-to-peer network will propagate the validation messages to other validators. The validators will decide if another validator is reliable or not only by its local observation of the validation messages received. So an honest validator’s vote on another validator’s reliability is accurate. 1. Given the votes are accurate, and one vote per validator, an honest validator will not create a UNLModify transaction of a reliable validator. 1. A validator can be added to a negative UNL only through a UNLModify transaction. Assuming the group of malicious validators is less than the quorum, they cannot frame a reliable validator. ## Summary The bullet points below briefly summarize the current proposal: * The motivation of the negative UNL is to improve the liveness of the network. * The targeted faults are the ones frequently observed in the production network. * Validators propose negative UNL candidates based on their local measurements. * The absolute minimum quorum is 60% of the original UNL. * The format of the ledger is changed, and a new *UNLModify* pseudo-transaction is added. Any tools or systems that rely on the format of these data will have to be updated. * The negative UNL can only be modified on the flag ledgers. * At most one validator can be added to the negative UNL at a flag ledger. * At most one validator can be removed from the negative UNL at a flag ledger. * If a validator's reliability status changes, it takes two flag ledgers to modify the negative UNL. * The quorum is the larger of 80% of the effective UNL and 60% of the original UNL. * If a validator is on the negative UNL, its validation messages are ignored when the local node verifies if a ledger is fully validated. ## FAQ ### Question: What are UNLs? Quote from the [Technical FAQ](https://xrpl.org/technical-faq.html): "They are the lists of transaction validators a given participant believes will not conspire to defraud them." ### Question: How does the negative UNL proposal affect network liveness? The network can make forward progress when more than a quorum of the trusted validators agree with the progress. The lower the quorum size is, the easier for the network to progress. If the quorum is too low, however, the network is not safe because nodes may have different results. So the quorum size used in the consensus protocol is a balance between the safety and the liveness of the network. The negative UNL reduces the size of the effective UNL, resulting in a lower quorum size while keeping the network safe.