From d5e5747d45934e5c77967245c2a5da25d2c0a79d Mon Sep 17 00:00:00 2001 From: mDuo13 Date: Tue, 4 Sep 2018 15:09:04 -0700 Subject: [PATCH] Troubleshooting - edits per reviews --- .../troubleshooting/diagnosing-problems.md | 18 +++++---- .../troubleshooting/server-wont-start.md | 20 +++++----- .../understanding-log-messages.md | 38 +++++++++++++++---- 3 files changed, 50 insertions(+), 26 deletions(-) diff --git a/content/tutorials/manage-the-rippled-server/troubleshooting/diagnosing-problems.md b/content/tutorials/manage-the-rippled-server/troubleshooting/diagnosing-problems.md index b1289719a3..8e91aa4add 100644 --- a/content/tutorials/manage-the-rippled-server/troubleshooting/diagnosing-problems.md +++ b/content/tutorials/manage-the-rippled-server/troubleshooting/diagnosing-problems.md @@ -1,10 +1,10 @@ # Diagnosing Problems with rippled -If you have having problems with `rippled`, the first step is to collect more information to accurately characterize the problem. From there, it can be easier to figure out a root cause and a fix. +If you are having problems with `rippled`, the first step is to collect more information to accurately characterize the problem. From there, it can be easier to figure out a root cause and a fix. -If your server does not start, see [rippled Server Won't Start](server-wont-start.html) for a list of some possible causes and fixes. +If your server does not start (such as crashing or otherwise shutting down automatically), see [rippled Server Won't Start](server-wont-start.html) for a list of some possible causes and fixes. -The remainder of this document suggests steps for diagnosing the problem if your server starts successfully. +The remainder of this document suggests steps for diagnosing problems that happen while your server is up and running (including if the process is active but unable to sync with the network). ## Get the server_info @@ -17,9 +17,9 @@ rippled server_info The response to this command has a lot of information, which is documented along with the [server_info method][]. For troubleshooting purposes, the most important fields are (from most commonly used to least): -- **`server_state`** - Most of the time, this field should show either `full` or `proposing` depending on whether it is [configured as a validator](run-rippled-as-a-validator.html). The value `connected` means that the server can communicate with the rest of the peer-to-peer network, but it does not yet have enough data to track progress of the shared ledger state. Normally, syncing to the state of the rest of the ledger takes about 5-15 minutes after starting. +- **`server_state`** - Most of the time, this field should show `proposing` for a server that is [configured as a validator](run-rippled-as-a-validator.html), or `full` for a non-validating server. The value `connected` means that the server can communicate with the rest of the peer-to-peer network, but it does not yet have enough data to track progress of the shared ledger state. Normally, syncing to the state of the rest of the ledger takes about 5-15 minutes after starting. - - If your server remains in the `connected` state for hours, or returns to the `connected` state after being in the `full` or `proposing` states, that usually indicates that your server cannot keep up with the rest of the network. The most common bottlenecks are disk I/O and network bandwidth. + - If your server remains in the `connected` state for hours, or returns to the `connected` state after being in the `full` or `proposing` states, that usually indicates that your server cannot keep up with the rest of the network. The most common bottlenecks are disk I/O, network bandwidth, and RAM. - **`complete_ledgers`** - This field shows which [ledger indexes](basic-data-types.html#ledger-index) your server has complete ledger data for. Healthy servers usually have a single range of recent ledgers, such as `"12133424-12133858"`. @@ -36,6 +36,8 @@ For troubleshooting purposes, the most important fields are (from most commonly - If you have 0 peers, your server may be unable to contact the network, or your system clock may be wrong. (Ripple recommends running an [NTP](http://www.ntp.org/) daemon on all servers to keep their clocks synced.) + - If you have exactly 10 peers, that may indicate that your `rippled` is unable to receive incoming connections through a router using [NAT](https://en.wikipedia.org/wiki/Network_address_translation). You can improve connectivity by configuring your router's firewall to forward the port used for peer-to-peer connections (port 51235 [by default](https://github.com/ripple/rippled/blob/8429dd67e60ba360da591bfa905b58a35638fda1/cfg/rippled-example.cfg#L1065)). + ### No Response from Server The `rippled` executable returns the following message if it wasn't able to connect as a client to the `rippled` server: @@ -53,14 +55,14 @@ This generally indicates one of several problems: - The `rippled` server is just starting up, or is not running at all. Check the status of the service; if it is running, wait a few seconds and try again. - You may need to pass different [parameters to the `rippled` commandline client](commandline-usage.html#client-mode-options) to connect to your server. -- The `rippled` server may not be configured not to accept JSON-RPC connections. +- The `rippled` server may be configured not to accept JSON-RPC connections. ## Check the server log -While running, `rippled` servers write information to a debug log. The location of the debug log depends on your server's configuration file. The [default configuration](https://github.com/ripple/rippled/blob/master/cfg/rippled-example.cfg#L1139-L1142) writes the server's debug log to the file `/var/log/rippled/debug.log`. If you start the `rippled` service directly (instead of using `systemctl` or `service` to start it), it also prints log messages to the console by default. +[By default,](https://github.com/ripple/rippled/blob/master/cfg/rippled-example.cfg#L1139-L1142) `rippled` writes the server's debug log to the file `/var/log/rippled/debug.log`. The location of the debug log can differ based on your server's configuration file. If you start the `rippled` service directly (instead of using `systemctl` or `service` to start it), it also prints log messages to the console by default. -You can control the verbosity of the debug log with the [log_level method][]. The default config file sets the `log_level` to severity "warning" for all categories of log messages. (See the `[rpc_startup]` stanza of the config file for settings.) +The default config file sets the log level to severity "warning" for all categories of log messages by internally using the [log_level method][] during startup. You can control the verbosity of the debug log [using the `--silent` commandline option during startup](commandline-usage.html#verbosity-options) and with the [log_level method][] while the server is running. (See the `[rpc_startup]` stanza of the config file for settings.) It is normal for a `rippled` the server to print many warning-level (`WRN`) messages during startup and a few warning-level messages from time to time later on. You can **safely ignore** most warnings in the first 5 to 15 minutes of server startup. diff --git a/content/tutorials/manage-the-rippled-server/troubleshooting/server-wont-start.md b/content/tutorials/manage-the-rippled-server/troubleshooting/server-wont-start.md index 3666b36ad3..34e57dde9a 100644 --- a/content/tutorials/manage-the-rippled-server/troubleshooting/server-wont-start.md +++ b/content/tutorials/manage-the-rippled-server/troubleshooting/server-wont-start.md @@ -1,6 +1,6 @@ # rippled Server Won't Start -This page explains possible reasons the `rippled` server does not start successfully, and how to fix them. +This page explains possible reasons the `rippled` server does not start and how to fix them. These instructions assume you have [installed `rippled`](install-rippled.html) on a supported platform. @@ -18,20 +18,20 @@ This occurs because the system has a security limit on the number of files a sin 1. Add the following lines to the end of your `/etc/security/limits.conf` file: - * soft nofile 5200 - * hard nofile 10240 + * soft nofile 65536 + * hard nofile 65536 -2. Check that the number of files that can be opened is now `10240`: +2. Check that the [hard limit on number of files that can be opened](https://ss64.com/bash/ulimit.html) is now `65536`: ulimit -Hn - The command prints the hard limit on the number of open files, which should be 10240. + The command should output `65536`. 3. Try starting `rippled` again. systemctl start rippled -4. If `rippled` still does not start, open `/etc/sysctl.conf` and add the following line to the bottom of the file: +4. If `rippled` still does not start, open `/etc/sysctl.conf` and append the following kernel-level setting: fs.file-max = 65536 @@ -49,9 +49,9 @@ Aborted (core dumped) Possible solutions: -- Check that `/etc/opt/ripple/rippled.cfg` exists and the `rippled` user has read permissions to the file. (Assuming you use the `rippled` user to run the `rippled` process, and you want to use the default location for the config file.) +- Check that the config file exists (the default location is `/etc/opt/ripple/rippled.cfg`) and the user that runs your `rippled` process (usually `rippled`) has read permissions to the file. -- Create a config file that can be read by the `rippled` user to `$HOME/.config/ripple/rippled.cfg` (where `$HOME` points to the `rippled` user's home directory). +- Create a config file that can be read by the `rippled` user at `$HOME/.config/ripple/rippled.cfg` (where `$HOME` points to the `rippled` user's home directory). **Tip:** The `rippled` repository contains [an example `rippled.cfg` file](https://github.com/ripple/rippled/blob/master/cfg/rippled-example.cfg) which is provided as the default config when you do an RPM installation. If you do not have the file, you can copy it from there. @@ -131,7 +131,7 @@ Or, if you are sure you don't need the databases: rm -r /var/lib/rippled/db ``` -**Tip:** It is generally safe to delete the `rippled` databases, because any individual server can re-download ledger history from other servers in the XRP Ledger network. If you are using [clustering](clustering.html), be sure your servers each have a unique `[node_seed]` configured first; if not, servers may not be recognized as part of the cluster after you restart them. +**Tip:** It is generally safe to delete the `rippled` databases, because any individual server can re-download ledger history from other servers in the XRP Ledger network. ## Online Delete is Less Than Ledger History @@ -142,7 +142,7 @@ An error message such as the following indicates that the `rippled.cfg` file has Terminating thread rippled: main: unhandled St13runtime_error 'online_delete must not be less than ledger_history (currently 3000) ``` -The `[ledger_history]` setting represents how many ledgers of history the server should seek to back-fill. The `online_delete` field (in the `[node_db]` stanza) indicates how many ledgers of history to keep when dropping older history. The `online_delete` value must be equal or larger than `[ledger_history]` to prevent the server from deleting historical ledgers that it is also trying to download. +The `[ledger_history]` setting represents how many ledgers of history the server should seek to back-fill. The `online_delete` field (in the `[node_db]` stanza) indicates how many ledgers of history to keep when dropping older history. The `online_delete` value must be equal to or larger than `[ledger_history]` to prevent the server from deleting historical ledgers that it is also trying to download. To fix the problem, edit the `rippled.cfg` file and change or remove either the `[ledger_history]` or `online_delete` options. (If you omit `[ledger_history]`, it defaults to 256 ledger versions, so `online_delete`, if present, must be larger than 256. If you omit `online_delete`, it disables automatic deletion of old ledger versions.) diff --git a/content/tutorials/manage-the-rippled-server/troubleshooting/understanding-log-messages.md b/content/tutorials/manage-the-rippled-server/troubleshooting/understanding-log-messages.md index 96fc8e4472..e45edde23a 100644 --- a/content/tutorials/manage-the-rippled-server/troubleshooting/understanding-log-messages.md +++ b/content/tutorials/manage-the-rippled-server/troubleshooting/understanding-log-messages.md @@ -18,7 +18,7 @@ Terminating thread rippled: main: unhandled St13runtime_error If your server always crashes on startup, see [Server Won't Start](server-wont-start.html) for possible cases. -If your server crashes randomly during operation or as a result of particular commands, make sure you are [updated](updating-rippled.html) to the latest `rippled` version. If you are on the latest version and your server is still crashing, check the following: +If your server crashes randomly during operation or as a result of particular commands, make sure you are [updated](update-rippled.html) to the latest `rippled` version. If you are on the latest version and your server is still crashing, check the following: - Is your server running out of memory? On some systems, `rippled` may be terminated by the Out Of Memory (OOM) Killer or another monitor process. - If your server is running in a shared environment, are other users or administrators causing the machine or service to be restarted? For example, some hosted providers automatically kill any service that uses a large amount of a shared machine's resources for an extended period of time. @@ -39,8 +39,8 @@ Losing connections from time to time is normal for any peer-to-peer network. **O A large number of these messages around the same time may indicate a problem, such as: -- Your internet connection to one or more specific peers was cut off -- Your server may have been overloading the peer with requests, causing it to drop your server +- Your internet connection to one or more specific peers was cut off. +- Your server may have been overloading the peer with requests, causing it to drop your server. ## No hash for fetch pack @@ -55,15 +55,37 @@ A large number of these messages around the same time may indicate a problem, su ## LoadMonitor Job +Messages such as the following occur when a function takes a long time to run (over 11 seconds in this example): + ```text 2018-Aug-28 22:56:36.180827973 LoadMonitor:WRN Job: gotFetchPack run: 11566ms wait: 0ms +``` + +The following similar message occurs when a job spends a long time waiting to run (again, over 11 seconds in this example): + +```text 2018-Aug-28 22:56:36.180970431 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 11566ms 2018-Aug-28 22:56:36.181053831 LoadMonitor:WRN Job: AcquisitionDone run: 0ms wait: 11566ms 2018-Aug-28 22:56:36.181110594 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 11566ms 2018-Aug-28 22:56:36.181169931 LoadMonitor:WRN Job: AcquisitionDone run: 0ms wait: 11566ms ``` -***TODO: how serious is this?*** +These two types of messages often occur together, when a long-running job causes other jobs to wait a long time for it to finish. + +It is **normal** to display several messages of these types **during the first few minutes** after starting the server. + +If the messages continue for more than 5 minutes after starting the server, especially if the `run` times are well over 1000ms, that may indicate that **your server does not have sufficient resources, such as disk I/O, RAM, or CPU**. This may be caused by not having sufficiently-powerful hardware or because other processes running on the same hardware are competing with `rippled` for resources. (Examples of other processes that may compete with `rippled` for resources include scheduled backups, virus scanners, and periodic database cleaners.) + +Another possible cause is trying to use NuDB on rotational hard disks; NuDB should only be used with solid state drives (SSDs). Ripple recommends always using SSD storage for `rippled`'s databases, but you _may_ be able to run `rippled` successfully on rotational disks using RocksDB. If you are using rotational disks, make sure both the `[node_db]` and the `[shard_db]` (if you have one) are configured to use RocksDB. For example: + +``` +[node_db] +type=RocksDB +# ... more config omitted + +[shard_db] +type=RocksDB +``` ## View of consensus changed during open @@ -76,7 +98,7 @@ Log messages such as the following occur when a server is not in sync with the r 2018-Aug-28 22:56:22.368499966 LedgerConsensus:WRN {"accepted":true,"account_hash":"89A821400087101F1BF2D2B912C6A9F2788CC715590E8FA5710F2D10BF5E3C03","close_flags":0,"close_time":588812130,"close_time_human":"2018-Aug-28 22:55:30.000000000","close_time_resolution":30,"closed":true,"hash":"96A8DF9ECF5E9D087BAE9DDDE38C197D3C1C6FB842C7BB770F8929E56CC71661","ledger_hash":"96A8DF9ECF5E9D087BAE9DDDE38C197D3C1C6FB842C7BB770F8929E56CC71661","ledger_index":"3","parent_close_time":588812070,"parent_hash":"5F5CB224644F080BC8E1CC10E126D62E9D7F9BE1C64AD0565881E99E3F64688A","seqNum":"3","totalCoins":"100000000000000000","total_coins":"100000000000000000","transaction_hash":"0000000000000000000000000000000000000000000000000000000000000000"} ``` -During the first 5 to 15 minutes after the server starts up, it is normal for it to be out of sync with the rest of the network and print messages such as these. If the server writes these messages long after starting up, it could indicate a problem. Common causes include unreliable network connections and insufficient hardware specs. +During the first 5 to 15 minutes after the server starts up, it is normal for it to be out of sync with the rest of the network and print messages such as these. If the server writes these messages long after starting up, it could indicate a problem. Common causes include unreliable network connections and insufficient hardware specs. This can also happen when other processes running on the same hardware are competing with `rippled` for resources. (Examples of other processes that may compete with `rippled` for resources include scheduled backups, virus scanners, and periodic database cleaners.) ## Already validated sequence at or past @@ -89,9 +111,9 @@ Log messages such as the following indicate that a server received validations f Occasional messages of this type do not usually indicate a problem. If this type of message occurs frequently with the same sending validator, it could indicate a problem, including any of the following (roughly in order of most to least likely): -- The server writing the message is having network issues -- The validator described in the message is having network issues -- The validator described in the message is behaving maliciously +- The server writing the message is having network issues. +- The validator described in the message is having network issues. +- The validator described in the message is behaving maliciously. ## Unable to determine hash of ancestor