Merge pull request #453 from mDuo13/troubleshooting

rippled Troubleshooting Docs
This commit is contained in:
Rome Reginelli
2018-09-14 18:11:43 -07:00
committed by GitHub
4 changed files with 452 additions and 3 deletions

View File

@@ -0,0 +1,76 @@
# Diagnosing Problems with rippled
If you are having problems with `rippled`, the first step is to collect more information to accurately characterize the problem. From there, it can be easier to figure out a root cause and a fix.
If your server does not start (such as crashing or otherwise shutting down automatically), see [rippled Server Won't Start](server-wont-start.html) for a list of some possible causes and fixes.
The remainder of this document suggests steps for diagnosing problems that happen while your server is up and running (including if the process is active but unable to sync with the network).
## Get the server_info
You can use the commandline to get server status information from the local `rippled` instance. For example:
```
rippled server_info
```
The response to this command has a lot of information, which is documented along with the [server_info method][].
For troubleshooting purposes, the most important fields are (from most commonly used to least):
- **`server_state`** - Most of the time, this field should show `proposing` for a server that is [configured as a validator](run-rippled-as-a-validator.html), or `full` for a non-validating server. The value `connected` means that the server can communicate with the rest of the peer-to-peer network, but it does not yet have enough data to track progress of the shared ledger state. Normally, syncing to the state of the rest of the ledger takes about 5-15 minutes after starting.
- If your server remains in the `connected` state for hours, or returns to the `connected` state after being in the `full` or `proposing` states, that usually indicates that your server cannot keep up with the rest of the network. The most common bottlenecks are disk I/O, network bandwidth, and RAM.
- **`complete_ledgers`** - This field shows which [ledger indexes](basic-data-types.html#ledger-index) your server has complete ledger data for. Healthy servers usually have a single range of recent ledgers, such as `"12133424-12133858"`.
- If you have a disjoint set of complete ledgers such as `"11845721-12133420,12133424-12133858"`, that could indicate that your server has had intermittent outages or has temporarily fallen out of sync with the rest of the network. The most common causes for this are insufficient disk I/O or network bandwidth.
- Normally, a `rippled` server downloads recent ledger history from its peers. If gaps in your ledger history persist for more than a few hours, you may not be connected to any peers who have the missing data. If this occurs, you can force your server to try and peer with one of Ripple's full-history public servers by adding the following stanza to your config file and restarting:
[ips_fixed]
s2.ripple.com 51235
- **`amendment_blocked`** - This field is normally omitted from the `server_info` response. If this field appears with the value `true`, then the network has approved an [amendment](amendments.html) for which your server doesn't have an implementation. Most likely, you can fix this by [updating rippled](update-rippled.html) to the latest version. You can also use the [feature method][] to see what amendment IDs are currently enabled and which one(s) your server does and does not support.
- **`peers`** - This field indicates how many other servers in the XRP Ledger peer-to-peer network your server is connected to. Healthy servers typically show between 5 and 50 peers, unless explicitly configured to connect only to certain peers.
- If you have 0 peers, your server may be unable to contact the network, or your system clock may be wrong. (Ripple recommends running an [NTP](http://www.ntp.org/) daemon on all servers to keep their clocks synced.)
- If you have exactly 10 peers, that may indicate that your `rippled` is unable to receive incoming connections through a router using [NAT](https://en.wikipedia.org/wiki/Network_address_translation). You can improve connectivity by configuring your router's firewall to forward the port used for peer-to-peer connections (port 51235 [by default](https://github.com/ripple/rippled/blob/8429dd67e60ba360da591bfa905b58a35638fda1/cfg/rippled-example.cfg#L1065)).
### No Response from Server
The `rippled` executable returns the following message if it wasn't able to connect as a client to the `rippled` server:
```json
{
"error" : "internal",
"error_code" : 71,
"error_message" : "Internal error.",
"error_what" : "no response from server"
}
```
This generally indicates one of several problems:
- The `rippled` server is just starting up, or is not running at all. Check the status of the service; if it is running, wait a few seconds and try again.
- You may need to pass different [parameters to the `rippled` commandline client](commandline-usage.html#client-mode-options) to connect to your server.
- The `rippled` server may be configured not to accept JSON-RPC connections.
## Check the server log
[By default,](https://github.com/ripple/rippled/blob/master/cfg/rippled-example.cfg#L1139-L1142) `rippled` writes the server's debug log to the file `/var/log/rippled/debug.log`. The location of the debug log can differ based on your server's configuration file. If you start the `rippled` service directly (instead of using `systemctl` or `service` to start it), it also prints log messages to the console by default.
The default config file sets the log level to severity "warning" for all categories of log messages by internally using the [log_level method][] during startup. You can control the verbosity of the debug log [using the `--silent` commandline option during startup](commandline-usage.html#verbosity-options) and with the [log_level method][] while the server is running. (See the `[rpc_startup]` stanza of the config file for settings.)
It is normal for a `rippled` the server to print many warning-level (`WRN`) messages during startup and a few warning-level messages from time to time later on. You can **safely ignore** most warnings in the first 5 to 15 minutes of server startup.
For a more thorough explanation of various types of log messages, see [Understanding Log Messages](understanding-log-messages.html).
<!--{# common link defs #}-->
{% include '_snippets/rippled-api-links.md' %}
{% include '_snippets/tx-type-links.md' %}
{% include '_snippets/rippled_versions.md' %}

View File

@@ -0,0 +1,169 @@
# rippled Server Won't Start
This page explains possible reasons the `rippled` server does not start and how to fix them.
These instructions assume you have [installed `rippled`](install-rippled.html) on a supported platform.
## File Descriptors Limit
On some Linux variants, you may get an error message such as the following when trying to run `rippled`:
```text
WARNING: There are only 1024 file descriptors (soft limit) available, which
limit the number of simultaneous connections.
```
This occurs because the system has a security limit on the number of files a single process may open, but the limit is set too low for `rippled`. To fix the problem, **root access is required**. Increase the number of files `rippled` is allowed to open with the following steps:
1. Add the following lines to the end of your `/etc/security/limits.conf` file:
* soft nofile 65536
* hard nofile 65536
2. Check that the [hard limit on number of files that can be opened](https://ss64.com/bash/ulimit.html) is now `65536`:
ulimit -Hn
The command should output `65536`.
3. Try starting `rippled` again.
systemctl start rippled
4. If `rippled` still does not start, open `/etc/sysctl.conf` and append the following kernel-level setting:
fs.file-max = 65536
## Failed to open /etc/opt/ripple/rippled.cfg
If `rippled` crashes on startup with an error such as the following, it means that `rippled` cannot read its config file:
```text
Loading: "/etc/opt/ripple/rippled.cfg"
Failed to open '"/etc/opt/ripple/rippled.cfg"'.
Terminating thread rippled: main: unhandled St13runtime_error 'Can not create "/var/opt/ripple"'
Aborted (core dumped)
```
Possible solutions:
- Check that the config file exists (the default location is `/etc/opt/ripple/rippled.cfg`) and the user that runs your `rippled` process (usually `rippled`) has read permissions to the file.
- Create a config file that can be read by the `rippled` user at `$HOME/.config/ripple/rippled.cfg` (where `$HOME` points to the `rippled` user's home directory).
**Tip:** The `rippled` repository contains [an example `rippled.cfg` file](https://github.com/ripple/rippled/blob/master/cfg/rippled-example.cfg) which is provided as the default config when you do an RPM installation. If you do not have the file, you can copy it from there.
- Specify the path to your preferred config file using the `--conf` [commandline option](commandline-usage.html).
## Failed to open validators file
If `rippled` crashes on startup with an error such as the following, it means it can read its primary config file, but that config file specifies a separate validators config file (typically named `validators.txt`), which `rippled` cannot read.
```text
Loading: "/home/rippled/.config/ripple/rippled.cfg"
Terminating thread rippled: main: unhandled St13runtime_error 'The file specified in [validators_file] does not exist: /home/rippled/.config/ripple/validators.txt'
Aborted (core dumped)
```
Possible solutions:
- Check that the `[validators.txt]` file exists and the `rippled` user has permissions to read it.
**Tip:** The `rippled` repository contains [an example `validators.txt` file](https://github.com/ripple/rippled/blob/master/cfg/validators-example.txt) which is provided as the default config when you do an RPM installation. If you do not have the file, you can copy it from there.
- Edit your `rippled.cfg` file and modify the `[validators_file]` setting to have the correct path to your `validators.txt` (or equivalent) file. Check for extra whitespace before or after the filename.
- Edit your `rippled.cfg` file and remove the `[validators_file]` setting. Add validator settings directly to your `rippled.cfg` file. For example:
[validator_list_sites]
https://vl.ripple.com
[validator_list_keys]
ED2677ABFFD1B33AC6FBC3062B71F1E8397C1505E1C42C64D11AD1B28FF73F4734
## Cannot create database path
If `rippled` crashes on startup with an error such as the following, it means the server does not have write permissions to the `[database_path]` from its config file.
```text
Loading: "/home/rippled/.config/ripple/rippled.cfg"
Terminating thread rippled: main: unhandled St13runtime_error 'Can not create "/var/lib/rippled/db"'
Aborted (core dumped)
```
The paths to the configuration file (`/home/rippled/.config/ripple/rippled.cfg`) and the database path (`/var/lib/rippled/db`) may vary depending on your system.
Possible solutions:
- Run `rippled` as a different user that has write permissions to the database path printed in the error message.
- Edit your `rippled.cfg` file and change the `[database_path]` setting to use a path that the `rippled` user has write permissions to.
- Grant the `rippled` user write permissions to the configured database path.
## State DB Error
The following error can occur if the `rippled` server's state database is corrupted (possibly as the result of being shutdown unexpectedly):
```text
2018-Aug-21 23:06:38.675117810 SHAMapStore:ERR state db error:
writableDbExists false archiveDbExists false
writableDb '/var/lib/rippled/db/rocksdb/rippledb.11a9' archiveDb '/var/lib/rippled/db/rocksdb/rippledb.2d73'
To resume operation, make backups of and remove the files matching /var/lib/rippled/db/state* and contents of the directory /var/lib/rippled/db/rocksdb
Terminating thread rippled: main: unhandled St13runtime_error 'state db error'
```
The easiest way to fix this problem is to delete the databases entirely. You may want to back them up elsewhere instead. For example:
```sh
mv /var/lib/rippled/db /var/lib/rippled/db-bak
```
Or, if you are sure you don't need the databases:
```sh
rm -r /var/lib/rippled/db
```
**Tip:** It is generally safe to delete the `rippled` databases, because any individual server can re-download ledger history from other servers in the XRP Ledger network.
## Online Delete is Less Than Ledger History
An error message such as the following indicates that the `rippled.cfg` file has contradictory values for `[ledger_history]` and `online_delete`.
```text
Terminating thread rippled: main: unhandled St13runtime_error 'online_delete must not be less than ledger_history (currently 3000)
```
The `[ledger_history]` setting represents how many ledgers of history the server should seek to back-fill. The `online_delete` field (in the `[node_db]` stanza) indicates how many ledgers of history to keep when dropping older history. The `online_delete` value must be equal to or larger than `[ledger_history]` to prevent the server from deleting historical ledgers that it is also trying to download.
To fix the problem, edit the `rippled.cfg` file and change or remove either the `[ledger_history]` or `online_delete` options. (If you omit `[ledger_history]`, it defaults to 256 ledger versions, so `online_delete`, if present, must be larger than 256. If you omit `online_delete`, it disables automatic deletion of old ledger versions.)
## Bad node_size value
An error such as the following indicates that the `rippled.cfg` file has an improper value for the `node_size` setting:
```text
Terminating thread rippled: main: unhandled N5beast14BadLexicalCastE 'std::bad_cast'
```
Valid parameters for the `node_size` field are `tiny`, `small`, `medium`, or `huge`. For more information see [Node Size](capacity-planning.html#node-size).
## Shard path missing
An error such as the following indicates that the `rippled.cfg` has an incomplete [history sharding](history-sharding.html) configuration:
```text
Terminating thread rippled: main: unhandled St13runtime_error 'shard path missing'
```
If your config includes a `[shard_db]` stanza, it must contain a `path` field, which points to a directory where `rippled` can write the data for the shard store. This error means the `path` field is missing or located in the wrong place. Check for extra whitespace or typos in your config file, and compare against the [Shard Configuration Example](history-sharding.html#shard-configuration-example).

View File

@@ -0,0 +1,162 @@
# Understanding Log Messages
The following sections describe some of the most common types of log messages that can appear in a `rippled` server's debug log and how to interpret them.
This is an important step in [Diagnosing Problems](diagnosing-problems.html) with `rippled`.
## Crashes
Messages in the log that mention runtime errors can indicate that the server crashed. These messages usually start with a message such as one of the following examples:
```
Throw<std::runtime_error>
```
```
Terminating thread rippled: main: unhandled St13runtime_error
```
If your server always crashes on startup, see [Server Won't Start](server-wont-start.html) for possible cases.
If your server crashes randomly during operation or as a result of particular commands, make sure you are [updated](update-rippled.html) to the latest `rippled` version. If you are on the latest version and your server is still crashing, check the following:
- Is your server running out of memory? On some systems, `rippled` may be terminated by the Out Of Memory (OOM) Killer or another monitor process.
- If your server is running in a shared environment, are other users or administrators causing the machine or service to be restarted? For example, some hosted providers automatically kill any service that uses a large amount of a shared machine's resources for an extended period of time.
- Does your server meet the [minimum requirements](install-rippled.html#minimum-system-requirements) to run `rippled`? What about the [recommendations for production servers](capacity-planning.html#recommendation-1)?
If none of the above apply, please report the issue to Ripple as a security-sensitive bug. If Ripple can reproduce the crash, you may be eligible for a bounty. See <https://ripple.com/bug-bounty/> for details.
## Connection reset by peer
The following log message indicates that a peer `rippled` server closed a connection:
```text
2018-Aug-28 22:55:41.738765510 Peer:WRN [012] onReadMessage: Connection reset by peer
```
Losing connections from time to time is normal for any peer-to-peer network. **Occasional messages of this kind do not indicate a problem.**
A large number of these messages around the same time may indicate a problem, such as:
- Your internet connection to one or more specific peers was cut off.
- Your server may have been overloading the peer with requests, causing it to drop your server.
## No hash for fetch pack
Messages such as the following are caused by a bug in `rippled` v1.1.0 and earlier when downloading historical ledgers for [history sharding](history-sharding.html):
```text
2018-Aug-28 22:56:21.397076850 LedgerMaster:ERR No hash for fetch pack. Missing Index 7159808
```
These can be safely ignored.
## LoadMonitor Job
Messages such as the following occur when a function takes a long time to run (over 11 seconds in this example):
```text
2018-Aug-28 22:56:36.180827973 LoadMonitor:WRN Job: gotFetchPack run: 11566ms wait: 0ms
```
The following similar message occurs when a job spends a long time waiting to run (again, over 11 seconds in this example):
```text
2018-Aug-28 22:56:36.180970431 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 11566ms
2018-Aug-28 22:56:36.181053831 LoadMonitor:WRN Job: AcquisitionDone run: 0ms wait: 11566ms
2018-Aug-28 22:56:36.181110594 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 11566ms
2018-Aug-28 22:56:36.181169931 LoadMonitor:WRN Job: AcquisitionDone run: 0ms wait: 11566ms
```
These two types of messages often occur together, when a long-running job causes other jobs to wait a long time for it to finish.
It is **normal** to display several messages of these types **during the first few minutes** after starting the server.
If the messages continue for more than 5 minutes after starting the server, especially if the `run` times are well over 1000ms, that may indicate that **your server does not have sufficient resources, such as disk I/O, RAM, or CPU**. This may be caused by not having sufficiently-powerful hardware or because other processes running on the same hardware are competing with `rippled` for resources. (Examples of other processes that may compete with `rippled` for resources include scheduled backups, virus scanners, and periodic database cleaners.)
Another possible cause is trying to use NuDB on rotational hard disks; NuDB should only be used with solid state drives (SSDs). Ripple recommends always using SSD storage for `rippled`'s databases, but you _may_ be able to run `rippled` successfully on rotational disks using RocksDB. If you are using rotational disks, make sure both the `[node_db]` and the `[shard_db]` (if you have one) are configured to use RocksDB. For example:
```
[node_db]
type=RocksDB
# ... more config omitted
[shard_db]
type=RocksDB
```
## View of consensus changed during open
Log messages such as the following occur when a server is not in sync with the rest of the network:
```text
2018-Aug-28 22:56:22.368460130 LedgerConsensus:WRN View of consensus changed during open status=open, mode=proposing
2018-Aug-28 22:56:22.368468202 LedgerConsensus:WRN 96A8DF9ECF5E9D087BAE9DDDE38C197D3C1C6FB842C7BB770F8929E56CC71661 to 00B1E512EF558F2FD9A0A6C263B3D922297F26A55AEB56A009341A22895B516E
2018-Aug-28 22:56:22.368499966 LedgerConsensus:WRN {"accepted":true,"account_hash":"89A821400087101F1BF2D2B912C6A9F2788CC715590E8FA5710F2D10BF5E3C03","close_flags":0,"close_time":588812130,"close_time_human":"2018-Aug-28 22:55:30.000000000","close_time_resolution":30,"closed":true,"hash":"96A8DF9ECF5E9D087BAE9DDDE38C197D3C1C6FB842C7BB770F8929E56CC71661","ledger_hash":"96A8DF9ECF5E9D087BAE9DDDE38C197D3C1C6FB842C7BB770F8929E56CC71661","ledger_index":"3","parent_close_time":588812070,"parent_hash":"5F5CB224644F080BC8E1CC10E126D62E9D7F9BE1C64AD0565881E99E3F64688A","seqNum":"3","totalCoins":"100000000000000000","total_coins":"100000000000000000","transaction_hash":"0000000000000000000000000000000000000000000000000000000000000000"}
```
During the first 5 to 15 minutes after the server starts up, it is normal for it to be out of sync with the rest of the network and print messages such as these. If the server writes these messages long after starting up, it could indicate a problem. Common causes include unreliable network connections and insufficient hardware specs. This can also happen when other processes running on the same hardware are competing with `rippled` for resources. (Examples of other processes that may compete with `rippled` for resources include scheduled backups, virus scanners, and periodic database cleaners.)
## Already validated sequence at or past
Log messages such as the following indicate that a server received validations for different ledger sequences out of order.
```text
2018-Aug-28 22:55:58.316094260 Validations:WRN Val for 2137ACEFC0D137EFA1D84C2524A39032802E4B74F93C130A289CD87C9C565011 trusted/full from nHUeUNSn3zce2xQZWNghQvd9WRH6FWEnCBKYVJu2vAizMxnXegfJ signing key n9KcRZYHLU9rhGVwB9e4wEMYsxXvUfgFxtmX25pc1QPNgweqzQf5 already validated sequence at or past 12133663 src=1
```
Occasional messages of this type do not usually indicate a problem. If this type of message occurs frequently with the same sending validator, it could indicate a problem, including any of the following (roughly in order of most to least likely):
- The server writing the message is having network issues.
- The validator described in the message is having network issues.
- The validator described in the message is behaving maliciously.
## Unable to determine hash of ancestor
Log messages such as the following occur when the server sees a validation message from a peer and it does not know the parent ledger version that server is building on. This is normal when a server is syncing to the network.
```text
2018-Aug-28 22:56:22.256065549 Validations:WRN Unable to determine hash of ancestor seq=3 from ledger hash=00B1E512EF558F2FD9A0A6C263B3D922297F26A55AEB56A009341A22895B516E seq=12133675
```
If this message occurs frequently outside of the first 5 to 15 minutes after starting the server, it could indicate a problem.
## InboundLedger Want hash
Log messages such as the following indicate that the server is requesting ledger data from other servers:
```text
InboundLedger:WRN Want: 5AE53B5E39E6388DBACD0959E5F5A0FCAF0E0DCBA45D9AB15120E8CDD21E019B
```
This is normal if your server is syncing, backfilling, or downloading [history shards](history-sharding.html).
## InboundLedger 11 timeouts for ledger
```text
InboundLedger:WRN 11 timeouts for ledger 8265938
```
This indicates that your server is having trouble requesting specific ledger data from its peers. If the [ledger index](basic-data-types.html#ledger-index) is much lower than the most recent validated ledger's index as reported by the [server_info method][], this probably indicates that your server is downloading a [history shard](history-sharding.html).
This is not strictly a problem, but if you want to acquire ledger history faster, you can configure `rippled` to connect to peers with full history by adding or editing the `[ips_fixed]` config stanza and restarting the server. For example, to always try to connect to one of Ripple's full-history servers:
```
[ips_fixed]
s2.ripple.com 51235
```
<!--{# common link defs #}-->
{% include '_snippets/rippled-api-links.md' %}
{% include '_snippets/tx-type-links.md' %}
{% include '_snippets/rippled_versions.md' %}

View File

@@ -865,7 +865,7 @@ pages:
funnel: Docs
doc_type: Tutorials
category: Manage the rippled Server
blurb: Build, install, and run the rippled server. Configure it to your use case and troubleshoot problems.
blurb: Build, install, configure, and run the rippled server.
template: template-landing-children.html
targets:
- local
@@ -950,8 +950,50 @@ pages:
targets:
- local
# TODO: troubleshoot rippled,
# run a validator, run in stand-alone mode, set up private peers
# TODO: run a validator, run in stand-alone mode, set up private peers
- name: Troubleshooting rippled
html: troubleshoot-the-rippled-server.html
funnel: Docs
doc_type: Tutorials
category: Manage the rippled Server
subcategory: Troubleshooting rippled
blurb: Troubleshoot all kinds of problems with the rippled server.
template: template-landing-children.html
targets:
- local
- md: tutorials/manage-the-rippled-server/troubleshooting/diagnosing-problems.md
html: diagnosing-problems.html
funnel: Docs
doc_type: Tutorials
category: Manage the rippled Server
subcategory: Troubleshooting rippled
blurb: Collect information to identify the cause of problems.
targets:
- local
- md: tutorials/manage-the-rippled-server/troubleshooting/understanding-log-messages.md
html: understanding-log-messages.html
funnel: Docs
doc_type: Tutorials
category: Manage the rippled Server
subcategory: Troubleshooting rippled
blurb: Interpret and respond to warning and error messages in the debug log.
targets:
- local
- md: tutorials/manage-the-rippled-server/troubleshooting/server-wont-start.md
html: server-wont-start.html
funnel: Docs
doc_type: Tutorials
category: Manage the rippled Server
subcategory: Troubleshooting rippled
blurb: A collection of problems that would cause a rippled server not to start, and how to fix them.
targets:
- local
# TODO: more troubleshooting rippled articles
- md: tutorials/manage-the-rippled-server/fix-sqlite-tx-db-page-size-issue.md
html: fix-sqlite-tx-db-page-size-issue.html