Health Check edits:

- revised per reviews
- updated for rippled#3491
This commit is contained in:
mDuo13
2020-07-08 13:39:28 -07:00
parent c82994068b
commit fa277b7197
2 changed files with 71 additions and 22 deletions

View File

@@ -3,14 +3,14 @@
The Health Check is a special [peer port method](peer-port-methods.html) for reporting on the health of an individual `rippled` server. This method is intended for use in automated monitoring to recognize outages and prompt automated or manual interventions such as restarting the server. [New in: rippled 1.6.0][]
This method checks several metrics to see if they are in ranges generally considered healthy. If all metrics are in normal ranges, this method reports that the server is healthy. If any metric is outside normal ranges, this method reports that the server is unhealthy and reports the metric(s) that were unhealthy. Since some metrics may rapidly fluctuate into and out of unhealthy ranges, it is recommended not to raise alerts unless the health check fails multiple times in a row.
This method checks several metrics to see if they are in ranges generally considered healthy. If all metrics are in normal ranges, this method reports that the server is healthy. If any metric is outside normal ranges, this method reports that the server is unhealthy and reports the metric(s) that are unhealthy. Since some metrics may rapidly fluctuate into and out of unhealthy ranges, you should not raise alerts unless the health check fails multiple times in a row.
**Note:** Since the health check is a [peer port method](peer-port-methods.html), it is not available when testing the server in [stand-alone mode](rippled-server-modes.html#reasons-to-run-a-rippled-server-in-stand-alone-mode).
## Request Format
To request the Peer Crawler information, make the following HTTP request:
To request the Health Check information, make the following HTTP request:
- **Protocol:** https
- **HTTP Method:** GET
@@ -23,6 +23,24 @@ To request the Peer Crawler information, make the following HTTP request:
## Example Response
<!-- MULTICODE_BLOCK_START -->
*Healthy*
```json
HTTP/1.1 200 OK
Server: rippled-1.6.0-b8
Content-Type: application/json
Connection: close
Transfer-Encoding: chunked
{
"info": {}
}
```
*Warning*
```json
HTTP/1.1 503 Service Unavailable
Server: rippled-1.6.0
@@ -32,18 +50,43 @@ Transfer-Encoding: chunked
{
"info": {
"load_factor": 256,
"server_state": "connected",
"validated_ledger": 2147483647
"validated_ledger": -1
}
}
```
*Critical*
```json
HTTP/1.1 500 Internal Server Error
Server: rippled-1.6.0
Content-Type: application/json
Connection: close
Transfer-Encoding: chunked
{
"info": {
"peers": 0,
"server_state": "disconnected",
"validated_ledger":-1
}
}
```
<!-- MULTICODE_BLOCK_END -->
## Response Format
If the server is in a **critical** state, the response has the status code **503 Service Unavailable**. If the server is **healthy** or in a **warning** state, the response has the status code **200 OK**.
The response's HTTP status code indicates the health of the server:
In either case, the response body is a JSON object with a single `info` object at the top level. The info object contains values for each metric that is in a warning or critical range. The response omits metrics that are in a healthy range, so a fully healthy server has an empty object.
| Status Code | Health Status | Description |
|:------------------------------|:--------------|:-----------------------------|
| **200 OK** | Healthy | All health metrics are within acceptable ranges. |
| **503 Service Unavailable** | Warning | One or more metric is in the warning range. Manual intervention may or may not be necessary. |
| **500 Internal Server Error** | Critical | One or more metric is in the critical range. There is a serious problem that probably needs manual intervention to fix. |
The response body is a JSON object with a single `info` object at the top level. The `info` object contains values for each metric that is in a warning or critical range. The response omits metrics that are in a healthy range, so a fully healthy server has an empty object.
The `info` object may contain the following fields:
@@ -53,7 +96,7 @@ The `info` object may contain the following fields:
| `load_factor` | Number | _(May be omitted)_ A measure of the overall load the server is under. This reflects I/O, CPU, and memory limitations. This is a warning if the load factor is over 100, or critical if the load factor is 1000 or higher. |
| `peers` | Number | _(May be omitted)_ The number of [peer servers](peer-protocol.html) this server is connected to. This is a warning if connected to 7 or fewer peers, and critical if connected to zero peers. |
| `server_state` | String | _(May be omitted)_ The current [server state](rippled-server-states.html). This is a warning if the server is in the `tracking`, `syncing`, or `connected` states. This is critical if the server is in the `disconnected` state. |
| `validated_ledger` | Number | _(May be omitted)_ The number of seconds since the last time a ledger was validated by consensus. If there is no validated ledger available, this is a very large integer value such as `2147483647` (architecture-dependent). This is a warning if the last validated ledger was at least 7 seconds ago, and critical if the last validated ledger was at least 20 seconds ago. |
| `validated_ledger` | Number | _(May be omitted)_ The number of seconds since the last time a ledger was validated by [consensus](intro-to-consensus.html). If there is no validated ledger available ([as during the initial sync period when starting the server](server-doesnt-sync.html#normal-syncing-behavior)), this is the value `-1` and is considered a warning. This metric is also a warning if the last validated ledger was at least 7 seconds ago, or critical if the last validated ledger was at least 20 seconds ago. |
## See Also

View File

@@ -6,10 +6,12 @@ Infrastructure monitoring, and reliability engineering more generally, is an adv
## Momentary Failures
Some metrics in the health check can rapidly fluctuate into unhealthy ranges and then recover automatically shortly afterward. It is unnecessary and undesirable to raise alerts every single time the health check reports an unhealthy status. An automated monitoring system should call the health check method frequently, but only escalate to a higher level of intervention based on the severity and frequency of the problem.
Some [metrics][] in the health check can rapidly fluctuate into unhealthy ranges and then recover automatically shortly afterward. It is unnecessary and undesirable to raise alerts every single time the health check reports an unhealthy status. An automated monitoring system should call the health check method frequently, but only escalate to a higher level of intervention based on the severity and frequency of the problem.
For example, if you check the health of the server once per second, you might raise an alert if the server reports "warning" status three times in a row, or four times in a five-second span. You might also raise an alert if the server reports "critical" status twice in a five-second span.
**Tip:** The server normally reports a "critical" status for the first few seconds after startup, switches to a "warning" status after it establishes a connection to the network, and finally reports a "healthy" status when it has fully synced to the network. After a restart, you should give a server 515 minutes to sync before taking additional interventions.
## Special Cases
Certain server configurations may always report a `warning` status even when operating normally. If your server qualifies as a special case, you must configure your automated monitoring to recognize the difference between the normal status and an actual problem. This probably involves parsing the JSON response body for the health check method and comparing the values there with expected normal ranges.
@@ -27,9 +29,9 @@ The following sections suggest some common interventions you may want to attempt
- [Redirect traffic](#redirect-traffic) away from the affected server
- [Restart](#restart) the server software or hardware
- [Upgrade](#upgrade) the `rippled` software
- [Investigate network](#investigate-network) in case the problem originates elsewhere
- [Replace hardware](#replace-hardware)
- [Upgrade](#upgrade) the `rippled` software
### Redirect Traffic
@@ -41,7 +43,7 @@ Redirecting traffic away from a server that is unhealthy is an appropriate respo
### Restart
The most straightforward intervention is to restart the server. This can resolve temporary issues with several types of failures, including any of the following metrics:
The most straightforward intervention is to restart the server. This can resolve temporary issues with several types of failures, including any of the following [metrics][]:
- `load_factor`
- `peers`
@@ -59,9 +61,18 @@ A stronger intervention is to restart the entire machine.
**Caution:** After a server starts, it typically needs up to 15 minutes to sync to the network. During this time, the health check is likely to report a critical or warning status. You should be sure your automated systems give servers enough time to sync before restarting them again.
### Upgrade
If the server reports `"amendment_blocked": true` in the health check, this indicates that the XRP Ledger has enabled a [protocol amendment](amendments.html) that your server does not understand. As a precaution against misinterpreting the revised rules of the network in a way that causes you to lose money, such servers become "amendment blocked" instead of operating normally.
To resolve being amendment blocked, [update your server](install-rippled.html) to a newer software version that understands the amendment.
Also, software bugs can cause a server to get [stuck not syncing](server-doesnt-sync.html). In this case, the `server_state` metric is likely to be in a warning or critical state. If you are not using the latest stable release, you should upgrade to get the latest fixes for any known issues that could cause this.
### Investigate Network
An unreliable or insufficient network connection can cause a server to report outages. Warning or critical values in the following metrics can indicate network problems:
An unreliable or insufficient network connection can cause a server to report outages. Warning or critical values in the following [metrics][] can indicate network problems:
- `peers`
- `server_state`
@@ -77,28 +88,23 @@ In this case, the necessary interventions may involve changes to other systems,
### Replace Hardware
If the outage is caused by a hardware failure or by higher load than the hardware is capable of handling, it may be necessary to replace some components or even the entire server.
If the outage is caused by a hardware failure or by higher load than the hardware is capable of handling, you may need to replace some components or even the entire server.
The amount of load on a server in the XRP Ledger depends in part on transaction volume in the network, which varies organically. Load also depends on your usage pattern. See [Capacity Planning](capacity-planning.html) for how to plan the appropriate hardware and settings for your situation.
Warning or critical values for the following metrics may indicate insufficient hardware:
Warning or critical values for the following [metrics][] may indicate insufficient hardware:
- `load_factor`
- `server_state`
- `validated_ledger`
### Upgrade
If the server reports `"amendment_blocked": true` in the health check, this indicates that the XRP Ledger has enabled a [protocol amendment](amendments.html) that your server does not understand. As a precaution against misinterpreting the revised rules of the network in a way that causes you to lose money, such servers become "amendment blocked" instead of operating normally.
The proper way to resolve being amendment blocked is to [update your server](install-rippled.html) to a newer software version that understands the amendment.
<!--{# common link defs #}-->
[metrics]: health-check.html#response-format
{% include '_snippets/rippled-api-links.md' %}
{% include '_snippets/tx-type-links.md' %}
{% include '_snippets/rippled_versions.md' %}