mirror of
https://github.com/XRPLF/rippled.git
synced 2025-11-14 16:15:51 +00:00
Update nudb comments.
This commit is contained in:
@@ -6,34 +6,36 @@ these database than what is traditional. NuDB provides highly
|
|||||||
optimized and concurrent atomic, durable, and isolated fetch and
|
optimized and concurrent atomic, durable, and isolated fetch and
|
||||||
insert operations to secondary storage, along with these features:
|
insert operations to secondary storage, along with these features:
|
||||||
|
|
||||||
* Low memory footprint
|
* Low memory footprint.
|
||||||
* Values are immutable
|
* Values are immutable.
|
||||||
* Value sizes from 1 2^48 bytes (281TB)
|
* Value sizes from 1 to 2^48 bytes (281TB).
|
||||||
* All keys are the same size
|
* All keys are the same size.
|
||||||
* Performance independent of growth
|
* Performance independent of growth.
|
||||||
* Optimized for concurrent fetch
|
* Optimized for concurrent fetch.
|
||||||
* Key file can be rebuilt if needed
|
* Key file can be rebuilt if needed.
|
||||||
* Inserts are atomic and consistent
|
* Inserts are atomic and consistent.
|
||||||
* Data file may be iterated, index rebuilt.
|
* Data files may be efficiently iterated.
|
||||||
* Key and data files may be on different volumes
|
* Key and data files may be on different volumes.
|
||||||
* Hardened against algorithmic complexity attacks
|
* Hardened against algorithmic complexity attacks.
|
||||||
* Header-only, nothing to build or link
|
* Header-only, nothing to build or link.
|
||||||
|
|
||||||
Three files are used. The data file holds keys and values stored
|
Three files are used.
|
||||||
sequentially and size-prefixed. The key file holds a series of
|
|
||||||
fixed-size bucket records forming an on-disk hash table. The log file
|
* The data file holds keys and values stored sequentially and size-prefixed.
|
||||||
stores bookkeeping information used to restore consistency when an
|
* The key file holds a series of fixed-size bucket records forming an on-disk
|
||||||
external failure occurs. In typical cases a fetch costs one I/O to
|
hash table.
|
||||||
consult the key file and if the key is present, one I/O to read the
|
* The log file stores bookkeeping information used to restore consistency when
|
||||||
value.
|
an external failure occurs.
|
||||||
|
|
||||||
|
In typical cases a fetch costs one I/O cycle to consult the key file, and if the
|
||||||
|
key is present, one I/O cycle to read the value.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
Callers define these parameters when a database is created:
|
Callers must define these parameters when _creating_ a database:
|
||||||
|
|
||||||
* KeySize: The size of a key in bytes
|
* `KeySize`: The size of a key in bytes.
|
||||||
* BlockSize: The physical size of a key file record
|
* `BlockSize`: The physical size of a key file record.
|
||||||
* LoadFactor: The desired fraction of bucket occupancy
|
|
||||||
|
|
||||||
The ideal block size matches the sector size or block size of the
|
The ideal block size matches the sector size or block size of the
|
||||||
underlying physical media that holds the key file. Functions are
|
underlying physical media that holds the key file. Functions are
|
||||||
@@ -42,33 +44,37 @@ device, but a default of 4096 should work for typical installations.
|
|||||||
The implementation tries to fit as many entries as possible in a key
|
The implementation tries to fit as many entries as possible in a key
|
||||||
file record, to maximize the amount of useful work performed per I/O.
|
file record, to maximize the amount of useful work performed per I/O.
|
||||||
|
|
||||||
The load factor is chosen to make bucket overflows unlikely without
|
* `LoadFactor`: The desired fraction of bucket occupancy
|
||||||
|
|
||||||
|
`LoadFactor` is chosen to make bucket overflows unlikely without
|
||||||
sacrificing bucket occupancy. A value of 0.50 seems to work well with
|
sacrificing bucket occupancy. A value of 0.50 seems to work well with
|
||||||
a good hash function.
|
a good hash function.
|
||||||
|
|
||||||
Callers also provide these parameters when a database is opened:
|
Callers must also provide these parameters when a database is _opened:_
|
||||||
|
|
||||||
* Appnum: An application-defined integer constant
|
* `Appnum`: An application-defined integer constant which can be retrieved
|
||||||
* AllocSize: A significant multiple of the average data size
|
later from the database [TODO].
|
||||||
|
* `AllocSize`: A significant multiple of the average data size.
|
||||||
|
|
||||||
To improve performance, memory is recycled. NuDB needs a hint about
|
Memory is recycled to improve performance, so NuDB needs `AllocSize` as a
|
||||||
the average size of the data being inserted. For an average data
|
hint about the average size of the data being inserted. For an average data size
|
||||||
size of 1KB (one kilobyte), AllocSize of sixteen megabytes (16MB) is
|
of 1KB (one kilobyte), `AllocSize` of sixteen megabytes (16MB) is sufficient. If
|
||||||
sufficient. If the AllocSize is too low, the memory recycler will
|
the `AllocSize` is too low, the memory recycler will not make efficient use of
|
||||||
not make efficient use of allocated blocks.
|
allocated blocks.
|
||||||
|
|
||||||
Two operations are defined, fetch and insert.
|
Two operations are defined: `fetch`, and `insert`.
|
||||||
|
|
||||||
### Fetch
|
### `fetch`
|
||||||
|
|
||||||
The fetch operation retrieves a variable length value given the
|
The `fetch` operation retrieves a variable length value given the
|
||||||
key. The caller supplies a factory used to provide a buffer for storing
|
key. The caller supplies a factory used to provide a buffer for storing
|
||||||
the value. This interface allows custom memory allocation strategies.
|
the value. This interface allows custom memory allocation strategies.
|
||||||
|
|
||||||
### Insert
|
### `insert`
|
||||||
|
|
||||||
Insert adds a key/value pair to the store. Value data must contain at
|
`insert` adds a key/value pair to the store. Value data must contain at least
|
||||||
least one byte. Duplicate keys are disallowed. Insertions are serialized.
|
one byte. Duplicate keys are disallowed. Insertions are serialized, which means
|
||||||
|
[TODO].
|
||||||
|
|
||||||
## Implementation
|
## Implementation
|
||||||
|
|
||||||
@@ -89,24 +95,24 @@ and immutable: once written, bytes are never changed.
|
|||||||
Initially the hash table in the key file consists of a single bucket.
|
Initially the hash table in the key file consists of a single bucket.
|
||||||
After the load factor is exceeded from insertions, the hash table grows
|
After the load factor is exceeded from insertions, the hash table grows
|
||||||
in size by one bucket by doing a "split". The split operation is the
|
in size by one bucket by doing a "split". The split operation is the
|
||||||
linear hashing algorithm as described by Litwin and Larson:
|
[linear hashing algorithm](http://en.wikipedia.org/wiki/Linear_hashing)
|
||||||
|
as described by Litwin and Larson.
|
||||||
|
|
||||||
http://en.wikipedia.org/wiki/Linear_hashing
|
|
||||||
|
|
||||||
When a bucket is split, each key is rehashed and either remains in the
|
When a bucket is split, each key is rehashed, and either remains in the
|
||||||
original bucket or gets moved to the new bucket appended to the end of
|
original bucket or gets moved to the a bucket appended to the end of
|
||||||
the key file.
|
the key file.
|
||||||
|
|
||||||
An insertion on a full bucket first triggers the "spill" algorithm:
|
An insertion on a full bucket first triggers the "spill" algorithm.
|
||||||
First, a spill record is appended to the data file. The spill record
|
|
||||||
contains header information followed by the entire bucket record. Then,
|
First, a spill record is appended to the data file, containing header
|
||||||
the bucket's size is set to zero and the offset of the spill record is
|
information followed by the entire bucket record. Then the bucket's size is set
|
||||||
stored in the bucket. At this point the insertion may proceed normally,
|
to zero and the offset of the spill record is stored in the bucket. At this
|
||||||
since the bucket is empty. Spilled buckets in the data file are always
|
point the insertion may proceed normally, since the bucket is empty. Spilled
|
||||||
full.
|
buckets in the data file are always full.
|
||||||
|
|
||||||
Because every bucket holds the offset of the next spill record in the
|
Because every bucket holds the offset of the next spill record in the
|
||||||
data file, each bucket forms a linked list. In practice, careful
|
data file, the buckets form a linked list. In practice, careful
|
||||||
selection of capacity and load factor will keep the percentage of
|
selection of capacity and load factor will keep the percentage of
|
||||||
buckets with one spill record to a minimum, with no bucket requiring
|
buckets with one spill record to a minimum, with no bucket requiring
|
||||||
two spill records.
|
two spill records.
|
||||||
@@ -141,16 +147,16 @@ database stores information used to roll back partial commits.
|
|||||||
|
|
||||||
Each record in the data file is prefixed with a header identifying
|
Each record in the data file is prefixed with a header identifying
|
||||||
whether it is a value record or a spill record, along with the size of
|
whether it is a value record or a spill record, along with the size of
|
||||||
the record in bytes and a copy of the key if its a value record.
|
the record in bytes and a copy of the key if it's a value record, so values can
|
||||||
Therefore, values may be iterated. A key file can be regenerated from
|
be iterated by incrementing a byte counter. A key file can be regenerated from
|
||||||
just the data file by iterating the values and performing the key
|
just the data file by iterating the values and performing the key
|
||||||
insertion algorithm.
|
insertion algorithm.
|
||||||
|
|
||||||
## Concurrency
|
## Concurrency
|
||||||
|
|
||||||
Locks are never held during disk reads and writes. Fetches are fully
|
Locks are never held during disk reads and writes. Fetches are fully
|
||||||
concurrent, while inserts are serialized. Inserts prevent duplicate
|
concurrent, while inserts are serialized. Inserts fail on duplicate
|
||||||
keys. Inserts are atomic, they either succeed immediately or fail.
|
keys, and are atomic: they either succeed immediately or fail.
|
||||||
After an insert, the key is immediately visible to subsequent fetches.
|
After an insert, the key is immediately visible to subsequent fetches.
|
||||||
|
|
||||||
## Formats
|
## Formats
|
||||||
@@ -180,18 +186,18 @@ fixed-length Bucket Records.
|
|||||||
uint8[56] Reserved Zeroes
|
uint8[56] Reserved Zeroes
|
||||||
uint8[] Reserved Zero-pad to block size
|
uint8[] Reserved Zero-pad to block size
|
||||||
|
|
||||||
The Type identifies the file as belonging to nudb. The UID is
|
`Type` identifies the file as belonging to nudb. `UID` is
|
||||||
generated randomly when the database is created, and this value
|
generated randomly when the database is created, and this value
|
||||||
is stored in the data and log files as well. The UID is used
|
is stored in the data and log files as well - it's used
|
||||||
to determine if files belong to the same database. Salt is
|
to determine if files belong to the same database. `Salt` is
|
||||||
generated when the database is created and helps prevent
|
generated when the database is created and helps prevent
|
||||||
complexity attacks; the salt is prepended to the key material
|
complexity attacks; it is prepended to the key material
|
||||||
when computing a hash, or used to initialize the state of
|
when computing a hash, or used to initialize the state of
|
||||||
the hash function. Appnum is an application defined constant
|
the hash function. `Appnum` is an application defined constant
|
||||||
set when the database is created. It can be used for anything,
|
set when the database is created. It can be used for anything,
|
||||||
for example to distinguish between different data formats.
|
for example to distinguish between different data formats.
|
||||||
|
|
||||||
Pepper is computed by hashing the salt using a hash function
|
`Pepper` is computed by hashing `Salt` using a hash function
|
||||||
seeded with the salt. This is used to fingerprint the hash
|
seeded with the salt. This is used to fingerprint the hash
|
||||||
function used. If a database is opened and the fingerprint
|
function used. If a database is opened and the fingerprint
|
||||||
does not match the hash calculation performed using the template
|
does not match the hash calculation performed using the template
|
||||||
@@ -231,8 +237,7 @@ variable-length Value Records and Spill Records.
|
|||||||
uint64 UID Unique ID generated on creation
|
uint64 UID Unique ID generated on creation
|
||||||
uint64 Appnum Application defined constant
|
uint64 Appnum Application defined constant
|
||||||
uint16 KeySize Key size in bytes
|
uint16 KeySize Key size in bytes
|
||||||
|
uint8[64] (reserved) Zeroes
|
||||||
uint8[64] Reserved Zeroes
|
|
||||||
|
|
||||||
UID contains the same value as the salt in the corresponding key
|
UID contains the same value as the salt in the corresponding key
|
||||||
file. This is placed in the data file so that key and value files
|
file. This is placed in the data file so that key and value files
|
||||||
|
|||||||
Reference in New Issue
Block a user