Update nudb comments.

2025-12-06 17:27:55 +00:00 · 2015-02-09 13:09:04 -05:00
parent 8bda9487c6
commit 0339904920
1 changed files with 68 additions and 63 deletions
--- a/src/beast/beast/nudb/README.md
+++ b/src/beast/beast/nudb/README.md
@@ -6,34 +6,36 @@ these database than what is traditional. NuDB provides highly
 optimized and concurrent atomic, durable, and isolated fetch and
 insert operations to secondary storage, along with these features:

-* Low memory footprint
-* Values are immutable
-* Value sizes from 1 2^48 bytes (281TB)
-* All keys are the same size
-* Performance independent of growth
-* Optimized for concurrent fetch
-* Key file can be rebuilt if needed
-* Inserts are atomic and consistent
-* Data file may be iterated, index rebuilt.
-* Key and data files may be on different volumes
-* Hardened against algorithmic complexity attacks
-* Header-only, nothing to build or link
+* Low memory footprint.
+* Values are immutable.
+* Value sizes from 1 to 2^48 bytes (281TB).
+* All keys are the same size.
+* Performance independent of growth.
+* Optimized for concurrent fetch.
+* Key file can be rebuilt if needed.
+* Inserts are atomic and consistent.
+* Data files may be efficiently iterated.
+* Key and data files may be on different volumes.
+* Hardened against algorithmic complexity attacks.
+* Header-only, nothing to build or link.

-Three files are used. The data file holds keys and values stored
-sequentially and size-prefixed. The key file holds a series of
-fixed-size bucket records forming an on-disk hash table. The log file
-stores bookkeeping information used to restore consistency when an
-external failure occurs. In typical cases a fetch costs one I/O to
-consult the key file and if the key is present, one I/O to read the
-value.
+Three files are used. 
+
+* The data file holds keys and values stored sequentially and size-prefixed.
+* The key file holds a series of fixed-size bucket records forming an on-disk
+  hash table.
+* The log file stores bookkeeping information used to restore consistency when
+an external failure occurs. 
+
+In typical cases a fetch costs one I/O cycle to consult the key file, and if the
+key is present, one I/O cycle to read the value.

 ## Usage

-Callers define these parameters when a database is created:
+Callers must define these parameters when _creating_ a database:

-* KeySize: The size of a key in bytes
-* BlockSize: The physical size of a key file record
-* LoadFactor: The desired fraction of bucket occupancy
+* `KeySize`: The size of a key in bytes.
+* `BlockSize`: The physical size of a key file record.

 The ideal block size matches the sector size or block size of the
 underlying physical media that holds the key file. Functions are
@@ -42,33 +44,37 @@ device, but a default of 4096 should work for typical installations.
 The implementation tries to fit as many entries as possible in a key
 file record, to maximize the amount of useful work performed per I/O.

-The load factor is chosen to make bucket overflows unlikely without
+* `LoadFactor`: The desired fraction of bucket occupancy
+
+`LoadFactor` is chosen to make bucket overflows unlikely without
 sacrificing bucket occupancy. A value of 0.50 seems to work well with
 a good hash function.

-Callers also provide these parameters when a database is opened:
+Callers must also provide these parameters when a database is _opened:_

-* Appnum: An application-defined integer constant
-* AllocSize: A significant multiple of the average data size
+* `Appnum`: An application-defined integer constant which can be retrieved 
+later from the database [TODO].
+* `AllocSize`: A significant multiple of the average data size.

-To improve performance, memory is recycled. NuDB needs a hint about
-the average size of the data being inserted. For an average data
-size of 1KB (one kilobyte), AllocSize of sixteen megabytes (16MB) is
-sufficient. If the AllocSize is too low, the memory recycler will
-not make efficient use of allocated blocks.
+Memory is recycled to improve performance, so NuDB needs `AllocSize` as a
+hint about the average size of the data being inserted. For an average data size
+of 1KB (one kilobyte), `AllocSize` of sixteen megabytes (16MB) is sufficient. If
+the `AllocSize` is too low, the memory recycler will not make efficient use of
+allocated blocks.

-Two operations are defined, fetch and insert.
+Two operations are defined: `fetch`, and `insert`.

-### Fetch
+### `fetch`

-The fetch operation retrieves a variable length value given the
+The `fetch` operation retrieves a variable length value given the
 key. The caller supplies a factory used to provide a buffer for storing
 the value. This interface allows custom memory allocation strategies.

-### Insert
+### `insert`

-Insert adds a key/value pair to the store. Value data must contain at
-least one byte. Duplicate keys are disallowed. Insertions are serialized.
+`insert` adds a key/value pair to the store. Value data must contain at least
+one byte. Duplicate keys are disallowed. Insertions are serialized, which means
+[TODO].

 ## Implementation

@@ -89,24 +95,24 @@ and immutable: once written, bytes are never changed.
 Initially the hash table in the key file consists of a single bucket.
 After the load factor is exceeded from insertions, the hash table grows
 in size by one bucket by doing a "split". The split operation is the
-linear hashing algorithm as described by Litwin and Larson:
-    
-http://en.wikipedia.org/wiki/Linear_hashing
+[linear hashing algorithm](http://en.wikipedia.org/wiki/Linear_hashing) 
+as described by Litwin and Larson.

-When a bucket is split, each key is rehashed and either remains in the
-original bucket or gets moved to the new bucket appended to the end of
+
+When a bucket is split, each key is rehashed, and either remains in the
+original bucket or gets moved to the a bucket appended to the end of
 the key file.

-An insertion on a full bucket first triggers the "spill" algorithm:
-First, a spill record is appended to the data file. The spill record
-contains header information followed by the entire bucket record. Then,
-the bucket's size is set to zero and the offset of the spill record is
-stored in the bucket. At this point the insertion may proceed normally,
-since the bucket is empty. Spilled buckets in the data file are always
-full.
+An insertion on a full bucket first triggers the "spill" algorithm.
+
+First, a spill record is appended to the data file, containing header
+information followed by the entire bucket record. Then the bucket's size is set
+to zero and the offset of the spill record is stored in the bucket. At this
+point the insertion may proceed normally, since the bucket is empty. Spilled
+buckets in the data file are always full.

 Because every bucket holds the offset of the next spill record in the
-data file, each bucket forms a linked list. In practice, careful
+data file, the buckets form a linked list. In practice, careful
 selection of capacity and load factor will keep the percentage of
 buckets with one spill record to a minimum, with no bucket requiring
 two spill records.
@@ -141,16 +147,16 @@ database stores information used to roll back partial commits.

 Each record in the data file is prefixed with a header identifying
 whether it is a value record or a spill record, along with the size of
-the record in bytes and a copy of the key if its a value record.
-Therefore, values may be iterated. A key file can be regenerated from
+the record in bytes and a copy of the key if it's a value record, so values can
+be iterated by incrementing a byte counter. A key file can be regenerated from
 just the data file by iterating the values and performing the key
 insertion algorithm.

 ## Concurrency

 Locks are never held during disk reads and writes. Fetches are fully
-concurrent, while inserts are serialized. Inserts prevent duplicate
-keys. Inserts are atomic, they either succeed immediately or fail.
+concurrent, while inserts are serialized. Inserts fail on duplicate
+keys, and are atomic: they either succeed immediately or fail.
 After an insert, the key is immediately visible to subsequent fetches.

 ## Formats
@@ -180,18 +186,18 @@ fixed-length Bucket Records.
    uint8[56]       Reserved        Zeroes
    uint8[]         Reserved        Zero-pad to block size

-The Type identifies the file as belonging to nudb. The UID is
+`Type` identifies the file as belonging to nudb. `UID` is
 generated randomly when the database is created, and this value
-is stored in the data and log files as well. The UID is used
-to determine if files belong to the same database. Salt is
+is stored in the data and log files as well - it's used
+to determine if files belong to the same database. `Salt` is
 generated when the database is created and helps prevent
-complexity attacks; the salt is prepended to the key material
+complexity attacks; it is prepended to the key material
 when computing a hash, or used to initialize the state of
-the hash function. Appnum is an application defined constant
+the hash function. `Appnum` is an application defined constant
 set when the database is created. It can be used for anything,
 for example to distinguish between different data formats.

-Pepper is computed by hashing the salt using a hash function
+`Pepper` is computed by hashing `Salt` using a hash function
 seeded with the salt. This is used to fingerprint the hash
 function used. If a database is opened and the fingerprint
 does not match the hash calculation performed using the template
@@ -231,8 +237,7 @@ variable-length Value Records and Spill Records.
    uint64              UID             Unique ID generated on creation
    uint64              Appnum          Application defined constant
    uint16              KeySize         Key size in bytes
-
-    uint8[64]           Reserved        Zeroes
+    uint8[64]           (reserved)      Zeroes

 UID contains the same value as the salt in the corresponding key
 file. This is placed in the data file so that key and value files