Add new RocksDBQuickFactory for benchmarking:

This new factory is intended for benchmarking against the existing RocksDBFactory and has the following differences. * Does not use BatchWriter * Disables WAL for writes to memtable * Uses a hash index in blocks * Uses RocksDB OptimizeFor… functions See Benchmarks.md for further discussion of some of the issues raised by investigation of RocksDB performance.
2025-11-20 11:05:54 +00:00 · 2014-10-31 19:23:26 +00:00
parent 6540804571
commit a1f46e84b8
7 changed files with 455 additions and 0 deletions
--- a/Builds/VisualStudio2013/RippleD.vcxproj
+++ b/Builds/VisualStudio2013/RippleD.vcxproj
@@ -2520,6 +2520,11 @@
    </ClCompile>
    <ClInclude Include="..\..\src\ripple\nodestore\backend\RocksDBFactory.h">
    </ClInclude>
+    <ClCompile Include="..\..\src\ripple\nodestore\backend\RocksDBQuickFactory.cpp">
+      <ExcludedFromBuild>True</ExcludedFromBuild>
+    </ClCompile>
+    <ClInclude Include="..\..\src\ripple\nodestore\backend\RocksDBQuickFactory.h">
+    </ClInclude>
    <ClInclude Include="..\..\src\ripple\nodestore\Database.h">
    </ClInclude>
    <ClInclude Include="..\..\src\ripple\nodestore\DummyScheduler.h">
--- a/Builds/VisualStudio2013/RippleD.vcxproj.filters
+++ b/Builds/VisualStudio2013/RippleD.vcxproj.filters
@@ -3561,6 +3561,12 @@
    <ClInclude Include="..\..\src\ripple\nodestore\backend\RocksDBFactory.h">
      <Filter>ripple\nodestore\backend</Filter>
    </ClInclude>
+    <ClCompile Include="..\..\src\ripple\nodestore\backend\RocksDBQuickFactory.cpp">
+      <Filter>ripple\nodestore\backend</Filter>
+    </ClCompile>
+    <ClInclude Include="..\..\src\ripple\nodestore\backend\RocksDBQuickFactory.h">
+      <Filter>ripple\nodestore\backend</Filter>
+    </ClInclude>
    <ClInclude Include="..\..\src\ripple\nodestore\Database.h">
      <Filter>ripple\nodestore</Filter>
    </ClInclude>
--- a/src/ripple/nodestore/Benchmarks.md
+++ b/src/ripple/nodestore/Benchmarks.md
@@ -0,0 +1,45 @@
+#Benchmarks
+
+```
+$rippled --unittest=NodeStoreTiming --unittest-arg="type=rocksdbquick,style=level,num_objects=2000000"
+
+ripple.bench.NodeStoreTiming repeatableObject
+  Batch Insert    Fetch 50/50  Fetch Missing   Fetch Random        Inserts  Ordered Fetch
+         59.53          12.67           6.04          11.33          25.55          52.15 type=rocksdbquick,style=level,num_objects=2000000
+
+$rippled --unittest=NodeStoreTiming --unittest-arg="type=rocksdbquick,style=level,num_objects=2000000"
+
+ripple.bench.NodeStoreTiming repeatableObject
+  Batch Insert    Fetch 50/50  Fetch Missing   Fetch Random        Inserts  Ordered Fetch
+         44.29          27.45           5.95          20.47          23.58          53.60 type=rocksdbquick,style=level,num_objects=2000000
+```
+
+```
+$rippled --unittest=NodeStoreTiming --unittest-arg="type=rocksdb,num_objects=2000000,open_files=2000,filter_bits=12,cache_mb=256,file_size_mb=8,file_size_mult=2"
+
+ripple.bench.NodeStoreTiming repeatableObject
+  Batch Insert    Fetch 50/50  Fetch Missing   Fetch Random        Inserts  Ordered Fetch
+        377.61          30.62          10.05          17.41         201.73          64.46 type=rocksdb,num_objects=2000000,open_files=2000,filter_bits=12,cache_mb=256,file_size_mb=8,file_size_mult=2
+
+$rippled --unittest=NodeStoreTiming --unittest-arg="type=rocksdb,num_objects=2000000,open_files=2000,filter_bits=12,cache_mb=256,file_size_mb=8,file_size_mult=2"
+
+ripple.bench.NodeStoreTiming repeatableObject
+  Batch Insert    Fetch 50/50  Fetch Missing   Fetch Random        Inserts  Ordered Fetch
+        405.83          29.48          11.29          25.81         209.05          55.75 type=rocksdb,num_objects=2000000,open_files=2000,filter_bits=12,cache_mb=256,file_size_mb=8,file_size_mult=2
+```
+
+##Discussion
+
+RocksDBQuickFactory is intended to provide a testbed for comparing potential rocksdb performance with the existing recommended configuration in rippled.cfg. Through various executions and profiling some conclusions are presented below.
+
+* If the write ahead log is enabled, insert speed soon clogs up under load. The BatchWriter class intends to stop this from blocking the main threads by queuing up writes and running them in a separate thread. However, rocksdb already has separate threads dedicated to flushing the memtable to disk and the memtable is itself an in-memory queue. The result is two queues with a guarantee of durability in between. However if the memtable was used as the sole queue and the rocksdb::Flush() call was manually triggered at opportune moments, possibly just after ledger close, then that would provide similar, but more predictable guarantees. It would also remove an unneeded thread and unnecessary memory usage. An alternative point of view is that because there will always be many other rippled instances running there is no need for such guarantees. The nodes will always be available from another peer.
+
+* Lookup in a block was previously using binary search. With rippled's use case it is highly unlikely that two adjacent key/values will ever be requested one after the other. Therefore hash indexing of blocks makes much more sense. Rocksdb has a number of options for hash indexing both memtables and blocks and these need more testing to find the best choice.
+
+* The current Database implementation has two forms of caching, so the LRU cache of blocks at Factory level does not make any sense. However, if the hash indexing and potentially the new [bloom filter](http://rocksdb.org/blog/1427/new-bloom-filter-format/) can provide faster lookup for non-existent keys, then potentially the caching could exist at Factory level.
+
+* Multiple runs of the benchmarks can yield surprisingly different results. This can perhaps be attributed to the asynchronous nature of rocksdb's compaction process. The benchmarks are artifical and create highly unlikely write load to create the dataset to measure different read access patterns. Therefore multiple runs of the benchmarks are required to get a feel for the effectiveness of the changes. This contrasts sharply with the keyvadb benchmarking were highly repeatable timings were discovered. Also realistically sized datasets are required to get a correct insight. The number of 2,000,000 key/values (actually 4,000,000 after the two insert benchmarks complete) is too low to get a full picture.
+
+* An interesting side effect of running the benchmarks in a profiler was that a clear pattern of what RocksDB does under the hood was observable. This led to the decision to trial hash indexing and also the discovery of the native CRC32 instruction not being used.
+
+* Important point to note that is if this factory is tested with an existing set of sst files none of the old sst files will benefit from indexing changes until they are compacted at a future point in time.
--- a/src/ripple/nodestore/backend/RocksDBQuickFactory.cpp
+++ b/src/ripple/nodestore/backend/RocksDBQuickFactory.cpp
@@ -0,0 +1,356 @@
+//------------------------------------------------------------------------------
+/*
+    This file is part of rippled: https://github.com/ripple/rippled
+    Copyright (c) 2012, 2013 Ripple Labs Inc.
+
+    Permission to use, copy, modify, and/or distribute this software for any
+    purpose  with  or without fee is hereby granted, provided that the above
+    copyright notice and this permission notice appear in all copies.
+
+    THE  SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+    WITH  REGARD  TO  THIS  SOFTWARE  INCLUDING  ALL  IMPLIED  WARRANTIES  OF
+    MERCHANTABILITY  AND  FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+    ANY  SPECIAL ,  DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+    WHATSOEVER  RESULTING  FROM  LOSS  OF USE, DATA OR PROFITS, WHETHER IN AN
+    ACTION  OF  CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+    OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+*/
+//==============================================================================
+
+#if RIPPLE_ROCKSDB_AVAILABLE
+
+#include <ripple/core/Config.h>
+#include <beast/threads/Thread.h>
+#include <atomic>
+
+namespace ripple {
+namespace NodeStore {
+
+class RockDBQuickEnv : public rocksdb::EnvWrapper
+{
+public:
+    RockDBQuickEnv ()
+        : EnvWrapper (rocksdb::Env::Default())
+    {
+    }
+
+    struct ThreadParams
+    {
+        ThreadParams (void (*f_)(void*), void* a_)
+            : f (f_)
+            , a (a_)
+        {
+        }
+
+        void (*f)(void*);
+        void* a;
+    };
+
+    static
+    void
+    thread_entry (void* ptr)
+    {
+        ThreadParams* const p (reinterpret_cast <ThreadParams*> (ptr));
+        void (*f)(void*) = p->f;
+        void* a (p->a);
+        delete p;
+
+        static std::atomic <std::size_t> n;
+        std::size_t const id (++n);
+        std::stringstream ss;
+        ss << "rocksdb #" << id;
+        beast::Thread::setCurrentThreadName (ss.str());
+
+        (*f)(a);
+    }
+
+    void
+    StartThread (void (*f)(void*), void* a)
+    {
+        ThreadParams* const p (new ThreadParams (f, a));
+        EnvWrapper::StartThread (&RockDBQuickEnv::thread_entry, p);
+    }
+};
+
+//------------------------------------------------------------------------------
+
+class RocksDBQuickBackend
+    : public Backend
+    , public beast::LeakChecked <RocksDBQuickBackend>
+{
+public:
+    beast::Journal m_journal;
+    size_t const m_keyBytes;
+    std::string m_name;
+    std::unique_ptr <rocksdb::DB> m_db;
+
+    RocksDBQuickBackend (int keyBytes, Parameters const& keyValues,
+        Scheduler& scheduler, beast::Journal journal, RockDBQuickEnv* env)
+        : m_journal (journal)
+        , m_keyBytes (keyBytes)
+        , m_name (keyValues ["path"].toStdString ())
+    {
+        if (m_name.empty())
+            throw std::runtime_error ("Missing path in RocksDBFactory backend");
+
+        // Defaults
+        std::uint64_t budget = 512 * 1024 * 1024;  // 512MB
+        std::string style("level");
+        std::uint64_t threads=4;
+
+        if (!keyValues["budget"].isEmpty())
+            budget = keyValues["budget"].getIntValue();
+
+        if (!keyValues["style"].isEmpty())
+            style = keyValues["style"].toStdString();
+
+        if (!keyValues["threads"].isEmpty())
+            threads = keyValues["threads"].getIntValue();
+
+
+        // Set options
+        rocksdb::Options options;
+        options.create_if_missing = true;
+        options.env = env;
+
+        if (style == "level")
+            options.OptimizeLevelStyleCompaction(budget);
+
+        if (style == "universal")
+            options.OptimizeUniversalStyleCompaction(budget);
+
+        if (style == "point")
+            options.OptimizeForPointLookup(budget / 1024 / 1024);  // In MB
+
+        options.IncreaseParallelism(threads);
+
+        // Allows hash indexes in blocks
+        options.prefix_extractor.reset(rocksdb::NewNoopTransform());
+
+        // overrride OptimizeLevelStyleCompaction
+        options.min_write_buffer_number_to_merge = 1;
+        
+        rocksdb::BlockBasedTableOptions table_options;
+        // Use hash index
+        table_options.index_type =
+            rocksdb::BlockBasedTableOptions::kHashSearch;
+        table_options.filter_policy.reset(
+            rocksdb::NewBloomFilterPolicy(10));
+        options.table_factory.reset(
+            NewBlockBasedTableFactory(table_options));
+        
+        // Higher values make reads slower
+        // table_options.block_size = 4096;
+
+        // No point when DatabaseImp has a cache
+        // table_options.block_cache =
+        //     rocksdb::NewLRUCache(64 * 1024 * 1024);
+
+        options.memtable_factory.reset(rocksdb::NewHashSkipListRepFactory());
+        // Alternative:
+        // options.memtable_factory.reset(
+        //     rocksdb::NewHashCuckooRepFactory(options.write_buffer_size));
+
+        rocksdb::DB* db = nullptr;
+
+        rocksdb::Status status = rocksdb::DB::Open (options, m_name, &db);
+        if (!status.ok () || !db)
+            throw std::runtime_error (std::string("Unable to open/create RocksDB: ") + status.ToString());
+
+        m_db.reset (db);
+    }
+
+    ~RocksDBQuickBackend ()
+    {
+    }
+
+    std::string
+    getName()
+    {
+        return m_name;
+    }
+
+    //--------------------------------------------------------------------------
+
+    Status
+    fetch (void const* key, NodeObject::Ptr* pObject)
+    {
+        pObject->reset ();
+
+        Status status (ok);
+
+        rocksdb::ReadOptions const options;
+        rocksdb::Slice const slice (static_cast <char const*> (key), m_keyBytes);
+
+        std::string string;
+
+        rocksdb::Status getStatus = m_db->Get (options, slice, &string);
+
+        if (getStatus.ok ())
+        {
+            DecodedBlob decoded (key, string.data (), string.size ());
+
+            if (decoded.wasOk ())
+            {
+                *pObject = decoded.createObject ();
+            }
+            else
+            {
+                // Decoding failed, probably corrupted!
+                //
+                status = dataCorrupt;
+            }
+        }
+        else
+        {
+            if (getStatus.IsCorruption ())
+            {
+                status = dataCorrupt;
+            }
+            else if (getStatus.IsNotFound ())
+            {
+                status = notFound;
+            }
+            else
+            {
+                status = Status (customCode + getStatus.code());
+
+                m_journal.error << getStatus.ToString ();
+            }
+        }
+
+        return status;
+    }
+
+    void
+    store (NodeObject::ref object)
+    {
+        storeBatch(Batch{object});
+    }
+
+    void
+    storeBatch (Batch const& batch)
+    {
+        rocksdb::WriteBatch wb;
+ 
+        EncodedBlob encoded;
+
+        for (auto const& e : batch)
+        {
+            encoded.prepare (e);
+
+            wb.Put(
+                rocksdb::Slice(reinterpret_cast<char const*>(encoded.getKey()),
+                               m_keyBytes),
+                rocksdb::Slice(reinterpret_cast<char const*>(encoded.getData()),
+                               encoded.getSize()));
+        }
+
+        rocksdb::WriteOptions options;
+
+        // Crucial to ensure good write speed and non-blocking writes to memtable
+        options.disableWAL = true;
+        
+        auto ret = m_db->Write (options, &wb);
+
+        if (!ret.ok ())
+            throw std::runtime_error ("storeBatch failed: " + ret.ToString());
+    }
+
+    void
+    for_each (std::function <void(NodeObject::Ptr)> f)
+    {
+        rocksdb::ReadOptions const options;
+
+        std::unique_ptr <rocksdb::Iterator> it (m_db->NewIterator (options));
+
+        for (it->SeekToFirst (); it->Valid (); it->Next ())
+        {
+            if (it->key ().size () == m_keyBytes)
+            {
+                DecodedBlob decoded (it->key ().data (),
+                                                it->value ().data (),
+                                                it->value ().size ());
+
+                if (decoded.wasOk ())
+                {
+                    f (decoded.createObject ());
+                }
+                else
+                {
+                    // Uh oh, corrupted data!
+                    if (m_journal.fatal) m_journal.fatal <<
+                        "Corrupt NodeObject #" << uint256 (it->key ().data ());
+                }
+            }
+            else
+            {
+                // VFALCO NOTE What does it mean to find an
+                //             incorrectly sized key? Corruption?
+                if (m_journal.fatal) m_journal.fatal <<
+                    "Bad key size = " << it->key ().size ();
+            }
+        }
+    }
+
+    int
+    getWriteLoad ()
+    {
+        return 0;
+    }
+
+    //--------------------------------------------------------------------------
+
+    void
+    writeBatch (Batch const& batch)
+    {
+        storeBatch (batch);
+    }
+};
+
+//------------------------------------------------------------------------------
+
+class RocksDBQuickFactory : public Factory
+{
+public:
+    std::shared_ptr <rocksdb::Cache> m_lruCache;
+    RockDBQuickEnv m_env;
+
+    RocksDBQuickFactory ()
+    {
+    }
+
+    ~RocksDBQuickFactory ()
+    {
+    }
+
+    std::string
+    getName () const
+    {
+        return "RocksDBQuick";
+    }
+
+    std::unique_ptr <Backend>
+    createInstance (
+        size_t keyBytes,
+        Parameters const& keyValues,
+        Scheduler& scheduler,
+        beast::Journal journal)
+    {
+        return std::make_unique <RocksDBQuickBackend> (
+            keyBytes, keyValues, scheduler, journal, &m_env);
+    }
+};
+
+//------------------------------------------------------------------------------
+
+std::unique_ptr <Factory>
+make_RocksDBQuickFactory ()
+{
+    return std::make_unique <RocksDBQuickFactory> ();
+}
+
+}
+}
+
+#endif
--- a/src/ripple/nodestore/backend/RocksDBQuickFactory.h
+++ b/src/ripple/nodestore/backend/RocksDBQuickFactory.h
@@ -0,0 +1,40 @@
+//------------------------------------------------------------------------------
+/*
+    This file is part of rippled: https://github.com/ripple/rippled
+    Copyright (c) 2012, 2013 Ripple Labs Inc.
+
+    Permission to use, copy, modify, and/or distribute this software for any
+    purpose  with  or without fee is hereby granted, provided that the above
+    copyright notice and this permission notice appear in all copies.
+
+    THE  SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+    WITH  REGARD  TO  THIS  SOFTWARE  INCLUDING  ALL  IMPLIED  WARRANTIES  OF
+    MERCHANTABILITY  AND  FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+    ANY  SPECIAL ,  DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+    WHATSOEVER  RESULTING  FROM  LOSS  OF USE, DATA OR PROFITS, WHETHER IN AN
+    ACTION  OF  CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+    OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+*/
+//==============================================================================
+
+#ifndef RIPPLE_NODESTORE_ROCKSDBQUICKFACTORY_H_INCLUDED
+#define RIPPLE_NODESTORE_ROCKSDBQUICKFACTORY_H_INCLUDED
+
+#if RIPPLE_ROCKSDB_AVAILABLE
+
+#include <ripple/nodestore/Factory.h>
+
+namespace ripple {
+namespace NodeStore {
+
+/** Factory to produce experimental RocksDB backends for the NodeStore.
+    @see Database
+*/
+std::unique_ptr <Factory> make_RocksDBQuickFactory ();
+
+}
+}
+
+#endif
+
+#endif
--- a/src/ripple/nodestore/impl/Manager.cpp
+++ b/src/ripple/nodestore/impl/Manager.cpp
@@ -63,6 +63,7 @@ public:

    #if RIPPLE_ROCKSDB_AVAILABLE
        add_factory (make_RocksDBFactory ());
+        add_factory (make_RocksDBQuickFactory ());
    #endif
    }

--- a/src/ripple/unity/nodestore.cpp
+++ b/src/ripple/unity/nodestore.cpp
@@ -45,6 +45,8 @@
 #include <ripple/nodestore/backend/NullFactory.cpp>
 #include <ripple/nodestore/backend/RocksDBFactory.h>
 #include <ripple/nodestore/backend/RocksDBFactory.cpp>
+#include <ripple/nodestore/backend/RocksDBQuickFactory.h>
+#include <ripple/nodestore/backend/RocksDBQuickFactory.cpp>

 #include <ripple/nodestore/impl/Backend.cpp>
 #include <ripple/nodestore/impl/BatchWriter.cpp>