Add new RocksDBQuickFactory for benchmarking:

This new factory is intended for benchmarking against the existing RocksDBFactory and has the following differences.
* Does not use BatchWriter
* Disables WAL for writes to memtable
* Uses a hash index in blocks
* Uses RocksDB OptimizeFor… functions
See Benchmarks.md for further discussion of some of the issues raised by investigation of RocksDB performance.
This commit is contained in:
Donovan Hide
2014-10-31 19:23:26 +00:00
committed by Vinnie Falco
parent 6540804571
commit a1f46e84b8
7 changed files with 455 additions and 0 deletions

View File

@@ -2520,6 +2520,11 @@
</ClCompile>
<ClInclude Include="..\..\src\ripple\nodestore\backend\RocksDBFactory.h">
</ClInclude>
<ClCompile Include="..\..\src\ripple\nodestore\backend\RocksDBQuickFactory.cpp">
<ExcludedFromBuild>True</ExcludedFromBuild>
</ClCompile>
<ClInclude Include="..\..\src\ripple\nodestore\backend\RocksDBQuickFactory.h">
</ClInclude>
<ClInclude Include="..\..\src\ripple\nodestore\Database.h">
</ClInclude>
<ClInclude Include="..\..\src\ripple\nodestore\DummyScheduler.h">

View File

@@ -3561,6 +3561,12 @@
<ClInclude Include="..\..\src\ripple\nodestore\backend\RocksDBFactory.h">
<Filter>ripple\nodestore\backend</Filter>
</ClInclude>
<ClCompile Include="..\..\src\ripple\nodestore\backend\RocksDBQuickFactory.cpp">
<Filter>ripple\nodestore\backend</Filter>
</ClCompile>
<ClInclude Include="..\..\src\ripple\nodestore\backend\RocksDBQuickFactory.h">
<Filter>ripple\nodestore\backend</Filter>
</ClInclude>
<ClInclude Include="..\..\src\ripple\nodestore\Database.h">
<Filter>ripple\nodestore</Filter>
</ClInclude>

View File

@@ -0,0 +1,45 @@
#Benchmarks
```
$rippled --unittest=NodeStoreTiming --unittest-arg="type=rocksdbquick,style=level,num_objects=2000000"
ripple.bench.NodeStoreTiming repeatableObject
Batch Insert Fetch 50/50 Fetch Missing Fetch Random Inserts Ordered Fetch
59.53 12.67 6.04 11.33 25.55 52.15 type=rocksdbquick,style=level,num_objects=2000000
$rippled --unittest=NodeStoreTiming --unittest-arg="type=rocksdbquick,style=level,num_objects=2000000"
ripple.bench.NodeStoreTiming repeatableObject
Batch Insert Fetch 50/50 Fetch Missing Fetch Random Inserts Ordered Fetch
44.29 27.45 5.95 20.47 23.58 53.60 type=rocksdbquick,style=level,num_objects=2000000
```
```
$rippled --unittest=NodeStoreTiming --unittest-arg="type=rocksdb,num_objects=2000000,open_files=2000,filter_bits=12,cache_mb=256,file_size_mb=8,file_size_mult=2"
ripple.bench.NodeStoreTiming repeatableObject
Batch Insert Fetch 50/50 Fetch Missing Fetch Random Inserts Ordered Fetch
377.61 30.62 10.05 17.41 201.73 64.46 type=rocksdb,num_objects=2000000,open_files=2000,filter_bits=12,cache_mb=256,file_size_mb=8,file_size_mult=2
$rippled --unittest=NodeStoreTiming --unittest-arg="type=rocksdb,num_objects=2000000,open_files=2000,filter_bits=12,cache_mb=256,file_size_mb=8,file_size_mult=2"
ripple.bench.NodeStoreTiming repeatableObject
Batch Insert Fetch 50/50 Fetch Missing Fetch Random Inserts Ordered Fetch
405.83 29.48 11.29 25.81 209.05 55.75 type=rocksdb,num_objects=2000000,open_files=2000,filter_bits=12,cache_mb=256,file_size_mb=8,file_size_mult=2
```
##Discussion
RocksDBQuickFactory is intended to provide a testbed for comparing potential rocksdb performance with the existing recommended configuration in rippled.cfg. Through various executions and profiling some conclusions are presented below.
* If the write ahead log is enabled, insert speed soon clogs up under load. The BatchWriter class intends to stop this from blocking the main threads by queuing up writes and running them in a separate thread. However, rocksdb already has separate threads dedicated to flushing the memtable to disk and the memtable is itself an in-memory queue. The result is two queues with a guarantee of durability in between. However if the memtable was used as the sole queue and the rocksdb::Flush() call was manually triggered at opportune moments, possibly just after ledger close, then that would provide similar, but more predictable guarantees. It would also remove an unneeded thread and unnecessary memory usage. An alternative point of view is that because there will always be many other rippled instances running there is no need for such guarantees. The nodes will always be available from another peer.
* Lookup in a block was previously using binary search. With rippled's use case it is highly unlikely that two adjacent key/values will ever be requested one after the other. Therefore hash indexing of blocks makes much more sense. Rocksdb has a number of options for hash indexing both memtables and blocks and these need more testing to find the best choice.
* The current Database implementation has two forms of caching, so the LRU cache of blocks at Factory level does not make any sense. However, if the hash indexing and potentially the new [bloom filter](http://rocksdb.org/blog/1427/new-bloom-filter-format/) can provide faster lookup for non-existent keys, then potentially the caching could exist at Factory level.
* Multiple runs of the benchmarks can yield surprisingly different results. This can perhaps be attributed to the asynchronous nature of rocksdb's compaction process. The benchmarks are artifical and create highly unlikely write load to create the dataset to measure different read access patterns. Therefore multiple runs of the benchmarks are required to get a feel for the effectiveness of the changes. This contrasts sharply with the keyvadb benchmarking were highly repeatable timings were discovered. Also realistically sized datasets are required to get a correct insight. The number of 2,000,000 key/values (actually 4,000,000 after the two insert benchmarks complete) is too low to get a full picture.
* An interesting side effect of running the benchmarks in a profiler was that a clear pattern of what RocksDB does under the hood was observable. This led to the decision to trial hash indexing and also the discovery of the native CRC32 instruction not being used.
* Important point to note that is if this factory is tested with an existing set of sst files none of the old sst files will benefit from indexing changes until they are compacted at a future point in time.

View File

@@ -0,0 +1,356 @@
//------------------------------------------------------------------------------
/*
This file is part of rippled: https://github.com/ripple/rippled
Copyright (c) 2012, 2013 Ripple Labs Inc.
Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted, provided that the above
copyright notice and this permission notice appear in all copies.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
ANY SPECIAL , DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
*/
//==============================================================================
#if RIPPLE_ROCKSDB_AVAILABLE
#include <ripple/core/Config.h>
#include <beast/threads/Thread.h>
#include <atomic>
namespace ripple {
namespace NodeStore {
class RockDBQuickEnv : public rocksdb::EnvWrapper
{
public:
RockDBQuickEnv ()
: EnvWrapper (rocksdb::Env::Default())
{
}
struct ThreadParams
{
ThreadParams (void (*f_)(void*), void* a_)
: f (f_)
, a (a_)
{
}
void (*f)(void*);
void* a;
};
static
void
thread_entry (void* ptr)
{
ThreadParams* const p (reinterpret_cast <ThreadParams*> (ptr));
void (*f)(void*) = p->f;
void* a (p->a);
delete p;
static std::atomic <std::size_t> n;
std::size_t const id (++n);
std::stringstream ss;
ss << "rocksdb #" << id;
beast::Thread::setCurrentThreadName (ss.str());
(*f)(a);
}
void
StartThread (void (*f)(void*), void* a)
{
ThreadParams* const p (new ThreadParams (f, a));
EnvWrapper::StartThread (&RockDBQuickEnv::thread_entry, p);
}
};
//------------------------------------------------------------------------------
class RocksDBQuickBackend
: public Backend
, public beast::LeakChecked <RocksDBQuickBackend>
{
public:
beast::Journal m_journal;
size_t const m_keyBytes;
std::string m_name;
std::unique_ptr <rocksdb::DB> m_db;
RocksDBQuickBackend (int keyBytes, Parameters const& keyValues,
Scheduler& scheduler, beast::Journal journal, RockDBQuickEnv* env)
: m_journal (journal)
, m_keyBytes (keyBytes)
, m_name (keyValues ["path"].toStdString ())
{
if (m_name.empty())
throw std::runtime_error ("Missing path in RocksDBFactory backend");
// Defaults
std::uint64_t budget = 512 * 1024 * 1024; // 512MB
std::string style("level");
std::uint64_t threads=4;
if (!keyValues["budget"].isEmpty())
budget = keyValues["budget"].getIntValue();
if (!keyValues["style"].isEmpty())
style = keyValues["style"].toStdString();
if (!keyValues["threads"].isEmpty())
threads = keyValues["threads"].getIntValue();
// Set options
rocksdb::Options options;
options.create_if_missing = true;
options.env = env;
if (style == "level")
options.OptimizeLevelStyleCompaction(budget);
if (style == "universal")
options.OptimizeUniversalStyleCompaction(budget);
if (style == "point")
options.OptimizeForPointLookup(budget / 1024 / 1024); // In MB
options.IncreaseParallelism(threads);
// Allows hash indexes in blocks
options.prefix_extractor.reset(rocksdb::NewNoopTransform());
// overrride OptimizeLevelStyleCompaction
options.min_write_buffer_number_to_merge = 1;
rocksdb::BlockBasedTableOptions table_options;
// Use hash index
table_options.index_type =
rocksdb::BlockBasedTableOptions::kHashSearch;
table_options.filter_policy.reset(
rocksdb::NewBloomFilterPolicy(10));
options.table_factory.reset(
NewBlockBasedTableFactory(table_options));
// Higher values make reads slower
// table_options.block_size = 4096;
// No point when DatabaseImp has a cache
// table_options.block_cache =
// rocksdb::NewLRUCache(64 * 1024 * 1024);
options.memtable_factory.reset(rocksdb::NewHashSkipListRepFactory());
// Alternative:
// options.memtable_factory.reset(
// rocksdb::NewHashCuckooRepFactory(options.write_buffer_size));
rocksdb::DB* db = nullptr;
rocksdb::Status status = rocksdb::DB::Open (options, m_name, &db);
if (!status.ok () || !db)
throw std::runtime_error (std::string("Unable to open/create RocksDB: ") + status.ToString());
m_db.reset (db);
}
~RocksDBQuickBackend ()
{
}
std::string
getName()
{
return m_name;
}
//--------------------------------------------------------------------------
Status
fetch (void const* key, NodeObject::Ptr* pObject)
{
pObject->reset ();
Status status (ok);
rocksdb::ReadOptions const options;
rocksdb::Slice const slice (static_cast <char const*> (key), m_keyBytes);
std::string string;
rocksdb::Status getStatus = m_db->Get (options, slice, &string);
if (getStatus.ok ())
{
DecodedBlob decoded (key, string.data (), string.size ());
if (decoded.wasOk ())
{
*pObject = decoded.createObject ();
}
else
{
// Decoding failed, probably corrupted!
//
status = dataCorrupt;
}
}
else
{
if (getStatus.IsCorruption ())
{
status = dataCorrupt;
}
else if (getStatus.IsNotFound ())
{
status = notFound;
}
else
{
status = Status (customCode + getStatus.code());
m_journal.error << getStatus.ToString ();
}
}
return status;
}
void
store (NodeObject::ref object)
{
storeBatch(Batch{object});
}
void
storeBatch (Batch const& batch)
{
rocksdb::WriteBatch wb;
EncodedBlob encoded;
for (auto const& e : batch)
{
encoded.prepare (e);
wb.Put(
rocksdb::Slice(reinterpret_cast<char const*>(encoded.getKey()),
m_keyBytes),
rocksdb::Slice(reinterpret_cast<char const*>(encoded.getData()),
encoded.getSize()));
}
rocksdb::WriteOptions options;
// Crucial to ensure good write speed and non-blocking writes to memtable
options.disableWAL = true;
auto ret = m_db->Write (options, &wb);
if (!ret.ok ())
throw std::runtime_error ("storeBatch failed: " + ret.ToString());
}
void
for_each (std::function <void(NodeObject::Ptr)> f)
{
rocksdb::ReadOptions const options;
std::unique_ptr <rocksdb::Iterator> it (m_db->NewIterator (options));
for (it->SeekToFirst (); it->Valid (); it->Next ())
{
if (it->key ().size () == m_keyBytes)
{
DecodedBlob decoded (it->key ().data (),
it->value ().data (),
it->value ().size ());
if (decoded.wasOk ())
{
f (decoded.createObject ());
}
else
{
// Uh oh, corrupted data!
if (m_journal.fatal) m_journal.fatal <<
"Corrupt NodeObject #" << uint256 (it->key ().data ());
}
}
else
{
// VFALCO NOTE What does it mean to find an
// incorrectly sized key? Corruption?
if (m_journal.fatal) m_journal.fatal <<
"Bad key size = " << it->key ().size ();
}
}
}
int
getWriteLoad ()
{
return 0;
}
//--------------------------------------------------------------------------
void
writeBatch (Batch const& batch)
{
storeBatch (batch);
}
};
//------------------------------------------------------------------------------
class RocksDBQuickFactory : public Factory
{
public:
std::shared_ptr <rocksdb::Cache> m_lruCache;
RockDBQuickEnv m_env;
RocksDBQuickFactory ()
{
}
~RocksDBQuickFactory ()
{
}
std::string
getName () const
{
return "RocksDBQuick";
}
std::unique_ptr <Backend>
createInstance (
size_t keyBytes,
Parameters const& keyValues,
Scheduler& scheduler,
beast::Journal journal)
{
return std::make_unique <RocksDBQuickBackend> (
keyBytes, keyValues, scheduler, journal, &m_env);
}
};
//------------------------------------------------------------------------------
std::unique_ptr <Factory>
make_RocksDBQuickFactory ()
{
return std::make_unique <RocksDBQuickFactory> ();
}
}
}
#endif

View File

@@ -0,0 +1,40 @@
//------------------------------------------------------------------------------
/*
This file is part of rippled: https://github.com/ripple/rippled
Copyright (c) 2012, 2013 Ripple Labs Inc.
Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted, provided that the above
copyright notice and this permission notice appear in all copies.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
ANY SPECIAL , DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
*/
//==============================================================================
#ifndef RIPPLE_NODESTORE_ROCKSDBQUICKFACTORY_H_INCLUDED
#define RIPPLE_NODESTORE_ROCKSDBQUICKFACTORY_H_INCLUDED
#if RIPPLE_ROCKSDB_AVAILABLE
#include <ripple/nodestore/Factory.h>
namespace ripple {
namespace NodeStore {
/** Factory to produce experimental RocksDB backends for the NodeStore.
@see Database
*/
std::unique_ptr <Factory> make_RocksDBQuickFactory ();
}
}
#endif
#endif

View File

@@ -63,6 +63,7 @@ public:
#if RIPPLE_ROCKSDB_AVAILABLE
add_factory (make_RocksDBFactory ());
add_factory (make_RocksDBQuickFactory ());
#endif
}

View File

@@ -45,6 +45,8 @@
#include <ripple/nodestore/backend/NullFactory.cpp>
#include <ripple/nodestore/backend/RocksDBFactory.h>
#include <ripple/nodestore/backend/RocksDBFactory.cpp>
#include <ripple/nodestore/backend/RocksDBQuickFactory.h>
#include <ripple/nodestore/backend/RocksDBQuickFactory.cpp>
#include <ripple/nodestore/impl/Backend.cpp>
#include <ripple/nodestore/impl/BatchWriter.cpp>