RocksDB: support of out-of-core ? related performance? - rocksdb

I'm starting a new software that should be able to handle large dataset, ie, some terabytes of data.
I have seen that Rocksdb allows storage of large datasets, but I'm not sure it is an out-of-core feature? I mean, if the dataset is larger than the computer RAM, will it handle it?
Also, in case there is no swapping, is there some performance impact study about using such in-memory data store?
Thanks

RocksDB has no difficulty with datasets that exceed RAM size. However, you pretty much have to use Bloom filters to preserve performance, and they take up RAM. So you will see some linear memory growth as your database grows. But it's nowhere near 1-to-1, more like 1/50th or so.

Related

MariaDB performance according to CPU Cache

I want to build a MariaDB server.
I am researching DB performance according to CPU specification before server purchase.
Are there any benefits to gaining as CPU Cache changes from 8MB to 12MB?
Is it a good option to purchase a large CPU Cache?
No.
Don't pay extra for more cores. MariaDB essentially uses part of only one core for each active connection.
Don't pay extra for faster CPU or anything else to do with the CPU. If you do become CPU-bound, then come back here for help with optimizing your query; it is usually as simple as adding a composite index or reformulating a query.
Throwing hardware at something is a one-time partial fix and of low benefit -- a few percent: like 2% or, rarely, 20%.
Reformulation or indexing can (depending on the situation) give you 2x or 20x or even 200x.
More RAM up to a point is beneficial. Make a crude estimate of dataset size. Let us know that value, plus some clue of what type of app.
SSD instead of HDD is rather standard, but that may give noticeable benefit.

High memory consumption in hana table partitioning

I have a big table having records around 4 billion ,table is partitioned but i need to perform the partitioning again. while doing the partitioning memory consumption of the hana system reached to its limit 4TB and started impacting other system.
How we can optimize the partitioning so get completed without consuming that much of memory
To re-partition tables, both the original table structure as well as the new table structure needs to be kept in memory at the same time.
For the target table structures, data will be inserted into delta stores and later on merged, which again consumes memory.
To increase performance, re-partitioning happens in parallel threads, which, you may guess, again uses additional memory.
The administration guide provides a hint to lower the number of parallel threads:
Parallelism and Memory Consumption
Partitioning operations consume a
high amount of memory. To reduce the memory consumption, it is
possible to configure the number of threads used.
You can change the
default value of the parameter split_threads in the partitioning
section of the indexserver.ini configuration file.
By default, 16 threads are used. In the case of a parallel partition/merge, the
individual operations use a total of the configured number of threads
for each host. Each operation takes at least one thread.
So, that's the online option to re-partition if your system does not have enough memory for parallel threads.
Alternatively, you may consider an offline re-partitioning that would involve exporting the table (as CSV!), truncating(!) the table, altering the partitioning on the now empty table and re-importing the data.
Note, that I wrote "truncate" as this will preserve all privileges and references to the table (views, synonyms, roles, etc.) which would be lost if you dropped and recreated the table.

SQLite vacuuming / fragmentation and performance degradation

Let's say I periodically insert data into a SQLite database, then purge the first 50% of the data, but I don't vacuum.
Do I have something like zeroed-out pages for the first 50% of the file now?
If I add another batch of data, am I filling in those zeroed-out pages?
The manual mentions fragmentation of data:
Frequent inserts, updates, and deletes can cause the database file to become fragmented - where data for a single table or index is scattered around the database file.
VACUUM ensures that each table and index is largely stored contiguously within the database file. In some cases, VACUUM may also reduce the number of partially filled pages in the database, reducing the size of the database file further.
But it doesn't indicate that there's necessarily a performance degradation from this.
It mostly hints at the wasted space that could be saved from vacuuming.
Is there a noticeable performance gain for data in strictly contiguous pages?
Could I expect "terrible" performance from a database with a lot of fragmented data?
SQLite automatically reuses free pages.
Fragmented pages can result in performance degradation only if
the amount of data is so large that it cannot be cached, and
your storage device does seeks relatively slowly (e.g. hard disks or cheap flash devices), and
you access the data often enough that the difference matters.
There is only one way to find out whether this is the case for your application: measure it.

divide workload on different hardware using MPI

I have a small network with computers of different hardware. Is it possible to optimize workload division between these hardware using MPI? ie. give nodes with larger ram and better cpu more data to compute? minimizing waiting time between different nodes for final reduction.
Thanks!
In my program data are divided into equal-sized batches. Each node in the network will process some of them. The result of each batch will be summed up after all batches are processed.
Can you divide the work into more batches than there are processes? If so, change your program so that instead of each process receiving one batch, the master keeps sending batches to whichever node is available, for as long as there are unassigned batches. It should be a fairly easy modification, and it will make faster nodes process more data, leading to a lower overall completion time. There are further enhancements you can make, e.g. once all batches have been assigned and a fast node is available, you could take an already assigned batch away from a slow node and reassign it to said fast node. But these may not be worth the extra effort.
If you absolutely have to work with as many batches as you have nodes, then you'll have to find some way of deciding which nodes are fast and which ones are slow. Perhaps the most robust way of doing this is to assign small, equally sized test batches to each process, and have them time their own solutions. The master can then divide the real data into appropriately sized batches for each node. The biggest downside to this approach is that if the initial speed measurement is inaccurate, then your efforts at load balancing may end up doing more harm than good. Also, depending on the exact data and algorithm you're working with, runtimes with small data sets may not be indicative of runtimes with large data sets.
Yet another way would be to take thorough measurements of each node's speed (i.e. multiple runs with large data sets) in advance, and have the master balance batch sizes according to this precompiled information. The obvious complication here is that you'll somehow have to keep this registry up to date and available.
All in all, I would recommend the very first approach: divide the work into many smaller chunks, and assign chunks to whichever node is available at the moment.

The cost of adding custom performance counters [duplicate]

When considering using performance counters as my companies' .NET based site, I was wondering how big the overhead is of using them.
Do I want to have my site continuously update it's counters or am I better off to only do when I measure?
The overhead of setting up the performance counters is generally not high enough to worry about (setting up a shared memory region and some .NET objects, along with CLR overhead because the CLR actually does the management for you). Here I'm referring to classes like PerformanceCounter.
The overhead of registering the perfromance counters can be decently slow, but generally is not a concern because it is intended to happen once at setup time because you want to change machine-wide state. It will be dwarfed by any copying that you do. It's not generally something you want to do at runtime. Here I'm referring to PerformanceCounterInstaller.
The overhead of updating a performance counter generally comes down to the cost of performing an Interlocked operation on the shared memory. This is slower than normal memory access but is a processor primitive (that's how it gets atomic operations across the entire memory subsystem including caches). Generally this cost is not high to worry about. It could be 10 times a normal memory operation, potentially worse depending on the update and what contention is like across threads and CPUs. But consider this, it's literally impossible to do any better than interlocked operations for cross-process communication with atomic updates, and no locks are held. Here I refer to PerformanceCounter.Increment and similar methods.
The overhead of reading a performance counter is generally a read from shared memory. As others have said, you want to sample on a reasonable period (just like any other sampling) but just think of PerfMon and try to keep the sampling on a human scale (think seconds instead of milliseconds) and you proably won't have any problems.
Finally, an appeal to experience: Performance counters are so lightweight that they are used everywhere in Windows, from the kernel to drivers to user applications. Microsoft relies on them internally.
Advice: The real question with performance counters is the learning curve in understanding (which is moderate) and one measuring the right things (seems easy but often you get it wrong).
The performance impact is negligible in updating. Microsoft's intent is that you always write to the performance counters. It's the monitoring of (or capturing of) those performance counters that will cause a degradation of performance. So, only when you use something like perfmon to capture the data.
In effect, the performance counter objects will have the effect of only "doing it when you measure."
I've tested these a LOT.
On an old compaq 1Ghz 1 processor machine, I was able to create about 10,000 counters and monitor them remotely for about 20% CPU usage. These aren't custom counters, just checking CPU or whatever.
Basically, you can monitor all the counters on any decent newer machine with very little impact.
The instantiation of the object can take a long time tho, a few seconds to a few minutes. I suggest you multithread this for all the counters you collect otherwise your app will sit there forever creating these objects. Not sure what MS does once you create it that takes so long, but you can do it for 1000 counters with 1000 threads in the same time you can do it for 1 counter and 1 thread.
A performance counter is just a pointer to 4/8 bytes in shared memory (aka memory mapped file), so their cost is very similar to that of accessing an int/long variabile.
I agree with famoushamsandwich, but would add that as long as your sampling rate is reasonable (5 seconds or more) and you monitor a reasonable set of counters, then the impact of measuring is negligible as well (in most cases).
The thing that I have found is that it is not that slow for the majority of applications. I wouldn't put one in a tight loop, or something that is called thousands of times a second.
Secondly, I found that programmatically creating the performance counters is very slow, so make sure that you create them before hand and not in code.

Resources