RocksDB tuning parameters for write-only usage to reduce memory - rocksdb

I'm using RocksDB in a process only for persisting data. I don't need to read. Unfortunately the memory usage is skyrocketing and I want to stop this using the options.
I'm currently using these options:
rocksdb::Options options;
options.compression = rocksdb::CompressionType::kZSTD;
options.create_if_missing = true;
options.create_missing_column_families = true;
options.write_buffer_size = 512000;
My understanding is the last parameter should stop memory growing because i've reduced the size of the buffers before they are flushed. However, memory is still growing.
Are there any additional parameters I can use to reduce memory for write-only usage?

If there're only write, you can also disable block cache:
https://github.com/facebook/rocksdb/blob/fd2079938d784babcea908c6c82aeb8b3afa42af/include/rocksdb/table.h#L232-L235
memory growth could also because the number of column families, number of SSTables. If possible, you can also just write SST by yourself and then ingest to rocksdb, which can reduce all the overhead on the write path: http://rocksdb.org/blog/2017/02/17/bulkoad-ingest-sst-file.html

Related

Bulk Insert in Symfony and Doctrine: How to select batch size?

I am working on a web app using Symfony 2.7 and Doctrine. A Symfony command is used to perform an update of a large number of entities.
I followed the Doctrine guidelines and use $entityManager->flush() not for every single entity.
This is die Doctrine example code:
<?php
$batchSize = 20;
for ($i = 1; $i <= 10000; ++$i) {
$user = new CmsUser;
$user->setStatus('user');
$user->setUsername('user' . $i);
$user->setName('Mr.Smith-' . $i);
$em->persist($user);
if (($i % $batchSize) === 0) {
$em->flush();
}
}
$em->flush(); //Persist objects that did not make up an entire batch
The guidelines say:
You may need to experiment with the batch size to find the size that
works best for you. Larger batch sizes mean more prepared statement
reuse internally but also mean more work during flush.
So I did try different batch size. The larger the batch size, the faster the command completes its task.
Thus the question is: What are the downsides of large batch sizes? Why not use $entityManager->flush() only once, after all entities have been updated
The docu just says, that larger batch sizes "mean more work during flush". But why/when could this be a problem?
The only downside I can see are Exceptions during the update: If the script stops before the saved changed where flushed, the changes are lost. Is this the only limitation?
What are the downsides of large batch sizes?
Large batch sizes may use a lot of memory if you create for examples 10,000 entities. If you don't save the entities in batchs, they will accumulate in memory and if the program reach the memory limit it may crash the whole script.
Why not use $entityManager->flush() only once, after all entities have been updated
It's possible, but storing 10,000 entities in the memory before calling flush() once will use more memory than saving entities 100 by 100. It may also take more time.
The docu just says, that larger batch sizes "mean more work during flush". But why/when could this be a problem?
If you don't have any performance issue with biggest batch sizes, it's probably because your data is not big enough to fill the memory or disrupt PHP's memory management.
So the size of the batch depend of multiple factors, mostly memory usage vs. time. If the script consume too much RAM, the size of the batch has to be lowered. But using really small batches may take more time than bigger batches. So you have to run multiple tests in order to adjust this size so that it uses most of the available memory but not more.
I don't have any proofs but I remember having worked with thousands of entities. When I used only one flush(), I saw that the progress bar was getting slower, it looked like my program was getting slower as I added more and more entities in the memory.
If the flush takes too much time, you might exceed the maximum execution time of the server, and lose the connection.
From my experience, 100 entities per batch worked great. Depending on the Entity, 200 was too much and other Entity, I could do 1000.
To properly insert in batch, you will need the command :
$em->clear();
after each of your flushes. The reason is the Doctrine does not free the objects it's flushing into the DB. This means that if you don't "clear" them, the memory consumption will keep on increasing until you bust your PHP Memory Limit and crash your operation.
I would also recommend against increasing PHP Memory Limit to higher values. If you do, you risk creating huge lag on your server which could increase the number of connections to your server and then crash it.
It is also recommended to process batch operations outside of the Web Server upload form page. So save the data in a Blob and then process it later with a Cronjob task that will process your batch processing at the desired time (outside of Web Server's peak usage time).
As suggested in Doctrine documentation, ORM is not the best tool to use with batches.
Unless your entity needs some specific logic (like listeners), avoid ORM and use DBAL directly.

Are global memory barriers required if only one work item reads and writes to memory

In my kernel, each work item has a reserved memory region in a buffer
that only it writes to and reads from.
Is it necessary to use memory barriers in this case?
EDIT:
I call mem_fence(CLK_GLOBAL_MEM_FENCE) before each write and before each read. Is this enough to guarantee load/store consistency?
Also, is this even necessary if only one work item is loading storing to this memory region ?
See this other stack overflow question:
In OpenCL, what does mem_fence() do, as opposed to barrier()?
The memory barriers work at a workgroup level, this is, stopping the threads belonging to the same block of threads until all of them reach the barrier. If there is not intersection between the memory spaces of different work items, there is not needed any extra synchronization point.
Also, is this even necessary if only one work item is loading storing to this memory region ?
Theoretically, mem_fence only guarantees the commit of the previous memory accesses before the later ones. In my case, I never saw differences in the results of applications using or not this mem_fence call.
Best regards

How to set memory limit to a process in Golang

I use syscall prlimit to set resource limit to process, it works for limit CPU time but when test memory usage, I come with the problem.
package sandbox
import (
"syscall"
"unsafe"
)
func prLimit(pid int, limit uintptr, rlimit *syscall.Rlimit) error {
_, _, errno := syscall.RawSyscall6(syscall.SYS_PRLIMIT64, uintptr(pid), limit, uintptr(unsafe.Pointer(rlimit)), 0, 0, 0)
var err error
if errno != 0 {
err = errno
return err
} else {
return nil
}
}
and it's my test.
func TestMemoryLimit(t *testing.T) {
proc, err := os.StartProcess("test/memo", []string{"memo"}, &os.ProcAttr{})
if err != nil {
panic(err)
}
defer proc.Kill()
var rlimit syscall.Rlimit
rlimit.Cur = 10
rlimit.Max = 10 + 1024
prLimit(proc.Pid, syscall.RLIMIT_DATA, &rlimit)
status, err := proc.Wait()
if status.Success() {
t.Fatal("memory test failed")
}
}
this is memo:
package main
func main() {
var a [10000][]int
for i := 0; i < 1000; i++ {
a[i] = make([]int, 1024)
}
}
I make a large amount of memory and set only 10 bytes for the memory , but it won't signal segment fault signal any way.
RLIMIT_DATA describes the maximum size of a processes data segment. Traditionally, programs that allocate memory enlarge the data segment with calls to brk() to allocate memory from the operating system.
Go doesn't use this approach. Instead, it uses a variant of the mmap() system call to request regions of memory anywhere in the address space. This is much more flexible than a brk() based approach as you can deallocate arbitrary memory regions with munmap(), whereas a brk() based approach can only deallocate memory from the end of the data segment.
The result of this is that RLIMIT_DATA is inneffective in controlling the amount of memory a process uses. Try using RLIMIT_AS instead, but beware that this limit also incorporates the address space you use for file mappings, especially in the case of shared libraries.
There is one proposal Soft memory limit, it may be released after go 1.18
This option comes in two flavors: a new runtime/debug function called SetMemoryLimit and a GOMEMLIMIT environment variable. In sum, the runtime will try to maintain this memory limit by limiting the size of the heap, and by returning memory to the underlying platform more aggressively. This includes with a mechanism to help mitigate garbage collection death spirals. Finally, by setting GOGC=off, the Go runtime will always grow the heap to the full memory limit.
This new option gives applications better control over their resource economy. It empowers users to:
Better utilize the memory that they already have,
Confidently decrease their memory limits, knowing Go will respect them,
Avoid unsupported forms of garbage collection tuning.
Update
This feature will be released in Go 1.19
The runtime now includes support for a soft memory limit. This memory limit includes the Go heap and all other memory managed by the runtime, and excludes external memory sources such as mappings of the binary itself, memory managed in other languages, and memory held by the operating system on behalf of the Go program.
This limit may be managed via runtime/debug.SetMemoryLimit or the equivalent GOMEMLIMIT environment variable.
The limit works in conjunction with runtime/debug.SetGCPercent / GOGC, and will be respected even if GOGC=off, allowing Go programs to always make maximal use of their memory limit, improving resource efficiency in some cases.
Here are some suggestions about the memory limit usage per A Guide to the Go Garbage Collector
While the memory limit is a powerful tool, and the Go runtime takes steps to mitigate the worst behaviors from misuse, it's still important to use it thoughtfully. Below is a collection of tidbits of advice about where the memory limit is most useful and applicable, and where it might cause more harm than good.
Do take advantage of the memory limit when the execution environment of your Go program is entirely within your control, and the Go program is the only program with access to some set of resources (i.e. some kind of memory reservation, like a container memory limit).
A good example is the deployment of a web service into containers with a fixed amount of available memory.
In this case, a good rule of thumb is to leave an additional 5-10% of headroom to account for memory sources the Go runtime is unaware of.
Do feel free to adjust the memory limit in real time to adapt to changing conditions.
A good example is a cgo program where C libraries temporarily need to use substantially more memory.
Don't set GOGC to off with a memory limit if the Go program might share some of its limited memory with other programs, and those programs are generally decoupled from the Go program. Instead, keep the memory limit since it may help to curb undesirable transient behavior, but set GOGC to some smaller, reasonable value for the average case.
While it may be tempting to try and "reserve" memory for co-tenant programs, unless the programs are fully synchronized (e.g. the Go program calls some subprocess and blocks while its callee executes), the result will be less reliable as inevitably both programs will need more memory. Letting the Go program use less memory when it doesn't need it will generate a more reliable result overall. This advice also applies to overcommit situations, where the sum of memory limits of containers running on one machine may exceed the actual physical memory available to the machine.
Don't use the memory limit when deploying to an execution environment you don't control, especially when your program's memory use is proportional to its inputs.
A good example is a CLI tool or a desktop application. Baking a memory limit into the program when it's unclear what kind of inputs it might be fed, or how much memory might be available on the system can lead to confusing crashes and poor performance. Plus, an advanced end-user can always set a memory limit if they wish.
Don't set a memory limit to avoid out-of-memory conditions when a program is already close to its environment's memory limits.
This effectively replaces an out-of-memory risk with a risk of severe application slowdown, which is often not a favorable trade, even with the efforts Go makes to mitigate thrashing. In such a case, it would be much more effective to either increase the environment's memory limits (and then potentially set a memory limit) or decrease GOGC (which provides a much cleaner trade-off than thrashing-mitigation does).

OpenCL : Id of the physical core being used

I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK

Is there a maximum limit to private memory in OpenCL?

Does the OpenCL specification set any maximum limit on the amount of private memory that can be used? If so, how do I get this number?
I have a function which gives the correct result when run outside OpenCL, but when converted to a kernel, it spews out garbage. I checked the amount of private memory being used per work item using the CL_KERNEL_PRIVATE_MEM_SIZE flag and it is ~ 4000 bytes. I suspect that I am using too much private memory and this is somehow leading to junk computation.
As per OpenCL spec the location and size is not defined i.e. it left for vendor to decide. Which puts a question on How much is to be used. If used correctly gets the best performance and if not can be became the cause for slowdown.
You can use AMD's CodeXL or NVIDIA's Nsight (If you have AMD or NVIDIA cards) to analyze memory usage by the kernel. With little hands on tool you can understand the register spilling using these tool.
I don't think that the high usage of private memory will lead to the junk result, it could certainly be a issue in your code.
Its different for different archs. For example, a hd7870's private memory per compute-unit is 256kB and if your setting is 64 threads per compute unit, then each thread will have 4kB private memory which means 1000 float values. If you increase threads per compute unit further, privates/thread will drop to even 1kB range. You should add some local memory usage to balance it.
More importantly, you can not use all of it. Compiler uses big portion for its own optimizations and some things that I dont know. You can never be sure without a profiler.
There isn't a theoretical limit for private memory (unlike local memory). If there was, clGetDeviceInfo would list it (it doesn't). However, I know there are practical limits. For example, some GPU implementations will try and store private memory in the register file if it fits. If you exceed this, it spills out to main memory and may be orders of magnitude more expensive. Regardless, the result should be correct (just achieved much slower). It should not junk your computation.

Resources