What is memory -cache in memory plugin of collectd? - graphite

In the memory plugin of collectd , there are four attributes -
Memory used
Memory Free
Memory buffer
Memory cache
What does each of them mean ?

I'll explain it more simple :
Memory Used is memory that you're using for any running process.
Memory Free is memory that doesn't do anything useful. It is normal that the operating system puts that memory to use.
Memory Buffer is a buffer that holding every single piece of data that is transmitted from one storage location to another (like when using a circular buffer in audio processing). A buffer allows just that - a "buffer" of data before and after your current position in the data stream.
Memory Cache is a partial data that is cached so that the remaining data can be transferred without any performance penalty. In this context, the cache only "pre-fetches" a small amount of data (depending on the transfer rates, cache sizes, etc...).

Related

openCL: initialize local memory buffer directly from host memory

I have a lot of situations where I create a buffer with an input data from host's memory (with either CL_MEM_COPY_HOST_PTR or CL_MEM_USE_HOST_PTR) and pass it as an argument to my kernel only to copy its contents to group's local memory right away at the beginning of the kernel.
I was wondering then, if it is maybe possible to directly initialize a local memory buffer with values from host's memory without an unnecessary write to device's global memory (which is what CL_MEM_COPY_HOST_PTR does, as far as I understand) nor its cache (which is what CL_MEM_USE_HOST_PTR does, AFAIU).
Each work-group would need to have its local buffer initialized with a different offset of the host's input data of course.
Alternatively, is there a way to tell CL_MEM_USE_HOST_PTR to definitely not cache the values as each of them will be read only once? Whatever host-access or read-write flags I combine it with and whether I annotate kernel's param as __global, __constant or __global const, the performance is always few % worse than CL_MEM_COPY_HOST_PTR, which seems to suggest that the kernel tries to cache input values heavily, I guess. (my guess is that CL_MEM_COPY_HOST_PTR writes a whole continuous memory region, which is faster than ad-hoc writes by CL_MEM_USE_HOST_PTR when it caches values being read)
According to OpenCL specification, host has no access to local memory. Only kernel can read and write LDS.
As for CL_MEM_USE_HOST_PTR vs CL_MEM_USE_HOST_PTR, the latter gives you full control while CL_MEM_USE_HOST_PTR can either copy the whole buffer to device before kernel execution either make PCI-e transactions for every buffer read operation from inside the kernel.

Why InnoDB does use buffer pool, not mmap entire file?

The InnoDB uses buffer bool of configurable size to store last recently used pages (b+tree blocks).
Why not mmap the entire file instead? Yes, this does not work for changed pages, because you want to store them in double write buffer before writing back to destination place. But mmap lets kernel manage the LRU for pages and avoids userspace copying. Also inkernel-copy code does not use vector instructions (to avoid storing their registers in the process context).
But when page is not changed, why not use mmap to read pages and let kernel manage caching them in filesystem ram cache? So you need "custom" userspace cache for changed pages only.
LMDB author mentioned that he chosen the mmap approach to avoid data copying from filysystem cache to userspace and to avoid LRU reinvention.
What critical disadvantages of mmap i missing that lead to buffer pool approach?
Disadvantages of MMAP:
Not all operating systems support it (ahem Windows)
Coarse locking. It's difficult to allow many clients to make concurrent access to the file.
Relying on the OS to buffer I/O writes leads to increased risk of data loss if the RDBMS engine crashes. Need to use a journaling filesystem, which may not be supported on all operating systems.
Can only map a file size up to the size of the virtual memory address space, so on 32-bit OS, the database files are limited to 4GB (per comment from Roger Lipscombe above).
Early versions of MongoDB tried to use MMAP in the primary storage engine (the only storage engine in the earliest MongoDB). Since then, they have introduced other storage engines, notably WiredTiger. This has greater support for tuning, better performance on multicore systems, support for encryption and compression, multi-document transactions, and so on.

OpenCL Buffer Creation

I am fairly new to OpenCL and though I have understood everything up until now, but I am having trouble understanding how buffer objects work.
I haven't understood where a buffer object is stored. In this StackOverflow question it is stated that:
If you have one device only, probably (99.99%) is going to be in the device. (In rare cases it may be in the host if the device does not have enough memory for the time being)
To me, this means that buffer objects are stored in device memory. However, as is stated in this StackOverflow question, if the flag CL_MEM_ALLOC_HOST_PTR is used in clCreateBuffer, the memory used will most likely be pinned memory. My understanding is that, when memory is pinned it will not be swapped out. This means that pinned memory MUST be located in RAM, not in device memory.
So what is actually happening?
What I would like to know what do the flags:
CL_MEM_USE_HOST_PTR
CL_MEM_COPY_HOST_PTR
CL_MEM_ALLOC_HOST_PTR
imply about the location of buffer.
Thank you
Let's first have a look at the signature of clCreateBuffer:
cl_mem clCreateBuffer(
cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr,
cl_int *errcode_ret)
There is no argument here that would provide the OpenCL runtime with an exact device to whose memory the buffer shall be put, as a context can have multiple devices. The runtime only knows as soon as we use a buffer object, e.g. read/write from/to it, as those operations need a command queue that is connected to a specific device.
Every memory object an reside in either the host memory or one of the context's device's memories, and the runtime might migrate it as needed. So in general, every memory object, might have a piece of internal host memory within the OpenCL runtime. What the runtime actually does is implementation dependent, so we cannot not make too many assumptions and get no portable guarantees. That means everything about pinning etc. is implementation-dependent, and you can only hope for the best, but avoid patterns that will definitely prevent the use of pinned memory.
Why do we want pinned memory?
Pinned memory means, that the virtual address of our memory page in our process' address space has a fixed translation into a physical memory address of the RAM. This enables DMA (Direct Memory Access) transfers (which operate on physical addresses) between the device memory of a GPU and the CPU memory using PCIe. DMA lowers the CPU load and possibly increases copy speed. So we want the internal host storage of our OpenCL memory objects to be pinned, to increase the performance of data transfers between the internal host storage and the device memory of an OpenCL memory object.
As a basic rule of thumb: if your runtime allocates the host memory, it might be pinned. If you allocate it in your application code, the runtime will pessimistically assume it is not pinned - which usually is a correct assumption.
CL_MEM_USE_HOST_PTR
Allows us to provide memory to the OpenCL implementation for internal host-storage of the object. It does not mean that the memory object will not be migrated into device memory if we call a kernel. As that memory is user-provided, the runtime cannot assume it to be pinned. This might lead to an additional copy between the un-pinned internal host storage and a pinned buffer prior to device transfer, to enable DMA for host-device-transfers.
CL_MEM_ALLOC_HOST_PTR
We tell the runtime to allocate host memory for the object. It could be pinned.
CL_MEM_COPY_HOST_PTR
We provide host memory to copy-initialise our buffer from, not to use it internally. We can also combine it with CL_MEM_ALLOC_HOST_PTR. The runtime will allocate memory for internal host storage. It could be pinned.
Hope that helps.
The specification is (deliberately?) vague on the topic, leaving a lot of freedom to implementors. So unless an OpenCL implementation you are targeting makes explicit guarantees for the flags, you should treat them as advisory.
First off, CL_MEM_COPY_HOST_PTR actually has nothing to do with allocation, it just means that you would like clCreateBuffer to pre-fill the allocated memory with the contents of the memory at the host_ptr you passed to the call. This is as if you called clCreateBuffer with host_ptr = NULL and without this flag, and then made a blocking clEnqueueWriteBuffer call to write the entire buffer.
Regarding allocation modes:
CL_MEM_USE_HOST_PTR - this means you've pre-allocated some memory, correctly aligned, and would like to use this as backing memory for the buffer. The implementation can still allocate device memory and copy back and forth between your buffer and the allocated memory, if the device does not support directly accessing host memory, or if the driver decides that a shadow copy to VRAM will be more efficient than directly accessing system memory. On implementations that can read directly from system memory though, this is one option for zero-copy buffers.
CL_MEM_ALLOC_HOST_PTR - This is a hint to tell the OpenCL implementation that you're planning to access the buffer from the host side by mapping it into host address space, but unlike CL_MEM_USE_HOST_PTR, you are leaving the allocation itself to the OpenCL implementation. For implementations that support it, this is another option for zero copy buffers: create the buffer, map it to the host, get a host algorithm or I/O to write to the mapped memory, then unmap it and use it in a GPU kernel. Unlike CL_MEM_USE_HOST_PTR, this leaves the door open for using VRAM that can be mapped directly to the CPU's address space (e.g. PCIe BARs).
Default (neither of the above 2): Allocate wherever most convenient for the device. Typically VRAM, and if memory-mapping into host memory is not supported by the device, this typically means that if you map it into host address space, you end up with 2 copies of the buffer, one in VRAM and one in system memory, while the OpenCL implementation internally copies back and forth between the 2.
Note that the implementation may also use any access flags provided ( CL_MEM_HOST_WRITE_ONLY, CL_MEM_HOST_READ_ONLY, CL_MEM_HOST_NO_ACCESS, CL_MEM_WRITE_ONLY, CL_MEM_READ_ONLY, and CL_MEM_READ_WRITE) to influence the decision where to allocate memory.
Finally, regarding "pinned" memory: many modern systems have an IOMMU, and when this is active, system memory access from devices can cause IOMMU page faults, so the host memory technically doesn't even need to be resident. In any case, the OpenCL implementation is typically deeply integrated with a kernel-level device driver, which can typically pin system memory ranges (exclude them from paging) on demand. So if using CL_MEM_USE_HOST_PTR you just need to make sure you provide appropriately aligned memory, and the implementation will take care of pinning for you.

Why QShared Memory create size and returned size() are different

I have a doubt in QSharedMemory
if i create a shared memory & it size is less than 4096
the size() function returned 4096.
If the created size is greater than 4096, then it return 4096+created size.
Eg:
QSharedMemory mem("MyApp");
mem.create(1);
qDebug("Size=%d",mem.size());//4096
QSharedMemory mem("MyApp");
mem.create(4095);
qDebug("Size=%d",mem.size());//4096
QSharedMemory mem("MyApp");
mem.create(4097);
qDebug("Size=%d",mem.size());//8192
How to get the correct size?
I am using Windows 7 32-bit OS
There is nothing wrong with QSharedMemory. It shows you the real physical memory usage, which is not what we are use to with virtual memory.
In practice, physical memory granularity is a page, which has several bytes. Usually 4096 bytes. When you allocate one byte, It consume a whole physical page.
When a process deal with memory he is dealing virtual memory, which provide powerful tools. For instance virtual memory manager can use the same physical page for several one byte allocations. But virtual memory is only relevant at the process scope.
Here you have memory shared by several processes, so it is a different memory model. The Qt devs just made the design decision to make this reality visible to the user of the framework.

Memory transfer between host and device in OpenCL?

Consider the following code which creates a buffer memory object from an array of double's of size size:
coef_mem = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, (sizeof(double) * size), arr, &err);
Consider it is passed as an arg for a kernel. There are 2 possibilities depending on the device on which the kernel is running:
The device is same as host device
The device is other than host device
Here are my questions for both the possibilities:
At what step is the memory transferred to the device from the host?
How do I measure the time required for transferring the memory from host to device?
How do I measure the time required for transferring the memory from
device's global memory to private memory?
Is the memory still transferred if the device is same as host device?
Will the time required to transfer from host to device be greater
than the time required for transferring from device's global memory
to private memory?
At what step is the memory transferred to the device from the host?
The only guarantee you have is that the data will be on the device by the time the kernel begins execution. The OpenCL specification deliberately doesn't mandate when these data transfers should happen, in order to allow different OpenCL implementations to make decisions that are suitable for their own hardware. If you only have a single device in the context, the transfer could be performed as soon as you create the buffer. In my experience, these transfers usually happen when the kernel is enqueued (or soon after), because that is when the implementation knows that it really needs the buffer on a particular device. But it really is completely up to the implementation.
How do I measure the time required for transferring the memory from host to device?
Use a profiler, which usually shows when these transfers happen and how long they take. If you transfer the data with clEnqueueWriteBuffer instead, you could use the OpenCL event profiling system.
How do I measure the time required for transferring the memory from device's global memory to private memory?
Again, use a profiler. Most profilers will have a metric for the achieved bandwidth when reading from global memory, or something similar. It's not really an explicit transfer from global to private memory though.
Is the memory still transferred if the device is same as host device?
With CL_MEM_COPY_HOST_PTR, yes. If you don't want a transfer to happen, use CL_MEM_USE_HOST_PTR instead. With unified memory architectures (e.g. integrated GPU), the typical recommendation is to use CL_MEM_ALLOC_HOST_PTR to allocate a device buffer in host-accessible memory (usually pinned), and access it with clEnqueueMapBuffer.
Will the time required to transfer from host to device be greater than the time required for transferring from device's global memory to private memory?
Probably, but this will depend on the architecture, whether you have a unified memory system, and how you actually access the data in kernel (memory access patterns and caches will have a big effect).

Resources