Why InnoDB does use buffer pool, not mmap entire file? - innodb

The InnoDB uses buffer bool of configurable size to store last recently used pages (b+tree blocks).
Why not mmap the entire file instead? Yes, this does not work for changed pages, because you want to store them in double write buffer before writing back to destination place. But mmap lets kernel manage the LRU for pages and avoids userspace copying. Also inkernel-copy code does not use vector instructions (to avoid storing their registers in the process context).
But when page is not changed, why not use mmap to read pages and let kernel manage caching them in filesystem ram cache? So you need "custom" userspace cache for changed pages only.
LMDB author mentioned that he chosen the mmap approach to avoid data copying from filysystem cache to userspace and to avoid LRU reinvention.
What critical disadvantages of mmap i missing that lead to buffer pool approach?

Disadvantages of MMAP:
Not all operating systems support it (ahem Windows)
Coarse locking. It's difficult to allow many clients to make concurrent access to the file.
Relying on the OS to buffer I/O writes leads to increased risk of data loss if the RDBMS engine crashes. Need to use a journaling filesystem, which may not be supported on all operating systems.
Can only map a file size up to the size of the virtual memory address space, so on 32-bit OS, the database files are limited to 4GB (per comment from Roger Lipscombe above).
Early versions of MongoDB tried to use MMAP in the primary storage engine (the only storage engine in the earliest MongoDB). Since then, they have introduced other storage engines, notably WiredTiger. This has greater support for tuning, better performance on multicore systems, support for encryption and compression, multi-document transactions, and so on.

Related

Are Google Cloud Disks OK to use with SQLite?

Google Cloud disks are network disks that behave like local disks. SQLite expects a local disk so that locking and transactions work correctly.
A. Is it safe to use Google Cloud disks for SQLite?
B. Do they support the right locking mechanisms? How is this done over the network?
C. How does disk IOP's and Throughput relate to SQLite performance? If I have a 1GB SQLite file with queries that take 40ms to complete locally, how many IOP's would this use? Which disk performance should I choose between (standard, balanced, SSD)?
Thanks.
Related
https://cloud.google.com/compute/docs/disks#pdspecs
Persistent disks are durable network storage devices that your instances can access like physical disks
https://www.sqlite.org/draft/useovernet.html
the SQLite library is not tested in across-a-network scenarios, nor is that reasonably possible. Hence, use of a remote database is done at the user's risk.
Yeah, the article you referenced, essentially stipulates that since the reads and writes are "simplified", at the OS level, they can be unpredictable resulting in "loss in translation" issues when going local-network-remote.
They also point out, it may very well work totally fine in testing and perhaps in production for a time, but there are known side effects which are hard to detect and mitigate against -- so its a slight gamble.
Again the implementation they are describing is not Google Cloud Disk, but rather simply stated as a remote networked arrangement.
My point is more that Google Cloud Disk may be more "virtual" rather than purely networked attached storage... to my mind that would be where to look, and evaluate it from there.
Checkout this thread for some additional insight into the issues, https://serverfault.com/questions/823532/sqlite-on-google-cloud-persistent-disk
Additionally, I was looking around and I found this thread, where one poster suggest using SQLite as a read-only asset, then deploying updates in a far more controlled process.
https://news.ycombinator.com/item?id=26441125
the persistend disk acts like a normal disk in your vm. and is only accessable to one vm at a time.
so it's safe to use, you won't lose any data.
For the performance part. you just have to test it. for your specific workload. if you have plenty of spare ram, and your database is read heavy, and seldom writes. the whole database will be cached by the os (linux) disk cache. so it will be crazy fast. even on hdd storage.
but if you are low on spare ram. than the database won't be in the os cache. and writes are always synced to disk. and that causes lots of I/O operations.
in that case use the highest performing disk you can / are willing to afford.

When should we use CL_MEM_USE_HOST_PTR

I am trying to understand when to use CL_MEM_USE_HOST_PTR on a CPU-GPU Soc by Intel.
Reading this guide I came across:
If your application uses a specific memory management algorithm, or if
you want to wrap existing native application memory allocations, you
can pass a pointer to clCreateBuffer along with the
CL_MEM_USE_HOST_PTR flag.
Can someone explain with an example what is the meaning of: specific memory management algorithm, and wrap existing native application memory allocations.
CL_MEM_USE_HOST_PTR flag means, that memory for OpenCL memory object will not be allocated on Device side, but will be used from memory, allocated on Host side. Though, memory content may be cached (this is opaque to user).
Imagine, that you have complicated library, which has it's own sophisticated memory allocation mechanisms (e. g. with reference counting), etc. It's not that easy (usually - impossible) to allocate OpenCL memory objects "by hand", as they must have same lifetime to objects, allocated by library, (possibly - same alignment), etc.
In that case much easier way it to use CL_MEM_USE_HOST_PTR flag, when creating OpenCL memory objects. All objects handling will be done under-the-hood. This way can save you a lot of pain especially when you're working with big projects, implemented on plain C, in which memory objects processing is always tricky.

Memory Object Assignation to Context Mechanism In OpenCL

I'd like to know what exactly happens when we assign a memory object to a context in OpenCL.
Does the runtime copies the data to all of the devices which are associated with the context?
I'd be thankful if you help me understand this issue :-)
Generally and typically the copy happens when the runtime handles the clEnqueueWriteBuffer / clEnqueueReadBuffer commands.
However, if you created the memory object using certain combinations of flags, the runtime can choose to copy the memory sooner than that (like right after creation) or later (like on-demand before running a kernel or even on-demand as it needs it). Vendor documentation often indicates if they take special advantage of any of these flags.
A couple of the "interesting" variations:
Shared memory (Intel Ingrated Graphics GPUs, AMD APUs, and CPU drivers): You can allocate a buffer and never copy it to the device because the device can access host memory.
On-demand paging: Some discrete GPUs can copy buffer memory over PCIe as it is read or written by a kernel.
Those are both "advanced" usage of OpenCL buffers. You should probably start with "regular" buffers and work your way up if they don't do what you need.
This post describes the extra flags fairly well.

Why are SQLite transactions bound to harddisk rotation?

There's a following statement in SQLite FAQ:
A transaction normally requires two complete rotations of the disk platter, which on a 7200RPM disk drive limits you to about 60 transactions per second.
As I know there's a cache on the harddisk and there might be also an extra cache in the disk driver that abstract the operation that is perceived by the software from the actual operation against the disk platter.
Then why and how exactly are transactions so strictly bound to disk platter rotation?
From Atomic Commit In SQLite
2.0 Hardware Assumptions
SQLite assumes that the operating
system will buffer writes and that a
write request will return before data
has actually been stored in the mass
storage device. SQLite further assumes
that write operations will be
reordered by the operating system. For
this reason, SQLite does a "flush" or
"fsync" operation at key points.
SQLite assumes that the flush or fsync
will not return until all pending
write operations for the file that is
being flushed have completed. We are
told that the flush and fsync
primitives are broken on some versions
of Windows and Linux. This is
unfortunate. It opens SQLite up to the
possibility of database corruption
following a power loss in the middle
of a commit. However, there is nothing
that SQLite can do to test for or
remedy the situation. SQLite assumes
that the operating system that it is
running on works as advertised. If
that is not quite the case, well then
hopefully you will not lose power too
often.
Because it ensures data integrity by making sure the data is actually written on to the disk rather than held in memory. Thus if the power goes off or something, the database is not corrupted.
This video http://www.youtube.com/watch?v=f428dSRkTs4 talks about reasons why (e.g. because SQLite is actually used in a lot of embedded devices where the power might well suddenly go off.)

UNIX Domain sockets vs Shared Memory (Mapped File)

Can anyone tell, how slow are the UNIX domain sockets, compared to Shared Memory (or the alternative memory-mapped file)?
Thanks.
It's more a question of design, than speed (Shared Memory is faster), domain sockets are definitively more UNIX-style, and do a lot less problems. In terms of choice know beforehand:
Domain Sockets advantages
blocking and non-blocking mode and switching between them
you don't have to free them when tasks are completed
Domain sockets disadvantages
must read and write in a linear fashion
Shared Memory advantages
non-linear storage
will never block
multiple programs can access it
Shared Memory disadvantages
need locking implementation
need manual freeing, even if unused by any program
That's all I can think of now. However, I'd go with domain sockets any day -- not to mention that it's a lot easier then to reimplement them to do distributed computing. The speed gain of Shared Memory will be lost because of the need of a safe design. However, if you know exactly what you're doing, and use the proper kernel calls, you can achieve greater speed with Shared Memory.
In terms of speed shared memory is definitely the winner. With sockets you will have at least two copies of the data - from sending process to the kernel buffer, then from the kernel to the receiving process. With shared memory the latency will only be bound by the cache consistency algorithm between the cores on the box.
As Kornel notes though, dealing with shared memory is more involved since you have to come up with your own synchronization/signalling scheme, which might add a delay depending on which route you go. Definitely use semaphores in shared memory (implemented with futex on Linux) to avoid system calls in non-contended case.
Both are inter process communication (IPC) mechanisms.
UNIX domain sockets are uses for communication between processes on one host similar as TCP-Sockets are used between different hosts.
Shared memory (SHM) is a piece of memory where you can put data and share this between processes.
SHM provides you random access by using pointers, Sockets can be written or read but you cannot rewind or do positioning.
#Kornel Kisielewicz 's answer is good IMO. Just adding my own results here for sockets, not only Unix domain sockets.
Shared Memory
Performance is very high. No copies with RAW access data. Fastest access for sure.
Synchronization needed. Design not so easy to setup for complex cases.
Fixed size. Growing shared memory is doable but memory has to be unmapped first, growed, and then remapped.
Signaling mechanism can be quite slow, see here : Boost.Interprocess notify() performance. Especially if you want to do lots of exchanges between processes. Signaling mechanism not so easy to setup also.
Sockets
Easy to setup.
Can be used on different machines.
No complex synchronisation needed.
Size is not a problem if you use TCP. Simple design with header containing the packet size and then send the data.
Ping/Pong exchange is fast because it can be treated as hardware interruption by the OS.
Performance is average: a few copies of data are made.
High CPU consumption compared to shared memory. Sockets calls are not that cheap if you use them a lot.
In my tests, exchanges of small chunks of data (around 1MByte/second) shows no real advantage for shared memory. I would even say that ping/pong exchanges were faster using TCP (due to simple and efficient signaling mechanism). BUT when exchanging large amount of data (around 200MBytes/second), I had 20% CPU consumption with sockets, compared to 3% CPU using shared memory. So a huge win for shared memory in terms of CPU because read and write socket calls are not cheap.
In this case - sockets are faster. Writing to shared memory is faster then any IPC but writing to a memory mapped file and writing to shared memory are 2 completely different things.
when writing to a memory mapped file you need to "flush" what was written to the shared memory to an actual binded file (not exactly, the flush is being done for you), so you copy your data first to the shared memory, and then you copy it again (flush) to the actual file and that is super duper expansive - more then anything, even more then writing to socket, you are gaining nothing by doing that.

Resources