sqlite3 multiple inserts really slow - sqlite

I have build a file archiver in Windows which uses sqlite3 to store files and takes advantage of multicore techniques to complete the archive faster.
I am trying a backup of 100.000 files now and insertion is slow.
When I comment the line which inserts, the app uses 100% CPU which is normal. With the insertion line on, it rarely gets above 25%.
As the archiving progresses, insertion gets more and more slow, processing a few files/second with a cpu usage of 11%. No disk usage is shown, so the bottleneck can't be the disk.
I 've:
PRAGMA temp_store = MEMORY
PRAGMA journal_mode = MEMORY
PRAGMA synchronous = OFF
and the entire insertion is within a transaction.
After further analysis it seems that SQLite's problem is to bind the blob64 (if I pass 0, it seems to be fine).
Why SQLite would have a problem inserting a raw blob of data into the archive?
Any ideas?
Thanks.

Your answer may lie here:
https://www.sqlite.org/threadsafe.html
Because it says there that:
The default mode is serialized.
which might explain your observations.
According to that document, you can either configure this at compile time (which I would most definitely not myself do) or via:
sqlite3_config (SQLITE_CONFIG_MULTITHREAD);
Just how stratospherically it then performs I wouldn't know.

Related

What could cause a sqlite application to slow down over time with high load?

I'll definitely need to update this based on feedback so I apologize in advance.
The problem I'm trying to solve is roughly this.
The graph shows Disk utilization in the Windows task manager. My sqlite application is a webserver that takes in json requests with timestamps, looks up the existing entry in a 2 column key/value table, merges the request into the existing item (they don't grow over time), and then writes it back to the database.
The db is created as follows. I've experimented with and without WAL without difference.
createStatement().use { it.executeUpdate("CREATE TABLE IF NOT EXISTS items ( key TEXT NOT NULL PRIMARY KEY, value BLOB );") }
The write/set is done as follows
try {
val insertStatement = "INSERT OR REPLACE INTO items (key, value) VALUES (?, ?)"
prepareStatement(insertStatement).use {
it.setBytes(1, keySerializer.serialize(key))
it.setBytes(2, valueSerializer.serialize(value))
it.executeUpdate()
}
commit()
} catch (t: Throwable) {
rollback()
throw t
}
I use a single database connection the entire time which seems to be ok for my use case and greatly improves performance relative to getting a new one for each operation.
val databaseUrl = "jdbc:sqlite:${System.getProperty("java.io.tmpdir")}/$name-map-v2.sqlite"
if (connection?.isClosed == true || connection == null) {
connection = DriverManager.getConnection(databaseUrl)
}
I'm effectively serializing access to the db. I'm pretty sure the default threading mode for the sqlite driver is to serialize and I'm also doing some serializing in kotlin coroutines (via actors).
I'm load testing the application locally and I notice that disk utilization spikes around the one minute mark but I can't determine why. I know that throughput plummets when that happens though. I expect the server to chug along at a more or less constant rate. The db in these tests is pretty small too, hardly reaches 1mb.
Hoping people can recommend some next steps or set me straight as far as performance expectations. I'm assuming there is some sqlite specific thing that happens when throughput is very high for too long, but I would have thought it would be related to WAL or something (which I'm not using).
I have a theory but it's a bit farfetched.
The fact that you hit a performance wall after some time makes me think that either a buffer somewhere is filling up, or some other kind of data accumulation threshold is being reached.
Where exactly the culprit is, I'm not sure.
So, I'd run the following tests.
// At the beginning
connection.setAutoCommit(true);
If the problem is in the driver side of the rollback transaction buffer, then this will slightly (hopefully) slow down operations, "spreading" the impact away from the one-minute mark. Instead of getting fast operations for 59 seconds and then some seconds of full stop, you get not so fast operations the whole time.
In case the problem is further down the line, try
PRAGMA JOURNAL_MODE=MEMORY
PRAGMA SYNCHRONOUS=OFF disables the rollback journal synchronization
(The data will be more at risk in case of a catastrophic powerdown).
Finally, another possibility is that the page translation buffer gets filled after a sufficient number of different keys has been entered. You can test this directly by doing these two tests:
1) pre-fill the database with all the keys in ascending order and a large request, then start updating the same many keys.
2) run the test with only very few keys.
If the slowdown does not occur in the above cases, then it's either TLB buffer management that's not up to the challenge, or database fragmentation is a problem.
It might be the case that issuing
PRAGMA PAGE_SIZE=32768
upon database creation might solve or mitigate the problem. Conversely, PRAGMA PAGE_SIZE=1024 could "spread" the problem avoiding performance bottlenecks.
Another thing to try is closing the database connection and reopening it when it gets older than, say, 30 seconds. If this works, we'll still need to understand why it works (in this case I expect the JDBC driver to be at fault).
First of all, I want to say that I do not use exactly your driver for sqlite, and I use different devices in my work. (but how different are they really?)
From what I see, correct me if im wrong, you use one transaction, for one insert statement. You get request, you use the disc, you use the memory, open, close etc... every time. This can't work fast.
The first thing I do when I have to do inserts in sqlite is to group them, and use a single transaction to do it. That way, you are using your resources in batches.
One transaction, many insert statements, single commit. If there is a problem with a batch, handle the valid separately, log the faulty, move the next batch of requests.

How Redis RDB persistance actually works behind the scene?

i was going through Redis RDB persistence. I having some doubts regarding RDB persistence related to its disadvantage.
Understanding So far:
We should use rdb persistence when we need to save the snapshot of dataset currently in memory at some regular interval.
I can understand that in this way we can lose some data in case of server break down. But another disadvantage that i can't understand is how fork can be time consuming when persisting large dataset using rdb.
Quoting from Documentation
RDB needs to fork() often in order to persist on disk using a child
process. Fork() can be time consuming if the dataset is big, and may
result in Redis to stop serving clients for some millisecond or even
for one second if the dataset is very big and the CPU performance not
great. AOF also needs to fork() but you can tune how often you want to
rewrite your logs without any trade-off on durability.
I know how fork works as per my knowledge When parent process forks it create a new Child process and we can allow some code that child process will execute based on its pid or we can provide it some new executable that it will work on using exec() system call.
but things that i don't understand how it will be heavy task when size of dataset is larger?
I think i know the answer but i m not sure about that
Quoted from this link https://www.bottomupcs.com/fork_and_exec.xhtml
When a process calls fork then
the operating system will create a new process that is exactly the same as the parent process. This means all the state that was talked about previously is copied, including open files, register state and all memory allocations, which includes the program code.
As per above statement whole dataset of redis will be copied to child.
Am i understanding right?
When standard fork is called with copy-on-write the OS must still copy all the page table entries, which can take time time if you have small 4k pages and a huge dataset, this is what makes the actual fork() time slow.
You can also find a lot of time and memory is required if your dataset is changing a lot in a sparse way, as copy-on-write semantics triggers the actual memory pages to be copied as changes are made to the original. Redis also performs incremental rehashing and maintains expiry etc. so an instance that is more active will typically take longer to save to disk.
More reading:
Faster forking of large processes on Linux?
http://kirkwylie.blogspot.co.uk/2008/11/linux-fork-performance-redux-large.html

Bulk Insert in Symfony and Doctrine: How to select batch size?

I am working on a web app using Symfony 2.7 and Doctrine. A Symfony command is used to perform an update of a large number of entities.
I followed the Doctrine guidelines and use $entityManager->flush() not for every single entity.
This is die Doctrine example code:
<?php
$batchSize = 20;
for ($i = 1; $i <= 10000; ++$i) {
$user = new CmsUser;
$user->setStatus('user');
$user->setUsername('user' . $i);
$user->setName('Mr.Smith-' . $i);
$em->persist($user);
if (($i % $batchSize) === 0) {
$em->flush();
}
}
$em->flush(); //Persist objects that did not make up an entire batch
The guidelines say:
You may need to experiment with the batch size to find the size that
works best for you. Larger batch sizes mean more prepared statement
reuse internally but also mean more work during flush.
So I did try different batch size. The larger the batch size, the faster the command completes its task.
Thus the question is: What are the downsides of large batch sizes? Why not use $entityManager->flush() only once, after all entities have been updated
The docu just says, that larger batch sizes "mean more work during flush". But why/when could this be a problem?
The only downside I can see are Exceptions during the update: If the script stops before the saved changed where flushed, the changes are lost. Is this the only limitation?
What are the downsides of large batch sizes?
Large batch sizes may use a lot of memory if you create for examples 10,000 entities. If you don't save the entities in batchs, they will accumulate in memory and if the program reach the memory limit it may crash the whole script.
Why not use $entityManager->flush() only once, after all entities have been updated
It's possible, but storing 10,000 entities in the memory before calling flush() once will use more memory than saving entities 100 by 100. It may also take more time.
The docu just says, that larger batch sizes "mean more work during flush". But why/when could this be a problem?
If you don't have any performance issue with biggest batch sizes, it's probably because your data is not big enough to fill the memory or disrupt PHP's memory management.
So the size of the batch depend of multiple factors, mostly memory usage vs. time. If the script consume too much RAM, the size of the batch has to be lowered. But using really small batches may take more time than bigger batches. So you have to run multiple tests in order to adjust this size so that it uses most of the available memory but not more.
I don't have any proofs but I remember having worked with thousands of entities. When I used only one flush(), I saw that the progress bar was getting slower, it looked like my program was getting slower as I added more and more entities in the memory.
If the flush takes too much time, you might exceed the maximum execution time of the server, and lose the connection.
From my experience, 100 entities per batch worked great. Depending on the Entity, 200 was too much and other Entity, I could do 1000.
To properly insert in batch, you will need the command :
$em->clear();
after each of your flushes. The reason is the Doctrine does not free the objects it's flushing into the DB. This means that if you don't "clear" them, the memory consumption will keep on increasing until you bust your PHP Memory Limit and crash your operation.
I would also recommend against increasing PHP Memory Limit to higher values. If you do, you risk creating huge lag on your server which could increase the number of connections to your server and then crash it.
It is also recommended to process batch operations outside of the Web Server upload form page. So save the data in a Blob and then process it later with a Cronjob task that will process your batch processing at the desired time (outside of Web Server's peak usage time).
As suggested in Doctrine documentation, ORM is not the best tool to use with batches.
Unless your entity needs some specific logic (like listeners), avoid ORM and use DBAL directly.

SQLITE database WAL file size keeps growing

I am writing continuously into a db file which has PRAGMA journal_mode=WAL, PRAGMA journal_size_limit=0. My C++ program has two threads, one reader(queries at 15 sec intervals) and one writer(inserts at 5 sec intervals).
Every 3 min I am pausing insertion to run a sqlite3_wal_checkpoint_v2() from the writer thread with the mode parameter as SQLITE_CHECKPOINT_RESTART. To ensure that no active read operations are going on at this point, I set a flag that checkpointing is about to take place and wait for reader to complete (the connection is still open) before running checkpoint. After checkpoint completion I again indicate to readers it is okay to resume querying.
sqlite3_wal_checkpoint_v2() returns SQLITE_OK, and pnLog and Ckpt as equal(around 4000), indicating complete wal file has been synced with main db file. So next write should start from beginning according to documentation. However, this does not seem to be happening as the subsequent writes cause the WAL file to grow indefinitely, eventually up to some GBs.
I did some searching and found that that readers can cause checkpoint failure due to open transactions. However, the only reader I'm using is ending its transaction before the checkpoint starts. What else could be preventing the WAL file from not growing?
This is far too late as an answer, but may be useful to other people.
According to the SQLite documentation, your expectations should be correct, but if you read this SO post, problems arise also in case of non-finalized statements. Therefore, if you just sqlite3_reset() your statement, there are chances anyway that the db may look busy or locked for a checkpoint. Note that this may happen also with higher levels of SQLITE_CHECKPOINT_values.
Also, the SQLITE_CHECKPOINT_TRUNCATE value, if checkout is successfully operated, will truncate the -wal file to zero length. That may help you check that all pages have been inserted in the db.
Another discussion in which -wal files grow larger and larger due to unfinalized statements is this.

How does sqlite cache_spill pragma exactly work?

I would like to know how the cache_spill = false pragma exactly works. I understand that once the cache is full it should be written to disk even before a commit actually happens. I understand this could be problematic because it requires keeping an exclusive lock since that moment until the actual commit takes place. I understand one could increment the cache size to ameliorate this potential problem. And I understand that one would like to magically avoid any spill under such circumstances although I don't believe the cache_spill pragma works in magical ways. So:
Does it make further API calls that require cache growing to fail, so signaling the user a commit is in order?
Does it stop writing to the memory cache and use the disk instead, losing performance but avoiding the spill?
The cache spilling affected by this pragma happens only when the database runs into a soft memory limit.
If you inhibit these spills, the changed data is just kept in memory.
This might result in an out-of-memory error if you need some memory for more changed data (or for anything else).
In practice, most operating systems will just swap out some data to disk (which is more inefficient because the data must be read back from swap before it is actually committed).

Resources