Should ANALYZE be run in a transaction? - sqlite

In sqlite (specifically version 3), should ANALYZE be run in a transaction?
If so, and I'm at the end of a long transaction that made lots of changes, is it okay to run ANALYZE in that same transaction or should that transaction be committed first and begin another transaction for the ANALYZE?
The documentation doesn't say anything about this one way or another.

ANALYZE reads the data from indexed columns and writes statistical information into some internal table.
This is somewhat similar to the following query:
INSERT OR REPLACE INTO sqlite_statXXX
SELECT 'MyTable', 'MyColumn', COUNT(*), AVG(MyColumn) FROM MyTable
done once for every indexed column.
Like any other SQL statement that writes a small amount of data to the database, the transaction overhead will be much larger than the actual effort to write the data itself.
In your case, it is not necessary for your changed data to be available without the changed statistics, so you could just as well do the ANALYZE in the same transaction.
If the database is so big that ANALYZE runs for a long time, it might make sense to delay its execution until later when it does not conflict with more important transactions.

Related

SQL Server Data Archiving

I have a SQL Azure database on which I need to perform some data archiving operation.
Plan is to move all the irrelevant data from the actual tables into Archive_* tables.
I have tables which have up to 8-9 million records.
One option is to write a stored procedure and insert data in to the new Archive_* tables and also delete from the actual tables.
But this operation is really time consuming and running for more than 3 hrs.
I am in a situation where I can't have more than an hour's downtime.
How can I make this archiving faster?
You can use Azure Automation to schedule execution of a stored procedure every day at the same time, during maintenance window, where this stored procedure will archive the oldest one week or one month of data only, each time it runs. The store procedure should archive data older than X number of weeks/months/years only. Please read this article to create the runbook. In a few days you will have all the old data archived and the Runbook will continue to do the job from now and on.
You can't make it faster, but you can make it seamless. The first option is to have a separate task that moves data in portions from the source to the archive tables. In order to prevent table lock escalations and overall performance degradation I would suggest you to limit the size of a single transaction. E.g. start transaction, insert N records into the archive table, delete these records from the source table, commit transaction. Continue for a few days until all the necessary data is transferred. The advantage of that way is that if there is some kind of a failure, you may restart the archival process and it will continue from the point of the failure.
The second option that does not exclude the first one really depends on how critical the performance of the source tables for you and how many updates are happening with them. It if is not a problem you can write triggers that actually pour every inserted/updated record into an archive table. Then, when you want a cleanup all you need to do is to delete the obsolete records from the source tables, their copies will already be in the archive tables.
In the both cases you will not need to have any downtime.

Aggregation on FireStore/CloudDatastore. Use Cloud Functions onCreate/Update?

I want to create an expense tracker and one of the things I want to find out is how much did I spend in each month per category.
How should I do this in FireStore/DataStore?
Pull down required data and do aggregation locally? Seems very slow?
Perform aggregation everytime a transaction is created/updated and save it in a table? But this may result in many invocations of the functions, which may be costly?
Is there a better way? Seems like 2 is currently the best option? But I wonder if theres anyway I can reduce costs?
I note that I may not need the aggregated data to be realtime, so is there a way to debounce the cloud function execution? Since I note that at times, I will batch insert a bunch of transactions. Wonder if theres a way to disable functions for certain queries and manually call them after the batch has finished for example?
The two approaches you describe are indeed the most common.
The best approach mostly depends on the number of transactions you have. If you have few transactions, then it may be totally fine to do the aggregation on each client. But as you get more transactions, the overhead of downloading the data will become prohibitive and you're more likely to want to keep a running total in the database.
I'd normally recommend keeping the total up to date with any transaction. You can even do that with client-side code, by using transactions (to prevent multiple users overwriting each other's updates) and server-side security rules (to prevent malicious actors from writing an aggregate that doesn't match its transaction).
If you want to aggregate in batches, you'll want to run code periodically, either in a server you control, or in Cloud Functions.
There is nothing built into Cloud Functions to debounce document writes. You could probably keep a debounce counter in Firestore, but that would then be reading/writing a document on each transaction.
More reasonable seems to run a function on a timer, as described in this blog post and shown in this video. But you'll need to make sure your data structure in that case allows the code to detect what transactions it needs to aggregate.
One way to do this is to ensure the transactions can be ordered in some way, e.g. by giving them a timestamp, and having your aggregation code keep track (likely in the database) of the last timestamp it has aggregated already. Then whenever the aggregator runs, it:
reads the current aggregated value
queries the database for transactions that have been added since it last ran
loops over those transactions, updating the aggregated value
writes the aggregated value and the last timestamp back to the database in a transaction (to ensure either both are written, or neither is written)

Riak and time-sorted records

I'd like to sort some records, stored in riak, by a function of the each record's score and "age" (current time - creation date). What is the best way do do a "time-sensitive" query in riak? Thus far, the options I'm aware of are:
Realtime mapreduce - Do the entire calculation in a mapreduce job, at query-time
ETL job - Periodically do the query in a background job, and store the result back into riak
Punt it to the app layer - Don't sort at all using riak, and instead use an application-level layer to sort and cache the records.
Mapreduce seems the best on paper, however, I've read mixed-reports about the real-world latency of riak mapreduce.
MapReduce is a quite expensive operation and not recommended as a real-time querying tool. It works best when run over a limited set of data in batch mode where the number of concurrent mapreduce jobs can be controlled, and I would therefore not recommend the first option.
Having a process periodically process/aggregate data for a specific time slice as described in the second option could work and allow efficient access to the prepared data through direct key access. The aggregation process could, if you are using leveldb, be based around a secondary index holding a timestamp. One downside could however be that newly inserted records may not show up in the results immediately, which may or may not be a problem in your scenario.
If you need the computed records to be accurate and will perform a significant number of these queries, you may be better off updating the computed summary records as part of the writing and updating process.
In general it is a good idea to make sure that you can get the data you need as efficiently as possibly, preferably through direct key access, and then perform filtering of data that is not required as well as sorting and aggregation on the application side.

Attaching two memory databases

I am collecting data every second and storing it in a ":memory" database. Inserting data into this database is inside a transaction.
Everytime one request is sending to server and server will read data from the first memory, do some calculation, store it in the second database and send it back to the client. For this, I am creating another ":memory:" database to store the aggregated information of the first db. I cannot use the same db because I need to do some large calculation to get the aggregated result. This cannot be done inside the transaction( because if one collection takes 5 sec I will lose all the 4 seconds data). I cannot create table in the same database because I will not be able to write the aggregate data while it is collecting and inserting the original data(it is inside transaction and it is collecting every one second)
-- Sometimes I want to retrieve data from both the databses. How can I link both these memory databases? Using attach database stmt, I can attach the second db to the first one. But the problem is next time when a request comes how will I check the second db is exist or not?
-- Suppose, I am attaching the second memory db to first one. Will it lock the second database, when we write data to the first db?
-- Is there any other way to store this aggregated data??
As far as I got your idea, I don't think that you need two databases at all. I suppose you are misinterpreting the idea of transactions in sql.
If you are beginning a transaction other processes will be still allowed to read data. If you are reading data, you probably don't need a database lock.
A possible workflow could look as the following.
Insert some data to the database (use a transaction just for the
insertion process)
Perform heavy calculations on the database (but do not use a transaction, otherwise it will prevent other processes of inserting any data to your database). Even if this step includes really heavy computation, you can still insert and read data by using another process as SELECT statements will not lock your database.
Write results to the database (again, by using a transaction)
Just make sure that heavy calculations are not performed within a transaction.
If you want a more detailed description of this solution, look at the documentation about the file locking behaviour of sqlite3: http://www.sqlite.org/lockingv3.html

sqlite3 bulk insert from C?

I came across the .import command to do this (bulk insert), but is there a query version of this which I can execute using sqlite3_exec().
I would just like to copy a small text file contents into a table.
A query version of this one below,
".import demotab.txt mytable"
Sqlite's performance doesn't benefit from bulk insert. Simply performing the inserts separately (but within a single transaction!) provides very good performance.
You might benefit from increasing sqlite's page cache size; that depends on the number of indexes and/or the order in which the data is inserted. If you don't have any indexes, for a pure insert, the cache size is likely not to matter much.
Be sure to use a prepared query, as opposed to regenerating a query plan in the innermost loop. It's extremely important to wrap the statements in a transaction since this avoids the need for the filesystem to sync the database to disk - afterall, partially a written transaction is atomically aborted anyhow, meaning that all fsync()'s are delayed until the transaction completes.
Finally, indexes will limit your insert performance since their creation is somewhat expensive. If you're really dealing with a lot of data and start off with an empty table, it may be beneficial to add the indexes after the data - though this isn't a huge factor.
Oh, and you might want to get one of those intel X25-E SSD's and ensure you have an AHCI controller ;-).
I'm maintaining an app with sqlite db's with about 500000000 rows (spread over several tables) - much of which was bulk inserted using plain old begin-insert-commit: it works fine.

Resources