SQLite vs. CSV for data acquistion - sqlite

I am designing a measurement instrument and the software is written in C++/Qt5.12 on a custom build Linux embedded system (Buildroot). The data are time series and fall into 2 categories :
actual physical data, 1..3 fields, sampling period 5 min
housekeeping parameters (temperatures, fow rates, etc.), 5..10 fields, sampling period 1..10 sec
I have been using CSV files so far, and they do the job. Although the data are not relational and the data acquisition rate is low, I am looking into SQLite because :
reduced risk to produce corrupted files in case of a crash thanks to transactions
more flexibility to alter the data format in the long run, e.g. add a column, with less impact on processing software
SQLite is supported by Buildroot
Questions :
Does SQLite look like a smart choice over CSV in this case ?
The instrument will be running 24/7 for years, so I guess I'll have to split the database into chunks (e.g. monthly) to keep the file reasonably small and for archiving. I wonder how easy that would be. Can it be automated with a cron job ?
Thanks.

Does SQLite look like a smart choice over CSV in this case ?
I'd suggest yes. Mainly because you would probably want to do something with the data other than spend the rest of your life looking through it.
Perhaps you want some sort of aggregated stats (a summary. averages, maximum value, minimum values perhaps to compare periods). SQLite can make that pretty easy and pretty efficient.
The instrument will be running 24/7 for years, so I guess I'll have to split the database into chunks (e.g. monthly) to keep the file reasonably small and for archiving. I wonder how easy that would be. Can it be automated with a cron job ?
Cron no need, utilise the power of SQLite, a TRIGGER could be handly.
Here's an example that shows a little of what you could do.
As you have 2 distinct sets of readings physical (table) and housekeeping (table) the example has a table per each.
the physical table has 1 column for the timestamp of the reading and 4 columns for the readings.
the housekeeping table has 1 column for the timestamp and 10 reading columns.
The example automatically generates data just_to_load some data to show results. The example has such a table that is used to control how much data is inserted, it has 1 row with 1 value (although it could have more rows) and this value is extracted to determine how much data is added.
with the value as 1000 1000 physical readings will be added for every 5 minutes (about 3.5 days worth of data).
with the value of 1000 then 300,000 rows will be added to the housekeeping table. i.e every 5 minutes 300 rows will be added.
The example demonstrates automated (TRIGGER) based tidying up (doesn't backup the data but will clear data from both the tables (just an example showing that you can do things automatically)). The TRIGGER is named auto_tidyup.
To know that the TRIGGER is being activated it additionally records the start and end of the TRIGGER's processing (what it does when activated and its WHEN clause condition is met (to reduce the times that it tries to do something)). This data is stored in another table namely tidyup_log.
The TRIGGER has been set so the WHEN clause is triggered (this would be changed after tested to a suitable schedule).
So in summary 4 tables (1 for testing purposes only) and 1 trigger.
When the data is loaded, the data is then used by 3 queries to extract useful data (well sort of).
The Example SQL (note that perhaps the most complicated SQL is for loading the testing data) :-
DROP TABLE IF EXISTS physical;
DROP TABLE IF EXISTS housekeeping;
DROP TRIGGER IF EXISTS auto_tidyup;
DROP TABLE IF EXISTS tidyup_log;
DROP TABLE IF EXISTS just_for_load;
CREATE TABLE IF NOT EXISTS physical(timestamp INTEGER PRIMARY KEY, fld1 REAL, fld2 REAL, fld3 REAL, fld4 REAL);
CREATE TABLE IF NOT EXISTS housekeeping(timestamp INTEGER PRIMARY KEY, prm1 REAL, prm2 REAL, prm3 REAL, prm4 REAL, prm5 REAL, prm6 REAL, prm7 REAL, prm8 REAL, prm9 REAL, prm10 REAL);
CREATE TABLE IF NOT EXISTS tidyup_log (timestamp INTEGER, action_performed TEXT);
CREATE TRIGGER IF NOT EXISTS auto_tidyup AFTER INSERT ON physical
WHEN CAST(strftime('%d','now') AS INTEGER) = 23 /* <<<<<<<<<< TESTING SO GET HITS >>>>>>>>>>*/
/*WHEN CAST(strftime('%d','now') AS INTEGER = 1 */ /* IF TODAY FIRST DAY OF MONTH */
BEGIN
INSERT INTO tidyup_log VALUES (strftime('%s','now'),'TIDY Started');
DELETE FROM physical WHERE timestamp < new.timestamp - (60 * 60 * 24 * 365 /*approx a year */);
DELETE FROM housekeeping WHERE timestamp < new.timestamp - (60 * 60 * 24 * 365);
INSERT INTO tidyup_log VALUES (strftime('%s','now'),'TIDY ENDED');
END
;
/* ONLY FOR LOADING Test Data controls number of rows added */
CREATE TABLE IF NOT EXISTS just_for_load (base_count INTEGER);
INSERT INTO just_for_load VALUES(1000); /* Number of physical rows to add 5 minutes e.g. 1000 is close to 3.5 days*/
WITH RECURSIVE counter(i) AS
(SELECT 1 UNION ALL SELECT i+1 FROM counter WHERE i < (SELECT sum(base_count) FROM just_for_load))
INSERT INTO physical SELECT strftime('%s','now','+'||(i * 5)||' minutes'), random(),random(),random(),random()FROM counter
;
WITH RECURSIVE counter(i) AS
(SELECT 1 UNION ALL SELECT i+1 FROM counter WHERE i < (SELECT (sum(base_count) * 300) FROM just_for_load))
INSERT INTO housekeeping SELECT strftime('%s','now','+'||(i)||' second'), random(),random(),random(),random(), random(),random(),random(),random(), random(),random()FROM counter
;
/* <<<<<<<<<< DATA LOADED SO EXTRACT IT >>>>>>>>> */
SELECT datetime(timestamp,'unixepoch'), fld1,fld2,fld3,fld4 FROM physical;
/* First query to basically show the 5 minute intervals (and lots of random values)*/
/* This query gets the sum and average of the 10 readings over a 5 minute window */
SELECT
'From '||datetime(min(timestamp),'unixepoch')||' To '||datetime(max(timestamp),'unixepoch') AS Range,
sum(prm1)AS avgP1, avg(prm1) AS sumP1,
sum(prm2)AS avgP2, avg(prm2) AS sumP2,
sum(prm3)AS avgP3, avg(prm3) AS sumP3,
sum(prm4)AS avgP4, avg(prm4) AS sumP4,
sum(prm5)AS avgP5, avg(prm5) AS sumP5,
sum(prm6)AS avgP6, avg(prm6) AS sumP6,
sum(prm7)AS avgP7, avg(prm7) AS sumP7,
sum(prm8)AS avgP8, avg(prm8) AS sumP8,
sum(prm9)AS avgP9, avg(prm9) AS sumP9,
sum(prm10)AS avgP10, avg(prm10) AS sumP10
FROM housekeeping GROUP BY timestamp / 300
;
/* This query shows that the TRIGGER is being activated (even though it does no deletions) */
SELECT * FROM tidyup_log;
/* Tidy up the Testing environment */
DROP TABLE IF EXISTS physical;
DROP TABLE IF EXISTS housekeeping;
DROP TRIGGER IF EXISTS auto_tidyup;
DROP TABLE IF EXISTS tidyup_log;
DROP TABLE IF EXISTS just_for_load;
The comments should explain quite a bit.
you may wish to look at:
SQLite CREATE TABLE
SQLite CREATE TRIGGER
SQLite Date and Time Functions
SQLite Aggregate Functions
SQLite SQL Language Expressions
Results
Extract from the physical table (showing 5 minute intervals of the data aka data you probably don't want to look at)
Extract more useful data averages and sums of each of the 10 readings every 5 minutes
1001 rows because rows don't end start on a 5 minute boundary
The tidyup log (to show the TRIGGER is being activated)
start and end for each physical row (noting that the WHEN criteria has been set to trigger on all) and hence 2000 rows
Lastly just to show 300000 rows part of the message log :-
WITH RECURSIVE counter(i) AS
(SELECT 1 UNION ALL SELECT i+1 FROM counter WHERE i < (SELECT (sum(base_count) * 300) FROM just_for_load))
INSERT INTO housekeeping SELECT strftime('%s','now','+'||(i)||' second'), random(),random(),random(),random(), random(),random(),random(),random(), random(),random()FROM counter
> Affected rows: 300000
> Time: 1.207s

Related

Hive - Filter on Map data type column stuck

We have a table with one year of data in daily date partition. Each day has 100 billion rows. The table has a map data type column which holds a 100000+ key-value pair. All I need is min(date) based on the two map column filter. YARN is stuck with deciding a number of mapper for this query. I only see the below message when I invoke the query and it stuck for more than 30+ minutes. Then I killed the job. Is there a way to optimize and run the query?
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Query:
select min(data_dt) from my_db.logs
where txttype = 'abcd'
and mapfields['page_name'] in ('a','b','c','d','e','d')
-- Total 50 page names
and mapfields['usedIn'] like '%Group%'
and (ctry like 'aa%' or ctry like 'bb%' or ctry like 'cc%')
;

How to select first row of each table?

I have a SQL database with 30 tables ordered by company names. They are 30 stocks. Each table has columns date, high, low, close and volume.
My question is: how do I select the first record of each table in SQL?
Is it
FROM TOP 1 * SELECT aaple_30min?
I tried
SELECT *
FROM bac_30min
LIMIT 1
but this gives me only for one table.
I have decided to write the query in R, where I will loop through the companies in question and use the paste function to make calls by replacing above "bac_30" with company names as strings.

Indices have failed me - how can I optimize this SQLite query?

I need some help optimizing the following query:
SELECT
kd2c.id as _id,
kd2c.literal as kanji
FROM
kd2_character as kd2c
JOIN krad_components as kcom ON kcom.kanji_fk = kd2c.id
WHERE kcom.radical_fk IN (1, 2, 3, etc...)
GROUP BY kd2c.id HAVING count(distinct kcom.radical_fk) = <number of integers in WHERE clause>
ORDER BY kd2c.freq IS NULL ASC, kd2c.freq, kd2c.id
The query itself (regardless of the number of fk's in the WHERE clause) takes 0.04 seconds to run, which is a long time relative to all of my other queries that take around 0.0003 seconds. I ran an EXPLAIN QUERY PLAN against the above statement and received the following:
# | selectid | order | from | detail
1 0 0 0 SCAN TABLE kd2_character AS kd2c USING INTEGER PRIMARY KEY (~1000000 rows)
2 0 1 1 SEARCH TABLE krad_components AS kcom USING COVERING INDEX idx_krad_components (kanji_fk=? AND radical_fk=?) (~9 rows)
3 0 0 0 EXECUTE LIST SUBQUERY 1
4 0 0 0 USE TEMP B-TREE FOR ORDER BY
I'm pretty sure the script takes so long because of that initial SCAN TABLE. If that's the case, how can I get rid of it? I thought creating an index on kd2_characer.id would help things along, but it didn't have any noticeable effect on execution time.
How can I improve this query? Is there a better way to structure my GROUP BY, since it's probably the source of the SCAN?
When SQLite joins two tables, it uses a nested loop join, i.e., it goes through all the records of one table, and looks up correspondig record(s) in the other table.
This is faster if many of the first table's records get filtered out by some WHERE condition before they must be joined, and if the second table has an index on the joined column.
For this particular query, SQLite has estimated that using kd2_character as the outer table in the loop is faster (because you have a single index that can be used for lookups in both kanji_fk and radical_fk columns).
This might or might not be actually true.
Try running ANALYZE once to get more accurate estimations.
You can force SQLite to use a particular join order by using CROSS JOIN; check if this makes a difference:
...
FROM krad_components AS kcom
CROSS JOIN kd2_character AS kd2c ON kcom.kanji_fk = kd2c.id
WHERE ...
(This optimization is dangerous if the database's contents eventually change so that the other join order would be faster.)

sqlite: query to add (subtract) cells from adjacent rows and put result in new column

I am examining a .sqlite file in FireFox's SQLite Manager and need to see if any data was not collected. An example is worth a thousand words:
ReadDate ReadValue
1361900350183.00 137
1361899753183.00 139
1361900053183.00 138
The are no primary keys and the table is NOT sorted by ReadDate or time. [Changing the input table is not an option!]
What I'd like to do is produce with simple SQL a table that looks like this:
ReadDate ReadValue TimeOffset
1361899753183.00 139
1361900053183.00 138 300000 // this is ReadDate(1) - ReadDate(0)
1361900350183.00 137 297000 // this is ReadDate(2) - ReadDate(1)
This would allow me to inspect the data and see if any data values were not captured (TimeOffset would be much greater than 300000). I could also write an additional query to get a COUNT of all TimeOffsets beyond a threshold.
I'm having trouble getting going on what I imagine is a simple exercise. I know how to do joins and sorts (order by), but here I need to compare one row to another. Do I need a cursor? And how to get the extra column? I have a gut feeling that if I just knew the vocabulary a little better, I'd be able to come up with the search terms and find the answer quickly.
Many thanks,
Dave
First, add an (empty) column to your table:
ALTER TABLE MyTable ADD COLUMN TimeOffset NUMERIC;
Then, the TimeOffset for each record is the difference between the ReadDate column of this record and of the record with the next smaller ReadDate, i.e, the record with the largest ReadDate that is still smaller than this one's:
UPDATE MyTable
SET TimeOffset = ReadDate - (SELECT MAX(ReadDate)
FROM MyTable AS t2
WHERE t2.ReadDate < MyTable.ReadDate);

SQLite - Update with random unique value

I am trying to populate everyrow in a column with random ranging from 0 to row count.
So far I have this
UPDATE table
SET column = ABS (RANDOM() % (SELECT COUNT(id) FROM table))
This does the job but produces duplicate values, which turned out to be bad. I added a Unique constraint but that just causes it to crash.
Is there a way to update a column with random unique values from certain range?
Thanks!
If you want to later read the records in a random order, you can just do the ordering at that time:
SELECT * FROM MyTable ORDER BY random()
(This will not work if you need the same order in multiple queries.)
Otherwise, you can use a temporary table to store the random mapping between the rowids of your table and the numbers 1..N.
(Those numbers are automatically generated by the rowids of the temporary table.)
CREATE TEMP TABLE MyOrder AS
SELECT rowid AS original_rowid
FROM MyTable
ORDER BY random();
UPDATE MyTable
SET MyColumn = (SELECT rowid
FROM MyOrder
WHERE original_rowid = MyTable.rowid) - 1;
DROP TABLE MyOrder;
What you seem to be seeking is not simply a set of random numbers, but rather a random permutation of the numbers 1..N. This is harder to do. If you look in Knuth (The Art of Computer Programming), or in Bentley (Programming Pearls or More Programming Pearls), one suggested way is to create an array with the values 1..N, and then for each position, swap the current value with a randomly selected other value from the array. (I'd need to dig out the books to check whether it is any arbitrary position in the array, or only with a value following it in the array.) In your context, then you apply this permutation to the rows in the table under some ordering, so row 1 under the ordering gets the value in the array at position 1 (using 1-based indexing), etc.
In the 1st Edition of Programming Pearls, Column 11 Searching, Bentley says:
Knuth's Algorithm P in Section 3.4.2 shuffles the array X[1..N].
for I := 1 to N do
Swap(X[I], X[RandInt(I,N)])
where the RandInt(n,m) function returns a random integer in the range [n..m] (inclusive). That's nothing if not succinct.
The alternative is to have your code thrashing around when there is one value left to update, waiting until the random number generator picks the one value that hasn't been used yet. As a hit and miss process, that can take a while, especially if the number of rows in total is large.
Actually translating that into SQLite is a separate exercise. How big is your table? Is there a convenient unique key on it (other than the one you're randomizing)?
Given that you have a primary key, you can easily generate an array of structures such that each primary key is allocated a number in the range 1..N. You then use Algorithm P to permute the numbers. Then you can update the table from the primary keys with the appropriate randomized number. You might be able to do it all with a second (temporary) table in SQL, especially if SQLite supports UPDATE statements with a join between two tables. But it is probably nearly as simple to use the array to drive singleton updates. You'd probably not want a unique constraint on the random number column while this update is in progress.

Resources