SQLite3 + Rasperry Pi3 - 1st select on startup slow - sqlite

I am using the pi to record video surveillance data in a single, but highly populated table per camera. The table consists of 3 columns - Timestamp, offset and frame length. The video data is stored in a separate file on the filesystem. My code is written in C.
Timestamp is the date/time for a video frame in the stream, offset is the fseek/ftell offsets into the streaming data file and frame length is the length of the frame. Pretty self explanatory. The primary and only index is on the timestamp column.
There is one database writer forked process per camera and there could be multiple forked read-only processes querying the database at any time.
These processes are created by socket listeners in the classic client/server architecture which accept video streams from other processes that manage the surveillance cameras and clients that query it.
When a read-only client connects, it selects the first row in the database for a selected camera. For some reason, the select takes > 60 secs and subsequent selects on the same query are very snappy (much less than 1 sec). I've debugged the code to determine this is the cause.
I have these pragma's configured both for the reader and writer forked processes and have tried greater and lesser values with minimal if any impact:
pragma busy_timeout=7000
pragma cache_size=-4096
pragma mmap_size=4194304
I am assuming the cause is due to populating the SQLite3 caches when a read-only client connects, but I'm not sure what else to try.
I've implemented my own write caching/buffering strategy to help prevent locks, which helped significantly, but it did not solve the delay at startup problem.
I've also split the table by weekday in an attempt to help control the table population size. It seems once the population nears 100,000 rows, the problem starts occurring. The population for a table can be around 2.5 million rows per day.
Here is the query:
sprintf(sql, "select * from %s_%s.VIDEO_FRAME where TIME_STAMP = "
"(select min(TIME_STAMP) from %s_%s.VIDEO_FRAME)",
cam_name, day_of_week, cam_name, day_of_week);
(edit)
$ uname -a
Linux raspberrypi3 4.1.19-v7+ #858 SMP Tue Mar 15 15:56:00 GMT 2016 armv7l GNU/Linux
$ sqlite3
sqlite> .open Video_Camera_02__Belkin_NetCam__WED.db
sqlite> .tables VIDEO_FRAME
sqlite> .schema VIDEO_FRAME CREATE TABLE VIDEO_FRAME(TIME_STAMP UNSIGNED BIG INT NOT NULL,FRAME_OFFSET BIGINT, FRAME_LENGTH INTEGER,PRIMARY KEY(TIME_STAMP));
sqlite> explain query plan
...> select * from VIDEO_FRAME where TIME_STAMP = (select min(TIME_STAMP) from VIDEO_FRAME);
0|0|0|SEARCH TABLE VIDEO_FRAME USING INDEX sqlite_autoindex_VIDEO_FRAME_1 (TIME_STAMP=?)
0|0|0|EXECUTE SCALAR SUBQUERY 1 1|0|0|SEARCH TABLE VIDEO_FRAME USING COVERING INDEX sqlite_autoindex_VIDEO_FRAME_1 –
After some further troubleshooting, the culprit seems to be with the forked db writer process. I tried starting the r/o clients with no streaming data being written and the select returned immediately. I haven't found the root problem, but at least have isolated where it is coming from.
Thanks!

Related

Extracting data files for different dates from database table

I am on windows and on Oracle 11.0.2
I have a table TEMP_TRANSACTION consisting of transactions for 6 months or so. Each record has a transaction date and other data with it.
Now I want to do the following:
1. Extract data from the table for each transaction date
2. Create a flat file with a name of the transaction date;
3. Output the data for this transaction date to the flat file;
4. Move on to the next date and then do the steps 1-3 again.
I create a simple sql script to spool the data out for a transaction date and it works. Now I want to put this in a loop or something like that so that it iterates for each transaction date.
I know this is asking for something from scratch but I need pointers on how to proceed.
I have Powershell, Java at hand and no access to Unix.
Please help!
Edit: Removed powershell as my primary goal is to get it out from Oracle (PL/SQL) and if not then explore Powershell OR Java.
-Abhi
I was finally able to achieve what I was looking for. Below are the steps (may be not the most efficient ones, but it did work :) )
Created a SQL script which spools the data I was looking for (for a single day).
set colsep '|'
column spoolname new_val spoolname;
select 'TRANSACTION_' || substr(&1,0,8) ||'.txt' spoolname from dual;
set echo off
set feedback off
set linesize 5000
set pagesize 0
set sqlprompt ''
set trimspool on
set headsep off
set verify off
spool &spoolname
Select
''||local_timestamp|| ''||'|'||Field1|| ''||'|'||field2
from << transaction_table >>
where local_timestamp = &1;
select 'XX|'|| count(1)
from <<source_table>>
where local_timestamp = &1;
spool off
exit
I created a file named content.txt where I populated the local timestamp values (i.e. the transaction date time-stamps as
20141007000000
20140515000000
20140515000000
Finally I used a loop on powershell which picked up one value from content.txt and then called the sql script (from step 1) and passed the parameter:
PS C:\TEMP\data> $content = Get-Content C:\TEMP\content.txt
PS C:\TEMP\data> foreach ($line in $content){sqlplus user/password '#C:\temp\ExtractData.sql' $line}
And that is it!
I still have to refine few things but at least the idea of splitting the data is working :)
Hope this helps others who are looking for similar thing.

mysqldump data loss after restoration

I have tried to dump a database db1 of about 40gb into sql file using mysqldump from system A with innodb default storage engine and tried to restore it on another system B. Both have the default storage engine as innodb and same mysql version . I have checked for any table corruptions on system A using check table status and was not able to find any table corruptions on it. I have used the below query to calculate the table size and no of rows per table on both databases (db1) over system A and system B and found that there was about 6GB data loss on db1 of system B.
SELECT table_schema,
-> SUM(data_length+index_length)/1024/1024 AS total_mb,
-> SUM(data_length)/1024/1024 AS data_mb,
-> SUM(index_length)/1024/1024 AS index_mb,
-> COUNT(*) AS tables,
-> CURDATE() AS today
-> FROM information_schema.tables
-> GROUP BY table_schema
-> ORDER BY 2 DESC
Can we rely on information schema for calculating the exact no of rows, exact tablesize (datalength + indexlength) when Innodb is default storage engine ? Why a dump using mysql dump has resulted in significant data loss on restoration over system B ?
InnoDB isn't able to give a exact count (using a SELECT COUNT() query) of records found in a table. When you request a record count on a table with the InnoDB engine, you will notice that the count will flucturate.
For more information I would like to refer you to the MySQL developer page for InnoDB
http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html
Restrictions on InnoDB Tables
ANALYZE TABLE determines index cardinality (as displayed in the Cardinality column of SHOW INDEX output) by doing eight random dives to each of the index trees and updating index cardinality estimates accordingly. Because these are only estimates, repeated runs of ANALYZE TABLE may produce different numbers. This makes ANALYZE TABLE fast on InnoDB tables but not 100% accurate because it does not take all rows into account.
MySQL uses index cardinality estimates only in join optimization. If some join is not optimized in the right way, you can try using ANALYZE TABLE. In the few cases that ANALYZE TABLE does not produce values good enough for your particular tables, you can use FORCE INDEX with your queries to force the use of a particular index, or set the max_seeks_for_key system variable to ensure that MySQL prefers index lookups over table scans. See Section 5.1.4, “Server System Variables”, and Section C.5.6, “Optimizer-Related Issues”.
SHOW TABLE STATUS does not give accurate statistics on InnoDB tables, except for the physical size reserved by the table. The row count is only a rough estimate used in SQL optimization.
InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time. To process a SELECT COUNT(*) FROM t statement, InnoDB scans an index of the table, which takes some time if the index is not entirely in the buffer pool. If your table does not change often, using the MySQL query cache is a good solution. To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does. If an approximate row count is sufficient, SHOW TABLE STATUS can be used. See Section 14.2.12.1, “InnoDB Performance Tuning Tips”.
The best solution to check if you have any data loss, is to compare the contents of your database.
mysqldump --skip-comments --skip-extended-insert -u root -p dbName1 > file1.sql
mysqldump --skip-comments --skip-extended-insert -u root -p dbName2 > file2.sql
diff file1.sql file2.sql
See this topic for more information.
Another advantage of this solution is that you can see where you have the differences.

Understanding the ORA_ROWSCN behavior in Oracle

So this is essentially a follow-up question on Finding duplicate records.
We perform data imports from text files everyday and we ended up importing 10163 records spread across 182 files twice. On running the query mentioned above to find duplicates, the total count of records we got is 10174, which is 11 records more than what are contained in the files. I assumed about the posibility of 2 records that are exactly the same and are valid ones being accounted for as well in the query. So I thought it would be best to use a timestamp field and simply find all the records that ran today (and hence ended up adding duplicate rows). I used ORA_ROWSCN using the following query:
select count(*) from my_table
where TRUNC(SCN_TO_TIMESTAMP(ORA_ROWSCN)) = '01-MAR-2012'
;
However, the count is still more i.e. 10168. Now, I am pretty sure that the total lines in the file is 10163 by running the following command in the folder that contains all the files. wc -l *.txt.
Is it possible to find out which rows are actually inserted twice?
By default, ORA_ROWSCN is stored at the block level, not at the row level. It is only stored at the row level if the table was originally built with ROWDEPENDENCIES enabled. Assuming that you can fit many rows of your table in a single block and that you're not using the APPEND hint to insert the new data above the existing high water mark of the table, you are likely inserting new data into blocks that already have some existing data in them. By default, that is going to change the ORA_ROWSCN of every row in the block causing your query to count more rows than were actually inserted.
Since ORA_ROWSCN is only guaranteed to be an upper-bound on the last time there was DML on a row, it would be much more common to determine how many rows were inserted today by adding a CREATE_DATE column to the table that defaults to SYSDATE or to rely on SQL%ROWCOUNT after your INSERT ran (assuming, of course, that you are using a single INSERT statement to insert all the rows).
Generally, using the ORA_ROWSCN and the SCN_TO_TIMESTAMP function is going to be a problematic way to identify when a row was inserted even if the table is built with ROWDEPENDENCIES. ORA_ROWSCN returns an Oracle SCN which is a System Change Number. This is a unique identifier for a particular change (i.e. a transaction). As such, there is no direct link between a SCN and a time-- my database might be generating SCN's a million times more quickly than yours and my SCN 1 may be years different from your SCN 1. The Oracle background process SMON maintains a table that maps SCN values to approximate timestamps but it only maintains that data for a limited period of time-- otherwise, your database would end up with a multi-billion row table that was just storing SCN to timestamp mappings. If the row was inserted more than, say, a week ago (and the exact limit depends on the database and database version), SCN_TO_TIMESTAMP won't be able to convert the SCN to a timestamp and will return an error.

Slow making many aggregate queries to a very large SQL Server table

I have a custom log/transaction table that tracks my users every action within the web application and it currently has millions of records and grows by the minute. In my application I need to implement some of way of precalculating a user's activities/actions in sql to determine whether other features/actions are available to the user within the application. For one example, before a page loads, I need to check if the user viewed a page X number of times.
(SELECT COUNT(*) FROM MyLog WHERE UserID = xxx and PageID = 123)
I am making several similar aggregate queries with joins for checking other conditions and the performance is poor. These checks are occuring on every page request and the application can receive hundreds of requests per minute.
I'm looking for any ideas to improve the application performance through sql and/or application code.
This is a .NET 2.0 app and using SQL Server 2008.
Much thanks in advance!
Easiest way is to store the counts in a table by themselves. Then, when adding records (hopefully through an SP), you can simply increment the affected row in your aggregate table. If you are really worried about the counts getting out of whack, you can put a trigger on the detail table to update the aggregated table, however I don't like triggers as they have very little visibility.
Also, how up to date do these counts need to be? Can this be something that can be stored into a table once a day?
Querying a log table like this may be more trouble then it is worth.
As an alternative I would suggest using something like memcache to store the value as needed. As long as you update the cache on each hit it will much faster the querying a large database table. Memcache has an build in increment operator that handles this kind of thing.
This way you only need to query the db on the first visit.
Another alternative is to use a precomputed table, updating it as needed.
Have you indexed MyLog on UserID and PageID? If not, that should give you some huge gains.
Todd this is a tough one because of the number of operations you are performing.
Have you checked your indexes on that database?
Here's a stored procedure you can execute to help at least find valid indexes. I can't remember where I found this but it helped me:
CREATE PROCEDURE [dbo].[SQLMissingIndexes]
#DBNAME varchar(100)=NULL
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
SELECT
migs.avg_total_user_cost * (migs.avg_user_impact / 100.0)
* (migs.user_seeks + migs.user_scans) AS improvement_measure,
'CREATE INDEX [missing_index_'
+ CONVERT (varchar, mig.index_group_handle)
+ '_' + CONVERT (varchar, mid.index_handle)
+ '_' + LEFT (PARSENAME(mid.statement, 1), 32) + ']'
+ ' ON ' + mid.statement
+ ' (' + ISNULL (mid.equality_columns,'')
+ CASE WHEN mid.equality_columns IS NOT NULL
AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END
+ ISNULL (mid.inequality_columns, '')
+ ')'
+ ISNULL (' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement,
migs.*,
mid.database_id,
mid.[object_id]
FROM
sys.dm_db_missing_index_groups mig
INNER JOIN
sys.dm_db_missing_index_group_stats migs
ON migs.group_handle = mig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details mid
ON mig.index_handle = mid.index_handle
WHERE
migs.avg_total_user_cost
* (migs.avg_user_impact / 100.0)
* (migs.user_seeks + migs.user_scans) > 10
AND
(#DBNAME = db_name(mid.database_id) OR #DBNAME IS NULL)
ORDER BY
migs.avg_total_user_cost
* migs.avg_user_impact
* (migs.user_seeks + migs.user_scans) DESC
END
I modified it a bit to accept a db name. If you dont provide a db name it will run and give you information about all databases and give you suggestions on what fields need indexing.
To run it use:
exec DatabaseName.dbo.SQLMissingIndexes 'MyDatabaseName'
I usually put reusable SQL (Sproc) code in a seperate database called DBA then from any database I can say:
exec DBA.dbo.SQLMissingIndexes
As an example.
Edit
Just remembered the source, Bart Duncan.
Here is a direct link http://blogs.msdn.com/b/bartd/archive/2007/07/19/are-you-using-sql-s-missing-index-dmvs.aspx
But remember I did modify it to accept a single db name.
We had the same problem, beginning several years ago, moved from SQL Server to OLAP cubes, and when that stopped working recently we moved again, to Hadoop and some other components.
OLTP (Online Transaction Processing) databases, of which SQL Server is one, are not very good at OLAP (Online Analytical Processing). This is what OLAP cubes are for.
OLTP provides good throughput when you're writing and reading many individual rows. It fails, as you just found, when doing many aggregate queries that require scanning many rows. Since SQL Server stores every record as a contiguous block on the disk, scanning many rows means many disk fetches. The cache saves you for a while - so long as your table is small, but when you get to tables with millions of rows the problem becomes evident.
Frankly, OLAP isn't that scalable either, and at some point (tens of millions of new records per day) you're going to have to move to a more distributed solution - either paid (Vertica, Greenplum) or free (HBase, Hypertable).
If neither is an option (e.g. no time or no budget) then for now you can alleviate your pain somewhat by spending more on hardware. You need very fast IO (fast disks, RAID), as as much RAM as you could get.

Sqlite3: Disabling primary key index while inserting?

I have an Sqlite3 database with a table and a primary key consisting of two integers, and I'm trying to insert lots of data into it (ie. around 1GB or so)
The issue I'm having is that creating primary key also implicitly creates an index, which in my case bogs down inserts to a crawl after a few commits (and that would be because the database file is on NFS.. sigh).
So, I'd like to somehow temporary disable that index. My best plan so far involved dropping the primary key's automatic index, however it seems that SQLite doesn't like it and throws an error if I attempt to do it.
My second best plan would involve the application making transparent copies of the database on the network drive, making modifications and then merging it back. Note that as opposed to most SQlite/NFS questions, I don't need access concurrency.
What would be a correct way to do something like that?
UPDATE:
I forgot to specify the flags I'm already using:
PRAGMA synchronous = OFF
PRAGMA journal_mode = OFF
PRAGMA locking_mode = EXCLUSIVE
PRAGMA temp_store = MEMORY
UPDATE 2:
I'm in fact inserting items in batches, however every next batch is slower to commit than previous one (I'm assuming this has to do with the size of index). I tried doing batches of between 10k and 50k tuples, each one being two integers and a float.
You can't remove embedded index since it's the only address of row.
Merge your 2 integer keys in single long key = (key1<<32) + key2; and make this as a INTEGER PRIMARY KEY in youd schema (in that case you will have only 1 index)
Set page size for new DB at least 4096
Remove ANY additional index except primary
Fill in data in the SORTED order so that primary key is growing.
Reuse commands, don't create each time them from string
Set page cache size to as much memory as you have left (remember that cache size is in number of pages, but not number of bytes)
Commit every 50000 items.
If you have additional indexes - create them only AFTER ALL data is in table
If you'll be able to merge key (I think you're using 32bit, while sqlite using 64bit, so it's possible) and fill data in sorted order I bet you will fill in your first Gb with the same performance as second and both will be fast enough.
Are you doing the INSERT of each new as an individual Transaction?
If you use BEGIN TRANSACTION and INSERT rows in batches then I think the index will only get rebuilt at the end of each Transaction.
See faster-bulk-inserts-in-sqlite3.

Resources