Order of data in SQLite database - sqlite

If I was to insert lots of rows into an empty table without primary key, nor any indexes. Varying number of rows might be inserted per transaction. Could I then be sure that a SELECT * FROM the_table; would retrieve the data in the same order on both Linux and Windows?

No, you cannot and should never rely on the order of rows in a result set from a query that does not have ordering constraints. Even on the same platform, same database. Even if it works in your tests.
Things like VACCUMing your database (or some of the auto_vaccum modes I think) could change the relative block layout of your data and alter the result set even if nothing else has changed elsewhere (no inserts, no query plan change).

Related

SQLite: re-arrange physical position of rows inside file

My problem is that my querys are too slow.
I have a fairly large sqlite database. The table is:
CREATE TABLE results (
timestamp TEXT,
name TEXT,
result float,
)
(I know that timestamps as TEXT is not optimal, but please ignore that for the purposes of this question. I'll have to fix that when I have the time)
"name" is a category. This calculation holds the results of a calculation that has to be done at each timestamp for all "name"s. So the inserts are done at equal-timestamps, but the querys will be done at equal-names (i.e. I want given a name, get its time series), like:
SELECT timestamp,result WHERE name='some_name';
Now, the way I'm doing things now is to have no indexes, calculate all results, then create an index on name CREATE INDEX index_name ON results (name). The reasoning is that I don't need the index when I'm inserting, but having the index will make querys on the index really fast.
But it's not. The database is fairly large. It has about half a million timestamps, and for each timestamp I have about 1000 names.
I suspect, although I'm not sure, that the reason why it's slow is that every though I've indexed the names, they're still scattered all around the physical disk. Something like:
timestamp1,name1,result
timestamp1,name2,result
timestamp1,name3,result
...
timestamp1,name999,result
timestamp1,name1000,result
timestamp2,name1,result
timestamp2,name2,result
etc...
I'm sure this is slower to query with NAME='some_name' than if the rows were physically ordered as:
timestamp1,name1,result
timestamp2,name1,result
timestamp3,name1,result
...
timestamp499997,name1000,result
timestamp499998,name1000,result
timestamp499999,name1000,result
timestamp500000,namee1000,result
etc...
So, how do I tell SQLite that the order in which I'd like the rows in disk isn't the one they were written in?
UPDATE: I'm further convinced that the slowness in doing a select with such an index comes exclusively from non-contiguous disk access. Doing SELECT * FROM results WHERE name=<something_that_doesnt_exist> immediately returns zero results. This suggests that it's not finding the names that's slow, it's actually reading them from the disk.
Normal sqlite tables have, as a primary key, a 64-bit integer (Known as rowid and a few other aliases). That determines the order that rows are stored in a B*-tree (Which puts all actual data in leaf node pages). You can change this with a WITHOUT ROWID table, but that requires an explicit primary key which is used to place rows in a B-tree. So if every row's (name, timestamp) columns make a unique value, that's a possibility that will leave all rows with the same name on a smaller set of pages instead of scattered all over.
You'd want the composite PK to be in that order if you're searching for a particular name most of the time, so something like:
CREATE TABLE results (
timestamp TEXT
, name TEXT
, result REAL
, PRIMARY KEY (name, timestamp)
) WITHOUT ROWID
(And of course not bothering with a second index on name.) The tradeoff is that inserts are likely to be slower as the chances of needing to split a page in the B-tree go up.
Some pragmas worth looking into to tune things:
cache_size
mmap_size
optimize (After creating your index; also consider building sqlite with SQLITE_ENABLE_STAT4.)
Since you don't have an INTEGER PRIMARY KEY, consider VACUUM after deleting a lot of rows if you ever do that.

Is SQLite "Insert or Replace" slower than just "Insert"?

I am copying millions of rows to a table in another database. I am doing a few things with the data in-between and have duplicates on a certain column that is used as a key in the destination table. Ignoring all the other solutions to fix this, I am testing out using "Insert or Replace" and so far processing is going smooth, but I am not sure whether this is faster than a normal "Insert" (given a case where there are no PK duplicates)?
The OR REPLACE clause works only if there is some UNIQUE (or PRIMARY KEY) constraint that could be violated.
This means that the database always has to check whether there is a duplicate, the only difference is what happens when a duplicate is found: report an error, or delete the old row.

truncate not always working: why?

I have defined this mapper method:
#Delete("truncate table MY_TABLE")
public void wipeAllData();
and it usually works...anyway sometimes it doesn't...is there any particular reason/known bug for that?
I'm using mybatis 3.3.0 with oracle 11g as DBMS.
EDIT
Since you added the oracle11g tag. My previous answer is no longer valid, at least not the reason why it would not be working. So I edited it.
There are some reasons that I'm aware of why sometimes it is not working in ORACLE. According to the ORACLE docs
You cannot individually truncate a table that is part of a cluster. You must either truncate the cluster, delete all rows from the table, or drop and re-create the table.
You cannot truncate the parent table of an enabled foreign key constraint. You must disable the constraint before truncating the table. An exception is that you can truncate the table if the integrity constraint is self-referential.
You cannot truncate the parent table of a reference-partitioned table. You must first drop the reference-partitioned child table.
But you should be aware that the usage or a TRUNCATE command is not ideal in an application scope. It should be an operation executed on the database only. The reason lies in another indication of the docs:
If table is not empty, then the database marks UNUSABLE all nonpartitioned indexes and all partitions of global partitioned indexes on the table. However, when the table is truncated, the index is also truncated, and a new high water mark is calculated for the index segment. This operation is equivalent to creating a new segment for the index. Therefore, at the end of the truncate operation, the indexes are once again USABLE.
So it can be a painfully long operation depending on indexes and the size of the table.
Also, for tables that have constraints the truncate operation will not drop the table, it will delete registries one by one. If you have ON DELETE CASCADE on your constraints, if not, an error will be thrown. This is still true for oracle database
Another thing will should aware of is
Removing rows with the TRUNCATE TABLE statement can be faster than removing all rows with the DELETE statement, especially if the table has numerous triggers, indexes, and other dependencies.
So if by any means you have a trigger on that table it will do nothing.
The original DOC about TRUNCATE command is here:
TRUNCATE TABLE

Understanding the ORA_ROWSCN behavior in Oracle

So this is essentially a follow-up question on Finding duplicate records.
We perform data imports from text files everyday and we ended up importing 10163 records spread across 182 files twice. On running the query mentioned above to find duplicates, the total count of records we got is 10174, which is 11 records more than what are contained in the files. I assumed about the posibility of 2 records that are exactly the same and are valid ones being accounted for as well in the query. So I thought it would be best to use a timestamp field and simply find all the records that ran today (and hence ended up adding duplicate rows). I used ORA_ROWSCN using the following query:
select count(*) from my_table
where TRUNC(SCN_TO_TIMESTAMP(ORA_ROWSCN)) = '01-MAR-2012'
;
However, the count is still more i.e. 10168. Now, I am pretty sure that the total lines in the file is 10163 by running the following command in the folder that contains all the files. wc -l *.txt.
Is it possible to find out which rows are actually inserted twice?
By default, ORA_ROWSCN is stored at the block level, not at the row level. It is only stored at the row level if the table was originally built with ROWDEPENDENCIES enabled. Assuming that you can fit many rows of your table in a single block and that you're not using the APPEND hint to insert the new data above the existing high water mark of the table, you are likely inserting new data into blocks that already have some existing data in them. By default, that is going to change the ORA_ROWSCN of every row in the block causing your query to count more rows than were actually inserted.
Since ORA_ROWSCN is only guaranteed to be an upper-bound on the last time there was DML on a row, it would be much more common to determine how many rows were inserted today by adding a CREATE_DATE column to the table that defaults to SYSDATE or to rely on SQL%ROWCOUNT after your INSERT ran (assuming, of course, that you are using a single INSERT statement to insert all the rows).
Generally, using the ORA_ROWSCN and the SCN_TO_TIMESTAMP function is going to be a problematic way to identify when a row was inserted even if the table is built with ROWDEPENDENCIES. ORA_ROWSCN returns an Oracle SCN which is a System Change Number. This is a unique identifier for a particular change (i.e. a transaction). As such, there is no direct link between a SCN and a time-- my database might be generating SCN's a million times more quickly than yours and my SCN 1 may be years different from your SCN 1. The Oracle background process SMON maintains a table that maps SCN values to approximate timestamps but it only maintains that data for a limited period of time-- otherwise, your database would end up with a multi-billion row table that was just storing SCN to timestamp mappings. If the row was inserted more than, say, a week ago (and the exact limit depends on the database and database version), SCN_TO_TIMESTAMP won't be able to convert the SCN to a timestamp and will return an error.

Sqlite3: Disabling primary key index while inserting?

I have an Sqlite3 database with a table and a primary key consisting of two integers, and I'm trying to insert lots of data into it (ie. around 1GB or so)
The issue I'm having is that creating primary key also implicitly creates an index, which in my case bogs down inserts to a crawl after a few commits (and that would be because the database file is on NFS.. sigh).
So, I'd like to somehow temporary disable that index. My best plan so far involved dropping the primary key's automatic index, however it seems that SQLite doesn't like it and throws an error if I attempt to do it.
My second best plan would involve the application making transparent copies of the database on the network drive, making modifications and then merging it back. Note that as opposed to most SQlite/NFS questions, I don't need access concurrency.
What would be a correct way to do something like that?
UPDATE:
I forgot to specify the flags I'm already using:
PRAGMA synchronous = OFF
PRAGMA journal_mode = OFF
PRAGMA locking_mode = EXCLUSIVE
PRAGMA temp_store = MEMORY
UPDATE 2:
I'm in fact inserting items in batches, however every next batch is slower to commit than previous one (I'm assuming this has to do with the size of index). I tried doing batches of between 10k and 50k tuples, each one being two integers and a float.
You can't remove embedded index since it's the only address of row.
Merge your 2 integer keys in single long key = (key1<<32) + key2; and make this as a INTEGER PRIMARY KEY in youd schema (in that case you will have only 1 index)
Set page size for new DB at least 4096
Remove ANY additional index except primary
Fill in data in the SORTED order so that primary key is growing.
Reuse commands, don't create each time them from string
Set page cache size to as much memory as you have left (remember that cache size is in number of pages, but not number of bytes)
Commit every 50000 items.
If you have additional indexes - create them only AFTER ALL data is in table
If you'll be able to merge key (I think you're using 32bit, while sqlite using 64bit, so it's possible) and fill data in sorted order I bet you will fill in your first Gb with the same performance as second and both will be fast enough.
Are you doing the INSERT of each new as an individual Transaction?
If you use BEGIN TRANSACTION and INSERT rows in batches then I think the index will only get rebuilt at the end of each Transaction.
See faster-bulk-inserts-in-sqlite3.

Resources