Understanding the ORA_ROWSCN behavior in Oracle - oracle11g

So this is essentially a follow-up question on Finding duplicate records.
We perform data imports from text files everyday and we ended up importing 10163 records spread across 182 files twice. On running the query mentioned above to find duplicates, the total count of records we got is 10174, which is 11 records more than what are contained in the files. I assumed about the posibility of 2 records that are exactly the same and are valid ones being accounted for as well in the query. So I thought it would be best to use a timestamp field and simply find all the records that ran today (and hence ended up adding duplicate rows). I used ORA_ROWSCN using the following query:
select count(*) from my_table
where TRUNC(SCN_TO_TIMESTAMP(ORA_ROWSCN)) = '01-MAR-2012'
;
However, the count is still more i.e. 10168. Now, I am pretty sure that the total lines in the file is 10163 by running the following command in the folder that contains all the files. wc -l *.txt.
Is it possible to find out which rows are actually inserted twice?

By default, ORA_ROWSCN is stored at the block level, not at the row level. It is only stored at the row level if the table was originally built with ROWDEPENDENCIES enabled. Assuming that you can fit many rows of your table in a single block and that you're not using the APPEND hint to insert the new data above the existing high water mark of the table, you are likely inserting new data into blocks that already have some existing data in them. By default, that is going to change the ORA_ROWSCN of every row in the block causing your query to count more rows than were actually inserted.
Since ORA_ROWSCN is only guaranteed to be an upper-bound on the last time there was DML on a row, it would be much more common to determine how many rows were inserted today by adding a CREATE_DATE column to the table that defaults to SYSDATE or to rely on SQL%ROWCOUNT after your INSERT ran (assuming, of course, that you are using a single INSERT statement to insert all the rows).
Generally, using the ORA_ROWSCN and the SCN_TO_TIMESTAMP function is going to be a problematic way to identify when a row was inserted even if the table is built with ROWDEPENDENCIES. ORA_ROWSCN returns an Oracle SCN which is a System Change Number. This is a unique identifier for a particular change (i.e. a transaction). As such, there is no direct link between a SCN and a time-- my database might be generating SCN's a million times more quickly than yours and my SCN 1 may be years different from your SCN 1. The Oracle background process SMON maintains a table that maps SCN values to approximate timestamps but it only maintains that data for a limited period of time-- otherwise, your database would end up with a multi-billion row table that was just storing SCN to timestamp mappings. If the row was inserted more than, say, a week ago (and the exact limit depends on the database and database version), SCN_TO_TIMESTAMP won't be able to convert the SCN to a timestamp and will return an error.

Related

SQLite: re-arrange physical position of rows inside file

My problem is that my querys are too slow.
I have a fairly large sqlite database. The table is:
CREATE TABLE results (
timestamp TEXT,
name TEXT,
result float,
)
(I know that timestamps as TEXT is not optimal, but please ignore that for the purposes of this question. I'll have to fix that when I have the time)
"name" is a category. This calculation holds the results of a calculation that has to be done at each timestamp for all "name"s. So the inserts are done at equal-timestamps, but the querys will be done at equal-names (i.e. I want given a name, get its time series), like:
SELECT timestamp,result WHERE name='some_name';
Now, the way I'm doing things now is to have no indexes, calculate all results, then create an index on name CREATE INDEX index_name ON results (name). The reasoning is that I don't need the index when I'm inserting, but having the index will make querys on the index really fast.
But it's not. The database is fairly large. It has about half a million timestamps, and for each timestamp I have about 1000 names.
I suspect, although I'm not sure, that the reason why it's slow is that every though I've indexed the names, they're still scattered all around the physical disk. Something like:
timestamp1,name1,result
timestamp1,name2,result
timestamp1,name3,result
...
timestamp1,name999,result
timestamp1,name1000,result
timestamp2,name1,result
timestamp2,name2,result
etc...
I'm sure this is slower to query with NAME='some_name' than if the rows were physically ordered as:
timestamp1,name1,result
timestamp2,name1,result
timestamp3,name1,result
...
timestamp499997,name1000,result
timestamp499998,name1000,result
timestamp499999,name1000,result
timestamp500000,namee1000,result
etc...
So, how do I tell SQLite that the order in which I'd like the rows in disk isn't the one they were written in?
UPDATE: I'm further convinced that the slowness in doing a select with such an index comes exclusively from non-contiguous disk access. Doing SELECT * FROM results WHERE name=<something_that_doesnt_exist> immediately returns zero results. This suggests that it's not finding the names that's slow, it's actually reading them from the disk.
Normal sqlite tables have, as a primary key, a 64-bit integer (Known as rowid and a few other aliases). That determines the order that rows are stored in a B*-tree (Which puts all actual data in leaf node pages). You can change this with a WITHOUT ROWID table, but that requires an explicit primary key which is used to place rows in a B-tree. So if every row's (name, timestamp) columns make a unique value, that's a possibility that will leave all rows with the same name on a smaller set of pages instead of scattered all over.
You'd want the composite PK to be in that order if you're searching for a particular name most of the time, so something like:
CREATE TABLE results (
timestamp TEXT
, name TEXT
, result REAL
, PRIMARY KEY (name, timestamp)
) WITHOUT ROWID
(And of course not bothering with a second index on name.) The tradeoff is that inserts are likely to be slower as the chances of needing to split a page in the B-tree go up.
Some pragmas worth looking into to tune things:
cache_size
mmap_size
optimize (After creating your index; also consider building sqlite with SQLITE_ENABLE_STAT4.)
Since you don't have an INTEGER PRIMARY KEY, consider VACUUM after deleting a lot of rows if you ever do that.

How can I replace certain values by their average in an sqlite database?

I have an sqlite database with a table that logs electric power values over time, i.e. there is a timestamp column and one for the associated power value.
With a value coming in roughly every second, this table grows significantly over time. Which is why I want to thin out old values, for example by replacing all 60 values in a minute with their average.
I know how to query for the average.
I know how to insert the query's result back into the table.
But how do I delete the original values without also deleting the newly inserted average value (which has a timestamp within the same range)?
Note that I would like to perform the operation entirely inside sqlite query language, i.e. without storing for example row ids in the C code that is executing the queries.
The easiest way would be to use a temporary table:
BEGIN;
CREATE TEMP TABLE Averages AS
SELECT MIN(Timestamp), AVG(Value)
FROM MyTable
WHERE (old)
GROUP BY (minute);
DELETE FROM MyTable WHERE (old);
INSERT INTO MyTable(Timestamp, Value) SELECT * FROM Averages;
DROP TABLE Averages;
COMMIT;

sqlite3 - the philosophy behind sqlite design for this scenario

suppose we have a file with just one table named TableA and this table has just one column named Text;
let say we populate our TableA with 3,000,000 of strings like these(each line a record):
Many of our patients are incontinent.
Many of our patients are severely disturbed.
Many of our patients need help with dressing.
if I save the file at this level it'll be: ~326 MB
now let say we want to increase the speed of our queries and therefore we set our Text column as the PrimaryKey(or create index on it);
if I save the file at this level it'll be: ~700 MB
our query:
SELECT Text FROM "TableA" where Text like '% home %'
for the table without index: ~5.545s
for the indexed table: ~2.231s
As far as I know when we create index on a column or set a column to be our PrimaryKey then sqlite engine doesn't need to refer to table itself(if no other column was requested in query) and it uses the index for query and hence the speed of query execution increases;
My question is in the scenario above which we have just one column and set that column to be the PrimaryKey too, then why sqlite holds some kind of unnecessary data?(at least it seems unnecessary!)(in this case ~326 MB) why not just keeping the index\PrimaryKey data?
In SQLite, table rows are stored in the order of the internal rowid column.
Therefore, indexes must be stored separately.
In SQLite 3.8.2 or later, you can create a WITHOUT ROWID table which is stored in order of its primary key values.

How to use one sequence for all table in sqlite

When I'm creating tables in an SQLite database, separate sequences are automatically created for each table.
However I want to use one sequence for all tables in my SQLite database and also need to set min and max values (e.g. min=10000 and max=999999) of that sequence (min and max means start value of sequence and maximum value that sequence can increment).
I know this can be done in an Oracle database, but don't know how to do it in SQLite.
Is there any way to do this?
Unfortunately, you cannot do this: SQLite automatically creates sequences for each table in special sqlite_sequence service table.
And even if you somehow forced it to take single sequence as source for all your tables, it would not work the way you expect. For example, in PostgreSQL or Oracle, if you set sequence to value say 1 but table already has filled rows with values 1,2,..., any attempt to insert new rows using that sequence as a source would fail with unique constraint violation.
In SQLite, however, if you manually reset sequence using something like:
UPDATE sqlite_sequence SET seq = 1 WHERE name = 'mytable'
and you already have rows 1,2,3,..., new attempts to insert will NOT fail, but automatically assign this sequence maximum existing value for appropriate auto-increment column, so it can keep going up.
Note that you can use this hack to assign starting value for the sequence, however you cannot make it stop incrementing at some value (unless you watch it using other means).
First of all, this is a bad idea.
The performance of database queries depends on predictability, and by fiddling with the sequence of indexes you are introducing offsets which will confuse the database engine.
However, to achieve this you could try to determine the lowest sequence number which is higher than or equal to any existing sequence number:
SELECT MAX(seq) FROM sqlite_sequence
This needs to be done after each INSERT query, followed by an update of all sequences for all tables:
UPDATE sqlite_sequence SET seq=determined_max

Sqlite3: Disabling primary key index while inserting?

I have an Sqlite3 database with a table and a primary key consisting of two integers, and I'm trying to insert lots of data into it (ie. around 1GB or so)
The issue I'm having is that creating primary key also implicitly creates an index, which in my case bogs down inserts to a crawl after a few commits (and that would be because the database file is on NFS.. sigh).
So, I'd like to somehow temporary disable that index. My best plan so far involved dropping the primary key's automatic index, however it seems that SQLite doesn't like it and throws an error if I attempt to do it.
My second best plan would involve the application making transparent copies of the database on the network drive, making modifications and then merging it back. Note that as opposed to most SQlite/NFS questions, I don't need access concurrency.
What would be a correct way to do something like that?
UPDATE:
I forgot to specify the flags I'm already using:
PRAGMA synchronous = OFF
PRAGMA journal_mode = OFF
PRAGMA locking_mode = EXCLUSIVE
PRAGMA temp_store = MEMORY
UPDATE 2:
I'm in fact inserting items in batches, however every next batch is slower to commit than previous one (I'm assuming this has to do with the size of index). I tried doing batches of between 10k and 50k tuples, each one being two integers and a float.
You can't remove embedded index since it's the only address of row.
Merge your 2 integer keys in single long key = (key1<<32) + key2; and make this as a INTEGER PRIMARY KEY in youd schema (in that case you will have only 1 index)
Set page size for new DB at least 4096
Remove ANY additional index except primary
Fill in data in the SORTED order so that primary key is growing.
Reuse commands, don't create each time them from string
Set page cache size to as much memory as you have left (remember that cache size is in number of pages, but not number of bytes)
Commit every 50000 items.
If you have additional indexes - create them only AFTER ALL data is in table
If you'll be able to merge key (I think you're using 32bit, while sqlite using 64bit, so it's possible) and fill data in sorted order I bet you will fill in your first Gb with the same performance as second and both will be fast enough.
Are you doing the INSERT of each new as an individual Transaction?
If you use BEGIN TRANSACTION and INSERT rows in batches then I think the index will only get rebuilt at the end of each Transaction.
See faster-bulk-inserts-in-sqlite3.

Resources