I am copying millions of rows to a table in another database. I am doing a few things with the data in-between and have duplicates on a certain column that is used as a key in the destination table. Ignoring all the other solutions to fix this, I am testing out using "Insert or Replace" and so far processing is going smooth, but I am not sure whether this is faster than a normal "Insert" (given a case where there are no PK duplicates)?
The OR REPLACE clause works only if there is some UNIQUE (or PRIMARY KEY) constraint that could be violated.
This means that the database always has to check whether there is a duplicate, the only difference is what happens when a duplicate is found: report an error, or delete the old row.
Related
My problem is that my querys are too slow.
I have a fairly large sqlite database. The table is:
CREATE TABLE results (
timestamp TEXT,
name TEXT,
result float,
)
(I know that timestamps as TEXT is not optimal, but please ignore that for the purposes of this question. I'll have to fix that when I have the time)
"name" is a category. This calculation holds the results of a calculation that has to be done at each timestamp for all "name"s. So the inserts are done at equal-timestamps, but the querys will be done at equal-names (i.e. I want given a name, get its time series), like:
SELECT timestamp,result WHERE name='some_name';
Now, the way I'm doing things now is to have no indexes, calculate all results, then create an index on name CREATE INDEX index_name ON results (name). The reasoning is that I don't need the index when I'm inserting, but having the index will make querys on the index really fast.
But it's not. The database is fairly large. It has about half a million timestamps, and for each timestamp I have about 1000 names.
I suspect, although I'm not sure, that the reason why it's slow is that every though I've indexed the names, they're still scattered all around the physical disk. Something like:
timestamp1,name1,result
timestamp1,name2,result
timestamp1,name3,result
...
timestamp1,name999,result
timestamp1,name1000,result
timestamp2,name1,result
timestamp2,name2,result
etc...
I'm sure this is slower to query with NAME='some_name' than if the rows were physically ordered as:
timestamp1,name1,result
timestamp2,name1,result
timestamp3,name1,result
...
timestamp499997,name1000,result
timestamp499998,name1000,result
timestamp499999,name1000,result
timestamp500000,namee1000,result
etc...
So, how do I tell SQLite that the order in which I'd like the rows in disk isn't the one they were written in?
UPDATE: I'm further convinced that the slowness in doing a select with such an index comes exclusively from non-contiguous disk access. Doing SELECT * FROM results WHERE name=<something_that_doesnt_exist> immediately returns zero results. This suggests that it's not finding the names that's slow, it's actually reading them from the disk.
Normal sqlite tables have, as a primary key, a 64-bit integer (Known as rowid and a few other aliases). That determines the order that rows are stored in a B*-tree (Which puts all actual data in leaf node pages). You can change this with a WITHOUT ROWID table, but that requires an explicit primary key which is used to place rows in a B-tree. So if every row's (name, timestamp) columns make a unique value, that's a possibility that will leave all rows with the same name on a smaller set of pages instead of scattered all over.
You'd want the composite PK to be in that order if you're searching for a particular name most of the time, so something like:
CREATE TABLE results (
timestamp TEXT
, name TEXT
, result REAL
, PRIMARY KEY (name, timestamp)
) WITHOUT ROWID
(And of course not bothering with a second index on name.) The tradeoff is that inserts are likely to be slower as the chances of needing to split a page in the B-tree go up.
Some pragmas worth looking into to tune things:
cache_size
mmap_size
optimize (After creating your index; also consider building sqlite with SQLITE_ENABLE_STAT4.)
Since you don't have an INTEGER PRIMARY KEY, consider VACUUM after deleting a lot of rows if you ever do that.
I am currently working on a database structure in SQLite Studio (not sure whether that's in itself important, but might as well mention), and error messages are making me wonder whether I'm just going at it the wrong way or there's some subtlety I'm missing.
Assume two tables, people-basics (person-ID, person-NAME, person-GENDER) and people-stats (person-ID, person-NAME, person-SIZE). What I'm looking into achieving is "Every record in people-basics corresponds to a single record in people-stats.", ideally with the added property that person-ID and person-NAME in people-stats reflect the associated person-ID and person-NAME in people-basics.
I've been assuming up to now that one would achieve this with Foreign Keys, but I've also been unable to get this to work.
When I add a person in people-basics, it works fine, but then when I go over to people-stats no corresponding record exists and if I try to create one and fill the Foreign Key column with corresponding data, I get this message: "Cannot edit this cell. Details: Error while executing SQL query on database 'People': no such column: people-basics.person" (I think the message is truncated).
The DDL I currently have for my tables (auto-generated by SQLite Studio based on my GUI operations):
CREATE TABLE [people-basics] (
[person-ID] INTEGER PRIMARY KEY AUTOINCREMENT
UNIQUE
NOT NULL,
[person-NAME] TEXT UNIQUE
NOT NULL,
[person-GENDER] TEXT
);
CREATE TABLE [people-stats] (
[person-NAME] TEXT REFERENCES [people-basics] ([person-NAME]),
[person-SIZE] NUMERIC
);
(I've removed the person-ID column from people-stats for now as it seemed like I should only have one foreign key at a time, not sure whether that's true.)
Alright, that was a little silly.
The entire problem was solved by removing hyphens from table names and column names. (So: charBasics instead of char-basics, etc.)
Ah well.
I have defined this mapper method:
#Delete("truncate table MY_TABLE")
public void wipeAllData();
and it usually works...anyway sometimes it doesn't...is there any particular reason/known bug for that?
I'm using mybatis 3.3.0 with oracle 11g as DBMS.
EDIT
Since you added the oracle11g tag. My previous answer is no longer valid, at least not the reason why it would not be working. So I edited it.
There are some reasons that I'm aware of why sometimes it is not working in ORACLE. According to the ORACLE docs
You cannot individually truncate a table that is part of a cluster. You must either truncate the cluster, delete all rows from the table, or drop and re-create the table.
You cannot truncate the parent table of an enabled foreign key constraint. You must disable the constraint before truncating the table. An exception is that you can truncate the table if the integrity constraint is self-referential.
You cannot truncate the parent table of a reference-partitioned table. You must first drop the reference-partitioned child table.
But you should be aware that the usage or a TRUNCATE command is not ideal in an application scope. It should be an operation executed on the database only. The reason lies in another indication of the docs:
If table is not empty, then the database marks UNUSABLE all nonpartitioned indexes and all partitions of global partitioned indexes on the table. However, when the table is truncated, the index is also truncated, and a new high water mark is calculated for the index segment. This operation is equivalent to creating a new segment for the index. Therefore, at the end of the truncate operation, the indexes are once again USABLE.
So it can be a painfully long operation depending on indexes and the size of the table.
Also, for tables that have constraints the truncate operation will not drop the table, it will delete registries one by one. If you have ON DELETE CASCADE on your constraints, if not, an error will be thrown. This is still true for oracle database
Another thing will should aware of is
Removing rows with the TRUNCATE TABLE statement can be faster than removing all rows with the DELETE statement, especially if the table has numerous triggers, indexes, and other dependencies.
So if by any means you have a trigger on that table it will do nothing.
The original DOC about TRUNCATE command is here:
TRUNCATE TABLE
I'm working with SQLite in Flash.
I have this unique index:
CREATE UNIQUE INDEX songsIndex ON songs ( DiscID, Artist, Title )
I have a parametised recursive function set up to insert any new rows (single or multiple).
It works fine if I try to insert a row with the same DiscID, Artist and Title as an existing row - ie it ignores inserting the existing row, and tells me that 0 out of 1 records were updated - GOOD.
However, if, for example the DiscId is blank, but the artist and title are not, a new record is created when there is already one with a blank DiscId and the same artist and title - BAD.
I traced out the disc id prior to the insert, and Flash is telling me it's undefined. So I've coded it to set anything undefined to "" (an empty string) to make sure it's truly an empty string being inserted - but subsequent inserts still ignore the unique index and add a brand new row even though the same row exists.
What am I misunderstanding?
Thanks for your time and help.
SQLite allows NULLable fields to participate in UNIQUE indexes. If you have such an index, and if you add records such that two of the three columns have identical values and the other column is NULL in both records, SQLite will allow that, matching the behavior you're seeing.
Therefore the most likely explanation is that despite your effort to INSERT zero-length strings, you're actually still INSERTing NULLs.
Also, unless you've explicitly included OR IGNORE in your INSERT statements, the expected behavior of SQLite is to throw an error when you attempt to insert a duplicate INDEX value into a UNIQUE INDEX. Since you're not seeing that behavior, I'm guessing that Flash provides some kind of wrapper around SQLite that's hiding the true behavior from you (and could also be translating empty strings to NULL).
Larry's answer is great. To anyone having the same problem here's the SQLite docs citation explaining that in this case all NULLs are treated as different values:
For the purposes of unique indices, all NULL values are considered
different from all other NULL values and are thus unique. This is one
of the two possible interpretations of the SQL-92 standard (the
language in the standard is ambiguous). The interpretation used by
SQLite is the same and is the interpretation followed by PostgreSQL,
MySQL, Firebird, and Oracle. Informix and Microsoft SQL Server follow
the other interpretation of the standard, which is that all NULL
values are equal to one another.
See here: https://www.sqlite.org/lang_createindex.html
If I was to insert lots of rows into an empty table without primary key, nor any indexes. Varying number of rows might be inserted per transaction. Could I then be sure that a SELECT * FROM the_table; would retrieve the data in the same order on both Linux and Windows?
No, you cannot and should never rely on the order of rows in a result set from a query that does not have ordering constraints. Even on the same platform, same database. Even if it works in your tests.
Things like VACCUMing your database (or some of the auto_vaccum modes I think) could change the relative block layout of your data and alter the result set even if nothing else has changed elsewhere (no inserts, no query plan change).