How best to efficiently extract data from a large SQLite database? - r

I am using SQlite to store a large amount of data and am having troubles extracting that data using very simple queries. At the moment, my database is just one table, with about 50million rows and 15 columns. I would like to extract one complete column from this table.
I have tried using RSQlite: dbGetQuery(db, ‘select qs from CSI’) where qs and CSI are my column and table names respectively. Qs are character strings. This query runs for hours before I give up (R version 3.3.3, RSQLite_1.1-2).
I also tried the DB Browser for SQLite (v3.9.1), using the same query and again gave up after a few hours run time. I do not have an IDKey/indexing, but I thought since I want the entire column, this should not have any impact.
I am running on a 64bit Windows machine with 16GB Ram. How can I extract columns from my table within a reasonable time? Or is there a better way I should be storing my data for easy access?

To get a column value, SQLite has to read the row up to the column. So to get the values from all rows, it has to read practically everything.
With an index on this column, you would have a covering index that would reduce the amount of data to be read from disk.
If you do not actually need multiple values from the same row, consider storing the columns in different tables, or using a different database.

Related

Set table vs multi set table performance

I have to prepare a table where I will keep weekly results for some aggregated data. Table will have 30 fields (10 CHARACTERs, 20 DECIMALs), I think I will have 250k rows weekly.
In my head I can see two scenarios:
Set table and relying on teradata in preventing duplicate rows - it should skip duplicate entries while inserting new data
Multi set table with UPI - it will give an error upon inserting duplicate row.
INSERT statement is going to be executed through VBA on excel, where handling possible teradata errors is not a problem.
Which scenario will be faster to run in a year time where there will be circa 14 millions rows
Is there any other way to have it done?
Regards
On a high level, since you would be having a comparatively high data count on your table, it is advisable not to use SET tables, rather go with the multiset table.
For more info you can refer to this link
http://www.dwhpro.com/teradata-multiset-tables/
Why do you care about Duplicate Rows? When you store weekly aggregates there should be no duplicates at all. And Duplicate Rows are not the same as duplicate Primary Key values.
Simply choose a PI which fits best your join/access pattern (maybe partition by date). To avoid any potential duplicates you might simply use MERGE instead of INSERT.

Organizing tables with data-heavy rows to optimize access times

I am working with a sqlite3 database of around 70 gigabytes right now. This db has three tables: one with about 30 million rows, and two more with ~150 and ~300 million each, with each table running from 6-11 columns.
The table with the fewest rows is consuming the bulk of the space, as it contains a raw data column of zipped BLOBs, generally running between 1 and 6 kilobytes per row; all other columns in the database are numeric, and the zipped data is immutable so inefficiency in modification is not a concern.
I have noticed that creating indexes on the numeric columns of this table:
[15:52:36] Query finished in 723.253 second(s).
takes several times as long as creating a comparable index on the table with five times as many rows:
[15:56:24] Query finished in 182.009 second(s).
[16:06:40] Query finished in 201.977 second(s).
Would it be better practice to store the BLOB data in a separate table to access with JOINs? The extra width of each row is the most likely candidate for the slow scan rate of this table.
My current suspicions are:
This is mostly due to the way data is read from disk, making skipping medium-sized amounts of data impractical and yielding a very low ratio of usable data per sector read from the disk by the operating system, and
It is therefore probably standard practice that I did not know as a relative newcomer to relational databases to avoid putting larger, variable-width data into the same table as other data that may need to be scanned without indices
but I would appreciate some feedback from someone with more knowledge in the field.
In the SQLite file format, all the column values in a row are simply appended together, and stored as the row value. If the row is too large to fit into one database page, the remaining data is stored in a linked list of overflow pages.
When SQLite reads a row, it reads only as much as needed, but must start at the beginning of the row.
Therefore, when you have a blob (or a large text value), you should move it to the end of the column list so that it is possible to read the other columns' values without having to go through the overflow page list:
CREATE TABLE t (
id INTEGER PRIMARY KEY,
a INTEGER,
[...],
i REAL,
data BLOB NOT NULL,
);
With a single table, the first bytes of the blob value are still stored inside the table's database pages, which decreases the number of rows that can be stored in one page.
If the other columns are accessed often, then it might make sense to move the blob to a separate table (a separate file should not be necessary). This allows the database to go through more rows at once when reading a page, but increases the effort needed to look up the blob value.

How to import a data frame in RSQLite with specifying column's constrains?

I am trying to put a large data frame into a new table of a database. It could be done simply done via:
dbWriteTable(conn=db,name="sometablename",value=my.data)
However, I want to specify the Primary keys, foreign keys and the column Types like Numeric, Text and so on.
Is there any thing I can do? Should I create a table with my columns first and then add the data frame into it?
RSQlite assumes you have already your data.frame table all set before writing it to disk. There is not much to specify in the writing query. So, I visualise two ways, either before firing a query to write it, or after. I usually write the table from R to disk, then I polish it using dbGetQuery to alter table attributes. The only problem with this workflow is that Sqlite has very limited feature for altering tables.

How to determine position of specific character/string in SQLite string column value?

I have values in a SQLite table* that contain a number of strings, of different lengths, joined by periods, something like this:
SomeApp.SomeNameSpace.InterestingString.NotInteresting
SomeApp.OtherNameSpace.WantThisOne.ReallyQuiteDull
SomeApp.OtherNameSpace.WantThisOne.AlsoDull
SomeApp.DifferentNameSpace.AlwaysWorthALook.LittleValue
I'd like to extract (in this case) the third period-delimited substring so I could write something like
SELECT interesting_string, COUNT(*)
FROM ( SELECT third_part_of_period_delimited_string(name) interesting_string )
GROUP BY interesting_string;
Obviously I can do this any number of ways programmatically; I'm wondering if there's any way to achieve this in a SQLite SELECT query?
* It's a SharpDevelop Profiler database, if anyone's curious
No.
You can, as you mention, work with the strings after you have selected them from the database. Or you can split them up into separate columns when they are stored.
If you do not have access to the code that is storing the data, you might want to consider reading the data in its entirety, splitting the strings and storing the split out tokens in separate columns in a new table. If the data is not too large, you might look at storing this table in a new memory database to give excellent performance.
Whether this is worthwhile depends on whether one pass to split the data strings can be made use of many times. If the data is constantly changing, then this scheme would probably not work well.

what's the fastest way to fill a table in SQLite?

I'm writing an application which produces a lot of data to store in a database.
The DB schema is very simple: it's a table with just 4 columns, but I must fill it with more than 30000 rows.
I'm using SQLite and QSql as API.
Data is produced very fast (no sleeps) and I'm using QSqlQuery to insert a row at time.
However it seems that it takes 7-8 seconds to store 100 rows (I'm using QTime for time counting).
I tried using QSqlTableModel but I noticed no performance improvements, even calling QSqlTableModel::submitAll every 1000 rows (QTime shows 70-80 seconds for 1000 rows).
Is there any way to store rows faster? What is the fastest way to fill a table with SQLite?
You could try looking at whether you've got transactions set up correctly; they're expensive because they have to sync to disk to commit.
Also bear in mind that SQLite is more heavily optimized for reading anyway.
You might try dropping any indexes at the start and then adding them back after all records have been imported. Results will vary of course if you're emptying the table first or just appending new records.

Resources