Organizing tables with data-heavy rows to optimize access times - sqlite

I am working with a sqlite3 database of around 70 gigabytes right now. This db has three tables: one with about 30 million rows, and two more with ~150 and ~300 million each, with each table running from 6-11 columns.
The table with the fewest rows is consuming the bulk of the space, as it contains a raw data column of zipped BLOBs, generally running between 1 and 6 kilobytes per row; all other columns in the database are numeric, and the zipped data is immutable so inefficiency in modification is not a concern.
I have noticed that creating indexes on the numeric columns of this table:
[15:52:36] Query finished in 723.253 second(s).
takes several times as long as creating a comparable index on the table with five times as many rows:
[15:56:24] Query finished in 182.009 second(s).
[16:06:40] Query finished in 201.977 second(s).
Would it be better practice to store the BLOB data in a separate table to access with JOINs? The extra width of each row is the most likely candidate for the slow scan rate of this table.
My current suspicions are:
This is mostly due to the way data is read from disk, making skipping medium-sized amounts of data impractical and yielding a very low ratio of usable data per sector read from the disk by the operating system, and
It is therefore probably standard practice that I did not know as a relative newcomer to relational databases to avoid putting larger, variable-width data into the same table as other data that may need to be scanned without indices
but I would appreciate some feedback from someone with more knowledge in the field.

In the SQLite file format, all the column values in a row are simply appended together, and stored as the row value. If the row is too large to fit into one database page, the remaining data is stored in a linked list of overflow pages.
When SQLite reads a row, it reads only as much as needed, but must start at the beginning of the row.
Therefore, when you have a blob (or a large text value), you should move it to the end of the column list so that it is possible to read the other columns' values without having to go through the overflow page list:
CREATE TABLE t (
id INTEGER PRIMARY KEY,
a INTEGER,
[...],
i REAL,
data BLOB NOT NULL,
);
With a single table, the first bytes of the blob value are still stored inside the table's database pages, which decreases the number of rows that can be stored in one page.
If the other columns are accessed often, then it might make sense to move the blob to a separate table (a separate file should not be necessary). This allows the database to go through more rows at once when reading a page, but increases the effort needed to look up the blob value.

Related

How to overcome Row size too large (> 8126) error on Google-Cloud MySQL5.7 Second Generation

Google Cloud MySQL Engine supports the InnoDB storage engine only.
I am getting the following error when creating a table with 300 columns.
[Err] 1118 - Row size too large (> 8126).
Changing some columns to TEXT or BLOB may help. In the current row format, the BLOB prefix of 0 bytes is stored inline.
I tried creating a table with the combination of some columns as text types and some others as blob types as well but it did not work.
Even modifying innodb_log_file_size is not possible, as it is not allowed on the Google Cloud-SQL Platform.
"Vertical Partitioning"
A table with lots of columns is pushing several limits; you hit one of them. There are several reasonable workarounds, Vertical Partitioning may be the best, especially if many are TEXT/BLOB.
Instead of a single table, have multiple tables with the same PRIMARY KEY, except that one may be AUTO_INCREMENT. JOIN them together as needed to collect the columns. You could even have VIEWs to hide the fact that you split up the table. I recommend grouping the columns by some logical grouping based on the application and which columns are needed 'together'.
Do not splay an array of things across columns; instead, have another table with multiple rows to handle the repetition. Example: address1, state1, country1, address2, state2, country2.
Do not use CHAR or BINARY except for truly fixed-length columns. Most of such are very short. Also, most CHAR columns should be CHARACTER SET ascii, not utf8. (Think, country_code, zipcode, md5.)
innodb_log_file_size is only indirectly related to your Question. What is it's value?
Directly related is innodb_page_size, which defaults to 16K, and virtually no one ever changes. I would expect Cloud Engines to prohibit changing it.
(I'm with Bill on desiring more info about your schema -- so we can be more specific about how to help you.)
You don't have much option here. InnoDB default page size is 16KB, and you must design your tables so at least two rows fit in a page. That's where the limit of 8126 bytes per row comes from.
Variable-length columns like VARCHAR, VARBINARY, BLOB, and TEXT can be longer, because data exceeding the row size limit can be stored on extra pages. To take advantage of this, you must enable the Barracuda table format, and choose ROW_FORMAT=DYNAMIC.
In config:
[mysqld]
innodb_file_per_table = ON
innodb_file_format = Barracuda
innodb_default_row_format = DYNAMIC;
I don't know if these settings are already enabled in Google Cloud SQL, or if they allow you to change these settings.
Read https://dev.mysql.com/doc/refman/5.7/en/innodb-row-format.html for more information
Again, the advantage of DYNAMIC row format only applies to variable-length data types. If you have 300 columns that are fixed-length, like CHAR, then it doesn't help.
By the way, innodb_log_file_size has nothing to do with this error about row size.
In order to do what you want to do on a Cloud SQL instance, first off run this to set the innodb_strict_mode variable:
SET innodb_strict_mode = 0 ;
After that you should be able to create your table.

write joined objects to file in an optimized fashion

I have two R data frames. For example, orders and customers. If I write them to file with saveRDS(), they take up a certain amount of space. If I join them, I'll end up with one big data frame. If I save that to file, the file is much larger than the initial two. However, no new data has actually been created. I think R is treating each row as completely unique and independent. If a customer has 10 orders, their info is just repeated 10 times instead of stored as a single entity. Is there a way to optimize this? Is the only option to just save the two tables and join them every time?

How best to efficiently extract data from a large SQLite database?

I am using SQlite to store a large amount of data and am having troubles extracting that data using very simple queries. At the moment, my database is just one table, with about 50million rows and 15 columns. I would like to extract one complete column from this table.
I have tried using RSQlite: dbGetQuery(db, ‘select qs from CSI’) where qs and CSI are my column and table names respectively. Qs are character strings. This query runs for hours before I give up (R version 3.3.3, RSQLite_1.1-2).
I also tried the DB Browser for SQLite (v3.9.1), using the same query and again gave up after a few hours run time. I do not have an IDKey/indexing, but I thought since I want the entire column, this should not have any impact.
I am running on a 64bit Windows machine with 16GB Ram. How can I extract columns from my table within a reasonable time? Or is there a better way I should be storing my data for easy access?
To get a column value, SQLite has to read the row up to the column. So to get the values from all rows, it has to read practically everything.
With an index on this column, you would have a covering index that would reduce the amount of data to be read from disk.
If you do not actually need multiple values from the same row, consider storing the columns in different tables, or using a different database.

SQLite with large data, is it best to split

I am using SQLite because this is for cross-platform. I have about 10 tables with a small amount of data (maybe a few dozen rows each), but then I also have a set of data which might have a million or more rows.
The small dataset isn't really modified that much, just queried, but the large data set will be queried and modified frequently.
Rather than have a single SQLite database with all the tables in it, I was wondering if splitting it into two databases might be smartest.
Basically I'd have one database, lets call it "settings", with the 10 tables in it. I'd then have another database, lets call it "userdata", with the million rows.
I'll be creating a third database called "audits" where I record each change to the "userdata" database. This database is expected to grow (for a short time period).
I am just wondering if people have an opinion as to whether it is a good idea to split my data into multiple databases or if I should just have one massive one.
My thinking is the queries on the "userdata" database might be slightly more efficient since it will only have one table.
Note, this is not for long-term. It is for a short period of time. It will be queried and edited for about a week, then it is done.

Do SQLite queries that return large result sets take more time?

When performing a SQLite query does the size of the returned data set affect how long the query takes? Lets assume for this question that I don't actually access any of the data in the result, I just want to know if the query itself takes longer. Lets also assume that I am simply selecting all rows and have no WHERE or ORDER BY clauses.
For example if I have two tables A and B. Let says table A has a million rows and table B has 10 rows and that both tables have the same number and types of columns. Will selecting all rows in table A take longer than selecting all rows in table B?
This is a follow up to my question How does a cursor refer to deleted rows?. I am guessing that if a during the query SQLite makes a copy of the data then queries that return large data sets may take longer, unless there is an optimization that only copies the query result data if there is a change to the data in the db while the query is still alive?
Depending on some details, yes, a query may take different amounts of time.
Example: I have a table with some 20k entries. I do a GLOB search that must try every line, with a LIMIT. If the LIMIT is met, the query can stop early. If not, it must go through the entire table (or JOIN). So searches with too many results return quicker than searches with only a few results.
If the query must run through the same amount of data, I don't expect there is a significant difference between a smaller and larger amount of selected rows. There will probably be IO cost, of course.

Resources