write joined objects to file in an optimized fashion

write joined objects to file in an optimized fashion - r

I have two R data frames. For example, orders and customers. If I write them to file with saveRDS(), they take up a certain amount of space. If I join them, I'll end up with one big data frame. If I save that to file, the file is much larger than the initial two. However, no new data has actually been created. I think R is treating each row as completely unique and independent. If a customer has 10 orders, their info is just repeated 10 times instead of stored as a single entity. Is there a way to optimize this? Is the only option to just save the two tables and join them every time?

Related

Parquet concat or split two schemes

I have two CSV files. In the first one I have: first_name, last_name and in the second I have: email, phone. The two files connect by line index (same number of records). I need to save all data in parquet format.
First option - connect two schemes to one and save everything in one parquet file.
Second option - save two schemes separately (as two parquet files).
According to my use-case there is a high probability to take the second option (2 files). At the end I need to query data using various tools, most often using Presto.
Question 1- is it possible to pull data from two parquet files (let's say select first_name, email)?
Question 2- Will there be a difference in run times?
I have run some tests, but cannot come to an accurate conclusion...

You can pull data from those two tables but you need to have some join keys in order to combine the records. If it is not there the you might have to use row_number() assuming data are in the same order in both the tables. Data size also matters here.
In big data world, denormalized format is the recommendation if you have to join those tables very frequently in your queries. This approach will give you better performance.

How best to efficiently extract data from a large SQLite database?

I am using SQlite to store a large amount of data and am having troubles extracting that data using very simple queries. At the moment, my database is just one table, with about 50million rows and 15 columns. I would like to extract one complete column from this table.
I have tried using RSQlite: dbGetQuery(db, ‘select qs from CSI’) where qs and CSI are my column and table names respectively. Qs are character strings. This query runs for hours before I give up (R version 3.3.3, RSQLite_1.1-2).
I also tried the DB Browser for SQLite (v3.9.1), using the same query and again gave up after a few hours run time. I do not have an IDKey/indexing, but I thought since I want the entire column, this should not have any impact.
I am running on a 64bit Windows machine with 16GB Ram. How can I extract columns from my table within a reasonable time? Or is there a better way I should be storing my data for easy access?

To get a column value, SQLite has to read the row up to the column. So to get the values from all rows, it has to read practically everything.
With an index on this column, you would have a covering index that would reduce the amount of data to be read from disk.
If you do not actually need multiple values from the same row, consider storing the columns in different tables, or using a different database.

SQLite with large data, is it best to split

I am using SQLite because this is for cross-platform. I have about 10 tables with a small amount of data (maybe a few dozen rows each), but then I also have a set of data which might have a million or more rows.
The small dataset isn't really modified that much, just queried, but the large data set will be queried and modified frequently.
Rather than have a single SQLite database with all the tables in it, I was wondering if splitting it into two databases might be smartest.
Basically I'd have one database, lets call it "settings", with the 10 tables in it. I'd then have another database, lets call it "userdata", with the million rows.
I'll be creating a third database called "audits" where I record each change to the "userdata" database. This database is expected to grow (for a short time period).
I am just wondering if people have an opinion as to whether it is a good idea to split my data into multiple databases or if I should just have one massive one.
My thinking is the queries on the "userdata" database might be slightly more efficient since it will only have one table.
Note, this is not for long-term. It is for a short period of time. It will be queried and edited for about a week, then it is done.

Organizing tables with data-heavy rows to optimize access times

I am working with a sqlite3 database of around 70 gigabytes right now. This db has three tables: one with about 30 million rows, and two more with ~150 and ~300 million each, with each table running from 6-11 columns.
The table with the fewest rows is consuming the bulk of the space, as it contains a raw data column of zipped BLOBs, generally running between 1 and 6 kilobytes per row; all other columns in the database are numeric, and the zipped data is immutable so inefficiency in modification is not a concern.
I have noticed that creating indexes on the numeric columns of this table:
[15:52:36] Query finished in 723.253 second(s).
takes several times as long as creating a comparable index on the table with five times as many rows:
[15:56:24] Query finished in 182.009 second(s).
[16:06:40] Query finished in 201.977 second(s).
Would it be better practice to store the BLOB data in a separate table to access with JOINs? The extra width of each row is the most likely candidate for the slow scan rate of this table.
My current suspicions are:
This is mostly due to the way data is read from disk, making skipping medium-sized amounts of data impractical and yielding a very low ratio of usable data per sector read from the disk by the operating system, and
It is therefore probably standard practice that I did not know as a relative newcomer to relational databases to avoid putting larger, variable-width data into the same table as other data that may need to be scanned without indices
but I would appreciate some feedback from someone with more knowledge in the field.

In the SQLite file format, all the column values in a row are simply appended together, and stored as the row value. If the row is too large to fit into one database page, the remaining data is stored in a linked list of overflow pages.
When SQLite reads a row, it reads only as much as needed, but must start at the beginning of the row.
Therefore, when you have a blob (or a large text value), you should move it to the end of the column list so that it is possible to read the other columns' values without having to go through the overflow page list:
CREATE TABLE t (
id INTEGER PRIMARY KEY,
a INTEGER,
[...],
i REAL,
data BLOB NOT NULL,
);
With a single table, the first bytes of the blob value are still stored inside the table's database pages, which decreases the number of rows that can be stored in one page.
If the other columns are accessed often, then it might make sense to move the blob to a separate table (a separate file should not be necessary). This allows the database to go through more rows at once when reading a page, but increases the effort needed to look up the blob value.

Do SQLite queries that return large result sets take more time?

When performing a SQLite query does the size of the returned data set affect how long the query takes? Lets assume for this question that I don't actually access any of the data in the result, I just want to know if the query itself takes longer. Lets also assume that I am simply selecting all rows and have no WHERE or ORDER BY clauses.
For example if I have two tables A and B. Let says table A has a million rows and table B has 10 rows and that both tables have the same number and types of columns. Will selecting all rows in table A take longer than selecting all rows in table B?
This is a follow up to my question How does a cursor refer to deleted rows?. I am guessing that if a during the query SQLite makes a copy of the data then queries that return large data sets may take longer, unless there is an optimization that only copies the query result data if there is a change to the data in the db while the query is still alive?

Depending on some details, yes, a query may take different amounts of time.
Example: I have a table with some 20k entries. I do a GLOB search that must try every line, with a LIMIT. If the LIMIT is met, the query can stop early. If not, it must go through the entire table (or JOIN). So searches with too many results return quicker than searches with only a few results.
If the query must run through the same amount of data, I don't expect there is a significant difference between a smaller and larger amount of selected rows. There will probably be IO cost, of course.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex