RSQLite Faster Subsetting of large Table? - sqlite

So I have a large dataset (see my previous question) where I need to subset it based on an ID which I have in another table
I use a statement like:
vars <- dbListFields(db, "UNIVERSE")
ids <- dbGetQuery(db, "SELECT ID FROM LIST1"
dbGetQuery(db,
paste("CREATE TABLE SUB1 (",
paste(vars,collapse=" int,"),
")"
) )
dbGetQuery(db,
paste("INSERT INTO SUB1 (",
paste(vars,collapse=","),
") SELECT * FROM UNIVERSE WHERE
UNIVERSE.ID IN (",
paste(t(ids),collapse=","),
")"
) )
The code runs (I may have missed a parenthesis above) but it takes a while since my table UNIVERSE is about 10 gigs in size. The major problem is I'm going to have to run this for many different tables "LIST#" to make "SUB#" and the subsets are not disjoint so I can't just delete the record from UNIVERSE when I'm done with it.
I'm wondering if I've gone about subsetting the wrong way or if there's other ways I can speed this up?
Thanks for the help.

This is kind of an old question and I don't know if you found the solution or not. If UNIVERSE.ID is a unique, non-NULL integer, setting it up as an 'INTEGER PRIMARY KEY' should speed things up a lot. There's some code and discussion here:
http://www.mail-archive.com/r-sig-db%40stat.math.ethz.ch/msg00363.html
I don't know if using an inner join would speed things up or not; it might be worth a try too.

Do you have an index on UNIVERSE.ID? I'm no SQLite guru, but generally you want fields that you are going to query on to have indexes.

Related

Snowflake, Python/Jupyter analysis

I am new to Snowflake, and running a query to get a couple of day's data - this returns more than 200 million rows, and take a few days. I tried running the same query in Jupyter - and the kernel restars/dies before the query ends. Even if it got into Jupyter - I doubt I could analyze the data in any reasonable timeline (but maybe using dask?).
I am not really sure where to start - I am trying to check the data for missing values, and my first instinct was to use Jupyter - but I am lost at the moment.
My next idea is to stay within Snowflake - and check the columns there with case statements (e.g. sum(case when column_value = '' then 1 else 0 end) as number_missing_values
Does anyone have any ideas/direction I could try - or know if I'm doing something wrong?
Thank you!
not really the answer you are looking for but
sum(case when column_value = '' then 1 else 0 end) as number_missing_values`
when you say missing value, this will only find values that are an empty string
this can also be written is a simpler form as:
count_if(column_value = '') as number_missing_values
The data base already knows how many rows are in a column, and it knows how many null columns there are. If loading data into a table, it might make more sense to not load empty strings, and use null instead then, for not compute cost you can run:
count(*) - count(column) as number_empty_values
also of note, if you have two tables in snowflake you can compare the via the MINUS
aka
select * from table_1
minus
select * from table_2
is useful to find missing rows, you do have to do it in both directions.
Then you can HASH rows, or hash the whole table via HASH_AGG
But normally when looking for missing data, you have an external system, so the driver is 'what can that system handle' and finding common ground.
Also in the past we where search for bugs in our processing that cause duplicate data (where we needed/wanted no duplicates) so then the above, and COUNT DISTINCT like commands come in useful.

dplyr Filter Database Table with Large Number of Matches

I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.

Quickest way to load data from PostgreSQL to R

I’m planning to do some data analysis with R; the datasets are stored in PostgreSQL tables and some of them contain up to 2 million records. I thought this would not be a big problem for R and loading of the records would be rather quick, but the things turned out differently.
Doing something like this may take a minute or more, which is not what I was expecting:
library(RPostgreSQL);
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "mydb", user="me", password="my_password");
records <- dbGetQuery(con, statement = paste(
"SELECT *",
"FROM my_table",
"ORDER BY id")); # 1.5M records
Alternative code is even slower:
records2 <- dbSendQuery(con, "select * from my_table ORDER BY id")
fetch(records2,n=-1)
I can’t say my hardware is the most advanced in the world, but it’s a rather decent MacBook Pro with 8G RAM and SSD. When I fetch the same data with, let’s say, QGIS, things are done significantly faster.
What can one do to increase performance in such case? Alternative libraries? Tricks and hacks? Anything else?
You should exclude ORDER BY - this is not part of loading data. It may significantly slows your query.
You can order data afterwards having them in R memory. If you are looking for fast ordering check this SO answer.
More of a redesign than answer to the question, but...
You could always plug R directly into PostgreSQL, and run your query without moving the data anywhere. Can't move it any faster than not moving it at all :)
PL/R for PostgreSQL

Efficent way to query an sqlite database with an R vector

I have a vector of values in R and want to get responding values from a sqlite database. I use the following code.
values = c()
for (a in keys)
{
result <- dbGetQuery(con," SELECT content FROM aacontent WHERE Id=?",a)
values = c(values,results)
}
Unfortunatly, this code is very slow. Is there a more efficent way to do this?
Thanks,
Johannes
If aacontent isn't very large then read it all into R and use something like R's match function, or the sqldf function, or data.table functions
If aacontent is too large for that, and keys is small-ish, then write keys to an sqlite table and do a join query. You might benefit from creating an index on one or both of them.
The are certainly pre-built tools for SQL querying tasks like this from R (since you're using SQLite, I'd be sure to check out sqldf), but in my experience I just end up writing lots of little helper-wrapper functions for building queries.
For instance, in your case, your problem isn't really the R piece, it's that you want to roll all the values in keys into one query. So you'd want a query that looks more like:
SELECT content FROM aacontent WHERE Id IN (val1,val2,...)
and then the trick is using paste in R to build the IN clause. I tend to just use a simple wrapper function on dbGetQuery that uses the ... argument and paste to stitch queries together from various pieces. Something like this:
myQuery <- function(con,...){
arg <- list(...)
res <- dbGetQuery(con,paste(arg,collapse = ""))
res
}
So that it's a bit easier to stitch together stuff when using IN clauses:
myQuery(con,"SELECT content FROM aacontent WHERE Id IN (",
paste(keys,collapse = ","),"))
Note that it's a bit harder if the values in keys are characters, since then you need to do some more work with paste to get single quotes around each element, but it's not that much more work.
This advice is all more relevant if the db in question is fairly small; if you're dealing with bigger data, Spacedman's suggestions are probably more worth looking into.

sqlite subqueries with group_concat as columns in select statements

I have two tables, one contains a list of items which is called watch_list with some important attributes and the other is just a list of prices which is called price_history. What I would like to do is group together 10 of the lowest prices into a single column with a group_concat operation and then create a row with item attributes from watch_list along with the 10 lowest prices for each item in watch_list. First I tried joins but then I realized that the operations where happening in the wrong order so there was no way I could get the desired result with a join operation. Then I tried the obvious thing and just queried the price_history for every row in the watch_list and just glued everything together in the host environment which worked but seemed very inefficient. Now I have the following query which looks like it should work but it's not giving me the results that I want. I would like to know what is wrong with the following statement:
select w.asin,w.title,
(select group_concat(lowest_used_price) from price_history as p
where p.asin=w.asin limit 10)
as lowest_used
from watch_list as w
Basically I want the limit operation to happen before group_concat does anything but I can't think of a sql statement that will do that.
Figured it out, as somebody once said "All problems in computer science can be solved by another level of indirection." and in this case an extra select subquery did the trick:
select w.asin,w.title,
(select group_concat(lowest_used_price)
from (select lowest_used_price from price_history as p
where p.asin=w.asin limit 10)) as lowest_used
from watch_list as w

Resources