dplyr Filter Database Table with Large Number of Matches - r

I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?

Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.

Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.

Related

Writing Apache Arrow dataset in batches in R

I'm wondering what the correct approach is to creating an Apache Arrow multi-file dataset as described here in batches. The tutorial explains quite well how to write a new partitioned dataset from data in memory, but is it possible to do this in batches?
My current approach is to simply write the datasets individually, but to the same directory. This appears to be working, but I have to imagine this causes issues with the metadata that powers the feature. Essentially, my logic is as follows (pseudocode):
data_ids <- c(123, 234, 345, 456, 567)
# write data in batches
for (id in data_ids) {
## assume this is some complicated computation that returns 1,000,000 records
df <- data_load_helper(id)
df <- group_by(df, col_1, col_2, col_3)
arrow::write_dataset(df, "arrow_dataset/", format = 'arrow')
}
# read in data
dat <- arrow::open_dataset("arrow_dataset/", format="arrow", partitioning=c("col_1", "col_2", "col_3"))
# check some data
dat %>%
filter(col_1 == 123) %>%
collect()
What is the correct way of doing this? Or is my approach correct? Loading all of the data into one object and then writing it at once is not viable, and certain chunks of the data will update at different periods over time.
TL;DR: Your solution looks pretty reasonable.
There may be one or two issues you run into. First, if your batches do not all have identical schemas then you will need to make sure to pass in unify_schemas=TRUE when you are opening the dataset for reading. This could also become costly and you may want to just save the unified schema off on its own.
certain chunks of the data will update at different periods over time.
If by "update" you mean "add more data" then you may need to supply a basename_template. Otherwise every call to write_dataset will try and create part-0.arrow and they will overwrite each other. A common practice to work around this is to include some kind of UUID in the basename_template.
If by "update" you mean "replace existing data" then things will be a little trickier. If you want to replace entire partitions worth of data you can use existing_data_behavior="delete_matching". If you want to replace matching rows I'm not sure there is a great solution at the moment.
This approach could also lead to small batches, depending on how much data is in each group in each data_id. For example, if you have 100,000 data ids and each data id has 1 million records spread across 1,000 combinations of col_1/col_2/col_3 then you will end up with 1 million files, each with 1,000 rows. This won't perform well. Ideally you'd want to end up with 1,000 files, each with 1,000,000 rows. You could perhaps address this with some kind of occasional compaction step.

Am I using the most efficient (or right) R instructions?

first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.

How to use daply (from plyr) on 2billion rows using less memory

Does any one know, how one could apply the following function that converts 3 columns table into a matrix using a file that has 2 billion rows (with less than 10GB memory).
where x is 1st, y is 2nd and z is 3rd column.
library(plyr)
daply(a, .(x, y), function(x) x$z)
If you cannot load all the tuples at once
I know this is not the answer you are looking for: use SQLite.
The problem with R is that it must load the entire frame at once. If you don't have enough memory, then it simply can't continue.
SQLite is way smarter than R to do aggregates. Perhaps the most important feature is that it optimizes the memory available, and if it can, it does not need to read all the elements at once. See this for details on how to do it.
http://www.r-bloggers.com/using-sqlite-in-r/
If SQLite does not support the aggregate you want, you can create it yourself (see user defined functions in SQLite).
Alternatively you can try to partition your data (outside R), so you can aggregate in stages. But that will still require some sort of program that can read process the files in less than the available memory. Unix/MacOS/Linux sort is one of those utilities that can deal with more-than-available-memory data. It might be useful.

File organization - how to handle different combinations of filters on one data.frame efficiently?

I currently do a lot of descriptive analysis in R. I always work with a data.table like df
net <- seq(1,20,by=2)
gross <- seq(2,20,by=2)
color <- c("green", "blue", "white")
height <- c(170,172,180,188)
library(data.table)
df <- data.table(net,gross,color,height)
In order to obtain results, I do apply a lot of filters.
Sometimes I use one filter, sometimes I use a combination of multiple filters, e.g.:
df[color=="green" & height>175]
In my real data.table, I have 7 columns and all kind of filter-combinations.
Since I always address the same data.table, I'd like to find the most efficient way to filter the data.
So far, my files are organized like this (bottom-up):
execution level: multiple R-scripts with a very specific job (no interaction between them) that calculate and write the results to an excel file using XL Connect
source file: this file receives a pre-filtered data.table and sources all files from the execution level. It is necessary in case I add/remove files on the execution level.
filter files: read the data.table and apply one or multiple filters, as shown above with df_green_high. By filtering, filter files create a
new data.table and source the "source file" with this new filtered table.
I am currently challenged, since I have too many filter files. Having 7 variables, there is such a large number of combinations of filter, so I'll get lost sooner or later.
How can I do my analysis more efficient (reduce the number of "filter files"?)
How can I conveniently name the exported files according to the filters used?
I have read Workflow for statistical analysis and report writing and some other similar questions. However, in this case, I always refer to the same basic table, so there should be a more efficient way. I do not have a CS background, so any help is highly appreciated. On SOF, I also read about creating a package, but I am not sure if this reasonable.
I usually do it like this:
create a list called say "my_case_list"
filter data, do computation on the filtered data
add a column called "case" to each filtered dataset. Fill this column with some string i.e. "case 1: color=="green" & height>175"
put this data to my_case_list
convert list to data.frame like object
export results to sql server
import results from sql server to Excel Pivot table
make sense of results
Automate the process as much as possible.

Efficent way to query an sqlite database with an R vector

I have a vector of values in R and want to get responding values from a sqlite database. I use the following code.
values = c()
for (a in keys)
{
result <- dbGetQuery(con," SELECT content FROM aacontent WHERE Id=?",a)
values = c(values,results)
}
Unfortunatly, this code is very slow. Is there a more efficent way to do this?
Thanks,
Johannes
If aacontent isn't very large then read it all into R and use something like R's match function, or the sqldf function, or data.table functions
If aacontent is too large for that, and keys is small-ish, then write keys to an sqlite table and do a join query. You might benefit from creating an index on one or both of them.
The are certainly pre-built tools for SQL querying tasks like this from R (since you're using SQLite, I'd be sure to check out sqldf), but in my experience I just end up writing lots of little helper-wrapper functions for building queries.
For instance, in your case, your problem isn't really the R piece, it's that you want to roll all the values in keys into one query. So you'd want a query that looks more like:
SELECT content FROM aacontent WHERE Id IN (val1,val2,...)
and then the trick is using paste in R to build the IN clause. I tend to just use a simple wrapper function on dbGetQuery that uses the ... argument and paste to stitch queries together from various pieces. Something like this:
myQuery <- function(con,...){
arg <- list(...)
res <- dbGetQuery(con,paste(arg,collapse = ""))
res
}
So that it's a bit easier to stitch together stuff when using IN clauses:
myQuery(con,"SELECT content FROM aacontent WHERE Id IN (",
paste(keys,collapse = ","),"))
Note that it's a bit harder if the values in keys are characters, since then you need to do some more work with paste to get single quotes around each element, but it's not that much more work.
This advice is all more relevant if the db in question is fairly small; if you're dealing with bigger data, Spacedman's suggestions are probably more worth looking into.

Resources