Efficent way to query an sqlite database with an R vector - sqlite

I have a vector of values in R and want to get responding values from a sqlite database. I use the following code.
values = c()
for (a in keys)
{
result <- dbGetQuery(con," SELECT content FROM aacontent WHERE Id=?",a)
values = c(values,results)
}
Unfortunatly, this code is very slow. Is there a more efficent way to do this?
Thanks,
Johannes

If aacontent isn't very large then read it all into R and use something like R's match function, or the sqldf function, or data.table functions
If aacontent is too large for that, and keys is small-ish, then write keys to an sqlite table and do a join query. You might benefit from creating an index on one or both of them.

The are certainly pre-built tools for SQL querying tasks like this from R (since you're using SQLite, I'd be sure to check out sqldf), but in my experience I just end up writing lots of little helper-wrapper functions for building queries.
For instance, in your case, your problem isn't really the R piece, it's that you want to roll all the values in keys into one query. So you'd want a query that looks more like:
SELECT content FROM aacontent WHERE Id IN (val1,val2,...)
and then the trick is using paste in R to build the IN clause. I tend to just use a simple wrapper function on dbGetQuery that uses the ... argument and paste to stitch queries together from various pieces. Something like this:
myQuery <- function(con,...){
arg <- list(...)
res <- dbGetQuery(con,paste(arg,collapse = ""))
res
}
So that it's a bit easier to stitch together stuff when using IN clauses:
myQuery(con,"SELECT content FROM aacontent WHERE Id IN (",
paste(keys,collapse = ","),"))
Note that it's a bit harder if the values in keys are characters, since then you need to do some more work with paste to get single quotes around each element, but it's not that much more work.
This advice is all more relevant if the db in question is fairly small; if you're dealing with bigger data, Spacedman's suggestions are probably more worth looking into.

Related

Creating a data frame by applying a function to each element of a vector and combining the results

I am working on a project where we frequently work with a list of usernames. We also have a function to take a username and return a dataframe with that user's data. E.g.
users = c("bob", "john", "michael")
get_data_for_user = function(user)
{
data.frame(user=user, data=sample(10))
}
We often:
Iterate over each element of users
Call get_data_for_user to get their data
rbind the results into a single dataerame
I am currently doing this in a purely imperative way:
ret = get_data_for_user(users[1])
for (i in 2:length(users))
{
ret = rbind(ret, get_data_for_user(users[i]))
}
This works, but my impression is that all the cool kids are now using libraries like purrr to do this in a single line. I am fairly new to purrr, and the closest I can see is using map_df to convert the vector of usernames to a vector of dataframes. I.e.
dfs = map_df(users, get_data_for_user)
That is, it seems like I would still be on the hook for writing a loop to do the rbind.
I'd like to clarify whether my solution (which works) is currently considered best practice in R / amongst users of the tidyverse.
Thanks.
That looks right to me - map_df handles the rbind internally (you'll need {dplyr} in addition to {purrr}).
FWIW, purrr::map_dfr() will do the same thing, but the function name is a bit more explicit, noting that it will be binding rows; purrr::map_dfc() binds columns.
I would suggest a slight adjustment:
dfs = map_dfr(users, get_data_for_user)
map_dfr() explicitely states that you want to do a row bind. And I would be inclined to call this best practice when working with purrr.
For the sake of completeness, here are some additional approaches:
using built-in functions
Reduce(rbind, lapply(users, get_data_for_user))
using data.table approach
library(data.table)
rbindlist(lapply(users, get_data_for_user))

Am I using the most efficient (or right) R instructions?

first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.

dplyr Filter Database Table with Large Number of Matches

I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.

Storing Multiple Query results in a single variable

I'm running into problem of storing results of multiple queries into a list.
Currently i have the results from previously executed query stored in a list.
Currently i can't find a way to store the results into a list or anything of that like.
queryResults2 return a 2D list.
#Sample code for 2nd select
for(i in length(queryResults[[1]]){
query_pd <-paste("Select price_date,price from price_master where stock_id ='",queryResults[[1]][i],"' order by price_date")
queryResults2 <- dbGetQuery(conn, query_pd)
#storing value here
}
First of all, don't worry too much about the whole "don't use loops" thing. Here are three basic options:
for loop. The basic template would be:
result <- vector("list",length(queryResults[[1]])
for (i in ...){
#Do stuff
result[[i]] <- something
}
lapply Here the basic format would be:
lapply(seq_len(queryResults[[1]]),function(i) dbGetQuery(conn,paste(...)))
You don't necessarily need the function to take the query and connection object as arguments, R's scoping will be able to find them if they exist in the calling environment.
Run it all as one query with an IN clause and then split is afterwards.
Personally, I try to use (3) as much as possible.

Ordered Map / Hash Table in R

While working with lists i've noticed an issue that i didn't expect.
result5 <- vector("list",length(queryResults[[1]]))
for(i in 1:length(queryResults[[1]])){
id <- queryResults[[1]][i]
result5[[id]] <-getPrices(id)
}
The problem is that after this code runs instead of the result staying the same size (w/e queryResults[[1]] is) it goes up to the last index creating a bunch of null entries in the middle.
result5 current stores a number of int,double lists so it looks like :
result5[[index(int)]][[row]][col]
While on it's own it's not too problematic I would rather avoid that simply for easier size calculations later on.
For clarification, id is an integer. And in the given case for loop offers same performance, but greater convenience than the apply functions.
After some testing seems like the easiest way of doing it is :
Using a hash package to convert it using a hash using :
result6 <- hash(queryResults[[1]],lapply(queryResults[[1]],getPrices))
And if it needs to get accessed calling
result6[[toString(id)]]
With the difference in performance being marginal, albeit it's still fairly annoying having to include toString in your code.
It's not clear exactly what your question is, but judging by the structure of the loop, you probably want
result5[[i]] <- getPrices(id)
rather than result5[[id]] <- getPrices(id).

Resources