How do I perform the GROUP BY ... HAVING query using dbplyr in dplyr?
I have a list of IDs and I have to group by IDs which are not in this list.
Is there a way I can directly execute the query with tbl(), if not what is the dplyr verb for the same?
Using group_by_if function from dplyr doesn't seem to do it.
I want to execute something like
SELECT * FROM TBL
WHERE YEAR(DATE) = 2001
GROUP BY COL1 HAVING COL2 NOT IN ID_LIST
where ID_LIST is an R vector
For the example you have given, it is not clear to me how
SELECT * FROM TBL
WHERE YEAR(DATE) = 2001
GROUP BY COL1
HAVING COL2 NOT IN ID_LIST
Is different from
SELECT * FROM TBL
WHERE YEAR(DATE) = 2001
AND COL2 NOT IN ID_LIST
GROUP BY COL1
Hence #Rohit's suggestion of applying a filter is an effective solution.
HAVING largely operates the same way as WHERE but after aggregation with the added feature that you can use aggregators in the HAVING clause. See this discussion. But in this case you are not applying aggregators in the HAVING clause so you should be free to use a WHERE clause instead.
Regarding the nested SQL queries that dbplyr produces. It might seem counter intuitive given the usual emphasis on clean, human-readable code, but for dbplyr auto-generated queries I recommend not worrying about the quality of the machine generated code. It is written by a machine and it is (mostly) read by a machine, so its human readability is less important.
Efficiency could be a concern with many layers of nesting. However, in 2017-06-09 dbplyr was given a basic SQL optimiser. I have not found (though I have not tested extensively) nested auto-generated queries to perform significantly worse than non-nested user written queries. But if performance is critical, you probably want to build your SQL query manually by paste-ing together text strings in R.
One final thought - the length of ID_LIST is also important to consider. It is discussed in this question.
Related
I have a large MySQL table (92 cols, 3 million rows) that I'm wrangling in R with dbplyr. I'm new to the package and had a question or two about dbplyr's lazy evaluation as I'm experiencing some considerable slowness.
Suppose I have connected to my database and wish to perform a select operation to reduce the number of columns:
results<- my_conn_table %>%
select(id, type, date)
Despite there being millions of rows, viewing results is near instantaneous (only showing 3 rows for clarity):
> results
# Source: lazy query [?? x 3]
# Database: mysql []
id type date
<int> <chr> <date>
1 1 Customer 1997-01-04
2 2 Customer 1997-02-08
3 3 Business 1997-03-29
...
However, if I attempt to perform a filter operation with:
results<- my_df %>%
select(id, type, date) %>%
filter(type == "Business")
the operation takes a very long time to process (tens of minutes in some instances). Is this long processing time a simple function of nrow ~= 3 million? (In other words, there's nothing I can do about it because the dataset is large.) Or is someone able to suggest some general ways to perhaps make this more performant?
My initial understanding of the lazy evaluation was that the filter() query would only return the top few rows, to prevent the long run time scenario. If I wanted all the data I can run collect() to gather the results into my local R session (which I would expect to take a considerable amount of time depending on the query.)
Building on #NovaEthos's answer, you can call show_query(results) to get the SQL query that dbplyr generated and is passing to the database. Posting this query here will make it clear whether there is any inefficiency in how the database is being queried.
A further thing to investigate is how your data is indexed. Like an index in a book, an index on a database table helps find records faster.
You might only be asking for 1000 records with type = 'business' but if these records only start from row 2 million, then the computer has to scan two-thirds of your data before it finds the first record of interest.
You can add indexes to existing database tables using something like the following:
query <- paste0("CREATE NONCLUSTERED INDEX my_index_name ON", database_name, ".", schema_name, ".", table_name, "(", column_name, ");")
DBI::dbExecute(db_connection, as.character(query))
Note that this syntax is for SQL Server. Your database may require slightly different syntax. In practice I wrap this in a function with additional checks such as 'does the indexing column exist in the table?' this is demonstrated here.
One thing you could try is to run the same query but using the DBI package:
res <- DBI::dbSendQuery(con, "SELECT id,type,date FROM your_table_name WHERE type = 'business'")
If it's the same speed, it's your data. If not, it's dbplyr. You could also try playing around with different filter parameters (maybe specify a single ID or date and see how long that takes) and see if the problem persists.
I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.
I want to perform association rule mining in R using arules:apriori function and that needs a transactions type input. This is nothing but list of factors with each element representing the unique set of products purchased in that transaction. An example below:
products transaction
1 {a,b} 1
2 {a,b,c} 2
3 {b} 3
In the package documentation, they recommend using split to generate this like so:
split(DT[,"products",with=FALSE], DT[,"transaction",with=FALSE])
But when I try the same on a large set of transactions, it is painfully slow. Example MWE below:
library(data.table)
#Number of transactions
ntrxn = 1000000
#Generating a dummy transactions table
#Recycling transaction vector over products
DT = data.table(transaction = seq(1,ntrxn,1)
,products = rep(letters[1:3],ntrxn))[order(transaction)]
TEST = split(DT[,"products",with=FALSE], DT[,"transaction",with=FALSE])
Is there a way to speed this up by leveraging data.table by condition? I have tried this:
DT[,list(as.factor(.SD$products)),by=transaction]
But it just gives me back the data.table (which makes sense in hindsight). Is there a way a list of vectors using a similar expression but by leveraging the performant data.table internals to take care of the heavy lifting.
If data.table alone is not the answer here, I am really curious which approach would get me to the output I am looking for.
Wrapping the OP's last line of code to make a list column:
DT[, .(.(products)), by=transaction]
.() is an alias for list(). This is faster on my computer, anyways.
I have some tables in Hive that I need to join together. Since I need to do some work on each of them, normalize the key, remove outliers.... and as I add more and more tables... This chaining process turned out to be a big mass.
It is so easy to get lost where you are and the query is getting out of control.
However, I have a pretty clear idea how the final table should look like and each column is fairly independent of the other tables.
For examp, here is an example:
table_class1
name id score
Alex 1 90
Chad 3 50
...
table_class2
name id score
Alexandar 1 50
Benjamin 2 100
...
In the end I really want something looks like:
name id class1 class2 ...
alex 1 90 50
ben 2 100 NA
chad 3 50 NA
I know it could be a left outer join, but I am really having a hard time to create a seperate table for each of them after the normalization and then use left outer join with the union of the keys to left outer join each of them...
I am thinking about using NOSQL(HBase) to dump the processed data into NOSQL format.. like:
(source, key, variable, value)
(table_class1, (alex, 1), class1, 90)
(table_class1, (chad, 3), class1, 50)
(table_class2, (alex, 1), class2, 50)
(table_class2, (benjamin, 2), class2, 100)
...
In the end, I want to use something like the melt and cast in R reshape package to bring that data back to be a table.
Since this is a big data project, and there will be hundreds of millions of key value pairs in HBase.
(1) I don't know if this is a legit approach
(2) If so, is there any big data tool to pivot long HBase table into a Hive table.
Honestly, I would love to help more, but I am not clear about what you're trying to achieve (maybe because I've never used R), please elaborate and I'll try to improve my answer if necessary.
Why do you need HBase for? You can store your processed data in new tables and work with them, you can even CREATE VIEW to simplify the query if it's too large, maybe that's what you're looking for (HIVE manual). Unless you have a good reason for using HBase, I'll stick just to HIVE to avoid additional complexity, don't get me wrong, there are a lot of valid reasons for using HBase.
About your second question, you can define and use HBase tables as HIVE tables, you can even CREATE and SELECT INSERT into them all inside HIVE, is that what you're looking for?: HBase/HIVE integration doc
One last thing in case you don't know, you can create custom functions in HIVE very easily to help you with the tedious normalization process, take a look at this.
I have a vector of values in R and want to get responding values from a sqlite database. I use the following code.
values = c()
for (a in keys)
{
result <- dbGetQuery(con," SELECT content FROM aacontent WHERE Id=?",a)
values = c(values,results)
}
Unfortunatly, this code is very slow. Is there a more efficent way to do this?
Thanks,
Johannes
If aacontent isn't very large then read it all into R and use something like R's match function, or the sqldf function, or data.table functions
If aacontent is too large for that, and keys is small-ish, then write keys to an sqlite table and do a join query. You might benefit from creating an index on one or both of them.
The are certainly pre-built tools for SQL querying tasks like this from R (since you're using SQLite, I'd be sure to check out sqldf), but in my experience I just end up writing lots of little helper-wrapper functions for building queries.
For instance, in your case, your problem isn't really the R piece, it's that you want to roll all the values in keys into one query. So you'd want a query that looks more like:
SELECT content FROM aacontent WHERE Id IN (val1,val2,...)
and then the trick is using paste in R to build the IN clause. I tend to just use a simple wrapper function on dbGetQuery that uses the ... argument and paste to stitch queries together from various pieces. Something like this:
myQuery <- function(con,...){
arg <- list(...)
res <- dbGetQuery(con,paste(arg,collapse = ""))
res
}
So that it's a bit easier to stitch together stuff when using IN clauses:
myQuery(con,"SELECT content FROM aacontent WHERE Id IN (",
paste(keys,collapse = ","),"))
Note that it's a bit harder if the values in keys are characters, since then you need to do some more work with paste to get single quotes around each element, but it's not that much more work.
This advice is all more relevant if the db in question is fairly small; if you're dealing with bigger data, Spacedman's suggestions are probably more worth looking into.