I want to perform association rule mining in R using arules:apriori function and that needs a transactions type input. This is nothing but list of factors with each element representing the unique set of products purchased in that transaction. An example below:
products transaction
1 {a,b} 1
2 {a,b,c} 2
3 {b} 3
In the package documentation, they recommend using split to generate this like so:
split(DT[,"products",with=FALSE], DT[,"transaction",with=FALSE])
But when I try the same on a large set of transactions, it is painfully slow. Example MWE below:
library(data.table)
#Number of transactions
ntrxn = 1000000
#Generating a dummy transactions table
#Recycling transaction vector over products
DT = data.table(transaction = seq(1,ntrxn,1)
,products = rep(letters[1:3],ntrxn))[order(transaction)]
TEST = split(DT[,"products",with=FALSE], DT[,"transaction",with=FALSE])
Is there a way to speed this up by leveraging data.table by condition? I have tried this:
DT[,list(as.factor(.SD$products)),by=transaction]
But it just gives me back the data.table (which makes sense in hindsight). Is there a way a list of vectors using a similar expression but by leveraging the performant data.table internals to take care of the heavy lifting.
If data.table alone is not the answer here, I am really curious which approach would get me to the output I am looking for.
Wrapping the OP's last line of code to make a list column:
DT[, .(.(products)), by=transaction]
.() is an alias for list(). This is faster on my computer, anyways.
Related
first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.
I understand from excellent resources here, here and here that data.table utilises automatic indexing (to create a key i.e. supercharged row names) and binary search based subset in contrast to tidyverse, which relies on vector scanning.
I understand that vector scanning requires scanning each individual row and the creation of nrow(dataset) length logical vectors, and that doing this repeatedly is not as efficient.
I'm wondering if someone can help me frame exactly how these two methods means that data.table operations run a lot faster compared to tidyverse when you need to group by a variable. I.e. is it because data.table automatically indexes the group_by column and breaks it into grouped subsets and runs operations on each subset, whilst a vector scanning approach would require the generation of n = unique groups of multiple logical vectors, and then run operations on each individual logical vector, before collating results?
Also, according to the data.table vignette,
We can set keys on multiple columns and the column can be of different
types...
Since the rows are reordered, a data.table can have at most one key
because it can not be sorted in more than one way.
What does it mean that we can set keys on multiple columns and yet a data.table can have at most one key? I.e. is it that during any moment when running an operation, there is only one reference key, but which column the reference key is set as can change as we progress to another component of the overall operation?
Thank you in advance!
There is no.
There are different ways to finding groups, and then to compute expression by groups. Each single thing can be differently implemented. They are not related to keys or index. Also data.table is not automatically creating key/index during group by (as of now).
data.table has very fast, carefully implemented, order function, it is being used to find groups. It was contributed to base R later on. There is an idea to use it in dplyr to speed up grouping: https://github.com/tidyverse/dplyr/issues/4406
Yet data.table order function got improved since then and now scales even better.
Aside from finding groups, there is a part about computing an expression. If we evaluate "user defined function" it will always be much slower. Many common functions are internally optimized, so they don't switch between R and C for every group. Here, data.table has also very carefully implemented "GForce" functions. Not sure but in dplyr they are called "hybrid evaluation".
It is always important to test on your particular data use case. If you have just 2 unique groups in data, then fast grouping algorithms will not shine much.
Also there is a community repository which meant to describe data.table algorithms https://github.com/asantucci/algo_data.table but it is not very active. I just recently posted there a comment about "groupby optimization", will paste it here as well. Answer was provided by data.table author Matt Dowle.
Q: does GForce allocate mem for biggest group, then copy there values of a group, to aggregate, so it can benefit from being contiguous in memory and will be more cache efficient? if so, do we check if groups aren't sorted already? so we can avoid doing allocation and copy?
A: gforce (gsum) assigns to many group results at once; it doesn't gather the groups together. You're describing non-gforce (dogroup.c) which copies to the largest group. See the branch in dogroups.c which knows whether groups are already grouped: it swithes to a memcpy. The memcpy is very fast (contiguous, pre-fetch) so it's pretty good already. We must copy because R's DATAPTR is not a pointer we can repoint, it's an offset from SEXP.
I am currently learning data.table in R. a few questions which got me confused:
Does subsetting columns always preserve the order of records? (i.e. Row 1,2,3 will stay as Row 1,2,3 instead of Row 1,3,2)
Also, does the same conclusion apply to different expressions, such as DB[[1]], DB$V1, etc.
2.
When subsetting multiple columns, I know I need to use something like DB[,.(V1, V2)], but I am confused about what's the result from DB[,V1, V2]?
The code runs, seems to produce the result but the rows are not in the same order as the original table. If someone can explain what does the latter code mean, that would be great help.
Thanks a lot!
I wanted to start with small suggestion... if you create data processing related question on SO it is enormously better to ship reproducible code in the question, and expected output if it isn't clear. You will reach much bigger audience and gather more quality solutions. This is generally common practice on r tag.
Subsetting preserve order, underlying storage of data is column oriented unlike regular SQL db (which are not aware of row order), it works exactly the same as subsetting a vector in base R, just much faster.
Regarding [[ and $, these are just a methods for extracting column from data.table, and a list in general, you can use DB[[1]], DB[["V1"]], DB$V1. They behave differently depending if column/list element exists.
Third argument inside data.table [ operator is by which expect columns to group by over, so you query column V1 grouped by V2, without using any aggregate function. And this is very different than DB[, .(V1, V2)] or DB[, c("V1","V2"), with=FALSE] or DB[, list(V1,V2)] or DB[, .SD, .SDcols=c("V1","V2")], ... . Most of the api is borrowed from base R, functions like subset() or with().
At the end I would recommend to go through data.table vignettes, also there is my recent longish post that goes through various data.table examples: Boost Your Data Munging with R.
When doing sequencing, I normally apply TraMineR's seqdef function on a dataset to generate a single sequence object:
sequence_object <- seqdef(data)
However, let's say I want to loop through a dataframe and generate 1 sequence object per every chunk of 10 columns. Then I would do something like this:
colpicks <- seq(10,1000,by=10)
mapply(function(start,stop) seqdef(df[,start:stop]), colpicks-9, colpicks)
Now, I want to store these objects in some suitable manner. Two questions:
What is the most suitable way of storing (or maybe just automatically naming) 100 objects, so that I can easily loop through each of them at a later point?
How can I modify my code above so that it stores the data per your answer to (1)?
"Most suitable" is completely subjective and dependent on your goal.
I'm assuming this question is related to your previous question, and thus I would suggest setting the simplify argument of mapply to FALSE
myMatrixList <- mapply(.... , simplify=FALSE)
However, even that is not necessary, as you can just combine the sapply from the previous question and skip the middle step
I usually work with big dataframes that are pretty well sorted (or can be easily sorted).
Given two dataframes, both sorted by 'user'
some.data <user> <data_1> <data_2>
user <user> <user_attr_1> <user_attr_2>
And I run m = merge(some.data,user), I receive the result as:
m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>
And this is fine so.
But merge doesn't take advantage of these dataframes being sorted on the common column making the merge pretty CPU/memory heavy. However, this merge could be done in O(n)
I am wondering if there is a way in R to conduct an efficient merge on sorted datasets?
I don't have any experience with it, but as far as I know, this is one of the issues that package data.tablewas designed to improve.
For most practical purposes, data.table=data.frame + index. As a consequence, when used right, this improves performance of quite a few large operations.
There is a danger that turning your data.frame into a data.table (i.e. adding the index) could take some time (although I expect this to be well optimized), but once you've got it up, functions like merge can easily use the index for better performance.
If your set of common keys/indexes is totally overlapping, that is...
Reduce(`&`, user$user.id %in% some.data$user.id)
...returns TRUE and they are, as you said, sorted,and there are no key duplicates then your merging problem is reduced to adding columns to a data.frame. Something in the lines along...
library(log4r)
t1 <- system.time(z <- merge(user, some.data, by='user.id'))
info(my.logger, paste('Elapsed time with merge():', t1['elapsed']))
t2 <- Sys.time()
r <- data.frame(user.id=user$user.id, V1.x=user$V1, V2.x=user$V2)
r[,names(some.data)] <- some.data[,names(some.data)
t3 <- Sys.time()
info(my.logger, paste('Elapsed time without:', t3-t2))
If the assumptions above do not hold, then it gets slightly messier set union of both key sets, translation function, NA padding) but the merging and overlapping assumption alone gets you a long way ahead.
Notice also that the timing of the seconds approach is biased since it's calling twice Sys.time() unlike the merge() timing which calls system.time() and only once.
(Excuse my lame usage of S.O. mark-up)