Speed up complex data.table operation (subset, sum, group) - r

I have a large data.table which I need to subset, sum and group the same way on several occurrences in my code. Therefore, I store the result to save time. The operation still takes rather long and I would like to know how to speed it up.
inco <- inventory[period > p, sum(incoming), by = articleID][,V1]
The keys of inventory are period and articleID. The size varies depending on the parameters but is always greater than 3 GB. It has about 62,670,000 rows of 7 variables.
I comment on my thought so far:
1. Subset: period > p
This could be faster with vector scanning, but I would need to generate the sequence from p to max(p) for that, taking additional time. Plus, the data.table is already sorted by p. So I suppose, the gain in speed is not high.
2. Aggregate: sum(incoming)
No idea how to improve this.
3. Group: by = articleID
This grouping might be faster with another key setting of the table, but this would have a bad impact on my other code.
4. Access: [, V1]
This could be neglected and done during later operations, but I doubt a speed gain.
Do you have ideas for detailed profiling or improving this operation?
Minimum reproducible example
(decrease n to make it run on your machine, if necessary):
library(data.table)
p <- 100
n <- 10000
inventory <- CJ(period=seq(1,n,1), weight=c(0.1,1), volume=c(1,10), price=c(1,1000), E_demand=c(1000), VK=seq(from=0.2, to=0.8, by=0.2), s=c(seq(1,99,1), seq(from=100, to=1000, by=20)))
inventory[, articleID:=paste0("W",weight,"V",volume,"P",price,"E", round(E_demand,2), "VK", round(VK,3), "s",s)]
inventory[, incoming:=rgamma( rate=1,shape=0.3, dim(inventory)[1])]
setkey(inventory, period, articleID)
inco <- inventory[period > p, sum(incoming), by = articleID][,V1]

Related

How can I handle a very large number of combinations in R?

Here is my problem - I would like to generate a fairly large number of factorial combinations and then apply some constraints on them to narrow down the list of all possible combinations. However, this becomes an issue when the number of all possible combinations becomes extremely large.
Let's take an example - Assume we have 8 variables (A; B; C; etc.) each taking 3 levels/values (A={1,2,3}; B={1,2,3}; etc.).
The list of all possible combinations would be 3**8 (=6561) and can be generated as following:
tic <- function(){start.time <<- Sys.time()}
toc <- function(){round(Sys.time() - start.time, 4)}
nX = 8
tic()
lk = as.list(NULL)
lk = lapply(1:nX, function(x) c(1,2,3))
toc()
tic()
mapx = expand.grid(lk)
mapx$idx = 1:nrow(mapx)
toc()
So far so good, these operations are done pretty quickly (< 1 second) even if we significantly increase the number of variables.
The next step is to generate a corrected set of all pairwise comparisons (An uncorrected set would be obtain by freely combining all 6561 options with each other, leading to 65616561=43046721 combinations) - The size of this "universe" would be: 6561(6561-1)/2 = 21520080. Already pretty big!
I am using the R built-in function combn to get it done. In this example the running time remains acceptable (about 20 seconds on my PC) but things become impossible with higher higher number of variables and/or more levels per variable (running time would increase exponentially, for example it already took 177 seconds with 9 variables!). But my biggest concern is actually that the object size would become so large that R can no longer handle it (Memory issue).
tic()
univ = t(combn(mapx$idx,2))
toc()
The next step would be to identify the list of combinations meeting some pre-defined constraints. For instance I would like to sub-select all combinations sharing exactly 3 common elements (ie 3 variables take the same values). Again the running time will be very long (even if a 8 variables) as my approach is to loop over all combinations previously defined.
tic()
vrf = NULL
vrf = sapply(1:nrow(univ), function(x){
j1 = mapx[mapx$idx==univ[x,1],-ncol(mapx)]
j2 = mapx[mapx$idx==univ[x,2],-ncol(mapx)]
cond = ifelse(sum(j1==j2)==3,1,0)
return(cond)})
toc()
tic()
univ = univ[vrf==1,]
toc()
Would you know how to overcome this issue? Any tips/advices would be more than welcome!

R lookup time for very long vector

In the R programming language...
Bottleneck in my code:
a <- a[b]
where:
a,b are vectors of length 90 Million.
a is a logical vector.
b is a permutation of the indeces of a.
This operation is slow: it takes ~ 1.5 - 2.0 seconds.
I thought straightforward indexing would be much faster, even for large vectors.
Am I simply stuck? Or is there a way to speed this up?
Context:
P is a large matrix (10k row, 5k columns).
rows = names, columns = features. values = real numbers.
Problem: Given a subset of names, I need to obtain matrix Q, where:
Each column of Q is sorted (independently of the other columns of Q).
The values in a column of Q come from the corresponding column of P and are only those from the rows of P which are in the given subset of names.
Here is a naive implementation:
Psub <- P[names,]
Q <- sapply( Psub , sort )
But I am given 10,000 distinct subsets of names (each subset is several 20% to 90% of the total). Taking the subset and sorting each time is incredibly slow.
Instead, I can pre-compute the order vector:
b <- sapply( P , order )
b <- convert_to_linear_index( as.data.frame(b) , dim(P) )
# my own function.
# Now b is a vector of length nrow(P) * ncol(P)
a <- rownames(P) %in% myNames
a <- rep(a , ncol(P) )
a <- a[b]
a <- as.matrix(a , nrow = length(myNames) )
I don't see this getting much faster than that. You can try to write an optimized C function to do exactly this, which might cut the time in half or so (and that's optimistic -- vectorized R operations like this don't have much overhead), but not much more than that.
You've got approx 10^8 values to go through. Each time through the internal loop, it needs to increment the iterator, get the index b[i] out of memory, look up a[b[i]] and then save that value into newa[i]. I'm not a compiler/assembly expert by a long shot, but this sounds like on the order of 5-10 instructions, which means you're looking at "big O" of 1 billion instructions total, so there's a clock rate limit to how fast this can go.
Also, R stores logical values as 32 bit ints, so the array a will take up about 400 megs, which doesn't fit into cache, so if b is a more or less random permutation, then you're going to be missing the cache regularly (on most lookups to a, in fact). Again, I'm not an expert, but I would think it's likely that the cache misses here are the bottleneck, and if that's the case, optimized C won't help much.
Aside from writing it in C, the other thing to do is determine whether there are any assumptions you can make that would let you not go through the whole array. For example, if you know most of the indices will not change, and you can figure out which ones do change, you might be able to make it go faster.
On edit, here are some numbers. I have an AMD with clock speed of 2.8GHz. It takes me 3.4 seconds with a random permutation (i.e. lots of cache misses) and 0.7 seconds with either 1:n or n:1 (i.e. very few cache misses), which breaks into 0.6 seconds of execution time and 0.1 of system time, presumably to allocate the new array. So it does appear that cache misses are the thing. Maybe optimized C code could shave something like 0.2 or 0.3 seconds off of that base time, but if the permutation is random, that won't make much difference.
> x<-sample(c(T,F),90*10**6,T)
> prm<-sample(90*10**6)
> prm1<-1:length(prm)
> prm2<-rev(prm1)
> system.time(x<-x[prm])
user system elapsed
3.317 0.116 3.436
> system.time(x<-x[prm1])
user system elapsed
0.593 0.140 0.734
> system.time(x<-x[prm2])
user system elapsed
0.631 0.112 0.743
>

How to vectorize comparison of subsequent rows

I am trying to speed up my current implementation for the following problem:
The arrival of an article articleID ordered in a period is determined through its leadtime, which varies from period to period. I would like to determine the number of times, an article has been overtaken, i.e. an article ordered in a later period arrived earlier than an article ordered earlier.
My current implementation (yes, two ugly for-loops):
library(data.table)
lt <- CJ(articleID=c("A", "B", "C"), period=1:100)
lt$leadtime <- round(runif(length(lt$period))*100,0)
lt[, arrival:=period+leadtime]
setkey(lt,articleID,period)
overtakenSum <- 0
for (art in unique(lt$articleID)) {
for (p in sort(unique(lt[art,period]))) {
# find subsequent period of item where arrival is before arrival period of article in current period
overtakenSum <- overtakenSum +
sum(lt[art==articleID & period>p,arrival] <
lt[.(art,p),arrival])
}
print(overtakenSum)
}
The implementation is too slow for the number of articles and periods I need to consider.
Is there a way to
vectorize these operations (like with some advanced use of diff) or
take advantage of special functions in data.table?
I don't know how you can avoid the basically by-row part of your problem, since the various sums for each row don't appear to have a relationship with each other. However, with a simple rewriting of your solution I see a speedup of about 3x:
lt[, {perArt = .SD; # renaming, to be able to run the next line correctly
perArt[, sum(perArt[period > p, arrival] < arrival), by = .(p = period)][, sum(V1)]}
, by = articleID][, cumsum(V1)]
#[1] 1450 2599 3760

Speeding up identification of subsequences

I am using a dataset which has hundreds of events in each sequence. I am trying to identify subsequences and sequential association rules using TraMineR. For example, here is code that I would write:
# Frequent subsequences:
fsubseq <- seqefsub(weaver, minSupport = 0.05, maxK = 4)
fsubseq <- seqentrans(fsubseq)
fsb <- fsubseq[fsubseq$data$nevent > 1]
plot(fsb[1:20], col = "cyan")
# Sequential association rules:
rules <- TraMineR:::seqerules(fsubseq)
rules[order(rules$Lift, decreasing = TRUE)[1:25], 1:4]
This is usually workable as long as I set maxK to 1-3, but as I move over that value the computations takes hours if not days. Are there any specific parameters I can adjust to speed these computations up?
Computation time is strongly linked to:
Number of events per sequence. The algorithm was designed for a small number of event per sequence (<6 typically) and many sequences. You can try removing some events that are not your main interest or analysing group of events. I guess that the relationship between number of events and computation time is at least exponential. With more than 10 events per sequences, it can be really slow.
Minimum support. With low minimum support the possible number of subsequence get really big. Try to set it to an higher value.
Hope this helps.

R- Speed up calculation related with subset of data.table

Need help on speed up for case below:
I am having roughly 8.5 Millions rows of orders history for 1.3M orders. I need to calculate the time it take between two steps of each order. I use calculation as below:
History[, time_to_next_status:=
get_time_to_next_step(id_sales_order_item_status_history,
id_sales_order_item, History_subset),
by='id_sales_order_item_status_history']
In the code above:
id_sales_order_item - id of a sales order item - there are multiple history record have the same id_sales_order_item
id_sales_order_item_status_history - id of a row
History_subset is a subset of History which contains only 3 columns [id_sales_order_item_status_history, id_sales_order_item, created_at] needed in the calculations.
created_at is the time the history was created
The function get_time_to_next_step is defined as below
get_time_to_next_step <- function(id_sales_order_item_status_history, filter_by,
dataSet){
dataSet <- dataSet %.% filter(id_sales_order_item == filter_by)
index <- match(currentId, dataSet$id_sales_order_item_status_history)
time_to_next_status <- dataSet[index + 1, created_at] - dataSet[index, created_at]
time_to_next_status
}
The issues is that it take 15mins to run arround 10k records of the History. So it would take up to ~9 days to complete the calculation. Is there anyway I can fasten this up without break the data in to multiple subset?
I will take a shot. Can't you try something like this..
History[ , Index := 1:.N, by= id_sales_order_item]
History[ , time_to_next_status := created_at[Index+1]-created_at[Index], by= id_sales_order_item]
I would think this would be pretty fast.

Resources