How to vectorize comparison of subsequent rows - r

I am trying to speed up my current implementation for the following problem:
The arrival of an article articleID ordered in a period is determined through its leadtime, which varies from period to period. I would like to determine the number of times, an article has been overtaken, i.e. an article ordered in a later period arrived earlier than an article ordered earlier.
My current implementation (yes, two ugly for-loops):
library(data.table)
lt <- CJ(articleID=c("A", "B", "C"), period=1:100)
lt$leadtime <- round(runif(length(lt$period))*100,0)
lt[, arrival:=period+leadtime]
setkey(lt,articleID,period)
overtakenSum <- 0
for (art in unique(lt$articleID)) {
for (p in sort(unique(lt[art,period]))) {
# find subsequent period of item where arrival is before arrival period of article in current period
overtakenSum <- overtakenSum +
sum(lt[art==articleID & period>p,arrival] <
lt[.(art,p),arrival])
}
print(overtakenSum)
}
The implementation is too slow for the number of articles and periods I need to consider.
Is there a way to
vectorize these operations (like with some advanced use of diff) or
take advantage of special functions in data.table?

I don't know how you can avoid the basically by-row part of your problem, since the various sums for each row don't appear to have a relationship with each other. However, with a simple rewriting of your solution I see a speedup of about 3x:
lt[, {perArt = .SD; # renaming, to be able to run the next line correctly
perArt[, sum(perArt[period > p, arrival] < arrival), by = .(p = period)][, sum(V1)]}
, by = articleID][, cumsum(V1)]
#[1] 1450 2599 3760

Related

Randomly pairing elements of a vector in R to count unique arrangements

Background:
On this combinatorics question, the issue is how to determine the sample space: the ways 8 different soccer teams can be paired up for the next round of competition. Two different answers have been advanced for that part of the problem: 28 (see comments OP) and 105 (see edit within OP and answer).
I'd like to do this manually to try to hone down on the mistake in whichever answer is incorrect.
What I have tried:
teams = 1:8
names(teams) = c("RM", "BCN", "SEV", "JUV", "ROM", "MC", "LIV", "BYN")
split(sample(teams), rep(1:(length(teams)/2), each=2))
Unfortunately, the output is a list, and I wanted a vector to be able to run something like:
unique(...,MARGIN=2)
Is there a way of doing this in an elegant manner?
After a now erased answer (thank you), I would go with
a <- replicate(1e5, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2))))
to simulate 100,000 random samples, and later run
unique(a, MARGIN = 2).
But how can I account for the fact that the order of the 4 pairings of opponents doesn't matter, and that LIV-BYN and BYN-LIV, for example, is the same pairing (field advantage notwithstanding)?
> u = ncol(unique(replicate(1e6, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2)))), MARGIN = 2))
> u / (factorial(4) * 2^4)
[1] 105
The idea of unlist is from #Song Zhengyi, and if his answer is un-deleted, I'll accept it. The complete answer is in the lines above.
u needs to be divided by 4! because
BCN-RM, BYN-SEV, JUV-ROM, LIV-MC
is exactly the same as
LIV-MC, BCN-RM, BYN-SEV, JUV-ROM
or
BCN-RM, LIV-MC, BYN-SEV, JUV-ROM
etc.
The term 2^4 is to avoid over-counting since for every possible unique draw, each one of the pairings can be flipped without loss (discarding field advantage): BCN-RM is the same as RM-BCN, and there are 4 pairs in each draw.
If field advantage is a consideration (real life)...
> u/factorial(4)
[1] 1680
we end up with 1,680 possible draws.

Calculate Number of Times Weight Changes more than a percentage

Suppose we have a table A with date and weight columns. Basically, this is daily weight data across 10 years. We want to count the number of times the weight has changed more than 3% in either direction. Is the below pseudo-code slightly correct:
count = 0
for (i in 1:nrow(A))
{
weight_initial = weight[i]
for (j in 1:nrow(A))
{
weight_compare = weight[j]
if(weight_compare >= 1.03*weight_initial || weight_compare <= 0.97*weight_initial)
{
count ++
}
}
}
It's better to do vectorized computations in R whenever possible. This is a quick and dirty approach (does twice as many computations as necessary, but should still be pretty quick):
weight <- rnorm(10,mean=1,sd=0.1)
wtcomp <- outer(weight,weight,"/")
sum(abs(wtcomp[lower.tri(wtcomp)])>0.03)
This solution is similar to what would be produced by your pseudocode, except that yours does comparisons between present and both past and future values - so yours would (I think) produce double my answer.
Do you really want to count weight gain/loss against all future times? e.g. should weight = (1,2,2,2,2) really be counted as 4 weight-gain events and not just one?
It is not very clear for me what is the output you want. So, if you have your time series of weights:
a <- c(1,1.5,2,4,3, 3.005, 3.05, 0.5, 0.99)
and you want to compare each measurement just against the initial measurement to check how many times it changed more than 3%:
sum(abs((a-a[1])/a[1]) > 0.03)
But if you want to calculate how many times there was a daily change higher than the 3% with respect to the previous measurement, then:
sum(abs(diff(a)/a[-length(a)]) > .003)
Cheers F.

Speed up complex data.table operation (subset, sum, group)

I have a large data.table which I need to subset, sum and group the same way on several occurrences in my code. Therefore, I store the result to save time. The operation still takes rather long and I would like to know how to speed it up.
inco <- inventory[period > p, sum(incoming), by = articleID][,V1]
The keys of inventory are period and articleID. The size varies depending on the parameters but is always greater than 3 GB. It has about 62,670,000 rows of 7 variables.
I comment on my thought so far:
1. Subset: period > p
This could be faster with vector scanning, but I would need to generate the sequence from p to max(p) for that, taking additional time. Plus, the data.table is already sorted by p. So I suppose, the gain in speed is not high.
2. Aggregate: sum(incoming)
No idea how to improve this.
3. Group: by = articleID
This grouping might be faster with another key setting of the table, but this would have a bad impact on my other code.
4. Access: [, V1]
This could be neglected and done during later operations, but I doubt a speed gain.
Do you have ideas for detailed profiling or improving this operation?
Minimum reproducible example
(decrease n to make it run on your machine, if necessary):
library(data.table)
p <- 100
n <- 10000
inventory <- CJ(period=seq(1,n,1), weight=c(0.1,1), volume=c(1,10), price=c(1,1000), E_demand=c(1000), VK=seq(from=0.2, to=0.8, by=0.2), s=c(seq(1,99,1), seq(from=100, to=1000, by=20)))
inventory[, articleID:=paste0("W",weight,"V",volume,"P",price,"E", round(E_demand,2), "VK", round(VK,3), "s",s)]
inventory[, incoming:=rgamma( rate=1,shape=0.3, dim(inventory)[1])]
setkey(inventory, period, articleID)
inco <- inventory[period > p, sum(incoming), by = articleID][,V1]

Speeding up identification of subsequences

I am using a dataset which has hundreds of events in each sequence. I am trying to identify subsequences and sequential association rules using TraMineR. For example, here is code that I would write:
# Frequent subsequences:
fsubseq <- seqefsub(weaver, minSupport = 0.05, maxK = 4)
fsubseq <- seqentrans(fsubseq)
fsb <- fsubseq[fsubseq$data$nevent > 1]
plot(fsb[1:20], col = "cyan")
# Sequential association rules:
rules <- TraMineR:::seqerules(fsubseq)
rules[order(rules$Lift, decreasing = TRUE)[1:25], 1:4]
This is usually workable as long as I set maxK to 1-3, but as I move over that value the computations takes hours if not days. Are there any specific parameters I can adjust to speed these computations up?
Computation time is strongly linked to:
Number of events per sequence. The algorithm was designed for a small number of event per sequence (<6 typically) and many sequences. You can try removing some events that are not your main interest or analysing group of events. I guess that the relationship between number of events and computation time is at least exponential. With more than 10 events per sequences, it can be really slow.
Minimum support. With low minimum support the possible number of subsequence get really big. Try to set it to an higher value.
Hope this helps.

R- Speed up calculation related with subset of data.table

Need help on speed up for case below:
I am having roughly 8.5 Millions rows of orders history for 1.3M orders. I need to calculate the time it take between two steps of each order. I use calculation as below:
History[, time_to_next_status:=
get_time_to_next_step(id_sales_order_item_status_history,
id_sales_order_item, History_subset),
by='id_sales_order_item_status_history']
In the code above:
id_sales_order_item - id of a sales order item - there are multiple history record have the same id_sales_order_item
id_sales_order_item_status_history - id of a row
History_subset is a subset of History which contains only 3 columns [id_sales_order_item_status_history, id_sales_order_item, created_at] needed in the calculations.
created_at is the time the history was created
The function get_time_to_next_step is defined as below
get_time_to_next_step <- function(id_sales_order_item_status_history, filter_by,
dataSet){
dataSet <- dataSet %.% filter(id_sales_order_item == filter_by)
index <- match(currentId, dataSet$id_sales_order_item_status_history)
time_to_next_status <- dataSet[index + 1, created_at] - dataSet[index, created_at]
time_to_next_status
}
The issues is that it take 15mins to run arround 10k records of the History. So it would take up to ~9 days to complete the calculation. Is there anyway I can fasten this up without break the data in to multiple subset?
I will take a shot. Can't you try something like this..
History[ , Index := 1:.N, by= id_sales_order_item]
History[ , time_to_next_status := created_at[Index+1]-created_at[Index], by= id_sales_order_item]
I would think this would be pretty fast.

Resources