Speeding up identification of subsequences - r

I am using a dataset which has hundreds of events in each sequence. I am trying to identify subsequences and sequential association rules using TraMineR. For example, here is code that I would write:
# Frequent subsequences:
fsubseq <- seqefsub(weaver, minSupport = 0.05, maxK = 4)
fsubseq <- seqentrans(fsubseq)
fsb <- fsubseq[fsubseq$data$nevent > 1]
plot(fsb[1:20], col = "cyan")
# Sequential association rules:
rules <- TraMineR:::seqerules(fsubseq)
rules[order(rules$Lift, decreasing = TRUE)[1:25], 1:4]
This is usually workable as long as I set maxK to 1-3, but as I move over that value the computations takes hours if not days. Are there any specific parameters I can adjust to speed these computations up?

Computation time is strongly linked to:
Number of events per sequence. The algorithm was designed for a small number of event per sequence (<6 typically) and many sequences. You can try removing some events that are not your main interest or analysing group of events. I guess that the relationship between number of events and computation time is at least exponential. With more than 10 events per sequences, it can be really slow.
Minimum support. With low minimum support the possible number of subsequence get really big. Try to set it to an higher value.
Hope this helps.

Related

How can I handle a very large number of combinations in R?

Here is my problem - I would like to generate a fairly large number of factorial combinations and then apply some constraints on them to narrow down the list of all possible combinations. However, this becomes an issue when the number of all possible combinations becomes extremely large.
Let's take an example - Assume we have 8 variables (A; B; C; etc.) each taking 3 levels/values (A={1,2,3}; B={1,2,3}; etc.).
The list of all possible combinations would be 3**8 (=6561) and can be generated as following:
tic <- function(){start.time <<- Sys.time()}
toc <- function(){round(Sys.time() - start.time, 4)}
nX = 8
tic()
lk = as.list(NULL)
lk = lapply(1:nX, function(x) c(1,2,3))
toc()
tic()
mapx = expand.grid(lk)
mapx$idx = 1:nrow(mapx)
toc()
So far so good, these operations are done pretty quickly (< 1 second) even if we significantly increase the number of variables.
The next step is to generate a corrected set of all pairwise comparisons (An uncorrected set would be obtain by freely combining all 6561 options with each other, leading to 65616561=43046721 combinations) - The size of this "universe" would be: 6561(6561-1)/2 = 21520080. Already pretty big!
I am using the R built-in function combn to get it done. In this example the running time remains acceptable (about 20 seconds on my PC) but things become impossible with higher higher number of variables and/or more levels per variable (running time would increase exponentially, for example it already took 177 seconds with 9 variables!). But my biggest concern is actually that the object size would become so large that R can no longer handle it (Memory issue).
tic()
univ = t(combn(mapx$idx,2))
toc()
The next step would be to identify the list of combinations meeting some pre-defined constraints. For instance I would like to sub-select all combinations sharing exactly 3 common elements (ie 3 variables take the same values). Again the running time will be very long (even if a 8 variables) as my approach is to loop over all combinations previously defined.
tic()
vrf = NULL
vrf = sapply(1:nrow(univ), function(x){
j1 = mapx[mapx$idx==univ[x,1],-ncol(mapx)]
j2 = mapx[mapx$idx==univ[x,2],-ncol(mapx)]
cond = ifelse(sum(j1==j2)==3,1,0)
return(cond)})
toc()
tic()
univ = univ[vrf==1,]
toc()
Would you know how to overcome this issue? Any tips/advices would be more than welcome!

Making a for loop in r

I am just getting started with R so I am sorry if I say things that dont make sense.
I am trying to make a for loop which does the following,
l_dtest[[1]]<-vector()
l_dtest[[2]]<-vector()
l_dtest[[3]]<-vector()
l_dtest[[4]]<-vector()
l_dtest[[5]]<-vector()
all the way up till any number which will be assigned as n. for example, if n was chosen to be 100 then it would repeat this all the way to > l_dtest[[100]]<-vector().
I have tried multiple different attempts at doing this and here is one of them.
n<-4
p<-(1:n)
l_dtest<-list()
for(i in p){
print((l_dtest[i]<-vector())<-i)
}
Again I am VERY new to R so I don't know what I am doing or what is wrong with this loop.
The detailed background for why I need to do this is that I need to write an R function that receives as input the size of the population "n", runs a simulation of the model below with that population size, and returns the number of generations it took to reach a MRCA (most recent common ancestor).
Here is the model,
We assume the population size is constant at n. Generations are discrete and non-overlapping. The genealogy is formed by this random process: in each
generation, each individual chooses two parents at random from the previous generation. The choices are made randomly and equally likely over the n possibilities and each individual chooses twice. All choices are made independently. Thus, for example, it is possible that, when an individual chooses his two parents, he chooses the same individual twice, so that in
fact he ends up with just one parent; this happens with probability 1/n.
I don't understand the specific step at the begining of this post or why I need to do it but my teacher said I do. I don't know if this helps but the next step is choosing parents for the first person and then combining the lists from the step I posted with a previous step. It looks like this,
sample(1:5, 2, replace=T)
#[1] 1 2
l_dtemp[[1]]<-union(l_dtemp[[1]], l_d[[1]]) #To my understanding, l_dtem[[1]] is now receiving the listdescandants from l_d[[1]] bcs the ladder chose l_dtemp[[1]] as first parent
l_dtemp[[2]]<-union(l_dtemp[[2]], l_d[[1]]) #Same as ^^ but for l_d[[1]]'s 2nd choice which is l_dtemp[[2]]
sample(1:5, 2, replace=T)
#[1] 1 3
l_dtemp[[1]]<-union(l_dtemp[[1]], l_d[[2]])
l_dtemp[[3]]<-union(l_dtemp[[3]], l_d[[2]])

Forcing discrete time series to be monotonous decreasing

I've an evaluations series. Each evaluation could have discrete values ranging from 0 to 4. Series should decrease in time. However, since values are inserted manually, errors could happen.
Therefore, I would like to modify my series to be monotonous decreasing. Moreover, I would minimize the number of evaluations modified. Finally, if two or more series would satisfy these criteria, would choose the one with the higher overall values sum.
E.g.
Recorded evaluation
4332422111
Ideal evaluation
4332222111
Recorded evaluation
4332322111
Ideal evaluation
4333322111
(in this case, 4332222111 would have satisfied criteria too, but I chose with the higher values)
I tried with brutal force approach by generating all possible combinations, selecting those monotonous decreasing and finally comparing each one of these with that recorded.
However, series could be even 20-evaluations long and combinations would too many.
x1 <- c(4,3,3,2,4,2,2,1,1,1)
x2 <- c(4,3,3,2,3,2,2,1,1,1)
You could almost certainly break this algorithm, but here's a first try: replace locations with increased values by NA, then fill them in with the previous location.
dfun <- function(x) {
r <- replace(x,which(c(0,diff(x))>0),NA)
zoo::na.locf(r)
}
dfun(x1)
dfun(x2)
This gives the "less-ideal" answer in the second case.
For the record, I also tried
dfun2 <- function(x) {
s <- as.stepfun(isoreg(-x))
-s(seq_along(x))
}
but this doesn't handle the first example as desired.
You could also try to do this with discrete programming (about which I know almost nothing), or a slightly more sophisticated form of brute force -- use a stochastic algorithm that strongly penalizes non-monotonicity and weakly penalizes the distance from the initial sequence ... (e.g. optim(..., method="SANN") with a candidate function that adds or subtracts 1 from an element at random)

TraMineR, Extract all present combination of events as dummy variables

Lets say I have this data. My objective is to extraxt combinations of sequences.
I have one constraint, the time between two events may not be more than 5, lets call this maxGap.
User <- c(rep(1,3)) # One users
Event <- c("C","B","C") # Say this is random events could be anything from LETTERS[1:4]
Time <- c(c(1,12,13)) # This is a timeline
df <- data.frame(User=User,
Event=Event,
Time=Time)
If want to use these sequences as binary explanatory variables for analysis.
Given this dataframe the result should be like this.
res.df <- data.frame(User=1,
C=1,
B=1,
CB=0,
BC=1,
CBC=0)
(CB) and (CBC) will be 0 since the maxGap > 5.
I was trying to write a function for this using many for-loops, but it becomes very complex if the sequence becomes larger and the different number of evets also becomes larger. And also if the number of different User grows to 100 000.
Is it possible of doing this in TraMineR with the help of seqeconstraint?
Here is how you would do that with TraMineR
df.seqe <- seqecreate(id=df$User, timestamp=df$Time, event=df$Event)
constr <- seqeconstraint(maxGap=5)
subseq <- seqefsub(df.seqe, minSupport=0, constraint=constr)
(presence <- seqeapplysub(subseq, method="presence"))
which gives
(B) (B)-(C) (C)
1-(C)-11-(B)-1-(C) 1 1 1
presence is a table with a column for each subsequence that occurs at least once in the data set. So, if you have several individuals (event sequences), the table will have one row per individual and the columns will be the binary variable you are looking for. (See also TraMineR: Can I get the complete sequence if I give an event sub sequence? )
However, be aware that TraMineR works fine only with subsequences of length up to about 4 or 5. We suggest to set maxK=3 or 4 in seqefsub. The number of individuals should not be a problem, nor should the number of different possible events (the alphabet) as long as you restrict the maximal subsequence length you are looking for.
Hope this helps

How to vectorize comparison of subsequent rows

I am trying to speed up my current implementation for the following problem:
The arrival of an article articleID ordered in a period is determined through its leadtime, which varies from period to period. I would like to determine the number of times, an article has been overtaken, i.e. an article ordered in a later period arrived earlier than an article ordered earlier.
My current implementation (yes, two ugly for-loops):
library(data.table)
lt <- CJ(articleID=c("A", "B", "C"), period=1:100)
lt$leadtime <- round(runif(length(lt$period))*100,0)
lt[, arrival:=period+leadtime]
setkey(lt,articleID,period)
overtakenSum <- 0
for (art in unique(lt$articleID)) {
for (p in sort(unique(lt[art,period]))) {
# find subsequent period of item where arrival is before arrival period of article in current period
overtakenSum <- overtakenSum +
sum(lt[art==articleID & period>p,arrival] <
lt[.(art,p),arrival])
}
print(overtakenSum)
}
The implementation is too slow for the number of articles and periods I need to consider.
Is there a way to
vectorize these operations (like with some advanced use of diff) or
take advantage of special functions in data.table?
I don't know how you can avoid the basically by-row part of your problem, since the various sums for each row don't appear to have a relationship with each other. However, with a simple rewriting of your solution I see a speedup of about 3x:
lt[, {perArt = .SD; # renaming, to be able to run the next line correctly
perArt[, sum(perArt[period > p, arrival] < arrival), by = .(p = period)][, sum(V1)]}
, by = articleID][, cumsum(V1)]
#[1] 1450 2599 3760

Resources