When doing frequent sequence mining, one typically wants to do the following tasks:
1. Find sequential patterns (frequent sequences).
2. Find out which sequential patterns apply to a transaction. I.e.: given a transaction, which of all of the frequent sequences found is present?
I'm having trouble doing the latter.
Using R, I am applying the cspade-algorithm from the arulesSequences package on the following toy dataset:
data <- data.frame(id = 1:10,
transaction = c("A B B A",
"A B C B D C B B B F A",
"A A B",
"B A B A",
"A B B B B",
"A A A B",
"A B B A B B",
"E F F A C B D A B C D E",
"A B B A B",
"A B"))
Then I split the data using the str_split function from package stringr:
data_for_fseq_mining <- str_split(string = data$transaction, pattern = " ")
Use identifiers to uniquely name the list elements in 'data_for_fseq_mining'. This is a prerequisite for using the function 'as.transactions' as shown below.
names(data_for_fseq_mining) <- data$id
In order to convert this kind of data to a dataset of class 'transactions' I use the following function as.transactions from https://github.com/cran/clickstream/blob/master/R/Clickstream.r.
data_for_fseq_mining_trans <- as.transactions(clickstreamList = data_for_fseq_mining)
Now the data is in the proper format, I run the cspade-algorithm with some parameters:
sequences <- cspade(data = data_for_fseq_mining_trans,
parameter = list(support = 0.3, maxsize = 10, maxlen = 10, mingap = 1, maxgap = 10),
control = list(tidList = TRUE, verbose = TRUE))
Summarizing the results (sequence and relative support):
sequences_df <- cbind(sequence = labels(sequences), support = sequences#quality)
sequence support
1 <{A}> 1.0
2 <{B}> 1.0
3 <{A},{B}> 1.0
4 <{B},{B}> 0.7
5 <{A},{B},{B}> 0.6
6 <{B},{B},{B}> 0.4
7 <{A},{B},{B},{B}> 0.4
8 <{B},{B},{B},{B}> 0.3
9 <{A},{B},{B},{B},{B}> 0.3
10 <{A},{A},{B}> 0.5
11 <{B},{A},{B}> 0.4
12 <{A},{B},{A},{B}> 0.3
13 <{A},{A}> 0.8
14 <{B},{A}> 0.6
15 <{A},{B},{A}> 0.6
16 <{B},{B},{A}> 0.5
17 <{A},{B},{B},{A}> 0.4
That's perfectly fine, but now I would like to know, for each transaction, whether each sequence is present or not (TRUE/FALSE). To do this, I tried to use the tidList:
sequences_score <- as.matrix(sequences#tidLists#data)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
[1,] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[2,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[5,] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
[6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[9,] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
From this result, I assume each row corresponds to a transaction and each column to a sequence. But when looking at the 4th column, it says that pattern '
<{B},{B}>' is not present in transactions 2, 4 and 7. Though these transactions clearly all contain this pattern. Are my assumptions about the output wrong?
An alternative approach is to use this piece of code provided by juliesls: R arulesSequences Find which patterns are supported by a sequence
When applying the following lines of code, an error occurs.
ids <- unique(data_for_fseq_mining_trans#itemsetInfo$sequenceID)
sequences_score <- data.frame()
for (seq_id in 1:length(sequences)){
sequences_score[,labels(sequences[seq_id])] <- logical(0)
}
for (id in ids){
transaction_subset <- data_for_fseq_mining_trans[data_for_fseq_mining_trans#itemsetInfo$sequenceID==id]
sequences_score[id, ] <- as.logical(support(x = sequences, transactions =
transaction_subset, type="absolute"))
}
Any clues?
To see whether each sequence is present or not you can indeed use your provided code:
sequences_score <- as.matrix(sequences#tidLists#data)
However, you have to map the resulting matrix to your data using another property of your sequence object as follows:
# Get mapping ids, change to numeric values
mapping_ids <- as.numeric(sequences#tidLists#transactionInfo$sequenceID)
# Then map your matrix sequence_score to correspond to the order of your data
sequences_score <- sequences_score[order(mapping_ids), ]
Related
I need to run the function lapply on a activation_status list t times so that the t iteration of the function remembers the results from the t-1 iteration.
The list is basically a bidimensional array representing a single item i status over multiple t periods and looks like this:
n_items <<- 100
n_iterations <<- 10
activation_status <-
lapply(1:n_iterations,
FUN = function(t, bool, i) rep(bool, t),
FALSE, n_items)
Now during each iteration t, I randomly activate (set to TRUE) a number of items within the list but I want all the items already activated at time t-1 to stay active (note that I define activation_status within the update function so that it's accessible in the inner functions).
updateActivation <- function(t) {
activation_status[[t]] <- as.logical(rbinom(n_items, 1, prob = .5))
activation_status[[t]][activation_status[[t-1]] == TRUE] <- TRUE
}
But then
lapply(1:n_iterations, updateActivation)
throws as error:
Error in activation_status[[t - 1]] : attempt to select less than one element in get1index
I know I could use a loop, but I wonder if it is:
Possible to do something like this with the apply function?
Do it faster?
Not sure if I fully understood the question but seems like you are looking for a recursion.
In that case Reduce() can be used instead of lapply():
activation_status <- rep(FALSE, 10)
n_iterations <- 5
Reduce(function(y, x) as.logical(rbinom(length(y), 1, prob=0.1)) | y,
x=1:n_iterations, init=activation_status, accumulate=TRUE
)
[[1]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[[3]]
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
[[4]]
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
[[5]]
[1] TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
[[6]]
[1] TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
We could probably do this without using any apply command.
#Set seed for reproduciblity
set.seed(123)
#Create initialization demo data
activation_status <- rep(FALSE, 10)
#Number of values to select
n_iterations <- 5
#Sequence from 1:n_iterations
seq_n_iterations <- seq_len(n_iterations)
#Create matrix to hold output
output <- replicate(n_iterations, activation_status)
#Select n_iterations random values from 1:length(activation_status)
#You can change this if you want to use some specific distrubution
points <- sample(length(activation_status), n_iterations)
#Create column indices
cols <- rep(seq_n_iterations, seq_n_iterations)
#Create row indices
rows <- points[ave(inds, inds, FUN = seq_along)]
#Change those values to TRUE
output[cbind(rows, cols)] <- TRUE
output
# [,1] [,2] [,3] [,4] [,5]
# [1,] FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE TRUE TRUE
# [3,] TRUE TRUE TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE
# [6,] FALSE FALSE FALSE FALSE TRUE
# [7,] FALSE FALSE FALSE FALSE FALSE
# [8,] FALSE FALSE FALSE TRUE TRUE
# [9,] FALSE FALSE FALSE FALSE FALSE
#[10,] FALSE TRUE TRUE TRUE TRUE
If you want them as lists :
asplit(output, 2)
#[[1]]
# [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[[2]]
# [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[[3]]
# [1] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[[4]]
# [1] FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
#[[5]]
# [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
I am interested in writing a program that gives the number of elements of vector x that are smaller or equal to any given value within vector x.
Let's say
x = [1,3,8,7,6,4,3,10,12]
I want to calculate the number of elements within x which are smaller or equal to 1, to 3, to 8 etc. For example the fifth element of x[5] is 6 and the number of elements smaller or equal to 6 equals to 5. However, I only know how to do an element-wise comparison, e.g x[1]<=x[3]
I suppose that I will be using the for loop and have something like this here:
for (i in length(x)){
if (x[i]<=x[i]){
print(x[i])}
# count number of TRUEs
}
However, this code obviously does not do what I want.
Use outer to make all comparisons at once:
outer(x, x, "<=")
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [2,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [3,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
# [4,] FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
# [5,] FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
# [6,] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
# [7,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
# [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
colSums(outer(x, x, "<="))
#[1] 1 3 7 6 5 4 3 8 9
You can also use the *apply family as follows,
sapply(x, function(i) sum(x <= i))
#[1] 1 3 7 6 5 4 3 8 9
We can use findInterval
findInterval(x, sort(x))
#[1] 1 3 7 6 5 4 3 8 9
Another alternative is to use rank, which ranks the values. Setting the ties.method argument to "max" retrieves the inclusive value ("<=" versus "<").
rank(x, ties.method="max")
[1] 1 3 7 6 5 4 3 8 9
Consider the data frame in R:
set.seed(36)
y <- runif(10,0,200)
group <- sample(rep(1:2, each=5))
d <- data.frame(y, group)
I want to compare all y against all y within each group. The following codes do this correctly:
d_split <- split(d, d$group)
a <- with(d_split[[1]],outer(y, y, "<="))
b <- with(d_split[[2]],outer(y, y, "<="))
But while I am doing this inside a function, and the number of group varies (group will be an argument of that function), then I cannot proceed in this manner. How can I elegantly write the last three line codes to compare all y against all y within each group?
To perform the same operation for multiple groups we can use lapply and perform the outer operation for every group.
lapply(split(d, d$group), function(x) outer(x[["y"]], x[["y"]], "<="))
#$`1`
# [,1] [,2] [,3] [,4] [,5]
#[1,] TRUE TRUE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE FALSE
#[3,] TRUE TRUE TRUE FALSE TRUE
#[4,] TRUE TRUE TRUE TRUE TRUE
#[5,] TRUE TRUE FALSE FALSE TRUE
#$`2`
# [,1] [,2] [,3] [,4] [,5]
#[1,] TRUE TRUE FALSE TRUE FALSE
#[2,] FALSE TRUE FALSE TRUE FALSE
#[3,] TRUE TRUE TRUE TRUE TRUE
#[4,] FALSE FALSE FALSE TRUE FALSE
#[5,] TRUE TRUE FALSE TRUE TRUE
Here is an option without splitting
library(data.table)
setDT(d)[, as.data.table(outer(y, y, "<=")), group]
# group V1 V2 V3 V4 V5
#1: 1 TRUE TRUE FALSE FALSE FALSE
#2: 1 FALSE TRUE FALSE FALSE FALSE
#3: 1 TRUE TRUE TRUE FALSE TRUE
#4: 1 TRUE TRUE TRUE TRUE TRUE
#5: 1 TRUE TRUE FALSE FALSE TRUE
#6: 2 TRUE TRUE FALSE TRUE FALSE
#7: 2 FALSE TRUE FALSE TRUE FALSE
#8: 2 TRUE TRUE TRUE TRUE TRUE
#9: 2 FALSE FALSE FALSE TRUE FALSE
#10: 2 TRUE TRUE FALSE TRUE TRUE
Or in a 'long' format with CJ
setDT(d)[, CJ(y, y), group][, V1 <= V2, group]
I have a vector representing a partitioning of objects into clusters:
#9 objects partitioned into 6 clusters
> part1 <- c(1,2,3,1,4,2,2,5,6)
I can easily create similarity matrix where the measure of similarity is just {0,1}: 0 if two elements are in different clusters, and 1, if in the same:
> sim <- outer(part1,part1,"==")
> sim
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[3,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[6,] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[7,] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
But for large vectors (100,000's of objects) it doesn't work due to memory limits.
Clusters are small on average, so sparse matrix would be compact enough. I've looked through Matrix package and couldn't find anything like outer() for sparse objects.
So is there any other simple way to create such matrix directly from the vector (without looping through all vector's elements' pairs and populating sparse matrix element by element)?
If I have this matrix (which I named data):
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[2,] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[5,] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[8,] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
And I want to combine the columns into one single column like this: (where one TRUE in the row equals TRUE)
[,1]
[1,] TRUE
[2,] TRUE
[3,] FALSE
[4,] TRUE
[5,] TRUE
[6,] FALSE
[7,] TRUE
[8,] TRUE
[9,] FALSE
I know I could do something like (using the |):
data2[1:9,1]<-data[,1]|data[,2]|data[,3]|data[,4]…
data2 would then contain a single column with the different columns combined. But this is not a good way if I would have lots of columns (for example ncol=100)
I guess there is some simple way of doing it?
Thanks
Here is another answer that takes advantage of how R converts between logicals and numerics:
When going from logical to numeric, FALSE becomes 0 and TRUE becomes 1 so rowSums gives you the number of TRUE per row:
rowSums(data)
# [1] 3 3 0 3 3 0 3 3 0
When going from numeric to logical, 0 becomes FALSE, anything else is TRUE, so you can feed the output of rowSums to as.logical and it will indicate if a row has at least one TRUE:
as.logical(rowSums(data))
# [1] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
I like Tyler's answer though, it might be less efficient (to be proven) but I find it more intuitive.
You could use any with apply as in:
mat <- matrix(sample(c(TRUE, FALSE), 100,TRUE), 10)
apply(mat, 1, any)