R crashes when cspade is trained on a large data set - r

The codes below works to extract sequences using the cspade algorithm.
library("arulesSequences")
df <- data.frame(personID = c(1, 1, 2, 2, 2),
eventID = c(100, 101, 102, 103, 104),
site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
sequence = c(1, 2, 1, 2, 3))
df.trans <- as(df[,"site", drop = FALSE], "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
df.trans <- df.trans[order(transactionInfo(df.trans)$sequenceID),]
seq <- cspade(df.trans, parameter = list(support = 0.2),
control = list(verbose = TRUE))
The problem is that my actual data is ~2 million rows, with sequence increasing to ~20 for each person. Using the code above, cspade quickly consumes all RAM and R crashes. Anyone have tips on how to perform sequence mining on large datasets like mine? Thanks!

How many unique IDs do you have in df$sequence? It looks like in the last column of your sample dataset that there are 3 sequence options. Do you think sequences of up to 20 are necessary? One thing you could do is set the maxlen parameter in your cspade function call to something like 4 or 5 and evaluate your predictive accuracy, assuming that's what you are after.
So you would have something like seq <- cspade(df.trans, parameter = list(support = 0.2, maxlen = 4),control = list(verbose = TRUE)).
Hope that helps

Related

Add p-value column in qwraps::summary_table

I want to make a little summary table for my colleagues in R-Markdown using qwraps::summary_table. The data.frame contains information of different exposures. All the variables are coded as binary.
library(qwraps2)
library(dplyr)
pop <- rbinom(n = 1000, size = 1, prob = runif(n = 10, min = 0, max = 1))
exp <- rbinom(n = 1000, size = 1, prob = .5)
ID <- c(1:500)
therapy <- factor(sample(x = pop, size = 500, replace = TRUE), labels = c("Control", "Intervention"))
exp_1 <- sample(x = exp, size = 500, replace = TRUE)
exp_2 <- sample(x = exp, size = 500, replace = TRUE)
exp_3 <- sample(x = exp, size = 500, replace = TRUE)
exp_4 <- sample(x = exp, size = 500, replace = TRUE)
df <- data.frame(ID, exp_1, exp_2, exp_3, exp_4, therapy)
head(df)
In the next step, I create a simple summary table as follows. In the table I want to have the groups (control vs. intervention) as columns and the exposures as rows:
my_summary <-
list(list("Exposure 1" = ~ n_perc(exp_1 %in% 1),
"Exposure 2" = ~ n_perc(exp_2 %in% 1),
"Exposure 3" = ~ n_perc(exp_3 %in% 1),
"Exposure 4" = ~ n_perc(exp_4 %in% 1))
)
my_table <- summary_table(group_by(df, therapy), my_summary)
my_table
In the next step I wanted to add a further column containing p-values for the group differences between control and intervention group, e. g. with fisher.test. I read in ?qwraps::summary_table that cbind is a suitable method for class qwraps2_summary_table, but to be honest, I'm struggling with it. I tried different ways but failed, unfortunately.
Is there a convenient way to add individual columns via qwraps::summary_table especially p-values according to the grouped columns?
Thanks for your help!
Best,
Florian
[SOLVED]
Meanwhile, after a lot of research on this topic, I found a convenient and easy way to add a p.values column. Maybe it is not the smartest solution, but worked, at least for me.
First I calculated the p.values with a function, which extracts the p.values from the returned output of fisher.test and stored them in an object, in my case a simple numeric vector:
# write function to extract fishers.test
fisher.pvalue <- function(x) {
value <- fisher.test(x)$p.value
return(value)
}
# fisher test/generate pvalues
p.vals <- round(sapply(list(
table(df$exp_1, df$therapy),
table(df$exp_2, df$therapy),
table(df$exp_3, df$therapy),
table(df$exp_4, df$therapy)), fisher.pvalue), digits = 2)
In the following step I simply added an empty table column called P-Values and added the p.vals to the column cells.
overall_table <- cbind(my_table, "P-Value" = "") # create empty column
overall_table[9:12] <- p.vals # add vals to empty column
# overall_table <- cbind(my_table, "P-Value" = p.vals) works the same way in one line of code
overall_table
In my case, I simply looked for the corresponding cell indices in overall_table (for P-Values = 9:12) and filled them using base syntax. In the vignette of qwraps2 (https://cran.r-project.org/web/packages/qwraps2/vignettes/summary-statistics.html), the author used regular expressions to identify the right cells (see section 3.2).
If there are other methods to add individual columns to qwraps2::summary_table I would appreciate to see how it is possible.
Best,
Florian

fast way to create a transition frequency table

I have a table showing sequences of patterns, represented with the id of the sequence, and the ordered value of the consecutive patterns over several rows. patterns have all the same length but sequences have different lengths)
patterns_seq_a <- c("ABC", "BCD", "ABC", "CBA")
patterns_seq_b <- c("BCD", "BCC", "BAC", "ABC", "BCD")
patterns_seq_c <- c("ABC", "ABC")
sequence_df <- data.frame(sequence_id = c(rep("a", length(patterns_seq_a)),
rep("b", length(patterns_seq_b)),
rep("c", length(patterns_seq_c))),
pattern = c(patterns_seq_a, patterns_seq_b, patterns_seq_c))
and I'm trying to count the frequency from one pattern to another over all the sequences.
(and ultimately will build a probability of transitions from the frequency table)
so for example looking at the transition from "CBA" to "BCD" its frequency over all sequences would be 0 (NB the row 4 and 5 don't belong to the same sequence).
The end result would have the form of:
unique_patterns <- unique(sequence_df$pattern)
result <- data.frame(matrix(0, ncol = length(unique_patterns), nrow = length(unique_patterns)))
colnames(result) <- unique_patterns
rownames(result) <- unique_patterns
(but could also be a hash pattern -> pattern -> frequency if it has to)
On the web I've found a solution based on concatenations of sequences then greps but I it's unusable as it's too slow (profvis blames the greps). Something on the lines of this:
freq_table <- c()
for (start_pattern in unique_patterns) {
for (end_pattern in unique_patterns) {
transition_pattern <- paste0(start_pattern, ',', end_pattern)
sequence_holding_transition <- concat_sequence_df[grep(transition_pattern, concat_sequence_df$patterns_sequence),]
if (nrow(sequence_holding_transition) < 1) {
transition_frequency <- c(transition_pattern, 0)
} else {
concat_sequence_holding_transition <- paste0(sequence_holding_transition$patterns_sequence, collapse = ",", sep="/")
transition_pattern_positions <- gregexpr(pattern = transition_pattern, text = concat_sequence_holding_transition)[[1]]
transition_frequency <- c(transition_pattern, length(transition_pattern_positions))
}
freq_table <- rbind(freq_table, transition_frequency)
}
}
frequency_table <- data.frame(pattern_transition = freq_table[, 1], counts = freq_table[, 2])
frequency_table$pattern.from <- sapply(strsplit(as.character(frequency_table$pattern_transition), ","), `[`, 1)
frequency_table$pattern.to <- sapply(strsplit(as.character(frequency_table$pattern_transition), ","), `[`, 2)
frequency_table <- t(matrix(as.numeric(as.character(frequency_table$counts)), ncol=length(unique_patterns), nrow=length(unique_patterns)))
colnames(frequency_table) <- unique_patterns
rownames(frequency_table) <- unique_patterns
I haven't been able to vectorize it and those greps are really slow in any case. I tried parallelizing it but it's still too slow.
Would someone have a lightning fast solution of building this frequency table?
I don't really care about the code sample, it's just here to show an example of a - unusable- solution.
Thanks!
UPDATE:
please find bellow the expected output if I know how to count.
To have lines with no transitions (such as "CBA" to "XXX") dropped is acceptable, and ofc it doesn't matter the actual type of the structure:
structure(
list(
ABC = c(1, 1, 0, 1, 0),
BAC = c(0, 0, 1, 0, 0),
BCC = c(0, 0, 0, 1, 0),
BCD = c(2, 0, 0, 0, 0),
CBA = c(1, 0, 0, 0, 0)
),
row.names = c("ABC", "BAC", "BCC", "BCD", "CBA"),
class = "data.frame"
)
I have not benchmarked it but it seems, that the following solutions uses basic R functions that tend to usually be fast. Given sequence_df from the question:
table(unlist(tapply(sequence_df$pattern, sequence_df$sequence_id, FUN =
function(p) paste0(p[-length(p)], p[-1]))))
I am using tapply to examine each sequence_id on it's own and paste0 for the transition patterns. unlist puts it all in one large vector which table can count. I am not shure, whether one would call that fully vectorized but at least it has no nested for loops and no regular expressions.
It certainly lacks code to transform the table type into a matrix right now. That can be written once we know how it compares runtime wise with other solutions and depending on whether the matrix is really the ideal format for the operations to come after that.
ok so I had a look at the solutions from #Berhard and #user20650 and while I haven't seriously validated the results yet, it looks that they both do the job.
one difference is that the tapply doesn't generate the 0 frequency transitions while data.table does.
Both solutions are faster than the nested for / rbind example as expected.
So I have tried to benchmark both, and assuming no mistake on my side adapting your code, the data.table is more than twice faster than the tapply one.
Thank you both for these elegant answers, appreciated!
library(data.table)
library(tidyr)
library(microbenchmark)
patterns_sequences <- lapply(seq(1:8), function(id) {
length_sequence <- sample(3:10, 1);
do.call(paste0, replicate(3, sample(c("A", "B", "C"), length_sequence, TRUE), FALSE))
})
sequence_df <- data.frame(sequence_id = c(rep("a", length(patterns_sequences[[1]])),
rep("b", length(patterns_sequences[[2]])),
rep("c", length(patterns_sequences[[3]])),
rep("d", length(patterns_sequences[[4]])),
rep("e", length(patterns_sequences[[5]])),
rep("f", length(patterns_sequences[[6]])),
rep("g", length(patterns_sequences[[7]])),
rep("h", length(patterns_sequences[[8]]))),
pattern = unlist(patterns_sequences))
build_frequency_table_base <- function(sequence_df) {
ft0 <- as.data.frame(table(unlist(tapply(sequence_df$pattern, sequence_df$sequence_id, FUN =
function(p) paste0(p[-length(p)], ",", p[-1])))), stringsAsFactors = FALSE)
ft1 <- ft0 %>%
tidyr::separate(Var1, c("from_pattern", "to_pattern"), ",")
ft5 <- tidyr::spread(ft1, to_pattern, Freq, fill= 0)
rownames(ft5) <- ft5$from_pattern
ft5$from_pattern <- NULL
ft5
}
build_frequency_table_dt <- function(sequence_df) {
dt = as.data.table(sequence_df);
dt[, pattern := factor(pattern)];
dt[, pl := shift(pattern), by=sequence_id][ ,pl := factor(pl, level=levels(pattern))];
res_dt <- with(dt, table(pl, pattern))
res_dt <- as.data.frame.matrix(res_dt)
}
tictoc::tic("base")
res_base <- build_frequency_table_base(sequence_df)
tictoc::toc()
tictoc::tic("DT")
res_dt <- build_frequency_table_dt(sequence_df)
tictoc::toc()
(bench = microbenchmark::microbenchmark(
build_frequency_table_base(sequence_df),
res_dt <- build_frequency_table_dt(sequence_df),
times=1000L
))
ggplot2::autoplot(bench)

What happens when no chromosomes satisfy the contraints?

I've found this page from the R blog where there is an example of how the genalg library works.
I've written a piece of code, mainly copy-pasted from the page linked above. What I'm expecting from the code is that no chromosome is a good solution and I've been wondering what happens when, in theory, all the chromosomes should be discarded. It seems that the algorithm is always able to return a solution but which solution is a solution that does not satisfy the constraints?
What I can imagine is that the algorithm should always return a chromosome made of only zeroes. I tried to run it several times but it does not happen, i.e. it returns also ones, in the chromosome sequence. But, how is this possible? What am I missing? Does anyone have more experience than me, with respect to genetic algorithms in general and the genalg library for R, in particular?
The code:
library("genalg");
evalFunc <- function(x) {
current_solution_survivalpoints <- x %*% dataset$survivalpoints;
current_solution_weight <- x %*% dataset$weight;
if (current_solution_weight > weightlimit){
return(0);
}
else{
return(-current_solution_survivalpoints);
}
}
dataset <- data.frame(item = c("pocketknife", "beans", "potatoes", "unions", "sleeping bag", "rope", "compass"),
survivalpoints = c(10, 20, 15, 2, 30, 10, 30),
weight = c(5, 10, 20, 5, 12, 10, 5));
#this is the constraint that cannot be satisfied
weightlimit <- 4;
GAmodel <- rbga.bin(size = 7, popSize = 200, iters = 100, mutationChance = 0.01, evalFunc = evalFunc);
filter <- GAmodel$evaluations == min(GAmodel$evaluations);
chromosome <- GAmodel$population[filter, , drop= FALSE][1,];
print("the solution is");print(chromosome);

Why is the actual number of generation not as specified for genetic algorithms in R

I am working with the genalg library for R, and try to save all the generations when I run a binary generic algorithm. It does not seems like there is a built-in method for that in the library, so my attempt was to save each chromosome, x, coming through the evaluation function.
To test this method I have tried to insert print(x) in the evaluation function to be able to see all the evaluated chromosomes. However, the number of printed chromosomes does not always match what I am suspecting.
I thought that the number of printed chromosomes would be equal to the number of iterations times the population size, but it does not seems to be try all the time.
The problem is that I want to know from which generation (or iteration) each chromosome belongs, which I can't tell if the number of chromosomes are different from iter times popSize.
What is the reason for this, and how can I "fix" it. Or is there another way of saving each chromosome and from which iteration it belongs?
Below is an example, where I thought that the evaluation function would print 2x5 chromosomes, but only prints 8.
library(genalg)
library(ggplot2)
dataset <- data.frame(
item = c("pocketknife", "beans", "potatoes", "unions", "sleeping bag", "rope", "compass"),
survivalpoints = c(10, 20, 15, 2, 30, 10, 30),
weight = c(1, 5, 10, 1, 7, 5, 1))
weightlimit <- 20
evalFunc <- function(x) {
print(x)
current_solution_survivalpoints <- x %*% dataset$survivalpoints
current_solution_weight <- x %*% dataset$weight
if (current_solution_weight > weightlimit)
return(0) else return(-current_solution_survivalpoints
}
iter = 2
popSize = 5
set.seed(1)
GAmodel <- rbga.bin(size = 7, popSize = popSize, iters = iter, mutationChance = 0.1,elitism = T, evalFunc = evalFunc)
Looking at the function code, it seems like at each iteration (generation) a subset of chromosomes is chosen from the population (population = 5 chromosomes in your example) with a certain probability (0.1 in your case) and mutated. Evaluation function is called only for the mutated chromosomes at each generation (and of course for all the chromosomes in the first iteration to know their initial value).
Note that, this subset do not include elitists group, which in your example you have defined as 1 element big (you have erroneously passed elitism=TRUE and TRUE is implicitly converted to 1).
Anyway, to know the population at each generation, you can pass a monitor function through the monitorFun parameter e.g. :
# obj contains a lot of informations, try to print it
monitor <- function(obj) {
print(paste(" GENERATION :", obj$iter))
print("POPULATION:")
print(obj$population)
print("VALUES:")
print(obj$evaluations)
}
iter = 2
popSize = 5
set.seed(1)
GAmodel <- rbga.bin(size = 7, popSize = popSize,
iters = iter, mutationChance = 0.1,
elitism = 1, evalFunc = evalFunc, monitorFunc = monitor)

Generate random number in a for loop in R software

I would like to ask how to generate random numbers in a for loop in R software.
I am trying to make a table with two columns, ID and time. Each ID has 7 times: 0,1,2,3,4,5 and the last number has to be random, between 6 and 7.
ID<-data.frame(rep(1:100,each=7))
for (i in unique(ID)){
ID$time <- c(0,1,2,3,4,5, x <-runif(1,6,7), 100)[ID==i]
}
An error message popped up:
Error in `$<-.data.frame`(`*tmp*`, "time", value = c(0, NA, NA, NA, NA, :
replacement has 8 rows, data has 700
You Could try using replicate, such as
ID$time <- c(replicate(100, c(0:5, runif(1, 6, 7))))
Althoguh replicate is wrapper for sapply which is basically a hidden for loop. Instead, you could also try a vectorized approach, such as:
ID$time <- 0:6
ID[ID$time == 6, "time"] <- runif(100, 6, 7)
I think you want this:
set.seed(123)
ID <- data.frame( time = c(0,1,2,3,4,5), x = runif(100,6,7))
You don't want to use '<-' for the arguments to data.frame
But maybe it's this that you want:
ID <- data.frame( time = rep( c(0,1,2,3,4,5,6), each=100) , x = runif(700,6,7))
(it's always a good idea to describe in a natural language what it is that you want.)

Resources