Related
I have a table showing sequences of patterns, represented with the id of the sequence, and the ordered value of the consecutive patterns over several rows. patterns have all the same length but sequences have different lengths)
patterns_seq_a <- c("ABC", "BCD", "ABC", "CBA")
patterns_seq_b <- c("BCD", "BCC", "BAC", "ABC", "BCD")
patterns_seq_c <- c("ABC", "ABC")
sequence_df <- data.frame(sequence_id = c(rep("a", length(patterns_seq_a)),
rep("b", length(patterns_seq_b)),
rep("c", length(patterns_seq_c))),
pattern = c(patterns_seq_a, patterns_seq_b, patterns_seq_c))
and I'm trying to count the frequency from one pattern to another over all the sequences.
(and ultimately will build a probability of transitions from the frequency table)
so for example looking at the transition from "CBA" to "BCD" its frequency over all sequences would be 0 (NB the row 4 and 5 don't belong to the same sequence).
The end result would have the form of:
unique_patterns <- unique(sequence_df$pattern)
result <- data.frame(matrix(0, ncol = length(unique_patterns), nrow = length(unique_patterns)))
colnames(result) <- unique_patterns
rownames(result) <- unique_patterns
(but could also be a hash pattern -> pattern -> frequency if it has to)
On the web I've found a solution based on concatenations of sequences then greps but I it's unusable as it's too slow (profvis blames the greps). Something on the lines of this:
freq_table <- c()
for (start_pattern in unique_patterns) {
for (end_pattern in unique_patterns) {
transition_pattern <- paste0(start_pattern, ',', end_pattern)
sequence_holding_transition <- concat_sequence_df[grep(transition_pattern, concat_sequence_df$patterns_sequence),]
if (nrow(sequence_holding_transition) < 1) {
transition_frequency <- c(transition_pattern, 0)
} else {
concat_sequence_holding_transition <- paste0(sequence_holding_transition$patterns_sequence, collapse = ",", sep="/")
transition_pattern_positions <- gregexpr(pattern = transition_pattern, text = concat_sequence_holding_transition)[[1]]
transition_frequency <- c(transition_pattern, length(transition_pattern_positions))
}
freq_table <- rbind(freq_table, transition_frequency)
}
}
frequency_table <- data.frame(pattern_transition = freq_table[, 1], counts = freq_table[, 2])
frequency_table$pattern.from <- sapply(strsplit(as.character(frequency_table$pattern_transition), ","), `[`, 1)
frequency_table$pattern.to <- sapply(strsplit(as.character(frequency_table$pattern_transition), ","), `[`, 2)
frequency_table <- t(matrix(as.numeric(as.character(frequency_table$counts)), ncol=length(unique_patterns), nrow=length(unique_patterns)))
colnames(frequency_table) <- unique_patterns
rownames(frequency_table) <- unique_patterns
I haven't been able to vectorize it and those greps are really slow in any case. I tried parallelizing it but it's still too slow.
Would someone have a lightning fast solution of building this frequency table?
I don't really care about the code sample, it's just here to show an example of a - unusable- solution.
Thanks!
UPDATE:
please find bellow the expected output if I know how to count.
To have lines with no transitions (such as "CBA" to "XXX") dropped is acceptable, and ofc it doesn't matter the actual type of the structure:
structure(
list(
ABC = c(1, 1, 0, 1, 0),
BAC = c(0, 0, 1, 0, 0),
BCC = c(0, 0, 0, 1, 0),
BCD = c(2, 0, 0, 0, 0),
CBA = c(1, 0, 0, 0, 0)
),
row.names = c("ABC", "BAC", "BCC", "BCD", "CBA"),
class = "data.frame"
)
I have not benchmarked it but it seems, that the following solutions uses basic R functions that tend to usually be fast. Given sequence_df from the question:
table(unlist(tapply(sequence_df$pattern, sequence_df$sequence_id, FUN =
function(p) paste0(p[-length(p)], p[-1]))))
I am using tapply to examine each sequence_id on it's own and paste0 for the transition patterns. unlist puts it all in one large vector which table can count. I am not shure, whether one would call that fully vectorized but at least it has no nested for loops and no regular expressions.
It certainly lacks code to transform the table type into a matrix right now. That can be written once we know how it compares runtime wise with other solutions and depending on whether the matrix is really the ideal format for the operations to come after that.
ok so I had a look at the solutions from #Berhard and #user20650 and while I haven't seriously validated the results yet, it looks that they both do the job.
one difference is that the tapply doesn't generate the 0 frequency transitions while data.table does.
Both solutions are faster than the nested for / rbind example as expected.
So I have tried to benchmark both, and assuming no mistake on my side adapting your code, the data.table is more than twice faster than the tapply one.
Thank you both for these elegant answers, appreciated!
library(data.table)
library(tidyr)
library(microbenchmark)
patterns_sequences <- lapply(seq(1:8), function(id) {
length_sequence <- sample(3:10, 1);
do.call(paste0, replicate(3, sample(c("A", "B", "C"), length_sequence, TRUE), FALSE))
})
sequence_df <- data.frame(sequence_id = c(rep("a", length(patterns_sequences[[1]])),
rep("b", length(patterns_sequences[[2]])),
rep("c", length(patterns_sequences[[3]])),
rep("d", length(patterns_sequences[[4]])),
rep("e", length(patterns_sequences[[5]])),
rep("f", length(patterns_sequences[[6]])),
rep("g", length(patterns_sequences[[7]])),
rep("h", length(patterns_sequences[[8]]))),
pattern = unlist(patterns_sequences))
build_frequency_table_base <- function(sequence_df) {
ft0 <- as.data.frame(table(unlist(tapply(sequence_df$pattern, sequence_df$sequence_id, FUN =
function(p) paste0(p[-length(p)], ",", p[-1])))), stringsAsFactors = FALSE)
ft1 <- ft0 %>%
tidyr::separate(Var1, c("from_pattern", "to_pattern"), ",")
ft5 <- tidyr::spread(ft1, to_pattern, Freq, fill= 0)
rownames(ft5) <- ft5$from_pattern
ft5$from_pattern <- NULL
ft5
}
build_frequency_table_dt <- function(sequence_df) {
dt = as.data.table(sequence_df);
dt[, pattern := factor(pattern)];
dt[, pl := shift(pattern), by=sequence_id][ ,pl := factor(pl, level=levels(pattern))];
res_dt <- with(dt, table(pl, pattern))
res_dt <- as.data.frame.matrix(res_dt)
}
tictoc::tic("base")
res_base <- build_frequency_table_base(sequence_df)
tictoc::toc()
tictoc::tic("DT")
res_dt <- build_frequency_table_dt(sequence_df)
tictoc::toc()
(bench = microbenchmark::microbenchmark(
build_frequency_table_base(sequence_df),
res_dt <- build_frequency_table_dt(sequence_df),
times=1000L
))
ggplot2::autoplot(bench)
I was working on a problem where I had two grouping variables and one value. I only to keep the rows were at least two of the values in the group are close to each other in value. In the example I wanted groups that had one set of values within 10 of each other.
Below is what I initially tried, and something about making a flag variable made me feel like I was doing it in some roundabout way, and I just wanted to know if there's a cleaner more intended way to do something like this in data.table. Thank you!
x and y are the categories, z is the value.
library(data.table)
set.seed(123)
dt <- data.table(
x = sample(LETTERS, 1000, T),
y = sample(letters, 1000, T),
z = sample(100, 1000, T),
key = tail(letters, 3)
)
dt <- unique(dt)
dt <- dt[dt[, .(flag = any(diff(z) <= 11)), .(x, y)], on = c("x", "y")][(flag)]
dt[, flag := NULL]
dt
You can use .I with an if to determine whether to include each group (here want matches your final dt)
dt <- unique(dt)
want <- dt[dt[, if(any(diff(z) <= 11)) .I, .(x, y)]$V1]
You could do
res <- dt[, if (.N > 1L && min(diff(z)) <= 11) .SD, by=.(x, y)]
I used min instead of any since I guess it leads to fewer computations.
I added the .N > 1L condition since you need to think about how to handle the single row case (where diff is NA). You could do
.N > 1L && to drop those cases
.N == 1L || to keep them
I just wanted to know if there's a cleaner more intended way to do something like this in data.table
I think a having= syntax would be convenient for this. It's currently a feature request.
Input data (since OP overwrites it):
library(data.table)
set.seed(123)
dt <- data.table(
x = sample(LETTERS, 1000, T),
y = sample(letters, 1000, T),
z = sample(100, 1000, T),
key = tail(letters, 3)
)
dt <- unique(dt)
Is there a better way to achieve the result below?
I'm commonly using this sort of coding approach for operating on many columns (imagine v2 had 20 + columns, so I need a dynamic way to build new columns)
set.seed(1)
dt <- data.table(m = 1:10, v1 = rnorm(10, 1), v2 = rnorm(10, 1), v3 = rnorm(10, 1))
run_across <- function(x, y = "m", z = c("v1")) {
# browser()
x[,c(paste0(y, ".", z), paste0(y, ".div.", z)) := list(get(y) * get(z), get(y) / get(z))]
}
# works fine for an individual case:
run_across(dt)
# Error: param z returns the values of column v1:
dt[, lapply(X=.SD, FUN = run_across, x = dt, y = "m"), .SDcols = paste0("v", 1:3)]
# This has desired result, but is there a more elegant approach?
invisible(lapply(paste0("v", 1:3), run_across, x = dt, y = "m"))
First of all, "elegance" is a very subjective term that is not really helpful in evaluating code, from my point of view.
Second, the approach the OP offered themself is not bad at all, using a sinlge lapply-call.
Third, the best I can think of to making this a more "elegant" (in the sense of R-like or simple) operation, is to convert your data to long format before running your custom function:
dt <- melt(dt, id.vars = "m")
run_across(dt, y = "m", z = "value")
That means, you can do it in a single call to run_across without looping.
If you need to convert to wide-format afterwards, you can use dcast:
dcast(dt, m ~ variable, value.var = c("value","m.value", "m.div.value"))
Given the following test matrix:
testMatrix <- matrix( c(1,1,2,10,20,30,300,100,200,"A","B","C"), 3, 4)
colnames(testMatrix) <- c("GroupID", "ElementID", "Value", "Name")
Here I want to find the max per group and then return the name of that column.
E.g. I would expect 1, A and 2, C. If there is a tie with max, the first match would be fine.
After that I would have to attach this to the matrix with a new Column "GroupName"
How can I do this?
I already have the Group, Max Value combination:
groupMax <- aggregate (as.numeric(testMatrix[,3]), by=list( testMatrix[,1] ), max )
The way I used to add columns to my matrix works like this (let's assume there is also already a matrix groupNames with GroupID, Name combinations):
testMatrix <- cbind ( testMatrix, groupNames[match( testMatrix[,1], groupNames[,1] ), 2] )
A data.table solution for time and memory efficiency and syntactic elegance
library(data.table)
DT <- as.data.table(testMatrix)
DT[,list(Name = Name[which.max(Value)]),by = GroupID]
Base solution, not as simple as Dan M's:
testMatrix <- data.frame(GroupID = c(1,1,2), ElementID = c(10,20,30),
Value=c(300,100,200), Name=c("A","B","C"))
A <- lapply(split(testMatrix, testMatrix$GroupID), function(x) {
x[which.max(x$Value), c(1, 4)]
}
)
do.call(rbind, A)
As #Tyler said, a data.frame is easier to work with. Here's an option:
testMatrix <- data.frame(GroupID = c(1,1,2), ElementID = c(10,20,30), Value=c(300,100,200), Name=c("A","B","C"))
ddply(testMatrix, .(GroupID), summarize, Name=Name[which.max(Value)])
I figured out a nice way to do this via dplyr
filter(group_by(testMatrix,GroupID),min_rank(desc(Value))==1)
This is a follow up question from R: t-test over all columns
Suppose I have a huge data set, and then I created numerous subsets based on certain conditions. The subsets should have the same number of columns. Then I want to do t-test on two subsets at a time (outer loop) and then for each combination of subsets go through all columns one column at a time (inner loop).
Here is what I have come up with based on previous answer. This one stops with an error.
C <- c("c1","c1","c1","c1","c1",
"c2","c2","c2","c2","c2",
"c3","c3","c3","c3","c3",
"c4","c4","c4","c4","c4",
"c5","c5","c5","c5","c5",
"c6","c6","c6","c6","c6",
"c7","c7","c7","c7","c7",
"c8","c8","c8","c8","c8",
"c9","c9","c9","c9","c9",
"c10","c10","c10","c10","c10")
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
Data <- data.frame(C, X, Y, Z)
Data.c1 = subset(Data, C == "c1",select=X:Z)
Data.c2 = subset(Data, C == "c2",select=X:Z)
Data.c3 = subset(Data, C == "c3",select=X:Z)
Data.c4 = subset(Data, C == "c4",select=X:Z)
Data.c5 = subset(Data, C == "c5",select=X:Z)
Data.Subsets = c("Data.c1",
"Data.c2",
"Data.c3",
"Data.c4",
"Data.c5")
library(plyr)
combo1 <- combn(length(Data.Subsets),1)
adply(combo1, 1, function(x) {
combo2 <- combn(ncol(Data.Subsets[x]),2)
adply(combo2, 2, function(y) {
test <- t.test( Data.Subsets[x][, y[1]], Data.Subsets[x][, y[2]], na.rm=TRUE)
out <- data.frame("Subset" = rownames(Data.Subsets[x]),
, "Row" = colnames(x)[y[1]]
, "Column" = colnames(x[y[2]])
, "t.value" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
} )
} )
First of all, you can more easily define you dataset using gl, and by avoiding creating individual variables for the columns.
Data <- data.frame(
C = gl(10, 5, labels = paste("c", 1:10, sep = "")),
X = rnorm(n = 50, mean = 10, sd = 5),
Y = rnorm(n = 50, mean = 15, sd = 6),
Z = rnorm(n = 50, mean = 20, sd = 5)
)
Convert this to "long" format using melt from the reshape package. (You can also use the base reshape function.)
longData <- melt(Data, id.vars = "C")
Now Use pairwise.t.test to compute t tests on all pairs of X/Y/Z for for each level of C.
with(longData, pairwise.t.test(value, interaction(C, variable)))
Note that it is important to use pairwise.t.test rather than just lots of individual calls to t.test because you need to adjust your p values if you run lots of tests. (See, e.g., xkcd for explanation.)
In general, pairwise t tests are inferior to a regression so be careful about their usage.
You can use get(Data.subset[x]) which will pick out the relevant data frame. But I don't think this should be necessary.
Explicitly subsetting that many times shoudn't be necessry either. You could create them using something like
conditions = c("c1", "c2", "c3", "c4", "c5")
dfs <- lapply(conditions, function(x){subset(Data, C==x, select=X:Z)})
That should (didn't test it) return a list of data frames each subseted on the various conditions you passed it.
However it would be a much better idea as #Richie Cotton points out, to reshape your data frame and use pairwise t tests.
I should point out that doing this many t-tests doesn't seem wise. Even after correction for multiple testing, be it FDR, permutation or otherwise. It would be better to try and figure out if you can use an anova of some sort as they are used for almost exactly this purpose.